From 1567c3e3467cddeb019a7b53ec632f834b6a9239 Mon Sep 17 00:00:00 2001 From: Giovanni Gherdovich Date: Wed, 22 Jan 2020 16:16:12 +0100 Subject: x86, sched: Add support for frequency invariance Implement arch_scale_freq_capacity() for 'modern' x86. This function is used by the scheduler to correctly account usage in the face of DVFS. The present patch addresses Intel processors specifically and has positive performance and performance-per-watt implications for the schedutil cpufreq governor, bringing it closer to, if not on-par with, the powersave governor from the intel_pstate driver/framework. Large performance gains are obtained when the machine is lightly loaded and no regression are observed at saturation. The benchmarks with the largest gains are kernel compilation, tbench (the networking version of dbench) and shell-intensive workloads. 1. FREQUENCY INVARIANCE: MOTIVATION * Without it, a task looks larger if the CPU runs slower 2. PECULIARITIES OF X86 * freq invariance accounting requires knowing the ratio freq_curr/freq_max 2.1 CURRENT FREQUENCY * Use delta_APERF / delta_MPERF * freq_base (a.k.a "BusyMHz") 2.2 MAX FREQUENCY * It varies with time (turbo). As an approximation, we set it to a constant, i.e. 4-cores turbo frequency. 3. EFFECTS ON THE SCHEDUTIL FREQUENCY GOVERNOR * The invariant schedutil's formula has no feedback loop and reacts faster to utilization changes 4. KNOWN LIMITATIONS * In some cases tasks can't reach max util despite how hard they try 5. PERFORMANCE TESTING 5.1 MACHINES * Skylake, Broadwell, Haswell 5.2 SETUP * baseline Linux v5.2 w/ non-invariant schedutil. Tested freq_max = 1-2-3-4-8-12 active cores turbo w/ invariant schedutil, and intel_pstate/powersave 5.3 BENCHMARK RESULTS 5.3.1 NEUTRAL BENCHMARKS * NAS Parallel Benchmark (HPC), hackbench 5.3.2 NON-NEUTRAL BENCHMARKS * tbench (10-30% better), kernbench (10-15% better), shell-intensive-scripts (30-50% better) * no regressions 5.3.3 SELECTION OF DETAILED RESULTS 5.3.4 POWER CONSUMPTION, PERFORMANCE-PER-WATT * dbench (5% worse on one machine), kernbench (3% worse), tbench (5-10% better), shell-intensive-scripts (10-40% better) 6. MICROARCH'ES ADDRESSED HERE * Xeon Core before Scalable Performance processors line (Xeon Gold/Platinum etc have different MSRs semantic for querying turbo levels) 7. REFERENCES * MMTests performance testing framework, github.com/gormanm/mmtests +-------------------------------------------------------------------------+ | 1. FREQUENCY INVARIANCE: MOTIVATION +-------------------------------------------------------------------------+ For example; suppose a CPU has two frequencies: 500 and 1000 Mhz. When running a task that would consume 1/3rd of a CPU at 1000 MHz, it would appear to consume 2/3rd (or 66.6%) when running at 500 MHz, giving the false impression this CPU is almost at capacity, even though it can go faster [*]. In a nutshell, without frequency scale-invariance tasks look larger just because the CPU is running slower. [*] (footnote: this assumes a linear frequency/performance relation; which everybody knows to be false, but given realities its the best approximation we can make.) +-------------------------------------------------------------------------+ | 2. PECULIARITIES OF X86 +-------------------------------------------------------------------------+ Accounting for frequency changes in PELT signals requires the computation of the ratio freq_curr / freq_max. On x86 neither of those terms is readily available. 2.1 CURRENT FREQUENCY ==================== Since modern x86 has hardware control over the actual frequency we run at (because amongst other things, Turbo-Mode), we cannot simply use the frequency as requested through cpufreq. Instead we use the APERF/MPERF MSRs to compute the effective frequency over the recent past. Also, because reading MSRs is expensive, don't do so every time we need the value, but amortize the cost by doing it every tick. 2.2 MAX FREQUENCY ================= Obtaining freq_max is also non-trivial because at any time the hardware can provide a frequency boost to a selected subset of cores if the package has enough power to spare (eg: Turbo Boost). This means that the maximum frequency available to a given core changes with time. The approach taken in this change is to arbitrarily set freq_max to a constant value at boot. The value chosen is the "4-cores (4C) turbo frequency" on most microarchitectures, after evaluating the following candidates: * 1-core (1C) turbo frequency (the fastest turbo state available) * around base frequency (a.k.a. max P-state) * something in between, such as 4C turbo To interpret these options, consider that this is the denominator in freq_curr/freq_max, and that ratio will be used to scale PELT signals such as util_avg and load_avg. A large denominator will undershoot (util_avg looks a bit smaller than it really is), viceversa with a smaller denominator PELT signals will tend to overshoot. Given that PELT drives frequency selection in the schedutil governor, we will have: freq_max set to | effect on DVFS --------------------+------------------ 1C turbo | power efficiency (lower freq choices) base freq | performance (higher util_avg, higher freq requests) 4C turbo | a bit of both 4C turbo proves to be a good compromise in a number of benchmarks (see below). +-------------------------------------------------------------------------+ | 3. EFFECTS ON THE SCHEDUTIL FREQUENCY GOVERNOR +-------------------------------------------------------------------------+ Once an architecture implements a frequency scale-invariant utilization (the PELT signal util_avg), schedutil switches its frequency selection formula from freq_next = 1.25 * freq_curr * util [non-invariant util signal] to freq_next = 1.25 * freq_max * util [invariant util signal] where, in the second formula, freq_max is set to the 1C turbo frequency (max turbo). The advantage of the second formula, whose usage we unlock with this patch, is that freq_next doesn't depend on the current frequency in an iterative fashion, but can jump to any frequency in a single update. This absence of feedback in the formula makes it quicker to react to utilization changes and more robust against pathological instabilities. Compare it to the update formula of intel_pstate/powersave: freq_next = 1.25 * freq_max * Busy% where again freq_max is 1C turbo and Busy% is the percentage of time not spent idling (calculated with delta_MPERF / delta_TSC); essentially the same as invariant schedutil, and largely responsible for intel_pstate/powersave good reputation. The non-invariant schedutil formula is derived from the invariant one by approximating util_inv with util_raw * freq_curr / freq_max, but this has limitations. Testing shows improved performances due to better frequency selections when the machine is lightly loaded, and essentially no change in behaviour at saturation / overutilization. +-------------------------------------------------------------------------+ | 4. KNOWN LIMITATIONS +-------------------------------------------------------------------------+ It's been shown that it is possible to create pathological scenarios where a CPU-bound task cannot reach max utilization, if the normalizing factor freq_max is fixed to a constant value (see [Lelli-2018]). If freq_max is set to 4C turbo as we do here, one needs to peg at least 5 cores in a package doing some busywork, and observe that none of those task will ever reach max util (1024) because they're all running at less than the 4C turbo frequency. While this concern still applies, we believe the performance benefit of frequency scale-invariant PELT signals outweights the cost of this limitation. [Lelli-2018] https://lore.kernel.org/lkml/20180517150418.GF22493@localhost.localdomain/ +-------------------------------------------------------------------------+ | 5. PERFORMANCE TESTING +-------------------------------------------------------------------------+ 5.1 MACHINES ============ We tested the patch on three machines, with Skylake, Broadwell and Haswell CPUs. The details are below, together with the available turbo ratios as reported by the appropriate MSRs. * 8x-SKYLAKE-UMA: Single socket E3-1240 v5, Skylake 4 cores/8 threads Max EFFiciency, BASE frequency and available turbo levels (MHz): EFFIC 800 |******** BASE 3500 |*********************************** 4C 3700 |************************************* 3C 3800 |************************************** 2C 3900 |*************************************** 1C 3900 |*************************************** * 80x-BROADWELL-NUMA: Two sockets E5-2698 v4, 2x Broadwell 20 cores/40 threads Max EFFiciency, BASE frequency and available turbo levels (MHz): EFFIC 1200 |************ BASE 2200 |********************** 8C 2900 |***************************** 7C 3000 |****************************** 6C 3100 |******************************* 5C 3200 |******************************** 4C 3300 |********************************* 3C 3400 |********************************** 2C 3600 |************************************ 1C 3600 |************************************ * 48x-HASWELL-NUMA Two sockets E5-2670 v3, 2x Haswell 12 cores/24 threads Max EFFiciency, BASE frequency and available turbo levels (MHz): EFFIC 1200 |************ BASE 2300 |*********************** 12C 2600 |************************** 11C 2600 |************************** 10C 2600 |************************** 9C 2600 |************************** 8C 2600 |************************** 7C 2600 |************************** 6C 2600 |************************** 5C 2700 |*************************** 4C 2800 |**************************** 3C 2900 |***************************** 2C 3100 |******************************* 1C 3100 |******************************* 5.2 SETUP ========= * The baseline is Linux v5.2 with schedutil (non-invariant) and the intel_pstate driver in passive mode. * The rationale for choosing the various freq_max values to test have been to try all the 1-2-3-4C turbo levels (note that 1C and 2C turbo are identical on all machines), plus one more value closer to base_freq but still in the turbo range (8C turbo for both 80x-BROADWELL-NUMA and 48x-HASWELL-NUMA). * In addition we've run all tests with intel_pstate/powersave for comparison. * The filesystem is always XFS, the userspace is openSUSE Leap 15.1. * 8x-SKYLAKE-UMA is capable of HWP (Hardware-Managed P-States), so the runs with active intel_pstate on this machine use that. This gives, in terms of combinations tested on each machine: * 8x-SKYLAKE-UMA * Baseline: Linux v5.2, non-invariant schedutil, intel_pstate passive * intel_pstate active + powersave + HWP * invariant schedutil, freq_max = 1C turbo * invariant schedutil, freq_max = 3C turbo * invariant schedutil, freq_max = 4C turbo * both 80x-BROADWELL-NUMA and 48x-HASWELL-NUMA * [same as 8x-SKYLAKE-UMA, but no HWP capable] * invariant schedutil, freq_max = 8C turbo (which on 48x-HASWELL-NUMA is the same as 12C turbo, or "all cores turbo") 5.3 BENCHMARK RESULTS ===================== 5.3.1 NEUTRAL BENCHMARKS ------------------------ Tests that didn't show any measurable difference in performance on any of the test machines between non-invariant schedutil and our patch are: * NAS Parallel Benchmarks (NPB) using either MPI or openMP for IPC, any computational kernel * flexible I/O (FIO) * hackbench (using threads or processes, and using pipes or sockets) 5.3.2 NON-NEUTRAL BENCHMARKS ---------------------------- What follow are summary tables where each benchmark result is given a score. * A tilde (~) means a neutral result, i.e. no difference from baseline. * Scores are computed with the ratio result_new / result_baseline, so a tilde means a score of 1.00. * The results in the score ratio are the geometric means of results running the benchmark with different parameters (eg: for kernbench: using 1, 2, 4, ... number of processes; for pgbench: varying the number of clients, and so on). * The first three tables show higher-is-better kind of tests (i.e. measured in operations/second), the subsequent three show lower-is-better kind of tests (i.e. the workload is fixed and we measure elapsed time, think kernbench). * "gitsource" is a name we made up for the test consisting in running the entire unit tests suite of the Git SCM and measuring how long it takes. We take it as a typical example of shell-intensive serialized workload. * In the "I_PSTATE" column we have the results for intel_pstate/powersave. Other columns show invariant schedutil for different values of freq_max. 4C turbo is circled as it's the value we've chosen for the final implementation. 80x-BROADWELL-NUMA (comparison ratio; higher is better) +------+ I_PSTATE 1C 3C | 4C | 8C pgbench-ro 1.14 ~ ~ | 1.11 | 1.14 pgbench-rw ~ ~ ~ | ~ | ~ netperf-udp 1.06 ~ 1.06 | 1.05 | 1.07 netperf-tcp ~ 1.03 ~ | 1.01 | 1.02 tbench4 1.57 1.18 1.22 | 1.30 | 1.56 +------+ 8x-SKYLAKE-UMA (comparison ratio; higher is better) +------+ I_PSTATE/HWP 1C 3C | 4C | pgbench-ro ~ ~ ~ | ~ | pgbench-rw ~ ~ ~ | ~ | netperf-udp ~ ~ ~ | ~ | netperf-tcp ~ ~ ~ | ~ | tbench4 1.30 1.14 1.14 | 1.16 | +------+ 48x-HASWELL-NUMA (comparison ratio; higher is better) +------+ I_PSTATE 1C 3C | 4C | 12C pgbench-ro 1.15 ~ ~ | 1.06 | 1.16 pgbench-rw ~ ~ ~ | ~ | ~ netperf-udp 1.05 0.97 1.04 | 1.04 | 1.02 netperf-tcp 0.96 1.01 1.01 | 1.01 | 1.01 tbench4 1.50 1.05 1.13 | 1.13 | 1.25 +------+ In the table above we see that active intel_pstate is slightly better than our 4C-turbo patch (both in reference to the baseline non-invariant schedutil) on read-only pgbench and much better on tbench. Both cases are notable in which it shows that lowering our freq_max (to 8C-turbo and 12C-turbo on 80x-BROADWELL-NUMA and 48x-HASWELL-NUMA respectively) helps invariant schedutil to get closer. If we ignore active intel_pstate and focus on the comparison with baseline alone, there are several instances of double-digit performance improvement. 80x-BROADWELL-NUMA (comparison ratio; lower is better) +------+ I_PSTATE 1C 3C | 4C | 8C dbench4 1.23 0.95 0.95 | 0.95 | 0.95 kernbench 0.93 0.83 0.83 | 0.83 | 0.82 gitsource 0.98 0.49 0.49 | 0.49 | 0.48 +------+ 8x-SKYLAKE-UMA (comparison ratio; lower is better) +------+ I_PSTATE/HWP 1C 3C | 4C | dbench4 ~ ~ ~ | ~ | kernbench ~ ~ ~ | ~ | gitsource 0.92 0.55 0.55 | 0.55 | +------+ 48x-HASWELL-NUMA (comparison ratio; lower is better) +------+ I_PSTATE 1C 3C | 4C | 8C dbench4 ~ ~ ~ | ~ | ~ kernbench 0.94 0.90 0.89 | 0.90 | 0.90 gitsource 0.97 0.69 0.69 | 0.69 | 0.69 +------+ dbench is not very remarkable here, unless we notice how poorly active intel_pstate is performing on 80x-BROADWELL-NUMA: 23% regression versus non-invariant schedutil. We repeated that run getting consistent results. Out of scope for the patch at hand, but deserving future investigation. Other than that, we previously ran this campaign with Linux v5.0 and saw the patch doing better on dbench a the time. We haven't checked closely and can only speculate at this point. On the NUMA boxes kernbench gets 10-15% improvements on average; we'll see in the detailed tables that the gains concentrate on low process counts (lightly loaded machines). The test we call "gitsource" (running the git unit test suite, a long-running single-threaded shell script) appears rather spectacular in this table (gains of 30-50% depending on the machine). It is to be noted, however, that gitsource has no adjustable parameters (such as the number of jobs in kernbench, which we average over in order to get a single-number summary score) and is exactly the kind of low-parallelism workload that benefits the most from this patch. When looking at the detailed tables of kernbench or tbench4, at low process or client counts one can see similar numbers. 5.3.3 SELECTION OF DETAILED RESULTS ----------------------------------- Machine : 48x-HASWELL-NUMA Benchmark : tbench4 (i.e. dbench4 over the network, actually loopback) Varying parameter : number of clients Unit : MB/sec (higher is better) 5.2.0 vanilla (BASELINE) 5.2.0 intel_pstate 5.2.0 1C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Hmean 1 126.73 +- 0.31% ( ) 315.91 +- 0.66% ( 149.28%) 125.03 +- 0.76% ( -1.34%) Hmean 2 258.04 +- 0.62% ( ) 614.16 +- 0.51% ( 138.01%) 269.58 +- 1.45% ( 4.47%) Hmean 4 514.30 +- 0.67% ( ) 1146.58 +- 0.54% ( 122.94%) 533.84 +- 1.99% ( 3.80%) Hmean 8 1111.38 +- 2.52% ( ) 2159.78 +- 0.38% ( 94.33%) 1359.92 +- 1.56% ( 22.36%) Hmean 16 2286.47 +- 1.36% ( ) 3338.29 +- 0.21% ( 46.00%) 2720.20 +- 0.52% ( 18.97%) Hmean 32 4704.84 +- 0.35% ( ) 4759.03 +- 0.43% ( 1.15%) 4774.48 +- 0.30% ( 1.48%) Hmean 64 7578.04 +- 0.27% ( ) 7533.70 +- 0.43% ( -0.59%) 7462.17 +- 0.65% ( -1.53%) Hmean 128 6998.52 +- 0.16% ( ) 6987.59 +- 0.12% ( -0.16%) 6909.17 +- 0.14% ( -1.28%) Hmean 192 6901.35 +- 0.25% ( ) 6913.16 +- 0.10% ( 0.17%) 6855.47 +- 0.21% ( -0.66%) 5.2.0 3C-turbo 5.2.0 4C-turbo 5.2.0 12C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Hmean 1 128.43 +- 0.28% ( 1.34%) 130.64 +- 3.81% ( 3.09%) 153.71 +- 5.89% ( 21.30%) Hmean 2 311.70 +- 6.15% ( 20.79%) 281.66 +- 3.40% ( 9.15%) 305.08 +- 5.70% ( 18.23%) Hmean 4 641.98 +- 2.32% ( 24.83%) 623.88 +- 5.28% ( 21.31%) 906.84 +- 4.65% ( 76.32%) Hmean 8 1633.31 +- 1.56% ( 46.96%) 1714.16 +- 0.93% ( 54.24%) 2095.74 +- 0.47% ( 88.57%) Hmean 16 3047.24 +- 0.42% ( 33.27%) 3155.02 +- 0.30% ( 37.99%) 3634.58 +- 0.15% ( 58.96%) Hmean 32 4734.31 +- 0.60% ( 0.63%) 4804.38 +- 0.23% ( 2.12%) 4674.62 +- 0.27% ( -0.64%) Hmean 64 7699.74 +- 0.35% ( 1.61%) 7499.72 +- 0.34% ( -1.03%) 7659.03 +- 0.25% ( 1.07%) Hmean 128 6935.18 +- 0.15% ( -0.91%) 6942.54 +- 0.10% ( -0.80%) 7004.85 +- 0.12% ( 0.09%) Hmean 192 6901.62 +- 0.12% ( 0.00%) 6856.93 +- 0.10% ( -0.64%) 6978.74 +- 0.10% ( 1.12%) This is one of the cases where the patch still can't surpass active intel_pstate, not even when freq_max is as low as 12C-turbo. Otherwise, gains are visible up to 16 clients and the saturated scenario is the same as baseline. The scores in the summary table from the previous sections are ratios of geometric means of the results over different clients, as seen in this table. Machine : 80x-BROADWELL-NUMA Benchmark : kernbench (kernel compilation) Varying parameter : number of jobs Unit : seconds (lower is better) 5.2.0 vanilla (BASELINE) 5.2.0 intel_pstate 5.2.0 1C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Amean 2 379.68 +- 0.06% ( ) 330.20 +- 0.43% ( 13.03%) 285.93 +- 0.07% ( 24.69%) Amean 4 200.15 +- 0.24% ( ) 175.89 +- 0.22% ( 12.12%) 153.78 +- 0.25% ( 23.17%) Amean 8 106.20 +- 0.31% ( ) 95.54 +- 0.23% ( 10.03%) 86.74 +- 0.10% ( 18.32%) Amean 16 56.96 +- 1.31% ( ) 53.25 +- 1.22% ( 6.50%) 48.34 +- 1.73% ( 15.13%) Amean 32 34.80 +- 2.46% ( ) 33.81 +- 0.77% ( 2.83%) 30.28 +- 1.59% ( 12.99%) Amean 64 26.11 +- 1.63% ( ) 25.04 +- 1.07% ( 4.10%) 22.41 +- 2.37% ( 14.16%) Amean 128 24.80 +- 1.36% ( ) 23.57 +- 1.23% ( 4.93%) 21.44 +- 1.37% ( 13.55%) Amean 160 24.85 +- 0.56% ( ) 23.85 +- 1.17% ( 4.06%) 21.25 +- 1.12% ( 14.49%) 5.2.0 3C-turbo 5.2.0 4C-turbo 5.2.0 8C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Amean 2 284.08 +- 0.13% ( 25.18%) 283.96 +- 0.51% ( 25.21%) 285.05 +- 0.21% ( 24.92%) Amean 4 153.18 +- 0.22% ( 23.47%) 154.70 +- 1.64% ( 22.71%) 153.64 +- 0.30% ( 23.24%) Amean 8 87.06 +- 0.28% ( 18.02%) 86.77 +- 0.46% ( 18.29%) 86.78 +- 0.22% ( 18.28%) Amean 16 48.03 +- 0.93% ( 15.68%) 47.75 +- 1.99% ( 16.17%) 47.52 +- 1.61% ( 16.57%) Amean 32 30.23 +- 1.20% ( 13.14%) 30.08 +- 1.67% ( 13.57%) 30.07 +- 1.67% ( 13.60%) Amean 64 22.59 +- 2.02% ( 13.50%) 22.63 +- 0.81% ( 13.32%) 22.42 +- 0.76% ( 14.12%) Amean 128 21.37 +- 0.67% ( 13.82%) 21.31 +- 1.15% ( 14.07%) 21.17 +- 1.93% ( 14.63%) Amean 160 21.68 +- 0.57% ( 12.76%) 21.18 +- 1.74% ( 14.77%) 21.22 +- 1.00% ( 14.61%) The patch outperform active intel_pstate (and baseline) by a considerable margin; the summary table from the previous section says 4C turbo and active intel_pstate are 0.83 and 0.93 against baseline respectively, so 4C turbo is 0.83/0.93=0.89 against intel_pstate (~10% better on average). There is no noticeable difference with regard to the value of freq_max. Machine : 8x-SKYLAKE-UMA Benchmark : gitsource (time to run the git unit test suite) Varying parameter : none Unit : seconds (lower is better) 5.2.0 vanilla 5.2.0 intel_pstate/hwp 5.2.0 1C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Amean 858.85 +- 1.16% ( ) 791.94 +- 0.21% ( 7.79%) 474.95 ( 44.70%) 5.2.0 3C-turbo 5.2.0 4C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Amean 475.26 +- 0.20% ( 44.66%) 474.34 +- 0.13% ( 44.77%) In this test, which is of interest as representing shell-intensive (i.e. fork-intensive) serialized workloads, invariant schedutil outperforms intel_pstate/powersave by a whopping 40% margin. 5.3.4 POWER CONSUMPTION, PERFORMANCE-PER-WATT --------------------------------------------- The following table shows average power consumption in watt for each benchmark. Data comes from turbostat (package average), which in turn is read from the RAPL interface on CPUs. We know the patch affects CPU frequencies so it's reasonable to ignore other power consumers (such as memory or I/O). Also, we don't have a power meter available in the lab so RAPL is the best we have. turbostat sampled average power every 10 seconds for the entire duration of each benchmark. We took all those values and averaged them (i.e. with don't have detail on a per-parameter granularity, only on whole benchmarks). 80x-BROADWELL-NUMA (power consumption, watts) +--------+ BASELINE I_PSTATE 1C 3C | 4C | 8C pgbench-ro 130.01 142.77 131.11 132.45 | 134.65 | 136.84 pgbench-rw 68.30 60.83 71.45 71.70 | 71.65 | 72.54 dbench4 90.25 59.06 101.43 99.89 | 101.10 | 102.94 netperf-udp 65.70 69.81 66.02 68.03 | 68.27 | 68.95 netperf-tcp 88.08 87.96 88.97 88.89 | 88.85 | 88.20 tbench4 142.32 176.73 153.02 163.91 | 165.58 | 176.07 kernbench 92.94 101.95 114.91 115.47 | 115.52 | 115.10 gitsource 40.92 41.87 75.14 75.20 | 75.40 | 75.70 +--------+ 8x-SKYLAKE-UMA (power consumption, watts) +--------+ BASELINE I_PSTATE/HWP 1C 3C | 4C | pgbench-ro 46.49 46.68 46.56 46.59 | 46.52 | pgbench-rw 29.34 31.38 30.98 31.00 | 31.00 | dbench4 27.28 27.37 27.49 27.41 | 27.38 | netperf-udp 22.33 22.41 22.36 22.35 | 22.36 | netperf-tcp 27.29 27.29 27.30 27.31 | 27.33 | tbench4 41.13 45.61 43.10 43.33 | 43.56 | kernbench 42.56 42.63 43.01 43.01 | 43.01 | gitsource 13.32 13.69 17.33 17.30 | 17.35 | +--------+ 48x-HASWELL-NUMA (power consumption, watts) +--------+ BASELINE I_PSTATE 1C 3C | 4C | 12C pgbench-ro 128.84 136.04 129.87 132.43 | 132.30 | 134.86 pgbench-rw 37.68 37.92 37.17 37.74 | 37.73 | 37.31 dbench4 28.56 28.73 28.60 28.73 | 28.70 | 28.79 netperf-udp 56.70 60.44 56.79 57.42 | 57.54 | 57.52 netperf-tcp 75.49 75.27 75.87 76.02 | 76.01 | 75.95 tbench4 115.44 139.51 119.53 123.07 | 123.97 | 130.22 kernbench 83.23 91.55 95.58 95.69 | 95.72 | 96.04 gitsource 36.79 36.99 39.99 40.34 | 40.35 | 40.23 +--------+ A lower power consumption isn't necessarily better, it depends on what is done with that energy. Here are tables with the ratio of performance-per-watt on each machine and benchmark. Higher is always better; a tilde (~) means a neutral ratio (i.e. 1.00). 80x-BROADWELL-NUMA (performance-per-watt ratios; higher is better) +------+ I_PSTATE 1C 3C | 4C | 8C pgbench-ro 1.04 1.06 0.94 | 1.07 | 1.08 pgbench-rw 1.10 0.97 0.96 | 0.96 | 0.97 dbench4 1.24 0.94 0.95 | 0.94 | 0.92 netperf-udp ~ 1.02 1.02 | ~ | 1.02 netperf-tcp ~ 1.02 ~ | ~ | 1.02 tbench4 1.26 1.10 1.06 | 1.12 | 1.26 kernbench 0.98 0.97 0.97 | 0.97 | 0.98 gitsource ~ 1.11 1.11 | 1.11 | 1.13 +------+ 8x-SKYLAKE-UMA (performance-per-watt ratios; higher is better) +------+ I_PSTATE/HWP 1C 3C | 4C | pgbench-ro ~ ~ ~ | ~ | pgbench-rw 0.95 0.97 0.96 | 0.96 | dbench4 ~ ~ ~ | ~ | netperf-udp ~ ~ ~ | ~ | netperf-tcp ~ ~ ~ | ~ | tbench4 1.17 1.09 1.08 | 1.10 | kernbench ~ ~ ~ | ~ | gitsource 1.06 1.40 1.40 | 1.40 | +------+ 48x-HASWELL-NUMA (performance-per-watt ratios; higher is better) +------+ I_PSTATE 1C 3C | 4C | 12C pgbench-ro 1.09 ~ 1.09 | 1.03 | 1.11 pgbench-rw ~ 0.86 ~ | ~ | 0.86 dbench4 ~ 1.02 1.02 | 1.02 | ~ netperf-udp ~ 0.97 1.03 | 1.02 | ~ netperf-tcp 0.96 ~ ~ | ~ | ~ tbench4 1.24 ~ 1.06 | 1.05 | 1.11 kernbench 0.97 0.97 0.98 | 0.97 | 0.96 gitsource 1.03 1.33 1.32 | 1.32 | 1.33 +------+ These results are overall pleasing: in plenty of cases we observe performance-per-watt improvements. The few regressions (read/write pgbench and dbench on the Broadwell machine) are of small magnitude. kernbench loses a few percentage points (it has a 10-15% performance improvement, but apparently the increase in power consumption is larger than that). tbench4 and gitsource, which benefit the most from the patch, keep a positive score in this table which is a welcome surprise; that suggests that in those particular workloads the non-invariant schedutil (and active intel_pstate, too) makes some rather suboptimal frequency selections. +-------------------------------------------------------------------------+ | 6. MICROARCH'ES ADDRESSED HERE +-------------------------------------------------------------------------+ The patch addresses Xeon Core processors that use MSR_PLATFORM_INFO and MSR_TURBO_RATIO_LIMIT to advertise their base frequency and turbo frequencies respectively. This excludes the recent Xeon Scalable Performance processors line (Xeon Gold, Platinum etc) whose MSRs have to be parsed differently. Subsequent patches will address: * Xeon Scalable Performance processors and Atom Goldmont/Goldmont Plus * Xeon Phi (Knights Landing, Knights Mill) * Atom Silvermont +-------------------------------------------------------------------------+ | 7. REFERENCES +-------------------------------------------------------------------------+ Tests have been run with the help of the MMTests performance testing framework, see github.com/gormanm/mmtests. The configuration file names for the benchmark used are: db-pgbench-timed-ro-small-xfs db-pgbench-timed-rw-small-xfs io-dbench4-async-xfs network-netperf-unbound network-tbench scheduler-unbound workload-kerndevel-xfs workload-shellscripts-xfs hpc-nas-c-class-mpi-full-xfs hpc-nas-c-class-omp-full All those benchmarks are generally available on the web: pgbench: https://www.postgresql.org/docs/10/pgbench.html netperf: https://hewlettpackard.github.io/netperf/ dbench/tbench: https://dbench.samba.org/ gitsource: git unit test suite, github.com/git/git NAS Parallel Benchmarks: https://www.nas.nasa.gov/publications/npb.html hackbench: https://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.c Suggested-by: Peter Zijlstra Signed-off-by: Giovanni Gherdovich Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Acked-by: Doug Smythies Acked-by: Rafael J. Wysocki Link: https://lkml.kernel.org/r/20200122151617.531-2-ggherdovich@suse.cz --- arch/x86/include/asm/topology.h | 20 +++++ arch/x86/kernel/smpboot.c | 183 +++++++++++++++++++++++++++++++++++++++- 2 files changed, 202 insertions(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h index 4b14d2318251..2ebf7b7b2126 100644 --- a/arch/x86/include/asm/topology.h +++ b/arch/x86/include/asm/topology.h @@ -193,4 +193,24 @@ static inline void sched_clear_itmt_support(void) } #endif /* CONFIG_SCHED_MC_PRIO */ +#ifdef CONFIG_SMP +#include + +DECLARE_STATIC_KEY_FALSE(arch_scale_freq_key); + +#define arch_scale_freq_invariant() static_branch_likely(&arch_scale_freq_key) + +DECLARE_PER_CPU(unsigned long, arch_freq_scale); + +static inline long arch_scale_freq_capacity(int cpu) +{ + return per_cpu(arch_freq_scale, cpu); +} +#define arch_scale_freq_capacity arch_scale_freq_capacity + +extern void arch_scale_freq_tick(void); +#define arch_scale_freq_tick arch_scale_freq_tick + +#endif + #endif /* _ASM_X86_TOPOLOGY_H */ diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 69881b2d446c..28696bccf912 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -147,6 +147,8 @@ static inline void smpboot_restore_warm_reset_vector(void) *((volatile u32 *)phys_to_virt(TRAMPOLINE_PHYS_LOW)) = 0; } +static void init_freq_invariance(void); + /* * Report back to the Boot Processor during boot time or to the caller processor * during CPU online. @@ -183,6 +185,8 @@ static void smp_callin(void) */ set_cpu_sibling_map(raw_smp_processor_id()); + init_freq_invariance(); + /* * Get our bogomips. * Update loops_per_jiffy in cpu_data. Previous call to @@ -1337,7 +1341,7 @@ void __init native_smp_prepare_cpus(unsigned int max_cpus) set_sched_topology(x86_topology); set_cpu_sibling_map(0); - + init_freq_invariance(); smp_sanity_check(); switch (apic_intr_mode) { @@ -1764,3 +1768,180 @@ void native_play_dead(void) } #endif + +/* + * APERF/MPERF frequency ratio computation. + * + * The scheduler wants to do frequency invariant accounting and needs a <1 + * ratio to account for the 'current' frequency, corresponding to + * freq_curr / freq_max. + * + * Since the frequency freq_curr on x86 is controlled by micro-controller and + * our P-state setting is little more than a request/hint, we need to observe + * the effective frequency 'BusyMHz', i.e. the average frequency over a time + * interval after discarding idle time. This is given by: + * + * BusyMHz = delta_APERF / delta_MPERF * freq_base + * + * where freq_base is the max non-turbo P-state. + * + * The freq_max term has to be set to a somewhat arbitrary value, because we + * can't know which turbo states will be available at a given point in time: + * it all depends on the thermal headroom of the entire package. We set it to + * the turbo level with 4 cores active. + * + * Benchmarks show that's a good compromise between the 1C turbo ratio + * (freq_curr/freq_max would rarely reach 1) and something close to freq_base, + * which would ignore the entire turbo range (a conspicuous part, making + * freq_curr/freq_max always maxed out). + * + * Setting freq_max to anything less than the 1C turbo ratio makes the ratio + * freq_curr / freq_max to eventually grow >1, in which case we clip it to 1. + */ + +DEFINE_STATIC_KEY_FALSE(arch_scale_freq_key); + +static DEFINE_PER_CPU(u64, arch_prev_aperf); +static DEFINE_PER_CPU(u64, arch_prev_mperf); +static u64 arch_max_freq_ratio = SCHED_CAPACITY_SCALE; + +static bool turbo_disabled(void) +{ + u64 misc_en; + int err; + + err = rdmsrl_safe(MSR_IA32_MISC_ENABLE, &misc_en); + if (err) + return false; + + return (misc_en & MSR_IA32_MISC_ENABLE_TURBO_DISABLE); +} + +#include +#include + +#define ICPU(model) \ + {X86_VENDOR_INTEL, 6, model, X86_FEATURE_APERFMPERF, 0} + +static const struct x86_cpu_id has_knl_turbo_ratio_limits[] = { + ICPU(INTEL_FAM6_XEON_PHI_KNL), + ICPU(INTEL_FAM6_XEON_PHI_KNM), + {} +}; + +static const struct x86_cpu_id has_skx_turbo_ratio_limits[] = { + ICPU(INTEL_FAM6_SKYLAKE_X), + {} +}; + +static const struct x86_cpu_id has_glm_turbo_ratio_limits[] = { + ICPU(INTEL_FAM6_ATOM_GOLDMONT), + ICPU(INTEL_FAM6_ATOM_GOLDMONT_D), + ICPU(INTEL_FAM6_ATOM_GOLDMONT_PLUS), + {} +}; + +static bool core_set_max_freq_ratio(void) +{ + u64 base_freq, turbo_freq; + int err; + + err = rdmsrl_safe(MSR_PLATFORM_INFO, &base_freq); + if (err) + return false; + + err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &turbo_freq); + if (err) + return false; + + base_freq = (base_freq >> 8) & 0xFF; /* max P state */ + turbo_freq = (turbo_freq >> 24) & 0xFF; /* 4C turbo */ + + arch_max_freq_ratio = div_u64(turbo_freq * SCHED_CAPACITY_SCALE, + base_freq); + return true; +} + +static bool intel_set_max_freq_ratio(void) +{ + /* + * TODO: add support for: + * + * - Xeon Gold/Platinum + * - Xeon Phi (KNM, KNL) + * - Atom Goldmont + * - Atom Silvermont + */ + + if (x86_match_cpu(has_skx_turbo_ratio_limits) || + x86_match_cpu(has_knl_turbo_ratio_limits) || + x86_match_cpu(has_glm_turbo_ratio_limits)) + return false; + + if (turbo_disabled() || core_set_max_freq_ratio()) + return true; + + return false; +} + +static void init_counter_refs(void *arg) +{ + u64 aperf, mperf; + + rdmsrl(MSR_IA32_APERF, aperf); + rdmsrl(MSR_IA32_MPERF, mperf); + + this_cpu_write(arch_prev_aperf, aperf); + this_cpu_write(arch_prev_mperf, mperf); +} + +static void init_freq_invariance(void) +{ + bool ret = false; + + if (smp_processor_id() != 0 || !boot_cpu_has(X86_FEATURE_APERFMPERF)) + return; + + if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) + ret = intel_set_max_freq_ratio(); + + if (ret) { + on_each_cpu(init_counter_refs, NULL, 1); + static_branch_enable(&arch_scale_freq_key); + } else { + pr_debug("Couldn't determine max cpu frequency, necessary for scale-invariant accounting.\n"); + } +} + +DEFINE_PER_CPU(unsigned long, arch_freq_scale) = SCHED_CAPACITY_SCALE; + +void arch_scale_freq_tick(void) +{ + u64 freq_scale; + u64 aperf, mperf; + u64 acnt, mcnt; + + if (!arch_scale_freq_invariant()) + return; + + rdmsrl(MSR_IA32_APERF, aperf); + rdmsrl(MSR_IA32_MPERF, mperf); + + acnt = aperf - this_cpu_read(arch_prev_aperf); + mcnt = mperf - this_cpu_read(arch_prev_mperf); + if (!mcnt) + return; + + this_cpu_write(arch_prev_aperf, aperf); + this_cpu_write(arch_prev_mperf, mperf); + + acnt <<= 2*SCHED_CAPACITY_SHIFT; + mcnt *= arch_max_freq_ratio; + + freq_scale = div64_u64(acnt, mcnt); + + if (freq_scale > SCHED_CAPACITY_SCALE) + freq_scale = SCHED_CAPACITY_SCALE; + + this_cpu_write(arch_freq_scale, freq_scale); +} -- cgit From 2a0abc59699896f03bf6f16efb8a3a490511216f Mon Sep 17 00:00:00 2001 From: Giovanni Gherdovich Date: Wed, 22 Jan 2020 16:16:13 +0100 Subject: x86, sched: Add support for frequency invariance on SKYLAKE_X The scheduler needs the ratio freq_curr/freq_max for frequency-invariant accounting. On SKYLAKE_X CPUs set freq_max to the highest frequency that can be sustained by a group of at least 4 cores. From the changelog of commit 31e07522be56 ("tools/power turbostat: fix decoding for GLM, DNV, SKX turbo-ratio limits"): > Newer processors do not hard-code the the number of cpus in each bin > to {1, 2, 3, 4, 5, 6, 7, 8} Rather, they can specify any number > of CPUS in each of the 8 bins: > > eg. > > ... > 37 * 100.0 = 3600.0 MHz max turbo 4 active cores > 38 * 100.0 = 3700.0 MHz max turbo 3 active cores > 39 * 100.0 = 3800.0 MHz max turbo 2 active cores > 39 * 100.0 = 3900.0 MHz max turbo 1 active cores > > could now look something like this: > > ... > 37 * 100.0 = 3600.0 MHz max turbo 16 active cores > 38 * 100.0 = 3700.0 MHz max turbo 8 active cores > 39 * 100.0 = 3800.0 MHz max turbo 4 active cores > 39 * 100.0 = 3900.0 MHz max turbo 2 active cores This encoding of turbo levels applies to both SKYLAKE_X and GOLDMONT/GOLDMONT_D, but we treat these two classes in separate commits because their freq_max values need to be different. For SKX we prefer a lower freq_max in the ratio freq_curr/freq_max, allowing load and utilization to overshoot and the schedutil governor to be more performance-oriented. Models from the Atom series (such as GOLDMONT*) are handled in a forthcoming commit as they have to favor power-efficiency over performance. Results from a performance evaluation follow. 1. TEST SETUP 2. NEUTRAL BENCHMARKS 3. NON-NEUTRAL BENCHMARKS 4. DETAILED TABLES 1. TEST SETUP ------------- Test machine: CPU Model : Intel Xeon Platinum 8260L CPU @ 2.40GHz (a.k.a. Cascade Lake) Fam/Mod/Ste : 6:85:6 Topology : 2 sockets, 24 cores / 48 threads each socket Memory : 192G Storage : SSD, XFS filesystem Max EFFICiency, BASE frequency and available turbo levels (MHz): EFFIC 1000 |********** BASE 2400 |************************ 24C 3100 |******************************* 20C 3300 |********************************* 16C 3600 |************************************ 12C 3600 |************************************ 8C 3600 |************************************ 4C 3700 |************************************* 2C 3900 |*************************************** Tested kernels: Baseline : v5.2, intel_pstate passive, schedutil Comparison #1 : v5.2, intel_pstate active , powersave+HWP Comparison #2 : v5.2, this patch, intel_pstate passive, schedutil 2. NEUTRAL BENCHMARKS --------------------- * pgbench read/write * NASA Parallel Benchmarks (NPB), MPI or OpenMP for message-passing * hackbench * netperf 3. NON-NEUTRAL BENCHMARKS ------------------------- comparison ratio with baseline; 1.00 means neutral, higher is better: I_PSTATE FREQ-INV ---------------------------------------- pgbench read-only 1.10 ~ tbench 1.82 1.14 comparison ratio with baseline; 1.00 means neutral, lower is better: I_PSTATE FREQ-INV ---------------------------------------- dbench ~ 0.97 kernbench 0.88 0.78 gitsource[*] ~ 0.46 [*] "gitsource" consists in running git's unit tests tilde (~) means 1.00, ie result identical to baseline Performance per watt: performance-per-watt ratios with baseline; 1.00 means neutral, higher is better: I_PSTATE FREQ-INV ---------------------------------------- dbench 0.92 0.91 tbench 1.26 1.04 kernbench 0.95 0.96 gitsource 1.03 1.30 Similarly to earlier Xeons, measurable performance gains over non-invariant schedutil are observed on dbench, tbench, kernel compilation and running the git unit tests suite. Looking at the detailed tables show that the patch scores the largest difference when the machine is lightly loaded. Power efficiency suffers lightly on kernbench and a bit more on dbench, but largely improves on gitsource (which also runs considerably faster). For reference, we also report results using active intel_pstate with powersave and HWP; the largest gap between non-invariant schedutil and intel_pstate+powersave is still tbench, which runs 82% better and with 26% improved efficiency on the latter configuration -- this divide isn't closed yet by frequency-invariant schedutil. 4. DETAILED TABLES ------------------ Benchmark : tbench4 (i.e. dbench4 over the network, actually loopback) Varying parameter : number of clients Unit : MB/sec (higher is better) 5.2.0 vanilla (BASELINE) 5.2.0 intel_pstate/HWP 5.2.0 freq-inv - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Hmean 1 183.56 +- 0.21% ( ) 516.12 +- 0.57% ( 181.18%) 185.59 +- 0.59% ( 1.11%) Hmean 2 365.75 +- 0.25% ( ) 1015.14 +- 0.33% ( 177.55%) 402.59 +- 4.48% ( 10.07%) Hmean 4 720.99 +- 0.44% ( ) 1951.75 +- 0.28% ( 170.70%) 738.39 +- 1.72% ( 2.41%) Hmean 8 1449.93 +- 0.34% ( ) 3830.56 +- 0.24% ( 164.19%) 1750.36 +- 4.65% ( 20.72%) Hmean 16 2874.26 +- 0.57% ( ) 7381.62 +- 0.53% ( 156.82%) 4348.35 +- 2.22% ( 51.29%) Hmean 32 6116.17 +- 5.10% ( ) 13013.05 +- 0.08% ( 112.76%) 8980.35 +- 0.66% ( 46.83%) Hmean 64 14485.04 +- 3.46% ( ) 17835.12 +- 0.35% ( 23.13%) 16540.73 +- 0.51% ( 14.19%) Hmean 128 30779.16 +- 3.20% ( ) 32796.94 +- 2.13% ( 6.56%) 31512.58 +- 0.20% ( 2.38%) Hmean 256 34664.66 +- 0.81% ( ) 34604.67 +- 0.46% ( -0.17%) 34943.70 +- 0.25% ( 0.80%) Hmean 384 33957.51 +- 0.11% ( ) 34091.50 +- 0.14% ( 0.39%) 33921.41 +- 0.09% ( -0.11%) Benchmark : kernbench (kernel compilation) Varying parameter : number of jobs Unit : seconds (lower is better) 5.2.0 vanilla (BASELINE) 5.2.0 intel_pstate/HWP 5.2.0 freq-inv - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Amean 2 332.94 +- 0.40% ( ) 260.16 +- 0.45% ( 21.86%) 233.56 +- 0.21% ( 29.85%) Amean 4 173.04 +- 0.43% ( ) 138.76 +- 0.03% ( 19.81%) 123.59 +- 0.11% ( 28.58%) Amean 8 89.65 +- 0.20% ( ) 73.54 +- 0.09% ( 17.97%) 65.69 +- 0.10% ( 26.72%) Amean 16 48.08 +- 1.41% ( ) 41.64 +- 1.61% ( 13.40%) 36.00 +- 1.80% ( 25.11%) Amean 32 28.78 +- 0.72% ( ) 26.61 +- 1.99% ( 7.55%) 23.19 +- 1.68% ( 19.43%) Amean 64 20.46 +- 1.85% ( ) 19.76 +- 0.35% ( 3.42%) 17.38 +- 0.92% ( 15.06%) Amean 128 18.69 +- 1.70% ( ) 17.59 +- 1.04% ( 5.90%) 15.73 +- 1.40% ( 15.85%) Amean 192 18.82 +- 1.01% ( ) 17.76 +- 0.77% ( 5.67%) 15.57 +- 1.80% ( 17.28%) Benchmark : gitsource (time to run the git unit test suite) Varying parameter : none Unit : seconds (lower is better) 5.2.0 vanilla (BASELINE) 5.2.0 intel_pstate/HWP 5.2.0 freq-inv - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Amean 792.49 +- 0.20% ( ) 779.35 +- 0.24% ( 1.66%) 427.14 +- 0.16% ( 46.10%) Signed-off-by: Giovanni Gherdovich Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Acked-by: Rafael J. Wysocki Link: https://lkml.kernel.org/r/20200122151617.531-3-ggherdovich@suse.cz --- arch/x86/kernel/smpboot.c | 66 +++++++++++++++++++++++++++++++++++++---------- 1 file changed, 53 insertions(+), 13 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 28696bccf912..ba9d3bdc191c 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -1841,24 +1841,52 @@ static const struct x86_cpu_id has_glm_turbo_ratio_limits[] = { {} }; -static bool core_set_max_freq_ratio(void) +static bool skx_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq, int size) +{ + u64 ratios, counts; + u32 group_size; + int err, i; + + err = rdmsrl_safe(MSR_PLATFORM_INFO, base_freq); + if (err) + return false; + + *base_freq = (*base_freq >> 8) & 0xFF; /* max P state */ + + err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &ratios); + if (err) + return false; + + err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT1, &counts); + if (err) + return false; + + for (i = 0; i < 64; i += 8) { + group_size = (counts >> i) & 0xFF; + if (group_size >= size) { + *turbo_freq = (ratios >> i) & 0xFF; + return true; + } + } + + return false; +} + +static bool core_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq) { - u64 base_freq, turbo_freq; int err; - err = rdmsrl_safe(MSR_PLATFORM_INFO, &base_freq); + err = rdmsrl_safe(MSR_PLATFORM_INFO, base_freq); if (err) return false; - err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &turbo_freq); + err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, turbo_freq); if (err) return false; - base_freq = (base_freq >> 8) & 0xFF; /* max P state */ - turbo_freq = (turbo_freq >> 24) & 0xFF; /* 4C turbo */ + *base_freq = (*base_freq >> 8) & 0xFF; /* max P state */ + *turbo_freq = (*turbo_freq >> 24) & 0xFF; /* 4C turbo */ - arch_max_freq_ratio = div_u64(turbo_freq * SCHED_CAPACITY_SCALE, - base_freq); return true; } @@ -1867,21 +1895,33 @@ static bool intel_set_max_freq_ratio(void) /* * TODO: add support for: * - * - Xeon Gold/Platinum * - Xeon Phi (KNM, KNL) * - Atom Goldmont * - Atom Silvermont */ - if (x86_match_cpu(has_skx_turbo_ratio_limits) || - x86_match_cpu(has_knl_turbo_ratio_limits) || + u64 base_freq = 1, turbo_freq = 1; + + if (x86_match_cpu(has_knl_turbo_ratio_limits) || x86_match_cpu(has_glm_turbo_ratio_limits)) return false; - if (turbo_disabled() || core_set_max_freq_ratio()) - return true; + if (turbo_disabled()) + goto out; + + if (x86_match_cpu(has_skx_turbo_ratio_limits) && + skx_set_max_freq_ratio(&base_freq, &turbo_freq, 4)) + goto out; + + if (core_set_max_freq_ratio(&base_freq, &turbo_freq)) + goto out; return false; + +out: + arch_max_freq_ratio = div_u64(turbo_freq * SCHED_CAPACITY_SCALE, + base_freq); + return true; } static void init_counter_refs(void *arg) -- cgit From 8bea0dfb4a820ae063568a87cc2e7d8f587377af Mon Sep 17 00:00:00 2001 From: Giovanni Gherdovich Date: Wed, 22 Jan 2020 16:16:14 +0100 Subject: x86, sched: Add support for frequency invariance on XEON_PHI_KNL/KNM The scheduler needs the ratio freq_curr/freq_max for frequency-invariant accounting. On Xeon Phi CPUs set freq_max to the second-highest frequency reported by the CPU. Xeon Phi CPUs such as Knights Landing and Knights Mill typically have either one or two turbo frequencies; in the former case that's 100 MHz above the base frequency, in the latter case the two levels are 100 MHz and 200 MHz above base frequency. We set freq_max to the second-highest frequency reported by the CPU. This could be the base frequency (if only one turbo level is available) or the first turbo level (if two levels are available). The rationale is to compromise between power efficiency or performance -- going straight to max turbo would favor efficiency and blindly using base freq would favor performance. For reference, this is how MSR_TURBO_RATIO_LIMIT must be parsed on a Xeon Phi to get the available frequencies (taken from a comment in turbostat's sources): [0] -- Reserved [7:1] -- Base value of number of active cores of bucket 1. [15:8] -- Base value of freq ratio of bucket 1. [20:16] -- +ve delta of number of active cores of bucket 2. i.e. active cores of bucket 2 = active cores of bucket 1 + delta [23:21] -- Negative delta of freq ratio of bucket 2. i.e. freq ratio of bucket 2 = freq ratio of bucket 1 - delta [28:24]-- +ve delta of number of active cores of bucket 3. [31:29]-- -ve delta of freq ratio of bucket 3. [36:32]-- +ve delta of number of active cores of bucket 4. [39:37]-- -ve delta of freq ratio of bucket 4. [44:40]-- +ve delta of number of active cores of bucket 5. [47:45]-- -ve delta of freq ratio of bucket 5. [52:48]-- +ve delta of number of active cores of bucket 6. [55:53]-- -ve delta of freq ratio of bucket 6. [60:56]-- +ve delta of number of active cores of bucket 7. [63:61]-- -ve delta of freq ratio of bucket 7. 1. PERFORMANCE EVALUATION: TBENCH +5% 2. NEUTRAL BENCHMARKS (ALL OTHERS) 3. TEST SETUP 1. PERFORMANCE EVALUATION: TBENCH +5% ------------------------------------- A performance evaluation was conducted on a Knights Mill machine (see "Test Setup" below), were the frequency-invariance patch (on schedutil) is compared to both non-invariant schedutil and active intel_pstate with powersave: all three tested kernels behave the same performance-wise and with regard to power consumption (performance per watt). The only notable difference is tbench: comparison ratio of performance with baseline; 1.00 means neutral, higher is better: I_PSTATE FREQ-INV ---------------------------------------- tbench 1.04 1.05 performance-per-watt ratios with baseline; 1.00 means neutral, higher is better: I_PSTATE FREQ-INV ---------------------------------------- tbench 1.03 1.04 which essentially means that frequency-invariant schedutil is 5% better than baseline, the same as intel_pstate+powersave. As the results above are averaged over the varying parameter, here the detailed table. Varying parameter : number of clients Unit : MB/sec (higher is better) 5.2.0 vanilla (BASELINE) 5.2.0 intel_pstate 5.2.0 freq-inv - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Hmean 1 49.06 +- 2.12% ( ) 51.66 +- 1.52% ( 5.30%) 52.87 +- 0.88% ( 7.76%) Hmean 2 93.82 +- 0.45% ( ) 103.24 +- 0.70% ( 10.05%) 105.90 +- 0.70% ( 12.88%) Hmean 4 192.46 +- 1.15% ( ) 215.95 +- 0.60% ( 12.21%) 215.78 +- 1.43% ( 12.12%) Hmean 8 406.74 +- 2.58% ( ) 438.58 +- 0.36% ( 7.83%) 437.61 +- 0.97% ( 7.59%) Hmean 16 857.70 +- 1.22% ( ) 890.26 +- 0.72% ( 3.80%) 889.11 +- 0.73% ( 3.66%) Hmean 32 1760.10 +- 0.92% ( ) 1791.70 +- 0.44% ( 1.79%) 1787.95 +- 0.44% ( 1.58%) Hmean 64 3183.50 +- 0.34% ( ) 3183.19 +- 0.36% ( -0.01%) 3187.53 +- 0.36% ( 0.13%) Hmean 128 4830.96 +- 0.31% ( ) 4846.53 +- 0.30% ( 0.32%) 4855.86 +- 0.30% ( 0.52%) Hmean 256 5467.98 +- 0.38% ( ) 5793.80 +- 0.28% ( 5.96%) 5821.94 +- 0.17% ( 6.47%) Hmean 512 5398.10 +- 0.06% ( ) 5745.56 +- 0.08% ( 6.44%) 5503.68 +- 0.07% ( 1.96%) Hmean 1024 5290.43 +- 0.63% ( ) 5221.07 +- 0.47% ( -1.31%) 5277.22 +- 0.80% ( -0.25%) Hmean 1088 5139.71 +- 0.57% ( ) 5236.02 +- 0.71% ( 1.87%) 5190.57 +- 0.41% ( 0.99%) 2. NEUTRAL BENCHMARKS (ALL OTHERS) ---------------------------------- * pgbench (both read/write and read-only) * NASA Parallel Benchmarks (NPB), MPI or OpenMP for message-passing * hackbench * netperf * dbench * kernbench * gitsource (git unit test suite) 3. TEST SETUP ------------- Test machine: CPU Model : Intel Xeon Phi CPU 7255 @ 1.10GHz (a.k.a. Knights Mill) Fam/Mod/Ste : 6:133:0 Topology : 1 socket, 68 cores / 272 threads Memory : 96G Storage : rotary, XFS filesystem Max EFFICiency, BASE frequency and available turbo levels (MHz): EFFIC 1000 |********** BASE 1100 |*********** 68C 1100 |*********** 30C 1200 |************ Tested kernels: Baseline : v5.2, intel_pstate passive, schedutil Comparison #1 : v5.2, intel_pstate active , powersave Comparison #2 : v5.2, this patch, intel_pstate passive, schedutil Signed-off-by: Giovanni Gherdovich Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Acked-by: Rafael J. Wysocki Link: https://lkml.kernel.org/r/20200122151617.531-4-ggherdovich@suse.cz --- arch/x86/kernel/smpboot.c | 49 ++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 46 insertions(+), 3 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index ba9d3bdc191c..8cb3113377a9 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -1841,6 +1841,48 @@ static const struct x86_cpu_id has_glm_turbo_ratio_limits[] = { {} }; +static bool knl_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq, + int num_delta_fratio) +{ + int fratio, delta_fratio, found; + int err, i; + u64 msr; + + if (!x86_match_cpu(has_knl_turbo_ratio_limits)) + return false; + + err = rdmsrl_safe(MSR_PLATFORM_INFO, base_freq); + if (err) + return false; + + *base_freq = (*base_freq >> 8) & 0xFF; /* max P state */ + + err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &msr); + if (err) + return false; + + fratio = (msr >> 8) & 0xFF; + i = 16; + found = 0; + do { + if (found >= num_delta_fratio) { + *turbo_freq = fratio; + return true; + } + + delta_fratio = (msr >> (i + 5)) & 0x7; + + if (delta_fratio) { + found += 1; + fratio -= delta_fratio; + } + + i += 8; + } while (i < 64); + + return true; +} + static bool skx_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq, int size) { u64 ratios, counts; @@ -1895,20 +1937,21 @@ static bool intel_set_max_freq_ratio(void) /* * TODO: add support for: * - * - Xeon Phi (KNM, KNL) * - Atom Goldmont * - Atom Silvermont */ u64 base_freq = 1, turbo_freq = 1; - if (x86_match_cpu(has_knl_turbo_ratio_limits) || - x86_match_cpu(has_glm_turbo_ratio_limits)) + if (x86_match_cpu(has_glm_turbo_ratio_limits)) return false; if (turbo_disabled()) goto out; + if (knl_set_max_freq_ratio(&base_freq, &turbo_freq, 1)) + goto out; + if (x86_match_cpu(has_skx_turbo_ratio_limits) && skx_set_max_freq_ratio(&base_freq, &turbo_freq, 4)) goto out; -- cgit From eacf0474aec8bdccdc7f19386319127c67be3588 Mon Sep 17 00:00:00 2001 From: Giovanni Gherdovich Date: Wed, 22 Jan 2020 16:16:15 +0100 Subject: x86, sched: Add support for frequency invariance on ATOM_GOLDMONT* The scheduler needs the ratio freq_curr/freq_max for frequency-invariant accounting. On GOLDMONT (aka Apollo Lake), GOLDMONT_D (aka Denverton) and GOLDMONT_PLUS CPUs (aka Gemini Lake) set freq_max to the highest frequency reported by the CPU. The encoding of turbo ratios for GOLDMONT* is identical to the one for SKYLAKE_X, but we treat the Atom case apart because we want to set freq_max to a higher value, thus the ratio freq_curr/freq_max to be lower, leading to more conservative frequency selections (favoring power efficiency). Signed-off-by: Giovanni Gherdovich Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Acked-by: Rafael J. Wysocki Link: https://lkml.kernel.org/r/20200122151617.531-5-ggherdovich@suse.cz --- arch/x86/kernel/smpboot.c | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 8cb3113377a9..3e32d620f1fb 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -1795,6 +1795,10 @@ void native_play_dead(void) * which would ignore the entire turbo range (a conspicuous part, making * freq_curr/freq_max always maxed out). * + * An exception to the heuristic above is the Atom uarch, where we choose the + * highest turbo level for freq_max since Atom's are generally oriented towards + * power efficiency. + * * Setting freq_max to anything less than the 1C turbo ratio makes the ratio * freq_curr / freq_max to eventually grow >1, in which case we clip it to 1. */ @@ -1937,18 +1941,18 @@ static bool intel_set_max_freq_ratio(void) /* * TODO: add support for: * - * - Atom Goldmont * - Atom Silvermont */ u64 base_freq = 1, turbo_freq = 1; - if (x86_match_cpu(has_glm_turbo_ratio_limits)) - return false; - if (turbo_disabled()) goto out; + if (x86_match_cpu(has_glm_turbo_ratio_limits) && + skx_set_max_freq_ratio(&base_freq, &turbo_freq, 1)) + goto out; + if (knl_set_max_freq_ratio(&base_freq, &turbo_freq, 1)) goto out; -- cgit From 298c6f99bf30ef735e79f7f6d086bdfae505d380 Mon Sep 17 00:00:00 2001 From: Giovanni Gherdovich Date: Wed, 22 Jan 2020 16:16:16 +0100 Subject: x86, sched: Add support for frequency invariance on ATOM The scheduler needs the ratio freq_curr/freq_max for frequency-invariant accounting. On all ATOM CPUs prior to Goldmont, set freq_max to the 1-core turbo ratio. We intended to perform tests validating that this patch doesn't regress in terms of energy efficiency, given that this is the primary concern on Atom processors. Alas, we found out that turbostat doesn't support reading RAPL interfaces on our test machine (Airmont), and we don't have external equipment to measure power consumption; all we have is the performance results of the benchmarks we ran. Test machine: Platform : Dell Wyse 3040 Thin Client[1] CPU Model : Intel Atom x5-Z8350 (aka Cherry Trail, aka Airmont) Fam/Mod/Ste : 6:76:4 Topology : 1 socket, 4 cores / 4 threads Memory : 2G Storage : onboard flash, XFS filesystem [1] https://www.dell.com/en-us/work/shop/wyse-endpoints-and-software/wyse-3040-thin-client/spd/wyse-3040-thin-client Base frequency and available turbo levels (MHz): Min Operating Freq 266 |*** Low Freq Mode 800 |******** Base Freq 2400 |************************ 4 Cores 2800 |**************************** 3 Cores 2800 |**************************** 2 Cores 3200 |******************************** 1 Core 3200 |******************************** Tested kernels: Baseline : v5.4-rc1, intel_pstate passive, schedutil Comparison #1 : v5.4-rc1, intel_pstate active , powersave Comparison #2 : v5.4-rc1, this patch, intel_pstate passive, schedutil tbench, hackbench and kernbench performed the same under all three kernels; dbench ran faster with intel_pstate/powersave and the git unit tests were a lot faster with intel_pstate/powersave and invariant schedutil wrt the baseline. Not that any of this is terrbily interesting anyway, one doesn't buy an Atom system to go fast. Power consumption regressions aren't expected but we lack the equipment to make that measurement. Turbostat seems to think that reading RAPL on this machine isn't a good idea and we're trusting that decision. comparison ratio of performance with baseline; 1.00 means neutral, lower is better: I_PSTATE FREQ-INV ---------------------------------------- dbench 0.90 ~ kernbench 0.98 0.97 gitsource 0.63 0.43 Signed-off-by: Giovanni Gherdovich Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Acked-by: Rafael J. Wysocki Link: https://lkml.kernel.org/r/20200122151617.531-6-ggherdovich@suse.cz --- arch/x86/kernel/smpboot.c | 27 +++++++++++++++++++++------ 1 file changed, 21 insertions(+), 6 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 3e32d620f1fb..5f04bf8419f9 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -1821,6 +1821,24 @@ static bool turbo_disabled(void) return (misc_en & MSR_IA32_MISC_ENABLE_TURBO_DISABLE); } +static bool slv_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq) +{ + int err; + + err = rdmsrl_safe(MSR_ATOM_CORE_RATIOS, base_freq); + if (err) + return false; + + err = rdmsrl_safe(MSR_ATOM_CORE_TURBO_RATIOS, turbo_freq); + if (err) + return false; + + *base_freq = (*base_freq >> 16) & 0x3F; /* max P state */ + *turbo_freq = *turbo_freq & 0x3F; /* 1C turbo */ + + return true; +} + #include #include @@ -1938,17 +1956,14 @@ static bool core_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq) static bool intel_set_max_freq_ratio(void) { - /* - * TODO: add support for: - * - * - Atom Silvermont - */ - u64 base_freq = 1, turbo_freq = 1; if (turbo_disabled()) goto out; + if (slv_set_max_freq_ratio(&base_freq, &turbo_freq)) + goto out; + if (x86_match_cpu(has_glm_turbo_ratio_limits) && skx_set_max_freq_ratio(&base_freq, &turbo_freq, 1)) goto out; -- cgit From 918229cdd5abb50d8a2edfcd8dc6b6bc53afd765 Mon Sep 17 00:00:00 2001 From: Giovanni Gherdovich Date: Wed, 22 Jan 2020 16:16:17 +0100 Subject: x86/intel_pstate: Handle runtime turbo disablement/enablement in frequency invariance On some platforms such as the Dell XPS 13 laptop the firmware disables turbo when the machine is disconnected from AC, and viceversa it enables it again when it's reconnected. In these cases a _PPC ACPI notification is issued. The scheduler needs to know freq_max for frequency-invariant calculations. To account for turbo availability to come and go, record freq_max at boot as if turbo was available and store it in a helper variable. Use a setter function to swap between freq_base and freq_max every time turbo goes off or on. Signed-off-by: Giovanni Gherdovich Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Acked-by: Rafael J. Wysocki Link: https://lkml.kernel.org/r/20200122151617.531-7-ggherdovich@suse.cz --- arch/x86/include/asm/topology.h | 5 +++++ arch/x86/kernel/smpboot.c | 15 ++++++++++----- 2 files changed, 15 insertions(+), 5 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h index 2ebf7b7b2126..79d8d5496330 100644 --- a/arch/x86/include/asm/topology.h +++ b/arch/x86/include/asm/topology.h @@ -211,6 +211,11 @@ static inline long arch_scale_freq_capacity(int cpu) extern void arch_scale_freq_tick(void); #define arch_scale_freq_tick arch_scale_freq_tick +extern void arch_set_max_freq_ratio(bool turbo_disabled); +#else +static inline void arch_set_max_freq_ratio(bool turbo_disabled) +{ +} #endif #endif /* _ASM_X86_TOPOLOGY_H */ diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 5f04bf8419f9..467191e51196 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -1807,8 +1807,15 @@ DEFINE_STATIC_KEY_FALSE(arch_scale_freq_key); static DEFINE_PER_CPU(u64, arch_prev_aperf); static DEFINE_PER_CPU(u64, arch_prev_mperf); +static u64 arch_turbo_freq_ratio = SCHED_CAPACITY_SCALE; static u64 arch_max_freq_ratio = SCHED_CAPACITY_SCALE; +void arch_set_max_freq_ratio(bool turbo_disabled) +{ + arch_max_freq_ratio = turbo_disabled ? SCHED_CAPACITY_SCALE : + arch_turbo_freq_ratio; +} + static bool turbo_disabled(void) { u64 misc_en; @@ -1956,10 +1963,7 @@ static bool core_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq) static bool intel_set_max_freq_ratio(void) { - u64 base_freq = 1, turbo_freq = 1; - - if (turbo_disabled()) - goto out; + u64 base_freq, turbo_freq; if (slv_set_max_freq_ratio(&base_freq, &turbo_freq)) goto out; @@ -1981,8 +1985,9 @@ static bool intel_set_max_freq_ratio(void) return false; out: - arch_max_freq_ratio = div_u64(turbo_freq * SCHED_CAPACITY_SCALE, + arch_turbo_freq_ratio = div_u64(turbo_freq * SCHED_CAPACITY_SCALE, base_freq); + arch_set_max_freq_ratio(turbo_disabled()); return true; } -- cgit From 6c1c07b33eb093e5a2a313ece89baa596ba6135e Mon Sep 17 00:00:00 2001 From: Kan Liang Date: Tue, 21 Jan 2020 10:13:38 -0800 Subject: perf/x86/intel: Avoid unnecessary PEBS_ENABLE MSR access in PMI The perf PMI handler, intel_pmu_handle_irq(), currently does unnecessary MSR accesses for PEBS_ENABLE MSR in __intel_pmu_enable/disable_all() when PEBS is enabled. When entering the handler, global ctrl is explicitly disabled. All counters do not count anymore. It doesn't matter if PEBS is enabled or not in a PMI handler. Furthermore, for most cases, the cpuc->pebs_enabled is not changed in PMI. The PEBS status doesn't change. The PEBS_ENABLE MSR doesn't need to be changed either when exiting the handler. PMI throttle may change the PEBS status during PMI handler. The x86_pmu_stop() ends up in intel_pmu_pebs_disable() which can update cpuc->pebs_enabled. But the MSR_IA32_PEBS_ENABLE is not updated at the same time. Because the cpuc->enabled has been forced to 0. The patch explicitly update the MSR_IA32_PEBS_ENABLE for this case. Use ftrace to measure the duration of intel_pmu_handle_irq() on BDX. #perf record -e cycles:P -- ./tchain_edit The average duration of intel_pmu_handle_irq(): Without the patch 1.144 us With the patch 1.025 us Signed-off-by: Kan Liang Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Link: https://lkml.kernel.org/r/20200121181338.3234-1-kan.liang@linux.intel.com --- arch/x86/events/intel/core.c | 25 ++++++++++++++++++++++--- 1 file changed, 22 insertions(+), 3 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index dff6623804c2..332954cccece 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -1945,6 +1945,14 @@ static __initconst const u64 knl_hw_cache_extra_regs * intel_bts events don't coexist with intel PMU's BTS events because of * x86_add_exclusive(x86_lbr_exclusive_lbr); there's no need to keep them * disabled around intel PMU's event batching etc, only inside the PMI handler. + * + * Avoid PEBS_ENABLE MSR access in PMIs. + * The GLOBAL_CTRL has been disabled. All the counters do not count anymore. + * It doesn't matter if the PEBS is enabled or not. + * Usually, the PEBS status are not changed in PMIs. It's unnecessary to + * access PEBS_ENABLE MSR in disable_all()/enable_all(). + * However, there are some cases which may change PEBS status, e.g. PMI + * throttle. The PEBS_ENABLE should be updated where the status changes. */ static void __intel_pmu_disable_all(void) { @@ -1954,13 +1962,12 @@ static void __intel_pmu_disable_all(void) if (test_bit(INTEL_PMC_IDX_FIXED_BTS, cpuc->active_mask)) intel_pmu_disable_bts(); - - intel_pmu_pebs_disable_all(); } static void intel_pmu_disable_all(void) { __intel_pmu_disable_all(); + intel_pmu_pebs_disable_all(); intel_pmu_lbr_disable_all(); } @@ -1968,7 +1975,6 @@ static void __intel_pmu_enable_all(int added, bool pmi) { struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); - intel_pmu_pebs_enable_all(); intel_pmu_lbr_enable_all(pmi); wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask); @@ -1986,6 +1992,7 @@ static void __intel_pmu_enable_all(int added, bool pmi) static void intel_pmu_enable_all(int added) { + intel_pmu_pebs_enable_all(); __intel_pmu_enable_all(added, false); } @@ -2374,9 +2381,21 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status) * PEBS overflow sets bit 62 in the global status register */ if (__test_and_clear_bit(62, (unsigned long *)&status)) { + u64 pebs_enabled = cpuc->pebs_enabled; + handled++; x86_pmu.drain_pebs(regs); status &= x86_pmu.intel_ctrl | GLOBAL_STATUS_TRACE_TOPAPMI; + + /* + * PMI throttle may be triggered, which stops the PEBS event. + * Although cpuc->pebs_enabled is updated accordingly, the + * MSR_IA32_PEBS_ENABLE is not updated. Because the + * cpuc->enabled has been forced to 0 in PMI. + * Update the MSR if pebs_enabled is changed. + */ + if (pebs_enabled != cpuc->pebs_enabled) + wrmsrl(MSR_IA32_PEBS_ENABLE, cpuc->pebs_enabled); } /* -- cgit From bbfd5e4fab63703375eafaf241a0c696024a59e1 Mon Sep 17 00:00:00 2001 From: Kan Liang Date: Mon, 27 Jan 2020 08:53:54 -0800 Subject: perf/core: Add new branch sample type for HW index of raw branch records The low level index is the index in the underlying hardware buffer of the most recently captured taken branch which is always saved in branch_entries[0]. It is very useful for reconstructing the call stack. For example, in Intel LBR call stack mode, the depth of reconstructed LBR call stack limits to the number of LBR registers. With the low level index information, perf tool may stitch the stacks of two samples. The reconstructed LBR call stack can break the HW limitation. Add a new branch sample type to retrieve low level index of raw branch records. The low level index is between -1 (unknown) and max depth which can be retrieved in /sys/devices/cpu/caps/branches. Only when the new branch sample type is set, the low level index information is dumped into the PERF_SAMPLE_BRANCH_STACK output. Perf tool should check the attr.branch_sample_type, and apply the corresponding format for PERF_SAMPLE_BRANCH_STACK samples. Otherwise, some user case may be broken. For example, users may parse a perf.data, which include the new branch sample type, with an old version perf tool (without the check). Users probably get incorrect information without any warning. Signed-off-by: Kan Liang Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Link: https://lkml.kernel.org/r/20200127165355.27495-2-kan.liang@linux.intel.com --- arch/x86/events/intel/lbr.c | 3 +++ 1 file changed, 3 insertions(+) (limited to 'arch/x86') diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c index 534c76606049..7639e2097101 100644 --- a/arch/x86/events/intel/lbr.c +++ b/arch/x86/events/intel/lbr.c @@ -585,6 +585,7 @@ static void intel_pmu_lbr_read_32(struct cpu_hw_events *cpuc) cpuc->lbr_entries[i].reserved = 0; } cpuc->lbr_stack.nr = i; + cpuc->lbr_stack.hw_idx = -1ULL; } /* @@ -680,6 +681,7 @@ static void intel_pmu_lbr_read_64(struct cpu_hw_events *cpuc) out++; } cpuc->lbr_stack.nr = out; + cpuc->lbr_stack.hw_idx = -1ULL; } void intel_pmu_lbr_read(void) @@ -1120,6 +1122,7 @@ void intel_pmu_store_pebs_lbrs(struct pebs_lbr *lbr) int i; cpuc->lbr_stack.nr = x86_pmu.lbr_nr; + cpuc->lbr_stack.hw_idx = -1ULL; for (i = 0; i < x86_pmu.lbr_nr; i++) { u64 info = lbr->lbr[i].info; struct perf_branch_entry *e = &cpuc->lbr_entries[i]; -- cgit From db278b90c326ce5895be09b6171f5ff3df1e3cca Mon Sep 17 00:00:00 2001 From: Kan Liang Date: Mon, 27 Jan 2020 08:53:55 -0800 Subject: perf/x86/intel: Output LBR TOS information correctly For Intel LBR, the LBR Top-of-Stack (TOS) information is the HW index of raw branch record for the most recent branch. For non-adaptive PEBS and non-PEBS, the TOS information can be directly retrieved from TOS MSR read in intel_pmu_lbr_read(). For adaptive PEBS, the LBR information stored in PEBS record doesn't include the TOS information. For single PEBS, TOS can be directly read from MSR, because the PMI is triggered immediately after PEBS is written. TOS MSR is still unchanged. For large PEBS, TOS MSR has stale value. Set -1ULL to indicate that the TOS information is not available. Signed-off-by: Kan Liang Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Link: https://lkml.kernel.org/r/20200127165355.27495-3-kan.liang@linux.intel.com --- arch/x86/events/intel/lbr.c | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c index 7639e2097101..65113b16804a 100644 --- a/arch/x86/events/intel/lbr.c +++ b/arch/x86/events/intel/lbr.c @@ -585,7 +585,7 @@ static void intel_pmu_lbr_read_32(struct cpu_hw_events *cpuc) cpuc->lbr_entries[i].reserved = 0; } cpuc->lbr_stack.nr = i; - cpuc->lbr_stack.hw_idx = -1ULL; + cpuc->lbr_stack.hw_idx = tos; } /* @@ -681,7 +681,7 @@ static void intel_pmu_lbr_read_64(struct cpu_hw_events *cpuc) out++; } cpuc->lbr_stack.nr = out; - cpuc->lbr_stack.hw_idx = -1ULL; + cpuc->lbr_stack.hw_idx = tos; } void intel_pmu_lbr_read(void) @@ -1122,7 +1122,13 @@ void intel_pmu_store_pebs_lbrs(struct pebs_lbr *lbr) int i; cpuc->lbr_stack.nr = x86_pmu.lbr_nr; - cpuc->lbr_stack.hw_idx = -1ULL; + + /* Cannot get TOS for large PEBS */ + if (cpuc->n_pebs == cpuc->n_large_pebs) + cpuc->lbr_stack.hw_idx = -1ULL; + else + cpuc->lbr_stack.hw_idx = intel_pmu_lbr_tos(); + for (i = 0; i < x86_pmu.lbr_nr; i++) { u64 info = lbr->lbr[i].info; struct perf_branch_entry *e = &cpuc->lbr_entries[i]; -- cgit From fdb64822443ec9fb8c3a74b598a74790ae8d2e22 Mon Sep 17 00:00:00 2001 From: Kan Liang Date: Thu, 6 Feb 2020 08:15:27 -0800 Subject: perf/x86: Add Intel Tiger Lake uncore support For MSR type of uncore units, there is no difference between Ice Lake and Tiger Lake. Share the same code with Ice Lake. Tiger Lake has two MCs. Both of them are located at 0:0:0. The BAR offset is still 0x48. The offset of the two MCs is 0x10000. Each MC has three counters to count every read/write/total issued by the Memory Controller to DRAM. The counters can be accessed by MMIO. They are free-running counters. The offset of counters are different for TIGERLAKE_L and TIGERLAKE. Add separated mmio_init() functions. Signed-off-by: Kan Liang Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Reviewed-by: Andi Kleen Link: https://lkml.kernel.org/r/20200206161527.3529-1-kan.liang@linux.intel.com --- arch/x86/events/intel/uncore.c | 12 +++ arch/x86/events/intel/uncore.h | 2 + arch/x86/events/intel/uncore_snb.c | 159 +++++++++++++++++++++++++++++++++++++ 3 files changed, 173 insertions(+) (limited to 'arch/x86') diff --git a/arch/x86/events/intel/uncore.c b/arch/x86/events/intel/uncore.c index 86467f85c383..63922e3a34f5 100644 --- a/arch/x86/events/intel/uncore.c +++ b/arch/x86/events/intel/uncore.c @@ -1470,6 +1470,16 @@ static const struct intel_uncore_init_fun icl_uncore_init __initconst = { .pci_init = skl_uncore_pci_init, }; +static const struct intel_uncore_init_fun tgl_uncore_init __initconst = { + .cpu_init = icl_uncore_cpu_init, + .mmio_init = tgl_uncore_mmio_init, +}; + +static const struct intel_uncore_init_fun tgl_l_uncore_init __initconst = { + .cpu_init = icl_uncore_cpu_init, + .mmio_init = tgl_l_uncore_mmio_init, +}; + static const struct intel_uncore_init_fun snr_uncore_init __initconst = { .cpu_init = snr_uncore_cpu_init, .pci_init = snr_uncore_pci_init, @@ -1505,6 +1515,8 @@ static const struct x86_cpu_id intel_uncore_match[] __initconst = { X86_UNCORE_MODEL_MATCH(INTEL_FAM6_ICELAKE_L, icl_uncore_init), X86_UNCORE_MODEL_MATCH(INTEL_FAM6_ICELAKE_NNPI, icl_uncore_init), X86_UNCORE_MODEL_MATCH(INTEL_FAM6_ICELAKE, icl_uncore_init), + X86_UNCORE_MODEL_MATCH(INTEL_FAM6_TIGERLAKE_L, tgl_l_uncore_init), + X86_UNCORE_MODEL_MATCH(INTEL_FAM6_TIGERLAKE, tgl_uncore_init), X86_UNCORE_MODEL_MATCH(INTEL_FAM6_ATOM_TREMONT_D, snr_uncore_init), {}, }; diff --git a/arch/x86/events/intel/uncore.h b/arch/x86/events/intel/uncore.h index bbfdaa720b45..1204dcc9fe9b 100644 --- a/arch/x86/events/intel/uncore.h +++ b/arch/x86/events/intel/uncore.h @@ -527,6 +527,8 @@ void snb_uncore_cpu_init(void); void nhm_uncore_cpu_init(void); void skl_uncore_cpu_init(void); void icl_uncore_cpu_init(void); +void tgl_uncore_mmio_init(void); +void tgl_l_uncore_mmio_init(void); int snb_pci2phy_map_init(int devid); /* uncore_snbep.c */ diff --git a/arch/x86/events/intel/uncore_snb.c b/arch/x86/events/intel/uncore_snb.c index c37cb12d0ef6..3de1065eefc4 100644 --- a/arch/x86/events/intel/uncore_snb.c +++ b/arch/x86/events/intel/uncore_snb.c @@ -44,6 +44,11 @@ #define PCI_DEVICE_ID_INTEL_WHL_UD_IMC 0x3e35 #define PCI_DEVICE_ID_INTEL_ICL_U_IMC 0x8a02 #define PCI_DEVICE_ID_INTEL_ICL_U2_IMC 0x8a12 +#define PCI_DEVICE_ID_INTEL_TGL_U1_IMC 0x9a02 +#define PCI_DEVICE_ID_INTEL_TGL_U2_IMC 0x9a04 +#define PCI_DEVICE_ID_INTEL_TGL_U3_IMC 0x9a12 +#define PCI_DEVICE_ID_INTEL_TGL_U4_IMC 0x9a14 +#define PCI_DEVICE_ID_INTEL_TGL_H_IMC 0x9a36 /* SNB event control */ @@ -1002,3 +1007,157 @@ void nhm_uncore_cpu_init(void) } /* end of Nehalem uncore support */ + +/* Tiger Lake MMIO uncore support */ + +static const struct pci_device_id tgl_uncore_pci_ids[] = { + { /* IMC */ + PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_TGL_U1_IMC), + .driver_data = UNCORE_PCI_DEV_DATA(SNB_PCI_UNCORE_IMC, 0), + }, + { /* IMC */ + PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_TGL_U2_IMC), + .driver_data = UNCORE_PCI_DEV_DATA(SNB_PCI_UNCORE_IMC, 0), + }, + { /* IMC */ + PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_TGL_U3_IMC), + .driver_data = UNCORE_PCI_DEV_DATA(SNB_PCI_UNCORE_IMC, 0), + }, + { /* IMC */ + PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_TGL_U4_IMC), + .driver_data = UNCORE_PCI_DEV_DATA(SNB_PCI_UNCORE_IMC, 0), + }, + { /* IMC */ + PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_TGL_H_IMC), + .driver_data = UNCORE_PCI_DEV_DATA(SNB_PCI_UNCORE_IMC, 0), + }, + { /* end: all zeroes */ } +}; + +enum perf_tgl_uncore_imc_freerunning_types { + TGL_MMIO_UNCORE_IMC_DATA_TOTAL, + TGL_MMIO_UNCORE_IMC_DATA_READ, + TGL_MMIO_UNCORE_IMC_DATA_WRITE, + TGL_MMIO_UNCORE_IMC_FREERUNNING_TYPE_MAX +}; + +static struct freerunning_counters tgl_l_uncore_imc_freerunning[] = { + [TGL_MMIO_UNCORE_IMC_DATA_TOTAL] = { 0x5040, 0x0, 0x0, 1, 64 }, + [TGL_MMIO_UNCORE_IMC_DATA_READ] = { 0x5058, 0x0, 0x0, 1, 64 }, + [TGL_MMIO_UNCORE_IMC_DATA_WRITE] = { 0x50A0, 0x0, 0x0, 1, 64 }, +}; + +static struct freerunning_counters tgl_uncore_imc_freerunning[] = { + [TGL_MMIO_UNCORE_IMC_DATA_TOTAL] = { 0xd840, 0x0, 0x0, 1, 64 }, + [TGL_MMIO_UNCORE_IMC_DATA_READ] = { 0xd858, 0x0, 0x0, 1, 64 }, + [TGL_MMIO_UNCORE_IMC_DATA_WRITE] = { 0xd8A0, 0x0, 0x0, 1, 64 }, +}; + +static struct uncore_event_desc tgl_uncore_imc_events[] = { + INTEL_UNCORE_EVENT_DESC(data_total, "event=0xff,umask=0x10"), + INTEL_UNCORE_EVENT_DESC(data_total.scale, "6.103515625e-5"), + INTEL_UNCORE_EVENT_DESC(data_total.unit, "MiB"), + + INTEL_UNCORE_EVENT_DESC(data_read, "event=0xff,umask=0x20"), + INTEL_UNCORE_EVENT_DESC(data_read.scale, "6.103515625e-5"), + INTEL_UNCORE_EVENT_DESC(data_read.unit, "MiB"), + + INTEL_UNCORE_EVENT_DESC(data_write, "event=0xff,umask=0x30"), + INTEL_UNCORE_EVENT_DESC(data_write.scale, "6.103515625e-5"), + INTEL_UNCORE_EVENT_DESC(data_write.unit, "MiB"), + + { /* end: all zeroes */ } +}; + +static struct pci_dev *tgl_uncore_get_mc_dev(void) +{ + const struct pci_device_id *ids = tgl_uncore_pci_ids; + struct pci_dev *mc_dev = NULL; + + while (ids && ids->vendor) { + mc_dev = pci_get_device(PCI_VENDOR_ID_INTEL, ids->device, NULL); + if (mc_dev) + return mc_dev; + ids++; + } + + return mc_dev; +} + +#define TGL_UNCORE_MMIO_IMC_MEM_OFFSET 0x10000 + +static void tgl_uncore_imc_freerunning_init_box(struct intel_uncore_box *box) +{ + struct pci_dev *pdev = tgl_uncore_get_mc_dev(); + struct intel_uncore_pmu *pmu = box->pmu; + resource_size_t addr; + u32 mch_bar; + + if (!pdev) { + pr_warn("perf uncore: Cannot find matched IMC device.\n"); + return; + } + + pci_read_config_dword(pdev, SNB_UNCORE_PCI_IMC_BAR_OFFSET, &mch_bar); + /* MCHBAR is disabled */ + if (!(mch_bar & BIT(0))) { + pr_warn("perf uncore: MCHBAR is disabled. Failed to map IMC free-running counters.\n"); + return; + } + mch_bar &= ~BIT(0); + addr = (resource_size_t)(mch_bar + TGL_UNCORE_MMIO_IMC_MEM_OFFSET * pmu->pmu_idx); + +#ifdef CONFIG_PHYS_ADDR_T_64BIT + pci_read_config_dword(pdev, SNB_UNCORE_PCI_IMC_BAR_OFFSET + 4, &mch_bar); + addr |= ((resource_size_t)mch_bar << 32); +#endif + + box->io_addr = ioremap(addr, SNB_UNCORE_PCI_IMC_MAP_SIZE); +} + +static struct intel_uncore_ops tgl_uncore_imc_freerunning_ops = { + .init_box = tgl_uncore_imc_freerunning_init_box, + .exit_box = uncore_mmio_exit_box, + .read_counter = uncore_mmio_read_counter, + .hw_config = uncore_freerunning_hw_config, +}; + +static struct attribute *tgl_uncore_imc_formats_attr[] = { + &format_attr_event.attr, + &format_attr_umask.attr, + NULL +}; + +static const struct attribute_group tgl_uncore_imc_format_group = { + .name = "format", + .attrs = tgl_uncore_imc_formats_attr, +}; + +static struct intel_uncore_type tgl_uncore_imc_free_running = { + .name = "imc_free_running", + .num_counters = 3, + .num_boxes = 2, + .num_freerunning_types = TGL_MMIO_UNCORE_IMC_FREERUNNING_TYPE_MAX, + .freerunning = tgl_uncore_imc_freerunning, + .ops = &tgl_uncore_imc_freerunning_ops, + .event_descs = tgl_uncore_imc_events, + .format_group = &tgl_uncore_imc_format_group, +}; + +static struct intel_uncore_type *tgl_mmio_uncores[] = { + &tgl_uncore_imc_free_running, + NULL +}; + +void tgl_l_uncore_mmio_init(void) +{ + tgl_uncore_imc_free_running.freerunning = tgl_l_uncore_imc_freerunning; + uncore_mmio_uncores = tgl_mmio_uncores; +} + +void tgl_uncore_mmio_init(void) +{ + uncore_mmio_uncores = tgl_mmio_uncores; +} + +/* end of Tiger Lake MMIO uncore support */ -- cgit From c12e13dcd814023a903399ec5ac2e7082d664b8b Mon Sep 17 00:00:00 2001 From: Yu-cheng Yu Date: Thu, 9 Jan 2020 13:14:50 -0800 Subject: x86/fpu/xstate: Fix last_good_offset in setup_xstate_features() The function setup_xstate_features() uses CPUID to find each xfeature's standard-format offset and size. Since XSAVES always uses the compacted format, supervisor xstates are *NEVER* in the standard-format and their offsets are left as -1's. However, they are still being tracked as last_good_offset. Fix it by tracking only user xstate offsets. [ bp: Use xfeature_is_supervisor() and save an indentation level. Drop now unused xfeature_is_user(). ] Signed-off-by: Yu-cheng Yu Signed-off-by: Borislav Petkov Reviewed-by: Dave Hansen Link: https://lkml.kernel.org/r/20200109211452.27369-2-yu-cheng.yu@intel.com --- arch/x86/kernel/fpu/xstate.c | 27 +++++++++++++-------------- 1 file changed, 13 insertions(+), 14 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c index a1806598aaa4..fe67cabfb4a1 100644 --- a/arch/x86/kernel/fpu/xstate.c +++ b/arch/x86/kernel/fpu/xstate.c @@ -120,11 +120,6 @@ static bool xfeature_is_supervisor(int xfeature_nr) return ecx & 1; } -static bool xfeature_is_user(int xfeature_nr) -{ - return !xfeature_is_supervisor(xfeature_nr); -} - /* * When executing XSAVEOPT (or other optimized XSAVE instructions), if * a processor implementation detects that an FPU state component is still @@ -265,21 +260,25 @@ static void __init setup_xstate_features(void) cpuid_count(XSTATE_CPUID, i, &eax, &ebx, &ecx, &edx); + xstate_sizes[i] = eax; + /* - * If an xfeature is supervisor state, the offset - * in EBX is invalid. We leave it to -1. + * If an xfeature is supervisor state, the offset in EBX is + * invalid, leave it to -1. */ - if (xfeature_is_user(i)) - xstate_offsets[i] = ebx; + if (xfeature_is_supervisor(i)) + continue; + + xstate_offsets[i] = ebx; - xstate_sizes[i] = eax; /* - * In our xstate size checks, we assume that the - * highest-numbered xstate feature has the - * highest offset in the buffer. Ensure it does. + * In our xstate size checks, we assume that the highest-numbered + * xstate feature has the highest offset in the buffer. Ensure + * it does. */ WARN_ONCE(last_good_offset > xstate_offsets[i], - "x86/fpu: misordered xstate at %d\n", last_good_offset); + "x86/fpu: misordered xstate at %d\n", last_good_offset); + last_good_offset = xstate_offsets[i]; } } -- cgit From 48bfdb9deffdc6b683feb25e15f4f26aac503501 Mon Sep 17 00:00:00 2001 From: Arvind Sankar Date: Tue, 7 Jan 2020 14:44:35 -0500 Subject: x86/boot/compressed/64: Use LEA to initialize boot stack pointer It's shorter, and it's what is used in every other place, so make it consistent. Signed-off-by: Arvind Sankar Signed-off-by: Borislav Petkov Link: https://lkml.kernel.org/r/20200107194436.2166846-2-nivedita@alum.mit.edu --- arch/x86/boot/compressed/head_64.S | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S index 1f1f6c8139b3..d1220de1de52 100644 --- a/arch/x86/boot/compressed/head_64.S +++ b/arch/x86/boot/compressed/head_64.S @@ -81,9 +81,7 @@ SYM_FUNC_START(startup_32) subl $1b, %ebp /* setup a stack and make sure cpu supports long mode. */ - movl $boot_stack_end, %eax - addl %ebp, %eax - movl %eax, %esp + leal boot_stack_end(%ebp), %esp call verify_cpu testl %eax, %eax -- cgit From a86255fe5258714e1f7c1bdfe95f08e4d098d450 Mon Sep 17 00:00:00 2001 From: Arvind Sankar Date: Tue, 11 Feb 2020 12:33:33 -0500 Subject: x86/boot/compressed/64: Use 32-bit (zero-extended) MOV for z_output_len z_output_len is the size of the decompressed payload (i.e. vmlinux + vmlinux.relocs) and is generated as an unsigned 32-bit quantity by mkpiggy.c. The current movq $z_output_len, %r9 instruction generates a sign-extended move to %r9. Using movl $z_output_len, %r9d will instead zero-extend into %r9, which is appropriate for an unsigned 32-bit quantity. This is also what is already done for z_input_len, the size of the compressed payload. [ bp: Also, z_output_len cannot be a 64-bit quantity because it participates in: init_size: .long INIT_SIZE # kernel initialization size through INIT_SIZE which is a 32-bit quantity determined by the .long directive (vs .quad for 64-bit). Furthermore, if it really must be a 64-bit quantity, then the insn must be MOVABS which can accommodate a 64-bit immediate and which the toolchain does not generate automatically. ] Signed-off-by: Arvind Sankar Signed-off-by: Borislav Petkov Link: https://lkml.kernel.org/r/20200211173333.1722739-1-nivedita@alum.mit.edu --- arch/x86/boot/compressed/head_64.S | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S index d1220de1de52..68f31c48d6c2 100644 --- a/arch/x86/boot/compressed/head_64.S +++ b/arch/x86/boot/compressed/head_64.S @@ -482,7 +482,7 @@ SYM_FUNC_START_LOCAL_NOALIGN(.Lrelocated) leaq input_data(%rip), %rdx /* input_data */ movl $z_input_len, %ecx /* input_len */ movq %rbp, %r8 /* output target address */ - movq $z_output_len, %r9 /* decompressed length, end of relocs */ + movl $z_output_len, %r9d /* decompressed length, end of relocs */ call extract_kernel /* returns kernel location in %rax */ popq %rsi -- cgit From 49a91d61aed1db01097b51a24c77137eb348a0bf Mon Sep 17 00:00:00 2001 From: Yu-cheng Yu Date: Thu, 9 Jan 2020 13:14:51 -0800 Subject: x86/fpu/xstate: Fix XSAVES offsets in setup_xstate_comp() In setup_xstate_comp(), each XSAVES component offset starts from the end of its preceding component plus alignment. A disabled feature does not take space and its offset should be set to the end of its preceding one with no alignment. However, in this case, alignment is incorrectly added to the offset, which can cause the next component to have a wrong offset. This problem has not been visible because currently there is no xfeature requiring alignment. Fix it by tracking the next starting offset only from enabled xfeatures. To make it clear, also change the function name to setup_xstate_comp_offsets(). [ bp: Fix a typo in the comment above it, while at it. ] Signed-off-by: Yu-cheng Yu Signed-off-by: Borislav Petkov Reviewed-by: Dave Hansen Link: https://lkml.kernel.org/r/20200109211452.27369-3-yu-cheng.yu@intel.com --- arch/x86/kernel/fpu/xstate.c | 32 ++++++++++++-------------------- 1 file changed, 12 insertions(+), 20 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c index fe67cabfb4a1..edcaacd583d8 100644 --- a/arch/x86/kernel/fpu/xstate.c +++ b/arch/x86/kernel/fpu/xstate.c @@ -337,11 +337,11 @@ static int xfeature_is_aligned(int xfeature_nr) /* * This function sets up offsets and sizes of all extended states in * xsave area. This supports both standard format and compacted format - * of the xsave aread. + * of the xsave area. */ -static void __init setup_xstate_comp(void) +static void __init setup_xstate_comp_offsets(void) { - unsigned int xstate_comp_sizes[XFEATURE_MAX]; + unsigned int next_offset; int i; /* @@ -355,31 +355,23 @@ static void __init setup_xstate_comp(void) if (!boot_cpu_has(X86_FEATURE_XSAVES)) { for (i = FIRST_EXTENDED_XFEATURE; i < XFEATURE_MAX; i++) { - if (xfeature_enabled(i)) { + if (xfeature_enabled(i)) xstate_comp_offsets[i] = xstate_offsets[i]; - xstate_comp_sizes[i] = xstate_sizes[i]; - } } return; } - xstate_comp_offsets[FIRST_EXTENDED_XFEATURE] = - FXSAVE_SIZE + XSAVE_HDR_SIZE; + next_offset = FXSAVE_SIZE + XSAVE_HDR_SIZE; for (i = FIRST_EXTENDED_XFEATURE; i < XFEATURE_MAX; i++) { - if (xfeature_enabled(i)) - xstate_comp_sizes[i] = xstate_sizes[i]; - else - xstate_comp_sizes[i] = 0; + if (!xfeature_enabled(i)) + continue; - if (i > FIRST_EXTENDED_XFEATURE) { - xstate_comp_offsets[i] = xstate_comp_offsets[i-1] - + xstate_comp_sizes[i-1]; + if (xfeature_is_aligned(i)) + next_offset = ALIGN(next_offset, 64); - if (xfeature_is_aligned(i)) - xstate_comp_offsets[i] = - ALIGN(xstate_comp_offsets[i], 64); - } + xstate_comp_offsets[i] = next_offset; + next_offset += xstate_sizes[i]; } } @@ -773,7 +765,7 @@ void __init fpu__init_system_xstate(void) fpu__init_prepare_fx_sw_frame(); setup_init_fpu_buf(); - setup_xstate_comp(); + setup_xstate_comp_offsets(); print_xstate_offset_size(); pr_info("x86/fpu: Enabled xstate features 0x%llx, context size is %d bytes, using '%s' format.\n", -- cgit From e70b100806d63fb79775858ea92e1a716da46186 Mon Sep 17 00:00:00 2001 From: Yu-cheng Yu Date: Thu, 9 Jan 2020 13:14:52 -0800 Subject: x86/fpu/xstate: Warn when checking alignment of disabled xfeatures An XSAVES component's alignment/offset is meaningful only when the feature is enabled. Return zero and WARN_ONCE on checking alignment of disabled features. Signed-off-by: Yu-cheng Yu Signed-off-by: Borislav Petkov Reviewed-by: Dave Hansen Link: https://lkml.kernel.org/r/20200109211452.27369-4-yu-cheng.yu@intel.com --- arch/x86/kernel/fpu/xstate.c | 7 +++++++ 1 file changed, 7 insertions(+) (limited to 'arch/x86') diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c index edcaacd583d8..73fe5979629c 100644 --- a/arch/x86/kernel/fpu/xstate.c +++ b/arch/x86/kernel/fpu/xstate.c @@ -325,6 +325,13 @@ static int xfeature_is_aligned(int xfeature_nr) u32 eax, ebx, ecx, edx; CHECK_XFEATURE(xfeature_nr); + + if (!xfeature_enabled(xfeature_nr)) { + WARN_ONCE(1, "Checking alignment of disabled xfeature %d\n", + xfeature_nr); + return 0; + } + cpuid_count(XSTATE_CPUID, xfeature_nr, &eax, &ebx, &ecx, &edx); /* * The value returned by ECX[1] indicates the alignment -- cgit From 07b586fe06625b0b610dc3d3a969c51913d143d4 Mon Sep 17 00:00:00 2001 From: "Jason A. Donenfeld" Date: Mon, 20 Jan 2020 18:18:15 +0100 Subject: crypto: x86/curve25519 - replace with formally verified implementation This comes from INRIA's HACL*/Vale. It implements the same algorithm and implementation strategy as the code it replaces, only this code has been formally verified, sans the base point multiplication, which uses code similar to prior, only it uses the formally verified field arithmetic alongside reproducable ladder generation steps. This doesn't have a pure-bmi2 version, which means haswell no longer benefits, but the increased (doubled) code complexity is not worth it for a single generation of chips that's already old. Performance-wise, this is around 1% slower on older microarchitectures, and slightly faster on newer microarchitectures, mainly 10nm ones or backports of 10nm to 14nm. This implementation is "everest" below: Xeon E5-2680 v4 (Broadwell) armfazh: 133340 cycles per call everest: 133436 cycles per call Xeon Gold 5120 (Sky Lake Server) armfazh: 112636 cycles per call everest: 113906 cycles per call Core i5-6300U (Sky Lake Client) armfazh: 116810 cycles per call everest: 117916 cycles per call Core i7-7600U (Kaby Lake) armfazh: 119523 cycles per call everest: 119040 cycles per call Core i7-8750H (Coffee Lake) armfazh: 113914 cycles per call everest: 113650 cycles per call Core i9-9880H (Coffee Lake Refresh) armfazh: 112616 cycles per call everest: 114082 cycles per call Core i3-8121U (Cannon Lake) armfazh: 113202 cycles per call everest: 111382 cycles per call Core i7-8265U (Whiskey Lake) armfazh: 127307 cycles per call everest: 127697 cycles per call Core i7-8550U (Kaby Lake Refresh) armfazh: 127522 cycles per call everest: 127083 cycles per call Xeon Platinum 8275CL (Cascade Lake) armfazh: 114380 cycles per call everest: 114656 cycles per call Achieving these kind of results with formally verified code is quite remarkable, especialy considering that performance is favorable for newer chips. Signed-off-by: Jason A. Donenfeld Signed-off-by: Herbert Xu --- arch/x86/crypto/curve25519-x86_64.c | 3546 +++++++++++++---------------------- 1 file changed, 1292 insertions(+), 2254 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/crypto/curve25519-x86_64.c b/arch/x86/crypto/curve25519-x86_64.c index eec7d2d24239..e4e58b8e9afe 100644 --- a/arch/x86/crypto/curve25519-x86_64.c +++ b/arch/x86/crypto/curve25519-x86_64.c @@ -1,8 +1,7 @@ -// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause +// SPDX-License-Identifier: GPL-2.0 OR MIT /* - * Copyright (c) 2017 Armando Faz . All Rights Reserved. - * Copyright (C) 2018-2019 Jason A. Donenfeld . All Rights Reserved. - * Copyright (C) 2018 Samuel Neves . All Rights Reserved. + * Copyright (C) 2020 Jason A. Donenfeld . All Rights Reserved. + * Copyright (c) 2016-2020 INRIA, CMU and Microsoft Corporation */ #include @@ -16,2337 +15,1378 @@ #include #include -static __ro_after_init DEFINE_STATIC_KEY_FALSE(curve25519_use_bmi2); -static __ro_after_init DEFINE_STATIC_KEY_FALSE(curve25519_use_adx); - -enum { NUM_WORDS_ELTFP25519 = 4 }; -typedef __aligned(32) u64 eltfp25519_1w[NUM_WORDS_ELTFP25519]; -typedef __aligned(32) u64 eltfp25519_1w_buffer[2 * NUM_WORDS_ELTFP25519]; - -#define mul_eltfp25519_1w_adx(c, a, b) do { \ - mul_256x256_integer_adx(m.buffer, a, b); \ - red_eltfp25519_1w_adx(c, m.buffer); \ -} while (0) - -#define mul_eltfp25519_1w_bmi2(c, a, b) do { \ - mul_256x256_integer_bmi2(m.buffer, a, b); \ - red_eltfp25519_1w_bmi2(c, m.buffer); \ -} while (0) - -#define sqr_eltfp25519_1w_adx(a) do { \ - sqr_256x256_integer_adx(m.buffer, a); \ - red_eltfp25519_1w_adx(a, m.buffer); \ -} while (0) - -#define sqr_eltfp25519_1w_bmi2(a) do { \ - sqr_256x256_integer_bmi2(m.buffer, a); \ - red_eltfp25519_1w_bmi2(a, m.buffer); \ -} while (0) - -#define mul_eltfp25519_2w_adx(c, a, b) do { \ - mul2_256x256_integer_adx(m.buffer, a, b); \ - red_eltfp25519_2w_adx(c, m.buffer); \ -} while (0) - -#define mul_eltfp25519_2w_bmi2(c, a, b) do { \ - mul2_256x256_integer_bmi2(m.buffer, a, b); \ - red_eltfp25519_2w_bmi2(c, m.buffer); \ -} while (0) - -#define sqr_eltfp25519_2w_adx(a) do { \ - sqr2_256x256_integer_adx(m.buffer, a); \ - red_eltfp25519_2w_adx(a, m.buffer); \ -} while (0) - -#define sqr_eltfp25519_2w_bmi2(a) do { \ - sqr2_256x256_integer_bmi2(m.buffer, a); \ - red_eltfp25519_2w_bmi2(a, m.buffer); \ -} while (0) - -#define sqrn_eltfp25519_1w_adx(a, times) do { \ - int ____counter = (times); \ - while (____counter-- > 0) \ - sqr_eltfp25519_1w_adx(a); \ -} while (0) - -#define sqrn_eltfp25519_1w_bmi2(a, times) do { \ - int ____counter = (times); \ - while (____counter-- > 0) \ - sqr_eltfp25519_1w_bmi2(a); \ -} while (0) - -#define copy_eltfp25519_1w(C, A) do { \ - (C)[0] = (A)[0]; \ - (C)[1] = (A)[1]; \ - (C)[2] = (A)[2]; \ - (C)[3] = (A)[3]; \ -} while (0) - -#define setzero_eltfp25519_1w(C) do { \ - (C)[0] = 0; \ - (C)[1] = 0; \ - (C)[2] = 0; \ - (C)[3] = 0; \ -} while (0) - -__aligned(32) static const u64 table_ladder_8k[252 * NUM_WORDS_ELTFP25519] = { - /* 1 */ 0xfffffffffffffff3UL, 0xffffffffffffffffUL, - 0xffffffffffffffffUL, 0x5fffffffffffffffUL, - /* 2 */ 0x6b8220f416aafe96UL, 0x82ebeb2b4f566a34UL, - 0xd5a9a5b075a5950fUL, 0x5142b2cf4b2488f4UL, - /* 3 */ 0x6aaebc750069680cUL, 0x89cf7820a0f99c41UL, - 0x2a58d9183b56d0f4UL, 0x4b5aca80e36011a4UL, - /* 4 */ 0x329132348c29745dUL, 0xf4a2e616e1642fd7UL, - 0x1e45bb03ff67bc34UL, 0x306912d0f42a9b4aUL, - /* 5 */ 0xff886507e6af7154UL, 0x04f50e13dfeec82fUL, - 0xaa512fe82abab5ceUL, 0x174e251a68d5f222UL, - /* 6 */ 0xcf96700d82028898UL, 0x1743e3370a2c02c5UL, - 0x379eec98b4e86eaaUL, 0x0c59888a51e0482eUL, - /* 7 */ 0xfbcbf1d699b5d189UL, 0xacaef0d58e9fdc84UL, - 0xc1c20d06231f7614UL, 0x2938218da274f972UL, - /* 8 */ 0xf6af49beff1d7f18UL, 0xcc541c22387ac9c2UL, - 0x96fcc9ef4015c56bUL, 0x69c1627c690913a9UL, - /* 9 */ 0x7a86fd2f4733db0eUL, 0xfdb8c4f29e087de9UL, - 0x095e4b1a8ea2a229UL, 0x1ad7a7c829b37a79UL, - /* 10 */ 0x342d89cad17ea0c0UL, 0x67bedda6cced2051UL, - 0x19ca31bf2bb42f74UL, 0x3df7b4c84980acbbUL, - /* 11 */ 0xa8c6444dc80ad883UL, 0xb91e440366e3ab85UL, - 0xc215cda00164f6d8UL, 0x3d867c6ef247e668UL, - /* 12 */ 0xc7dd582bcc3e658cUL, 0xfd2c4748ee0e5528UL, - 0xa0fd9b95cc9f4f71UL, 0x7529d871b0675ddfUL, - /* 13 */ 0xb8f568b42d3cbd78UL, 0x1233011b91f3da82UL, - 0x2dce6ccd4a7c3b62UL, 0x75e7fc8e9e498603UL, - /* 14 */ 0x2f4f13f1fcd0b6ecUL, 0xf1a8ca1f29ff7a45UL, - 0xc249c1a72981e29bUL, 0x6ebe0dbb8c83b56aUL, - /* 15 */ 0x7114fa8d170bb222UL, 0x65a2dcd5bf93935fUL, - 0xbdc41f68b59c979aUL, 0x2f0eef79a2ce9289UL, - /* 16 */ 0x42ecbf0c083c37ceUL, 0x2930bc09ec496322UL, - 0xf294b0c19cfeac0dUL, 0x3780aa4bedfabb80UL, - /* 17 */ 0x56c17d3e7cead929UL, 0xe7cb4beb2e5722c5UL, - 0x0ce931732dbfe15aUL, 0x41b883c7621052f8UL, - /* 18 */ 0xdbf75ca0c3d25350UL, 0x2936be086eb1e351UL, - 0xc936e03cb4a9b212UL, 0x1d45bf82322225aaUL, - /* 19 */ 0xe81ab1036a024cc5UL, 0xe212201c304c9a72UL, - 0xc5d73fba6832b1fcUL, 0x20ffdb5a4d839581UL, - /* 20 */ 0xa283d367be5d0fadUL, 0x6c2b25ca8b164475UL, - 0x9d4935467caaf22eUL, 0x5166408eee85ff49UL, - /* 21 */ 0x3c67baa2fab4e361UL, 0xb3e433c67ef35cefUL, - 0x5259729241159b1cUL, 0x6a621892d5b0ab33UL, - /* 22 */ 0x20b74a387555cdcbUL, 0x532aa10e1208923fUL, - 0xeaa17b7762281dd1UL, 0x61ab3443f05c44bfUL, - /* 23 */ 0x257a6c422324def8UL, 0x131c6c1017e3cf7fUL, - 0x23758739f630a257UL, 0x295a407a01a78580UL, - /* 24 */ 0xf8c443246d5da8d9UL, 0x19d775450c52fa5dUL, - 0x2afcfc92731bf83dUL, 0x7d10c8e81b2b4700UL, - /* 25 */ 0xc8e0271f70baa20bUL, 0x993748867ca63957UL, - 0x5412efb3cb7ed4bbUL, 0x3196d36173e62975UL, - /* 26 */ 0xde5bcad141c7dffcUL, 0x47cc8cd2b395c848UL, - 0xa34cd942e11af3cbUL, 0x0256dbf2d04ecec2UL, - /* 27 */ 0x875ab7e94b0e667fUL, 0xcad4dd83c0850d10UL, - 0x47f12e8f4e72c79fUL, 0x5f1a87bb8c85b19bUL, - /* 28 */ 0x7ae9d0b6437f51b8UL, 0x12c7ce5518879065UL, - 0x2ade09fe5cf77aeeUL, 0x23a05a2f7d2c5627UL, - /* 29 */ 0x5908e128f17c169aUL, 0xf77498dd8ad0852dUL, - 0x74b4c4ceab102f64UL, 0x183abadd10139845UL, - /* 30 */ 0xb165ba8daa92aaacUL, 0xd5c5ef9599386705UL, - 0xbe2f8f0cf8fc40d1UL, 0x2701e635ee204514UL, - /* 31 */ 0x629fa80020156514UL, 0xf223868764a8c1ceUL, - 0x5b894fff0b3f060eUL, 0x60d9944cf708a3faUL, - /* 32 */ 0xaeea001a1c7a201fUL, 0xebf16a633ee2ce63UL, - 0x6f7709594c7a07e1UL, 0x79b958150d0208cbUL, - /* 33 */ 0x24b55e5301d410e7UL, 0xe3a34edff3fdc84dUL, - 0xd88768e4904032d8UL, 0x131384427b3aaeecUL, - /* 34 */ 0x8405e51286234f14UL, 0x14dc4739adb4c529UL, - 0xb8a2b5b250634ffdUL, 0x2fe2a94ad8a7ff93UL, - /* 35 */ 0xec5c57efe843faddUL, 0x2843ce40f0bb9918UL, - 0xa4b561d6cf3d6305UL, 0x743629bde8fb777eUL, - /* 36 */ 0x343edd46bbaf738fUL, 0xed981828b101a651UL, - 0xa401760b882c797aUL, 0x1fc223e28dc88730UL, - /* 37 */ 0x48604e91fc0fba0eUL, 0xb637f78f052c6fa4UL, - 0x91ccac3d09e9239cUL, 0x23f7eed4437a687cUL, - /* 38 */ 0x5173b1118d9bd800UL, 0x29d641b63189d4a7UL, - 0xfdbf177988bbc586UL, 0x2959894fcad81df5UL, - /* 39 */ 0xaebc8ef3b4bbc899UL, 0x4148995ab26992b9UL, - 0x24e20b0134f92cfbUL, 0x40d158894a05dee8UL, - /* 40 */ 0x46b00b1185af76f6UL, 0x26bac77873187a79UL, - 0x3dc0bf95ab8fff5fUL, 0x2a608bd8945524d7UL, - /* 41 */ 0x26449588bd446302UL, 0x7c4bc21c0388439cUL, - 0x8e98a4f383bd11b2UL, 0x26218d7bc9d876b9UL, - /* 42 */ 0xe3081542997c178aUL, 0x3c2d29a86fb6606fUL, - 0x5c217736fa279374UL, 0x7dde05734afeb1faUL, - /* 43 */ 0x3bf10e3906d42babUL, 0xe4f7803e1980649cUL, - 0xe6053bf89595bf7aUL, 0x394faf38da245530UL, - /* 44 */ 0x7a8efb58896928f4UL, 0xfbc778e9cc6a113cUL, - 0x72670ce330af596fUL, 0x48f222a81d3d6cf7UL, - /* 45 */ 0xf01fce410d72caa7UL, 0x5a20ecc7213b5595UL, - 0x7bc21165c1fa1483UL, 0x07f89ae31da8a741UL, - /* 46 */ 0x05d2c2b4c6830ff9UL, 0xd43e330fc6316293UL, - 0xa5a5590a96d3a904UL, 0x705edb91a65333b6UL, - /* 47 */ 0x048ee15e0bb9a5f7UL, 0x3240cfca9e0aaf5dUL, - 0x8f4b71ceedc4a40bUL, 0x621c0da3de544a6dUL, - /* 48 */ 0x92872836a08c4091UL, 0xce8375b010c91445UL, - 0x8a72eb524f276394UL, 0x2667fcfa7ec83635UL, - /* 49 */ 0x7f4c173345e8752aUL, 0x061b47feee7079a5UL, - 0x25dd9afa9f86ff34UL, 0x3780cef5425dc89cUL, - /* 50 */ 0x1a46035a513bb4e9UL, 0x3e1ef379ac575adaUL, - 0xc78c5f1c5fa24b50UL, 0x321a967634fd9f22UL, - /* 51 */ 0x946707b8826e27faUL, 0x3dca84d64c506fd0UL, - 0xc189218075e91436UL, 0x6d9284169b3b8484UL, - /* 52 */ 0x3a67e840383f2ddfUL, 0x33eec9a30c4f9b75UL, - 0x3ec7c86fa783ef47UL, 0x26ec449fbac9fbc4UL, - /* 53 */ 0x5c0f38cba09b9e7dUL, 0x81168cc762a3478cUL, - 0x3e23b0d306fc121cUL, 0x5a238aa0a5efdcddUL, - /* 54 */ 0x1ba26121c4ea43ffUL, 0x36f8c77f7c8832b5UL, - 0x88fbea0b0adcf99aUL, 0x5ca9938ec25bebf9UL, - /* 55 */ 0xd5436a5e51fccda0UL, 0x1dbc4797c2cd893bUL, - 0x19346a65d3224a08UL, 0x0f5034e49b9af466UL, - /* 56 */ 0xf23c3967a1e0b96eUL, 0xe58b08fa867a4d88UL, - 0xfb2fabc6a7341679UL, 0x2a75381eb6026946UL, - /* 57 */ 0xc80a3be4c19420acUL, 0x66b1f6c681f2b6dcUL, - 0x7cf7036761e93388UL, 0x25abbbd8a660a4c4UL, - /* 58 */ 0x91ea12ba14fd5198UL, 0x684950fc4a3cffa9UL, - 0xf826842130f5ad28UL, 0x3ea988f75301a441UL, - /* 59 */ 0xc978109a695f8c6fUL, 0x1746eb4a0530c3f3UL, - 0x444d6d77b4459995UL, 0x75952b8c054e5cc7UL, - /* 60 */ 0xa3703f7915f4d6aaUL, 0x66c346202f2647d8UL, - 0xd01469df811d644bUL, 0x77fea47d81a5d71fUL, - /* 61 */ 0xc5e9529ef57ca381UL, 0x6eeeb4b9ce2f881aUL, - 0xb6e91a28e8009bd6UL, 0x4b80be3e9afc3fecUL, - /* 62 */ 0x7e3773c526aed2c5UL, 0x1b4afcb453c9a49dUL, - 0xa920bdd7baffb24dUL, 0x7c54699f122d400eUL, - /* 63 */ 0xef46c8e14fa94bc8UL, 0xe0b074ce2952ed5eUL, - 0xbea450e1dbd885d5UL, 0x61b68649320f712cUL, - /* 64 */ 0x8a485f7309ccbdd1UL, 0xbd06320d7d4d1a2dUL, - 0x25232973322dbef4UL, 0x445dc4758c17f770UL, - /* 65 */ 0xdb0434177cc8933cUL, 0xed6fe82175ea059fUL, - 0x1efebefdc053db34UL, 0x4adbe867c65daf99UL, - /* 66 */ 0x3acd71a2a90609dfUL, 0xe5e991856dd04050UL, - 0x1ec69b688157c23cUL, 0x697427f6885cfe4dUL, - /* 67 */ 0xd7be7b9b65e1a851UL, 0xa03d28d522c536ddUL, - 0x28399d658fd2b645UL, 0x49e5b7e17c2641e1UL, - /* 68 */ 0x6f8c3a98700457a4UL, 0x5078f0a25ebb6778UL, - 0xd13c3ccbc382960fUL, 0x2e003258a7df84b1UL, - /* 69 */ 0x8ad1f39be6296a1cUL, 0xc1eeaa652a5fbfb2UL, - 0x33ee0673fd26f3cbUL, 0x59256173a69d2cccUL, - /* 70 */ 0x41ea07aa4e18fc41UL, 0xd9fc19527c87a51eUL, - 0xbdaacb805831ca6fUL, 0x445b652dc916694fUL, - /* 71 */ 0xce92a3a7f2172315UL, 0x1edc282de11b9964UL, - 0xa1823aafe04c314aUL, 0x790a2d94437cf586UL, - /* 72 */ 0x71c447fb93f6e009UL, 0x8922a56722845276UL, - 0xbf70903b204f5169UL, 0x2f7a89891ba319feUL, - /* 73 */ 0x02a08eb577e2140cUL, 0xed9a4ed4427bdcf4UL, - 0x5253ec44e4323cd1UL, 0x3e88363c14e9355bUL, - /* 74 */ 0xaa66c14277110b8cUL, 0x1ae0391610a23390UL, - 0x2030bd12c93fc2a2UL, 0x3ee141579555c7abUL, - /* 75 */ 0x9214de3a6d6e7d41UL, 0x3ccdd88607f17efeUL, - 0x674f1288f8e11217UL, 0x5682250f329f93d0UL, - /* 76 */ 0x6cf00b136d2e396eUL, 0x6e4cf86f1014debfUL, - 0x5930b1b5bfcc4e83UL, 0x047069b48aba16b6UL, - /* 77 */ 0x0d4ce4ab69b20793UL, 0xb24db91a97d0fb9eUL, - 0xcdfa50f54e00d01dUL, 0x221b1085368bddb5UL, - /* 78 */ 0xe7e59468b1e3d8d2UL, 0x53c56563bd122f93UL, - 0xeee8a903e0663f09UL, 0x61efa662cbbe3d42UL, - /* 79 */ 0x2cf8ddddde6eab2aUL, 0x9bf80ad51435f231UL, - 0x5deadacec9f04973UL, 0x29275b5d41d29b27UL, - /* 80 */ 0xcfde0f0895ebf14fUL, 0xb9aab96b054905a7UL, - 0xcae80dd9a1c420fdUL, 0x0a63bf2f1673bbc7UL, - /* 81 */ 0x092f6e11958fbc8cUL, 0x672a81e804822fadUL, - 0xcac8351560d52517UL, 0x6f3f7722c8f192f8UL, - /* 82 */ 0xf8ba90ccc2e894b7UL, 0x2c7557a438ff9f0dUL, - 0x894d1d855ae52359UL, 0x68e122157b743d69UL, - /* 83 */ 0xd87e5570cfb919f3UL, 0x3f2cdecd95798db9UL, - 0x2121154710c0a2ceUL, 0x3c66a115246dc5b2UL, - /* 84 */ 0xcbedc562294ecb72UL, 0xba7143c36a280b16UL, - 0x9610c2efd4078b67UL, 0x6144735d946a4b1eUL, - /* 85 */ 0x536f111ed75b3350UL, 0x0211db8c2041d81bUL, - 0xf93cb1000e10413cUL, 0x149dfd3c039e8876UL, - /* 86 */ 0xd479dde46b63155bUL, 0xb66e15e93c837976UL, - 0xdafde43b1f13e038UL, 0x5fafda1a2e4b0b35UL, - /* 87 */ 0x3600bbdf17197581UL, 0x3972050bbe3cd2c2UL, - 0x5938906dbdd5be86UL, 0x34fce5e43f9b860fUL, - /* 88 */ 0x75a8a4cd42d14d02UL, 0x828dabc53441df65UL, - 0x33dcabedd2e131d3UL, 0x3ebad76fb814d25fUL, - /* 89 */ 0xd4906f566f70e10fUL, 0x5d12f7aa51690f5aUL, - 0x45adb16e76cefcf2UL, 0x01f768aead232999UL, - /* 90 */ 0x2b6cc77b6248febdUL, 0x3cd30628ec3aaffdUL, - 0xce1c0b80d4ef486aUL, 0x4c3bff2ea6f66c23UL, - /* 91 */ 0x3f2ec4094aeaeb5fUL, 0x61b19b286e372ca7UL, - 0x5eefa966de2a701dUL, 0x23b20565de55e3efUL, - /* 92 */ 0xe301ca5279d58557UL, 0x07b2d4ce27c2874fUL, - 0xa532cd8a9dcf1d67UL, 0x2a52fee23f2bff56UL, - /* 93 */ 0x8624efb37cd8663dUL, 0xbbc7ac20ffbd7594UL, - 0x57b85e9c82d37445UL, 0x7b3052cb86a6ec66UL, - /* 94 */ 0x3482f0ad2525e91eUL, 0x2cb68043d28edca0UL, - 0xaf4f6d052e1b003aUL, 0x185f8c2529781b0aUL, - /* 95 */ 0xaa41de5bd80ce0d6UL, 0x9407b2416853e9d6UL, - 0x563ec36e357f4c3aUL, 0x4cc4b8dd0e297bceUL, - /* 96 */ 0xa2fc1a52ffb8730eUL, 0x1811f16e67058e37UL, - 0x10f9a366cddf4ee1UL, 0x72f4a0c4a0b9f099UL, - /* 97 */ 0x8c16c06f663f4ea7UL, 0x693b3af74e970fbaUL, - 0x2102e7f1d69ec345UL, 0x0ba53cbc968a8089UL, - /* 98 */ 0xca3d9dc7fea15537UL, 0x4c6824bb51536493UL, - 0xb9886314844006b1UL, 0x40d2a72ab454cc60UL, - /* 99 */ 0x5936a1b712570975UL, 0x91b9d648debda657UL, - 0x3344094bb64330eaUL, 0x006ba10d12ee51d0UL, - /* 100 */ 0x19228468f5de5d58UL, 0x0eb12f4c38cc05b0UL, - 0xa1039f9dd5601990UL, 0x4502d4ce4fff0e0bUL, - /* 101 */ 0xeb2054106837c189UL, 0xd0f6544c6dd3b93cUL, - 0x40727064c416d74fUL, 0x6e15c6114b502ef0UL, - /* 102 */ 0x4df2a398cfb1a76bUL, 0x11256c7419f2f6b1UL, - 0x4a497962066e6043UL, 0x705b3aab41355b44UL, - /* 103 */ 0x365ef536d797b1d8UL, 0x00076bd622ddf0dbUL, - 0x3bbf33b0e0575a88UL, 0x3777aa05c8e4ca4dUL, - /* 104 */ 0x392745c85578db5fUL, 0x6fda4149dbae5ae2UL, - 0xb1f0b00b8adc9867UL, 0x09963437d36f1da3UL, - /* 105 */ 0x7e824e90a5dc3853UL, 0xccb5f6641f135cbdUL, - 0x6736d86c87ce8fccUL, 0x625f3ce26604249fUL, - /* 106 */ 0xaf8ac8059502f63fUL, 0x0c05e70a2e351469UL, - 0x35292e9c764b6305UL, 0x1a394360c7e23ac3UL, - /* 107 */ 0xd5c6d53251183264UL, 0x62065abd43c2b74fUL, - 0xb5fbf5d03b973f9bUL, 0x13a3da3661206e5eUL, - /* 108 */ 0xc6bd5837725d94e5UL, 0x18e30912205016c5UL, - 0x2088ce1570033c68UL, 0x7fba1f495c837987UL, - /* 109 */ 0x5a8c7423f2f9079dUL, 0x1735157b34023fc5UL, - 0xe4f9b49ad2fab351UL, 0x6691ff72c878e33cUL, - /* 110 */ 0x122c2adedc5eff3eUL, 0xf8dd4bf1d8956cf4UL, - 0xeb86205d9e9e5bdaUL, 0x049b92b9d975c743UL, - /* 111 */ 0xa5379730b0f6c05aUL, 0x72a0ffacc6f3a553UL, - 0xb0032c34b20dcd6dUL, 0x470e9dbc88d5164aUL, - /* 112 */ 0xb19cf10ca237c047UL, 0xb65466711f6c81a2UL, - 0xb3321bd16dd80b43UL, 0x48c14f600c5fbe8eUL, - /* 113 */ 0x66451c264aa6c803UL, 0xb66e3904a4fa7da6UL, - 0xd45f19b0b3128395UL, 0x31602627c3c9bc10UL, - /* 114 */ 0x3120dc4832e4e10dUL, 0xeb20c46756c717f7UL, - 0x00f52e3f67280294UL, 0x566d4fc14730c509UL, - /* 115 */ 0x7e3a5d40fd837206UL, 0xc1e926dc7159547aUL, - 0x216730fba68d6095UL, 0x22e8c3843f69cea7UL, - /* 116 */ 0x33d074e8930e4b2bUL, 0xb6e4350e84d15816UL, - 0x5534c26ad6ba2365UL, 0x7773c12f89f1f3f3UL, - /* 117 */ 0x8cba404da57962aaUL, 0x5b9897a81999ce56UL, - 0x508e862f121692fcUL, 0x3a81907fa093c291UL, - /* 118 */ 0x0dded0ff4725a510UL, 0x10d8cc10673fc503UL, - 0x5b9d151c9f1f4e89UL, 0x32a5c1d5cb09a44cUL, - /* 119 */ 0x1e0aa442b90541fbUL, 0x5f85eb7cc1b485dbUL, - 0xbee595ce8a9df2e5UL, 0x25e496c722422236UL, - /* 120 */ 0x5edf3c46cd0fe5b9UL, 0x34e75a7ed2a43388UL, - 0xe488de11d761e352UL, 0x0e878a01a085545cUL, - /* 121 */ 0xba493c77e021bb04UL, 0x2b4d1843c7df899aUL, - 0x9ea37a487ae80d67UL, 0x67a9958011e41794UL, - /* 122 */ 0x4b58051a6697b065UL, 0x47e33f7d8d6ba6d4UL, - 0xbb4da8d483ca46c1UL, 0x68becaa181c2db0dUL, - /* 123 */ 0x8d8980e90b989aa5UL, 0xf95eb14a2c93c99bUL, - 0x51c6c7c4796e73a2UL, 0x6e228363b5efb569UL, - /* 124 */ 0xc6bbc0b02dd624c8UL, 0x777eb47dec8170eeUL, - 0x3cde15a004cfafa9UL, 0x1dc6bc087160bf9bUL, - /* 125 */ 0x2e07e043eec34002UL, 0x18e9fc677a68dc7fUL, - 0xd8da03188bd15b9aUL, 0x48fbc3bb00568253UL, - /* 126 */ 0x57547d4cfb654ce1UL, 0xd3565b82a058e2adUL, - 0xf63eaf0bbf154478UL, 0x47531ef114dfbb18UL, - /* 127 */ 0xe1ec630a4278c587UL, 0x5507d546ca8e83f3UL, - 0x85e135c63adc0c2bUL, 0x0aa7efa85682844eUL, - /* 128 */ 0x72691ba8b3e1f615UL, 0x32b4e9701fbe3ffaUL, - 0x97b6d92e39bb7868UL, 0x2cfe53dea02e39e8UL, - /* 129 */ 0x687392cd85cd52b0UL, 0x27ff66c910e29831UL, - 0x97134556a9832d06UL, 0x269bb0360a84f8a0UL, - /* 130 */ 0x706e55457643f85cUL, 0x3734a48c9b597d1bUL, - 0x7aee91e8c6efa472UL, 0x5cd6abc198a9d9e0UL, - /* 131 */ 0x0e04de06cb3ce41aUL, 0xd8c6eb893402e138UL, - 0x904659bb686e3772UL, 0x7215c371746ba8c8UL, - /* 132 */ 0xfd12a97eeae4a2d9UL, 0x9514b7516394f2c5UL, - 0x266fd5809208f294UL, 0x5c847085619a26b9UL, - /* 133 */ 0x52985410fed694eaUL, 0x3c905b934a2ed254UL, - 0x10bb47692d3be467UL, 0x063b3d2d69e5e9e1UL, - /* 134 */ 0x472726eedda57debUL, 0xefb6c4ae10f41891UL, - 0x2b1641917b307614UL, 0x117c554fc4f45b7cUL, - /* 135 */ 0xc07cf3118f9d8812UL, 0x01dbd82050017939UL, - 0xd7e803f4171b2827UL, 0x1015e87487d225eaUL, - /* 136 */ 0xc58de3fed23acc4dUL, 0x50db91c294a7be2dUL, - 0x0b94d43d1c9cf457UL, 0x6b1640fa6e37524aUL, - /* 137 */ 0x692f346c5fda0d09UL, 0x200b1c59fa4d3151UL, - 0xb8c46f760777a296UL, 0x4b38395f3ffdfbcfUL, - /* 138 */ 0x18d25e00be54d671UL, 0x60d50582bec8aba6UL, - 0x87ad8f263b78b982UL, 0x50fdf64e9cda0432UL, - /* 139 */ 0x90f567aac578dcf0UL, 0xef1e9b0ef2a3133bUL, - 0x0eebba9242d9de71UL, 0x15473c9bf03101c7UL, - /* 140 */ 0x7c77e8ae56b78095UL, 0xb678e7666e6f078eUL, - 0x2da0b9615348ba1fUL, 0x7cf931c1ff733f0bUL, - /* 141 */ 0x26b357f50a0a366cUL, 0xe9708cf42b87d732UL, - 0xc13aeea5f91cb2c0UL, 0x35d90c991143bb4cUL, - /* 142 */ 0x47c1c404a9a0d9dcUL, 0x659e58451972d251UL, - 0x3875a8c473b38c31UL, 0x1fbd9ed379561f24UL, - /* 143 */ 0x11fabc6fd41ec28dUL, 0x7ef8dfe3cd2a2dcaUL, - 0x72e73b5d8c404595UL, 0x6135fa4954b72f27UL, - /* 144 */ 0xccfc32a2de24b69cUL, 0x3f55698c1f095d88UL, - 0xbe3350ed5ac3f929UL, 0x5e9bf806ca477eebUL, - /* 145 */ 0xe9ce8fb63c309f68UL, 0x5376f63565e1f9f4UL, - 0xd1afcfb35a6393f1UL, 0x6632a1ede5623506UL, - /* 146 */ 0x0b7d6c390c2ded4cUL, 0x56cb3281df04cb1fUL, - 0x66305a1249ecc3c7UL, 0x5d588b60a38ca72aUL, - /* 147 */ 0xa6ecbf78e8e5f42dUL, 0x86eeb44b3c8a3eecUL, - 0xec219c48fbd21604UL, 0x1aaf1af517c36731UL, - /* 148 */ 0xc306a2836769bde7UL, 0x208280622b1e2adbUL, - 0x8027f51ffbff94a6UL, 0x76cfa1ce1124f26bUL, - /* 149 */ 0x18eb00562422abb6UL, 0xf377c4d58f8c29c3UL, - 0x4dbbc207f531561aUL, 0x0253b7f082128a27UL, - /* 150 */ 0x3d1f091cb62c17e0UL, 0x4860e1abd64628a9UL, - 0x52d17436309d4253UL, 0x356f97e13efae576UL, - /* 151 */ 0xd351e11aa150535bUL, 0x3e6b45bb1dd878ccUL, - 0x0c776128bed92c98UL, 0x1d34ae93032885b8UL, - /* 152 */ 0x4ba0488ca85ba4c3UL, 0x985348c33c9ce6ceUL, - 0x66124c6f97bda770UL, 0x0f81a0290654124aUL, - /* 153 */ 0x9ed09ca6569b86fdUL, 0x811009fd18af9a2dUL, - 0xff08d03f93d8c20aUL, 0x52a148199faef26bUL, - /* 154 */ 0x3e03f9dc2d8d1b73UL, 0x4205801873961a70UL, - 0xc0d987f041a35970UL, 0x07aa1f15a1c0d549UL, - /* 155 */ 0xdfd46ce08cd27224UL, 0x6d0a024f934e4239UL, - 0x808a7a6399897b59UL, 0x0a4556e9e13d95a2UL, - /* 156 */ 0xd21a991fe9c13045UL, 0x9b0e8548fe7751b8UL, - 0x5da643cb4bf30035UL, 0x77db28d63940f721UL, - /* 157 */ 0xfc5eeb614adc9011UL, 0x5229419ae8c411ebUL, - 0x9ec3e7787d1dcf74UL, 0x340d053e216e4cb5UL, - /* 158 */ 0xcac7af39b48df2b4UL, 0xc0faec2871a10a94UL, - 0x140a69245ca575edUL, 0x0cf1c37134273a4cUL, - /* 159 */ 0xc8ee306ac224b8a5UL, 0x57eaee7ccb4930b0UL, - 0xa1e806bdaacbe74fUL, 0x7d9a62742eeb657dUL, - /* 160 */ 0x9eb6b6ef546c4830UL, 0x885cca1fddb36e2eUL, - 0xe6b9f383ef0d7105UL, 0x58654fef9d2e0412UL, - /* 161 */ 0xa905c4ffbe0e8e26UL, 0x942de5df9b31816eUL, - 0x497d723f802e88e1UL, 0x30684dea602f408dUL, - /* 162 */ 0x21e5a278a3e6cb34UL, 0xaefb6e6f5b151dc4UL, - 0xb30b8e049d77ca15UL, 0x28c3c9cf53b98981UL, - /* 163 */ 0x287fb721556cdd2aUL, 0x0d317ca897022274UL, - 0x7468c7423a543258UL, 0x4a7f11464eb5642fUL, - /* 164 */ 0xa237a4774d193aa6UL, 0xd865986ea92129a1UL, - 0x24c515ecf87c1a88UL, 0x604003575f39f5ebUL, - /* 165 */ 0x47b9f189570a9b27UL, 0x2b98cede465e4b78UL, - 0x026df551dbb85c20UL, 0x74fcd91047e21901UL, - /* 166 */ 0x13e2a90a23c1bfa3UL, 0x0cb0074e478519f6UL, - 0x5ff1cbbe3af6cf44UL, 0x67fe5438be812dbeUL, - /* 167 */ 0xd13cf64fa40f05b0UL, 0x054dfb2f32283787UL, - 0x4173915b7f0d2aeaUL, 0x482f144f1f610d4eUL, - /* 168 */ 0xf6210201b47f8234UL, 0x5d0ae1929e70b990UL, - 0xdcd7f455b049567cUL, 0x7e93d0f1f0916f01UL, - /* 169 */ 0xdd79cbf18a7db4faUL, 0xbe8391bf6f74c62fUL, - 0x027145d14b8291bdUL, 0x585a73ea2cbf1705UL, - /* 170 */ 0x485ca03e928a0db2UL, 0x10fc01a5742857e7UL, - 0x2f482edbd6d551a7UL, 0x0f0433b5048fdb8aUL, - /* 171 */ 0x60da2e8dd7dc6247UL, 0x88b4c9d38cd4819aUL, - 0x13033ac001f66697UL, 0x273b24fe3b367d75UL, - /* 172 */ 0xc6e8f66a31b3b9d4UL, 0x281514a494df49d5UL, - 0xd1726fdfc8b23da7UL, 0x4b3ae7d103dee548UL, - /* 173 */ 0xc6256e19ce4b9d7eUL, 0xff5c5cf186e3c61cUL, - 0xacc63ca34b8ec145UL, 0x74621888fee66574UL, - /* 174 */ 0x956f409645290a1eUL, 0xef0bf8e3263a962eUL, - 0xed6a50eb5ec2647bUL, 0x0694283a9dca7502UL, - /* 175 */ 0x769b963643a2dcd1UL, 0x42b7c8ea09fc5353UL, - 0x4f002aee13397eabUL, 0x63005e2c19b7d63aUL, - /* 176 */ 0xca6736da63023beaUL, 0x966c7f6db12a99b7UL, - 0xace09390c537c5e1UL, 0x0b696063a1aa89eeUL, - /* 177 */ 0xebb03e97288c56e5UL, 0x432a9f9f938c8be8UL, - 0xa6a5a93d5b717f71UL, 0x1a5fb4c3e18f9d97UL, - /* 178 */ 0x1c94e7ad1c60cdceUL, 0xee202a43fc02c4a0UL, - 0x8dafe4d867c46a20UL, 0x0a10263c8ac27b58UL, - /* 179 */ 0xd0dea9dfe4432a4aUL, 0x856af87bbe9277c5UL, - 0xce8472acc212c71aUL, 0x6f151b6d9bbb1e91UL, - /* 180 */ 0x26776c527ceed56aUL, 0x7d211cb7fbf8faecUL, - 0x37ae66a6fd4609ccUL, 0x1f81b702d2770c42UL, - /* 181 */ 0x2fb0b057eac58392UL, 0xe1dd89fe29744e9dUL, - 0xc964f8eb17beb4f8UL, 0x29571073c9a2d41eUL, - /* 182 */ 0xa948a18981c0e254UL, 0x2df6369b65b22830UL, - 0xa33eb2d75fcfd3c6UL, 0x078cd6ec4199a01fUL, - /* 183 */ 0x4a584a41ad900d2fUL, 0x32142b78e2c74c52UL, - 0x68c4e8338431c978UL, 0x7f69ea9008689fc2UL, - /* 184 */ 0x52f2c81e46a38265UL, 0xfd78072d04a832fdUL, - 0x8cd7d5fa25359e94UL, 0x4de71b7454cc29d2UL, - /* 185 */ 0x42eb60ad1eda6ac9UL, 0x0aad37dfdbc09c3aUL, - 0x81004b71e33cc191UL, 0x44e6be345122803cUL, - /* 186 */ 0x03fe8388ba1920dbUL, 0xf5d57c32150db008UL, - 0x49c8c4281af60c29UL, 0x21edb518de701aeeUL, - /* 187 */ 0x7fb63e418f06dc99UL, 0xa4460d99c166d7b8UL, - 0x24dd5248ce520a83UL, 0x5ec3ad712b928358UL, - /* 188 */ 0x15022a5fbd17930fUL, 0xa4f64a77d82570e3UL, - 0x12bc8d6915783712UL, 0x498194c0fc620abbUL, - /* 189 */ 0x38a2d9d255686c82UL, 0x785c6bd9193e21f0UL, - 0xe4d5c81ab24a5484UL, 0x56307860b2e20989UL, - /* 190 */ 0x429d55f78b4d74c4UL, 0x22f1834643350131UL, - 0x1e60c24598c71fffUL, 0x59f2f014979983efUL, - /* 191 */ 0x46a47d56eb494a44UL, 0x3e22a854d636a18eUL, - 0xb346e15274491c3bUL, 0x2ceafd4e5390cde7UL, - /* 192 */ 0xba8a8538be0d6675UL, 0x4b9074bb50818e23UL, - 0xcbdab89085d304c3UL, 0x61a24fe0e56192c4UL, - /* 193 */ 0xcb7615e6db525bcbUL, 0xdd7d8c35a567e4caUL, - 0xe6b4153acafcdd69UL, 0x2d668e097f3c9766UL, - /* 194 */ 0xa57e7e265ce55ef0UL, 0x5d9f4e527cd4b967UL, - 0xfbc83606492fd1e5UL, 0x090d52beb7c3f7aeUL, - /* 195 */ 0x09b9515a1e7b4d7cUL, 0x1f266a2599da44c0UL, - 0xa1c49548e2c55504UL, 0x7ef04287126f15ccUL, - /* 196 */ 0xfed1659dbd30ef15UL, 0x8b4ab9eec4e0277bUL, - 0x884d6236a5df3291UL, 0x1fd96ea6bf5cf788UL, - /* 197 */ 0x42a161981f190d9aUL, 0x61d849507e6052c1UL, - 0x9fe113bf285a2cd5UL, 0x7c22d676dbad85d8UL, - /* 198 */ 0x82e770ed2bfbd27dUL, 0x4c05b2ece996f5a5UL, - 0xcd40a9c2b0900150UL, 0x5895319213d9bf64UL, - /* 199 */ 0xe7cc5d703fea2e08UL, 0xb50c491258e2188cUL, - 0xcce30baa48205bf0UL, 0x537c659ccfa32d62UL, - /* 200 */ 0x37b6623a98cfc088UL, 0xfe9bed1fa4d6aca4UL, - 0x04d29b8e56a8d1b0UL, 0x725f71c40b519575UL, - /* 201 */ 0x28c7f89cd0339ce6UL, 0x8367b14469ddc18bUL, - 0x883ada83a6a1652cUL, 0x585f1974034d6c17UL, - /* 202 */ 0x89cfb266f1b19188UL, 0xe63b4863e7c35217UL, - 0xd88c9da6b4c0526aUL, 0x3e035c9df0954635UL, - /* 203 */ 0xdd9d5412fb45de9dUL, 0xdd684532e4cff40dUL, - 0x4b5c999b151d671cUL, 0x2d8c2cc811e7f690UL, - /* 204 */ 0x7f54be1d90055d40UL, 0xa464c5df464aaf40UL, - 0x33979624f0e917beUL, 0x2c018dc527356b30UL, - /* 205 */ 0xa5415024e330b3d4UL, 0x73ff3d96691652d3UL, - 0x94ec42c4ef9b59f1UL, 0x0747201618d08e5aUL, - /* 206 */ 0x4d6ca48aca411c53UL, 0x66415f2fcfa66119UL, - 0x9c4dd40051e227ffUL, 0x59810bc09a02f7ebUL, - /* 207 */ 0x2a7eb171b3dc101dUL, 0x441c5ab99ffef68eUL, - 0x32025c9b93b359eaUL, 0x5e8ce0a71e9d112fUL, - /* 208 */ 0xbfcccb92429503fdUL, 0xd271ba752f095d55UL, - 0x345ead5e972d091eUL, 0x18c8df11a83103baUL, - /* 209 */ 0x90cd949a9aed0f4cUL, 0xc5d1f4cb6660e37eUL, - 0xb8cac52d56c52e0bUL, 0x6e42e400c5808e0dUL, - /* 210 */ 0xa3b46966eeaefd23UL, 0x0c4f1f0be39ecdcaUL, - 0x189dc8c9d683a51dUL, 0x51f27f054c09351bUL, - /* 211 */ 0x4c487ccd2a320682UL, 0x587ea95bb3df1c96UL, - 0xc8ccf79e555cb8e8UL, 0x547dc829a206d73dUL, - /* 212 */ 0xb822a6cd80c39b06UL, 0xe96d54732000d4c6UL, - 0x28535b6f91463b4dUL, 0x228f4660e2486e1dUL, - /* 213 */ 0x98799538de8d3abfUL, 0x8cd8330045ebca6eUL, - 0x79952a008221e738UL, 0x4322e1a7535cd2bbUL, - /* 214 */ 0xb114c11819d1801cUL, 0x2016e4d84f3f5ec7UL, - 0xdd0e2df409260f4cUL, 0x5ec362c0ae5f7266UL, - /* 215 */ 0xc0462b18b8b2b4eeUL, 0x7cc8d950274d1afbUL, - 0xf25f7105436b02d2UL, 0x43bbf8dcbff9ccd3UL, - /* 216 */ 0xb6ad1767a039e9dfUL, 0xb0714da8f69d3583UL, - 0x5e55fa18b42931f5UL, 0x4ed5558f33c60961UL, - /* 217 */ 0x1fe37901c647a5ddUL, 0x593ddf1f8081d357UL, - 0x0249a4fd813fd7a6UL, 0x69acca274e9caf61UL, - /* 218 */ 0x047ba3ea330721c9UL, 0x83423fc20e7e1ea0UL, - 0x1df4c0af01314a60UL, 0x09a62dab89289527UL, - /* 219 */ 0xa5b325a49cc6cb00UL, 0xe94b5dc654b56cb6UL, - 0x3be28779adc994a0UL, 0x4296e8f8ba3a4aadUL, - /* 220 */ 0x328689761e451eabUL, 0x2e4d598bff59594aUL, - 0x49b96853d7a7084aUL, 0x4980a319601420a8UL, - /* 221 */ 0x9565b9e12f552c42UL, 0x8a5318db7100fe96UL, - 0x05c90b4d43add0d7UL, 0x538b4cd66a5d4edaUL, - /* 222 */ 0xf4e94fc3e89f039fUL, 0x592c9af26f618045UL, - 0x08a36eb5fd4b9550UL, 0x25fffaf6c2ed1419UL, - /* 223 */ 0x34434459cc79d354UL, 0xeeecbfb4b1d5476bUL, - 0xddeb34a061615d99UL, 0x5129cecceb64b773UL, - /* 224 */ 0xee43215894993520UL, 0x772f9c7cf14c0b3bUL, - 0xd2e2fce306bedad5UL, 0x715f42b546f06a97UL, - /* 225 */ 0x434ecdceda5b5f1aUL, 0x0da17115a49741a9UL, - 0x680bd77c73edad2eUL, 0x487c02354edd9041UL, - /* 226 */ 0xb8efeff3a70ed9c4UL, 0x56a32aa3e857e302UL, - 0xdf3a68bd48a2a5a0UL, 0x07f650b73176c444UL, - /* 227 */ 0xe38b9b1626e0ccb1UL, 0x79e053c18b09fb36UL, - 0x56d90319c9f94964UL, 0x1ca941e7ac9ff5c4UL, - /* 228 */ 0x49c4df29162fa0bbUL, 0x8488cf3282b33305UL, - 0x95dfda14cabb437dUL, 0x3391f78264d5ad86UL, - /* 229 */ 0x729ae06ae2b5095dUL, 0xd58a58d73259a946UL, - 0xe9834262d13921edUL, 0x27fedafaa54bb592UL, - /* 230 */ 0xa99dc5b829ad48bbUL, 0x5f025742499ee260UL, - 0x802c8ecd5d7513fdUL, 0x78ceb3ef3f6dd938UL, - /* 231 */ 0xc342f44f8a135d94UL, 0x7b9edb44828cdda3UL, - 0x9436d11a0537cfe7UL, 0x5064b164ec1ab4c8UL, - /* 232 */ 0x7020eccfd37eb2fcUL, 0x1f31ea3ed90d25fcUL, - 0x1b930d7bdfa1bb34UL, 0x5344467a48113044UL, - /* 233 */ 0x70073170f25e6dfbUL, 0xe385dc1a50114cc8UL, - 0x2348698ac8fc4f00UL, 0x2a77a55284dd40d8UL, - /* 234 */ 0xfe06afe0c98c6ce4UL, 0xc235df96dddfd6e4UL, - 0x1428d01e33bf1ed3UL, 0x785768ec9300bdafUL, - /* 235 */ 0x9702e57a91deb63bUL, 0x61bdb8bfe5ce8b80UL, - 0x645b426f3d1d58acUL, 0x4804a82227a557bcUL, - /* 236 */ 0x8e57048ab44d2601UL, 0x68d6501a4b3a6935UL, - 0xc39c9ec3f9e1c293UL, 0x4172f257d4de63e2UL, - /* 237 */ 0xd368b450330c6401UL, 0x040d3017418f2391UL, - 0x2c34bb6090b7d90dUL, 0x16f649228fdfd51fUL, - /* 238 */ 0xbea6818e2b928ef5UL, 0xe28ccf91cdc11e72UL, - 0x594aaa68e77a36cdUL, 0x313034806c7ffd0fUL, - /* 239 */ 0x8a9d27ac2249bd65UL, 0x19a3b464018e9512UL, - 0xc26ccff352b37ec7UL, 0x056f68341d797b21UL, - /* 240 */ 0x5e79d6757efd2327UL, 0xfabdbcb6553afe15UL, - 0xd3e7222c6eaf5a60UL, 0x7046c76d4dae743bUL, - /* 241 */ 0x660be872b18d4a55UL, 0x19992518574e1496UL, - 0xc103053a302bdcbbUL, 0x3ed8e9800b218e8eUL, - /* 242 */ 0x7b0b9239fa75e03eUL, 0xefe9fb684633c083UL, - 0x98a35fbe391a7793UL, 0x6065510fe2d0fe34UL, - /* 243 */ 0x55cb668548abad0cUL, 0xb4584548da87e527UL, - 0x2c43ecea0107c1ddUL, 0x526028809372de35UL, - /* 244 */ 0x3415c56af9213b1fUL, 0x5bee1a4d017e98dbUL, - 0x13f6b105b5cf709bUL, 0x5ff20e3482b29ab6UL, - /* 245 */ 0x0aa29c75cc2e6c90UL, 0xfc7d73ca3a70e206UL, - 0x899fc38fc4b5c515UL, 0x250386b124ffc207UL, - /* 246 */ 0x54ea28d5ae3d2b56UL, 0x9913149dd6de60ceUL, - 0x16694fc58f06d6c1UL, 0x46b23975eb018fc7UL, - /* 247 */ 0x470a6a0fb4b7b4e2UL, 0x5d92475a8f7253deUL, - 0xabeee5b52fbd3adbUL, 0x7fa20801a0806968UL, - /* 248 */ 0x76f3faf19f7714d2UL, 0xb3e840c12f4660c3UL, - 0x0fb4cd8df212744eUL, 0x4b065a251d3a2dd2UL, - /* 249 */ 0x5cebde383d77cd4aUL, 0x6adf39df882c9cb1UL, - 0xa2dd242eb09af759UL, 0x3147c0e50e5f6422UL, - /* 250 */ 0x164ca5101d1350dbUL, 0xf8d13479c33fc962UL, - 0xe640ce4d13e5da08UL, 0x4bdee0c45061f8baUL, - /* 251 */ 0xd7c46dc1a4edb1c9UL, 0x5514d7b6437fd98aUL, - 0x58942f6bb2a1c00bUL, 0x2dffb2ab1d70710eUL, - /* 252 */ 0xccdfcf2fc18b6d68UL, 0xa8ebcba8b7806167UL, - 0x980697f95e2937e3UL, 0x02fbba1cd0126e8cUL -}; - -/* c is two 512-bit products: c0[0:7]=a0[0:3]*b0[0:3] and c1[8:15]=a1[4:7]*b1[4:7] - * a is two 256-bit integers: a0[0:3] and a1[4:7] - * b is two 256-bit integers: b0[0:3] and b1[4:7] - */ -static void mul2_256x256_integer_adx(u64 *const c, const u64 *const a, - const u64 *const b) -{ - asm volatile( - "xorl %%r14d, %%r14d ;" - "movq (%1), %%rdx; " /* A[0] */ - "mulx (%2), %%r8, %%r15; " /* A[0]*B[0] */ - "xorl %%r10d, %%r10d ;" - "movq %%r8, (%0) ;" - "mulx 8(%2), %%r10, %%rax; " /* A[0]*B[1] */ - "adox %%r10, %%r15 ;" - "mulx 16(%2), %%r8, %%rbx; " /* A[0]*B[2] */ - "adox %%r8, %%rax ;" - "mulx 24(%2), %%r10, %%rcx; " /* A[0]*B[3] */ - "adox %%r10, %%rbx ;" - /******************************************/ - "adox %%r14, %%rcx ;" - - "movq 8(%1), %%rdx; " /* A[1] */ - "mulx (%2), %%r8, %%r9; " /* A[1]*B[0] */ - "adox %%r15, %%r8 ;" - "movq %%r8, 8(%0) ;" - "mulx 8(%2), %%r10, %%r11; " /* A[1]*B[1] */ - "adox %%r10, %%r9 ;" - "adcx %%r9, %%rax ;" - "mulx 16(%2), %%r8, %%r13; " /* A[1]*B[2] */ - "adox %%r8, %%r11 ;" - "adcx %%r11, %%rbx ;" - "mulx 24(%2), %%r10, %%r15; " /* A[1]*B[3] */ - "adox %%r10, %%r13 ;" - "adcx %%r13, %%rcx ;" - /******************************************/ - "adox %%r14, %%r15 ;" - "adcx %%r14, %%r15 ;" - - "movq 16(%1), %%rdx; " /* A[2] */ - "xorl %%r10d, %%r10d ;" - "mulx (%2), %%r8, %%r9; " /* A[2]*B[0] */ - "adox %%rax, %%r8 ;" - "movq %%r8, 16(%0) ;" - "mulx 8(%2), %%r10, %%r11; " /* A[2]*B[1] */ - "adox %%r10, %%r9 ;" - "adcx %%r9, %%rbx ;" - "mulx 16(%2), %%r8, %%r13; " /* A[2]*B[2] */ - "adox %%r8, %%r11 ;" - "adcx %%r11, %%rcx ;" - "mulx 24(%2), %%r10, %%rax; " /* A[2]*B[3] */ - "adox %%r10, %%r13 ;" - "adcx %%r13, %%r15 ;" - /******************************************/ - "adox %%r14, %%rax ;" - "adcx %%r14, %%rax ;" - - "movq 24(%1), %%rdx; " /* A[3] */ - "xorl %%r10d, %%r10d ;" - "mulx (%2), %%r8, %%r9; " /* A[3]*B[0] */ - "adox %%rbx, %%r8 ;" - "movq %%r8, 24(%0) ;" - "mulx 8(%2), %%r10, %%r11; " /* A[3]*B[1] */ - "adox %%r10, %%r9 ;" - "adcx %%r9, %%rcx ;" - "movq %%rcx, 32(%0) ;" - "mulx 16(%2), %%r8, %%r13; " /* A[3]*B[2] */ - "adox %%r8, %%r11 ;" - "adcx %%r11, %%r15 ;" - "movq %%r15, 40(%0) ;" - "mulx 24(%2), %%r10, %%rbx; " /* A[3]*B[3] */ - "adox %%r10, %%r13 ;" - "adcx %%r13, %%rax ;" - "movq %%rax, 48(%0) ;" - /******************************************/ - "adox %%r14, %%rbx ;" - "adcx %%r14, %%rbx ;" - "movq %%rbx, 56(%0) ;" - - "movq 32(%1), %%rdx; " /* C[0] */ - "mulx 32(%2), %%r8, %%r15; " /* C[0]*D[0] */ - "xorl %%r10d, %%r10d ;" - "movq %%r8, 64(%0);" - "mulx 40(%2), %%r10, %%rax; " /* C[0]*D[1] */ - "adox %%r10, %%r15 ;" - "mulx 48(%2), %%r8, %%rbx; " /* C[0]*D[2] */ - "adox %%r8, %%rax ;" - "mulx 56(%2), %%r10, %%rcx; " /* C[0]*D[3] */ - "adox %%r10, %%rbx ;" - /******************************************/ - "adox %%r14, %%rcx ;" - - "movq 40(%1), %%rdx; " /* C[1] */ - "xorl %%r10d, %%r10d ;" - "mulx 32(%2), %%r8, %%r9; " /* C[1]*D[0] */ - "adox %%r15, %%r8 ;" - "movq %%r8, 72(%0);" - "mulx 40(%2), %%r10, %%r11; " /* C[1]*D[1] */ - "adox %%r10, %%r9 ;" - "adcx %%r9, %%rax ;" - "mulx 48(%2), %%r8, %%r13; " /* C[1]*D[2] */ - "adox %%r8, %%r11 ;" - "adcx %%r11, %%rbx ;" - "mulx 56(%2), %%r10, %%r15; " /* C[1]*D[3] */ - "adox %%r10, %%r13 ;" - "adcx %%r13, %%rcx ;" - /******************************************/ - "adox %%r14, %%r15 ;" - "adcx %%r14, %%r15 ;" - - "movq 48(%1), %%rdx; " /* C[2] */ - "xorl %%r10d, %%r10d ;" - "mulx 32(%2), %%r8, %%r9; " /* C[2]*D[0] */ - "adox %%rax, %%r8 ;" - "movq %%r8, 80(%0);" - "mulx 40(%2), %%r10, %%r11; " /* C[2]*D[1] */ - "adox %%r10, %%r9 ;" - "adcx %%r9, %%rbx ;" - "mulx 48(%2), %%r8, %%r13; " /* C[2]*D[2] */ - "adox %%r8, %%r11 ;" - "adcx %%r11, %%rcx ;" - "mulx 56(%2), %%r10, %%rax; " /* C[2]*D[3] */ - "adox %%r10, %%r13 ;" - "adcx %%r13, %%r15 ;" - /******************************************/ - "adox %%r14, %%rax ;" - "adcx %%r14, %%rax ;" - - "movq 56(%1), %%rdx; " /* C[3] */ - "xorl %%r10d, %%r10d ;" - "mulx 32(%2), %%r8, %%r9; " /* C[3]*D[0] */ - "adox %%rbx, %%r8 ;" - "movq %%r8, 88(%0);" - "mulx 40(%2), %%r10, %%r11; " /* C[3]*D[1] */ - "adox %%r10, %%r9 ;" - "adcx %%r9, %%rcx ;" - "movq %%rcx, 96(%0) ;" - "mulx 48(%2), %%r8, %%r13; " /* C[3]*D[2] */ - "adox %%r8, %%r11 ;" - "adcx %%r11, %%r15 ;" - "movq %%r15, 104(%0) ;" - "mulx 56(%2), %%r10, %%rbx; " /* C[3]*D[3] */ - "adox %%r10, %%r13 ;" - "adcx %%r13, %%rax ;" - "movq %%rax, 112(%0) ;" - /******************************************/ - "adox %%r14, %%rbx ;" - "adcx %%r14, %%rbx ;" - "movq %%rbx, 120(%0) ;" - : - : "r"(c), "r"(a), "r"(b) - : "memory", "cc", "%rax", "%rbx", "%rcx", "%rdx", "%r8", "%r9", - "%r10", "%r11", "%r13", "%r14", "%r15"); -} - -static void mul2_256x256_integer_bmi2(u64 *const c, const u64 *const a, - const u64 *const b) +static __always_inline u64 eq_mask(u64 a, u64 b) { - asm volatile( - "movq (%1), %%rdx; " /* A[0] */ - "mulx (%2), %%r8, %%r15; " /* A[0]*B[0] */ - "movq %%r8, (%0) ;" - "mulx 8(%2), %%r10, %%rax; " /* A[0]*B[1] */ - "addq %%r10, %%r15 ;" - "mulx 16(%2), %%r8, %%rbx; " /* A[0]*B[2] */ - "adcq %%r8, %%rax ;" - "mulx 24(%2), %%r10, %%rcx; " /* A[0]*B[3] */ - "adcq %%r10, %%rbx ;" - /******************************************/ - "adcq $0, %%rcx ;" - - "movq 8(%1), %%rdx; " /* A[1] */ - "mulx (%2), %%r8, %%r9; " /* A[1]*B[0] */ - "addq %%r15, %%r8 ;" - "movq %%r8, 8(%0) ;" - "mulx 8(%2), %%r10, %%r11; " /* A[1]*B[1] */ - "adcq %%r10, %%r9 ;" - "mulx 16(%2), %%r8, %%r13; " /* A[1]*B[2] */ - "adcq %%r8, %%r11 ;" - "mulx 24(%2), %%r10, %%r15; " /* A[1]*B[3] */ - "adcq %%r10, %%r13 ;" - /******************************************/ - "adcq $0, %%r15 ;" - - "addq %%r9, %%rax ;" - "adcq %%r11, %%rbx ;" - "adcq %%r13, %%rcx ;" - "adcq $0, %%r15 ;" - - "movq 16(%1), %%rdx; " /* A[2] */ - "mulx (%2), %%r8, %%r9; " /* A[2]*B[0] */ - "addq %%rax, %%r8 ;" - "movq %%r8, 16(%0) ;" - "mulx 8(%2), %%r10, %%r11; " /* A[2]*B[1] */ - "adcq %%r10, %%r9 ;" - "mulx 16(%2), %%r8, %%r13; " /* A[2]*B[2] */ - "adcq %%r8, %%r11 ;" - "mulx 24(%2), %%r10, %%rax; " /* A[2]*B[3] */ - "adcq %%r10, %%r13 ;" - /******************************************/ - "adcq $0, %%rax ;" - - "addq %%r9, %%rbx ;" - "adcq %%r11, %%rcx ;" - "adcq %%r13, %%r15 ;" - "adcq $0, %%rax ;" - - "movq 24(%1), %%rdx; " /* A[3] */ - "mulx (%2), %%r8, %%r9; " /* A[3]*B[0] */ - "addq %%rbx, %%r8 ;" - "movq %%r8, 24(%0) ;" - "mulx 8(%2), %%r10, %%r11; " /* A[3]*B[1] */ - "adcq %%r10, %%r9 ;" - "mulx 16(%2), %%r8, %%r13; " /* A[3]*B[2] */ - "adcq %%r8, %%r11 ;" - "mulx 24(%2), %%r10, %%rbx; " /* A[3]*B[3] */ - "adcq %%r10, %%r13 ;" - /******************************************/ - "adcq $0, %%rbx ;" - - "addq %%r9, %%rcx ;" - "movq %%rcx, 32(%0) ;" - "adcq %%r11, %%r15 ;" - "movq %%r15, 40(%0) ;" - "adcq %%r13, %%rax ;" - "movq %%rax, 48(%0) ;" - "adcq $0, %%rbx ;" - "movq %%rbx, 56(%0) ;" - - "movq 32(%1), %%rdx; " /* C[0] */ - "mulx 32(%2), %%r8, %%r15; " /* C[0]*D[0] */ - "movq %%r8, 64(%0) ;" - "mulx 40(%2), %%r10, %%rax; " /* C[0]*D[1] */ - "addq %%r10, %%r15 ;" - "mulx 48(%2), %%r8, %%rbx; " /* C[0]*D[2] */ - "adcq %%r8, %%rax ;" - "mulx 56(%2), %%r10, %%rcx; " /* C[0]*D[3] */ - "adcq %%r10, %%rbx ;" - /******************************************/ - "adcq $0, %%rcx ;" - - "movq 40(%1), %%rdx; " /* C[1] */ - "mulx 32(%2), %%r8, %%r9; " /* C[1]*D[0] */ - "addq %%r15, %%r8 ;" - "movq %%r8, 72(%0) ;" - "mulx 40(%2), %%r10, %%r11; " /* C[1]*D[1] */ - "adcq %%r10, %%r9 ;" - "mulx 48(%2), %%r8, %%r13; " /* C[1]*D[2] */ - "adcq %%r8, %%r11 ;" - "mulx 56(%2), %%r10, %%r15; " /* C[1]*D[3] */ - "adcq %%r10, %%r13 ;" - /******************************************/ - "adcq $0, %%r15 ;" - - "addq %%r9, %%rax ;" - "adcq %%r11, %%rbx ;" - "adcq %%r13, %%rcx ;" - "adcq $0, %%r15 ;" - - "movq 48(%1), %%rdx; " /* C[2] */ - "mulx 32(%2), %%r8, %%r9; " /* C[2]*D[0] */ - "addq %%rax, %%r8 ;" - "movq %%r8, 80(%0) ;" - "mulx 40(%2), %%r10, %%r11; " /* C[2]*D[1] */ - "adcq %%r10, %%r9 ;" - "mulx 48(%2), %%r8, %%r13; " /* C[2]*D[2] */ - "adcq %%r8, %%r11 ;" - "mulx 56(%2), %%r10, %%rax; " /* C[2]*D[3] */ - "adcq %%r10, %%r13 ;" - /******************************************/ - "adcq $0, %%rax ;" - - "addq %%r9, %%rbx ;" - "adcq %%r11, %%rcx ;" - "adcq %%r13, %%r15 ;" - "adcq $0, %%rax ;" - - "movq 56(%1), %%rdx; " /* C[3] */ - "mulx 32(%2), %%r8, %%r9; " /* C[3]*D[0] */ - "addq %%rbx, %%r8 ;" - "movq %%r8, 88(%0) ;" - "mulx 40(%2), %%r10, %%r11; " /* C[3]*D[1] */ - "adcq %%r10, %%r9 ;" - "mulx 48(%2), %%r8, %%r13; " /* C[3]*D[2] */ - "adcq %%r8, %%r11 ;" - "mulx 56(%2), %%r10, %%rbx; " /* C[3]*D[3] */ - "adcq %%r10, %%r13 ;" - /******************************************/ - "adcq $0, %%rbx ;" - - "addq %%r9, %%rcx ;" - "movq %%rcx, 96(%0) ;" - "adcq %%r11, %%r15 ;" - "movq %%r15, 104(%0) ;" - "adcq %%r13, %%rax ;" - "movq %%rax, 112(%0) ;" - "adcq $0, %%rbx ;" - "movq %%rbx, 120(%0) ;" - : - : "r"(c), "r"(a), "r"(b) - : "memory", "cc", "%rax", "%rbx", "%rcx", "%rdx", "%r8", "%r9", - "%r10", "%r11", "%r13", "%r15"); + u64 x = a ^ b; + u64 minus_x = ~x + (u64)1U; + u64 x_or_minus_x = x | minus_x; + u64 xnx = x_or_minus_x >> (u32)63U; + return xnx - (u64)1U; } -static void sqr2_256x256_integer_adx(u64 *const c, const u64 *const a) +static __always_inline u64 gte_mask(u64 a, u64 b) { - asm volatile( - "movq (%1), %%rdx ;" /* A[0] */ - "mulx 8(%1), %%r8, %%r14 ;" /* A[1]*A[0] */ - "xorl %%r15d, %%r15d;" - "mulx 16(%1), %%r9, %%r10 ;" /* A[2]*A[0] */ - "adcx %%r14, %%r9 ;" - "mulx 24(%1), %%rax, %%rcx ;" /* A[3]*A[0] */ - "adcx %%rax, %%r10 ;" - "movq 24(%1), %%rdx ;" /* A[3] */ - "mulx 8(%1), %%r11, %%rbx ;" /* A[1]*A[3] */ - "adcx %%rcx, %%r11 ;" - "mulx 16(%1), %%rax, %%r13 ;" /* A[2]*A[3] */ - "adcx %%rax, %%rbx ;" - "movq 8(%1), %%rdx ;" /* A[1] */ - "adcx %%r15, %%r13 ;" - "mulx 16(%1), %%rax, %%rcx ;" /* A[2]*A[1] */ - "movq $0, %%r14 ;" - /******************************************/ - "adcx %%r15, %%r14 ;" - - "xorl %%r15d, %%r15d;" - "adox %%rax, %%r10 ;" - "adcx %%r8, %%r8 ;" - "adox %%rcx, %%r11 ;" - "adcx %%r9, %%r9 ;" - "adox %%r15, %%rbx ;" - "adcx %%r10, %%r10 ;" - "adox %%r15, %%r13 ;" - "adcx %%r11, %%r11 ;" - "adox %%r15, %%r14 ;" - "adcx %%rbx, %%rbx ;" - "adcx %%r13, %%r13 ;" - "adcx %%r14, %%r14 ;" - - "movq (%1), %%rdx ;" - "mulx %%rdx, %%rax, %%rcx ;" /* A[0]^2 */ - /*******************/ - "movq %%rax, 0(%0) ;" - "addq %%rcx, %%r8 ;" - "movq %%r8, 8(%0) ;" - "movq 8(%1), %%rdx ;" - "mulx %%rdx, %%rax, %%rcx ;" /* A[1]^2 */ - "adcq %%rax, %%r9 ;" - "movq %%r9, 16(%0) ;" - "adcq %%rcx, %%r10 ;" - "movq %%r10, 24(%0) ;" - "movq 16(%1), %%rdx ;" - "mulx %%rdx, %%rax, %%rcx ;" /* A[2]^2 */ - "adcq %%rax, %%r11 ;" - "movq %%r11, 32(%0) ;" - "adcq %%rcx, %%rbx ;" - "movq %%rbx, 40(%0) ;" - "movq 24(%1), %%rdx ;" - "mulx %%rdx, %%rax, %%rcx ;" /* A[3]^2 */ - "adcq %%rax, %%r13 ;" - "movq %%r13, 48(%0) ;" - "adcq %%rcx, %%r14 ;" - "movq %%r14, 56(%0) ;" - - - "movq 32(%1), %%rdx ;" /* B[0] */ - "mulx 40(%1), %%r8, %%r14 ;" /* B[1]*B[0] */ - "xorl %%r15d, %%r15d;" - "mulx 48(%1), %%r9, %%r10 ;" /* B[2]*B[0] */ - "adcx %%r14, %%r9 ;" - "mulx 56(%1), %%rax, %%rcx ;" /* B[3]*B[0] */ - "adcx %%rax, %%r10 ;" - "movq 56(%1), %%rdx ;" /* B[3] */ - "mulx 40(%1), %%r11, %%rbx ;" /* B[1]*B[3] */ - "adcx %%rcx, %%r11 ;" - "mulx 48(%1), %%rax, %%r13 ;" /* B[2]*B[3] */ - "adcx %%rax, %%rbx ;" - "movq 40(%1), %%rdx ;" /* B[1] */ - "adcx %%r15, %%r13 ;" - "mulx 48(%1), %%rax, %%rcx ;" /* B[2]*B[1] */ - "movq $0, %%r14 ;" - /******************************************/ - "adcx %%r15, %%r14 ;" - - "xorl %%r15d, %%r15d;" - "adox %%rax, %%r10 ;" - "adcx %%r8, %%r8 ;" - "adox %%rcx, %%r11 ;" - "adcx %%r9, %%r9 ;" - "adox %%r15, %%rbx ;" - "adcx %%r10, %%r10 ;" - "adox %%r15, %%r13 ;" - "adcx %%r11, %%r11 ;" - "adox %%r15, %%r14 ;" - "adcx %%rbx, %%rbx ;" - "adcx %%r13, %%r13 ;" - "adcx %%r14, %%r14 ;" - - "movq 32(%1), %%rdx ;" - "mulx %%rdx, %%rax, %%rcx ;" /* B[0]^2 */ - /*******************/ - "movq %%rax, 64(%0) ;" - "addq %%rcx, %%r8 ;" - "movq %%r8, 72(%0) ;" - "movq 40(%1), %%rdx ;" - "mulx %%rdx, %%rax, %%rcx ;" /* B[1]^2 */ - "adcq %%rax, %%r9 ;" - "movq %%r9, 80(%0) ;" - "adcq %%rcx, %%r10 ;" - "movq %%r10, 88(%0) ;" - "movq 48(%1), %%rdx ;" - "mulx %%rdx, %%rax, %%rcx ;" /* B[2]^2 */ - "adcq %%rax, %%r11 ;" - "movq %%r11, 96(%0) ;" - "adcq %%rcx, %%rbx ;" - "movq %%rbx, 104(%0) ;" - "movq 56(%1), %%rdx ;" - "mulx %%rdx, %%rax, %%rcx ;" /* B[3]^2 */ - "adcq %%rax, %%r13 ;" - "movq %%r13, 112(%0) ;" - "adcq %%rcx, %%r14 ;" - "movq %%r14, 120(%0) ;" - : - : "r"(c), "r"(a) - : "memory", "cc", "%rax", "%rbx", "%rcx", "%rdx", "%r8", "%r9", - "%r10", "%r11", "%r13", "%r14", "%r15"); + u64 x = a; + u64 y = b; + u64 x_xor_y = x ^ y; + u64 x_sub_y = x - y; + u64 x_sub_y_xor_y = x_sub_y ^ y; + u64 q = x_xor_y | x_sub_y_xor_y; + u64 x_xor_q = x ^ q; + u64 x_xor_q_ = x_xor_q >> (u32)63U; + return x_xor_q_ - (u64)1U; } -static void sqr2_256x256_integer_bmi2(u64 *const c, const u64 *const a) +/* Computes the addition of four-element f1 with value in f2 + * and returns the carry (if any) */ +static inline u64 add_scalar(u64 *out, const u64 *f1, u64 f2) { - asm volatile( - "movq 8(%1), %%rdx ;" /* A[1] */ - "mulx (%1), %%r8, %%r9 ;" /* A[0]*A[1] */ - "mulx 16(%1), %%r10, %%r11 ;" /* A[2]*A[1] */ - "mulx 24(%1), %%rcx, %%r14 ;" /* A[3]*A[1] */ - - "movq 16(%1), %%rdx ;" /* A[2] */ - "mulx 24(%1), %%r15, %%r13 ;" /* A[3]*A[2] */ - "mulx (%1), %%rax, %%rdx ;" /* A[0]*A[2] */ - - "addq %%rax, %%r9 ;" - "adcq %%rdx, %%r10 ;" - "adcq %%rcx, %%r11 ;" - "adcq %%r14, %%r15 ;" - "adcq $0, %%r13 ;" - "movq $0, %%r14 ;" - "adcq $0, %%r14 ;" - - "movq (%1), %%rdx ;" /* A[0] */ - "mulx 24(%1), %%rax, %%rcx ;" /* A[0]*A[3] */ - - "addq %%rax, %%r10 ;" - "adcq %%rcx, %%r11 ;" - "adcq $0, %%r15 ;" - "adcq $0, %%r13 ;" - "adcq $0, %%r14 ;" - - "shldq $1, %%r13, %%r14 ;" - "shldq $1, %%r15, %%r13 ;" - "shldq $1, %%r11, %%r15 ;" - "shldq $1, %%r10, %%r11 ;" - "shldq $1, %%r9, %%r10 ;" - "shldq $1, %%r8, %%r9 ;" - "shlq $1, %%r8 ;" - - /*******************/ - "mulx %%rdx, %%rax, %%rcx ; " /* A[0]^2 */ - /*******************/ - "movq %%rax, 0(%0) ;" - "addq %%rcx, %%r8 ;" - "movq %%r8, 8(%0) ;" - "movq 8(%1), %%rdx ;" - "mulx %%rdx, %%rax, %%rcx ; " /* A[1]^2 */ - "adcq %%rax, %%r9 ;" - "movq %%r9, 16(%0) ;" - "adcq %%rcx, %%r10 ;" - "movq %%r10, 24(%0) ;" - "movq 16(%1), %%rdx ;" - "mulx %%rdx, %%rax, %%rcx ; " /* A[2]^2 */ - "adcq %%rax, %%r11 ;" - "movq %%r11, 32(%0) ;" - "adcq %%rcx, %%r15 ;" - "movq %%r15, 40(%0) ;" - "movq 24(%1), %%rdx ;" - "mulx %%rdx, %%rax, %%rcx ; " /* A[3]^2 */ - "adcq %%rax, %%r13 ;" - "movq %%r13, 48(%0) ;" - "adcq %%rcx, %%r14 ;" - "movq %%r14, 56(%0) ;" - - "movq 40(%1), %%rdx ;" /* B[1] */ - "mulx 32(%1), %%r8, %%r9 ;" /* B[0]*B[1] */ - "mulx 48(%1), %%r10, %%r11 ;" /* B[2]*B[1] */ - "mulx 56(%1), %%rcx, %%r14 ;" /* B[3]*B[1] */ - - "movq 48(%1), %%rdx ;" /* B[2] */ - "mulx 56(%1), %%r15, %%r13 ;" /* B[3]*B[2] */ - "mulx 32(%1), %%rax, %%rdx ;" /* B[0]*B[2] */ - - "addq %%rax, %%r9 ;" - "adcq %%rdx, %%r10 ;" - "adcq %%rcx, %%r11 ;" - "adcq %%r14, %%r15 ;" - "adcq $0, %%r13 ;" - "movq $0, %%r14 ;" - "adcq $0, %%r14 ;" - - "movq 32(%1), %%rdx ;" /* B[0] */ - "mulx 56(%1), %%rax, %%rcx ;" /* B[0]*B[3] */ - - "addq %%rax, %%r10 ;" - "adcq %%rcx, %%r11 ;" - "adcq $0, %%r15 ;" - "adcq $0, %%r13 ;" - "adcq $0, %%r14 ;" - - "shldq $1, %%r13, %%r14 ;" - "shldq $1, %%r15, %%r13 ;" - "shldq $1, %%r11, %%r15 ;" - "shldq $1, %%r10, %%r11 ;" - "shldq $1, %%r9, %%r10 ;" - "shldq $1, %%r8, %%r9 ;" - "shlq $1, %%r8 ;" - - /*******************/ - "mulx %%rdx, %%rax, %%rcx ; " /* B[0]^2 */ - /*******************/ - "movq %%rax, 64(%0) ;" - "addq %%rcx, %%r8 ;" - "movq %%r8, 72(%0) ;" - "movq 40(%1), %%rdx ;" - "mulx %%rdx, %%rax, %%rcx ; " /* B[1]^2 */ - "adcq %%rax, %%r9 ;" - "movq %%r9, 80(%0) ;" - "adcq %%rcx, %%r10 ;" - "movq %%r10, 88(%0) ;" - "movq 48(%1), %%rdx ;" - "mulx %%rdx, %%rax, %%rcx ; " /* B[2]^2 */ - "adcq %%rax, %%r11 ;" - "movq %%r11, 96(%0) ;" - "adcq %%rcx, %%r15 ;" - "movq %%r15, 104(%0) ;" - "movq 56(%1), %%rdx ;" - "mulx %%rdx, %%rax, %%rcx ; " /* B[3]^2 */ - "adcq %%rax, %%r13 ;" - "movq %%r13, 112(%0) ;" - "adcq %%rcx, %%r14 ;" - "movq %%r14, 120(%0) ;" - : - : "r"(c), "r"(a) - : "memory", "cc", "%rax", "%rcx", "%rdx", "%r8", "%r9", "%r10", - "%r11", "%r13", "%r14", "%r15"); -} + u64 carry_r; -static void red_eltfp25519_2w_adx(u64 *const c, const u64 *const a) -{ asm volatile( - "movl $38, %%edx; " /* 2*c = 38 = 2^256 */ - "mulx 32(%1), %%r8, %%r10; " /* c*C[4] */ - "xorl %%ebx, %%ebx ;" - "adox (%1), %%r8 ;" - "mulx 40(%1), %%r9, %%r11; " /* c*C[5] */ - "adcx %%r10, %%r9 ;" - "adox 8(%1), %%r9 ;" - "mulx 48(%1), %%r10, %%rax; " /* c*C[6] */ - "adcx %%r11, %%r10 ;" - "adox 16(%1), %%r10 ;" - "mulx 56(%1), %%r11, %%rcx; " /* c*C[7] */ - "adcx %%rax, %%r11 ;" - "adox 24(%1), %%r11 ;" - /***************************************/ - "adcx %%rbx, %%rcx ;" - "adox %%rbx, %%rcx ;" - "imul %%rdx, %%rcx ;" /* c*C[4], cf=0, of=0 */ - "adcx %%rcx, %%r8 ;" - "adcx %%rbx, %%r9 ;" - "movq %%r9, 8(%0) ;" - "adcx %%rbx, %%r10 ;" - "movq %%r10, 16(%0) ;" - "adcx %%rbx, %%r11 ;" - "movq %%r11, 24(%0) ;" - "mov $0, %%ecx ;" - "cmovc %%edx, %%ecx ;" - "addq %%rcx, %%r8 ;" - "movq %%r8, (%0) ;" - - "mulx 96(%1), %%r8, %%r10; " /* c*C[4] */ - "xorl %%ebx, %%ebx ;" - "adox 64(%1), %%r8 ;" - "mulx 104(%1), %%r9, %%r11; " /* c*C[5] */ - "adcx %%r10, %%r9 ;" - "adox 72(%1), %%r9 ;" - "mulx 112(%1), %%r10, %%rax; " /* c*C[6] */ - "adcx %%r11, %%r10 ;" - "adox 80(%1), %%r10 ;" - "mulx 120(%1), %%r11, %%rcx; " /* c*C[7] */ - "adcx %%rax, %%r11 ;" - "adox 88(%1), %%r11 ;" - /****************************************/ - "adcx %%rbx, %%rcx ;" - "adox %%rbx, %%rcx ;" - "imul %%rdx, %%rcx ;" /* c*C[4], cf=0, of=0 */ - "adcx %%rcx, %%r8 ;" - "adcx %%rbx, %%r9 ;" - "movq %%r9, 40(%0) ;" - "adcx %%rbx, %%r10 ;" - "movq %%r10, 48(%0) ;" - "adcx %%rbx, %%r11 ;" - "movq %%r11, 56(%0) ;" - "mov $0, %%ecx ;" - "cmovc %%edx, %%ecx ;" - "addq %%rcx, %%r8 ;" - "movq %%r8, 32(%0) ;" - : - : "r"(c), "r"(a) - : "memory", "cc", "%rax", "%rbx", "%rcx", "%rdx", "%r8", "%r9", - "%r10", "%r11"); -} + /* Clear registers to propagate the carry bit */ + " xor %%r8, %%r8;" + " xor %%r9, %%r9;" + " xor %%r10, %%r10;" + " xor %%r11, %%r11;" + " xor %1, %1;" + + /* Begin addition chain */ + " addq 0(%3), %0;" + " movq %0, 0(%2);" + " adcxq 8(%3), %%r8;" + " movq %%r8, 8(%2);" + " adcxq 16(%3), %%r9;" + " movq %%r9, 16(%2);" + " adcxq 24(%3), %%r10;" + " movq %%r10, 24(%2);" + + /* Return the carry bit in a register */ + " adcx %%r11, %1;" + : "+&r" (f2), "=&r" (carry_r) + : "r" (out), "r" (f1) + : "%r8", "%r9", "%r10", "%r11", "memory", "cc" + ); -static void red_eltfp25519_2w_bmi2(u64 *const c, const u64 *const a) -{ - asm volatile( - "movl $38, %%edx ; " /* 2*c = 38 = 2^256 */ - "mulx 32(%1), %%r8, %%r10 ;" /* c*C[4] */ - "mulx 40(%1), %%r9, %%r11 ;" /* c*C[5] */ - "addq %%r10, %%r9 ;" - "mulx 48(%1), %%r10, %%rax ;" /* c*C[6] */ - "adcq %%r11, %%r10 ;" - "mulx 56(%1), %%r11, %%rcx ;" /* c*C[7] */ - "adcq %%rax, %%r11 ;" - /***************************************/ - "adcq $0, %%rcx ;" - "addq (%1), %%r8 ;" - "adcq 8(%1), %%r9 ;" - "adcq 16(%1), %%r10 ;" - "adcq 24(%1), %%r11 ;" - "adcq $0, %%rcx ;" - "imul %%rdx, %%rcx ;" /* c*C[4], cf=0 */ - "addq %%rcx, %%r8 ;" - "adcq $0, %%r9 ;" - "movq %%r9, 8(%0) ;" - "adcq $0, %%r10 ;" - "movq %%r10, 16(%0) ;" - "adcq $0, %%r11 ;" - "movq %%r11, 24(%0) ;" - "mov $0, %%ecx ;" - "cmovc %%edx, %%ecx ;" - "addq %%rcx, %%r8 ;" - "movq %%r8, (%0) ;" - - "mulx 96(%1), %%r8, %%r10 ;" /* c*C[4] */ - "mulx 104(%1), %%r9, %%r11 ;" /* c*C[5] */ - "addq %%r10, %%r9 ;" - "mulx 112(%1), %%r10, %%rax ;" /* c*C[6] */ - "adcq %%r11, %%r10 ;" - "mulx 120(%1), %%r11, %%rcx ;" /* c*C[7] */ - "adcq %%rax, %%r11 ;" - /****************************************/ - "adcq $0, %%rcx ;" - "addq 64(%1), %%r8 ;" - "adcq 72(%1), %%r9 ;" - "adcq 80(%1), %%r10 ;" - "adcq 88(%1), %%r11 ;" - "adcq $0, %%rcx ;" - "imul %%rdx, %%rcx ;" /* c*C[4], cf=0 */ - "addq %%rcx, %%r8 ;" - "adcq $0, %%r9 ;" - "movq %%r9, 40(%0) ;" - "adcq $0, %%r10 ;" - "movq %%r10, 48(%0) ;" - "adcq $0, %%r11 ;" - "movq %%r11, 56(%0) ;" - "mov $0, %%ecx ;" - "cmovc %%edx, %%ecx ;" - "addq %%rcx, %%r8 ;" - "movq %%r8, 32(%0) ;" - : - : "r"(c), "r"(a) - : "memory", "cc", "%rax", "%rcx", "%rdx", "%r8", "%r9", "%r10", - "%r11"); + return carry_r; } -static void mul_256x256_integer_adx(u64 *const c, const u64 *const a, - const u64 *const b) +/* Computes the field addition of two field elements */ +static inline void fadd(u64 *out, const u64 *f1, const u64 *f2) { asm volatile( - "movq (%1), %%rdx; " /* A[0] */ - "mulx (%2), %%r8, %%r9; " /* A[0]*B[0] */ - "xorl %%r10d, %%r10d ;" - "movq %%r8, (%0) ;" - "mulx 8(%2), %%r10, %%r11; " /* A[0]*B[1] */ - "adox %%r9, %%r10 ;" - "movq %%r10, 8(%0) ;" - "mulx 16(%2), %%r15, %%r13; " /* A[0]*B[2] */ - "adox %%r11, %%r15 ;" - "mulx 24(%2), %%r14, %%rdx; " /* A[0]*B[3] */ - "adox %%r13, %%r14 ;" - "movq $0, %%rax ;" - /******************************************/ - "adox %%rdx, %%rax ;" - - "movq 8(%1), %%rdx; " /* A[1] */ - "mulx (%2), %%r8, %%r9; " /* A[1]*B[0] */ - "xorl %%r10d, %%r10d ;" - "adcx 8(%0), %%r8 ;" - "movq %%r8, 8(%0) ;" - "mulx 8(%2), %%r10, %%r11; " /* A[1]*B[1] */ - "adox %%r9, %%r10 ;" - "adcx %%r15, %%r10 ;" - "movq %%r10, 16(%0) ;" - "mulx 16(%2), %%r15, %%r13; " /* A[1]*B[2] */ - "adox %%r11, %%r15 ;" - "adcx %%r14, %%r15 ;" - "movq $0, %%r8 ;" - "mulx 24(%2), %%r14, %%rdx; " /* A[1]*B[3] */ - "adox %%r13, %%r14 ;" - "adcx %%rax, %%r14 ;" - "movq $0, %%rax ;" - /******************************************/ - "adox %%rdx, %%rax ;" - "adcx %%r8, %%rax ;" - - "movq 16(%1), %%rdx; " /* A[2] */ - "mulx (%2), %%r8, %%r9; " /* A[2]*B[0] */ - "xorl %%r10d, %%r10d ;" - "adcx 16(%0), %%r8 ;" - "movq %%r8, 16(%0) ;" - "mulx 8(%2), %%r10, %%r11; " /* A[2]*B[1] */ - "adox %%r9, %%r10 ;" - "adcx %%r15, %%r10 ;" - "movq %%r10, 24(%0) ;" - "mulx 16(%2), %%r15, %%r13; " /* A[2]*B[2] */ - "adox %%r11, %%r15 ;" - "adcx %%r14, %%r15 ;" - "movq $0, %%r8 ;" - "mulx 24(%2), %%r14, %%rdx; " /* A[2]*B[3] */ - "adox %%r13, %%r14 ;" - "adcx %%rax, %%r14 ;" - "movq $0, %%rax ;" - /******************************************/ - "adox %%rdx, %%rax ;" - "adcx %%r8, %%rax ;" - - "movq 24(%1), %%rdx; " /* A[3] */ - "mulx (%2), %%r8, %%r9; " /* A[3]*B[0] */ - "xorl %%r10d, %%r10d ;" - "adcx 24(%0), %%r8 ;" - "movq %%r8, 24(%0) ;" - "mulx 8(%2), %%r10, %%r11; " /* A[3]*B[1] */ - "adox %%r9, %%r10 ;" - "adcx %%r15, %%r10 ;" - "movq %%r10, 32(%0) ;" - "mulx 16(%2), %%r15, %%r13; " /* A[3]*B[2] */ - "adox %%r11, %%r15 ;" - "adcx %%r14, %%r15 ;" - "movq %%r15, 40(%0) ;" - "movq $0, %%r8 ;" - "mulx 24(%2), %%r14, %%rdx; " /* A[3]*B[3] */ - "adox %%r13, %%r14 ;" - "adcx %%rax, %%r14 ;" - "movq %%r14, 48(%0) ;" - "movq $0, %%rax ;" - /******************************************/ - "adox %%rdx, %%rax ;" - "adcx %%r8, %%rax ;" - "movq %%rax, 56(%0) ;" - : - : "r"(c), "r"(a), "r"(b) - : "memory", "cc", "%rax", "%rdx", "%r8", "%r9", "%r10", "%r11", - "%r13", "%r14", "%r15"); + /* Compute the raw addition of f1 + f2 */ + " movq 0(%0), %%r8;" + " addq 0(%2), %%r8;" + " movq 8(%0), %%r9;" + " adcxq 8(%2), %%r9;" + " movq 16(%0), %%r10;" + " adcxq 16(%2), %%r10;" + " movq 24(%0), %%r11;" + " adcxq 24(%2), %%r11;" + + /* Wrap the result back into the field */ + + /* Step 1: Compute carry*38 */ + " mov $0, %%rax;" + " mov $38, %0;" + " cmovc %0, %%rax;" + + /* Step 2: Add carry*38 to the original sum */ + " xor %%rcx, %%rcx;" + " add %%rax, %%r8;" + " adcx %%rcx, %%r9;" + " movq %%r9, 8(%1);" + " adcx %%rcx, %%r10;" + " movq %%r10, 16(%1);" + " adcx %%rcx, %%r11;" + " movq %%r11, 24(%1);" + + /* Step 3: Fold the carry bit back in; guaranteed not to carry at this point */ + " mov $0, %%rax;" + " cmovc %0, %%rax;" + " add %%rax, %%r8;" + " movq %%r8, 0(%1);" + : "+&r" (f2) + : "r" (out), "r" (f1) + : "%rax", "%rcx", "%r8", "%r9", "%r10", "%r11", "memory", "cc" + ); } -static void mul_256x256_integer_bmi2(u64 *const c, const u64 *const a, - const u64 *const b) +/* Computes the field substraction of two field elements */ +static inline void fsub(u64 *out, const u64 *f1, const u64 *f2) { asm volatile( - "movq (%1), %%rdx; " /* A[0] */ - "mulx (%2), %%r8, %%r15; " /* A[0]*B[0] */ - "movq %%r8, (%0) ;" - "mulx 8(%2), %%r10, %%rax; " /* A[0]*B[1] */ - "addq %%r10, %%r15 ;" - "mulx 16(%2), %%r8, %%rbx; " /* A[0]*B[2] */ - "adcq %%r8, %%rax ;" - "mulx 24(%2), %%r10, %%rcx; " /* A[0]*B[3] */ - "adcq %%r10, %%rbx ;" - /******************************************/ - "adcq $0, %%rcx ;" - - "movq 8(%1), %%rdx; " /* A[1] */ - "mulx (%2), %%r8, %%r9; " /* A[1]*B[0] */ - "addq %%r15, %%r8 ;" - "movq %%r8, 8(%0) ;" - "mulx 8(%2), %%r10, %%r11; " /* A[1]*B[1] */ - "adcq %%r10, %%r9 ;" - "mulx 16(%2), %%r8, %%r13; " /* A[1]*B[2] */ - "adcq %%r8, %%r11 ;" - "mulx 24(%2), %%r10, %%r15; " /* A[1]*B[3] */ - "adcq %%r10, %%r13 ;" - /******************************************/ - "adcq $0, %%r15 ;" - - "addq %%r9, %%rax ;" - "adcq %%r11, %%rbx ;" - "adcq %%r13, %%rcx ;" - "adcq $0, %%r15 ;" - - "movq 16(%1), %%rdx; " /* A[2] */ - "mulx (%2), %%r8, %%r9; " /* A[2]*B[0] */ - "addq %%rax, %%r8 ;" - "movq %%r8, 16(%0) ;" - "mulx 8(%2), %%r10, %%r11; " /* A[2]*B[1] */ - "adcq %%r10, %%r9 ;" - "mulx 16(%2), %%r8, %%r13; " /* A[2]*B[2] */ - "adcq %%r8, %%r11 ;" - "mulx 24(%2), %%r10, %%rax; " /* A[2]*B[3] */ - "adcq %%r10, %%r13 ;" - /******************************************/ - "adcq $0, %%rax ;" - - "addq %%r9, %%rbx ;" - "adcq %%r11, %%rcx ;" - "adcq %%r13, %%r15 ;" - "adcq $0, %%rax ;" - - "movq 24(%1), %%rdx; " /* A[3] */ - "mulx (%2), %%r8, %%r9; " /* A[3]*B[0] */ - "addq %%rbx, %%r8 ;" - "movq %%r8, 24(%0) ;" - "mulx 8(%2), %%r10, %%r11; " /* A[3]*B[1] */ - "adcq %%r10, %%r9 ;" - "mulx 16(%2), %%r8, %%r13; " /* A[3]*B[2] */ - "adcq %%r8, %%r11 ;" - "mulx 24(%2), %%r10, %%rbx; " /* A[3]*B[3] */ - "adcq %%r10, %%r13 ;" - /******************************************/ - "adcq $0, %%rbx ;" - - "addq %%r9, %%rcx ;" - "movq %%rcx, 32(%0) ;" - "adcq %%r11, %%r15 ;" - "movq %%r15, 40(%0) ;" - "adcq %%r13, %%rax ;" - "movq %%rax, 48(%0) ;" - "adcq $0, %%rbx ;" - "movq %%rbx, 56(%0) ;" - : - : "r"(c), "r"(a), "r"(b) - : "memory", "cc", "%rax", "%rbx", "%rcx", "%rdx", "%r8", "%r9", - "%r10", "%r11", "%r13", "%r15"); + /* Compute the raw substraction of f1-f2 */ + " movq 0(%1), %%r8;" + " subq 0(%2), %%r8;" + " movq 8(%1), %%r9;" + " sbbq 8(%2), %%r9;" + " movq 16(%1), %%r10;" + " sbbq 16(%2), %%r10;" + " movq 24(%1), %%r11;" + " sbbq 24(%2), %%r11;" + + /* Wrap the result back into the field */ + + /* Step 1: Compute carry*38 */ + " mov $0, %%rax;" + " mov $38, %%rcx;" + " cmovc %%rcx, %%rax;" + + /* Step 2: Substract carry*38 from the original difference */ + " sub %%rax, %%r8;" + " sbb $0, %%r9;" + " sbb $0, %%r10;" + " sbb $0, %%r11;" + + /* Step 3: Fold the carry bit back in; guaranteed not to carry at this point */ + " mov $0, %%rax;" + " cmovc %%rcx, %%rax;" + " sub %%rax, %%r8;" + + /* Store the result */ + " movq %%r8, 0(%0);" + " movq %%r9, 8(%0);" + " movq %%r10, 16(%0);" + " movq %%r11, 24(%0);" + : + : "r" (out), "r" (f1), "r" (f2) + : "%rax", "%rcx", "%r8", "%r9", "%r10", "%r11", "memory", "cc" + ); } -static void sqr_256x256_integer_adx(u64 *const c, const u64 *const a) +/* Computes a field multiplication: out <- f1 * f2 + * Uses the 8-element buffer tmp for intermediate results */ +static inline void fmul(u64 *out, const u64 *f1, const u64 *f2, u64 *tmp) { asm volatile( - "movq (%1), %%rdx ;" /* A[0] */ - "mulx 8(%1), %%r8, %%r14 ;" /* A[1]*A[0] */ - "xorl %%r15d, %%r15d;" - "mulx 16(%1), %%r9, %%r10 ;" /* A[2]*A[0] */ - "adcx %%r14, %%r9 ;" - "mulx 24(%1), %%rax, %%rcx ;" /* A[3]*A[0] */ - "adcx %%rax, %%r10 ;" - "movq 24(%1), %%rdx ;" /* A[3] */ - "mulx 8(%1), %%r11, %%rbx ;" /* A[1]*A[3] */ - "adcx %%rcx, %%r11 ;" - "mulx 16(%1), %%rax, %%r13 ;" /* A[2]*A[3] */ - "adcx %%rax, %%rbx ;" - "movq 8(%1), %%rdx ;" /* A[1] */ - "adcx %%r15, %%r13 ;" - "mulx 16(%1), %%rax, %%rcx ;" /* A[2]*A[1] */ - "movq $0, %%r14 ;" - /******************************************/ - "adcx %%r15, %%r14 ;" - - "xorl %%r15d, %%r15d;" - "adox %%rax, %%r10 ;" - "adcx %%r8, %%r8 ;" - "adox %%rcx, %%r11 ;" - "adcx %%r9, %%r9 ;" - "adox %%r15, %%rbx ;" - "adcx %%r10, %%r10 ;" - "adox %%r15, %%r13 ;" - "adcx %%r11, %%r11 ;" - "adox %%r15, %%r14 ;" - "adcx %%rbx, %%rbx ;" - "adcx %%r13, %%r13 ;" - "adcx %%r14, %%r14 ;" - - "movq (%1), %%rdx ;" - "mulx %%rdx, %%rax, %%rcx ;" /* A[0]^2 */ - /*******************/ - "movq %%rax, 0(%0) ;" - "addq %%rcx, %%r8 ;" - "movq %%r8, 8(%0) ;" - "movq 8(%1), %%rdx ;" - "mulx %%rdx, %%rax, %%rcx ;" /* A[1]^2 */ - "adcq %%rax, %%r9 ;" - "movq %%r9, 16(%0) ;" - "adcq %%rcx, %%r10 ;" - "movq %%r10, 24(%0) ;" - "movq 16(%1), %%rdx ;" - "mulx %%rdx, %%rax, %%rcx ;" /* A[2]^2 */ - "adcq %%rax, %%r11 ;" - "movq %%r11, 32(%0) ;" - "adcq %%rcx, %%rbx ;" - "movq %%rbx, 40(%0) ;" - "movq 24(%1), %%rdx ;" - "mulx %%rdx, %%rax, %%rcx ;" /* A[3]^2 */ - "adcq %%rax, %%r13 ;" - "movq %%r13, 48(%0) ;" - "adcq %%rcx, %%r14 ;" - "movq %%r14, 56(%0) ;" - : - : "r"(c), "r"(a) - : "memory", "cc", "%rax", "%rbx", "%rcx", "%rdx", "%r8", "%r9", - "%r10", "%r11", "%r13", "%r14", "%r15"); + /* Compute the raw multiplication: tmp <- src1 * src2 */ + + /* Compute src1[0] * src2 */ + " movq 0(%1), %%rdx;" + " mulxq 0(%3), %%r8, %%r9;" " xor %%r10, %%r10;" " movq %%r8, 0(%0);" + " mulxq 8(%3), %%r10, %%r11;" " adox %%r9, %%r10;" " movq %%r10, 8(%0);" + " mulxq 16(%3), %%r12, %%r13;" " adox %%r11, %%r12;" + " mulxq 24(%3), %%r14, %%rdx;" " adox %%r13, %%r14;" " mov $0, %%rax;" + " adox %%rdx, %%rax;" + /* Compute src1[1] * src2 */ + " movq 8(%1), %%rdx;" + " mulxq 0(%3), %%r8, %%r9;" " xor %%r10, %%r10;" " adcxq 8(%0), %%r8;" " movq %%r8, 8(%0);" + " mulxq 8(%3), %%r10, %%r11;" " adox %%r9, %%r10;" " adcx %%r12, %%r10;" " movq %%r10, 16(%0);" + " mulxq 16(%3), %%r12, %%r13;" " adox %%r11, %%r12;" " adcx %%r14, %%r12;" " mov $0, %%r8;" + " mulxq 24(%3), %%r14, %%rdx;" " adox %%r13, %%r14;" " adcx %%rax, %%r14;" " mov $0, %%rax;" + " adox %%rdx, %%rax;" " adcx %%r8, %%rax;" + /* Compute src1[2] * src2 */ + " movq 16(%1), %%rdx;" + " mulxq 0(%3), %%r8, %%r9;" " xor %%r10, %%r10;" " adcxq 16(%0), %%r8;" " movq %%r8, 16(%0);" + " mulxq 8(%3), %%r10, %%r11;" " adox %%r9, %%r10;" " adcx %%r12, %%r10;" " movq %%r10, 24(%0);" + " mulxq 16(%3), %%r12, %%r13;" " adox %%r11, %%r12;" " adcx %%r14, %%r12;" " mov $0, %%r8;" + " mulxq 24(%3), %%r14, %%rdx;" " adox %%r13, %%r14;" " adcx %%rax, %%r14;" " mov $0, %%rax;" + " adox %%rdx, %%rax;" " adcx %%r8, %%rax;" + /* Compute src1[3] * src2 */ + " movq 24(%1), %%rdx;" + " mulxq 0(%3), %%r8, %%r9;" " xor %%r10, %%r10;" " adcxq 24(%0), %%r8;" " movq %%r8, 24(%0);" + " mulxq 8(%3), %%r10, %%r11;" " adox %%r9, %%r10;" " adcx %%r12, %%r10;" " movq %%r10, 32(%0);" + " mulxq 16(%3), %%r12, %%r13;" " adox %%r11, %%r12;" " adcx %%r14, %%r12;" " movq %%r12, 40(%0);" " mov $0, %%r8;" + " mulxq 24(%3), %%r14, %%rdx;" " adox %%r13, %%r14;" " adcx %%rax, %%r14;" " movq %%r14, 48(%0);" " mov $0, %%rax;" + " adox %%rdx, %%rax;" " adcx %%r8, %%rax;" " movq %%rax, 56(%0);" + /* Line up pointers */ + " mov %0, %1;" + " mov %2, %0;" + + /* Wrap the result back into the field */ + + /* Step 1: Compute dst + carry == tmp_hi * 38 + tmp_lo */ + " mov $38, %%rdx;" + " mulxq 32(%1), %%r8, %%r13;" + " xor %3, %3;" + " adoxq 0(%1), %%r8;" + " mulxq 40(%1), %%r9, %%r12;" + " adcx %%r13, %%r9;" + " adoxq 8(%1), %%r9;" + " mulxq 48(%1), %%r10, %%r13;" + " adcx %%r12, %%r10;" + " adoxq 16(%1), %%r10;" + " mulxq 56(%1), %%r11, %%rax;" + " adcx %%r13, %%r11;" + " adoxq 24(%1), %%r11;" + " adcx %3, %%rax;" + " adox %3, %%rax;" + " imul %%rdx, %%rax;" + + /* Step 2: Fold the carry back into dst */ + " add %%rax, %%r8;" + " adcx %3, %%r9;" + " movq %%r9, 8(%0);" + " adcx %3, %%r10;" + " movq %%r10, 16(%0);" + " adcx %3, %%r11;" + " movq %%r11, 24(%0);" + + /* Step 3: Fold the carry bit back in; guaranteed not to carry at this point */ + " mov $0, %%rax;" + " cmovc %%rdx, %%rax;" + " add %%rax, %%r8;" + " movq %%r8, 0(%0);" + : "+&r" (tmp), "+&r" (f1), "+&r" (out), "+&r" (f2) + : + : "%rax", "%rdx", "%r8", "%r9", "%r10", "%r11", "%r12", "%r13", "%r14", "memory", "cc" + ); } -static void sqr_256x256_integer_bmi2(u64 *const c, const u64 *const a) +/* Computes two field multiplications: + * out[0] <- f1[0] * f2[0] + * out[1] <- f1[1] * f2[1] + * Uses the 16-element buffer tmp for intermediate results. */ +static inline void fmul2(u64 *out, const u64 *f1, const u64 *f2, u64 *tmp) { asm volatile( - "movq 8(%1), %%rdx ;" /* A[1] */ - "mulx (%1), %%r8, %%r9 ;" /* A[0]*A[1] */ - "mulx 16(%1), %%r10, %%r11 ;" /* A[2]*A[1] */ - "mulx 24(%1), %%rcx, %%r14 ;" /* A[3]*A[1] */ - - "movq 16(%1), %%rdx ;" /* A[2] */ - "mulx 24(%1), %%r15, %%r13 ;" /* A[3]*A[2] */ - "mulx (%1), %%rax, %%rdx ;" /* A[0]*A[2] */ - - "addq %%rax, %%r9 ;" - "adcq %%rdx, %%r10 ;" - "adcq %%rcx, %%r11 ;" - "adcq %%r14, %%r15 ;" - "adcq $0, %%r13 ;" - "movq $0, %%r14 ;" - "adcq $0, %%r14 ;" - - "movq (%1), %%rdx ;" /* A[0] */ - "mulx 24(%1), %%rax, %%rcx ;" /* A[0]*A[3] */ - - "addq %%rax, %%r10 ;" - "adcq %%rcx, %%r11 ;" - "adcq $0, %%r15 ;" - "adcq $0, %%r13 ;" - "adcq $0, %%r14 ;" - - "shldq $1, %%r13, %%r14 ;" - "shldq $1, %%r15, %%r13 ;" - "shldq $1, %%r11, %%r15 ;" - "shldq $1, %%r10, %%r11 ;" - "shldq $1, %%r9, %%r10 ;" - "shldq $1, %%r8, %%r9 ;" - "shlq $1, %%r8 ;" - - /*******************/ - "mulx %%rdx, %%rax, %%rcx ;" /* A[0]^2 */ - /*******************/ - "movq %%rax, 0(%0) ;" - "addq %%rcx, %%r8 ;" - "movq %%r8, 8(%0) ;" - "movq 8(%1), %%rdx ;" - "mulx %%rdx, %%rax, %%rcx ;" /* A[1]^2 */ - "adcq %%rax, %%r9 ;" - "movq %%r9, 16(%0) ;" - "adcq %%rcx, %%r10 ;" - "movq %%r10, 24(%0) ;" - "movq 16(%1), %%rdx ;" - "mulx %%rdx, %%rax, %%rcx ;" /* A[2]^2 */ - "adcq %%rax, %%r11 ;" - "movq %%r11, 32(%0) ;" - "adcq %%rcx, %%r15 ;" - "movq %%r15, 40(%0) ;" - "movq 24(%1), %%rdx ;" - "mulx %%rdx, %%rax, %%rcx ;" /* A[3]^2 */ - "adcq %%rax, %%r13 ;" - "movq %%r13, 48(%0) ;" - "adcq %%rcx, %%r14 ;" - "movq %%r14, 56(%0) ;" - : - : "r"(c), "r"(a) - : "memory", "cc", "%rax", "%rcx", "%rdx", "%r8", "%r9", "%r10", - "%r11", "%r13", "%r14", "%r15"); + /* Compute the raw multiplication tmp[0] <- f1[0] * f2[0] */ + + /* Compute src1[0] * src2 */ + " movq 0(%1), %%rdx;" + " mulxq 0(%3), %%r8, %%r9;" " xor %%r10, %%r10;" " movq %%r8, 0(%0);" + " mulxq 8(%3), %%r10, %%r11;" " adox %%r9, %%r10;" " movq %%r10, 8(%0);" + " mulxq 16(%3), %%r12, %%r13;" " adox %%r11, %%r12;" + " mulxq 24(%3), %%r14, %%rdx;" " adox %%r13, %%r14;" " mov $0, %%rax;" + " adox %%rdx, %%rax;" + /* Compute src1[1] * src2 */ + " movq 8(%1), %%rdx;" + " mulxq 0(%3), %%r8, %%r9;" " xor %%r10, %%r10;" " adcxq 8(%0), %%r8;" " movq %%r8, 8(%0);" + " mulxq 8(%3), %%r10, %%r11;" " adox %%r9, %%r10;" " adcx %%r12, %%r10;" " movq %%r10, 16(%0);" + " mulxq 16(%3), %%r12, %%r13;" " adox %%r11, %%r12;" " adcx %%r14, %%r12;" " mov $0, %%r8;" + " mulxq 24(%3), %%r14, %%rdx;" " adox %%r13, %%r14;" " adcx %%rax, %%r14;" " mov $0, %%rax;" + " adox %%rdx, %%rax;" " adcx %%r8, %%rax;" + /* Compute src1[2] * src2 */ + " movq 16(%1), %%rdx;" + " mulxq 0(%3), %%r8, %%r9;" " xor %%r10, %%r10;" " adcxq 16(%0), %%r8;" " movq %%r8, 16(%0);" + " mulxq 8(%3), %%r10, %%r11;" " adox %%r9, %%r10;" " adcx %%r12, %%r10;" " movq %%r10, 24(%0);" + " mulxq 16(%3), %%r12, %%r13;" " adox %%r11, %%r12;" " adcx %%r14, %%r12;" " mov $0, %%r8;" + " mulxq 24(%3), %%r14, %%rdx;" " adox %%r13, %%r14;" " adcx %%rax, %%r14;" " mov $0, %%rax;" + " adox %%rdx, %%rax;" " adcx %%r8, %%rax;" + /* Compute src1[3] * src2 */ + " movq 24(%1), %%rdx;" + " mulxq 0(%3), %%r8, %%r9;" " xor %%r10, %%r10;" " adcxq 24(%0), %%r8;" " movq %%r8, 24(%0);" + " mulxq 8(%3), %%r10, %%r11;" " adox %%r9, %%r10;" " adcx %%r12, %%r10;" " movq %%r10, 32(%0);" + " mulxq 16(%3), %%r12, %%r13;" " adox %%r11, %%r12;" " adcx %%r14, %%r12;" " movq %%r12, 40(%0);" " mov $0, %%r8;" + " mulxq 24(%3), %%r14, %%rdx;" " adox %%r13, %%r14;" " adcx %%rax, %%r14;" " movq %%r14, 48(%0);" " mov $0, %%rax;" + " adox %%rdx, %%rax;" " adcx %%r8, %%rax;" " movq %%rax, 56(%0);" + + /* Compute the raw multiplication tmp[1] <- f1[1] * f2[1] */ + + /* Compute src1[0] * src2 */ + " movq 32(%1), %%rdx;" + " mulxq 32(%3), %%r8, %%r9;" " xor %%r10, %%r10;" " movq %%r8, 64(%0);" + " mulxq 40(%3), %%r10, %%r11;" " adox %%r9, %%r10;" " movq %%r10, 72(%0);" + " mulxq 48(%3), %%r12, %%r13;" " adox %%r11, %%r12;" + " mulxq 56(%3), %%r14, %%rdx;" " adox %%r13, %%r14;" " mov $0, %%rax;" + " adox %%rdx, %%rax;" + /* Compute src1[1] * src2 */ + " movq 40(%1), %%rdx;" + " mulxq 32(%3), %%r8, %%r9;" " xor %%r10, %%r10;" " adcxq 72(%0), %%r8;" " movq %%r8, 72(%0);" + " mulxq 40(%3), %%r10, %%r11;" " adox %%r9, %%r10;" " adcx %%r12, %%r10;" " movq %%r10, 80(%0);" + " mulxq 48(%3), %%r12, %%r13;" " adox %%r11, %%r12;" " adcx %%r14, %%r12;" " mov $0, %%r8;" + " mulxq 56(%3), %%r14, %%rdx;" " adox %%r13, %%r14;" " adcx %%rax, %%r14;" " mov $0, %%rax;" + " adox %%rdx, %%rax;" " adcx %%r8, %%rax;" + /* Compute src1[2] * src2 */ + " movq 48(%1), %%rdx;" + " mulxq 32(%3), %%r8, %%r9;" " xor %%r10, %%r10;" " adcxq 80(%0), %%r8;" " movq %%r8, 80(%0);" + " mulxq 40(%3), %%r10, %%r11;" " adox %%r9, %%r10;" " adcx %%r12, %%r10;" " movq %%r10, 88(%0);" + " mulxq 48(%3), %%r12, %%r13;" " adox %%r11, %%r12;" " adcx %%r14, %%r12;" " mov $0, %%r8;" + " mulxq 56(%3), %%r14, %%rdx;" " adox %%r13, %%r14;" " adcx %%rax, %%r14;" " mov $0, %%rax;" + " adox %%rdx, %%rax;" " adcx %%r8, %%rax;" + /* Compute src1[3] * src2 */ + " movq 56(%1), %%rdx;" + " mulxq 32(%3), %%r8, %%r9;" " xor %%r10, %%r10;" " adcxq 88(%0), %%r8;" " movq %%r8, 88(%0);" + " mulxq 40(%3), %%r10, %%r11;" " adox %%r9, %%r10;" " adcx %%r12, %%r10;" " movq %%r10, 96(%0);" + " mulxq 48(%3), %%r12, %%r13;" " adox %%r11, %%r12;" " adcx %%r14, %%r12;" " movq %%r12, 104(%0);" " mov $0, %%r8;" + " mulxq 56(%3), %%r14, %%rdx;" " adox %%r13, %%r14;" " adcx %%rax, %%r14;" " movq %%r14, 112(%0);" " mov $0, %%rax;" + " adox %%rdx, %%rax;" " adcx %%r8, %%rax;" " movq %%rax, 120(%0);" + /* Line up pointers */ + " mov %0, %1;" + " mov %2, %0;" + + /* Wrap the results back into the field */ + + /* Step 1: Compute dst + carry == tmp_hi * 38 + tmp_lo */ + " mov $38, %%rdx;" + " mulxq 32(%1), %%r8, %%r13;" + " xor %3, %3;" + " adoxq 0(%1), %%r8;" + " mulxq 40(%1), %%r9, %%r12;" + " adcx %%r13, %%r9;" + " adoxq 8(%1), %%r9;" + " mulxq 48(%1), %%r10, %%r13;" + " adcx %%r12, %%r10;" + " adoxq 16(%1), %%r10;" + " mulxq 56(%1), %%r11, %%rax;" + " adcx %%r13, %%r11;" + " adoxq 24(%1), %%r11;" + " adcx %3, %%rax;" + " adox %3, %%rax;" + " imul %%rdx, %%rax;" + + /* Step 2: Fold the carry back into dst */ + " add %%rax, %%r8;" + " adcx %3, %%r9;" + " movq %%r9, 8(%0);" + " adcx %3, %%r10;" + " movq %%r10, 16(%0);" + " adcx %3, %%r11;" + " movq %%r11, 24(%0);" + + /* Step 3: Fold the carry bit back in; guaranteed not to carry at this point */ + " mov $0, %%rax;" + " cmovc %%rdx, %%rax;" + " add %%rax, %%r8;" + " movq %%r8, 0(%0);" + + /* Step 1: Compute dst + carry == tmp_hi * 38 + tmp_lo */ + " mov $38, %%rdx;" + " mulxq 96(%1), %%r8, %%r13;" + " xor %3, %3;" + " adoxq 64(%1), %%r8;" + " mulxq 104(%1), %%r9, %%r12;" + " adcx %%r13, %%r9;" + " adoxq 72(%1), %%r9;" + " mulxq 112(%1), %%r10, %%r13;" + " adcx %%r12, %%r10;" + " adoxq 80(%1), %%r10;" + " mulxq 120(%1), %%r11, %%rax;" + " adcx %%r13, %%r11;" + " adoxq 88(%1), %%r11;" + " adcx %3, %%rax;" + " adox %3, %%rax;" + " imul %%rdx, %%rax;" + + /* Step 2: Fold the carry back into dst */ + " add %%rax, %%r8;" + " adcx %3, %%r9;" + " movq %%r9, 40(%0);" + " adcx %3, %%r10;" + " movq %%r10, 48(%0);" + " adcx %3, %%r11;" + " movq %%r11, 56(%0);" + + /* Step 3: Fold the carry bit back in; guaranteed not to carry at this point */ + " mov $0, %%rax;" + " cmovc %%rdx, %%rax;" + " add %%rax, %%r8;" + " movq %%r8, 32(%0);" + : "+&r" (tmp), "+&r" (f1), "+&r" (out), "+&r" (f2) + : + : "%rax", "%rdx", "%r8", "%r9", "%r10", "%r11", "%r12", "%r13", "%r14", "memory", "cc" + ); } -static void red_eltfp25519_1w_adx(u64 *const c, const u64 *const a) +/* Computes the field multiplication of four-element f1 with value in f2 */ +static inline void fmul_scalar(u64 *out, const u64 *f1, u64 f2) { - asm volatile( - "movl $38, %%edx ;" /* 2*c = 38 = 2^256 */ - "mulx 32(%1), %%r8, %%r10 ;" /* c*C[4] */ - "xorl %%ebx, %%ebx ;" - "adox (%1), %%r8 ;" - "mulx 40(%1), %%r9, %%r11 ;" /* c*C[5] */ - "adcx %%r10, %%r9 ;" - "adox 8(%1), %%r9 ;" - "mulx 48(%1), %%r10, %%rax ;" /* c*C[6] */ - "adcx %%r11, %%r10 ;" - "adox 16(%1), %%r10 ;" - "mulx 56(%1), %%r11, %%rcx ;" /* c*C[7] */ - "adcx %%rax, %%r11 ;" - "adox 24(%1), %%r11 ;" - /***************************************/ - "adcx %%rbx, %%rcx ;" - "adox %%rbx, %%rcx ;" - "imul %%rdx, %%rcx ;" /* c*C[4], cf=0, of=0 */ - "adcx %%rcx, %%r8 ;" - "adcx %%rbx, %%r9 ;" - "movq %%r9, 8(%0) ;" - "adcx %%rbx, %%r10 ;" - "movq %%r10, 16(%0) ;" - "adcx %%rbx, %%r11 ;" - "movq %%r11, 24(%0) ;" - "mov $0, %%ecx ;" - "cmovc %%edx, %%ecx ;" - "addq %%rcx, %%r8 ;" - "movq %%r8, (%0) ;" - : - : "r"(c), "r"(a) - : "memory", "cc", "%rax", "%rbx", "%rcx", "%rdx", "%r8", "%r9", - "%r10", "%r11"); -} + register u64 f2_r asm("rdx") = f2; -static void red_eltfp25519_1w_bmi2(u64 *const c, const u64 *const a) -{ asm volatile( - "movl $38, %%edx ;" /* 2*c = 38 = 2^256 */ - "mulx 32(%1), %%r8, %%r10 ;" /* c*C[4] */ - "mulx 40(%1), %%r9, %%r11 ;" /* c*C[5] */ - "addq %%r10, %%r9 ;" - "mulx 48(%1), %%r10, %%rax ;" /* c*C[6] */ - "adcq %%r11, %%r10 ;" - "mulx 56(%1), %%r11, %%rcx ;" /* c*C[7] */ - "adcq %%rax, %%r11 ;" - /***************************************/ - "adcq $0, %%rcx ;" - "addq (%1), %%r8 ;" - "adcq 8(%1), %%r9 ;" - "adcq 16(%1), %%r10 ;" - "adcq 24(%1), %%r11 ;" - "adcq $0, %%rcx ;" - "imul %%rdx, %%rcx ;" /* c*C[4], cf=0 */ - "addq %%rcx, %%r8 ;" - "adcq $0, %%r9 ;" - "movq %%r9, 8(%0) ;" - "adcq $0, %%r10 ;" - "movq %%r10, 16(%0) ;" - "adcq $0, %%r11 ;" - "movq %%r11, 24(%0) ;" - "mov $0, %%ecx ;" - "cmovc %%edx, %%ecx ;" - "addq %%rcx, %%r8 ;" - "movq %%r8, (%0) ;" - : - : "r"(c), "r"(a) - : "memory", "cc", "%rax", "%rcx", "%rdx", "%r8", "%r9", "%r10", - "%r11"); + /* Compute the raw multiplication of f1*f2 */ + " mulxq 0(%2), %%r8, %%rcx;" /* f1[0]*f2 */ + " mulxq 8(%2), %%r9, %%r12;" /* f1[1]*f2 */ + " add %%rcx, %%r9;" + " mov $0, %%rcx;" + " mulxq 16(%2), %%r10, %%r13;" /* f1[2]*f2 */ + " adcx %%r12, %%r10;" + " mulxq 24(%2), %%r11, %%rax;" /* f1[3]*f2 */ + " adcx %%r13, %%r11;" + " adcx %%rcx, %%rax;" + + /* Wrap the result back into the field */ + + /* Step 1: Compute carry*38 */ + " mov $38, %%rdx;" + " imul %%rdx, %%rax;" + + /* Step 2: Fold the carry back into dst */ + " add %%rax, %%r8;" + " adcx %%rcx, %%r9;" + " movq %%r9, 8(%1);" + " adcx %%rcx, %%r10;" + " movq %%r10, 16(%1);" + " adcx %%rcx, %%r11;" + " movq %%r11, 24(%1);" + + /* Step 3: Fold the carry bit back in; guaranteed not to carry at this point */ + " mov $0, %%rax;" + " cmovc %%rdx, %%rax;" + " add %%rax, %%r8;" + " movq %%r8, 0(%1);" + : "+&r" (f2_r) + : "r" (out), "r" (f1) + : "%rax", "%rcx", "%r8", "%r9", "%r10", "%r11", "%r12", "%r13", "memory", "cc" + ); } -static __always_inline void -add_eltfp25519_1w_adx(u64 *const c, const u64 *const a, const u64 *const b) +/* Computes p1 <- bit ? p2 : p1 in constant time */ +static inline void cswap2(u64 bit, const u64 *p1, const u64 *p2) { asm volatile( - "mov $38, %%eax ;" - "xorl %%ecx, %%ecx ;" - "movq (%2), %%r8 ;" - "adcx (%1), %%r8 ;" - "movq 8(%2), %%r9 ;" - "adcx 8(%1), %%r9 ;" - "movq 16(%2), %%r10 ;" - "adcx 16(%1), %%r10 ;" - "movq 24(%2), %%r11 ;" - "adcx 24(%1), %%r11 ;" - "cmovc %%eax, %%ecx ;" - "xorl %%eax, %%eax ;" - "adcx %%rcx, %%r8 ;" - "adcx %%rax, %%r9 ;" - "movq %%r9, 8(%0) ;" - "adcx %%rax, %%r10 ;" - "movq %%r10, 16(%0) ;" - "adcx %%rax, %%r11 ;" - "movq %%r11, 24(%0) ;" - "mov $38, %%ecx ;" - "cmovc %%ecx, %%eax ;" - "addq %%rax, %%r8 ;" - "movq %%r8, (%0) ;" - : - : "r"(c), "r"(a), "r"(b) - : "memory", "cc", "%rax", "%rcx", "%r8", "%r9", "%r10", "%r11"); + /* Invert the polarity of bit to match cmov expectations */ + " add $18446744073709551615, %0;" + + /* cswap p1[0], p2[0] */ + " movq 0(%1), %%r8;" + " movq 0(%2), %%r9;" + " mov %%r8, %%r10;" + " cmovc %%r9, %%r8;" + " cmovc %%r10, %%r9;" + " movq %%r8, 0(%1);" + " movq %%r9, 0(%2);" + + /* cswap p1[1], p2[1] */ + " movq 8(%1), %%r8;" + " movq 8(%2), %%r9;" + " mov %%r8, %%r10;" + " cmovc %%r9, %%r8;" + " cmovc %%r10, %%r9;" + " movq %%r8, 8(%1);" + " movq %%r9, 8(%2);" + + /* cswap p1[2], p2[2] */ + " movq 16(%1), %%r8;" + " movq 16(%2), %%r9;" + " mov %%r8, %%r10;" + " cmovc %%r9, %%r8;" + " cmovc %%r10, %%r9;" + " movq %%r8, 16(%1);" + " movq %%r9, 16(%2);" + + /* cswap p1[3], p2[3] */ + " movq 24(%1), %%r8;" + " movq 24(%2), %%r9;" + " mov %%r8, %%r10;" + " cmovc %%r9, %%r8;" + " cmovc %%r10, %%r9;" + " movq %%r8, 24(%1);" + " movq %%r9, 24(%2);" + + /* cswap p1[4], p2[4] */ + " movq 32(%1), %%r8;" + " movq 32(%2), %%r9;" + " mov %%r8, %%r10;" + " cmovc %%r9, %%r8;" + " cmovc %%r10, %%r9;" + " movq %%r8, 32(%1);" + " movq %%r9, 32(%2);" + + /* cswap p1[5], p2[5] */ + " movq 40(%1), %%r8;" + " movq 40(%2), %%r9;" + " mov %%r8, %%r10;" + " cmovc %%r9, %%r8;" + " cmovc %%r10, %%r9;" + " movq %%r8, 40(%1);" + " movq %%r9, 40(%2);" + + /* cswap p1[6], p2[6] */ + " movq 48(%1), %%r8;" + " movq 48(%2), %%r9;" + " mov %%r8, %%r10;" + " cmovc %%r9, %%r8;" + " cmovc %%r10, %%r9;" + " movq %%r8, 48(%1);" + " movq %%r9, 48(%2);" + + /* cswap p1[7], p2[7] */ + " movq 56(%1), %%r8;" + " movq 56(%2), %%r9;" + " mov %%r8, %%r10;" + " cmovc %%r9, %%r8;" + " cmovc %%r10, %%r9;" + " movq %%r8, 56(%1);" + " movq %%r9, 56(%2);" + : "+&r" (bit) + : "r" (p1), "r" (p2) + : "%r8", "%r9", "%r10", "memory", "cc" + ); } -static __always_inline void -add_eltfp25519_1w_bmi2(u64 *const c, const u64 *const a, const u64 *const b) +/* Computes the square of a field element: out <- f * f + * Uses the 8-element buffer tmp for intermediate results */ +static inline void fsqr(u64 *out, const u64 *f, u64 *tmp) { asm volatile( - "mov $38, %%eax ;" - "movq (%2), %%r8 ;" - "addq (%1), %%r8 ;" - "movq 8(%2), %%r9 ;" - "adcq 8(%1), %%r9 ;" - "movq 16(%2), %%r10 ;" - "adcq 16(%1), %%r10 ;" - "movq 24(%2), %%r11 ;" - "adcq 24(%1), %%r11 ;" - "mov $0, %%ecx ;" - "cmovc %%eax, %%ecx ;" - "addq %%rcx, %%r8 ;" - "adcq $0, %%r9 ;" - "movq %%r9, 8(%0) ;" - "adcq $0, %%r10 ;" - "movq %%r10, 16(%0) ;" - "adcq $0, %%r11 ;" - "movq %%r11, 24(%0) ;" - "mov $0, %%ecx ;" - "cmovc %%eax, %%ecx ;" - "addq %%rcx, %%r8 ;" - "movq %%r8, (%0) ;" - : - : "r"(c), "r"(a), "r"(b) - : "memory", "cc", "%rax", "%rcx", "%r8", "%r9", "%r10", "%r11"); + /* Compute the raw multiplication: tmp <- f * f */ + + /* Step 1: Compute all partial products */ + " movq 0(%1), %%rdx;" /* f[0] */ + " mulxq 8(%1), %%r8, %%r14;" " xor %%r15, %%r15;" /* f[1]*f[0] */ + " mulxq 16(%1), %%r9, %%r10;" " adcx %%r14, %%r9;" /* f[2]*f[0] */ + " mulxq 24(%1), %%rax, %%rcx;" " adcx %%rax, %%r10;" /* f[3]*f[0] */ + " movq 24(%1), %%rdx;" /* f[3] */ + " mulxq 8(%1), %%r11, %%r12;" " adcx %%rcx, %%r11;" /* f[1]*f[3] */ + " mulxq 16(%1), %%rax, %%r13;" " adcx %%rax, %%r12;" /* f[2]*f[3] */ + " movq 8(%1), %%rdx;" " adcx %%r15, %%r13;" /* f1 */ + " mulxq 16(%1), %%rax, %%rcx;" " mov $0, %%r14;" /* f[2]*f[1] */ + + /* Step 2: Compute two parallel carry chains */ + " xor %%r15, %%r15;" + " adox %%rax, %%r10;" + " adcx %%r8, %%r8;" + " adox %%rcx, %%r11;" + " adcx %%r9, %%r9;" + " adox %%r15, %%r12;" + " adcx %%r10, %%r10;" + " adox %%r15, %%r13;" + " adcx %%r11, %%r11;" + " adox %%r15, %%r14;" + " adcx %%r12, %%r12;" + " adcx %%r13, %%r13;" + " adcx %%r14, %%r14;" + + /* Step 3: Compute intermediate squares */ + " movq 0(%1), %%rdx;" " mulx %%rdx, %%rax, %%rcx;" /* f[0]^2 */ + " movq %%rax, 0(%0);" + " add %%rcx, %%r8;" " movq %%r8, 8(%0);" + " movq 8(%1), %%rdx;" " mulx %%rdx, %%rax, %%rcx;" /* f[1]^2 */ + " adcx %%rax, %%r9;" " movq %%r9, 16(%0);" + " adcx %%rcx, %%r10;" " movq %%r10, 24(%0);" + " movq 16(%1), %%rdx;" " mulx %%rdx, %%rax, %%rcx;" /* f[2]^2 */ + " adcx %%rax, %%r11;" " movq %%r11, 32(%0);" + " adcx %%rcx, %%r12;" " movq %%r12, 40(%0);" + " movq 24(%1), %%rdx;" " mulx %%rdx, %%rax, %%rcx;" /* f[3]^2 */ + " adcx %%rax, %%r13;" " movq %%r13, 48(%0);" + " adcx %%rcx, %%r14;" " movq %%r14, 56(%0);" + + /* Line up pointers */ + " mov %0, %1;" + " mov %2, %0;" + + /* Wrap the result back into the field */ + + /* Step 1: Compute dst + carry == tmp_hi * 38 + tmp_lo */ + " mov $38, %%rdx;" + " mulxq 32(%1), %%r8, %%r13;" + " xor %%rcx, %%rcx;" + " adoxq 0(%1), %%r8;" + " mulxq 40(%1), %%r9, %%r12;" + " adcx %%r13, %%r9;" + " adoxq 8(%1), %%r9;" + " mulxq 48(%1), %%r10, %%r13;" + " adcx %%r12, %%r10;" + " adoxq 16(%1), %%r10;" + " mulxq 56(%1), %%r11, %%rax;" + " adcx %%r13, %%r11;" + " adoxq 24(%1), %%r11;" + " adcx %%rcx, %%rax;" + " adox %%rcx, %%rax;" + " imul %%rdx, %%rax;" + + /* Step 2: Fold the carry back into dst */ + " add %%rax, %%r8;" + " adcx %%rcx, %%r9;" + " movq %%r9, 8(%0);" + " adcx %%rcx, %%r10;" + " movq %%r10, 16(%0);" + " adcx %%rcx, %%r11;" + " movq %%r11, 24(%0);" + + /* Step 3: Fold the carry bit back in; guaranteed not to carry at this point */ + " mov $0, %%rax;" + " cmovc %%rdx, %%rax;" + " add %%rax, %%r8;" + " movq %%r8, 0(%0);" + : "+&r" (tmp), "+&r" (f), "+&r" (out) + : + : "%rax", "%rcx", "%rdx", "%r8", "%r9", "%r10", "%r11", "%r12", "%r13", "%r14", "%r15", "memory", "cc" + ); } -static __always_inline void -sub_eltfp25519_1w(u64 *const c, const u64 *const a, const u64 *const b) +/* Computes two field squarings: + * out[0] <- f[0] * f[0] + * out[1] <- f[1] * f[1] + * Uses the 16-element buffer tmp for intermediate results */ +static inline void fsqr2(u64 *out, const u64 *f, u64 *tmp) { asm volatile( - "mov $38, %%eax ;" - "movq (%1), %%r8 ;" - "subq (%2), %%r8 ;" - "movq 8(%1), %%r9 ;" - "sbbq 8(%2), %%r9 ;" - "movq 16(%1), %%r10 ;" - "sbbq 16(%2), %%r10 ;" - "movq 24(%1), %%r11 ;" - "sbbq 24(%2), %%r11 ;" - "mov $0, %%ecx ;" - "cmovc %%eax, %%ecx ;" - "subq %%rcx, %%r8 ;" - "sbbq $0, %%r9 ;" - "movq %%r9, 8(%0) ;" - "sbbq $0, %%r10 ;" - "movq %%r10, 16(%0) ;" - "sbbq $0, %%r11 ;" - "movq %%r11, 24(%0) ;" - "mov $0, %%ecx ;" - "cmovc %%eax, %%ecx ;" - "subq %%rcx, %%r8 ;" - "movq %%r8, (%0) ;" - : - : "r"(c), "r"(a), "r"(b) - : "memory", "cc", "%rax", "%rcx", "%r8", "%r9", "%r10", "%r11"); + /* Step 1: Compute all partial products */ + " movq 0(%1), %%rdx;" /* f[0] */ + " mulxq 8(%1), %%r8, %%r14;" " xor %%r15, %%r15;" /* f[1]*f[0] */ + " mulxq 16(%1), %%r9, %%r10;" " adcx %%r14, %%r9;" /* f[2]*f[0] */ + " mulxq 24(%1), %%rax, %%rcx;" " adcx %%rax, %%r10;" /* f[3]*f[0] */ + " movq 24(%1), %%rdx;" /* f[3] */ + " mulxq 8(%1), %%r11, %%r12;" " adcx %%rcx, %%r11;" /* f[1]*f[3] */ + " mulxq 16(%1), %%rax, %%r13;" " adcx %%rax, %%r12;" /* f[2]*f[3] */ + " movq 8(%1), %%rdx;" " adcx %%r15, %%r13;" /* f1 */ + " mulxq 16(%1), %%rax, %%rcx;" " mov $0, %%r14;" /* f[2]*f[1] */ + + /* Step 2: Compute two parallel carry chains */ + " xor %%r15, %%r15;" + " adox %%rax, %%r10;" + " adcx %%r8, %%r8;" + " adox %%rcx, %%r11;" + " adcx %%r9, %%r9;" + " adox %%r15, %%r12;" + " adcx %%r10, %%r10;" + " adox %%r15, %%r13;" + " adcx %%r11, %%r11;" + " adox %%r15, %%r14;" + " adcx %%r12, %%r12;" + " adcx %%r13, %%r13;" + " adcx %%r14, %%r14;" + + /* Step 3: Compute intermediate squares */ + " movq 0(%1), %%rdx;" " mulx %%rdx, %%rax, %%rcx;" /* f[0]^2 */ + " movq %%rax, 0(%0);" + " add %%rcx, %%r8;" " movq %%r8, 8(%0);" + " movq 8(%1), %%rdx;" " mulx %%rdx, %%rax, %%rcx;" /* f[1]^2 */ + " adcx %%rax, %%r9;" " movq %%r9, 16(%0);" + " adcx %%rcx, %%r10;" " movq %%r10, 24(%0);" + " movq 16(%1), %%rdx;" " mulx %%rdx, %%rax, %%rcx;" /* f[2]^2 */ + " adcx %%rax, %%r11;" " movq %%r11, 32(%0);" + " adcx %%rcx, %%r12;" " movq %%r12, 40(%0);" + " movq 24(%1), %%rdx;" " mulx %%rdx, %%rax, %%rcx;" /* f[3]^2 */ + " adcx %%rax, %%r13;" " movq %%r13, 48(%0);" + " adcx %%rcx, %%r14;" " movq %%r14, 56(%0);" + + /* Step 1: Compute all partial products */ + " movq 32(%1), %%rdx;" /* f[0] */ + " mulxq 40(%1), %%r8, %%r14;" " xor %%r15, %%r15;" /* f[1]*f[0] */ + " mulxq 48(%1), %%r9, %%r10;" " adcx %%r14, %%r9;" /* f[2]*f[0] */ + " mulxq 56(%1), %%rax, %%rcx;" " adcx %%rax, %%r10;" /* f[3]*f[0] */ + " movq 56(%1), %%rdx;" /* f[3] */ + " mulxq 40(%1), %%r11, %%r12;" " adcx %%rcx, %%r11;" /* f[1]*f[3] */ + " mulxq 48(%1), %%rax, %%r13;" " adcx %%rax, %%r12;" /* f[2]*f[3] */ + " movq 40(%1), %%rdx;" " adcx %%r15, %%r13;" /* f1 */ + " mulxq 48(%1), %%rax, %%rcx;" " mov $0, %%r14;" /* f[2]*f[1] */ + + /* Step 2: Compute two parallel carry chains */ + " xor %%r15, %%r15;" + " adox %%rax, %%r10;" + " adcx %%r8, %%r8;" + " adox %%rcx, %%r11;" + " adcx %%r9, %%r9;" + " adox %%r15, %%r12;" + " adcx %%r10, %%r10;" + " adox %%r15, %%r13;" + " adcx %%r11, %%r11;" + " adox %%r15, %%r14;" + " adcx %%r12, %%r12;" + " adcx %%r13, %%r13;" + " adcx %%r14, %%r14;" + + /* Step 3: Compute intermediate squares */ + " movq 32(%1), %%rdx;" " mulx %%rdx, %%rax, %%rcx;" /* f[0]^2 */ + " movq %%rax, 64(%0);" + " add %%rcx, %%r8;" " movq %%r8, 72(%0);" + " movq 40(%1), %%rdx;" " mulx %%rdx, %%rax, %%rcx;" /* f[1]^2 */ + " adcx %%rax, %%r9;" " movq %%r9, 80(%0);" + " adcx %%rcx, %%r10;" " movq %%r10, 88(%0);" + " movq 48(%1), %%rdx;" " mulx %%rdx, %%rax, %%rcx;" /* f[2]^2 */ + " adcx %%rax, %%r11;" " movq %%r11, 96(%0);" + " adcx %%rcx, %%r12;" " movq %%r12, 104(%0);" + " movq 56(%1), %%rdx;" " mulx %%rdx, %%rax, %%rcx;" /* f[3]^2 */ + " adcx %%rax, %%r13;" " movq %%r13, 112(%0);" + " adcx %%rcx, %%r14;" " movq %%r14, 120(%0);" + + /* Line up pointers */ + " mov %0, %1;" + " mov %2, %0;" + + /* Step 1: Compute dst + carry == tmp_hi * 38 + tmp_lo */ + " mov $38, %%rdx;" + " mulxq 32(%1), %%r8, %%r13;" + " xor %%rcx, %%rcx;" + " adoxq 0(%1), %%r8;" + " mulxq 40(%1), %%r9, %%r12;" + " adcx %%r13, %%r9;" + " adoxq 8(%1), %%r9;" + " mulxq 48(%1), %%r10, %%r13;" + " adcx %%r12, %%r10;" + " adoxq 16(%1), %%r10;" + " mulxq 56(%1), %%r11, %%rax;" + " adcx %%r13, %%r11;" + " adoxq 24(%1), %%r11;" + " adcx %%rcx, %%rax;" + " adox %%rcx, %%rax;" + " imul %%rdx, %%rax;" + + /* Step 2: Fold the carry back into dst */ + " add %%rax, %%r8;" + " adcx %%rcx, %%r9;" + " movq %%r9, 8(%0);" + " adcx %%rcx, %%r10;" + " movq %%r10, 16(%0);" + " adcx %%rcx, %%r11;" + " movq %%r11, 24(%0);" + + /* Step 3: Fold the carry bit back in; guaranteed not to carry at this point */ + " mov $0, %%rax;" + " cmovc %%rdx, %%rax;" + " add %%rax, %%r8;" + " movq %%r8, 0(%0);" + + /* Step 1: Compute dst + carry == tmp_hi * 38 + tmp_lo */ + " mov $38, %%rdx;" + " mulxq 96(%1), %%r8, %%r13;" + " xor %%rcx, %%rcx;" + " adoxq 64(%1), %%r8;" + " mulxq 104(%1), %%r9, %%r12;" + " adcx %%r13, %%r9;" + " adoxq 72(%1), %%r9;" + " mulxq 112(%1), %%r10, %%r13;" + " adcx %%r12, %%r10;" + " adoxq 80(%1), %%r10;" + " mulxq 120(%1), %%r11, %%rax;" + " adcx %%r13, %%r11;" + " adoxq 88(%1), %%r11;" + " adcx %%rcx, %%rax;" + " adox %%rcx, %%rax;" + " imul %%rdx, %%rax;" + + /* Step 2: Fold the carry back into dst */ + " add %%rax, %%r8;" + " adcx %%rcx, %%r9;" + " movq %%r9, 40(%0);" + " adcx %%rcx, %%r10;" + " movq %%r10, 48(%0);" + " adcx %%rcx, %%r11;" + " movq %%r11, 56(%0);" + + /* Step 3: Fold the carry bit back in; guaranteed not to carry at this point */ + " mov $0, %%rax;" + " cmovc %%rdx, %%rax;" + " add %%rax, %%r8;" + " movq %%r8, 32(%0);" + : "+&r" (tmp), "+&r" (f), "+&r" (out) + : + : "%rax", "%rcx", "%rdx", "%r8", "%r9", "%r10", "%r11", "%r12", "%r13", "%r14", "%r15", "memory", "cc" + ); } -/* Multiplication by a24 = (A+2)/4 = (486662+2)/4 = 121666 */ -static __always_inline void -mul_a24_eltfp25519_1w(u64 *const c, const u64 *const a) +static void point_add_and_double(u64 *q, u64 *p01_tmp1, u64 *tmp2) { - const u64 a24 = 121666; - asm volatile( - "movq %2, %%rdx ;" - "mulx (%1), %%r8, %%r10 ;" - "mulx 8(%1), %%r9, %%r11 ;" - "addq %%r10, %%r9 ;" - "mulx 16(%1), %%r10, %%rax ;" - "adcq %%r11, %%r10 ;" - "mulx 24(%1), %%r11, %%rcx ;" - "adcq %%rax, %%r11 ;" - /**************************/ - "adcq $0, %%rcx ;" - "movl $38, %%edx ;" /* 2*c = 38 = 2^256 mod 2^255-19*/ - "imul %%rdx, %%rcx ;" - "addq %%rcx, %%r8 ;" - "adcq $0, %%r9 ;" - "movq %%r9, 8(%0) ;" - "adcq $0, %%r10 ;" - "movq %%r10, 16(%0) ;" - "adcq $0, %%r11 ;" - "movq %%r11, 24(%0) ;" - "mov $0, %%ecx ;" - "cmovc %%edx, %%ecx ;" - "addq %%rcx, %%r8 ;" - "movq %%r8, (%0) ;" - : - : "r"(c), "r"(a), "r"(a24) - : "memory", "cc", "%rax", "%rcx", "%rdx", "%r8", "%r9", "%r10", - "%r11"); + u64 *nq = p01_tmp1; + u64 *nq_p1 = p01_tmp1 + (u32)8U; + u64 *tmp1 = p01_tmp1 + (u32)16U; + u64 *x1 = q; + u64 *x2 = nq; + u64 *z2 = nq + (u32)4U; + u64 *z3 = nq_p1 + (u32)4U; + u64 *a = tmp1; + u64 *b = tmp1 + (u32)4U; + u64 *ab = tmp1; + u64 *dc = tmp1 + (u32)8U; + u64 *x3; + u64 *z31; + u64 *d0; + u64 *c0; + u64 *a1; + u64 *b1; + u64 *d; + u64 *c; + u64 *ab1; + u64 *dc1; + fadd(a, x2, z2); + fsub(b, x2, z2); + x3 = nq_p1; + z31 = nq_p1 + (u32)4U; + d0 = dc; + c0 = dc + (u32)4U; + fadd(c0, x3, z31); + fsub(d0, x3, z31); + fmul2(dc, dc, ab, tmp2); + fadd(x3, d0, c0); + fsub(z31, d0, c0); + a1 = tmp1; + b1 = tmp1 + (u32)4U; + d = tmp1 + (u32)8U; + c = tmp1 + (u32)12U; + ab1 = tmp1; + dc1 = tmp1 + (u32)8U; + fsqr2(dc1, ab1, tmp2); + fsqr2(nq_p1, nq_p1, tmp2); + a1[0U] = c[0U]; + a1[1U] = c[1U]; + a1[2U] = c[2U]; + a1[3U] = c[3U]; + fsub(c, d, c); + fmul_scalar(b1, c, (u64)121665U); + fadd(b1, b1, d); + fmul2(nq, dc1, ab1, tmp2); + fmul(z3, z3, x1, tmp2); } -static void inv_eltfp25519_1w_adx(u64 *const c, const u64 *const a) +static void point_double(u64 *nq, u64 *tmp1, u64 *tmp2) { - struct { - eltfp25519_1w_buffer buffer; - eltfp25519_1w x0, x1, x2; - } __aligned(32) m; - u64 *T[4]; - - T[0] = m.x0; - T[1] = c; /* x^(-1) */ - T[2] = m.x1; - T[3] = m.x2; - - copy_eltfp25519_1w(T[1], a); - sqrn_eltfp25519_1w_adx(T[1], 1); - copy_eltfp25519_1w(T[2], T[1]); - sqrn_eltfp25519_1w_adx(T[2], 2); - mul_eltfp25519_1w_adx(T[0], a, T[2]); - mul_eltfp25519_1w_adx(T[1], T[1], T[0]); - copy_eltfp25519_1w(T[2], T[1]); - sqrn_eltfp25519_1w_adx(T[2], 1); - mul_eltfp25519_1w_adx(T[0], T[0], T[2]); - copy_eltfp25519_1w(T[2], T[0]); - sqrn_eltfp25519_1w_adx(T[2], 5); - mul_eltfp25519_1w_adx(T[0], T[0], T[2]); - copy_eltfp25519_1w(T[2], T[0]); - sqrn_eltfp25519_1w_adx(T[2], 10); - mul_eltfp25519_1w_adx(T[2], T[2], T[0]); - copy_eltfp25519_1w(T[3], T[2]); - sqrn_eltfp25519_1w_adx(T[3], 20); - mul_eltfp25519_1w_adx(T[3], T[3], T[2]); - sqrn_eltfp25519_1w_adx(T[3], 10); - mul_eltfp25519_1w_adx(T[3], T[3], T[0]); - copy_eltfp25519_1w(T[0], T[3]); - sqrn_eltfp25519_1w_adx(T[0], 50); - mul_eltfp25519_1w_adx(T[0], T[0], T[3]); - copy_eltfp25519_1w(T[2], T[0]); - sqrn_eltfp25519_1w_adx(T[2], 100); - mul_eltfp25519_1w_adx(T[2], T[2], T[0]); - sqrn_eltfp25519_1w_adx(T[2], 50); - mul_eltfp25519_1w_adx(T[2], T[2], T[3]); - sqrn_eltfp25519_1w_adx(T[2], 5); - mul_eltfp25519_1w_adx(T[1], T[1], T[2]); - - memzero_explicit(&m, sizeof(m)); + u64 *x2 = nq; + u64 *z2 = nq + (u32)4U; + u64 *a = tmp1; + u64 *b = tmp1 + (u32)4U; + u64 *d = tmp1 + (u32)8U; + u64 *c = tmp1 + (u32)12U; + u64 *ab = tmp1; + u64 *dc = tmp1 + (u32)8U; + fadd(a, x2, z2); + fsub(b, x2, z2); + fsqr2(dc, ab, tmp2); + a[0U] = c[0U]; + a[1U] = c[1U]; + a[2U] = c[2U]; + a[3U] = c[3U]; + fsub(c, d, c); + fmul_scalar(b, c, (u64)121665U); + fadd(b, b, d); + fmul2(nq, dc, ab, tmp2); } -static void inv_eltfp25519_1w_bmi2(u64 *const c, const u64 *const a) +static void montgomery_ladder(u64 *out, const u8 *key, u64 *init1) { - struct { - eltfp25519_1w_buffer buffer; - eltfp25519_1w x0, x1, x2; - } __aligned(32) m; - u64 *T[5]; - - T[0] = m.x0; - T[1] = c; /* x^(-1) */ - T[2] = m.x1; - T[3] = m.x2; - - copy_eltfp25519_1w(T[1], a); - sqrn_eltfp25519_1w_bmi2(T[1], 1); - copy_eltfp25519_1w(T[2], T[1]); - sqrn_eltfp25519_1w_bmi2(T[2], 2); - mul_eltfp25519_1w_bmi2(T[0], a, T[2]); - mul_eltfp25519_1w_bmi2(T[1], T[1], T[0]); - copy_eltfp25519_1w(T[2], T[1]); - sqrn_eltfp25519_1w_bmi2(T[2], 1); - mul_eltfp25519_1w_bmi2(T[0], T[0], T[2]); - copy_eltfp25519_1w(T[2], T[0]); - sqrn_eltfp25519_1w_bmi2(T[2], 5); - mul_eltfp25519_1w_bmi2(T[0], T[0], T[2]); - copy_eltfp25519_1w(T[2], T[0]); - sqrn_eltfp25519_1w_bmi2(T[2], 10); - mul_eltfp25519_1w_bmi2(T[2], T[2], T[0]); - copy_eltfp25519_1w(T[3], T[2]); - sqrn_eltfp25519_1w_bmi2(T[3], 20); - mul_eltfp25519_1w_bmi2(T[3], T[3], T[2]); - sqrn_eltfp25519_1w_bmi2(T[3], 10); - mul_eltfp25519_1w_bmi2(T[3], T[3], T[0]); - copy_eltfp25519_1w(T[0], T[3]); - sqrn_eltfp25519_1w_bmi2(T[0], 50); - mul_eltfp25519_1w_bmi2(T[0], T[0], T[3]); - copy_eltfp25519_1w(T[2], T[0]); - sqrn_eltfp25519_1w_bmi2(T[2], 100); - mul_eltfp25519_1w_bmi2(T[2], T[2], T[0]); - sqrn_eltfp25519_1w_bmi2(T[2], 50); - mul_eltfp25519_1w_bmi2(T[2], T[2], T[3]); - sqrn_eltfp25519_1w_bmi2(T[2], 5); - mul_eltfp25519_1w_bmi2(T[1], T[1], T[2]); - - memzero_explicit(&m, sizeof(m)); + u64 tmp2[16U] = { 0U }; + u64 p01_tmp1_swap[33U] = { 0U }; + u64 *p0 = p01_tmp1_swap; + u64 *p01 = p01_tmp1_swap; + u64 *p03 = p01; + u64 *p11 = p01 + (u32)8U; + u64 *x0; + u64 *z0; + u64 *p01_tmp1; + u64 *p01_tmp11; + u64 *nq10; + u64 *nq_p11; + u64 *swap1; + u64 sw0; + u64 *nq1; + u64 *tmp1; + memcpy(p11, init1, (u32)8U * sizeof(init1[0U])); + x0 = p03; + z0 = p03 + (u32)4U; + x0[0U] = (u64)1U; + x0[1U] = (u64)0U; + x0[2U] = (u64)0U; + x0[3U] = (u64)0U; + z0[0U] = (u64)0U; + z0[1U] = (u64)0U; + z0[2U] = (u64)0U; + z0[3U] = (u64)0U; + p01_tmp1 = p01_tmp1_swap; + p01_tmp11 = p01_tmp1_swap; + nq10 = p01_tmp1_swap; + nq_p11 = p01_tmp1_swap + (u32)8U; + swap1 = p01_tmp1_swap + (u32)32U; + cswap2((u64)1U, nq10, nq_p11); + point_add_and_double(init1, p01_tmp11, tmp2); + swap1[0U] = (u64)1U; + { + u32 i; + for (i = (u32)0U; i < (u32)251U; i = i + (u32)1U) { + u64 *p01_tmp12 = p01_tmp1_swap; + u64 *swap2 = p01_tmp1_swap + (u32)32U; + u64 *nq2 = p01_tmp12; + u64 *nq_p12 = p01_tmp12 + (u32)8U; + u64 bit = (u64)(key[((u32)253U - i) / (u32)8U] >> ((u32)253U - i) % (u32)8U & (u8)1U); + u64 sw = swap2[0U] ^ bit; + cswap2(sw, nq2, nq_p12); + point_add_and_double(init1, p01_tmp12, tmp2); + swap2[0U] = bit; + } + } + sw0 = swap1[0U]; + cswap2(sw0, nq10, nq_p11); + nq1 = p01_tmp1; + tmp1 = p01_tmp1 + (u32)16U; + point_double(nq1, tmp1, tmp2); + point_double(nq1, tmp1, tmp2); + point_double(nq1, tmp1, tmp2); + memcpy(out, p0, (u32)8U * sizeof(p0[0U])); + + memzero_explicit(tmp2, sizeof(tmp2)); + memzero_explicit(p01_tmp1_swap, sizeof(p01_tmp1_swap)); } -/* Given c, a 256-bit number, fred_eltfp25519_1w updates c - * with a number such that 0 <= C < 2**255-19. - */ -static __always_inline void fred_eltfp25519_1w(u64 *const c) +static void fsquare_times(u64 *o, const u64 *inp, u64 *tmp, u32 n1) { - u64 tmp0 = 38, tmp1 = 19; - asm volatile( - "btrq $63, %3 ;" /* Put bit 255 in carry flag and clear */ - "cmovncl %k5, %k4 ;" /* c[255] ? 38 : 19 */ - - /* Add either 19 or 38 to c */ - "addq %4, %0 ;" - "adcq $0, %1 ;" - "adcq $0, %2 ;" - "adcq $0, %3 ;" - - /* Test for bit 255 again; only triggered on overflow modulo 2^255-19 */ - "movl $0, %k4 ;" - "cmovnsl %k5, %k4 ;" /* c[255] ? 0 : 19 */ - "btrq $63, %3 ;" /* Clear bit 255 */ - - /* Subtract 19 if necessary */ - "subq %4, %0 ;" - "sbbq $0, %1 ;" - "sbbq $0, %2 ;" - "sbbq $0, %3 ;" - - : "+r"(c[0]), "+r"(c[1]), "+r"(c[2]), "+r"(c[3]), "+r"(tmp0), - "+r"(tmp1) - : - : "memory", "cc"); + u32 i; + fsqr(o, inp, tmp); + for (i = (u32)0U; i < n1 - (u32)1U; i = i + (u32)1U) + fsqr(o, o, tmp); } -static __always_inline void cswap(u8 bit, u64 *const px, u64 *const py) +static void finv(u64 *o, const u64 *i, u64 *tmp) { - u64 temp; - asm volatile( - "test %9, %9 ;" - "movq %0, %8 ;" - "cmovnzq %4, %0 ;" - "cmovnzq %8, %4 ;" - "movq %1, %8 ;" - "cmovnzq %5, %1 ;" - "cmovnzq %8, %5 ;" - "movq %2, %8 ;" - "cmovnzq %6, %2 ;" - "cmovnzq %8, %6 ;" - "movq %3, %8 ;" - "cmovnzq %7, %3 ;" - "cmovnzq %8, %7 ;" - : "+r"(px[0]), "+r"(px[1]), "+r"(px[2]), "+r"(px[3]), - "+r"(py[0]), "+r"(py[1]), "+r"(py[2]), "+r"(py[3]), - "=r"(temp) - : "r"(bit) - : "cc" - ); + u64 t1[16U] = { 0U }; + u64 *a0 = t1; + u64 *b = t1 + (u32)4U; + u64 *c = t1 + (u32)8U; + u64 *t00 = t1 + (u32)12U; + u64 *tmp1 = tmp; + u64 *a; + u64 *t0; + fsquare_times(a0, i, tmp1, (u32)1U); + fsquare_times(t00, a0, tmp1, (u32)2U); + fmul(b, t00, i, tmp); + fmul(a0, b, a0, tmp); + fsquare_times(t00, a0, tmp1, (u32)1U); + fmul(b, t00, b, tmp); + fsquare_times(t00, b, tmp1, (u32)5U); + fmul(b, t00, b, tmp); + fsquare_times(t00, b, tmp1, (u32)10U); + fmul(c, t00, b, tmp); + fsquare_times(t00, c, tmp1, (u32)20U); + fmul(t00, t00, c, tmp); + fsquare_times(t00, t00, tmp1, (u32)10U); + fmul(b, t00, b, tmp); + fsquare_times(t00, b, tmp1, (u32)50U); + fmul(c, t00, b, tmp); + fsquare_times(t00, c, tmp1, (u32)100U); + fmul(t00, t00, c, tmp); + fsquare_times(t00, t00, tmp1, (u32)50U); + fmul(t00, t00, b, tmp); + fsquare_times(t00, t00, tmp1, (u32)5U); + a = t1; + t0 = t1 + (u32)12U; + fmul(o, t0, a, tmp); } -static __always_inline void cselect(u8 bit, u64 *const px, const u64 *const py) +static void store_felem(u64 *b, u64 *f) { - asm volatile( - "test %4, %4 ;" - "cmovnzq %5, %0 ;" - "cmovnzq %6, %1 ;" - "cmovnzq %7, %2 ;" - "cmovnzq %8, %3 ;" - : "+r"(px[0]), "+r"(px[1]), "+r"(px[2]), "+r"(px[3]) - : "r"(bit), "rm"(py[0]), "rm"(py[1]), "rm"(py[2]), "rm"(py[3]) - : "cc" - ); + u64 f30 = f[3U]; + u64 top_bit0 = f30 >> (u32)63U; + u64 carry0; + u64 f31; + u64 top_bit; + u64 carry; + u64 f0; + u64 f1; + u64 f2; + u64 f3; + u64 m0; + u64 m1; + u64 m2; + u64 m3; + u64 mask; + u64 f0_; + u64 f1_; + u64 f2_; + u64 f3_; + u64 o0; + u64 o1; + u64 o2; + u64 o3; + f[3U] = f30 & (u64)0x7fffffffffffffffU; + carry0 = add_scalar(f, f, (u64)19U * top_bit0); + f31 = f[3U]; + top_bit = f31 >> (u32)63U; + f[3U] = f31 & (u64)0x7fffffffffffffffU; + carry = add_scalar(f, f, (u64)19U * top_bit); + f0 = f[0U]; + f1 = f[1U]; + f2 = f[2U]; + f3 = f[3U]; + m0 = gte_mask(f0, (u64)0xffffffffffffffedU); + m1 = eq_mask(f1, (u64)0xffffffffffffffffU); + m2 = eq_mask(f2, (u64)0xffffffffffffffffU); + m3 = eq_mask(f3, (u64)0x7fffffffffffffffU); + mask = ((m0 & m1) & m2) & m3; + f0_ = f0 - (mask & (u64)0xffffffffffffffedU); + f1_ = f1 - (mask & (u64)0xffffffffffffffffU); + f2_ = f2 - (mask & (u64)0xffffffffffffffffU); + f3_ = f3 - (mask & (u64)0x7fffffffffffffffU); + o0 = f0_; + o1 = f1_; + o2 = f2_; + o3 = f3_; + b[0U] = o0; + b[1U] = o1; + b[2U] = o2; + b[3U] = o3; } -static void curve25519_adx(u8 shared[CURVE25519_KEY_SIZE], - const u8 private_key[CURVE25519_KEY_SIZE], - const u8 session_key[CURVE25519_KEY_SIZE]) +static void encode_point(u8 *o, const u64 *i) { - struct { - u64 buffer[4 * NUM_WORDS_ELTFP25519]; - u64 coordinates[4 * NUM_WORDS_ELTFP25519]; - u64 workspace[6 * NUM_WORDS_ELTFP25519]; - u8 session[CURVE25519_KEY_SIZE]; - u8 private[CURVE25519_KEY_SIZE]; - } __aligned(32) m; - - int i = 0, j = 0; - u64 prev = 0; - u64 *const X1 = (u64 *)m.session; - u64 *const key = (u64 *)m.private; - u64 *const Px = m.coordinates + 0; - u64 *const Pz = m.coordinates + 4; - u64 *const Qx = m.coordinates + 8; - u64 *const Qz = m.coordinates + 12; - u64 *const X2 = Qx; - u64 *const Z2 = Qz; - u64 *const X3 = Px; - u64 *const Z3 = Pz; - u64 *const X2Z2 = Qx; - u64 *const X3Z3 = Px; - - u64 *const A = m.workspace + 0; - u64 *const B = m.workspace + 4; - u64 *const D = m.workspace + 8; - u64 *const C = m.workspace + 12; - u64 *const DA = m.workspace + 16; - u64 *const CB = m.workspace + 20; - u64 *const AB = A; - u64 *const DC = D; - u64 *const DACB = DA; - - memcpy(m.private, private_key, sizeof(m.private)); - memcpy(m.session, session_key, sizeof(m.session)); - - curve25519_clamp_secret(m.private); - - /* As in the draft: - * When receiving such an array, implementations of curve25519 - * MUST mask the most-significant bit in the final byte. This - * is done to preserve compatibility with point formats which - * reserve the sign bit for use in other protocols and to - * increase resistance to implementation fingerprinting - */ - m.session[CURVE25519_KEY_SIZE - 1] &= (1 << (255 % 8)) - 1; - - copy_eltfp25519_1w(Px, X1); - setzero_eltfp25519_1w(Pz); - setzero_eltfp25519_1w(Qx); - setzero_eltfp25519_1w(Qz); - - Pz[0] = 1; - Qx[0] = 1; - - /* main-loop */ - prev = 0; - j = 62; - for (i = 3; i >= 0; --i) { - while (j >= 0) { - u64 bit = (key[i] >> j) & 0x1; - u64 swap = bit ^ prev; - prev = bit; - - add_eltfp25519_1w_adx(A, X2, Z2); /* A = (X2+Z2) */ - sub_eltfp25519_1w(B, X2, Z2); /* B = (X2-Z2) */ - add_eltfp25519_1w_adx(C, X3, Z3); /* C = (X3+Z3) */ - sub_eltfp25519_1w(D, X3, Z3); /* D = (X3-Z3) */ - mul_eltfp25519_2w_adx(DACB, AB, DC); /* [DA|CB] = [A|B]*[D|C] */ - - cselect(swap, A, C); - cselect(swap, B, D); - - sqr_eltfp25519_2w_adx(AB); /* [AA|BB] = [A^2|B^2] */ - add_eltfp25519_1w_adx(X3, DA, CB); /* X3 = (DA+CB) */ - sub_eltfp25519_1w(Z3, DA, CB); /* Z3 = (DA-CB) */ - sqr_eltfp25519_2w_adx(X3Z3); /* [X3|Z3] = [(DA+CB)|(DA+CB)]^2 */ - - copy_eltfp25519_1w(X2, B); /* X2 = B^2 */ - sub_eltfp25519_1w(Z2, A, B); /* Z2 = E = AA-BB */ - - mul_a24_eltfp25519_1w(B, Z2); /* B = a24*E */ - add_eltfp25519_1w_adx(B, B, X2); /* B = a24*E+B */ - mul_eltfp25519_2w_adx(X2Z2, X2Z2, AB); /* [X2|Z2] = [B|E]*[A|a24*E+B] */ - mul_eltfp25519_1w_adx(Z3, Z3, X1); /* Z3 = Z3*X1 */ - --j; - } - j = 63; - } - - inv_eltfp25519_1w_adx(A, Qz); - mul_eltfp25519_1w_adx((u64 *)shared, Qx, A); - fred_eltfp25519_1w((u64 *)shared); - - memzero_explicit(&m, sizeof(m)); + const u64 *x = i; + const u64 *z = i + (u32)4U; + u64 tmp[4U] = { 0U }; + u64 tmp_w[16U] = { 0U }; + finv(tmp, z, tmp_w); + fmul(tmp, tmp, x, tmp_w); + store_felem((u64 *)o, tmp); } -static void curve25519_adx_base(u8 session_key[CURVE25519_KEY_SIZE], - const u8 private_key[CURVE25519_KEY_SIZE]) +static void curve25519_ever64(u8 *out, const u8 *priv, const u8 *pub) { - struct { - u64 buffer[4 * NUM_WORDS_ELTFP25519]; - u64 coordinates[4 * NUM_WORDS_ELTFP25519]; - u64 workspace[4 * NUM_WORDS_ELTFP25519]; - u8 private[CURVE25519_KEY_SIZE]; - } __aligned(32) m; - - const int ite[4] = { 64, 64, 64, 63 }; - const int q = 3; - u64 swap = 1; - - int i = 0, j = 0, k = 0; - u64 *const key = (u64 *)m.private; - u64 *const Ur1 = m.coordinates + 0; - u64 *const Zr1 = m.coordinates + 4; - u64 *const Ur2 = m.coordinates + 8; - u64 *const Zr2 = m.coordinates + 12; - - u64 *const UZr1 = m.coordinates + 0; - u64 *const ZUr2 = m.coordinates + 8; - - u64 *const A = m.workspace + 0; - u64 *const B = m.workspace + 4; - u64 *const C = m.workspace + 8; - u64 *const D = m.workspace + 12; - - u64 *const AB = m.workspace + 0; - u64 *const CD = m.workspace + 8; - - const u64 *const P = table_ladder_8k; - - memcpy(m.private, private_key, sizeof(m.private)); - - curve25519_clamp_secret(m.private); - - setzero_eltfp25519_1w(Ur1); - setzero_eltfp25519_1w(Zr1); - setzero_eltfp25519_1w(Zr2); - Ur1[0] = 1; - Zr1[0] = 1; - Zr2[0] = 1; - - /* G-S */ - Ur2[3] = 0x1eaecdeee27cab34UL; - Ur2[2] = 0xadc7a0b9235d48e2UL; - Ur2[1] = 0xbbf095ae14b2edf8UL; - Ur2[0] = 0x7e94e1fec82faabdUL; - - /* main-loop */ - j = q; - for (i = 0; i < NUM_WORDS_ELTFP25519; ++i) { - while (j < ite[i]) { - u64 bit = (key[i] >> j) & 0x1; - k = (64 * i + j - q); - swap = swap ^ bit; - cswap(swap, Ur1, Ur2); - cswap(swap, Zr1, Zr2); - swap = bit; - /* Addition */ - sub_eltfp25519_1w(B, Ur1, Zr1); /* B = Ur1-Zr1 */ - add_eltfp25519_1w_adx(A, Ur1, Zr1); /* A = Ur1+Zr1 */ - mul_eltfp25519_1w_adx(C, &P[4 * k], B); /* C = M0-B */ - sub_eltfp25519_1w(B, A, C); /* B = (Ur1+Zr1) - M*(Ur1-Zr1) */ - add_eltfp25519_1w_adx(A, A, C); /* A = (Ur1+Zr1) + M*(Ur1-Zr1) */ - sqr_eltfp25519_2w_adx(AB); /* A = A^2 | B = B^2 */ - mul_eltfp25519_2w_adx(UZr1, ZUr2, AB); /* Ur1 = Zr2*A | Zr1 = Ur2*B */ - ++j; + u64 init1[8U] = { 0U }; + u64 tmp[4U] = { 0U }; + u64 tmp3; + u64 *x; + u64 *z; + { + u32 i; + for (i = (u32)0U; i < (u32)4U; i = i + (u32)1U) { + u64 *os = tmp; + const u8 *bj = pub + i * (u32)8U; + u64 u = *(u64 *)bj; + u64 r = u; + u64 x0 = r; + os[i] = x0; } - j = 0; } - - /* Doubling */ - for (i = 0; i < q; ++i) { - add_eltfp25519_1w_adx(A, Ur1, Zr1); /* A = Ur1+Zr1 */ - sub_eltfp25519_1w(B, Ur1, Zr1); /* B = Ur1-Zr1 */ - sqr_eltfp25519_2w_adx(AB); /* A = A**2 B = B**2 */ - copy_eltfp25519_1w(C, B); /* C = B */ - sub_eltfp25519_1w(B, A, B); /* B = A-B */ - mul_a24_eltfp25519_1w(D, B); /* D = my_a24*B */ - add_eltfp25519_1w_adx(D, D, C); /* D = D+C */ - mul_eltfp25519_2w_adx(UZr1, AB, CD); /* Ur1 = A*B Zr1 = Zr1*A */ - } - - /* Convert to affine coordinates */ - inv_eltfp25519_1w_adx(A, Zr1); - mul_eltfp25519_1w_adx((u64 *)session_key, Ur1, A); - fred_eltfp25519_1w((u64 *)session_key); - - memzero_explicit(&m, sizeof(m)); + tmp3 = tmp[3U]; + tmp[3U] = tmp3 & (u64)0x7fffffffffffffffU; + x = init1; + z = init1 + (u32)4U; + z[0U] = (u64)1U; + z[1U] = (u64)0U; + z[2U] = (u64)0U; + z[3U] = (u64)0U; + x[0U] = tmp[0U]; + x[1U] = tmp[1U]; + x[2U] = tmp[2U]; + x[3U] = tmp[3U]; + montgomery_ladder(init1, priv, init1); + encode_point(out, init1); } -static void curve25519_bmi2(u8 shared[CURVE25519_KEY_SIZE], - const u8 private_key[CURVE25519_KEY_SIZE], - const u8 session_key[CURVE25519_KEY_SIZE]) -{ - struct { - u64 buffer[4 * NUM_WORDS_ELTFP25519]; - u64 coordinates[4 * NUM_WORDS_ELTFP25519]; - u64 workspace[6 * NUM_WORDS_ELTFP25519]; - u8 session[CURVE25519_KEY_SIZE]; - u8 private[CURVE25519_KEY_SIZE]; - } __aligned(32) m; - - int i = 0, j = 0; - u64 prev = 0; - u64 *const X1 = (u64 *)m.session; - u64 *const key = (u64 *)m.private; - u64 *const Px = m.coordinates + 0; - u64 *const Pz = m.coordinates + 4; - u64 *const Qx = m.coordinates + 8; - u64 *const Qz = m.coordinates + 12; - u64 *const X2 = Qx; - u64 *const Z2 = Qz; - u64 *const X3 = Px; - u64 *const Z3 = Pz; - u64 *const X2Z2 = Qx; - u64 *const X3Z3 = Px; - - u64 *const A = m.workspace + 0; - u64 *const B = m.workspace + 4; - u64 *const D = m.workspace + 8; - u64 *const C = m.workspace + 12; - u64 *const DA = m.workspace + 16; - u64 *const CB = m.workspace + 20; - u64 *const AB = A; - u64 *const DC = D; - u64 *const DACB = DA; - - memcpy(m.private, private_key, sizeof(m.private)); - memcpy(m.session, session_key, sizeof(m.session)); - - curve25519_clamp_secret(m.private); - - /* As in the draft: - * When receiving such an array, implementations of curve25519 - * MUST mask the most-significant bit in the final byte. This - * is done to preserve compatibility with point formats which - * reserve the sign bit for use in other protocols and to - * increase resistance to implementation fingerprinting - */ - m.session[CURVE25519_KEY_SIZE - 1] &= (1 << (255 % 8)) - 1; - - copy_eltfp25519_1w(Px, X1); - setzero_eltfp25519_1w(Pz); - setzero_eltfp25519_1w(Qx); - setzero_eltfp25519_1w(Qz); - - Pz[0] = 1; - Qx[0] = 1; - - /* main-loop */ - prev = 0; - j = 62; - for (i = 3; i >= 0; --i) { - while (j >= 0) { - u64 bit = (key[i] >> j) & 0x1; - u64 swap = bit ^ prev; - prev = bit; - - add_eltfp25519_1w_bmi2(A, X2, Z2); /* A = (X2+Z2) */ - sub_eltfp25519_1w(B, X2, Z2); /* B = (X2-Z2) */ - add_eltfp25519_1w_bmi2(C, X3, Z3); /* C = (X3+Z3) */ - sub_eltfp25519_1w(D, X3, Z3); /* D = (X3-Z3) */ - mul_eltfp25519_2w_bmi2(DACB, AB, DC); /* [DA|CB] = [A|B]*[D|C] */ - - cselect(swap, A, C); - cselect(swap, B, D); - - sqr_eltfp25519_2w_bmi2(AB); /* [AA|BB] = [A^2|B^2] */ - add_eltfp25519_1w_bmi2(X3, DA, CB); /* X3 = (DA+CB) */ - sub_eltfp25519_1w(Z3, DA, CB); /* Z3 = (DA-CB) */ - sqr_eltfp25519_2w_bmi2(X3Z3); /* [X3|Z3] = [(DA+CB)|(DA+CB)]^2 */ - - copy_eltfp25519_1w(X2, B); /* X2 = B^2 */ - sub_eltfp25519_1w(Z2, A, B); /* Z2 = E = AA-BB */ - - mul_a24_eltfp25519_1w(B, Z2); /* B = a24*E */ - add_eltfp25519_1w_bmi2(B, B, X2); /* B = a24*E+B */ - mul_eltfp25519_2w_bmi2(X2Z2, X2Z2, AB); /* [X2|Z2] = [B|E]*[A|a24*E+B] */ - mul_eltfp25519_1w_bmi2(Z3, Z3, X1); /* Z3 = Z3*X1 */ - --j; - } - j = 63; - } - - inv_eltfp25519_1w_bmi2(A, Qz); - mul_eltfp25519_1w_bmi2((u64 *)shared, Qx, A); - fred_eltfp25519_1w((u64 *)shared); +/* The below constants were generated using this sage script: + * + * #!/usr/bin/env sage + * import sys + * from sage.all import * + * def limbs(n): + * n = int(n) + * l = ((n >> 0) % 2^64, (n >> 64) % 2^64, (n >> 128) % 2^64, (n >> 192) % 2^64) + * return "0x%016xULL, 0x%016xULL, 0x%016xULL, 0x%016xULL" % l + * ec = EllipticCurve(GF(2^255 - 19), [0, 486662, 0, 1, 0]) + * p_minus_s = (ec.lift_x(9) - ec.lift_x(1))[0] + * print("static const u64 p_minus_s[] = { %s };\n" % limbs(p_minus_s)) + * print("static const u64 table_ladder[] = {") + * p = ec.lift_x(9) + * for i in range(252): + * l = (p[0] + p[2]) / (p[0] - p[2]) + * print(("\t%s" + ("," if i != 251 else "")) % limbs(l)) + * p = p * 2 + * print("};") + * + */ - memzero_explicit(&m, sizeof(m)); -} +static const u64 p_minus_s[] = { 0x816b1e0137d48290ULL, 0x440f6a51eb4d1207ULL, 0x52385f46dca2b71dULL, 0x215132111d8354cbULL }; + +static const u64 table_ladder[] = { + 0xfffffffffffffff3ULL, 0xffffffffffffffffULL, 0xffffffffffffffffULL, 0x5fffffffffffffffULL, + 0x6b8220f416aafe96ULL, 0x82ebeb2b4f566a34ULL, 0xd5a9a5b075a5950fULL, 0x5142b2cf4b2488f4ULL, + 0x6aaebc750069680cULL, 0x89cf7820a0f99c41ULL, 0x2a58d9183b56d0f4ULL, 0x4b5aca80e36011a4ULL, + 0x329132348c29745dULL, 0xf4a2e616e1642fd7ULL, 0x1e45bb03ff67bc34ULL, 0x306912d0f42a9b4aULL, + 0xff886507e6af7154ULL, 0x04f50e13dfeec82fULL, 0xaa512fe82abab5ceULL, 0x174e251a68d5f222ULL, + 0xcf96700d82028898ULL, 0x1743e3370a2c02c5ULL, 0x379eec98b4e86eaaULL, 0x0c59888a51e0482eULL, + 0xfbcbf1d699b5d189ULL, 0xacaef0d58e9fdc84ULL, 0xc1c20d06231f7614ULL, 0x2938218da274f972ULL, + 0xf6af49beff1d7f18ULL, 0xcc541c22387ac9c2ULL, 0x96fcc9ef4015c56bULL, 0x69c1627c690913a9ULL, + 0x7a86fd2f4733db0eULL, 0xfdb8c4f29e087de9ULL, 0x095e4b1a8ea2a229ULL, 0x1ad7a7c829b37a79ULL, + 0x342d89cad17ea0c0ULL, 0x67bedda6cced2051ULL, 0x19ca31bf2bb42f74ULL, 0x3df7b4c84980acbbULL, + 0xa8c6444dc80ad883ULL, 0xb91e440366e3ab85ULL, 0xc215cda00164f6d8ULL, 0x3d867c6ef247e668ULL, + 0xc7dd582bcc3e658cULL, 0xfd2c4748ee0e5528ULL, 0xa0fd9b95cc9f4f71ULL, 0x7529d871b0675ddfULL, + 0xb8f568b42d3cbd78ULL, 0x1233011b91f3da82ULL, 0x2dce6ccd4a7c3b62ULL, 0x75e7fc8e9e498603ULL, + 0x2f4f13f1fcd0b6ecULL, 0xf1a8ca1f29ff7a45ULL, 0xc249c1a72981e29bULL, 0x6ebe0dbb8c83b56aULL, + 0x7114fa8d170bb222ULL, 0x65a2dcd5bf93935fULL, 0xbdc41f68b59c979aULL, 0x2f0eef79a2ce9289ULL, + 0x42ecbf0c083c37ceULL, 0x2930bc09ec496322ULL, 0xf294b0c19cfeac0dULL, 0x3780aa4bedfabb80ULL, + 0x56c17d3e7cead929ULL, 0xe7cb4beb2e5722c5ULL, 0x0ce931732dbfe15aULL, 0x41b883c7621052f8ULL, + 0xdbf75ca0c3d25350ULL, 0x2936be086eb1e351ULL, 0xc936e03cb4a9b212ULL, 0x1d45bf82322225aaULL, + 0xe81ab1036a024cc5ULL, 0xe212201c304c9a72ULL, 0xc5d73fba6832b1fcULL, 0x20ffdb5a4d839581ULL, + 0xa283d367be5d0fadULL, 0x6c2b25ca8b164475ULL, 0x9d4935467caaf22eULL, 0x5166408eee85ff49ULL, + 0x3c67baa2fab4e361ULL, 0xb3e433c67ef35cefULL, 0x5259729241159b1cULL, 0x6a621892d5b0ab33ULL, + 0x20b74a387555cdcbULL, 0x532aa10e1208923fULL, 0xeaa17b7762281dd1ULL, 0x61ab3443f05c44bfULL, + 0x257a6c422324def8ULL, 0x131c6c1017e3cf7fULL, 0x23758739f630a257ULL, 0x295a407a01a78580ULL, + 0xf8c443246d5da8d9ULL, 0x19d775450c52fa5dULL, 0x2afcfc92731bf83dULL, 0x7d10c8e81b2b4700ULL, + 0xc8e0271f70baa20bULL, 0x993748867ca63957ULL, 0x5412efb3cb7ed4bbULL, 0x3196d36173e62975ULL, + 0xde5bcad141c7dffcULL, 0x47cc8cd2b395c848ULL, 0xa34cd942e11af3cbULL, 0x0256dbf2d04ecec2ULL, + 0x875ab7e94b0e667fULL, 0xcad4dd83c0850d10ULL, 0x47f12e8f4e72c79fULL, 0x5f1a87bb8c85b19bULL, + 0x7ae9d0b6437f51b8ULL, 0x12c7ce5518879065ULL, 0x2ade09fe5cf77aeeULL, 0x23a05a2f7d2c5627ULL, + 0x5908e128f17c169aULL, 0xf77498dd8ad0852dULL, 0x74b4c4ceab102f64ULL, 0x183abadd10139845ULL, + 0xb165ba8daa92aaacULL, 0xd5c5ef9599386705ULL, 0xbe2f8f0cf8fc40d1ULL, 0x2701e635ee204514ULL, + 0x629fa80020156514ULL, 0xf223868764a8c1ceULL, 0x5b894fff0b3f060eULL, 0x60d9944cf708a3faULL, + 0xaeea001a1c7a201fULL, 0xebf16a633ee2ce63ULL, 0x6f7709594c7a07e1ULL, 0x79b958150d0208cbULL, + 0x24b55e5301d410e7ULL, 0xe3a34edff3fdc84dULL, 0xd88768e4904032d8ULL, 0x131384427b3aaeecULL, + 0x8405e51286234f14ULL, 0x14dc4739adb4c529ULL, 0xb8a2b5b250634ffdULL, 0x2fe2a94ad8a7ff93ULL, + 0xec5c57efe843faddULL, 0x2843ce40f0bb9918ULL, 0xa4b561d6cf3d6305ULL, 0x743629bde8fb777eULL, + 0x343edd46bbaf738fULL, 0xed981828b101a651ULL, 0xa401760b882c797aULL, 0x1fc223e28dc88730ULL, + 0x48604e91fc0fba0eULL, 0xb637f78f052c6fa4ULL, 0x91ccac3d09e9239cULL, 0x23f7eed4437a687cULL, + 0x5173b1118d9bd800ULL, 0x29d641b63189d4a7ULL, 0xfdbf177988bbc586ULL, 0x2959894fcad81df5ULL, + 0xaebc8ef3b4bbc899ULL, 0x4148995ab26992b9ULL, 0x24e20b0134f92cfbULL, 0x40d158894a05dee8ULL, + 0x46b00b1185af76f6ULL, 0x26bac77873187a79ULL, 0x3dc0bf95ab8fff5fULL, 0x2a608bd8945524d7ULL, + 0x26449588bd446302ULL, 0x7c4bc21c0388439cULL, 0x8e98a4f383bd11b2ULL, 0x26218d7bc9d876b9ULL, + 0xe3081542997c178aULL, 0x3c2d29a86fb6606fULL, 0x5c217736fa279374ULL, 0x7dde05734afeb1faULL, + 0x3bf10e3906d42babULL, 0xe4f7803e1980649cULL, 0xe6053bf89595bf7aULL, 0x394faf38da245530ULL, + 0x7a8efb58896928f4ULL, 0xfbc778e9cc6a113cULL, 0x72670ce330af596fULL, 0x48f222a81d3d6cf7ULL, + 0xf01fce410d72caa7ULL, 0x5a20ecc7213b5595ULL, 0x7bc21165c1fa1483ULL, 0x07f89ae31da8a741ULL, + 0x05d2c2b4c6830ff9ULL, 0xd43e330fc6316293ULL, 0xa5a5590a96d3a904ULL, 0x705edb91a65333b6ULL, + 0x048ee15e0bb9a5f7ULL, 0x3240cfca9e0aaf5dULL, 0x8f4b71ceedc4a40bULL, 0x621c0da3de544a6dULL, + 0x92872836a08c4091ULL, 0xce8375b010c91445ULL, 0x8a72eb524f276394ULL, 0x2667fcfa7ec83635ULL, + 0x7f4c173345e8752aULL, 0x061b47feee7079a5ULL, 0x25dd9afa9f86ff34ULL, 0x3780cef5425dc89cULL, + 0x1a46035a513bb4e9ULL, 0x3e1ef379ac575adaULL, 0xc78c5f1c5fa24b50ULL, 0x321a967634fd9f22ULL, + 0x946707b8826e27faULL, 0x3dca84d64c506fd0ULL, 0xc189218075e91436ULL, 0x6d9284169b3b8484ULL, + 0x3a67e840383f2ddfULL, 0x33eec9a30c4f9b75ULL, 0x3ec7c86fa783ef47ULL, 0x26ec449fbac9fbc4ULL, + 0x5c0f38cba09b9e7dULL, 0x81168cc762a3478cULL, 0x3e23b0d306fc121cULL, 0x5a238aa0a5efdcddULL, + 0x1ba26121c4ea43ffULL, 0x36f8c77f7c8832b5ULL, 0x88fbea0b0adcf99aULL, 0x5ca9938ec25bebf9ULL, + 0xd5436a5e51fccda0ULL, 0x1dbc4797c2cd893bULL, 0x19346a65d3224a08ULL, 0x0f5034e49b9af466ULL, + 0xf23c3967a1e0b96eULL, 0xe58b08fa867a4d88ULL, 0xfb2fabc6a7341679ULL, 0x2a75381eb6026946ULL, + 0xc80a3be4c19420acULL, 0x66b1f6c681f2b6dcULL, 0x7cf7036761e93388ULL, 0x25abbbd8a660a4c4ULL, + 0x91ea12ba14fd5198ULL, 0x684950fc4a3cffa9ULL, 0xf826842130f5ad28ULL, 0x3ea988f75301a441ULL, + 0xc978109a695f8c6fULL, 0x1746eb4a0530c3f3ULL, 0x444d6d77b4459995ULL, 0x75952b8c054e5cc7ULL, + 0xa3703f7915f4d6aaULL, 0x66c346202f2647d8ULL, 0xd01469df811d644bULL, 0x77fea47d81a5d71fULL, + 0xc5e9529ef57ca381ULL, 0x6eeeb4b9ce2f881aULL, 0xb6e91a28e8009bd6ULL, 0x4b80be3e9afc3fecULL, + 0x7e3773c526aed2c5ULL, 0x1b4afcb453c9a49dULL, 0xa920bdd7baffb24dULL, 0x7c54699f122d400eULL, + 0xef46c8e14fa94bc8ULL, 0xe0b074ce2952ed5eULL, 0xbea450e1dbd885d5ULL, 0x61b68649320f712cULL, + 0x8a485f7309ccbdd1ULL, 0xbd06320d7d4d1a2dULL, 0x25232973322dbef4ULL, 0x445dc4758c17f770ULL, + 0xdb0434177cc8933cULL, 0xed6fe82175ea059fULL, 0x1efebefdc053db34ULL, 0x4adbe867c65daf99ULL, + 0x3acd71a2a90609dfULL, 0xe5e991856dd04050ULL, 0x1ec69b688157c23cULL, 0x697427f6885cfe4dULL, + 0xd7be7b9b65e1a851ULL, 0xa03d28d522c536ddULL, 0x28399d658fd2b645ULL, 0x49e5b7e17c2641e1ULL, + 0x6f8c3a98700457a4ULL, 0x5078f0a25ebb6778ULL, 0xd13c3ccbc382960fULL, 0x2e003258a7df84b1ULL, + 0x8ad1f39be6296a1cULL, 0xc1eeaa652a5fbfb2ULL, 0x33ee0673fd26f3cbULL, 0x59256173a69d2cccULL, + 0x41ea07aa4e18fc41ULL, 0xd9fc19527c87a51eULL, 0xbdaacb805831ca6fULL, 0x445b652dc916694fULL, + 0xce92a3a7f2172315ULL, 0x1edc282de11b9964ULL, 0xa1823aafe04c314aULL, 0x790a2d94437cf586ULL, + 0x71c447fb93f6e009ULL, 0x8922a56722845276ULL, 0xbf70903b204f5169ULL, 0x2f7a89891ba319feULL, + 0x02a08eb577e2140cULL, 0xed9a4ed4427bdcf4ULL, 0x5253ec44e4323cd1ULL, 0x3e88363c14e9355bULL, + 0xaa66c14277110b8cULL, 0x1ae0391610a23390ULL, 0x2030bd12c93fc2a2ULL, 0x3ee141579555c7abULL, + 0x9214de3a6d6e7d41ULL, 0x3ccdd88607f17efeULL, 0x674f1288f8e11217ULL, 0x5682250f329f93d0ULL, + 0x6cf00b136d2e396eULL, 0x6e4cf86f1014debfULL, 0x5930b1b5bfcc4e83ULL, 0x047069b48aba16b6ULL, + 0x0d4ce4ab69b20793ULL, 0xb24db91a97d0fb9eULL, 0xcdfa50f54e00d01dULL, 0x221b1085368bddb5ULL, + 0xe7e59468b1e3d8d2ULL, 0x53c56563bd122f93ULL, 0xeee8a903e0663f09ULL, 0x61efa662cbbe3d42ULL, + 0x2cf8ddddde6eab2aULL, 0x9bf80ad51435f231ULL, 0x5deadacec9f04973ULL, 0x29275b5d41d29b27ULL, + 0xcfde0f0895ebf14fULL, 0xb9aab96b054905a7ULL, 0xcae80dd9a1c420fdULL, 0x0a63bf2f1673bbc7ULL, + 0x092f6e11958fbc8cULL, 0x672a81e804822fadULL, 0xcac8351560d52517ULL, 0x6f3f7722c8f192f8ULL, + 0xf8ba90ccc2e894b7ULL, 0x2c7557a438ff9f0dULL, 0x894d1d855ae52359ULL, 0x68e122157b743d69ULL, + 0xd87e5570cfb919f3ULL, 0x3f2cdecd95798db9ULL, 0x2121154710c0a2ceULL, 0x3c66a115246dc5b2ULL, + 0xcbedc562294ecb72ULL, 0xba7143c36a280b16ULL, 0x9610c2efd4078b67ULL, 0x6144735d946a4b1eULL, + 0x536f111ed75b3350ULL, 0x0211db8c2041d81bULL, 0xf93cb1000e10413cULL, 0x149dfd3c039e8876ULL, + 0xd479dde46b63155bULL, 0xb66e15e93c837976ULL, 0xdafde43b1f13e038ULL, 0x5fafda1a2e4b0b35ULL, + 0x3600bbdf17197581ULL, 0x3972050bbe3cd2c2ULL, 0x5938906dbdd5be86ULL, 0x34fce5e43f9b860fULL, + 0x75a8a4cd42d14d02ULL, 0x828dabc53441df65ULL, 0x33dcabedd2e131d3ULL, 0x3ebad76fb814d25fULL, + 0xd4906f566f70e10fULL, 0x5d12f7aa51690f5aULL, 0x45adb16e76cefcf2ULL, 0x01f768aead232999ULL, + 0x2b6cc77b6248febdULL, 0x3cd30628ec3aaffdULL, 0xce1c0b80d4ef486aULL, 0x4c3bff2ea6f66c23ULL, + 0x3f2ec4094aeaeb5fULL, 0x61b19b286e372ca7ULL, 0x5eefa966de2a701dULL, 0x23b20565de55e3efULL, + 0xe301ca5279d58557ULL, 0x07b2d4ce27c2874fULL, 0xa532cd8a9dcf1d67ULL, 0x2a52fee23f2bff56ULL, + 0x8624efb37cd8663dULL, 0xbbc7ac20ffbd7594ULL, 0x57b85e9c82d37445ULL, 0x7b3052cb86a6ec66ULL, + 0x3482f0ad2525e91eULL, 0x2cb68043d28edca0ULL, 0xaf4f6d052e1b003aULL, 0x185f8c2529781b0aULL, + 0xaa41de5bd80ce0d6ULL, 0x9407b2416853e9d6ULL, 0x563ec36e357f4c3aULL, 0x4cc4b8dd0e297bceULL, + 0xa2fc1a52ffb8730eULL, 0x1811f16e67058e37ULL, 0x10f9a366cddf4ee1ULL, 0x72f4a0c4a0b9f099ULL, + 0x8c16c06f663f4ea7ULL, 0x693b3af74e970fbaULL, 0x2102e7f1d69ec345ULL, 0x0ba53cbc968a8089ULL, + 0xca3d9dc7fea15537ULL, 0x4c6824bb51536493ULL, 0xb9886314844006b1ULL, 0x40d2a72ab454cc60ULL, + 0x5936a1b712570975ULL, 0x91b9d648debda657ULL, 0x3344094bb64330eaULL, 0x006ba10d12ee51d0ULL, + 0x19228468f5de5d58ULL, 0x0eb12f4c38cc05b0ULL, 0xa1039f9dd5601990ULL, 0x4502d4ce4fff0e0bULL, + 0xeb2054106837c189ULL, 0xd0f6544c6dd3b93cULL, 0x40727064c416d74fULL, 0x6e15c6114b502ef0ULL, + 0x4df2a398cfb1a76bULL, 0x11256c7419f2f6b1ULL, 0x4a497962066e6043ULL, 0x705b3aab41355b44ULL, + 0x365ef536d797b1d8ULL, 0x00076bd622ddf0dbULL, 0x3bbf33b0e0575a88ULL, 0x3777aa05c8e4ca4dULL, + 0x392745c85578db5fULL, 0x6fda4149dbae5ae2ULL, 0xb1f0b00b8adc9867ULL, 0x09963437d36f1da3ULL, + 0x7e824e90a5dc3853ULL, 0xccb5f6641f135cbdULL, 0x6736d86c87ce8fccULL, 0x625f3ce26604249fULL, + 0xaf8ac8059502f63fULL, 0x0c05e70a2e351469ULL, 0x35292e9c764b6305ULL, 0x1a394360c7e23ac3ULL, + 0xd5c6d53251183264ULL, 0x62065abd43c2b74fULL, 0xb5fbf5d03b973f9bULL, 0x13a3da3661206e5eULL, + 0xc6bd5837725d94e5ULL, 0x18e30912205016c5ULL, 0x2088ce1570033c68ULL, 0x7fba1f495c837987ULL, + 0x5a8c7423f2f9079dULL, 0x1735157b34023fc5ULL, 0xe4f9b49ad2fab351ULL, 0x6691ff72c878e33cULL, + 0x122c2adedc5eff3eULL, 0xf8dd4bf1d8956cf4ULL, 0xeb86205d9e9e5bdaULL, 0x049b92b9d975c743ULL, + 0xa5379730b0f6c05aULL, 0x72a0ffacc6f3a553ULL, 0xb0032c34b20dcd6dULL, 0x470e9dbc88d5164aULL, + 0xb19cf10ca237c047ULL, 0xb65466711f6c81a2ULL, 0xb3321bd16dd80b43ULL, 0x48c14f600c5fbe8eULL, + 0x66451c264aa6c803ULL, 0xb66e3904a4fa7da6ULL, 0xd45f19b0b3128395ULL, 0x31602627c3c9bc10ULL, + 0x3120dc4832e4e10dULL, 0xeb20c46756c717f7ULL, 0x00f52e3f67280294ULL, 0x566d4fc14730c509ULL, + 0x7e3a5d40fd837206ULL, 0xc1e926dc7159547aULL, 0x216730fba68d6095ULL, 0x22e8c3843f69cea7ULL, + 0x33d074e8930e4b2bULL, 0xb6e4350e84d15816ULL, 0x5534c26ad6ba2365ULL, 0x7773c12f89f1f3f3ULL, + 0x8cba404da57962aaULL, 0x5b9897a81999ce56ULL, 0x508e862f121692fcULL, 0x3a81907fa093c291ULL, + 0x0dded0ff4725a510ULL, 0x10d8cc10673fc503ULL, 0x5b9d151c9f1f4e89ULL, 0x32a5c1d5cb09a44cULL, + 0x1e0aa442b90541fbULL, 0x5f85eb7cc1b485dbULL, 0xbee595ce8a9df2e5ULL, 0x25e496c722422236ULL, + 0x5edf3c46cd0fe5b9ULL, 0x34e75a7ed2a43388ULL, 0xe488de11d761e352ULL, 0x0e878a01a085545cULL, + 0xba493c77e021bb04ULL, 0x2b4d1843c7df899aULL, 0x9ea37a487ae80d67ULL, 0x67a9958011e41794ULL, + 0x4b58051a6697b065ULL, 0x47e33f7d8d6ba6d4ULL, 0xbb4da8d483ca46c1ULL, 0x68becaa181c2db0dULL, + 0x8d8980e90b989aa5ULL, 0xf95eb14a2c93c99bULL, 0x51c6c7c4796e73a2ULL, 0x6e228363b5efb569ULL, + 0xc6bbc0b02dd624c8ULL, 0x777eb47dec8170eeULL, 0x3cde15a004cfafa9ULL, 0x1dc6bc087160bf9bULL, + 0x2e07e043eec34002ULL, 0x18e9fc677a68dc7fULL, 0xd8da03188bd15b9aULL, 0x48fbc3bb00568253ULL, + 0x57547d4cfb654ce1ULL, 0xd3565b82a058e2adULL, 0xf63eaf0bbf154478ULL, 0x47531ef114dfbb18ULL, + 0xe1ec630a4278c587ULL, 0x5507d546ca8e83f3ULL, 0x85e135c63adc0c2bULL, 0x0aa7efa85682844eULL, + 0x72691ba8b3e1f615ULL, 0x32b4e9701fbe3ffaULL, 0x97b6d92e39bb7868ULL, 0x2cfe53dea02e39e8ULL, + 0x687392cd85cd52b0ULL, 0x27ff66c910e29831ULL, 0x97134556a9832d06ULL, 0x269bb0360a84f8a0ULL, + 0x706e55457643f85cULL, 0x3734a48c9b597d1bULL, 0x7aee91e8c6efa472ULL, 0x5cd6abc198a9d9e0ULL, + 0x0e04de06cb3ce41aULL, 0xd8c6eb893402e138ULL, 0x904659bb686e3772ULL, 0x7215c371746ba8c8ULL, + 0xfd12a97eeae4a2d9ULL, 0x9514b7516394f2c5ULL, 0x266fd5809208f294ULL, 0x5c847085619a26b9ULL, + 0x52985410fed694eaULL, 0x3c905b934a2ed254ULL, 0x10bb47692d3be467ULL, 0x063b3d2d69e5e9e1ULL, + 0x472726eedda57debULL, 0xefb6c4ae10f41891ULL, 0x2b1641917b307614ULL, 0x117c554fc4f45b7cULL, + 0xc07cf3118f9d8812ULL, 0x01dbd82050017939ULL, 0xd7e803f4171b2827ULL, 0x1015e87487d225eaULL, + 0xc58de3fed23acc4dULL, 0x50db91c294a7be2dULL, 0x0b94d43d1c9cf457ULL, 0x6b1640fa6e37524aULL, + 0x692f346c5fda0d09ULL, 0x200b1c59fa4d3151ULL, 0xb8c46f760777a296ULL, 0x4b38395f3ffdfbcfULL, + 0x18d25e00be54d671ULL, 0x60d50582bec8aba6ULL, 0x87ad8f263b78b982ULL, 0x50fdf64e9cda0432ULL, + 0x90f567aac578dcf0ULL, 0xef1e9b0ef2a3133bULL, 0x0eebba9242d9de71ULL, 0x15473c9bf03101c7ULL, + 0x7c77e8ae56b78095ULL, 0xb678e7666e6f078eULL, 0x2da0b9615348ba1fULL, 0x7cf931c1ff733f0bULL, + 0x26b357f50a0a366cULL, 0xe9708cf42b87d732ULL, 0xc13aeea5f91cb2c0ULL, 0x35d90c991143bb4cULL, + 0x47c1c404a9a0d9dcULL, 0x659e58451972d251ULL, 0x3875a8c473b38c31ULL, 0x1fbd9ed379561f24ULL, + 0x11fabc6fd41ec28dULL, 0x7ef8dfe3cd2a2dcaULL, 0x72e73b5d8c404595ULL, 0x6135fa4954b72f27ULL, + 0xccfc32a2de24b69cULL, 0x3f55698c1f095d88ULL, 0xbe3350ed5ac3f929ULL, 0x5e9bf806ca477eebULL, + 0xe9ce8fb63c309f68ULL, 0x5376f63565e1f9f4ULL, 0xd1afcfb35a6393f1ULL, 0x6632a1ede5623506ULL, + 0x0b7d6c390c2ded4cULL, 0x56cb3281df04cb1fULL, 0x66305a1249ecc3c7ULL, 0x5d588b60a38ca72aULL, + 0xa6ecbf78e8e5f42dULL, 0x86eeb44b3c8a3eecULL, 0xec219c48fbd21604ULL, 0x1aaf1af517c36731ULL, + 0xc306a2836769bde7ULL, 0x208280622b1e2adbULL, 0x8027f51ffbff94a6ULL, 0x76cfa1ce1124f26bULL, + 0x18eb00562422abb6ULL, 0xf377c4d58f8c29c3ULL, 0x4dbbc207f531561aULL, 0x0253b7f082128a27ULL, + 0x3d1f091cb62c17e0ULL, 0x4860e1abd64628a9ULL, 0x52d17436309d4253ULL, 0x356f97e13efae576ULL, + 0xd351e11aa150535bULL, 0x3e6b45bb1dd878ccULL, 0x0c776128bed92c98ULL, 0x1d34ae93032885b8ULL, + 0x4ba0488ca85ba4c3ULL, 0x985348c33c9ce6ceULL, 0x66124c6f97bda770ULL, 0x0f81a0290654124aULL, + 0x9ed09ca6569b86fdULL, 0x811009fd18af9a2dULL, 0xff08d03f93d8c20aULL, 0x52a148199faef26bULL, + 0x3e03f9dc2d8d1b73ULL, 0x4205801873961a70ULL, 0xc0d987f041a35970ULL, 0x07aa1f15a1c0d549ULL, + 0xdfd46ce08cd27224ULL, 0x6d0a024f934e4239ULL, 0x808a7a6399897b59ULL, 0x0a4556e9e13d95a2ULL, + 0xd21a991fe9c13045ULL, 0x9b0e8548fe7751b8ULL, 0x5da643cb4bf30035ULL, 0x77db28d63940f721ULL, + 0xfc5eeb614adc9011ULL, 0x5229419ae8c411ebULL, 0x9ec3e7787d1dcf74ULL, 0x340d053e216e4cb5ULL, + 0xcac7af39b48df2b4ULL, 0xc0faec2871a10a94ULL, 0x140a69245ca575edULL, 0x0cf1c37134273a4cULL, + 0xc8ee306ac224b8a5ULL, 0x57eaee7ccb4930b0ULL, 0xa1e806bdaacbe74fULL, 0x7d9a62742eeb657dULL, + 0x9eb6b6ef546c4830ULL, 0x885cca1fddb36e2eULL, 0xe6b9f383ef0d7105ULL, 0x58654fef9d2e0412ULL, + 0xa905c4ffbe0e8e26ULL, 0x942de5df9b31816eULL, 0x497d723f802e88e1ULL, 0x30684dea602f408dULL, + 0x21e5a278a3e6cb34ULL, 0xaefb6e6f5b151dc4ULL, 0xb30b8e049d77ca15ULL, 0x28c3c9cf53b98981ULL, + 0x287fb721556cdd2aULL, 0x0d317ca897022274ULL, 0x7468c7423a543258ULL, 0x4a7f11464eb5642fULL, + 0xa237a4774d193aa6ULL, 0xd865986ea92129a1ULL, 0x24c515ecf87c1a88ULL, 0x604003575f39f5ebULL, + 0x47b9f189570a9b27ULL, 0x2b98cede465e4b78ULL, 0x026df551dbb85c20ULL, 0x74fcd91047e21901ULL, + 0x13e2a90a23c1bfa3ULL, 0x0cb0074e478519f6ULL, 0x5ff1cbbe3af6cf44ULL, 0x67fe5438be812dbeULL, + 0xd13cf64fa40f05b0ULL, 0x054dfb2f32283787ULL, 0x4173915b7f0d2aeaULL, 0x482f144f1f610d4eULL, + 0xf6210201b47f8234ULL, 0x5d0ae1929e70b990ULL, 0xdcd7f455b049567cULL, 0x7e93d0f1f0916f01ULL, + 0xdd79cbf18a7db4faULL, 0xbe8391bf6f74c62fULL, 0x027145d14b8291bdULL, 0x585a73ea2cbf1705ULL, + 0x485ca03e928a0db2ULL, 0x10fc01a5742857e7ULL, 0x2f482edbd6d551a7ULL, 0x0f0433b5048fdb8aULL, + 0x60da2e8dd7dc6247ULL, 0x88b4c9d38cd4819aULL, 0x13033ac001f66697ULL, 0x273b24fe3b367d75ULL, + 0xc6e8f66a31b3b9d4ULL, 0x281514a494df49d5ULL, 0xd1726fdfc8b23da7ULL, 0x4b3ae7d103dee548ULL, + 0xc6256e19ce4b9d7eULL, 0xff5c5cf186e3c61cULL, 0xacc63ca34b8ec145ULL, 0x74621888fee66574ULL, + 0x956f409645290a1eULL, 0xef0bf8e3263a962eULL, 0xed6a50eb5ec2647bULL, 0x0694283a9dca7502ULL, + 0x769b963643a2dcd1ULL, 0x42b7c8ea09fc5353ULL, 0x4f002aee13397eabULL, 0x63005e2c19b7d63aULL, + 0xca6736da63023beaULL, 0x966c7f6db12a99b7ULL, 0xace09390c537c5e1ULL, 0x0b696063a1aa89eeULL, + 0xebb03e97288c56e5ULL, 0x432a9f9f938c8be8ULL, 0xa6a5a93d5b717f71ULL, 0x1a5fb4c3e18f9d97ULL, + 0x1c94e7ad1c60cdceULL, 0xee202a43fc02c4a0ULL, 0x8dafe4d867c46a20ULL, 0x0a10263c8ac27b58ULL, + 0xd0dea9dfe4432a4aULL, 0x856af87bbe9277c5ULL, 0xce8472acc212c71aULL, 0x6f151b6d9bbb1e91ULL, + 0x26776c527ceed56aULL, 0x7d211cb7fbf8faecULL, 0x37ae66a6fd4609ccULL, 0x1f81b702d2770c42ULL, + 0x2fb0b057eac58392ULL, 0xe1dd89fe29744e9dULL, 0xc964f8eb17beb4f8ULL, 0x29571073c9a2d41eULL, + 0xa948a18981c0e254ULL, 0x2df6369b65b22830ULL, 0xa33eb2d75fcfd3c6ULL, 0x078cd6ec4199a01fULL, + 0x4a584a41ad900d2fULL, 0x32142b78e2c74c52ULL, 0x68c4e8338431c978ULL, 0x7f69ea9008689fc2ULL, + 0x52f2c81e46a38265ULL, 0xfd78072d04a832fdULL, 0x8cd7d5fa25359e94ULL, 0x4de71b7454cc29d2ULL, + 0x42eb60ad1eda6ac9ULL, 0x0aad37dfdbc09c3aULL, 0x81004b71e33cc191ULL, 0x44e6be345122803cULL, + 0x03fe8388ba1920dbULL, 0xf5d57c32150db008ULL, 0x49c8c4281af60c29ULL, 0x21edb518de701aeeULL, + 0x7fb63e418f06dc99ULL, 0xa4460d99c166d7b8ULL, 0x24dd5248ce520a83ULL, 0x5ec3ad712b928358ULL, + 0x15022a5fbd17930fULL, 0xa4f64a77d82570e3ULL, 0x12bc8d6915783712ULL, 0x498194c0fc620abbULL, + 0x38a2d9d255686c82ULL, 0x785c6bd9193e21f0ULL, 0xe4d5c81ab24a5484ULL, 0x56307860b2e20989ULL, + 0x429d55f78b4d74c4ULL, 0x22f1834643350131ULL, 0x1e60c24598c71fffULL, 0x59f2f014979983efULL, + 0x46a47d56eb494a44ULL, 0x3e22a854d636a18eULL, 0xb346e15274491c3bULL, 0x2ceafd4e5390cde7ULL, + 0xba8a8538be0d6675ULL, 0x4b9074bb50818e23ULL, 0xcbdab89085d304c3ULL, 0x61a24fe0e56192c4ULL, + 0xcb7615e6db525bcbULL, 0xdd7d8c35a567e4caULL, 0xe6b4153acafcdd69ULL, 0x2d668e097f3c9766ULL, + 0xa57e7e265ce55ef0ULL, 0x5d9f4e527cd4b967ULL, 0xfbc83606492fd1e5ULL, 0x090d52beb7c3f7aeULL, + 0x09b9515a1e7b4d7cULL, 0x1f266a2599da44c0ULL, 0xa1c49548e2c55504ULL, 0x7ef04287126f15ccULL, + 0xfed1659dbd30ef15ULL, 0x8b4ab9eec4e0277bULL, 0x884d6236a5df3291ULL, 0x1fd96ea6bf5cf788ULL, + 0x42a161981f190d9aULL, 0x61d849507e6052c1ULL, 0x9fe113bf285a2cd5ULL, 0x7c22d676dbad85d8ULL, + 0x82e770ed2bfbd27dULL, 0x4c05b2ece996f5a5ULL, 0xcd40a9c2b0900150ULL, 0x5895319213d9bf64ULL, + 0xe7cc5d703fea2e08ULL, 0xb50c491258e2188cULL, 0xcce30baa48205bf0ULL, 0x537c659ccfa32d62ULL, + 0x37b6623a98cfc088ULL, 0xfe9bed1fa4d6aca4ULL, 0x04d29b8e56a8d1b0ULL, 0x725f71c40b519575ULL, + 0x28c7f89cd0339ce6ULL, 0x8367b14469ddc18bULL, 0x883ada83a6a1652cULL, 0x585f1974034d6c17ULL, + 0x89cfb266f1b19188ULL, 0xe63b4863e7c35217ULL, 0xd88c9da6b4c0526aULL, 0x3e035c9df0954635ULL, + 0xdd9d5412fb45de9dULL, 0xdd684532e4cff40dULL, 0x4b5c999b151d671cULL, 0x2d8c2cc811e7f690ULL, + 0x7f54be1d90055d40ULL, 0xa464c5df464aaf40ULL, 0x33979624f0e917beULL, 0x2c018dc527356b30ULL, + 0xa5415024e330b3d4ULL, 0x73ff3d96691652d3ULL, 0x94ec42c4ef9b59f1ULL, 0x0747201618d08e5aULL, + 0x4d6ca48aca411c53ULL, 0x66415f2fcfa66119ULL, 0x9c4dd40051e227ffULL, 0x59810bc09a02f7ebULL, + 0x2a7eb171b3dc101dULL, 0x441c5ab99ffef68eULL, 0x32025c9b93b359eaULL, 0x5e8ce0a71e9d112fULL, + 0xbfcccb92429503fdULL, 0xd271ba752f095d55ULL, 0x345ead5e972d091eULL, 0x18c8df11a83103baULL, + 0x90cd949a9aed0f4cULL, 0xc5d1f4cb6660e37eULL, 0xb8cac52d56c52e0bULL, 0x6e42e400c5808e0dULL, + 0xa3b46966eeaefd23ULL, 0x0c4f1f0be39ecdcaULL, 0x189dc8c9d683a51dULL, 0x51f27f054c09351bULL, + 0x4c487ccd2a320682ULL, 0x587ea95bb3df1c96ULL, 0xc8ccf79e555cb8e8ULL, 0x547dc829a206d73dULL, + 0xb822a6cd80c39b06ULL, 0xe96d54732000d4c6ULL, 0x28535b6f91463b4dULL, 0x228f4660e2486e1dULL, + 0x98799538de8d3abfULL, 0x8cd8330045ebca6eULL, 0x79952a008221e738ULL, 0x4322e1a7535cd2bbULL, + 0xb114c11819d1801cULL, 0x2016e4d84f3f5ec7ULL, 0xdd0e2df409260f4cULL, 0x5ec362c0ae5f7266ULL, + 0xc0462b18b8b2b4eeULL, 0x7cc8d950274d1afbULL, 0xf25f7105436b02d2ULL, 0x43bbf8dcbff9ccd3ULL, + 0xb6ad1767a039e9dfULL, 0xb0714da8f69d3583ULL, 0x5e55fa18b42931f5ULL, 0x4ed5558f33c60961ULL, + 0x1fe37901c647a5ddULL, 0x593ddf1f8081d357ULL, 0x0249a4fd813fd7a6ULL, 0x69acca274e9caf61ULL, + 0x047ba3ea330721c9ULL, 0x83423fc20e7e1ea0ULL, 0x1df4c0af01314a60ULL, 0x09a62dab89289527ULL, + 0xa5b325a49cc6cb00ULL, 0xe94b5dc654b56cb6ULL, 0x3be28779adc994a0ULL, 0x4296e8f8ba3a4aadULL, + 0x328689761e451eabULL, 0x2e4d598bff59594aULL, 0x49b96853d7a7084aULL, 0x4980a319601420a8ULL, + 0x9565b9e12f552c42ULL, 0x8a5318db7100fe96ULL, 0x05c90b4d43add0d7ULL, 0x538b4cd66a5d4edaULL, + 0xf4e94fc3e89f039fULL, 0x592c9af26f618045ULL, 0x08a36eb5fd4b9550ULL, 0x25fffaf6c2ed1419ULL, + 0x34434459cc79d354ULL, 0xeeecbfb4b1d5476bULL, 0xddeb34a061615d99ULL, 0x5129cecceb64b773ULL, + 0xee43215894993520ULL, 0x772f9c7cf14c0b3bULL, 0xd2e2fce306bedad5ULL, 0x715f42b546f06a97ULL, + 0x434ecdceda5b5f1aULL, 0x0da17115a49741a9ULL, 0x680bd77c73edad2eULL, 0x487c02354edd9041ULL, + 0xb8efeff3a70ed9c4ULL, 0x56a32aa3e857e302ULL, 0xdf3a68bd48a2a5a0ULL, 0x07f650b73176c444ULL, + 0xe38b9b1626e0ccb1ULL, 0x79e053c18b09fb36ULL, 0x56d90319c9f94964ULL, 0x1ca941e7ac9ff5c4ULL, + 0x49c4df29162fa0bbULL, 0x8488cf3282b33305ULL, 0x95dfda14cabb437dULL, 0x3391f78264d5ad86ULL, + 0x729ae06ae2b5095dULL, 0xd58a58d73259a946ULL, 0xe9834262d13921edULL, 0x27fedafaa54bb592ULL, + 0xa99dc5b829ad48bbULL, 0x5f025742499ee260ULL, 0x802c8ecd5d7513fdULL, 0x78ceb3ef3f6dd938ULL, + 0xc342f44f8a135d94ULL, 0x7b9edb44828cdda3ULL, 0x9436d11a0537cfe7ULL, 0x5064b164ec1ab4c8ULL, + 0x7020eccfd37eb2fcULL, 0x1f31ea3ed90d25fcULL, 0x1b930d7bdfa1bb34ULL, 0x5344467a48113044ULL, + 0x70073170f25e6dfbULL, 0xe385dc1a50114cc8ULL, 0x2348698ac8fc4f00ULL, 0x2a77a55284dd40d8ULL, + 0xfe06afe0c98c6ce4ULL, 0xc235df96dddfd6e4ULL, 0x1428d01e33bf1ed3ULL, 0x785768ec9300bdafULL, + 0x9702e57a91deb63bULL, 0x61bdb8bfe5ce8b80ULL, 0x645b426f3d1d58acULL, 0x4804a82227a557bcULL, + 0x8e57048ab44d2601ULL, 0x68d6501a4b3a6935ULL, 0xc39c9ec3f9e1c293ULL, 0x4172f257d4de63e2ULL, + 0xd368b450330c6401ULL, 0x040d3017418f2391ULL, 0x2c34bb6090b7d90dULL, 0x16f649228fdfd51fULL, + 0xbea6818e2b928ef5ULL, 0xe28ccf91cdc11e72ULL, 0x594aaa68e77a36cdULL, 0x313034806c7ffd0fULL, + 0x8a9d27ac2249bd65ULL, 0x19a3b464018e9512ULL, 0xc26ccff352b37ec7ULL, 0x056f68341d797b21ULL, + 0x5e79d6757efd2327ULL, 0xfabdbcb6553afe15ULL, 0xd3e7222c6eaf5a60ULL, 0x7046c76d4dae743bULL, + 0x660be872b18d4a55ULL, 0x19992518574e1496ULL, 0xc103053a302bdcbbULL, 0x3ed8e9800b218e8eULL, + 0x7b0b9239fa75e03eULL, 0xefe9fb684633c083ULL, 0x98a35fbe391a7793ULL, 0x6065510fe2d0fe34ULL, + 0x55cb668548abad0cULL, 0xb4584548da87e527ULL, 0x2c43ecea0107c1ddULL, 0x526028809372de35ULL, + 0x3415c56af9213b1fULL, 0x5bee1a4d017e98dbULL, 0x13f6b105b5cf709bULL, 0x5ff20e3482b29ab6ULL, + 0x0aa29c75cc2e6c90ULL, 0xfc7d73ca3a70e206ULL, 0x899fc38fc4b5c515ULL, 0x250386b124ffc207ULL, + 0x54ea28d5ae3d2b56ULL, 0x9913149dd6de60ceULL, 0x16694fc58f06d6c1ULL, 0x46b23975eb018fc7ULL, + 0x470a6a0fb4b7b4e2ULL, 0x5d92475a8f7253deULL, 0xabeee5b52fbd3adbULL, 0x7fa20801a0806968ULL, + 0x76f3faf19f7714d2ULL, 0xb3e840c12f4660c3ULL, 0x0fb4cd8df212744eULL, 0x4b065a251d3a2dd2ULL, + 0x5cebde383d77cd4aULL, 0x6adf39df882c9cb1ULL, 0xa2dd242eb09af759ULL, 0x3147c0e50e5f6422ULL, + 0x164ca5101d1350dbULL, 0xf8d13479c33fc962ULL, 0xe640ce4d13e5da08ULL, 0x4bdee0c45061f8baULL, + 0xd7c46dc1a4edb1c9ULL, 0x5514d7b6437fd98aULL, 0x58942f6bb2a1c00bULL, 0x2dffb2ab1d70710eULL, + 0xccdfcf2fc18b6d68ULL, 0xa8ebcba8b7806167ULL, 0x980697f95e2937e3ULL, 0x02fbba1cd0126e8cULL +}; -static void curve25519_bmi2_base(u8 session_key[CURVE25519_KEY_SIZE], - const u8 private_key[CURVE25519_KEY_SIZE]) +static void curve25519_ever64_base(u8 *out, const u8 *priv) { - struct { - u64 buffer[4 * NUM_WORDS_ELTFP25519]; - u64 coordinates[4 * NUM_WORDS_ELTFP25519]; - u64 workspace[4 * NUM_WORDS_ELTFP25519]; - u8 private[CURVE25519_KEY_SIZE]; - } __aligned(32) m; - - const int ite[4] = { 64, 64, 64, 63 }; - const int q = 3; u64 swap = 1; - - int i = 0, j = 0, k = 0; - u64 *const key = (u64 *)m.private; - u64 *const Ur1 = m.coordinates + 0; - u64 *const Zr1 = m.coordinates + 4; - u64 *const Ur2 = m.coordinates + 8; - u64 *const Zr2 = m.coordinates + 12; - - u64 *const UZr1 = m.coordinates + 0; - u64 *const ZUr2 = m.coordinates + 8; - - u64 *const A = m.workspace + 0; - u64 *const B = m.workspace + 4; - u64 *const C = m.workspace + 8; - u64 *const D = m.workspace + 12; - - u64 *const AB = m.workspace + 0; - u64 *const CD = m.workspace + 8; - - const u64 *const P = table_ladder_8k; - - memcpy(m.private, private_key, sizeof(m.private)); - - curve25519_clamp_secret(m.private); - - setzero_eltfp25519_1w(Ur1); - setzero_eltfp25519_1w(Zr1); - setzero_eltfp25519_1w(Zr2); - Ur1[0] = 1; - Zr1[0] = 1; - Zr2[0] = 1; - - /* G-S */ - Ur2[3] = 0x1eaecdeee27cab34UL; - Ur2[2] = 0xadc7a0b9235d48e2UL; - Ur2[1] = 0xbbf095ae14b2edf8UL; - Ur2[0] = 0x7e94e1fec82faabdUL; - - /* main-loop */ - j = q; - for (i = 0; i < NUM_WORDS_ELTFP25519; ++i) { - while (j < ite[i]) { - u64 bit = (key[i] >> j) & 0x1; - k = (64 * i + j - q); + int i, j, k; + u64 tmp[16 + 32 + 4]; + u64 *x1 = &tmp[0]; + u64 *z1 = &tmp[4]; + u64 *x2 = &tmp[8]; + u64 *z2 = &tmp[12]; + u64 *xz1 = &tmp[0]; + u64 *xz2 = &tmp[8]; + u64 *a = &tmp[0 + 16]; + u64 *b = &tmp[4 + 16]; + u64 *c = &tmp[8 + 16]; + u64 *ab = &tmp[0 + 16]; + u64 *abcd = &tmp[0 + 16]; + u64 *ef = &tmp[16 + 16]; + u64 *efgh = &tmp[16 + 16]; + u64 *key = &tmp[0 + 16 + 32]; + + memcpy(key, priv, 32); + ((u8 *)key)[0] &= 248; + ((u8 *)key)[31] = (((u8 *)key)[31] & 127) | 64; + + x1[0] = 1, x1[1] = x1[2] = x1[3] = 0; + z1[0] = 1, z1[1] = z1[2] = z1[3] = 0; + z2[0] = 1, z2[1] = z2[2] = z2[3] = 0; + memcpy(x2, p_minus_s, sizeof(p_minus_s)); + + j = 3; + for (i = 0; i < 4; ++i) { + while (j < (const int[]){ 64, 64, 64, 63 }[i]) { + u64 bit = (key[i] >> j) & 1; + k = (64 * i + j - 3); swap = swap ^ bit; - cswap(swap, Ur1, Ur2); - cswap(swap, Zr1, Zr2); + cswap2(swap, xz1, xz2); swap = bit; - /* Addition */ - sub_eltfp25519_1w(B, Ur1, Zr1); /* B = Ur1-Zr1 */ - add_eltfp25519_1w_bmi2(A, Ur1, Zr1); /* A = Ur1+Zr1 */ - mul_eltfp25519_1w_bmi2(C, &P[4 * k], B);/* C = M0-B */ - sub_eltfp25519_1w(B, A, C); /* B = (Ur1+Zr1) - M*(Ur1-Zr1) */ - add_eltfp25519_1w_bmi2(A, A, C); /* A = (Ur1+Zr1) + M*(Ur1-Zr1) */ - sqr_eltfp25519_2w_bmi2(AB); /* A = A^2 | B = B^2 */ - mul_eltfp25519_2w_bmi2(UZr1, ZUr2, AB); /* Ur1 = Zr2*A | Zr1 = Ur2*B */ + fsub(b, x1, z1); + fadd(a, x1, z1); + fmul(c, &table_ladder[4 * k], b, ef); + fsub(b, a, c); + fadd(a, a, c); + fsqr2(ab, ab, efgh); + fmul2(xz1, xz2, ab, efgh); ++j; } j = 0; } - /* Doubling */ - for (i = 0; i < q; ++i) { - add_eltfp25519_1w_bmi2(A, Ur1, Zr1); /* A = Ur1+Zr1 */ - sub_eltfp25519_1w(B, Ur1, Zr1); /* B = Ur1-Zr1 */ - sqr_eltfp25519_2w_bmi2(AB); /* A = A**2 B = B**2 */ - copy_eltfp25519_1w(C, B); /* C = B */ - sub_eltfp25519_1w(B, A, B); /* B = A-B */ - mul_a24_eltfp25519_1w(D, B); /* D = my_a24*B */ - add_eltfp25519_1w_bmi2(D, D, C); /* D = D+C */ - mul_eltfp25519_2w_bmi2(UZr1, AB, CD); /* Ur1 = A*B Zr1 = Zr1*A */ - } + point_double(xz1, abcd, efgh); + point_double(xz1, abcd, efgh); + point_double(xz1, abcd, efgh); + encode_point(out, xz1); - /* Convert to affine coordinates */ - inv_eltfp25519_1w_bmi2(A, Zr1); - mul_eltfp25519_1w_bmi2((u64 *)session_key, Ur1, A); - fred_eltfp25519_1w((u64 *)session_key); - - memzero_explicit(&m, sizeof(m)); + memzero_explicit(tmp, sizeof(tmp)); } +static __ro_after_init DEFINE_STATIC_KEY_FALSE(curve25519_use_bmi2_adx); + void curve25519_arch(u8 mypublic[CURVE25519_KEY_SIZE], const u8 secret[CURVE25519_KEY_SIZE], const u8 basepoint[CURVE25519_KEY_SIZE]) { - if (static_branch_likely(&curve25519_use_adx)) - curve25519_adx(mypublic, secret, basepoint); - else if (static_branch_likely(&curve25519_use_bmi2)) - curve25519_bmi2(mypublic, secret, basepoint); + if (static_branch_likely(&curve25519_use_bmi2_adx)) + curve25519_ever64(mypublic, secret, basepoint); else curve25519_generic(mypublic, secret, basepoint); } @@ -2355,10 +1395,8 @@ EXPORT_SYMBOL(curve25519_arch); void curve25519_base_arch(u8 pub[CURVE25519_KEY_SIZE], const u8 secret[CURVE25519_KEY_SIZE]) { - if (static_branch_likely(&curve25519_use_adx)) - curve25519_adx_base(pub, secret); - else if (static_branch_likely(&curve25519_use_bmi2)) - curve25519_bmi2_base(pub, secret); + if (static_branch_likely(&curve25519_use_bmi2_adx)) + curve25519_ever64_base(pub, secret); else curve25519_generic(pub, secret, curve25519_base_point); } @@ -2449,12 +1487,11 @@ static struct kpp_alg curve25519_alg = { .max_size = curve25519_max_size, }; + static int __init curve25519_mod_init(void) { - if (boot_cpu_has(X86_FEATURE_BMI2)) - static_branch_enable(&curve25519_use_bmi2); - else if (boot_cpu_has(X86_FEATURE_ADX)) - static_branch_enable(&curve25519_use_adx); + if (boot_cpu_has(X86_FEATURE_BMI2) && boot_cpu_has(X86_FEATURE_ADX)) + static_branch_enable(&curve25519_use_bmi2_adx); else return 0; return IS_REACHABLE(CONFIG_CRYPTO_KPP) ? @@ -2474,3 +1511,4 @@ module_exit(curve25519_mod_exit); MODULE_ALIAS_CRYPTO("curve25519"); MODULE_ALIAS_CRYPTO("curve25519-x86"); MODULE_LICENSE("GPL v2"); +MODULE_AUTHOR("Jason A. Donenfeld "); -- cgit From fe66a17ecd4912401afd61517eadae2e9e4ce0ae Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Wed, 12 Feb 2020 00:10:00 +0100 Subject: x86: platform: iosf_mbi: Call cpu_latency_qos_*() instead of pm_qos_*() Call cpu_latency_qos_add/update/remove_request() instead of pm_qos_add/update/remove_request(), respectively, because the latter are going to be dropped. No intentional functional impact. Signed-off-by: Rafael J. Wysocki Reviewed-by: Ulf Hansson Reviewed-by: Andy Shevchenko Reviewed-by: Amit Kucheria Tested-by: Amit Kucheria --- arch/x86/platform/intel/iosf_mbi.c | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/platform/intel/iosf_mbi.c b/arch/x86/platform/intel/iosf_mbi.c index 9e2444500428..526f70f27c1c 100644 --- a/arch/x86/platform/intel/iosf_mbi.c +++ b/arch/x86/platform/intel/iosf_mbi.c @@ -265,7 +265,7 @@ static void iosf_mbi_reset_semaphore(void) iosf_mbi_sem_address, 0, PUNIT_SEMAPHORE_BIT)) dev_err(&mbi_pdev->dev, "Error P-Unit semaphore reset failed\n"); - pm_qos_update_request(&iosf_mbi_pm_qos, PM_QOS_DEFAULT_VALUE); + cpu_latency_qos_update_request(&iosf_mbi_pm_qos, PM_QOS_DEFAULT_VALUE); blocking_notifier_call_chain(&iosf_mbi_pmic_bus_access_notifier, MBI_PMIC_BUS_ACCESS_END, NULL); @@ -301,8 +301,8 @@ static void iosf_mbi_reset_semaphore(void) * 4) When CPU cores enter C6 or C7 the P-Unit needs to talk to the PMIC * if this happens while the kernel itself is accessing the PMIC I2C bus * the SoC hangs. - * As the third step we call pm_qos_update_request() to disallow the CPU - * to enter C6 or C7. + * As the third step we call cpu_latency_qos_update_request() to disallow the + * CPU to enter C6 or C7. * * 5) The P-Unit has a PMIC bus semaphore which we can request to stop * autonomous P-Unit tasks from accessing the PMIC I2C bus while we hold it. @@ -338,7 +338,7 @@ int iosf_mbi_block_punit_i2c_access(void) * requires the P-Unit to talk to the PMIC and if this happens while * we're holding the semaphore, the SoC hangs. */ - pm_qos_update_request(&iosf_mbi_pm_qos, 0); + cpu_latency_qos_update_request(&iosf_mbi_pm_qos, 0); /* host driver writes to side band semaphore register */ ret = iosf_mbi_write(BT_MBI_UNIT_PMC, MBI_REG_WRITE, @@ -547,8 +547,7 @@ static int __init iosf_mbi_init(void) { iosf_debugfs_init(); - pm_qos_add_request(&iosf_mbi_pm_qos, PM_QOS_CPU_DMA_LATENCY, - PM_QOS_DEFAULT_VALUE); + cpu_latency_qos_add_request(&iosf_mbi_pm_qos, PM_QOS_DEFAULT_VALUE); return pci_register_driver(&iosf_mbi_pci_driver); } @@ -561,7 +560,7 @@ static void __exit iosf_mbi_exit(void) pci_dev_put(mbi_pdev); mbi_pdev = NULL; - pm_qos_remove_request(&iosf_mbi_pm_qos); + cpu_latency_qos_remove_request(&iosf_mbi_pm_qos); } module_init(iosf_mbi_init); -- cgit From 7c805795307b40af50a45b7db44dd09ac1700947 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Wed, 23 Oct 2019 14:27:14 +0200 Subject: x86/entry: Remove _TIF_NOHZ from _TIF_WORK_SYSCALL_ENTRY Evaluating _TIF_NOHZ to decide whether to use the slow syscall entry path is not only pointless, it's actually counterproductive: 1) Context tracking code is invoked unconditionally before that flag is evaluated. 2) If the flag is set the slow path is invoked for nothing due to #1 Remove it. Signed-off-by: Thomas Gleixner Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Borislav Petkov Cc: Andy Lutomirski Signed-off-by: Frederic Weisbecker --- arch/x86/include/asm/thread_info.h | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h index cf4327986e98..6cb9d1b0d1e6 100644 --- a/arch/x86/include/asm/thread_info.h +++ b/arch/x86/include/asm/thread_info.h @@ -133,14 +133,10 @@ struct thread_info { #define _TIF_X32 (1 << TIF_X32) #define _TIF_FSCHECK (1 << TIF_FSCHECK) -/* - * work to do in syscall_trace_enter(). Also includes TIF_NOHZ for - * enter_from_user_mode() - */ +/* Work to do before invoking the actual syscall. */ #define _TIF_WORK_SYSCALL_ENTRY \ (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU | _TIF_SYSCALL_AUDIT | \ - _TIF_SECCOMP | _TIF_SYSCALL_TRACEPOINT | \ - _TIF_NOHZ) + _TIF_SECCOMP | _TIF_SYSCALL_TRACEPOINT) /* flags to check in __switch_to() */ #define _TIF_WORK_CTXSW_BASE \ -- cgit From 490f561b783dac2c4825e288e6dbbf83481eea34 Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Mon, 27 Jan 2020 16:41:52 +0100 Subject: context-tracking: Introduce CONFIG_HAVE_TIF_NOHZ A few archs (x86, arm, arm64) don't rely anymore on TIF_NOHZ to call into context tracking on user entry/exit but instead use static keys (or not) to optimize those calls. Ideally every arch should migrate to that behaviour in the long run. Settle a config option to let those archs remove their TIF_NOHZ definitions. Signed-off-by: Frederic Weisbecker Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Borislav Petkov Cc: Andy Lutomirski Cc: Russell King Cc: Catalin Marinas Cc: Will Deacon Cc: Ralf Baechle Cc: Paul Burton Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Michael Ellerman Cc: David S. Miller --- arch/x86/Kconfig | 1 + 1 file changed, 1 insertion(+) (limited to 'arch/x86') diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index beea77046f9b..549eed3460c9 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -211,6 +211,7 @@ config X86 select HAVE_STACK_VALIDATION if X86_64 select HAVE_RSEQ select HAVE_SYSCALL_TRACEPOINTS + select HAVE_TIF_NOHZ if X86_64 select HAVE_UNSTABLE_SCHED_CLOCK select HAVE_USER_RETURN_NOTIFIER select HAVE_GENERIC_VDSO -- cgit From 68d875131e43a8eed193810c1dfebb027fe2450e Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Tue, 28 Jan 2020 13:50:32 +0100 Subject: x86: Remove TIF_NOHZ Static keys have replaced TIF_NOHZ to optimize the calls to context tracking. We can now safely remove that thread flag. Signed-off-by: Frederic Weisbecker Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Borislav Petkov Cc: Andy Lutomirski --- arch/x86/Kconfig | 1 - arch/x86/include/asm/thread_info.h | 2 -- 2 files changed, 3 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 549eed3460c9..beea77046f9b 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -211,7 +211,6 @@ config X86 select HAVE_STACK_VALIDATION if X86_64 select HAVE_RSEQ select HAVE_SYSCALL_TRACEPOINTS - select HAVE_TIF_NOHZ if X86_64 select HAVE_UNSTABLE_SCHED_CLOCK select HAVE_USER_RETURN_NOTIFIER select HAVE_GENERIC_VDSO diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h index 6cb9d1b0d1e6..384cdde10680 100644 --- a/arch/x86/include/asm/thread_info.h +++ b/arch/x86/include/asm/thread_info.h @@ -92,7 +92,6 @@ struct thread_info { #define TIF_NOCPUID 15 /* CPUID is not accessible in userland */ #define TIF_NOTSC 16 /* TSC is not accessible in userland */ #define TIF_IA32 17 /* IA32 compatibility process */ -#define TIF_NOHZ 19 /* in adaptive nohz mode */ #define TIF_MEMDIE 20 /* is terminating due to OOM killer */ #define TIF_POLLING_NRFLAG 21 /* idle is polling for TIF_NEED_RESCHED */ #define TIF_IO_BITMAP 22 /* uses I/O bitmap */ @@ -122,7 +121,6 @@ struct thread_info { #define _TIF_NOCPUID (1 << TIF_NOCPUID) #define _TIF_NOTSC (1 << TIF_NOTSC) #define _TIF_IA32 (1 << TIF_IA32) -#define _TIF_NOHZ (1 << TIF_NOHZ) #define _TIF_POLLING_NRFLAG (1 << TIF_POLLING_NRFLAG) #define _TIF_IO_BITMAP (1 << TIF_IO_BITMAP) #define _TIF_FORCED_TF (1 << TIF_FORCED_TF) -- cgit From c8e3dd86600a1a7b165478cc626d69bf07967c15 Mon Sep 17 00:00:00 2001 From: Al Viro Date: Sat, 15 Feb 2020 11:28:09 -0500 Subject: x86 user stack frame reads: switch to explicit __get_user() rather than relying upon the magic in raw_copy_from_user() Signed-off-by: Al Viro --- arch/x86/events/core.c | 27 +++++++-------------------- arch/x86/include/asm/uaccess.h | 9 --------- arch/x86/kernel/stacktrace.c | 6 ++++-- 3 files changed, 11 insertions(+), 31 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 3bb738f5a472..a619763e96e1 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -2490,7 +2490,7 @@ perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry_ctx *ent /* 32-bit process in 64-bit kernel. */ unsigned long ss_base, cs_base; struct stack_frame_ia32 frame; - const void __user *fp; + const struct stack_frame_ia32 __user *fp; if (!test_thread_flag(TIF_IA32)) return 0; @@ -2501,18 +2501,12 @@ perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry_ctx *ent fp = compat_ptr(ss_base + regs->bp); pagefault_disable(); while (entry->nr < entry->max_stack) { - unsigned long bytes; - frame.next_frame = 0; - frame.return_address = 0; - if (!valid_user_frame(fp, sizeof(frame))) break; - bytes = __copy_from_user_nmi(&frame.next_frame, fp, 4); - if (bytes != 0) + if (__get_user(frame.next_frame, &fp->next_frame)) break; - bytes = __copy_from_user_nmi(&frame.return_address, fp+4, 4); - if (bytes != 0) + if (__get_user(frame.return_address, &fp->return_address)) break; perf_callchain_store(entry, cs_base + frame.return_address); @@ -2533,7 +2527,7 @@ void perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs) { struct stack_frame frame; - const unsigned long __user *fp; + const struct stack_frame __user *fp; if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) { /* TODO: We don't support guest os callchain now */ @@ -2546,7 +2540,7 @@ perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs if (regs->flags & (X86_VM_MASK | PERF_EFLAGS_VM)) return; - fp = (unsigned long __user *)regs->bp; + fp = (void __user *)regs->bp; perf_callchain_store(entry, regs->ip); @@ -2558,19 +2552,12 @@ perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs pagefault_disable(); while (entry->nr < entry->max_stack) { - unsigned long bytes; - - frame.next_frame = NULL; - frame.return_address = 0; - if (!valid_user_frame(fp, sizeof(frame))) break; - bytes = __copy_from_user_nmi(&frame.next_frame, fp, sizeof(*fp)); - if (bytes != 0) + if (__get_user(frame.next_frame, &fp->next_frame)) break; - bytes = __copy_from_user_nmi(&frame.return_address, fp + 1, sizeof(*fp)); - if (bytes != 0) + if (__get_user(frame.return_address, &fp->return_address)) break; perf_callchain_store(entry, frame.return_address); diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h index 61d93f062a36..ab8eab43a8a2 100644 --- a/arch/x86/include/asm/uaccess.h +++ b/arch/x86/include/asm/uaccess.h @@ -694,15 +694,6 @@ extern struct movsl_mask { # include #endif -/* - * We rely on the nested NMI work to allow atomic faults from the NMI path; the - * nested NMI paths are careful to preserve CR2. - * - * Caller must use pagefault_enable/disable, or run in interrupt context, - * and also do a uaccess_ok() check - */ -#define __copy_from_user_nmi __copy_from_user_inatomic - /* * The "unsafe" user accesses aren't really "unsafe", but the naming * is a big fat warning: you have to not only do the access_ok() diff --git a/arch/x86/kernel/stacktrace.c b/arch/x86/kernel/stacktrace.c index 2d6898c2cb64..6ad43fc44556 100644 --- a/arch/x86/kernel/stacktrace.c +++ b/arch/x86/kernel/stacktrace.c @@ -96,7 +96,8 @@ struct stack_frame_user { }; static int -copy_stack_frame(const void __user *fp, struct stack_frame_user *frame) +copy_stack_frame(const struct stack_frame_user __user *fp, + struct stack_frame_user *frame) { int ret; @@ -105,7 +106,8 @@ copy_stack_frame(const void __user *fp, struct stack_frame_user *frame) ret = 1; pagefault_disable(); - if (__copy_from_user_inatomic(frame, fp, sizeof(*frame))) + if (__get_user(frame->next_fp, &fp->next_fp) || + __get_user(frame->ret_addr, &fp->ret_addr)) ret = 0; pagefault_enable(); -- cgit From a4814443993c7c8686036bb8ca0df6abc499d63f Mon Sep 17 00:00:00 2001 From: Al Viro Date: Sat, 15 Feb 2020 11:29:04 -0500 Subject: x86 kvm page table walks: switch to explicit __get_user() Signed-off-by: Al Viro --- arch/x86/kvm/mmu/paging_tmpl.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h index 4e1ef0473663..5bea4cfe8a15 100644 --- a/arch/x86/kvm/mmu/paging_tmpl.h +++ b/arch/x86/kvm/mmu/paging_tmpl.h @@ -400,7 +400,7 @@ retry_walk: goto error; ptep_user = (pt_element_t __user *)((void *)host_addr + offset); - if (unlikely(__copy_from_user(&pte, ptep_user, sizeof(pte)))) + if (unlikely(__get_user(pte, ptep_user))) goto error; walker->ptep_user[walker->level - 1] = ptep_user; -- cgit From 4d1d0977a2156a1dafe8f1cd890ab918c803485b Mon Sep 17 00:00:00 2001 From: Martin Molnar Date: Sun, 16 Feb 2020 16:17:39 +0100 Subject: x86: Fix a handful of typos Fix a couple of typos in code comments. [ bp: While at it: s/IRQ's/IRQs/. ] Signed-off-by: Martin Molnar Signed-off-by: Borislav Petkov Reviewed-by: Randy Dunlap Link: https://lkml.kernel.org/r/0819a044-c360-44a4-f0b6-3f5bafe2d35c@gmail.com --- arch/x86/kernel/irqinit.c | 2 +- arch/x86/kernel/nmi.c | 4 ++-- arch/x86/kernel/reboot.c | 2 +- arch/x86/kernel/smpboot.c | 2 +- arch/x86/kernel/tsc.c | 2 +- arch/x86/kernel/tsc_sync.c | 2 +- 6 files changed, 7 insertions(+), 7 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kernel/irqinit.c b/arch/x86/kernel/irqinit.c index 16919a9671fa..1e5ad12f1061 100644 --- a/arch/x86/kernel/irqinit.c +++ b/arch/x86/kernel/irqinit.c @@ -84,7 +84,7 @@ void __init init_IRQ(void) * On cpu 0, Assign ISA_IRQ_VECTOR(irq) to IRQ 0..15. * If these IRQ's are handled by legacy interrupt-controllers like PIC, * then this configuration will likely be static after the boot. If - * these IRQ's are handled by more mordern controllers like IO-APIC, + * these IRQs are handled by more modern controllers like IO-APIC, * then this vector space can be freed and re-used dynamically as the * irq's migrate etc. */ diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c index 54c21d6abd5a..6407ea21fa1b 100644 --- a/arch/x86/kernel/nmi.c +++ b/arch/x86/kernel/nmi.c @@ -403,9 +403,9 @@ static void default_do_nmi(struct pt_regs *regs) * a 'real' unknown NMI. For example, while processing * a perf NMI another perf NMI comes in along with a * 'real' unknown NMI. These two NMIs get combined into - * one (as descibed above). When the next NMI gets + * one (as described above). When the next NMI gets * processed, it will be flagged by perf as handled, but - * noone will know that there was a 'real' unknown NMI sent + * no one will know that there was a 'real' unknown NMI sent * also. As a result it gets swallowed. Or if the first * perf NMI returns two events handled then the second * NMI will get eaten by the logic below, again losing a diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c index 0cc7c0b106bb..3ca43be4f9cf 100644 --- a/arch/x86/kernel/reboot.c +++ b/arch/x86/kernel/reboot.c @@ -531,7 +531,7 @@ static void emergency_vmx_disable_all(void) /* * We need to disable VMX on all CPUs before rebooting, otherwise - * we risk hanging up the machine, because the CPU ignore INIT + * we risk hanging up the machine, because the CPU ignores INIT * signals when VMX is enabled. * * We can't take any locks and we may be on an inconsistent diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 69881b2d446c..3feaeee8a926 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -1434,7 +1434,7 @@ early_param("possible_cpus", _setup_possible_cpus); /* * cpu_possible_mask should be static, it cannot change as cpu's * are onlined, or offlined. The reason is per-cpu data-structures - * are allocated by some modules at init time, and dont expect to + * are allocated by some modules at init time, and don't expect to * do this dynamically on cpu arrival/departure. * cpu_present_mask on the other hand can change dynamically. * In case when cpu_hotplug is not compiled, then we resort to current diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c index 7e322e2daaf5..0109b9d485c6 100644 --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -477,7 +477,7 @@ static unsigned long pit_calibrate_tsc(u32 latch, unsigned long ms, int loopmin) * transition from one expected value to another with a fairly * high accuracy, and we didn't miss any events. We can thus * use the TSC value at the transitions to calculate a pretty - * good value for the TSC frequencty. + * good value for the TSC frequency. */ static inline int pit_verify_msb(unsigned char val) { diff --git a/arch/x86/kernel/tsc_sync.c b/arch/x86/kernel/tsc_sync.c index 32a818764e03..3d3c761eb74a 100644 --- a/arch/x86/kernel/tsc_sync.c +++ b/arch/x86/kernel/tsc_sync.c @@ -295,7 +295,7 @@ static cycles_t check_tsc_warp(unsigned int timeout) * But as the TSC is per-logical CPU and can potentially be modified wrongly * by the bios, TSC sync test for smaller duration should be able * to catch such errors. Also this will catch the condition where all the - * cores in the socket doesn't get reset at the same time. + * cores in the socket don't get reset at the same time. */ static inline unsigned int loop_timeout(int cpu) { -- cgit From 50e818715821b89c7abac90a97721f106e893d83 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Fri, 7 Feb 2020 13:38:48 +0100 Subject: x86/vdso: Mark the TSC clocksource path likely Jumping out of line for the TSC clcoksource read is creating awful code. TSC is likely to be the clocksource at least on bare metal and the PV interfaces are sufficiently more work that the jump over the TSC read is just in the noise. Signed-off-by: Thomas Gleixner Reviewed-by: Vincenzo Frascino Link: https://lkml.kernel.org/r/20200207124402.328922847@linutronix.de --- arch/x86/include/asm/vdso/gettimeofday.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/vdso/gettimeofday.h b/arch/x86/include/asm/vdso/gettimeofday.h index 6ee1f7dba34b..264d4fd3ff2c 100644 --- a/arch/x86/include/asm/vdso/gettimeofday.h +++ b/arch/x86/include/asm/vdso/gettimeofday.h @@ -243,7 +243,7 @@ static u64 vread_hvclock(void) static inline u64 __arch_get_hw_counter(s32 clock_mode) { - if (clock_mode == VCLOCK_TSC) + if (likely(clock_mode == VCLOCK_TSC)) return (u64)rdtsc_ordered(); /* * For any memory-mapped vclock type, we need to make sure that gcc -- cgit From eec399dd862762b9594df3659f15839a4e12f17a Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Fri, 7 Feb 2020 13:38:54 +0100 Subject: x86/vdso: Move VDSO clocksource state tracking to callback All architectures which use the generic VDSO code have their own storage for the VDSO clock mode. That's pointless and just requires duplicate code. X86 abuses the function which retrieves the architecture specific clock mode storage to mark the clocksource as used in the VDSO. That's silly because this is invoked on every tick when the VDSO data is updated. Move this functionality to the clocksource::enable() callback so it gets invoked once when the clocksource is installed. This allows to make the clock mode storage generic. Signed-off-by: Thomas Gleixner Reviewed-by: Michael Kelley (Hyper-V parts) Reviewed-by: Vincenzo Frascino (VDSO parts) Acked-by: Juergen Gross (Xen parts) Link: https://lkml.kernel.org/r/20200207124402.934519777@linutronix.de --- arch/x86/entry/vdso/vma.c | 4 ++++ arch/x86/include/asm/clocksource.h | 12 ++++++++++++ arch/x86/include/asm/mshyperv.h | 2 ++ arch/x86/include/asm/vdso/vsyscall.h | 10 +--------- arch/x86/include/asm/vgtod.h | 6 ------ arch/x86/kernel/kvmclock.c | 7 +++++++ arch/x86/kernel/tsc.c | 32 ++++++++++++++++++++------------ arch/x86/xen/time.c | 17 ++++++++++++----- 8 files changed, 58 insertions(+), 32 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c index c1b8496b5606..cce3e809f17e 100644 --- a/arch/x86/entry/vdso/vma.c +++ b/arch/x86/entry/vdso/vma.c @@ -38,6 +38,8 @@ struct vdso_data *arch_get_vdso_data(void *vvar_page) } #undef EMIT_VVAR +unsigned int vclocks_used __read_mostly; + #if defined(CONFIG_X86_64) unsigned int __read_mostly vdso64_enabled = 1; #endif @@ -445,6 +447,8 @@ __setup("vdso=", vdso_setup); static int __init init_vdso(void) { + BUILD_BUG_ON(VCLOCK_MAX >= 32); + init_vdso_image(&vdso_image_64); #ifdef CONFIG_X86_X32_ABI diff --git a/arch/x86/include/asm/clocksource.h b/arch/x86/include/asm/clocksource.h index dc4cfc888d6d..2450d6e02a5d 100644 --- a/arch/x86/include/asm/clocksource.h +++ b/arch/x86/include/asm/clocksource.h @@ -14,4 +14,16 @@ struct arch_clocksource_data { int vclock_mode; }; +extern unsigned int vclocks_used; + +static inline bool vclock_was_used(int vclock) +{ + return READ_ONCE(vclocks_used) & (1U << vclock); +} + +static inline void vclocks_set_used(unsigned int which) +{ + WRITE_ONCE(vclocks_used, READ_ONCE(vclocks_used) | (1 << which)); +} + #endif /* _ASM_X86_CLOCKSOURCE_H */ diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h index 6b79515abb82..f7cbc01f128e 100644 --- a/arch/x86/include/asm/mshyperv.h +++ b/arch/x86/include/asm/mshyperv.h @@ -47,6 +47,8 @@ typedef int (*hyperv_fill_flush_list_func)( wrmsrl(HV_X64_MSR_REFERENCE_TSC, val) #define hv_set_clocksource_vdso(val) \ ((val).archdata.vclock_mode = VCLOCK_HVCLOCK) +#define hv_enable_vdso_clocksource() \ + vclocks_set_used(VCLOCK_HVCLOCK); #define hv_get_raw_timer() rdtsc_ordered() void hyperv_callback_vector(void); diff --git a/arch/x86/include/asm/vdso/vsyscall.h b/arch/x86/include/asm/vdso/vsyscall.h index 0026ab2123ce..01f5733d14d6 100644 --- a/arch/x86/include/asm/vdso/vsyscall.h +++ b/arch/x86/include/asm/vdso/vsyscall.h @@ -10,8 +10,6 @@ #include #include -int vclocks_used __read_mostly; - DEFINE_VVAR(struct vdso_data, _vdso_data); /* * Update the vDSO data page to keep in sync with kernel timekeeping. @@ -26,13 +24,7 @@ struct vdso_data *__x86_get_k_vdso_data(void) static __always_inline int __x86_get_clock_mode(struct timekeeper *tk) { - int vclock_mode = tk->tkr_mono.clock->archdata.vclock_mode; - - /* Mark the new vclock used. */ - BUILD_BUG_ON(VCLOCK_MAX >= 32); - WRITE_ONCE(vclocks_used, READ_ONCE(vclocks_used) | (1 << vclock_mode)); - - return vclock_mode; + return tk->tkr_mono.clock->archdata.vclock_mode; } #define __arch_get_clock_mode __x86_get_clock_mode diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h index a2638c6124ed..fc8e4cd342cc 100644 --- a/arch/x86/include/asm/vgtod.h +++ b/arch/x86/include/asm/vgtod.h @@ -15,10 +15,4 @@ typedef u64 gtod_long_t; typedef unsigned long gtod_long_t; #endif -extern int vclocks_used; -static inline bool vclock_was_used(int vclock) -{ - return READ_ONCE(vclocks_used) & (1 << vclock); -} - #endif /* _ASM_X86_VGTOD_H */ diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c index 904494b924c1..33f2cac25f13 100644 --- a/arch/x86/kernel/kvmclock.c +++ b/arch/x86/kernel/kvmclock.c @@ -159,12 +159,19 @@ bool kvm_check_and_clear_guest_paused(void) return ret; } +static int kvm_cs_enable(struct clocksource *cs) +{ + vclocks_set_used(VCLOCK_PVCLOCK); + return 0; +} + struct clocksource kvm_clock = { .name = "kvm-clock", .read = kvm_clock_get_cycles, .rating = 400, .mask = CLOCKSOURCE_MASK(64), .flags = CLOCK_SOURCE_IS_CONTINUOUS, + .enable = kvm_cs_enable, }; EXPORT_SYMBOL_GPL(kvm_clock); diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c index 7e322e2daaf5..742da141a30a 100644 --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -1108,17 +1108,24 @@ static void tsc_cs_tick_stable(struct clocksource *cs) sched_clock_tick_stable(); } +static int tsc_cs_enable(struct clocksource *cs) +{ + vclocks_set_used(VCLOCK_TSC); + return 0; +} + /* * .mask MUST be CLOCKSOURCE_MASK(64). See comment above read_tsc() */ static struct clocksource clocksource_tsc_early = { - .name = "tsc-early", - .rating = 299, - .read = read_tsc, - .mask = CLOCKSOURCE_MASK(64), - .flags = CLOCK_SOURCE_IS_CONTINUOUS | + .name = "tsc-early", + .rating = 299, + .read = read_tsc, + .mask = CLOCKSOURCE_MASK(64), + .flags = CLOCK_SOURCE_IS_CONTINUOUS | CLOCK_SOURCE_MUST_VERIFY, - .archdata = { .vclock_mode = VCLOCK_TSC }, + .archdata = { .vclock_mode = VCLOCK_TSC }, + .enable = tsc_cs_enable, .resume = tsc_resume, .mark_unstable = tsc_cs_mark_unstable, .tick_stable = tsc_cs_tick_stable, @@ -1131,14 +1138,15 @@ static struct clocksource clocksource_tsc_early = { * been found good. */ static struct clocksource clocksource_tsc = { - .name = "tsc", - .rating = 300, - .read = read_tsc, - .mask = CLOCKSOURCE_MASK(64), - .flags = CLOCK_SOURCE_IS_CONTINUOUS | + .name = "tsc", + .rating = 300, + .read = read_tsc, + .mask = CLOCKSOURCE_MASK(64), + .flags = CLOCK_SOURCE_IS_CONTINUOUS | CLOCK_SOURCE_VALID_FOR_HRES | CLOCK_SOURCE_MUST_VERIFY, - .archdata = { .vclock_mode = VCLOCK_TSC }, + .archdata = { .vclock_mode = VCLOCK_TSC }, + .enable = tsc_cs_enable, .resume = tsc_resume, .mark_unstable = tsc_cs_mark_unstable, .tick_stable = tsc_cs_tick_stable, diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c index befbdd8b17f0..5d1568ff19ea 100644 --- a/arch/x86/xen/time.c +++ b/arch/x86/xen/time.c @@ -145,12 +145,19 @@ static struct notifier_block xen_pvclock_gtod_notifier = { .notifier_call = xen_pvclock_gtod_notify, }; +static int xen_cs_enable(struct clocksource *cs) +{ + vclocks_set_used(VCLOCK_PVCLOCK); + return 0; +} + static struct clocksource xen_clocksource __read_mostly = { - .name = "xen", - .rating = 400, - .read = xen_clocksource_get_cycles, - .mask = ~0, - .flags = CLOCK_SOURCE_IS_CONTINUOUS, + .name = "xen", + .rating = 400, + .read = xen_clocksource_get_cycles, + .mask = CLOCKSOURCE_MASK(64), + .flags = CLOCK_SOURCE_IS_CONTINUOUS, + .enable = xen_cs_enable, }; /* -- cgit From b95a8a27c300d1a39a4e36f63a518ef36e4b966c Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Fri, 7 Feb 2020 13:38:56 +0100 Subject: x86/vdso: Use generic VDSO clock mode storage Switch to the generic VDSO clock mode storage. Signed-off-by: Thomas Gleixner Reviewed-by: Vincenzo Frascino (VDSO parts) Acked-by: Juergen Gross (Xen parts) Acked-by: Paolo Bonzini (KVM parts) Link: https://lkml.kernel.org/r/20200207124403.152039903@linutronix.de --- arch/x86/Kconfig | 2 +- arch/x86/entry/vdso/vma.c | 6 +++--- arch/x86/include/asm/clocksource.h | 13 ++++--------- arch/x86/include/asm/mshyperv.h | 4 ++-- arch/x86/include/asm/vdso/gettimeofday.h | 6 +++--- arch/x86/include/asm/vdso/vsyscall.h | 7 ------- arch/x86/kernel/kvmclock.c | 4 ++-- arch/x86/kernel/pvclock.c | 2 +- arch/x86/kernel/time.c | 12 +++--------- arch/x86/kernel/tsc.c | 6 +++--- arch/x86/kvm/trace.h | 4 ++-- arch/x86/kvm/x86.c | 22 +++++++++++----------- arch/x86/xen/time.c | 21 +++++++++++---------- 13 files changed, 46 insertions(+), 63 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index beea77046f9b..698e9c835cc5 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -57,7 +57,6 @@ config X86 select ACPI_LEGACY_TABLES_LOOKUP if ACPI select ACPI_SYSTEM_POWER_STATES_SUPPORT if ACPI select ARCH_32BIT_OFF_T if X86_32 - select ARCH_CLOCKSOURCE_DATA select ARCH_CLOCKSOURCE_INIT select ARCH_HAS_ACPI_TABLE_UPGRADE if ACPI select ARCH_HAS_DEBUG_VIRTUAL @@ -126,6 +125,7 @@ config X86 select GENERIC_STRNLEN_USER select GENERIC_TIME_VSYSCALL select GENERIC_GETTIMEOFDAY + select GENERIC_VDSO_CLOCK_MODE select GENERIC_VDSO_TIME_NS select GUP_GET_PTE_LOW_HIGH if X86_PAE select HARDLOCKUP_CHECK_TIMESTAMP if X86_64 diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c index cce3e809f17e..43428cc514c8 100644 --- a/arch/x86/entry/vdso/vma.c +++ b/arch/x86/entry/vdso/vma.c @@ -221,7 +221,7 @@ static vm_fault_t vvar_fault(const struct vm_special_mapping *sm, } else if (sym_offset == image->sym_pvclock_page) { struct pvclock_vsyscall_time_info *pvti = pvclock_get_pvti_cpu0_va(); - if (pvti && vclock_was_used(VCLOCK_PVCLOCK)) { + if (pvti && vclock_was_used(VDSO_CLOCKMODE_PVCLOCK)) { return vmf_insert_pfn_prot(vma, vmf->address, __pa(pvti) >> PAGE_SHIFT, pgprot_decrypted(vma->vm_page_prot)); @@ -229,7 +229,7 @@ static vm_fault_t vvar_fault(const struct vm_special_mapping *sm, } else if (sym_offset == image->sym_hvclock_page) { struct ms_hyperv_tsc_page *tsc_pg = hv_get_tsc_page(); - if (tsc_pg && vclock_was_used(VCLOCK_HVCLOCK)) + if (tsc_pg && vclock_was_used(VDSO_CLOCKMODE_HVCLOCK)) return vmf_insert_pfn(vma, vmf->address, virt_to_phys(tsc_pg) >> PAGE_SHIFT); } else if (sym_offset == image->sym_timens_page) { @@ -447,7 +447,7 @@ __setup("vdso=", vdso_setup); static int __init init_vdso(void) { - BUILD_BUG_ON(VCLOCK_MAX >= 32); + BUILD_BUG_ON(VDSO_CLOCKMODE_MAX >= 32); init_vdso_image(&vdso_image_64); diff --git a/arch/x86/include/asm/clocksource.h b/arch/x86/include/asm/clocksource.h index 2450d6e02a5d..d561db67f96d 100644 --- a/arch/x86/include/asm/clocksource.h +++ b/arch/x86/include/asm/clocksource.h @@ -4,15 +4,10 @@ #ifndef _ASM_X86_CLOCKSOURCE_H #define _ASM_X86_CLOCKSOURCE_H -#define VCLOCK_NONE 0 /* No vDSO clock available. */ -#define VCLOCK_TSC 1 /* vDSO should use vread_tsc. */ -#define VCLOCK_PVCLOCK 2 /* vDSO should use vread_pvclock. */ -#define VCLOCK_HVCLOCK 3 /* vDSO should use vread_hvclock. */ -#define VCLOCK_MAX 3 - -struct arch_clocksource_data { - int vclock_mode; -}; +#define VDSO_ARCH_CLOCKMODES \ + VDSO_CLOCKMODE_TSC, \ + VDSO_CLOCKMODE_PVCLOCK, \ + VDSO_CLOCKMODE_HVCLOCK extern unsigned int vclocks_used; diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h index f7cbc01f128e..edc2c581704a 100644 --- a/arch/x86/include/asm/mshyperv.h +++ b/arch/x86/include/asm/mshyperv.h @@ -46,9 +46,9 @@ typedef int (*hyperv_fill_flush_list_func)( #define hv_set_reference_tsc(val) \ wrmsrl(HV_X64_MSR_REFERENCE_TSC, val) #define hv_set_clocksource_vdso(val) \ - ((val).archdata.vclock_mode = VCLOCK_HVCLOCK) + ((val).vdso_clock_mode = VDSO_CLOCKMODE_HVCLOCK) #define hv_enable_vdso_clocksource() \ - vclocks_set_used(VCLOCK_HVCLOCK); + vclocks_set_used(VDSO_CLOCKMODE_HVCLOCK); #define hv_get_raw_timer() rdtsc_ordered() void hyperv_callback_vector(void); diff --git a/arch/x86/include/asm/vdso/gettimeofday.h b/arch/x86/include/asm/vdso/gettimeofday.h index 264d4fd3ff2c..9a6dc9b4ec99 100644 --- a/arch/x86/include/asm/vdso/gettimeofday.h +++ b/arch/x86/include/asm/vdso/gettimeofday.h @@ -243,7 +243,7 @@ static u64 vread_hvclock(void) static inline u64 __arch_get_hw_counter(s32 clock_mode) { - if (likely(clock_mode == VCLOCK_TSC)) + if (likely(clock_mode == VDSO_CLOCKMODE_TSC)) return (u64)rdtsc_ordered(); /* * For any memory-mapped vclock type, we need to make sure that gcc @@ -252,13 +252,13 @@ static inline u64 __arch_get_hw_counter(s32 clock_mode) * question isn't enabled, which will segfault. Hence the barriers. */ #ifdef CONFIG_PARAVIRT_CLOCK - if (clock_mode == VCLOCK_PVCLOCK) { + if (clock_mode == VDSO_CLOCKMODE_PVCLOCK) { barrier(); return vread_pvclock(); } #endif #ifdef CONFIG_HYPERV_TIMER - if (clock_mode == VCLOCK_HVCLOCK) { + if (clock_mode == VDSO_CLOCKMODE_HVCLOCK) { barrier(); return vread_hvclock(); } diff --git a/arch/x86/include/asm/vdso/vsyscall.h b/arch/x86/include/asm/vdso/vsyscall.h index 01f5733d14d6..be199a9b2676 100644 --- a/arch/x86/include/asm/vdso/vsyscall.h +++ b/arch/x86/include/asm/vdso/vsyscall.h @@ -21,13 +21,6 @@ struct vdso_data *__x86_get_k_vdso_data(void) } #define __arch_get_k_vdso_data __x86_get_k_vdso_data -static __always_inline -int __x86_get_clock_mode(struct timekeeper *tk) -{ - return tk->tkr_mono.clock->archdata.vclock_mode; -} -#define __arch_get_clock_mode __x86_get_clock_mode - /* The asm-generic header needs to be included after the definitions above */ #include diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c index 33f2cac25f13..34b18f6eeb2c 100644 --- a/arch/x86/kernel/kvmclock.c +++ b/arch/x86/kernel/kvmclock.c @@ -161,7 +161,7 @@ bool kvm_check_and_clear_guest_paused(void) static int kvm_cs_enable(struct clocksource *cs) { - vclocks_set_used(VCLOCK_PVCLOCK); + vclocks_set_used(VDSO_CLOCKMODE_PVCLOCK); return 0; } @@ -279,7 +279,7 @@ static int __init kvm_setup_vsyscall_timeinfo(void) if (!(flags & PVCLOCK_TSC_STABLE_BIT)) return 0; - kvm_clock.archdata.vclock_mode = VCLOCK_PVCLOCK; + kvm_clock.vdso_clock_mode = VDSO_CLOCKMODE_PVCLOCK; #endif kvmclock_init_mem(); diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c index 10125358b9c4..11065dc03f5b 100644 --- a/arch/x86/kernel/pvclock.c +++ b/arch/x86/kernel/pvclock.c @@ -145,7 +145,7 @@ void pvclock_read_wallclock(struct pvclock_wall_clock *wall_clock, void pvclock_set_pvti_cpu0_va(struct pvclock_vsyscall_time_info *pvti) { - WARN_ON(vclock_was_used(VCLOCK_PVCLOCK)); + WARN_ON(vclock_was_used(VDSO_CLOCKMODE_PVCLOCK)); pvti_cpu0_va = pvti; } diff --git a/arch/x86/kernel/time.c b/arch/x86/kernel/time.c index d8673d8a779b..4d545db52efc 100644 --- a/arch/x86/kernel/time.c +++ b/arch/x86/kernel/time.c @@ -122,18 +122,12 @@ void __init time_init(void) */ void clocksource_arch_init(struct clocksource *cs) { - if (cs->archdata.vclock_mode == VCLOCK_NONE) + if (cs->vdso_clock_mode == VDSO_CLOCKMODE_NONE) return; - if (cs->archdata.vclock_mode > VCLOCK_MAX) { - pr_warn("clocksource %s registered with invalid vclock_mode %d. Disabling vclock.\n", - cs->name, cs->archdata.vclock_mode); - cs->archdata.vclock_mode = VCLOCK_NONE; - } - if (cs->mask != CLOCKSOURCE_MASK(64)) { - pr_warn("clocksource %s registered with invalid mask %016llx. Disabling vclock.\n", + pr_warn("clocksource %s registered with invalid mask %016llx for VDSO. Disabling VDSO support.\n", cs->name, cs->mask); - cs->archdata.vclock_mode = VCLOCK_NONE; + cs->vdso_clock_mode = VDSO_CLOCKMODE_NONE; } } diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c index 742da141a30a..971d6f0216df 100644 --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -1110,7 +1110,7 @@ static void tsc_cs_tick_stable(struct clocksource *cs) static int tsc_cs_enable(struct clocksource *cs) { - vclocks_set_used(VCLOCK_TSC); + vclocks_set_used(VDSO_CLOCKMODE_TSC); return 0; } @@ -1124,7 +1124,7 @@ static struct clocksource clocksource_tsc_early = { .mask = CLOCKSOURCE_MASK(64), .flags = CLOCK_SOURCE_IS_CONTINUOUS | CLOCK_SOURCE_MUST_VERIFY, - .archdata = { .vclock_mode = VCLOCK_TSC }, + .vdso_clock_mode = VDSO_CLOCKMODE_TSC, .enable = tsc_cs_enable, .resume = tsc_resume, .mark_unstable = tsc_cs_mark_unstable, @@ -1145,7 +1145,7 @@ static struct clocksource clocksource_tsc = { .flags = CLOCK_SOURCE_IS_CONTINUOUS | CLOCK_SOURCE_VALID_FOR_HRES | CLOCK_SOURCE_MUST_VERIFY, - .archdata = { .vclock_mode = VCLOCK_TSC }, + .vdso_clock_mode = VDSO_CLOCKMODE_TSC, .enable = tsc_cs_enable, .resume = tsc_resume, .mark_unstable = tsc_cs_mark_unstable, diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h index f194dd058470..cef5a344fedb 100644 --- a/arch/x86/kvm/trace.h +++ b/arch/x86/kvm/trace.h @@ -815,8 +815,8 @@ TRACE_EVENT(kvm_write_tsc_offset, #ifdef CONFIG_X86_64 #define host_clocks \ - {VCLOCK_NONE, "none"}, \ - {VCLOCK_TSC, "tsc"} \ + {VDSO_CLOCKMODE_NONE, "none"}, \ + {VDSO_CLOCKMODE_TSC, "tsc"} \ TRACE_EVENT(kvm_update_master_clock, TP_PROTO(bool use_master_clock, unsigned int host_clock, bool offset_matched), diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index fb5d64ebc35d..0e7ef2955db3 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1631,7 +1631,7 @@ static void update_pvclock_gtod(struct timekeeper *tk) write_seqcount_begin(&vdata->seq); /* copy pvclock gtod data */ - vdata->clock.vclock_mode = tk->tkr_mono.clock->archdata.vclock_mode; + vdata->clock.vclock_mode = tk->tkr_mono.clock->vdso_clock_mode; vdata->clock.cycle_last = tk->tkr_mono.cycle_last; vdata->clock.mask = tk->tkr_mono.mask; vdata->clock.mult = tk->tkr_mono.mult; @@ -1639,7 +1639,7 @@ static void update_pvclock_gtod(struct timekeeper *tk) vdata->clock.base_cycles = tk->tkr_mono.xtime_nsec; vdata->clock.offset = tk->tkr_mono.base; - vdata->raw_clock.vclock_mode = tk->tkr_raw.clock->archdata.vclock_mode; + vdata->raw_clock.vclock_mode = tk->tkr_raw.clock->vdso_clock_mode; vdata->raw_clock.cycle_last = tk->tkr_raw.cycle_last; vdata->raw_clock.mask = tk->tkr_raw.mask; vdata->raw_clock.mult = tk->tkr_raw.mult; @@ -1840,7 +1840,7 @@ static u64 compute_guest_tsc(struct kvm_vcpu *vcpu, s64 kernel_ns) static inline int gtod_is_based_on_tsc(int mode) { - return mode == VCLOCK_TSC || mode == VCLOCK_HVCLOCK; + return mode == VDSO_CLOCKMODE_TSC || mode == VDSO_CLOCKMODE_HVCLOCK; } static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu) @@ -1933,7 +1933,7 @@ static inline bool kvm_check_tsc_unstable(void) * TSC is marked unstable when we're running on Hyper-V, * 'TSC page' clocksource is good. */ - if (pvclock_gtod_data.clock.vclock_mode == VCLOCK_HVCLOCK) + if (pvclock_gtod_data.clock.vclock_mode == VDSO_CLOCKMODE_HVCLOCK) return false; #endif return check_tsc_unstable(); @@ -2088,30 +2088,30 @@ static inline u64 vgettsc(struct pvclock_clock *clock, u64 *tsc_timestamp, u64 tsc_pg_val; switch (clock->vclock_mode) { - case VCLOCK_HVCLOCK: + case VDSO_CLOCKMODE_HVCLOCK: tsc_pg_val = hv_read_tsc_page_tsc(hv_get_tsc_page(), tsc_timestamp); if (tsc_pg_val != U64_MAX) { /* TSC page valid */ - *mode = VCLOCK_HVCLOCK; + *mode = VDSO_CLOCKMODE_HVCLOCK; v = (tsc_pg_val - clock->cycle_last) & clock->mask; } else { /* TSC page invalid */ - *mode = VCLOCK_NONE; + *mode = VDSO_CLOCKMODE_NONE; } break; - case VCLOCK_TSC: - *mode = VCLOCK_TSC; + case VDSO_CLOCKMODE_TSC: + *mode = VDSO_CLOCKMODE_TSC; *tsc_timestamp = read_tsc(); v = (*tsc_timestamp - clock->cycle_last) & clock->mask; break; default: - *mode = VCLOCK_NONE; + *mode = VDSO_CLOCKMODE_NONE; } - if (*mode == VCLOCK_NONE) + if (*mode == VDSO_CLOCKMODE_NONE) *tsc_timestamp = v = 0; return v * clock->mult; diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c index 5d1568ff19ea..c8897aad13cd 100644 --- a/arch/x86/xen/time.c +++ b/arch/x86/xen/time.c @@ -147,7 +147,7 @@ static struct notifier_block xen_pvclock_gtod_notifier = { static int xen_cs_enable(struct clocksource *cs) { - vclocks_set_used(VCLOCK_PVCLOCK); + vclocks_set_used(VDSO_CLOCKMODE_PVCLOCK); return 0; } @@ -419,12 +419,13 @@ void xen_restore_time_memory_area(void) ret = HYPERVISOR_vcpu_op(VCPUOP_register_vcpu_time_memory_area, 0, &t); /* - * We don't disable VCLOCK_PVCLOCK entirely if it fails to register the - * secondary time info with Xen or if we migrated to a host without the - * necessary flags. On both of these cases what happens is either - * process seeing a zeroed out pvti or seeing no PVCLOCK_TSC_STABLE_BIT - * bit set. Userspace checks the latter and if 0, it discards the data - * in pvti and fallbacks to a system call for a reliable timestamp. + * We don't disable VDSO_CLOCKMODE_PVCLOCK entirely if it fails to + * register the secondary time info with Xen or if we migrated to a + * host without the necessary flags. On both of these cases what + * happens is either process seeing a zeroed out pvti or seeing no + * PVCLOCK_TSC_STABLE_BIT bit set. Userspace checks the latter and + * if 0, it discards the data in pvti and fallbacks to a system + * call for a reliable timestamp. */ if (ret != 0) pr_notice("Cannot restore secondary vcpu_time_info (err %d)", @@ -450,7 +451,7 @@ static void xen_setup_vsyscall_time_info(void) ret = HYPERVISOR_vcpu_op(VCPUOP_register_vcpu_time_memory_area, 0, &t); if (ret) { - pr_notice("xen: VCLOCK_PVCLOCK not supported (err %d)\n", ret); + pr_notice("xen: VDSO_CLOCKMODE_PVCLOCK not supported (err %d)\n", ret); free_page((unsigned long)ti); return; } @@ -467,14 +468,14 @@ static void xen_setup_vsyscall_time_info(void) if (!ret) free_page((unsigned long)ti); - pr_notice("xen: VCLOCK_PVCLOCK not supported (tsc unstable)\n"); + pr_notice("xen: VDSO_CLOCKMODE_PVCLOCK not supported (tsc unstable)\n"); return; } xen_clock = ti; pvclock_set_pvti_cpu0_va(xen_clock); - xen_clocksource.archdata.vclock_mode = VCLOCK_PVCLOCK; + xen_clocksource.vdso_clock_mode = VDSO_CLOCKMODE_PVCLOCK; } static void __init xen_time_init(void) -- cgit From cdcb58cc05ed9a7f74509a023762488c111cdfb3 Mon Sep 17 00:00:00 2001 From: Benjamin Thiel Date: Thu, 23 Jan 2020 14:30:51 +0100 Subject: x86/iopl: Include prototype header for ksys_ioperm() .. in order to fix a -Wmissing-prototype warning. No functional change. Signed-off-by: Benjamin Thiel Signed-off-by: Borislav Petkov Link: https://lkml.kernel.org/r/20200123133051.5974-1-b.thiel@posteo.de --- arch/x86/kernel/ioport.c | 1 + 1 file changed, 1 insertion(+) (limited to 'arch/x86') diff --git a/arch/x86/kernel/ioport.c b/arch/x86/kernel/ioport.c index 8abeee0dd7bf..a53e7b4a7419 100644 --- a/arch/x86/kernel/ioport.c +++ b/arch/x86/kernel/ioport.c @@ -13,6 +13,7 @@ #include #include +#include #ifdef CONFIG_X86_IOPL_IOPERM -- cgit From 99ce3255fddf08fca9249488793c9b5c909ff0ab Mon Sep 17 00:00:00 2001 From: Benjamin Thiel Date: Thu, 23 Jan 2020 16:27:54 +0100 Subject: x86/syscalls: Add prototypes for C syscall callbacks .. in order to fix a couple of -Wmissing-prototypes warnings. No functional change. [ bp: Massage commit message and drop newlines. ] Signed-off-by: Benjamin Thiel Signed-off-by: Borislav Petkov Link: https://lkml.kernel.org/r/20200123152754.20149-1-b.thiel@posteo.de --- arch/x86/entry/common.c | 1 + arch/x86/include/asm/syscall.h | 5 +++++ 2 files changed, 6 insertions(+) (limited to 'arch/x86') diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index 9747876980b5..ec167d8c41cb 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -34,6 +34,7 @@ #include #include #include +#include #define CREATE_TRACE_POINTS #include diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h index 8db3fdb6102e..d20ffc518cf4 100644 --- a/arch/x86/include/asm/syscall.h +++ b/arch/x86/include/asm/syscall.h @@ -168,6 +168,11 @@ static inline int syscall_get_arch(struct task_struct *task) task->thread_info.status & TS_COMPAT) ? AUDIT_ARCH_I386 : AUDIT_ARCH_X86_64; } + +void do_syscall_64(unsigned long nr, struct pt_regs *regs); +void do_int80_syscall_32(struct pt_regs *regs); +long do_fast_syscall_32(struct pt_regs *regs); + #endif /* CONFIG_X86_32 */ #endif /* _ASM_X86_SYSCALL_H */ -- cgit From b10c307f6f314c068814d0e23c86f06d5d57004b Mon Sep 17 00:00:00 2001 From: Benjamin Thiel Date: Thu, 23 Jan 2020 18:29:45 +0100 Subject: x86/cpu: Move prototype for get_umwait_control_msr() to a global location .. in order to fix a -Wmissing-prototypes warning. No functional change. Signed-off-by: Benjamin Thiel Signed-off-by: Borislav Petkov Cc: kvm@vger.kernel.org Link: https://lkml.kernel.org/r/20200123172945.7235-1-b.thiel@posteo.de --- arch/x86/include/asm/mwait.h | 2 ++ arch/x86/kernel/cpu/umwait.c | 1 + arch/x86/kvm/vmx/vmx.c | 1 + arch/x86/kvm/vmx/vmx.h | 2 -- 4 files changed, 4 insertions(+), 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/mwait.h b/arch/x86/include/asm/mwait.h index 9d5252c9685c..b809f117f3f4 100644 --- a/arch/x86/include/asm/mwait.h +++ b/arch/x86/include/asm/mwait.h @@ -23,6 +23,8 @@ #define MWAITX_MAX_LOOPS ((u32)-1) #define MWAITX_DISABLE_CSTATES 0xf0 +u32 get_umwait_control_msr(void); + static inline void __monitor(const void *eax, unsigned long ecx, unsigned long edx) { diff --git a/arch/x86/kernel/cpu/umwait.c b/arch/x86/kernel/cpu/umwait.c index c222f283b456..300e3fd5ade3 100644 --- a/arch/x86/kernel/cpu/umwait.c +++ b/arch/x86/kernel/cpu/umwait.c @@ -4,6 +4,7 @@ #include #include +#include #define UMWAIT_C02_ENABLE 0 diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 9a6664886f2e..2068cdae3ede 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -41,6 +41,7 @@ #include #include #include +#include #include #include #include diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h index 7f42cf3dcd70..b4e14ed66ca9 100644 --- a/arch/x86/kvm/vmx/vmx.h +++ b/arch/x86/kvm/vmx/vmx.h @@ -14,8 +14,6 @@ extern const u32 vmx_msr_index[]; extern u64 host_efer; -extern u32 get_umwait_control_msr(void); - #define MSR_TYPE_R 1 #define MSR_TYPE_W 2 #define MSR_TYPE_RW 3 -- cgit From 1e5d8e1e47afde23e3249aed25d7d124feff5c1c Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Sun, 16 Feb 2020 12:01:04 -0800 Subject: x86/mm: Introduce CONFIG_NUMA_KEEP_MEMINFO Currently x86 numa_meminfo is marked __initdata in the CONFIG_MEMORY_HOTPLUG=n case. In support of a new facility to allow drivers to map reserved memory to a 'target_node' (phys_to_target_node()), add support for removing the __initdata designation for those users. Both memory hotplug and phys_to_target_node() users select CONFIG_NUMA_KEEP_MEMINFO to tell the arch to maintain its physical address to NUMA mapping infrastructure post init. Cc: Dave Hansen Cc: Andy Lutomirski Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Borislav Petkov Cc: "H. Peter Anvin" Cc: Cc: Andrew Morton Cc: David Hildenbrand Cc: Michal Hocko Reviewed-by: Ingo Molnar Signed-off-by: Dan Williams Reviewed-by: Thomas Gleixner Link: https://lore.kernel.org/r/158188326422.894464.15742054998046628934.stgit@dwillia2-desk3.amr.corp.intel.com --- arch/x86/mm/numa.c | 6 +----- 1 file changed, 1 insertion(+), 5 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index 99f7a68738f0..2450b21cc28a 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -25,11 +25,7 @@ nodemask_t numa_nodes_parsed __initdata; struct pglist_data *node_data[MAX_NUMNODES] __read_mostly; EXPORT_SYMBOL(node_data); -static struct numa_meminfo numa_meminfo -#ifndef CONFIG_MEMORY_HOTPLUG -__initdata -#endif -; +static struct numa_meminfo numa_meminfo __initdata_or_meminfo; static int numa_distance_cnt; static u8 *numa_distance; -- cgit From f86fd32db706613fe8d0104057efa6e83e0d7e8f Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Fri, 7 Feb 2020 13:38:59 +0100 Subject: lib/vdso: Cleanup clock mode storage leftovers Now that all architectures are converted to use the generic storage the helpers and conditionals can be removed. Signed-off-by: Thomas Gleixner Tested-by: Vincenzo Frascino Reviewed-by: Vincenzo Frascino Link: https://lkml.kernel.org/r/20200207124403.470699892@linutronix.de --- arch/x86/Kconfig | 1 - 1 file changed, 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 698e9c835cc5..8b995db3d10f 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -125,7 +125,6 @@ config X86 select GENERIC_STRNLEN_USER select GENERIC_TIME_VSYSCALL select GENERIC_GETTIMEOFDAY - select GENERIC_VDSO_CLOCK_MODE select GENERIC_VDSO_TIME_NS select GUP_GET_PTE_LOW_HIGH if X86_PAE select HARDLOCKUP_CHECK_TIMESTAMP if X86_64 -- cgit From 5d30f92e7631286b8617777c5400c8eadcae50a1 Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Sun, 16 Feb 2020 12:01:09 -0800 Subject: x86/NUMA: Provide a range-to-target_node lookup facility The DEV_DAX_KMEM facility is a generic mechanism to allow device-dax instances, fronting performance-differentiated-memory like pmem, to be added to the System RAM pool. The NUMA node for that hot-added memory is derived from the device-dax instance's 'target_node' attribute. Recall that the 'target_node' is the ACPI-PXM-to-node translation for memory when it comes online whereas the 'numa_node' attribute of the device represents the closest online cpu node. Presently useful target_node information from the ACPI SRAT is discarded with the expectation that "Reserved" memory will never be onlined. Now, DEV_DAX_KMEM violates that assumption, there is a need to retain the translation. Move, rather than discard, numa_memblk data to a secondary array that memory_add_physaddr_to_target_node() may consider at a later point in time. Cc: Dave Hansen Cc: Andy Lutomirski Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Borislav Petkov Cc: "H. Peter Anvin" Cc: Cc: Andrew Morton Cc: David Hildenbrand Cc: Michal Hocko Reported-by: kbuild test robot Reviewed-by: Ingo Molnar Signed-off-by: Dan Williams Reviewed-by: Thomas Gleixner Link: https://lore.kernel.org/r/158188326978.894464.217282995221175417.stgit@dwillia2-desk3.amr.corp.intel.com --- arch/x86/mm/numa.c | 61 +++++++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 51 insertions(+), 10 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index 2450b21cc28a..59ba008504dc 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -26,6 +26,7 @@ struct pglist_data *node_data[MAX_NUMNODES] __read_mostly; EXPORT_SYMBOL(node_data); static struct numa_meminfo numa_meminfo __initdata_or_meminfo; +static struct numa_meminfo numa_reserved_meminfo __initdata_or_meminfo; static int numa_distance_cnt; static u8 *numa_distance; @@ -164,6 +165,19 @@ void __init numa_remove_memblk_from(int idx, struct numa_meminfo *mi) (mi->nr_blks - idx) * sizeof(mi->blk[0])); } +/** + * numa_move_tail_memblk - Move a numa_memblk from one numa_meminfo to another + * @dst: numa_meminfo to append block to + * @idx: Index of memblk to remove + * @src: numa_meminfo to remove memblk from + */ +static void __init numa_move_tail_memblk(struct numa_meminfo *dst, int idx, + struct numa_meminfo *src) +{ + dst->blk[dst->nr_blks++] = src->blk[idx]; + numa_remove_memblk_from(idx, src); +} + /** * numa_add_memblk - Add one numa_memblk to numa_meminfo * @nid: NUMA node ID of the new memblk @@ -233,14 +247,19 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi) for (i = 0; i < mi->nr_blks; i++) { struct numa_memblk *bi = &mi->blk[i]; - /* make sure all blocks are inside the limits */ + /* move / save reserved memory ranges */ + if (!memblock_overlaps_region(&memblock.memory, + bi->start, bi->end - bi->start)) { + numa_move_tail_memblk(&numa_reserved_meminfo, i--, mi); + continue; + } + + /* make sure all non-reserved blocks are inside the limits */ bi->start = max(bi->start, low); bi->end = min(bi->end, high); - /* and there's no empty or non-exist block */ - if (bi->start >= bi->end || - !memblock_overlaps_region(&memblock.memory, - bi->start, bi->end - bi->start)) + /* and there's no empty block */ + if (bi->start >= bi->end) numa_remove_memblk_from(i--, mi); } @@ -877,16 +896,38 @@ EXPORT_SYMBOL(cpumask_of_node); #endif /* !CONFIG_DEBUG_PER_CPU_MAPS */ -#ifdef CONFIG_MEMORY_HOTPLUG -int memory_add_physaddr_to_nid(u64 start) +#ifdef CONFIG_NUMA_KEEP_MEMINFO +static int meminfo_to_nid(struct numa_meminfo *mi, u64 start) { - struct numa_meminfo *mi = &numa_meminfo; - int nid = mi->blk[0].nid; int i; for (i = 0; i < mi->nr_blks; i++) if (mi->blk[i].start <= start && mi->blk[i].end > start) - nid = mi->blk[i].nid; + return mi->blk[i].nid; + return NUMA_NO_NODE; +} + +int phys_to_target_node(phys_addr_t start) +{ + int nid = meminfo_to_nid(&numa_meminfo, start); + + /* + * Prefer online nodes, but if reserved memory might be + * hot-added continue the search with reserved ranges. + */ + if (nid != NUMA_NO_NODE) + return nid; + + return meminfo_to_nid(&numa_reserved_meminfo, start); +} +EXPORT_SYMBOL_GPL(phys_to_target_node); + +int memory_add_physaddr_to_nid(u64 start) +{ + int nid = meminfo_to_nid(&numa_meminfo, start); + + if (nid == NUMA_NO_NODE) + nid = numa_meminfo.blk[0].nid; return nid; } EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid); -- cgit From 7b27a8622f802761d5c6abd6c37b22312a35343c Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Sun, 16 Feb 2020 12:01:16 -0800 Subject: libnvdimm/e820: Retrieve and populate correct 'target_node' info Use the new phys_to_target_node() and numa_map_to_online_node() helpers to retrieve the correct id for the 'numa_node' ("local" / online initiator node) and 'target_node' (offline target memory node) sysfs attributes. Below is an example from a 4 NUMA node system where all the memory on node2 is pmem / reserved. It should be noted that with the arrival of the ACPI HMAT table and EFI Specific Purpose Memory the kernel will start to see more platforms with reserved / performance differentiated memory in its own NUMA node. Hence all the stakeholders on the Cc for what is ostensibly a libnvdimm local patch. === Before === /* Notice no online memory on node2 at start */ # numactl --hardware available: 3 nodes (0-1,3) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 node 0 size: 3958 MB node 0 free: 3708 MB node 1 cpus: 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 node 1 size: 4027 MB node 1 free: 3871 MB node 3 cpus: node 3 size: 3994 MB node 3 free: 3971 MB node distances: node 0 1 3 0: 10 21 21 1: 21 10 21 3: 21 21 10 /* * Put the pmem namespace into devdax mode so it can be assigned to the * kmem driver */ # ndctl create-namespace -e namespace0.0 -m devdax -f { "dev":"namespace0.0", "mode":"devdax", "map":"dev", "size":"3.94 GiB (4.23 GB)", "uuid":"1650af9b-9ba3-4704-acd6-10178399d9a3", [..] } /* Online Persistent Memory as System RAM */ # daxctl reconfigure-device --mode=system-ram dax0.0 libdaxctl: memblock_in_dev: dax0.0: memory0: Unable to determine phys_index: Success libdaxctl: memblock_in_dev: dax0.0: memory0: Unable to determine phys_index: Success libdaxctl: memblock_in_dev: dax0.0: memory0: Unable to determine phys_index: Success libdaxctl: memblock_in_dev: dax0.0: memory0: Unable to determine phys_index: Success [ { "chardev":"dax0.0", "size":4225761280, "target_node":0, "mode":"system-ram" } ] reconfigured 1 device /* Note that the memory is onlined by default to the wrong node, node0 */ # numactl --hardware available: 3 nodes (0-1,3) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 node 0 size: 7926 MB node 0 free: 7655 MB node 1 cpus: 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 node 1 size: 4027 MB node 1 free: 3871 MB node 3 cpus: node 3 size: 3994 MB node 3 free: 3971 MB node distances: node 0 1 3 0: 10 21 21 1: 21 10 21 3: 21 21 10 === After === /* Notice that the "phys_index" error messages are gone */ # daxctl reconfigure-device --mode=system-ram dax0.0 [ { "chardev":"dax0.0", "size":4225761280, "target_node":2, "mode":"system-ram" } ] reconfigured 1 device /* Notice that node2 is now correctly populated */ # numactl --hardware available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 node 0 size: 3958 MB node 0 free: 3793 MB node 1 cpus: 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 node 1 size: 4027 MB node 1 free: 3851 MB node 2 cpus: node 2 size: 3968 MB node 2 free: 3968 MB node 3 cpus: node 3 size: 3994 MB node 3 free: 3908 MB node distances: node 0 1 2 3 0: 10 21 21 21 1: 21 10 21 21 2: 21 21 10 21 3: 21 21 21 10 Cc: Dave Hansen Cc: Andy Lutomirski Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Andrew Morton Cc: David Hildenbrand Cc: Michal Hocko Cc: Ira Weiny Cc: Vishal Verma Cc: Christoph Hellwig Reviewed-by: Ingo Molnar Signed-off-by: Dan Williams Link: https://lore.kernel.org/r/158188327614.894464.13122730362187722603.stgit@dwillia2-desk3.amr.corp.intel.com --- arch/x86/Kconfig | 1 + 1 file changed, 1 insertion(+) (limited to 'arch/x86') diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index beea77046f9b..d4e446daf457 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1664,6 +1664,7 @@ config X86_PMEM_LEGACY depends on PHYS_ADDR_T_64BIT depends on BLK_DEV select X86_PMEM_LEGACY_DEVICE + select NUMA_KEEP_MEMINFO if NUMA select LIBNVDIMM help Treat memory marked using the non-standard e820 type of 12 as used -- cgit From 3ee372ccce4d4e7c610748d0583979d3ed3a0cf4 Mon Sep 17 00:00:00 2001 From: Arvind Sankar Date: Thu, 9 Jan 2020 10:02:17 -0500 Subject: x86/boot/compressed/64: Remove .bss/.pgtable from bzImage Commit 5b11f1cee579 ("x86, boot: straighten out ranges to copy/zero in compressed/head*.S") introduced a separate .pgtable section, splitting it out from the rest of .bss. This section was added without the writeable flag, marking it as read-only. This results in the linker putting the .rela.dyn section (containing bogus dynamic relocations from head_64.o) after the .bss and .pgtable sections. When objcopy is used to convert compressed/vmlinux into a binary for the bzImage: $ objcopy -O binary -R .note -R .comment -S arch/x86/boot/compressed/vmlinux \ arch/x86/boot/vmlinux.bin the .bss and .pgtable sections get materialized as ~176KiB of zero bytes in the binary in order to place .rela.dyn at the correct location. Fix this by marking .pgtable as writeable. This moves the .rela.dyn section up in the ELF image layout so that .bss and .pgtable are the last allocated sections and so don't appear in bzImage. [ bp: Massage commit message. ] Signed-off-by: Arvind Sankar Signed-off-by: Borislav Petkov Acked-by: Kees Cook Link: https://lkml.kernel.org/r/20200109150218.16544-1-nivedita@alum.mit.edu --- arch/x86/boot/compressed/head_64.S | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S index 68f31c48d6c2..c8ee6eff13ef 100644 --- a/arch/x86/boot/compressed/head_64.S +++ b/arch/x86/boot/compressed/head_64.S @@ -645,7 +645,7 @@ SYM_DATA_END_LABEL(boot_stack, SYM_L_LOCAL, boot_stack_end) /* * Space for page tables (not in .bss so not zeroed) */ - .section ".pgtable","a",@nobits + .section ".pgtable","aw",@nobits .balign 4096 SYM_DATA_LOCAL(pgtable, .fill BOOT_PGT_SIZE, 1, 0) -- cgit From 2976908e4198aa02fc3f76802358f69396267189 Mon Sep 17 00:00:00 2001 From: Prarit Bhargava Date: Wed, 19 Feb 2020 08:16:11 -0500 Subject: x86/mce: Do not log spurious corrected mce errors A user has reported that they are seeing spurious corrected errors on their hardware. Intel Errata HSD131, HSM142, HSW131, and BDM48 report that "spurious corrected errors may be logged in the IA32_MC0_STATUS register with the valid field (bit 63) set, the uncorrected error field (bit 61) not set, a Model Specific Error Code (bits [31:16]) of 0x000F, and an MCA Error Code (bits [15:0]) of 0x0005." The Errata PDFs are linked in the bugzilla below. Block these spurious errors from the console and logs. [ bp: Move the intel_filter_mce() header declarations into the already existing CONFIG_X86_MCE_INTEL ifdeffery. ] Co-developed-by: Alexander Krupp Signed-off-by: Alexander Krupp Signed-off-by: Prarit Bhargava Signed-off-by: Borislav Petkov Link: https://bugzilla.kernel.org/show_bug.cgi?id=206587 Link: https://lkml.kernel.org/r/20200219131611.36816-1-prarit@redhat.com --- arch/x86/kernel/cpu/mce/core.c | 2 ++ arch/x86/kernel/cpu/mce/intel.c | 17 +++++++++++++++++ arch/x86/kernel/cpu/mce/internal.h | 2 ++ 3 files changed, 21 insertions(+) (limited to 'arch/x86') diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index 2c4f949611e4..fe3983d551cc 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -1877,6 +1877,8 @@ bool filter_mce(struct mce *m) { if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD) return amd_filter_mce(m); + if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) + return intel_filter_mce(m); return false; } diff --git a/arch/x86/kernel/cpu/mce/intel.c b/arch/x86/kernel/cpu/mce/intel.c index 5627b1091b85..989148e6746c 100644 --- a/arch/x86/kernel/cpu/mce/intel.c +++ b/arch/x86/kernel/cpu/mce/intel.c @@ -520,3 +520,20 @@ void mce_intel_feature_clear(struct cpuinfo_x86 *c) { intel_clear_lmce(); } + +bool intel_filter_mce(struct mce *m) +{ + struct cpuinfo_x86 *c = &boot_cpu_data; + + /* MCE errata HSD131, HSM142, HSW131, BDM48, and HSM142 */ + if ((c->x86 == 6) && + ((c->x86_model == INTEL_FAM6_HASWELL) || + (c->x86_model == INTEL_FAM6_HASWELL_L) || + (c->x86_model == INTEL_FAM6_BROADWELL) || + (c->x86_model == INTEL_FAM6_HASWELL_G)) && + (m->bank == 0) && + ((m->status & 0xa0000000ffffffff) == 0x80000000000f0005)) + return true; + + return false; +} diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h index b785c0d0b590..97db18441d2c 100644 --- a/arch/x86/kernel/cpu/mce/internal.h +++ b/arch/x86/kernel/cpu/mce/internal.h @@ -48,6 +48,7 @@ void cmci_disable_bank(int bank); void intel_init_cmci(void); void intel_init_lmce(void); void intel_clear_lmce(void); +bool intel_filter_mce(struct mce *m); #else # define cmci_intel_adjust_timer mce_adjust_timer_default static inline bool mce_intel_cmci_poll(void) { return false; } @@ -56,6 +57,7 @@ static inline void cmci_disable_bank(int bank) { } static inline void intel_init_cmci(void) { } static inline void intel_init_lmce(void) { } static inline void intel_clear_lmce(void) { } +static inline bool intel_filter_mce(struct mce *m) { return false; }; #endif void mce_timer_kick(unsigned long interval); -- cgit From 6650cdd9a8ccf00555dbbe743d58541ad8feb6a7 Mon Sep 17 00:00:00 2001 From: "Peter Zijlstra (Intel)" Date: Sun, 26 Jan 2020 12:05:35 -0800 Subject: x86/split_lock: Enable split lock detection by kernel A split-lock occurs when an atomic instruction operates on data that spans two cache lines. In order to maintain atomicity the core takes a global bus lock. This is typically >1000 cycles slower than an atomic operation within a cache line. It also disrupts performance on other cores (which must wait for the bus lock to be released before their memory operations can complete). For real-time systems this may mean missing deadlines. For other systems it may just be very annoying. Some CPUs have the capability to raise an #AC trap when a split lock is attempted. Provide a command line option to give the user choices on how to handle this: split_lock_detect= off - not enabled (no traps for split locks) warn - warn once when an application does a split lock, but allow it to continue running. fatal - Send SIGBUS to applications that cause split lock On systems that support split lock detection the default is "warn". Note that if the kernel hits a split lock in any mode other than "off" it will OOPs. One implementation wrinkle is that the MSR to control the split lock detection is per-core, not per thread. This might result in some short lived races on HT systems in "warn" mode if Linux tries to enable on one thread while disabling on the other. Race analysis by Sean Christopherson: - Toggling of split-lock is only done in "warn" mode. Worst case scenario of a race is that a misbehaving task will generate multiple #AC exceptions on the same instruction. And this race will only occur if both siblings are running tasks that generate split-lock #ACs, e.g. a race where sibling threads are writing different values will only occur if CPUx is disabling split-lock after an #AC and CPUy is re-enabling split-lock after *its* previous task generated an #AC. - Transitioning between off/warn/fatal modes at runtime isn't supported and disabling is tracked per task, so hardware will always reach a steady state that matches the configured mode. I.e. split-lock is guaranteed to be enabled in hardware once all _TIF_SLD threads have been scheduled out. Signed-off-by: Peter Zijlstra (Intel) Co-developed-by: Fenghua Yu Signed-off-by: Fenghua Yu Co-developed-by: Tony Luck Signed-off-by: Tony Luck Signed-off-by: Thomas Gleixner Signed-off-by: Borislav Petkov Link: https://lore.kernel.org/r/20200126200535.GB30377@agluck-desk2.amr.corp.intel.com --- arch/x86/include/asm/cpu.h | 12 +++ arch/x86/include/asm/cpufeatures.h | 2 + arch/x86/include/asm/msr-index.h | 9 ++ arch/x86/include/asm/thread_info.h | 4 +- arch/x86/kernel/cpu/common.c | 2 + arch/x86/kernel/cpu/intel.c | 175 +++++++++++++++++++++++++++++++++++++ arch/x86/kernel/process.c | 3 + arch/x86/kernel/traps.c | 24 ++++- 8 files changed, 228 insertions(+), 3 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/cpu.h b/arch/x86/include/asm/cpu.h index adc6cc86b062..ff6f3ca649b3 100644 --- a/arch/x86/include/asm/cpu.h +++ b/arch/x86/include/asm/cpu.h @@ -40,4 +40,16 @@ int mwait_usable(const struct cpuinfo_x86 *); unsigned int x86_family(unsigned int sig); unsigned int x86_model(unsigned int sig); unsigned int x86_stepping(unsigned int sig); +#ifdef CONFIG_CPU_SUP_INTEL +extern void __init cpu_set_core_cap_bits(struct cpuinfo_x86 *c); +extern void switch_to_sld(unsigned long tifn); +extern bool handle_user_split_lock(struct pt_regs *regs, long error_code); +#else +static inline void __init cpu_set_core_cap_bits(struct cpuinfo_x86 *c) {} +static inline void switch_to_sld(unsigned long tifn) {} +static inline bool handle_user_split_lock(struct pt_regs *regs, long error_code) +{ + return false; +} +#endif #endif /* _ASM_X86_CPU_H */ diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index f3327cb56edf..cd56ad5d308e 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -285,6 +285,7 @@ #define X86_FEATURE_CQM_MBM_LOCAL (11*32+ 3) /* LLC Local MBM monitoring */ #define X86_FEATURE_FENCE_SWAPGS_USER (11*32+ 4) /* "" LFENCE in user entry SWAPGS path */ #define X86_FEATURE_FENCE_SWAPGS_KERNEL (11*32+ 5) /* "" LFENCE in kernel entry SWAPGS path */ +#define X86_FEATURE_SPLIT_LOCK_DETECT (11*32+ 6) /* #AC for split lock */ /* Intel-defined CPU features, CPUID level 0x00000007:1 (EAX), word 12 */ #define X86_FEATURE_AVX512_BF16 (12*32+ 5) /* AVX512 BFLOAT16 instructions */ @@ -367,6 +368,7 @@ #define X86_FEATURE_INTEL_STIBP (18*32+27) /* "" Single Thread Indirect Branch Predictors */ #define X86_FEATURE_FLUSH_L1D (18*32+28) /* Flush L1D cache */ #define X86_FEATURE_ARCH_CAPABILITIES (18*32+29) /* IA32_ARCH_CAPABILITIES MSR (Intel) */ +#define X86_FEATURE_CORE_CAPABILITIES (18*32+30) /* "" IA32_CORE_CAPABILITIES MSR */ #define X86_FEATURE_SPEC_CTRL_SSBD (18*32+31) /* "" Speculative Store Bypass Disable */ /* diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h index ebe1685e92dd..8821697a7549 100644 --- a/arch/x86/include/asm/msr-index.h +++ b/arch/x86/include/asm/msr-index.h @@ -41,6 +41,10 @@ /* Intel MSRs. Some also available on other CPUs */ +#define MSR_TEST_CTRL 0x00000033 +#define MSR_TEST_CTRL_SPLIT_LOCK_DETECT_BIT 29 +#define MSR_TEST_CTRL_SPLIT_LOCK_DETECT BIT(MSR_TEST_CTRL_SPLIT_LOCK_DETECT_BIT) + #define MSR_IA32_SPEC_CTRL 0x00000048 /* Speculation Control */ #define SPEC_CTRL_IBRS BIT(0) /* Indirect Branch Restricted Speculation */ #define SPEC_CTRL_STIBP_SHIFT 1 /* Single Thread Indirect Branch Predictor (STIBP) bit */ @@ -70,6 +74,11 @@ */ #define MSR_IA32_UMWAIT_CONTROL_TIME_MASK (~0x03U) +/* Abbreviated from Intel SDM name IA32_CORE_CAPABILITIES */ +#define MSR_IA32_CORE_CAPS 0x000000cf +#define MSR_IA32_CORE_CAPS_SPLIT_LOCK_DETECT_BIT 5 +#define MSR_IA32_CORE_CAPS_SPLIT_LOCK_DETECT BIT(MSR_IA32_CORE_CAPS_SPLIT_LOCK_DETECT_BIT) + #define MSR_PKG_CST_CONFIG_CONTROL 0x000000e2 #define NHM_C3_AUTO_DEMOTE (1UL << 25) #define NHM_C1_AUTO_DEMOTE (1UL << 26) diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h index cf4327986e98..f807930bd763 100644 --- a/arch/x86/include/asm/thread_info.h +++ b/arch/x86/include/asm/thread_info.h @@ -92,6 +92,7 @@ struct thread_info { #define TIF_NOCPUID 15 /* CPUID is not accessible in userland */ #define TIF_NOTSC 16 /* TSC is not accessible in userland */ #define TIF_IA32 17 /* IA32 compatibility process */ +#define TIF_SLD 18 /* Restore split lock detection on context switch */ #define TIF_NOHZ 19 /* in adaptive nohz mode */ #define TIF_MEMDIE 20 /* is terminating due to OOM killer */ #define TIF_POLLING_NRFLAG 21 /* idle is polling for TIF_NEED_RESCHED */ @@ -122,6 +123,7 @@ struct thread_info { #define _TIF_NOCPUID (1 << TIF_NOCPUID) #define _TIF_NOTSC (1 << TIF_NOTSC) #define _TIF_IA32 (1 << TIF_IA32) +#define _TIF_SLD (1 << TIF_SLD) #define _TIF_NOHZ (1 << TIF_NOHZ) #define _TIF_POLLING_NRFLAG (1 << TIF_POLLING_NRFLAG) #define _TIF_IO_BITMAP (1 << TIF_IO_BITMAP) @@ -145,7 +147,7 @@ struct thread_info { /* flags to check in __switch_to() */ #define _TIF_WORK_CTXSW_BASE \ (_TIF_NOCPUID | _TIF_NOTSC | _TIF_BLOCKSTEP | \ - _TIF_SSBD | _TIF_SPEC_FORCE_UPDATE) + _TIF_SSBD | _TIF_SPEC_FORCE_UPDATE | _TIF_SLD) /* * Avoid calls to __switch_to_xtra() on UP as STIBP is not evaluated. diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 52c9bfbbdb2a..bf282d7805de 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -1224,6 +1224,8 @@ static void __init early_identify_cpu(struct cpuinfo_x86 *c) cpu_set_bug_bits(c); + cpu_set_core_cap_bits(c); + fpu__init_system(c); #ifdef CONFIG_X86_32 diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c index be82cd5841c3..db3e745e5d47 100644 --- a/arch/x86/kernel/cpu/intel.c +++ b/arch/x86/kernel/cpu/intel.c @@ -19,6 +19,8 @@ #include #include #include +#include +#include #ifdef CONFIG_X86_64 #include @@ -31,6 +33,19 @@ #include #endif +enum split_lock_detect_state { + sld_off = 0, + sld_warn, + sld_fatal, +}; + +/* + * Default to sld_off because most systems do not support split lock detection + * split_lock_setup() will switch this to sld_warn on systems that support + * split lock detect, unless there is a command line override. + */ +static enum split_lock_detect_state sld_state = sld_off; + /* * Processors which have self-snooping capability can handle conflicting * memory type across CPUs by snooping its own cache. However, there exists @@ -570,6 +585,8 @@ static void init_intel_misc_features(struct cpuinfo_x86 *c) wrmsrl(MSR_MISC_FEATURES_ENABLES, msr); } +static void split_lock_init(void); + static void init_intel(struct cpuinfo_x86 *c) { early_init_intel(c); @@ -684,6 +701,8 @@ static void init_intel(struct cpuinfo_x86 *c) tsx_enable(); if (tsx_ctrl_state == TSX_CTRL_DISABLE) tsx_disable(); + + split_lock_init(); } #ifdef CONFIG_X86_32 @@ -945,3 +964,159 @@ static const struct cpu_dev intel_cpu_dev = { }; cpu_dev_register(intel_cpu_dev); + +#undef pr_fmt +#define pr_fmt(fmt) "x86/split lock detection: " fmt + +static const struct { + const char *option; + enum split_lock_detect_state state; +} sld_options[] __initconst = { + { "off", sld_off }, + { "warn", sld_warn }, + { "fatal", sld_fatal }, +}; + +static inline bool match_option(const char *arg, int arglen, const char *opt) +{ + int len = strlen(opt); + + return len == arglen && !strncmp(arg, opt, len); +} + +static void __init split_lock_setup(void) +{ + char arg[20]; + int i, ret; + + setup_force_cpu_cap(X86_FEATURE_SPLIT_LOCK_DETECT); + sld_state = sld_warn; + + ret = cmdline_find_option(boot_command_line, "split_lock_detect", + arg, sizeof(arg)); + if (ret >= 0) { + for (i = 0; i < ARRAY_SIZE(sld_options); i++) { + if (match_option(arg, ret, sld_options[i].option)) { + sld_state = sld_options[i].state; + break; + } + } + } + + switch (sld_state) { + case sld_off: + pr_info("disabled\n"); + break; + + case sld_warn: + pr_info("warning about user-space split_locks\n"); + break; + + case sld_fatal: + pr_info("sending SIGBUS on user-space split_locks\n"); + break; + } +} + +/* + * Locking is not required at the moment because only bit 29 of this + * MSR is implemented and locking would not prevent that the operation + * of one thread is immediately undone by the sibling thread. + * Use the "safe" versions of rdmsr/wrmsr here because although code + * checks CPUID and MSR bits to make sure the TEST_CTRL MSR should + * exist, there may be glitches in virtualization that leave a guest + * with an incorrect view of real h/w capabilities. + */ +static bool __sld_msr_set(bool on) +{ + u64 test_ctrl_val; + + if (rdmsrl_safe(MSR_TEST_CTRL, &test_ctrl_val)) + return false; + + if (on) + test_ctrl_val |= MSR_TEST_CTRL_SPLIT_LOCK_DETECT; + else + test_ctrl_val &= ~MSR_TEST_CTRL_SPLIT_LOCK_DETECT; + + return !wrmsrl_safe(MSR_TEST_CTRL, test_ctrl_val); +} + +static void split_lock_init(void) +{ + if (sld_state == sld_off) + return; + + if (__sld_msr_set(true)) + return; + + /* + * If this is anything other than the boot-cpu, you've done + * funny things and you get to keep whatever pieces. + */ + pr_warn("MSR fail -- disabled\n"); + sld_state = sld_off; +} + +bool handle_user_split_lock(struct pt_regs *regs, long error_code) +{ + if ((regs->flags & X86_EFLAGS_AC) || sld_state == sld_fatal) + return false; + + pr_warn_ratelimited("#AC: %s/%d took a split_lock trap at address: 0x%lx\n", + current->comm, current->pid, regs->ip); + + /* + * Disable the split lock detection for this task so it can make + * progress and set TIF_SLD so the detection is re-enabled via + * switch_to_sld() when the task is scheduled out. + */ + __sld_msr_set(false); + set_tsk_thread_flag(current, TIF_SLD); + return true; +} + +/* + * This function is called only when switching between tasks with + * different split-lock detection modes. It sets the MSR for the + * mode of the new task. This is right most of the time, but since + * the MSR is shared by hyperthreads on a physical core there can + * be glitches when the two threads need different modes. + */ +void switch_to_sld(unsigned long tifn) +{ + __sld_msr_set(!(tifn & _TIF_SLD)); +} + +#define SPLIT_LOCK_CPU(model) {X86_VENDOR_INTEL, 6, model, X86_FEATURE_ANY} + +/* + * The following processors have the split lock detection feature. But + * since they don't have the IA32_CORE_CAPABILITIES MSR, the feature cannot + * be enumerated. Enable it by family and model matching on these + * processors. + */ +static const struct x86_cpu_id split_lock_cpu_ids[] __initconst = { + SPLIT_LOCK_CPU(INTEL_FAM6_ICELAKE_X), + SPLIT_LOCK_CPU(INTEL_FAM6_ICELAKE_L), + {} +}; + +void __init cpu_set_core_cap_bits(struct cpuinfo_x86 *c) +{ + u64 ia32_core_caps = 0; + + if (c->x86_vendor != X86_VENDOR_INTEL) + return; + if (cpu_has(c, X86_FEATURE_CORE_CAPABILITIES)) { + /* Enumerate features reported in IA32_CORE_CAPABILITIES MSR. */ + rdmsrl(MSR_IA32_CORE_CAPS, ia32_core_caps); + } else if (!boot_cpu_has(X86_FEATURE_HYPERVISOR)) { + /* Enumerate split lock detection by family and model. */ + if (x86_match_cpu(split_lock_cpu_ids)) + ia32_core_caps |= MSR_IA32_CORE_CAPS_SPLIT_LOCK_DETECT; + } + + if (ia32_core_caps & MSR_IA32_CORE_CAPS_SPLIT_LOCK_DETECT) + split_lock_setup(); +} diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index 839b5244e3b7..a43c32868c3c 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -650,6 +650,9 @@ void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p) /* Enforce MSR update to ensure consistent state */ __speculation_ctrl_update(~tifn, tifn); } + + if ((tifp ^ tifn) & _TIF_SLD) + switch_to_sld(tifn); } /* diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index 6ef00eb6fbb9..0ef5befaed7d 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -46,6 +46,7 @@ #include #include #include +#include #include #include #include @@ -242,7 +243,6 @@ do_trap(int trapnr, int signr, char *str, struct pt_regs *regs, { struct task_struct *tsk = current; - if (!do_trap_no_signal(tsk, trapnr, str, regs, error_code)) return; @@ -288,9 +288,29 @@ DO_ERROR(X86_TRAP_OLD_MF, SIGFPE, 0, NULL, "coprocessor segment overru DO_ERROR(X86_TRAP_TS, SIGSEGV, 0, NULL, "invalid TSS", invalid_TSS) DO_ERROR(X86_TRAP_NP, SIGBUS, 0, NULL, "segment not present", segment_not_present) DO_ERROR(X86_TRAP_SS, SIGBUS, 0, NULL, "stack segment", stack_segment) -DO_ERROR(X86_TRAP_AC, SIGBUS, BUS_ADRALN, NULL, "alignment check", alignment_check) #undef IP +dotraplinkage void do_alignment_check(struct pt_regs *regs, long error_code) +{ + char *str = "alignment check"; + + RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU"); + + if (notify_die(DIE_TRAP, str, regs, error_code, X86_TRAP_AC, SIGBUS) == NOTIFY_STOP) + return; + + if (!user_mode(regs)) + die("Split lock detected\n", regs, error_code); + + local_irq_enable(); + + if (handle_user_split_lock(regs, error_code)) + return; + + do_trap(X86_TRAP_AC, SIGBUS, "alignment check", regs, + error_code, BUS_ADRALN, NULL); +} + #ifdef CONFIG_VMAP_STACK __visible void __noreturn handle_stack_overflow(const char *message, struct pt_regs *regs, -- cgit From 67a6af7ad1d161dbc9c139e868d5549e632923f7 Mon Sep 17 00:00:00 2001 From: Arvind Sankar Date: Sun, 2 Feb 2020 12:13:47 -0500 Subject: x86/boot: Remove KEEP_SEGMENTS support Commit a24e785111a3 ("i386: paravirt boot sequence") added this flag for use by paravirtualized environments such as Xen. However, Xen never made use of this flag [1], and it was only ever used by lguest [2]. Commit ecda85e70277 ("x86/lguest: Remove lguest support") removed lguest, so KEEP_SEGMENTS has lost its last user. [1] https://lore.kernel.org/lkml/4D4B097C.5050405@goop.org [2] https://www.mail-archive.com/lguest@lists.ozlabs.org/msg00469.html Signed-off-by: Arvind Sankar Link: https://lore.kernel.org/r/20200202171353.3736319-2-nivedita@alum.mit.edu Signed-off-by: Ard Biesheuvel --- arch/x86/boot/compressed/head_32.S | 8 -------- arch/x86/boot/compressed/head_64.S | 8 -------- arch/x86/kernel/head_32.S | 6 ------ 3 files changed, 22 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/boot/compressed/head_32.S b/arch/x86/boot/compressed/head_32.S index 73f17d0544dd..cb2cb91fce45 100644 --- a/arch/x86/boot/compressed/head_32.S +++ b/arch/x86/boot/compressed/head_32.S @@ -63,13 +63,6 @@ __HEAD SYM_FUNC_START(startup_32) cld - /* - * Test KEEP_SEGMENTS flag to see if the bootloader is asking - * us to not reload segments - */ - testb $KEEP_SEGMENTS, BP_loadflags(%esi) - jnz 1f - cli movl $__BOOT_DS, %eax movl %eax, %ds @@ -77,7 +70,6 @@ SYM_FUNC_START(startup_32) movl %eax, %fs movl %eax, %gs movl %eax, %ss -1: /* * Calculate the delta between where we were compiled to run diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S index 1f1f6c8139b3..bd44d89540d3 100644 --- a/arch/x86/boot/compressed/head_64.S +++ b/arch/x86/boot/compressed/head_64.S @@ -53,19 +53,11 @@ SYM_FUNC_START(startup_32) * all need to be under the 4G limit. */ cld - /* - * Test KEEP_SEGMENTS flag to see if the bootloader is asking - * us to not reload segments - */ - testb $KEEP_SEGMENTS, BP_loadflags(%esi) - jnz 1f - cli movl $(__BOOT_DS), %eax movl %eax, %ds movl %eax, %es movl %eax, %ss -1: /* * Calculate the delta between where we were compiled to run diff --git a/arch/x86/kernel/head_32.S b/arch/x86/kernel/head_32.S index 3923ab4630d7..f66a6b90f954 100644 --- a/arch/x86/kernel/head_32.S +++ b/arch/x86/kernel/head_32.S @@ -67,11 +67,6 @@ __HEAD SYM_CODE_START(startup_32) movl pa(initial_stack),%ecx - /* test KEEP_SEGMENTS flag to see if the bootloader is asking - us to not reload segments */ - testb $KEEP_SEGMENTS, BP_loadflags(%esi) - jnz 2f - /* * Set segments to known values. */ @@ -82,7 +77,6 @@ SYM_CODE_START(startup_32) movl %eax,%fs movl %eax,%gs movl %eax,%ss -2: leal -__PAGE_OFFSET(%ecx),%esp /* -- cgit From 90ff226281e1083988a42cfc51f89d91734cc55e Mon Sep 17 00:00:00 2001 From: Arvind Sankar Date: Sun, 2 Feb 2020 12:13:48 -0500 Subject: efi/x86: Don't depend on firmware GDT layout When booting in mixed mode, the firmware's GDT is still installed at handover entry in efi32_stub_entry. We save the GDTR for later use in __efi64_thunk but we are assuming that descriptor 2 (__KERNEL_CS) is a valid 32-bit code segment descriptor and that descriptor 3 (__KERNEL_DS/__BOOT_DS) is a valid data segment descriptor. This happens to be true for OVMF (it actually uses descriptor 1 for data segments, but descriptor 3 is also setup as data), but we shouldn't depend on this being the case. Fix this by saving the code and data selectors in addition to the GDTR in efi32_stub_entry, and restoring them in __efi64_thunk before calling the firmware. The UEFI specification guarantees that selectors will be flat, so using the DS selector for all the segment registers should be enough. We also need to install our own GDT before initializing segment registers in startup_32, so move the GDT load up to the beginning of the function. [ardb: mention mixed mode in the commit log] Signed-off-by: Arvind Sankar Link: https://lore.kernel.org/r/20200202171353.3736319-3-nivedita@alum.mit.edu Signed-off-by: Ard Biesheuvel --- arch/x86/boot/compressed/efi_thunk_64.S | 29 ++++++++++++++++++++++++----- arch/x86/boot/compressed/head_64.S | 30 ++++++++++++++++++------------ 2 files changed, 42 insertions(+), 17 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/boot/compressed/efi_thunk_64.S b/arch/x86/boot/compressed/efi_thunk_64.S index 8fb7f6799c52..2b2049259619 100644 --- a/arch/x86/boot/compressed/efi_thunk_64.S +++ b/arch/x86/boot/compressed/efi_thunk_64.S @@ -54,11 +54,16 @@ SYM_FUNC_START(__efi64_thunk) * Switch to gdt with 32-bit segments. This is the firmware GDT * that was installed when the kernel started executing. This * pointer was saved at the EFI stub entry point in head_64.S. + * + * Pass the saved DS selector to the 32-bit code, and use far return to + * restore the saved CS selector. */ leaq efi32_boot_gdt(%rip), %rax lgdt (%rax) - pushq $__KERNEL_CS + movzwl efi32_boot_ds(%rip), %edx + movzwq efi32_boot_cs(%rip), %rax + pushq %rax leaq efi_enter32(%rip), %rax pushq %rax lretq @@ -73,6 +78,10 @@ SYM_FUNC_START(__efi64_thunk) movl %ebx, %es pop %rbx movl %ebx, %ds + /* Clear out 32-bit selector from FS and GS */ + xorl %ebx, %ebx + movl %ebx, %fs + movl %ebx, %gs /* * Convert 32-bit status code into 64-bit. @@ -92,10 +101,12 @@ SYM_FUNC_END(__efi64_thunk) * The stack should represent the 32-bit calling convention. */ SYM_FUNC_START_LOCAL(efi_enter32) - movl $__KERNEL_DS, %eax - movl %eax, %ds - movl %eax, %es - movl %eax, %ss + /* Load firmware selector into data and stack segment registers */ + movl %edx, %ds + movl %edx, %es + movl %edx, %fs + movl %edx, %gs + movl %edx, %ss /* Reload pgtables */ movl %cr3, %eax @@ -157,6 +168,14 @@ SYM_DATA_START(efi32_boot_gdt) .quad 0 SYM_DATA_END(efi32_boot_gdt) +SYM_DATA_START(efi32_boot_cs) + .word 0 +SYM_DATA_END(efi32_boot_cs) + +SYM_DATA_START(efi32_boot_ds) + .word 0 +SYM_DATA_END(efi32_boot_ds) + SYM_DATA_START(efi_gdt64) .word efi_gdt64_end - efi_gdt64 .long 0 /* Filled out by user */ diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S index bd44d89540d3..c56b30bd9c7b 100644 --- a/arch/x86/boot/compressed/head_64.S +++ b/arch/x86/boot/compressed/head_64.S @@ -54,10 +54,6 @@ SYM_FUNC_START(startup_32) */ cld cli - movl $(__BOOT_DS), %eax - movl %eax, %ds - movl %eax, %es - movl %eax, %ss /* * Calculate the delta between where we were compiled to run @@ -72,10 +68,20 @@ SYM_FUNC_START(startup_32) 1: popl %ebp subl $1b, %ebp + /* Load new GDT with the 64bit segments using 32bit descriptor */ + addl %ebp, gdt+2(%ebp) + lgdt gdt(%ebp) + + /* Load segment registers with our descriptors */ + movl $__BOOT_DS, %eax + movl %eax, %ds + movl %eax, %es + movl %eax, %fs + movl %eax, %gs + movl %eax, %ss + /* setup a stack and make sure cpu supports long mode. */ - movl $boot_stack_end, %eax - addl %ebp, %eax - movl %eax, %esp + leal boot_stack_end(%ebp), %esp call verify_cpu testl %eax, %eax @@ -112,10 +118,6 @@ SYM_FUNC_START(startup_32) * Prepare for entering 64 bit mode */ - /* Load new GDT with the 64bit segments using 32bit descriptor */ - addl %ebp, gdt+2(%ebp) - lgdt gdt(%ebp) - /* Enable PAE mode */ movl %cr4, %eax orl $X86_CR4_PAE, %eax @@ -232,9 +234,13 @@ SYM_FUNC_START(efi32_stub_entry) movl %ecx, efi32_boot_args(%ebp) movl %edx, efi32_boot_args+4(%ebp) - sgdtl efi32_boot_gdt(%ebp) movb $0, efi_is64(%ebp) + /* Save firmware GDTR and code/data selectors */ + sgdtl efi32_boot_gdt(%ebp) + movw %cs, efi32_boot_cs(%ebp) + movw %ds, efi32_boot_ds(%ebp) + /* Disable paging */ movl %cr0, %eax btrl $X86_CR0_PG_BIT, %eax -- cgit From 32d009137a5646947d450da6fa641a1f4dc1e42c Mon Sep 17 00:00:00 2001 From: Arvind Sankar Date: Sun, 2 Feb 2020 12:13:49 -0500 Subject: x86/boot: Reload GDTR after copying to the end of the buffer The GDT may get overwritten during the copy or during extract_kernel, which will cause problems if any segment register is touched before the GDTR is reloaded by the decompressed kernel. For safety update the GDTR to point to the GDT within the copied kernel. Signed-off-by: Arvind Sankar Link: https://lore.kernel.org/r/20200202171353.3736319-4-nivedita@alum.mit.edu Signed-off-by: Ard Biesheuvel --- arch/x86/boot/compressed/head_64.S | 10 ++++++++++ 1 file changed, 10 insertions(+) (limited to 'arch/x86') diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S index c56b30bd9c7b..27eb2a6786db 100644 --- a/arch/x86/boot/compressed/head_64.S +++ b/arch/x86/boot/compressed/head_64.S @@ -439,6 +439,16 @@ trampoline_return: cld popq %rsi + /* + * The GDT may get overwritten either during the copy we just did or + * during extract_kernel below. To avoid any issues, repoint the GDTR + * to the new copy of the GDT. + */ + leaq gdt64(%rbx), %rax + subq %rbp, 2(%rax) + addq %rbx, 2(%rax) + lgdt (%rax) + /* * Jump to the relocated address. */ -- cgit From cae0e431a02cd63fecaf677ae166f184644125a7 Mon Sep 17 00:00:00 2001 From: Arvind Sankar Date: Sun, 2 Feb 2020 12:13:50 -0500 Subject: x86/boot: Clear direction and interrupt flags in startup_64 startup_32 already clears these flags on entry, do it in startup_64 as well for consistency. The direction flag in particular is not specified to be cleared in the boot protocol documentation, and we currently call into C code (paging_prepare) without explicitly clearing it. Signed-off-by: Arvind Sankar Link: https://lore.kernel.org/r/20200202171353.3736319-5-nivedita@alum.mit.edu Signed-off-by: Ard Biesheuvel --- arch/x86/boot/compressed/head_64.S | 3 +++ 1 file changed, 3 insertions(+) (limited to 'arch/x86') diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S index 27eb2a6786db..69cc6c68741e 100644 --- a/arch/x86/boot/compressed/head_64.S +++ b/arch/x86/boot/compressed/head_64.S @@ -264,6 +264,9 @@ SYM_CODE_START(startup_64) * and command line. */ + cld + cli + /* Setup data segments. */ xorl %eax, %eax movl %eax, %ds -- cgit From ef5a7b5eb13ed88ba9690ab27def3a085332cc8c Mon Sep 17 00:00:00 2001 From: Arvind Sankar Date: Sun, 2 Feb 2020 12:13:51 -0500 Subject: efi/x86: Remove GDT setup from efi_main The 64-bit kernel will already load a GDT in startup_64, which is the next function to execute after return from efi_main. Add GDT setup code to the 32-bit kernel's startup_32 as well. Doing it in the head code has the advantage that we can avoid potentially corrupting the GDT during copy/decompression. This also removes dependence on having a specific GDT layout setup by the bootloader. Both startup_32 and startup_64 now clear interrupts on entry, so we can remove that from efi_main as well. Signed-off-by: Arvind Sankar Link: https://lore.kernel.org/r/20200202171353.3736319-6-nivedita@alum.mit.edu Signed-off-by: Ard Biesheuvel --- arch/x86/boot/compressed/eboot.c | 103 ------------------------------------- arch/x86/boot/compressed/head_32.S | 40 +++++++++++--- 2 files changed, 34 insertions(+), 109 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/boot/compressed/eboot.c b/arch/x86/boot/compressed/eboot.c index 287393d725f0..c92fe0b75cec 100644 --- a/arch/x86/boot/compressed/eboot.c +++ b/arch/x86/boot/compressed/eboot.c @@ -712,10 +712,8 @@ struct boot_params *efi_main(efi_handle_t handle, efi_system_table_t *sys_table_arg, struct boot_params *boot_params) { - struct desc_ptr *gdt = NULL; struct setup_header *hdr = &boot_params->hdr; efi_status_t status; - struct desc_struct *desc; unsigned long cmdline_paddr; sys_table = sys_table_arg; @@ -753,20 +751,6 @@ struct boot_params *efi_main(efi_handle_t handle, setup_quirks(boot_params); - status = efi_bs_call(allocate_pool, EFI_LOADER_DATA, sizeof(*gdt), - (void **)&gdt); - if (status != EFI_SUCCESS) { - efi_printk("Failed to allocate memory for 'gdt' structure\n"); - goto fail; - } - - gdt->size = 0x800; - status = efi_low_alloc(gdt->size, 8, (unsigned long *)&gdt->address); - if (status != EFI_SUCCESS) { - efi_printk("Failed to allocate memory for 'gdt'\n"); - goto fail; - } - /* * If the kernel isn't already loaded at the preferred load * address, relocate it. @@ -793,93 +777,6 @@ struct boot_params *efi_main(efi_handle_t handle, goto fail; } - memset((char *)gdt->address, 0x0, gdt->size); - desc = (struct desc_struct *)gdt->address; - - /* The first GDT is a dummy. */ - desc++; - - if (IS_ENABLED(CONFIG_X86_64)) { - /* __KERNEL32_CS */ - desc->limit0 = 0xffff; - desc->base0 = 0x0000; - desc->base1 = 0x0000; - desc->type = SEG_TYPE_CODE | SEG_TYPE_EXEC_READ; - desc->s = DESC_TYPE_CODE_DATA; - desc->dpl = 0; - desc->p = 1; - desc->limit1 = 0xf; - desc->avl = 0; - desc->l = 0; - desc->d = SEG_OP_SIZE_32BIT; - desc->g = SEG_GRANULARITY_4KB; - desc->base2 = 0x00; - - desc++; - } else { - /* Second entry is unused on 32-bit */ - desc++; - } - - /* __KERNEL_CS */ - desc->limit0 = 0xffff; - desc->base0 = 0x0000; - desc->base1 = 0x0000; - desc->type = SEG_TYPE_CODE | SEG_TYPE_EXEC_READ; - desc->s = DESC_TYPE_CODE_DATA; - desc->dpl = 0; - desc->p = 1; - desc->limit1 = 0xf; - desc->avl = 0; - - if (IS_ENABLED(CONFIG_X86_64)) { - desc->l = 1; - desc->d = 0; - } else { - desc->l = 0; - desc->d = SEG_OP_SIZE_32BIT; - } - desc->g = SEG_GRANULARITY_4KB; - desc->base2 = 0x00; - desc++; - - /* __KERNEL_DS */ - desc->limit0 = 0xffff; - desc->base0 = 0x0000; - desc->base1 = 0x0000; - desc->type = SEG_TYPE_DATA | SEG_TYPE_READ_WRITE; - desc->s = DESC_TYPE_CODE_DATA; - desc->dpl = 0; - desc->p = 1; - desc->limit1 = 0xf; - desc->avl = 0; - desc->l = 0; - desc->d = SEG_OP_SIZE_32BIT; - desc->g = SEG_GRANULARITY_4KB; - desc->base2 = 0x00; - desc++; - - if (IS_ENABLED(CONFIG_X86_64)) { - /* Task segment value */ - desc->limit0 = 0x0000; - desc->base0 = 0x0000; - desc->base1 = 0x0000; - desc->type = SEG_TYPE_TSS; - desc->s = 0; - desc->dpl = 0; - desc->p = 1; - desc->limit1 = 0x0; - desc->avl = 0; - desc->l = 0; - desc->d = 0; - desc->g = SEG_GRANULARITY_4KB; - desc->base2 = 0x00; - desc++; - } - - asm volatile("cli"); - asm volatile ("lgdt %0" : : "m" (*gdt)); - return boot_params; fail: efi_printk("efi_main() failed!\n"); diff --git a/arch/x86/boot/compressed/head_32.S b/arch/x86/boot/compressed/head_32.S index cb2cb91fce45..356060c5332c 100644 --- a/arch/x86/boot/compressed/head_32.S +++ b/arch/x86/boot/compressed/head_32.S @@ -64,12 +64,6 @@ SYM_FUNC_START(startup_32) cld cli - movl $__BOOT_DS, %eax - movl %eax, %ds - movl %eax, %es - movl %eax, %fs - movl %eax, %gs - movl %eax, %ss /* * Calculate the delta between where we were compiled to run @@ -84,6 +78,19 @@ SYM_FUNC_START(startup_32) 1: popl %ebp subl $1b, %ebp + /* Load new GDT */ + leal gdt(%ebp), %eax + movl %eax, 2(%eax) + lgdt (%eax) + + /* Load segment registers with our descriptors */ + movl $__BOOT_DS, %eax + movl %eax, %ds + movl %eax, %es + movl %eax, %fs + movl %eax, %gs + movl %eax, %ss + /* * %ebp contains the address we are loaded at by the boot loader and %ebx * contains the address where we should move the kernel image temporarily @@ -129,6 +136,16 @@ SYM_FUNC_START(startup_32) cld popl %esi + /* + * The GDT may get overwritten either during the copy we just did or + * during extract_kernel below. To avoid any issues, repoint the GDTR + * to the new copy of the GDT. EAX still contains the previously + * calculated relocation offset of init_size - _end. + */ + leal gdt(%ebx), %edx + addl %eax, 2(%edx) + lgdt (%edx) + /* * Jump to the relocated address. */ @@ -201,6 +218,17 @@ SYM_FUNC_START_LOCAL_NOALIGN(.Lrelocated) jmp *%eax SYM_FUNC_END(.Lrelocated) + .data + .balign 8 +SYM_DATA_START_LOCAL(gdt) + .word gdt_end - gdt - 1 + .long 0 + .word 0 + .quad 0x0000000000000000 /* Reserved */ + .quad 0x00cf9a000000ffff /* __KERNEL_CS */ + .quad 0x00cf92000000ffff /* __KERNEL_DS */ +SYM_DATA_END_LABEL(gdt, SYM_L_LOCAL, gdt_end) + /* * Stack and heap for uncompression */ -- cgit From b75e2b076d00751579c73cfbbc8a7eac7d2a0468 Mon Sep 17 00:00:00 2001 From: Arvind Sankar Date: Sun, 2 Feb 2020 12:13:52 -0500 Subject: x86/boot: GDT limit value should be size - 1 The limit value for the GDTR should be such that adding it to the base address gives the address of the last byte of the GDT, i.e. it should be one less than the size, not the size. Signed-off-by: Arvind Sankar Link: https://lore.kernel.org/r/20200202171353.3736319-7-nivedita@alum.mit.edu Signed-off-by: Ard Biesheuvel --- arch/x86/boot/compressed/head_64.S | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S index 69cc6c68741e..c36e6156b6a3 100644 --- a/arch/x86/boot/compressed/head_64.S +++ b/arch/x86/boot/compressed/head_64.S @@ -624,12 +624,12 @@ SYM_FUNC_END(.Lno_longmode) .data SYM_DATA_START_LOCAL(gdt64) - .word gdt_end - gdt + .word gdt_end - gdt - 1 .quad 0 SYM_DATA_END(gdt64) .balign 8 SYM_DATA_START_LOCAL(gdt) - .word gdt_end - gdt + .word gdt_end - gdt - 1 .long gdt .word 0 .quad 0x00cf9a000000ffff /* __KERNEL32_CS */ -- cgit From 8a3abe30de9fffec8b44adeb78f93ecb0f09b0c5 Mon Sep 17 00:00:00 2001 From: Arvind Sankar Date: Sun, 2 Feb 2020 12:13:53 -0500 Subject: x86/boot: Micro-optimize GDT loading instructions Rearrange the instructions a bit to use a 32-bit displacement once instead of 2/3 times. This saves 8 bytes of machine code. Signed-off-by: Arvind Sankar Link: https://lore.kernel.org/r/20200202171353.3736319-8-nivedita@alum.mit.edu Signed-off-by: Ard Biesheuvel --- arch/x86/boot/compressed/head_64.S | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S index c36e6156b6a3..a4f5561c1c0e 100644 --- a/arch/x86/boot/compressed/head_64.S +++ b/arch/x86/boot/compressed/head_64.S @@ -69,8 +69,9 @@ SYM_FUNC_START(startup_32) subl $1b, %ebp /* Load new GDT with the 64bit segments using 32bit descriptor */ - addl %ebp, gdt+2(%ebp) - lgdt gdt(%ebp) + leal gdt(%ebp), %eax + movl %eax, 2(%eax) + lgdt (%eax) /* Load segment registers with our descriptors */ movl $__BOOT_DS, %eax @@ -355,9 +356,9 @@ SYM_CODE_START(startup_64) */ /* Make sure we have GDT with 32-bit code segment */ - leaq gdt(%rip), %rax - movq %rax, gdt64+2(%rip) - lgdt gdt64(%rip) + leaq gdt64(%rip), %rax + addq %rax, 2(%rax) + lgdt (%rax) /* * paging_prepare() sets up the trampoline and checks if we need to @@ -625,12 +626,12 @@ SYM_FUNC_END(.Lno_longmode) .data SYM_DATA_START_LOCAL(gdt64) .word gdt_end - gdt - 1 - .quad 0 + .quad gdt - gdt64 SYM_DATA_END(gdt64) .balign 8 SYM_DATA_START_LOCAL(gdt) .word gdt_end - gdt - 1 - .long gdt + .long 0 .word 0 .quad 0x00cf9a000000ffff /* __KERNEL32_CS */ .quad 0x00af9a000000ffff /* __KERNEL_CS */ -- cgit From f32ea1cd124c9a8b847e33123d156cb55699fa51 Mon Sep 17 00:00:00 2001 From: Arvind Sankar Date: Thu, 30 Jan 2020 17:20:04 -0500 Subject: efi/x86: Mark setup_graphics static This function is only called from efi_main in the same source file. Signed-off-by: Arvind Sankar Link: https://lore.kernel.org/r/20200130222004.1932152-1-nivedita@alum.mit.edu Signed-off-by: Ard Biesheuvel --- arch/x86/boot/compressed/eboot.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/boot/compressed/eboot.c b/arch/x86/boot/compressed/eboot.c index c92fe0b75cec..32423e83ba8f 100644 --- a/arch/x86/boot/compressed/eboot.c +++ b/arch/x86/boot/compressed/eboot.c @@ -315,7 +315,7 @@ free_handle: return status; } -void setup_graphics(struct boot_params *boot_params) +static void setup_graphics(struct boot_params *boot_params) { efi_guid_t graphics_proto = EFI_GRAPHICS_OUTPUT_PROTOCOL_GUID; struct screen_info *si; -- cgit From e6d832ea9ac63316ee72df5e9f21698cfd486698 Mon Sep 17 00:00:00 2001 From: Ard Biesheuvel Date: Mon, 10 Feb 2020 17:02:30 +0100 Subject: efi/libstub/x86: Remove pointless zeroing of apm_bios_info We have some code in the EFI stub entry point that takes the address of the apm_bios_info struct in the newly allocated and zeroed out boot_params structure, only to zero it out again. This is pointless so remove it. Signed-off-by: Ard Biesheuvel --- arch/x86/boot/compressed/eboot.c | 5 ----- 1 file changed, 5 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/boot/compressed/eboot.c b/arch/x86/boot/compressed/eboot.c index 32423e83ba8f..4745a1ee7953 100644 --- a/arch/x86/boot/compressed/eboot.c +++ b/arch/x86/boot/compressed/eboot.c @@ -358,7 +358,6 @@ efi_status_t __efiapi efi_pe_entry(efi_handle_t handle, efi_system_table_t *sys_table_arg) { struct boot_params *boot_params; - struct apm_bios_info *bi; struct setup_header *hdr; efi_loaded_image_t *image; efi_guid_t proto = LOADED_IMAGE_PROTOCOL_GUID; @@ -389,7 +388,6 @@ efi_status_t __efiapi efi_pe_entry(efi_handle_t handle, memset(boot_params, 0x0, 0x4000); hdr = &boot_params->hdr; - bi = &boot_params->apm_bios_info; /* Copy the second sector to boot_params */ memcpy(&hdr->jump, image->image_base + 512, 512); @@ -416,9 +414,6 @@ efi_status_t __efiapi efi_pe_entry(efi_handle_t handle, hdr->ramdisk_image = 0; hdr->ramdisk_size = 0; - /* Clear APM BIOS info */ - memset(bi, 0, sizeof(*bi)); - status = efi_parse_options(cmdline_ptr); if (status != EFI_SUCCESS) goto fail2; -- cgit From 04a7d0e15606769ef58d5cee912c5d08d93ded92 Mon Sep 17 00:00:00 2001 From: Ard Biesheuvel Date: Mon, 10 Feb 2020 17:02:31 +0100 Subject: efi/libstub/x86: Avoid overflowing code32_start on PE entry When using the native PE entry point (as opposed to the EFI handover protocol entry point that is used more widely), we set code32_start, which is a 32-bit wide field, to the effective symbol address of startup_32, which could overflow given that the EFI loader may have located the running image anywhere in memory, and we haven't reached the point yet where we relocate ourselves. Since we relocate ourselves if code32_start != pref_address, this isn't likely to lead to problems in practice, given how unlikely it is that the truncated effective address of startup_32 happens to equal pref_address. But it is better to defer the assignment of code32_start to after the relocation, when it is guaranteed to fit. While at it, move the call to efi_relocate_kernel() to an earlier stage so it is more likely that our preferred offset in memory has not been occupied by other memory allocations done in the mean time. Signed-off-by: Ard Biesheuvel --- arch/x86/boot/compressed/eboot.c | 40 ++++++++++++++++++---------------------- 1 file changed, 18 insertions(+), 22 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/boot/compressed/eboot.c b/arch/x86/boot/compressed/eboot.c index 4745a1ee7953..55637135b50c 100644 --- a/arch/x86/boot/compressed/eboot.c +++ b/arch/x86/boot/compressed/eboot.c @@ -439,8 +439,6 @@ efi_status_t __efiapi efi_pe_entry(efi_handle_t handle, boot_params->ext_ramdisk_image = (u64)ramdisk_addr >> 32; boot_params->ext_ramdisk_size = (u64)ramdisk_size >> 32; - hdr->code32_start = (u32)(unsigned long)startup_32; - efi_stub_entry(handle, sys_table, boot_params); /* not reached */ @@ -707,6 +705,7 @@ struct boot_params *efi_main(efi_handle_t handle, efi_system_table_t *sys_table_arg, struct boot_params *boot_params) { + unsigned long bzimage_addr = (unsigned long)startup_32; struct setup_header *hdr = &boot_params->hdr; efi_status_t status; unsigned long cmdline_paddr; @@ -717,6 +716,23 @@ struct boot_params *efi_main(efi_handle_t handle, if (sys_table->hdr.signature != EFI_SYSTEM_TABLE_SIGNATURE) goto fail; + /* + * If the kernel isn't already loaded at the preferred load + * address, relocate it. + */ + if (bzimage_addr != hdr->pref_address) { + status = efi_relocate_kernel(&bzimage_addr, + hdr->init_size, hdr->init_size, + hdr->pref_address, + hdr->kernel_alignment, + LOAD_PHYSICAL_ADDR); + if (status != EFI_SUCCESS) { + efi_printk("efi_relocate_kernel() failed!\n"); + goto fail; + } + } + hdr->code32_start = (u32)bzimage_addr; + /* * make_boot_params() may have been called before efi_main(), in which * case this is the second time we parse the cmdline. This is ok, @@ -746,26 +762,6 @@ struct boot_params *efi_main(efi_handle_t handle, setup_quirks(boot_params); - /* - * If the kernel isn't already loaded at the preferred load - * address, relocate it. - */ - if (hdr->pref_address != hdr->code32_start) { - unsigned long bzimage_addr = hdr->code32_start; - status = efi_relocate_kernel(&bzimage_addr, - hdr->init_size, hdr->init_size, - hdr->pref_address, - hdr->kernel_alignment, - LOAD_PHYSICAL_ADDR); - if (status != EFI_SUCCESS) { - efi_printk("efi_relocate_kernel() failed!\n"); - goto fail; - } - - hdr->pref_address = hdr->code32_start; - hdr->code32_start = bzimage_addr; - } - status = exit_boot(boot_params, handle); if (status != EFI_SUCCESS) { efi_printk("exit_boot() failed!\n"); -- cgit From c2d0b470154c5be39f253da7814742030635f300 Mon Sep 17 00:00:00 2001 From: Ard Biesheuvel Date: Mon, 10 Feb 2020 17:02:36 +0100 Subject: efi/libstub/x86: Incorporate eboot.c into libstub Most of the EFI stub source files of all architectures reside under drivers/firmware/efi/libstub, where they share a Makefile with special CFLAGS and an include file with declarations that are only relevant for stub code. Currently, we carry a lot of stub specific stuff in linux/efi.h only because eboot.c in arch/x86 needs them as well. So let's move eboot.c into libstub/, and move the contents of eboot.h that we still care about into efistub.h Signed-off-by: Ard Biesheuvel --- arch/x86/boot/compressed/Makefile | 5 +- arch/x86/boot/compressed/eboot.c | 777 -------------------------------------- arch/x86/boot/compressed/eboot.h | 31 -- 3 files changed, 1 insertion(+), 812 deletions(-) delete mode 100644 arch/x86/boot/compressed/eboot.c delete mode 100644 arch/x86/boot/compressed/eboot.h (limited to 'arch/x86') diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile index 26050ae0b27e..e51879bdc51c 100644 --- a/arch/x86/boot/compressed/Makefile +++ b/arch/x86/boot/compressed/Makefile @@ -87,10 +87,7 @@ endif vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o -$(obj)/eboot.o: KBUILD_CFLAGS += -fshort-wchar -mno-red-zone - -vmlinux-objs-$(CONFIG_EFI_STUB) += $(obj)/eboot.o \ - $(objtree)/drivers/firmware/efi/libstub/lib.a +vmlinux-objs-$(CONFIG_EFI_STUB) += $(objtree)/drivers/firmware/efi/libstub/lib.a vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_thunk_$(BITS).o # The compressed kernel is built with -fPIC/-fPIE so that a boot loader diff --git a/arch/x86/boot/compressed/eboot.c b/arch/x86/boot/compressed/eboot.c deleted file mode 100644 index 55637135b50c..000000000000 --- a/arch/x86/boot/compressed/eboot.c +++ /dev/null @@ -1,777 +0,0 @@ -// SPDX-License-Identifier: GPL-2.0-only - -/* ----------------------------------------------------------------------- - * - * Copyright 2011 Intel Corporation; author Matt Fleming - * - * ----------------------------------------------------------------------- */ - -#pragma GCC visibility push(hidden) - -#include -#include - -#include -#include -#include -#include -#include - -#include "../string.h" -#include "eboot.h" - -static efi_system_table_t *sys_table; -extern const bool efi_is64; - -__pure efi_system_table_t *efi_system_table(void) -{ - return sys_table; -} - -__attribute_const__ bool efi_is_64bit(void) -{ - if (IS_ENABLED(CONFIG_EFI_MIXED)) - return efi_is64; - return IS_ENABLED(CONFIG_X86_64); -} - -static efi_status_t -preserve_pci_rom_image(efi_pci_io_protocol_t *pci, struct pci_setup_rom **__rom) -{ - struct pci_setup_rom *rom = NULL; - efi_status_t status; - unsigned long size; - uint64_t romsize; - void *romimage; - - /* - * Some firmware images contain EFI function pointers at the place where - * the romimage and romsize fields are supposed to be. Typically the EFI - * code is mapped at high addresses, translating to an unrealistically - * large romsize. The UEFI spec limits the size of option ROMs to 16 - * MiB so we reject any ROMs over 16 MiB in size to catch this. - */ - romimage = efi_table_attr(pci, romimage); - romsize = efi_table_attr(pci, romsize); - if (!romimage || !romsize || romsize > SZ_16M) - return EFI_INVALID_PARAMETER; - - size = romsize + sizeof(*rom); - - status = efi_bs_call(allocate_pool, EFI_LOADER_DATA, size, - (void **)&rom); - if (status != EFI_SUCCESS) { - efi_printk("Failed to allocate memory for 'rom'\n"); - return status; - } - - memset(rom, 0, sizeof(*rom)); - - rom->data.type = SETUP_PCI; - rom->data.len = size - sizeof(struct setup_data); - rom->data.next = 0; - rom->pcilen = pci->romsize; - *__rom = rom; - - status = efi_call_proto(pci, pci.read, EfiPciIoWidthUint16, - PCI_VENDOR_ID, 1, &rom->vendor); - - if (status != EFI_SUCCESS) { - efi_printk("Failed to read rom->vendor\n"); - goto free_struct; - } - - status = efi_call_proto(pci, pci.read, EfiPciIoWidthUint16, - PCI_DEVICE_ID, 1, &rom->devid); - - if (status != EFI_SUCCESS) { - efi_printk("Failed to read rom->devid\n"); - goto free_struct; - } - - status = efi_call_proto(pci, get_location, &rom->segment, &rom->bus, - &rom->device, &rom->function); - - if (status != EFI_SUCCESS) - goto free_struct; - - memcpy(rom->romdata, romimage, romsize); - return status; - -free_struct: - efi_bs_call(free_pool, rom); - return status; -} - -/* - * There's no way to return an informative status from this function, - * because any analysis (and printing of error messages) needs to be - * done directly at the EFI function call-site. - * - * For example, EFI_INVALID_PARAMETER could indicate a bug or maybe we - * just didn't find any PCI devices, but there's no way to tell outside - * the context of the call. - */ -static void setup_efi_pci(struct boot_params *params) -{ - efi_status_t status; - void **pci_handle = NULL; - efi_guid_t pci_proto = EFI_PCI_IO_PROTOCOL_GUID; - unsigned long size = 0; - struct setup_data *data; - efi_handle_t h; - int i; - - status = efi_bs_call(locate_handle, EFI_LOCATE_BY_PROTOCOL, - &pci_proto, NULL, &size, pci_handle); - - if (status == EFI_BUFFER_TOO_SMALL) { - status = efi_bs_call(allocate_pool, EFI_LOADER_DATA, size, - (void **)&pci_handle); - - if (status != EFI_SUCCESS) { - efi_printk("Failed to allocate memory for 'pci_handle'\n"); - return; - } - - status = efi_bs_call(locate_handle, EFI_LOCATE_BY_PROTOCOL, - &pci_proto, NULL, &size, pci_handle); - } - - if (status != EFI_SUCCESS) - goto free_handle; - - data = (struct setup_data *)(unsigned long)params->hdr.setup_data; - - while (data && data->next) - data = (struct setup_data *)(unsigned long)data->next; - - for_each_efi_handle(h, pci_handle, size, i) { - efi_pci_io_protocol_t *pci = NULL; - struct pci_setup_rom *rom; - - status = efi_bs_call(handle_protocol, h, &pci_proto, - (void **)&pci); - if (status != EFI_SUCCESS || !pci) - continue; - - status = preserve_pci_rom_image(pci, &rom); - if (status != EFI_SUCCESS) - continue; - - if (data) - data->next = (unsigned long)rom; - else - params->hdr.setup_data = (unsigned long)rom; - - data = (struct setup_data *)rom; - } - -free_handle: - efi_bs_call(free_pool, pci_handle); -} - -static void retrieve_apple_device_properties(struct boot_params *boot_params) -{ - efi_guid_t guid = APPLE_PROPERTIES_PROTOCOL_GUID; - struct setup_data *data, *new; - efi_status_t status; - u32 size = 0; - apple_properties_protocol_t *p; - - status = efi_bs_call(locate_protocol, &guid, NULL, (void **)&p); - if (status != EFI_SUCCESS) - return; - - if (efi_table_attr(p, version) != 0x10000) { - efi_printk("Unsupported properties proto version\n"); - return; - } - - efi_call_proto(p, get_all, NULL, &size); - if (!size) - return; - - do { - status = efi_bs_call(allocate_pool, EFI_LOADER_DATA, - size + sizeof(struct setup_data), - (void **)&new); - if (status != EFI_SUCCESS) { - efi_printk("Failed to allocate memory for 'properties'\n"); - return; - } - - status = efi_call_proto(p, get_all, new->data, &size); - - if (status == EFI_BUFFER_TOO_SMALL) - efi_bs_call(free_pool, new); - } while (status == EFI_BUFFER_TOO_SMALL); - - new->type = SETUP_APPLE_PROPERTIES; - new->len = size; - new->next = 0; - - data = (struct setup_data *)(unsigned long)boot_params->hdr.setup_data; - if (!data) { - boot_params->hdr.setup_data = (unsigned long)new; - } else { - while (data->next) - data = (struct setup_data *)(unsigned long)data->next; - data->next = (unsigned long)new; - } -} - -static const efi_char16_t apple[] = L"Apple"; - -static void setup_quirks(struct boot_params *boot_params) -{ - efi_char16_t *fw_vendor = (efi_char16_t *)(unsigned long) - efi_table_attr(efi_system_table(), fw_vendor); - - if (!memcmp(fw_vendor, apple, sizeof(apple))) { - if (IS_ENABLED(CONFIG_APPLE_PROPERTIES)) - retrieve_apple_device_properties(boot_params); - } -} - -/* - * See if we have Universal Graphics Adapter (UGA) protocol - */ -static efi_status_t -setup_uga(struct screen_info *si, efi_guid_t *uga_proto, unsigned long size) -{ - efi_status_t status; - u32 width, height; - void **uga_handle = NULL; - efi_uga_draw_protocol_t *uga = NULL, *first_uga; - efi_handle_t handle; - int i; - - status = efi_bs_call(allocate_pool, EFI_LOADER_DATA, size, - (void **)&uga_handle); - if (status != EFI_SUCCESS) - return status; - - status = efi_bs_call(locate_handle, EFI_LOCATE_BY_PROTOCOL, - uga_proto, NULL, &size, uga_handle); - if (status != EFI_SUCCESS) - goto free_handle; - - height = 0; - width = 0; - - first_uga = NULL; - for_each_efi_handle(handle, uga_handle, size, i) { - efi_guid_t pciio_proto = EFI_PCI_IO_PROTOCOL_GUID; - u32 w, h, depth, refresh; - void *pciio; - - status = efi_bs_call(handle_protocol, handle, uga_proto, - (void **)&uga); - if (status != EFI_SUCCESS) - continue; - - pciio = NULL; - efi_bs_call(handle_protocol, handle, &pciio_proto, &pciio); - - status = efi_call_proto(uga, get_mode, &w, &h, &depth, &refresh); - if (status == EFI_SUCCESS && (!first_uga || pciio)) { - width = w; - height = h; - - /* - * Once we've found a UGA supporting PCIIO, - * don't bother looking any further. - */ - if (pciio) - break; - - first_uga = uga; - } - } - - if (!width && !height) - goto free_handle; - - /* EFI framebuffer */ - si->orig_video_isVGA = VIDEO_TYPE_EFI; - - si->lfb_depth = 32; - si->lfb_width = width; - si->lfb_height = height; - - si->red_size = 8; - si->red_pos = 16; - si->green_size = 8; - si->green_pos = 8; - si->blue_size = 8; - si->blue_pos = 0; - si->rsvd_size = 8; - si->rsvd_pos = 24; - -free_handle: - efi_bs_call(free_pool, uga_handle); - - return status; -} - -static void setup_graphics(struct boot_params *boot_params) -{ - efi_guid_t graphics_proto = EFI_GRAPHICS_OUTPUT_PROTOCOL_GUID; - struct screen_info *si; - efi_guid_t uga_proto = EFI_UGA_PROTOCOL_GUID; - efi_status_t status; - unsigned long size; - void **gop_handle = NULL; - void **uga_handle = NULL; - - si = &boot_params->screen_info; - memset(si, 0, sizeof(*si)); - - size = 0; - status = efi_bs_call(locate_handle, EFI_LOCATE_BY_PROTOCOL, - &graphics_proto, NULL, &size, gop_handle); - if (status == EFI_BUFFER_TOO_SMALL) - status = efi_setup_gop(si, &graphics_proto, size); - - if (status != EFI_SUCCESS) { - size = 0; - status = efi_bs_call(locate_handle, EFI_LOCATE_BY_PROTOCOL, - &uga_proto, NULL, &size, uga_handle); - if (status == EFI_BUFFER_TOO_SMALL) - setup_uga(si, &uga_proto, size); - } -} - -void startup_32(struct boot_params *boot_params); - -void __noreturn efi_stub_entry(efi_handle_t handle, - efi_system_table_t *sys_table_arg, - struct boot_params *boot_params); - -/* - * Because the x86 boot code expects to be passed a boot_params we - * need to create one ourselves (usually the bootloader would create - * one for us). - */ -efi_status_t __efiapi efi_pe_entry(efi_handle_t handle, - efi_system_table_t *sys_table_arg) -{ - struct boot_params *boot_params; - struct setup_header *hdr; - efi_loaded_image_t *image; - efi_guid_t proto = LOADED_IMAGE_PROTOCOL_GUID; - int options_size = 0; - efi_status_t status; - char *cmdline_ptr; - unsigned long ramdisk_addr; - unsigned long ramdisk_size; - - sys_table = sys_table_arg; - - /* Check if we were booted by the EFI firmware */ - if (sys_table->hdr.signature != EFI_SYSTEM_TABLE_SIGNATURE) - return EFI_INVALID_PARAMETER; - - status = efi_bs_call(handle_protocol, handle, &proto, (void *)&image); - if (status != EFI_SUCCESS) { - efi_printk("Failed to get handle for LOADED_IMAGE_PROTOCOL\n"); - return status; - } - - status = efi_low_alloc(0x4000, 1, (unsigned long *)&boot_params); - if (status != EFI_SUCCESS) { - efi_printk("Failed to allocate lowmem for boot params\n"); - return status; - } - - memset(boot_params, 0x0, 0x4000); - - hdr = &boot_params->hdr; - - /* Copy the second sector to boot_params */ - memcpy(&hdr->jump, image->image_base + 512, 512); - - /* - * Fill out some of the header fields ourselves because the - * EFI firmware loader doesn't load the first sector. - */ - hdr->root_flags = 1; - hdr->vid_mode = 0xffff; - hdr->boot_flag = 0xAA55; - - hdr->type_of_loader = 0x21; - - /* Convert unicode cmdline to ascii */ - cmdline_ptr = efi_convert_cmdline(image, &options_size); - if (!cmdline_ptr) - goto fail; - - hdr->cmd_line_ptr = (unsigned long)cmdline_ptr; - /* Fill in upper bits of command line address, NOP on 32 bit */ - boot_params->ext_cmd_line_ptr = (u64)(unsigned long)cmdline_ptr >> 32; - - hdr->ramdisk_image = 0; - hdr->ramdisk_size = 0; - - status = efi_parse_options(cmdline_ptr); - if (status != EFI_SUCCESS) - goto fail2; - - status = handle_cmdline_files(image, - (char *)(unsigned long)hdr->cmd_line_ptr, - "initrd=", hdr->initrd_addr_max, - &ramdisk_addr, &ramdisk_size); - - if (status != EFI_SUCCESS && - hdr->xloadflags & XLF_CAN_BE_LOADED_ABOVE_4G) { - efi_printk("Trying to load files to higher address\n"); - status = handle_cmdline_files(image, - (char *)(unsigned long)hdr->cmd_line_ptr, - "initrd=", -1UL, - &ramdisk_addr, &ramdisk_size); - } - - if (status != EFI_SUCCESS) - goto fail2; - hdr->ramdisk_image = ramdisk_addr & 0xffffffff; - hdr->ramdisk_size = ramdisk_size & 0xffffffff; - boot_params->ext_ramdisk_image = (u64)ramdisk_addr >> 32; - boot_params->ext_ramdisk_size = (u64)ramdisk_size >> 32; - - efi_stub_entry(handle, sys_table, boot_params); - /* not reached */ - -fail2: - efi_free(options_size, hdr->cmd_line_ptr); -fail: - efi_free(0x4000, (unsigned long)boot_params); - - return status; -} - -static void add_e820ext(struct boot_params *params, - struct setup_data *e820ext, u32 nr_entries) -{ - struct setup_data *data; - - e820ext->type = SETUP_E820_EXT; - e820ext->len = nr_entries * sizeof(struct boot_e820_entry); - e820ext->next = 0; - - data = (struct setup_data *)(unsigned long)params->hdr.setup_data; - - while (data && data->next) - data = (struct setup_data *)(unsigned long)data->next; - - if (data) - data->next = (unsigned long)e820ext; - else - params->hdr.setup_data = (unsigned long)e820ext; -} - -static efi_status_t -setup_e820(struct boot_params *params, struct setup_data *e820ext, u32 e820ext_size) -{ - struct boot_e820_entry *entry = params->e820_table; - struct efi_info *efi = ¶ms->efi_info; - struct boot_e820_entry *prev = NULL; - u32 nr_entries; - u32 nr_desc; - int i; - - nr_entries = 0; - nr_desc = efi->efi_memmap_size / efi->efi_memdesc_size; - - for (i = 0; i < nr_desc; i++) { - efi_memory_desc_t *d; - unsigned int e820_type = 0; - unsigned long m = efi->efi_memmap; - -#ifdef CONFIG_X86_64 - m |= (u64)efi->efi_memmap_hi << 32; -#endif - - d = efi_early_memdesc_ptr(m, efi->efi_memdesc_size, i); - switch (d->type) { - case EFI_RESERVED_TYPE: - case EFI_RUNTIME_SERVICES_CODE: - case EFI_RUNTIME_SERVICES_DATA: - case EFI_MEMORY_MAPPED_IO: - case EFI_MEMORY_MAPPED_IO_PORT_SPACE: - case EFI_PAL_CODE: - e820_type = E820_TYPE_RESERVED; - break; - - case EFI_UNUSABLE_MEMORY: - e820_type = E820_TYPE_UNUSABLE; - break; - - case EFI_ACPI_RECLAIM_MEMORY: - e820_type = E820_TYPE_ACPI; - break; - - case EFI_LOADER_CODE: - case EFI_LOADER_DATA: - case EFI_BOOT_SERVICES_CODE: - case EFI_BOOT_SERVICES_DATA: - case EFI_CONVENTIONAL_MEMORY: - if (efi_soft_reserve_enabled() && - (d->attribute & EFI_MEMORY_SP)) - e820_type = E820_TYPE_SOFT_RESERVED; - else - e820_type = E820_TYPE_RAM; - break; - - case EFI_ACPI_MEMORY_NVS: - e820_type = E820_TYPE_NVS; - break; - - case EFI_PERSISTENT_MEMORY: - e820_type = E820_TYPE_PMEM; - break; - - default: - continue; - } - - /* Merge adjacent mappings */ - if (prev && prev->type == e820_type && - (prev->addr + prev->size) == d->phys_addr) { - prev->size += d->num_pages << 12; - continue; - } - - if (nr_entries == ARRAY_SIZE(params->e820_table)) { - u32 need = (nr_desc - i) * sizeof(struct e820_entry) + - sizeof(struct setup_data); - - if (!e820ext || e820ext_size < need) - return EFI_BUFFER_TOO_SMALL; - - /* boot_params map full, switch to e820 extended */ - entry = (struct boot_e820_entry *)e820ext->data; - } - - entry->addr = d->phys_addr; - entry->size = d->num_pages << PAGE_SHIFT; - entry->type = e820_type; - prev = entry++; - nr_entries++; - } - - if (nr_entries > ARRAY_SIZE(params->e820_table)) { - u32 nr_e820ext = nr_entries - ARRAY_SIZE(params->e820_table); - - add_e820ext(params, e820ext, nr_e820ext); - nr_entries -= nr_e820ext; - } - - params->e820_entries = (u8)nr_entries; - - return EFI_SUCCESS; -} - -static efi_status_t alloc_e820ext(u32 nr_desc, struct setup_data **e820ext, - u32 *e820ext_size) -{ - efi_status_t status; - unsigned long size; - - size = sizeof(struct setup_data) + - sizeof(struct e820_entry) * nr_desc; - - if (*e820ext) { - efi_bs_call(free_pool, *e820ext); - *e820ext = NULL; - *e820ext_size = 0; - } - - status = efi_bs_call(allocate_pool, EFI_LOADER_DATA, size, - (void **)e820ext); - if (status == EFI_SUCCESS) - *e820ext_size = size; - - return status; -} - -static efi_status_t allocate_e820(struct boot_params *params, - struct setup_data **e820ext, - u32 *e820ext_size) -{ - unsigned long map_size, desc_size, buff_size; - struct efi_boot_memmap boot_map; - efi_memory_desc_t *map; - efi_status_t status; - __u32 nr_desc; - - boot_map.map = ↦ - boot_map.map_size = &map_size; - boot_map.desc_size = &desc_size; - boot_map.desc_ver = NULL; - boot_map.key_ptr = NULL; - boot_map.buff_size = &buff_size; - - status = efi_get_memory_map(&boot_map); - if (status != EFI_SUCCESS) - return status; - - nr_desc = buff_size / desc_size; - - if (nr_desc > ARRAY_SIZE(params->e820_table)) { - u32 nr_e820ext = nr_desc - ARRAY_SIZE(params->e820_table); - - status = alloc_e820ext(nr_e820ext, e820ext, e820ext_size); - if (status != EFI_SUCCESS) - return status; - } - - return EFI_SUCCESS; -} - -struct exit_boot_struct { - struct boot_params *boot_params; - struct efi_info *efi; -}; - -static efi_status_t exit_boot_func(struct efi_boot_memmap *map, - void *priv) -{ - const char *signature; - struct exit_boot_struct *p = priv; - - signature = efi_is_64bit() ? EFI64_LOADER_SIGNATURE - : EFI32_LOADER_SIGNATURE; - memcpy(&p->efi->efi_loader_signature, signature, sizeof(__u32)); - - p->efi->efi_systab = (unsigned long)efi_system_table(); - p->efi->efi_memdesc_size = *map->desc_size; - p->efi->efi_memdesc_version = *map->desc_ver; - p->efi->efi_memmap = (unsigned long)*map->map; - p->efi->efi_memmap_size = *map->map_size; - -#ifdef CONFIG_X86_64 - p->efi->efi_systab_hi = (unsigned long)efi_system_table() >> 32; - p->efi->efi_memmap_hi = (unsigned long)*map->map >> 32; -#endif - - return EFI_SUCCESS; -} - -static efi_status_t exit_boot(struct boot_params *boot_params, void *handle) -{ - unsigned long map_sz, key, desc_size, buff_size; - efi_memory_desc_t *mem_map; - struct setup_data *e820ext = NULL; - __u32 e820ext_size = 0; - efi_status_t status; - __u32 desc_version; - struct efi_boot_memmap map; - struct exit_boot_struct priv; - - map.map = &mem_map; - map.map_size = &map_sz; - map.desc_size = &desc_size; - map.desc_ver = &desc_version; - map.key_ptr = &key; - map.buff_size = &buff_size; - priv.boot_params = boot_params; - priv.efi = &boot_params->efi_info; - - status = allocate_e820(boot_params, &e820ext, &e820ext_size); - if (status != EFI_SUCCESS) - return status; - - /* Might as well exit boot services now */ - status = efi_exit_boot_services(handle, &map, &priv, exit_boot_func); - if (status != EFI_SUCCESS) - return status; - - /* Historic? */ - boot_params->alt_mem_k = 32 * 1024; - - status = setup_e820(boot_params, e820ext, e820ext_size); - if (status != EFI_SUCCESS) - return status; - - return EFI_SUCCESS; -} - -/* - * On success we return a pointer to a boot_params structure, and NULL - * on failure. - */ -struct boot_params *efi_main(efi_handle_t handle, - efi_system_table_t *sys_table_arg, - struct boot_params *boot_params) -{ - unsigned long bzimage_addr = (unsigned long)startup_32; - struct setup_header *hdr = &boot_params->hdr; - efi_status_t status; - unsigned long cmdline_paddr; - - sys_table = sys_table_arg; - - /* Check if we were booted by the EFI firmware */ - if (sys_table->hdr.signature != EFI_SYSTEM_TABLE_SIGNATURE) - goto fail; - - /* - * If the kernel isn't already loaded at the preferred load - * address, relocate it. - */ - if (bzimage_addr != hdr->pref_address) { - status = efi_relocate_kernel(&bzimage_addr, - hdr->init_size, hdr->init_size, - hdr->pref_address, - hdr->kernel_alignment, - LOAD_PHYSICAL_ADDR); - if (status != EFI_SUCCESS) { - efi_printk("efi_relocate_kernel() failed!\n"); - goto fail; - } - } - hdr->code32_start = (u32)bzimage_addr; - - /* - * make_boot_params() may have been called before efi_main(), in which - * case this is the second time we parse the cmdline. This is ok, - * parsing the cmdline multiple times does not have side-effects. - */ - cmdline_paddr = ((u64)hdr->cmd_line_ptr | - ((u64)boot_params->ext_cmd_line_ptr << 32)); - efi_parse_options((char *)cmdline_paddr); - - /* - * If the boot loader gave us a value for secure_boot then we use that, - * otherwise we ask the BIOS. - */ - if (boot_params->secure_boot == efi_secureboot_mode_unset) - boot_params->secure_boot = efi_get_secureboot(); - - /* Ask the firmware to clear memory on unclean shutdown */ - efi_enable_reset_attack_mitigation(); - - efi_random_get_seed(); - - efi_retrieve_tpm2_eventlog(); - - setup_graphics(boot_params); - - setup_efi_pci(boot_params); - - setup_quirks(boot_params); - - status = exit_boot(boot_params, handle); - if (status != EFI_SUCCESS) { - efi_printk("exit_boot() failed!\n"); - goto fail; - } - - return boot_params; -fail: - efi_printk("efi_main() failed!\n"); - - for (;;) - asm("hlt"); -} diff --git a/arch/x86/boot/compressed/eboot.h b/arch/x86/boot/compressed/eboot.h deleted file mode 100644 index 99f35343d443..000000000000 --- a/arch/x86/boot/compressed/eboot.h +++ /dev/null @@ -1,31 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0 */ -#ifndef BOOT_COMPRESSED_EBOOT_H -#define BOOT_COMPRESSED_EBOOT_H - -#define SEG_TYPE_DATA (0 << 3) -#define SEG_TYPE_READ_WRITE (1 << 1) -#define SEG_TYPE_CODE (1 << 3) -#define SEG_TYPE_EXEC_READ (1 << 1) -#define SEG_TYPE_TSS ((1 << 3) | (1 << 0)) -#define SEG_OP_SIZE_32BIT (1 << 0) -#define SEG_GRANULARITY_4KB (1 << 0) - -#define DESC_TYPE_CODE_DATA (1 << 0) - -typedef union efi_uga_draw_protocol efi_uga_draw_protocol_t; - -union efi_uga_draw_protocol { - struct { - efi_status_t (__efiapi *get_mode)(efi_uga_draw_protocol_t *, - u32*, u32*, u32*, u32*); - void *set_mode; - void *blt; - }; - struct { - u32 get_mode; - u32 set_mode; - u32 blt; - } mixed_mode; -}; - -#endif /* BOOT_COMPRESSED_EBOOT_H */ -- cgit From 1e45bf7372c48c78e1f7e538fd3e612946f6ad20 Mon Sep 17 00:00:00 2001 From: Ard Biesheuvel Date: Mon, 10 Feb 2020 17:02:40 +0100 Subject: efi/libstub/x86: Permit cmdline data to be allocated above 4 GB We now support cmdline data that is located in memory that is not 32-bit addressable, so relax the allocation limit on systems where this feature is enabled. Signed-off-by: Ard Biesheuvel --- arch/x86/include/asm/efi.h | 2 -- 1 file changed, 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h index 86169a24b0d8..85f79accd3f8 100644 --- a/arch/x86/include/asm/efi.h +++ b/arch/x86/include/asm/efi.h @@ -34,8 +34,6 @@ static inline bool efi_have_uv1_memmap(void) #define EFI32_LOADER_SIGNATURE "EL32" #define EFI64_LOADER_SIGNATURE "EL64" -#define MAX_CMDLINE_ADDRESS UINT_MAX - #define ARCH_EFI_IRQ_FLAGS_MASK X86_EFLAGS_IF /* -- cgit From abd268685a21cb5d0c991bb21a88ea0c1d2e15d8 Mon Sep 17 00:00:00 2001 From: Ard Biesheuvel Date: Mon, 10 Feb 2020 17:02:47 +0100 Subject: efi/libstub: Expose LocateDevicePath boot service We will be adding support for loading the initrd from a GUIDed device path in a subsequent patch, so update the prototype of the LocateDevicePath() boot service to make it callable from our code. Signed-off-by: Ard Biesheuvel --- arch/x86/include/asm/efi.h | 3 +++ 1 file changed, 3 insertions(+) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h index 85f79accd3f8..9ced980b123b 100644 --- a/arch/x86/include/asm/efi.h +++ b/arch/x86/include/asm/efi.h @@ -283,6 +283,9 @@ static inline void *efi64_zero_upper(void *p) #define __efi64_argmap_locate_protocol(protocol, reg, interface) \ ((protocol), (reg), efi64_zero_upper(interface)) +#define __efi64_argmap_locate_device_path(protocol, path, handle) \ + ((protocol), (path), efi64_zero_upper(handle)) + /* PCI I/O */ #define __efi64_argmap_get_location(protocol, seg, bus, dev, func) \ ((protocol), efi64_zero_upper(seg), efi64_zero_upper(bus), \ -- cgit From 2931d526d5674940d916a4b513a681ee3562e574 Mon Sep 17 00:00:00 2001 From: Ard Biesheuvel Date: Mon, 10 Feb 2020 17:02:48 +0100 Subject: efi/libstub: Make the LoadFile EFI protocol accessible Add the protocol definitions, GUIDs and mixed mode glue so that the EFI loadfile protocol can be used from the stub. This will be used in a future patch to load the initrd. Signed-off-by: Ard Biesheuvel --- arch/x86/include/asm/efi.h | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h index 9ced980b123b..fcb21e3d13c5 100644 --- a/arch/x86/include/asm/efi.h +++ b/arch/x86/include/asm/efi.h @@ -291,6 +291,10 @@ static inline void *efi64_zero_upper(void *p) ((protocol), efi64_zero_upper(seg), efi64_zero_upper(bus), \ efi64_zero_upper(dev), efi64_zero_upper(func)) +/* LoadFile */ +#define __efi64_argmap_load_file(protocol, path, policy, bufsize, buf) \ + ((protocol), (path), (policy), efi64_zero_upper(bufsize), (buf)) + /* * The macros below handle the plumbing for the argument mapping. To add a * mapping for a specific EFI method, simply define a macro -- cgit From 14b60cc8e0ead866cd25b4e51945c6267ca9936c Mon Sep 17 00:00:00 2001 From: Ard Biesheuvel Date: Sun, 2 Feb 2020 11:30:14 +0100 Subject: efi/x86: Reindent struct initializer for legibility Reindent the efi_memory_map_data initializer so that all the = signs are aligned vertically, making the resulting code much easier to read. Suggested-by: Ingo Molnar Signed-off-by: Ard Biesheuvel --- arch/x86/platform/efi/efi.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c index ae923ee8e2b4..293c47f9cb39 100644 --- a/arch/x86/platform/efi/efi.c +++ b/arch/x86/platform/efi/efi.c @@ -305,11 +305,11 @@ static void __init efi_clean_memmap(void) if (n_removal > 0) { struct efi_memory_map_data data = { - .phys_map = efi.memmap.phys_map, - .desc_version = efi.memmap.desc_version, - .desc_size = efi.memmap.desc_size, - .size = efi.memmap.desc_size * (efi.memmap.nr_map - n_removal), - .flags = 0, + .phys_map = efi.memmap.phys_map, + .desc_version = efi.memmap.desc_version, + .desc_size = efi.memmap.desc_size, + .size = efi.memmap.desc_size * (efi.memmap.nr_map - n_removal), + .flags = 0, }; pr_warn("Removing %d invalid memory map entries.\n", n_removal); -- cgit From a570b0624b3f73a6cc57c529e506300c639912e2 Mon Sep 17 00:00:00 2001 From: Ard Biesheuvel Date: Sun, 2 Feb 2020 11:22:41 +0100 Subject: efi/x86: Replace #ifdefs with IS_ENABLED() checks When possible, IS_ENABLED() conditionals are preferred over #ifdefs, given that the latter hide the code from the compiler entirely, which reduces build test coverage when the option is not enabled. So replace an instance in the x86 efi startup code. Signed-off-by: Ard Biesheuvel --- arch/x86/platform/efi/efi.c | 11 ++++------- 1 file changed, 4 insertions(+), 7 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c index 293c47f9cb39..52067ed7fd59 100644 --- a/arch/x86/platform/efi/efi.c +++ b/arch/x86/platform/efi/efi.c @@ -214,16 +214,13 @@ int __init efi_memblock_x86_reserve_range(void) if (efi_enabled(EFI_PARAVIRT)) return 0; -#ifdef CONFIG_X86_32 - /* Can't handle data above 4GB at this time */ - if (e->efi_memmap_hi) { + /* Can't handle firmware tables above 4GB on i386 */ + if (IS_ENABLED(CONFIG_X86_32) && e->efi_memmap_hi > 0) { pr_err("Memory map is above 4GB, disabling EFI.\n"); return -EINVAL; } - pmap = e->efi_memmap; -#else - pmap = (e->efi_memmap | ((__u64)e->efi_memmap_hi << 32)); -#endif + pmap = (phys_addr_t)(e->efi_memmap | ((u64)e->efi_memmap_hi << 32)); + data.phys_map = pmap; data.size = e->efi_memmap_size; data.desc_size = e->efi_memdesc_size; -- cgit From 50d53c58dd77d3b0b6a5afe391eaac3722fc3153 Mon Sep 17 00:00:00 2001 From: Ard Biesheuvel Date: Sun, 19 Jan 2020 15:29:21 +0100 Subject: efi: Drop handling of 'boot_info' configuration table Some plumbing exists to handle a UEFI configuration table of type BOOT_INFO but since we never match it to a GUID anywhere, we never actually register such a table, or access it, for that matter. So simply drop all mentions of it. Tested-by: Tony Luck # arch/ia64 Signed-off-by: Ard Biesheuvel --- arch/x86/platform/efi/efi.c | 1 - 1 file changed, 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c index 52067ed7fd59..4970229fd822 100644 --- a/arch/x86/platform/efi/efi.c +++ b/arch/x86/platform/efi/efi.c @@ -70,7 +70,6 @@ static const unsigned long * const efi_tables[] = { &efi.acpi20, &efi.smbios, &efi.smbios3, - &efi.boot_info, &efi.hcdp, &efi.uga, #ifdef CONFIG_X86_UV -- cgit From 120540f230d5d2d32846adc0156b58961c8c59d1 Mon Sep 17 00:00:00 2001 From: Ard Biesheuvel Date: Sun, 19 Jan 2020 15:43:53 +0100 Subject: efi/ia64: Move HCDP and MPS table handling into IA64 arch code The HCDP and MPS tables are Itanium specific EFI config tables, so move their handling to ia64 arch code. Tested-by: Tony Luck # arch/ia64 Signed-off-by: Ard Biesheuvel --- arch/x86/platform/efi/efi.c | 2 -- 1 file changed, 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c index 4970229fd822..61ebaae62894 100644 --- a/arch/x86/platform/efi/efi.c +++ b/arch/x86/platform/efi/efi.c @@ -65,12 +65,10 @@ static efi_config_table_type_t arch_tables[] __initdata = { }; static const unsigned long * const efi_tables[] = { - &efi.mps, &efi.acpi, &efi.acpi20, &efi.smbios, &efi.smbios3, - &efi.hcdp, &efi.uga, #ifdef CONFIG_X86_UV &uv_systab_phys, -- cgit From fd506e0cf9fd4306aa0eb57cbff5f00514da8179 Mon Sep 17 00:00:00 2001 From: Ard Biesheuvel Date: Sun, 19 Jan 2020 16:17:59 +0100 Subject: efi: Move UGA and PROP table handling to x86 code The UGA table is x86 specific (its handling was introduced when the EFI support code was modified to accommodate IA32), so there is no need to handle it in generic code. The EFI properties table is not strictly x86 specific, but it was deprecated almost immediately after having been introduced, due to implementation difficulties. Only x86 takes it into account today, and this is not going to change, so make this table x86 only as well. Tested-by: Tony Luck # arch/ia64 Signed-off-by: Ard Biesheuvel --- arch/x86/platform/efi/efi.c | 32 ++++++++++++++++++++++++++++++-- 1 file changed, 30 insertions(+), 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c index 61ebaae62894..421f082535c5 100644 --- a/arch/x86/platform/efi/efi.c +++ b/arch/x86/platform/efi/efi.c @@ -57,7 +57,12 @@ static efi_system_table_t efi_systab __initdata; static u64 efi_systab_phys __initdata; +static unsigned long prop_phys = EFI_INVALID_TABLE_ADDR; +static unsigned long uga_phys = EFI_INVALID_TABLE_ADDR; + static efi_config_table_type_t arch_tables[] __initdata = { + {EFI_PROPERTIES_TABLE_GUID, "PROP", &prop_phys}, + {UGA_IO_PROTOCOL_GUID, "UGA", &uga_phys}, #ifdef CONFIG_X86_UV {UV_SYSTEM_TABLE_GUID, "UVsystab", &uv_systab_phys}, #endif @@ -69,7 +74,7 @@ static const unsigned long * const efi_tables[] = { &efi.acpi20, &efi.smbios, &efi.smbios3, - &efi.uga, + &uga_phys, #ifdef CONFIG_X86_UV &uv_systab_phys, #endif @@ -77,7 +82,7 @@ static const unsigned long * const efi_tables[] = { &efi.runtime, &efi.config_table, &efi.esrt, - &efi.properties_table, + &prop_phys, &efi.mem_attr_table, #ifdef CONFIG_EFI_RCI2_TABLE &rci2_table_phys, @@ -490,6 +495,22 @@ void __init efi_init(void) return; } + /* Parse the EFI Properties table if it exists */ + if (prop_phys != EFI_INVALID_TABLE_ADDR) { + efi_properties_table_t *tbl; + + tbl = early_memremap_ro(prop_phys, sizeof(*tbl)); + if (tbl == NULL) { + pr_err("Could not map Properties table!\n"); + } else { + if (tbl->memory_protection_attribute & + EFI_PROPERTIES_RUNTIME_MEMORY_PROTECTION_NON_EXECUTABLE_PE_DATA) + set_bit(EFI_NX_PE_DATA, &efi.flags); + + early_memunmap(tbl, sizeof(*tbl)); + } + } + set_bit(EFI_RUNTIME_SERVICES, &efi.flags); efi_clean_memmap(); @@ -993,3 +1014,10 @@ bool efi_is_table_address(unsigned long phys_addr) return false; } + +char *efi_systab_show_arch(char *str) +{ + if (uga_phys != EFI_INVALID_TABLE_ADDR) + str += sprintf(str, "UGA=0x%lx\n", uga_phys); + return str; +} -- cgit From a17e809ea573e69474064ba2bbff06d212861e19 Mon Sep 17 00:00:00 2001 From: Ard Biesheuvel Date: Wed, 22 Jan 2020 15:05:12 +0100 Subject: efi: Move mem_attr_table out of struct efi The memory attributes table is only used at init time by the core EFI code, so there is no need to carry its address in struct efi that is shared with the world. So move it out, and make it __ro_after_init as well, considering that the value is set during early boot. Tested-by: Tony Luck # arch/ia64 Signed-off-by: Ard Biesheuvel --- arch/x86/platform/efi/efi.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c index 421f082535c5..22dc3678cdba 100644 --- a/arch/x86/platform/efi/efi.c +++ b/arch/x86/platform/efi/efi.c @@ -83,7 +83,7 @@ static const unsigned long * const efi_tables[] = { &efi.config_table, &efi.esrt, &prop_phys, - &efi.mem_attr_table, + &efi_mem_attr_table, #ifdef CONFIG_EFI_RCI2_TABLE &rci2_table_phys, #endif -- cgit From 14fb4209094355928d5a742e35afabdf7b404c17 Mon Sep 17 00:00:00 2001 From: Ard Biesheuvel Date: Mon, 20 Jan 2020 10:49:11 +0100 Subject: efi: Merge EFI system table revision and vendor checks We have three different versions of the code that checks the EFI system table revision and copies the firmware vendor string, and they are mostly equivalent, with the exception of the use of early_memremap_ro vs. __va() and the lowest major revision to warn about. Let's move this into common code and factor out the commonalities. Tested-by: Tony Luck # arch/ia64 Signed-off-by: Ard Biesheuvel --- arch/x86/platform/efi/efi.c | 46 ++++++++++----------------------------------- 1 file changed, 10 insertions(+), 36 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c index 22dc3678cdba..5bb53da48a4b 100644 --- a/arch/x86/platform/efi/efi.c +++ b/arch/x86/platform/efi/efi.c @@ -336,15 +336,23 @@ static int __init efi_systab_init(u64 phys) { int size = efi_enabled(EFI_64BIT) ? sizeof(efi_system_table_64_t) : sizeof(efi_system_table_32_t); + const efi_table_hdr_t *hdr; bool over4g = false; void *p; + int ret; - p = early_memremap_ro(phys, size); + hdr = p = early_memremap_ro(phys, size); if (p == NULL) { pr_err("Couldn't map the system table!\n"); return -ENOMEM; } + ret = efi_systab_check_header(hdr, 1); + if (ret) { + early_memunmap(p, size); + return ret; + } + if (efi_enabled(EFI_64BIT)) { const efi_system_table_64_t *systab64 = p; @@ -411,6 +419,7 @@ static int __init efi_systab_init(u64 phys) efi_systab.tables = systab32->tables; } + efi_systab_report_header(hdr, efi_systab.fw_vendor); early_memunmap(p, size); if (IS_ENABLED(CONFIG_X86_32) && over4g) { @@ -419,28 +428,11 @@ static int __init efi_systab_init(u64 phys) } efi.systab = &efi_systab; - - /* - * Verify the EFI Table - */ - if (efi.systab->hdr.signature != EFI_SYSTEM_TABLE_SIGNATURE) { - pr_err("System table signature incorrect!\n"); - return -EINVAL; - } - if ((efi.systab->hdr.revision >> 16) == 0) - pr_err("Warning: System table version %d.%02d, expected 1.00 or greater!\n", - efi.systab->hdr.revision >> 16, - efi.systab->hdr.revision & 0xffff); - return 0; } void __init efi_init(void) { - efi_char16_t *c16; - char vendor[100] = "unknown"; - int i = 0; - if (IS_ENABLED(CONFIG_X86_32) && (boot_params.efi_info.efi_systab_hi || boot_params.efi_info.efi_memmap_hi)) { @@ -458,24 +450,6 @@ void __init efi_init(void) efi.fw_vendor = (unsigned long)efi.systab->fw_vendor; efi.runtime = (unsigned long)efi.systab->runtime; - /* - * Show what we know for posterity - */ - c16 = early_memremap_ro(efi.systab->fw_vendor, - sizeof(vendor) * sizeof(efi_char16_t)); - if (c16) { - for (i = 0; i < sizeof(vendor) - 1 && c16[i]; ++i) - vendor[i] = c16[i]; - vendor[i] = '\0'; - early_memunmap(c16, sizeof(vendor) * sizeof(efi_char16_t)); - } else { - pr_err("Could not map the firmware vendor!\n"); - } - - pr_info("EFI v%u.%.02u by %s\n", - efi.systab->hdr.revision >> 16, - efi.systab->hdr.revision & 0xffff, vendor); - if (efi_reuse_config(efi.systab->tables, efi.systab->nr_tables)) return; -- cgit From 3a0701dc7ff8ebe1031a9f64c99c638929cd2d70 Mon Sep 17 00:00:00 2001 From: Ard Biesheuvel Date: Mon, 20 Jan 2020 17:17:27 +0100 Subject: efi: Make efi_config_init() x86 only The efi_config_init() routine is no longer shared with ia64 so let's move it into the x86 arch code before making further x86 specific changes to it. Tested-by: Tony Luck # arch/ia64 Signed-off-by: Ard Biesheuvel --- arch/x86/platform/efi/efi.c | 30 ++++++++++++++++++++++++++++++ 1 file changed, 30 insertions(+) (limited to 'arch/x86') diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c index 5bb53da48a4b..f1033f7f9e39 100644 --- a/arch/x86/platform/efi/efi.c +++ b/arch/x86/platform/efi/efi.c @@ -431,6 +431,36 @@ static int __init efi_systab_init(u64 phys) return 0; } +static int __init efi_config_init(efi_config_table_type_t *arch_tables) +{ + void *config_tables; + int sz, ret; + + if (efi.systab->nr_tables == 0) + return 0; + + if (efi_enabled(EFI_64BIT)) + sz = sizeof(efi_config_table_64_t); + else + sz = sizeof(efi_config_table_32_t); + + /* + * Let's see what config tables the firmware passed to us. + */ + config_tables = early_memremap(efi.systab->tables, + efi.systab->nr_tables * sz); + if (config_tables == NULL) { + pr_err("Could not map Configuration table!\n"); + return -ENOMEM; + } + + ret = efi_config_parse_tables(config_tables, efi.systab->nr_tables, sz, + arch_tables); + + early_memunmap(config_tables, efi.systab->nr_tables * sz); + return ret; +} + void __init efi_init(void) { if (IS_ENABLED(CONFIG_X86_32) && -- cgit From 06c0bd93434c5b9b284773f90bb054aff591d5be Mon Sep 17 00:00:00 2001 From: Ard Biesheuvel Date: Wed, 22 Jan 2020 14:40:57 +0100 Subject: efi: Clean up config_parse_tables() config_parse_tables() is a jumble of pointer arithmetic, due to the fact that on x86, we may be dealing with firmware whose native word size differs from the kernel's. This is not a concern on other architectures, and doesn't quite justify the state of the code, so let's clean it up by adding a non-x86 code path, constifying statically allocated tables and replacing preprocessor conditionals with IS_ENABLED() checks. Tested-by: Tony Luck # arch/ia64 Signed-off-by: Ard Biesheuvel --- arch/x86/platform/efi/efi.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c index f1033f7f9e39..47367f4d82d0 100644 --- a/arch/x86/platform/efi/efi.c +++ b/arch/x86/platform/efi/efi.c @@ -60,7 +60,7 @@ static u64 efi_systab_phys __initdata; static unsigned long prop_phys = EFI_INVALID_TABLE_ADDR; static unsigned long uga_phys = EFI_INVALID_TABLE_ADDR; -static efi_config_table_type_t arch_tables[] __initdata = { +static const efi_config_table_type_t arch_tables[] __initconst = { {EFI_PROPERTIES_TABLE_GUID, "PROP", &prop_phys}, {UGA_IO_PROTOCOL_GUID, "UGA", &uga_phys}, #ifdef CONFIG_X86_UV @@ -431,7 +431,7 @@ static int __init efi_systab_init(u64 phys) return 0; } -static int __init efi_config_init(efi_config_table_type_t *arch_tables) +static int __init efi_config_init(const efi_config_table_type_t *arch_tables) { void *config_tables; int sz, ret; @@ -454,7 +454,7 @@ static int __init efi_config_init(efi_config_table_type_t *arch_tables) return -ENOMEM; } - ret = efi_config_parse_tables(config_tables, efi.systab->nr_tables, sz, + ret = efi_config_parse_tables(config_tables, efi.systab->nr_tables, arch_tables); early_memunmap(config_tables, efi.systab->nr_tables * sz); -- cgit From 0a67361dcdaa29dca1e77ebac919c62e93a8b3bc Mon Sep 17 00:00:00 2001 From: Ard Biesheuvel Date: Mon, 20 Jan 2020 16:15:00 +0100 Subject: efi/x86: Remove runtime table address from kexec EFI setup data Since commit 33b85447fa61946b ("efi/x86: Drop two near identical versions of efi_runtime_init()"), we no longer map the EFI runtime services table before calling SetVirtualAddressMap(), which means we don't need the 1:1 mapped physical address of this table, and so there is no point in passing the address via EFI setup data on kexec boot. Note that the kexec tools will still look for this address in sysfs, so we still need to provide it. Tested-by: Tony Luck # arch/ia64 Signed-off-by: Ard Biesheuvel --- arch/x86/include/asm/efi.h | 1 - arch/x86/kernel/kexec-bzimage64.c | 1 - arch/x86/platform/efi/efi.c | 4 +--- 3 files changed, 1 insertion(+), 5 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h index fcb21e3d13c5..ee867f01b2f6 100644 --- a/arch/x86/include/asm/efi.h +++ b/arch/x86/include/asm/efi.h @@ -178,7 +178,6 @@ extern void __init efi_uv1_memmap_phys_epilog(pgd_t *save_pgd); struct efi_setup_data { u64 fw_vendor; - u64 runtime; u64 tables; u64 smbios; u64 reserved[8]; diff --git a/arch/x86/kernel/kexec-bzimage64.c b/arch/x86/kernel/kexec-bzimage64.c index f293d872602a..f400678bd345 100644 --- a/arch/x86/kernel/kexec-bzimage64.c +++ b/arch/x86/kernel/kexec-bzimage64.c @@ -142,7 +142,6 @@ prepare_add_efi_setup_data(struct boot_params *params, struct efi_setup_data *esd = (void *)sd + sizeof(struct setup_data); esd->fw_vendor = efi.fw_vendor; - esd->runtime = efi.runtime; esd->tables = efi.config_table; esd->smbios = efi.smbios; diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c index 47367f4d82d0..7d932452a40f 100644 --- a/arch/x86/platform/efi/efi.c +++ b/arch/x86/platform/efi/efi.c @@ -376,6 +376,7 @@ static int __init efi_systab_init(u64 phys) systab64->con_out > U32_MAX || systab64->stderr_handle > U32_MAX || systab64->stderr > U32_MAX || + systab64->runtime > U32_MAX || systab64->boottime > U32_MAX; if (efi_setup) { @@ -388,17 +389,14 @@ static int __init efi_systab_init(u64 phys) } efi_systab.fw_vendor = (unsigned long)data->fw_vendor; - efi_systab.runtime = (void *)(unsigned long)data->runtime; efi_systab.tables = (unsigned long)data->tables; over4g |= data->fw_vendor > U32_MAX || - data->runtime > U32_MAX || data->tables > U32_MAX; early_memunmap(data, sizeof(*data)); } else { over4g |= systab64->fw_vendor > U32_MAX || - systab64->runtime > U32_MAX || systab64->tables > U32_MAX; } } else { -- cgit From 9cd437ac0ef4f324a92e2579784b03bb487ae7fb Mon Sep 17 00:00:00 2001 From: Ard Biesheuvel Date: Mon, 20 Jan 2020 17:23:21 +0100 Subject: efi/x86: Make fw_vendor, config_table and runtime sysfs nodes x86 specific There is some code that exposes physical addresses of certain parts of the EFI firmware implementation via sysfs nodes. These nodes are only used on x86, and are of dubious value to begin with, so let's move their handling into the x86 arch code. Tested-by: Tony Luck # arch/ia64 Signed-off-by: Ard Biesheuvel --- arch/x86/include/asm/efi.h | 2 ++ arch/x86/kernel/kexec-bzimage64.c | 4 +-- arch/x86/platform/efi/efi.c | 60 +++++++++++++++++++++++++++++++-------- arch/x86/platform/efi/quirks.c | 2 +- 4 files changed, 53 insertions(+), 15 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h index ee867f01b2f6..78fc28da2e29 100644 --- a/arch/x86/include/asm/efi.h +++ b/arch/x86/include/asm/efi.h @@ -10,6 +10,8 @@ #include #include +extern unsigned long efi_fw_vendor, efi_config_table; + /* * We map the EFI regions needed for runtime services non-contiguously, * with preserved alignment on virtual addresses starting from -4G down diff --git a/arch/x86/kernel/kexec-bzimage64.c b/arch/x86/kernel/kexec-bzimage64.c index f400678bd345..db6578d45157 100644 --- a/arch/x86/kernel/kexec-bzimage64.c +++ b/arch/x86/kernel/kexec-bzimage64.c @@ -141,8 +141,8 @@ prepare_add_efi_setup_data(struct boot_params *params, struct setup_data *sd = (void *)params + efi_setup_data_offset; struct efi_setup_data *esd = (void *)sd + sizeof(struct setup_data); - esd->fw_vendor = efi.fw_vendor; - esd->tables = efi.config_table; + esd->fw_vendor = efi_fw_vendor; + esd->tables = efi_config_table; esd->smbios = efi.smbios; sd->type = SETUP_EFI; diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c index 7d932452a40f..6fa412e156c7 100644 --- a/arch/x86/platform/efi/efi.c +++ b/arch/x86/platform/efi/efi.c @@ -59,6 +59,9 @@ static u64 efi_systab_phys __initdata; static unsigned long prop_phys = EFI_INVALID_TABLE_ADDR; static unsigned long uga_phys = EFI_INVALID_TABLE_ADDR; +static unsigned long efi_runtime, efi_nr_tables; + +unsigned long efi_fw_vendor, efi_config_table; static const efi_config_table_type_t arch_tables[] __initconst = { {EFI_PROPERTIES_TABLE_GUID, "PROP", &prop_phys}, @@ -78,9 +81,9 @@ static const unsigned long * const efi_tables[] = { #ifdef CONFIG_X86_UV &uv_systab_phys, #endif - &efi.fw_vendor, - &efi.runtime, - &efi.config_table, + &efi_fw_vendor, + &efi_runtime, + &efi_config_table, &efi.esrt, &prop_phys, &efi_mem_attr_table, @@ -434,7 +437,7 @@ static int __init efi_config_init(const efi_config_table_type_t *arch_tables) void *config_tables; int sz, ret; - if (efi.systab->nr_tables == 0) + if (efi_nr_tables == 0) return 0; if (efi_enabled(EFI_64BIT)) @@ -445,17 +448,16 @@ static int __init efi_config_init(const efi_config_table_type_t *arch_tables) /* * Let's see what config tables the firmware passed to us. */ - config_tables = early_memremap(efi.systab->tables, - efi.systab->nr_tables * sz); + config_tables = early_memremap(efi_config_table, efi_nr_tables * sz); if (config_tables == NULL) { pr_err("Could not map Configuration table!\n"); return -ENOMEM; } - ret = efi_config_parse_tables(config_tables, efi.systab->nr_tables, + ret = efi_config_parse_tables(config_tables, efi_nr_tables, arch_tables); - early_memunmap(config_tables, efi.systab->nr_tables * sz); + early_memunmap(config_tables, efi_nr_tables * sz); return ret; } @@ -474,11 +476,12 @@ void __init efi_init(void) if (efi_systab_init(efi_systab_phys)) return; - efi.config_table = (unsigned long)efi.systab->tables; - efi.fw_vendor = (unsigned long)efi.systab->fw_vendor; - efi.runtime = (unsigned long)efi.systab->runtime; + efi_config_table = (unsigned long)efi.systab->tables; + efi_nr_tables = efi.systab->nr_tables; + efi_fw_vendor = (unsigned long)efi.systab->fw_vendor; + efi_runtime = (unsigned long)efi.systab->runtime; - if (efi_reuse_config(efi.systab->tables, efi.systab->nr_tables)) + if (efi_reuse_config(efi_config_table, efi_nr_tables)) return; if (efi_config_init(arch_tables)) @@ -1023,3 +1026,36 @@ char *efi_systab_show_arch(char *str) str += sprintf(str, "UGA=0x%lx\n", uga_phys); return str; } + +#define EFI_FIELD(var) efi_ ## var + +#define EFI_ATTR_SHOW(name) \ +static ssize_t name##_show(struct kobject *kobj, \ + struct kobj_attribute *attr, char *buf) \ +{ \ + return sprintf(buf, "0x%lx\n", EFI_FIELD(name)); \ +} + +EFI_ATTR_SHOW(fw_vendor); +EFI_ATTR_SHOW(runtime); +EFI_ATTR_SHOW(config_table); + +struct kobj_attribute efi_attr_fw_vendor = __ATTR_RO(fw_vendor); +struct kobj_attribute efi_attr_runtime = __ATTR_RO(runtime); +struct kobj_attribute efi_attr_config_table = __ATTR_RO(config_table); + +umode_t efi_attr_is_visible(struct kobject *kobj, struct attribute *attr, int n) +{ + if (attr == &efi_attr_fw_vendor.attr) { + if (efi_enabled(EFI_PARAVIRT) || + efi_fw_vendor == EFI_INVALID_TABLE_ADDR) + return 0; + } else if (attr == &efi_attr_runtime.attr) { + if (efi_runtime == EFI_INVALID_TABLE_ADDR) + return 0; + } else if (attr == &efi_attr_config_table.attr) { + if (efi_config_table == EFI_INVALID_TABLE_ADDR) + return 0; + } + return attr->mode; +} diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c index 88d32c06cffa..b0e0161e2e8e 100644 --- a/arch/x86/platform/efi/quirks.c +++ b/arch/x86/platform/efi/quirks.c @@ -537,7 +537,7 @@ int __init efi_reuse_config(u64 tables, int nr_tables) goto out_memremap; } - for (i = 0; i < efi.systab->nr_tables; i++) { + for (i = 0; i < nr_tables; i++) { efi_guid_t guid; guid = ((efi_config_table_64_t *)p)->guid; -- cgit From 09308012d8546dda75e96c02bed19e2ba1e875fd Mon Sep 17 00:00:00 2001 From: Ard Biesheuvel Date: Mon, 20 Jan 2020 18:35:17 +0100 Subject: efi/x86: Merge assignments of efi.runtime_version efi.runtime_version is always set to the same value on both existing code paths, so just set it earlier from a shared one. Tested-by: Tony Luck # arch/ia64 Signed-off-by: Ard Biesheuvel --- arch/x86/platform/efi/efi.c | 19 ++----------------- 1 file changed, 2 insertions(+), 17 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c index 6fa412e156c7..e4ee9a37254a 100644 --- a/arch/x86/platform/efi/efi.c +++ b/arch/x86/platform/efi/efi.c @@ -420,6 +420,8 @@ static int __init efi_systab_init(u64 phys) efi_systab.tables = systab32->tables; } + efi.runtime_version = hdr->revision; + efi_systab_report_header(hdr, efi_systab.fw_vendor); early_memunmap(p, size); @@ -870,15 +872,6 @@ static void __init kexec_enter_virtual_mode(void) } efi_sync_low_kernel_mappings(); - - /* - * Now that EFI is in virtual mode, update the function - * pointers in the runtime service table to the new virtual addresses. - * - * Call EFI services through wrapper functions. - */ - efi.runtime_version = efi_systab.hdr.revision; - efi_native_runtime_setup(); #endif } @@ -965,14 +958,6 @@ static void __init __efi_enter_virtual_mode(void) efi_free_boot_services(); - /* - * Now that EFI is in virtual mode, update the function - * pointers in the runtime service table to the new virtual addresses. - * - * Call EFI services through wrapper functions. - */ - efi.runtime_version = efi_systab.hdr.revision; - if (!efi_is_mixed()) efi_native_runtime_setup(); else -- cgit From 59f2a619a2db86111e8bb30f349aebff6eb75baa Mon Sep 17 00:00:00 2001 From: Ard Biesheuvel Date: Tue, 21 Jan 2020 09:44:43 +0100 Subject: efi: Add 'runtime' pointer to struct efi Instead of going through the EFI system table each time, just copy the runtime services table pointer into struct efi directly. This is the last use of the system table pointer in struct efi, allowing us to drop it in a future patch, along with a fair amount of quirky handling of the translated address. Note that usually, the runtime services pointer changes value during the call to SetVirtualAddressMap(), so grab the updated value as soon as that call returns. (Mixed mode uses a 1:1 mapping, and kexec boot enters with the updated address in the system table, so in those cases, we don't need to do anything here) Tested-by: Tony Luck # arch/ia64 Signed-off-by: Ard Biesheuvel --- arch/x86/include/asm/efi.h | 3 ++- arch/x86/kernel/asm-offsets_32.c | 5 +++++ arch/x86/platform/efi/efi.c | 9 ++++++--- arch/x86/platform/efi/efi_32.c | 13 ++++++++----- arch/x86/platform/efi/efi_64.c | 14 ++++++++------ arch/x86/platform/efi/efi_stub_32.S | 21 ++++++++++++++++----- 6 files changed, 45 insertions(+), 20 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h index 78fc28da2e29..0de57151c732 100644 --- a/arch/x86/include/asm/efi.h +++ b/arch/x86/include/asm/efi.h @@ -218,7 +218,8 @@ extern void efi_thunk_runtime_setup(void); efi_status_t efi_set_virtual_address_map(unsigned long memory_map_size, unsigned long descriptor_size, u32 descriptor_version, - efi_memory_desc_t *virtual_map); + efi_memory_desc_t *virtual_map, + unsigned long systab_phys); /* arch specific definitions used by the stub code */ diff --git a/arch/x86/kernel/asm-offsets_32.c b/arch/x86/kernel/asm-offsets_32.c index 82826f2275cc..2b4256ebe86e 100644 --- a/arch/x86/kernel/asm-offsets_32.c +++ b/arch/x86/kernel/asm-offsets_32.c @@ -3,6 +3,8 @@ # error "Please do not build this file directly, build asm-offsets.c instead" #endif +#include + #include #define __SYSCALL_I386(nr, sym, qual) [nr] = 1, @@ -64,4 +66,7 @@ void foo(void) BLANK(); DEFINE(__NR_syscall_max, sizeof(syscalls) - 1); DEFINE(NR_syscalls, sizeof(syscalls)); + + BLANK(); + DEFINE(EFI_svam, offsetof(efi_runtime_services_t, set_virtual_address_map)); } diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c index e4ee9a37254a..6bd8aae235d2 100644 --- a/arch/x86/platform/efi/efi.c +++ b/arch/x86/platform/efi/efi.c @@ -55,8 +55,8 @@ #include static efi_system_table_t efi_systab __initdata; -static u64 efi_systab_phys __initdata; +static unsigned long efi_systab_phys __initdata; static unsigned long prop_phys = EFI_INVALID_TABLE_ADDR; static unsigned long uga_phys = EFI_INVALID_TABLE_ADDR; static unsigned long efi_runtime, efi_nr_tables; @@ -335,7 +335,7 @@ void __init efi_print_memmap(void) } } -static int __init efi_systab_init(u64 phys) +static int __init efi_systab_init(unsigned long phys) { int size = efi_enabled(EFI_64BIT) ? sizeof(efi_system_table_64_t) : sizeof(efi_system_table_32_t); @@ -949,7 +949,8 @@ static void __init __efi_enter_virtual_mode(void) status = efi_set_virtual_address_map(efi.memmap.desc_size * count, efi.memmap.desc_size, efi.memmap.desc_version, - (efi_memory_desc_t *)pa); + (efi_memory_desc_t *)pa, + efi_systab_phys); if (status != EFI_SUCCESS) { pr_err("Unable to switch EFI into virtual mode (status=%lx)!\n", status); @@ -983,6 +984,8 @@ void __init efi_enter_virtual_mode(void) if (efi_enabled(EFI_PARAVIRT)) return; + efi.runtime = (efi_runtime_services_t *)efi_runtime; + if (efi_setup) kexec_enter_virtual_mode(); else diff --git a/arch/x86/platform/efi/efi_32.c b/arch/x86/platform/efi/efi_32.c index 081d466002c9..c049c432745d 100644 --- a/arch/x86/platform/efi/efi_32.c +++ b/arch/x86/platform/efi/efi_32.c @@ -66,14 +66,16 @@ void __init efi_map_region(efi_memory_desc_t *md) void __init efi_map_region_fixed(efi_memory_desc_t *md) {} void __init parse_efi_setup(u64 phys_addr, u32 data_len) {} -efi_status_t efi_call_svam(efi_set_virtual_address_map_t *__efiapi *, - u32, u32, u32, void *); +efi_status_t efi_call_svam(efi_runtime_services_t * const *, + u32, u32, u32, void *, u32); efi_status_t __init efi_set_virtual_address_map(unsigned long memory_map_size, unsigned long descriptor_size, u32 descriptor_version, - efi_memory_desc_t *virtual_map) + efi_memory_desc_t *virtual_map, + unsigned long systab_phys) { + const efi_system_table_t *systab = (efi_system_table_t *)systab_phys; struct desc_ptr gdt_descr; efi_status_t status; unsigned long flags; @@ -90,9 +92,10 @@ efi_status_t __init efi_set_virtual_address_map(unsigned long memory_map_size, /* Disable interrupts around EFI calls: */ local_irq_save(flags); - status = efi_call_svam(&efi.systab->runtime->set_virtual_address_map, + status = efi_call_svam(&systab->runtime, memory_map_size, descriptor_size, - descriptor_version, virtual_map); + descriptor_version, virtual_map, + __pa(&efi.runtime)); local_irq_restore(flags); load_fixmap_gdt(0); diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c index fa8506e76bbe..f78f7da666fb 100644 --- a/arch/x86/platform/efi/efi_64.c +++ b/arch/x86/platform/efi/efi_64.c @@ -500,12 +500,9 @@ static DEFINE_SPINLOCK(efi_runtime_lock); */ #define __efi_thunk(func, ...) \ ({ \ - efi_runtime_services_32_t *__rt; \ unsigned short __ds, __es; \ efi_status_t ____s; \ \ - __rt = (void *)(unsigned long)efi.systab->mixed_mode.runtime; \ - \ savesegment(ds, __ds); \ savesegment(es, __es); \ \ @@ -513,7 +510,7 @@ static DEFINE_SPINLOCK(efi_runtime_lock); loadsegment(ds, __KERNEL_DS); \ loadsegment(es, __KERNEL_DS); \ \ - ____s = efi64_thunk(__rt->func, __VA_ARGS__); \ + ____s = efi64_thunk(efi.runtime->mixed_mode.func, __VA_ARGS__); \ \ loadsegment(ds, __ds); \ loadsegment(es, __es); \ @@ -886,8 +883,10 @@ efi_status_t __init __no_sanitize_address efi_set_virtual_address_map(unsigned long memory_map_size, unsigned long descriptor_size, u32 descriptor_version, - efi_memory_desc_t *virtual_map) + efi_memory_desc_t *virtual_map, + unsigned long systab_phys) { + const efi_system_table_t *systab = (efi_system_table_t *)systab_phys; efi_status_t status; unsigned long flags; pgd_t *save_pgd = NULL; @@ -910,13 +909,16 @@ efi_set_virtual_address_map(unsigned long memory_map_size, /* Disable interrupts around EFI calls: */ local_irq_save(flags); - status = efi_call(efi.systab->runtime->set_virtual_address_map, + status = efi_call(efi.runtime->set_virtual_address_map, memory_map_size, descriptor_size, descriptor_version, virtual_map); local_irq_restore(flags); kernel_fpu_end(); + /* grab the virtually remapped EFI runtime services table pointer */ + efi.runtime = READ_ONCE(systab->runtime); + if (save_pgd) efi_uv1_memmap_phys_epilog(save_pgd); else diff --git a/arch/x86/platform/efi/efi_stub_32.S b/arch/x86/platform/efi/efi_stub_32.S index 75c46e7a809f..09237236fb25 100644 --- a/arch/x86/platform/efi/efi_stub_32.S +++ b/arch/x86/platform/efi/efi_stub_32.S @@ -8,14 +8,20 @@ #include #include +#include #include __INIT SYM_FUNC_START(efi_call_svam) - push 8(%esp) - push 8(%esp) + push %ebp + movl %esp, %ebp + push %ebx + + push 16(%esp) + push 16(%esp) push %ecx push %edx + movl %eax, %ebx // &systab_phys->runtime /* * Switch to the flat mapped alias of this routine, by jumping to the @@ -35,15 +41,20 @@ SYM_FUNC_START(efi_call_svam) subl $__PAGE_OFFSET, %esp /* call the EFI routine */ - call *(%eax) + movl (%eax), %eax + call *EFI_svam(%eax) - /* convert ESP back to a kernel VA, and pop the outgoing args */ - addl $__PAGE_OFFSET + 16, %esp + /* grab the virtually remapped EFI runtime services table pointer */ + movl (%ebx), %ecx + movl 36(%esp), %edx // &efi.runtime + movl %ecx, (%edx) /* re-enable paging */ movl %cr0, %edx orl $0x80000000, %edx movl %edx, %cr0 + pop %ebx + leave ret SYM_FUNC_END(efi_call_svam) -- cgit From fd26830423e5f7442001f090cd4a53f4b6c3d9fa Mon Sep 17 00:00:00 2001 From: Ard Biesheuvel Date: Tue, 21 Jan 2020 10:16:32 +0100 Subject: efi/x86: Drop 'systab' member from struct efi The systab member in struct efi has outlived its usefulness, now that we have better ways to access the only piece of information we are interested in after init, which is the EFI runtime services table address. So instead of instantiating a doctored copy at early boot with lots of mangled values, and switching the pointer when switching into virtual mode, let's grab the values we need directly, and get rid of the systab pointer entirely. Tested-by: Tony Luck # arch/ia64 Signed-off-by: Ard Biesheuvel --- arch/x86/platform/efi/efi.c | 87 ++++++++------------------------------------- 1 file changed, 14 insertions(+), 73 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c index 6bd8aae235d2..43b24e149312 100644 --- a/arch/x86/platform/efi/efi.c +++ b/arch/x86/platform/efi/efi.c @@ -54,8 +54,6 @@ #include #include -static efi_system_table_t efi_systab __initdata; - static unsigned long efi_systab_phys __initdata; static unsigned long prop_phys = EFI_INVALID_TABLE_ADDR; static unsigned long uga_phys = EFI_INVALID_TABLE_ADDR; @@ -359,28 +357,8 @@ static int __init efi_systab_init(unsigned long phys) if (efi_enabled(EFI_64BIT)) { const efi_system_table_64_t *systab64 = p; - efi_systab.hdr = systab64->hdr; - efi_systab.fw_vendor = systab64->fw_vendor; - efi_systab.fw_revision = systab64->fw_revision; - efi_systab.con_in_handle = systab64->con_in_handle; - efi_systab.con_in = systab64->con_in; - efi_systab.con_out_handle = systab64->con_out_handle; - efi_systab.con_out = (void *)(unsigned long)systab64->con_out; - efi_systab.stderr_handle = systab64->stderr_handle; - efi_systab.stderr = systab64->stderr; - efi_systab.runtime = (void *)(unsigned long)systab64->runtime; - efi_systab.boottime = (void *)(unsigned long)systab64->boottime; - efi_systab.nr_tables = systab64->nr_tables; - efi_systab.tables = systab64->tables; - - over4g = systab64->con_in_handle > U32_MAX || - systab64->con_in > U32_MAX || - systab64->con_out_handle > U32_MAX || - systab64->con_out > U32_MAX || - systab64->stderr_handle > U32_MAX || - systab64->stderr > U32_MAX || - systab64->runtime > U32_MAX || - systab64->boottime > U32_MAX; + efi_runtime = systab64->runtime; + over4g = systab64->runtime > U32_MAX; if (efi_setup) { struct efi_setup_data *data; @@ -391,38 +369,33 @@ static int __init efi_systab_init(unsigned long phys) return -ENOMEM; } - efi_systab.fw_vendor = (unsigned long)data->fw_vendor; - efi_systab.tables = (unsigned long)data->tables; + efi_fw_vendor = (unsigned long)data->fw_vendor; + efi_config_table = (unsigned long)data->tables; over4g |= data->fw_vendor > U32_MAX || data->tables > U32_MAX; early_memunmap(data, sizeof(*data)); } else { + efi_fw_vendor = systab64->fw_vendor; + efi_config_table = systab64->tables; + over4g |= systab64->fw_vendor > U32_MAX || systab64->tables > U32_MAX; } + efi_nr_tables = systab64->nr_tables; } else { const efi_system_table_32_t *systab32 = p; - efi_systab.hdr = systab32->hdr; - efi_systab.fw_vendor = systab32->fw_vendor; - efi_systab.fw_revision = systab32->fw_revision; - efi_systab.con_in_handle = systab32->con_in_handle; - efi_systab.con_in = systab32->con_in; - efi_systab.con_out_handle = systab32->con_out_handle; - efi_systab.con_out = (void *)(unsigned long)systab32->con_out; - efi_systab.stderr_handle = systab32->stderr_handle; - efi_systab.stderr = systab32->stderr; - efi_systab.runtime = (void *)(unsigned long)systab32->runtime; - efi_systab.boottime = (void *)(unsigned long)systab32->boottime; - efi_systab.nr_tables = systab32->nr_tables; - efi_systab.tables = systab32->tables; + efi_fw_vendor = systab32->fw_vendor; + efi_runtime = systab32->runtime; + efi_config_table = systab32->tables; + efi_nr_tables = systab32->nr_tables; } efi.runtime_version = hdr->revision; - efi_systab_report_header(hdr, efi_systab.fw_vendor); + efi_systab_report_header(hdr, efi_fw_vendor); early_memunmap(p, size); if (IS_ENABLED(CONFIG_X86_32) && over4g) { @@ -430,7 +403,6 @@ static int __init efi_systab_init(unsigned long phys) return -EINVAL; } - efi.systab = &efi_systab; return 0; } @@ -478,11 +450,6 @@ void __init efi_init(void) if (efi_systab_init(efi_systab_phys)) return; - efi_config_table = (unsigned long)efi.systab->tables; - efi_nr_tables = efi.systab->nr_tables; - efi_fw_vendor = (unsigned long)efi.systab->fw_vendor; - efi_runtime = (unsigned long)efi.systab->runtime; - if (efi_reuse_config(efi_config_table, efi_nr_tables)) return; @@ -624,20 +591,6 @@ static void __init efi_merge_regions(void) } } -static void __init get_systab_virt_addr(efi_memory_desc_t *md) -{ - unsigned long size; - u64 end, systab; - - size = md->num_pages << EFI_PAGE_SHIFT; - end = md->phys_addr + size; - systab = efi_systab_phys; - if (md->phys_addr <= systab && systab < end) { - systab += md->virt_addr - md->phys_addr; - efi.systab = (efi_system_table_t *)(unsigned long)systab; - } -} - static void *realloc_pages(void *old_memmap, int old_shift) { void *ret; @@ -793,7 +746,6 @@ static void * __init efi_map_regions(int *count, int *pg_shift) continue; efi_map_region(md); - get_systab_virt_addr(md); if (left < desc_size) { new_memmap = realloc_pages(new_memmap, *pg_shift); @@ -819,8 +771,6 @@ static void __init kexec_enter_virtual_mode(void) efi_memory_desc_t *md; unsigned int num_pages; - efi.systab = NULL; - /* * We don't do virtual mode, since we don't do runtime services, on * non-native EFI. With the UV1 memmap, we don't do runtime services in @@ -843,10 +793,8 @@ static void __init kexec_enter_virtual_mode(void) * Map efi regions which were passed via setup_data. The virt_addr is a * fixed addr which was used in first kernel of a kexec boot. */ - for_each_efi_memory_desc(md) { + for_each_efi_memory_desc(md) efi_map_region_fixed(md); /* FIXME: add error handling */ - get_systab_virt_addr(md); - } /* * Unregister the early EFI memmap from efi_init() and install @@ -861,8 +809,6 @@ static void __init kexec_enter_virtual_mode(void) return; } - BUG_ON(!efi.systab); - num_pages = ALIGN(efi.memmap.nr_map * efi.memmap.desc_size, PAGE_SIZE); num_pages >>= PAGE_SHIFT; @@ -905,8 +851,6 @@ static void __init __efi_enter_virtual_mode(void) efi_status_t status; unsigned long pa; - efi.systab = NULL; - if (efi_alloc_page_tables()) { pr_err("Failed to allocate EFI page tables\n"); goto err; @@ -938,9 +882,6 @@ static void __init __efi_enter_virtual_mode(void) efi_print_memmap(); } - if (WARN_ON(!efi.systab)) - goto err; - if (efi_setup_page_tables(pa, 1 << pg_shift)) goto err; -- cgit From 223e3ee56f77570157aba8cc550208af430a869b Mon Sep 17 00:00:00 2001 From: Ard Biesheuvel Date: Sat, 22 Feb 2020 15:15:50 +0100 Subject: efi/x86: add headroom to decompressor BSS to account for setup block In the bootparams struct, init_size defines the static footprint of the bzImage, counted from the start of the kernel image, i.e., startup_32(). The PE/COFF metadata declares the same size for the entire image, but this time, the image includes the setup block as well, and so the space reserved by UEFI is a bit too small. This usually doesn't matter, since we normally relocate the kernel into a memory allocation of the correct size. But in the unlikely case that the image happens to be loaded at exactly the preferred offset, we skip this relocation, and execute the image in place, stepping on memory beyond the provided allocation, which may be in use for other purposes. Let's fix this by adding the size of the setup block to the image size as declared in the PE/COFF header. Signed-off-by: Ard Biesheuvel --- arch/x86/boot/tools/build.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/boot/tools/build.c b/arch/x86/boot/tools/build.c index 55e669d29e54..c08db2ee4ba2 100644 --- a/arch/x86/boot/tools/build.c +++ b/arch/x86/boot/tools/build.c @@ -408,7 +408,7 @@ int main(int argc, char ** argv) update_pecoff_text(setup_sectors * 512, i + (sys_size * 16)); init_sz = get_unaligned_le32(&buf[0x260]); - update_pecoff_bss(i + (sys_size * 16), init_sz); + update_pecoff_bss(i + (sys_size * 16), init_sz + setup_sectors * 512); efi_stub_entry_update(); -- cgit From 832187f03994b03b6cc4a3b9130d82b1ec5cbec4 Mon Sep 17 00:00:00 2001 From: Ard Biesheuvel Date: Wed, 12 Feb 2020 11:26:10 +0100 Subject: efi/x86: Drop redundant .bss section In commit c7fb93ec51d462ec ("x86/efi: Include a .bss section within the PE/COFF headers") we added a separate .bss section to the PE/COFF header of the compressed kernel describing the static memory footprint of the decompressor, to ensure that it has enough headroom to decompress itself. We can achieve the exact same result by increasing the virtual size of the .text section, without changing the raw size, which, as per the PE/COFF specification, requires the loader to zero initialize the delta. Doing so frees up a slot in the section table, which we will use later to describe the mixed mode entrypoint. Signed-off-by: Ard Biesheuvel --- arch/x86/boot/header.S | 21 +-------------------- arch/x86/boot/tools/build.c | 36 ++++++++++++------------------------ 2 files changed, 13 insertions(+), 44 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/boot/header.S b/arch/x86/boot/header.S index 97d9b6d6c1af..d59f6604bb42 100644 --- a/arch/x86/boot/header.S +++ b/arch/x86/boot/header.S @@ -106,7 +106,7 @@ coff_header: #else .word 0x8664 # x86-64 #endif - .word 4 # nr_sections + .word 3 # nr_sections .long 0 # TimeDateStamp .long 0 # PointerToSymbolTable .long 1 # NumberOfSymbols @@ -248,25 +248,6 @@ section_table: .word 0 # NumberOfLineNumbers .long 0x60500020 # Characteristics (section flags) - # - # The offset & size fields are filled in by build.c. - # - .ascii ".bss" - .byte 0 - .byte 0 - .byte 0 - .byte 0 - .long 0 - .long 0x0 - .long 0 # Size of initialized data - # on disk - .long 0x0 - .long 0 # PointerToRelocations - .long 0 # PointerToLineNumbers - .word 0 # NumberOfRelocations - .word 0 # NumberOfLineNumbers - .long 0xc8000080 # Characteristics (section flags) - #endif /* CONFIG_EFI_STUB */ # Kernel attributes; used by setup. This is part 1 of the diff --git a/arch/x86/boot/tools/build.c b/arch/x86/boot/tools/build.c index c08db2ee4ba2..f9f5761c5d05 100644 --- a/arch/x86/boot/tools/build.c +++ b/arch/x86/boot/tools/build.c @@ -203,10 +203,12 @@ static void update_pecoff_setup_and_reloc(unsigned int size) put_unaligned_le32(10, &buf[reloc_offset + 4]); } -static void update_pecoff_text(unsigned int text_start, unsigned int file_sz) +static void update_pecoff_text(unsigned int text_start, unsigned int file_sz, + unsigned int init_sz) { unsigned int pe_header; unsigned int text_sz = file_sz - text_start; + unsigned int bss_sz = init_sz + text_start - file_sz; pe_header = get_unaligned_le32(&buf[0x3c]); @@ -214,30 +216,18 @@ static void update_pecoff_text(unsigned int text_start, unsigned int file_sz) * Size of code: Subtract the size of the first sector (512 bytes) * which includes the header. */ - put_unaligned_le32(file_sz - 512, &buf[pe_header + 0x1c]); + put_unaligned_le32(file_sz - 512 + bss_sz, &buf[pe_header + 0x1c]); + + /* Size of image */ + put_unaligned_le32(init_sz + text_start, &buf[pe_header + 0x50]); /* * Address of entry point for PE/COFF executable */ put_unaligned_le32(text_start + efi_pe_entry, &buf[pe_header + 0x28]); - update_pecoff_section_header(".text", text_start, text_sz); -} - -static void update_pecoff_bss(unsigned int file_sz, unsigned int init_sz) -{ - unsigned int pe_header; - unsigned int bss_sz = init_sz - file_sz; - - pe_header = get_unaligned_le32(&buf[0x3c]); - - /* Size of uninitialized data */ - put_unaligned_le32(bss_sz, &buf[pe_header + 0x24]); - - /* Size of image */ - put_unaligned_le32(init_sz, &buf[pe_header + 0x50]); - - update_pecoff_section_header_fields(".bss", file_sz, bss_sz, 0, 0); + update_pecoff_section_header_fields(".text", text_start, text_sz + bss_sz, + text_sz, text_start); } static int reserve_pecoff_reloc_section(int c) @@ -278,9 +268,8 @@ static void efi_stub_entry_update(void) static inline void update_pecoff_setup_and_reloc(unsigned int size) {} static inline void update_pecoff_text(unsigned int text_start, - unsigned int file_sz) {} -static inline void update_pecoff_bss(unsigned int file_sz, - unsigned int init_sz) {} + unsigned int file_sz, + unsigned int init_sz) {} static inline void efi_stub_defaults(void) {} static inline void efi_stub_entry_update(void) {} @@ -406,9 +395,8 @@ int main(int argc, char ** argv) buf[0x1f1] = setup_sectors-1; put_unaligned_le32(sys_size, &buf[0x1f4]); - update_pecoff_text(setup_sectors * 512, i + (sys_size * 16)); init_sz = get_unaligned_le32(&buf[0x260]); - update_pecoff_bss(i + (sys_size * 16), init_sz + setup_sectors * 512); + update_pecoff_text(setup_sectors * 512, i + (sys_size * 16), init_sz); efi_stub_entry_update(); -- cgit From 3b8f44fc0810d51b58837cf7fdba3f72f8cffcdc Mon Sep 17 00:00:00 2001 From: Ard Biesheuvel Date: Sun, 16 Feb 2020 00:03:25 +0100 Subject: efi/libstub/x86: Use Exit() boot service to exit the stub on errors Currently, we either return with an error [from efi_pe_entry()] or enter a deadloop [in efi_main()] if any fatal errors occur during execution of the EFI stub. Let's switch to calling the Exit() EFI boot service instead in both cases, so that we a) can get rid of the deadloop, and simply return to the boot manager if any errors occur during execution of the stub, including during the call to ExitBootServices(), b) can also return cleanly from efi_pe_entry() or efi_main() in mixed mode, once we introduce support for LoadImage/StartImage based mixed mode in the next patch. Note that on systems running downstream GRUBs [which do not use LoadImage or StartImage to boot the kernel, and instead, pass their own image handle as the loaded image handle], calling Exit() will exit from GRUB rather than from the kernel, but this is a tolerable side effect. Signed-off-by: Ard Biesheuvel --- arch/x86/include/asm/efi.h | 8 ++++++++ 1 file changed, 8 insertions(+) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h index 0de57151c732..cdcf48d52a12 100644 --- a/arch/x86/include/asm/efi.h +++ b/arch/x86/include/asm/efi.h @@ -270,6 +270,11 @@ static inline void *efi64_zero_upper(void *p) return p; } +static inline u32 efi64_convert_status(efi_status_t status) +{ + return (u32)(status | (u64)status >> 32); +} + #define __efi64_argmap_free_pages(addr, size) \ ((addr), 0, (size)) @@ -288,6 +293,9 @@ static inline void *efi64_zero_upper(void *p) #define __efi64_argmap_locate_device_path(protocol, path, handle) \ ((protocol), (path), efi64_zero_upper(handle)) +#define __efi64_argmap_exit(handle, status, size, data) \ + ((handle), efi64_convert_status(status), (size), (data)) + /* PCI I/O */ #define __efi64_argmap_get_location(protocol, seg, bus, dev, func) \ ((protocol), efi64_zero_upper(seg), efi64_zero_upper(bus), \ -- cgit From 17054f492dfd4d91e093ebb87013807812ec42a4 Mon Sep 17 00:00:00 2001 From: Ard Biesheuvel Date: Wed, 12 Feb 2020 23:20:54 +0100 Subject: efi/x86: Implement mixed mode boot without the handover protocol Add support for booting 64-bit x86 kernels from 32-bit firmware running on 64-bit capable CPUs without requiring the bootloader to implement the EFI handover protocol or allocate the setup block, etc etc, all of which can be done by the stub itself, using code that already exists. Instead, create an ordinary EFI application entrypoint but implemented in 32-bit code [so that it can be invoked by 32-bit firmware], and stash the address of this 32-bit entrypoint in the .compat section where the bootloader can find it. Note that we use the setup block embedded in the binary to go through startup_32(), but it gets reallocated and copied in efi_pe_entry(), using the same code that runs when the x86 kernel is booted in EFI mode from native firmware. This requires the loaded image protocol to be installed on the kernel image's EFI handle, and point to the kernel image itself and not to its loader. This, in turn, requires the bootloader to use the LoadImage() boot service to load the 64-bit image from 32-bit firmware, which is in fact supported by firmware based on EDK2. (Only StartImage() will fail, and instead, the newly added entrypoint needs to be invoked) Signed-off-by: Ard Biesheuvel --- arch/x86/boot/compressed/head_64.S | 59 ++++++++++++++++++++++++++++++++++++-- 1 file changed, 57 insertions(+), 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S index a4f5561c1c0e..f7bacc4c1494 100644 --- a/arch/x86/boot/compressed/head_64.S +++ b/arch/x86/boot/compressed/head_64.S @@ -207,8 +207,13 @@ SYM_FUNC_START(startup_32) cmp $0, %edi jz 1f leal efi64_stub_entry(%ebp), %eax - movl %esi, %edx movl efi32_boot_args+4(%ebp), %esi + movl efi32_boot_args+8(%ebp), %edx // saved bootparams pointer + cmpl $0, %edx + jnz 1f + leal efi_pe_entry(%ebp), %eax + movl %edi, %ecx // MS calling convention + movl %esi, %edx 1: #endif pushl %eax @@ -233,6 +238,8 @@ SYM_FUNC_START(efi32_stub_entry) 1: pop %ebp subl $1b, %ebp + movl %esi, efi32_boot_args+8(%ebp) +SYM_INNER_LABEL(efi32_pe_stub_entry, SYM_L_LOCAL) movl %ecx, efi32_boot_args(%ebp) movl %edx, efi32_boot_args+4(%ebp) movb $0, efi_is64(%ebp) @@ -641,8 +648,56 @@ SYM_DATA_START_LOCAL(gdt) SYM_DATA_END_LABEL(gdt, SYM_L_LOCAL, gdt_end) #ifdef CONFIG_EFI_MIXED -SYM_DATA_LOCAL(efi32_boot_args, .long 0, 0) +SYM_DATA_LOCAL(efi32_boot_args, .long 0, 0, 0) SYM_DATA(efi_is64, .byte 1) + +#define ST32_boottime 60 // offsetof(efi_system_table_32_t, boottime) +#define BS32_handle_protocol 88 // offsetof(efi_boot_services_32_t, handle_protocol) +#define LI32_image_base 32 // offsetof(efi_loaded_image_32_t, image_base) + + .text + .code32 +SYM_FUNC_START(efi32_pe_entry) + pushl %ebp + + call verify_cpu // check for long mode support + testl %eax, %eax + movl $0x80000003, %eax // EFI_UNSUPPORTED + jnz 3f + + call 1f +1: pop %ebp + subl $1b, %ebp + + /* Get the loaded image protocol pointer from the image handle */ + subl $12, %esp // space for the loaded image pointer + pushl %esp // pass its address + leal 4f(%ebp), %eax + pushl %eax // pass the GUID address + pushl 28(%esp) // pass the image handle + + movl 36(%esp), %eax // sys_table + movl ST32_boottime(%eax), %eax // sys_table->boottime + call *BS32_handle_protocol(%eax) // sys_table->boottime->handle_protocol + cmp $0, %eax + jnz 2f + + movl 32(%esp), %ecx // image_handle + movl 36(%esp), %edx // sys_table + movl 12(%esp), %esi // loaded_image + movl LI32_image_base(%esi), %esi // loaded_image->image_base + jmp efi32_pe_stub_entry + +2: addl $24, %esp +3: popl %ebp + ret +SYM_FUNC_END(efi32_pe_entry) + + .section ".rodata" + /* EFI loaded image protocol GUID */ +4: .long 0x5B1B31A1 + .word 0x9562, 0x11d2 + .byte 0x8E, 0x3F, 0x00, 0xA0, 0xC9, 0x69, 0x72, 0x3B #endif /* -- cgit From 97aa276579b28b86f4a3e235b50762c0191c2ac3 Mon Sep 17 00:00:00 2001 From: Ard Biesheuvel Date: Wed, 12 Feb 2020 12:18:17 +0100 Subject: efi/x86: Add true mixed mode entry point into .compat section Currently, mixed mode is closely tied to the EFI handover protocol and relies on intimate knowledge of the bootparams structure, setup header etc, all of which are rather byzantine and entirely specific to x86. Even though no other EFI supported architectures are currently known that could support something like mixed mode, it still makes sense to abstract a bit from this, and make it part of a generic Linux on EFI boot protocol. To that end, add a .compat section to the mixed mode binary, and populate it with the PE machine type and entry point address, allowing firmware implementations to match it to their native machine type, and invoke non-native binaries using a secondary entry point. Signed-off-by: Ard Biesheuvel --- arch/x86/boot/Makefile | 2 +- arch/x86/boot/header.S | 20 +++++++++++++++++- arch/x86/boot/tools/build.c | 50 ++++++++++++++++++++++++++++++++++++++++++++- 3 files changed, 69 insertions(+), 3 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/boot/Makefile b/arch/x86/boot/Makefile index 012b82fc8617..ef9e1f2c836c 100644 --- a/arch/x86/boot/Makefile +++ b/arch/x86/boot/Makefile @@ -88,7 +88,7 @@ $(obj)/vmlinux.bin: $(obj)/compressed/vmlinux FORCE SETUP_OBJS = $(addprefix $(obj)/,$(setup-y)) -sed-zoffset := -e 's/^\([0-9a-fA-F]*\) [a-zA-Z] \(startup_32\|startup_64\|efi32_stub_entry\|efi64_stub_entry\|efi_pe_entry\|input_data\|kernel_info\|_end\|_ehead\|_text\|z_.*\)$$/\#define ZO_\2 0x\1/p' +sed-zoffset := -e 's/^\([0-9a-fA-F]*\) [a-zA-Z] \(startup_32\|startup_64\|efi32_stub_entry\|efi64_stub_entry\|efi_pe_entry\|efi32_pe_entry\|input_data\|kernel_info\|_end\|_ehead\|_text\|z_.*\)$$/\#define ZO_\2 0x\1/p' quiet_cmd_zoffset = ZOFFSET $@ cmd_zoffset = $(NM) $< | sed -n $(sed-zoffset) > $@ diff --git a/arch/x86/boot/header.S b/arch/x86/boot/header.S index d59f6604bb42..44aeb63ca6ae 100644 --- a/arch/x86/boot/header.S +++ b/arch/x86/boot/header.S @@ -106,7 +106,7 @@ coff_header: #else .word 0x8664 # x86-64 #endif - .word 3 # nr_sections + .word section_count # nr_sections .long 0 # TimeDateStamp .long 0 # PointerToSymbolTable .long 1 # NumberOfSymbols @@ -230,6 +230,23 @@ section_table: .word 0 # NumberOfLineNumbers .long 0x42100040 # Characteristics (section flags) +#ifdef CONFIG_EFI_MIXED + # + # The offset & size fields are filled in by build.c. + # + .asciz ".compat" + .long 0 + .long 0x0 + .long 0 # Size of initialized data + # on disk + .long 0x0 + .long 0 # PointerToRelocations + .long 0 # PointerToLineNumbers + .word 0 # NumberOfRelocations + .word 0 # NumberOfLineNumbers + .long 0x42100040 # Characteristics (section flags) +#endif + # # The offset & size fields are filled in by build.c. # @@ -248,6 +265,7 @@ section_table: .word 0 # NumberOfLineNumbers .long 0x60500020 # Characteristics (section flags) + .set section_count, (. - section_table) / 40 #endif /* CONFIG_EFI_STUB */ # Kernel attributes; used by setup. This is part 1 of the diff --git a/arch/x86/boot/tools/build.c b/arch/x86/boot/tools/build.c index f9f5761c5d05..90d403dfec80 100644 --- a/arch/x86/boot/tools/build.c +++ b/arch/x86/boot/tools/build.c @@ -53,9 +53,16 @@ u8 buf[SETUP_SECT_MAX*512]; #define PECOFF_RELOC_RESERVE 0x20 +#ifdef CONFIG_EFI_MIXED +#define PECOFF_COMPAT_RESERVE 0x20 +#else +#define PECOFF_COMPAT_RESERVE 0x0 +#endif + unsigned long efi32_stub_entry; unsigned long efi64_stub_entry; unsigned long efi_pe_entry; +unsigned long efi32_pe_entry; unsigned long kernel_info; unsigned long startup_64; @@ -189,7 +196,10 @@ static void update_pecoff_section_header(char *section_name, u32 offset, u32 siz static void update_pecoff_setup_and_reloc(unsigned int size) { u32 setup_offset = 0x200; - u32 reloc_offset = size - PECOFF_RELOC_RESERVE; + u32 reloc_offset = size - PECOFF_RELOC_RESERVE - PECOFF_COMPAT_RESERVE; +#ifdef CONFIG_EFI_MIXED + u32 compat_offset = reloc_offset + PECOFF_RELOC_RESERVE; +#endif u32 setup_size = reloc_offset - setup_offset; update_pecoff_section_header(".setup", setup_offset, setup_size); @@ -201,6 +211,20 @@ static void update_pecoff_setup_and_reloc(unsigned int size) */ put_unaligned_le32(reloc_offset + 10, &buf[reloc_offset]); put_unaligned_le32(10, &buf[reloc_offset + 4]); + +#ifdef CONFIG_EFI_MIXED + update_pecoff_section_header(".compat", compat_offset, PECOFF_COMPAT_RESERVE); + + /* + * Put the IA-32 machine type (0x14c) and the associated entry point + * address in the .compat section, so loaders can figure out which other + * execution modes this image supports. + */ + buf[compat_offset] = 0x1; + buf[compat_offset + 1] = 0x8; + put_unaligned_le16(0x14c, &buf[compat_offset + 2]); + put_unaligned_le32(efi32_pe_entry + size, &buf[compat_offset + 4]); +#endif } static void update_pecoff_text(unsigned int text_start, unsigned int file_sz, @@ -212,6 +236,22 @@ static void update_pecoff_text(unsigned int text_start, unsigned int file_sz, pe_header = get_unaligned_le32(&buf[0x3c]); +#ifdef CONFIG_EFI_MIXED + /* + * In mixed mode, we will execute startup_32() at whichever offset in + * memory it happened to land when the PE/COFF loader loaded the image, + * which may be misaligned with respect to the kernel_alignment field + * in the setup header. + * + * In order for startup_32 to safely execute in place at this offset, + * we need to ensure that the CONFIG_PHYSICAL_ALIGN aligned allocation + * it creates for the page tables does not extend beyond the declared + * size of the image in the PE/COFF header. So add the required slack. + */ + bss_sz += CONFIG_PHYSICAL_ALIGN; + init_sz += CONFIG_PHYSICAL_ALIGN; +#endif + /* * Size of code: Subtract the size of the first sector (512 bytes) * which includes the header. @@ -279,6 +319,12 @@ static inline int reserve_pecoff_reloc_section(int c) } #endif /* CONFIG_EFI_STUB */ +static int reserve_pecoff_compat_section(int c) +{ + /* Reserve 0x20 bytes for .compat section */ + memset(buf+c, 0, PECOFF_COMPAT_RESERVE); + return PECOFF_COMPAT_RESERVE; +} /* * Parse zoffset.h and find the entry points. We could just #include zoffset.h @@ -311,6 +357,7 @@ static void parse_zoffset(char *fname) PARSE_ZOFS(p, efi32_stub_entry); PARSE_ZOFS(p, efi64_stub_entry); PARSE_ZOFS(p, efi_pe_entry); + PARSE_ZOFS(p, efi32_pe_entry); PARSE_ZOFS(p, kernel_info); PARSE_ZOFS(p, startup_64); @@ -354,6 +401,7 @@ int main(int argc, char ** argv) die("Boot block hasn't got boot flag (0xAA55)"); fclose(file); + c += reserve_pecoff_compat_section(c); c += reserve_pecoff_reloc_section(c); /* Pad unused space with zeros */ -- cgit From 9a440391b560347bf5ee7cb96b63e7e91cedf66a Mon Sep 17 00:00:00 2001 From: Ard Biesheuvel Date: Thu, 23 Jan 2020 13:09:35 +0100 Subject: x86/ima: Use EFI GetVariable only when available Replace the EFI runtime services check with one that tells us whether EFI GetVariable() is implemented by the firmware. Signed-off-by: Ard Biesheuvel --- arch/x86/kernel/ima_arch.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/kernel/ima_arch.c b/arch/x86/kernel/ima_arch.c index 4d4f5d9faac3..cb6ed616a543 100644 --- a/arch/x86/kernel/ima_arch.c +++ b/arch/x86/kernel/ima_arch.c @@ -19,7 +19,7 @@ static enum efi_secureboot_mode get_sb_mode(void) size = sizeof(secboot); - if (!efi_enabled(EFI_RUNTIME_SERVICES)) { + if (!efi_rt_services_supported(EFI_RT_SUPPORTED_GET_VARIABLE)) { pr_info("ima: secureboot mode unknown, no efi\n"); return efi_secureboot_mode_unknown; } -- cgit From a3326a0d878c433d69981c504f8c8ade60cd2b51 Mon Sep 17 00:00:00 2001 From: Ard Biesheuvel Date: Thu, 20 Feb 2020 10:56:59 +0100 Subject: efi/x86: Use symbolic constants in PE header instead of bare numbers Replace bare numbers in the PE/COFF header structure with symbolic constants so they become self documenting. Signed-off-by: Ard Biesheuvel --- arch/x86/boot/header.S | 62 ++++++++++++++++++++++++++------------------------ 1 file changed, 32 insertions(+), 30 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/boot/header.S b/arch/x86/boot/header.S index 44aeb63ca6ae..f65661a58cf8 100644 --- a/arch/x86/boot/header.S +++ b/arch/x86/boot/header.S @@ -15,7 +15,7 @@ * hex while segment addresses are written as segment:offset. * */ - +#include #include #include #include @@ -43,8 +43,7 @@ SYSSEG = 0x1000 /* historical load address >> 4 */ bootsect_start: #ifdef CONFIG_EFI_STUB # "MZ", MS-DOS header - .byte 0x4d - .byte 0x5a + .word MZ_MAGIC #endif # Normalize the start address @@ -97,39 +96,30 @@ bugger_off_msg: #ifdef CONFIG_EFI_STUB pe_header: - .ascii "PE" - .word 0 + .long PE_MAGIC coff_header: #ifdef CONFIG_X86_32 - .word 0x14c # i386 + .set image_file_add_flags, IMAGE_FILE_32BIT_MACHINE + .set pe_opt_magic, PE_OPT_MAGIC_PE32 + .word IMAGE_FILE_MACHINE_I386 #else - .word 0x8664 # x86-64 + .set image_file_add_flags, 0 + .set pe_opt_magic, PE_OPT_MAGIC_PE32PLUS + .word IMAGE_FILE_MACHINE_AMD64 #endif .word section_count # nr_sections .long 0 # TimeDateStamp .long 0 # PointerToSymbolTable .long 1 # NumberOfSymbols .word section_table - optional_header # SizeOfOptionalHeader -#ifdef CONFIG_X86_32 - .word 0x306 # Characteristics. - # IMAGE_FILE_32BIT_MACHINE | - # IMAGE_FILE_DEBUG_STRIPPED | - # IMAGE_FILE_EXECUTABLE_IMAGE | - # IMAGE_FILE_LINE_NUMS_STRIPPED -#else - .word 0x206 # Characteristics - # IMAGE_FILE_DEBUG_STRIPPED | - # IMAGE_FILE_EXECUTABLE_IMAGE | - # IMAGE_FILE_LINE_NUMS_STRIPPED -#endif + .word IMAGE_FILE_EXECUTABLE_IMAGE | \ + image_file_add_flags | \ + IMAGE_FILE_DEBUG_STRIPPED | \ + IMAGE_FILE_LINE_NUMS_STRIPPED # Characteristics optional_header: -#ifdef CONFIG_X86_32 - .word 0x10b # PE32 format -#else - .word 0x20b # PE32+ format -#endif + .word pe_opt_magic .byte 0x02 # MajorLinkerVersion .byte 0x14 # MinorLinkerVersion @@ -170,7 +160,7 @@ extra_header_fields: .long 0x200 # SizeOfHeaders .long 0 # CheckSum - .word 0xa # Subsystem (EFI application) + .word IMAGE_SUBSYSTEM_EFI_APPLICATION # Subsystem (EFI application) .word 0 # DllCharacteristics #ifdef CONFIG_X86_32 .long 0 # SizeOfStackReserve @@ -184,7 +174,7 @@ extra_header_fields: .quad 0 # SizeOfHeapCommit #endif .long 0 # LoaderFlags - .long 0x6 # NumberOfRvaAndSizes + .long (section_table - .) / 8 # NumberOfRvaAndSizes .quad 0 # ExportTable .quad 0 # ImportTable @@ -210,7 +200,10 @@ section_table: .long 0 # PointerToLineNumbers .word 0 # NumberOfRelocations .word 0 # NumberOfLineNumbers - .long 0x60500020 # Characteristics (section flags) + .long IMAGE_SCN_CNT_CODE | \ + IMAGE_SCN_MEM_READ | \ + IMAGE_SCN_MEM_EXECUTE | \ + IMAGE_SCN_ALIGN_16BYTES # Characteristics # # The EFI application loader requires a relocation section @@ -228,7 +221,10 @@ section_table: .long 0 # PointerToLineNumbers .word 0 # NumberOfRelocations .word 0 # NumberOfLineNumbers - .long 0x42100040 # Characteristics (section flags) + .long IMAGE_SCN_CNT_INITIALIZED_DATA | \ + IMAGE_SCN_MEM_READ | \ + IMAGE_SCN_MEM_DISCARDABLE | \ + IMAGE_SCN_ALIGN_1BYTES # Characteristics #ifdef CONFIG_EFI_MIXED # @@ -244,7 +240,10 @@ section_table: .long 0 # PointerToLineNumbers .word 0 # NumberOfRelocations .word 0 # NumberOfLineNumbers - .long 0x42100040 # Characteristics (section flags) + .long IMAGE_SCN_CNT_INITIALIZED_DATA | \ + IMAGE_SCN_MEM_READ | \ + IMAGE_SCN_MEM_DISCARDABLE | \ + IMAGE_SCN_ALIGN_1BYTES # Characteristics #endif # @@ -263,7 +262,10 @@ section_table: .long 0 # PointerToLineNumbers .word 0 # NumberOfRelocations .word 0 # NumberOfLineNumbers - .long 0x60500020 # Characteristics (section flags) + .long IMAGE_SCN_CNT_CODE | \ + IMAGE_SCN_MEM_READ | \ + IMAGE_SCN_MEM_EXECUTE | \ + IMAGE_SCN_ALIGN_16BYTES # Characteristics .set section_count, (. - section_table) / 40 #endif /* CONFIG_EFI_STUB */ -- cgit From 148d3f716c208738d2c7ff8c80be18d89ec8a06e Mon Sep 17 00:00:00 2001 From: Ard Biesheuvel Date: Thu, 20 Feb 2020 11:06:00 +0100 Subject: efi/libstub: Introduce symbolic constants for the stub major/minor version Now that we have added new ways to load the initrd or the mixed mode kernel, we will also need a way to tell the loader about this. Add symbolic constants for the PE/COFF major/minor version numbers (which fortunately have always been 0x0 for all architectures), so that we can bump them later to document the capabilities of the stub. Signed-off-by: Ard Biesheuvel --- arch/x86/boot/header.S | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/boot/header.S b/arch/x86/boot/header.S index f65661a58cf8..4ee25e28996f 100644 --- a/arch/x86/boot/header.S +++ b/arch/x86/boot/header.S @@ -147,8 +147,8 @@ extra_header_fields: .long 0x20 # FileAlignment .word 0 # MajorOperatingSystemVersion .word 0 # MinorOperatingSystemVersion - .word 0 # MajorImageVersion - .word 0 # MinorImageVersion + .word LINUX_EFISTUB_MAJOR_VERSION # MajorImageVersion + .word LINUX_EFISTUB_MINOR_VERSION # MinorImageVersion .word 0 # MajorSubsystemVersion .word 0 # MinorSubsystemVersion .long 0 # Win32VersionValue -- cgit From 0eea39a234dc52063d14541fabcb2c64516a2328 Mon Sep 17 00:00:00 2001 From: Arvind Sankar Date: Thu, 9 Jan 2020 10:02:18 -0500 Subject: x86/boot/compressed: Remove .eh_frame section from bzImage Discarding unnecessary sections with "*(*)" (see thread at Link: below) works fine with the bfd linker but fails with lld: $ make -j$(nproc) -s CC=clang LD=ld.lld O=out.x86_64 distclean defconfig bzImage ld.lld: error: discarding .shstrtab section is not allowed lld tries to also discard essential sections like .shstrtab, .symtab and .strtab, which results in the link failing since .shstrtab is required by the ELF specification: the e_shstrndx field in the ELF header is the index of .shstrtab, and each section in the section table is required to have an sh_name that points into the .shstrtab. .symtab and .strtab are also necessary to generate the zoffset.h file for the bzImage header. Since the only sizeable section that can be discarded is .eh_frame, restrict the discard to only .eh_frame to be safe. [ bp: Flesh out commit message and replace offending commit with this one. ] Signed-off-by: Arvind Sankar Signed-off-by: Borislav Petkov Tested-by: Nathan Chancellor Link: https://lkml.kernel.org/r/20200109150218.16544-2-nivedita@alum.mit.edu --- arch/x86/boot/compressed/vmlinux.lds.S | 5 +++++ 1 file changed, 5 insertions(+) (limited to 'arch/x86') diff --git a/arch/x86/boot/compressed/vmlinux.lds.S b/arch/x86/boot/compressed/vmlinux.lds.S index 508cfa6828c5..469dcf800a2c 100644 --- a/arch/x86/boot/compressed/vmlinux.lds.S +++ b/arch/x86/boot/compressed/vmlinux.lds.S @@ -73,4 +73,9 @@ SECTIONS #endif . = ALIGN(PAGE_SIZE); /* keep ZO size page aligned */ _end = .; + + /* Discard .eh_frame to save some space */ + /DISCARD/ : { + *(.eh_frame) + } } -- cgit From 16171bffc829272d5e6014bad48f680cb50943d9 Mon Sep 17 00:00:00 2001 From: Dave Hansen Date: Wed, 22 Jan 2020 08:53:46 -0800 Subject: x86/pkeys: Add check for pkey "overflow" Alex Shi reported the pkey macros above arch_set_user_pkey_access() to be unused. They are unused, and even refer to a nonexistent CONFIG option. But, they might have served a good use, which was to ensure that the code does not try to set values that would not fit in the PKRU register. As it stands, a too-large 'pkey' value would be likely to silently overflow the u32 new_pkru_bits. Add a check to look for overflows. Also add a comment to remind any future developer to closely examine the types used to store pkey values if arch_max_pkey() ever changes. This boots and passes the x86 pkey selftests. Reported-by: Alex Shi Signed-off-by: Dave Hansen Signed-off-by: Borislav Petkov Link: https://lkml.kernel.org/r/20200122165346.AD4DA150@viggo.jf.intel.com --- arch/x86/include/asm/pkeys.h | 5 +++++ arch/x86/kernel/fpu/xstate.c | 9 +++++++-- 2 files changed, 12 insertions(+), 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/pkeys.h b/arch/x86/include/asm/pkeys.h index 19b137f1b3be..2ff9b98812b7 100644 --- a/arch/x86/include/asm/pkeys.h +++ b/arch/x86/include/asm/pkeys.h @@ -4,6 +4,11 @@ #define ARCH_DEFAULT_PKEY 0 +/* + * If more than 16 keys are ever supported, a thorough audit + * will be necessary to ensure that the types that store key + * numbers and masks have sufficient capacity. + */ #define arch_max_pkey() (boot_cpu_has(X86_FEATURE_OSPKE) ? 16 : 1) extern int arch_set_user_pkey_access(struct task_struct *tsk, int pkey, diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c index 73fe5979629c..32b153d38748 100644 --- a/arch/x86/kernel/fpu/xstate.c +++ b/arch/x86/kernel/fpu/xstate.c @@ -895,8 +895,6 @@ const void *get_xsave_field_ptr(int xfeature_nr) #ifdef CONFIG_ARCH_HAS_PKEYS -#define NR_VALID_PKRU_BITS (CONFIG_NR_PROTECTION_KEYS * 2) -#define PKRU_VALID_MASK (NR_VALID_PKRU_BITS - 1) /* * This will go out and modify PKRU register to set the access * rights for @pkey to @init_val. @@ -915,6 +913,13 @@ int arch_set_user_pkey_access(struct task_struct *tsk, int pkey, if (!boot_cpu_has(X86_FEATURE_OSPKE)) return -EINVAL; + /* + * This code should only be called with valid 'pkey' + * values originating from in-kernel users. Complain + * if a bad value is observed. + */ + WARN_ON_ONCE(pkey >= arch_max_pkey()); + /* Set the bits we need in PKRU: */ if (init_val & PKEY_DISABLE_ACCESS) new_pkru_bits |= PKRU_AD_BIT; -- cgit From 679b2ec8e060ca7a90441aff5e7d384720a41b76 Mon Sep 17 00:00:00 2001 From: Diego Elio Pettenò Date: Sun, 23 Feb 2020 19:11:44 +0000 Subject: scsi: sr: remove references to BLK_DEV_SR_VENDOR, leave it enabled MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This kernel configuration is basically enabling/disabling sr driver quirks detection. While these quirks are for fairly rare devices (very old CD burners, and a glucometer), the additional detection of these models is a very minimal amount of code. The logic behind the quirks is always built into the sr driver. This also removes the config from all the defconfig files that are enabling this already. Link: https://lore.kernel.org/r/20200223191144.726-1-flameeyes@flameeyes.com Reviewed-by: Jens Axboe Signed-off-by: Diego Elio Pettenò Signed-off-by: Martin K. Petersen --- arch/x86/configs/i386_defconfig | 1 - arch/x86/configs/x86_64_defconfig | 1 - 2 files changed, 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/configs/i386_defconfig b/arch/x86/configs/i386_defconfig index 59ce9ed58430..18806b4fb26a 100644 --- a/arch/x86/configs/i386_defconfig +++ b/arch/x86/configs/i386_defconfig @@ -137,7 +137,6 @@ CONFIG_CONNECTOR=y CONFIG_BLK_DEV_LOOP=y CONFIG_BLK_DEV_SD=y CONFIG_BLK_DEV_SR=y -CONFIG_BLK_DEV_SR_VENDOR=y CONFIG_CHR_DEV_SG=y CONFIG_SCSI_CONSTANTS=y CONFIG_SCSI_SPI_ATTRS=y diff --git a/arch/x86/configs/x86_64_defconfig b/arch/x86/configs/x86_64_defconfig index 0b9654c7a05c..7f5cb3bead37 100644 --- a/arch/x86/configs/x86_64_defconfig +++ b/arch/x86/configs/x86_64_defconfig @@ -135,7 +135,6 @@ CONFIG_CONNECTOR=y CONFIG_BLK_DEV_LOOP=y CONFIG_BLK_DEV_SD=y CONFIG_BLK_DEV_SR=y -CONFIG_BLK_DEV_SR_VENDOR=y CONFIG_CHR_DEV_SG=y CONFIG_SCSI_CONSTANTS=y CONFIG_SCSI_SPI_ATTRS=y -- cgit From 003602ad5516e59940de42e44c8d8033387bb363 Mon Sep 17 00:00:00 2001 From: Arvind Sankar Date: Mon, 24 Feb 2020 18:21:28 -0500 Subject: x86/*/Makefile: Use -fno-asynchronous-unwind-tables to suppress .eh_frame sections While discussing a patch to discard .eh_frame from the compressed vmlinux using the linker script, Fangrui Song pointed out [1] that these sections shouldn't exist in the first place because arch/x86/Makefile uses -fno-asynchronous-unwind-tables. It turns out this is because the Makefiles used to build the compressed kernel redefine KBUILD_CFLAGS, dropping this flag. Add the flag to the Makefile for the compressed kernel, as well as the EFI stub Makefile to fix this. Also add the flag to boot/Makefile and realmode/rm/Makefile so that the kernel's boot code (boot/setup.elf) and realmode trampoline (realmode/rm/realmode.elf) won't be compiled with .eh_frame sections, since their linker scripts also just discard them. [1] https://lore.kernel.org/lkml/20200222185806.ywnqhfqmy67akfsa@google.com/ Suggested-by: Fangrui Song Signed-off-by: Arvind Sankar Signed-off-by: Borislav Petkov Reviewed-by: Nathan Chancellor Reviewed-by: Nick Desaulniers Reviewed-by: Kees Cook Tested-by: Nathan Chancellor Link: https://lkml.kernel.org/r/20200224232129.597160-2-nivedita@alum.mit.edu --- arch/x86/boot/Makefile | 1 + arch/x86/boot/compressed/Makefile | 1 + arch/x86/realmode/rm/Makefile | 1 + 3 files changed, 3 insertions(+) (limited to 'arch/x86') diff --git a/arch/x86/boot/Makefile b/arch/x86/boot/Makefile index 012b82fc8617..24f011e0adf1 100644 --- a/arch/x86/boot/Makefile +++ b/arch/x86/boot/Makefile @@ -68,6 +68,7 @@ clean-files += cpustr.h KBUILD_CFLAGS := $(REALMODE_CFLAGS) -D_SETUP KBUILD_AFLAGS := $(KBUILD_CFLAGS) -D__ASSEMBLY__ KBUILD_CFLAGS += $(call cc-option,-fmacro-prefix-map=$(srctree)/=) +KBUILD_CFLAGS += -fno-asynchronous-unwind-tables GCOV_PROFILE := n UBSAN_SANITIZE := n diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile index 26050ae0b27e..c33111341325 100644 --- a/arch/x86/boot/compressed/Makefile +++ b/arch/x86/boot/compressed/Makefile @@ -39,6 +39,7 @@ KBUILD_CFLAGS += $(call cc-disable-warning, address-of-packed-member) KBUILD_CFLAGS += $(call cc-disable-warning, gnu) KBUILD_CFLAGS += -Wno-pointer-sign KBUILD_CFLAGS += $(call cc-option,-fmacro-prefix-map=$(srctree)/=) +KBUILD_CFLAGS += -fno-asynchronous-unwind-tables KBUILD_AFLAGS := $(KBUILD_CFLAGS) -D__ASSEMBLY__ GCOV_PROFILE := n diff --git a/arch/x86/realmode/rm/Makefile b/arch/x86/realmode/rm/Makefile index 99b6332ba540..b11ec5d8f8ac 100644 --- a/arch/x86/realmode/rm/Makefile +++ b/arch/x86/realmode/rm/Makefile @@ -71,5 +71,6 @@ $(obj)/realmode.relocs: $(obj)/realmode.elf FORCE KBUILD_CFLAGS := $(REALMODE_CFLAGS) -D_SETUP -D_WAKEUP \ -I$(srctree)/arch/x86/boot KBUILD_AFLAGS := $(KBUILD_CFLAGS) -D__ASSEMBLY__ +KBUILD_CFLAGS += -fno-asynchronous-unwind-tables GCOV_PROFILE := n UBSAN_SANITIZE := n -- cgit From 6f8f0dc980028e98ae339876a8403edae4d20e39 Mon Sep 17 00:00:00 2001 From: Arvind Sankar Date: Mon, 24 Feb 2020 18:21:29 -0500 Subject: x86/vmlinux: Drop unneeded linker script discard of .eh_frame Now that .eh_frame sections for the files in setup.elf and realmode.elf are not generated anymore, the linker scripts don't need the special output section name /DISCARD/ any more. Remove the one in the main kernel linker script as well, since there are no .eh_frame sections already, and fix up a comment referencing .eh_frame. Update the comment in asm/dwarf2.h referring to .eh_frame so it continues to make sense, as well as being more specific. [ bp: Touch up commit message. ] Signed-off-by: Arvind Sankar Signed-off-by: Borislav Petkov Reviewed-by: Nathan Chancellor Reviewed-by: Nick Desaulniers Reviewed-by: Kees Cook Tested-by: Nathan Chancellor Link: https://lkml.kernel.org/r/20200224232129.597160-3-nivedita@alum.mit.edu --- arch/x86/boot/compressed/vmlinux.lds.S | 5 ----- arch/x86/boot/setup.ld | 1 - arch/x86/include/asm/dwarf2.h | 4 ++-- arch/x86/kernel/vmlinux.lds.S | 7 ++----- arch/x86/realmode/rm/realmode.lds.S | 1 - 5 files changed, 4 insertions(+), 14 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/boot/compressed/vmlinux.lds.S b/arch/x86/boot/compressed/vmlinux.lds.S index 469dcf800a2c..508cfa6828c5 100644 --- a/arch/x86/boot/compressed/vmlinux.lds.S +++ b/arch/x86/boot/compressed/vmlinux.lds.S @@ -73,9 +73,4 @@ SECTIONS #endif . = ALIGN(PAGE_SIZE); /* keep ZO size page aligned */ _end = .; - - /* Discard .eh_frame to save some space */ - /DISCARD/ : { - *(.eh_frame) - } } diff --git a/arch/x86/boot/setup.ld b/arch/x86/boot/setup.ld index 3da1c37c6dd5..24c95522f231 100644 --- a/arch/x86/boot/setup.ld +++ b/arch/x86/boot/setup.ld @@ -52,7 +52,6 @@ SECTIONS _end = .; /DISCARD/ : { - *(.eh_frame) *(.note*) } diff --git a/arch/x86/include/asm/dwarf2.h b/arch/x86/include/asm/dwarf2.h index ae391f609840..f71a0cce9373 100644 --- a/arch/x86/include/asm/dwarf2.h +++ b/arch/x86/include/asm/dwarf2.h @@ -42,8 +42,8 @@ * Emit CFI data in .debug_frame sections, not .eh_frame sections. * The latter we currently just discard since we don't do DWARF * unwinding at runtime. So only the offline DWARF information is - * useful to anyone. Note we should not use this directive if - * vmlinux.lds.S gets changed so it doesn't discard .eh_frame. + * useful to anyone. Note we should not use this directive if we + * ever decide to enable DWARF unwinding at runtime. */ .cfi_sections .debug_frame #else diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S index e3296aa028fe..5cab3a29adcb 100644 --- a/arch/x86/kernel/vmlinux.lds.S +++ b/arch/x86/kernel/vmlinux.lds.S @@ -313,8 +313,8 @@ SECTIONS . = ALIGN(8); /* - * .exit.text is discard at runtime, not link time, to deal with - * references from .altinstructions and .eh_frame + * .exit.text is discarded at runtime, not link time, to deal with + * references from .altinstructions */ .exit.text : AT(ADDR(.exit.text) - LOAD_OFFSET) { EXIT_TEXT @@ -412,9 +412,6 @@ SECTIONS DWARF_DEBUG DISCARDS - /DISCARD/ : { - *(.eh_frame) - } } diff --git a/arch/x86/realmode/rm/realmode.lds.S b/arch/x86/realmode/rm/realmode.lds.S index 64d135d1ee63..63aa51875ba0 100644 --- a/arch/x86/realmode/rm/realmode.lds.S +++ b/arch/x86/realmode/rm/realmode.lds.S @@ -71,7 +71,6 @@ SECTIONS /DISCARD/ : { *(.note*) *(.debug*) - *(.eh_frame*) } #include "pasyms.h" -- cgit From 645e64662af4dba9eb4a80bbab2663dd22018312 Mon Sep 17 00:00:00 2001 From: Anders Roxell Date: Fri, 24 Jan 2020 12:46:15 +0100 Subject: x86/Kconfig: Make CMDLINE_OVERRIDE depend on non-empty CMDLINE When trying to boot an allmodconfig kernel that is built with KCONFIG_ALLCONFIG=$(pwd)/arch/x86/configs/x86_64_defconfig, it doesn't boot since CONFIG_CMDLINE_OVERRIDE gets enabled and that requires the user to pass the full cmdline to CONFIG_CMDLINE. Change so that CONFIG_CMDLINE_OVERRIDE gets set only if CONFIG_CMDLINE is set to something except an empty string. [ bp: touchup. ] Signed-off-by: Anders Roxell Signed-off-by: Borislav Petkov Link: https://lkml.kernel.org/r/20200124114615.11577-1-anders.roxell@linaro.org --- arch/x86/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index beea77046f9b..b41419231b90 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -2418,7 +2418,7 @@ config CMDLINE config CMDLINE_OVERRIDE bool "Built-in command line overrides boot loader arguments" - depends on CMDLINE_BOOL + depends on CMDLINE_BOOL && CMDLINE != "" ---help--- Set this option to 'Y' to have the kernel ignore the boot loader command line, and use ONLY the built-in command line. -- cgit From 3d51507f29f2153a658df4a0674ec5b592b62085 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Tue, 25 Feb 2020 22:36:37 +0100 Subject: x86/entry/32: Add missing ASM_CLAC to general_protection entry All exception entry points must have ASM_CLAC right at the beginning. The general_protection entry is missing one. Fixes: e59d1b0a2419 ("x86-32, smap: Add STAC/CLAC instructions to 32-bit kernel entry") Signed-off-by: Thomas Gleixner Reviewed-by: Frederic Weisbecker Reviewed-by: Alexandre Chartre Reviewed-by: Andy Lutomirski Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20200225220216.219537887@linutronix.de --- arch/x86/entry/entry_32.S | 1 + 1 file changed, 1 insertion(+) (limited to 'arch/x86') diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S index 7e0560442538..39243df98100 100644 --- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -1694,6 +1694,7 @@ SYM_CODE_START(int3) SYM_CODE_END(int3) SYM_CODE_START(general_protection) + ASM_CLAC pushl $do_general_protection jmp common_exception SYM_CODE_END(general_protection) -- cgit From 55ba18d6ed37a28cf8b8ca79e9aef4cf98183bb7 Mon Sep 17 00:00:00 2001 From: Andy Lutomirski Date: Tue, 25 Feb 2020 22:36:38 +0100 Subject: x86/mce: Disable tracing and kprobes on do_machine_check() do_machine_check() can be raised in almost any context including the most fragile ones. Prevent kprobes and tracing. Signed-off-by: Andy Lutomirski Signed-off-by: Thomas Gleixner Reviewed-by: Borislav Petkov Reviewed-by: Alexandre Chartre Reviewed-by: Andy Lutomirski Link: https://lkml.kernel.org/r/20200225220216.315548935@linutronix.de --- arch/x86/include/asm/traps.h | 3 --- arch/x86/kernel/cpu/mce/core.c | 12 ++++++++++-- 2 files changed, 10 insertions(+), 5 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h index ffa0dc8a535e..e1c660b9e8e6 100644 --- a/arch/x86/include/asm/traps.h +++ b/arch/x86/include/asm/traps.h @@ -88,9 +88,6 @@ dotraplinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code, dotraplinkage void do_spurious_interrupt_bug(struct pt_regs *regs, long error_code); dotraplinkage void do_coprocessor_error(struct pt_regs *regs, long error_code); dotraplinkage void do_alignment_check(struct pt_regs *regs, long error_code); -#ifdef CONFIG_X86_MCE -dotraplinkage void do_machine_check(struct pt_regs *regs, long error_code); -#endif dotraplinkage void do_simd_coprocessor_error(struct pt_regs *regs, long error_code); #ifdef CONFIG_X86_32 dotraplinkage void do_iret_error(struct pt_regs *regs, long error_code); diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index 2c4f949611e4..32ecc5969d59 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -1213,8 +1213,14 @@ static void __mc_scan_banks(struct mce *m, struct mce *final, * On Intel systems this is entered on all CPUs in parallel through * MCE broadcast. However some CPUs might be broken beyond repair, * so be always careful when synchronizing with others. + * + * Tracing and kprobes are disabled: if we interrupted a kernel context + * with IF=1, we need to minimize stack usage. There are also recursion + * issues: if the machine check was due to a failure of the memory + * backing the user stack, tracing that reads the user stack will cause + * potentially infinite recursion. */ -void do_machine_check(struct pt_regs *regs, long error_code) +void notrace do_machine_check(struct pt_regs *regs, long error_code) { DECLARE_BITMAP(valid_banks, MAX_NR_BANKS); DECLARE_BITMAP(toclear, MAX_NR_BANKS); @@ -1360,6 +1366,7 @@ out_ist: ist_exit(regs); } EXPORT_SYMBOL_GPL(do_machine_check); +NOKPROBE_SYMBOL(do_machine_check); #ifndef CONFIG_MEMORY_FAILURE int memory_failure(unsigned long pfn, int flags) @@ -1892,10 +1899,11 @@ static void unexpected_machine_check(struct pt_regs *regs, long error_code) void (*machine_check_vector)(struct pt_regs *, long error_code) = unexpected_machine_check; -dotraplinkage void do_mce(struct pt_regs *regs, long error_code) +dotraplinkage notrace void do_mce(struct pt_regs *regs, long error_code) { machine_check_vector(regs, error_code); } +NOKPROBE_SYMBOL(do_mce); /* * Called for each booted CPU to set up machine checks. -- cgit From 840371bea19e85f30d19909171248cf8c5845fd7 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Tue, 25 Feb 2020 22:36:39 +0100 Subject: x86/entry/32: Force MCE through do_mce() Remove the pointless difference between 32 and 64 bit to make further unifications simpler. Signed-off-by: Thomas Gleixner Reviewed-by: Frederic Weisbecker Reviewed-by: Alexandre Chartre Reviewed-by: Andy Lutomirski Link: https://lkml.kernel.org/r/20200225220216.428188397@linutronix.de --- arch/x86/entry/entry_32.S | 2 +- arch/x86/include/asm/mce.h | 3 --- arch/x86/kernel/cpu/mce/internal.h | 3 +++ 3 files changed, 4 insertions(+), 4 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S index 39243df98100..a8b44384699e 100644 --- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -1365,7 +1365,7 @@ SYM_CODE_END(divide_error) SYM_CODE_START(machine_check) ASM_CLAC pushl $0 - pushl machine_check_vector + pushl $do_mce jmp common_exception SYM_CODE_END(machine_check) #endif diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h index 4359b955e0b7..a7e0b9d370a4 100644 --- a/arch/x86/include/asm/mce.h +++ b/arch/x86/include/asm/mce.h @@ -238,9 +238,6 @@ extern void mce_disable_bank(int bank); /* * Exception handler */ - -/* Call the installed machine check handler for this CPU setup. */ -extern void (*machine_check_vector)(struct pt_regs *, long error_code); void do_machine_check(struct pt_regs *, long); /* diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h index b785c0d0b590..c1cda0b7aa89 100644 --- a/arch/x86/kernel/cpu/mce/internal.h +++ b/arch/x86/kernel/cpu/mce/internal.h @@ -8,6 +8,9 @@ #include #include +/* Pointer to the installed machine check handler for this CPU setup. */ +extern void (*machine_check_vector)(struct pt_regs *, long error_code); + enum severity_level { MCE_NO_SEVERITY, MCE_DEFERRED_SEVERITY, -- cgit From e039dd815941e785203142261397da6ec64d20cc Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Tue, 25 Feb 2020 22:36:40 +0100 Subject: x86/traps: Remove pointless irq enable from do_spurious_interrupt_bug() That function returns immediately after conditionally reenabling interrupts which is more than pointless and requires the ASM code to disable interrupts again. Signed-off-by: Thomas Gleixner Reviewed-by: Sean Christopherson Reviewed-by: Alexandre Chartre Reviewed-by: Frederic Weisbecker Reviewed-by: Andy Lutomirski Acked-by: Peter Zijlstra (Intel) Link: https://lore.kernel.org/r/20191023123117.871608831@linutronix.de Link: https://lkml.kernel.org/r/20200225220216.518575042@linutronix.de --- arch/x86/kernel/traps.c | 1 - 1 file changed, 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index 6ef00eb6fbb9..474b8cb54899 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -862,7 +862,6 @@ do_simd_coprocessor_error(struct pt_regs *regs, long error_code) dotraplinkage void do_spurious_interrupt_bug(struct pt_regs *regs, long error_code) { - cond_local_irq_enable(regs); } dotraplinkage void -- cgit From d244d0e195bc12964bcf3b8eef45e715a7f203b0 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Tue, 25 Feb 2020 22:36:41 +0100 Subject: x86/traps: Document do_spurious_interrupt_bug() Add a comment which explains why this empty handler for a reserved vector exists. Requested-by: Josh Poimboeuf Signed-off-by: Thomas Gleixner Reviewed-by: Frederic Weisbecker Reviewed-by: Alexandre Chartre Reviewed-by: Andy Lutomirski Link: https://lkml.kernel.org/r/20200225220216.624165786@linutronix.de --- arch/x86/kernel/traps.c | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) (limited to 'arch/x86') diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index 474b8cb54899..7ffb6f4dea1c 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -862,6 +862,25 @@ do_simd_coprocessor_error(struct pt_regs *regs, long error_code) dotraplinkage void do_spurious_interrupt_bug(struct pt_regs *regs, long error_code) { + /* + * This addresses a Pentium Pro Erratum: + * + * PROBLEM: If the APIC subsystem is configured in mixed mode with + * Virtual Wire mode implemented through the local APIC, an + * interrupt vector of 0Fh (Intel reserved encoding) may be + * generated by the local APIC (Int 15). This vector may be + * generated upon receipt of a spurious interrupt (an interrupt + * which is removed before the system receives the INTA sequence) + * instead of the programmed 8259 spurious interrupt vector. + * + * IMPLICATION: The spurious interrupt vector programmed in the + * 8259 is normally handled by an operating system's spurious + * interrupt handler. However, a vector of 0Fh is unknown to some + * operating systems, which would crash if this erratum occurred. + * + * In theory this could be limited to 32bit, but the handler is not + * hurting and who knows which other CPUs suffer from this. + */ } dotraplinkage void -- cgit From 3ba4f0a633ca4bfeadb24eba19cb53ebe246eabf Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Tue, 25 Feb 2020 22:36:42 +0100 Subject: x86/traps: Remove redundant declaration of do_double_fault() Signed-off-by: Thomas Gleixner Reviewed-by: Alexandre Chartre Reviewed-by: Andy Lutomirski Link: https://lkml.kernel.org/r/20200225220216.720335354@linutronix.de --- arch/x86/include/asm/traps.h | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h index e1c660b9e8e6..c26a7e1d8a2c 100644 --- a/arch/x86/include/asm/traps.h +++ b/arch/x86/include/asm/traps.h @@ -76,13 +76,6 @@ dotraplinkage void do_coprocessor_segment_overrun(struct pt_regs *regs, long err dotraplinkage void do_invalid_TSS(struct pt_regs *regs, long error_code); dotraplinkage void do_segment_not_present(struct pt_regs *regs, long error_code); dotraplinkage void do_stack_segment(struct pt_regs *regs, long error_code); -#ifdef CONFIG_X86_64 -dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code, unsigned long address); -asmlinkage __visible notrace struct pt_regs *sync_regs(struct pt_regs *eregs); -asmlinkage __visible notrace -struct bad_iret_stack *fixup_bad_iret(struct bad_iret_stack *s); -void __init trap_init(void); -#endif dotraplinkage void do_general_protection(struct pt_regs *regs, long error_code); dotraplinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code, unsigned long address); dotraplinkage void do_spurious_interrupt_bug(struct pt_regs *regs, long error_code); @@ -94,6 +87,13 @@ dotraplinkage void do_iret_error(struct pt_regs *regs, long error_code); #endif dotraplinkage void do_mce(struct pt_regs *regs, long error_code); +#ifdef CONFIG_X86_64 +asmlinkage __visible notrace struct pt_regs *sync_regs(struct pt_regs *eregs); +asmlinkage __visible notrace +struct bad_iret_stack *fixup_bad_iret(struct bad_iret_stack *s); +void __init trap_init(void); +#endif + static inline int get_si_code(unsigned long condition) { if (condition & DR_STEP) -- cgit From 17dbedb5da184b501c5c51fc78d1071005139e48 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Tue, 25 Feb 2020 22:36:43 +0100 Subject: x86/irq: Remove useless return value from do_IRQ() Nothing is using it. Signed-off-by: Thomas Gleixner Reviewed-by: Alexandre Chartre Reviewed-by: Andy Lutomirski Link: https://lkml.kernel.org/r/20200225220216.826870369@linutronix.de --- arch/x86/include/asm/irq.h | 2 +- arch/x86/kernel/irq.c | 3 +-- 2 files changed, 2 insertions(+), 3 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/irq.h b/arch/x86/include/asm/irq.h index a176f6165d85..72fba0eeeb30 100644 --- a/arch/x86/include/asm/irq.h +++ b/arch/x86/include/asm/irq.h @@ -36,7 +36,7 @@ extern void native_init_IRQ(void); extern void handle_irq(struct irq_desc *desc, struct pt_regs *regs); -extern __visible unsigned int do_IRQ(struct pt_regs *regs); +extern __visible void do_IRQ(struct pt_regs *regs); extern void init_ISA_irqs(void); diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c index 21efee32e2b1..c7965ff429c5 100644 --- a/arch/x86/kernel/irq.c +++ b/arch/x86/kernel/irq.c @@ -230,7 +230,7 @@ u64 arch_irq_stat(void) * SMP cross-CPU interrupts have their own specific * handlers). */ -__visible unsigned int __irq_entry do_IRQ(struct pt_regs *regs) +__visible void __irq_entry do_IRQ(struct pt_regs *regs) { struct pt_regs *old_regs = set_irq_regs(regs); struct irq_desc * desc; @@ -263,7 +263,6 @@ __visible unsigned int __irq_entry do_IRQ(struct pt_regs *regs) exiting_irq(); set_irq_regs(old_regs); - return 1; } #ifdef CONFIG_X86_LOCAL_APIC -- cgit From ac3607f92f70c762e24d0ae731168f7584de51ec Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Tue, 25 Feb 2020 22:36:45 +0100 Subject: x86/entry/entry_32: Route int3 through common_exception int3 is not using the common_exception path for purely historical reasons, but there is no reason to keep it the only exception which is different. Make it use common_exception so the upcoming changes to autogenerate the entry stubs do not have to special case int3. Signed-off-by: Thomas Gleixner Reviewed-by: Frederic Weisbecker Reviewed-by: Alexandre Chartre Reviewed-by: Andy Lutomirski Link: https://lkml.kernel.org/r/20200225220217.042369808@linutronix.de --- arch/x86/entry/entry_32.S | 10 ++-------- 1 file changed, 2 insertions(+), 8 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S index a8b44384699e..0753f4879a80 100644 --- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -1683,14 +1683,8 @@ SYM_CODE_END(nmi) SYM_CODE_START(int3) ASM_CLAC pushl $-1 # mark this as an int - - SAVE_ALL switch_stacks=1 - ENCODE_FRAME_POINTER - TRACE_IRQS_OFF - xorl %edx, %edx # zero error code - movl %esp, %eax # pt_regs pointer - call do_int3 - jmp ret_from_exception + pushl $do_int3 + jmp common_exception SYM_CODE_END(int3) SYM_CODE_START(general_protection) -- cgit From 65c668f5faebf549db086b7a6841b6f4187b4e4f Mon Sep 17 00:00:00 2001 From: Andy Lutomirski Date: Tue, 25 Feb 2020 22:36:46 +0100 Subject: x86/traps: Stop using ist_enter/exit() in do_int3() #BP is not longer using IST and using ist_enter() and ist_exit() makes it harder to change ist_enter() and ist_exit()'s behavior. Instead open-code the very small amount of required logic. Signed-off-by: Andy Lutomirski Signed-off-by: Thomas Gleixner Reviewed-by: Alexandre Chartre Reviewed-by: Andy Lutomirski Link: https://lkml.kernel.org/r/20200225220217.150607679@linutronix.de --- arch/x86/kernel/traps.c | 21 +++++++++++++++------ 1 file changed, 15 insertions(+), 6 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index 7ffb6f4dea1c..c0bc9df8634d 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -572,14 +572,20 @@ dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code) return; /* - * Use ist_enter despite the fact that we don't use an IST stack. - * We can be called from a kprobe in non-CONTEXT_KERNEL kernel - * mode or even during context tracking state changes. + * Unlike any other non-IST entry, we can be called from a kprobe in + * non-CONTEXT_KERNEL kernel mode or even during context tracking + * state changes. Make sure that we wake up RCU even if we're coming + * from kernel code. * - * This means that we can't schedule. That's okay. + * This means that we can't schedule even if we came from a + * preemptible kernel context. That's okay. */ - ist_enter(regs); + if (!user_mode(regs)) { + rcu_nmi_enter(); + preempt_disable(); + } RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU"); + #ifdef CONFIG_KGDB_LOW_LEVEL_TRAP if (kgdb_ll_trap(DIE_INT3, "int3", regs, error_code, X86_TRAP_BP, SIGTRAP) == NOTIFY_STOP) @@ -600,7 +606,10 @@ dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code) cond_local_irq_disable(regs); exit: - ist_exit(regs); + if (!user_mode(regs)) { + preempt_enable_no_resched(); + rcu_nmi_exit(); + } } NOKPROBE_SYMBOL(do_int3); -- cgit From f10e80a19b07b58fc2adad7945f8313b01503bae Mon Sep 17 00:00:00 2001 From: Tom Lendacky Date: Fri, 28 Feb 2020 13:14:03 +0100 Subject: efi/x86: Add TPM related EFI tables to unencrypted mapping checks When booting with SME active, EFI tables must be mapped unencrypted since they were built by UEFI in unencrypted memory. Update the list of tables to be checked during early_memremap() processing to account for the EFI TPM tables. This fixes a bug where an EFI TPM log table has been created by UEFI, but it lives in memory that has been marked as usable rather than reserved. Signed-off-by: Tom Lendacky Signed-off-by: Ard Biesheuvel Signed-off-by: Ingo Molnar Cc: linux-efi@vger.kernel.org Cc: Ingo Molnar Cc: Thomas Gleixner Cc: David Hildenbrand Cc: Heinrich Schuchardt Cc: # v5.4+ Link: https://lore.kernel.org/r/4144cd813f113c20cdfa511cf59500a64e6015be.1582662842.git.thomas.lendacky@amd.com Link: https://lore.kernel.org/r/20200228121408.9075-2-ardb@kernel.org --- arch/x86/platform/efi/efi.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'arch/x86') diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c index 43b24e149312..0a8117865430 100644 --- a/arch/x86/platform/efi/efi.c +++ b/arch/x86/platform/efi/efi.c @@ -88,6 +88,8 @@ static const unsigned long * const efi_tables[] = { #ifdef CONFIG_EFI_RCI2_TABLE &rci2_table_phys, #endif + &efi.tpm_log, + &efi.tpm_final_log, }; u64 efi_setup; /* efi setup_data physical address */ -- cgit From badc61982adb6018a48ed8fe32087b9754cae14b Mon Sep 17 00:00:00 2001 From: Tom Lendacky Date: Fri, 28 Feb 2020 13:14:04 +0100 Subject: efi/x86: Add RNG seed EFI table to unencrypted mapping check When booting with SME active, EFI tables must be mapped unencrypted since they were built by UEFI in unencrypted memory. Update the list of tables to be checked during early_memremap() processing to account for the EFI RNG seed table. Signed-off-by: Tom Lendacky Signed-off-by: Ard Biesheuvel Signed-off-by: Ingo Molnar Cc: linux-efi@vger.kernel.org Cc: Ingo Molnar Cc: Thomas Gleixner Cc: David Hildenbrand Cc: Heinrich Schuchardt Link: https://lore.kernel.org/r/b64385fc13e5d7ad4b459216524f138e7879234f.1582662842.git.thomas.lendacky@amd.com Link: https://lore.kernel.org/r/20200228121408.9075-3-ardb@kernel.org --- arch/x86/platform/efi/efi.c | 1 + 1 file changed, 1 insertion(+) (limited to 'arch/x86') diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c index 0a8117865430..aca9bdd87bca 100644 --- a/arch/x86/platform/efi/efi.c +++ b/arch/x86/platform/efi/efi.c @@ -90,6 +90,7 @@ static const unsigned long * const efi_tables[] = { #endif &efi.tpm_log, &efi.tpm_final_log, + &efi_rng_seed, }; u64 efi_setup; /* efi setup_data physical address */ -- cgit From c98a76eabbb6e7755f3d4a4c33f8fe869dda6383 Mon Sep 17 00:00:00 2001 From: Arvind Sankar Date: Wed, 26 Feb 2020 18:00:31 -0500 Subject: x86/boot/compressed: Fix reloading of GDTR post-relocation The following commit: ef5a7b5eb13e ("efi/x86: Remove GDT setup from efi_main") introduced GDT setup into the 32-bit kernel's startup_32, and reloads the GDTR after relocating the kernel for paranoia's sake. A followup commit: 32d009137a56 ("x86/boot: Reload GDTR after copying to the end of the buffer") introduced a similar GDTR reload in the 64-bit kernel as well. The GDTR is adjusted by (init_size-_end), however this may not be the correct offset to apply if the kernel was loaded at a misaligned address or below LOAD_PHYSICAL_ADDR, as in that case the decompression buffer has an additional offset from the original load address. This should never happen for a conformant bootloader, but we're being paranoid anyway, so just store the new GDT address in there instead of adding any offsets, which is simpler as well. Fixes: ef5a7b5eb13e ("efi/x86: Remove GDT setup from efi_main") Fixes: 32d009137a56 ("x86/boot: Reload GDTR after copying to the end of the buffer") Signed-off-by: Arvind Sankar Signed-off-by: Ingo Molnar Cc: Ingo Molnar Cc: Ard Biesheuvel Cc: linux-efi@vger.kernel.org Cc: Thomas Gleixner Cc: x86@kernel.org Link: https://lore.kernel.org/r/20200226230031.3011645-2-nivedita@alum.mit.edu --- arch/x86/boot/compressed/head_32.S | 9 ++++----- arch/x86/boot/compressed/head_64.S | 4 ++-- 2 files changed, 6 insertions(+), 7 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/boot/compressed/head_32.S b/arch/x86/boot/compressed/head_32.S index 356060c5332c..2f8138b71ea9 100644 --- a/arch/x86/boot/compressed/head_32.S +++ b/arch/x86/boot/compressed/head_32.S @@ -139,12 +139,11 @@ SYM_FUNC_START(startup_32) /* * The GDT may get overwritten either during the copy we just did or * during extract_kernel below. To avoid any issues, repoint the GDTR - * to the new copy of the GDT. EAX still contains the previously - * calculated relocation offset of init_size - _end. + * to the new copy of the GDT. */ - leal gdt(%ebx), %edx - addl %eax, 2(%edx) - lgdt (%edx) + leal gdt(%ebx), %eax + movl %eax, 2(%eax) + lgdt (%eax) /* * Jump to the relocated address. diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S index f7bacc4c1494..fcf8feaa57ea 100644 --- a/arch/x86/boot/compressed/head_64.S +++ b/arch/x86/boot/compressed/head_64.S @@ -456,8 +456,8 @@ trampoline_return: * to the new copy of the GDT. */ leaq gdt64(%rbx), %rax - subq %rbp, 2(%rax) - addq %rbx, 2(%rax) + leaq gdt(%rbx), %rdx + movq %rdx, 2(%rax) lgdt (%rax) /* -- cgit From e441a2ae0e9e9bb12fd3fbe2d59d923fadfe8ef7 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Thu, 27 Feb 2020 15:24:29 +0100 Subject: x86/entry/32: Remove the 0/-1 distinction from exception entries Nothing cares about the -1 "mark as interrupt" in the errorcode of exception entries. It's only used to fill the error code when a signal is delivered, but this is already inconsistent vs. 64 bit as there all exceptions which do not have an error code set it to 0. So if 32 bit applications would care about this, then they would have noticed more than a decade ago. Just use 0 for all excpetions which do not have an errorcode consistently. This does neither break /proc/$PID/syscall because this interface examines the error code / syscall number which is on the stack and that is set to -1 (no syscall) in common_exception unconditionally for all exceptions. The push in the entry stub is just there to fill the hardware error code slot on the stack for consistency of the stack layout. A transient observation of 0 is possible, but that's true for the other exceptions which use 0 already as well and that interface is an unreliable snapshot of dubious correctness anyway. Signed-off-by: Thomas Gleixner Reviewed-by: Alexandre Chartre Link: https://lkml.kernel.org/r/87mu94m7ky.fsf@nanos.tec.linutronix.de --- arch/x86/entry/entry_32.S | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S index 0753f4879a80..ddc87f232877 100644 --- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -1290,7 +1290,7 @@ SYM_CODE_END(simd_coprocessor_error) SYM_CODE_START(device_not_available) ASM_CLAC - pushl $-1 # mark this as an int + pushl $0 pushl $do_device_not_available jmp common_exception SYM_CODE_END(device_not_available) @@ -1531,7 +1531,7 @@ SYM_CODE_START(debug) * Entry from sysenter is now handled in common_exception */ ASM_CLAC - pushl $-1 # mark this as an int + pushl $0 pushl $do_debug jmp common_exception SYM_CODE_END(debug) @@ -1682,7 +1682,7 @@ SYM_CODE_END(nmi) SYM_CODE_START(int3) ASM_CLAC - pushl $-1 # mark this as an int + pushl $0 pushl $do_int3 jmp common_exception SYM_CODE_END(int3) -- cgit From 0e72a6a3cfc3a32273f5e99bfaa4407f4917d343 Mon Sep 17 00:00:00 2001 From: Hans de Goede Date: Wed, 15 Jan 2020 17:35:45 +0100 Subject: efi: Export boot-services code and data as debugfs-blobs Sometimes it is useful to be able to dump the efi boot-services code and data. This commit adds these as debugfs-blobs to /sys/kernel/debug/efi, but only if efi=debug is passed on the kernel-commandline as this requires not freeing those memory-regions, which costs 20+ MB of RAM. Reviewed-by: Greg Kroah-Hartman Acked-by: Ard Biesheuvel Signed-off-by: Hans de Goede Link: https://lore.kernel.org/r/20200115163554.101315-2-hdegoede@redhat.com Signed-off-by: Ard Biesheuvel --- arch/x86/platform/efi/efi.c | 1 + arch/x86/platform/efi/quirks.c | 4 ++++ 2 files changed, 5 insertions(+) (limited to 'arch/x86') diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c index ae923ee8e2b4..8391d3c658b4 100644 --- a/arch/x86/platform/efi/efi.c +++ b/arch/x86/platform/efi/efi.c @@ -243,6 +243,7 @@ int __init efi_memblock_x86_reserve_range(void) efi.memmap.desc_version); memblock_reserve(pmap, efi.memmap.nr_map * efi.memmap.desc_size); + set_bit(EFI_PRESERVE_BS_REGIONS, &efi.flags); return 0; } diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c index 88d32c06cffa..bada1037b711 100644 --- a/arch/x86/platform/efi/quirks.c +++ b/arch/x86/platform/efi/quirks.c @@ -410,6 +410,10 @@ void __init efi_free_boot_services(void) int num_entries = 0; void *new, *new_md; + /* Keep all regions for /sys/kernel/debug/efi */ + if (efi_enabled(EFI_DBG)) + return; + for_each_efi_memory_desc(md) { unsigned long long start = md->phys_addr; unsigned long long size = md->num_pages << EFI_PAGE_SHIFT; -- cgit From f0df68d5bae8825ee5b62f00af237ae82247f045 Mon Sep 17 00:00:00 2001 From: Hans de Goede Date: Wed, 15 Jan 2020 17:35:46 +0100 Subject: efi: Add embedded peripheral firmware support Just like with PCI options ROMs, which we save in the setup_efi_pci* functions from arch/x86/boot/compressed/eboot.c, the EFI code / ROM itself sometimes may contain data which is useful/necessary for peripheral drivers to have access to. Specifically the EFI code may contain an embedded copy of firmware which needs to be (re)loaded into the peripheral. Normally such firmware would be part of linux-firmware, but in some cases this is not feasible, for 2 reasons: 1) The firmware is customized for a specific use-case of the chipset / use with a specific hardware model, so we cannot have a single firmware file for the chipset. E.g. touchscreen controller firmwares are compiled specifically for the hardware model they are used with, as they are calibrated for a specific model digitizer. 2) Despite repeated attempts we have failed to get permission to redistribute the firmware. This is especially a problem with customized firmwares, these get created by the chip vendor for a specific ODM and the copyright may partially belong with the ODM, so the chip vendor cannot give a blanket permission to distribute these. This commit adds support for finding peripheral firmware embedded in the EFI code and makes the found firmware available through the new efi_get_embedded_fw() function. Support for loading these firmwares through the standard firmware loading mechanism is added in a follow-up commit in this patch-series. Note we check the EFI_BOOT_SERVICES_CODE for embedded firmware near the end of start_kernel(), just before calling rest_init(), this is on purpose because the typical EFI_BOOT_SERVICES_CODE memory-segment is too large for early_memremap(), so the check must be done after mm_init(). This relies on EFI_BOOT_SERVICES_CODE not being free-ed until efi_free_boot_services() is called, which means that this will only work on x86 for now. Reported-by: Dave Olsthoorn Suggested-by: Peter Jones Acked-by: Ard Biesheuvel Signed-off-by: Hans de Goede Link: https://lore.kernel.org/r/20200115163554.101315-3-hdegoede@redhat.com Signed-off-by: Ard Biesheuvel --- arch/x86/platform/efi/efi.c | 1 + 1 file changed, 1 insertion(+) (limited to 'arch/x86') diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c index 8391d3c658b4..a4ea04603af3 100644 --- a/arch/x86/platform/efi/efi.c +++ b/arch/x86/platform/efi/efi.c @@ -944,6 +944,7 @@ static void __init __efi_enter_virtual_mode(void) goto err; } + efi_check_for_embedded_firmwares(); efi_free_boot_services(); /* -- cgit From 2a86f6612164a10a3cdb6a499bddf329c4d69f84 Mon Sep 17 00:00:00 2001 From: Masahiro Yamada Date: Fri, 28 Feb 2020 12:46:40 +0900 Subject: kbuild: use KBUILD_DEFCONFIG as the fallback for DEFCONFIG_LIST Most of the Kconfig commands (except defconfig and all*config) read the .config file as a base set of CONFIG options. When it does not exist, the files in DEFCONFIG_LIST are searched in this order and loaded if found. I do not see much sense in the last two lines in DEFCONFIG_LIST. [1] ARCH_DEFCONFIG The entry for DEFCONFIG_LIST is guarded by 'depends on !UML'. So, the ARCH_DEFCONFIG definition in arch/x86/um/Kconfig is meaningless. arch/{sh,sparc,x86}/Kconfig define ARCH_DEFCONFIG depending on 32 or 64 bit variant symbols. This is a little bit strange; ARCH_DEFCONFIG should be a fixed string because the base config file is loaded before the symbol evaluation stage. Using KBUILD_DEFCONFIG makes more sense because it is fixed before Kconfig is invoked. Fortunately, arch/{sh,sparc,x86}/Makefile define it in the same way, and it works as expected. Hence, replace ARCH_DEFCONFIG with "arch/$(SRCARCH)/configs/$(KBUILD_DEFCONFIG)". [2] arch/$(ARCH)/defconfig This file path is no longer valid. The defconfig files are always located in the arch configs/ directories. $ find arch -name defconfig | sort arch/alpha/configs/defconfig arch/arm64/configs/defconfig arch/csky/configs/defconfig arch/nds32/configs/defconfig arch/riscv/configs/defconfig arch/s390/configs/defconfig arch/unicore32/configs/defconfig The path arch/*/configs/defconfig is already covered by "arch/$(SRCARCH)/configs/$(KBUILD_DEFCONFIG)". So, this file path is not necessary. I moved the default KBUILD_DEFCONFIG to the top Makefile. Otherwise, the 7 architectures listed above would end up with endless loop of syncconfig. Signed-off-by: Masahiro Yamada --- arch/x86/Kconfig | 5 ----- arch/x86/um/Kconfig | 5 ----- 2 files changed, 10 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index beea77046f9b..98935f4387f9 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -240,11 +240,6 @@ config OUTPUT_FORMAT default "elf32-i386" if X86_32 default "elf64-x86-64" if X86_64 -config ARCH_DEFCONFIG - string - default "arch/x86/configs/i386_defconfig" if X86_32 - default "arch/x86/configs/x86_64_defconfig" if X86_64 - config LOCKDEP_SUPPORT def_bool y diff --git a/arch/x86/um/Kconfig b/arch/x86/um/Kconfig index a8985e1f7432..95d26a69088b 100644 --- a/arch/x86/um/Kconfig +++ b/arch/x86/um/Kconfig @@ -27,11 +27,6 @@ config X86_64 def_bool 64BIT select MODULES_USE_ELF_RELA -config ARCH_DEFCONFIG - string - default "arch/um/configs/i386_defconfig" if X86_32 - default "arch/um/configs/x86_64_defconfig" if X86_64 - config 3_LEVEL_PGTABLES bool "Three-level pagetables" if !64BIT default 64BIT -- cgit From 88fd9e5352fe05f7fe57778293aebd4cd106960b Mon Sep 17 00:00:00 2001 From: KP Singh Date: Wed, 4 Mar 2020 20:18:47 +0100 Subject: bpf: Refactor trampoline update code As we need to introduce a third type of attachment for trampolines, the flattened signature of arch_prepare_bpf_trampoline gets even more complicated. Refactor the prog and count argument to arch_prepare_bpf_trampoline to use bpf_tramp_progs to simplify the addition and accounting for new attachment types. Signed-off-by: KP Singh Signed-off-by: Alexei Starovoitov Acked-by: Andrii Nakryiko Acked-by: Daniel Borkmann Link: https://lore.kernel.org/bpf/20200304191853.1529-2-kpsingh@chromium.org --- arch/x86/net/bpf_jit_comp.c | 31 ++++++++++++++++--------------- 1 file changed, 16 insertions(+), 15 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index 9ba08e9abc09..15c7d28bc05c 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -1362,12 +1362,12 @@ static void restore_regs(const struct btf_func_model *m, u8 **prog, int nr_args, } static int invoke_bpf(const struct btf_func_model *m, u8 **pprog, - struct bpf_prog **progs, int prog_cnt, int stack_size) + struct bpf_tramp_progs *tp, int stack_size) { u8 *prog = *pprog; int cnt = 0, i; - for (i = 0; i < prog_cnt; i++) { + for (i = 0; i < tp->nr_progs; i++) { if (emit_call(&prog, __bpf_prog_enter, prog)) return -EINVAL; /* remember prog start time returned by __bpf_prog_enter */ @@ -1376,17 +1376,17 @@ static int invoke_bpf(const struct btf_func_model *m, u8 **pprog, /* arg1: lea rdi, [rbp - stack_size] */ EMIT4(0x48, 0x8D, 0x7D, -stack_size); /* arg2: progs[i]->insnsi for interpreter */ - if (!progs[i]->jited) + if (!tp->progs[i]->jited) emit_mov_imm64(&prog, BPF_REG_2, - (long) progs[i]->insnsi >> 32, - (u32) (long) progs[i]->insnsi); + (long) tp->progs[i]->insnsi >> 32, + (u32) (long) tp->progs[i]->insnsi); /* call JITed bpf program or interpreter */ - if (emit_call(&prog, progs[i]->bpf_func, prog)) + if (emit_call(&prog, tp->progs[i]->bpf_func, prog)) return -EINVAL; /* arg1: mov rdi, progs[i] */ - emit_mov_imm64(&prog, BPF_REG_1, (long) progs[i] >> 32, - (u32) (long) progs[i]); + emit_mov_imm64(&prog, BPF_REG_1, (long) tp->progs[i] >> 32, + (u32) (long) tp->progs[i]); /* arg2: mov rsi, rbx <- start time in nsec */ emit_mov_reg(&prog, true, BPF_REG_2, BPF_REG_6); if (emit_call(&prog, __bpf_prog_exit, prog)) @@ -1458,12 +1458,13 @@ static int invoke_bpf(const struct btf_func_model *m, u8 **pprog, */ int arch_prepare_bpf_trampoline(void *image, void *image_end, const struct btf_func_model *m, u32 flags, - struct bpf_prog **fentry_progs, int fentry_cnt, - struct bpf_prog **fexit_progs, int fexit_cnt, + struct bpf_tramp_progs *tprogs, void *orig_call) { int cnt = 0, nr_args = m->nr_args; int stack_size = nr_args * 8; + struct bpf_tramp_progs *fentry = &tprogs[BPF_TRAMP_FENTRY]; + struct bpf_tramp_progs *fexit = &tprogs[BPF_TRAMP_FEXIT]; u8 *prog; /* x86-64 supports up to 6 arguments. 7+ can be added in the future */ @@ -1492,12 +1493,12 @@ int arch_prepare_bpf_trampoline(void *image, void *image_end, save_regs(m, &prog, nr_args, stack_size); - if (fentry_cnt) - if (invoke_bpf(m, &prog, fentry_progs, fentry_cnt, stack_size)) + if (fentry->nr_progs) + if (invoke_bpf(m, &prog, fentry, stack_size)) return -EINVAL; if (flags & BPF_TRAMP_F_CALL_ORIG) { - if (fentry_cnt) + if (fentry->nr_progs) restore_regs(m, &prog, nr_args, stack_size); /* call original function */ @@ -1507,8 +1508,8 @@ int arch_prepare_bpf_trampoline(void *image, void *image_end, emit_stx(&prog, BPF_DW, BPF_REG_FP, BPF_REG_0, -8); } - if (fexit_cnt) - if (invoke_bpf(m, &prog, fexit_progs, fexit_cnt, stack_size)) + if (fexit->nr_progs) + if (invoke_bpf(m, &prog, fexit, stack_size)) return -EINVAL; if (flags & BPF_TRAMP_F_RESTORE_REGS) -- cgit From 7e639208e88d60abf83d48dfda4c0ad325a77b58 Mon Sep 17 00:00:00 2001 From: KP Singh Date: Wed, 4 Mar 2020 20:18:48 +0100 Subject: bpf: JIT helpers for fmod_ret progs * Split the invoke_bpf program to prepare for special handling of fmod_ret programs introduced in a subsequent patch. * Move the definition of emit_cond_near_jump and emit_nops as they are needed for fmod_ret. * Refactor branch target alignment into its own generic helper function i.e. emit_align. Signed-off-by: KP Singh Signed-off-by: Alexei Starovoitov Acked-by: Andrii Nakryiko Acked-by: Daniel Borkmann Link: https://lore.kernel.org/bpf/20200304191853.1529-3-kpsingh@chromium.org --- arch/x86/net/bpf_jit_comp.c | 148 +++++++++++++++++++++++++------------------- 1 file changed, 85 insertions(+), 63 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index 15c7d28bc05c..d6349e930b06 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -1361,35 +1361,95 @@ static void restore_regs(const struct btf_func_model *m, u8 **prog, int nr_args, -(stack_size - i * 8)); } +static int invoke_bpf_prog(const struct btf_func_model *m, u8 **pprog, + struct bpf_prog *p, int stack_size) +{ + u8 *prog = *pprog; + int cnt = 0; + + if (emit_call(&prog, __bpf_prog_enter, prog)) + return -EINVAL; + /* remember prog start time returned by __bpf_prog_enter */ + emit_mov_reg(&prog, true, BPF_REG_6, BPF_REG_0); + + /* arg1: lea rdi, [rbp - stack_size] */ + EMIT4(0x48, 0x8D, 0x7D, -stack_size); + /* arg2: progs[i]->insnsi for interpreter */ + if (!p->jited) + emit_mov_imm64(&prog, BPF_REG_2, + (long) p->insnsi >> 32, + (u32) (long) p->insnsi); + /* call JITed bpf program or interpreter */ + if (emit_call(&prog, p->bpf_func, prog)) + return -EINVAL; + + /* arg1: mov rdi, progs[i] */ + emit_mov_imm64(&prog, BPF_REG_1, (long) p >> 32, + (u32) (long) p); + /* arg2: mov rsi, rbx <- start time in nsec */ + emit_mov_reg(&prog, true, BPF_REG_2, BPF_REG_6); + if (emit_call(&prog, __bpf_prog_exit, prog)) + return -EINVAL; + + *pprog = prog; + return 0; +} + +static void emit_nops(u8 **pprog, unsigned int len) +{ + unsigned int i, noplen; + u8 *prog = *pprog; + int cnt = 0; + + while (len > 0) { + noplen = len; + + if (noplen > ASM_NOP_MAX) + noplen = ASM_NOP_MAX; + + for (i = 0; i < noplen; i++) + EMIT1(ideal_nops[noplen][i]); + len -= noplen; + } + + *pprog = prog; +} + +static void emit_align(u8 **pprog, u32 align) +{ + u8 *target, *prog = *pprog; + + target = PTR_ALIGN(prog, align); + if (target != prog) + emit_nops(&prog, target - prog); + + *pprog = prog; +} + +static int emit_cond_near_jump(u8 **pprog, void *func, void *ip, u8 jmp_cond) +{ + u8 *prog = *pprog; + int cnt = 0; + s64 offset; + + offset = func - (ip + 2 + 4); + if (!is_simm32(offset)) { + pr_err("Target %p is out of range\n", func); + return -EINVAL; + } + EMIT2_off32(0x0F, jmp_cond + 0x10, offset); + *pprog = prog; + return 0; +} + static int invoke_bpf(const struct btf_func_model *m, u8 **pprog, struct bpf_tramp_progs *tp, int stack_size) { + int i; u8 *prog = *pprog; - int cnt = 0, i; for (i = 0; i < tp->nr_progs; i++) { - if (emit_call(&prog, __bpf_prog_enter, prog)) - return -EINVAL; - /* remember prog start time returned by __bpf_prog_enter */ - emit_mov_reg(&prog, true, BPF_REG_6, BPF_REG_0); - - /* arg1: lea rdi, [rbp - stack_size] */ - EMIT4(0x48, 0x8D, 0x7D, -stack_size); - /* arg2: progs[i]->insnsi for interpreter */ - if (!tp->progs[i]->jited) - emit_mov_imm64(&prog, BPF_REG_2, - (long) tp->progs[i]->insnsi >> 32, - (u32) (long) tp->progs[i]->insnsi); - /* call JITed bpf program or interpreter */ - if (emit_call(&prog, tp->progs[i]->bpf_func, prog)) - return -EINVAL; - - /* arg1: mov rdi, progs[i] */ - emit_mov_imm64(&prog, BPF_REG_1, (long) tp->progs[i] >> 32, - (u32) (long) tp->progs[i]); - /* arg2: mov rsi, rbx <- start time in nsec */ - emit_mov_reg(&prog, true, BPF_REG_2, BPF_REG_6); - if (emit_call(&prog, __bpf_prog_exit, prog)) + if (invoke_bpf_prog(m, &prog, tp->progs[i], stack_size)) return -EINVAL; } *pprog = prog; @@ -1531,42 +1591,6 @@ int arch_prepare_bpf_trampoline(void *image, void *image_end, return prog - (u8 *)image; } -static int emit_cond_near_jump(u8 **pprog, void *func, void *ip, u8 jmp_cond) -{ - u8 *prog = *pprog; - int cnt = 0; - s64 offset; - - offset = func - (ip + 2 + 4); - if (!is_simm32(offset)) { - pr_err("Target %p is out of range\n", func); - return -EINVAL; - } - EMIT2_off32(0x0F, jmp_cond + 0x10, offset); - *pprog = prog; - return 0; -} - -static void emit_nops(u8 **pprog, unsigned int len) -{ - unsigned int i, noplen; - u8 *prog = *pprog; - int cnt = 0; - - while (len > 0) { - noplen = len; - - if (noplen > ASM_NOP_MAX) - noplen = ASM_NOP_MAX; - - for (i = 0; i < noplen; i++) - EMIT1(ideal_nops[noplen][i]); - len -= noplen; - } - - *pprog = prog; -} - static int emit_fallback_jump(u8 **pprog) { u8 *prog = *pprog; @@ -1589,7 +1613,7 @@ static int emit_fallback_jump(u8 **pprog) static int emit_bpf_dispatcher(u8 **pprog, int a, int b, s64 *progs) { - u8 *jg_reloc, *jg_target, *prog = *pprog; + u8 *jg_reloc, *prog = *pprog; int pivot, err, jg_bytes = 1, cnt = 0; s64 jg_offset; @@ -1644,9 +1668,7 @@ static int emit_bpf_dispatcher(u8 **pprog, int a, int b, s64 *progs) * Coding Rule 11: All branch targets should be 16-byte * aligned. */ - jg_target = PTR_ALIGN(prog, 16); - if (jg_target != prog) - emit_nops(&prog, jg_target - prog); + emit_align(&prog, 16); jg_offset = prog - jg_reloc; emit_code(jg_reloc - jg_bytes, jg_offset, jg_bytes); -- cgit From ae24082331d9bbaae283aafbe930a8f0eb85605a Mon Sep 17 00:00:00 2001 From: KP Singh Date: Wed, 4 Mar 2020 20:18:49 +0100 Subject: bpf: Introduce BPF_MODIFY_RETURN When multiple programs are attached, each program receives the return value from the previous program on the stack and the last program provides the return value to the attached function. The fmod_ret bpf programs are run after the fentry programs and before the fexit programs. The original function is only called if all the fmod_ret programs return 0 to avoid any unintended side-effects. The success value, i.e. 0 is not currently configurable but can be made so where user-space can specify it at load time. For example: int func_to_be_attached(int a, int b) { <--- do_fentry do_fmod_ret: if (ret != 0) goto do_fexit; original_function: } <--- do_fexit The fmod_ret program attached to this function can be defined as: SEC("fmod_ret/func_to_be_attached") int BPF_PROG(func_name, int a, int b, int ret) { // This will skip the original function logic. return 1; } The first fmod_ret program is passed 0 in its return argument. Signed-off-by: KP Singh Signed-off-by: Alexei Starovoitov Acked-by: Andrii Nakryiko Acked-by: Daniel Borkmann Link: https://lore.kernel.org/bpf/20200304191853.1529-4-kpsingh@chromium.org --- arch/x86/net/bpf_jit_comp.c | 130 ++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 119 insertions(+), 11 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index d6349e930b06..b1fd000feb89 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -1362,7 +1362,7 @@ static void restore_regs(const struct btf_func_model *m, u8 **prog, int nr_args, } static int invoke_bpf_prog(const struct btf_func_model *m, u8 **pprog, - struct bpf_prog *p, int stack_size) + struct bpf_prog *p, int stack_size, bool mod_ret) { u8 *prog = *pprog; int cnt = 0; @@ -1383,6 +1383,13 @@ static int invoke_bpf_prog(const struct btf_func_model *m, u8 **pprog, if (emit_call(&prog, p->bpf_func, prog)) return -EINVAL; + /* BPF_TRAMP_MODIFY_RETURN trampolines can modify the return + * of the previous call which is then passed on the stack to + * the next BPF program. + */ + if (mod_ret) + emit_stx(&prog, BPF_DW, BPF_REG_FP, BPF_REG_0, -8); + /* arg1: mov rdi, progs[i] */ emit_mov_imm64(&prog, BPF_REG_1, (long) p >> 32, (u32) (long) p); @@ -1442,6 +1449,23 @@ static int emit_cond_near_jump(u8 **pprog, void *func, void *ip, u8 jmp_cond) return 0; } +static int emit_mod_ret_check_imm8(u8 **pprog, int value) +{ + u8 *prog = *pprog; + int cnt = 0; + + if (!is_imm8(value)) + return -EINVAL; + + if (value == 0) + EMIT2(0x85, add_2reg(0xC0, BPF_REG_0, BPF_REG_0)); + else + EMIT3(0x83, add_1reg(0xF8, BPF_REG_0), value); + + *pprog = prog; + return 0; +} + static int invoke_bpf(const struct btf_func_model *m, u8 **pprog, struct bpf_tramp_progs *tp, int stack_size) { @@ -1449,9 +1473,49 @@ static int invoke_bpf(const struct btf_func_model *m, u8 **pprog, u8 *prog = *pprog; for (i = 0; i < tp->nr_progs; i++) { - if (invoke_bpf_prog(m, &prog, tp->progs[i], stack_size)) + if (invoke_bpf_prog(m, &prog, tp->progs[i], stack_size, false)) + return -EINVAL; + } + *pprog = prog; + return 0; +} + +static int invoke_bpf_mod_ret(const struct btf_func_model *m, u8 **pprog, + struct bpf_tramp_progs *tp, int stack_size, + u8 **branches) +{ + u8 *prog = *pprog; + int i; + + /* The first fmod_ret program will receive a garbage return value. + * Set this to 0 to avoid confusing the program. + */ + emit_mov_imm32(&prog, false, BPF_REG_0, 0); + emit_stx(&prog, BPF_DW, BPF_REG_FP, BPF_REG_0, -8); + for (i = 0; i < tp->nr_progs; i++) { + if (invoke_bpf_prog(m, &prog, tp->progs[i], stack_size, true)) return -EINVAL; + + /* Generate a branch: + * + * if (ret != 0) + * goto do_fexit; + * + * If needed this can be extended to any integer value which can + * be passed by user-space when the program is loaded. + */ + if (emit_mod_ret_check_imm8(&prog, 0)) + return -EINVAL; + + /* Save the location of the branch and Generate 6 nops + * (4 bytes for an offset and 2 bytes for the jump) These nops + * are replaced with a conditional jump once do_fexit (i.e. the + * start of the fexit invocation) is finalized. + */ + branches[i] = prog; + emit_nops(&prog, 4 + 2); } + *pprog = prog; return 0; } @@ -1521,10 +1585,12 @@ int arch_prepare_bpf_trampoline(void *image, void *image_end, struct bpf_tramp_progs *tprogs, void *orig_call) { - int cnt = 0, nr_args = m->nr_args; + int ret, i, cnt = 0, nr_args = m->nr_args; int stack_size = nr_args * 8; struct bpf_tramp_progs *fentry = &tprogs[BPF_TRAMP_FENTRY]; struct bpf_tramp_progs *fexit = &tprogs[BPF_TRAMP_FEXIT]; + struct bpf_tramp_progs *fmod_ret = &tprogs[BPF_TRAMP_MODIFY_RETURN]; + u8 **branches = NULL; u8 *prog; /* x86-64 supports up to 6 arguments. 7+ can be added in the future */ @@ -1557,24 +1623,60 @@ int arch_prepare_bpf_trampoline(void *image, void *image_end, if (invoke_bpf(m, &prog, fentry, stack_size)) return -EINVAL; + if (fmod_ret->nr_progs) { + branches = kcalloc(fmod_ret->nr_progs, sizeof(u8 *), + GFP_KERNEL); + if (!branches) + return -ENOMEM; + + if (invoke_bpf_mod_ret(m, &prog, fmod_ret, stack_size, + branches)) { + ret = -EINVAL; + goto cleanup; + } + } + if (flags & BPF_TRAMP_F_CALL_ORIG) { - if (fentry->nr_progs) + if (fentry->nr_progs || fmod_ret->nr_progs) restore_regs(m, &prog, nr_args, stack_size); /* call original function */ - if (emit_call(&prog, orig_call, prog)) - return -EINVAL; + if (emit_call(&prog, orig_call, prog)) { + ret = -EINVAL; + goto cleanup; + } /* remember return value in a stack for bpf prog to access */ emit_stx(&prog, BPF_DW, BPF_REG_FP, BPF_REG_0, -8); } + if (fmod_ret->nr_progs) { + /* From Intel 64 and IA-32 Architectures Optimization + * Reference Manual, 3.4.1.4 Code Alignment, Assembly/Compiler + * Coding Rule 11: All branch targets should be 16-byte + * aligned. + */ + emit_align(&prog, 16); + /* Update the branches saved in invoke_bpf_mod_ret with the + * aligned address of do_fexit. + */ + for (i = 0; i < fmod_ret->nr_progs; i++) + emit_cond_near_jump(&branches[i], prog, branches[i], + X86_JNE); + } + if (fexit->nr_progs) - if (invoke_bpf(m, &prog, fexit, stack_size)) - return -EINVAL; + if (invoke_bpf(m, &prog, fexit, stack_size)) { + ret = -EINVAL; + goto cleanup; + } if (flags & BPF_TRAMP_F_RESTORE_REGS) restore_regs(m, &prog, nr_args, stack_size); + /* This needs to be done regardless. If there were fmod_ret programs, + * the return value is only updated on the stack and still needs to be + * restored to R0. + */ if (flags & BPF_TRAMP_F_CALL_ORIG) /* restore original return value back into RAX */ emit_ldx(&prog, BPF_DW, BPF_REG_0, BPF_REG_FP, -8); @@ -1586,9 +1688,15 @@ int arch_prepare_bpf_trampoline(void *image, void *image_end, EMIT4(0x48, 0x83, 0xC4, 8); /* add rsp, 8 */ EMIT1(0xC3); /* ret */ /* Make sure the trampoline generation logic doesn't overflow */ - if (WARN_ON_ONCE(prog > (u8 *)image_end - BPF_INSN_SAFETY)) - return -EFAULT; - return prog - (u8 *)image; + if (WARN_ON_ONCE(prog > (u8 *)image_end - BPF_INSN_SAFETY)) { + ret = -EFAULT; + goto cleanup; + } + ret = prog - (u8 *)image; + +cleanup: + kfree(branches); + return ret; } static int emit_fallback_jump(u8 **pprog) -- cgit From 681ff0181bbfb183e32bc6beb6ec076304470479 Mon Sep 17 00:00:00 2001 From: Arvind Sankar Date: Thu, 5 Mar 2020 10:01:52 -0500 Subject: x86/mm/init/32: Stop printing the virtual memory layout For security reasons, don't display the kernel's virtual memory layout. Kees Cook points out: "These have been entirely removed on other architectures, so let's just do the same for ia32 and remove it unconditionally." 071929dbdd86 ("arm64: Stop printing the virtual memory layout") 1c31d4e96b8c ("ARM: 8820/1: mm: Stop printing the virtual memory layout") 31833332f798 ("m68k/mm: Stop printing the virtual memory layout") fd8d0ca25631 ("parisc: Hide virtual kernel memory layout") adb1fe9ae2ee ("mm/page_alloc: Remove kernel address exposure in free_reserved_area()") Signed-off-by: Arvind Sankar Signed-off-by: Thomas Gleixner Acked-by: Tycho Andersen Acked-by: Kees Cook Link: https://lkml.kernel.org/r/20200305150152.831697-1-nivedita@alum.mit.edu --- arch/x86/mm/init_32.c | 38 -------------------------------------- 1 file changed, 38 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c index 23df4885bbed..8ae0272c1c51 100644 --- a/arch/x86/mm/init_32.c +++ b/arch/x86/mm/init_32.c @@ -788,44 +788,6 @@ void __init mem_init(void) x86_init.hyper.init_after_bootmem(); mem_init_print_info(NULL); - printk(KERN_INFO "virtual kernel memory layout:\n" - " fixmap : 0x%08lx - 0x%08lx (%4ld kB)\n" - " cpu_entry : 0x%08lx - 0x%08lx (%4ld kB)\n" -#ifdef CONFIG_HIGHMEM - " pkmap : 0x%08lx - 0x%08lx (%4ld kB)\n" -#endif - " vmalloc : 0x%08lx - 0x%08lx (%4ld MB)\n" - " lowmem : 0x%08lx - 0x%08lx (%4ld MB)\n" - " .init : 0x%08lx - 0x%08lx (%4ld kB)\n" - " .data : 0x%08lx - 0x%08lx (%4ld kB)\n" - " .text : 0x%08lx - 0x%08lx (%4ld kB)\n", - FIXADDR_START, FIXADDR_TOP, - (FIXADDR_TOP - FIXADDR_START) >> 10, - - CPU_ENTRY_AREA_BASE, - CPU_ENTRY_AREA_BASE + CPU_ENTRY_AREA_MAP_SIZE, - CPU_ENTRY_AREA_MAP_SIZE >> 10, - -#ifdef CONFIG_HIGHMEM - PKMAP_BASE, PKMAP_BASE+LAST_PKMAP*PAGE_SIZE, - (LAST_PKMAP*PAGE_SIZE) >> 10, -#endif - - VMALLOC_START, VMALLOC_END, - (VMALLOC_END - VMALLOC_START) >> 20, - - (unsigned long)__va(0), (unsigned long)high_memory, - ((unsigned long)high_memory - (unsigned long)__va(0)) >> 20, - - (unsigned long)&__init_begin, (unsigned long)&__init_end, - ((unsigned long)&__init_end - - (unsigned long)&__init_begin) >> 10, - - (unsigned long)&_etext, (unsigned long)&_edata, - ((unsigned long)&_edata - (unsigned long)&_etext) >> 10, - - (unsigned long)&_text, (unsigned long)&_etext, - ((unsigned long)&_etext - (unsigned long)&_text) >> 10); /* * Check boundaries twice: Some fundamental inconsistencies can -- cgit From dc7fc3a53ae158263196b1892b672aedf67796c5 Mon Sep 17 00:00:00 2001 From: "Jason A. Donenfeld" Date: Sun, 1 Mar 2020 16:06:56 +0800 Subject: crypto: x86/curve25519 - leave r12 as spare register This updates to the newer register selection proved by HACL*, which leads to a more compact instruction encoding, and saves around 100 cycles. Signed-off-by: Jason A. Donenfeld Signed-off-by: Herbert Xu --- arch/x86/crypto/curve25519-x86_64.c | 110 ++++++++++++++++++------------------ 1 file changed, 55 insertions(+), 55 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/crypto/curve25519-x86_64.c b/arch/x86/crypto/curve25519-x86_64.c index e4e58b8e9afe..8a17621f7d3a 100644 --- a/arch/x86/crypto/curve25519-x86_64.c +++ b/arch/x86/crypto/curve25519-x86_64.c @@ -167,28 +167,28 @@ static inline void fmul(u64 *out, const u64 *f1, const u64 *f2, u64 *tmp) " movq 0(%1), %%rdx;" " mulxq 0(%3), %%r8, %%r9;" " xor %%r10, %%r10;" " movq %%r8, 0(%0);" " mulxq 8(%3), %%r10, %%r11;" " adox %%r9, %%r10;" " movq %%r10, 8(%0);" - " mulxq 16(%3), %%r12, %%r13;" " adox %%r11, %%r12;" + " mulxq 16(%3), %%rbx, %%r13;" " adox %%r11, %%rbx;" " mulxq 24(%3), %%r14, %%rdx;" " adox %%r13, %%r14;" " mov $0, %%rax;" " adox %%rdx, %%rax;" /* Compute src1[1] * src2 */ " movq 8(%1), %%rdx;" " mulxq 0(%3), %%r8, %%r9;" " xor %%r10, %%r10;" " adcxq 8(%0), %%r8;" " movq %%r8, 8(%0);" - " mulxq 8(%3), %%r10, %%r11;" " adox %%r9, %%r10;" " adcx %%r12, %%r10;" " movq %%r10, 16(%0);" - " mulxq 16(%3), %%r12, %%r13;" " adox %%r11, %%r12;" " adcx %%r14, %%r12;" " mov $0, %%r8;" + " mulxq 8(%3), %%r10, %%r11;" " adox %%r9, %%r10;" " adcx %%rbx, %%r10;" " movq %%r10, 16(%0);" + " mulxq 16(%3), %%rbx, %%r13;" " adox %%r11, %%rbx;" " adcx %%r14, %%rbx;" " mov $0, %%r8;" " mulxq 24(%3), %%r14, %%rdx;" " adox %%r13, %%r14;" " adcx %%rax, %%r14;" " mov $0, %%rax;" " adox %%rdx, %%rax;" " adcx %%r8, %%rax;" /* Compute src1[2] * src2 */ " movq 16(%1), %%rdx;" " mulxq 0(%3), %%r8, %%r9;" " xor %%r10, %%r10;" " adcxq 16(%0), %%r8;" " movq %%r8, 16(%0);" - " mulxq 8(%3), %%r10, %%r11;" " adox %%r9, %%r10;" " adcx %%r12, %%r10;" " movq %%r10, 24(%0);" - " mulxq 16(%3), %%r12, %%r13;" " adox %%r11, %%r12;" " adcx %%r14, %%r12;" " mov $0, %%r8;" + " mulxq 8(%3), %%r10, %%r11;" " adox %%r9, %%r10;" " adcx %%rbx, %%r10;" " movq %%r10, 24(%0);" + " mulxq 16(%3), %%rbx, %%r13;" " adox %%r11, %%rbx;" " adcx %%r14, %%rbx;" " mov $0, %%r8;" " mulxq 24(%3), %%r14, %%rdx;" " adox %%r13, %%r14;" " adcx %%rax, %%r14;" " mov $0, %%rax;" " adox %%rdx, %%rax;" " adcx %%r8, %%rax;" /* Compute src1[3] * src2 */ " movq 24(%1), %%rdx;" " mulxq 0(%3), %%r8, %%r9;" " xor %%r10, %%r10;" " adcxq 24(%0), %%r8;" " movq %%r8, 24(%0);" - " mulxq 8(%3), %%r10, %%r11;" " adox %%r9, %%r10;" " adcx %%r12, %%r10;" " movq %%r10, 32(%0);" - " mulxq 16(%3), %%r12, %%r13;" " adox %%r11, %%r12;" " adcx %%r14, %%r12;" " movq %%r12, 40(%0);" " mov $0, %%r8;" + " mulxq 8(%3), %%r10, %%r11;" " adox %%r9, %%r10;" " adcx %%rbx, %%r10;" " movq %%r10, 32(%0);" + " mulxq 16(%3), %%rbx, %%r13;" " adox %%r11, %%rbx;" " adcx %%r14, %%rbx;" " movq %%rbx, 40(%0);" " mov $0, %%r8;" " mulxq 24(%3), %%r14, %%rdx;" " adox %%r13, %%r14;" " adcx %%rax, %%r14;" " movq %%r14, 48(%0);" " mov $0, %%rax;" " adox %%rdx, %%rax;" " adcx %%r8, %%rax;" " movq %%rax, 56(%0);" /* Line up pointers */ @@ -202,11 +202,11 @@ static inline void fmul(u64 *out, const u64 *f1, const u64 *f2, u64 *tmp) " mulxq 32(%1), %%r8, %%r13;" " xor %3, %3;" " adoxq 0(%1), %%r8;" - " mulxq 40(%1), %%r9, %%r12;" + " mulxq 40(%1), %%r9, %%rbx;" " adcx %%r13, %%r9;" " adoxq 8(%1), %%r9;" " mulxq 48(%1), %%r10, %%r13;" - " adcx %%r12, %%r10;" + " adcx %%rbx, %%r10;" " adoxq 16(%1), %%r10;" " mulxq 56(%1), %%r11, %%rax;" " adcx %%r13, %%r11;" @@ -231,7 +231,7 @@ static inline void fmul(u64 *out, const u64 *f1, const u64 *f2, u64 *tmp) " movq %%r8, 0(%0);" : "+&r" (tmp), "+&r" (f1), "+&r" (out), "+&r" (f2) : - : "%rax", "%rdx", "%r8", "%r9", "%r10", "%r11", "%r12", "%r13", "%r14", "memory", "cc" + : "%rax", "%rdx", "%r8", "%r9", "%r10", "%r11", "%rbx", "%r13", "%r14", "memory", "cc" ); } @@ -248,28 +248,28 @@ static inline void fmul2(u64 *out, const u64 *f1, const u64 *f2, u64 *tmp) " movq 0(%1), %%rdx;" " mulxq 0(%3), %%r8, %%r9;" " xor %%r10, %%r10;" " movq %%r8, 0(%0);" " mulxq 8(%3), %%r10, %%r11;" " adox %%r9, %%r10;" " movq %%r10, 8(%0);" - " mulxq 16(%3), %%r12, %%r13;" " adox %%r11, %%r12;" + " mulxq 16(%3), %%rbx, %%r13;" " adox %%r11, %%rbx;" " mulxq 24(%3), %%r14, %%rdx;" " adox %%r13, %%r14;" " mov $0, %%rax;" " adox %%rdx, %%rax;" /* Compute src1[1] * src2 */ " movq 8(%1), %%rdx;" " mulxq 0(%3), %%r8, %%r9;" " xor %%r10, %%r10;" " adcxq 8(%0), %%r8;" " movq %%r8, 8(%0);" - " mulxq 8(%3), %%r10, %%r11;" " adox %%r9, %%r10;" " adcx %%r12, %%r10;" " movq %%r10, 16(%0);" - " mulxq 16(%3), %%r12, %%r13;" " adox %%r11, %%r12;" " adcx %%r14, %%r12;" " mov $0, %%r8;" + " mulxq 8(%3), %%r10, %%r11;" " adox %%r9, %%r10;" " adcx %%rbx, %%r10;" " movq %%r10, 16(%0);" + " mulxq 16(%3), %%rbx, %%r13;" " adox %%r11, %%rbx;" " adcx %%r14, %%rbx;" " mov $0, %%r8;" " mulxq 24(%3), %%r14, %%rdx;" " adox %%r13, %%r14;" " adcx %%rax, %%r14;" " mov $0, %%rax;" " adox %%rdx, %%rax;" " adcx %%r8, %%rax;" /* Compute src1[2] * src2 */ " movq 16(%1), %%rdx;" " mulxq 0(%3), %%r8, %%r9;" " xor %%r10, %%r10;" " adcxq 16(%0), %%r8;" " movq %%r8, 16(%0);" - " mulxq 8(%3), %%r10, %%r11;" " adox %%r9, %%r10;" " adcx %%r12, %%r10;" " movq %%r10, 24(%0);" - " mulxq 16(%3), %%r12, %%r13;" " adox %%r11, %%r12;" " adcx %%r14, %%r12;" " mov $0, %%r8;" + " mulxq 8(%3), %%r10, %%r11;" " adox %%r9, %%r10;" " adcx %%rbx, %%r10;" " movq %%r10, 24(%0);" + " mulxq 16(%3), %%rbx, %%r13;" " adox %%r11, %%rbx;" " adcx %%r14, %%rbx;" " mov $0, %%r8;" " mulxq 24(%3), %%r14, %%rdx;" " adox %%r13, %%r14;" " adcx %%rax, %%r14;" " mov $0, %%rax;" " adox %%rdx, %%rax;" " adcx %%r8, %%rax;" /* Compute src1[3] * src2 */ " movq 24(%1), %%rdx;" " mulxq 0(%3), %%r8, %%r9;" " xor %%r10, %%r10;" " adcxq 24(%0), %%r8;" " movq %%r8, 24(%0);" - " mulxq 8(%3), %%r10, %%r11;" " adox %%r9, %%r10;" " adcx %%r12, %%r10;" " movq %%r10, 32(%0);" - " mulxq 16(%3), %%r12, %%r13;" " adox %%r11, %%r12;" " adcx %%r14, %%r12;" " movq %%r12, 40(%0);" " mov $0, %%r8;" + " mulxq 8(%3), %%r10, %%r11;" " adox %%r9, %%r10;" " adcx %%rbx, %%r10;" " movq %%r10, 32(%0);" + " mulxq 16(%3), %%rbx, %%r13;" " adox %%r11, %%rbx;" " adcx %%r14, %%rbx;" " movq %%rbx, 40(%0);" " mov $0, %%r8;" " mulxq 24(%3), %%r14, %%rdx;" " adox %%r13, %%r14;" " adcx %%rax, %%r14;" " movq %%r14, 48(%0);" " mov $0, %%rax;" " adox %%rdx, %%rax;" " adcx %%r8, %%rax;" " movq %%rax, 56(%0);" @@ -279,28 +279,28 @@ static inline void fmul2(u64 *out, const u64 *f1, const u64 *f2, u64 *tmp) " movq 32(%1), %%rdx;" " mulxq 32(%3), %%r8, %%r9;" " xor %%r10, %%r10;" " movq %%r8, 64(%0);" " mulxq 40(%3), %%r10, %%r11;" " adox %%r9, %%r10;" " movq %%r10, 72(%0);" - " mulxq 48(%3), %%r12, %%r13;" " adox %%r11, %%r12;" + " mulxq 48(%3), %%rbx, %%r13;" " adox %%r11, %%rbx;" " mulxq 56(%3), %%r14, %%rdx;" " adox %%r13, %%r14;" " mov $0, %%rax;" " adox %%rdx, %%rax;" /* Compute src1[1] * src2 */ " movq 40(%1), %%rdx;" " mulxq 32(%3), %%r8, %%r9;" " xor %%r10, %%r10;" " adcxq 72(%0), %%r8;" " movq %%r8, 72(%0);" - " mulxq 40(%3), %%r10, %%r11;" " adox %%r9, %%r10;" " adcx %%r12, %%r10;" " movq %%r10, 80(%0);" - " mulxq 48(%3), %%r12, %%r13;" " adox %%r11, %%r12;" " adcx %%r14, %%r12;" " mov $0, %%r8;" + " mulxq 40(%3), %%r10, %%r11;" " adox %%r9, %%r10;" " adcx %%rbx, %%r10;" " movq %%r10, 80(%0);" + " mulxq 48(%3), %%rbx, %%r13;" " adox %%r11, %%rbx;" " adcx %%r14, %%rbx;" " mov $0, %%r8;" " mulxq 56(%3), %%r14, %%rdx;" " adox %%r13, %%r14;" " adcx %%rax, %%r14;" " mov $0, %%rax;" " adox %%rdx, %%rax;" " adcx %%r8, %%rax;" /* Compute src1[2] * src2 */ " movq 48(%1), %%rdx;" " mulxq 32(%3), %%r8, %%r9;" " xor %%r10, %%r10;" " adcxq 80(%0), %%r8;" " movq %%r8, 80(%0);" - " mulxq 40(%3), %%r10, %%r11;" " adox %%r9, %%r10;" " adcx %%r12, %%r10;" " movq %%r10, 88(%0);" - " mulxq 48(%3), %%r12, %%r13;" " adox %%r11, %%r12;" " adcx %%r14, %%r12;" " mov $0, %%r8;" + " mulxq 40(%3), %%r10, %%r11;" " adox %%r9, %%r10;" " adcx %%rbx, %%r10;" " movq %%r10, 88(%0);" + " mulxq 48(%3), %%rbx, %%r13;" " adox %%r11, %%rbx;" " adcx %%r14, %%rbx;" " mov $0, %%r8;" " mulxq 56(%3), %%r14, %%rdx;" " adox %%r13, %%r14;" " adcx %%rax, %%r14;" " mov $0, %%rax;" " adox %%rdx, %%rax;" " adcx %%r8, %%rax;" /* Compute src1[3] * src2 */ " movq 56(%1), %%rdx;" " mulxq 32(%3), %%r8, %%r9;" " xor %%r10, %%r10;" " adcxq 88(%0), %%r8;" " movq %%r8, 88(%0);" - " mulxq 40(%3), %%r10, %%r11;" " adox %%r9, %%r10;" " adcx %%r12, %%r10;" " movq %%r10, 96(%0);" - " mulxq 48(%3), %%r12, %%r13;" " adox %%r11, %%r12;" " adcx %%r14, %%r12;" " movq %%r12, 104(%0);" " mov $0, %%r8;" + " mulxq 40(%3), %%r10, %%r11;" " adox %%r9, %%r10;" " adcx %%rbx, %%r10;" " movq %%r10, 96(%0);" + " mulxq 48(%3), %%rbx, %%r13;" " adox %%r11, %%rbx;" " adcx %%r14, %%rbx;" " movq %%rbx, 104(%0);" " mov $0, %%r8;" " mulxq 56(%3), %%r14, %%rdx;" " adox %%r13, %%r14;" " adcx %%rax, %%r14;" " movq %%r14, 112(%0);" " mov $0, %%rax;" " adox %%rdx, %%rax;" " adcx %%r8, %%rax;" " movq %%rax, 120(%0);" /* Line up pointers */ @@ -314,11 +314,11 @@ static inline void fmul2(u64 *out, const u64 *f1, const u64 *f2, u64 *tmp) " mulxq 32(%1), %%r8, %%r13;" " xor %3, %3;" " adoxq 0(%1), %%r8;" - " mulxq 40(%1), %%r9, %%r12;" + " mulxq 40(%1), %%r9, %%rbx;" " adcx %%r13, %%r9;" " adoxq 8(%1), %%r9;" " mulxq 48(%1), %%r10, %%r13;" - " adcx %%r12, %%r10;" + " adcx %%rbx, %%r10;" " adoxq 16(%1), %%r10;" " mulxq 56(%1), %%r11, %%rax;" " adcx %%r13, %%r11;" @@ -347,11 +347,11 @@ static inline void fmul2(u64 *out, const u64 *f1, const u64 *f2, u64 *tmp) " mulxq 96(%1), %%r8, %%r13;" " xor %3, %3;" " adoxq 64(%1), %%r8;" - " mulxq 104(%1), %%r9, %%r12;" + " mulxq 104(%1), %%r9, %%rbx;" " adcx %%r13, %%r9;" " adoxq 72(%1), %%r9;" " mulxq 112(%1), %%r10, %%r13;" - " adcx %%r12, %%r10;" + " adcx %%rbx, %%r10;" " adoxq 80(%1), %%r10;" " mulxq 120(%1), %%r11, %%rax;" " adcx %%r13, %%r11;" @@ -376,7 +376,7 @@ static inline void fmul2(u64 *out, const u64 *f1, const u64 *f2, u64 *tmp) " movq %%r8, 32(%0);" : "+&r" (tmp), "+&r" (f1), "+&r" (out), "+&r" (f2) : - : "%rax", "%rdx", "%r8", "%r9", "%r10", "%r11", "%r12", "%r13", "%r14", "memory", "cc" + : "%rax", "%rdx", "%r8", "%r9", "%r10", "%r11", "%rbx", "%r13", "%r14", "memory", "cc" ); } @@ -388,11 +388,11 @@ static inline void fmul_scalar(u64 *out, const u64 *f1, u64 f2) asm volatile( /* Compute the raw multiplication of f1*f2 */ " mulxq 0(%2), %%r8, %%rcx;" /* f1[0]*f2 */ - " mulxq 8(%2), %%r9, %%r12;" /* f1[1]*f2 */ + " mulxq 8(%2), %%r9, %%rbx;" /* f1[1]*f2 */ " add %%rcx, %%r9;" " mov $0, %%rcx;" " mulxq 16(%2), %%r10, %%r13;" /* f1[2]*f2 */ - " adcx %%r12, %%r10;" + " adcx %%rbx, %%r10;" " mulxq 24(%2), %%r11, %%rax;" /* f1[3]*f2 */ " adcx %%r13, %%r11;" " adcx %%rcx, %%rax;" @@ -419,7 +419,7 @@ static inline void fmul_scalar(u64 *out, const u64 *f1, u64 f2) " movq %%r8, 0(%1);" : "+&r" (f2_r) : "r" (out), "r" (f1) - : "%rax", "%rcx", "%r8", "%r9", "%r10", "%r11", "%r12", "%r13", "memory", "cc" + : "%rax", "%rcx", "%r8", "%r9", "%r10", "%r11", "%rbx", "%r13", "memory", "cc" ); } @@ -520,8 +520,8 @@ static inline void fsqr(u64 *out, const u64 *f, u64 *tmp) " mulxq 16(%1), %%r9, %%r10;" " adcx %%r14, %%r9;" /* f[2]*f[0] */ " mulxq 24(%1), %%rax, %%rcx;" " adcx %%rax, %%r10;" /* f[3]*f[0] */ " movq 24(%1), %%rdx;" /* f[3] */ - " mulxq 8(%1), %%r11, %%r12;" " adcx %%rcx, %%r11;" /* f[1]*f[3] */ - " mulxq 16(%1), %%rax, %%r13;" " adcx %%rax, %%r12;" /* f[2]*f[3] */ + " mulxq 8(%1), %%r11, %%rbx;" " adcx %%rcx, %%r11;" /* f[1]*f[3] */ + " mulxq 16(%1), %%rax, %%r13;" " adcx %%rax, %%rbx;" /* f[2]*f[3] */ " movq 8(%1), %%rdx;" " adcx %%r15, %%r13;" /* f1 */ " mulxq 16(%1), %%rax, %%rcx;" " mov $0, %%r14;" /* f[2]*f[1] */ @@ -531,12 +531,12 @@ static inline void fsqr(u64 *out, const u64 *f, u64 *tmp) " adcx %%r8, %%r8;" " adox %%rcx, %%r11;" " adcx %%r9, %%r9;" - " adox %%r15, %%r12;" + " adox %%r15, %%rbx;" " adcx %%r10, %%r10;" " adox %%r15, %%r13;" " adcx %%r11, %%r11;" " adox %%r15, %%r14;" - " adcx %%r12, %%r12;" + " adcx %%rbx, %%rbx;" " adcx %%r13, %%r13;" " adcx %%r14, %%r14;" @@ -549,7 +549,7 @@ static inline void fsqr(u64 *out, const u64 *f, u64 *tmp) " adcx %%rcx, %%r10;" " movq %%r10, 24(%0);" " movq 16(%1), %%rdx;" " mulx %%rdx, %%rax, %%rcx;" /* f[2]^2 */ " adcx %%rax, %%r11;" " movq %%r11, 32(%0);" - " adcx %%rcx, %%r12;" " movq %%r12, 40(%0);" + " adcx %%rcx, %%rbx;" " movq %%rbx, 40(%0);" " movq 24(%1), %%rdx;" " mulx %%rdx, %%rax, %%rcx;" /* f[3]^2 */ " adcx %%rax, %%r13;" " movq %%r13, 48(%0);" " adcx %%rcx, %%r14;" " movq %%r14, 56(%0);" @@ -565,11 +565,11 @@ static inline void fsqr(u64 *out, const u64 *f, u64 *tmp) " mulxq 32(%1), %%r8, %%r13;" " xor %%rcx, %%rcx;" " adoxq 0(%1), %%r8;" - " mulxq 40(%1), %%r9, %%r12;" + " mulxq 40(%1), %%r9, %%rbx;" " adcx %%r13, %%r9;" " adoxq 8(%1), %%r9;" " mulxq 48(%1), %%r10, %%r13;" - " adcx %%r12, %%r10;" + " adcx %%rbx, %%r10;" " adoxq 16(%1), %%r10;" " mulxq 56(%1), %%r11, %%rax;" " adcx %%r13, %%r11;" @@ -594,7 +594,7 @@ static inline void fsqr(u64 *out, const u64 *f, u64 *tmp) " movq %%r8, 0(%0);" : "+&r" (tmp), "+&r" (f), "+&r" (out) : - : "%rax", "%rcx", "%rdx", "%r8", "%r9", "%r10", "%r11", "%r12", "%r13", "%r14", "%r15", "memory", "cc" + : "%rax", "%rcx", "%rdx", "%r8", "%r9", "%r10", "%r11", "%rbx", "%r13", "%r14", "%r15", "memory", "cc" ); } @@ -611,8 +611,8 @@ static inline void fsqr2(u64 *out, const u64 *f, u64 *tmp) " mulxq 16(%1), %%r9, %%r10;" " adcx %%r14, %%r9;" /* f[2]*f[0] */ " mulxq 24(%1), %%rax, %%rcx;" " adcx %%rax, %%r10;" /* f[3]*f[0] */ " movq 24(%1), %%rdx;" /* f[3] */ - " mulxq 8(%1), %%r11, %%r12;" " adcx %%rcx, %%r11;" /* f[1]*f[3] */ - " mulxq 16(%1), %%rax, %%r13;" " adcx %%rax, %%r12;" /* f[2]*f[3] */ + " mulxq 8(%1), %%r11, %%rbx;" " adcx %%rcx, %%r11;" /* f[1]*f[3] */ + " mulxq 16(%1), %%rax, %%r13;" " adcx %%rax, %%rbx;" /* f[2]*f[3] */ " movq 8(%1), %%rdx;" " adcx %%r15, %%r13;" /* f1 */ " mulxq 16(%1), %%rax, %%rcx;" " mov $0, %%r14;" /* f[2]*f[1] */ @@ -622,12 +622,12 @@ static inline void fsqr2(u64 *out, const u64 *f, u64 *tmp) " adcx %%r8, %%r8;" " adox %%rcx, %%r11;" " adcx %%r9, %%r9;" - " adox %%r15, %%r12;" + " adox %%r15, %%rbx;" " adcx %%r10, %%r10;" " adox %%r15, %%r13;" " adcx %%r11, %%r11;" " adox %%r15, %%r14;" - " adcx %%r12, %%r12;" + " adcx %%rbx, %%rbx;" " adcx %%r13, %%r13;" " adcx %%r14, %%r14;" @@ -640,7 +640,7 @@ static inline void fsqr2(u64 *out, const u64 *f, u64 *tmp) " adcx %%rcx, %%r10;" " movq %%r10, 24(%0);" " movq 16(%1), %%rdx;" " mulx %%rdx, %%rax, %%rcx;" /* f[2]^2 */ " adcx %%rax, %%r11;" " movq %%r11, 32(%0);" - " adcx %%rcx, %%r12;" " movq %%r12, 40(%0);" + " adcx %%rcx, %%rbx;" " movq %%rbx, 40(%0);" " movq 24(%1), %%rdx;" " mulx %%rdx, %%rax, %%rcx;" /* f[3]^2 */ " adcx %%rax, %%r13;" " movq %%r13, 48(%0);" " adcx %%rcx, %%r14;" " movq %%r14, 56(%0);" @@ -651,8 +651,8 @@ static inline void fsqr2(u64 *out, const u64 *f, u64 *tmp) " mulxq 48(%1), %%r9, %%r10;" " adcx %%r14, %%r9;" /* f[2]*f[0] */ " mulxq 56(%1), %%rax, %%rcx;" " adcx %%rax, %%r10;" /* f[3]*f[0] */ " movq 56(%1), %%rdx;" /* f[3] */ - " mulxq 40(%1), %%r11, %%r12;" " adcx %%rcx, %%r11;" /* f[1]*f[3] */ - " mulxq 48(%1), %%rax, %%r13;" " adcx %%rax, %%r12;" /* f[2]*f[3] */ + " mulxq 40(%1), %%r11, %%rbx;" " adcx %%rcx, %%r11;" /* f[1]*f[3] */ + " mulxq 48(%1), %%rax, %%r13;" " adcx %%rax, %%rbx;" /* f[2]*f[3] */ " movq 40(%1), %%rdx;" " adcx %%r15, %%r13;" /* f1 */ " mulxq 48(%1), %%rax, %%rcx;" " mov $0, %%r14;" /* f[2]*f[1] */ @@ -662,12 +662,12 @@ static inline void fsqr2(u64 *out, const u64 *f, u64 *tmp) " adcx %%r8, %%r8;" " adox %%rcx, %%r11;" " adcx %%r9, %%r9;" - " adox %%r15, %%r12;" + " adox %%r15, %%rbx;" " adcx %%r10, %%r10;" " adox %%r15, %%r13;" " adcx %%r11, %%r11;" " adox %%r15, %%r14;" - " adcx %%r12, %%r12;" + " adcx %%rbx, %%rbx;" " adcx %%r13, %%r13;" " adcx %%r14, %%r14;" @@ -680,7 +680,7 @@ static inline void fsqr2(u64 *out, const u64 *f, u64 *tmp) " adcx %%rcx, %%r10;" " movq %%r10, 88(%0);" " movq 48(%1), %%rdx;" " mulx %%rdx, %%rax, %%rcx;" /* f[2]^2 */ " adcx %%rax, %%r11;" " movq %%r11, 96(%0);" - " adcx %%rcx, %%r12;" " movq %%r12, 104(%0);" + " adcx %%rcx, %%rbx;" " movq %%rbx, 104(%0);" " movq 56(%1), %%rdx;" " mulx %%rdx, %%rax, %%rcx;" /* f[3]^2 */ " adcx %%rax, %%r13;" " movq %%r13, 112(%0);" " adcx %%rcx, %%r14;" " movq %%r14, 120(%0);" @@ -694,11 +694,11 @@ static inline void fsqr2(u64 *out, const u64 *f, u64 *tmp) " mulxq 32(%1), %%r8, %%r13;" " xor %%rcx, %%rcx;" " adoxq 0(%1), %%r8;" - " mulxq 40(%1), %%r9, %%r12;" + " mulxq 40(%1), %%r9, %%rbx;" " adcx %%r13, %%r9;" " adoxq 8(%1), %%r9;" " mulxq 48(%1), %%r10, %%r13;" - " adcx %%r12, %%r10;" + " adcx %%rbx, %%r10;" " adoxq 16(%1), %%r10;" " mulxq 56(%1), %%r11, %%rax;" " adcx %%r13, %%r11;" @@ -727,11 +727,11 @@ static inline void fsqr2(u64 *out, const u64 *f, u64 *tmp) " mulxq 96(%1), %%r8, %%r13;" " xor %%rcx, %%rcx;" " adoxq 64(%1), %%r8;" - " mulxq 104(%1), %%r9, %%r12;" + " mulxq 104(%1), %%r9, %%rbx;" " adcx %%r13, %%r9;" " adoxq 72(%1), %%r9;" " mulxq 112(%1), %%r10, %%r13;" - " adcx %%r12, %%r10;" + " adcx %%rbx, %%r10;" " adoxq 80(%1), %%r10;" " mulxq 120(%1), %%r11, %%rax;" " adcx %%r13, %%r11;" @@ -756,7 +756,7 @@ static inline void fsqr2(u64 *out, const u64 *f, u64 *tmp) " movq %%r8, 32(%0);" : "+&r" (tmp), "+&r" (f), "+&r" (out) : - : "%rax", "%rcx", "%rdx", "%r8", "%r9", "%r10", "%r11", "%r12", "%r13", "%r14", "%r15", "memory", "cc" + : "%rax", "%rcx", "%rdx", "%r8", "%r9", "%r10", "%r11", "%rbx", "%r13", "%r14", "%r15", "memory", "cc" ); } -- cgit From 80f1f85036355e5581ec0b99913410345ad3491b Mon Sep 17 00:00:00 2001 From: Luke Nelson Date: Thu, 5 Mar 2020 15:44:12 -0800 Subject: bpf, x32: Fix bug with JMP32 JSET BPF_X checking upper bits The current x32 BPF JIT is incorrect for JMP32 JSET BPF_X when the upper 32 bits of operand registers are non-zero in certain situations. The problem is in the following code: case BPF_JMP | BPF_JSET | BPF_X: case BPF_JMP32 | BPF_JSET | BPF_X: ... /* and dreg_lo,sreg_lo */ EMIT2(0x23, add_2reg(0xC0, sreg_lo, dreg_lo)); /* and dreg_hi,sreg_hi */ EMIT2(0x23, add_2reg(0xC0, sreg_hi, dreg_hi)); /* or dreg_lo,dreg_hi */ EMIT2(0x09, add_2reg(0xC0, dreg_lo, dreg_hi)); This code checks the upper bits of the operand registers regardless if the BPF instruction is BPF_JMP32 or BPF_JMP64. Registers dreg_hi and dreg_lo are not loaded from the stack for BPF_JMP32, however, they can still be polluted with values from previous instructions. The following BPF program demonstrates the bug. The jset64 instruction loads the temporary registers and performs the jump, since ((u64)r7 & (u64)r8) is non-zero. The jset32 should _not_ be taken, as the lower 32 bits are all zero, however, the current JIT will take the branch due the pollution of temporary registers from the earlier jset64. mov64 r0, 0 ld64 r7, 0x8000000000000000 ld64 r8, 0x8000000000000000 jset64 r7, r8, 1 exit jset32 r7, r8, 1 mov64 r0, 2 exit The expected return value of this program is 2; under the buggy x32 JIT it returns 0. The fix is to skip using the upper 32 bits for jset32 and compare the upper 32 bits for jset64 only. All tests in test_bpf.ko and selftests/bpf/test_verifier continue to pass with this change. We found this bug using our automated verification tool, Serval. Fixes: 69f827eb6e14 ("x32: bpf: implement jitting of JMP32") Co-developed-by: Xi Wang Signed-off-by: Xi Wang Signed-off-by: Luke Nelson Signed-off-by: Daniel Borkmann Link: https://lore.kernel.org/bpf/20200305234416.31597-1-luke.r.nels@gmail.com --- arch/x86/net/bpf_jit_comp32.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/net/bpf_jit_comp32.c b/arch/x86/net/bpf_jit_comp32.c index 393d251798c0..4d2a7a764602 100644 --- a/arch/x86/net/bpf_jit_comp32.c +++ b/arch/x86/net/bpf_jit_comp32.c @@ -2039,10 +2039,12 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, } /* and dreg_lo,sreg_lo */ EMIT2(0x23, add_2reg(0xC0, sreg_lo, dreg_lo)); - /* and dreg_hi,sreg_hi */ - EMIT2(0x23, add_2reg(0xC0, sreg_hi, dreg_hi)); - /* or dreg_lo,dreg_hi */ - EMIT2(0x09, add_2reg(0xC0, dreg_lo, dreg_hi)); + if (is_jmp64) { + /* and dreg_hi,sreg_hi */ + EMIT2(0x23, add_2reg(0xC0, sreg_hi, dreg_hi)); + /* or dreg_lo,dreg_hi */ + EMIT2(0x09, add_2reg(0xC0, dreg_lo, dreg_hi)); + } goto emit_cond_jmp; } case BPF_JMP | BPF_JSET | BPF_K: -- cgit From 3cdcd6899eaf454b2539c624fff5daf63c175a7a Mon Sep 17 00:00:00 2001 From: Arvind Sankar Date: Sun, 8 Mar 2020 09:08:40 +0100 Subject: efi/x86: Annotate the LOADED_IMAGE_PROTOCOL_GUID with SYM_DATA Use SYM_DATA*() macros to annotate this constant, and explicitly align it to 4-byte boundary. Use lower-case for hexadecimal data. Signed-off-by: Arvind Sankar Signed-off-by: Ard Biesheuvel Signed-off-by: Ingo Molnar Link: https://lore.kernel.org/r/20200301230436.2246909-2-nivedita@alum.mit.edu Link: https://lore.kernel.org/r/20200308080859.21568-10-ardb@kernel.org --- arch/x86/boot/compressed/head_64.S | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S index fcf8feaa57ea..8105e8348607 100644 --- a/arch/x86/boot/compressed/head_64.S +++ b/arch/x86/boot/compressed/head_64.S @@ -672,7 +672,7 @@ SYM_FUNC_START(efi32_pe_entry) /* Get the loaded image protocol pointer from the image handle */ subl $12, %esp // space for the loaded image pointer pushl %esp // pass its address - leal 4f(%ebp), %eax + leal loaded_image_proto(%ebp), %eax pushl %eax // pass the GUID address pushl 28(%esp) // pass the image handle @@ -695,9 +695,12 @@ SYM_FUNC_END(efi32_pe_entry) .section ".rodata" /* EFI loaded image protocol GUID */ -4: .long 0x5B1B31A1 + .balign 4 +SYM_DATA_START_LOCAL(loaded_image_proto) + .long 0x5b1b31a1 .word 0x9562, 0x11d2 - .byte 0x8E, 0x3F, 0x00, 0xA0, 0xC9, 0x69, 0x72, 0x3B + .byte 0x8e, 0x3f, 0x00, 0xa0, 0xc9, 0x69, 0x72, 0x3b +SYM_DATA_END(loaded_image_proto) #endif /* -- cgit From 71ff44ac6cfaa1d8d3dff6c73636f4fb97b2be10 Mon Sep 17 00:00:00 2001 From: Arvind Sankar Date: Sun, 8 Mar 2020 09:08:41 +0100 Subject: efi/x86: Respect 32-bit ABI in efi32_pe_entry() verify_cpu() clobbers BX and DI. In case we have to return error, we need to preserve them to respect the 32-bit calling convention. Signed-off-by: Arvind Sankar Signed-off-by: Ard Biesheuvel Signed-off-by: Ingo Molnar Link: https://lore.kernel.org/r/20200301230436.2246909-3-nivedita@alum.mit.edu Link: https://lore.kernel.org/r/20200308080859.21568-11-ardb@kernel.org --- arch/x86/boot/compressed/head_64.S | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'arch/x86') diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S index 8105e8348607..920daf62dac2 100644 --- a/arch/x86/boot/compressed/head_64.S +++ b/arch/x86/boot/compressed/head_64.S @@ -660,7 +660,11 @@ SYM_DATA(efi_is64, .byte 1) SYM_FUNC_START(efi32_pe_entry) pushl %ebp + pushl %ebx + pushl %edi call verify_cpu // check for long mode support + popl %edi + popl %ebx testl %eax, %eax movl $0x80000003, %eax // EFI_UNSUPPORTED jnz 3f -- cgit From 3fab43318f0565ffb45925bea5903bc400ce2864 Mon Sep 17 00:00:00 2001 From: Arvind Sankar Date: Sun, 8 Mar 2020 09:08:42 +0100 Subject: efi/x86: Make efi32_pe_entry() more readable Set up a proper frame pointer in efi32_pe_entry() so that it's easier to calculate offsets for arguments. Signed-off-by: Arvind Sankar Signed-off-by: Ard Biesheuvel Signed-off-by: Ingo Molnar Link: https://lore.kernel.org/r/20200301230436.2246909-4-nivedita@alum.mit.edu Link: https://lore.kernel.org/r/20200308080859.21568-12-ardb@kernel.org --- arch/x86/boot/compressed/head_64.S | 57 ++++++++++++++++++++++++++------------ 1 file changed, 40 insertions(+), 17 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S index 920daf62dac2..fabbd4c2e9f2 100644 --- a/arch/x86/boot/compressed/head_64.S +++ b/arch/x86/boot/compressed/head_64.S @@ -658,42 +658,65 @@ SYM_DATA(efi_is64, .byte 1) .text .code32 SYM_FUNC_START(efi32_pe_entry) +/* + * efi_status_t efi32_pe_entry(efi_handle_t image_handle, + * efi_system_table_32_t *sys_table) + */ + pushl %ebp + movl %esp, %ebp + pushl %eax // dummy push to allocate loaded_image - pushl %ebx + pushl %ebx // save callee-save registers pushl %edi + call verify_cpu // check for long mode support - popl %edi - popl %ebx testl %eax, %eax movl $0x80000003, %eax // EFI_UNSUPPORTED - jnz 3f + jnz 2f call 1f -1: pop %ebp - subl $1b, %ebp +1: pop %ebx + subl $1b, %ebx /* Get the loaded image protocol pointer from the image handle */ - subl $12, %esp // space for the loaded image pointer - pushl %esp // pass its address - leal loaded_image_proto(%ebp), %eax + leal -4(%ebp), %eax + pushl %eax // &loaded_image + leal loaded_image_proto(%ebx), %eax pushl %eax // pass the GUID address - pushl 28(%esp) // pass the image handle + pushl 8(%ebp) // pass the image handle - movl 36(%esp), %eax // sys_table + /* + * Note the alignment of the stack frame. + * sys_table + * handle <-- 16-byte aligned on entry by ABI + * return address + * frame pointer + * loaded_image <-- local variable + * saved %ebx <-- 16-byte aligned here + * saved %edi + * &loaded_image + * &loaded_image_proto + * handle <-- 16-byte aligned for call to handle_protocol + */ + + movl 12(%ebp), %eax // sys_table movl ST32_boottime(%eax), %eax // sys_table->boottime call *BS32_handle_protocol(%eax) // sys_table->boottime->handle_protocol - cmp $0, %eax + addl $12, %esp // restore argument space + testl %eax, %eax jnz 2f - movl 32(%esp), %ecx // image_handle - movl 36(%esp), %edx // sys_table - movl 12(%esp), %esi // loaded_image + movl 8(%ebp), %ecx // image_handle + movl 12(%ebp), %edx // sys_table + movl -4(%ebp), %esi // loaded_image movl LI32_image_base(%esi), %esi // loaded_image->image_base + movl %ebx, %ebp // startup_32 for efi32_pe_stub_entry jmp efi32_pe_stub_entry -2: addl $24, %esp -3: popl %ebp +2: popl %edi // restore callee-save registers + popl %ebx + leave ret SYM_FUNC_END(efi32_pe_entry) -- cgit From 8acf63efa1712fa5495425a4224378bb3e1231e0 Mon Sep 17 00:00:00 2001 From: Arvind Sankar Date: Sun, 8 Mar 2020 09:08:43 +0100 Subject: efi/x86: Avoid using code32_start code32_start is meant for 16-bit real-mode bootloaders to inform the kernel where the 32-bit protected mode code starts. Nothing in the protected mode kernel except the EFI stub uses it. efi_main() currently returns boot_params, with code32_start set inside it to tell efi_stub_entry() where startup_32 is located. Since it was invoked by efi_stub_entry() in the first place, boot_params is already known. Return the address of startup_32 instead. This will allow a 64-bit kernel to live above 4Gb, for example, and it's cleaner as well. Signed-off-by: Arvind Sankar Signed-off-by: Ard Biesheuvel Signed-off-by: Ingo Molnar Link: https://lore.kernel.org/r/20200301230436.2246909-5-nivedita@alum.mit.edu Link: https://lore.kernel.org/r/20200308080859.21568-13-ardb@kernel.org --- arch/x86/boot/compressed/head_32.S | 3 +-- arch/x86/boot/compressed/head_64.S | 4 ++-- arch/x86/kernel/asm-offsets.c | 1 - 3 files changed, 3 insertions(+), 5 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/boot/compressed/head_32.S b/arch/x86/boot/compressed/head_32.S index 2f8138b71ea9..e013bdc1237b 100644 --- a/arch/x86/boot/compressed/head_32.S +++ b/arch/x86/boot/compressed/head_32.S @@ -156,9 +156,8 @@ SYM_FUNC_END(startup_32) SYM_FUNC_START(efi32_stub_entry) SYM_FUNC_START_ALIAS(efi_stub_entry) add $0x4, %esp + movl 8(%esp), %esi /* save boot_params pointer */ call efi_main - movl %eax, %esi - movl BP_code32_start(%esi), %eax leal startup_32(%eax), %eax jmp *%eax SYM_FUNC_END(efi32_stub_entry) diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S index fabbd4c2e9f2..6a4ff919008c 100644 --- a/arch/x86/boot/compressed/head_64.S +++ b/arch/x86/boot/compressed/head_64.S @@ -472,9 +472,9 @@ SYM_CODE_END(startup_64) SYM_FUNC_START(efi64_stub_entry) SYM_FUNC_START_ALIAS(efi_stub_entry) and $~0xf, %rsp /* realign the stack */ + movq %rdx, %rbx /* save boot_params pointer */ call efi_main - movq %rax,%rsi - movl BP_code32_start(%esi), %eax + movq %rbx,%rsi leaq startup_64(%rax), %rax jmp *%rax SYM_FUNC_END(efi64_stub_entry) diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c index 5c7ee3df4d0b..3ca07ad552ae 100644 --- a/arch/x86/kernel/asm-offsets.c +++ b/arch/x86/kernel/asm-offsets.c @@ -88,7 +88,6 @@ static void __used common(void) OFFSET(BP_kernel_alignment, boot_params, hdr.kernel_alignment); OFFSET(BP_init_size, boot_params, hdr.init_size); OFFSET(BP_pref_address, boot_params, hdr.pref_address); - OFFSET(BP_code32_start, boot_params, hdr.code32_start); BLANK(); DEFINE(PTREGS_SIZE, sizeof(struct pt_regs)); -- cgit From 81a34892c2c7c809f9c4e22c5ac936ae673fb9a2 Mon Sep 17 00:00:00 2001 From: Arvind Sankar Date: Sun, 8 Mar 2020 09:08:44 +0100 Subject: x86/boot: Use unsigned comparison for addresses The load address is compared with LOAD_PHYSICAL_ADDR using a signed comparison currently (using jge instruction). When loading a 64-bit kernel using the new efi32_pe_entry() point added by: 97aa276579b2 ("efi/x86: Add true mixed mode entry point into .compat section") using Qemu with -m 3072, the firmware actually loads us above 2Gb, resulting in a very early crash. Use the JAE instruction to perform a unsigned comparison instead, as physical addresses should be considered unsigned. Signed-off-by: Arvind Sankar Signed-off-by: Ard Biesheuvel Signed-off-by: Ingo Molnar Link: https://lore.kernel.org/r/20200301230436.2246909-6-nivedita@alum.mit.edu Link: https://lore.kernel.org/r/20200308080859.21568-14-ardb@kernel.org --- arch/x86/boot/compressed/head_32.S | 2 +- arch/x86/boot/compressed/head_64.S | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/boot/compressed/head_32.S b/arch/x86/boot/compressed/head_32.S index e013bdc1237b..46bbe7ab4adf 100644 --- a/arch/x86/boot/compressed/head_32.S +++ b/arch/x86/boot/compressed/head_32.S @@ -105,7 +105,7 @@ SYM_FUNC_START(startup_32) notl %eax andl %eax, %ebx cmpl $LOAD_PHYSICAL_ADDR, %ebx - jge 1f + jae 1f #endif movl $LOAD_PHYSICAL_ADDR, %ebx 1: diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S index 6a4ff919008c..5d8338a693ce 100644 --- a/arch/x86/boot/compressed/head_64.S +++ b/arch/x86/boot/compressed/head_64.S @@ -105,7 +105,7 @@ SYM_FUNC_START(startup_32) notl %eax andl %eax, %ebx cmpl $LOAD_PHYSICAL_ADDR, %ebx - jge 1f + jae 1f #endif movl $LOAD_PHYSICAL_ADDR, %ebx 1: @@ -305,7 +305,7 @@ SYM_CODE_START(startup_64) notq %rax andq %rax, %rbp cmpq $LOAD_PHYSICAL_ADDR, %rbp - jge 1f + jae 1f #endif movq $LOAD_PHYSICAL_ADDR, %rbp 1: -- cgit From 8ef44be393113dca5cece65bc142ebb8ef013af0 Mon Sep 17 00:00:00 2001 From: Arvind Sankar Date: Sun, 8 Mar 2020 09:08:46 +0100 Subject: x86/boot/compressed/32: Save the output address instead of recalculating it In preparation for being able to decompress into a buffer starting at a different address than startup_32, save the calculated output address instead of recalculating it later. We now keep track of three addresses: %edx: startup_32 as we were loaded by bootloader %ebx: new location of compressed kernel %ebp: start of decompression buffer Signed-off-by: Arvind Sankar Signed-off-by: Ard Biesheuvel Signed-off-by: Ingo Molnar Link: https://lore.kernel.org/r/20200303221205.4048668-2-nivedita@alum.mit.edu Link: https://lore.kernel.org/r/20200308080859.21568-16-ardb@kernel.org --- arch/x86/boot/compressed/head_32.S | 25 ++++++++++++------------- 1 file changed, 12 insertions(+), 13 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/boot/compressed/head_32.S b/arch/x86/boot/compressed/head_32.S index 46bbe7ab4adf..894182500606 100644 --- a/arch/x86/boot/compressed/head_32.S +++ b/arch/x86/boot/compressed/head_32.S @@ -75,11 +75,11 @@ SYM_FUNC_START(startup_32) */ leal (BP_scratch+4)(%esi), %esp call 1f -1: popl %ebp - subl $1b, %ebp +1: popl %edx + subl $1b, %edx /* Load new GDT */ - leal gdt(%ebp), %eax + leal gdt(%edx), %eax movl %eax, 2(%eax) lgdt (%eax) @@ -92,13 +92,14 @@ SYM_FUNC_START(startup_32) movl %eax, %ss /* - * %ebp contains the address we are loaded at by the boot loader and %ebx + * %edx contains the address we are loaded at by the boot loader and %ebx * contains the address where we should move the kernel image temporarily - * for safe in-place decompression. + * for safe in-place decompression. %ebp contains the address that the kernel + * will be decompressed to. */ #ifdef CONFIG_RELOCATABLE - movl %ebp, %ebx + movl %edx, %ebx movl BP_kernel_alignment(%esi), %eax decl %eax addl %eax, %ebx @@ -110,10 +111,10 @@ SYM_FUNC_START(startup_32) movl $LOAD_PHYSICAL_ADDR, %ebx 1: + movl %ebx, %ebp // Save the output address for later /* Target address to relocate to for decompression */ - movl BP_init_size(%esi), %eax - subl $_end, %eax - addl %eax, %ebx + addl BP_init_size(%esi), %ebx + subl $_end, %ebx /* Set up the stack */ leal boot_stack_end(%ebx), %esp @@ -127,7 +128,7 @@ SYM_FUNC_START(startup_32) * where decompression in place becomes safe. */ pushl %esi - leal (_bss-4)(%ebp), %esi + leal (_bss-4)(%edx), %esi leal (_bss-4)(%ebx), %edi movl $(_bss - startup_32), %ecx shrl $2, %ecx @@ -196,9 +197,7 @@ SYM_FUNC_START_LOCAL_NOALIGN(.Lrelocated) /* push arguments for extract_kernel: */ pushl $z_output_len /* decompressed length, end of relocs */ - leal _end(%ebx), %eax - subl BP_init_size(%esi), %eax - pushl %eax /* output address */ + pushl %ebp /* output address */ pushl $z_input_len /* input_len */ leal input_data(%ebx), %eax -- cgit From 1887c9b653f99577c0f8ec413b0921a32b6129e2 Mon Sep 17 00:00:00 2001 From: Arvind Sankar Date: Sun, 8 Mar 2020 09:08:47 +0100 Subject: efi/x86: Decompress at start of PE image load address When booted via PE loader, define image_offset to hold the offset of startup_32() from the start of the PE image, and use it as the start of the decompression buffer. [ mingo: Fixed the grammar in the comments. ] Signed-off-by: Arvind Sankar Signed-off-by: Ard Biesheuvel Signed-off-by: Ingo Molnar Link: https://lore.kernel.org/r/20200303221205.4048668-3-nivedita@alum.mit.edu Link: https://lore.kernel.org/r/20200308080859.21568-17-ardb@kernel.org --- arch/x86/boot/compressed/head_32.S | 17 +++++++++++++++ arch/x86/boot/compressed/head_64.S | 42 +++++++++++++++++++++++++++++++++++--- 2 files changed, 56 insertions(+), 3 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/boot/compressed/head_32.S b/arch/x86/boot/compressed/head_32.S index 894182500606..ab3307036ba4 100644 --- a/arch/x86/boot/compressed/head_32.S +++ b/arch/x86/boot/compressed/head_32.S @@ -100,6 +100,19 @@ SYM_FUNC_START(startup_32) #ifdef CONFIG_RELOCATABLE movl %edx, %ebx + +#ifdef CONFIG_EFI_STUB +/* + * If we were loaded via the EFI LoadImage service, startup_32() will be at an + * offset to the start of the space allocated for the image. efi_pe_entry() will + * set up image_offset to tell us where the image actually starts, so that we + * can use the full available buffer. + * image_offset = startup_32 - image_base + * Otherwise image_offset will be zero and has no effect on the calculations. + */ + subl image_offset(%edx), %ebx +#endif + movl BP_kernel_alignment(%esi), %eax decl %eax addl %eax, %ebx @@ -226,6 +239,10 @@ SYM_DATA_START_LOCAL(gdt) .quad 0x00cf92000000ffff /* __KERNEL_DS */ SYM_DATA_END_LABEL(gdt, SYM_L_LOCAL, gdt_end) +#ifdef CONFIG_EFI_STUB +SYM_DATA(image_offset, .long 0) +#endif + /* * Stack and heap for uncompression */ diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S index 5d8338a693ce..d4657d38e884 100644 --- a/arch/x86/boot/compressed/head_64.S +++ b/arch/x86/boot/compressed/head_64.S @@ -99,6 +99,19 @@ SYM_FUNC_START(startup_32) #ifdef CONFIG_RELOCATABLE movl %ebp, %ebx + +#ifdef CONFIG_EFI_STUB +/* + * If we were loaded via the EFI LoadImage service, startup_32 will be at an + * offset to the start of the space allocated for the image. efi_pe_entry will + * set up image_offset to tell us where the image actually starts, so that we + * can use the full available buffer. + * image_offset = startup_32 - image_base + * Otherwise image_offset will be zero and has no effect on the calculations. + */ + subl image_offset(%ebp), %ebx +#endif + movl BP_kernel_alignment(%esi), %eax decl %eax addl %eax, %ebx @@ -111,9 +124,8 @@ SYM_FUNC_START(startup_32) 1: /* Target address to relocate to for decompression */ - movl BP_init_size(%esi), %eax - subl $_end, %eax - addl %eax, %ebx + addl BP_init_size(%esi), %ebx + subl $_end, %ebx /* * Prepare for entering 64 bit mode @@ -299,6 +311,20 @@ SYM_CODE_START(startup_64) /* Start with the delta to where the kernel will run at. */ #ifdef CONFIG_RELOCATABLE leaq startup_32(%rip) /* - $startup_32 */, %rbp + +#ifdef CONFIG_EFI_STUB +/* + * If we were loaded via the EFI LoadImage service, startup_32 will be at an + * offset to the start of the space allocated for the image. efi_pe_entry will + * set up image_offset to tell us where the image actually starts, so that we + * can use the full available buffer. + * image_offset = startup_32 - image_base + * Otherwise image_offset will be zero and has no effect on the calculations. + */ + movl image_offset(%rip), %eax + subq %rax, %rbp +#endif + movl BP_kernel_alignment(%rsi), %eax decl %eax addq %rax, %rbp @@ -647,6 +673,10 @@ SYM_DATA_START_LOCAL(gdt) .quad 0x0000000000000000 /* TS continued */ SYM_DATA_END_LABEL(gdt, SYM_L_LOCAL, gdt_end) +#ifdef CONFIG_EFI_STUB +SYM_DATA(image_offset, .long 0) +#endif + #ifdef CONFIG_EFI_MIXED SYM_DATA_LOCAL(efi32_boot_args, .long 0, 0, 0) SYM_DATA(efi_is64, .byte 1) @@ -712,6 +742,12 @@ SYM_FUNC_START(efi32_pe_entry) movl -4(%ebp), %esi // loaded_image movl LI32_image_base(%esi), %esi // loaded_image->image_base movl %ebx, %ebp // startup_32 for efi32_pe_stub_entry + /* + * We need to set the image_offset variable here since startup_32() will + * use it before we get to the 64-bit efi_pe_entry() in C code. + */ + subl %esi, %ebx + movl %ebx, image_offset(%ebp) // save image_offset jmp efi32_pe_stub_entry 2: popl %edi // restore callee-save registers -- cgit From 26725192c46e1e543ed86a06823fa591cd6faf58 Mon Sep 17 00:00:00 2001 From: Arvind Sankar Date: Sun, 8 Mar 2020 09:08:48 +0100 Subject: efi/x86: Add kernel preferred address to PE header Store the kernel's link address as ImageBase in the PE header. Note that the PE specification requires the ImageBase to be 64k aligned. The preferred address should almost always satisfy that, except for 32-bit kernel if the configuration has been customized. Signed-off-by: Arvind Sankar Signed-off-by: Ard Biesheuvel Signed-off-by: Ingo Molnar Link: https://lore.kernel.org/r/20200303221205.4048668-4-nivedita@alum.mit.edu Link: https://lore.kernel.org/r/20200308080859.21568-18-ardb@kernel.org --- arch/x86/boot/header.S | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/boot/header.S b/arch/x86/boot/header.S index 4ee25e28996f..735ad7f21ab0 100644 --- a/arch/x86/boot/header.S +++ b/arch/x86/boot/header.S @@ -138,10 +138,12 @@ optional_header: #endif extra_header_fields: + # PE specification requires ImageBase to be 64k aligned + .set image_base, (LOAD_PHYSICAL_ADDR + 0xffff) & ~0xffff #ifdef CONFIG_X86_32 - .long 0 # ImageBase + .long image_base # ImageBase #else - .quad 0 # ImageBase + .quad image_base # ImageBase #endif .long 0x20 # SectionAlignment .long 0x20 # FileAlignment -- cgit From 964124a97b973555c475423fca0fceafdde01a17 Mon Sep 17 00:00:00 2001 From: Arvind Sankar Date: Sun, 8 Mar 2020 09:08:49 +0100 Subject: efi/x86: Remove extra headroom for setup block The following commit: 223e3ee56f77 ("efi/x86: add headroom to decompressor BSS to account for setup block") added headroom to the PE image to account for the setup block, which wasn't used for the decompression buffer. Now that the decompression buffer is located at the start of the image, and includes the setup block, this is no longer required. Add a check to make sure that the head section of the compressed kernel won't overwrite itself while relocating. This is only for future-proofing as with current limits on the setup and the actual size of the head section, this can never happen. Signed-off-by: Arvind Sankar Signed-off-by: Ard Biesheuvel Signed-off-by: Ingo Molnar Link: https://lore.kernel.org/r/20200303221205.4048668-5-nivedita@alum.mit.edu Link: https://lore.kernel.org/r/20200308080859.21568-19-ardb@kernel.org --- arch/x86/boot/tools/build.c | 28 ++++++++++++++++++++++++++-- 1 file changed, 26 insertions(+), 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/boot/tools/build.c b/arch/x86/boot/tools/build.c index 90d403dfec80..8cac5b6103db 100644 --- a/arch/x86/boot/tools/build.c +++ b/arch/x86/boot/tools/build.c @@ -65,6 +65,8 @@ unsigned long efi_pe_entry; unsigned long efi32_pe_entry; unsigned long kernel_info; unsigned long startup_64; +unsigned long _ehead; +unsigned long _end; /*----------------------------------------------------------------------*/ @@ -232,7 +234,7 @@ static void update_pecoff_text(unsigned int text_start, unsigned int file_sz, { unsigned int pe_header; unsigned int text_sz = file_sz - text_start; - unsigned int bss_sz = init_sz + text_start - file_sz; + unsigned int bss_sz = init_sz - file_sz; pe_header = get_unaligned_le32(&buf[0x3c]); @@ -259,7 +261,7 @@ static void update_pecoff_text(unsigned int text_start, unsigned int file_sz, put_unaligned_le32(file_sz - 512 + bss_sz, &buf[pe_header + 0x1c]); /* Size of image */ - put_unaligned_le32(init_sz + text_start, &buf[pe_header + 0x50]); + put_unaligned_le32(init_sz, &buf[pe_header + 0x50]); /* * Address of entry point for PE/COFF executable @@ -360,6 +362,8 @@ static void parse_zoffset(char *fname) PARSE_ZOFS(p, efi32_pe_entry); PARSE_ZOFS(p, kernel_info); PARSE_ZOFS(p, startup_64); + PARSE_ZOFS(p, _ehead); + PARSE_ZOFS(p, _end); p = strchr(p, '\n'); while (p && (*p == '\r' || *p == '\n')) @@ -444,6 +448,26 @@ int main(int argc, char ** argv) put_unaligned_le32(sys_size, &buf[0x1f4]); init_sz = get_unaligned_le32(&buf[0x260]); +#ifdef CONFIG_EFI_STUB + /* + * The decompression buffer will start at ImageBase. When relocating + * the compressed kernel to its end, we must ensure that the head + * section does not get overwritten. The head section occupies + * [i, i + _ehead), and the destination is [init_sz - _end, init_sz). + * + * At present these should never overlap, because 'i' is at most 32k + * because of SETUP_SECT_MAX, '_ehead' is less than 1k, and the + * calculation of INIT_SIZE in boot/header.S ensures that + * 'init_sz - _end' is at least 64k. + * + * For future-proofing, increase init_sz if necessary. + */ + + if (init_sz - _end < i + _ehead) { + init_sz = (i + _ehead + _end + 4095) & ~4095; + put_unaligned_le32(init_sz, &buf[0x260]); + } +#endif update_pecoff_text(setup_sectors * 512, i + (sys_size * 16), init_sz); efi_stub_entry_update(); -- cgit From d5cdf4cfeac914617ca22866bd4685fd7f876dec Mon Sep 17 00:00:00 2001 From: Arvind Sankar Date: Sun, 8 Mar 2020 09:08:50 +0100 Subject: efi/x86: Don't relocate the kernel unless necessary Add alignment slack to the PE image size, so that we can realign the decompression buffer within the space allocated for the image. Only relocate the kernel if it has been loaded at an unsuitable address: - Below LOAD_PHYSICAL_ADDR, or - Above 64T for 64-bit and 512MiB for 32-bit For 32-bit, the upper limit is conservative, but the exact limit can be difficult to calculate. Signed-off-by: Arvind Sankar Signed-off-by: Ard Biesheuvel Signed-off-by: Ingo Molnar Link: https://lore.kernel.org/r/20200303221205.4048668-6-nivedita@alum.mit.edu Link: https://lore.kernel.org/r/20200308080859.21568-20-ardb@kernel.org --- arch/x86/boot/tools/build.c | 16 ++++++---------- 1 file changed, 6 insertions(+), 10 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/boot/tools/build.c b/arch/x86/boot/tools/build.c index 8cac5b6103db..8f8c8e386cea 100644 --- a/arch/x86/boot/tools/build.c +++ b/arch/x86/boot/tools/build.c @@ -238,21 +238,17 @@ static void update_pecoff_text(unsigned int text_start, unsigned int file_sz, pe_header = get_unaligned_le32(&buf[0x3c]); -#ifdef CONFIG_EFI_MIXED /* - * In mixed mode, we will execute startup_32() at whichever offset in - * memory it happened to land when the PE/COFF loader loaded the image, - * which may be misaligned with respect to the kernel_alignment field - * in the setup header. + * The PE/COFF loader may load the image at an address which is + * misaligned with respect to the kernel_alignment field in the setup + * header. * - * In order for startup_32 to safely execute in place at this offset, - * we need to ensure that the CONFIG_PHYSICAL_ALIGN aligned allocation - * it creates for the page tables does not extend beyond the declared - * size of the image in the PE/COFF header. So add the required slack. + * In order to avoid relocating the kernel to correct the misalignment, + * add slack to allow the buffer to be aligned within the declared size + * of the image. */ bss_sz += CONFIG_PHYSICAL_ALIGN; init_sz += CONFIG_PHYSICAL_ALIGN; -#endif /* * Size of code: Subtract the size of the first sector (512 bytes) -- cgit From 57648adb317c6f05d80a813130f8b88cecf1facf Mon Sep 17 00:00:00 2001 From: Ard Biesheuvel Date: Sun, 8 Mar 2020 09:08:52 +0100 Subject: efi/x86: Preserve %ebx correctly in efi_set_virtual_address_map() Commit: 59f2a619a2db8611 ("efi: Add 'runtime' pointer to struct efi") modified the assembler routine called by efi_set_virtual_address_map(), to grab the 'runtime' EFI service pointer while running with paging disabled (which is tricky to do in C code) After the change, register %ebx is not restored correctly, resulting in all kinds of weird behavior, so fix that. Reported-by: Guenter Roeck Tested-by: Guenter Roeck Signed-off-by: Ard Biesheuvel Signed-off-by: Ingo Molnar Link: https://lore.kernel.org/r/20200304133515.15035-1-ardb@kernel.org Link: https://lore.kernel.org/r/20200308080859.21568-22-ardb@kernel.org --- arch/x86/platform/efi/efi_stub_32.S | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/platform/efi/efi_stub_32.S b/arch/x86/platform/efi/efi_stub_32.S index 09237236fb25..09ec84f6ef51 100644 --- a/arch/x86/platform/efi/efi_stub_32.S +++ b/arch/x86/platform/efi/efi_stub_32.S @@ -54,7 +54,7 @@ SYM_FUNC_START(efi_call_svam) orl $0x80000000, %edx movl %edx, %cr0 - pop %ebx + movl 16(%esp), %ebx leave ret SYM_FUNC_END(efi_call_svam) -- cgit From 008f1d60fe25810d4554916744b0975d76601b64 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Fri, 6 Mar 2020 14:03:44 +0100 Subject: x86/apic/vector: Force interupt handler invocation to irq context Sathyanarayanan reported that the PCI-E AER error injection mechanism can result in a NULL pointer dereference in apic_ack_edge(): BUG: unable to handle kernel NULL pointer dereference at 0000000000000078 RIP: 0010:apic_ack_edge+0x1e/0x40 Call Trace: handle_edge_irq+0x7d/0x1e0 generic_handle_irq+0x27/0x30 aer_inject_write+0x53a/0x720 It crashes in irq_complete_move() which dereferences get_irq_regs() which is obviously NULL when this is called from non interrupt context. Of course the pointer could be checked, but that just papers over the real issue. Invoking the low level interrupt handling mechanism from random code can wreckage the fragile interrupt affinity mechanism of x86 as interrupts can only be moved in interrupt context or with special care when a CPU goes offline and the move has to be enforced. In the best case this triggers the warning in the MSI affinity setter, but if the call happens on the correct CPU it just corrupts state and might prevent further interrupt delivery for the affected device. Mark the APIC interrupts as unsuitable for being invoked in random contexts. This prevents the AER injection from proliferating the wreckage, but that's less broken than the current state of affairs and more correct than just papering over the problem by sprinkling random checks all over the place and silently corrupting state. Reported-by: sathyanarayanan.kuppuswamy@linux.intel.com Signed-off-by: Thomas Gleixner Link: https://lkml.kernel.org/r/20200306130623.684591280@linutronix.de --- arch/x86/kernel/apic/vector.c | 6 ++++++ 1 file changed, 6 insertions(+) (limited to 'arch/x86') diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c index 2c5676b0a6e7..6f6d98da05b9 100644 --- a/arch/x86/kernel/apic/vector.c +++ b/arch/x86/kernel/apic/vector.c @@ -556,6 +556,12 @@ static int x86_vector_alloc_irqs(struct irq_domain *domain, unsigned int virq, irqd->chip_data = apicd; irqd->hwirq = virq + i; irqd_set_single_target(irqd); + /* + * Prevent that any of these interrupts is invoked in + * non interrupt context via e.g. generic_handle_irq() + * as that can corrupt the affinity move state. + */ + irqd_set_handle_enforce_irqctx(irqd); /* * Legacy vectors are already assigned when the IOAPIC * takes them over. They stay on the same vector. This is -- cgit From b00f80fcfaa098f987dde99585e73e8ed7edae51 Mon Sep 17 00:00:00 2001 From: Boqun Feng Date: Mon, 10 Feb 2020 11:39:51 +0800 Subject: PCI: hv: Move hypercall related definitions into tlfs header Currently HVCALL_RETARGET_INTERRUPT and HV_PARTITION_ID_SELF are defined in pci-hyperv.c. However, similar to other hypercall related definitions, it makes more sense to put them in the tlfs header file. Besides, these definitions are arch-dependent, so for the support of virtual PCI on non-x86 archs in the future, move them into arch-specific tlfs header file. Signed-off-by: Boqun Feng (Microsoft) Signed-off-by: Lorenzo Pieralisi Reviewed-by: Andrew Murray Reviewed-by: Dexuan Cui --- arch/x86/include/asm/hyperv-tlfs.h | 3 +++ 1 file changed, 3 insertions(+) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h index 92abc1e42bfc..dffed0e10a68 100644 --- a/arch/x86/include/asm/hyperv-tlfs.h +++ b/arch/x86/include/asm/hyperv-tlfs.h @@ -376,6 +376,7 @@ struct hv_tsc_emulation_status { #define HVCALL_SEND_IPI_EX 0x0015 #define HVCALL_POST_MESSAGE 0x005c #define HVCALL_SIGNAL_EVENT 0x005d +#define HVCALL_RETARGET_INTERRUPT 0x007e #define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_SPACE 0x00af #define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_LIST 0x00b0 @@ -405,6 +406,8 @@ enum HV_GENERIC_SET_FORMAT { HV_GENERIC_SET_ALL, }; +#define HV_PARTITION_ID_SELF ((u64)-1) + #define HV_HYPERCALL_RESULT_MASK GENMASK_ULL(15, 0) #define HV_HYPERCALL_FAST_BIT BIT(16) #define HV_HYPERCALL_VARHEAD_OFFSET 17 -- cgit From 61bfd920abbf2c8c9c3b10bb335475e707247573 Mon Sep 17 00:00:00 2001 From: Boqun Feng Date: Mon, 10 Feb 2020 11:39:52 +0800 Subject: PCI: hv: Move retarget related structures into tlfs header Currently, retarget_msi_interrupt and other structures it relys on are defined in pci-hyperv.c. However, those structures are actually defined in Hypervisor Top-Level Functional Specification [1] and may be different in sizes of fields or layout from architecture to architecture. Let's move those definitions into x86's tlfs header file to support virtual PCI on non-x86 architectures in the future. Note that "__packed" attribute is added to these structures during the movement for the same reason as we use the attribute for other TLFS structures in the header file: make sure the structures meet the specification and avoid anything unexpected from the compilers. Additionally, rename struct retarget_msi_interrupt to hv_retarget_msi_interrupt for the consistent naming convention, also mirroring the name in TLFS. [1]: https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/reference/tlfs Signed-off-by: Boqun Feng (Microsoft) Signed-off-by: Lorenzo Pieralisi Reviewed-by: Dexuan Cui --- arch/x86/include/asm/hyperv-tlfs.h | 31 +++++++++++++++++++++++++++++++ 1 file changed, 31 insertions(+) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h index dffed0e10a68..a0b6a88d2f05 100644 --- a/arch/x86/include/asm/hyperv-tlfs.h +++ b/arch/x86/include/asm/hyperv-tlfs.h @@ -912,4 +912,35 @@ struct hv_tlb_flush_ex { struct hv_partition_assist_pg { u32 tlb_lock_count; }; + +struct hv_interrupt_entry { + u32 source; /* 1 for MSI(-X) */ + u32 reserved1; + u32 address; + u32 data; +} __packed; + +/* + * flags for hv_device_interrupt_target.flags + */ +#define HV_DEVICE_INTERRUPT_TARGET_MULTICAST 1 +#define HV_DEVICE_INTERRUPT_TARGET_PROCESSOR_SET 2 + +struct hv_device_interrupt_target { + u32 vector; + u32 flags; + union { + u64 vp_mask; + struct hv_vpset vp_set; + }; +} __packed; + +/* HvRetargetDeviceInterrupt hypercall */ +struct hv_retarget_device_interrupt { + u64 partition_id; /* use "self" */ + u64 device_id; + struct hv_interrupt_entry int_entry; + u64 reserved2; + struct hv_device_interrupt_target int_target; +} __packed __aligned(8); #endif -- cgit From 1cf106d93245f436c10e73cd3d4b885067d4bbcc Mon Sep 17 00:00:00 2001 From: Boqun Feng Date: Mon, 10 Feb 2020 11:39:53 +0800 Subject: PCI: hv: Introduce hv_msi_entry Add a new structure (hv_msi_entry), which is also defined in the TLFS, to describe the msi entry for HVCALL_RETARGET_INTERRUPT. The structure is needed because its layout may be different from architecture to architecture. Also add a new generic interface hv_set_msi_entry_from_desc() to allow different archs to set the msi entry from msi_desc. No functional change, only preparation for the future support of virtual PCI on non-x86 architectures. Signed-off-by: Boqun Feng (Microsoft) Signed-off-by: Lorenzo Pieralisi Reviewed-by: Dexuan Cui --- arch/x86/include/asm/hyperv-tlfs.h | 11 +++++++++-- arch/x86/include/asm/mshyperv.h | 8 ++++++++ 2 files changed, 17 insertions(+), 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h index a0b6a88d2f05..29336574d0bc 100644 --- a/arch/x86/include/asm/hyperv-tlfs.h +++ b/arch/x86/include/asm/hyperv-tlfs.h @@ -913,11 +913,18 @@ struct hv_partition_assist_pg { u32 tlb_lock_count; }; +union hv_msi_entry { + u64 as_uint64; + struct { + u32 address; + u32 data; + } __packed; +}; + struct hv_interrupt_entry { u32 source; /* 1 for MSI(-X) */ u32 reserved1; - u32 address; - u32 data; + union hv_msi_entry msi_entry; } __packed; /* diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h index 6b79515abb82..81fc30240122 100644 --- a/arch/x86/include/asm/mshyperv.h +++ b/arch/x86/include/asm/mshyperv.h @@ -4,6 +4,7 @@ #include #include +#include #include #include #include @@ -240,6 +241,13 @@ bool hv_vcpu_is_preempted(int vcpu); static inline void hv_apic_init(void) {} #endif +static inline void hv_set_msi_entry_from_desc(union hv_msi_entry *msi_entry, + struct msi_desc *msi_desc) +{ + msi_entry->address = msi_desc->msg.address_lo; + msi_entry->data = msi_desc->msg.data; +} + #else /* CONFIG_HYPERV */ static inline void hyperv_init(void) {} static inline void hyperv_setup_mmu_ops(void) {} -- cgit From bdb04a1abbf92c998f1afb5f00a037f2edaec1f7 Mon Sep 17 00:00:00 2001 From: Tony W Wang-oc Date: Mon, 9 Mar 2020 14:06:30 +0800 Subject: x86/Kconfig: Drop vendor dependency for X86_UMIP Some Centaur family 7 CPUs and Zhaoxin family 7 CPUs support the UMIP feature too. The text size growth which UMIP adds is ~1K and distro kernels enable it anyway so remove the vendor dependency. [ bp: Rewrite commit message. ] Signed-off-by: Tony W Wang-oc Signed-off-by: Borislav Petkov Link: https://lkml.kernel.org/r/1583733990-2587-1-git-send-email-TonyWWang-oc@zhaoxin.com --- arch/x86/Kconfig | 1 - 1 file changed, 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index beea77046f9b..cb3633d243cb 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1875,7 +1875,6 @@ config X86_SMAP config X86_UMIP def_bool y - depends on CPU_SUP_INTEL || CPU_SUP_AMD prompt "User Mode Instruction Prevention" if EXPERT ---help--- User Mode Instruction Prevention (UMIP) is a security feature in -- cgit From d8ecca4043f2d9d89daab7915eca8c2ec6254d0f Mon Sep 17 00:00:00 2001 From: Tony Luck Date: Tue, 18 Feb 2020 10:44:08 -0800 Subject: x86/mce/dev-mcelog: Dynamically allocate space for machine check records We have had a hard coded limit of 32 machine check records since the dawn of time. But as numbers of cores increase, it is possible for more than 32 errors to be reported before a user process reads from /dev/mcelog. In this case the additional errors are lost. Keep 32 as the minimum. But tune the maximum value up based on the number of processors. Signed-off-by: Tony Luck Signed-off-by: Borislav Petkov Link: https://lkml.kernel.org/r/20200218184408.GA23048@agluck-desk2.amr.corp.intel.com --- arch/x86/include/asm/mce.h | 6 ++--- arch/x86/kernel/cpu/mce/dev-mcelog.c | 47 +++++++++++++++++++++--------------- 2 files changed, 30 insertions(+), 23 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h index 4359b955e0b7..9d5b09913ef3 100644 --- a/arch/x86/include/asm/mce.h +++ b/arch/x86/include/asm/mce.h @@ -102,7 +102,7 @@ #define MCE_OVERFLOW 0 /* bit 0 in flags means overflow */ -#define MCE_LOG_LEN 32 +#define MCE_LOG_MIN_LEN 32U #define MCE_LOG_SIGNATURE "MACHINECHECK" /* AMD Scalable MCA */ @@ -135,11 +135,11 @@ */ struct mce_log_buffer { char signature[12]; /* "MACHINECHECK" */ - unsigned len; /* = MCE_LOG_LEN */ + unsigned len; /* = elements in .mce_entry[] */ unsigned next; unsigned flags; unsigned recordlen; /* length of struct mce */ - struct mce entry[MCE_LOG_LEN]; + struct mce entry[]; }; enum mce_notifier_prios { diff --git a/arch/x86/kernel/cpu/mce/dev-mcelog.c b/arch/x86/kernel/cpu/mce/dev-mcelog.c index 7c8958dee103..d089567a9ce8 100644 --- a/arch/x86/kernel/cpu/mce/dev-mcelog.c +++ b/arch/x86/kernel/cpu/mce/dev-mcelog.c @@ -29,11 +29,7 @@ static char *mce_helper_argv[2] = { mce_helper, NULL }; * separate MCEs from kernel messages to avoid bogus bug reports. */ -static struct mce_log_buffer mcelog = { - .signature = MCE_LOG_SIGNATURE, - .len = MCE_LOG_LEN, - .recordlen = sizeof(struct mce), -}; +static struct mce_log_buffer *mcelog; static DECLARE_WAIT_QUEUE_HEAD(mce_chrdev_wait); @@ -45,21 +41,21 @@ static int dev_mce_log(struct notifier_block *nb, unsigned long val, mutex_lock(&mce_chrdev_read_mutex); - entry = mcelog.next; + entry = mcelog->next; /* * When the buffer fills up discard new entries. Assume that the * earlier errors are the more interesting ones: */ - if (entry >= MCE_LOG_LEN) { - set_bit(MCE_OVERFLOW, (unsigned long *)&mcelog.flags); + if (entry >= mcelog->len) { + set_bit(MCE_OVERFLOW, (unsigned long *)&mcelog->flags); goto unlock; } - mcelog.next = entry + 1; + mcelog->next = entry + 1; - memcpy(mcelog.entry + entry, mce, sizeof(struct mce)); - mcelog.entry[entry].finished = 1; + memcpy(mcelog->entry + entry, mce, sizeof(struct mce)); + mcelog->entry[entry].finished = 1; /* wake processes polling /dev/mcelog */ wake_up_interruptible(&mce_chrdev_wait); @@ -214,21 +210,21 @@ static ssize_t mce_chrdev_read(struct file *filp, char __user *ubuf, /* Only supports full reads right now */ err = -EINVAL; - if (*off != 0 || usize < MCE_LOG_LEN*sizeof(struct mce)) + if (*off != 0 || usize < mcelog->len * sizeof(struct mce)) goto out; - next = mcelog.next; + next = mcelog->next; err = 0; for (i = 0; i < next; i++) { - struct mce *m = &mcelog.entry[i]; + struct mce *m = &mcelog->entry[i]; err |= copy_to_user(buf, m, sizeof(*m)); buf += sizeof(*m); } - memset(mcelog.entry, 0, next * sizeof(struct mce)); - mcelog.next = 0; + memset(mcelog->entry, 0, next * sizeof(struct mce)); + mcelog->next = 0; if (err) err = -EFAULT; @@ -242,7 +238,7 @@ out: static __poll_t mce_chrdev_poll(struct file *file, poll_table *wait) { poll_wait(file, &mce_chrdev_wait, wait); - if (READ_ONCE(mcelog.next)) + if (READ_ONCE(mcelog->next)) return EPOLLIN | EPOLLRDNORM; if (!mce_apei_read_done && apei_check_mce()) return EPOLLIN | EPOLLRDNORM; @@ -261,13 +257,13 @@ static long mce_chrdev_ioctl(struct file *f, unsigned int cmd, case MCE_GET_RECORD_LEN: return put_user(sizeof(struct mce), p); case MCE_GET_LOG_LEN: - return put_user(MCE_LOG_LEN, p); + return put_user(mcelog->len, p); case MCE_GETCLEAR_FLAGS: { unsigned flags; do { - flags = mcelog.flags; - } while (cmpxchg(&mcelog.flags, flags, 0) != flags); + flags = mcelog->flags; + } while (cmpxchg(&mcelog->flags, flags, 0) != flags); return put_user(flags, p); } @@ -339,8 +335,18 @@ static struct miscdevice mce_chrdev_device = { static __init int dev_mcelog_init_device(void) { + int mce_log_len; int err; + mce_log_len = max(MCE_LOG_MIN_LEN, num_online_cpus()); + mcelog = kzalloc(sizeof(*mcelog) + mce_log_len * sizeof(struct mce), GFP_KERNEL); + if (!mcelog) + return -ENOMEM; + + strncpy(mcelog->signature, MCE_LOG_SIGNATURE, sizeof(mcelog->signature)); + mcelog->len = mce_log_len; + mcelog->recordlen = sizeof(struct mce); + /* register character device /dev/mcelog */ err = misc_register(&mce_chrdev_device); if (err) { @@ -350,6 +356,7 @@ static __init int dev_mcelog_init_device(void) else pr_err("Unable to init device /dev/mcelog (rc: %d)\n", err); + kfree(mcelog); return err; } -- cgit From 74a4882d723a76cb3c72caf440ca5ef3f2bca6ab Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Sun, 8 Mar 2020 23:24:02 +0100 Subject: x86/entry/32: Remove unused label restore_nocheck Signed-off-by: Thomas Gleixner Reviewed-by: Alexandre Chartre Link: https://lkml.kernel.org/r/20200308222609.219366430@linutronix.de --- arch/x86/entry/entry_32.S | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S index ddc87f232877..80df781d4afb 100644 --- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -1091,7 +1091,7 @@ restore_all: TRACE_IRQS_IRET SWITCH_TO_ENTRY_STACK CHECK_AND_APPLY_ESPFIX -.Lrestore_nocheck: + /* Switch back to user CR3 */ SWITCH_TO_USER_CR3 scratch_reg=%eax -- cgit From 810f80a61be8c1d4a574082737f7a18c7459fa7b Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Sun, 8 Mar 2020 23:24:03 +0100 Subject: x86/entry/64: Trace irqflags unconditionally as ON when returning to user space User space cannot disable interrupts any longer so trace return to user space unconditionally as IRQS_ON. Signed-off-by: Thomas Gleixner Reviewed-by: Alexandre Chartre Link: https://lkml.kernel.org/r/20200308222609.314596327@linutronix.de --- arch/x86/entry/entry_32.S | 2 +- arch/x86/entry/entry_64.S | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S index 80df781d4afb..b67bae7091d7 100644 --- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -1088,7 +1088,7 @@ SYM_FUNC_START(entry_INT80_32) STACKLEAK_ERASE restore_all: - TRACE_IRQS_IRET + TRACE_IRQS_ON SWITCH_TO_ENTRY_STACK CHECK_AND_APPLY_ESPFIX diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index f2bb91e87877..0e9504fabe52 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -174,7 +174,7 @@ SYM_INNER_LABEL(entry_SYSCALL_64_after_hwframe, SYM_L_GLOBAL) movq %rsp, %rsi call do_syscall_64 /* returns with IRQs disabled */ - TRACE_IRQS_IRETQ /* we're about to change IF */ + TRACE_IRQS_ON /* return enables interrupts */ /* * Try to use SYSRET instead of IRET if we're returning to @@ -619,7 +619,7 @@ ret_from_intr: .Lretint_user: mov %rsp,%rdi call prepare_exit_to_usermode - TRACE_IRQS_IRETQ + TRACE_IRQS_ON SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode, SYM_L_GLOBAL) #ifdef CONFIG_DEBUG_ENTRY -- cgit From 13fac1d851e09109096b5862bf37c3da6908fb48 Mon Sep 17 00:00:00 2001 From: Alexei Starovoitov Date: Tue, 10 Mar 2020 17:39:06 -0700 Subject: bpf: Fix trampoline generation for fmod_ret programs fmod_ret progs are emitted as: start = __bpf_prog_enter(); call fmod_ret *(u64 *)(rbp - 8) = rax __bpf_prog_exit(, start); test eax, eax jne do_fexit That 'test eax, eax' is working by accident. The compiler is free to use rax inside __bpf_prog_exit() or inside functions that __bpf_prog_exit() is calling. Which caused "test_progs -t modify_return" to sporadically fail depending on compiler version and kconfig. Fix it by using 'cmp [rbp - 8], 0' instead of 'test eax, eax'. Fixes: ae24082331d9 ("bpf: Introduce BPF_MODIFY_RETURN") Reported-by: Andrii Nakryiko Signed-off-by: Alexei Starovoitov Signed-off-by: Daniel Borkmann Acked-by: Andrii Nakryiko Acked-by: KP Singh Link: https://lore.kernel.org/bpf/20200311003906.3643037-1-ast@kernel.org --- arch/x86/net/bpf_jit_comp.c | 31 +++++-------------------------- 1 file changed, 5 insertions(+), 26 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index b1fd000feb89..5ea7c2cf7ab4 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -1449,23 +1449,6 @@ static int emit_cond_near_jump(u8 **pprog, void *func, void *ip, u8 jmp_cond) return 0; } -static int emit_mod_ret_check_imm8(u8 **pprog, int value) -{ - u8 *prog = *pprog; - int cnt = 0; - - if (!is_imm8(value)) - return -EINVAL; - - if (value == 0) - EMIT2(0x85, add_2reg(0xC0, BPF_REG_0, BPF_REG_0)); - else - EMIT3(0x83, add_1reg(0xF8, BPF_REG_0), value); - - *pprog = prog; - return 0; -} - static int invoke_bpf(const struct btf_func_model *m, u8 **pprog, struct bpf_tramp_progs *tp, int stack_size) { @@ -1485,7 +1468,7 @@ static int invoke_bpf_mod_ret(const struct btf_func_model *m, u8 **pprog, u8 **branches) { u8 *prog = *pprog; - int i; + int i, cnt = 0; /* The first fmod_ret program will receive a garbage return value. * Set this to 0 to avoid confusing the program. @@ -1496,16 +1479,12 @@ static int invoke_bpf_mod_ret(const struct btf_func_model *m, u8 **pprog, if (invoke_bpf_prog(m, &prog, tp->progs[i], stack_size, true)) return -EINVAL; - /* Generate a branch: - * - * if (ret != 0) + /* mod_ret prog stored return value into [rbp - 8]. Emit: + * if (*(u64 *)(rbp - 8) != 0) * goto do_fexit; - * - * If needed this can be extended to any integer value which can - * be passed by user-space when the program is loaded. */ - if (emit_mod_ret_check_imm8(&prog, 0)) - return -EINVAL; + /* cmp QWORD PTR [rbp - 0x8], 0x0 */ + EMIT4(0x48, 0x83, 0x7d, 0xf8); EMIT1(0x00); /* Save the location of the branch and Generate 6 nops * (4 bytes for an offset and 2 bytes for the jump) These nops -- cgit From 17e5888e4e180b45af7bafe7f3a86440d42717f3 Mon Sep 17 00:00:00 2001 From: Hans de Goede Date: Thu, 23 Jan 2020 22:02:42 +0100 Subject: x86: Select HARDIRQS_SW_RESEND on x86 Modern x86 laptops are starting to use GPIO pins as interrupts more and more, e.g. touchpads and touchscreens have almost all moved away from PS/2 and USB to using I2C with a GPIO pin as interrupt. Modern x86 laptops also have almost all moved to using s2idle instead of using the system S3 ACPI power state to suspend. The Intel and AMD pinctrl drivers do not define irq_retrigger handlers for the irqchips they register, this is causing edge triggered interrupts which happen while suspended using s2idle to get lost. One specific example of this is the lid switch on some devices, lid switches used to be handled by the embedded-controller, but now the lid open/closed sensor is sometimes directly connected to a GPIO pin. On most devices the ACPI code for this looks like this: Method (_E00, ...) { Notify (LID0, 0x80) // Status Change } Where _E00 is an ACPI event handler for changes on both edges of the GPIO connected to the lid sensor, this event handler is then combined with an _LID method which directly reads the pin. When the device is resumed by opening the lid, the GPIO interrupt will wake the system, but because the pinctrl irqchip doesn't have an irq_retrigger handler, the Notify will not happen. This is not a problem in the case the _LID method directly reads the GPIO, because the drivers/acpi/button.c code will call _LID on resume anyways. But some devices have an event handler for the GPIO connected to the lid sensor which looks like this: Method (_E00, ...) { if (LID_GPIO == One) LIDS = One else LIDS = Zero Notify (LID0, 0x80) // Status Change } And the _LID method returns the cached LIDS value, since on open we do not re-run the edge-interrupt handler when we re-enable IRQS on resume (because of the missing irq_retrigger handler), _LID now will keep reporting closed, as LIDS was never changed to reflect the open status, this causes userspace to re-resume the laptop again shortly after opening the lid. The Intel GPIO controllers do not allow implementing irq_retrigger without emulating it in software, at which point we are better of just using the generic HARDIRQS_SW_RESEND mechanism rather then re-implementing software emulation for this separately in aprox. 14 different pinctrl drivers. Select HARDIRQS_SW_RESEND to solve the problem of edge-triggered GPIO interrupts not being re-triggered on resume when they were triggered during suspend (s2idle) and/or when they were the cause of the wakeup. This requires 008f1d60fe25 ("x86/apic/vector: Force interupt handler invocation to irq context") c16816acd086 ("genirq: Add protection against unsafe usage of generic_handle_irq()") to protect the APIC based interrupts from being wreckaged by a software resend. Signed-off-by: Hans de Goede Signed-off-by: Thomas Gleixner Link: https://lkml.kernel.org/r/20200123210242.53367-1-hdegoede@redhat.com --- arch/x86/Kconfig | 1 + 1 file changed, 1 insertion(+) (limited to 'arch/x86') diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index beea77046f9b..912893299a30 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -128,6 +128,7 @@ config X86 select GENERIC_GETTIMEOFDAY select GENERIC_VDSO_TIME_NS select GUP_GET_PTE_LOW_HIGH if X86_PAE + select HARDIRQS_SW_RESEND select HARDLOCKUP_CHECK_TIMESTAMP if X86_64 select HAVE_ACPI_APEI if ACPI select HAVE_ACPI_APEI_NMI if ACPI -- cgit From 812c2d7506fde7cdf83cb2532810a65782b51741 Mon Sep 17 00:00:00 2001 From: Hans de Goede Date: Sun, 23 Feb 2020 15:06:08 +0100 Subject: x86/tsc_msr: Use named struct initializers Use named struct initializers for the freq_desc struct-s initialization and change the "u8 msr_plat" to a "bool use_msr_plat" to make its meaning more clear instead of relying on a comment to explain it. Signed-off-by: Hans de Goede Signed-off-by: Thomas Gleixner Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20200223140610.59612-1-hdegoede@redhat.com --- arch/x86/kernel/tsc_msr.c | 28 ++++++++++++++++++---------- 1 file changed, 18 insertions(+), 10 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kernel/tsc_msr.c b/arch/x86/kernel/tsc_msr.c index e0cbe4f2af49..5fa41ac3feb1 100644 --- a/arch/x86/kernel/tsc_msr.c +++ b/arch/x86/kernel/tsc_msr.c @@ -22,10 +22,10 @@ * read in MSR_PLATFORM_ID[12:8], otherwise in MSR_PERF_STAT[44:40]. * Unfortunately some Intel Atom SoCs aren't quite compliant to this, * so we need manually differentiate SoC families. This is what the - * field msr_plat does. + * field use_msr_plat does. */ struct freq_desc { - u8 msr_plat; /* 1: use MSR_PLATFORM_INFO, 0: MSR_IA32_PERF_STATUS */ + bool use_msr_plat; u32 freqs[MAX_NUM_FREQS]; }; @@ -35,31 +35,39 @@ struct freq_desc { * by MSR based on SDM. */ static const struct freq_desc freq_desc_pnw = { - 0, { 0, 0, 0, 0, 0, 99840, 0, 83200 } + .use_msr_plat = false, + .freqs = { 0, 0, 0, 0, 0, 99840, 0, 83200 }, }; static const struct freq_desc freq_desc_clv = { - 0, { 0, 133200, 0, 0, 0, 99840, 0, 83200 } + .use_msr_plat = false, + .freqs = { 0, 133200, 0, 0, 0, 99840, 0, 83200 }, }; static const struct freq_desc freq_desc_byt = { - 1, { 83300, 100000, 133300, 116700, 80000, 0, 0, 0 } + .use_msr_plat = true, + .freqs = { 83300, 100000, 133300, 116700, 80000, 0, 0, 0 }, }; static const struct freq_desc freq_desc_cht = { - 1, { 83300, 100000, 133300, 116700, 80000, 93300, 90000, 88900, 87500 } + .use_msr_plat = true, + .freqs = { 83300, 100000, 133300, 116700, 80000, 93300, 90000, + 88900, 87500 }, }; static const struct freq_desc freq_desc_tng = { - 1, { 0, 100000, 133300, 0, 0, 0, 0, 0 } + .use_msr_plat = true, + .freqs = { 0, 100000, 133300, 0, 0, 0, 0, 0 }, }; static const struct freq_desc freq_desc_ann = { - 1, { 83300, 100000, 133300, 100000, 0, 0, 0, 0 } + .use_msr_plat = true, + .freqs = { 83300, 100000, 133300, 100000, 0, 0, 0, 0 }, }; static const struct freq_desc freq_desc_lgm = { - 1, { 78000, 78000, 78000, 78000, 78000, 78000, 78000, 78000 } + .use_msr_plat = true, + .freqs = { 78000, 78000, 78000, 78000, 78000, 78000, 78000, 78000 }, }; static const struct x86_cpu_id tsc_msr_cpu_ids[] = { @@ -91,7 +99,7 @@ unsigned long cpu_khz_from_msr(void) return 0; freq_desc = (struct freq_desc *)id->driver_data; - if (freq_desc->msr_plat) { + if (freq_desc->use_msr_plat) { rdmsr(MSR_PLATFORM_INFO, lo, hi); ratio = (lo >> 8) & 0xff; } else { -- cgit From c8810e2ffc30c7e1577f9c057c4b85d984bbc35a Mon Sep 17 00:00:00 2001 From: Hans de Goede Date: Sun, 23 Feb 2020 15:06:09 +0100 Subject: x86/tsc_msr: Fix MSR_FSB_FREQ mask for Cherry Trail devices According to the "Intel 64 and IA-32 Architectures Software Developer's Manual Volume 4: Model-Specific Registers" on Cherry Trail (Airmont) devices the 4 lowest bits of the MSR_FSB_FREQ mask indicate the bus freq unlike on e.g. Bay Trail where only the lowest 3 bits are used. This is also the reason why MAX_NUM_FREQS is defined as 9, since Cherry Trail SoCs have 9 possible frequencies, so the lo value from the MSR needs to be masked with 0x0f, not with 0x07 otherwise the 9th frequency will get interpreted as the 1st. Bump MAX_NUM_FREQS to 16 to avoid any possibility of addressing the array out of bounds and makes the mask part of the cpufreq struct so it can be set it per model. While at it also log an error when the index points to an uninitialized part of the freqs lookup-table. Signed-off-by: Hans de Goede Signed-off-by: Thomas Gleixner Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20200223140610.59612-2-hdegoede@redhat.com --- arch/x86/kernel/tsc_msr.c | 17 +++++++++++++++-- 1 file changed, 15 insertions(+), 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kernel/tsc_msr.c b/arch/x86/kernel/tsc_msr.c index 5fa41ac3feb1..95030895fffa 100644 --- a/arch/x86/kernel/tsc_msr.c +++ b/arch/x86/kernel/tsc_msr.c @@ -15,7 +15,7 @@ #include #include -#define MAX_NUM_FREQS 9 +#define MAX_NUM_FREQS 16 /* 4 bits to select the frequency */ /* * If MSR_PERF_STAT[31] is set, the maximum resolved bus ratio can be @@ -27,6 +27,7 @@ struct freq_desc { bool use_msr_plat; u32 freqs[MAX_NUM_FREQS]; + u32 mask; }; /* @@ -37,37 +38,44 @@ struct freq_desc { static const struct freq_desc freq_desc_pnw = { .use_msr_plat = false, .freqs = { 0, 0, 0, 0, 0, 99840, 0, 83200 }, + .mask = 0x07, }; static const struct freq_desc freq_desc_clv = { .use_msr_plat = false, .freqs = { 0, 133200, 0, 0, 0, 99840, 0, 83200 }, + .mask = 0x07, }; static const struct freq_desc freq_desc_byt = { .use_msr_plat = true, .freqs = { 83300, 100000, 133300, 116700, 80000, 0, 0, 0 }, + .mask = 0x07, }; static const struct freq_desc freq_desc_cht = { .use_msr_plat = true, .freqs = { 83300, 100000, 133300, 116700, 80000, 93300, 90000, 88900, 87500 }, + .mask = 0x0f, }; static const struct freq_desc freq_desc_tng = { .use_msr_plat = true, .freqs = { 0, 100000, 133300, 0, 0, 0, 0, 0 }, + .mask = 0x07, }; static const struct freq_desc freq_desc_ann = { .use_msr_plat = true, .freqs = { 83300, 100000, 133300, 100000, 0, 0, 0, 0 }, + .mask = 0x0f, }; static const struct freq_desc freq_desc_lgm = { .use_msr_plat = true, .freqs = { 78000, 78000, 78000, 78000, 78000, 78000, 78000, 78000 }, + .mask = 0x0f, }; static const struct x86_cpu_id tsc_msr_cpu_ids[] = { @@ -93,6 +101,7 @@ unsigned long cpu_khz_from_msr(void) const struct freq_desc *freq_desc; const struct x86_cpu_id *id; unsigned long res; + int index; id = x86_match_cpu(tsc_msr_cpu_ids); if (!id) @@ -109,13 +118,17 @@ unsigned long cpu_khz_from_msr(void) /* Get FSB FREQ ID */ rdmsr(MSR_FSB_FREQ, lo, hi); + index = lo & freq_desc->mask; /* Map CPU reference clock freq ID(0-7) to CPU reference clock freq(KHz) */ - freq = freq_desc->freqs[lo & 0x7]; + freq = freq_desc->freqs[index]; /* TSC frequency = maximum resolved freq * maximum resolved bus ratio */ res = freq * ratio; + if (freq == 0) + pr_err("Error MSR_FSB_FREQ index %d is unknown\n", index); + #ifdef CONFIG_X86_LOCAL_APIC lapic_timer_period = (freq * 1000) / HZ; #endif -- cgit From fac01d11722c92a186b27ee26cd429a8066adfb5 Mon Sep 17 00:00:00 2001 From: Hans de Goede Date: Sun, 23 Feb 2020 15:06:10 +0100 Subject: x86/tsc_msr: Make MSR derived TSC frequency more accurate MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The "Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 4: Model-Specific Registers" has the following table for the values from freq_desc_byt: 000B: 083.3 MHz 001B: 100.0 MHz 010B: 133.3 MHz 011B: 116.7 MHz 100B: 080.0 MHz Notice how for e.g the 83.3 MHz value there are 3 significant digits, which translates to an accuracy of a 1000 ppm, where as a typical crystal oscillator is 20 - 100 ppm, so the accuracy of the frequency format used in the Software Developer’s Manual is not really helpful. As far as we know Bay Trail SoCs use a 25 MHz crystal and Cherry Trail uses a 19.2 MHz crystal, the crystal is the source clock for a root PLL which outputs 1600 and 100 MHz. It is unclear if the root PLL outputs are used directly by the CPU clock PLL or if there is another PLL in between. This does not matter though, we can model the chain of PLLs as a single PLL with a quotient equal to the quotients of all PLLs in the chain multiplied. So we can create a simplified model of the CPU clock setup using a reference clock of 100 MHz plus a quotient which gets us as close to the frequency from the SDM as possible. For the 83.3 MHz example from above this would give 100 MHz * 5 / 6 = 83 and 1/3 MHz, which matches exactly what has been measured on actual hardware. Use a simplified PLL model with a reference clock of 100 MHz for all Bay and Cherry Trail models. This has been tested on the following models: CPU freq before: CPU freq after: Intel N2840 2165.800 MHz 2166.667 MHz Intel Z3736 1332.800 MHz 1333.333 MHz Intel Z3775 1466.300 MHz 1466.667 MHz Intel Z8350 1440.000 MHz 1440.000 MHz Intel Z8750 1600.000 MHz 1600.000 MHz This fixes the time drifting by about 1 second per hour (20 - 30 seconds per day) on (some) devices which rely on the tsc_msr.c code to determine the TSC frequency. Reported-by: Vipul Kumar Suggested-by: Thomas Gleixner Signed-off-by: Hans de Goede Signed-off-by: Thomas Gleixner Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20200223140610.59612-3-hdegoede@redhat.com --- arch/x86/kernel/tsc_msr.c | 97 +++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 86 insertions(+), 11 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kernel/tsc_msr.c b/arch/x86/kernel/tsc_msr.c index 95030895fffa..c65adaf81384 100644 --- a/arch/x86/kernel/tsc_msr.c +++ b/arch/x86/kernel/tsc_msr.c @@ -17,6 +17,28 @@ #define MAX_NUM_FREQS 16 /* 4 bits to select the frequency */ +/* + * The frequency numbers in the SDM are e.g. 83.3 MHz, which does not contain a + * lot of accuracy which leads to clock drift. As far as we know Bay Trail SoCs + * use a 25 MHz crystal and Cherry Trail uses a 19.2 MHz crystal, the crystal + * is the source clk for a root PLL which outputs 1600 and 100 MHz. It is + * unclear if the root PLL outputs are used directly by the CPU clock PLL or + * if there is another PLL in between. + * This does not matter though, we can model the chain of PLLs as a single PLL + * with a quotient equal to the quotients of all PLLs in the chain multiplied. + * So we can create a simplified model of the CPU clock setup using a reference + * clock of 100 MHz plus a quotient which gets us as close to the frequency + * from the SDM as possible. + * For the 83.3 MHz example from above this would give us 100 MHz * 5 / 6 = + * 83 and 1/3 MHz, which matches exactly what has been measured on actual hw. + */ +#define TSC_REFERENCE_KHZ 100000 + +struct muldiv { + u32 multiplier; + u32 divider; +}; + /* * If MSR_PERF_STAT[31] is set, the maximum resolved bus ratio can be * read in MSR_PLATFORM_ID[12:8], otherwise in MSR_PERF_STAT[44:40]. @@ -26,6 +48,11 @@ */ struct freq_desc { bool use_msr_plat; + struct muldiv muldiv[MAX_NUM_FREQS]; + /* + * Some CPU frequencies in the SDM do not map to known PLL freqs, in + * that case the muldiv array is empty and the freqs array is used. + */ u32 freqs[MAX_NUM_FREQS]; u32 mask; }; @@ -47,31 +74,66 @@ static const struct freq_desc freq_desc_clv = { .mask = 0x07, }; +/* + * Bay Trail SDM MSR_FSB_FREQ frequencies simplified PLL model: + * 000: 100 * 5 / 6 = 83.3333 MHz + * 001: 100 * 1 / 1 = 100.0000 MHz + * 010: 100 * 4 / 3 = 133.3333 MHz + * 011: 100 * 7 / 6 = 116.6667 MHz + * 100: 100 * 4 / 5 = 80.0000 MHz + */ static const struct freq_desc freq_desc_byt = { .use_msr_plat = true, - .freqs = { 83300, 100000, 133300, 116700, 80000, 0, 0, 0 }, + .muldiv = { { 5, 6 }, { 1, 1 }, { 4, 3 }, { 7, 6 }, + { 4, 5 } }, .mask = 0x07, }; +/* + * Cherry Trail SDM MSR_FSB_FREQ frequencies simplified PLL model: + * 0000: 100 * 5 / 6 = 83.3333 MHz + * 0001: 100 * 1 / 1 = 100.0000 MHz + * 0010: 100 * 4 / 3 = 133.3333 MHz + * 0011: 100 * 7 / 6 = 116.6667 MHz + * 0100: 100 * 4 / 5 = 80.0000 MHz + * 0101: 100 * 14 / 15 = 93.3333 MHz + * 0110: 100 * 9 / 10 = 90.0000 MHz + * 0111: 100 * 8 / 9 = 88.8889 MHz + * 1000: 100 * 7 / 8 = 87.5000 MHz + */ static const struct freq_desc freq_desc_cht = { .use_msr_plat = true, - .freqs = { 83300, 100000, 133300, 116700, 80000, 93300, 90000, - 88900, 87500 }, + .muldiv = { { 5, 6 }, { 1, 1 }, { 4, 3 }, { 7, 6 }, + { 4, 5 }, { 14, 15 }, { 9, 10 }, { 8, 9 }, + { 7, 8 } }, .mask = 0x0f, }; +/* + * Merriefield SDM MSR_FSB_FREQ frequencies simplified PLL model: + * 0001: 100 * 1 / 1 = 100.0000 MHz + * 0010: 100 * 4 / 3 = 133.3333 MHz + */ static const struct freq_desc freq_desc_tng = { .use_msr_plat = true, - .freqs = { 0, 100000, 133300, 0, 0, 0, 0, 0 }, + .muldiv = { { 0, 0 }, { 1, 1 }, { 4, 3 } }, .mask = 0x07, }; +/* + * Moorefield SDM MSR_FSB_FREQ frequencies simplified PLL model: + * 0000: 100 * 5 / 6 = 83.3333 MHz + * 0001: 100 * 1 / 1 = 100.0000 MHz + * 0010: 100 * 4 / 3 = 133.3333 MHz + * 0011: 100 * 1 / 1 = 100.0000 MHz + */ static const struct freq_desc freq_desc_ann = { .use_msr_plat = true, - .freqs = { 83300, 100000, 133300, 100000, 0, 0, 0, 0 }, + .muldiv = { { 5, 6 }, { 1, 1 }, { 4, 3 }, { 1, 1 } }, .mask = 0x0f, }; +/* 24 MHz crystal? : 24 * 13 / 4 = 78 MHz */ static const struct freq_desc freq_desc_lgm = { .use_msr_plat = true, .freqs = { 78000, 78000, 78000, 78000, 78000, 78000, 78000, 78000 }, @@ -97,9 +159,10 @@ static const struct x86_cpu_id tsc_msr_cpu_ids[] = { */ unsigned long cpu_khz_from_msr(void) { - u32 lo, hi, ratio, freq; + u32 lo, hi, ratio, freq, tscref; const struct freq_desc *freq_desc; const struct x86_cpu_id *id; + const struct muldiv *md; unsigned long res; int index; @@ -119,12 +182,24 @@ unsigned long cpu_khz_from_msr(void) /* Get FSB FREQ ID */ rdmsr(MSR_FSB_FREQ, lo, hi); index = lo & freq_desc->mask; + md = &freq_desc->muldiv[index]; - /* Map CPU reference clock freq ID(0-7) to CPU reference clock freq(KHz) */ - freq = freq_desc->freqs[index]; - - /* TSC frequency = maximum resolved freq * maximum resolved bus ratio */ - res = freq * ratio; + /* + * Note this also catches cases where the index points to an unpopulated + * part of muldiv, in that case the else will set freq and res to 0. + */ + if (md->divider) { + tscref = TSC_REFERENCE_KHZ * md->multiplier; + freq = DIV_ROUND_CLOSEST(tscref, md->divider); + /* + * Multiplying by ratio before the division has better + * accuracy than just calculating freq * ratio. + */ + res = DIV_ROUND_CLOSEST(tscref * ratio, md->divider); + } else { + freq = freq_desc->freqs[index]; + res = freq * ratio; + } if (freq == 0) pr_err("Error MSR_FSB_FREQ index %d is unknown\n", index); -- cgit From 753039ef8b2f1078e5bff8cd42f80578bf6385b0 Mon Sep 17 00:00:00 2001 From: Kim Phillips Date: Wed, 11 Mar 2020 14:14:51 -0500 Subject: x86/cpu/amd: Call init_amd_zn() om Family 19h processors too Family 19h CPUs are Zen-based and still share most architectural features with Family 17h CPUs, and therefore still need to call init_amd_zn() e.g., to set the RECLAIM_DISTANCE override. init_amd_zn() also sets X86_FEATURE_ZEN, which today is only used in amd_set_core_ssb_state(), which isn't called on some late model Family 17h CPUs, nor on any Family 19h CPUs: X86_FEATURE_AMD_SSBD replaces X86_FEATURE_LS_CFG_SSBD on those later model CPUs, where the SSBD mitigation is done via the SPEC_CTRL MSR instead of the LS_CFG MSR. Family 19h CPUs also don't have the erratum where the CPB feature bit isn't set, but that code can stay unchanged and run safely on Family 19h. Signed-off-by: Kim Phillips Signed-off-by: Borislav Petkov Link: https://lkml.kernel.org/r/20200311191451.13221-1-kim.phillips@amd.com --- arch/x86/include/asm/cpufeatures.h | 2 +- arch/x86/kernel/cpu/amd.c | 3 ++- 2 files changed, 3 insertions(+), 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index f3327cb56edf..f980efcacfeb 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -217,7 +217,7 @@ #define X86_FEATURE_IBRS ( 7*32+25) /* Indirect Branch Restricted Speculation */ #define X86_FEATURE_IBPB ( 7*32+26) /* Indirect Branch Prediction Barrier */ #define X86_FEATURE_STIBP ( 7*32+27) /* Single Thread Indirect Branch Predictors */ -#define X86_FEATURE_ZEN ( 7*32+28) /* "" CPU is AMD family 0x17 (Zen) */ +#define X86_FEATURE_ZEN ( 7*32+28) /* "" CPU is AMD family 0x17 or above (Zen) */ #define X86_FEATURE_L1TF_PTEINV ( 7*32+29) /* "" L1TF workaround PTE inversion */ #define X86_FEATURE_IBRS_ENHANCED ( 7*32+30) /* Enhanced IBRS */ #define X86_FEATURE_MSR_IA32_FEAT_CTL ( 7*32+31) /* "" MSR IA32_FEAT_CTL configured */ diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c index ac83a0fef628..dc6894a3f22d 100644 --- a/arch/x86/kernel/cpu/amd.c +++ b/arch/x86/kernel/cpu/amd.c @@ -925,7 +925,8 @@ static void init_amd(struct cpuinfo_x86 *c) case 0x12: init_amd_ln(c); break; case 0x15: init_amd_bd(c); break; case 0x16: init_amd_jg(c); break; - case 0x17: init_amd_zn(c); break; + case 0x17: fallthrough; + case 0x19: init_amd_zn(c); break; } /* -- cgit From 9e2b4be377f0d715d9d910507890f9620cc22a9d Mon Sep 17 00:00:00 2001 From: Nayna Jain Date: Sun, 8 Mar 2020 20:57:51 -0400 Subject: ima: add a new CONFIG for loading arch-specific policies Every time a new architecture defines the IMA architecture specific functions - arch_ima_get_secureboot() and arch_ima_get_policy(), the IMA include file needs to be updated. To avoid this "noise", this patch defines a new IMA Kconfig IMA_SECURE_AND_OR_TRUSTED_BOOT option, allowing the different architectures to select it. Suggested-by: Linus Torvalds Signed-off-by: Nayna Jain Acked-by: Ard Biesheuvel Acked-by: Philipp Rudo (s390) Acked-by: Michael Ellerman (powerpc) Signed-off-by: Mimi Zohar --- arch/x86/Kconfig | 1 + arch/x86/kernel/Makefile | 4 +--- 2 files changed, 2 insertions(+), 3 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index beea77046f9b..dcf5b1729f7c 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -230,6 +230,7 @@ config X86 select VIRT_TO_BUS select X86_FEATURE_NAMES if PROC_FS select PROC_PID_ARCH_STATUS if PROC_FS + imply IMA_SECURE_AND_OR_TRUSTED_BOOT if EFI config INSTRUCTION_DECODER def_bool y diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile index 9b294c13809a..cfef37a27def 100644 --- a/arch/x86/kernel/Makefile +++ b/arch/x86/kernel/Makefile @@ -154,6 +154,4 @@ ifeq ($(CONFIG_X86_64),y) obj-y += vsmp_64.o endif -ifdef CONFIG_EFI -obj-$(CONFIG_IMA) += ima_arch.o -endif +obj-$(CONFIG_IMA_SECURE_AND_OR_TRUSTED_BOOT) += ima_arch.o -- cgit From 6a9feaa8774f3b8210dfe40626a75ca047e4ecae Mon Sep 17 00:00:00 2001 From: Sebastian Andrzej Siewior Date: Wed, 5 Feb 2020 15:34:26 +0100 Subject: x86/mm/kmmio: Use this_cpu_ptr() instead get_cpu_var() for kmmio_ctx Both call sites that access kmmio_ctx, access kmmio_ctx with interrupts disabled. There is no need to use get_cpu_var() which additionally disables preemption. Use this_cpu_ptr() to access the kmmio_ctx variable of the current CPU. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Borislav Petkov Link: https://lkml.kernel.org/r/20200205143426.2592512-1-bigeasy@linutronix.de --- arch/x86/mm/kmmio.c | 10 +++------- 1 file changed, 3 insertions(+), 7 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/mm/kmmio.c b/arch/x86/mm/kmmio.c index 49d7814b59a9..9994353fb75d 100644 --- a/arch/x86/mm/kmmio.c +++ b/arch/x86/mm/kmmio.c @@ -260,7 +260,7 @@ int kmmio_handler(struct pt_regs *regs, unsigned long addr) goto no_kmmio; } - ctx = &get_cpu_var(kmmio_ctx); + ctx = this_cpu_ptr(&kmmio_ctx); if (ctx->active) { if (page_base == ctx->addr) { /* @@ -285,7 +285,7 @@ int kmmio_handler(struct pt_regs *regs, unsigned long addr) pr_emerg("previous hit was at 0x%08lx.\n", ctx->addr); disarm_kmmio_fault_page(faultpage); } - goto no_kmmio_ctx; + goto no_kmmio; } ctx->active++; @@ -314,11 +314,8 @@ int kmmio_handler(struct pt_regs *regs, unsigned long addr) * the user should drop to single cpu before tracing. */ - put_cpu_var(kmmio_ctx); return 1; /* fault handled */ -no_kmmio_ctx: - put_cpu_var(kmmio_ctx); no_kmmio: rcu_read_unlock(); preempt_enable_no_resched(); @@ -333,7 +330,7 @@ no_kmmio: static int post_kmmio_handler(unsigned long condition, struct pt_regs *regs) { int ret = 0; - struct kmmio_context *ctx = &get_cpu_var(kmmio_ctx); + struct kmmio_context *ctx = this_cpu_ptr(&kmmio_ctx); if (!ctx->active) { /* @@ -371,7 +368,6 @@ static int post_kmmio_handler(unsigned long condition, struct pt_regs *regs) if (!(regs->flags & X86_EFLAGS_TF)) ret = 1; out: - put_cpu_var(kmmio_ctx); return ret; } -- cgit From b56cd05c55a10afd479a1877d7f6a50d92d8536e Mon Sep 17 00:00:00 2001 From: Jiri Olsa Date: Thu, 12 Mar 2020 20:55:56 +0100 Subject: x86/mm: Rename is_kernel_text to __is_kernel_text The kbuild test robot reported compile issue on x86 in one of the following patches that adds include into , which is picked up by init_32.c object. The problem is that defines global function is_kernel_text which colides with the static function of the same name defined in init_32.c: $ make ARCH=i386 ... >> arch/x86/mm/init_32.c:241:19: error: redefinition of 'is_kernel_text' static inline int is_kernel_text(unsigned long addr) ^~~~~~~~~~~~~~ In file included from include/linux/bpf.h:21:0, from include/linux/bpf-cgroup.h:5, from include/linux/cgroup-defs.h:22, from include/linux/cgroup.h:28, from include/linux/hugetlb.h:9, from arch/x86/mm/init_32.c:18: include/linux/kallsyms.h:31:19: note: previous definition of 'is_kernel_text' was here static inline int is_kernel_text(unsigned long addr) Renaming the init_32.c is_kernel_text function to __is_kernel_text. Reported-by: kbuild test robot Signed-off-by: Jiri Olsa Signed-off-by: Alexei Starovoitov Acked-by: Song Liu Link: https://lore.kernel.org/bpf/20200312195610.346362-2-jolsa@kernel.org Signed-off-by: Alexei Starovoitov --- arch/x86/mm/init_32.c | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c index 23df4885bbed..eb6ede2c3d43 100644 --- a/arch/x86/mm/init_32.c +++ b/arch/x86/mm/init_32.c @@ -238,7 +238,11 @@ page_table_range_init(unsigned long start, unsigned long end, pgd_t *pgd_base) } } -static inline int is_kernel_text(unsigned long addr) +/* + * The already defines is_kernel_text, + * using '__' prefix not to get in conflict. + */ +static inline int __is_kernel_text(unsigned long addr) { if (addr >= (unsigned long)_text && addr <= (unsigned long)__init_end) return 1; @@ -328,8 +332,8 @@ repeat: addr2 = (pfn + PTRS_PER_PTE-1) * PAGE_SIZE + PAGE_OFFSET + PAGE_SIZE-1; - if (is_kernel_text(addr) || - is_kernel_text(addr2)) + if (__is_kernel_text(addr) || + __is_kernel_text(addr2)) prot = PAGE_KERNEL_LARGE_EXEC; pages_2m++; @@ -354,7 +358,7 @@ repeat: */ pgprot_t init_prot = __pgprot(PTE_IDENT_ATTR); - if (is_kernel_text(addr)) + if (__is_kernel_text(addr)) prot = PAGE_KERNEL_EXEC; pages_4k++; @@ -881,7 +885,7 @@ static void mark_nxdata_nx(void) */ unsigned long start = PFN_ALIGN(_etext); /* - * This comes from is_kernel_text upper limit. Also HPAGE where used: + * This comes from __is_kernel_text upper limit. Also HPAGE where used: */ unsigned long size = (((unsigned long)__init_end + HPAGE_SIZE) & HPAGE_MASK) - start; -- cgit From fa0fca68e1e64bc53fcb7cfd4bfac27c1b14a955 Mon Sep 17 00:00:00 2001 From: Alexey Dobriyan Date: Tue, 3 Mar 2020 23:41:44 +0300 Subject: x86/acpi: make "asmlinkage" part first thing in the function definition g++ insists that function declaration must start with extern "C" (which asmlinkage expands to). gcc doesn't care. Signed-off-by: Alexey Dobriyan Signed-off-by: Rafael J. Wysocki --- arch/x86/kernel/acpi/sleep.c | 2 +- arch/x86/kernel/acpi/sleep.h | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kernel/acpi/sleep.c b/arch/x86/kernel/acpi/sleep.c index 26b7256f590f..ed3b04483972 100644 --- a/arch/x86/kernel/acpi/sleep.c +++ b/arch/x86/kernel/acpi/sleep.c @@ -43,7 +43,7 @@ unsigned long acpi_get_wakeup_address(void) * * Wrapper around acpi_enter_sleep_state() to be called by assmebly. */ -acpi_status asmlinkage __visible x86_acpi_enter_sleep_state(u8 state) +asmlinkage acpi_status __visible x86_acpi_enter_sleep_state(u8 state) { return acpi_enter_sleep_state(state); } diff --git a/arch/x86/kernel/acpi/sleep.h b/arch/x86/kernel/acpi/sleep.h index d06c2079b6c1..171a40c74db6 100644 --- a/arch/x86/kernel/acpi/sleep.h +++ b/arch/x86/kernel/acpi/sleep.h @@ -19,4 +19,4 @@ extern void do_suspend_lowlevel(void); extern int x86_acpi_suspend_lowlevel(void); -acpi_status asmlinkage x86_acpi_enter_sleep_state(u8 state); +asmlinkage acpi_status x86_acpi_enter_sleep_state(u8 state); -- cgit From 1ffb8d032d03d686e3b06378780944608cc77906 Mon Sep 17 00:00:00 2001 From: Alex Hung Date: Wed, 4 Mar 2020 15:55:29 -0700 Subject: acpi/x86: add a kernel parameter to disable ACPI BGRT BGRT is for displaying seamless OEM logo from booting to login screen; however, this mechanism does not always work well on all configurations and the OEM logo can be displayed multiple times. This looks worse than without BGRT enabled. This patch adds a kernel parameter to disable BGRT in boot time. This is easier than re-compiling a kernel with CONFIG_ACPI_BGRT disabled. Signed-off-by: Alex Hung Signed-off-by: Rafael J. Wysocki --- arch/x86/kernel/acpi/boot.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c index 04205ce127a1..d1757ceca8ae 100644 --- a/arch/x86/kernel/acpi/boot.c +++ b/arch/x86/kernel/acpi/boot.c @@ -45,6 +45,7 @@ EXPORT_SYMBOL(acpi_disabled); #define PREFIX "ACPI: " int acpi_noirq; /* skip ACPI IRQ initialization */ +int acpi_nobgrt; /* skip ACPI BGRT */ int acpi_pci_disabled; /* skip ACPI PCI scan and IRQ initialization */ EXPORT_SYMBOL(acpi_pci_disabled); @@ -1619,7 +1620,7 @@ int __init acpi_boot_init(void) acpi_process_madt(); acpi_table_parse(ACPI_SIG_HPET, acpi_parse_hpet); - if (IS_ENABLED(CONFIG_ACPI_BGRT)) + if (IS_ENABLED(CONFIG_ACPI_BGRT) && !acpi_nobgrt) acpi_table_parse(ACPI_SIG_BGRT, acpi_parse_bgrt); if (!acpi_noirq) @@ -1671,6 +1672,13 @@ static int __init parse_acpi(char *arg) } early_param("acpi", parse_acpi); +static int __init parse_acpi_bgrt(char *arg) +{ + acpi_nobgrt = true; + return 0; +} +early_param("bgrt_disable", parse_acpi_bgrt); + /* FIXME: Using pci= for an ACPI parameter is a travesty. */ static int __init parse_pci(char *arg) { -- cgit From ecb9c790999fd6c5af0f44783bd0217f0b89ec2b Mon Sep 17 00:00:00 2001 From: Jan Engelhardt Date: Thu, 5 Mar 2020 13:24:25 +0100 Subject: acpi/x86: ignore unspecified bit positions in the ACPI global lock field The value in "new" is constructed from "old" such that all bits defined as reserved by the ACPI spec[1] are left untouched. But if those bits do not happen to be all zero, "new < 3" will not evaluate to true. The firmware of the laptop(s) Medion MD63490 / Akoya P15648 comes with garbage inside the "FACS" ACPI table. The starting value is old=0x4944454d, therefore new=0x4944454e, which is >= 3. Mask off the reserved bits. [1] https://uefi.org/sites/default/files/resources/ACPI_6_2.pdf Link: https://bugzilla.kernel.org/show_bug.cgi?id=206553 Cc: All applicable Signed-off-by: Jan Engelhardt Signed-off-by: Rafael J. Wysocki --- arch/x86/kernel/acpi/boot.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c index d1757ceca8ae..1ae5439a9a85 100644 --- a/arch/x86/kernel/acpi/boot.c +++ b/arch/x86/kernel/acpi/boot.c @@ -1748,7 +1748,7 @@ int __acpi_acquire_global_lock(unsigned int *lock) new = (((old & ~0x3) + 2) + ((old >> 1) & 0x1)); val = cmpxchg(lock, old, new); } while (unlikely (val != old)); - return (new < 3) ? -1 : 0; + return ((new & 0x3) < 3) ? -1 : 0; } int __acpi_release_global_lock(unsigned int *lock) -- cgit From 222f06e7cde57c38276be612428cc7e05d125a8c Mon Sep 17 00:00:00 2001 From: Chia-I Wu Date: Thu, 13 Feb 2020 13:30:34 -0800 Subject: KVM: vmx: rewrite the comment in vmx_get_mt_mask Better reflect the structure of the code and metion why we could not always honor the guest. Signed-off-by: Chia-I Wu Cc: Gurchetan Singh Cc: Gerd Hoffmann Signed-off-by: Paolo Bonzini --- arch/x86/kvm/vmx/vmx.c | 27 +++++++++++++++++---------- 1 file changed, 17 insertions(+), 10 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 63aaf44edd1f..5f6a26211183 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -6898,17 +6898,24 @@ static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) u8 cache; u64 ipat = 0; - /* For VT-d and EPT combination - * 1. MMIO: always map as UC - * 2. EPT with VT-d: - * a. VT-d without snooping control feature: can't guarantee the - * result, try to trust guest. - * b. VT-d with snooping control feature: snooping control feature of - * VT-d engine can guarantee the cache correctness. Just set it - * to WB to keep consistent with host. So the same as item 3. - * 3. EPT without VT-d: always map as WB and set IPAT=1 to keep - * consistent with host MTRR + /* We wanted to honor guest CD/MTRR/PAT, but doing so could result in + * memory aliases with conflicting memory types and sometimes MCEs. + * We have to be careful as to what are honored and when. + * + * For MMIO, guest CD/MTRR are ignored. The EPT memory type is set to + * UC. The effective memory type is UC or WC depending on guest PAT. + * This was historically the source of MCEs and we want to be + * conservative. + * + * When there is no need to deal with noncoherent DMA (e.g., no VT-d + * or VT-d has snoop control), guest CD/MTRR/PAT are all ignored. The + * EPT memory type is set to WB. The effective memory type is forced + * WB. + * + * Otherwise, we trust guest. Guest CD/MTRR/PAT are all honored. The + * EPT memory type is used to emulate guest CD/MTRR. */ + if (is_mmio) { cache = MTRR_TYPE_UNCACHABLE; goto exit; -- cgit From e630269841ab08ed54e1908eeee43ea739b74da2 Mon Sep 17 00:00:00 2001 From: Miaohe Lin Date: Sat, 15 Feb 2020 10:44:22 +0800 Subject: KVM: x86: Fix print format and coding style Use %u to print u32 var and correct some coding style. Signed-off-by: Miaohe Lin Reviewed-by: Vitaly Kuznetsov Signed-off-by: Paolo Bonzini --- arch/x86/kvm/i8254.c | 2 +- arch/x86/kvm/mmu/mmu.c | 3 +-- arch/x86/kvm/vmx/nested.c | 2 +- 3 files changed, 3 insertions(+), 4 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c index b24c606ac04b..febca334c320 100644 --- a/arch/x86/kvm/i8254.c +++ b/arch/x86/kvm/i8254.c @@ -367,7 +367,7 @@ static void pit_load_count(struct kvm_pit *pit, int channel, u32 val) { struct kvm_kpit_state *ps = &pit->pit_state; - pr_debug("load_count val is %d, channel is %d\n", val, channel); + pr_debug("load_count val is %u, channel is %d\n", val, channel); /* * The largest possible initial count is 0; this is equivalent diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 87e9ba27ada1..50a78a01aa0f 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -3568,8 +3568,7 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, * write-protected for dirty-logging or access tracking. */ if ((error_code & PFERR_WRITE_MASK) && - spte_can_locklessly_be_made_writable(spte)) - { + spte_can_locklessly_be_made_writable(spte)) { new_spte |= PT_WRITABLE_MASK; /* diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index e920d7834d73..0946122a8d3b 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -4423,7 +4423,7 @@ int get_vmx_mem_address(struct kvm_vcpu *vcpu, unsigned long exit_qualification, if (base_is_valid) off += kvm_register_read(vcpu, base_reg); if (index_is_valid) - off += kvm_register_read(vcpu, index_reg)< Date: Thu, 13 Feb 2020 10:53:25 +0800 Subject: KVM: x86: eliminate some unreachable code These code are unreachable, remove them. Signed-off-by: Miaohe Lin Reviewed-by: Krish Sadhukhan Reviewed-by: Vitaly Kuznetsov Signed-off-by: Paolo Bonzini --- arch/x86/kvm/vmx/vmx.c | 1 - arch/x86/kvm/x86.c | 3 --- 2 files changed, 4 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 5f6a26211183..6d547b5de28f 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -4550,7 +4550,6 @@ static bool rmode_exception(struct kvm_vcpu *vcpu, int vec) case GP_VECTOR: case MF_VECTOR: return true; - break; } return false; } diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 359fcd395132..de7f3e3f277c 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3077,7 +3077,6 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) break; case APIC_BASE_MSR ... APIC_BASE_MSR + 0x3ff: return kvm_x2apic_msr_read(vcpu, msr_info->index, &msr_info->data); - break; case MSR_IA32_TSCDEADLINE: msr_info->data = kvm_get_lapic_tscdeadline_msr(vcpu); break; @@ -3160,7 +3159,6 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) return kvm_hv_get_msr_common(vcpu, msr_info->index, &msr_info->data, msr_info->host_initiated); - break; case MSR_IA32_BBL_CR_CTL3: /* This legacy MSR exists but isn't fully documented in current * silicon. It is however accessed by winxp in very narrow @@ -8484,7 +8482,6 @@ static inline int vcpu_block(struct kvm *kvm, struct kvm_vcpu *vcpu) break; default: return -EINTR; - break; } return 1; } -- cgit From d71f5e03257c6c62568e76ee204a678396a55722 Mon Sep 17 00:00:00 2001 From: Miaohe Lin Date: Mon, 17 Feb 2020 23:02:30 +0800 Subject: KVM: VMX: Add 'else' to split mutually exclusive case Each if branch in handle_external_interrupt_irqoff() is mutually exclusive. Add 'else' to make it clear and also avoid some unnecessary check. Signed-off-by: Miaohe Lin Reviewed-by: Vitaly Kuznetsov Signed-off-by: Paolo Bonzini --- arch/x86/kvm/vmx/vmx.c | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 6d547b5de28f..15ddaf857552 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -6220,15 +6220,13 @@ static void handle_exception_nmi_irqoff(struct vcpu_vmx *vmx) vmx->exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO); /* if exit due to PF check for async PF */ - if (is_page_fault(vmx->exit_intr_info)) + if (is_page_fault(vmx->exit_intr_info)) { vmx->vcpu.arch.apf.host_apf_reason = kvm_read_and_reset_pf_reason(); - /* Handle machine checks before interrupts are enabled */ - if (is_machine_check(vmx->exit_intr_info)) + } else if (is_machine_check(vmx->exit_intr_info)) { kvm_machine_check(); - /* We need to handle NMIs before interrupts are enabled */ - if (is_nmi(vmx->exit_intr_info)) { + } else if (is_nmi(vmx->exit_intr_info)) { kvm_before_interrupt(&vmx->vcpu); asm("int $2"); kvm_after_interrupt(&vmx->vcpu); -- cgit From 999eabcc89b0e99a699d2946086a97537c3eff14 Mon Sep 17 00:00:00 2001 From: Miaohe Lin Date: Thu, 13 Feb 2020 10:37:44 +0800 Subject: KVM: apic: remove unused function apic_lvt_vector() The function apic_lvt_vector() is unused now, remove it. Signed-off-by: Miaohe Lin Signed-off-by: Paolo Bonzini --- arch/x86/kvm/lapic.c | 5 ----- 1 file changed, 5 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index e3099c642fec..18a42dcab3c6 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -294,11 +294,6 @@ static inline int apic_lvt_enabled(struct kvm_lapic *apic, int lvt_type) return !(kvm_lapic_get_reg(apic, lvt_type) & APIC_LVT_MASKED); } -static inline int apic_lvt_vector(struct kvm_lapic *apic, int lvt_type) -{ - return kvm_lapic_get_reg(apic, lvt_type) & APIC_VECTOR_MASK; -} - static inline int apic_lvtt_oneshot(struct kvm_lapic *apic) { return apic->lapic_timer.timer_mode == APIC_LVT_TIMER_ONESHOT; -- cgit From 92daa48b34d784748b575ae424def4ea7f024b2f Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 18 Feb 2020 15:03:08 -0800 Subject: KVM: x86: Add EMULTYPE_PF when emulation is triggered by a page fault Add a new emulation type flag to explicitly mark emulation related to a page fault. Move the propation of the GPA into the emulator from the page fault handler into x86_emulate_instruction, using EMULTYPE_PF as an indicator that cr2 is valid. Similarly, don't propagate cr2 into the exception.address when it's *not* valid. Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/include/asm/kvm_host.h | 12 +++++++++--- arch/x86/kvm/mmu/mmu.c | 10 ++-------- arch/x86/kvm/x86.c | 25 +++++++++++++++++++------ 3 files changed, 30 insertions(+), 17 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 98959e8cd448..ba4cf00fdb42 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1381,8 +1381,9 @@ extern u64 kvm_mce_cap_supported; * decode the instruction length. For use *only* by * kvm_x86_ops->skip_emulated_instruction() implementations. * - * EMULTYPE_ALLOW_RETRY - Set when the emulator should resume the guest to - * retry native execution under certain conditions. + * EMULTYPE_ALLOW_RETRY_PF - Set when the emulator should resume the guest to + * retry native execution under certain conditions, + * Can only be set in conjunction with EMULTYPE_PF. * * EMULTYPE_TRAP_UD_FORCED - Set when emulating an intercepted #UD that was * triggered by KVM's magic "force emulation" prefix, @@ -1395,13 +1396,18 @@ extern u64 kvm_mce_cap_supported; * backdoor emulation, which is opt in via module param. * VMware backoor emulation handles select instructions * and reinjects the #GP for all other cases. + * + * EMULTYPE_PF - Set when emulating MMIO by way of an intercepted #PF, in which + * case the CR2/GPA value pass on the stack is valid. */ #define EMULTYPE_NO_DECODE (1 << 0) #define EMULTYPE_TRAP_UD (1 << 1) #define EMULTYPE_SKIP (1 << 2) -#define EMULTYPE_ALLOW_RETRY (1 << 3) +#define EMULTYPE_ALLOW_RETRY_PF (1 << 3) #define EMULTYPE_TRAP_UD_FORCED (1 << 4) #define EMULTYPE_VMWARE_GP (1 << 5) +#define EMULTYPE_PF (1 << 6) + int kvm_emulate_instruction(struct kvm_vcpu *vcpu, int emulation_type); int kvm_emulate_instruction_from_buffer(struct kvm_vcpu *vcpu, void *insn, int insn_len); diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 50a78a01aa0f..a1dbd13abcaf 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -5415,18 +5415,12 @@ EXPORT_SYMBOL_GPL(kvm_mmu_unprotect_page_virt); int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code, void *insn, int insn_len) { - int r, emulation_type = 0; + int r, emulation_type = EMULTYPE_PF; bool direct = vcpu->arch.mmu->direct_map; if (WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa))) return RET_PF_RETRY; - /* With shadow page tables, fault_address contains a GVA or nGPA. */ - if (vcpu->arch.mmu->direct_map) { - vcpu->arch.gpa_available = true; - vcpu->arch.gpa_val = cr2_or_gpa; - } - r = RET_PF_INVALID; if (unlikely(error_code & PFERR_RSVD_MASK)) { r = handle_mmio_page_fault(vcpu, cr2_or_gpa, direct); @@ -5470,7 +5464,7 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code, * for L1 isn't going to magically fix whatever issue cause L2 to fail. */ if (!mmio_info_in_cache(vcpu, cr2_or_gpa, direct) && !is_guest_mode(vcpu)) - emulation_type = EMULTYPE_ALLOW_RETRY; + emulation_type |= EMULTYPE_ALLOW_RETRY_PF; emulate: /* * On AMD platforms, under certain conditions insn_len may be zero on #NPF. diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index de7f3e3f277c..1ce7bbc075e4 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -6492,10 +6492,11 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, gpa_t gpa = cr2_or_gpa; kvm_pfn_t pfn; - if (!(emulation_type & EMULTYPE_ALLOW_RETRY)) + if (!(emulation_type & EMULTYPE_ALLOW_RETRY_PF)) return false; - if (WARN_ON_ONCE(is_guest_mode(vcpu))) + if (WARN_ON_ONCE(is_guest_mode(vcpu)) || + WARN_ON_ONCE(!(emulation_type & EMULTYPE_PF))) return false; if (!vcpu->arch.mmu->direct_map) { @@ -6583,10 +6584,11 @@ static bool retry_instruction(struct x86_emulate_ctxt *ctxt, */ vcpu->arch.last_retry_eip = vcpu->arch.last_retry_addr = 0; - if (!(emulation_type & EMULTYPE_ALLOW_RETRY)) + if (!(emulation_type & EMULTYPE_ALLOW_RETRY_PF)) return false; - if (WARN_ON_ONCE(is_guest_mode(vcpu))) + if (WARN_ON_ONCE(is_guest_mode(vcpu)) || + WARN_ON_ONCE(!(emulation_type & EMULTYPE_PF))) return false; if (x86_page_table_writing_insn(ctxt)) @@ -6839,8 +6841,19 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, } restart: - /* Save the faulting GPA (cr2) in the address field */ - ctxt->exception.address = cr2_or_gpa; + if (emulation_type & EMULTYPE_PF) { + /* Save the faulting GPA (cr2) in the address field */ + ctxt->exception.address = cr2_or_gpa; + + /* With shadow page tables, cr2 contains a GVA or nGPA. */ + if (vcpu->arch.mmu->direct_map) { + vcpu->arch.gpa_available = true; + vcpu->arch.gpa_val = cr2_or_gpa; + } + } else { + /* Sanitize the address out of an abundance of paranoia. */ + ctxt->exception.address = 0; + } r = x86_emulate_insn(ctxt); -- cgit From 744e699c7e9913a9b1f825f189ab547c4da0d182 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 18 Feb 2020 15:03:09 -0800 Subject: KVM: x86: Move gpa_val and gpa_available into the emulator context Move the GPA tracking into the emulator context now that the context is guaranteed to be initialized via __init_emulate_ctxt() prior to dereferencing gpa_{available,val}, i.e. now that seeing a stale gpa_available will also trigger a WARN due to an invalid context. Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/include/asm/kvm_emulate.h | 4 ++++ arch/x86/include/asm/kvm_host.h | 4 ---- arch/x86/kvm/x86.c | 13 ++++++------- 3 files changed, 10 insertions(+), 11 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h index 2a8f2bd2e5cf..bf5f5e476f65 100644 --- a/arch/x86/include/asm/kvm_emulate.h +++ b/arch/x86/include/asm/kvm_emulate.h @@ -319,6 +319,10 @@ struct x86_emulate_ctxt { bool have_exception; struct x86_exception exception; + /* GPA available */ + bool gpa_available; + gpa_t gpa_val; + /* * decode cache */ diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index ba4cf00fdb42..06c21f14298b 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -808,10 +808,6 @@ struct kvm_vcpu_arch { int pending_ioapic_eoi; int pending_external_vector; - /* GPA available */ - bool gpa_available; - gpa_t gpa_val; - /* be preempted when it's in kernel-mode(cpl=0) */ bool preempted_in_kernel; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 1ce7bbc075e4..4f026f877fc1 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5745,10 +5745,9 @@ static int emulator_read_write_onepage(unsigned long addr, void *val, * operation using rep will only have the initial GPA from the NPF * occurred. */ - if (vcpu->arch.gpa_available && - emulator_can_use_gpa(ctxt) && - (addr & ~PAGE_MASK) == (vcpu->arch.gpa_val & ~PAGE_MASK)) { - gpa = vcpu->arch.gpa_val; + if (ctxt->gpa_available && emulator_can_use_gpa(ctxt) && + (addr & ~PAGE_MASK) == (ctxt->gpa_val & ~PAGE_MASK)) { + gpa = ctxt->gpa_val; ret = vcpu_is_mmio_gpa(vcpu, addr, gpa, write); } else { ret = vcpu_mmio_gva_to_gpa(vcpu, addr, &gpa, exception, write); @@ -6417,6 +6416,7 @@ static void init_emulate_ctxt(struct kvm_vcpu *vcpu) kvm_x86_ops->get_cs_db_l_bits(vcpu, &cs_db, &cs_l); + ctxt->gpa_available = false; ctxt->eflags = kvm_get_rflags(vcpu); ctxt->tf = (ctxt->eflags & X86_EFLAGS_TF) != 0; @@ -6847,8 +6847,8 @@ restart: /* With shadow page tables, cr2 contains a GVA or nGPA. */ if (vcpu->arch.mmu->direct_map) { - vcpu->arch.gpa_available = true; - vcpu->arch.gpa_val = cr2_or_gpa; + ctxt->gpa_available = true; + ctxt->gpa_val = cr2_or_gpa; } } else { /* Sanitize the address out of an abundance of paranoia. */ @@ -8454,7 +8454,6 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) if (vcpu->arch.apic_attention) kvm_lapic_sync_from_vapic(vcpu); - vcpu->arch.gpa_available = false; r = kvm_x86_ops->handle_exit(vcpu, exit_fastpath); return r; -- cgit From edd4fa37baa6ee8e44dc65523b27bd6fe44c94de Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 18 Feb 2020 13:07:15 -0800 Subject: KVM: x86: Allocate new rmap and large page tracking when moving memslot Reallocate a rmap array and recalcuate large page compatibility when moving an existing memslot to correctly handle the alignment properties of the new memslot. The number of rmap entries required at each level is dependent on the alignment of the memslot's base gfn with respect to that level, e.g. moving a large-page aligned memslot so that it becomes unaligned will increase the number of rmap entries needed at the now unaligned level. Not updating the rmap array is the most obvious bug, as KVM accesses garbage data beyond the end of the rmap. KVM interprets the bad data as pointers, leading to non-canonical #GPs, unexpected #PFs, etc... general protection fault: 0000 [#1] SMP CPU: 0 PID: 1909 Comm: move_memory_reg Not tainted 5.4.0-rc7+ #139 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:rmap_get_first+0x37/0x50 [kvm] Code: <48> 8b 3b 48 85 ff 74 ec e8 6c f4 ff ff 85 c0 74 e3 48 89 d8 5b c3 RSP: 0018:ffffc9000021bbc8 EFLAGS: 00010246 RAX: ffff00617461642e RBX: ffff00617461642e RCX: 0000000000000012 RDX: ffff88827400f568 RSI: ffffc9000021bbe0 RDI: ffff88827400f570 RBP: 0010000000000000 R08: ffffc9000021bd00 R09: ffffc9000021bda8 R10: ffffc9000021bc48 R11: 0000000000000000 R12: 0030000000000000 R13: 0000000000000000 R14: ffff88827427d700 R15: ffffc9000021bce8 FS: 00007f7eda014700(0000) GS:ffff888277a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f7ed9216ff8 CR3: 0000000274391003 CR4: 0000000000162eb0 Call Trace: kvm_mmu_slot_set_dirty+0xa1/0x150 [kvm] __kvm_set_memory_region.part.64+0x559/0x960 [kvm] kvm_set_memory_region+0x45/0x60 [kvm] kvm_vm_ioctl+0x30f/0x920 [kvm] do_vfs_ioctl+0xa1/0x620 ksys_ioctl+0x66/0x70 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x4c/0x170 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f7ed9911f47 Code: <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 21 6f 2c 00 f7 d8 64 89 01 48 RSP: 002b:00007ffc00937498 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 0000000001ab0010 RCX: 00007f7ed9911f47 RDX: 0000000001ab1350 RSI: 000000004020ae46 RDI: 0000000000000004 RBP: 000000000000000a R08: 0000000000000000 R09: 00007f7ed9214700 R10: 00007f7ed92149d0 R11: 0000000000000246 R12: 00000000bffff000 R13: 0000000000000003 R14: 00007f7ed9215000 R15: 0000000000000000 Modules linked in: kvm_intel kvm irqbypass ---[ end trace 0c5f570b3358ca89 ]--- The disallow_lpage tracking is more subtle. Failure to update results in KVM creating large pages when it shouldn't, either due to stale data or again due to indexing beyond the end of the metadata arrays, which can lead to memory corruption and/or leaking data to guest/userspace. Note, the arrays for the old memslot are freed by the unconditional call to kvm_free_memslot() in __kvm_set_memory_region(). Fixes: 05da45583de9b ("KVM: MMU: large page support") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson Reviewed-by: Peter Xu Signed-off-by: Paolo Bonzini --- arch/x86/kvm/x86.c | 11 +++++++++++ 1 file changed, 11 insertions(+) (limited to 'arch/x86') diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 4f026f877fc1..6387917d08ec 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -9878,6 +9878,13 @@ int kvm_arch_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot, { int i; + /* + * Clear out the previous array pointers for the KVM_MR_MOVE case. The + * old arrays will be freed by __kvm_set_memory_region() if installing + * the new memslot is successful. + */ + memset(&slot->arch, 0, sizeof(slot->arch)); + for (i = 0; i < KVM_NR_PAGE_SIZES; ++i) { struct kvm_lpage_info *linfo; unsigned long ugfn; @@ -9959,6 +9966,10 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm, const struct kvm_userspace_memory_region *mem, enum kvm_mr_change change) { + if (change == KVM_MR_MOVE) + return kvm_arch_create_memslot(kvm, memslot, + mem->memory_size >> PAGE_SHIFT); + return 0; } -- cgit From 0dab98b7ade66598cab3b59931995ce91bd61258 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 18 Feb 2020 13:07:19 -0800 Subject: KVM: x86: Allocate memslot resources during prepare_memory_region() Allocate the various metadata structures associated with a new memslot during kvm_arch_prepare_memory_region(), which paves the way for removing kvm_arch_create_memslot() altogether. Moving x86's memory allocation only changes the order of kernel memory allocations between x86 and common KVM code. Reviewed-by: Peter Xu Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/x86.c | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 6387917d08ec..94ea7b914fc7 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -9875,6 +9875,12 @@ void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *free, int kvm_arch_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot, unsigned long npages) +{ + return 0; +} + +static int kvm_alloc_memslot_metadata(struct kvm_memory_slot *slot, + unsigned long npages) { int i; @@ -9966,10 +9972,9 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm, const struct kvm_userspace_memory_region *mem, enum kvm_mr_change change) { - if (change == KVM_MR_MOVE) - return kvm_arch_create_memslot(kvm, memslot, - mem->memory_size >> PAGE_SHIFT); - + if (change == KVM_MR_CREATE || change == KVM_MR_MOVE) + return kvm_alloc_memslot_metadata(memslot, + mem->memory_size >> PAGE_SHIFT); return 0; } -- cgit From 414de7abbf809f046511269797d9f2310b88e036 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 18 Feb 2020 13:07:20 -0800 Subject: KVM: Drop kvm_arch_create_memslot() Remove kvm_arch_create_memslot() now that all arch implementations are effectively nops. Removing kvm_arch_create_memslot() eliminates the possibility for arch specific code to allocate memory prior to setting a memslot, which sets the stage for simplifying kvm_free_memslot(). Cc: Janosch Frank Acked-by: Christian Borntraeger Reviewed-by: Peter Xu Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/x86.c | 6 ------ 1 file changed, 6 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 94ea7b914fc7..c23b08f48079 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -9873,12 +9873,6 @@ void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *free, kvm_page_track_free_memslot(free, dont); } -int kvm_arch_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot, - unsigned long npages) -{ - return 0; -} - static int kvm_alloc_memslot_metadata(struct kvm_memory_slot *slot, unsigned long npages) { -- cgit From 9d4c197c0e94c372ceffd2ffc53a23518f301ed9 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 18 Feb 2020 13:07:24 -0800 Subject: KVM: Drop "const" attribute from old memslot in commit_memory_region() Drop the "const" attribute from @old in kvm_arch_commit_memory_region() to allow arch specific code to free arch specific resources in the old memslot without having to cast away the attribute. Freeing resources in kvm_arch_commit_memory_region() paves the way for simplifying kvm_free_memslot() by eliminating the last usage of its @dont param. Reviewed-by: Peter Xu Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/x86.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index c23b08f48079..a3c92c5077d0 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -10024,7 +10024,7 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm, void kvm_arch_commit_memory_region(struct kvm *kvm, const struct kvm_userspace_memory_region *mem, - const struct kvm_memory_slot *old, + struct kvm_memory_slot *old, const struct kvm_memory_slot *new, enum kvm_mr_change change) { -- cgit From 21198846de1c348304280436caf3a5dc936d5c65 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 18 Feb 2020 13:07:25 -0800 Subject: KVM: x86: Free arrays for old memslot when moving memslot's base gfn Explicitly free the metadata arrays (stored in slot->arch) in the old memslot structure when moving the memslot's base gfn is committed. This eliminates x86's dependency on kvm_free_memslot() being called when a memslot move is committed, and paves the way for removing the funky code in kvm_free_memslot() that conditionally frees structures based on its @dont param. Reviewed-by: Peter Xu Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/x86.c | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'arch/x86') diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index a3c92c5077d0..6068208917dd 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -10066,6 +10066,10 @@ void kvm_arch_commit_memory_region(struct kvm *kvm, */ if (change != KVM_MR_DELETE) kvm_mmu_slot_apply_flags(kvm, (struct kvm_memory_slot *) new); + + /* Free the arrays associated with the old memslot. */ + if (change == KVM_MR_MOVE) + kvm_arch_free_memslot(kvm, old, NULL); } void kvm_arch_flush_shadow_all(struct kvm *kvm) -- cgit From e96c81ee89d80e1a0fe50a0e9be40c1b77e14aaa Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 18 Feb 2020 13:07:27 -0800 Subject: KVM: Simplify kvm_free_memslot() and all its descendents Now that all callers of kvm_free_memslot() pass NULL for @dont, remove the param from the top-level routine and all arch's implementations. No functional change intended. Tested-by: Christoffer Dall Reviewed-by: Peter Xu Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/include/asm/kvm_page_track.h | 3 +-- arch/x86/kvm/mmu/page_track.c | 15 ++++++--------- arch/x86/kvm/x86.c | 21 ++++++++------------- 3 files changed, 15 insertions(+), 24 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/kvm_page_track.h b/arch/x86/include/asm/kvm_page_track.h index 172f9749dbb2..87bd6025d91d 100644 --- a/arch/x86/include/asm/kvm_page_track.h +++ b/arch/x86/include/asm/kvm_page_track.h @@ -49,8 +49,7 @@ struct kvm_page_track_notifier_node { void kvm_page_track_init(struct kvm *kvm); void kvm_page_track_cleanup(struct kvm *kvm); -void kvm_page_track_free_memslot(struct kvm_memory_slot *free, - struct kvm_memory_slot *dont); +void kvm_page_track_free_memslot(struct kvm_memory_slot *slot); int kvm_page_track_create_memslot(struct kvm_memory_slot *slot, unsigned long npages); diff --git a/arch/x86/kvm/mmu/page_track.c b/arch/x86/kvm/mmu/page_track.c index 3521e2d176f2..d125ec379c79 100644 --- a/arch/x86/kvm/mmu/page_track.c +++ b/arch/x86/kvm/mmu/page_track.c @@ -19,17 +19,14 @@ #include "mmu.h" -void kvm_page_track_free_memslot(struct kvm_memory_slot *free, - struct kvm_memory_slot *dont) +void kvm_page_track_free_memslot(struct kvm_memory_slot *slot) { int i; - for (i = 0; i < KVM_PAGE_TRACK_MAX; i++) - if (!dont || free->arch.gfn_track[i] != - dont->arch.gfn_track[i]) { - kvfree(free->arch.gfn_track[i]); - free->arch.gfn_track[i] = NULL; - } + for (i = 0; i < KVM_PAGE_TRACK_MAX; i++) { + kvfree(slot->arch.gfn_track[i]); + slot->arch.gfn_track[i] = NULL; + } } int kvm_page_track_create_memslot(struct kvm_memory_slot *slot, @@ -48,7 +45,7 @@ int kvm_page_track_create_memslot(struct kvm_memory_slot *slot, return 0; track_free: - kvm_page_track_free_memslot(slot, NULL); + kvm_page_track_free_memslot(slot); return -ENOMEM; } diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 6068208917dd..88885e3147c4 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -9850,27 +9850,22 @@ void kvm_arch_destroy_vm(struct kvm *kvm) kvm_hv_destroy_vm(kvm); } -void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *free, - struct kvm_memory_slot *dont) +void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot) { int i; for (i = 0; i < KVM_NR_PAGE_SIZES; ++i) { - if (!dont || free->arch.rmap[i] != dont->arch.rmap[i]) { - kvfree(free->arch.rmap[i]); - free->arch.rmap[i] = NULL; - } + kvfree(slot->arch.rmap[i]); + slot->arch.rmap[i] = NULL; + if (i == 0) continue; - if (!dont || free->arch.lpage_info[i - 1] != - dont->arch.lpage_info[i - 1]) { - kvfree(free->arch.lpage_info[i - 1]); - free->arch.lpage_info[i - 1] = NULL; - } + kvfree(slot->arch.lpage_info[i - 1]); + slot->arch.lpage_info[i - 1] = NULL; } - kvm_page_track_free_memslot(free, dont); + kvm_page_track_free_memslot(slot); } static int kvm_alloc_memslot_metadata(struct kvm_memory_slot *slot, @@ -10069,7 +10064,7 @@ void kvm_arch_commit_memory_region(struct kvm *kvm, /* Free the arrays associated with the old memslot. */ if (change == KVM_MR_MOVE) - kvm_arch_free_memslot(kvm, old, NULL); + kvm_arch_free_memslot(kvm, old); } void kvm_arch_flush_shadow_all(struct kvm *kvm) -- cgit From 0dff084607bd555d6f74db2af8406a9da9f0fc3a Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 18 Feb 2020 13:07:29 -0800 Subject: KVM: Provide common implementation for generic dirty log functions Move the implementations of KVM_GET_DIRTY_LOG and KVM_CLEAR_DIRTY_LOG for CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT into common KVM code. The arch specific implemenations are extremely similar, differing only in whether the dirty log needs to be sync'd from hardware (x86) and how the TLBs are flushed. Add new arch hooks to handle sync and TLB flush; the sync will also be used for non-generic dirty log support in a future patch (s390). The ulterior motive for providing a common implementation is to eliminate the dependency between arch and common code with respect to the memslot referenced by the dirty log, i.e. to make it obvious in the code that the validity of the memslot is guaranteed, as a future patch will rework memslot handling such that id_to_memslot() can return NULL. Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/x86.c | 61 ++++-------------------------------------------------- 1 file changed, 4 insertions(+), 57 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 88885e3147c4..27b97e546980 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4759,77 +4759,24 @@ static int kvm_vm_ioctl_reinject(struct kvm *kvm, return 0; } -/** - * kvm_vm_ioctl_get_dirty_log - get and clear the log of dirty pages in a slot - * @kvm: kvm instance - * @log: slot id and address to which we copy the log - * - * Steps 1-4 below provide general overview of dirty page logging. See - * kvm_get_dirty_log_protect() function description for additional details. - * - * We call kvm_get_dirty_log_protect() to handle steps 1-3, upon return we - * always flush the TLB (step 4) even if previous step failed and the dirty - * bitmap may be corrupt. Regardless of previous outcome the KVM logging API - * does not preclude user space subsequent dirty log read. Flushing TLB ensures - * writes will be marked dirty for next log read. - * - * 1. Take a snapshot of the bit and clear it if needed. - * 2. Write protect the corresponding page. - * 3. Copy the snapshot to the userspace. - * 4. Flush TLB's if needed. - */ -int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log) +void kvm_arch_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot) { - bool flush = false; - int r; - - mutex_lock(&kvm->slots_lock); - /* * Flush potentially hardware-cached dirty pages to dirty_bitmap. */ if (kvm_x86_ops->flush_log_dirty) kvm_x86_ops->flush_log_dirty(kvm); - - r = kvm_get_dirty_log_protect(kvm, log, &flush); - - /* - * All the TLBs can be flushed out of mmu lock, see the comments in - * kvm_mmu_slot_remove_write_access(). - */ - lockdep_assert_held(&kvm->slots_lock); - if (flush) - kvm_flush_remote_tlbs(kvm); - - mutex_unlock(&kvm->slots_lock); - return r; } -int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm, struct kvm_clear_dirty_log *log) +void kvm_arch_flush_remote_tlbs_memslot(struct kvm *kvm, + struct kvm_memory_slot *memslot) { - bool flush = false; - int r; - - mutex_lock(&kvm->slots_lock); - - /* - * Flush potentially hardware-cached dirty pages to dirty_bitmap. - */ - if (kvm_x86_ops->flush_log_dirty) - kvm_x86_ops->flush_log_dirty(kvm); - - r = kvm_clear_dirty_log_protect(kvm, log, &flush); - /* * All the TLBs can be flushed out of mmu lock, see the comments in * kvm_mmu_slot_remove_write_access(). */ lockdep_assert_held(&kvm->slots_lock); - if (flush) - kvm_flush_remote_tlbs(kvm); - - mutex_unlock(&kvm->slots_lock); - return r; + kvm_flush_remote_tlbs(kvm); } int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct kvm_irq_level *irq_event, -- cgit From 0577d1abe704c315bb5cdfc71f4ca7b9b5358f59 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 18 Feb 2020 13:07:31 -0800 Subject: KVM: Terminate memslot walks via used_slots Refactor memslot handling to treat the number of used slots as the de facto size of the memslot array, e.g. return NULL from id_to_memslot() when an invalid index is provided instead of relying on npages==0 to detect an invalid memslot. Rework the sorting and walking of memslots in advance of dynamically sizing memslots to aid bisection and debug, e.g. with luck, a bug in the refactoring will bisect here and/or hit a WARN instead of randomly corrupting memory. Alternatively, a global null/invalid memslot could be returned, i.e. so callers of id_to_memslot() don't have to explicitly check for a NULL memslot, but that approach runs the risk of introducing difficult-to- debug issues, e.g. if the global null slot is modified. Constifying the return from id_to_memslot() to combat such issues is possible, but would require a massive refactoring of arch specific code and would still be susceptible to casting shenanigans. Add function comments to update_memslots() and search_memslots() to explicitly (and loudly) state how memslots are sorted. Opportunistically stuff @hva with a non-canonical value when deleting a private memslot on x86 to detect bogus usage of the freed slot. No functional change intended. Tested-by: Christoffer Dall Tested-by: Marc Zyngier Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/x86.c | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 27b97e546980..13ac4d0f9794 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -9715,9 +9715,9 @@ void kvm_arch_sync_events(struct kvm *kvm) int __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size) { int i, r; - unsigned long hva; + unsigned long hva, uninitialized_var(old_npages); struct kvm_memslots *slots = kvm_memslots(kvm); - struct kvm_memory_slot *slot, old; + struct kvm_memory_slot *slot; /* Called with kvm->slots_lock held. */ if (WARN_ON(id >= KVM_MEM_SLOTS_NUM)) @@ -9725,7 +9725,7 @@ int __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size) slot = id_to_memslot(slots, id); if (size) { - if (slot->npages) + if (slot && slot->npages) return -EEXIST; /* @@ -9737,13 +9737,14 @@ int __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size) if (IS_ERR((void *)hva)) return PTR_ERR((void *)hva); } else { - if (!slot->npages) + if (!slot || !slot->npages) return 0; - hva = 0; + /* Stuff a non-canonical value to catch use-after-delete. */ + hva = 0xdeadull << 48; + old_npages = slot->npages; } - old = *slot; for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) { struct kvm_userspace_memory_region m; @@ -9758,7 +9759,7 @@ int __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size) } if (!size) - vm_munmap(old.userspace_addr, old.npages * PAGE_SIZE); + vm_munmap(hva, old_npages * PAGE_SIZE); return 0; } -- cgit From b3594ffbf932c8e8b23201cdc2c173708a4472dc Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 18 Feb 2020 13:07:34 -0800 Subject: KVM: x86/mmu: Move kvm_arch_flush_remote_tlbs_memslot() to mmu.c Move kvm_arch_flush_remote_tlbs_memslot() from x86.c to mmu.c in preparation for calling kvm_flush_remote_tlbs_with_address() instead of kvm_flush_remote_tlbs(). The with_address() variant is statically defined in mmu.c, arguably kvm_arch_flush_remote_tlbs_memslot() belongs in mmu.c anyways, and defining kvm_arch_flush_remote_tlbs_memslot() in mmu.c will allow the compiler to inline said function when a future patch consolidates open coded variants of the function. No functional change intended. Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/mmu/mmu.c | 11 +++++++++++ arch/x86/kvm/x86.c | 11 ----------- 2 files changed, 11 insertions(+), 11 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index a1dbd13abcaf..2ebac21ab024 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -5934,6 +5934,17 @@ void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm, spin_unlock(&kvm->mmu_lock); } +void kvm_arch_flush_remote_tlbs_memslot(struct kvm *kvm, + struct kvm_memory_slot *memslot) +{ + /* + * All the TLBs can be flushed out of mmu lock, see the comments in + * kvm_mmu_slot_remove_write_access(). + */ + lockdep_assert_held(&kvm->slots_lock); + kvm_flush_remote_tlbs(kvm); +} + void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm, struct kvm_memory_slot *memslot) { diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 13ac4d0f9794..4b4749768f3d 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4768,17 +4768,6 @@ void kvm_arch_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot) kvm_x86_ops->flush_log_dirty(kvm); } -void kvm_arch_flush_remote_tlbs_memslot(struct kvm *kvm, - struct kvm_memory_slot *memslot) -{ - /* - * All the TLBs can be flushed out of mmu lock, see the comments in - * kvm_mmu_slot_remove_write_access(). - */ - lockdep_assert_held(&kvm->slots_lock); - kvm_flush_remote_tlbs(kvm); -} - int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct kvm_irq_level *irq_event, bool line_status) { -- cgit From cec37648f40b0ef7e6260ffa588c51fd3754262b Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 18 Feb 2020 13:07:35 -0800 Subject: KVM: x86/mmu: Use range-based TLB flush for dirty log memslot flush Use the with_address() variant when performing a TLB flush for a specific memslot via kvm_arch_flush_remote_tlbs_memslot(), i.e. when flushing after clearing dirty bits during KVM_{GET,CLEAR}_DIRTY_LOG. This aligns all dirty log memslot-specific TLB flushes to use the with_address() variant and paves the way for consolidating the relevant code. Note, moving to the with_address() variant only affects functionality when running as a HyperV guest. Cc: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/mmu/mmu.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 2ebac21ab024..7268c6ffb643 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -5942,7 +5942,8 @@ void kvm_arch_flush_remote_tlbs_memslot(struct kvm *kvm, * kvm_mmu_slot_remove_write_access(). */ lockdep_assert_held(&kvm->slots_lock); - kvm_flush_remote_tlbs(kvm); + kvm_flush_remote_tlbs_with_address(kvm, memslot->base_gfn, + memslot->npages); } void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm, -- cgit From 7f42aa76d4a5587939ee98d541ea1b97b382f408 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 18 Feb 2020 13:07:36 -0800 Subject: KVM: x86/mmu: Consolidate open coded variants of memslot TLB flushes Replace open coded instances of kvm_arch_flush_remote_tlbs_memslot()'s functionality with calls to the aforementioned function. Update the comment in kvm_arch_flush_remote_tlbs_memslot() to elaborate on how it is used and why it asserts that slots_lock is held. No functional change intended. Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/mmu/mmu.c | 34 +++++++++------------------------- 1 file changed, 9 insertions(+), 25 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 7268c6ffb643..c4e0b97f82ac 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -5862,13 +5862,6 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, false); spin_unlock(&kvm->mmu_lock); - /* - * kvm_mmu_slot_remove_write_access() and kvm_vm_ioctl_get_dirty_log() - * which do tlb flush out of mmu-lock should be serialized by - * kvm->slots_lock otherwise tlb flush would be missed. - */ - lockdep_assert_held(&kvm->slots_lock); - /* * We can flush all the TLBs out of the mmu lock without TLB * corruption since we just change the spte from writable to @@ -5881,8 +5874,7 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, * on PT_WRITABLE_MASK anymore. */ if (flush) - kvm_flush_remote_tlbs_with_address(kvm, memslot->base_gfn, - memslot->npages); + kvm_arch_flush_remote_tlbs_memslot(kvm, memslot); } static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm, @@ -5938,8 +5930,11 @@ void kvm_arch_flush_remote_tlbs_memslot(struct kvm *kvm, struct kvm_memory_slot *memslot) { /* - * All the TLBs can be flushed out of mmu lock, see the comments in - * kvm_mmu_slot_remove_write_access(). + * All current use cases for flushing the TLBs for a specific memslot + * are related to dirty logging, and do the TLB flush out of mmu_lock. + * The interaction between the various operations on memslot must be + * serialized by slots_locks to ensure the TLB flush from one operation + * is observed by any other operation on the same memslot. */ lockdep_assert_held(&kvm->slots_lock); kvm_flush_remote_tlbs_with_address(kvm, memslot->base_gfn, @@ -5955,8 +5950,6 @@ void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm, flush = slot_handle_leaf(kvm, memslot, __rmap_clear_dirty, false); spin_unlock(&kvm->mmu_lock); - lockdep_assert_held(&kvm->slots_lock); - /* * It's also safe to flush TLBs out of mmu lock here as currently this * function is only used for dirty logging, in which case flushing TLB @@ -5964,8 +5957,7 @@ void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm, * dirty_bitmap. */ if (flush) - kvm_flush_remote_tlbs_with_address(kvm, memslot->base_gfn, - memslot->npages); + kvm_arch_flush_remote_tlbs_memslot(kvm, memslot); } EXPORT_SYMBOL_GPL(kvm_mmu_slot_leaf_clear_dirty); @@ -5979,12 +5971,8 @@ void kvm_mmu_slot_largepage_remove_write_access(struct kvm *kvm, false); spin_unlock(&kvm->mmu_lock); - /* see kvm_mmu_slot_remove_write_access */ - lockdep_assert_held(&kvm->slots_lock); - if (flush) - kvm_flush_remote_tlbs_with_address(kvm, memslot->base_gfn, - memslot->npages); + kvm_arch_flush_remote_tlbs_memslot(kvm, memslot); } EXPORT_SYMBOL_GPL(kvm_mmu_slot_largepage_remove_write_access); @@ -5997,12 +5985,8 @@ void kvm_mmu_slot_set_dirty(struct kvm *kvm, flush = slot_handle_all_level(kvm, memslot, __rmap_set_dirty, false); spin_unlock(&kvm->mmu_lock); - lockdep_assert_held(&kvm->slots_lock); - - /* see kvm_mmu_slot_leaf_clear_dirty */ if (flush) - kvm_flush_remote_tlbs_with_address(kvm, memslot->base_gfn, - memslot->npages); + kvm_arch_flush_remote_tlbs_memslot(kvm, memslot); } EXPORT_SYMBOL_GPL(kvm_mmu_slot_set_dirty); -- cgit From 168d918f2643d7d3f0240e768d40b4f8aba3540a Mon Sep 17 00:00:00 2001 From: Eric Hankland Date: Fri, 21 Feb 2020 18:34:13 -0800 Subject: KVM: x86: Adjust counter sample period after a wrmsr The sample_period of a counter tracks when that counter will overflow and set global status/trigger a PMI. However this currently only gets set when the initial counter is created or when a counter is resumed; this updates the sample period after a wrmsr so running counters will accurately reflect their new value. Signed-off-by: Eric Hankland Signed-off-by: Paolo Bonzini --- arch/x86/kvm/pmu.c | 4 ++-- arch/x86/kvm/pmu.h | 9 +++++++++ arch/x86/kvm/vmx/pmu_intel.c | 6 ++++++ 3 files changed, 17 insertions(+), 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index bcc6a73d6628..d1f8ca57d354 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -111,7 +111,7 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, .config = config, }; - attr.sample_period = (-pmc->counter) & pmc_bitmask(pmc); + attr.sample_period = get_sample_period(pmc, pmc->counter); if (in_tx) attr.config |= HSW_IN_TX; @@ -158,7 +158,7 @@ static bool pmc_resume_counter(struct kvm_pmc *pmc) /* recalibrate sample period and check if it's accepted by perf core */ if (perf_event_period(pmc->perf_event, - (-pmc->counter) & pmc_bitmask(pmc))) + get_sample_period(pmc, pmc->counter))) return false; /* reuse perf_event to serve as pmc_reprogram_counter() does*/ diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h index 13332984b6d5..d7da2b9e0755 100644 --- a/arch/x86/kvm/pmu.h +++ b/arch/x86/kvm/pmu.h @@ -129,6 +129,15 @@ static inline struct kvm_pmc *get_fixed_pmc(struct kvm_pmu *pmu, u32 msr) return NULL; } +static inline u64 get_sample_period(struct kvm_pmc *pmc, u64 counter_value) +{ + u64 sample_period = (-counter_value) & pmc_bitmask(pmc); + + if (!sample_period) + sample_period = pmc_bitmask(pmc) + 1; + return sample_period; +} + void reprogram_gp_counter(struct kvm_pmc *pmc, u64 eventsel); void reprogram_fixed_counter(struct kvm_pmc *pmc, u8 ctrl, int fixed_idx); void reprogram_counter(struct kvm_pmu *pmu, int pmc_idx); diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index fd21cdb10b79..e933541751fb 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -263,9 +263,15 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) if (!msr_info->host_initiated) data = (s64)(s32)data; pmc->counter += data - pmc_read_counter(pmc); + if (pmc->perf_event) + perf_event_period(pmc->perf_event, + get_sample_period(pmc, data)); return 0; } else if ((pmc = get_fixed_pmc(pmu, msr))) { pmc->counter += data - pmc_read_counter(pmc); + if (pmc->perf_event) + perf_event_period(pmc->perf_event, + get_sample_period(pmc, data)); return 0; } else if ((pmc = get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0))) { if (data == pmc->eventsel) -- cgit From d18b2f43b9147c8005ae0844fb445d8cc6a87e31 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Sun, 26 Jan 2020 16:41:11 -0800 Subject: KVM: x86: Gracefully handle __vmalloc() failure during VM allocation Check the result of __vmalloc() to avoid dereferencing a NULL pointer in the event that allocation failres. Fixes: d1e5b0e98ea27 ("kvm: Make VM ioctl do valloc for some archs") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson Reviewed-by: Vitaly Kuznetsov Signed-off-by: Paolo Bonzini --- arch/x86/kvm/svm.c | 4 ++++ arch/x86/kvm/vmx/vmx.c | 4 ++++ 2 files changed, 8 insertions(+) (limited to 'arch/x86') diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index ad3f5b178a03..0455bd105bbe 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1949,6 +1949,10 @@ static struct kvm *svm_vm_alloc(void) struct kvm_svm *kvm_svm = __vmalloc(sizeof(struct kvm_svm), GFP_KERNEL_ACCOUNT | __GFP_ZERO, PAGE_KERNEL); + + if (!kvm_svm) + return NULL; + return &kvm_svm->kvm; } diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 15ddaf857552..933c72b97b50 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -6684,6 +6684,10 @@ static struct kvm *vmx_vm_alloc(void) struct kvm_vmx *kvm_vmx = __vmalloc(sizeof(struct kvm_vmx), GFP_KERNEL_ACCOUNT | __GFP_ZERO, PAGE_KERNEL); + + if (!kvm_vmx) + return NULL; + return &kvm_vmx->kvm; } -- cgit From 1a625056cc57c1fb6fe9b4500ef07215e942c9bf Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Sun, 26 Jan 2020 16:41:12 -0800 Subject: KVM: x86: Directly return __vmalloc() result in ->vm_alloc() Directly return the __vmalloc() result in {svm,vmx}_vm_alloc() to pave the way for handling VM alloc/free in common x86 code, and to obviate the need to check the result of __vmalloc() in vendor specific code. Add a build-time assertion to ensure each structs' "kvm" field stays at offset 0, which allows interpreting a "struct kvm_{svm,vmx}" as a "struct kvm". Signed-off-by: Sean Christopherson Reviewed-by: Vitaly Kuznetsov Signed-off-by: Paolo Bonzini --- arch/x86/kvm/svm.c | 12 ++++-------- arch/x86/kvm/vmx/vmx.c | 12 ++++-------- 2 files changed, 8 insertions(+), 16 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 0455bd105bbe..fdc88f7350c6 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1946,19 +1946,15 @@ static void __unregister_enc_region_locked(struct kvm *kvm, static struct kvm *svm_vm_alloc(void) { - struct kvm_svm *kvm_svm = __vmalloc(sizeof(struct kvm_svm), - GFP_KERNEL_ACCOUNT | __GFP_ZERO, - PAGE_KERNEL); + BUILD_BUG_ON(offsetof(struct kvm_svm, kvm) != 0); - if (!kvm_svm) - return NULL; - - return &kvm_svm->kvm; + return __vmalloc(sizeof(struct kvm_svm), + GFP_KERNEL_ACCOUNT | __GFP_ZERO, PAGE_KERNEL); } static void svm_vm_free(struct kvm *kvm) { - vfree(to_kvm_svm(kvm)); + vfree(kvm); } static void sev_vm_destroy(struct kvm *kvm) diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 933c72b97b50..7d19ed7c13d8 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -6681,20 +6681,16 @@ static void vmx_vcpu_run(struct kvm_vcpu *vcpu) static struct kvm *vmx_vm_alloc(void) { - struct kvm_vmx *kvm_vmx = __vmalloc(sizeof(struct kvm_vmx), - GFP_KERNEL_ACCOUNT | __GFP_ZERO, - PAGE_KERNEL); + BUILD_BUG_ON(offsetof(struct kvm_vmx, kvm) != 0); - if (!kvm_vmx) - return NULL; - - return &kvm_vmx->kvm; + return __vmalloc(sizeof(struct kvm_vmx), + GFP_KERNEL_ACCOUNT | __GFP_ZERO, PAGE_KERNEL); } static void vmx_vm_free(struct kvm *kvm) { kfree(kvm->arch.hyperv.hv_pa_pg); - vfree(to_kvm_vmx(kvm)); + vfree(kvm); } static void vmx_free_vcpu(struct kvm_vcpu *vcpu) -- cgit From 562b6b089d64724278de61114da658fb0a516250 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Sun, 26 Jan 2020 16:41:13 -0800 Subject: KVM: x86: Consolidate VM allocation and free for VMX and SVM Move the VM allocation and free code to common x86 as the logic is more or less identical across SVM and VMX. Note, although hyperv.hv_pa_pg is part of the common kvm->arch, it's (currently) only allocated by VMX VMs. But, since kfree() plays nice when passed a NULL pointer, the superfluous call for SVM is harmless and avoids future churn if SVM gains support for HyperV's direct TLB flush. Signed-off-by: Sean Christopherson Reviewed-by: Vitaly Kuznetsov [Make vm_size a field instead of a function. - Paolo] Signed-off-by: Paolo Bonzini --- arch/x86/include/asm/kvm_host.h | 12 ++++-------- arch/x86/kvm/svm.c | 16 +--------------- arch/x86/kvm/vmx/vmx.c | 17 +---------------- arch/x86/kvm/x86.c | 7 +++++++ 4 files changed, 13 insertions(+), 39 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 06c21f14298b..5edf6425c747 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1059,8 +1059,7 @@ struct kvm_x86_ops { bool (*has_emulated_msr)(int index); void (*cpuid_update)(struct kvm_vcpu *vcpu); - struct kvm *(*vm_alloc)(void); - void (*vm_free)(struct kvm *); + unsigned int vm_size; int (*vm_init)(struct kvm *kvm); void (*vm_destroy)(struct kvm *kvm); @@ -1278,13 +1277,10 @@ extern struct kmem_cache *x86_fpu_cache; #define __KVM_HAVE_ARCH_VM_ALLOC static inline struct kvm *kvm_arch_alloc_vm(void) { - return kvm_x86_ops->vm_alloc(); -} - -static inline void kvm_arch_free_vm(struct kvm *kvm) -{ - return kvm_x86_ops->vm_free(kvm); + return __vmalloc(kvm_x86_ops->vm_size, + GFP_KERNEL_ACCOUNT | __GFP_ZERO, PAGE_KERNEL); } +void kvm_arch_free_vm(struct kvm *kvm); #define __KVM_HAVE_ARCH_FLUSH_REMOTE_TLB static inline int kvm_arch_flush_remote_tlb(struct kvm *kvm) diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index fdc88f7350c6..fd3fc9fbefff 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1944,19 +1944,6 @@ static void __unregister_enc_region_locked(struct kvm *kvm, kfree(region); } -static struct kvm *svm_vm_alloc(void) -{ - BUILD_BUG_ON(offsetof(struct kvm_svm, kvm) != 0); - - return __vmalloc(sizeof(struct kvm_svm), - GFP_KERNEL_ACCOUNT | __GFP_ZERO, PAGE_KERNEL); -} - -static void svm_vm_free(struct kvm *kvm) -{ - vfree(kvm); -} - static void sev_vm_destroy(struct kvm *kvm) { struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; @@ -7395,8 +7382,7 @@ static struct kvm_x86_ops svm_x86_ops __ro_after_init = { .vcpu_free = svm_free_vcpu, .vcpu_reset = svm_vcpu_reset, - .vm_alloc = svm_vm_alloc, - .vm_free = svm_vm_free, + .vm_size = sizeof(struct kvm_svm), .vm_init = svm_vm_init, .vm_destroy = svm_vm_destroy, diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 7d19ed7c13d8..a04017bdae05 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -6679,20 +6679,6 @@ static void vmx_vcpu_run(struct kvm_vcpu *vcpu) vmx_complete_interrupts(vmx); } -static struct kvm *vmx_vm_alloc(void) -{ - BUILD_BUG_ON(offsetof(struct kvm_vmx, kvm) != 0); - - return __vmalloc(sizeof(struct kvm_vmx), - GFP_KERNEL_ACCOUNT | __GFP_ZERO, PAGE_KERNEL); -} - -static void vmx_vm_free(struct kvm *kvm) -{ - kfree(kvm->arch.hyperv.hv_pa_pg); - vfree(kvm); -} - static void vmx_free_vcpu(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); @@ -7835,9 +7821,8 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = { .cpu_has_accelerated_tpr = report_flexpriority, .has_emulated_msr = vmx_has_emulated_msr, + .vm_size = sizeof(struct kvm_vmx), .vm_init = vmx_vm_init, - .vm_alloc = vmx_vm_alloc, - .vm_free = vmx_vm_free, .vcpu_create = vmx_create_vcpu, .vcpu_free = vmx_free_vcpu, diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 4b4749768f3d..ddd1d296bd20 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -9622,6 +9622,13 @@ void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) kvm_x86_ops->sched_in(vcpu, cpu); } +void kvm_arch_free_vm(struct kvm *kvm) +{ + kfree(kvm->arch.hyperv.hv_pa_pg); + vfree(kvm); +} + + int kvm_arch_init_vm(struct kvm *kvm, unsigned long type) { if (type) -- cgit From 4d395762599dbab1eb29d9011d5b75ca3cc4f70a Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Fri, 28 Feb 2020 13:30:20 -0500 Subject: KVM: Remove unnecessary asm/kvm_host.h includes Remove includes of asm/kvm_host.h from files that already include linux/kvm_host.h to make it more obvious that there is no ordering issue between the two headers. linux/kvm_host.h includes asm/kvm_host.h to pick up architecture specific settings, and this will never change, i.e. including asm/kvm_host.h after linux/kvm_host.h may seem problematic, but in practice is simply redundant. Signed-off-by: Peter Xu Signed-off-by: Paolo Bonzini --- arch/x86/kvm/mmu/page_track.c | 1 - 1 file changed, 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/mmu/page_track.c b/arch/x86/kvm/mmu/page_track.c index d125ec379c79..ddc1ec3bdacd 100644 --- a/arch/x86/kvm/mmu/page_track.c +++ b/arch/x86/kvm/mmu/page_track.c @@ -14,7 +14,6 @@ #include #include -#include #include #include "mmu.h" -- cgit From cc7f5577adfc766de8613b71e9ae52c053fcca01 Mon Sep 17 00:00:00 2001 From: Oliver Upton Date: Fri, 28 Feb 2020 00:59:04 -0800 Subject: KVM: SVM: Inhibit APIC virtualization for X2APIC guest The AVIC does not support guest use of the x2APIC interface. Currently, KVM simply chooses to squash the x2APIC feature in the guest's CPUID If the AVIC is enabled. Doing so prevents KVM from running a guest with greater than 255 vCPUs, as such a guest necessitates the use of the x2APIC interface. Instead, inhibit AVIC enablement on a per-VM basis whenever the x2APIC feature is set in the guest's CPUID. Signed-off-by: Oliver Upton Reviewed-by: Jim Mattson Signed-off-by: Paolo Bonzini --- arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/svm.c | 15 +++++++++------ 2 files changed, 10 insertions(+), 6 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 5edf6425c747..f58861e2ece5 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -886,6 +886,7 @@ enum kvm_irqchip_mode { #define APICV_INHIBIT_REASON_NESTED 2 #define APICV_INHIBIT_REASON_IRQWIN 3 #define APICV_INHIBIT_REASON_PIT_REINJ 4 +#define APICV_INHIBIT_REASON_X2APIC 5 struct kvm_arch { unsigned long n_used_mmu_pages; diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index fd3fc9fbefff..0d417276653b 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -6014,7 +6014,13 @@ static void svm_cpuid_update(struct kvm_vcpu *vcpu) if (!kvm_vcpu_apicv_active(vcpu)) return; - guest_cpuid_clear(vcpu, X86_FEATURE_X2APIC); + /* + * AVIC does not work with an x2APIC mode guest. If the X2APIC feature + * is exposed to the guest, disable AVIC. + */ + if (guest_cpuid_has(vcpu, X86_FEATURE_X2APIC)) + kvm_request_apicv_update(vcpu->kvm, false, + APICV_INHIBIT_REASON_X2APIC); /* * Currently, AVIC does not work with nested virtualization. @@ -6030,10 +6036,6 @@ static void svm_cpuid_update(struct kvm_vcpu *vcpu) static void svm_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2 *entry) { switch (func) { - case 0x1: - if (avic) - entry->ecx &= ~F(X2APIC); - break; case 0x80000001: if (nested) entry->ecx |= (1 << 2); /* Set SVM bit */ @@ -7357,7 +7359,8 @@ static bool svm_check_apicv_inhibit_reasons(ulong bit) BIT(APICV_INHIBIT_REASON_HYPERV) | BIT(APICV_INHIBIT_REASON_NESTED) | BIT(APICV_INHIBIT_REASON_IRQWIN) | - BIT(APICV_INHIBIT_REASON_PIT_REINJ); + BIT(APICV_INHIBIT_REASON_PIT_REINJ) | + BIT(APICV_INHIBIT_REASON_X2APIC); return supported & BIT(bit); } -- cgit From 3651c7fc2bf63b62c009b8470aaf096b3b4b01e8 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Fri, 28 Feb 2020 14:52:39 -0800 Subject: KVM: x86/mmu: Ignore guest CR3 on fast root switch for direct MMU Ignore the guest's CR3 when looking for a cached root for a direct MMU, the guest's CR3 has no impact on the direct MMU's shadow pages (the role check ensures compatibility with CR0.WP, etc...). Zero out root_cr3 when allocating the direct roots to make it clear that it's ignored. Cc: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/mmu/mmu.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index c4e0b97f82ac..9d617b9dc78f 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -3730,7 +3730,9 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu) vcpu->arch.mmu->root_hpa = __pa(vcpu->arch.mmu->pae_root); } else BUG(); - vcpu->arch.mmu->root_cr3 = vcpu->arch.mmu->get_cr3(vcpu); + + /* root_cr3 is ignored for direct MMUs. */ + vcpu->arch.mmu->root_cr3 = 0; return 0; } @@ -4272,8 +4274,8 @@ static bool cached_root_available(struct kvm_vcpu *vcpu, gpa_t new_cr3, for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) { swap(root, mmu->prev_roots[i]); - if (new_cr3 == root.cr3 && VALID_PAGE(root.hpa) && - page_header(root.hpa) != NULL && + if ((new_role.direct || new_cr3 == root.cr3) && + VALID_PAGE(root.hpa) && page_header(root.hpa) && new_role.word == page_header(root.hpa)->role.word) break; } -- cgit From 0be44352071dc87a4f9bf879642b1d44876971d9 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Fri, 28 Feb 2020 14:52:40 -0800 Subject: KVM: x86/mmu: Reuse the current root if possible for fast switch Reuse the current root when possible instead of grabbing a different root from the array of cached roots. Doing so avoids unnecessary MMU switches and also fixes a quirk where KVM can't reuse roots without creating multiple roots since the cache is a victim cache, i.e. roots are added to the cache when they're "evicted", not when they are created. The quirk could be fixed by adding roots to the cache on creation, but that would reduce the effective size of the cache as one of its entries would be burned to track the current root. Reusing the current root is especially helpful for nested virt as the current root is almost always usable for the "new" MMU on nested VM-entry/VM-exit. Cc: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/mmu/mmu.c | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 9d617b9dc78f..53b776dfc949 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -4253,6 +4253,14 @@ static void nonpaging_init_context(struct kvm_vcpu *vcpu, context->nx = false; } +static inline bool is_root_usable(struct kvm_mmu_root_info *root, gpa_t cr3, + union kvm_mmu_page_role role) +{ + return (role.direct || cr3 == root->cr3) && + VALID_PAGE(root->hpa) && page_header(root->hpa) && + role.word == page_header(root->hpa)->role.word; +} + /* * Find out if a previously cached root matching the new CR3/role is available. * The current root is also inserted into the cache. @@ -4271,12 +4279,13 @@ static bool cached_root_available(struct kvm_vcpu *vcpu, gpa_t new_cr3, root.cr3 = mmu->root_cr3; root.hpa = mmu->root_hpa; + if (is_root_usable(&root, new_cr3, new_role)) + return true; + for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) { swap(root, mmu->prev_roots[i]); - if ((new_role.direct || new_cr3 == root.cr3) && - VALID_PAGE(root.hpa) && page_header(root.hpa) && - new_role.word == page_header(root.hpa)->role.word) + if (is_root_usable(&root, new_cr3, new_role)) break; } -- cgit From 3c9bd4006bfc2dccda1823db61b3f470ef91cfaa Mon Sep 17 00:00:00 2001 From: Jay Zhou Date: Thu, 27 Feb 2020 09:32:27 +0800 Subject: KVM: x86: enable dirty log gradually in small chunks It could take kvm->mmu_lock for an extended period of time when enabling dirty log for the first time. The main cost is to clear all the D-bits of last level SPTEs. This situation can benefit from manual dirty log protect as well, which can reduce the mmu_lock time taken. The sequence is like this: 1. Initialize all the bits of the dirty bitmap to 1 when enabling dirty log for the first time 2. Only write protect the huge pages 3. KVM_GET_DIRTY_LOG returns the dirty bitmap info 4. KVM_CLEAR_DIRTY_LOG will clear D-bit for each of the leaf level SPTEs gradually in small chunks Under the Intel(R) Xeon(R) Gold 6152 CPU @ 2.10GHz environment, I did some tests with a 128G windows VM and counted the time taken of memory_global_dirty_log_start, here is the numbers: VM Size Before After optimization 128G 460ms 10ms Signed-off-by: Jay Zhou Signed-off-by: Paolo Bonzini --- arch/x86/include/asm/kvm_host.h | 6 +++++- arch/x86/kvm/mmu/mmu.c | 7 ++++--- arch/x86/kvm/vmx/vmx.c | 3 ++- arch/x86/kvm/x86.c | 21 +++++++++++++++++---- 4 files changed, 28 insertions(+), 9 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index f58861e2ece5..681e23071847 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -49,6 +49,9 @@ #define KVM_IRQCHIP_NUM_PINS KVM_IOAPIC_NUM_PINS +#define KVM_DIRTY_LOG_MANUAL_CAPS (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \ + KVM_DIRTY_LOG_INITIALLY_SET) + /* x86-specific vcpu->requests bit members */ #define KVM_REQ_MIGRATE_TIMER KVM_ARCH_REQ(0) #define KVM_REQ_REPORT_TPR_ACCESS KVM_ARCH_REQ(1) @@ -1306,7 +1309,8 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask, void kvm_mmu_reset_context(struct kvm_vcpu *vcpu); void kvm_mmu_slot_remove_write_access(struct kvm *kvm, - struct kvm_memory_slot *memslot); + struct kvm_memory_slot *memslot, + int start_level); void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm, const struct kvm_memory_slot *memslot); void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm, diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 53b776dfc949..db0b66e153a2 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -5864,13 +5864,14 @@ static bool slot_rmap_write_protect(struct kvm *kvm, } void kvm_mmu_slot_remove_write_access(struct kvm *kvm, - struct kvm_memory_slot *memslot) + struct kvm_memory_slot *memslot, + int start_level) { bool flush; spin_lock(&kvm->mmu_lock); - flush = slot_handle_all_level(kvm, memslot, slot_rmap_write_protect, - false); + flush = slot_handle_level(kvm, memslot, slot_rmap_write_protect, + start_level, PT_MAX_HUGEPAGE_LEVEL, false); spin_unlock(&kvm->mmu_lock); /* diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index a04017bdae05..2bb4c4e21076 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7280,7 +7280,8 @@ static void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu) static void vmx_slot_enable_log_dirty(struct kvm *kvm, struct kvm_memory_slot *slot) { - kvm_mmu_slot_leaf_clear_dirty(kvm, slot); + if (!kvm_dirty_log_manual_protect_and_init_set(kvm)) + kvm_mmu_slot_leaf_clear_dirty(kvm, slot); kvm_mmu_slot_largepage_remove_write_access(kvm, slot); } diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index ddd1d296bd20..864d0aded0b8 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -9916,7 +9916,7 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm, { /* Still write protect RO slot */ if (new->flags & KVM_MEM_READONLY) { - kvm_mmu_slot_remove_write_access(kvm, new); + kvm_mmu_slot_remove_write_access(kvm, new, PT_PAGE_TABLE_LEVEL); return; } @@ -9951,10 +9951,23 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm, * See the comments in fast_page_fault(). */ if (new->flags & KVM_MEM_LOG_DIRTY_PAGES) { - if (kvm_x86_ops->slot_enable_log_dirty) + if (kvm_x86_ops->slot_enable_log_dirty) { kvm_x86_ops->slot_enable_log_dirty(kvm, new); - else - kvm_mmu_slot_remove_write_access(kvm, new); + } else { + int level = + kvm_dirty_log_manual_protect_and_init_set(kvm) ? + PT_DIRECTORY_LEVEL : PT_PAGE_TABLE_LEVEL; + + /* + * If we're with initial-all-set, we don't need + * to write protect any small page because + * they're reported as dirty already. However + * we still need to write-protect huge pages + * so that the page split can happen lazily on + * the first write to the huge page. + */ + kvm_mmu_slot_remove_write_access(kvm, new, level); + } } else { if (kvm_x86_ops->slot_disable_log_dirty) kvm_x86_ops->slot_disable_log_dirty(kvm, new); -- cgit From 49f933d445b611a237a4687ec7135b57b6b2cfba Mon Sep 17 00:00:00 2001 From: Miaohe Lin Date: Thu, 27 Feb 2020 11:20:54 +0800 Subject: KVM: Fix some obsolete comments Remove some obsolete comments, fix wrong function name and description. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Miaohe Lin Signed-off-by: Paolo Bonzini --- arch/x86/kvm/vmx/nested.c | 4 ++-- arch/x86/kvm/vmx/vmx.c | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index 0946122a8d3b..966a5c394c06 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -2960,7 +2960,7 @@ static int nested_vmx_check_vmentry_hw(struct kvm_vcpu *vcpu) /* * Induce a consistency check VMExit by clearing bit 1 in GUEST_RFLAGS, * which is reserved to '1' by hardware. GUEST_RFLAGS is guaranteed to - * be written (by preparve_vmcs02()) before the "real" VMEnter, i.e. + * be written (by prepare_vmcs02()) before the "real" VMEnter, i.e. * there is no need to preserve other bits or save/restore the field. */ vmcs_writel(GUEST_RFLAGS, 0); @@ -4382,7 +4382,7 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 exit_reason, * Decode the memory-address operand of a vmx instruction, as recorded on an * exit caused by such an instruction (run by a guest hypervisor). * On success, returns 0. When the operand is invalid, returns 1 and throws - * #UD or #GP. + * #UD, #GP, or #SS. */ int get_vmx_mem_address(struct kvm_vcpu *vcpu, unsigned long exit_qualification, u32 vmx_instruction_info, bool wr, int len, gva_t *ret) diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 2bb4c4e21076..a7314b28ee3a 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -808,7 +808,7 @@ void update_exception_bitmap(struct kvm_vcpu *vcpu) if (to_vmx(vcpu)->rmode.vm86_active) eb = ~0; if (enable_ept) - eb &= ~(1u << PF_VECTOR); /* bypass_guest_pf = 0 */ + eb &= ~(1u << PF_VECTOR); /* When we are running a nested L2 guest and L1 specified for it a * certain exception bitmap, we must trap the same exceptions and pass -- cgit From 4abaffce4d25aa41392d2e81835592726d757857 Mon Sep 17 00:00:00 2001 From: Wanpeng Li Date: Wed, 26 Feb 2020 10:41:02 +0800 Subject: KVM: LAPIC: Recalculate apic map in batch In the vCPU reset and set APIC_BASE MSR path, the apic map will be recalculated several times, each time it will consume 10+ us observed by ftrace in my non-overcommit environment since the expensive memory allocate/mutex/rcu etc operations. This patch optimizes it by recaluating apic map in batch, I hope this can benefit the serverless scenario which can frequently create/destroy VMs. Before patch: kvm_lapic_reset ~27us After patch: kvm_lapic_reset ~14us Observed by ftrace, improve ~48%. Signed-off-by: Wanpeng Li Signed-off-by: Paolo Bonzini --- arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/lapic.c | 46 ++++++++++++++++++++++++++++++++--------- arch/x86/kvm/lapic.h | 1 + arch/x86/kvm/x86.c | 1 + 4 files changed, 39 insertions(+), 10 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 681e23071847..642273712e6a 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -920,6 +920,7 @@ struct kvm_arch { atomic_t vapics_in_nmi_mode; struct mutex apic_map_lock; struct kvm_apic_map *apic_map; + bool apic_map_dirty; bool apic_access_page_done; unsigned long apicv_inhibit_reasons; diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 18a42dcab3c6..41bd49ebe355 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -164,14 +164,28 @@ static void kvm_apic_map_free(struct rcu_head *rcu) kvfree(map); } -static void recalculate_apic_map(struct kvm *kvm) +void kvm_recalculate_apic_map(struct kvm *kvm) { struct kvm_apic_map *new, *old = NULL; struct kvm_vcpu *vcpu; int i; u32 max_id = 255; /* enough space for any xAPIC ID */ + if (!kvm->arch.apic_map_dirty) { + /* + * Read kvm->arch.apic_map_dirty before + * kvm->arch.apic_map + */ + smp_rmb(); + return; + } + mutex_lock(&kvm->arch.apic_map_lock); + if (!kvm->arch.apic_map_dirty) { + /* Someone else has updated the map. */ + mutex_unlock(&kvm->arch.apic_map_lock); + return; + } kvm_for_each_vcpu(i, vcpu, kvm) if (kvm_apic_present(vcpu)) @@ -236,6 +250,12 @@ out: old = rcu_dereference_protected(kvm->arch.apic_map, lockdep_is_held(&kvm->arch.apic_map_lock)); rcu_assign_pointer(kvm->arch.apic_map, new); + /* + * Write kvm->arch.apic_map before + * clearing apic->apic_map_dirty + */ + smp_wmb(); + kvm->arch.apic_map_dirty = false; mutex_unlock(&kvm->arch.apic_map_lock); if (old) @@ -257,20 +277,20 @@ static inline void apic_set_spiv(struct kvm_lapic *apic, u32 val) else static_key_slow_inc(&apic_sw_disabled.key); - recalculate_apic_map(apic->vcpu->kvm); + apic->vcpu->kvm->arch.apic_map_dirty = true; } } static inline void kvm_apic_set_xapic_id(struct kvm_lapic *apic, u8 id) { kvm_lapic_set_reg(apic, APIC_ID, id << 24); - recalculate_apic_map(apic->vcpu->kvm); + apic->vcpu->kvm->arch.apic_map_dirty = true; } static inline void kvm_apic_set_ldr(struct kvm_lapic *apic, u32 id) { kvm_lapic_set_reg(apic, APIC_LDR, id); - recalculate_apic_map(apic->vcpu->kvm); + apic->vcpu->kvm->arch.apic_map_dirty = true; } static inline u32 kvm_apic_calc_x2apic_ldr(u32 id) @@ -286,7 +306,7 @@ static inline void kvm_apic_set_x2apic_id(struct kvm_lapic *apic, u32 id) kvm_lapic_set_reg(apic, APIC_ID, id); kvm_lapic_set_reg(apic, APIC_LDR, ldr); - recalculate_apic_map(apic->vcpu->kvm); + apic->vcpu->kvm->arch.apic_map_dirty = true; } static inline int apic_lvt_enabled(struct kvm_lapic *apic, int lvt_type) @@ -1906,7 +1926,7 @@ int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val) case APIC_DFR: if (!apic_x2apic_mode(apic)) { kvm_lapic_set_reg(apic, APIC_DFR, val | 0x0FFFFFFF); - recalculate_apic_map(apic->vcpu->kvm); + apic->vcpu->kvm->arch.apic_map_dirty = true; } else ret = 1; break; @@ -2012,6 +2032,8 @@ int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val) break; } + kvm_recalculate_apic_map(apic->vcpu->kvm); + return ret; } EXPORT_SYMBOL_GPL(kvm_lapic_reg_write); @@ -2160,7 +2182,7 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value) static_key_slow_dec_deferred(&apic_hw_disabled); } else { static_key_slow_inc(&apic_hw_disabled.key); - recalculate_apic_map(vcpu->kvm); + vcpu->kvm->arch.apic_map_dirty = true; } } @@ -2201,6 +2223,7 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu, bool init_event) if (!apic) return; + vcpu->kvm->arch.apic_map_dirty = false; /* Stop the timer in case it's a reset to an active apic */ hrtimer_cancel(&apic->lapic_timer.timer); @@ -2252,6 +2275,8 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu, bool init_event) vcpu->arch.apic_arb_prio = 0; vcpu->arch.apic_attention = 0; + + kvm_recalculate_apic_map(vcpu->kvm); } /* @@ -2473,17 +2498,18 @@ int kvm_apic_set_state(struct kvm_vcpu *vcpu, struct kvm_lapic_state *s) struct kvm_lapic *apic = vcpu->arch.apic; int r; - kvm_lapic_set_base(vcpu, vcpu->arch.apic_base); /* set SPIV separately to get count of SW disabled APICs right */ apic_set_spiv(apic, *((u32 *)(s->regs + APIC_SPIV))); r = kvm_apic_state_fixup(vcpu, s, true); - if (r) + if (r) { + kvm_recalculate_apic_map(vcpu->kvm); return r; + } memcpy(vcpu->arch.apic->regs, s->regs, sizeof(*s)); - recalculate_apic_map(vcpu->kvm); + kvm_recalculate_apic_map(vcpu->kvm); kvm_apic_set_version(vcpu); apic_update_ppr(apic); diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h index ec6fbfe325cf..7581bc2dcd47 100644 --- a/arch/x86/kvm/lapic.h +++ b/arch/x86/kvm/lapic.h @@ -78,6 +78,7 @@ void kvm_lapic_set_tpr(struct kvm_vcpu *vcpu, unsigned long cr8); void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu); void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value); u64 kvm_lapic_get_base(struct kvm_vcpu *vcpu); +void kvm_recalculate_apic_map(struct kvm *kvm); void kvm_apic_set_version(struct kvm_vcpu *vcpu); int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val); int kvm_lapic_reg_read(struct kvm_lapic *apic, u32 offset, int len, diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 864d0aded0b8..c5762c031b0c 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -350,6 +350,7 @@ int kvm_set_apic_base(struct kvm_vcpu *vcpu, struct msr_data *msr_info) } kvm_lapic_set_base(vcpu, msr_info->data); + kvm_recalculate_apic_map(vcpu->kvm); return 0; } EXPORT_SYMBOL_GPL(kvm_set_apic_base); -- cgit From b34de572a863b5a453dece431eac0da59b5aec0a Mon Sep 17 00:00:00 2001 From: Wanpeng Li Date: Fri, 28 Feb 2020 11:18:41 +0800 Subject: KVM: X86: trigger kvmclock sync request just once on VM creation In the progress of vCPUs creation, it queues a kvmclock sync worker to the global workqueue before each vCPU creation completes. The workqueue subsystem guarantees not to queue the already queued work; however, we can make the logic more clear by making just one leader to trigger this kvmclock sync request, and also save on cacheline bouncing caused by test_and_set_bit. Signed-off-by: Wanpeng Li Reviewed-by: Vitaly Kuznetsov Signed-off-by: Paolo Bonzini --- arch/x86/kvm/x86.c | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index c5762c031b0c..78e34c968ab5 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -9335,11 +9335,9 @@ void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu) mutex_unlock(&vcpu->mutex); - if (!kvmclock_periodic_sync) - return; - - schedule_delayed_work(&kvm->arch.kvmclock_sync_work, - KVMCLOCK_SYNC_PERIOD); + if (kvmclock_periodic_sync && vcpu->vcpu_idx == 0) + schedule_delayed_work(&kvm->arch.kvmclock_sync_work, + KVMCLOCK_SYNC_PERIOD); } void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu) -- cgit From a1c77abb8d93381e25a8d2df3a917388244ba776 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 22:27:35 -0800 Subject: KVM: nVMX: Properly handle userspace interrupt window request Return true for vmx_interrupt_allowed() if the vCPU is in L2 and L1 has external interrupt exiting enabled. IRQs are never blocked in hardware if the CPU is in the guest (L2 from L1's perspective) when IRQs trigger VM-Exit. The new check percolates up to kvm_vcpu_ready_for_interrupt_injection() and thus vcpu_run(), and so KVM will exit to userspace if userspace has requested an interrupt window (to inject an IRQ into L1). Remove the @external_intr param from vmx_check_nested_events(), which is actually an indicator that userspace wants an interrupt window, e.g. it's named @req_int_win further up the stack. Injecting a VM-Exit into L1 to try and bounce out to L0 userspace is all kinds of broken and is no longer necessary. Remove the hack in nested_vmx_vmexit() that attempted to workaround the breakage in vmx_check_nested_events() by only filling interrupt info if there's an actual interrupt pending. The hack actually made things worse because it caused KVM to _never_ fill interrupt info when the LAPIC resides in userspace (kvm_cpu_has_interrupt() queries interrupt.injected, which is always cleared by prepare_vmcs12() before reaching the hack in nested_vmx_vmexit()). Fixes: 6550c4df7e50 ("KVM: nVMX: Fix interrupt window request with "Acknowledge interrupt on exit"") Cc: stable@vger.kernel.org Cc: Liran Alon Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/include/asm/kvm_host.h | 2 +- arch/x86/kvm/vmx/nested.c | 18 ++++-------------- arch/x86/kvm/vmx/vmx.c | 9 +++++++-- arch/x86/kvm/x86.c | 10 +++++----- 4 files changed, 17 insertions(+), 22 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 642273712e6a..6093ff1f3bdb 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1180,7 +1180,7 @@ struct kvm_x86_ops { bool (*pt_supported)(void); bool (*pku_supported)(void); - int (*check_nested_events)(struct kvm_vcpu *vcpu, bool external_intr); + int (*check_nested_events)(struct kvm_vcpu *vcpu); void (*request_immediate_exit)(struct kvm_vcpu *vcpu); void (*sched_in)(struct kvm_vcpu *kvm, int cpu); diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index 966a5c394c06..300d87ff23c2 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -3603,7 +3603,7 @@ static void nested_vmx_update_pending_dbg(struct kvm_vcpu *vcpu) vcpu->arch.exception.payload); } -static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr) +static int vmx_check_nested_events(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); unsigned long exit_qual; @@ -3679,8 +3679,7 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr) return 0; } - if ((kvm_cpu_has_interrupt(vcpu) || external_intr) && - nested_exit_on_intr(vcpu)) { + if (kvm_cpu_has_interrupt(vcpu) && nested_exit_on_intr(vcpu)) { if (block_nested_events) return -EBUSY; nested_vmx_vmexit(vcpu, EXIT_REASON_EXTERNAL_INTERRUPT, 0, 0); @@ -4328,17 +4327,8 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 exit_reason, vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE; if (likely(!vmx->fail)) { - /* - * TODO: SDM says that with acknowledge interrupt on - * exit, bit 31 of the VM-exit interrupt information - * (valid interrupt) is always set to 1 on - * EXIT_REASON_EXTERNAL_INTERRUPT, so we shouldn't - * need kvm_cpu_has_interrupt(). See the commit - * message for details. - */ - if (nested_exit_intr_ack_set(vcpu) && - exit_reason == EXIT_REASON_EXTERNAL_INTERRUPT && - kvm_cpu_has_interrupt(vcpu)) { + if (exit_reason == EXIT_REASON_EXTERNAL_INTERRUPT && + nested_exit_intr_ack_set(vcpu)) { int irq = kvm_cpu_get_interrupt(vcpu); WARN_ON(irq < 0); vmcs12->vm_exit_intr_info = irq | diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index a7314b28ee3a..de4bf790b3f4 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -4493,8 +4493,13 @@ static int vmx_nmi_allowed(struct kvm_vcpu *vcpu) static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu) { - return (!to_vmx(vcpu)->nested.nested_run_pending && - vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) && + if (to_vmx(vcpu)->nested.nested_run_pending) + return false; + + if (is_guest_mode(vcpu) && nested_exit_on_intr(vcpu)) + return true; + + return (vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) && !(vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) & (GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS)); } diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 78e34c968ab5..b15092c9593d 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -7579,7 +7579,7 @@ static void update_cr8_intercept(struct kvm_vcpu *vcpu) kvm_x86_ops->update_cr8_intercept(vcpu, tpr, max_irr); } -static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win) +static int inject_pending_event(struct kvm_vcpu *vcpu) { int r; @@ -7615,7 +7615,7 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win) * from L2 to L1. */ if (is_guest_mode(vcpu) && kvm_x86_ops->check_nested_events) { - r = kvm_x86_ops->check_nested_events(vcpu, req_int_win); + r = kvm_x86_ops->check_nested_events(vcpu); if (r != 0) return r; } @@ -7677,7 +7677,7 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win) * KVM_REQ_EVENT only on certain events and not unconditionally? */ if (is_guest_mode(vcpu) && kvm_x86_ops->check_nested_events) { - r = kvm_x86_ops->check_nested_events(vcpu, req_int_win); + r = kvm_x86_ops->check_nested_events(vcpu); if (r != 0) return r; } @@ -8210,7 +8210,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) goto out; } - if (inject_pending_event(vcpu, req_int_win) != 0) + if (inject_pending_event(vcpu) != 0) req_immediate_exit = true; else { /* Enable SMI/NMI/IRQ window open exits if needed. @@ -8438,7 +8438,7 @@ static inline int vcpu_block(struct kvm *kvm, struct kvm_vcpu *vcpu) static inline bool kvm_vcpu_running(struct kvm_vcpu *vcpu) { if (is_guest_mode(vcpu) && kvm_x86_ops->check_nested_events) - kvm_x86_ops->check_nested_events(vcpu, false); + kvm_x86_ops->check_nested_events(vcpu); return (vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE && !vcpu->arch.apf.halted); -- cgit From a102a674e423f41eed3bcdfe887287b7a68fb735 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 18:02:34 -0800 Subject: KVM: x86/mmu: Don't drop level/direct from MMU role calculation Use the calculated role as-is when propagating it to kvm_mmu.mmu_role, i.e. stop masking off meaningful fields. The concept of masking off fields came from kvm_mmu_pte_write(), which (correctly) ignores certain fields when comparing kvm_mmu_page.role against kvm_mmu.mmu_role, e.g. the current mmu's access and level have no relation to a shadow page's access and level. Masking off the level causes problems for 5-level paging, e.g. CR4.LA57 has its own redundant flag in the extended role, and nested EPT would need a similar hack to support 5-level paging for L2. Opportunistically rework the mask for kvm_mmu_pte_write() to define the fields that should be ignored as opposed to the fields that should be checked, i.e. make it opt-out instead of opt-in so that new fields are automatically picked up. While doing so, stop ignoring "direct". The field is effectively ignored anyways because kvm_mmu_pte_write() is only reached with an indirect mmu and the loop only walks indirect shadow pages, but double checking "direct" literally costs nothing. Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/mmu/mmu.c | 35 ++++++++++++++++++----------------- 1 file changed, 18 insertions(+), 17 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index db0b66e153a2..8b22fc0f0973 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -215,17 +215,6 @@ struct kvm_shadow_walk_iterator { unsigned index; }; -static const union kvm_mmu_page_role mmu_base_role_mask = { - .cr0_wp = 1, - .gpte_is_8_bytes = 1, - .nxe = 1, - .smep_andnot_wp = 1, - .smap_andnot_wp = 1, - .smm = 1, - .guest_mode = 1, - .ad_disabled = 1, -}; - #define for_each_shadow_entry_using_root(_vcpu, _root, _addr, _walker) \ for (shadow_walk_init_using_root(&(_walker), (_vcpu), \ (_root), (_addr)); \ @@ -4930,7 +4919,6 @@ static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu) union kvm_mmu_role new_role = kvm_calc_tdp_mmu_root_page_role(vcpu, false); - new_role.base.word &= mmu_base_role_mask.word; if (new_role.as_u64 == context->mmu_role.as_u64) return; @@ -5002,7 +4990,6 @@ void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu) union kvm_mmu_role new_role = kvm_calc_shadow_mmu_root_page_role(vcpu, false); - new_role.base.word &= mmu_base_role_mask.word; if (new_role.as_u64 == context->mmu_role.as_u64) return; @@ -5059,7 +5046,6 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly, __kvm_mmu_new_cr3(vcpu, new_eptp, new_role.base, false); - new_role.base.word &= mmu_base_role_mask.word; if (new_role.as_u64 == context->mmu_role.as_u64) return; @@ -5100,7 +5086,6 @@ static void init_kvm_nested_mmu(struct kvm_vcpu *vcpu) union kvm_mmu_role new_role = kvm_calc_mmu_role_common(vcpu, false); struct kvm_mmu *g_context = &vcpu->arch.nested_mmu; - new_role.base.word &= mmu_base_role_mask.word; if (new_role.as_u64 == g_context->mmu_role.as_u64) return; @@ -5339,6 +5324,22 @@ static u64 *get_written_sptes(struct kvm_mmu_page *sp, gpa_t gpa, int *nspte) return spte; } +/* + * Ignore various flags when determining if a SPTE can be immediately + * overwritten for the current MMU. + * - level: explicitly checked in mmu_pte_write_new_pte(), and will never + * match the current MMU role, as MMU's level tracks the root level. + * - access: updated based on the new guest PTE + * - quadrant: handled by get_written_sptes() + * - invalid: always false (loop only walks valid shadow pages) + */ +static const union kvm_mmu_page_role role_ign = { + .level = 0xf, + .access = 0x7, + .quadrant = 0x3, + .invalid = 0x1, +}; + static void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new, int bytes, struct kvm_page_track_notifier_node *node) @@ -5394,8 +5395,8 @@ static void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, entry = *spte; mmu_page_zap_pte(vcpu->kvm, sp, spte); if (gentry && - !((sp->role.word ^ base_role) - & mmu_base_role_mask.word) && rmap_can_add(vcpu)) + !((sp->role.word ^ base_role) & ~role_ign.word) && + rmap_can_add(vcpu)) mmu_pte_write_new_pte(vcpu, sp, spte, &gentry); if (need_remote_flush(entry, *spte)) remote_flush = true; -- cgit From 8053f924cad30bf9f9a24e02b6c8ddfabf5202ea Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 18:02:35 -0800 Subject: KVM: x86/mmu: Drop kvm_mmu_extended_role.cr4_la57 hack Drop kvm_mmu_extended_role.cr4_la57 now that mmu_role doesn't mask off level, which already incorporates the guest's CR4.LA57 for a shadow MMU by querying is_la57_mode(). Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/include/asm/kvm_host.h | 1 - arch/x86/kvm/mmu/mmu.c | 1 - 2 files changed, 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 6093ff1f3bdb..327cfce91185 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -300,7 +300,6 @@ union kvm_mmu_extended_role { unsigned int cr4_pke:1; unsigned int cr4_smap:1; unsigned int cr4_smep:1; - unsigned int cr4_la57:1; unsigned int maxphyaddr:6; }; }; diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 8b22fc0f0973..374ccbcb92e1 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -4873,7 +4873,6 @@ static union kvm_mmu_extended_role kvm_calc_mmu_role_ext(struct kvm_vcpu *vcpu) ext.cr4_smap = !!kvm_read_cr4_bits(vcpu, X86_CR4_SMAP); ext.cr4_pse = !!is_pse(vcpu); ext.cr4_pke = !!kvm_read_cr4_bits(vcpu, X86_CR4_PKE); - ext.cr4_la57 = !!kvm_read_cr4_bits(vcpu, X86_CR4_LA57); ext.maxphyaddr = cpuid_maxphyaddr(vcpu); ext.valid = 1; -- cgit From bb1fcc70d98f040e3cc469b079d3f38fc541cb95 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 18:02:36 -0800 Subject: KVM: nVMX: Allow L1 to use 5-level page walks for nested EPT Add support for 5-level nested EPT, and advertise said support in the EPT capabilities MSR. KVM's MMU can already handle 5-level legacy page tables, there's no reason to force an L1 VMM to use shadow paging if it wants to employ 5-level page tables. Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/include/asm/vmx.h | 12 ++++++++++++ arch/x86/kvm/mmu/mmu.c | 11 ++++++----- arch/x86/kvm/mmu/paging_tmpl.h | 2 +- arch/x86/kvm/vmx/nested.c | 21 +++++++++++++++++---- arch/x86/kvm/vmx/vmx.c | 3 +-- 5 files changed, 37 insertions(+), 12 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h index 8521af3fef27..5e090d1f03f8 100644 --- a/arch/x86/include/asm/vmx.h +++ b/arch/x86/include/asm/vmx.h @@ -500,6 +500,18 @@ enum vmcs_field { VMX_EPT_EXECUTABLE_MASK) #define VMX_EPT_MT_MASK (7ull << VMX_EPT_MT_EPTE_SHIFT) +static inline u8 vmx_eptp_page_walk_level(u64 eptp) +{ + u64 encoded_level = eptp & VMX_EPTP_PWL_MASK; + + if (encoded_level == VMX_EPTP_PWL_5) + return 5; + + /* @eptp must be pre-validated by the caller. */ + WARN_ON_ONCE(encoded_level != VMX_EPTP_PWL_4); + return 4; +} + /* The mask to use to trigger an EPT Misconfiguration in order to track MMIO */ #define VMX_EPT_MISCONFIG_WX_VALUE (VMX_EPT_WRITABLE_MASK | \ VMX_EPT_EXECUTABLE_MASK) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 374ccbcb92e1..a214e103af07 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -5008,14 +5008,14 @@ EXPORT_SYMBOL_GPL(kvm_init_shadow_mmu); static union kvm_mmu_role kvm_calc_shadow_ept_root_page_role(struct kvm_vcpu *vcpu, bool accessed_dirty, - bool execonly) + bool execonly, u8 level) { union kvm_mmu_role role = {0}; /* SMM flag is inherited from root_mmu */ role.base.smm = vcpu->arch.root_mmu.mmu_role.base.smm; - role.base.level = PT64_ROOT_4LEVEL; + role.base.level = level; role.base.gpte_is_8_bytes = true; role.base.direct = false; role.base.ad_disabled = !accessed_dirty; @@ -5039,16 +5039,17 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly, bool accessed_dirty, gpa_t new_eptp) { struct kvm_mmu *context = vcpu->arch.mmu; + u8 level = vmx_eptp_page_walk_level(new_eptp); union kvm_mmu_role new_role = kvm_calc_shadow_ept_root_page_role(vcpu, accessed_dirty, - execonly); + execonly, level); __kvm_mmu_new_cr3(vcpu, new_eptp, new_role.base, false); if (new_role.as_u64 == context->mmu_role.as_u64) return; - context->shadow_root_level = PT64_ROOT_4LEVEL; + context->shadow_root_level = level; context->nx = true; context->ept_ad = accessed_dirty; @@ -5057,7 +5058,7 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly, context->sync_page = ept_sync_page; context->invlpg = ept_invlpg; context->update_pte = ept_update_pte; - context->root_level = PT64_ROOT_4LEVEL; + context->root_level = level; context->direct_map = false; context->mmu_role.as_u64 = new_role.as_u64; diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h index e4c8a4cbf407..6b15b58f3ecc 100644 --- a/arch/x86/kvm/mmu/paging_tmpl.h +++ b/arch/x86/kvm/mmu/paging_tmpl.h @@ -66,7 +66,7 @@ #define PT_GUEST_ACCESSED_SHIFT 8 #define PT_HAVE_ACCESSED_DIRTY(mmu) ((mmu)->ept_ad) #define CMPXCHG cmpxchg64 - #define PT_MAX_FULL_LEVELS 4 + #define PT_MAX_FULL_LEVELS PT64_ROOT_MAX_LEVEL #else #error Invalid PTTYPE value #endif diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index 300d87ff23c2..5f19b906d94a 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -2582,9 +2582,19 @@ static bool valid_ept_address(struct kvm_vcpu *vcpu, u64 address) return false; } - /* only 4 levels page-walk length are valid */ - if (CC((address & VMX_EPTP_PWL_MASK) != VMX_EPTP_PWL_4)) + /* Page-walk levels validity. */ + switch (address & VMX_EPTP_PWL_MASK) { + case VMX_EPTP_PWL_5: + if (CC(!(vmx->nested.msrs.ept_caps & VMX_EPT_PAGE_WALK_5_BIT))) + return false; + break; + case VMX_EPTP_PWL_4: + if (CC(!(vmx->nested.msrs.ept_caps & VMX_EPT_PAGE_WALK_4_BIT))) + return false; + break; + default: return false; + } /* Reserved bits should not be set */ if (CC(address >> maxphyaddr || ((address >> 7) & 0x1f))) @@ -6119,8 +6129,11 @@ void nested_vmx_setup_ctls_msrs(struct nested_vmx_msrs *msrs, u32 ept_caps) /* nested EPT: emulate EPT also to L1 */ msrs->secondary_ctls_high |= SECONDARY_EXEC_ENABLE_EPT; - msrs->ept_caps = VMX_EPT_PAGE_WALK_4_BIT | - VMX_EPTP_WB_BIT | VMX_EPT_INVEPT_BIT; + msrs->ept_caps = + VMX_EPT_PAGE_WALK_4_BIT | + VMX_EPT_PAGE_WALK_5_BIT | + VMX_EPTP_WB_BIT | + VMX_EPT_INVEPT_BIT; if (cpu_has_vmx_ept_execute_only()) msrs->ept_caps |= VMX_EPT_EXECUTE_ONLY_BIT; diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index de4bf790b3f4..c369ab30d4e6 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -2985,9 +2985,8 @@ void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0) static int get_ept_level(struct kvm_vcpu *vcpu) { - /* Nested EPT currently only supports 4-level walks. */ if (is_guest_mode(vcpu) && nested_cpu_has_ept(get_vmcs12(vcpu))) - return 4; + return vmx_eptp_page_walk_level(nested_ept_get_cr3(vcpu)); if (cpu_has_vmx_ept_5levels() && (cpuid_maxphyaddr(vcpu) > 48)) return 5; return 4; -- cgit From ac69dfaacee8ec6e3fb291f15ec7735a7772eab2 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 18:02:37 -0800 Subject: KVM: nVMX: Rename nested_ept_get_cr3() to nested_ept_get_eptp() Rename the accessor for vmcs12.EPTP to use "eptp" instead of "cr3". The accessor has no relation to cr3 whatsoever, other than it being assigned to the also poorly named kvm_mmu->get_cr3() hook. No functional change intended. Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/vmx/nested.c | 4 ++-- arch/x86/kvm/vmx/nested.h | 4 ++-- arch/x86/kvm/vmx/vmx.c | 2 +- 3 files changed, 5 insertions(+), 5 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index 5f19b906d94a..59b7214b1284 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -353,9 +353,9 @@ static void nested_ept_init_mmu_context(struct kvm_vcpu *vcpu) to_vmx(vcpu)->nested.msrs.ept_caps & VMX_EPT_EXECUTE_ONLY_BIT, nested_ept_ad_enabled(vcpu), - nested_ept_get_cr3(vcpu)); + nested_ept_get_eptp(vcpu)); vcpu->arch.mmu->set_cr3 = vmx_set_cr3; - vcpu->arch.mmu->get_cr3 = nested_ept_get_cr3; + vcpu->arch.mmu->get_cr3 = nested_ept_get_eptp; vcpu->arch.mmu->inject_page_fault = nested_ept_inject_page_fault; vcpu->arch.mmu->get_pdptr = kvm_pdptr_read; diff --git a/arch/x86/kvm/vmx/nested.h b/arch/x86/kvm/vmx/nested.h index 9aeda46f473e..21d36652f213 100644 --- a/arch/x86/kvm/vmx/nested.h +++ b/arch/x86/kvm/vmx/nested.h @@ -60,7 +60,7 @@ static inline int vmx_has_valid_vmcs12(struct kvm_vcpu *vcpu) vmx->nested.hv_evmcs; } -static inline unsigned long nested_ept_get_cr3(struct kvm_vcpu *vcpu) +static inline unsigned long nested_ept_get_eptp(struct kvm_vcpu *vcpu) { /* return the page table to be shadowed - in our case, EPT12 */ return get_vmcs12(vcpu)->ept_pointer; @@ -68,7 +68,7 @@ static inline unsigned long nested_ept_get_cr3(struct kvm_vcpu *vcpu) static inline bool nested_ept_ad_enabled(struct kvm_vcpu *vcpu) { - return nested_ept_get_cr3(vcpu) & VMX_EPTP_AD_ENABLE_BIT; + return nested_ept_get_eptp(vcpu) & VMX_EPTP_AD_ENABLE_BIT; } /* diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index c369ab30d4e6..743b81642ce2 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -2986,7 +2986,7 @@ void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0) static int get_ept_level(struct kvm_vcpu *vcpu) { if (is_guest_mode(vcpu) && nested_cpu_has_ept(get_vmcs12(vcpu))) - return vmx_eptp_page_walk_level(nested_ept_get_cr3(vcpu)); + return vmx_eptp_page_walk_level(nested_ept_get_eptp(vcpu)); if (cpu_has_vmx_ept_5levels() && (cpuid_maxphyaddr(vcpu) > 48)) return 5; return 4; -- cgit From ac6389ab2c7ad900cc5b02bf4058d2e431963ea9 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 18:02:38 -0800 Subject: KVM: nVMX: Rename EPTP validity helper and associated variables Rename valid_ept_address() to nested_vmx_check_eptp() to follow the nVMX nomenclature and to reflect that the function now checks a lot more than just the address contained in the EPTP. Rename address to new_eptp in associated code. No functional change intended. Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/vmx/nested.c | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index 59b7214b1284..c1eae446d5da 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -2563,13 +2563,13 @@ static int nested_vmx_check_nmi_controls(struct vmcs12 *vmcs12) return 0; } -static bool valid_ept_address(struct kvm_vcpu *vcpu, u64 address) +static bool nested_vmx_check_eptp(struct kvm_vcpu *vcpu, u64 new_eptp) { struct vcpu_vmx *vmx = to_vmx(vcpu); int maxphyaddr = cpuid_maxphyaddr(vcpu); /* Check for memory type validity */ - switch (address & VMX_EPTP_MT_MASK) { + switch (new_eptp & VMX_EPTP_MT_MASK) { case VMX_EPTP_MT_UC: if (CC(!(vmx->nested.msrs.ept_caps & VMX_EPTP_UC_BIT))) return false; @@ -2583,7 +2583,7 @@ static bool valid_ept_address(struct kvm_vcpu *vcpu, u64 address) } /* Page-walk levels validity. */ - switch (address & VMX_EPTP_PWL_MASK) { + switch (new_eptp & VMX_EPTP_PWL_MASK) { case VMX_EPTP_PWL_5: if (CC(!(vmx->nested.msrs.ept_caps & VMX_EPT_PAGE_WALK_5_BIT))) return false; @@ -2597,11 +2597,11 @@ static bool valid_ept_address(struct kvm_vcpu *vcpu, u64 address) } /* Reserved bits should not be set */ - if (CC(address >> maxphyaddr || ((address >> 7) & 0x1f))) + if (CC(new_eptp >> maxphyaddr || ((new_eptp >> 7) & 0x1f))) return false; /* AD, if set, should be supported */ - if (address & VMX_EPTP_AD_ENABLE_BIT) { + if (new_eptp & VMX_EPTP_AD_ENABLE_BIT) { if (CC(!(vmx->nested.msrs.ept_caps & VMX_EPT_AD_BIT))) return false; } @@ -2650,7 +2650,7 @@ static int nested_check_vm_execution_controls(struct kvm_vcpu *vcpu, return -EINVAL; if (nested_cpu_has_ept(vmcs12) && - CC(!valid_ept_address(vcpu, vmcs12->ept_pointer))) + CC(!nested_vmx_check_eptp(vcpu, vmcs12->ept_pointer))) return -EINVAL; if (nested_cpu_has_vmfunc(vmcs12)) { @@ -5234,7 +5234,7 @@ static int nested_vmx_eptp_switching(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) { u32 index = kvm_rcx_read(vcpu); - u64 address; + u64 new_eptp; bool accessed_dirty; struct kvm_mmu *mmu = vcpu->arch.walk_mmu; @@ -5247,23 +5247,23 @@ static int nested_vmx_eptp_switching(struct kvm_vcpu *vcpu, if (kvm_vcpu_read_guest_page(vcpu, vmcs12->eptp_list_address >> PAGE_SHIFT, - &address, index * 8, 8)) + &new_eptp, index * 8, 8)) return 1; - accessed_dirty = !!(address & VMX_EPTP_AD_ENABLE_BIT); + accessed_dirty = !!(new_eptp & VMX_EPTP_AD_ENABLE_BIT); /* * If the (L2) guest does a vmfunc to the currently * active ept pointer, we don't have to do anything else */ - if (vmcs12->ept_pointer != address) { - if (!valid_ept_address(vcpu, address)) + if (vmcs12->ept_pointer != new_eptp) { + if (!nested_vmx_check_eptp(vcpu, new_eptp)) return 1; kvm_mmu_unload(vcpu); mmu->ept_ad = accessed_dirty; mmu->mmu_role.base.ad_disabled = !accessed_dirty; - vmcs12->ept_pointer = address; + vmcs12->ept_pointer = new_eptp; /* * TODO: Check what's the correct approach in case * mmu reload fails. Currently, we just let the next -- cgit From d8dd54e06348c43b97e5c0d488e5ee4e004bfb6f Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 18:02:39 -0800 Subject: KVM: x86/mmu: Rename kvm_mmu->get_cr3() to ->get_guest_pgd() Rename kvm_mmu->get_cr3() to call out that it is retrieving a guest value, as opposed to kvm_mmu->set_cr3(), which sets a host value, and to note that it will return something other than CR3 when nested EPT is in use. Hopefully the new name will also make it more obvious that L1's nested_cr3 is returned in SVM's nested NPT case. No functional change intended. Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/include/asm/kvm_host.h | 2 +- arch/x86/kvm/mmu/mmu.c | 10 +++++----- arch/x86/kvm/mmu/paging_tmpl.h | 2 +- arch/x86/kvm/svm.c | 2 +- arch/x86/kvm/vmx/nested.c | 2 +- arch/x86/kvm/x86.c | 2 +- 6 files changed, 10 insertions(+), 10 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 327cfce91185..316ec6cf532b 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -385,7 +385,7 @@ struct kvm_mmu_root_info { */ struct kvm_mmu { void (*set_cr3)(struct kvm_vcpu *vcpu, unsigned long root); - unsigned long (*get_cr3)(struct kvm_vcpu *vcpu); + unsigned long (*get_guest_pgd)(struct kvm_vcpu *vcpu); u64 (*get_pdptr)(struct kvm_vcpu *vcpu, int index); int (*page_fault)(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u32 err, bool prefault); diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index a214e103af07..a1f4e325420e 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -3733,7 +3733,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu) gfn_t root_gfn, root_cr3; int i; - root_cr3 = vcpu->arch.mmu->get_cr3(vcpu); + root_cr3 = vcpu->arch.mmu->get_guest_pgd(vcpu); root_gfn = root_cr3 >> PAGE_SHIFT; if (mmu_check_root(vcpu, root_gfn)) @@ -4070,7 +4070,7 @@ static int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, arch.token = (vcpu->arch.apf.id++ << 12) | vcpu->vcpu_id; arch.gfn = gfn; arch.direct_map = vcpu->arch.mmu->direct_map; - arch.cr3 = vcpu->arch.mmu->get_cr3(vcpu); + arch.cr3 = vcpu->arch.mmu->get_guest_pgd(vcpu); return kvm_setup_async_pf(vcpu, cr2_or_gpa, kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch); @@ -4929,7 +4929,7 @@ static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu) context->shadow_root_level = kvm_x86_ops->get_tdp_level(vcpu); context->direct_map = true; context->set_cr3 = kvm_x86_ops->set_tdp_cr3; - context->get_cr3 = get_cr3; + context->get_guest_pgd = get_cr3; context->get_pdptr = kvm_pdptr_read; context->inject_page_fault = kvm_inject_page_fault; @@ -5076,7 +5076,7 @@ static void init_kvm_softmmu(struct kvm_vcpu *vcpu) kvm_init_shadow_mmu(vcpu); context->set_cr3 = kvm_x86_ops->set_cr3; - context->get_cr3 = get_cr3; + context->get_guest_pgd = get_cr3; context->get_pdptr = kvm_pdptr_read; context->inject_page_fault = kvm_inject_page_fault; } @@ -5090,7 +5090,7 @@ static void init_kvm_nested_mmu(struct kvm_vcpu *vcpu) return; g_context->mmu_role.as_u64 = new_role.as_u64; - g_context->get_cr3 = get_cr3; + g_context->get_guest_pgd = get_cr3; g_context->get_pdptr = kvm_pdptr_read; g_context->inject_page_fault = kvm_inject_page_fault; diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h index 6b15b58f3ecc..1ddbfff64ccc 100644 --- a/arch/x86/kvm/mmu/paging_tmpl.h +++ b/arch/x86/kvm/mmu/paging_tmpl.h @@ -333,7 +333,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker, trace_kvm_mmu_pagetable_walk(addr, access); retry_walk: walker->level = mmu->root_level; - pte = mmu->get_cr3(vcpu); + pte = mmu->get_guest_pgd(vcpu); have_ad = PT_HAVE_ACCESSED_DIRTY(mmu); #if PTTYPE == 64 diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 0d417276653b..48c9390011d0 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -3012,7 +3012,7 @@ static void nested_svm_init_mmu_context(struct kvm_vcpu *vcpu) vcpu->arch.mmu = &vcpu->arch.guest_mmu; kvm_init_shadow_mmu(vcpu); vcpu->arch.mmu->set_cr3 = nested_svm_set_tdp_cr3; - vcpu->arch.mmu->get_cr3 = nested_svm_get_tdp_cr3; + vcpu->arch.mmu->get_guest_pgd = nested_svm_get_tdp_cr3; vcpu->arch.mmu->get_pdptr = nested_svm_get_tdp_pdptr; vcpu->arch.mmu->inject_page_fault = nested_svm_inject_npf_exit; vcpu->arch.mmu->shadow_root_level = get_npt_level(vcpu); diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index c1eae446d5da..b6719d7f4a34 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -355,7 +355,7 @@ static void nested_ept_init_mmu_context(struct kvm_vcpu *vcpu) nested_ept_ad_enabled(vcpu), nested_ept_get_eptp(vcpu)); vcpu->arch.mmu->set_cr3 = vmx_set_cr3; - vcpu->arch.mmu->get_cr3 = nested_ept_get_eptp; + vcpu->arch.mmu->get_guest_pgd = nested_ept_get_eptp; vcpu->arch.mmu->inject_page_fault = nested_ept_inject_page_fault; vcpu->arch.mmu->get_pdptr = kvm_pdptr_read; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index b15092c9593d..ba4d476b79ad 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -10165,7 +10165,7 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work) return; if (!vcpu->arch.mmu->direct_map && - work->arch.cr3 != vcpu->arch.mmu->get_cr3(vcpu)) + work->arch.cr3 != vcpu->arch.mmu->get_guest_pgd(vcpu)) return; kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true); -- cgit From 96d4701049a7b888e7c5837bf61e49d9bd5f889c Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 18:02:40 -0800 Subject: KVM: nVMX: Drop unnecessary check on ept caps for execute-only Drop the call to cpu_has_vmx_ept_execute_only() when calculating which EPT capabilities will be exposed to L1 for nested EPT. The resulting configuration is immediately sanitized by the passed in @ept_caps, and except for the call from vmx_check_processor_compat(), @ept_caps is the capabilities that are queried by cpu_has_vmx_ept_execute_only(). For vmx_check_processor_compat(), KVM *wants* to ignore vmx_capability.ept so that a divergence in EPT capabilities between CPUs is detected. Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/vmx/nested.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index b6719d7f4a34..79c7764c77b1 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -6133,10 +6133,9 @@ void nested_vmx_setup_ctls_msrs(struct nested_vmx_msrs *msrs, u32 ept_caps) VMX_EPT_PAGE_WALK_4_BIT | VMX_EPT_PAGE_WALK_5_BIT | VMX_EPTP_WB_BIT | - VMX_EPT_INVEPT_BIT; - if (cpu_has_vmx_ept_execute_only()) - msrs->ept_caps |= - VMX_EPT_EXECUTE_ONLY_BIT; + VMX_EPT_INVEPT_BIT | + VMX_EPT_EXECUTE_ONLY_BIT; + msrs->ept_caps &= ept_caps; msrs->ept_caps |= VMX_EPT_EXTENT_GLOBAL_BIT | VMX_EPT_EXTENT_CONTEXT_BIT | VMX_EPT_2MB_PAGE_BIT | -- cgit From abbed4fa94f69d2046c6f7c12f6ecabf195c553e Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Wed, 4 Mar 2020 16:24:22 -0800 Subject: KVM: x86: Fix warning due to implicit truncation on 32-bit KVM Explicitly cast the integer literal to an unsigned long when stuffing a non-canonical value into the host virtual address during private memslot deletion. The explicit cast fixes a warning that gets promoted to an error when running with KVM's newfangled -Werror setting. arch/x86/kvm/x86.c:9739:9: error: large integer implicitly truncated to unsigned type [-Werror=overflow] Fixes: a3e967c0b87d3 ("KVM: Terminate memslot walks via used_slots" Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/x86.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index ba4d476b79ad..fa03f31ab33c 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -9735,8 +9735,12 @@ int __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size) if (!slot || !slot->npages) return 0; - /* Stuff a non-canonical value to catch use-after-delete. */ - hva = 0xdeadull << 48; + /* + * Stuff a non-canonical value to catch use-after-delete. This + * ends up being 0 on 32-bit KVM, but there's no better + * alternative. + */ + hva = (unsigned long)(0xdeadull << 48); old_npages = slot->npages; } -- cgit From 2e3bb4d886c72c0d5336a957ed66882e08f5c14b Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 18 Feb 2020 15:29:41 -0800 Subject: KVM: x86: Refactor I/O emulation helpers to provide vcpu-only variant Add variants of the I/O helpers that take a vCPU instead of an emulation context. This will eventually allow KVM to limit use of the emulation context to the full emulation path. Signed-off-by: Sean Christopherson Reviewed-by: Vitaly Kuznetsov Signed-off-by: Paolo Bonzini --- arch/x86/kvm/x86.c | 39 ++++++++++++++++++++++++--------------- 1 file changed, 24 insertions(+), 15 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index fa03f31ab33c..fbf68c305556 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5904,11 +5904,9 @@ static int emulator_pio_in_out(struct kvm_vcpu *vcpu, int size, return 0; } -static int emulator_pio_in_emulated(struct x86_emulate_ctxt *ctxt, - int size, unsigned short port, void *val, - unsigned int count) +static int emulator_pio_in(struct kvm_vcpu *vcpu, int size, + unsigned short port, void *val, unsigned int count) { - struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt); int ret; if (vcpu->arch.pio.count) @@ -5928,17 +5926,30 @@ data_avail: return 0; } -static int emulator_pio_out_emulated(struct x86_emulate_ctxt *ctxt, - int size, unsigned short port, - const void *val, unsigned int count) +static int emulator_pio_in_emulated(struct x86_emulate_ctxt *ctxt, + int size, unsigned short port, void *val, + unsigned int count) { - struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt); + return emulator_pio_in(emul_to_vcpu(ctxt), size, port, val, count); +} + +static int emulator_pio_out(struct kvm_vcpu *vcpu, int size, + unsigned short port, const void *val, + unsigned int count) +{ memcpy(vcpu->arch.pio_data, val, size * count); trace_kvm_pio(KVM_PIO_OUT, port, size, count, vcpu->arch.pio_data); return emulator_pio_in_out(vcpu, size, port, (void *)val, count, false); } +static int emulator_pio_out_emulated(struct x86_emulate_ctxt *ctxt, + int size, unsigned short port, + const void *val, unsigned int count) +{ + return emulator_pio_out(emul_to_vcpu(ctxt), size, port, val, count); +} + static unsigned long get_segment_base(struct kvm_vcpu *vcpu, int seg) { return kvm_x86_ops->get_segment_base(vcpu, seg); @@ -6891,8 +6902,8 @@ static int kvm_fast_pio_out(struct kvm_vcpu *vcpu, int size, unsigned short port) { unsigned long val = kvm_rax_read(vcpu); - int ret = emulator_pio_out_emulated(&vcpu->arch.emulate_ctxt, - size, port, &val, 1); + int ret = emulator_pio_out(vcpu, size, port, &val, 1); + if (ret) return ret; @@ -6928,11 +6939,10 @@ static int complete_fast_pio_in(struct kvm_vcpu *vcpu) val = (vcpu->arch.pio.size < 4) ? kvm_rax_read(vcpu) : 0; /* - * Since vcpu->arch.pio.count == 1 let emulator_pio_in_emulated perform + * Since vcpu->arch.pio.count == 1 let emulator_pio_in perform * the copy and tracing */ - emulator_pio_in_emulated(&vcpu->arch.emulate_ctxt, vcpu->arch.pio.size, - vcpu->arch.pio.port, &val, 1); + emulator_pio_in(vcpu, vcpu->arch.pio.size, vcpu->arch.pio.port, &val, 1); kvm_rax_write(vcpu, val); return kvm_skip_emulated_instruction(vcpu); @@ -6947,8 +6957,7 @@ static int kvm_fast_pio_in(struct kvm_vcpu *vcpu, int size, /* For size less than 4 we merge, else we zero extend */ val = (size < 4) ? kvm_rax_read(vcpu) : 0; - ret = emulator_pio_in_emulated(&vcpu->arch.emulate_ctxt, size, port, - &val, 1); + ret = emulator_pio_in(vcpu, size, port, &val, 1); if (ret) { kvm_rax_write(vcpu, val); return ret; -- cgit From 21f1b8f29ea5b2301af7f2cc41a20b7b87a22bec Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 18 Feb 2020 15:29:42 -0800 Subject: KVM: x86: Explicitly pass an exception struct to check_intercept Explicitly pass an exception struct when checking for intercept from the emulator, which eliminates the last reference to arch.emulate_ctxt in vendor specific code. Signed-off-by: Sean Christopherson Reviewed-by: Vitaly Kuznetsov Signed-off-by: Paolo Bonzini --- arch/x86/include/asm/kvm_host.h | 3 ++- arch/x86/kvm/svm.c | 3 ++- arch/x86/kvm/vmx/vmx.c | 8 ++++---- arch/x86/kvm/x86.c | 3 ++- 4 files changed, 10 insertions(+), 7 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 316ec6cf532b..af4264498554 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1170,7 +1170,8 @@ struct kvm_x86_ops { int (*check_intercept)(struct kvm_vcpu *vcpu, struct x86_instruction_info *info, - enum x86_intercept_stage stage); + enum x86_intercept_stage stage, + struct x86_exception *exception); void (*handle_exit_irqoff)(struct kvm_vcpu *vcpu, enum exit_fastpath_completion *exit_fastpath); bool (*mpx_supported)(void); diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 48c9390011d0..7f32c407d682 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -6175,7 +6175,8 @@ static const struct __x86_intercept { static int svm_check_intercept(struct kvm_vcpu *vcpu, struct x86_instruction_info *info, - enum x86_intercept_stage stage) + enum x86_intercept_stage stage, + struct x86_exception *exception) { struct vcpu_svm *svm = to_svm(vcpu); int vmexit, ret = X86EMUL_CONTINUE; diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 743b81642ce2..57742ddfd854 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7174,10 +7174,10 @@ static int vmx_check_intercept_io(struct kvm_vcpu *vcpu, static int vmx_check_intercept(struct kvm_vcpu *vcpu, struct x86_instruction_info *info, - enum x86_intercept_stage stage) + enum x86_intercept_stage stage, + struct x86_exception *exception) { struct vmcs12 *vmcs12 = get_vmcs12(vcpu); - struct x86_emulate_ctxt *ctxt = &vcpu->arch.emulate_ctxt; switch (info->intercept) { /* @@ -7186,8 +7186,8 @@ static int vmx_check_intercept(struct kvm_vcpu *vcpu, */ case x86_intercept_rdtscp: if (!nested_cpu_has2(vmcs12, SECONDARY_EXEC_RDTSCP)) { - ctxt->exception.vector = UD_VECTOR; - ctxt->exception.error_code_valid = false; + exception->vector = UD_VECTOR; + exception->error_code_valid = false; return X86EMUL_PROPAGATE_FAULT; } break; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index fbf68c305556..762a68200b46 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -6212,7 +6212,8 @@ static int emulator_intercept(struct x86_emulate_ctxt *ctxt, struct x86_instruction_info *info, enum x86_intercept_stage stage) { - return kvm_x86_ops->check_intercept(emul_to_vcpu(ctxt), info, stage); + return kvm_x86_ops->check_intercept(emul_to_vcpu(ctxt), info, stage, + &ctxt->exception); } static bool emulator_get_cpuid(struct x86_emulate_ctxt *ctxt, -- cgit From f0ed4760ed216fa0de52347289ded52be9a2c725 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 18 Feb 2020 15:29:43 -0800 Subject: KVM: x86: Move emulation-only helpers to emulate.c Move ctxt_virt_addr_bits() and emul_is_noncanonical_address() from x86.h to emulate.c. This eliminates all references to struct x86_emulate_ctxt from x86.h, and sets the stage for a future patch to stop including kvm_emulate.h in asm/kvm_host.h. Signed-off-by: Sean Christopherson Reviewed-by: Vitaly Kuznetsov Signed-off-by: Paolo Bonzini --- arch/x86/kvm/emulate.c | 11 +++++++++++ arch/x86/kvm/x86.h | 11 ----------- 2 files changed, 11 insertions(+), 11 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index dd19fb3539e0..929cd0bff5d0 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -665,6 +665,17 @@ static void set_segment_selector(struct x86_emulate_ctxt *ctxt, u16 selector, ctxt->ops->set_segment(ctxt, selector, &desc, base3, seg); } +static inline u8 ctxt_virt_addr_bits(struct x86_emulate_ctxt *ctxt) +{ + return (ctxt->ops->get_cr(ctxt, 4) & X86_CR4_LA57) ? 57 : 48; +} + +static inline bool emul_is_noncanonical_address(u64 la, + struct x86_emulate_ctxt *ctxt) +{ + return get_canonical(la, ctxt_virt_addr_bits(ctxt)) != la; +} + /* * x86 defines three classes of vector instructions: explicitly * aligned, explicitly unaligned, and the rest, which change behaviour diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h index 3624665acee4..8409842a25d9 100644 --- a/arch/x86/kvm/x86.h +++ b/arch/x86/kvm/x86.h @@ -149,11 +149,6 @@ static inline u8 vcpu_virt_addr_bits(struct kvm_vcpu *vcpu) return kvm_read_cr4_bits(vcpu, X86_CR4_LA57) ? 57 : 48; } -static inline u8 ctxt_virt_addr_bits(struct x86_emulate_ctxt *ctxt) -{ - return (ctxt->ops->get_cr(ctxt, 4) & X86_CR4_LA57) ? 57 : 48; -} - static inline u64 get_canonical(u64 la, u8 vaddr_bits) { return ((int64_t)la << (64 - vaddr_bits)) >> (64 - vaddr_bits); @@ -164,12 +159,6 @@ static inline bool is_noncanonical_address(u64 la, struct kvm_vcpu *vcpu) return get_canonical(la, vcpu_virt_addr_bits(vcpu)) != la; } -static inline bool emul_is_noncanonical_address(u64 la, - struct x86_emulate_ctxt *ctxt) -{ - return get_canonical(la, ctxt_virt_addr_bits(ctxt)) != la; -} - static inline void vcpu_cache_mmio_info(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn, unsigned access) { -- cgit From c9b8b07cded58c55ad2bf67e68b9bfae96092293 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 18 Feb 2020 15:29:48 -0800 Subject: KVM: x86: Dynamically allocate per-vCPU emulation context Allocate the emulation context instead of embedding it in struct kvm_vcpu_arch. Dynamic allocation provides several benefits: - Shrinks the size x86 vcpus by ~2.5k bytes, dropping them back below the PAGE_ALLOC_COSTLY_ORDER threshold. - Allows for dropping the include of kvm_emulate.h from asm/kvm_host.h and moving kvm_emulate.h into KVM's private directory. - Allows a reducing KVM's attack surface by shrinking the amount of vCPU data that is exposed to usercopy. - Allows a future patch to disable the emulator entirely, which may or may not be a realistic endeavor. Mark the entire struct as valid for usercopy to maintain existing behavior with respect to hardened usercopy. Future patches can shrink the usercopy range to cover only what is necessary. Signed-off-by: Sean Christopherson Reviewed-by: Vitaly Kuznetsov Signed-off-by: Paolo Bonzini --- arch/x86/include/asm/kvm_emulate.h | 1 + arch/x86/include/asm/kvm_host.h | 2 +- arch/x86/kvm/trace.h | 10 +++--- arch/x86/kvm/x86.c | 65 +++++++++++++++++++++++++++++++------- 4 files changed, 61 insertions(+), 17 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h index bf5f5e476f65..3a66f80d7d00 100644 --- a/arch/x86/include/asm/kvm_emulate.h +++ b/arch/x86/include/asm/kvm_emulate.h @@ -301,6 +301,7 @@ struct fastop; typedef void (*fastop_t)(struct fastop *); struct x86_emulate_ctxt { + void *vcpu; const struct x86_emulate_ops *ops; /* Register state before/after emulation. */ diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index af4264498554..03887ec21dd8 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -680,7 +680,7 @@ struct kvm_vcpu_arch { /* emulate context */ - struct x86_emulate_ctxt emulate_ctxt; + struct x86_emulate_ctxt *emulate_ctxt; bool emulate_regs_need_sync_to_vcpu; bool emulate_regs_need_sync_from_vcpu; int (*complete_userspace_io)(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h index f194dd058470..f5b8814d9f83 100644 --- a/arch/x86/kvm/trace.h +++ b/arch/x86/kvm/trace.h @@ -745,13 +745,13 @@ TRACE_EVENT(kvm_emulate_insn, TP_fast_assign( __entry->csbase = kvm_x86_ops->get_segment_base(vcpu, VCPU_SREG_CS); - __entry->len = vcpu->arch.emulate_ctxt.fetch.ptr - - vcpu->arch.emulate_ctxt.fetch.data; - __entry->rip = vcpu->arch.emulate_ctxt._eip - __entry->len; + __entry->len = vcpu->arch.emulate_ctxt->fetch.ptr + - vcpu->arch.emulate_ctxt->fetch.data; + __entry->rip = vcpu->arch.emulate_ctxt->_eip - __entry->len; memcpy(__entry->insn, - vcpu->arch.emulate_ctxt.fetch.data, + vcpu->arch.emulate_ctxt->fetch.data, 15); - __entry->flags = kei_decode_mode(vcpu->arch.emulate_ctxt.mode); + __entry->flags = kei_decode_mode(vcpu->arch.emulate_ctxt->mode); __entry->failed = failed; ), diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 762a68200b46..4465975f034b 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -81,7 +81,7 @@ u64 __read_mostly kvm_mce_cap_supported = MCG_CTL_P | MCG_SER_P; EXPORT_SYMBOL_GPL(kvm_mce_cap_supported); #define emul_to_vcpu(ctxt) \ - container_of(ctxt, struct kvm_vcpu, arch.emulate_ctxt) + ((struct kvm_vcpu *)(ctxt)->vcpu) /* EFER defaults: * - enable syscall per default because its emulated by KVM @@ -230,6 +230,19 @@ u64 __read_mostly host_xcr0; struct kmem_cache *x86_fpu_cache; EXPORT_SYMBOL_GPL(x86_fpu_cache); +static struct kmem_cache *x86_emulator_cache; + +static struct kmem_cache *kvm_alloc_emulator_cache(void) +{ + return kmem_cache_create_usercopy("x86_emulator", + sizeof(struct x86_emulate_ctxt), + __alignof__(struct x86_emulate_ctxt), + SLAB_ACCOUNT, + 0, + sizeof(struct x86_emulate_ctxt), + NULL); +} + static int emulator_fix_hypercall(struct x86_emulate_ctxt *ctxt); static inline void kvm_async_pf_hash_reset(struct kvm_vcpu *vcpu) @@ -5673,7 +5686,7 @@ static int emulator_read_write_onepage(unsigned long addr, void *val, int handled, ret; bool write = ops->write; struct kvm_mmio_fragment *frag; - struct x86_emulate_ctxt *ctxt = &vcpu->arch.emulate_ctxt; + struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt; /* * If the exit was due to a NPF we may already have a GPA. @@ -6346,7 +6359,7 @@ static void toggle_interruptibility(struct kvm_vcpu *vcpu, u32 mask) static bool inject_emulated_exception(struct kvm_vcpu *vcpu) { - struct x86_emulate_ctxt *ctxt = &vcpu->arch.emulate_ctxt; + struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt; if (ctxt->exception.vector == PF_VECTOR) return kvm_propagate_fault(vcpu, &ctxt->exception); @@ -6358,9 +6371,26 @@ static bool inject_emulated_exception(struct kvm_vcpu *vcpu) return false; } +static struct x86_emulate_ctxt *alloc_emulate_ctxt(struct kvm_vcpu *vcpu) +{ + struct x86_emulate_ctxt *ctxt; + + ctxt = kmem_cache_zalloc(x86_emulator_cache, GFP_KERNEL_ACCOUNT); + if (!ctxt) { + pr_err("kvm: failed to allocate vcpu's emulator\n"); + return NULL; + } + + ctxt->vcpu = vcpu; + ctxt->ops = &emulate_ops; + vcpu->arch.emulate_ctxt = ctxt; + + return ctxt; +} + static void init_emulate_ctxt(struct kvm_vcpu *vcpu) { - struct x86_emulate_ctxt *ctxt = &vcpu->arch.emulate_ctxt; + struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt; int cs_db, cs_l; kvm_x86_ops->get_cs_db_l_bits(vcpu, &cs_db, &cs_l); @@ -6385,7 +6415,7 @@ static void init_emulate_ctxt(struct kvm_vcpu *vcpu) void kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq, int inc_eip) { - struct x86_emulate_ctxt *ctxt = &vcpu->arch.emulate_ctxt; + struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt; int ret; init_emulate_ctxt(vcpu); @@ -6700,7 +6730,7 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, int emulation_type, void *insn, int insn_len) { int r; - struct x86_emulate_ctxt *ctxt = &vcpu->arch.emulate_ctxt; + struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt; bool writeback = true; bool write_fault_to_spt = vcpu->arch.write_fault_to_shadow_pgtable; @@ -7296,10 +7326,16 @@ int kvm_arch_init(void *opaque) goto out; } + x86_emulator_cache = kvm_alloc_emulator_cache(); + if (!x86_emulator_cache) { + pr_err("kvm: failed to allocate cache for x86 emulator\n"); + goto out_free_x86_fpu_cache; + } + shared_msrs = alloc_percpu(struct kvm_shared_msrs); if (!shared_msrs) { printk(KERN_ERR "kvm: failed to allocate percpu kvm_shared_msrs\n"); - goto out_free_x86_fpu_cache; + goto out_free_x86_emulator_cache; } r = kvm_mmu_module_init(); @@ -7332,6 +7368,8 @@ int kvm_arch_init(void *opaque) out_free_percpu: free_percpu(shared_msrs); +out_free_x86_emulator_cache: + kmem_cache_destroy(x86_emulator_cache); out_free_x86_fpu_cache: kmem_cache_destroy(x86_fpu_cache); out: @@ -8709,7 +8747,7 @@ static void __get_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs) * that usually, but some bad designed PV devices (vmware * backdoor interface) need this to work */ - emulator_writeback_register_cache(&vcpu->arch.emulate_ctxt); + emulator_writeback_register_cache(vcpu->arch.emulate_ctxt); vcpu->arch.emulate_regs_need_sync_to_vcpu = false; } regs->rax = kvm_rax_read(vcpu); @@ -8895,7 +8933,7 @@ out: int kvm_task_switch(struct kvm_vcpu *vcpu, u16 tss_selector, int idt_index, int reason, bool has_error_code, u32 error_code) { - struct x86_emulate_ctxt *ctxt = &vcpu->arch.emulate_ctxt; + struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt; int ret; init_emulate_ctxt(vcpu); @@ -9227,7 +9265,6 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu) struct page *page; int r; - vcpu->arch.emulate_ctxt.ops = &emulate_ops; if (!irqchip_in_kernel(vcpu->kvm) || kvm_vcpu_is_reset_bsp(vcpu)) vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE; else @@ -9265,11 +9302,14 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu) GFP_KERNEL_ACCOUNT)) goto fail_free_mce_banks; + if (!alloc_emulate_ctxt(vcpu)) + goto free_wbinvd_dirty_mask; + vcpu->arch.user_fpu = kmem_cache_zalloc(x86_fpu_cache, GFP_KERNEL_ACCOUNT); if (!vcpu->arch.user_fpu) { pr_err("kvm: failed to allocate userspace's fpu\n"); - goto free_wbinvd_dirty_mask; + goto free_emulate_ctxt; } vcpu->arch.guest_fpu = kmem_cache_zalloc(x86_fpu_cache, @@ -9311,6 +9351,8 @@ free_guest_fpu: kmem_cache_free(x86_fpu_cache, vcpu->arch.guest_fpu); free_user_fpu: kmem_cache_free(x86_fpu_cache, vcpu->arch.user_fpu); +free_emulate_ctxt: + kmem_cache_free(x86_emulator_cache, vcpu->arch.emulate_ctxt); free_wbinvd_dirty_mask: free_cpumask_var(vcpu->arch.wbinvd_dirty_mask); fail_free_mce_banks: @@ -9361,6 +9403,7 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu) kvm_x86_ops->vcpu_free(vcpu); + kmem_cache_free(x86_emulator_cache, vcpu->arch.emulate_ctxt); free_cpumask_var(vcpu->arch.wbinvd_dirty_mask); kmem_cache_free(x86_fpu_cache, vcpu->arch.user_fpu); kmem_cache_free(x86_fpu_cache, vcpu->arch.guest_fpu); -- cgit From 2f728d66e8a7d89d7cb141bf0acb30c61ae7ded5 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 18 Feb 2020 15:29:49 -0800 Subject: KVM: x86: Move kvm_emulate.h into KVM's private directory Now that the emulation context is dynamically allocated and not embedded in struct kvm_vcpu, move its header, kvm_emulate.h, out of the public asm directory and into KVM's private x86 directory. Signed-off-by: Sean Christopherson Reviewed-by: Vitaly Kuznetsov Signed-off-by: Paolo Bonzini --- arch/x86/include/asm/kvm_emulate.h | 480 ------------------------------------- arch/x86/include/asm/kvm_host.h | 7 +- arch/x86/kvm/emulate.c | 2 +- arch/x86/kvm/kvm_emulate.h | 480 +++++++++++++++++++++++++++++++++++++ arch/x86/kvm/mmu/mmu.c | 1 + arch/x86/kvm/x86.c | 1 + arch/x86/kvm/x86.h | 1 + 7 files changed, 488 insertions(+), 484 deletions(-) delete mode 100644 arch/x86/include/asm/kvm_emulate.h create mode 100644 arch/x86/kvm/kvm_emulate.h (limited to 'arch/x86') diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h deleted file mode 100644 index 3a66f80d7d00..000000000000 --- a/arch/x86/include/asm/kvm_emulate.h +++ /dev/null @@ -1,480 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0 */ -/****************************************************************************** - * x86_emulate.h - * - * Generic x86 (32-bit and 64-bit) instruction decoder and emulator. - * - * Copyright (c) 2005 Keir Fraser - * - * From: xen-unstable 10676:af9809f51f81a3c43f276f00c81a52ef558afda4 - */ - -#ifndef _ASM_X86_KVM_X86_EMULATE_H -#define _ASM_X86_KVM_X86_EMULATE_H - -#include - -struct x86_emulate_ctxt; -enum x86_intercept; -enum x86_intercept_stage; - -struct x86_exception { - u8 vector; - bool error_code_valid; - u16 error_code; - bool nested_page_fault; - u64 address; /* cr2 or nested page fault gpa */ - u8 async_page_fault; -}; - -/* - * This struct is used to carry enough information from the instruction - * decoder to main KVM so that a decision can be made whether the - * instruction needs to be intercepted or not. - */ -struct x86_instruction_info { - u8 intercept; /* which intercept */ - u8 rep_prefix; /* rep prefix? */ - u8 modrm_mod; /* mod part of modrm */ - u8 modrm_reg; /* index of register used */ - u8 modrm_rm; /* rm part of modrm */ - u64 src_val; /* value of source operand */ - u64 dst_val; /* value of destination operand */ - u8 src_bytes; /* size of source operand */ - u8 dst_bytes; /* size of destination operand */ - u8 ad_bytes; /* size of src/dst address */ - u64 next_rip; /* rip following the instruction */ -}; - -/* - * x86_emulate_ops: - * - * These operations represent the instruction emulator's interface to memory. - * There are two categories of operation: those that act on ordinary memory - * regions (*_std), and those that act on memory regions known to require - * special treatment or emulation (*_emulated). - * - * The emulator assumes that an instruction accesses only one 'emulated memory' - * location, that this location is the given linear faulting address (cr2), and - * that this is one of the instruction's data operands. Instruction fetches and - * stack operations are assumed never to access emulated memory. The emulator - * automatically deduces which operand of a string-move operation is accessing - * emulated memory, and assumes that the other operand accesses normal memory. - * - * NOTES: - * 1. The emulator isn't very smart about emulated vs. standard memory. - * 'Emulated memory' access addresses should be checked for sanity. - * 'Normal memory' accesses may fault, and the caller must arrange to - * detect and handle reentrancy into the emulator via recursive faults. - * Accesses may be unaligned and may cross page boundaries. - * 2. If the access fails (cannot emulate, or a standard access faults) then - * it is up to the memop to propagate the fault to the guest VM via - * some out-of-band mechanism, unknown to the emulator. The memop signals - * failure by returning X86EMUL_PROPAGATE_FAULT to the emulator, which will - * then immediately bail. - * 3. Valid access sizes are 1, 2, 4 and 8 bytes. On x86/32 systems only - * cmpxchg8b_emulated need support 8-byte accesses. - * 4. The emulator cannot handle 64-bit mode emulation on an x86/32 system. - */ -/* Access completed successfully: continue emulation as normal. */ -#define X86EMUL_CONTINUE 0 -/* Access is unhandleable: bail from emulation and return error to caller. */ -#define X86EMUL_UNHANDLEABLE 1 -/* Terminate emulation but return success to the caller. */ -#define X86EMUL_PROPAGATE_FAULT 2 /* propagate a generated fault to guest */ -#define X86EMUL_RETRY_INSTR 3 /* retry the instruction for some reason */ -#define X86EMUL_CMPXCHG_FAILED 4 /* cmpxchg did not see expected value */ -#define X86EMUL_IO_NEEDED 5 /* IO is needed to complete emulation */ -#define X86EMUL_INTERCEPTED 6 /* Intercepted by nested VMCB/VMCS */ - -struct x86_emulate_ops { - /* - * read_gpr: read a general purpose register (rax - r15) - * - * @reg: gpr number. - */ - ulong (*read_gpr)(struct x86_emulate_ctxt *ctxt, unsigned reg); - /* - * write_gpr: write a general purpose register (rax - r15) - * - * @reg: gpr number. - * @val: value to write. - */ - void (*write_gpr)(struct x86_emulate_ctxt *ctxt, unsigned reg, ulong val); - /* - * read_std: Read bytes of standard (non-emulated/special) memory. - * Used for descriptor reading. - * @addr: [IN ] Linear address from which to read. - * @val: [OUT] Value read from memory, zero-extended to 'u_long'. - * @bytes: [IN ] Number of bytes to read from memory. - * @system:[IN ] Whether the access is forced to be at CPL0. - */ - int (*read_std)(struct x86_emulate_ctxt *ctxt, - unsigned long addr, void *val, - unsigned int bytes, - struct x86_exception *fault, bool system); - - /* - * read_phys: Read bytes of standard (non-emulated/special) memory. - * Used for descriptor reading. - * @addr: [IN ] Physical address from which to read. - * @val: [OUT] Value read from memory. - * @bytes: [IN ] Number of bytes to read from memory. - */ - int (*read_phys)(struct x86_emulate_ctxt *ctxt, unsigned long addr, - void *val, unsigned int bytes); - - /* - * write_std: Write bytes of standard (non-emulated/special) memory. - * Used for descriptor writing. - * @addr: [IN ] Linear address to which to write. - * @val: [OUT] Value write to memory, zero-extended to 'u_long'. - * @bytes: [IN ] Number of bytes to write to memory. - * @system:[IN ] Whether the access is forced to be at CPL0. - */ - int (*write_std)(struct x86_emulate_ctxt *ctxt, - unsigned long addr, void *val, unsigned int bytes, - struct x86_exception *fault, bool system); - /* - * fetch: Read bytes of standard (non-emulated/special) memory. - * Used for instruction fetch. - * @addr: [IN ] Linear address from which to read. - * @val: [OUT] Value read from memory, zero-extended to 'u_long'. - * @bytes: [IN ] Number of bytes to read from memory. - */ - int (*fetch)(struct x86_emulate_ctxt *ctxt, - unsigned long addr, void *val, unsigned int bytes, - struct x86_exception *fault); - - /* - * read_emulated: Read bytes from emulated/special memory area. - * @addr: [IN ] Linear address from which to read. - * @val: [OUT] Value read from memory, zero-extended to 'u_long'. - * @bytes: [IN ] Number of bytes to read from memory. - */ - int (*read_emulated)(struct x86_emulate_ctxt *ctxt, - unsigned long addr, void *val, unsigned int bytes, - struct x86_exception *fault); - - /* - * write_emulated: Write bytes to emulated/special memory area. - * @addr: [IN ] Linear address to which to write. - * @val: [IN ] Value to write to memory (low-order bytes used as - * required). - * @bytes: [IN ] Number of bytes to write to memory. - */ - int (*write_emulated)(struct x86_emulate_ctxt *ctxt, - unsigned long addr, const void *val, - unsigned int bytes, - struct x86_exception *fault); - - /* - * cmpxchg_emulated: Emulate an atomic (LOCKed) CMPXCHG operation on an - * emulated/special memory area. - * @addr: [IN ] Linear address to access. - * @old: [IN ] Value expected to be current at @addr. - * @new: [IN ] Value to write to @addr. - * @bytes: [IN ] Number of bytes to access using CMPXCHG. - */ - int (*cmpxchg_emulated)(struct x86_emulate_ctxt *ctxt, - unsigned long addr, - const void *old, - const void *new, - unsigned int bytes, - struct x86_exception *fault); - void (*invlpg)(struct x86_emulate_ctxt *ctxt, ulong addr); - - int (*pio_in_emulated)(struct x86_emulate_ctxt *ctxt, - int size, unsigned short port, void *val, - unsigned int count); - - int (*pio_out_emulated)(struct x86_emulate_ctxt *ctxt, - int size, unsigned short port, const void *val, - unsigned int count); - - bool (*get_segment)(struct x86_emulate_ctxt *ctxt, u16 *selector, - struct desc_struct *desc, u32 *base3, int seg); - void (*set_segment)(struct x86_emulate_ctxt *ctxt, u16 selector, - struct desc_struct *desc, u32 base3, int seg); - unsigned long (*get_cached_segment_base)(struct x86_emulate_ctxt *ctxt, - int seg); - void (*get_gdt)(struct x86_emulate_ctxt *ctxt, struct desc_ptr *dt); - void (*get_idt)(struct x86_emulate_ctxt *ctxt, struct desc_ptr *dt); - void (*set_gdt)(struct x86_emulate_ctxt *ctxt, struct desc_ptr *dt); - void (*set_idt)(struct x86_emulate_ctxt *ctxt, struct desc_ptr *dt); - ulong (*get_cr)(struct x86_emulate_ctxt *ctxt, int cr); - int (*set_cr)(struct x86_emulate_ctxt *ctxt, int cr, ulong val); - int (*cpl)(struct x86_emulate_ctxt *ctxt); - int (*get_dr)(struct x86_emulate_ctxt *ctxt, int dr, ulong *dest); - int (*set_dr)(struct x86_emulate_ctxt *ctxt, int dr, ulong value); - u64 (*get_smbase)(struct x86_emulate_ctxt *ctxt); - void (*set_smbase)(struct x86_emulate_ctxt *ctxt, u64 smbase); - int (*set_msr)(struct x86_emulate_ctxt *ctxt, u32 msr_index, u64 data); - int (*get_msr)(struct x86_emulate_ctxt *ctxt, u32 msr_index, u64 *pdata); - int (*check_pmc)(struct x86_emulate_ctxt *ctxt, u32 pmc); - int (*read_pmc)(struct x86_emulate_ctxt *ctxt, u32 pmc, u64 *pdata); - void (*halt)(struct x86_emulate_ctxt *ctxt); - void (*wbinvd)(struct x86_emulate_ctxt *ctxt); - int (*fix_hypercall)(struct x86_emulate_ctxt *ctxt); - int (*intercept)(struct x86_emulate_ctxt *ctxt, - struct x86_instruction_info *info, - enum x86_intercept_stage stage); - - bool (*get_cpuid)(struct x86_emulate_ctxt *ctxt, u32 *eax, u32 *ebx, - u32 *ecx, u32 *edx, bool check_limit); - bool (*guest_has_long_mode)(struct x86_emulate_ctxt *ctxt); - bool (*guest_has_movbe)(struct x86_emulate_ctxt *ctxt); - bool (*guest_has_fxsr)(struct x86_emulate_ctxt *ctxt); - - void (*set_nmi_mask)(struct x86_emulate_ctxt *ctxt, bool masked); - - unsigned (*get_hflags)(struct x86_emulate_ctxt *ctxt); - void (*set_hflags)(struct x86_emulate_ctxt *ctxt, unsigned hflags); - int (*pre_leave_smm)(struct x86_emulate_ctxt *ctxt, - const char *smstate); - void (*post_leave_smm)(struct x86_emulate_ctxt *ctxt); - int (*set_xcr)(struct x86_emulate_ctxt *ctxt, u32 index, u64 xcr); -}; - -typedef u32 __attribute__((vector_size(16))) sse128_t; - -/* Type, address-of, and value of an instruction's operand. */ -struct operand { - enum { OP_REG, OP_MEM, OP_MEM_STR, OP_IMM, OP_XMM, OP_MM, OP_NONE } type; - unsigned int bytes; - unsigned int count; - union { - unsigned long orig_val; - u64 orig_val64; - }; - union { - unsigned long *reg; - struct segmented_address { - ulong ea; - unsigned seg; - } mem; - unsigned xmm; - unsigned mm; - } addr; - union { - unsigned long val; - u64 val64; - char valptr[sizeof(sse128_t)]; - sse128_t vec_val; - u64 mm_val; - void *data; - }; -}; - -struct fetch_cache { - u8 data[15]; - u8 *ptr; - u8 *end; -}; - -struct read_cache { - u8 data[1024]; - unsigned long pos; - unsigned long end; -}; - -/* Execution mode, passed to the emulator. */ -enum x86emul_mode { - X86EMUL_MODE_REAL, /* Real mode. */ - X86EMUL_MODE_VM86, /* Virtual 8086 mode. */ - X86EMUL_MODE_PROT16, /* 16-bit protected mode. */ - X86EMUL_MODE_PROT32, /* 32-bit protected mode. */ - X86EMUL_MODE_PROT64, /* 64-bit (long) mode. */ -}; - -/* These match some of the HF_* flags defined in kvm_host.h */ -#define X86EMUL_GUEST_MASK (1 << 5) /* VCPU is in guest-mode */ -#define X86EMUL_SMM_MASK (1 << 6) -#define X86EMUL_SMM_INSIDE_NMI_MASK (1 << 7) - -/* - * fastop functions are declared as taking a never-defined fastop parameter, - * so they can't be called from C directly. - */ -struct fastop; - -typedef void (*fastop_t)(struct fastop *); - -struct x86_emulate_ctxt { - void *vcpu; - const struct x86_emulate_ops *ops; - - /* Register state before/after emulation. */ - unsigned long eflags; - unsigned long eip; /* eip before instruction emulation */ - /* Emulated execution mode, represented by an X86EMUL_MODE value. */ - enum x86emul_mode mode; - - /* interruptibility state, as a result of execution of STI or MOV SS */ - int interruptibility; - - bool perm_ok; /* do not check permissions if true */ - bool ud; /* inject an #UD if host doesn't support insn */ - bool tf; /* TF value before instruction (after for syscall/sysret) */ - - bool have_exception; - struct x86_exception exception; - - /* GPA available */ - bool gpa_available; - gpa_t gpa_val; - - /* - * decode cache - */ - - /* current opcode length in bytes */ - u8 opcode_len; - u8 b; - u8 intercept; - u8 op_bytes; - u8 ad_bytes; - struct operand src; - struct operand src2; - struct operand dst; - union { - int (*execute)(struct x86_emulate_ctxt *ctxt); - fastop_t fop; - }; - int (*check_perm)(struct x86_emulate_ctxt *ctxt); - /* - * The following six fields are cleared together, - * the rest are initialized unconditionally in x86_decode_insn - * or elsewhere - */ - bool rip_relative; - u8 rex_prefix; - u8 lock_prefix; - u8 rep_prefix; - /* bitmaps of registers in _regs[] that can be read */ - u32 regs_valid; - /* bitmaps of registers in _regs[] that have been written */ - u32 regs_dirty; - /* modrm */ - u8 modrm; - u8 modrm_mod; - u8 modrm_reg; - u8 modrm_rm; - u8 modrm_seg; - u8 seg_override; - u64 d; - unsigned long _eip; - struct operand memop; - /* Fields above regs are cleared together. */ - unsigned long _regs[NR_VCPU_REGS]; - struct operand *memopp; - struct fetch_cache fetch; - struct read_cache io_read; - struct read_cache mem_read; -}; - -/* Repeat String Operation Prefix */ -#define REPE_PREFIX 0xf3 -#define REPNE_PREFIX 0xf2 - -/* CPUID vendors */ -#define X86EMUL_CPUID_VENDOR_AuthenticAMD_ebx 0x68747541 -#define X86EMUL_CPUID_VENDOR_AuthenticAMD_ecx 0x444d4163 -#define X86EMUL_CPUID_VENDOR_AuthenticAMD_edx 0x69746e65 - -#define X86EMUL_CPUID_VENDOR_AMDisbetterI_ebx 0x69444d41 -#define X86EMUL_CPUID_VENDOR_AMDisbetterI_ecx 0x21726574 -#define X86EMUL_CPUID_VENDOR_AMDisbetterI_edx 0x74656273 - -#define X86EMUL_CPUID_VENDOR_HygonGenuine_ebx 0x6f677948 -#define X86EMUL_CPUID_VENDOR_HygonGenuine_ecx 0x656e6975 -#define X86EMUL_CPUID_VENDOR_HygonGenuine_edx 0x6e65476e - -#define X86EMUL_CPUID_VENDOR_GenuineIntel_ebx 0x756e6547 -#define X86EMUL_CPUID_VENDOR_GenuineIntel_ecx 0x6c65746e -#define X86EMUL_CPUID_VENDOR_GenuineIntel_edx 0x49656e69 - -enum x86_intercept_stage { - X86_ICTP_NONE = 0, /* Allow zero-init to not match anything */ - X86_ICPT_PRE_EXCEPT, - X86_ICPT_POST_EXCEPT, - X86_ICPT_POST_MEMACCESS, -}; - -enum x86_intercept { - x86_intercept_none, - x86_intercept_cr_read, - x86_intercept_cr_write, - x86_intercept_clts, - x86_intercept_lmsw, - x86_intercept_smsw, - x86_intercept_dr_read, - x86_intercept_dr_write, - x86_intercept_lidt, - x86_intercept_sidt, - x86_intercept_lgdt, - x86_intercept_sgdt, - x86_intercept_lldt, - x86_intercept_sldt, - x86_intercept_ltr, - x86_intercept_str, - x86_intercept_rdtsc, - x86_intercept_rdpmc, - x86_intercept_pushf, - x86_intercept_popf, - x86_intercept_cpuid, - x86_intercept_rsm, - x86_intercept_iret, - x86_intercept_intn, - x86_intercept_invd, - x86_intercept_pause, - x86_intercept_hlt, - x86_intercept_invlpg, - x86_intercept_invlpga, - x86_intercept_vmrun, - x86_intercept_vmload, - x86_intercept_vmsave, - x86_intercept_vmmcall, - x86_intercept_stgi, - x86_intercept_clgi, - x86_intercept_skinit, - x86_intercept_rdtscp, - x86_intercept_icebp, - x86_intercept_wbinvd, - x86_intercept_monitor, - x86_intercept_mwait, - x86_intercept_rdmsr, - x86_intercept_wrmsr, - x86_intercept_in, - x86_intercept_ins, - x86_intercept_out, - x86_intercept_outs, - x86_intercept_xsetbv, - - nr_x86_intercepts -}; - -/* Host execution mode. */ -#if defined(CONFIG_X86_32) -#define X86EMUL_MODE_HOST X86EMUL_MODE_PROT32 -#elif defined(CONFIG_X86_64) -#define X86EMUL_MODE_HOST X86EMUL_MODE_PROT64 -#endif - -int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len); -bool x86_page_table_writing_insn(struct x86_emulate_ctxt *ctxt); -#define EMULATION_FAILED -1 -#define EMULATION_OK 0 -#define EMULATION_RESTART 1 -#define EMULATION_INTERCEPTED 2 -void init_decode_cache(struct x86_emulate_ctxt *ctxt); -int x86_emulate_insn(struct x86_emulate_ctxt *ctxt); -int emulator_task_switch(struct x86_emulate_ctxt *ctxt, - u16 tss_selector, int idt_index, int reason, - bool has_error_code, u32 error_code); -int emulate_int_real(struct x86_emulate_ctxt *ctxt, int irq); -void emulator_invalidate_register_cache(struct x86_emulate_ctxt *ctxt); -void emulator_writeback_register_cache(struct x86_emulate_ctxt *ctxt); -bool emulator_can_use_gpa(struct x86_emulate_ctxt *ctxt); - -#endif /* _ASM_X86_KVM_X86_EMULATE_H */ diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 03887ec21dd8..6ac67b6d6692 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -185,7 +185,10 @@ enum exit_fastpath_completion { EXIT_FASTPATH_SKIP_EMUL_INS, }; -#include +struct x86_emulate_ctxt; +struct x86_exception; +enum x86_intercept; +enum x86_intercept_stage; #define KVM_NR_MEM_OBJS 40 @@ -1418,8 +1421,6 @@ int kvm_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data); int kvm_emulate_rdmsr(struct kvm_vcpu *vcpu); int kvm_emulate_wrmsr(struct kvm_vcpu *vcpu); -struct x86_emulate_ctxt; - int kvm_fast_pio(struct kvm_vcpu *vcpu, int size, unsigned short port, int in); int kvm_emulate_cpuid(struct kvm_vcpu *vcpu); int kvm_emulate_halt(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 929cd0bff5d0..0bef7762c502 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -20,7 +20,7 @@ #include #include "kvm_cache_regs.h" -#include +#include "kvm_emulate.h" #include #include #include diff --git a/arch/x86/kvm/kvm_emulate.h b/arch/x86/kvm/kvm_emulate.h new file mode 100644 index 000000000000..3a66f80d7d00 --- /dev/null +++ b/arch/x86/kvm/kvm_emulate.h @@ -0,0 +1,480 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/****************************************************************************** + * x86_emulate.h + * + * Generic x86 (32-bit and 64-bit) instruction decoder and emulator. + * + * Copyright (c) 2005 Keir Fraser + * + * From: xen-unstable 10676:af9809f51f81a3c43f276f00c81a52ef558afda4 + */ + +#ifndef _ASM_X86_KVM_X86_EMULATE_H +#define _ASM_X86_KVM_X86_EMULATE_H + +#include + +struct x86_emulate_ctxt; +enum x86_intercept; +enum x86_intercept_stage; + +struct x86_exception { + u8 vector; + bool error_code_valid; + u16 error_code; + bool nested_page_fault; + u64 address; /* cr2 or nested page fault gpa */ + u8 async_page_fault; +}; + +/* + * This struct is used to carry enough information from the instruction + * decoder to main KVM so that a decision can be made whether the + * instruction needs to be intercepted or not. + */ +struct x86_instruction_info { + u8 intercept; /* which intercept */ + u8 rep_prefix; /* rep prefix? */ + u8 modrm_mod; /* mod part of modrm */ + u8 modrm_reg; /* index of register used */ + u8 modrm_rm; /* rm part of modrm */ + u64 src_val; /* value of source operand */ + u64 dst_val; /* value of destination operand */ + u8 src_bytes; /* size of source operand */ + u8 dst_bytes; /* size of destination operand */ + u8 ad_bytes; /* size of src/dst address */ + u64 next_rip; /* rip following the instruction */ +}; + +/* + * x86_emulate_ops: + * + * These operations represent the instruction emulator's interface to memory. + * There are two categories of operation: those that act on ordinary memory + * regions (*_std), and those that act on memory regions known to require + * special treatment or emulation (*_emulated). + * + * The emulator assumes that an instruction accesses only one 'emulated memory' + * location, that this location is the given linear faulting address (cr2), and + * that this is one of the instruction's data operands. Instruction fetches and + * stack operations are assumed never to access emulated memory. The emulator + * automatically deduces which operand of a string-move operation is accessing + * emulated memory, and assumes that the other operand accesses normal memory. + * + * NOTES: + * 1. The emulator isn't very smart about emulated vs. standard memory. + * 'Emulated memory' access addresses should be checked for sanity. + * 'Normal memory' accesses may fault, and the caller must arrange to + * detect and handle reentrancy into the emulator via recursive faults. + * Accesses may be unaligned and may cross page boundaries. + * 2. If the access fails (cannot emulate, or a standard access faults) then + * it is up to the memop to propagate the fault to the guest VM via + * some out-of-band mechanism, unknown to the emulator. The memop signals + * failure by returning X86EMUL_PROPAGATE_FAULT to the emulator, which will + * then immediately bail. + * 3. Valid access sizes are 1, 2, 4 and 8 bytes. On x86/32 systems only + * cmpxchg8b_emulated need support 8-byte accesses. + * 4. The emulator cannot handle 64-bit mode emulation on an x86/32 system. + */ +/* Access completed successfully: continue emulation as normal. */ +#define X86EMUL_CONTINUE 0 +/* Access is unhandleable: bail from emulation and return error to caller. */ +#define X86EMUL_UNHANDLEABLE 1 +/* Terminate emulation but return success to the caller. */ +#define X86EMUL_PROPAGATE_FAULT 2 /* propagate a generated fault to guest */ +#define X86EMUL_RETRY_INSTR 3 /* retry the instruction for some reason */ +#define X86EMUL_CMPXCHG_FAILED 4 /* cmpxchg did not see expected value */ +#define X86EMUL_IO_NEEDED 5 /* IO is needed to complete emulation */ +#define X86EMUL_INTERCEPTED 6 /* Intercepted by nested VMCB/VMCS */ + +struct x86_emulate_ops { + /* + * read_gpr: read a general purpose register (rax - r15) + * + * @reg: gpr number. + */ + ulong (*read_gpr)(struct x86_emulate_ctxt *ctxt, unsigned reg); + /* + * write_gpr: write a general purpose register (rax - r15) + * + * @reg: gpr number. + * @val: value to write. + */ + void (*write_gpr)(struct x86_emulate_ctxt *ctxt, unsigned reg, ulong val); + /* + * read_std: Read bytes of standard (non-emulated/special) memory. + * Used for descriptor reading. + * @addr: [IN ] Linear address from which to read. + * @val: [OUT] Value read from memory, zero-extended to 'u_long'. + * @bytes: [IN ] Number of bytes to read from memory. + * @system:[IN ] Whether the access is forced to be at CPL0. + */ + int (*read_std)(struct x86_emulate_ctxt *ctxt, + unsigned long addr, void *val, + unsigned int bytes, + struct x86_exception *fault, bool system); + + /* + * read_phys: Read bytes of standard (non-emulated/special) memory. + * Used for descriptor reading. + * @addr: [IN ] Physical address from which to read. + * @val: [OUT] Value read from memory. + * @bytes: [IN ] Number of bytes to read from memory. + */ + int (*read_phys)(struct x86_emulate_ctxt *ctxt, unsigned long addr, + void *val, unsigned int bytes); + + /* + * write_std: Write bytes of standard (non-emulated/special) memory. + * Used for descriptor writing. + * @addr: [IN ] Linear address to which to write. + * @val: [OUT] Value write to memory, zero-extended to 'u_long'. + * @bytes: [IN ] Number of bytes to write to memory. + * @system:[IN ] Whether the access is forced to be at CPL0. + */ + int (*write_std)(struct x86_emulate_ctxt *ctxt, + unsigned long addr, void *val, unsigned int bytes, + struct x86_exception *fault, bool system); + /* + * fetch: Read bytes of standard (non-emulated/special) memory. + * Used for instruction fetch. + * @addr: [IN ] Linear address from which to read. + * @val: [OUT] Value read from memory, zero-extended to 'u_long'. + * @bytes: [IN ] Number of bytes to read from memory. + */ + int (*fetch)(struct x86_emulate_ctxt *ctxt, + unsigned long addr, void *val, unsigned int bytes, + struct x86_exception *fault); + + /* + * read_emulated: Read bytes from emulated/special memory area. + * @addr: [IN ] Linear address from which to read. + * @val: [OUT] Value read from memory, zero-extended to 'u_long'. + * @bytes: [IN ] Number of bytes to read from memory. + */ + int (*read_emulated)(struct x86_emulate_ctxt *ctxt, + unsigned long addr, void *val, unsigned int bytes, + struct x86_exception *fault); + + /* + * write_emulated: Write bytes to emulated/special memory area. + * @addr: [IN ] Linear address to which to write. + * @val: [IN ] Value to write to memory (low-order bytes used as + * required). + * @bytes: [IN ] Number of bytes to write to memory. + */ + int (*write_emulated)(struct x86_emulate_ctxt *ctxt, + unsigned long addr, const void *val, + unsigned int bytes, + struct x86_exception *fault); + + /* + * cmpxchg_emulated: Emulate an atomic (LOCKed) CMPXCHG operation on an + * emulated/special memory area. + * @addr: [IN ] Linear address to access. + * @old: [IN ] Value expected to be current at @addr. + * @new: [IN ] Value to write to @addr. + * @bytes: [IN ] Number of bytes to access using CMPXCHG. + */ + int (*cmpxchg_emulated)(struct x86_emulate_ctxt *ctxt, + unsigned long addr, + const void *old, + const void *new, + unsigned int bytes, + struct x86_exception *fault); + void (*invlpg)(struct x86_emulate_ctxt *ctxt, ulong addr); + + int (*pio_in_emulated)(struct x86_emulate_ctxt *ctxt, + int size, unsigned short port, void *val, + unsigned int count); + + int (*pio_out_emulated)(struct x86_emulate_ctxt *ctxt, + int size, unsigned short port, const void *val, + unsigned int count); + + bool (*get_segment)(struct x86_emulate_ctxt *ctxt, u16 *selector, + struct desc_struct *desc, u32 *base3, int seg); + void (*set_segment)(struct x86_emulate_ctxt *ctxt, u16 selector, + struct desc_struct *desc, u32 base3, int seg); + unsigned long (*get_cached_segment_base)(struct x86_emulate_ctxt *ctxt, + int seg); + void (*get_gdt)(struct x86_emulate_ctxt *ctxt, struct desc_ptr *dt); + void (*get_idt)(struct x86_emulate_ctxt *ctxt, struct desc_ptr *dt); + void (*set_gdt)(struct x86_emulate_ctxt *ctxt, struct desc_ptr *dt); + void (*set_idt)(struct x86_emulate_ctxt *ctxt, struct desc_ptr *dt); + ulong (*get_cr)(struct x86_emulate_ctxt *ctxt, int cr); + int (*set_cr)(struct x86_emulate_ctxt *ctxt, int cr, ulong val); + int (*cpl)(struct x86_emulate_ctxt *ctxt); + int (*get_dr)(struct x86_emulate_ctxt *ctxt, int dr, ulong *dest); + int (*set_dr)(struct x86_emulate_ctxt *ctxt, int dr, ulong value); + u64 (*get_smbase)(struct x86_emulate_ctxt *ctxt); + void (*set_smbase)(struct x86_emulate_ctxt *ctxt, u64 smbase); + int (*set_msr)(struct x86_emulate_ctxt *ctxt, u32 msr_index, u64 data); + int (*get_msr)(struct x86_emulate_ctxt *ctxt, u32 msr_index, u64 *pdata); + int (*check_pmc)(struct x86_emulate_ctxt *ctxt, u32 pmc); + int (*read_pmc)(struct x86_emulate_ctxt *ctxt, u32 pmc, u64 *pdata); + void (*halt)(struct x86_emulate_ctxt *ctxt); + void (*wbinvd)(struct x86_emulate_ctxt *ctxt); + int (*fix_hypercall)(struct x86_emulate_ctxt *ctxt); + int (*intercept)(struct x86_emulate_ctxt *ctxt, + struct x86_instruction_info *info, + enum x86_intercept_stage stage); + + bool (*get_cpuid)(struct x86_emulate_ctxt *ctxt, u32 *eax, u32 *ebx, + u32 *ecx, u32 *edx, bool check_limit); + bool (*guest_has_long_mode)(struct x86_emulate_ctxt *ctxt); + bool (*guest_has_movbe)(struct x86_emulate_ctxt *ctxt); + bool (*guest_has_fxsr)(struct x86_emulate_ctxt *ctxt); + + void (*set_nmi_mask)(struct x86_emulate_ctxt *ctxt, bool masked); + + unsigned (*get_hflags)(struct x86_emulate_ctxt *ctxt); + void (*set_hflags)(struct x86_emulate_ctxt *ctxt, unsigned hflags); + int (*pre_leave_smm)(struct x86_emulate_ctxt *ctxt, + const char *smstate); + void (*post_leave_smm)(struct x86_emulate_ctxt *ctxt); + int (*set_xcr)(struct x86_emulate_ctxt *ctxt, u32 index, u64 xcr); +}; + +typedef u32 __attribute__((vector_size(16))) sse128_t; + +/* Type, address-of, and value of an instruction's operand. */ +struct operand { + enum { OP_REG, OP_MEM, OP_MEM_STR, OP_IMM, OP_XMM, OP_MM, OP_NONE } type; + unsigned int bytes; + unsigned int count; + union { + unsigned long orig_val; + u64 orig_val64; + }; + union { + unsigned long *reg; + struct segmented_address { + ulong ea; + unsigned seg; + } mem; + unsigned xmm; + unsigned mm; + } addr; + union { + unsigned long val; + u64 val64; + char valptr[sizeof(sse128_t)]; + sse128_t vec_val; + u64 mm_val; + void *data; + }; +}; + +struct fetch_cache { + u8 data[15]; + u8 *ptr; + u8 *end; +}; + +struct read_cache { + u8 data[1024]; + unsigned long pos; + unsigned long end; +}; + +/* Execution mode, passed to the emulator. */ +enum x86emul_mode { + X86EMUL_MODE_REAL, /* Real mode. */ + X86EMUL_MODE_VM86, /* Virtual 8086 mode. */ + X86EMUL_MODE_PROT16, /* 16-bit protected mode. */ + X86EMUL_MODE_PROT32, /* 32-bit protected mode. */ + X86EMUL_MODE_PROT64, /* 64-bit (long) mode. */ +}; + +/* These match some of the HF_* flags defined in kvm_host.h */ +#define X86EMUL_GUEST_MASK (1 << 5) /* VCPU is in guest-mode */ +#define X86EMUL_SMM_MASK (1 << 6) +#define X86EMUL_SMM_INSIDE_NMI_MASK (1 << 7) + +/* + * fastop functions are declared as taking a never-defined fastop parameter, + * so they can't be called from C directly. + */ +struct fastop; + +typedef void (*fastop_t)(struct fastop *); + +struct x86_emulate_ctxt { + void *vcpu; + const struct x86_emulate_ops *ops; + + /* Register state before/after emulation. */ + unsigned long eflags; + unsigned long eip; /* eip before instruction emulation */ + /* Emulated execution mode, represented by an X86EMUL_MODE value. */ + enum x86emul_mode mode; + + /* interruptibility state, as a result of execution of STI or MOV SS */ + int interruptibility; + + bool perm_ok; /* do not check permissions if true */ + bool ud; /* inject an #UD if host doesn't support insn */ + bool tf; /* TF value before instruction (after for syscall/sysret) */ + + bool have_exception; + struct x86_exception exception; + + /* GPA available */ + bool gpa_available; + gpa_t gpa_val; + + /* + * decode cache + */ + + /* current opcode length in bytes */ + u8 opcode_len; + u8 b; + u8 intercept; + u8 op_bytes; + u8 ad_bytes; + struct operand src; + struct operand src2; + struct operand dst; + union { + int (*execute)(struct x86_emulate_ctxt *ctxt); + fastop_t fop; + }; + int (*check_perm)(struct x86_emulate_ctxt *ctxt); + /* + * The following six fields are cleared together, + * the rest are initialized unconditionally in x86_decode_insn + * or elsewhere + */ + bool rip_relative; + u8 rex_prefix; + u8 lock_prefix; + u8 rep_prefix; + /* bitmaps of registers in _regs[] that can be read */ + u32 regs_valid; + /* bitmaps of registers in _regs[] that have been written */ + u32 regs_dirty; + /* modrm */ + u8 modrm; + u8 modrm_mod; + u8 modrm_reg; + u8 modrm_rm; + u8 modrm_seg; + u8 seg_override; + u64 d; + unsigned long _eip; + struct operand memop; + /* Fields above regs are cleared together. */ + unsigned long _regs[NR_VCPU_REGS]; + struct operand *memopp; + struct fetch_cache fetch; + struct read_cache io_read; + struct read_cache mem_read; +}; + +/* Repeat String Operation Prefix */ +#define REPE_PREFIX 0xf3 +#define REPNE_PREFIX 0xf2 + +/* CPUID vendors */ +#define X86EMUL_CPUID_VENDOR_AuthenticAMD_ebx 0x68747541 +#define X86EMUL_CPUID_VENDOR_AuthenticAMD_ecx 0x444d4163 +#define X86EMUL_CPUID_VENDOR_AuthenticAMD_edx 0x69746e65 + +#define X86EMUL_CPUID_VENDOR_AMDisbetterI_ebx 0x69444d41 +#define X86EMUL_CPUID_VENDOR_AMDisbetterI_ecx 0x21726574 +#define X86EMUL_CPUID_VENDOR_AMDisbetterI_edx 0x74656273 + +#define X86EMUL_CPUID_VENDOR_HygonGenuine_ebx 0x6f677948 +#define X86EMUL_CPUID_VENDOR_HygonGenuine_ecx 0x656e6975 +#define X86EMUL_CPUID_VENDOR_HygonGenuine_edx 0x6e65476e + +#define X86EMUL_CPUID_VENDOR_GenuineIntel_ebx 0x756e6547 +#define X86EMUL_CPUID_VENDOR_GenuineIntel_ecx 0x6c65746e +#define X86EMUL_CPUID_VENDOR_GenuineIntel_edx 0x49656e69 + +enum x86_intercept_stage { + X86_ICTP_NONE = 0, /* Allow zero-init to not match anything */ + X86_ICPT_PRE_EXCEPT, + X86_ICPT_POST_EXCEPT, + X86_ICPT_POST_MEMACCESS, +}; + +enum x86_intercept { + x86_intercept_none, + x86_intercept_cr_read, + x86_intercept_cr_write, + x86_intercept_clts, + x86_intercept_lmsw, + x86_intercept_smsw, + x86_intercept_dr_read, + x86_intercept_dr_write, + x86_intercept_lidt, + x86_intercept_sidt, + x86_intercept_lgdt, + x86_intercept_sgdt, + x86_intercept_lldt, + x86_intercept_sldt, + x86_intercept_ltr, + x86_intercept_str, + x86_intercept_rdtsc, + x86_intercept_rdpmc, + x86_intercept_pushf, + x86_intercept_popf, + x86_intercept_cpuid, + x86_intercept_rsm, + x86_intercept_iret, + x86_intercept_intn, + x86_intercept_invd, + x86_intercept_pause, + x86_intercept_hlt, + x86_intercept_invlpg, + x86_intercept_invlpga, + x86_intercept_vmrun, + x86_intercept_vmload, + x86_intercept_vmsave, + x86_intercept_vmmcall, + x86_intercept_stgi, + x86_intercept_clgi, + x86_intercept_skinit, + x86_intercept_rdtscp, + x86_intercept_icebp, + x86_intercept_wbinvd, + x86_intercept_monitor, + x86_intercept_mwait, + x86_intercept_rdmsr, + x86_intercept_wrmsr, + x86_intercept_in, + x86_intercept_ins, + x86_intercept_out, + x86_intercept_outs, + x86_intercept_xsetbv, + + nr_x86_intercepts +}; + +/* Host execution mode. */ +#if defined(CONFIG_X86_32) +#define X86EMUL_MODE_HOST X86EMUL_MODE_PROT32 +#elif defined(CONFIG_X86_64) +#define X86EMUL_MODE_HOST X86EMUL_MODE_PROT64 +#endif + +int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len); +bool x86_page_table_writing_insn(struct x86_emulate_ctxt *ctxt); +#define EMULATION_FAILED -1 +#define EMULATION_OK 0 +#define EMULATION_RESTART 1 +#define EMULATION_INTERCEPTED 2 +void init_decode_cache(struct x86_emulate_ctxt *ctxt); +int x86_emulate_insn(struct x86_emulate_ctxt *ctxt); +int emulator_task_switch(struct x86_emulate_ctxt *ctxt, + u16 tss_selector, int idt_index, int reason, + bool has_error_code, u32 error_code); +int emulate_int_real(struct x86_emulate_ctxt *ctxt, int irq); +void emulator_invalidate_register_cache(struct x86_emulate_ctxt *ctxt); +void emulator_writeback_register_cache(struct x86_emulate_ctxt *ctxt); +bool emulator_can_use_gpa(struct x86_emulate_ctxt *ctxt); + +#endif /* _ASM_X86_KVM_X86_EMULATE_H */ diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index a1f4e325420e..050da022f96c 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -19,6 +19,7 @@ #include "mmu.h" #include "x86.h" #include "kvm_cache_regs.h" +#include "kvm_emulate.h" #include "cpuid.h" #include diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 4465975f034b..f55899055467 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -22,6 +22,7 @@ #include "i8254.h" #include "tss.h" #include "kvm_cache_regs.h" +#include "kvm_emulate.h" #include "x86.h" #include "cpuid.h" #include "pmu.h" diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h index 8409842a25d9..f3c6e55eb5d9 100644 --- a/arch/x86/kvm/x86.h +++ b/arch/x86/kvm/x86.h @@ -5,6 +5,7 @@ #include #include #include "kvm_cache_regs.h" +#include "kvm_emulate.h" #define KVM_DEFAULT_PLE_GAP 128 #define KVM_VMX_DEFAULT_PLE_WINDOW 4096 -- cgit From 06add254c7f3b7f6fdfe04eb028aaabe5b27a734 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 18 Feb 2020 15:29:50 -0800 Subject: KVM: x86: Shrink the usercopy region of the emulation context Shuffle a few operand structs to the end of struct x86_emulate_ctxt and update the cache creation to whitelist only the region of the emulation context that is expected to be copied to/from user memory, e.g. the instruction operands, registers, and fetch/io/mem caches. Signed-off-by: Sean Christopherson Reviewed-by: Vitaly Kuznetsov Signed-off-by: Paolo Bonzini --- arch/x86/kvm/kvm_emulate.h | 8 +++++--- arch/x86/kvm/x86.c | 12 ++++++------ 2 files changed, 11 insertions(+), 9 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/kvm_emulate.h b/arch/x86/kvm/kvm_emulate.h index 3a66f80d7d00..f86fe6bf170d 100644 --- a/arch/x86/kvm/kvm_emulate.h +++ b/arch/x86/kvm/kvm_emulate.h @@ -334,9 +334,6 @@ struct x86_emulate_ctxt { u8 intercept; u8 op_bytes; u8 ad_bytes; - struct operand src; - struct operand src2; - struct operand dst; union { int (*execute)(struct x86_emulate_ctxt *ctxt); fastop_t fop; @@ -364,6 +361,11 @@ struct x86_emulate_ctxt { u8 seg_override; u64 d; unsigned long _eip; + + /* Here begins the usercopy section. */ + struct operand src; + struct operand src2; + struct operand dst; struct operand memop; /* Fields above regs are cleared together. */ unsigned long _regs[NR_VCPU_REGS]; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index f55899055467..a69f7bf020d9 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -235,13 +235,13 @@ static struct kmem_cache *x86_emulator_cache; static struct kmem_cache *kvm_alloc_emulator_cache(void) { - return kmem_cache_create_usercopy("x86_emulator", - sizeof(struct x86_emulate_ctxt), + unsigned int useroffset = offsetof(struct x86_emulate_ctxt, src); + unsigned int size = sizeof(struct x86_emulate_ctxt); + + return kmem_cache_create_usercopy("x86_emulator", size, __alignof__(struct x86_emulate_ctxt), - SLAB_ACCOUNT, - 0, - sizeof(struct x86_emulate_ctxt), - NULL); + SLAB_ACCOUNT, useroffset, + size - useroffset, NULL); } static int emulator_fix_hypercall(struct x86_emulate_ctxt *ctxt); -- cgit From 68c9a46e9ee8e9f673c7e0d1bbf483a289b5a86f Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:04 -0800 Subject: KVM: x86: Return -E2BIG when KVM_GET_SUPPORTED_CPUID hits max entries Fix a long-standing bug that causes KVM to return 0 instead of -E2BIG when userspace's array is insufficiently sized. This technically breaks backwards compatibility, e.g. a userspace with a hardcoded cpuid->nent could theoretically be broken as it would see an error instead of success if cpuid->nent is less than the number of entries required to fully enumerate the host CPU. But, the lowest known cpuid->nent hardcoded by a VMM is 100 (lkvm and selftests), and the limit for current processors on Intel and AMD is well under a 100. E.g. Intel's Icelake server with all the bells and whistles tops out at ~60 entries (variable due to SGX sub-leafs), and AMD's CPUID documentation allows for less than 50. CPUID 0xD sub-leaves on current kernels are capped by the value of KVM_SUPPORTED_XCR0, and therefore so many subleaves cannot have appeared on current kernels. Note, while the Fixes: tag is accurate with respect to the immediate bug, it's likely that similar bugs in KVM_GET_SUPPORTED_CPUID existed prior to the refactoring, e.g. Qemu contains a workaround for the broken KVM_GET_SUPPORTED_CPUID behavior that predates the buggy commit by over two years. The Qemu workaround is also likely the main reason the bug has gone unreported for so long. Qemu hack: commit 76ae317f7c16aec6b469604b1764094870a75470 Author: Mark McLoughlin Date: Tue May 19 18:55:21 2009 +0100 kvm: work around supported cpuid ioctl() brokenness KVM_GET_SUPPORTED_CPUID has been known to fail to return -E2BIG when it runs out of entries. Detect this by always trying again with a bigger table if the ioctl() fills the table. Fixes: 831bf664e9c1f ("KVM: Refactor and simplify kvm_dev_ioctl_get_supported_cpuid") Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index b1c469446b07..47ce04762c20 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -908,9 +908,14 @@ int kvm_dev_ioctl_get_cpuid(struct kvm_cpuid2 *cpuid, goto out_free; limit = cpuid_entries[nent - 1].eax; - for (func = ent->func + 1; func <= limit && nent < cpuid->nent && r == 0; ++func) + for (func = ent->func + 1; func <= limit && r == 0; ++func) { + if (nent >= cpuid->nent) { + r = -E2BIG; + goto out_free; + } r = do_cpuid_func(&cpuid_entries[nent], func, &nent, cpuid->nent, type); + } if (r) goto out_free; -- cgit From 619a17f11069ce3d338e00a621f57dd8dec164c2 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:05 -0800 Subject: KVM: x86: Refactor loop around do_cpuid_func() to separate helper Move the guts of kvm_dev_ioctl_get_cpuid()'s CPUID func loop to a separate helper to improve code readability and pave the way for future cleanup. No functional change intended. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 45 +++++++++++++++++++++++++++------------------ 1 file changed, 27 insertions(+), 18 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 47ce04762c20..f49fdd06f511 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -839,6 +839,29 @@ static bool is_centaur_cpu(const struct kvm_cpuid_param *param) return boot_cpu_data.x86_vendor == X86_VENDOR_CENTAUR; } +static int get_cpuid_func(struct kvm_cpuid_entry2 *entries, u32 func, + int *nent, int maxnent, unsigned int type) +{ + u32 limit; + int r; + + r = do_cpuid_func(&entries[*nent], func, nent, maxnent, type); + if (r) + return r; + + limit = entries[*nent - 1].eax; + for (func = func + 1; func <= limit; ++func) { + if (*nent >= maxnent) + return -E2BIG; + + r = do_cpuid_func(&entries[*nent], func, nent, maxnent, type); + if (r) + break; + } + + return r; +} + static bool sanity_check_entries(struct kvm_cpuid_entry2 __user *entries, __u32 num_entries, unsigned int ioctl_type) { @@ -871,8 +894,8 @@ int kvm_dev_ioctl_get_cpuid(struct kvm_cpuid2 *cpuid, unsigned int type) { struct kvm_cpuid_entry2 *cpuid_entries; - int limit, nent = 0, r = -E2BIG, i; - u32 func; + int nent = 0, r = -E2BIG, i; + static const struct kvm_cpuid_param param[] = { { .func = 0 }, { .func = 0x80000000 }, @@ -901,22 +924,8 @@ int kvm_dev_ioctl_get_cpuid(struct kvm_cpuid2 *cpuid, if (ent->qualifier && !ent->qualifier(ent)) continue; - r = do_cpuid_func(&cpuid_entries[nent], ent->func, - &nent, cpuid->nent, type); - - if (r) - goto out_free; - - limit = cpuid_entries[nent - 1].eax; - for (func = ent->func + 1; func <= limit && r == 0; ++func) { - if (nent >= cpuid->nent) { - r = -E2BIG; - goto out_free; - } - r = do_cpuid_func(&cpuid_entries[nent], func, - &nent, cpuid->nent, type); - } - + r = get_cpuid_func(cpuid_entries, ent->func, &nent, + cpuid->nent, type); if (r) goto out_free; } -- cgit From 8b86079cc339a7f1f94e92d8469d7fb69f8879c2 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:06 -0800 Subject: KVM: x86: Simplify handling of Centaur CPUID leafs Refactor the handling of the Centaur-only CPUID leaf to detect the leaf via a runtime query instead of adding a one-off callback in the static array. When the callback was introduced, there were additional fields in the array's structs, and more importantly, retpoline wasn't a thing. No functional change intended. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 32 ++++++++++---------------------- 1 file changed, 10 insertions(+), 22 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index f49fdd06f511..de52cbb46171 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -829,15 +829,7 @@ static int do_cpuid_func(struct kvm_cpuid_entry2 *entry, u32 func, return __do_cpuid_func(entry, func, nent, maxnent); } -struct kvm_cpuid_param { - u32 func; - bool (*qualifier)(const struct kvm_cpuid_param *param); -}; - -static bool is_centaur_cpu(const struct kvm_cpuid_param *param) -{ - return boot_cpu_data.x86_vendor == X86_VENDOR_CENTAUR; -} +#define CENTAUR_CPUID_SIGNATURE 0xC0000000 static int get_cpuid_func(struct kvm_cpuid_entry2 *entries, u32 func, int *nent, int maxnent, unsigned int type) @@ -845,6 +837,10 @@ static int get_cpuid_func(struct kvm_cpuid_entry2 *entries, u32 func, u32 limit; int r; + if (func == CENTAUR_CPUID_SIGNATURE && + boot_cpu_data.x86_vendor != X86_VENDOR_CENTAUR) + return 0; + r = do_cpuid_func(&entries[*nent], func, nent, maxnent, type); if (r) return r; @@ -896,11 +892,8 @@ int kvm_dev_ioctl_get_cpuid(struct kvm_cpuid2 *cpuid, struct kvm_cpuid_entry2 *cpuid_entries; int nent = 0, r = -E2BIG, i; - static const struct kvm_cpuid_param param[] = { - { .func = 0 }, - { .func = 0x80000000 }, - { .func = 0xC0000000, .qualifier = is_centaur_cpu }, - { .func = KVM_CPUID_SIGNATURE }, + static const u32 funcs[] = { + 0, 0x80000000, CENTAUR_CPUID_SIGNATURE, KVM_CPUID_SIGNATURE, }; if (cpuid->nent < 1) @@ -918,14 +911,9 @@ int kvm_dev_ioctl_get_cpuid(struct kvm_cpuid2 *cpuid, goto out; r = 0; - for (i = 0; i < ARRAY_SIZE(param); i++) { - const struct kvm_cpuid_param *ent = ¶m[i]; - - if (ent->qualifier && !ent->qualifier(ent)) - continue; - - r = get_cpuid_func(cpuid_entries, ent->func, &nent, - cpuid->nent, type); + for (i = 0; i < ARRAY_SIZE(funcs); i++) { + r = get_cpuid_func(cpuid_entries, funcs[i], &nent, cpuid->nent, + type); if (r) goto out_free; } -- cgit From d5a661d19df15cc618bb69e43cab3edb06d2021f Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:07 -0800 Subject: KVM: x86: Clean up error handling in kvm_dev_ioctl_get_cpuid() Clean up the error handling in kvm_dev_ioctl_get_cpuid(), which has gotten a bit crusty as the function has evolved over the years. Opportunistically hoist the static @funcs declaration to the top of the function to make it more obvious that it's a "static const". No functional change intended. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 19 +++++++------------ 1 file changed, 7 insertions(+), 12 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index de52cbb46171..11d5f311ef10 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -889,45 +889,40 @@ int kvm_dev_ioctl_get_cpuid(struct kvm_cpuid2 *cpuid, struct kvm_cpuid_entry2 __user *entries, unsigned int type) { - struct kvm_cpuid_entry2 *cpuid_entries; - int nent = 0, r = -E2BIG, i; - static const u32 funcs[] = { 0, 0x80000000, CENTAUR_CPUID_SIGNATURE, KVM_CPUID_SIGNATURE, }; + struct kvm_cpuid_entry2 *cpuid_entries; + int nent = 0, r, i; + if (cpuid->nent < 1) - goto out; + return -E2BIG; if (cpuid->nent > KVM_MAX_CPUID_ENTRIES) cpuid->nent = KVM_MAX_CPUID_ENTRIES; if (sanity_check_entries(entries, cpuid->nent, type)) return -EINVAL; - r = -ENOMEM; cpuid_entries = vzalloc(array_size(sizeof(struct kvm_cpuid_entry2), cpuid->nent)); if (!cpuid_entries) - goto out; + return -ENOMEM; - r = 0; for (i = 0; i < ARRAY_SIZE(funcs); i++) { r = get_cpuid_func(cpuid_entries, funcs[i], &nent, cpuid->nent, type); if (r) goto out_free; } + cpuid->nent = nent; - r = -EFAULT; if (copy_to_user(entries, cpuid_entries, nent * sizeof(struct kvm_cpuid_entry2))) - goto out_free; - cpuid->nent = nent; - r = 0; + r = -EFAULT; out_free: vfree(cpuid_entries); -out: return r; } -- cgit From 0fc62671876c29a80c16d41cbce782ff10795bef Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:08 -0800 Subject: KVM: x86: Check userspace CPUID array size after validating sub-leaf Verify that the next sub-leaf of CPUID 0x4 (or 0x8000001d) is valid before rejecting the entire KVM_GET_SUPPORTED_CPUID due to insufficent space in the userspace array. Note, although this is technically a bug, it's not visible to userspace as KVM_GET_SUPPORTED_CPUID is guaranteed to fail on KVM_CPUID_SIGNATURE, which is hardcoded to be added after the affected leafs. The real motivation for the change is to tightly couple the nent/maxnent and do_host_cpuid() sequences in preparation for future cleanup. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 11d5f311ef10..e5cf1e0cf84a 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -552,12 +552,12 @@ static inline int __do_cpuid_func(struct kvm_cpuid_entry2 *entry, u32 function, /* read more entries until cache_type is zero */ for (i = 1; ; ++i) { - if (*nent >= maxnent) - goto out; - cache_type = entry[i - 1].eax & 0x1f; if (!cache_type) break; + + if (*nent >= maxnent) + goto out; do_host_cpuid(&entry[i], function, i); ++*nent; } -- cgit From 3dc4a9cf05e53c358816f733e383ec42c86f3d4c Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:09 -0800 Subject: KVM: x86: Move CPUID 0xD.1 handling out of the index>0 loop Mov the sub-leaf 1 handling for CPUID 0xD out of the index>0 loop so that the loop only handles index>2. Sub-leafs 2+ have identical semantics, whereas sub-leaf 1 is effectively a feature sub-leaf. Moving sub-leaf 1 out of the loop does duplicate a bit of code, but the nent/maxnent code will be consolidated in a future patch, and duplicating the clear of ECX/EDX is arguably a good thing as the reasons for clearing said registers are completely different. No functional change intended. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 37 ++++++++++++++++++++++--------------- 1 file changed, 22 insertions(+), 15 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index e5cf1e0cf84a..fc8540596386 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -653,26 +653,33 @@ static inline int __do_cpuid_func(struct kvm_cpuid_entry2 *entry, u32 function, if (!supported) break; - for (idx = 1, i = 1; idx < 64; ++idx) { + if (*nent >= maxnent) + goto out; + + do_host_cpuid(&entry[1], function, 1); + ++*nent; + + entry[1].eax &= kvm_cpuid_D_1_eax_x86_features; + cpuid_mask(&entry[1].eax, CPUID_D_1_EAX); + if (entry[1].eax & (F(XSAVES)|F(XSAVEC))) + entry[1].ebx = xstate_required_size(supported, true); + else + entry[1].ebx = 0; + /* Saving XSS controlled state via XSAVES isn't supported. */ + entry[1].ecx = 0; + entry[1].edx = 0; + + for (idx = 2, i = 2; idx < 64; ++idx) { u64 mask = ((u64)1 << idx); + if (*nent >= maxnent) goto out; do_host_cpuid(&entry[i], function, idx); - if (idx == 1) { - entry[i].eax &= kvm_cpuid_D_1_eax_x86_features; - cpuid_mask(&entry[i].eax, CPUID_D_1_EAX); - entry[i].ebx = 0; - if (entry[i].eax & (F(XSAVES)|F(XSAVEC))) - entry[i].ebx = - xstate_required_size(supported, - true); - } else { - if (entry[i].eax == 0 || !(supported & mask)) - continue; - if (WARN_ON_ONCE(entry[i].ecx & 1)) - continue; - } + if (entry[i].eax == 0 || !(supported & mask)) + continue; + if (WARN_ON_ONCE(entry[i].ecx & 1)) + continue; entry[i].ecx = 0; entry[i].edx = 0; ++*nent; -- cgit From 1893c9415ae8cad20da863c41bdf308e56a2dd67 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:10 -0800 Subject: KVM: x86: Check for CPUID 0xD.N support before validating array size Now that sub-leaf 1 is handled separately, verify the next sub-leaf is needed before rejecting KVM_GET_SUPPORTED_CPUID due to an insufficiently sized userspace array. Note, although this is technically a bug, it's not visible to userspace as KVM_GET_SUPPORTED_CPUID is guaranteed to fail on KVM_CPUID_SIGNATURE, which is hardcoded to be added after leaf 0xD. The real motivation for the change is to tightly couple the nent/maxnent and do_host_cpuid() sequences in preparation for future cleanup. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index fc8540596386..fd9b29aa7abc 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -670,13 +670,14 @@ static inline int __do_cpuid_func(struct kvm_cpuid_entry2 *entry, u32 function, entry[1].edx = 0; for (idx = 2, i = 2; idx < 64; ++idx) { - u64 mask = ((u64)1 << idx); + if (!(supported & BIT_ULL(idx))) + continue; if (*nent >= maxnent) goto out; do_host_cpuid(&entry[i], function, idx); - if (entry[i].eax == 0 || !(supported & mask)) + if (entry[i].eax == 0) continue; if (WARN_ON_ONCE(entry[i].ecx & 1)) continue; -- cgit From 91001d403ad39b9c800a3b805bf48c5b027ecac5 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:11 -0800 Subject: KVM: x86: Warn on zero-size save state for valid CPUID 0xD.N sub-leaf WARN if the save state size for a valid XCR0-managed sub-leaf is zero, which would indicate a KVM or CPU bug. Add a comment to explain why KVM WARNs so the reader doesn't have to tease out the relevant bits from Intel's SDM and KVM's XCR0/XSS code. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index fd9b29aa7abc..424dde41cb5d 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -677,10 +677,17 @@ static inline int __do_cpuid_func(struct kvm_cpuid_entry2 *entry, u32 function, goto out; do_host_cpuid(&entry[i], function, idx); - if (entry[i].eax == 0) - continue; - if (WARN_ON_ONCE(entry[i].ecx & 1)) + + /* + * The @supported check above should have filtered out + * invalid sub-leafs as well as sub-leafs managed by + * IA32_XSS MSR. Only XCR0-managed sub-leafs should + * reach this point, and they should have a non-zero + * save state size. + */ + if (WARN_ON_ONCE(!entry[i].eax || (entry[i].ecx & 1))) continue; + entry[i].ecx = 0; entry[i].edx = 0; ++*nent; -- cgit From 8b2fc445a7619fec55740d0a5785aae4bd1058f3 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:12 -0800 Subject: KVM: x86: Refactor CPUID 0xD.N sub-leaf entry creation Increment the number of CPUID entries immediately after do_host_cpuid() in preparation for moving the logic into do_host_cpuid(). Handle the rare/impossible case of encountering a bogus sub-leaf by decrementing the number entries on failure. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 424dde41cb5d..6e1685a16cca 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -677,6 +677,7 @@ static inline int __do_cpuid_func(struct kvm_cpuid_entry2 *entry, u32 function, goto out; do_host_cpuid(&entry[i], function, idx); + ++*nent; /* * The @supported check above should have filtered out @@ -685,12 +686,13 @@ static inline int __do_cpuid_func(struct kvm_cpuid_entry2 *entry, u32 function, * reach this point, and they should have a non-zero * save state size. */ - if (WARN_ON_ONCE(!entry[i].eax || (entry[i].ecx & 1))) + if (WARN_ON_ONCE(!entry[i].eax || (entry[i].ecx & 1))) { + --*nent; continue; + } entry[i].ecx = 0; entry[i].edx = 0; - ++*nent; ++i; } break; -- cgit From 87849b1ccbd404eff78b54d8e4dd831e4a905cbb Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:13 -0800 Subject: KVM: x86: Clean up CPUID 0x7 sub-leaf loop Refactor the sub-leaf loop for CPUID 0x7 to move the main leaf out of said loop. The emitted code savings is basically a mirage, as the handling of the main leaf can easily be split to its own helper to avoid code bloat. No functional change intended. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 6e1685a16cca..b626893a11d5 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -573,16 +573,16 @@ static inline int __do_cpuid_func(struct kvm_cpuid_entry2 *entry, u32 function, case 7: { int i; - for (i = 0; ; ) { - do_cpuid_7_mask(&entry[i], i); - if (i == entry->eax) - break; + do_cpuid_7_mask(entry, 0); + + for (i = 1; i <= entry->eax; i++) { if (*nent >= maxnent) goto out; - ++i; do_host_cpuid(&entry[i], function, i); ++*nent; + + do_cpuid_7_mask(&entry[i], i); } break; } -- cgit From aceac6e5700f4bdfb35fd558ae60176be24aa482 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:14 -0800 Subject: KVM: x86: Drop the explicit @index from do_cpuid_7_mask() Drop the index param from do_cpuid_7_mask() and instead switch on the entry's index, which is guaranteed to be set by do_host_cpuid(). No functional change intended. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index b626893a11d5..fd04f17d1836 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -346,7 +346,7 @@ static int __do_cpuid_func_emulated(struct kvm_cpuid_entry2 *entry, return 0; } -static inline void do_cpuid_7_mask(struct kvm_cpuid_entry2 *entry, int index) +static inline void do_cpuid_7_mask(struct kvm_cpuid_entry2 *entry) { unsigned f_invpcid = kvm_x86_ops->invpcid_supported() ? F(INVPCID) : 0; unsigned f_mpx = kvm_mpx_supported() ? F(MPX) : 0; @@ -380,7 +380,7 @@ static inline void do_cpuid_7_mask(struct kvm_cpuid_entry2 *entry, int index) const u32 kvm_cpuid_7_1_eax_x86_features = F(AVX512_BF16); - switch (index) { + switch (entry->index) { case 0: entry->eax = min(entry->eax, 1u); entry->ebx &= kvm_cpuid_7_0_ebx_x86_features; @@ -573,7 +573,7 @@ static inline int __do_cpuid_func(struct kvm_cpuid_entry2 *entry, u32 function, case 7: { int i; - do_cpuid_7_mask(entry, 0); + do_cpuid_7_mask(entry); for (i = 1; i <= entry->eax; i++) { if (*nent >= maxnent) @@ -582,7 +582,7 @@ static inline int __do_cpuid_func(struct kvm_cpuid_entry2 *entry, u32 function, do_host_cpuid(&entry[i], function, i); ++*nent; - do_cpuid_7_mask(&entry[i], i); + do_cpuid_7_mask(&entry[i]); } break; } -- cgit From acfad336ecf97ccad43346d7a82c60c3b21dbde0 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:15 -0800 Subject: KVM: x86: Drop redundant boot cpu checks on SSBD feature bits Drop redundant checks when "emulating" SSBD feature across vendors, i.e. advertising the AMD variant when running on an Intel CPU and vice versa. Both SPEC_CTRL_SSBD and AMD_SSBD are already defined in the leaf-specific feature masks and are *not* forcefully set by the kernel, i.e. will already be set in the entry when supported by the host. Functionally, this changes nothing, but the redundant check is confusing, especially when considering future patches that will further differentiate between "real" and "emulated" feature bits. No functional change intended. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index fd04f17d1836..52f0af4e10d5 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -405,8 +405,7 @@ static inline void do_cpuid_7_mask(struct kvm_cpuid_entry2 *entry) entry->edx |= F(SPEC_CTRL); if (boot_cpu_has(X86_FEATURE_STIBP)) entry->edx |= F(INTEL_STIBP); - if (boot_cpu_has(X86_FEATURE_SPEC_CTRL_SSBD) || - boot_cpu_has(X86_FEATURE_AMD_SSBD)) + if (boot_cpu_has(X86_FEATURE_AMD_SSBD)) entry->edx |= F(SPEC_CTRL_SSBD); /* * We emulate ARCH_CAPABILITIES in software even @@ -780,8 +779,7 @@ static inline int __do_cpuid_func(struct kvm_cpuid_entry2 *entry, u32 function, entry->ebx |= F(AMD_IBRS); if (boot_cpu_has(X86_FEATURE_STIBP)) entry->ebx |= F(AMD_STIBP); - if (boot_cpu_has(X86_FEATURE_SPEC_CTRL_SSBD) || - boot_cpu_has(X86_FEATURE_AMD_SSBD)) + if (boot_cpu_has(X86_FEATURE_SPEC_CTRL_SSBD)) entry->ebx |= F(AMD_SSBD); if (!boot_cpu_has_bug(X86_BUG_SPEC_STORE_BYPASS)) entry->ebx |= F(AMD_SSB_NO); -- cgit From aa10a7dc8858f6cbd1ef5cb86693592896ee27c7 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:16 -0800 Subject: KVM: x86: Consolidate CPUID array max num entries checking Move the nent vs. maxnent check and nent increment into do_host_cpuid() to consolidate what is now identical code. To signal success vs. failure, return the entry and NULL respectively. A future patch will build on this to also move the entry retrieval into do_host_cpuid(). No functional change intended. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 49 +++++++++++++++++-------------------------------- 1 file changed, 17 insertions(+), 32 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 52f0af4e10d5..1ae3b2502333 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -287,9 +287,14 @@ static __always_inline void cpuid_mask(u32 *word, int wordnum) *word &= boot_cpu_data.x86_capability[wordnum]; } -static void do_host_cpuid(struct kvm_cpuid_entry2 *entry, u32 function, - u32 index) +static struct kvm_cpuid_entry2 *do_host_cpuid(struct kvm_cpuid_entry2 *entry, + int *nent, int maxnent, + u32 function, u32 index) { + if (*nent >= maxnent) + return NULL; + ++*nent; + entry->function = function; entry->index = index; entry->flags = 0; @@ -316,6 +321,8 @@ static void do_host_cpuid(struct kvm_cpuid_entry2 *entry, u32 function, entry->flags |= KVM_CPUID_FLAG_SIGNIFCANT_INDEX; break; } + + return entry; } static int __do_cpuid_func_emulated(struct kvm_cpuid_entry2 *entry, @@ -507,12 +514,9 @@ static inline int __do_cpuid_func(struct kvm_cpuid_entry2 *entry, u32 function, r = -E2BIG; - if (WARN_ON(*nent >= maxnent)) + if (WARN_ON(!do_host_cpuid(entry, nent, maxnent, function, 0))) goto out; - do_host_cpuid(entry, function, 0); - ++*nent; - switch (function) { case 0: /* Limited to the highest leaf implemented in KVM. */ @@ -536,11 +540,8 @@ static inline int __do_cpuid_func(struct kvm_cpuid_entry2 *entry, u32 function, entry->flags |= KVM_CPUID_FLAG_STATE_READ_NEXT; for (t = 1; t < times; ++t) { - if (*nent >= maxnent) + if (!do_host_cpuid(&entry[t], nent, maxnent, function, 0)) goto out; - - do_host_cpuid(&entry[t], function, 0); - ++*nent; } break; } @@ -555,10 +556,8 @@ static inline int __do_cpuid_func(struct kvm_cpuid_entry2 *entry, u32 function, if (!cache_type) break; - if (*nent >= maxnent) + if (!do_host_cpuid(&entry[i], nent, maxnent, function, i)) goto out; - do_host_cpuid(&entry[i], function, i); - ++*nent; } break; } @@ -575,12 +574,9 @@ static inline int __do_cpuid_func(struct kvm_cpuid_entry2 *entry, u32 function, do_cpuid_7_mask(entry); for (i = 1; i <= entry->eax; i++) { - if (*nent >= maxnent) + if (!do_host_cpuid(&entry[i], nent, maxnent, function, i)) goto out; - do_host_cpuid(&entry[i], function, i); - ++*nent; - do_cpuid_7_mask(&entry[i]); } break; @@ -633,11 +629,8 @@ static inline int __do_cpuid_func(struct kvm_cpuid_entry2 *entry, u32 function, * added entry is zero. */ for (i = 1; entry[i - 1].ecx & 0xff00; ++i) { - if (*nent >= maxnent) + if (!do_host_cpuid(&entry[i], nent, maxnent, function, i)) goto out; - - do_host_cpuid(&entry[i], function, i); - ++*nent; } break; } @@ -652,12 +645,9 @@ static inline int __do_cpuid_func(struct kvm_cpuid_entry2 *entry, u32 function, if (!supported) break; - if (*nent >= maxnent) + if (!do_host_cpuid(&entry[1], nent, maxnent, function, 1)) goto out; - do_host_cpuid(&entry[1], function, 1); - ++*nent; - entry[1].eax &= kvm_cpuid_D_1_eax_x86_features; cpuid_mask(&entry[1].eax, CPUID_D_1_EAX); if (entry[1].eax & (F(XSAVES)|F(XSAVEC))) @@ -672,12 +662,9 @@ static inline int __do_cpuid_func(struct kvm_cpuid_entry2 *entry, u32 function, if (!(supported & BIT_ULL(idx))) continue; - if (*nent >= maxnent) + if (!do_host_cpuid(&entry[i], nent, maxnent, function, idx)) goto out; - do_host_cpuid(&entry[i], function, idx); - ++*nent; - /* * The @supported check above should have filtered out * invalid sub-leafs as well as sub-leafs managed by @@ -704,10 +691,8 @@ static inline int __do_cpuid_func(struct kvm_cpuid_entry2 *entry, u32 function, break; for (t = 1; t <= times; ++t) { - if (*nent >= maxnent) + if (!do_host_cpuid(&entry[t], nent, maxnent, function, t)) goto out; - do_host_cpuid(&entry[t], function, t); - ++*nent; } break; } -- cgit From 74fa0bc7f083fbcce1ee0b9c3c2716fcb068fce4 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:17 -0800 Subject: KVM: x86: Hoist loop counter and terminator to top of __do_cpuid_func() Declare "i" and "max_idx" at the top of __do_cpuid_func() to consolidate a handful of declarations in various case statements. More importantly, establish the pattern of using max_idx instead of e.g. entry->eax as the loop terminator in preparation for refactoring how entry is handled in __do_cpuid_func(). No functional change intended. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 37 +++++++++++++------------------------ 1 file changed, 13 insertions(+), 24 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 1ae3b2502333..5044a595799f 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -439,7 +439,7 @@ static inline void do_cpuid_7_mask(struct kvm_cpuid_entry2 *entry) static inline int __do_cpuid_func(struct kvm_cpuid_entry2 *entry, u32 function, int *nent, int maxnent) { - int r; + int r, i, max_idx; unsigned f_nx = is_efer_nx() ? F(NX) : 0; #ifdef CONFIG_X86_64 unsigned f_gbpages = (kvm_x86_ops->get_lpage_level() == PT_PDPE_LEVEL) @@ -535,20 +535,18 @@ static inline int __do_cpuid_func(struct kvm_cpuid_entry2 *entry, u32 function, * may return different values. This forces us to get_cpu() before * issuing the first command, and also to emulate this annoying behavior * in kvm_emulate_cpuid() using KVM_CPUID_FLAG_STATE_READ_NEXT */ - case 2: { - int t, times = entry->eax & 0xff; - + case 2: entry->flags |= KVM_CPUID_FLAG_STATE_READ_NEXT; - for (t = 1; t < times; ++t) { - if (!do_host_cpuid(&entry[t], nent, maxnent, function, 0)) + + for (i = 1, max_idx = entry->eax & 0xff; i < max_idx; ++i) { + if (!do_host_cpuid(&entry[i], nent, maxnent, function, 0)) goto out; } break; - } /* functions 4 and 0x8000001d have additional index. */ case 4: case 0x8000001d: { - int i, cache_type; + int cache_type; /* read more entries until cache_type is zero */ for (i = 1; ; ++i) { @@ -568,19 +566,16 @@ static inline int __do_cpuid_func(struct kvm_cpuid_entry2 *entry, u32 function, entry->edx = 0; break; /* function 7 has additional index. */ - case 7: { - int i; - + case 7: do_cpuid_7_mask(entry); - for (i = 1; i <= entry->eax; i++) { + for (i = 1, max_idx = entry->eax; i <= max_idx; i++) { if (!do_host_cpuid(&entry[i], nent, maxnent, function, i)) goto out; do_cpuid_7_mask(&entry[i]); } break; - } case 9: break; case 0xa: { /* Architectural Performance Monitoring */ @@ -617,9 +612,7 @@ static inline int __do_cpuid_func(struct kvm_cpuid_entry2 *entry, u32 function, * thus they can be handled by common code. */ case 0x1f: - case 0xb: { - int i; - + case 0xb: /* * We filled in entry[0] for CPUID(EAX=, * ECX=00H) above. If its level type (ECX[15:8]) is @@ -633,9 +626,8 @@ static inline int __do_cpuid_func(struct kvm_cpuid_entry2 *entry, u32 function, goto out; } break; - } case 0xd: { - int idx, i; + int idx; u64 supported = kvm_supported_xcr0(); entry->eax &= supported; @@ -684,18 +676,15 @@ static inline int __do_cpuid_func(struct kvm_cpuid_entry2 *entry, u32 function, break; } /* Intel PT */ - case 0x14: { - int t, times = entry->eax; - + case 0x14: if (!f_intel_pt) break; - for (t = 1; t <= times; ++t) { - if (!do_host_cpuid(&entry[t], nent, maxnent, function, t)) + for (i = 1, max_idx = entry->eax; i <= max_idx; ++i) { + if (!do_host_cpuid(&entry[i], nent, maxnent, function, i)) goto out; } break; - } case KVM_CPUID_SIGNATURE: { static const char signature[12] = "KVMKVMKVM\0\0"; const u32 *sigptr = (const u32 *)signature; -- cgit From c862903963bbbf8dfcbb46204584f2330bb8df42 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:18 -0800 Subject: KVM: x86: Refactor CPUID 0x4 and 0x8000001d handling Refactoring the sub-leaf handling for CPUID 0x4/0x8000001d to eliminate a one-off variable and its associated brackets. No functional change intended. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 16 ++++++---------- 1 file changed, 6 insertions(+), 10 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 5044a595799f..d75d539da759 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -545,20 +545,16 @@ static inline int __do_cpuid_func(struct kvm_cpuid_entry2 *entry, u32 function, break; /* functions 4 and 0x8000001d have additional index. */ case 4: - case 0x8000001d: { - int cache_type; - - /* read more entries until cache_type is zero */ - for (i = 1; ; ++i) { - cache_type = entry[i - 1].eax & 0x1f; - if (!cache_type) - break; - + case 0x8000001d: + /* + * Read entries until the cache type in the previous entry is + * zero, i.e. indicates an invalid entry. + */ + for (i = 1; entry[i - 1].eax & 0x1f; ++i) { if (!do_host_cpuid(&entry[i], nent, maxnent, function, i)) goto out; } break; - } case 6: /* Thermal management */ entry->eax = 0x4; /* allow ARAT */ entry->ebx = 0; -- cgit From e53c95e8d41ef9927d1d5ff031db1fd6989ec16b Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:19 -0800 Subject: KVM: x86: Encapsulate CPUID entries and metadata in struct Add a struct to hold the array of CPUID entries and its associated metadata when handling KVM_GET_SUPPORTED_CPUID. Lookup and provide the correct entry in do_host_cpuid(), which eliminates the majority of array indexing shenanigans, e.g. entries[i -1], and generally makes the code more readable. The last array indexing holdout is kvm_get_cpuid(), which can't really be avoided without throwing the baby out with the bathwater. No functional change intended. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 138 ++++++++++++++++++++++++++++----------------------- 1 file changed, 76 insertions(+), 62 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index d75d539da759..59195de22d8f 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -287,13 +287,21 @@ static __always_inline void cpuid_mask(u32 *word, int wordnum) *word &= boot_cpu_data.x86_capability[wordnum]; } -static struct kvm_cpuid_entry2 *do_host_cpuid(struct kvm_cpuid_entry2 *entry, - int *nent, int maxnent, +struct kvm_cpuid_array { + struct kvm_cpuid_entry2 *entries; + const int maxnent; + int nent; +}; + +static struct kvm_cpuid_entry2 *do_host_cpuid(struct kvm_cpuid_array *array, u32 function, u32 index) { - if (*nent >= maxnent) + struct kvm_cpuid_entry2 *entry; + + if (array->nent >= array->maxnent) return NULL; - ++*nent; + + entry = &array->entries[array->nent++]; entry->function = function; entry->index = index; @@ -325,9 +333,10 @@ static struct kvm_cpuid_entry2 *do_host_cpuid(struct kvm_cpuid_entry2 *entry, return entry; } -static int __do_cpuid_func_emulated(struct kvm_cpuid_entry2 *entry, - u32 func, int *nent, int maxnent) +static int __do_cpuid_func_emulated(struct kvm_cpuid_array *array, u32 func) { + struct kvm_cpuid_entry2 *entry = &array->entries[array->nent]; + entry->function = func; entry->index = 0; entry->flags = 0; @@ -335,17 +344,17 @@ static int __do_cpuid_func_emulated(struct kvm_cpuid_entry2 *entry, switch (func) { case 0: entry->eax = 7; - ++*nent; + ++array->nent; break; case 1: entry->ecx = F(MOVBE); - ++*nent; + ++array->nent; break; case 7: entry->flags |= KVM_CPUID_FLAG_SIGNIFCANT_INDEX; entry->eax = 0; entry->ecx = F(RDPID); - ++*nent; + ++array->nent; default: break; } @@ -436,9 +445,9 @@ static inline void do_cpuid_7_mask(struct kvm_cpuid_entry2 *entry) } } -static inline int __do_cpuid_func(struct kvm_cpuid_entry2 *entry, u32 function, - int *nent, int maxnent) +static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) { + struct kvm_cpuid_entry2 *entry; int r, i, max_idx; unsigned f_nx = is_efer_nx() ? F(NX) : 0; #ifdef CONFIG_X86_64 @@ -514,7 +523,8 @@ static inline int __do_cpuid_func(struct kvm_cpuid_entry2 *entry, u32 function, r = -E2BIG; - if (WARN_ON(!do_host_cpuid(entry, nent, maxnent, function, 0))) + entry = do_host_cpuid(array, function, 0); + if (WARN_ON(!entry)) goto out; switch (function) { @@ -539,7 +549,8 @@ static inline int __do_cpuid_func(struct kvm_cpuid_entry2 *entry, u32 function, entry->flags |= KVM_CPUID_FLAG_STATE_READ_NEXT; for (i = 1, max_idx = entry->eax & 0xff; i < max_idx; ++i) { - if (!do_host_cpuid(&entry[i], nent, maxnent, function, 0)) + entry = do_host_cpuid(array, function, 0); + if (!entry) goto out; } break; @@ -550,8 +561,9 @@ static inline int __do_cpuid_func(struct kvm_cpuid_entry2 *entry, u32 function, * Read entries until the cache type in the previous entry is * zero, i.e. indicates an invalid entry. */ - for (i = 1; entry[i - 1].eax & 0x1f; ++i) { - if (!do_host_cpuid(&entry[i], nent, maxnent, function, i)) + for (i = 1; entry->eax & 0x1f; ++i) { + entry = do_host_cpuid(array, function, i); + if (!entry) goto out; } break; @@ -566,10 +578,11 @@ static inline int __do_cpuid_func(struct kvm_cpuid_entry2 *entry, u32 function, do_cpuid_7_mask(entry); for (i = 1, max_idx = entry->eax; i <= max_idx; i++) { - if (!do_host_cpuid(&entry[i], nent, maxnent, function, i)) + entry = do_host_cpuid(array, function, i); + if (!entry) goto out; - do_cpuid_7_mask(&entry[i]); + do_cpuid_7_mask(entry); } break; case 9: @@ -610,15 +623,13 @@ static inline int __do_cpuid_func(struct kvm_cpuid_entry2 *entry, u32 function, case 0x1f: case 0xb: /* - * We filled in entry[0] for CPUID(EAX=, - * ECX=00H) above. If its level type (ECX[15:8]) is - * zero, then the leaf is unimplemented, and we're - * done. Otherwise, continue to populate entries - * until the level type (ECX[15:8]) of the previously - * added entry is zero. + * Populate entries until the level type (ECX[15:8]) of the + * previous entry is zero. Note, CPUID EAX.{0x1f,0xb}.0 is + * the starting entry, filled by the primary do_host_cpuid(). */ - for (i = 1; entry[i - 1].ecx & 0xff00; ++i) { - if (!do_host_cpuid(&entry[i], nent, maxnent, function, i)) + for (i = 1; entry->ecx & 0xff00; ++i) { + entry = do_host_cpuid(array, function, i); + if (!entry) goto out; } break; @@ -633,24 +644,26 @@ static inline int __do_cpuid_func(struct kvm_cpuid_entry2 *entry, u32 function, if (!supported) break; - if (!do_host_cpuid(&entry[1], nent, maxnent, function, 1)) + entry = do_host_cpuid(array, function, 1); + if (!entry) goto out; - entry[1].eax &= kvm_cpuid_D_1_eax_x86_features; - cpuid_mask(&entry[1].eax, CPUID_D_1_EAX); - if (entry[1].eax & (F(XSAVES)|F(XSAVEC))) - entry[1].ebx = xstate_required_size(supported, true); + entry->eax &= kvm_cpuid_D_1_eax_x86_features; + cpuid_mask(&entry->eax, CPUID_D_1_EAX); + if (entry->eax & (F(XSAVES)|F(XSAVEC))) + entry->ebx = xstate_required_size(supported, true); else - entry[1].ebx = 0; + entry->ebx = 0; /* Saving XSS controlled state via XSAVES isn't supported. */ - entry[1].ecx = 0; - entry[1].edx = 0; + entry->ecx = 0; + entry->edx = 0; - for (idx = 2, i = 2; idx < 64; ++idx) { + for (idx = 2; idx < 64; ++idx) { if (!(supported & BIT_ULL(idx))) continue; - if (!do_host_cpuid(&entry[i], nent, maxnent, function, idx)) + entry = do_host_cpuid(array, function, idx); + if (!entry) goto out; /* @@ -660,14 +673,13 @@ static inline int __do_cpuid_func(struct kvm_cpuid_entry2 *entry, u32 function, * reach this point, and they should have a non-zero * save state size. */ - if (WARN_ON_ONCE(!entry[i].eax || (entry[i].ecx & 1))) { - --*nent; + if (WARN_ON_ONCE(!entry->eax || (entry->ecx & 1))) { + --array->nent; continue; } - entry[i].ecx = 0; - entry[i].edx = 0; - ++i; + entry->ecx = 0; + entry->edx = 0; } break; } @@ -677,7 +689,7 @@ static inline int __do_cpuid_func(struct kvm_cpuid_entry2 *entry, u32 function, break; for (i = 1, max_idx = entry->eax; i <= max_idx; ++i) { - if (!do_host_cpuid(&entry[i], nent, maxnent, function, i)) + if (!do_host_cpuid(array, function, i)) goto out; } break; @@ -802,22 +814,22 @@ out: return r; } -static int do_cpuid_func(struct kvm_cpuid_entry2 *entry, u32 func, - int *nent, int maxnent, unsigned int type) +static int do_cpuid_func(struct kvm_cpuid_array *array, u32 func, + unsigned int type) { - if (*nent >= maxnent) + if (array->nent >= array->maxnent) return -E2BIG; if (type == KVM_GET_EMULATED_CPUID) - return __do_cpuid_func_emulated(entry, func, nent, maxnent); + return __do_cpuid_func_emulated(array, func); - return __do_cpuid_func(entry, func, nent, maxnent); + return __do_cpuid_func(array, func); } #define CENTAUR_CPUID_SIGNATURE 0xC0000000 -static int get_cpuid_func(struct kvm_cpuid_entry2 *entries, u32 func, - int *nent, int maxnent, unsigned int type) +static int get_cpuid_func(struct kvm_cpuid_array *array, u32 func, + unsigned int type) { u32 limit; int r; @@ -826,16 +838,16 @@ static int get_cpuid_func(struct kvm_cpuid_entry2 *entries, u32 func, boot_cpu_data.x86_vendor != X86_VENDOR_CENTAUR) return 0; - r = do_cpuid_func(&entries[*nent], func, nent, maxnent, type); + r = do_cpuid_func(array, func, type); if (r) return r; - limit = entries[*nent - 1].eax; + limit = array->entries[array->nent - 1].eax; for (func = func + 1; func <= limit; ++func) { - if (*nent >= maxnent) + if (array->nent >= array->maxnent) return -E2BIG; - r = do_cpuid_func(&entries[*nent], func, nent, maxnent, type); + r = do_cpuid_func(array, func, type); if (r) break; } @@ -878,8 +890,11 @@ int kvm_dev_ioctl_get_cpuid(struct kvm_cpuid2 *cpuid, 0, 0x80000000, CENTAUR_CPUID_SIGNATURE, KVM_CPUID_SIGNATURE, }; - struct kvm_cpuid_entry2 *cpuid_entries; - int nent = 0, r, i; + struct kvm_cpuid_array array = { + .nent = 0, + .maxnent = cpuid->nent, + }; + int r, i; if (cpuid->nent < 1) return -E2BIG; @@ -889,25 +904,24 @@ int kvm_dev_ioctl_get_cpuid(struct kvm_cpuid2 *cpuid, if (sanity_check_entries(entries, cpuid->nent, type)) return -EINVAL; - cpuid_entries = vzalloc(array_size(sizeof(struct kvm_cpuid_entry2), + array.entries = vzalloc(array_size(sizeof(struct kvm_cpuid_entry2), cpuid->nent)); - if (!cpuid_entries) + if (!array.entries) return -ENOMEM; for (i = 0; i < ARRAY_SIZE(funcs); i++) { - r = get_cpuid_func(cpuid_entries, funcs[i], &nent, cpuid->nent, - type); + r = get_cpuid_func(&array, funcs[i], type); if (r) goto out_free; } - cpuid->nent = nent; + cpuid->nent = array.nent; - if (copy_to_user(entries, cpuid_entries, - nent * sizeof(struct kvm_cpuid_entry2))) + if (copy_to_user(entries, array.entries, + array.nent * sizeof(struct kvm_cpuid_entry2))) r = -EFAULT; out_free: - vfree(cpuid_entries); + vfree(array.entries); return r; } -- cgit From 695538aa21c057578a0c70e1aa22fd97fd407930 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:20 -0800 Subject: KVM: x86: Drop redundant array size check Drop a "nent >= maxnent" check in kvm_get_cpuid() that's fully redundant now that kvm_get_cpuid() isn't indexing the array to pass an entry to do_cpuid_func(). Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 3 --- 1 file changed, 3 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 59195de22d8f..4bf4f7d7741e 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -844,9 +844,6 @@ static int get_cpuid_func(struct kvm_cpuid_array *array, u32 func, limit = array->entries[array->nent - 1].eax; for (func = func + 1; func <= limit; ++func) { - if (array->nent >= array->maxnent) - return -E2BIG; - r = do_cpuid_func(array, func, type); if (r) break; -- cgit From 0eee8f9d9d3b2838fdd383328129df524cea4ba7 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:21 -0800 Subject: KVM: x86: Use common loop iterator when handling CPUID 0xD.N Use __do_cpuid_func()'s common loop iterator, "i", when enumerating the sub-leafs for CPUID 0xD now that the CPUID 0xD loop doesn't need to manual maintain separate counts for the entries index and CPUID index. No functional changed intended. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 4bf4f7d7741e..85f292088d91 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -634,7 +634,6 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) } break; case 0xd: { - int idx; u64 supported = kvm_supported_xcr0(); entry->eax &= supported; @@ -658,11 +657,11 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) entry->ecx = 0; entry->edx = 0; - for (idx = 2; idx < 64; ++idx) { - if (!(supported & BIT_ULL(idx))) + for (i = 2; i < 64; ++i) { + if (!(supported & BIT_ULL(i))) continue; - entry = do_host_cpuid(array, function, idx); + entry = do_host_cpuid(array, function, i); if (!entry) goto out; -- cgit From 2ef7619d43731b6eaa7cc2e03d000e4bbc1bf612 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:22 -0800 Subject: KVM: VMX: Add helpers to query Intel PT mode Add helpers to query which of the (two) supported PT modes is active. The primary motivation is to help document that there is a third PT mode (host-only) that's currently not supported by KVM. As is, it's not obvious that PT_MODE_SYSTEM != !PT_MODE_HOST_GUEST and vice versa, e.g. that "pt_mode == PT_MODE_SYSTEM" and "pt_mode != PT_MODE_HOST_GUEST" are two distinct checks. No functional change intended. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/vmx/capabilities.h | 18 ++++++++++++++++++ arch/x86/kvm/vmx/nested.c | 2 +- arch/x86/kvm/vmx/vmx.c | 26 +++++++++++++------------- arch/x86/kvm/vmx/vmx.h | 4 ++-- 4 files changed, 34 insertions(+), 16 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h index f486e2606247..80eec8cffbe2 100644 --- a/arch/x86/kvm/vmx/capabilities.h +++ b/arch/x86/kvm/vmx/capabilities.h @@ -354,4 +354,22 @@ static inline bool cpu_has_vmx_intel_pt(void) (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_RTIT_CTL); } +/* + * Processor Trace can operate in one of three modes: + * a. system-wide: trace both host/guest and output to host buffer + * b. host-only: only trace host and output to host buffer + * c. host-guest: trace host and guest simultaneously and output to their + * respective buffer + * + * KVM currently only supports (a) and (c). + */ +static inline bool vmx_pt_mode_is_system(void) +{ + return pt_mode == PT_MODE_SYSTEM; +} +static inline bool vmx_pt_mode_is_host_guest(void) +{ + return pt_mode == PT_MODE_HOST_GUEST; +} + #endif /* __KVM_X86_VMX_CAPS_H */ diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index 79c7764c77b1..e47eb7c0fbae 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -4602,7 +4602,7 @@ static int enter_vmx_operation(struct kvm_vcpu *vcpu) vmx->nested.vmcs02_initialized = false; vmx->nested.vmxon = true; - if (pt_mode == PT_MODE_HOST_GUEST) { + if (vmx_pt_mode_is_host_guest()) { vmx->pt_desc.guest.ctl = 0; pt_update_intercept_for_msr(vmx); } diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 57742ddfd854..af2acf15a067 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -1059,7 +1059,7 @@ static unsigned long segment_base(u16 selector) static inline bool pt_can_write_msr(struct vcpu_vmx *vmx) { - return (pt_mode == PT_MODE_HOST_GUEST) && + return vmx_pt_mode_is_host_guest() && !(vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN); } @@ -1093,7 +1093,7 @@ static inline void pt_save_msr(struct pt_ctx *ctx, u32 addr_range) static void pt_guest_enter(struct vcpu_vmx *vmx) { - if (pt_mode == PT_MODE_SYSTEM) + if (vmx_pt_mode_is_system()) return; /* @@ -1110,7 +1110,7 @@ static void pt_guest_enter(struct vcpu_vmx *vmx) static void pt_guest_exit(struct vcpu_vmx *vmx) { - if (pt_mode == PT_MODE_SYSTEM) + if (vmx_pt_mode_is_system()) return; if (vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN) { @@ -1904,24 +1904,24 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) &msr_info->data); break; case MSR_IA32_RTIT_CTL: - if (pt_mode != PT_MODE_HOST_GUEST) + if (!vmx_pt_mode_is_host_guest()) return 1; msr_info->data = vmx->pt_desc.guest.ctl; break; case MSR_IA32_RTIT_STATUS: - if (pt_mode != PT_MODE_HOST_GUEST) + if (!vmx_pt_mode_is_host_guest()) return 1; msr_info->data = vmx->pt_desc.guest.status; break; case MSR_IA32_RTIT_CR3_MATCH: - if ((pt_mode != PT_MODE_HOST_GUEST) || + if (!vmx_pt_mode_is_host_guest() || !intel_pt_validate_cap(vmx->pt_desc.caps, PT_CAP_cr3_filtering)) return 1; msr_info->data = vmx->pt_desc.guest.cr3_match; break; case MSR_IA32_RTIT_OUTPUT_BASE: - if ((pt_mode != PT_MODE_HOST_GUEST) || + if (!vmx_pt_mode_is_host_guest() || (!intel_pt_validate_cap(vmx->pt_desc.caps, PT_CAP_topa_output) && !intel_pt_validate_cap(vmx->pt_desc.caps, @@ -1930,7 +1930,7 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) msr_info->data = vmx->pt_desc.guest.output_base; break; case MSR_IA32_RTIT_OUTPUT_MASK: - if ((pt_mode != PT_MODE_HOST_GUEST) || + if (!vmx_pt_mode_is_host_guest() || (!intel_pt_validate_cap(vmx->pt_desc.caps, PT_CAP_topa_output) && !intel_pt_validate_cap(vmx->pt_desc.caps, @@ -1940,7 +1940,7 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) break; case MSR_IA32_RTIT_ADDR0_A ... MSR_IA32_RTIT_ADDR3_B: index = msr_info->index - MSR_IA32_RTIT_ADDR0_A; - if ((pt_mode != PT_MODE_HOST_GUEST) || + if (!vmx_pt_mode_is_host_guest() || (index >= 2 * intel_pt_validate_cap(vmx->pt_desc.caps, PT_CAP_num_address_ranges))) return 1; @@ -2146,7 +2146,7 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) return 1; return vmx_set_vmx_msr(vcpu, msr_index, data); case MSR_IA32_RTIT_CTL: - if ((pt_mode != PT_MODE_HOST_GUEST) || + if (!vmx_pt_mode_is_host_guest() || vmx_rtit_ctl_check(vcpu, data) || vmx->nested.vmxon) return 1; @@ -4023,7 +4023,7 @@ static void vmx_compute_secondary_exec_control(struct vcpu_vmx *vmx) u32 exec_control = vmcs_config.cpu_based_2nd_exec_ctrl; - if (pt_mode == PT_MODE_SYSTEM) + if (vmx_pt_mode_is_system()) exec_control &= ~(SECONDARY_EXEC_PT_USE_GPA | SECONDARY_EXEC_PT_CONCEAL_VMX); if (!cpu_need_virtualize_apic_accesses(vcpu)) exec_control &= ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES; @@ -4264,7 +4264,7 @@ static void init_vmcs(struct vcpu_vmx *vmx) if (cpu_has_vmx_encls_vmexit()) vmcs_write64(ENCLS_EXITING_BITMAP, -1ull); - if (pt_mode == PT_MODE_HOST_GUEST) { + if (vmx_pt_mode_is_host_guest()) { memset(&vmx->pt_desc, 0, sizeof(vmx->pt_desc)); /* Bit[6~0] are forced to 1, writes are ignored. */ vmx->pt_desc.guest.output_mask = 0x7F; @@ -6318,7 +6318,7 @@ static bool vmx_has_emulated_msr(int index) static bool vmx_pt_supported(void) { - return pt_mode == PT_MODE_HOST_GUEST; + return vmx_pt_mode_is_host_guest(); } static void vmx_recover_nmi_blocking(struct vcpu_vmx *vmx) diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h index e64da06c7009..9a51a3a77233 100644 --- a/arch/x86/kvm/vmx/vmx.h +++ b/arch/x86/kvm/vmx/vmx.h @@ -452,7 +452,7 @@ static inline void vmx_segment_cache_clear(struct vcpu_vmx *vmx) static inline u32 vmx_vmentry_ctrl(void) { u32 vmentry_ctrl = vmcs_config.vmentry_ctrl; - if (pt_mode == PT_MODE_SYSTEM) + if (vmx_pt_mode_is_system()) vmentry_ctrl &= ~(VM_ENTRY_PT_CONCEAL_PIP | VM_ENTRY_LOAD_IA32_RTIT_CTL); /* Loading of EFER and PERF_GLOBAL_CTRL are toggled dynamically */ @@ -463,7 +463,7 @@ static inline u32 vmx_vmentry_ctrl(void) static inline u32 vmx_vmexit_ctrl(void) { u32 vmexit_ctrl = vmcs_config.vmexit_ctrl; - if (pt_mode == PT_MODE_SYSTEM) + if (vmx_pt_mode_is_system()) vmexit_ctrl &= ~(VM_EXIT_PT_CONCEAL_PIP | VM_EXIT_CLEAR_IA32_RTIT_CTL); /* Loading of EFER and PERF_GLOBAL_CTRL are toggled dynamically */ -- cgit From cfc481810c903a5f74e5c7bf50ca8e28318dbc44 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:23 -0800 Subject: KVM: x86: Calculate the supported xcr0 mask at load time Add a new global variable, supported_xcr0, to track which xcr0 bits can be exposed to the guest instead of calculating the mask on every call. The supported bits are constant for a given instance of KVM. This paves the way toward eliminating the ->mpx_supported() call in kvm_mpx_supported(), e.g. eliminates multiple retpolines in VMX's nested VM-Enter path, and eventually toward eliminating ->mpx_supported() altogether. No functional change intended. Reviewed-by: Xiaoyao Li Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 32 +++++++++----------------------- arch/x86/kvm/svm.c | 2 ++ arch/x86/kvm/vmx/vmx.c | 4 ++++ arch/x86/kvm/x86.c | 14 +++++++++++--- arch/x86/kvm/x86.h | 7 +------ 5 files changed, 27 insertions(+), 32 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 85f292088d91..1eb775c33c4e 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -52,16 +52,6 @@ bool kvm_mpx_supported(void) } EXPORT_SYMBOL_GPL(kvm_mpx_supported); -u64 kvm_supported_xcr0(void) -{ - u64 xcr0 = KVM_SUPPORTED_XCR0 & host_xcr0; - - if (!kvm_mpx_supported()) - xcr0 &= ~(XFEATURE_MASK_BNDREGS | XFEATURE_MASK_BNDCSR); - - return xcr0; -} - #define F feature_bit int kvm_update_cpuid(struct kvm_vcpu *vcpu) @@ -107,8 +97,7 @@ int kvm_update_cpuid(struct kvm_vcpu *vcpu) vcpu->arch.guest_xstate_size = XSAVE_HDR_SIZE + XSAVE_HDR_OFFSET; } else { vcpu->arch.guest_supported_xcr0 = - (best->eax | ((u64)best->edx << 32)) & - kvm_supported_xcr0(); + (best->eax | ((u64)best->edx << 32)) & supported_xcr0; vcpu->arch.guest_xstate_size = best->ebx = xstate_required_size(vcpu->arch.xcr0, false); } @@ -633,14 +622,12 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) goto out; } break; - case 0xd: { - u64 supported = kvm_supported_xcr0(); - - entry->eax &= supported; - entry->ebx = xstate_required_size(supported, false); + case 0xd: + entry->eax &= supported_xcr0; + entry->ebx = xstate_required_size(supported_xcr0, false); entry->ecx = entry->ebx; - entry->edx &= supported >> 32; - if (!supported) + entry->edx &= supported_xcr0 >> 32; + if (!supported_xcr0) break; entry = do_host_cpuid(array, function, 1); @@ -650,7 +637,7 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) entry->eax &= kvm_cpuid_D_1_eax_x86_features; cpuid_mask(&entry->eax, CPUID_D_1_EAX); if (entry->eax & (F(XSAVES)|F(XSAVEC))) - entry->ebx = xstate_required_size(supported, true); + entry->ebx = xstate_required_size(supported_xcr0, true); else entry->ebx = 0; /* Saving XSS controlled state via XSAVES isn't supported. */ @@ -658,7 +645,7 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) entry->edx = 0; for (i = 2; i < 64; ++i) { - if (!(supported & BIT_ULL(i))) + if (!(supported_xcr0 & BIT_ULL(i))) continue; entry = do_host_cpuid(array, function, i); @@ -666,7 +653,7 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) goto out; /* - * The @supported check above should have filtered out + * The supported check above should have filtered out * invalid sub-leafs as well as sub-leafs managed by * IA32_XSS MSR. Only XCR0-managed sub-leafs should * reach this point, and they should have a non-zero @@ -681,7 +668,6 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) entry->edx = 0; } break; - } /* Intel PT */ case 0x14: if (!f_intel_pt) diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 7f32c407d682..5ba2ef1b4577 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1385,6 +1385,8 @@ static __init int svm_hardware_setup(void) init_msrpm_offsets(); + supported_xcr0 &= ~(XFEATURE_MASK_BNDREGS | XFEATURE_MASK_BNDCSR); + if (boot_cpu_has(X86_FEATURE_NX)) kvm_enable_efer_bits(EFER_NX); diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index af2acf15a067..e03b4d037079 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7660,6 +7660,10 @@ static __init int hardware_setup(void) WARN_ONCE(host_bndcfgs, "KVM: BNDCFGS in host will be lost"); } + if (!kvm_mpx_supported()) + supported_xcr0 &= ~(XFEATURE_MASK_BNDREGS | + XFEATURE_MASK_BNDCSR); + if (!cpu_has_vmx_vpid() || !cpu_has_vmx_invvpid() || !(cpu_has_vmx_invvpid_single() || cpu_has_vmx_invvpid_global())) enable_vpid = 0; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index a69f7bf020d9..849957f3afb2 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -181,6 +181,11 @@ struct kvm_shared_msrs { static struct kvm_shared_msrs_global __read_mostly shared_msrs_global; static struct kvm_shared_msrs __percpu *shared_msrs; +#define KVM_SUPPORTED_XCR0 (XFEATURE_MASK_FP | XFEATURE_MASK_SSE \ + | XFEATURE_MASK_YMM | XFEATURE_MASK_BNDREGS \ + | XFEATURE_MASK_BNDCSR | XFEATURE_MASK_AVX512 \ + | XFEATURE_MASK_PKRU) + static u64 __read_mostly host_xss; struct kvm_stats_debugfs_item debugfs_entries[] = { @@ -227,6 +232,8 @@ struct kvm_stats_debugfs_item debugfs_entries[] = { }; u64 __read_mostly host_xcr0; +u64 __read_mostly supported_xcr0; +EXPORT_SYMBOL_GPL(supported_xcr0); struct kmem_cache *x86_fpu_cache; EXPORT_SYMBOL_GPL(x86_fpu_cache); @@ -4114,8 +4121,7 @@ static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu, * CPUID leaf 0xD, index 0, EDX:EAX. This is for compatibility * with old userspace. */ - if (xstate_bv & ~kvm_supported_xcr0() || - mxcsr & ~mxcsr_feature_mask) + if (xstate_bv & ~supported_xcr0 || mxcsr & ~mxcsr_feature_mask) return -EINVAL; load_xsave(vcpu, (u8 *)guest_xsave->region); } else { @@ -7352,8 +7358,10 @@ int kvm_arch_init(void *opaque) perf_register_guest_info_callbacks(&kvm_guest_cbs); - if (boot_cpu_has(X86_FEATURE_XSAVE)) + if (boot_cpu_has(X86_FEATURE_XSAVE)) { host_xcr0 = xgetbv(XCR_XFEATURE_ENABLED_MASK); + supported_xcr0 = host_xcr0 & KVM_SUPPORTED_XCR0; + } kvm_lapic_init(); if (pi_inject_timer == -1) diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h index f3c6e55eb5d9..7a7dd5aa8586 100644 --- a/arch/x86/kvm/x86.h +++ b/arch/x86/kvm/x86.h @@ -270,13 +270,8 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, int emulation_type, void *insn, int insn_len); enum exit_fastpath_completion handle_fastpath_set_msr_irqoff(struct kvm_vcpu *vcpu); -#define KVM_SUPPORTED_XCR0 (XFEATURE_MASK_FP | XFEATURE_MASK_SSE \ - | XFEATURE_MASK_YMM | XFEATURE_MASK_BNDREGS \ - | XFEATURE_MASK_BNDCSR | XFEATURE_MASK_AVX512 \ - | XFEATURE_MASK_PKRU) extern u64 host_xcr0; - -extern u64 kvm_supported_xcr0(void); +extern u64 supported_xcr0; extern unsigned int min_timer_period_us; -- cgit From 7f5581f592984901620d34aa86a730092ae65092 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:24 -0800 Subject: KVM: x86: Use supported_xcr0 to detect MPX support Query supported_xcr0 when checking for MPX support instead of invoking ->mpx_supported() and drop ->mpx_supported() as kvm_mpx_supported() was its last user. Rename vmx_mpx_supported() to cpu_has_vmx_mpx() to better align with VMX/VMCS nomenclature. Modify VMX's adjustment of xcr0 to call cpus_has_vmx_mpx() (renamed from vmx_mpx_supported()) directly to avoid reading supported_xcr0 before it's fully configured. No functional change intended. Reviewed-by: Xiaoyao Li Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson [Test that *all* bits are set. - Paolo] Signed-off-by: Paolo Bonzini --- arch/x86/include/asm/kvm_host.h | 2 +- arch/x86/kvm/cpuid.c | 4 ++-- arch/x86/kvm/svm.c | 6 ------ arch/x86/kvm/vmx/capabilities.h | 2 +- arch/x86/kvm/vmx/vmx.c | 3 +-- 5 files changed, 5 insertions(+), 12 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 6ac67b6d6692..f817ddf876b5 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1177,7 +1177,7 @@ struct kvm_x86_ops { struct x86_exception *exception); void (*handle_exit_irqoff)(struct kvm_vcpu *vcpu, enum exit_fastpath_completion *exit_fastpath); - bool (*mpx_supported)(void); + bool (*xsaves_supported)(void); bool (*umip_emulated)(void); bool (*pt_supported)(void); diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 1eb775c33c4e..7f3f1a683853 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -47,8 +47,8 @@ static u32 xstate_required_size(u64 xstate_bv, bool compacted) bool kvm_mpx_supported(void) { - return ((host_xcr0 & (XFEATURE_MASK_BNDREGS | XFEATURE_MASK_BNDCSR)) - && kvm_x86_ops->mpx_supported()); + return (supported_xcr0 & (XFEATURE_MASK_BNDREGS | XFEATURE_MASK_BNDCSR)) + == (XFEATURE_MASK_BNDREGS | XFEATURE_MASK_BNDCSR); } EXPORT_SYMBOL_GPL(kvm_mpx_supported); diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 5ba2ef1b4577..7521f72b1067 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -6081,11 +6081,6 @@ static bool svm_invpcid_supported(void) return false; } -static bool svm_mpx_supported(void) -{ - return false; -} - static bool svm_xsaves_supported(void) { return boot_cpu_has(X86_FEATURE_XSAVES); @@ -7469,7 +7464,6 @@ static struct kvm_x86_ops svm_x86_ops __ro_after_init = { .rdtscp_supported = svm_rdtscp_supported, .invpcid_supported = svm_invpcid_supported, - .mpx_supported = svm_mpx_supported, .xsaves_supported = svm_xsaves_supported, .umip_emulated = svm_umip_emulated, .pt_supported = svm_pt_supported, diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h index 80eec8cffbe2..c00e26570198 100644 --- a/arch/x86/kvm/vmx/capabilities.h +++ b/arch/x86/kvm/vmx/capabilities.h @@ -101,7 +101,7 @@ static inline bool cpu_has_load_perf_global_ctrl(void) (vmcs_config.vmexit_ctrl & VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL); } -static inline bool vmx_mpx_supported(void) +static inline bool cpu_has_vmx_mpx(void) { return (vmcs_config.vmexit_ctrl & VM_EXIT_CLEAR_BNDCFGS) && (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_BNDCFGS); diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index e03b4d037079..59571cdd62fd 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7660,7 +7660,7 @@ static __init int hardware_setup(void) WARN_ONCE(host_bndcfgs, "KVM: BNDCFGS in host will be lost"); } - if (!kvm_mpx_supported()) + if (!cpu_has_vmx_mpx()) supported_xcr0 &= ~(XFEATURE_MASK_BNDREGS | XFEATURE_MASK_BNDCSR); @@ -7927,7 +7927,6 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = { .check_intercept = vmx_check_intercept, .handle_exit_irqoff = vmx_handle_exit_irqoff, - .mpx_supported = vmx_mpx_supported, .xsaves_supported = vmx_xsaves_supported, .umip_emulated = vmx_umip_emulated, .pt_supported = vmx_pt_supported, -- cgit From 615a4ae1c74c2997f19189412fae0326baaf1bff Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:25 -0800 Subject: KVM: x86: Make kvm_mpx_supported() an inline function Expose kvm_mpx_supported() as a static inline so that it can be inlined in kvm_intel.ko. No functional change intended. Reviewed-by: Xiaoyao Li Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 7 ------- arch/x86/kvm/cpuid.h | 1 - arch/x86/kvm/x86.h | 6 ++++++ 3 files changed, 6 insertions(+), 8 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 7f3f1a683853..1ff16300a468 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -45,13 +45,6 @@ static u32 xstate_required_size(u64 xstate_bv, bool compacted) return ret; } -bool kvm_mpx_supported(void) -{ - return (supported_xcr0 & (XFEATURE_MASK_BNDREGS | XFEATURE_MASK_BNDCSR)) - == (XFEATURE_MASK_BNDREGS | XFEATURE_MASK_BNDCSR); -} -EXPORT_SYMBOL_GPL(kvm_mpx_supported); - #define F feature_bit int kvm_update_cpuid(struct kvm_vcpu *vcpu) diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h index 7366c618aa04..c1ac0995843d 100644 --- a/arch/x86/kvm/cpuid.h +++ b/arch/x86/kvm/cpuid.h @@ -7,7 +7,6 @@ #include int kvm_update_cpuid(struct kvm_vcpu *vcpu); -bool kvm_mpx_supported(void); struct kvm_cpuid_entry2 *kvm_find_cpuid_entry(struct kvm_vcpu *vcpu, u32 function, u32 index); int kvm_dev_ioctl_get_cpuid(struct kvm_cpuid2 *cpuid, diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h index 7a7dd5aa8586..4d890bf827e8 100644 --- a/arch/x86/kvm/x86.h +++ b/arch/x86/kvm/x86.h @@ -273,6 +273,12 @@ enum exit_fastpath_completion handle_fastpath_set_msr_irqoff(struct kvm_vcpu *vc extern u64 host_xcr0; extern u64 supported_xcr0; +static inline bool kvm_mpx_supported(void) +{ + return (supported_xcr0 & (XFEATURE_MASK_BNDREGS | XFEATURE_MASK_BNDCSR)) + == (XFEATURE_MASK_BNDREGS | XFEATURE_MASK_BNDCSR); +} + extern unsigned int min_timer_period_us; extern bool enable_vmware_backdoor; -- cgit From 7392079c4e740572a5af72a7223ef5e162e442c9 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:26 -0800 Subject: KVM: x86: Clear output regs for CPUID 0x14 if PT isn't exposed to guest Clear the output regs for the main CPUID 0x14 leaf (index=0) if Intel PT isn't exposed to the guest. Leaf 0x14 enumerates Intel PT capabilities and should return zeroes if PT is not supported. Incorrectly reporting PT capabilities is essentially a cosmetic error, i.e. doesn't negatively affect any known userspace/kernel, as the existence of PT itself is correctly enumerated via CPUID 0x7. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 1ff16300a468..c194e49622b9 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -663,8 +663,10 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) break; /* Intel PT */ case 0x14: - if (!f_intel_pt) + if (!f_intel_pt) { + entry->eax = entry->ebx = entry->ecx = entry->edx = 0; break; + } for (i = 1, max_idx = entry->eax; i <= max_idx; ++i) { if (!do_host_cpuid(array, function, i)) -- cgit From 160b486f65ff89be7f90ff9297bb4bb0da446d91 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:27 -0800 Subject: KVM: x86: Drop explicit @func param from ->set_supported_cpuid() Drop the explicit @func param from ->set_supported_cpuid() and instead pull the CPUID function from the relevant entry. This sets the stage for hardening guest CPUID updates in future patches, e.g. allows adding run-time assertions that the CPUID feature being changed is actually a bit in the referenced CPUID entry. No functional change intended. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/include/asm/kvm_host.h | 2 +- arch/x86/kvm/cpuid.c | 2 +- arch/x86/kvm/svm.c | 4 ++-- arch/x86/kvm/vmx/vmx.c | 4 ++-- 4 files changed, 6 insertions(+), 6 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index f817ddf876b5..626cbe161c57 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1161,7 +1161,7 @@ struct kvm_x86_ops { void (*set_tdp_cr3)(struct kvm_vcpu *vcpu, unsigned long cr3); - void (*set_supported_cpuid)(u32 func, struct kvm_cpuid_entry2 *entry); + void (*set_supported_cpuid)(struct kvm_cpuid_entry2 *entry); bool (*has_wbinvd_exit)(void); diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index c194e49622b9..ffcf647b8fb4 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -784,7 +784,7 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) break; } - kvm_x86_ops->set_supported_cpuid(function, entry); + kvm_x86_ops->set_supported_cpuid(entry); r = 0; diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 7521f72b1067..46d9b8ea04f1 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -6035,9 +6035,9 @@ static void svm_cpuid_update(struct kvm_vcpu *vcpu) #define F feature_bit -static void svm_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2 *entry) +static void svm_set_supported_cpuid(struct kvm_cpuid_entry2 *entry) { - switch (func) { + switch (entry->function) { case 0x80000001: if (nested) entry->ecx |= (1 << 2); /* Set SVM bit */ diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 59571cdd62fd..082ccefc4348 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7128,9 +7128,9 @@ static void vmx_cpuid_update(struct kvm_vcpu *vcpu) } } -static void vmx_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2 *entry) +static void vmx_set_supported_cpuid(struct kvm_cpuid_entry2 *entry) { - if (func == 1 && nested) + if (entry->function == 1 && nested) entry->ecx |= feature_bit(VMX); } -- cgit From 3be5a60b454a19979963a246a6286255ce965fcf Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:28 -0800 Subject: KVM: x86: Use u32 for holding CPUID register value in helpers Change the intermediate CPUID output register values from "int" to "u32" to match both hardware and the storage type in struct cpuid_reg. No functional change intended. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h index c1ac0995843d..72a79bdfed6b 100644 --- a/arch/x86/kvm/cpuid.h +++ b/arch/x86/kvm/cpuid.h @@ -95,7 +95,7 @@ static __always_inline struct cpuid_reg x86_feature_cpuid(unsigned x86_feature) return reverse_cpuid[x86_leaf]; } -static __always_inline int *guest_cpuid_get_register(struct kvm_vcpu *vcpu, unsigned x86_feature) +static __always_inline u32 *guest_cpuid_get_register(struct kvm_vcpu *vcpu, unsigned x86_feature) { struct kvm_cpuid_entry2 *entry; const struct cpuid_reg cpuid = x86_feature_cpuid(x86_feature); @@ -121,7 +121,7 @@ static __always_inline int *guest_cpuid_get_register(struct kvm_vcpu *vcpu, unsi static __always_inline bool guest_cpuid_has(struct kvm_vcpu *vcpu, unsigned x86_feature) { - int *reg; + u32 *reg; reg = guest_cpuid_get_register(vcpu, x86_feature); if (!reg) @@ -132,7 +132,7 @@ static __always_inline bool guest_cpuid_has(struct kvm_vcpu *vcpu, unsigned x86_ static __always_inline void guest_cpuid_clear(struct kvm_vcpu *vcpu, unsigned x86_feature) { - int *reg; + u32 *reg; reg = guest_cpuid_get_register(vcpu, x86_feature); if (reg) -- cgit From 5e12b2bb34e9fb81d6942f9d300e6d259cc15431 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:29 -0800 Subject: KVM: x86: Replace bare "unsigned" with "unsigned int" in cpuid helpers Replace "unsigned" with "unsigned int" to make checkpatch and people everywhere a little bit happier, and to avoid propagating the filth when future patches add more cpuid helpers that work with unsigned (ints). No functional change intended. Suggested-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.h | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h index 72a79bdfed6b..46b4b61b6cf8 100644 --- a/arch/x86/kvm/cpuid.h +++ b/arch/x86/kvm/cpuid.h @@ -63,7 +63,7 @@ static const struct cpuid_reg reverse_cpuid[] = { * and can't be used by KVM to query/control guest capabilities. And obviously * the leaf being queried must have an entry in the lookup table. */ -static __always_inline void reverse_cpuid_check(unsigned x86_leaf) +static __always_inline void reverse_cpuid_check(unsigned int x86_leaf) { BUILD_BUG_ON(x86_leaf == CPUID_LNX_1); BUILD_BUG_ON(x86_leaf == CPUID_LNX_2); @@ -87,15 +87,16 @@ static __always_inline u32 __feature_bit(int x86_feature) #define feature_bit(name) __feature_bit(X86_FEATURE_##name) -static __always_inline struct cpuid_reg x86_feature_cpuid(unsigned x86_feature) +static __always_inline struct cpuid_reg x86_feature_cpuid(unsigned int x86_feature) { - unsigned x86_leaf = x86_feature / 32; + unsigned int x86_leaf = x86_feature / 32; reverse_cpuid_check(x86_leaf); return reverse_cpuid[x86_leaf]; } -static __always_inline u32 *guest_cpuid_get_register(struct kvm_vcpu *vcpu, unsigned x86_feature) +static __always_inline u32 *guest_cpuid_get_register(struct kvm_vcpu *vcpu, + unsigned int x86_feature) { struct kvm_cpuid_entry2 *entry; const struct cpuid_reg cpuid = x86_feature_cpuid(x86_feature); @@ -119,7 +120,8 @@ static __always_inline u32 *guest_cpuid_get_register(struct kvm_vcpu *vcpu, unsi } } -static __always_inline bool guest_cpuid_has(struct kvm_vcpu *vcpu, unsigned x86_feature) +static __always_inline bool guest_cpuid_has(struct kvm_vcpu *vcpu, + unsigned int x86_feature) { u32 *reg; @@ -130,7 +132,8 @@ static __always_inline bool guest_cpuid_has(struct kvm_vcpu *vcpu, unsigned x86_ return *reg & __feature_bit(x86_feature); } -static __always_inline void guest_cpuid_clear(struct kvm_vcpu *vcpu, unsigned x86_feature) +static __always_inline void guest_cpuid_clear(struct kvm_vcpu *vcpu, + unsigned int x86_feature) { u32 *reg; -- cgit From 4c61534aaae2a170bf0abd72676624fd3fabc655 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:30 -0800 Subject: KVM: x86: Introduce cpuid_entry_{get,has}() accessors Introduce accessors to retrieve feature bits from CPUID entries and use the new accessors where applicable. Using the accessors eliminates the need to manually specify the register to be queried at no extra cost (binary output is identical) and will allow adding runtime consistency checks on the function and index in a future patch. No functional change intended. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 9 +++++---- arch/x86/kvm/cpuid.h | 48 ++++++++++++++++++++++++++++++++++++++---------- 2 files changed, 43 insertions(+), 14 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index ffcf647b8fb4..81bf6555987f 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -68,7 +68,7 @@ int kvm_update_cpuid(struct kvm_vcpu *vcpu) best->edx |= F(APIC); if (apic) { - if (best->ecx & F(TSC_DEADLINE_TIMER)) + if (cpuid_entry_has(best, X86_FEATURE_TSC_DEADLINE_TIMER)) apic->lapic_timer.timer_mode_mask = 3 << 17; else apic->lapic_timer.timer_mode_mask = 1 << 17; @@ -96,7 +96,8 @@ int kvm_update_cpuid(struct kvm_vcpu *vcpu) } best = kvm_find_cpuid_entry(vcpu, 0xD, 1); - if (best && (best->eax & (F(XSAVES) | F(XSAVEC)))) + if (best && (cpuid_entry_has(best, X86_FEATURE_XSAVES) || + cpuid_entry_has(best, X86_FEATURE_XSAVEC))) best->ebx = xstate_required_size(vcpu->arch.xcr0, true); /* @@ -155,7 +156,7 @@ static void cpuid_fix_nx_cap(struct kvm_vcpu *vcpu) break; } } - if (entry && (entry->edx & F(NX)) && !is_efer_nx()) { + if (entry && cpuid_entry_has(entry, X86_FEATURE_NX) && !is_efer_nx()) { entry->edx &= ~F(NX); printk(KERN_INFO "kvm: guest NX capability removed\n"); } @@ -387,7 +388,7 @@ static inline void do_cpuid_7_mask(struct kvm_cpuid_entry2 *entry) entry->ebx |= F(TSC_ADJUST); entry->ecx &= kvm_cpuid_7_0_ecx_x86_features; - f_la57 = entry->ecx & F(LA57); + f_la57 = cpuid_entry_get(entry, X86_FEATURE_LA57); cpuid_mask(&entry->ecx, CPUID_7_ECX); /* Set LA57 based on hardware capability. */ entry->ecx |= f_la57; diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h index 46b4b61b6cf8..bf95428ddf4e 100644 --- a/arch/x86/kvm/cpuid.h +++ b/arch/x86/kvm/cpuid.h @@ -95,17 +95,10 @@ static __always_inline struct cpuid_reg x86_feature_cpuid(unsigned int x86_featu return reverse_cpuid[x86_leaf]; } -static __always_inline u32 *guest_cpuid_get_register(struct kvm_vcpu *vcpu, - unsigned int x86_feature) +static __always_inline u32 *__cpuid_entry_get_reg(struct kvm_cpuid_entry2 *entry, + const struct cpuid_reg *cpuid) { - struct kvm_cpuid_entry2 *entry; - const struct cpuid_reg cpuid = x86_feature_cpuid(x86_feature); - - entry = kvm_find_cpuid_entry(vcpu, cpuid.function, cpuid.index); - if (!entry) - return NULL; - - switch (cpuid.reg) { + switch (cpuid->reg) { case CPUID_EAX: return &entry->eax; case CPUID_EBX: @@ -120,6 +113,41 @@ static __always_inline u32 *guest_cpuid_get_register(struct kvm_vcpu *vcpu, } } +static __always_inline u32 *cpuid_entry_get_reg(struct kvm_cpuid_entry2 *entry, + unsigned int x86_feature) +{ + const struct cpuid_reg cpuid = x86_feature_cpuid(x86_feature); + + return __cpuid_entry_get_reg(entry, &cpuid); +} + +static __always_inline u32 cpuid_entry_get(struct kvm_cpuid_entry2 *entry, + unsigned int x86_feature) +{ + u32 *reg = cpuid_entry_get_reg(entry, x86_feature); + + return *reg & __feature_bit(x86_feature); +} + +static __always_inline bool cpuid_entry_has(struct kvm_cpuid_entry2 *entry, + unsigned int x86_feature) +{ + return cpuid_entry_get(entry, x86_feature); +} + +static __always_inline u32 *guest_cpuid_get_register(struct kvm_vcpu *vcpu, + unsigned int x86_feature) +{ + const struct cpuid_reg cpuid = x86_feature_cpuid(x86_feature); + struct kvm_cpuid_entry2 *entry; + + entry = kvm_find_cpuid_entry(vcpu, cpuid.function, cpuid.index); + if (!entry) + return NULL; + + return __cpuid_entry_get_reg(entry, &cpuid); +} + static __always_inline bool guest_cpuid_has(struct kvm_vcpu *vcpu, unsigned int x86_feature) { -- cgit From b32666b13a72a1fd9f5078d8142bd7325022520f Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:31 -0800 Subject: KVM: x86: Introduce cpuid_entry_{change,set,clear}() mutators Introduce mutators to modify feature bits in CPUID entries and use the new mutators where applicable. Using the mutators eliminates the need to manually specify the register to modify query at no extra cost and will allow adding runtime consistency checks on the function/index in a future patch. No functional change intended. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 62 ++++++++++++++++++++++------------------------------ arch/x86/kvm/cpuid.h | 32 +++++++++++++++++++++++++++ arch/x86/kvm/svm.c | 11 ++++------ 3 files changed, 62 insertions(+), 43 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 81bf6555987f..14b5fb24c6be 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -57,15 +57,12 @@ int kvm_update_cpuid(struct kvm_vcpu *vcpu) return 0; /* Update OSXSAVE bit */ - if (boot_cpu_has(X86_FEATURE_XSAVE) && best->function == 0x1) { - best->ecx &= ~F(OSXSAVE); - if (kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE)) - best->ecx |= F(OSXSAVE); - } + if (boot_cpu_has(X86_FEATURE_XSAVE) && best->function == 0x1) + cpuid_entry_change(best, X86_FEATURE_OSXSAVE, + kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE)); - best->edx &= ~F(APIC); - if (vcpu->arch.apic_base & MSR_IA32_APICBASE_ENABLE) - best->edx |= F(APIC); + cpuid_entry_change(best, X86_FEATURE_APIC, + vcpu->arch.apic_base & MSR_IA32_APICBASE_ENABLE); if (apic) { if (cpuid_entry_has(best, X86_FEATURE_TSC_DEADLINE_TIMER)) @@ -75,14 +72,9 @@ int kvm_update_cpuid(struct kvm_vcpu *vcpu) } best = kvm_find_cpuid_entry(vcpu, 7, 0); - if (best) { - /* Update OSPKE bit */ - if (boot_cpu_has(X86_FEATURE_PKU) && best->function == 0x7) { - best->ecx &= ~F(OSPKE); - if (kvm_read_cr4_bits(vcpu, X86_CR4_PKE)) - best->ecx |= F(OSPKE); - } - } + if (best && boot_cpu_has(X86_FEATURE_PKU) && best->function == 0x7) + cpuid_entry_change(best, X86_FEATURE_OSPKE, + kvm_read_cr4_bits(vcpu, X86_CR4_PKE)); best = kvm_find_cpuid_entry(vcpu, 0xD, 0); if (!best) { @@ -119,12 +111,10 @@ int kvm_update_cpuid(struct kvm_vcpu *vcpu) if (!kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT)) { best = kvm_find_cpuid_entry(vcpu, 0x1, 0); - if (best) { - if (vcpu->arch.ia32_misc_enable_msr & MSR_IA32_MISC_ENABLE_MWAIT) - best->ecx |= F(MWAIT); - else - best->ecx &= ~F(MWAIT); - } + if (best) + cpuid_entry_change(best, X86_FEATURE_MWAIT, + vcpu->arch.ia32_misc_enable_msr & + MSR_IA32_MISC_ENABLE_MWAIT); } /* Update physical-address width */ @@ -157,7 +147,7 @@ static void cpuid_fix_nx_cap(struct kvm_vcpu *vcpu) } } if (entry && cpuid_entry_has(entry, X86_FEATURE_NX) && !is_efer_nx()) { - entry->edx &= ~F(NX); + cpuid_entry_clear(entry, X86_FEATURE_NX); printk(KERN_INFO "kvm: guest NX capability removed\n"); } } @@ -385,7 +375,7 @@ static inline void do_cpuid_7_mask(struct kvm_cpuid_entry2 *entry) entry->ebx &= kvm_cpuid_7_0_ebx_x86_features; cpuid_mask(&entry->ebx, CPUID_7_0_EBX); /* TSC_ADJUST is emulated */ - entry->ebx |= F(TSC_ADJUST); + cpuid_entry_set(entry, X86_FEATURE_TSC_ADJUST); entry->ecx &= kvm_cpuid_7_0_ecx_x86_features; f_la57 = cpuid_entry_get(entry, X86_FEATURE_LA57); @@ -396,21 +386,21 @@ static inline void do_cpuid_7_mask(struct kvm_cpuid_entry2 *entry) entry->ecx |= f_pku; /* PKU is not yet implemented for shadow paging. */ if (!tdp_enabled || !boot_cpu_has(X86_FEATURE_OSPKE)) - entry->ecx &= ~F(PKU); + cpuid_entry_clear(entry, X86_FEATURE_PKU); entry->edx &= kvm_cpuid_7_0_edx_x86_features; cpuid_mask(&entry->edx, CPUID_7_EDX); if (boot_cpu_has(X86_FEATURE_IBPB) && boot_cpu_has(X86_FEATURE_IBRS)) - entry->edx |= F(SPEC_CTRL); + cpuid_entry_set(entry, X86_FEATURE_SPEC_CTRL); if (boot_cpu_has(X86_FEATURE_STIBP)) - entry->edx |= F(INTEL_STIBP); + cpuid_entry_set(entry, X86_FEATURE_INTEL_STIBP); if (boot_cpu_has(X86_FEATURE_AMD_SSBD)) - entry->edx |= F(SPEC_CTRL_SSBD); + cpuid_entry_set(entry, X86_FEATURE_SPEC_CTRL_SSBD); /* * We emulate ARCH_CAPABILITIES in software even * if the host doesn't support it. */ - entry->edx |= F(ARCH_CAPABILITIES); + cpuid_entry_set(entry, X86_FEATURE_ARCH_CAPABILITIES); break; case 1: entry->eax &= kvm_cpuid_7_1_eax_x86_features; @@ -522,7 +512,7 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) cpuid_mask(&entry->ecx, CPUID_1_ECX); /* we support x2apic emulation even if host does not support * it since we emulate x2apic in software */ - entry->ecx |= F(X2APIC); + cpuid_entry_set(entry, X86_FEATURE_X2APIC); break; /* function 2 entries are STATEFUL. That is, repeated cpuid commands * may return different values. This forces us to get_cpu() before @@ -737,22 +727,22 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) * record that in cpufeatures so use them. */ if (boot_cpu_has(X86_FEATURE_IBPB)) - entry->ebx |= F(AMD_IBPB); + cpuid_entry_set(entry, X86_FEATURE_AMD_IBPB); if (boot_cpu_has(X86_FEATURE_IBRS)) - entry->ebx |= F(AMD_IBRS); + cpuid_entry_set(entry, X86_FEATURE_AMD_IBRS); if (boot_cpu_has(X86_FEATURE_STIBP)) - entry->ebx |= F(AMD_STIBP); + cpuid_entry_set(entry, X86_FEATURE_AMD_STIBP); if (boot_cpu_has(X86_FEATURE_SPEC_CTRL_SSBD)) - entry->ebx |= F(AMD_SSBD); + cpuid_entry_set(entry, X86_FEATURE_AMD_SSBD); if (!boot_cpu_has_bug(X86_BUG_SPEC_STORE_BYPASS)) - entry->ebx |= F(AMD_SSB_NO); + cpuid_entry_set(entry, X86_FEATURE_AMD_SSB_NO); /* * The preference is to use SPEC CTRL MSR instead of the * VIRT_SPEC MSR. */ if (boot_cpu_has(X86_FEATURE_LS_CFG_SSBD) && !boot_cpu_has(X86_FEATURE_AMD_SSBD)) - entry->ebx |= F(VIRT_SSBD); + cpuid_entry_set(entry, X86_FEATURE_VIRT_SSBD); break; } case 0x80000019: diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h index bf95428ddf4e..de3c6c365a5a 100644 --- a/arch/x86/kvm/cpuid.h +++ b/arch/x86/kvm/cpuid.h @@ -135,6 +135,38 @@ static __always_inline bool cpuid_entry_has(struct kvm_cpuid_entry2 *entry, return cpuid_entry_get(entry, x86_feature); } +static __always_inline void cpuid_entry_clear(struct kvm_cpuid_entry2 *entry, + unsigned int x86_feature) +{ + u32 *reg = cpuid_entry_get_reg(entry, x86_feature); + + *reg &= ~__feature_bit(x86_feature); +} + +static __always_inline void cpuid_entry_set(struct kvm_cpuid_entry2 *entry, + unsigned int x86_feature) +{ + u32 *reg = cpuid_entry_get_reg(entry, x86_feature); + + *reg |= __feature_bit(x86_feature); +} + +static __always_inline void cpuid_entry_change(struct kvm_cpuid_entry2 *entry, + unsigned int x86_feature, + bool set) +{ + u32 *reg = cpuid_entry_get_reg(entry, x86_feature); + + /* + * Open coded instead of using cpuid_entry_{clear,set}() to coerce the + * compiler into using CMOV instead of Jcc when possible. + */ + if (set) + *reg |= __feature_bit(x86_feature); + else + *reg &= ~__feature_bit(x86_feature); +} + static __always_inline u32 *guest_cpuid_get_register(struct kvm_vcpu *vcpu, unsigned int x86_feature) { diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 46d9b8ea04f1..9ec7952fecb0 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -6033,19 +6033,17 @@ static void svm_cpuid_update(struct kvm_vcpu *vcpu) APICV_INHIBIT_REASON_NESTED); } -#define F feature_bit - static void svm_set_supported_cpuid(struct kvm_cpuid_entry2 *entry) { switch (entry->function) { case 0x80000001: if (nested) - entry->ecx |= (1 << 2); /* Set SVM bit */ + cpuid_entry_set(entry, X86_FEATURE_SVM); break; case 0x80000008: if (boot_cpu_has(X86_FEATURE_LS_CFG_SSBD) || boot_cpu_has(X86_FEATURE_AMD_SSBD)) - entry->ebx |= F(VIRT_SSBD); + cpuid_entry_set(entry, X86_FEATURE_VIRT_SSBD); break; case 0x8000000A: entry->eax = 1; /* SVM revision 1 */ @@ -6057,12 +6055,11 @@ static void svm_set_supported_cpuid(struct kvm_cpuid_entry2 *entry) /* Support next_rip if host supports it */ if (boot_cpu_has(X86_FEATURE_NRIPS)) - entry->edx |= F(NRIPS); + cpuid_entry_set(entry, X86_FEATURE_NRIPS); /* Support NPT for the guest if enabled */ if (npt_enabled) - entry->edx |= F(NPT); - + cpuid_entry_set(entry, X86_FEATURE_NPT); } } -- cgit From e745e37d49771b8a06955f18c7a435830c0a4a5c Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:32 -0800 Subject: KVM: x86: Refactor cpuid_mask() to auto-retrieve the register Use the recently introduced cpuid_entry_get_reg() to automatically get the appropriate register when masking a CPUID entry. No functional change intended. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 26 ++++++++++---------------- arch/x86/kvm/cpuid.h | 8 ++++++++ 2 files changed, 18 insertions(+), 16 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 14b5fb24c6be..b906ed316788 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -254,12 +254,6 @@ out: return r; } -static __always_inline void cpuid_mask(u32 *word, int wordnum) -{ - reverse_cpuid_check(wordnum); - *word &= boot_cpu_data.x86_capability[wordnum]; -} - struct kvm_cpuid_array { struct kvm_cpuid_entry2 *entries; const int maxnent; @@ -373,13 +367,13 @@ static inline void do_cpuid_7_mask(struct kvm_cpuid_entry2 *entry) case 0: entry->eax = min(entry->eax, 1u); entry->ebx &= kvm_cpuid_7_0_ebx_x86_features; - cpuid_mask(&entry->ebx, CPUID_7_0_EBX); + cpuid_entry_mask(entry, CPUID_7_0_EBX); /* TSC_ADJUST is emulated */ cpuid_entry_set(entry, X86_FEATURE_TSC_ADJUST); entry->ecx &= kvm_cpuid_7_0_ecx_x86_features; f_la57 = cpuid_entry_get(entry, X86_FEATURE_LA57); - cpuid_mask(&entry->ecx, CPUID_7_ECX); + cpuid_entry_mask(entry, CPUID_7_ECX); /* Set LA57 based on hardware capability. */ entry->ecx |= f_la57; entry->ecx |= f_umip; @@ -389,7 +383,7 @@ static inline void do_cpuid_7_mask(struct kvm_cpuid_entry2 *entry) cpuid_entry_clear(entry, X86_FEATURE_PKU); entry->edx &= kvm_cpuid_7_0_edx_x86_features; - cpuid_mask(&entry->edx, CPUID_7_EDX); + cpuid_entry_mask(entry, CPUID_7_EDX); if (boot_cpu_has(X86_FEATURE_IBPB) && boot_cpu_has(X86_FEATURE_IBRS)) cpuid_entry_set(entry, X86_FEATURE_SPEC_CTRL); if (boot_cpu_has(X86_FEATURE_STIBP)) @@ -507,9 +501,9 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) break; case 1: entry->edx &= kvm_cpuid_1_edx_x86_features; - cpuid_mask(&entry->edx, CPUID_1_EDX); + cpuid_entry_mask(entry, CPUID_1_EDX); entry->ecx &= kvm_cpuid_1_ecx_x86_features; - cpuid_mask(&entry->ecx, CPUID_1_ECX); + cpuid_entry_mask(entry, CPUID_1_ECX); /* we support x2apic emulation even if host does not support * it since we emulate x2apic in software */ cpuid_entry_set(entry, X86_FEATURE_X2APIC); @@ -619,7 +613,7 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) goto out; entry->eax &= kvm_cpuid_D_1_eax_x86_features; - cpuid_mask(&entry->eax, CPUID_D_1_EAX); + cpuid_entry_mask(entry, CPUID_D_1_EAX); if (entry->eax & (F(XSAVES)|F(XSAVEC))) entry->ebx = xstate_required_size(supported_xcr0, true); else @@ -699,9 +693,9 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) break; case 0x80000001: entry->edx &= kvm_cpuid_8000_0001_edx_x86_features; - cpuid_mask(&entry->edx, CPUID_8000_0001_EDX); + cpuid_entry_mask(entry, CPUID_8000_0001_EDX); entry->ecx &= kvm_cpuid_8000_0001_ecx_x86_features; - cpuid_mask(&entry->ecx, CPUID_8000_0001_ECX); + cpuid_entry_mask(entry, CPUID_8000_0001_ECX); break; case 0x80000007: /* Advanced power management */ /* invariant TSC is CPUID.80000007H:EDX[8] */ @@ -720,7 +714,7 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) entry->eax = g_phys_as | (virt_as << 8); entry->edx = 0; entry->ebx &= kvm_cpuid_8000_0008_ebx_x86_features; - cpuid_mask(&entry->ebx, CPUID_8000_0008_EBX); + cpuid_entry_mask(entry, CPUID_8000_0008_EBX); /* * AMD has separate bits for each SPEC_CTRL bit. * arch/x86/kernel/cpu/bugs.c is kind enough to @@ -763,7 +757,7 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) break; case 0xC0000001: entry->edx &= kvm_cpuid_C000_0001_edx_x86_features; - cpuid_mask(&entry->edx, CPUID_C000_0001_EDX); + cpuid_entry_mask(entry, CPUID_C000_0001_EDX); break; case 3: /* Processor serial number */ case 5: /* MONITOR/MWAIT */ diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h index de3c6c365a5a..407dc26c0633 100644 --- a/arch/x86/kvm/cpuid.h +++ b/arch/x86/kvm/cpuid.h @@ -167,6 +167,14 @@ static __always_inline void cpuid_entry_change(struct kvm_cpuid_entry2 *entry, *reg &= ~__feature_bit(x86_feature); } +static __always_inline void cpuid_entry_mask(struct kvm_cpuid_entry2 *entry, + enum cpuid_leafs leaf) +{ + u32 *reg = cpuid_entry_get_reg(entry, leaf * 32); + + *reg &= boot_cpu_data.x86_capability[leaf]; +} + static __always_inline u32 *guest_cpuid_get_register(struct kvm_vcpu *vcpu, unsigned int x86_feature) { -- cgit From 6c7ea4b56bfe7a11ecbaef9c521a7974fc1171cf Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:33 -0800 Subject: KVM: x86: Handle MPX CPUID adjustment in VMX code Move the MPX CPUID adjustments into VMX to eliminate an instance of the undesirable "unsigned f_* = *_supported ? F(*) : 0" pattern in the common CPUID handling code. Note, to maintain existing behavior, VMX must manually check for kernel support for MPX by querying boot_cpu_has(X86_FEATURE_MPX). Previously, do_cpuid_7_mask() masked MPX based on boot_cpu_data by invoking cpuid_mask() on the associated cpufeatures word, but cpuid_mask() runs prior to executing vmx_set_supported_cpuid(). No functional change intended. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 3 +-- arch/x86/kvm/vmx/vmx.c | 14 ++++++++++++-- 2 files changed, 13 insertions(+), 4 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index b906ed316788..06d3015c9946 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -332,7 +332,6 @@ static int __do_cpuid_func_emulated(struct kvm_cpuid_array *array, u32 func) static inline void do_cpuid_7_mask(struct kvm_cpuid_entry2 *entry) { unsigned f_invpcid = kvm_x86_ops->invpcid_supported() ? F(INVPCID) : 0; - unsigned f_mpx = kvm_mpx_supported() ? F(MPX) : 0; unsigned f_umip = kvm_x86_ops->umip_emulated() ? F(UMIP) : 0; unsigned f_intel_pt = kvm_x86_ops->pt_supported() ? F(INTEL_PT) : 0; unsigned f_la57; @@ -341,7 +340,7 @@ static inline void do_cpuid_7_mask(struct kvm_cpuid_entry2 *entry) /* cpuid 7.0.ebx */ const u32 kvm_cpuid_7_0_ebx_x86_features = F(FSGSBASE) | F(BMI1) | F(HLE) | F(AVX2) | F(SMEP) | - F(BMI2) | F(ERMS) | f_invpcid | F(RTM) | f_mpx | F(RDSEED) | + F(BMI2) | F(ERMS) | f_invpcid | F(RTM) | 0 /*MPX*/ | F(RDSEED) | F(ADX) | F(SMAP) | F(AVX512IFMA) | F(AVX512F) | F(AVX512PF) | F(AVX512ER) | F(AVX512CD) | F(CLFLUSHOPT) | F(CLWB) | F(AVX512DQ) | F(SHA_NI) | F(AVX512BW) | F(AVX512VL) | f_intel_pt; diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 082ccefc4348..d8be060d8829 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7130,8 +7130,18 @@ static void vmx_cpuid_update(struct kvm_vcpu *vcpu) static void vmx_set_supported_cpuid(struct kvm_cpuid_entry2 *entry) { - if (entry->function == 1 && nested) - entry->ecx |= feature_bit(VMX); + switch (entry->function) { + case 0x1: + if (nested) + cpuid_entry_set(entry, X86_FEATURE_VMX); + break; + case 0x7: + if (boot_cpu_has(X86_FEATURE_MPX) && kvm_mpx_supported()) + cpuid_entry_set(entry, X86_FEATURE_MPX); + break; + default: + break; + } } static void vmx_request_immediate_exit(struct kvm_vcpu *vcpu) -- cgit From 5ffec6f910dc8998c9da9320550ffddebe2e7afc Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:34 -0800 Subject: KVM: x86: Handle INVPCID CPUID adjustment in VMX code Move the INVPCID CPUID adjustments into VMX to eliminate an instance of the undesirable "unsigned f_* = *_supported ? F(*) : 0" pattern in the common CPUID handling code. Drop ->invpcid_supported(), CPUID adjustment was the only user. No functional change intended. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/include/asm/kvm_host.h | 1 - arch/x86/kvm/cpuid.c | 3 +-- arch/x86/kvm/svm.c | 6 ------ arch/x86/kvm/vmx/vmx.c | 10 +++------- 4 files changed, 4 insertions(+), 16 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 626cbe161c57..52470ccde7f7 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1157,7 +1157,6 @@ struct kvm_x86_ops { u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio); int (*get_lpage_level)(void); bool (*rdtscp_supported)(void); - bool (*invpcid_supported)(void); void (*set_tdp_cr3)(struct kvm_vcpu *vcpu, unsigned long cr3); diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 06d3015c9946..99c4748e5a0d 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -331,7 +331,6 @@ static int __do_cpuid_func_emulated(struct kvm_cpuid_array *array, u32 func) static inline void do_cpuid_7_mask(struct kvm_cpuid_entry2 *entry) { - unsigned f_invpcid = kvm_x86_ops->invpcid_supported() ? F(INVPCID) : 0; unsigned f_umip = kvm_x86_ops->umip_emulated() ? F(UMIP) : 0; unsigned f_intel_pt = kvm_x86_ops->pt_supported() ? F(INTEL_PT) : 0; unsigned f_la57; @@ -340,7 +339,7 @@ static inline void do_cpuid_7_mask(struct kvm_cpuid_entry2 *entry) /* cpuid 7.0.ebx */ const u32 kvm_cpuid_7_0_ebx_x86_features = F(FSGSBASE) | F(BMI1) | F(HLE) | F(AVX2) | F(SMEP) | - F(BMI2) | F(ERMS) | f_invpcid | F(RTM) | 0 /*MPX*/ | F(RDSEED) | + F(BMI2) | F(ERMS) | 0 /*INVPCID*/ | F(RTM) | 0 /*MPX*/ | F(RDSEED) | F(ADX) | F(SMAP) | F(AVX512IFMA) | F(AVX512F) | F(AVX512PF) | F(AVX512ER) | F(AVX512CD) | F(CLFLUSHOPT) | F(CLWB) | F(AVX512DQ) | F(SHA_NI) | F(AVX512BW) | F(AVX512VL) | f_intel_pt; diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 9ec7952fecb0..9ef4ebf8475a 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -6073,11 +6073,6 @@ static bool svm_rdtscp_supported(void) return boot_cpu_has(X86_FEATURE_RDTSCP); } -static bool svm_invpcid_supported(void) -{ - return false; -} - static bool svm_xsaves_supported(void) { return boot_cpu_has(X86_FEATURE_XSAVES); @@ -7460,7 +7455,6 @@ static struct kvm_x86_ops svm_x86_ops __ro_after_init = { .cpuid_update = svm_cpuid_update, .rdtscp_supported = svm_rdtscp_supported, - .invpcid_supported = svm_invpcid_supported, .xsaves_supported = svm_xsaves_supported, .umip_emulated = svm_umip_emulated, .pt_supported = svm_pt_supported, diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index d8be060d8829..040c7e8f3345 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -1692,11 +1692,6 @@ static bool vmx_rdtscp_supported(void) return cpu_has_vmx_rdtscp(); } -static bool vmx_invpcid_supported(void) -{ - return cpu_has_vmx_invpcid(); -} - /* * Swap MSR entry in host/guest MSR entry array. */ @@ -4093,7 +4088,7 @@ static void vmx_compute_secondary_exec_control(struct vcpu_vmx *vmx) } } - if (vmx_invpcid_supported()) { + if (cpu_has_vmx_invpcid()) { /* Exposing INVPCID only when PCID is exposed */ bool invpcid_enabled = guest_cpuid_has(vcpu, X86_FEATURE_INVPCID) && @@ -7138,6 +7133,8 @@ static void vmx_set_supported_cpuid(struct kvm_cpuid_entry2 *entry) case 0x7: if (boot_cpu_has(X86_FEATURE_MPX) && kvm_mpx_supported()) cpuid_entry_set(entry, X86_FEATURE_MPX); + if (boot_cpu_has(X86_FEATURE_INVPCID) && cpu_has_vmx_invpcid()) + cpuid_entry_set(entry, X86_FEATURE_INVPCID); break; default: break; @@ -7924,7 +7921,6 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = { .cpuid_update = vmx_cpuid_update, .rdtscp_supported = vmx_rdtscp_supported, - .invpcid_supported = vmx_invpcid_supported, .set_supported_cpuid = vmx_set_supported_cpuid, -- cgit From e574768f841ba600009da06b3027d9413dba868f Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:35 -0800 Subject: KVM: x86: Handle UMIP emulation CPUID adjustment in VMX code Move the CPUID adjustment for UMIP emulation into VMX code to eliminate an instance of the undesirable "unsigned f_* = *_supported ? F(*) : 0" pattern in the common CPUID handling code. No functional change intended. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 2 -- arch/x86/kvm/vmx/vmx.c | 2 ++ 2 files changed, 2 insertions(+), 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 99c4748e5a0d..d7b3db024edc 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -331,7 +331,6 @@ static int __do_cpuid_func_emulated(struct kvm_cpuid_array *array, u32 func) static inline void do_cpuid_7_mask(struct kvm_cpuid_entry2 *entry) { - unsigned f_umip = kvm_x86_ops->umip_emulated() ? F(UMIP) : 0; unsigned f_intel_pt = kvm_x86_ops->pt_supported() ? F(INTEL_PT) : 0; unsigned f_la57; unsigned f_pku = kvm_x86_ops->pku_supported() ? F(PKU) : 0; @@ -374,7 +373,6 @@ static inline void do_cpuid_7_mask(struct kvm_cpuid_entry2 *entry) cpuid_entry_mask(entry, CPUID_7_ECX); /* Set LA57 based on hardware capability. */ entry->ecx |= f_la57; - entry->ecx |= f_umip; entry->ecx |= f_pku; /* PKU is not yet implemented for shadow paging. */ if (!tdp_enabled || !boot_cpu_has(X86_FEATURE_OSPKE)) diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 040c7e8f3345..2f0897f296fb 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7135,6 +7135,8 @@ static void vmx_set_supported_cpuid(struct kvm_cpuid_entry2 *entry) cpuid_entry_set(entry, X86_FEATURE_MPX); if (boot_cpu_has(X86_FEATURE_INVPCID) && cpu_has_vmx_invpcid()) cpuid_entry_set(entry, X86_FEATURE_INVPCID); + if (vmx_umip_emulated()) + cpuid_entry_set(entry, X86_FEATURE_UMIP); break; default: break; -- cgit From d64d83d1e026f9fea9c8f18bf97b9529f7e4189c Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:36 -0800 Subject: KVM: x86: Handle PKU CPUID adjustment in VMX code Move the setting of the PKU CPUID bit into VMX to eliminate an instance of the undesirable "unsigned f_* = *_supported ? F(*) : 0" pattern in the common CPUID handling code. Drop ->pku_supported(), CPUID adjustment was the only user. Note, some AMD CPUs now support PKU, but SVM doesn't yet support exposing it to a guest. No functional change intended. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/include/asm/kvm_host.h | 1 - arch/x86/kvm/cpuid.c | 5 ----- arch/x86/kvm/svm.c | 6 ------ arch/x86/kvm/vmx/capabilities.h | 5 ----- arch/x86/kvm/vmx/vmx.c | 6 +++++- 5 files changed, 5 insertions(+), 18 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 52470ccde7f7..5b7848e55efb 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1180,7 +1180,6 @@ struct kvm_x86_ops { bool (*xsaves_supported)(void); bool (*umip_emulated)(void); bool (*pt_supported)(void); - bool (*pku_supported)(void); int (*check_nested_events)(struct kvm_vcpu *vcpu); void (*request_immediate_exit)(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index d7b3db024edc..3413fed22289 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -333,7 +333,6 @@ static inline void do_cpuid_7_mask(struct kvm_cpuid_entry2 *entry) { unsigned f_intel_pt = kvm_x86_ops->pt_supported() ? F(INTEL_PT) : 0; unsigned f_la57; - unsigned f_pku = kvm_x86_ops->pku_supported() ? F(PKU) : 0; /* cpuid 7.0.ebx */ const u32 kvm_cpuid_7_0_ebx_x86_features = @@ -373,10 +372,6 @@ static inline void do_cpuid_7_mask(struct kvm_cpuid_entry2 *entry) cpuid_entry_mask(entry, CPUID_7_ECX); /* Set LA57 based on hardware capability. */ entry->ecx |= f_la57; - entry->ecx |= f_pku; - /* PKU is not yet implemented for shadow paging. */ - if (!tdp_enabled || !boot_cpu_has(X86_FEATURE_OSPKE)) - cpuid_entry_clear(entry, X86_FEATURE_PKU); entry->edx &= kvm_cpuid_7_0_edx_x86_features; cpuid_entry_mask(entry, CPUID_7_EDX); diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 9ef4ebf8475a..8984ae140689 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -6093,11 +6093,6 @@ static bool svm_has_wbinvd_exit(void) return true; } -static bool svm_pku_supported(void) -{ - return false; -} - #define PRE_EX(exit) { .exit_code = (exit), \ .stage = X86_ICPT_PRE_EXCEPT, } #define POST_EX(exit) { .exit_code = (exit), \ @@ -7458,7 +7453,6 @@ static struct kvm_x86_ops svm_x86_ops __ro_after_init = { .xsaves_supported = svm_xsaves_supported, .umip_emulated = svm_umip_emulated, .pt_supported = svm_pt_supported, - .pku_supported = svm_pku_supported, .set_supported_cpuid = svm_set_supported_cpuid, diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h index c00e26570198..8903475f751e 100644 --- a/arch/x86/kvm/vmx/capabilities.h +++ b/arch/x86/kvm/vmx/capabilities.h @@ -146,11 +146,6 @@ static inline bool vmx_umip_emulated(void) SECONDARY_EXEC_DESC; } -static inline bool vmx_pku_supported(void) -{ - return boot_cpu_has(X86_FEATURE_PKU); -} - static inline bool cpu_has_vmx_rdtscp(void) { return vmcs_config.cpu_based_2nd_exec_ctrl & diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 2f0897f296fb..60a6ef3ee3b4 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7137,6 +7137,11 @@ static void vmx_set_supported_cpuid(struct kvm_cpuid_entry2 *entry) cpuid_entry_set(entry, X86_FEATURE_INVPCID); if (vmx_umip_emulated()) cpuid_entry_set(entry, X86_FEATURE_UMIP); + + /* PKU is not yet implemented for shadow paging. */ + if (enable_ept && boot_cpu_has(X86_FEATURE_PKU) && + boot_cpu_has(X86_FEATURE_OSPKE)) + cpuid_entry_set(entry, X86_FEATURE_PKU); break; default: break; @@ -7938,7 +7943,6 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = { .xsaves_supported = vmx_xsaves_supported, .umip_emulated = vmx_umip_emulated, .pt_supported = vmx_pt_supported, - .pku_supported = vmx_pku_supported, .request_immediate_exit = vmx_request_immediate_exit, -- cgit From 733deafc00df1dda5130fc14f87a1d3993913243 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:37 -0800 Subject: KVM: x86: Handle RDTSCP CPUID adjustment in VMX code Move the clearing of the RDTSCP CPUID bit into VMX, which has a separate VMCS control to enable RDTSCP in non-root, to eliminate an instance of the undesirable "unsigned f_* = *_supported ? F(*) : 0" pattern in the common CPUID handling code. Drop ->rdtscp_supported() since CPUID adjustment was the last remaining user. No functional change intended. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 3 +-- arch/x86/kvm/vmx/vmx.c | 4 ++++ 2 files changed, 5 insertions(+), 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 3413fed22289..9a930ec6912d 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -416,7 +416,6 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) unsigned f_gbpages = 0; unsigned f_lm = 0; #endif - unsigned f_rdtscp = kvm_x86_ops->rdtscp_supported() ? F(RDTSCP) : 0; unsigned f_xsaves = kvm_x86_ops->xsaves_supported() ? F(XSAVES) : 0; unsigned f_intel_pt = kvm_x86_ops->pt_supported() ? F(INTEL_PT) : 0; @@ -438,7 +437,7 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) F(MTRR) | F(PGE) | F(MCA) | F(CMOV) | F(PAT) | F(PSE36) | 0 /* Reserved */ | f_nx | 0 /* Reserved */ | F(MMXEXT) | F(MMX) | - F(FXSR) | F(FXSR_OPT) | f_gbpages | f_rdtscp | + F(FXSR) | F(FXSR_OPT) | f_gbpages | F(RDTSCP) | 0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW); /* cpuid 1.ecx */ const u32 kvm_cpuid_1_ecx_x86_features = diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 60a6ef3ee3b4..fae2d4438086 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7143,6 +7143,10 @@ static void vmx_set_supported_cpuid(struct kvm_cpuid_entry2 *entry) boot_cpu_has(X86_FEATURE_OSPKE)) cpuid_entry_set(entry, X86_FEATURE_PKU); break; + case 0x80000001: + if (!cpu_has_vmx_rdtscp()) + cpuid_entry_clear(entry, X86_FEATURE_RDTSCP); + break; default: break; } -- cgit From dbd068040c64162fbbfa278eb63ef64704190612 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:38 -0800 Subject: KVM: x86: Handle Intel PT CPUID adjustment in VMX code Move the Processor Trace CPUID adjustment into VMX code to eliminate an instance of the undesirable "unsigned f_* = *_supported ? F(*) : 0" pattern in the common CPUID handling code, and to pave the way toward eventually removing ->pt_supported(). No functional change intended. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 3 +-- arch/x86/kvm/vmx/vmx.c | 3 +++ 2 files changed, 4 insertions(+), 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 9a930ec6912d..26955c724571 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -331,7 +331,6 @@ static int __do_cpuid_func_emulated(struct kvm_cpuid_array *array, u32 func) static inline void do_cpuid_7_mask(struct kvm_cpuid_entry2 *entry) { - unsigned f_intel_pt = kvm_x86_ops->pt_supported() ? F(INTEL_PT) : 0; unsigned f_la57; /* cpuid 7.0.ebx */ @@ -340,7 +339,7 @@ static inline void do_cpuid_7_mask(struct kvm_cpuid_entry2 *entry) F(BMI2) | F(ERMS) | 0 /*INVPCID*/ | F(RTM) | 0 /*MPX*/ | F(RDSEED) | F(ADX) | F(SMAP) | F(AVX512IFMA) | F(AVX512F) | F(AVX512PF) | F(AVX512ER) | F(AVX512CD) | F(CLFLUSHOPT) | F(CLWB) | F(AVX512DQ) | - F(SHA_NI) | F(AVX512BW) | F(AVX512VL) | f_intel_pt; + F(SHA_NI) | F(AVX512BW) | F(AVX512VL) | 0 /*INTEL_PT*/; /* cpuid 7.0.ecx*/ const u32 kvm_cpuid_7_0_ecx_x86_features = diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index fae2d4438086..bf27cb8ac3fc 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7135,6 +7135,9 @@ static void vmx_set_supported_cpuid(struct kvm_cpuid_entry2 *entry) cpuid_entry_set(entry, X86_FEATURE_MPX); if (boot_cpu_has(X86_FEATURE_INVPCID) && cpu_has_vmx_invpcid()) cpuid_entry_set(entry, X86_FEATURE_INVPCID); + if (boot_cpu_has(X86_FEATURE_INTEL_PT) && + vmx_pt_mode_is_host_guest()) + cpuid_entry_set(entry, X86_FEATURE_INTEL_PT); if (vmx_umip_emulated()) cpuid_entry_set(entry, X86_FEATURE_UMIP); -- cgit From fb7d4377d513145303c1d0a192cb4b33d72be2d9 Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Tue, 3 Mar 2020 15:54:39 +0100 Subject: KVM: x86: handle GBPAGE CPUID adjustment for EPT with generic code The clearing of the GBPAGE CPUID bit for VMX is wrong; support for 1GB pages in EPT has no relationship to whether 1GB pages should be marked as supported in CPUID. This has no ill effect because we're only clearing the bit, but we're not marking 1GB pages as available when EPT is disabled (even though they are actually supported thanks to shadowing). Instead, forcibly enable 1GB pages in the shadow paging case. This also eliminates an instance of the undesirable "unsigned f_* = *_supported ? F(*) : 0" pattern in the common CPUID handling code, and paves the way toward eliminating ->get_lpage_level(). Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 26955c724571..aeab6573fe0f 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -408,8 +408,7 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) int r, i, max_idx; unsigned f_nx = is_efer_nx() ? F(NX) : 0; #ifdef CONFIG_X86_64 - unsigned f_gbpages = (kvm_x86_ops->get_lpage_level() == PT_PDPE_LEVEL) - ? F(GBPAGES) : 0; + unsigned f_gbpages = F(GBPAGES); unsigned f_lm = F(LM); #else unsigned f_gbpages = 0; @@ -683,6 +682,8 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) case 0x80000001: entry->edx &= kvm_cpuid_8000_0001_edx_x86_features; cpuid_entry_mask(entry, CPUID_8000_0001_EDX); + if (!tdp_enabled) + cpuid_entry_set(entry, X86_FEATURE_GBPAGES); entry->ecx &= kvm_cpuid_8000_0001_ecx_x86_features; cpuid_entry_mask(entry, CPUID_8000_0001_ECX); break; -- cgit From 9e6d01c2d9088efb8326997cafa8580295a49435 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:40 -0800 Subject: KVM: x86: Refactor handling of XSAVES CPUID adjustment Invert the handling of XSAVES, i.e. set it based on boot_cpu_has() by default, in preparation for adding KVM cpu caps, which will generate the mask at load time before ->xsaves_supported() is ready. No functional change intended. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index aeab6573fe0f..3e4b03c8ec12 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -414,7 +414,6 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) unsigned f_gbpages = 0; unsigned f_lm = 0; #endif - unsigned f_xsaves = kvm_x86_ops->xsaves_supported() ? F(XSAVES) : 0; unsigned f_intel_pt = kvm_x86_ops->pt_supported() ? F(INTEL_PT) : 0; /* cpuid 1.edx */ @@ -471,7 +470,7 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) /* cpuid 0xD.1.eax */ const u32 kvm_cpuid_D_1_eax_x86_features = - F(XSAVEOPT) | F(XSAVEC) | F(XGETBV1) | f_xsaves; + F(XSAVEOPT) | F(XSAVEC) | F(XGETBV1) | F(XSAVES); /* all calls to cpuid_count() should be made on the same cpu */ get_cpu(); @@ -602,6 +601,10 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) entry->eax &= kvm_cpuid_D_1_eax_x86_features; cpuid_entry_mask(entry, CPUID_D_1_EAX); + + if (!kvm_x86_ops->xsaves_supported()) + cpuid_entry_clear(entry, X86_FEATURE_XSAVES); + if (entry->eax & (F(XSAVES)|F(XSAVEC))) entry->ebx = xstate_required_size(supported_xcr0, true); else -- cgit From 66a6950f99950c77e2898e48d668eca1dac10a1e Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:41 -0800 Subject: KVM: x86: Introduce kvm_cpu_caps to replace runtime CPUID masking Calculate the CPUID masks for KVM_GET_SUPPORTED_CPUID at load time using what is effectively a KVM-adjusted copy of boot_cpu_data, or more precisely, the x86_capability array in boot_cpu_data. In terms of KVM support, the vast majority of CPUID feature bits are constant, and *all* feature support is known at KVM load time. Rather than apply boot_cpu_data, which is effectively read-only after init, at runtime, copy it into a KVM-specific array and use *that* to mask CPUID registers. In additional to consolidating the masking, kvm_cpu_caps can be adjusted by SVM/VMX at load time and thus eliminate all feature bit manipulation in ->set_supported_cpuid(). Opportunistically clean up a few warts: - Replace bare "unsigned" with "unsigned int" when a feature flag is captured in a local variable, e.g. f_nx. - Sort the CPUID masks by function, index and register (alphabetically for registers, i.e. EBX comes before ECX/EDX). - Remove the superfluous /* cpuid 7.0.ecx */ comments. No functional change intended. Signed-off-by: Sean Christopherson [Call kvm_set_cpu_caps from kvm_x86_ops->hardware_setup due to fixed GBPAGES patch. - Paolo] Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 233 ++++++++++++++++++++++++++----------------------- arch/x86/kvm/cpuid.h | 22 ++++- arch/x86/kvm/svm.c | 2 + arch/x86/kvm/vmx/vmx.c | 2 + 4 files changed, 151 insertions(+), 108 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 3e4b03c8ec12..31ea934d9b49 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -24,6 +24,13 @@ #include "trace.h" #include "pmu.h" +/* + * Unlike "struct cpuinfo_x86.x86_capability", kvm_cpu_caps doesn't need to be + * aligned to sizeof(unsigned long) because it's not accessed via bitops. + */ +u32 kvm_cpu_caps[NCAPINTS] __read_mostly; +EXPORT_SYMBOL_GPL(kvm_cpu_caps); + static u32 xstate_required_size(u64 xstate_bv, bool compacted) { int feature_bit = 0; @@ -254,6 +261,123 @@ out: return r; } +static __always_inline void kvm_cpu_cap_mask(enum cpuid_leafs leaf, u32 mask) +{ + reverse_cpuid_check(leaf); + kvm_cpu_caps[leaf] &= mask; +} + +void kvm_set_cpu_caps(void) +{ + unsigned int f_nx = is_efer_nx() ? F(NX) : 0; +#ifdef CONFIG_X86_64 + unsigned int f_gbpages = F(GBPAGES); + unsigned int f_lm = F(LM); +#else + unsigned int f_gbpages = 0; + unsigned int f_lm = 0; +#endif + + BUILD_BUG_ON(sizeof(kvm_cpu_caps) > + sizeof(boot_cpu_data.x86_capability)); + + memcpy(&kvm_cpu_caps, &boot_cpu_data.x86_capability, + sizeof(kvm_cpu_caps)); + + kvm_cpu_cap_mask(CPUID_1_ECX, + /* + * NOTE: MONITOR (and MWAIT) are emulated as NOP, but *not* + * advertised to guests via CPUID! + */ + F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ | + 0 /* DS-CPL, VMX, SMX, EST */ | + 0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ | + F(FMA) | F(CX16) | 0 /* xTPR Update, PDCM */ | + F(PCID) | 0 /* Reserved, DCA */ | F(XMM4_1) | + F(XMM4_2) | F(X2APIC) | F(MOVBE) | F(POPCNT) | + 0 /* Reserved*/ | F(AES) | F(XSAVE) | 0 /* OSXSAVE */ | F(AVX) | + F(F16C) | F(RDRAND) + ); + + kvm_cpu_cap_mask(CPUID_1_EDX, + F(FPU) | F(VME) | F(DE) | F(PSE) | + F(TSC) | F(MSR) | F(PAE) | F(MCE) | + F(CX8) | F(APIC) | 0 /* Reserved */ | F(SEP) | + F(MTRR) | F(PGE) | F(MCA) | F(CMOV) | + F(PAT) | F(PSE36) | 0 /* PSN */ | F(CLFLUSH) | + 0 /* Reserved, DS, ACPI */ | F(MMX) | + F(FXSR) | F(XMM) | F(XMM2) | F(SELFSNOOP) | + 0 /* HTT, TM, Reserved, PBE */ + ); + + kvm_cpu_cap_mask(CPUID_7_0_EBX, + F(FSGSBASE) | F(BMI1) | F(HLE) | F(AVX2) | F(SMEP) | + F(BMI2) | F(ERMS) | 0 /*INVPCID*/ | F(RTM) | 0 /*MPX*/ | F(RDSEED) | + F(ADX) | F(SMAP) | F(AVX512IFMA) | F(AVX512F) | F(AVX512PF) | + F(AVX512ER) | F(AVX512CD) | F(CLFLUSHOPT) | F(CLWB) | F(AVX512DQ) | + F(SHA_NI) | F(AVX512BW) | F(AVX512VL) | 0 /*INTEL_PT*/ + ); + + kvm_cpu_cap_mask(CPUID_7_ECX, + F(AVX512VBMI) | F(LA57) | 0 /*PKU*/ | 0 /*OSPKE*/ | F(RDPID) | + F(AVX512_VPOPCNTDQ) | F(UMIP) | F(AVX512_VBMI2) | F(GFNI) | + F(VAES) | F(VPCLMULQDQ) | F(AVX512_VNNI) | F(AVX512_BITALG) | + F(CLDEMOTE) | F(MOVDIRI) | F(MOVDIR64B) | 0 /*WAITPKG*/ + ); + /* Set LA57 based on hardware capability. */ + if (cpuid_ecx(7) & F(LA57)) + kvm_cpu_cap_set(X86_FEATURE_LA57); + + kvm_cpu_cap_mask(CPUID_7_EDX, + F(AVX512_4VNNIW) | F(AVX512_4FMAPS) | F(SPEC_CTRL) | + F(SPEC_CTRL_SSBD) | F(ARCH_CAPABILITIES) | F(INTEL_STIBP) | + F(MD_CLEAR) + ); + + kvm_cpu_cap_mask(CPUID_7_1_EAX, + F(AVX512_BF16) + ); + + kvm_cpu_cap_mask(CPUID_D_1_EAX, + F(XSAVEOPT) | F(XSAVEC) | F(XGETBV1) | F(XSAVES) + ); + + kvm_cpu_cap_mask(CPUID_8000_0001_ECX, + F(LAHF_LM) | F(CMP_LEGACY) | 0 /*SVM*/ | 0 /* ExtApicSpace */ | + F(CR8_LEGACY) | F(ABM) | F(SSE4A) | F(MISALIGNSSE) | + F(3DNOWPREFETCH) | F(OSVW) | 0 /* IBS */ | F(XOP) | + 0 /* SKINIT, WDT, LWP */ | F(FMA4) | F(TBM) | + F(TOPOEXT) | F(PERFCTR_CORE) + ); + + kvm_cpu_cap_mask(CPUID_8000_0001_EDX, + F(FPU) | F(VME) | F(DE) | F(PSE) | + F(TSC) | F(MSR) | F(PAE) | F(MCE) | + F(CX8) | F(APIC) | 0 /* Reserved */ | F(SYSCALL) | + F(MTRR) | F(PGE) | F(MCA) | F(CMOV) | + F(PAT) | F(PSE36) | 0 /* Reserved */ | + f_nx | 0 /* Reserved */ | F(MMXEXT) | F(MMX) | + F(FXSR) | F(FXSR_OPT) | f_gbpages | F(RDTSCP) | + 0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW) + ); + + if (!tdp_enabled && IS_ENABLED(CONFIG_X86_64)) + kvm_cpu_cap_set(X86_FEATURE_GBPAGES); + + kvm_cpu_cap_mask(CPUID_8000_0008_EBX, + F(CLZERO) | F(XSAVEERPTR) | + F(WBNOINVD) | F(AMD_IBPB) | F(AMD_IBRS) | F(AMD_SSBD) | F(VIRT_SSBD) | + F(AMD_SSB_NO) | F(AMD_STIBP) | F(AMD_STIBP_ALWAYS_ON) + ); + + kvm_cpu_cap_mask(CPUID_C000_0001_EDX, + F(XSTORE) | F(XSTORE_EN) | F(XCRYPT) | F(XCRYPT_EN) | + F(ACE2) | F(ACE2_EN) | F(PHE) | F(PHE_EN) | + F(PMM) | F(PMM_EN) + ); +} +EXPORT_SYMBOL_GPL(kvm_set_cpu_caps); + struct kvm_cpuid_array { struct kvm_cpuid_entry2 *entries; const int maxnent; @@ -331,48 +455,13 @@ static int __do_cpuid_func_emulated(struct kvm_cpuid_array *array, u32 func) static inline void do_cpuid_7_mask(struct kvm_cpuid_entry2 *entry) { - unsigned f_la57; - - /* cpuid 7.0.ebx */ - const u32 kvm_cpuid_7_0_ebx_x86_features = - F(FSGSBASE) | F(BMI1) | F(HLE) | F(AVX2) | F(SMEP) | - F(BMI2) | F(ERMS) | 0 /*INVPCID*/ | F(RTM) | 0 /*MPX*/ | F(RDSEED) | - F(ADX) | F(SMAP) | F(AVX512IFMA) | F(AVX512F) | F(AVX512PF) | - F(AVX512ER) | F(AVX512CD) | F(CLFLUSHOPT) | F(CLWB) | F(AVX512DQ) | - F(SHA_NI) | F(AVX512BW) | F(AVX512VL) | 0 /*INTEL_PT*/; - - /* cpuid 7.0.ecx*/ - const u32 kvm_cpuid_7_0_ecx_x86_features = - F(AVX512VBMI) | F(LA57) | 0 /*PKU*/ | 0 /*OSPKE*/ | F(RDPID) | - F(AVX512_VPOPCNTDQ) | F(UMIP) | F(AVX512_VBMI2) | F(GFNI) | - F(VAES) | F(VPCLMULQDQ) | F(AVX512_VNNI) | F(AVX512_BITALG) | - F(CLDEMOTE) | F(MOVDIRI) | F(MOVDIR64B) | 0 /*WAITPKG*/; - - /* cpuid 7.0.edx*/ - const u32 kvm_cpuid_7_0_edx_x86_features = - F(AVX512_4VNNIW) | F(AVX512_4FMAPS) | F(SPEC_CTRL) | - F(SPEC_CTRL_SSBD) | F(ARCH_CAPABILITIES) | F(INTEL_STIBP) | - F(MD_CLEAR); - - /* cpuid 7.1.eax */ - const u32 kvm_cpuid_7_1_eax_x86_features = - F(AVX512_BF16); - switch (entry->index) { case 0: entry->eax = min(entry->eax, 1u); - entry->ebx &= kvm_cpuid_7_0_ebx_x86_features; cpuid_entry_mask(entry, CPUID_7_0_EBX); /* TSC_ADJUST is emulated */ cpuid_entry_set(entry, X86_FEATURE_TSC_ADJUST); - - entry->ecx &= kvm_cpuid_7_0_ecx_x86_features; - f_la57 = cpuid_entry_get(entry, X86_FEATURE_LA57); cpuid_entry_mask(entry, CPUID_7_ECX); - /* Set LA57 based on hardware capability. */ - entry->ecx |= f_la57; - - entry->edx &= kvm_cpuid_7_0_edx_x86_features; cpuid_entry_mask(entry, CPUID_7_EDX); if (boot_cpu_has(X86_FEATURE_IBPB) && boot_cpu_has(X86_FEATURE_IBRS)) cpuid_entry_set(entry, X86_FEATURE_SPEC_CTRL); @@ -387,7 +476,7 @@ static inline void do_cpuid_7_mask(struct kvm_cpuid_entry2 *entry) cpuid_entry_set(entry, X86_FEATURE_ARCH_CAPABILITIES); break; case 1: - entry->eax &= kvm_cpuid_7_1_eax_x86_features; + cpuid_entry_mask(entry, CPUID_7_1_EAX); entry->ebx = 0; entry->ecx = 0; entry->edx = 0; @@ -406,72 +495,8 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) { struct kvm_cpuid_entry2 *entry; int r, i, max_idx; - unsigned f_nx = is_efer_nx() ? F(NX) : 0; -#ifdef CONFIG_X86_64 - unsigned f_gbpages = F(GBPAGES); - unsigned f_lm = F(LM); -#else - unsigned f_gbpages = 0; - unsigned f_lm = 0; -#endif unsigned f_intel_pt = kvm_x86_ops->pt_supported() ? F(INTEL_PT) : 0; - /* cpuid 1.edx */ - const u32 kvm_cpuid_1_edx_x86_features = - F(FPU) | F(VME) | F(DE) | F(PSE) | - F(TSC) | F(MSR) | F(PAE) | F(MCE) | - F(CX8) | F(APIC) | 0 /* Reserved */ | F(SEP) | - F(MTRR) | F(PGE) | F(MCA) | F(CMOV) | - F(PAT) | F(PSE36) | 0 /* PSN */ | F(CLFLUSH) | - 0 /* Reserved, DS, ACPI */ | F(MMX) | - F(FXSR) | F(XMM) | F(XMM2) | F(SELFSNOOP) | - 0 /* HTT, TM, Reserved, PBE */; - /* cpuid 0x80000001.edx */ - const u32 kvm_cpuid_8000_0001_edx_x86_features = - F(FPU) | F(VME) | F(DE) | F(PSE) | - F(TSC) | F(MSR) | F(PAE) | F(MCE) | - F(CX8) | F(APIC) | 0 /* Reserved */ | F(SYSCALL) | - F(MTRR) | F(PGE) | F(MCA) | F(CMOV) | - F(PAT) | F(PSE36) | 0 /* Reserved */ | - f_nx | 0 /* Reserved */ | F(MMXEXT) | F(MMX) | - F(FXSR) | F(FXSR_OPT) | f_gbpages | F(RDTSCP) | - 0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW); - /* cpuid 1.ecx */ - const u32 kvm_cpuid_1_ecx_x86_features = - /* NOTE: MONITOR (and MWAIT) are emulated as NOP, - * but *not* advertised to guests via CPUID ! */ - F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ | - 0 /* DS-CPL, VMX, SMX, EST */ | - 0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ | - F(FMA) | F(CX16) | 0 /* xTPR Update, PDCM */ | - F(PCID) | 0 /* Reserved, DCA */ | F(XMM4_1) | - F(XMM4_2) | F(X2APIC) | F(MOVBE) | F(POPCNT) | - 0 /* Reserved*/ | F(AES) | F(XSAVE) | 0 /* OSXSAVE */ | F(AVX) | - F(F16C) | F(RDRAND); - /* cpuid 0x80000001.ecx */ - const u32 kvm_cpuid_8000_0001_ecx_x86_features = - F(LAHF_LM) | F(CMP_LEGACY) | 0 /*SVM*/ | 0 /* ExtApicSpace */ | - F(CR8_LEGACY) | F(ABM) | F(SSE4A) | F(MISALIGNSSE) | - F(3DNOWPREFETCH) | F(OSVW) | 0 /* IBS */ | F(XOP) | - 0 /* SKINIT, WDT, LWP */ | F(FMA4) | F(TBM) | - F(TOPOEXT) | F(PERFCTR_CORE); - - /* cpuid 0x80000008.ebx */ - const u32 kvm_cpuid_8000_0008_ebx_x86_features = - F(CLZERO) | F(XSAVEERPTR) | - F(WBNOINVD) | F(AMD_IBPB) | F(AMD_IBRS) | F(AMD_SSBD) | F(VIRT_SSBD) | - F(AMD_SSB_NO) | F(AMD_STIBP) | F(AMD_STIBP_ALWAYS_ON); - - /* cpuid 0xC0000001.edx */ - const u32 kvm_cpuid_C000_0001_edx_x86_features = - F(XSTORE) | F(XSTORE_EN) | F(XCRYPT) | F(XCRYPT_EN) | - F(ACE2) | F(ACE2_EN) | F(PHE) | F(PHE_EN) | - F(PMM) | F(PMM_EN); - - /* cpuid 0xD.1.eax */ - const u32 kvm_cpuid_D_1_eax_x86_features = - F(XSAVEOPT) | F(XSAVEC) | F(XGETBV1) | F(XSAVES); - /* all calls to cpuid_count() should be made on the same cpu */ get_cpu(); @@ -487,9 +512,7 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) entry->eax = min(entry->eax, 0x1fU); break; case 1: - entry->edx &= kvm_cpuid_1_edx_x86_features; cpuid_entry_mask(entry, CPUID_1_EDX); - entry->ecx &= kvm_cpuid_1_ecx_x86_features; cpuid_entry_mask(entry, CPUID_1_ECX); /* we support x2apic emulation even if host does not support * it since we emulate x2apic in software */ @@ -599,7 +622,6 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) if (!entry) goto out; - entry->eax &= kvm_cpuid_D_1_eax_x86_features; cpuid_entry_mask(entry, CPUID_D_1_EAX); if (!kvm_x86_ops->xsaves_supported()) @@ -683,11 +705,10 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) entry->eax = min(entry->eax, 0x8000001f); break; case 0x80000001: - entry->edx &= kvm_cpuid_8000_0001_edx_x86_features; cpuid_entry_mask(entry, CPUID_8000_0001_EDX); + /* Add it manually because it may not be in host CPUID. */ if (!tdp_enabled) cpuid_entry_set(entry, X86_FEATURE_GBPAGES); - entry->ecx &= kvm_cpuid_8000_0001_ecx_x86_features; cpuid_entry_mask(entry, CPUID_8000_0001_ECX); break; case 0x80000007: /* Advanced power management */ @@ -706,7 +727,6 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) g_phys_as = phys_as; entry->eax = g_phys_as | (virt_as << 8); entry->edx = 0; - entry->ebx &= kvm_cpuid_8000_0008_ebx_x86_features; cpuid_entry_mask(entry, CPUID_8000_0008_EBX); /* * AMD has separate bits for each SPEC_CTRL bit. @@ -749,7 +769,6 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) entry->eax = min(entry->eax, 0xC0000004); break; case 0xC0000001: - entry->edx &= kvm_cpuid_C000_0001_edx_x86_features; cpuid_entry_mask(entry, CPUID_C000_0001_EDX); break; case 3: /* Processor serial number */ diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h index 407dc26c0633..13374f885c81 100644 --- a/arch/x86/kvm/cpuid.h +++ b/arch/x86/kvm/cpuid.h @@ -6,6 +6,9 @@ #include #include +extern u32 kvm_cpu_caps[NCAPINTS] __read_mostly; +void kvm_set_cpu_caps(void); + int kvm_update_cpuid(struct kvm_vcpu *vcpu); struct kvm_cpuid_entry2 *kvm_find_cpuid_entry(struct kvm_vcpu *vcpu, u32 function, u32 index); @@ -172,7 +175,8 @@ static __always_inline void cpuid_entry_mask(struct kvm_cpuid_entry2 *entry, { u32 *reg = cpuid_entry_get_reg(entry, leaf * 32); - *reg &= boot_cpu_data.x86_capability[leaf]; + BUILD_BUG_ON(leaf >= ARRAY_SIZE(kvm_cpu_caps)); + *reg &= kvm_cpu_caps[leaf]; } static __always_inline u32 *guest_cpuid_get_register(struct kvm_vcpu *vcpu, @@ -262,4 +266,20 @@ static inline bool cpuid_fault_enabled(struct kvm_vcpu *vcpu) MSR_MISC_FEATURES_ENABLES_CPUID_FAULT; } +static __always_inline void kvm_cpu_cap_clear(unsigned int x86_feature) +{ + unsigned int x86_leaf = x86_feature / 32; + + reverse_cpuid_check(x86_leaf); + kvm_cpu_caps[x86_leaf] &= ~__feature_bit(x86_feature); +} + +static __always_inline void kvm_cpu_cap_set(unsigned int x86_feature) +{ + unsigned int x86_leaf = x86_feature / 32; + + reverse_cpuid_check(x86_leaf); + kvm_cpu_caps[x86_leaf] |= __feature_bit(x86_feature); +} + #endif diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 8984ae140689..aae5e3eff48d 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1479,6 +1479,8 @@ static __init int svm_hardware_setup(void) pr_info("Virtual GIF supported\n"); } + kvm_set_cpu_caps(); + return 0; err: diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index bf27cb8ac3fc..ae482f4f5678 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7818,6 +7818,8 @@ static __init int hardware_setup(void) return r; } + kvm_set_cpu_caps(); + r = alloc_kvm_area(); if (r) nested_vmx_hardware_unsetup(); -- cgit From 9b58b9857f221e4f7149a22727ef61d0c141f56b Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:42 -0800 Subject: KVM: SVM: Convert feature updates from CPUID to KVM cpu caps Use the recently introduced KVM CPU caps to propagate SVM-only (kernel) settings to supported CPUID flags. Note, there are a few subtleties: - Setting a flag based on a *different* feature is effectively emulation, and must be done at runtime via ->set_supported_cpuid(). - CPUID 0x8000000A.EDX is a feature leaf that was previously not adjusted by kvm_cpu_cap_mask() because all features are hidden by default. Opportunistically add a technically unnecessary break and fix an indentation issue in svm_set_supported_cpuid(). No functional change intended. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 6 ++++++ arch/x86/kvm/svm.c | 46 ++++++++++++++++++++++++++++++---------------- 2 files changed, 36 insertions(+), 16 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 31ea934d9b49..a5aed82d9ffa 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -370,6 +370,12 @@ void kvm_set_cpu_caps(void) F(AMD_SSB_NO) | F(AMD_STIBP) | F(AMD_STIBP_ALWAYS_ON) ); + /* + * Hide all SVM features by default, SVM will set the cap bits for + * features it emulates and/or exposes for L1. + */ + kvm_cpu_cap_mask(CPUID_8000_000A_EDX, 0); + kvm_cpu_cap_mask(CPUID_C000_0001_EDX, F(XSTORE) | F(XSTORE_EN) | F(XCRYPT) | F(XCRYPT_EN) | F(ACE2) | F(ACE2_EN) | F(PHE) | F(PHE_EN) | diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index aae5e3eff48d..61333c8306b0 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1367,6 +1367,23 @@ static void svm_hardware_teardown(void) iopm_base = 0; } +static __init void svm_set_cpu_caps(void) +{ + kvm_set_cpu_caps(); + + /* CPUID 0x80000001 */ + if (nested) + kvm_cpu_cap_set(X86_FEATURE_SVM); + + /* CPUID 0x8000000A */ + /* Support next_rip if host supports it */ + if (boot_cpu_has(X86_FEATURE_NRIPS)) + kvm_cpu_cap_set(X86_FEATURE_NRIPS); + + if (npt_enabled) + kvm_cpu_cap_set(X86_FEATURE_NPT); +} + static __init int svm_hardware_setup(void) { int cpu; @@ -1479,7 +1496,7 @@ static __init int svm_hardware_setup(void) pr_info("Virtual GIF supported\n"); } - kvm_set_cpu_caps(); + svm_set_cpu_caps(); return 0; @@ -6035,16 +6052,20 @@ static void svm_cpuid_update(struct kvm_vcpu *vcpu) APICV_INHIBIT_REASON_NESTED); } +/* + * Vendor specific emulation must be handled via ->set_supported_cpuid(), not + * svm_set_cpu_caps(), as capabilities configured during hardware_setup() are + * masked against hardware/kernel support, i.e. they'd be lost. + * + * Note, setting a flag based on a *different* feature, e.g. setting VIRT_SSBD + * if LS_CFG_SSBD or AMD_SSBD is supported, is effectively emulation. + */ static void svm_set_supported_cpuid(struct kvm_cpuid_entry2 *entry) { switch (entry->function) { - case 0x80000001: - if (nested) - cpuid_entry_set(entry, X86_FEATURE_SVM); - break; case 0x80000008: if (boot_cpu_has(X86_FEATURE_LS_CFG_SSBD) || - boot_cpu_has(X86_FEATURE_AMD_SSBD)) + boot_cpu_has(X86_FEATURE_AMD_SSBD)) cpuid_entry_set(entry, X86_FEATURE_VIRT_SSBD); break; case 0x8000000A: @@ -6052,16 +6073,9 @@ static void svm_set_supported_cpuid(struct kvm_cpuid_entry2 *entry) entry->ebx = 8; /* Lets support 8 ASIDs in case we add proper ASID emulation to nested SVM */ entry->ecx = 0; /* Reserved */ - entry->edx = 0; /* Per default do not support any - additional features */ - - /* Support next_rip if host supports it */ - if (boot_cpu_has(X86_FEATURE_NRIPS)) - cpuid_entry_set(entry, X86_FEATURE_NRIPS); - - /* Support NPT for the guest if enabled */ - if (npt_enabled) - cpuid_entry_set(entry, X86_FEATURE_NPT); + /* Note, 0x8000000A.EDX is managed via kvm_cpu_caps. */; + cpuid_entry_mask(entry, CPUID_8000_000A_EDX); + break; } } -- cgit From 3ec6fd8cf0ba6bb3ded5cdee88319c9af90e14c8 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:43 -0800 Subject: KVM: VMX: Convert feature updates from CPUID to KVM cpu caps Use the recently introduced KVM CPU caps to propagate VMX-only (kernel) settings to supported CPUID flags. No functional change intended. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/vmx/vmx.c | 54 ++++++++++++++++++++++++++++++-------------------- 1 file changed, 33 insertions(+), 21 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index ae482f4f5678..19507222414d 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7123,38 +7123,50 @@ static void vmx_cpuid_update(struct kvm_vcpu *vcpu) } } +/* + * Vendor specific emulation must be handled via ->set_supported_cpuid(), not + * vmx_set_cpu_caps(), as capabilities configured during hardware_setup() are + * masked against hardware/kernel support, i.e. they'd be lost. + */ static void vmx_set_supported_cpuid(struct kvm_cpuid_entry2 *entry) { switch (entry->function) { - case 0x1: - if (nested) - cpuid_entry_set(entry, X86_FEATURE_VMX); - break; case 0x7: - if (boot_cpu_has(X86_FEATURE_MPX) && kvm_mpx_supported()) - cpuid_entry_set(entry, X86_FEATURE_MPX); - if (boot_cpu_has(X86_FEATURE_INVPCID) && cpu_has_vmx_invpcid()) - cpuid_entry_set(entry, X86_FEATURE_INVPCID); - if (boot_cpu_has(X86_FEATURE_INTEL_PT) && - vmx_pt_mode_is_host_guest()) - cpuid_entry_set(entry, X86_FEATURE_INTEL_PT); if (vmx_umip_emulated()) cpuid_entry_set(entry, X86_FEATURE_UMIP); - - /* PKU is not yet implemented for shadow paging. */ - if (enable_ept && boot_cpu_has(X86_FEATURE_PKU) && - boot_cpu_has(X86_FEATURE_OSPKE)) - cpuid_entry_set(entry, X86_FEATURE_PKU); - break; - case 0x80000001: - if (!cpu_has_vmx_rdtscp()) - cpuid_entry_clear(entry, X86_FEATURE_RDTSCP); break; default: break; } } +static __init void vmx_set_cpu_caps(void) +{ + kvm_set_cpu_caps(); + + /* CPUID 0x1 */ + if (nested) + kvm_cpu_cap_set(X86_FEATURE_VMX); + + /* CPUID 0x7 */ + if (boot_cpu_has(X86_FEATURE_MPX) && kvm_mpx_supported()) + kvm_cpu_cap_set(X86_FEATURE_MPX); + if (boot_cpu_has(X86_FEATURE_INVPCID) && cpu_has_vmx_invpcid()) + kvm_cpu_cap_set(X86_FEATURE_INVPCID); + if (boot_cpu_has(X86_FEATURE_INTEL_PT) && + vmx_pt_mode_is_host_guest()) + kvm_cpu_cap_set(X86_FEATURE_INTEL_PT); + + /* PKU is not yet implemented for shadow paging. */ + if (enable_ept && boot_cpu_has(X86_FEATURE_PKU) && + boot_cpu_has(X86_FEATURE_OSPKE)) + kvm_cpu_cap_set(X86_FEATURE_PKU); + + /* CPUID 0x80000001 */ + if (!cpu_has_vmx_rdtscp()) + kvm_cpu_cap_clear(X86_FEATURE_RDTSCP); +} + static void vmx_request_immediate_exit(struct kvm_vcpu *vcpu) { to_vmx(vcpu)->req_immediate_exit = true; @@ -7818,7 +7830,7 @@ static __init int hardware_setup(void) return r; } - kvm_set_cpu_caps(); + vmx_set_cpu_caps(); r = alloc_kvm_area(); if (r) -- cgit From b3d895d5c4154156894fd1df2158d82f94fb5527 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:44 -0800 Subject: KVM: x86: Move XSAVES CPUID adjust to VMX's KVM cpu cap update Move the clearing of the XSAVES CPUID bit into VMX, which has a separate VMCS control to enable XSAVES in non-root, to eliminate the last ugly renmant of the undesirable "unsigned f_* = *_supported ? F(*) : 0" pattern in the common CPUID handling code. Drop ->xsaves_supported(), CPUID adjustment was the only user. No functional change intended. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/include/asm/kvm_host.h | 1 - arch/x86/kvm/cpuid.c | 4 ---- arch/x86/kvm/svm.c | 6 ------ arch/x86/kvm/vmx/vmx.c | 5 ++++- 4 files changed, 4 insertions(+), 12 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 5b7848e55efb..d05138058a07 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1177,7 +1177,6 @@ struct kvm_x86_ops { void (*handle_exit_irqoff)(struct kvm_vcpu *vcpu, enum exit_fastpath_completion *exit_fastpath); - bool (*xsaves_supported)(void); bool (*umip_emulated)(void); bool (*pt_supported)(void); diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index a5aed82d9ffa..d0121199a231 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -629,10 +629,6 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) goto out; cpuid_entry_mask(entry, CPUID_D_1_EAX); - - if (!kvm_x86_ops->xsaves_supported()) - cpuid_entry_clear(entry, X86_FEATURE_XSAVES); - if (entry->eax & (F(XSAVES)|F(XSAVEC))) entry->ebx = xstate_required_size(supported_xcr0, true); else diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 61333c8306b0..30e1745c04a1 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -6089,11 +6089,6 @@ static bool svm_rdtscp_supported(void) return boot_cpu_has(X86_FEATURE_RDTSCP); } -static bool svm_xsaves_supported(void) -{ - return boot_cpu_has(X86_FEATURE_XSAVES); -} - static bool svm_umip_emulated(void) { return false; @@ -7466,7 +7461,6 @@ static struct kvm_x86_ops svm_x86_ops __ro_after_init = { .cpuid_update = svm_cpuid_update, .rdtscp_supported = svm_rdtscp_supported, - .xsaves_supported = svm_xsaves_supported, .umip_emulated = svm_umip_emulated, .pt_supported = svm_pt_supported, diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 19507222414d..b68660916ec1 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7162,6 +7162,10 @@ static __init void vmx_set_cpu_caps(void) boot_cpu_has(X86_FEATURE_OSPKE)) kvm_cpu_cap_set(X86_FEATURE_PKU); + /* CPUID 0xD.1 */ + if (!vmx_xsaves_supported()) + kvm_cpu_cap_clear(X86_FEATURE_XSAVES); + /* CPUID 0x80000001 */ if (!cpu_has_vmx_rdtscp()) kvm_cpu_cap_clear(X86_FEATURE_RDTSCP); @@ -7961,7 +7965,6 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = { .check_intercept = vmx_check_intercept, .handle_exit_irqoff = vmx_handle_exit_irqoff, - .xsaves_supported = vmx_xsaves_supported, .umip_emulated = vmx_umip_emulated, .pt_supported = vmx_pt_supported, -- cgit From 8721f5b061eb18c4bb3b77be3ec1c2811ca574ba Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:45 -0800 Subject: KVM: x86: Add a helper to check kernel support when setting cpu cap Add a helper, kvm_cpu_cap_check_and_set(), to query boot_cpu_has() as part of setting a KVM cpu capability. VMX in particular has a number of features that are dependent on both a VMCS capability and kernel support. No functional change intended. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.h | 6 ++++++ arch/x86/kvm/svm.c | 3 +-- arch/x86/kvm/vmx/vmx.c | 18 ++++++++---------- 3 files changed, 15 insertions(+), 12 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h index 13374f885c81..fd29b646916a 100644 --- a/arch/x86/kvm/cpuid.h +++ b/arch/x86/kvm/cpuid.h @@ -282,4 +282,10 @@ static __always_inline void kvm_cpu_cap_set(unsigned int x86_feature) kvm_cpu_caps[x86_leaf] |= __feature_bit(x86_feature); } +static __always_inline void kvm_cpu_cap_check_and_set(unsigned int x86_feature) +{ + if (boot_cpu_has(x86_feature)) + kvm_cpu_cap_set(x86_feature); +} + #endif diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 30e1745c04a1..997a471d4704 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1377,8 +1377,7 @@ static __init void svm_set_cpu_caps(void) /* CPUID 0x8000000A */ /* Support next_rip if host supports it */ - if (boot_cpu_has(X86_FEATURE_NRIPS)) - kvm_cpu_cap_set(X86_FEATURE_NRIPS); + kvm_cpu_cap_check_and_set(X86_FEATURE_NRIPS); if (npt_enabled) kvm_cpu_cap_set(X86_FEATURE_NPT); diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index b68660916ec1..16a32d37aff9 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7149,18 +7149,16 @@ static __init void vmx_set_cpu_caps(void) kvm_cpu_cap_set(X86_FEATURE_VMX); /* CPUID 0x7 */ - if (boot_cpu_has(X86_FEATURE_MPX) && kvm_mpx_supported()) - kvm_cpu_cap_set(X86_FEATURE_MPX); - if (boot_cpu_has(X86_FEATURE_INVPCID) && cpu_has_vmx_invpcid()) - kvm_cpu_cap_set(X86_FEATURE_INVPCID); - if (boot_cpu_has(X86_FEATURE_INTEL_PT) && - vmx_pt_mode_is_host_guest()) - kvm_cpu_cap_set(X86_FEATURE_INTEL_PT); + if (kvm_mpx_supported()) + kvm_cpu_cap_check_and_set(X86_FEATURE_MPX); + if (cpu_has_vmx_invpcid()) + kvm_cpu_cap_check_and_set(X86_FEATURE_INVPCID); + if (vmx_pt_mode_is_host_guest()) + kvm_cpu_cap_check_and_set(X86_FEATURE_INTEL_PT); /* PKU is not yet implemented for shadow paging. */ - if (enable_ept && boot_cpu_has(X86_FEATURE_PKU) && - boot_cpu_has(X86_FEATURE_OSPKE)) - kvm_cpu_cap_set(X86_FEATURE_PKU); + if (enable_ept && boot_cpu_has(X86_FEATURE_OSPKE)) + kvm_cpu_cap_check_and_set(X86_FEATURE_PKU); /* CPUID 0xD.1 */ if (!vmx_xsaves_supported()) -- cgit From c10398b6d0ddb9c8234890828ab83341e11f9840 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:46 -0800 Subject: KVM: x86: Use KVM cpu caps to mark CR4.LA57 as not-reserved Add accessor(s) for KVM cpu caps and use said accessor to detect hardware support for LA57 instead of manually querying CPUID. Note, the explicit conversion to bool via '!!' in kvm_cpu_cap_has() is technically unnecessary, but it gives people a warm fuzzy feeling. No functional change intended. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.h | 13 +++++++++++++ arch/x86/kvm/x86.c | 2 +- 2 files changed, 14 insertions(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h index fd29b646916a..e3dc0f02ad5c 100644 --- a/arch/x86/kvm/cpuid.h +++ b/arch/x86/kvm/cpuid.h @@ -282,6 +282,19 @@ static __always_inline void kvm_cpu_cap_set(unsigned int x86_feature) kvm_cpu_caps[x86_leaf] |= __feature_bit(x86_feature); } +static __always_inline u32 kvm_cpu_cap_get(unsigned int x86_feature) +{ + unsigned int x86_leaf = x86_feature / 32; + + reverse_cpuid_check(x86_leaf); + return kvm_cpu_caps[x86_leaf] & __feature_bit(x86_feature); +} + +static __always_inline bool kvm_cpu_cap_has(unsigned int x86_feature) +{ + return !!kvm_cpu_cap_get(x86_feature); +} + static __always_inline void kvm_cpu_cap_check_and_set(unsigned int x86_feature) { if (boot_cpu_has(x86_feature)) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 849957f3afb2..389bc80f684c 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -925,7 +925,7 @@ static u64 kvm_host_cr4_reserved_bits(struct cpuinfo_x86 *c) { u64 reserved_bits = __cr4_reserved_bits(cpu_has, c); - if (cpuid_ecx(0x7) & feature_bit(LA57)) + if (kvm_cpu_cap_has(X86_FEATURE_LA57)) reserved_bits &= ~X86_CR4_LA57; if (kvm_x86_ops->umip_emulated()) -- cgit From 90d2f60f41f73b90768554e5a30b1cfedd167731 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:47 -0800 Subject: KVM: x86: Use KVM cpu caps to track UMIP emulation Set UMIP in kvm_cpu_caps when it is emulated by VMX, even though the bit will effectively be dropped by do_host_cpuid(). This allows checking for UMIP emulation via kvm_cpu_caps instead of a dedicated kvm_x86_ops callback. No functional change intended. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/include/asm/kvm_host.h | 1 - arch/x86/kvm/svm.c | 6 ------ arch/x86/kvm/vmx/vmx.c | 8 +++++++- arch/x86/kvm/x86.c | 2 +- 4 files changed, 8 insertions(+), 9 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index d05138058a07..c46373016574 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1177,7 +1177,6 @@ struct kvm_x86_ops { void (*handle_exit_irqoff)(struct kvm_vcpu *vcpu, enum exit_fastpath_completion *exit_fastpath); - bool (*umip_emulated)(void); bool (*pt_supported)(void); int (*check_nested_events)(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 997a471d4704..26d2e170e9fd 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -6088,11 +6088,6 @@ static bool svm_rdtscp_supported(void) return boot_cpu_has(X86_FEATURE_RDTSCP); } -static bool svm_umip_emulated(void) -{ - return false; -} - static bool svm_pt_supported(void) { return false; @@ -7460,7 +7455,6 @@ static struct kvm_x86_ops svm_x86_ops __ro_after_init = { .cpuid_update = svm_cpuid_update, .rdtscp_supported = svm_rdtscp_supported, - .umip_emulated = svm_umip_emulated, .pt_supported = svm_pt_supported, .set_supported_cpuid = svm_set_supported_cpuid, diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 16a32d37aff9..bea247fabca0 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7132,6 +7132,10 @@ static void vmx_set_supported_cpuid(struct kvm_cpuid_entry2 *entry) { switch (entry->function) { case 0x7: + /* + * UMIP needs to be manually set even though vmx_set_cpu_caps() + * also sets UMIP since do_host_cpuid() will drop it. + */ if (vmx_umip_emulated()) cpuid_entry_set(entry, X86_FEATURE_UMIP); break; @@ -7160,6 +7164,9 @@ static __init void vmx_set_cpu_caps(void) if (enable_ept && boot_cpu_has(X86_FEATURE_OSPKE)) kvm_cpu_cap_check_and_set(X86_FEATURE_PKU); + if (vmx_umip_emulated()) + kvm_cpu_cap_set(X86_FEATURE_UMIP); + /* CPUID 0xD.1 */ if (!vmx_xsaves_supported()) kvm_cpu_cap_clear(X86_FEATURE_XSAVES); @@ -7963,7 +7970,6 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = { .check_intercept = vmx_check_intercept, .handle_exit_irqoff = vmx_handle_exit_irqoff, - .umip_emulated = vmx_umip_emulated, .pt_supported = vmx_pt_supported, .request_immediate_exit = vmx_request_immediate_exit, diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 389bc80f684c..51a49a6ed070 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -928,7 +928,7 @@ static u64 kvm_host_cr4_reserved_bits(struct cpuinfo_x86 *c) if (kvm_cpu_cap_has(X86_FEATURE_LA57)) reserved_bits &= ~X86_CR4_LA57; - if (kvm_x86_ops->umip_emulated()) + if (kvm_cpu_cap_has(X86_FEATURE_UMIP)) reserved_bits &= ~X86_CR4_UMIP; return reserved_bits; -- cgit From 09f628a0b49cf489faada00254569813a2143894 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:48 -0800 Subject: KVM: x86: Fold CPUID 0x7 masking back into __do_cpuid_func() Move the CPUID 0x7 masking back into __do_cpuid_func() now that the size of the code has been trimmed down significantly. Tweak the WARN case, which is impossible to hit unless the CPU is completely broken, to break the loop before creating the bogus entry. Opportunustically reorder the cpuid_entry_set() calls and shorten the comment about emulation to further reduce the footprint of CPUID 0x7. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 62 +++++++++++++++++++--------------------------------- 1 file changed, 22 insertions(+), 40 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index d0121199a231..f6d3121656cc 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -459,44 +459,6 @@ static int __do_cpuid_func_emulated(struct kvm_cpuid_array *array, u32 func) return 0; } -static inline void do_cpuid_7_mask(struct kvm_cpuid_entry2 *entry) -{ - switch (entry->index) { - case 0: - entry->eax = min(entry->eax, 1u); - cpuid_entry_mask(entry, CPUID_7_0_EBX); - /* TSC_ADJUST is emulated */ - cpuid_entry_set(entry, X86_FEATURE_TSC_ADJUST); - cpuid_entry_mask(entry, CPUID_7_ECX); - cpuid_entry_mask(entry, CPUID_7_EDX); - if (boot_cpu_has(X86_FEATURE_IBPB) && boot_cpu_has(X86_FEATURE_IBRS)) - cpuid_entry_set(entry, X86_FEATURE_SPEC_CTRL); - if (boot_cpu_has(X86_FEATURE_STIBP)) - cpuid_entry_set(entry, X86_FEATURE_INTEL_STIBP); - if (boot_cpu_has(X86_FEATURE_AMD_SSBD)) - cpuid_entry_set(entry, X86_FEATURE_SPEC_CTRL_SSBD); - /* - * We emulate ARCH_CAPABILITIES in software even - * if the host doesn't support it. - */ - cpuid_entry_set(entry, X86_FEATURE_ARCH_CAPABILITIES); - break; - case 1: - cpuid_entry_mask(entry, CPUID_7_1_EAX); - entry->ebx = 0; - entry->ecx = 0; - entry->edx = 0; - break; - default: - WARN_ON_ONCE(1); - entry->eax = 0; - entry->ebx = 0; - entry->ecx = 0; - entry->edx = 0; - break; - } -} - static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) { struct kvm_cpuid_entry2 *entry; @@ -558,14 +520,34 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) break; /* function 7 has additional index. */ case 7: - do_cpuid_7_mask(entry); + entry->eax = min(entry->eax, 1u); + cpuid_entry_mask(entry, CPUID_7_0_EBX); + cpuid_entry_mask(entry, CPUID_7_ECX); + cpuid_entry_mask(entry, CPUID_7_EDX); + + /* TSC_ADJUST and ARCH_CAPABILITIES are emulated in software. */ + cpuid_entry_set(entry, X86_FEATURE_TSC_ADJUST); + cpuid_entry_set(entry, X86_FEATURE_ARCH_CAPABILITIES); + + if (boot_cpu_has(X86_FEATURE_IBPB) && boot_cpu_has(X86_FEATURE_IBRS)) + cpuid_entry_set(entry, X86_FEATURE_SPEC_CTRL); + if (boot_cpu_has(X86_FEATURE_STIBP)) + cpuid_entry_set(entry, X86_FEATURE_INTEL_STIBP); + if (boot_cpu_has(X86_FEATURE_AMD_SSBD)) + cpuid_entry_set(entry, X86_FEATURE_SPEC_CTRL_SSBD); for (i = 1, max_idx = entry->eax; i <= max_idx; i++) { + if (WARN_ON_ONCE(i > 1)) + break; + entry = do_host_cpuid(array, function, i); if (!entry) goto out; - do_cpuid_7_mask(entry); + cpuid_entry_mask(entry, CPUID_7_1_EAX); + entry->ebx = 0; + entry->ecx = 0; + entry->edx = 0; } break; case 9: -- cgit From bcf600ca8d21e49af21fec5d2cf3927d6f62d048 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:49 -0800 Subject: KVM: x86: Remove the unnecessary loop on CPUID 0x7 sub-leafs Explicitly handle CPUID 0x7 sub-leaf 1. The kernel is currently aware of exactly one feature in CPUID 0x7.1, which means there is room for another 127 features before CPUID 0x7.2 will see the light of day, i.e. the looping is likely to be dead code for years to come. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index f6d3121656cc..cce78723376c 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -536,11 +536,9 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) if (boot_cpu_has(X86_FEATURE_AMD_SSBD)) cpuid_entry_set(entry, X86_FEATURE_SPEC_CTRL_SSBD); - for (i = 1, max_idx = entry->eax; i <= max_idx; i++) { - if (WARN_ON_ONCE(i > 1)) - break; - - entry = do_host_cpuid(array, function, i); + /* KVM only supports 0x7.0 and 0x7.1, capped above via min(). */ + if (entry->eax == 1) { + entry = do_host_cpuid(array, function, 1); if (!entry) goto out; -- cgit From c571a144ef17217fab0db44a1d88675b7df388ea Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:50 -0800 Subject: KVM: x86: Squash CPUID 0x2.0 insanity for modern CPUs Rework CPUID 0x2.0 to be a normal CPUID leaf if it returns "01" in AL, i.e. EAX & 0xff, as a step towards removing KVM's stateful CPUID code altogether. Long ago, Intel documented CPUID 0x2.0 as being a stateful leaf, e.g. a version of the SDM circa 1995 states: The least-significant byte in register EAX (register AL) indicates the number of times the CPUID instruction must be executed with an input value of 2 to get a complete description of the processors's caches and TLBs. The Pentium Pro family of processors will return a 1. A 2000 version of the SDM only updated the paragraph to reference Intel's new processory family: The first member of the family of Pentium 4 processors will return a 1. Fast forward to the present, and Intel's SDM now states: The least-significant byte in register EAX (register AL) will always return 01H. Software should ignore this value and not interpret it as an information descriptor. AMD's APM simply states that CPUID 0x2 is reserved. Given that CPUID itself was introduced in the Pentium, odds are good that the only Intel CPU family that *maybe* implemented a stateful CPUID was the P5. Which obviously did not support VMX, or KVM. In other words, KVM's emulation of a stateful CPUID 0x2.0 has likely been dead code from the day it was introduced. This is backed up by commit 0fdf8e59faa5c ("KVM: Fix cpuid iteration on multiple leaves per eac"), which shows that the stateful iteration code was completely broken when it was introduced by commit 0771671749b59 ("KVM: Enhance guest cpuid management"), i.e. not actually tested. Annotate all stateful code paths as "unlikely", but defer its removal to a future patch to simplify reinstating the code if by some miracle there is someone running KVM on a CPU with a stateful CPUID 0x2. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 31 +++++++++++++++++++++---------- 1 file changed, 21 insertions(+), 10 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index cce78723376c..870995d13644 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -408,9 +408,6 @@ static struct kvm_cpuid_entry2 *do_host_cpuid(struct kvm_cpuid_array *array, &entry->eax, &entry->ebx, &entry->ecx, &entry->edx); switch (function) { - case 2: - entry->flags |= KVM_CPUID_FLAG_STATEFUL_FUNC; - break; case 4: case 7: case 0xb: @@ -486,17 +483,31 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) * it since we emulate x2apic in software */ cpuid_entry_set(entry, X86_FEATURE_X2APIC); break; - /* function 2 entries are STATEFUL. That is, repeated cpuid commands - * may return different values. This forces us to get_cpu() before - * issuing the first command, and also to emulate this annoying behavior - * in kvm_emulate_cpuid() using KVM_CPUID_FLAG_STATE_READ_NEXT */ case 2: + /* + * On ancient CPUs, function 2 entries are STATEFUL. That is, + * CPUID(function=2, index=0) may return different results each + * time, with the least-significant byte in EAX enumerating the + * number of times software should do CPUID(2, 0). + * + * Modern CPUs (quite likely every CPU KVM has *ever* run on) + * are less idiotic. Intel's SDM states that EAX & 0xff "will + * always return 01H. Software should ignore this value and not + * interpret it as an informational descriptor", while AMD's + * APM states that CPUID(2) is reserved. + */ + max_idx = entry->eax & 0xff; + if (likely(max_idx <= 1)) + break; + + entry->flags |= KVM_CPUID_FLAG_STATEFUL_FUNC; entry->flags |= KVM_CPUID_FLAG_STATE_READ_NEXT; - for (i = 1, max_idx = entry->eax & 0xff; i < max_idx; ++i) { + for (i = 1; i < max_idx; ++i) { entry = do_host_cpuid(array, function, 0); if (!entry) goto out; + entry->flags |= KVM_CPUID_FLAG_STATEFUL_FUNC; } break; /* functions 4 and 0x8000001d have additional index. */ @@ -909,7 +920,7 @@ static int is_matching_cpuid_entry(struct kvm_cpuid_entry2 *e, return 0; if ((e->flags & KVM_CPUID_FLAG_SIGNIFCANT_INDEX) && e->index != index) return 0; - if ((e->flags & KVM_CPUID_FLAG_STATEFUL_FUNC) && + if (unlikely(e->flags & KVM_CPUID_FLAG_STATEFUL_FUNC) && !(e->flags & KVM_CPUID_FLAG_STATE_READ_NEXT)) return 0; return 1; @@ -926,7 +937,7 @@ struct kvm_cpuid_entry2 *kvm_find_cpuid_entry(struct kvm_vcpu *vcpu, e = &vcpu->arch.cpuid_entries[i]; if (is_matching_cpuid_entry(e, function, index)) { - if (e->flags & KVM_CPUID_FLAG_STATEFUL_FUNC) + if (unlikely(e->flags & KVM_CPUID_FLAG_STATEFUL_FUNC)) move_to_next_stateful_cpuid_entry(vcpu, i); best = e; break; -- cgit From 7ff6c0350315248bb58b074c90c4683f0e415669 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:51 -0800 Subject: KVM: x86: Remove stateful CPUID handling Remove the code for handling stateful CPUID 0x2 and mark the associated flags as deprecated. WARN if host CPUID 0x2.0.AL > 1, i.e. if by some miracle a host with stateful CPUID 0x2 is encountered. No known CPU exists that supports hardware accelerated virtualization _and_ a stateful CPUID 0x2. Barring an extremely contrived nested virtualization scenario, stateful CPUID support is dead code. Suggested-by: Vitaly Kuznetsov Suggested-by: Paolo Bonzini Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 73 ++++++++++------------------------------------------ 1 file changed, 13 insertions(+), 60 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 870995d13644..d4ada8f604a9 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -490,25 +490,16 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) * time, with the least-significant byte in EAX enumerating the * number of times software should do CPUID(2, 0). * - * Modern CPUs (quite likely every CPU KVM has *ever* run on) - * are less idiotic. Intel's SDM states that EAX & 0xff "will - * always return 01H. Software should ignore this value and not + * Modern CPUs, i.e. every CPU KVM has *ever* run on are less + * idiotic. Intel's SDM states that EAX & 0xff "will always + * return 01H. Software should ignore this value and not * interpret it as an informational descriptor", while AMD's * APM states that CPUID(2) is reserved. + * + * WARN if a frankenstein CPU that supports virtualization and + * a stateful CPUID.0x2 is encountered. */ - max_idx = entry->eax & 0xff; - if (likely(max_idx <= 1)) - break; - - entry->flags |= KVM_CPUID_FLAG_STATEFUL_FUNC; - entry->flags |= KVM_CPUID_FLAG_STATE_READ_NEXT; - - for (i = 1; i < max_idx; ++i) { - entry = do_host_cpuid(array, function, 0); - if (!entry) - goto out; - entry->flags |= KVM_CPUID_FLAG_STATEFUL_FUNC; - } + WARN_ON_ONCE((entry->eax & 0xff) > 1); break; /* functions 4 and 0x8000001d have additional index. */ case 4: @@ -892,58 +883,20 @@ out_free: return r; } -static int move_to_next_stateful_cpuid_entry(struct kvm_vcpu *vcpu, int i) -{ - struct kvm_cpuid_entry2 *e = &vcpu->arch.cpuid_entries[i]; - struct kvm_cpuid_entry2 *ej; - int j = i; - int nent = vcpu->arch.cpuid_nent; - - e->flags &= ~KVM_CPUID_FLAG_STATE_READ_NEXT; - /* when no next entry is found, the current entry[i] is reselected */ - do { - j = (j + 1) % nent; - ej = &vcpu->arch.cpuid_entries[j]; - } while (ej->function != e->function); - - ej->flags |= KVM_CPUID_FLAG_STATE_READ_NEXT; - - return j; -} - -/* find an entry with matching function, matching index (if needed), and that - * should be read next (if it's stateful) */ -static int is_matching_cpuid_entry(struct kvm_cpuid_entry2 *e, - u32 function, u32 index) -{ - if (e->function != function) - return 0; - if ((e->flags & KVM_CPUID_FLAG_SIGNIFCANT_INDEX) && e->index != index) - return 0; - if (unlikely(e->flags & KVM_CPUID_FLAG_STATEFUL_FUNC) && - !(e->flags & KVM_CPUID_FLAG_STATE_READ_NEXT)) - return 0; - return 1; -} - struct kvm_cpuid_entry2 *kvm_find_cpuid_entry(struct kvm_vcpu *vcpu, u32 function, u32 index) { + struct kvm_cpuid_entry2 *e; int i; - struct kvm_cpuid_entry2 *best = NULL; for (i = 0; i < vcpu->arch.cpuid_nent; ++i) { - struct kvm_cpuid_entry2 *e; - e = &vcpu->arch.cpuid_entries[i]; - if (is_matching_cpuid_entry(e, function, index)) { - if (unlikely(e->flags & KVM_CPUID_FLAG_STATEFUL_FUNC)) - move_to_next_stateful_cpuid_entry(vcpu, i); - best = e; - break; - } + + if (e->function == function && (e->index == index || + !(e->flags & KVM_CPUID_FLAG_SIGNIFCANT_INDEX))) + return e; } - return best; + return NULL; } EXPORT_SYMBOL_GPL(kvm_find_cpuid_entry); -- cgit From d8577a4c238f8bd3089bfb428fece97f14eaabad Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:52 -0800 Subject: KVM: x86: Do host CPUID at load time to mask KVM cpu caps Mask kvm_cpu_caps based on host CPUID in preparation for overriding the CPUID results during KVM_GET_SUPPORTED_CPUID instead of doing the masking at runtime. Note, masking may or may not be necessary, e.g. the kernel rarely, if ever, sets real CPUID bits that are not supported by hardware. But, the code is cheap and only runs once at load, so an abundance of caution is warranted. No functional change intended. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 8 ++++++++ 1 file changed, 8 insertions(+) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index d4ada8f604a9..80fccfd937bb 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -263,8 +263,16 @@ out: static __always_inline void kvm_cpu_cap_mask(enum cpuid_leafs leaf, u32 mask) { + const struct cpuid_reg cpuid = x86_feature_cpuid(leaf * 32); + struct kvm_cpuid_entry2 entry; + reverse_cpuid_check(leaf); kvm_cpu_caps[leaf] &= mask; + + cpuid_count(cpuid.function, cpuid.index, + &entry.eax, &entry.ebx, &entry.ecx, &entry.edx); + + kvm_cpu_caps[leaf] &= *__cpuid_entry_get_reg(&entry, &cpuid); } void kvm_set_cpu_caps(void) -- cgit From bd79199990477bf0c316b32bfcbd9862dc0f08ec Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:53 -0800 Subject: KVM: x86: Override host CPUID results with kvm_cpu_caps Override CPUID entries with kvm_cpu_caps during KVM_GET_SUPPORTED_CPUID instead of masking the host CPUID result, which is redundant now that the host CPUID is incorporated into kvm_cpu_caps at runtime. No functional change intended. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 25 +++++++++++-------------- arch/x86/kvm/cpuid.h | 6 +++--- arch/x86/kvm/svm.c | 3 +-- arch/x86/kvm/vmx/vmx.c | 12 ------------ 4 files changed, 15 insertions(+), 31 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 80fccfd937bb..493ea0e29450 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -485,8 +485,8 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) entry->eax = min(entry->eax, 0x1fU); break; case 1: - cpuid_entry_mask(entry, CPUID_1_EDX); - cpuid_entry_mask(entry, CPUID_1_ECX); + cpuid_entry_override(entry, CPUID_1_EDX); + cpuid_entry_override(entry, CPUID_1_ECX); /* we support x2apic emulation even if host does not support * it since we emulate x2apic in software */ cpuid_entry_set(entry, X86_FEATURE_X2APIC); @@ -531,9 +531,9 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) /* function 7 has additional index. */ case 7: entry->eax = min(entry->eax, 1u); - cpuid_entry_mask(entry, CPUID_7_0_EBX); - cpuid_entry_mask(entry, CPUID_7_ECX); - cpuid_entry_mask(entry, CPUID_7_EDX); + cpuid_entry_override(entry, CPUID_7_0_EBX); + cpuid_entry_override(entry, CPUID_7_ECX); + cpuid_entry_override(entry, CPUID_7_EDX); /* TSC_ADJUST and ARCH_CAPABILITIES are emulated in software. */ cpuid_entry_set(entry, X86_FEATURE_TSC_ADJUST); @@ -552,7 +552,7 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) if (!entry) goto out; - cpuid_entry_mask(entry, CPUID_7_1_EAX); + cpuid_entry_override(entry, CPUID_7_1_EAX); entry->ebx = 0; entry->ecx = 0; entry->edx = 0; @@ -618,7 +618,7 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) if (!entry) goto out; - cpuid_entry_mask(entry, CPUID_D_1_EAX); + cpuid_entry_override(entry, CPUID_D_1_EAX); if (entry->eax & (F(XSAVES)|F(XSAVEC))) entry->ebx = xstate_required_size(supported_xcr0, true); else @@ -697,11 +697,8 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) entry->eax = min(entry->eax, 0x8000001f); break; case 0x80000001: - cpuid_entry_mask(entry, CPUID_8000_0001_EDX); - /* Add it manually because it may not be in host CPUID. */ - if (!tdp_enabled) - cpuid_entry_set(entry, X86_FEATURE_GBPAGES); - cpuid_entry_mask(entry, CPUID_8000_0001_ECX); + cpuid_entry_override(entry, CPUID_8000_0001_EDX); + cpuid_entry_override(entry, CPUID_8000_0001_ECX); break; case 0x80000007: /* Advanced power management */ /* invariant TSC is CPUID.80000007H:EDX[8] */ @@ -719,7 +716,7 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) g_phys_as = phys_as; entry->eax = g_phys_as | (virt_as << 8); entry->edx = 0; - cpuid_entry_mask(entry, CPUID_8000_0008_EBX); + cpuid_entry_override(entry, CPUID_8000_0008_EBX); /* * AMD has separate bits for each SPEC_CTRL bit. * arch/x86/kernel/cpu/bugs.c is kind enough to @@ -761,7 +758,7 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) entry->eax = min(entry->eax, 0xC0000004); break; case 0xC0000001: - cpuid_entry_mask(entry, CPUID_C000_0001_EDX); + cpuid_entry_override(entry, CPUID_C000_0001_EDX); break; case 3: /* Processor serial number */ case 5: /* MONITOR/MWAIT */ diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h index e3dc0f02ad5c..e9a3277ce256 100644 --- a/arch/x86/kvm/cpuid.h +++ b/arch/x86/kvm/cpuid.h @@ -170,13 +170,13 @@ static __always_inline void cpuid_entry_change(struct kvm_cpuid_entry2 *entry, *reg &= ~__feature_bit(x86_feature); } -static __always_inline void cpuid_entry_mask(struct kvm_cpuid_entry2 *entry, - enum cpuid_leafs leaf) +static __always_inline void cpuid_entry_override(struct kvm_cpuid_entry2 *entry, + enum cpuid_leafs leaf) { u32 *reg = cpuid_entry_get_reg(entry, leaf * 32); BUILD_BUG_ON(leaf >= ARRAY_SIZE(kvm_cpu_caps)); - *reg &= kvm_cpu_caps[leaf]; + *reg = kvm_cpu_caps[leaf]; } static __always_inline u32 *guest_cpuid_get_register(struct kvm_vcpu *vcpu, diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 26d2e170e9fd..e897752524f2 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -6072,8 +6072,7 @@ static void svm_set_supported_cpuid(struct kvm_cpuid_entry2 *entry) entry->ebx = 8; /* Lets support 8 ASIDs in case we add proper ASID emulation to nested SVM */ entry->ecx = 0; /* Reserved */ - /* Note, 0x8000000A.EDX is managed via kvm_cpu_caps. */; - cpuid_entry_mask(entry, CPUID_8000_000A_EDX); + cpuid_entry_override(entry, CPUID_8000_000A_EDX); break; } } diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index bea247fabca0..4ba2c3e9fbcf 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7130,18 +7130,6 @@ static void vmx_cpuid_update(struct kvm_vcpu *vcpu) */ static void vmx_set_supported_cpuid(struct kvm_cpuid_entry2 *entry) { - switch (entry->function) { - case 0x7: - /* - * UMIP needs to be manually set even though vmx_set_cpu_caps() - * also sets UMIP since do_host_cpuid() will drop it. - */ - if (vmx_umip_emulated()) - cpuid_entry_set(entry, X86_FEATURE_UMIP); - break; - default: - break; - } } static __init void vmx_set_cpu_caps(void) -- cgit From 93c380e7b528882396ca463971012222bad7d82e Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:54 -0800 Subject: KVM: x86: Set emulated/transmuted feature bits via kvm_cpu_caps Set emulated and transmuted (set based on other features) feature bits via kvm_cpu_caps now that the CPUID output for KVM_GET_SUPPORTED_CPUID is direcly overidden with kvm_cpu_caps. Note, VMX emulation of UMIP already sets kvm_cpu_caps. No functional change intended. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 72 +++++++++++++++++++++++++------------------------- arch/x86/kvm/svm.c | 18 ++++--------- arch/x86/kvm/vmx/vmx.c | 5 ---- 3 files changed, 41 insertions(+), 54 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 493ea0e29450..dedf30fedbcb 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -306,6 +306,8 @@ void kvm_set_cpu_caps(void) 0 /* Reserved*/ | F(AES) | F(XSAVE) | 0 /* OSXSAVE */ | F(AVX) | F(F16C) | F(RDRAND) ); + /* KVM emulates x2apic in software irrespective of host support. */ + kvm_cpu_cap_set(X86_FEATURE_X2APIC); kvm_cpu_cap_mask(CPUID_1_EDX, F(FPU) | F(VME) | F(DE) | F(PSE) | @@ -342,6 +344,17 @@ void kvm_set_cpu_caps(void) F(MD_CLEAR) ); + /* TSC_ADJUST and ARCH_CAPABILITIES are emulated in software. */ + kvm_cpu_cap_set(X86_FEATURE_TSC_ADJUST); + kvm_cpu_cap_set(X86_FEATURE_ARCH_CAPABILITIES); + + if (boot_cpu_has(X86_FEATURE_IBPB) && boot_cpu_has(X86_FEATURE_IBRS)) + kvm_cpu_cap_set(X86_FEATURE_SPEC_CTRL); + if (boot_cpu_has(X86_FEATURE_STIBP)) + kvm_cpu_cap_set(X86_FEATURE_INTEL_STIBP); + if (boot_cpu_has(X86_FEATURE_AMD_SSBD)) + kvm_cpu_cap_set(X86_FEATURE_SPEC_CTRL_SSBD); + kvm_cpu_cap_mask(CPUID_7_1_EAX, F(AVX512_BF16) ); @@ -378,6 +391,29 @@ void kvm_set_cpu_caps(void) F(AMD_SSB_NO) | F(AMD_STIBP) | F(AMD_STIBP_ALWAYS_ON) ); + /* + * AMD has separate bits for each SPEC_CTRL bit. + * arch/x86/kernel/cpu/bugs.c is kind enough to + * record that in cpufeatures so use them. + */ + if (boot_cpu_has(X86_FEATURE_IBPB)) + kvm_cpu_cap_set(X86_FEATURE_AMD_IBPB); + if (boot_cpu_has(X86_FEATURE_IBRS)) + kvm_cpu_cap_set(X86_FEATURE_AMD_IBRS); + if (boot_cpu_has(X86_FEATURE_STIBP)) + kvm_cpu_cap_set(X86_FEATURE_AMD_STIBP); + if (boot_cpu_has(X86_FEATURE_SPEC_CTRL_SSBD)) + kvm_cpu_cap_set(X86_FEATURE_AMD_SSBD); + if (!boot_cpu_has_bug(X86_BUG_SPEC_STORE_BYPASS)) + kvm_cpu_cap_set(X86_FEATURE_AMD_SSB_NO); + /* + * The preference is to use SPEC CTRL MSR instead of the + * VIRT_SPEC MSR. + */ + if (boot_cpu_has(X86_FEATURE_LS_CFG_SSBD) && + !boot_cpu_has(X86_FEATURE_AMD_SSBD)) + kvm_cpu_cap_set(X86_FEATURE_VIRT_SSBD); + /* * Hide all SVM features by default, SVM will set the cap bits for * features it emulates and/or exposes for L1. @@ -487,9 +523,6 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) case 1: cpuid_entry_override(entry, CPUID_1_EDX); cpuid_entry_override(entry, CPUID_1_ECX); - /* we support x2apic emulation even if host does not support - * it since we emulate x2apic in software */ - cpuid_entry_set(entry, X86_FEATURE_X2APIC); break; case 2: /* @@ -535,17 +568,6 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) cpuid_entry_override(entry, CPUID_7_ECX); cpuid_entry_override(entry, CPUID_7_EDX); - /* TSC_ADJUST and ARCH_CAPABILITIES are emulated in software. */ - cpuid_entry_set(entry, X86_FEATURE_TSC_ADJUST); - cpuid_entry_set(entry, X86_FEATURE_ARCH_CAPABILITIES); - - if (boot_cpu_has(X86_FEATURE_IBPB) && boot_cpu_has(X86_FEATURE_IBRS)) - cpuid_entry_set(entry, X86_FEATURE_SPEC_CTRL); - if (boot_cpu_has(X86_FEATURE_STIBP)) - cpuid_entry_set(entry, X86_FEATURE_INTEL_STIBP); - if (boot_cpu_has(X86_FEATURE_AMD_SSBD)) - cpuid_entry_set(entry, X86_FEATURE_SPEC_CTRL_SSBD); - /* KVM only supports 0x7.0 and 0x7.1, capped above via min(). */ if (entry->eax == 1) { entry = do_host_cpuid(array, function, 1); @@ -717,28 +739,6 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) entry->eax = g_phys_as | (virt_as << 8); entry->edx = 0; cpuid_entry_override(entry, CPUID_8000_0008_EBX); - /* - * AMD has separate bits for each SPEC_CTRL bit. - * arch/x86/kernel/cpu/bugs.c is kind enough to - * record that in cpufeatures so use them. - */ - if (boot_cpu_has(X86_FEATURE_IBPB)) - cpuid_entry_set(entry, X86_FEATURE_AMD_IBPB); - if (boot_cpu_has(X86_FEATURE_IBRS)) - cpuid_entry_set(entry, X86_FEATURE_AMD_IBRS); - if (boot_cpu_has(X86_FEATURE_STIBP)) - cpuid_entry_set(entry, X86_FEATURE_AMD_STIBP); - if (boot_cpu_has(X86_FEATURE_SPEC_CTRL_SSBD)) - cpuid_entry_set(entry, X86_FEATURE_AMD_SSBD); - if (!boot_cpu_has_bug(X86_BUG_SPEC_STORE_BYPASS)) - cpuid_entry_set(entry, X86_FEATURE_AMD_SSB_NO); - /* - * The preference is to use SPEC CTRL MSR instead of the - * VIRT_SPEC MSR. - */ - if (boot_cpu_has(X86_FEATURE_LS_CFG_SSBD) && - !boot_cpu_has(X86_FEATURE_AMD_SSBD)) - cpuid_entry_set(entry, X86_FEATURE_VIRT_SSBD); break; } case 0x80000019: diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index e897752524f2..0236f2f98cbd 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1375,6 +1375,11 @@ static __init void svm_set_cpu_caps(void) if (nested) kvm_cpu_cap_set(X86_FEATURE_SVM); + /* CPUID 0x80000008 */ + if (boot_cpu_has(X86_FEATURE_LS_CFG_SSBD) || + boot_cpu_has(X86_FEATURE_AMD_SSBD)) + kvm_cpu_cap_set(X86_FEATURE_VIRT_SSBD); + /* CPUID 0x8000000A */ /* Support next_rip if host supports it */ kvm_cpu_cap_check_and_set(X86_FEATURE_NRIPS); @@ -6051,22 +6056,9 @@ static void svm_cpuid_update(struct kvm_vcpu *vcpu) APICV_INHIBIT_REASON_NESTED); } -/* - * Vendor specific emulation must be handled via ->set_supported_cpuid(), not - * svm_set_cpu_caps(), as capabilities configured during hardware_setup() are - * masked against hardware/kernel support, i.e. they'd be lost. - * - * Note, setting a flag based on a *different* feature, e.g. setting VIRT_SSBD - * if LS_CFG_SSBD or AMD_SSBD is supported, is effectively emulation. - */ static void svm_set_supported_cpuid(struct kvm_cpuid_entry2 *entry) { switch (entry->function) { - case 0x80000008: - if (boot_cpu_has(X86_FEATURE_LS_CFG_SSBD) || - boot_cpu_has(X86_FEATURE_AMD_SSBD)) - cpuid_entry_set(entry, X86_FEATURE_VIRT_SSBD); - break; case 0x8000000A: entry->eax = 1; /* SVM revision 1 */ entry->ebx = 8; /* Lets support 8 ASIDs in case we add proper diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 4ba2c3e9fbcf..fe7b4ae867d8 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7123,11 +7123,6 @@ static void vmx_cpuid_update(struct kvm_vcpu *vcpu) } } -/* - * Vendor specific emulation must be handled via ->set_supported_cpuid(), not - * vmx_set_cpu_caps(), as capabilities configured during hardware_setup() are - * masked against hardware/kernel support, i.e. they'd be lost. - */ static void vmx_set_supported_cpuid(struct kvm_cpuid_entry2 *entry) { } -- cgit From dd69cc2542f728179d5a0ae1c972813f88ab14aa Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:55 -0800 Subject: KVM: x86: Use kvm_cpu_caps to detect Intel PT support Check for Intel PT using kvm_cpu_cap_has() to pave the way toward eliminating ->pt_supported(). No functional change intended. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index dedf30fedbcb..214bcb3a5c1e 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -504,7 +504,6 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) { struct kvm_cpuid_entry2 *entry; int r, i, max_idx; - unsigned f_intel_pt = kvm_x86_ops->pt_supported() ? F(INTEL_PT) : 0; /* all calls to cpuid_count() should be made on the same cpu */ get_cpu(); @@ -675,7 +674,7 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) break; /* Intel PT */ case 0x14: - if (!f_intel_pt) { + if (!kvm_cpu_cap_has(X86_FEATURE_INTEL_PT)) { entry->eax = entry->ebx = entry->ecx = entry->edx = 0; break; } -- cgit From 7c7f9548108926dbf6776b1cfe748f7c617cfd36 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:56 -0800 Subject: KVM: x86: Do kvm_cpuid_array capacity checks in terminal functions Perform the capacity checks on the userspace provided kvm_cpuid_array in the lower __do_cpuid_func() and __do_cpuid_func_emulated(). Pre-checking the array in do_cpuid_func() no longer adds value now that __do_cpuid_func() has been trimmed down to size, i.e. doesn't invoke a big pile of retpolined functions before doing anything useful. Note, __do_cpuid_func() already checks the array capacity via do_host_cpuid(), "moving" the check to __do_cpuid_func() simply means removing a WARN_ON(). Suggested-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 214bcb3a5c1e..1934f5d6b731 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -473,8 +473,12 @@ static struct kvm_cpuid_entry2 *do_host_cpuid(struct kvm_cpuid_array *array, static int __do_cpuid_func_emulated(struct kvm_cpuid_array *array, u32 func) { - struct kvm_cpuid_entry2 *entry = &array->entries[array->nent]; + struct kvm_cpuid_entry2 *entry; + + if (array->nent >= array->maxnent) + return -E2BIG; + entry = &array->entries[array->nent]; entry->function = func; entry->index = 0; entry->flags = 0; @@ -511,7 +515,7 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) r = -E2BIG; entry = do_host_cpuid(array, function, 0); - if (WARN_ON(!entry)) + if (!entry) goto out; switch (function) { @@ -782,9 +786,6 @@ out: static int do_cpuid_func(struct kvm_cpuid_array *array, u32 func, unsigned int type) { - if (array->nent >= array->maxnent) - return -E2BIG; - if (type == KVM_GET_EMULATED_CPUID) return __do_cpuid_func_emulated(array, func); -- cgit From 139085101f8500b09c681b1e52c3839df681a0d2 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:57 -0800 Subject: KVM: x86: Use KVM cpu caps to detect MSR_TSC_AUX virt support Check for MSR_TSC_AUX virtualization via kvm_cpu_cap_has() and drop ->rdtscp_supported(). Note, vmx_rdtscp_supported() needs to hang around a tiny bit longer due other usage in VMX code. No functional change intended. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/include/asm/kvm_host.h | 1 - arch/x86/kvm/svm.c | 6 ------ arch/x86/kvm/vmx/vmx.c | 3 --- arch/x86/kvm/x86.c | 2 +- 4 files changed, 1 insertion(+), 11 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index c46373016574..00a1be55e90a 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1156,7 +1156,6 @@ struct kvm_x86_ops { int (*get_tdp_level)(struct kvm_vcpu *vcpu); u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio); int (*get_lpage_level)(void); - bool (*rdtscp_supported)(void); void (*set_tdp_cr3)(struct kvm_vcpu *vcpu, unsigned long cr3); diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 0236f2f98cbd..f802d9c196e9 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -6074,11 +6074,6 @@ static int svm_get_lpage_level(void) return PT_PDPE_LEVEL; } -static bool svm_rdtscp_supported(void) -{ - return boot_cpu_has(X86_FEATURE_RDTSCP); -} - static bool svm_pt_supported(void) { return false; @@ -7445,7 +7440,6 @@ static struct kvm_x86_ops svm_x86_ops __ro_after_init = { .cpuid_update = svm_cpuid_update, - .rdtscp_supported = svm_rdtscp_supported, .pt_supported = svm_pt_supported, .set_supported_cpuid = svm_set_supported_cpuid, diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index fe7b4ae867d8..80e8b8423104 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7939,9 +7939,6 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = { .get_lpage_level = vmx_get_lpage_level, .cpuid_update = vmx_cpuid_update, - - .rdtscp_supported = vmx_rdtscp_supported, - .set_supported_cpuid = vmx_set_supported_cpuid, .has_wbinvd_exit = cpu_has_vmx_wbinvd_exit, diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 51a49a6ed070..fd0889f2f37f 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5215,7 +5215,7 @@ static void kvm_init_msr_list(void) continue; break; case MSR_TSC_AUX: - if (!kvm_x86_ops->rdtscp_supported()) + if (!kvm_cpu_cap_has(X86_FEATURE_RDTSCP)) continue; break; case MSR_IA32_RTIT_CTL: -- cgit From a7a200eb4c698b70f60a1b8e4e2c6e1c3fcb982e Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:58 -0800 Subject: KVM: VMX: Directly use VMX capabilities helper to detect RDTSCP support Use cpu_has_vmx_rdtscp() directly when computing secondary exec controls and drop the now defunct vmx_rdtscp_supported(). No functional change intended. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/vmx/vmx.c | 7 +------ 1 file changed, 1 insertion(+), 6 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 80e8b8423104..e5aeb6f038e6 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -1687,11 +1687,6 @@ static void vmx_queue_exception(struct kvm_vcpu *vcpu) vmx_clear_hlt(vcpu); } -static bool vmx_rdtscp_supported(void) -{ - return cpu_has_vmx_rdtscp(); -} - /* * Swap MSR entry in host/guest MSR entry array. */ @@ -4073,7 +4068,7 @@ static void vmx_compute_secondary_exec_control(struct vcpu_vmx *vmx) } } - if (vmx_rdtscp_supported()) { + if (cpu_has_vmx_rdtscp()) { bool rdtscp_enabled = guest_cpuid_has(vcpu, X86_FEATURE_RDTSCP); if (!rdtscp_enabled) exec_control &= ~SECONDARY_EXEC_RDTSCP; -- cgit From 7b874c26a62487acaf2e7e179715991c70db25db Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:56:59 -0800 Subject: KVM: x86: Check for Intel PT MSR virtualization using KVM cpu caps Use kvm_cpu_cap_has() to check for Intel PT when processing the list of virtualized MSRs to pave the way toward removing ->pt_supported(). No functional change intended. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/x86.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index fd0889f2f37f..f3fac68f0612 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5220,23 +5220,23 @@ static void kvm_init_msr_list(void) break; case MSR_IA32_RTIT_CTL: case MSR_IA32_RTIT_STATUS: - if (!kvm_x86_ops->pt_supported()) + if (!kvm_cpu_cap_has(X86_FEATURE_INTEL_PT)) continue; break; case MSR_IA32_RTIT_CR3_MATCH: - if (!kvm_x86_ops->pt_supported() || + if (!kvm_cpu_cap_has(X86_FEATURE_INTEL_PT) || !intel_pt_validate_hw_cap(PT_CAP_cr3_filtering)) continue; break; case MSR_IA32_RTIT_OUTPUT_BASE: case MSR_IA32_RTIT_OUTPUT_MASK: - if (!kvm_x86_ops->pt_supported() || + if (!kvm_cpu_cap_has(X86_FEATURE_INTEL_PT) || (!intel_pt_validate_hw_cap(PT_CAP_topa_output) && !intel_pt_validate_hw_cap(PT_CAP_single_range_output))) continue; break; case MSR_IA32_RTIT_ADDR0_A ... MSR_IA32_RTIT_ADDR3_B: { - if (!kvm_x86_ops->pt_supported() || + if (!kvm_cpu_cap_has(X86_FEATURE_INTEL_PT) || msrs_to_save_all[i] - MSR_IA32_RTIT_ADDR0_A >= intel_pt_validate_hw_cap(PT_CAP_num_address_ranges) * 2) continue; -- cgit From a1bead2abaa162e5e67ad258a06c9d71dddad00d Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:57:00 -0800 Subject: KVM: VMX: Directly query Intel PT mode when refreshing PMUs Use vmx_pt_mode_is_host_guest() in intel_pmu_refresh() instead of bouncing through kvm_x86_ops->pt_supported, and remove ->pt_supported() as the PMU code was the last remaining user. Opportunistically clean up the wording of a comment that referenced kvm_x86_ops->pt_supported(). No functional change intended. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/include/asm/kvm_host.h | 2 -- arch/x86/kvm/svm.c | 7 ------- arch/x86/kvm/vmx/pmu_intel.c | 2 +- arch/x86/kvm/vmx/vmx.c | 6 ------ arch/x86/kvm/x86.c | 7 +++---- 5 files changed, 4 insertions(+), 20 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 00a1be55e90a..143d0ce493d5 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1176,8 +1176,6 @@ struct kvm_x86_ops { void (*handle_exit_irqoff)(struct kvm_vcpu *vcpu, enum exit_fastpath_completion *exit_fastpath); - bool (*pt_supported)(void); - int (*check_nested_events)(struct kvm_vcpu *vcpu); void (*request_immediate_exit)(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index f802d9c196e9..e0be6d0bca9e 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -6074,11 +6074,6 @@ static int svm_get_lpage_level(void) return PT_PDPE_LEVEL; } -static bool svm_pt_supported(void) -{ - return false; -} - static bool svm_has_wbinvd_exit(void) { return true; @@ -7440,8 +7435,6 @@ static struct kvm_x86_ops svm_x86_ops __ro_after_init = { .cpuid_update = svm_cpuid_update, - .pt_supported = svm_pt_supported, - .set_supported_cpuid = svm_set_supported_cpuid, .has_wbinvd_exit = svm_has_wbinvd_exit, diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index e933541751fb..7c857737b438 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -335,7 +335,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu) pmu->global_ovf_ctrl_mask = pmu->global_ctrl_mask & ~(MSR_CORE_PERF_GLOBAL_OVF_CTRL_OVF_BUF | MSR_CORE_PERF_GLOBAL_OVF_CTRL_COND_CHGD); - if (kvm_x86_ops->pt_supported()) + if (vmx_pt_mode_is_host_guest()) pmu->global_ovf_ctrl_mask &= ~MSR_CORE_PERF_GLOBAL_OVF_CTRL_TRACE_TOPA_PMI; diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index e5aeb6f038e6..75f61fb9c3b2 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -6306,11 +6306,6 @@ static bool vmx_has_emulated_msr(int index) } } -static bool vmx_pt_supported(void) -{ - return vmx_pt_mode_is_host_guest(); -} - static void vmx_recover_nmi_blocking(struct vcpu_vmx *vmx) { u32 exit_intr_info; @@ -7945,7 +7940,6 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = { .check_intercept = vmx_check_intercept, .handle_exit_irqoff = vmx_handle_exit_irqoff, - .pt_supported = vmx_pt_supported, .request_immediate_exit = vmx_request_immediate_exit, diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index f3fac68f0612..5be4961d49dd 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2820,10 +2820,9 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) !guest_cpuid_has(vcpu, X86_FEATURE_XSAVES)) return 1; /* - * We do support PT if kvm_x86_ops->pt_supported(), but we do - * not support IA32_XSS[bit 8]. Guests will have to use - * RDMSR/WRMSR rather than XSAVES/XRSTORS to save/restore PT - * MSRs. + * KVM supports exposing PT to the guest, but does not support + * IA32_XSS[bit 8]. Guests have to use RDMSR/WRMSR rather than + * XSAVES/XRSTORS to save/restore PT MSRs. */ if (data != 0) return 1; -- cgit From 213e0e1f500b5f035d4c8441025a74478818a64f Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:57:01 -0800 Subject: KVM: SVM: Refactor logging of NPT enabled/disabled Tweak SVM's logging of NPT enabled/disabled to handle the logging in a single pr_info() in preparation for merging kvm_enable_tdp() and kvm_disable_tdp() into a single function. No functional change intended. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/svm.c | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index e0be6d0bca9e..6b7b0f8caa52 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1455,16 +1455,14 @@ static __init int svm_hardware_setup(void) if (!boot_cpu_has(X86_FEATURE_NPT)) npt_enabled = false; - if (npt_enabled && !npt) { - printk(KERN_INFO "kvm: Nested Paging disabled\n"); + if (npt_enabled && !npt) npt_enabled = false; - } - if (npt_enabled) { - printk(KERN_INFO "kvm: Nested Paging enabled\n"); + if (npt_enabled) kvm_enable_tdp(); - } else + else kvm_disable_tdp(); + pr_info("kvm: Nested Paging %sabled\n", npt_enabled ? "en" : "dis"); if (nrips) { if (!boot_cpu_has(X86_FEATURE_NRIPS)) -- cgit From bde7723559586d6afd18fa1717fc143531d4c77d Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:57:02 -0800 Subject: KVM: x86/mmu: Merge kvm_{enable,disable}_tdp() into a common function Combine kvm_enable_tdp() and kvm_disable_tdp() into a single function, kvm_configure_mmu(), in preparation for doing additional configuration during hardware setup. And because having separate helpers is silly. No functional change intended. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/include/asm/kvm_host.h | 3 +-- arch/x86/kvm/mmu/mmu.c | 13 +++---------- arch/x86/kvm/svm.c | 5 +---- arch/x86/kvm/vmx/vmx.c | 4 +--- 4 files changed, 6 insertions(+), 19 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 143d0ce493d5..c9b721140f59 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1510,8 +1510,7 @@ void kvm_mmu_invlpg(struct kvm_vcpu *vcpu, gva_t gva); void kvm_mmu_invpcid_gva(struct kvm_vcpu *vcpu, gva_t gva, unsigned long pcid); void kvm_mmu_new_cr3(struct kvm_vcpu *vcpu, gpa_t new_cr3, bool skip_tlb_flush); -void kvm_enable_tdp(void); -void kvm_disable_tdp(void); +void kvm_configure_mmu(bool enable_tdp); static inline gpa_t translate_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, u32 access, struct x86_exception *exception) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 050da022f96c..724bc8051117 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -5559,18 +5559,11 @@ void kvm_mmu_invpcid_gva(struct kvm_vcpu *vcpu, gva_t gva, unsigned long pcid) } EXPORT_SYMBOL_GPL(kvm_mmu_invpcid_gva); -void kvm_enable_tdp(void) +void kvm_configure_mmu(bool enable_tdp) { - tdp_enabled = true; + tdp_enabled = enable_tdp; } -EXPORT_SYMBOL_GPL(kvm_enable_tdp); - -void kvm_disable_tdp(void) -{ - tdp_enabled = false; -} -EXPORT_SYMBOL_GPL(kvm_disable_tdp); - +EXPORT_SYMBOL_GPL(kvm_configure_mmu); /* The return value indicates if tlb flush on all vcpus is needed. */ typedef bool (*slot_level_handler) (struct kvm *kvm, struct kvm_rmap_head *rmap_head); diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 6b7b0f8caa52..422ee02afe8c 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1458,10 +1458,7 @@ static __init int svm_hardware_setup(void) if (npt_enabled && !npt) npt_enabled = false; - if (npt_enabled) - kvm_enable_tdp(); - else - kvm_disable_tdp(); + kvm_configure_mmu(npt_enabled); pr_info("kvm: Nested Paging %sabled\n", npt_enabled ? "en" : "dis"); if (nrips) { diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 75f61fb9c3b2..b4ae305646fe 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -5320,7 +5320,6 @@ static void vmx_enable_tdp(void) VMX_EPT_RWX_MASK, 0ull); ept_set_mmio_spte_mask(); - kvm_enable_tdp(); } /* @@ -7747,8 +7746,7 @@ static __init int hardware_setup(void) if (enable_ept) vmx_enable_tdp(); - else - kvm_disable_tdp(); + kvm_configure_mmu(enable_ept); /* * Only enable PML when hardware supports PML feature, and both EPT -- cgit From 703c335d06934401763863cf24fee61a13de055b Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:57:03 -0800 Subject: KVM: x86/mmu: Configure max page level during hardware setup Configure the max page level during hardware setup to avoid a retpoline in the page fault handler. Drop ->get_lpage_level() as the page fault handler was the last user. No functional change intended. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/include/asm/kvm_host.h | 3 +-- arch/x86/kvm/mmu/mmu.c | 20 ++++++++++++++++++-- arch/x86/kvm/svm.c | 9 +-------- arch/x86/kvm/vmx/vmx.c | 24 +++++++++++------------- 4 files changed, 31 insertions(+), 25 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index c9b721140f59..c817987c599e 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1155,7 +1155,6 @@ struct kvm_x86_ops { int (*set_identity_map_addr)(struct kvm *kvm, u64 ident_addr); int (*get_tdp_level)(struct kvm_vcpu *vcpu); u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio); - int (*get_lpage_level)(void); void (*set_tdp_cr3)(struct kvm_vcpu *vcpu, unsigned long cr3); @@ -1510,7 +1509,7 @@ void kvm_mmu_invlpg(struct kvm_vcpu *vcpu, gva_t gva); void kvm_mmu_invpcid_gva(struct kvm_vcpu *vcpu, gva_t gva, unsigned long pcid); void kvm_mmu_new_cr3(struct kvm_vcpu *vcpu, gpa_t new_cr3, bool skip_tlb_flush); -void kvm_configure_mmu(bool enable_tdp); +void kvm_configure_mmu(bool enable_tdp, int tdp_page_level); static inline gpa_t translate_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, u32 access, struct x86_exception *exception) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 724bc8051117..554546948e87 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -87,6 +87,8 @@ __MODULE_PARM_TYPE(nx_huge_pages_recovery_ratio, "uint"); */ bool tdp_enabled = false; +static int max_page_level __read_mostly; + enum { AUDIT_PRE_PAGE_FAULT, AUDIT_POST_PAGE_FAULT, @@ -3282,7 +3284,7 @@ static int kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, gfn_t gfn, if (!slot) return PT_PAGE_TABLE_LEVEL; - max_level = min(max_level, kvm_x86_ops->get_lpage_level()); + max_level = min(max_level, max_page_level); for ( ; max_level > PT_PAGE_TABLE_LEVEL; max_level--) { linfo = lpage_info_slot(gfn, slot, max_level); if (!linfo->disallow_lpage) @@ -5559,9 +5561,23 @@ void kvm_mmu_invpcid_gva(struct kvm_vcpu *vcpu, gva_t gva, unsigned long pcid) } EXPORT_SYMBOL_GPL(kvm_mmu_invpcid_gva); -void kvm_configure_mmu(bool enable_tdp) +void kvm_configure_mmu(bool enable_tdp, int tdp_page_level) { tdp_enabled = enable_tdp; + + /* + * max_page_level reflects the capabilities of KVM's MMU irrespective + * of kernel support, e.g. KVM may be capable of using 1GB pages when + * the kernel is not. But, KVM never creates a page size greater than + * what is used by the kernel for any given HVA, i.e. the kernel's + * capabilities are ultimately consulted by kvm_mmu_hugepage_adjust(). + */ + if (tdp_enabled) + max_page_level = tdp_page_level; + else if (boot_cpu_has(X86_FEATURE_GBPAGES)) + max_page_level = PT_PDPE_LEVEL; + else + max_page_level = PT_DIRECTORY_LEVEL; } EXPORT_SYMBOL_GPL(kvm_configure_mmu); diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 422ee02afe8c..5e3261ec8c59 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1458,7 +1458,7 @@ static __init int svm_hardware_setup(void) if (npt_enabled && !npt) npt_enabled = false; - kvm_configure_mmu(npt_enabled); + kvm_configure_mmu(npt_enabled, PT_PDPE_LEVEL); pr_info("kvm: Nested Paging %sabled\n", npt_enabled ? "en" : "dis"); if (nrips) { @@ -6064,11 +6064,6 @@ static void svm_set_supported_cpuid(struct kvm_cpuid_entry2 *entry) } } -static int svm_get_lpage_level(void) -{ - return PT_PDPE_LEVEL; -} - static bool svm_has_wbinvd_exit(void) { return true; @@ -7426,8 +7421,6 @@ static struct kvm_x86_ops svm_x86_ops __ro_after_init = { .get_exit_info = svm_get_exit_info, - .get_lpage_level = svm_get_lpage_level, - .cpuid_update = svm_cpuid_update, .set_supported_cpuid = svm_set_supported_cpuid, diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index b4ae305646fe..066c97ceebbf 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -6913,15 +6913,6 @@ exit: return (cache << VMX_EPT_MT_EPTE_SHIFT) | ipat; } -static int vmx_get_lpage_level(void) -{ - if (enable_ept && !cpu_has_vmx_ept_1g_page()) - return PT_DIRECTORY_LEVEL; - else - /* For shadow and EPT supported 1GB page */ - return PT_PDPE_LEVEL; -} - static void vmcs_set_secondary_exec_control(struct vcpu_vmx *vmx) { /* @@ -7653,7 +7644,7 @@ static __init int hardware_setup(void) { unsigned long host_bndcfgs; struct desc_ptr dt; - int r, i; + int r, i, ept_lpage_level; rdmsrl_safe(MSR_EFER, &host_efer); @@ -7746,7 +7737,16 @@ static __init int hardware_setup(void) if (enable_ept) vmx_enable_tdp(); - kvm_configure_mmu(enable_ept); + + if (!enable_ept) + ept_lpage_level = 0; + else if (cpu_has_vmx_ept_1g_page()) + ept_lpage_level = PT_PDPE_LEVEL; + else if (cpu_has_vmx_ept_2m_page()) + ept_lpage_level = PT_DIRECTORY_LEVEL; + else + ept_lpage_level = PT_PAGE_TABLE_LEVEL; + kvm_configure_mmu(enable_ept, ept_lpage_level); /* * Only enable PML when hardware supports PML feature, and both EPT @@ -7924,8 +7924,6 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = { .get_exit_info = vmx_get_exit_info, - .get_lpage_level = vmx_get_lpage_level, - .cpuid_update = vmx_cpuid_update, .set_supported_cpuid = vmx_set_supported_cpuid, -- cgit From e884b854ee18f17ba14d0221287286050e100749 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:57:04 -0800 Subject: KVM: x86: Don't propagate MMU lpage support to memslot.disallow_lpage Stop propagating MMU large page support into a memslot's disallow_lpage now that the MMU's max_page_level handles the scenario where VMX's EPT is enabled and EPT doesn't support 2M pages. No functional change intended. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/vmx/vmx.c | 3 --- 1 file changed, 3 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 066c97ceebbf..0c6a621a43df 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7702,9 +7702,6 @@ static __init int hardware_setup(void) if (!cpu_has_vmx_tpr_shadow()) kvm_x86_ops->update_cr8_intercept = NULL; - if (enable_ept && !cpu_has_vmx_ept_2m_page()) - kvm_disable_largepages(); - #if IS_ENABLED(CONFIG_HYPERV) if (ms_hyperv.nested_features & HV_X64_NESTED_GUEST_MAPPING_FLUSH && enable_ept) { -- cgit From 600087b6146764999949b4a12ce5f7627602c33a Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:57:05 -0800 Subject: KVM: Drop largepages_enabled and its accessor/mutator Drop largepages_enabled, kvm_largepages_enabled() and kvm_disable_largepages() now that all users are gone. Note, largepages_enabled was an x86-only flag that got left in common KVM code when KVM gained support for multiple architectures. No functional change intended. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/x86.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 5be4961d49dd..935cd40cbae2 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -9918,11 +9918,9 @@ static int kvm_alloc_memslot_metadata(struct kvm_memory_slot *slot, ugfn = slot->userspace_addr >> PAGE_SHIFT; /* * If the gfn and userspace address are not aligned wrt each - * other, or if explicitly asked to, disable large page - * support for this slot + * other, disable large page support for this slot. */ - if ((slot->base_gfn ^ ugfn) & (KVM_PAGES_PER_HPAGE(level) - 1) || - !kvm_largepages_enabled()) { + if ((slot->base_gfn ^ ugfn) & (KVM_PAGES_PER_HPAGE(level) - 1)) { unsigned long j; for (j = 0; j < lpages; ++j) -- cgit From 91661989d17ccec17bca199e7cb1f463ba4e5b78 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:57:06 -0800 Subject: KVM: x86: Move VMX's host_efer to common x86 code Move host_efer to common x86 code and use it for CPUID's is_efer_nx() to avoid constantly re-reading the MSR. No functional change intended. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/include/asm/kvm_host.h | 2 ++ arch/x86/kvm/cpuid.c | 5 +---- arch/x86/kvm/vmx/vmx.c | 3 --- arch/x86/kvm/vmx/vmx.h | 1 - arch/x86/kvm/x86.c | 5 +++++ 5 files changed, 8 insertions(+), 8 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index c817987c599e..531b5a96df33 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1271,6 +1271,8 @@ struct kvm_arch_async_pf { bool direct_map; }; +extern u64 __read_mostly host_efer; + extern struct kvm_x86_ops *kvm_x86_ops; extern struct kmem_cache *x86_fpu_cache; diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 1934f5d6b731..5dd67b124a68 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -134,10 +134,7 @@ int kvm_update_cpuid(struct kvm_vcpu *vcpu) static int is_efer_nx(void) { - unsigned long long efer = 0; - - rdmsrl_safe(MSR_EFER, &efer); - return efer & EFER_NX; + return host_efer & EFER_NX; } static void cpuid_fix_nx_cap(struct kvm_vcpu *vcpu) diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 0c6a621a43df..3ee5f75dd1e1 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -433,7 +433,6 @@ static const struct kvm_vmx_segment_field { VMX_SEGMENT_FIELD(LDTR), }; -u64 host_efer; static unsigned long host_idt_base; /* @@ -7646,8 +7645,6 @@ static __init int hardware_setup(void) struct desc_ptr dt; int r, i, ept_lpage_level; - rdmsrl_safe(MSR_EFER, &host_efer); - store_idt(&dt); host_idt_base = dt.address; diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h index 9a51a3a77233..fc45bdb5a62f 100644 --- a/arch/x86/kvm/vmx/vmx.h +++ b/arch/x86/kvm/vmx/vmx.h @@ -12,7 +12,6 @@ #include "vmcs.h" extern const u32 vmx_msr_index[]; -extern u64 host_efer; extern u32 get_umwait_control_msr(void); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 935cd40cbae2..d2f1b4746903 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -186,6 +186,9 @@ static struct kvm_shared_msrs __percpu *shared_msrs; | XFEATURE_MASK_BNDCSR | XFEATURE_MASK_AVX512 \ | XFEATURE_MASK_PKRU) +u64 __read_mostly host_efer; +EXPORT_SYMBOL_GPL(host_efer); + static u64 __read_mostly host_xss; struct kvm_stats_debugfs_item debugfs_entries[] = { @@ -9612,6 +9615,8 @@ int kvm_arch_hardware_setup(void) { int r; + rdmsrl_safe(MSR_EFER, &host_efer); + r = kvm_x86_ops->hardware_setup(); if (r != 0) return r; -- cgit From a50718cc3f43f12e6e33b098b5e2a9eb19f13158 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:57:07 -0800 Subject: KVM: nSVM: Expose SVM features to L1 iff nested is enabled Set SVM feature bits in KVM capabilities if and only if nested=true, KVM shouldn't advertise features that realistically can't be used. Use kvm_cpu_cap_has(X86_FEATURE_SVM) to indirectly query "nested" in svm_set_supported_cpuid() in anticipation of moving CPUID 0x8000000A adjustments into common x86 code. Suggested-by: Paolo Bonzini Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/svm.c | 22 +++++++++++++--------- 1 file changed, 13 insertions(+), 9 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 5e3261ec8c59..76a480a37f1d 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1371,21 +1371,21 @@ static __init void svm_set_cpu_caps(void) { kvm_set_cpu_caps(); - /* CPUID 0x80000001 */ - if (nested) + /* CPUID 0x80000001 and 0x8000000A (SVM features) */ + if (nested) { kvm_cpu_cap_set(X86_FEATURE_SVM); + if (boot_cpu_has(X86_FEATURE_NRIPS)) + kvm_cpu_cap_set(X86_FEATURE_NRIPS); + + if (npt_enabled) + kvm_cpu_cap_set(X86_FEATURE_NPT); + } + /* CPUID 0x80000008 */ if (boot_cpu_has(X86_FEATURE_LS_CFG_SSBD) || boot_cpu_has(X86_FEATURE_AMD_SSBD)) kvm_cpu_cap_set(X86_FEATURE_VIRT_SSBD); - - /* CPUID 0x8000000A */ - /* Support next_rip if host supports it */ - kvm_cpu_cap_check_and_set(X86_FEATURE_NRIPS); - - if (npt_enabled) - kvm_cpu_cap_set(X86_FEATURE_NPT); } static __init int svm_hardware_setup(void) @@ -6055,6 +6055,10 @@ static void svm_set_supported_cpuid(struct kvm_cpuid_entry2 *entry) { switch (entry->function) { case 0x8000000A: + if (!kvm_cpu_cap_has(X86_FEATURE_SVM)) { + entry->eax = entry->ebx = entry->ecx = entry->edx = 0; + break; + } entry->eax = 1; /* SVM revision 1 */ entry->ebx = 8; /* Lets support 8 ASIDs in case we add proper ASID emulation to nested SVM */ -- cgit From 4eb87460c4740030086411c3b7a7e167fb7e57bd Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:57:08 -0800 Subject: KVM: nSVM: Advertise and enable NRIPS for L1 iff nrips is enabled Set NRIPS in KVM capabilities if and only if nrips=true, which naturally incorporates the boot_cpu_has() check, and set nrips_enabled only if the KVM capability is enabled. Note, previously KVM would set nrips_enabled based purely on userspace input, but at worst that would cause KVM to propagate garbage into L1, i.e. userspace would simply be hosing its VM. Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/svm.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 76a480a37f1d..9b173d5fdc52 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1375,7 +1375,7 @@ static __init void svm_set_cpu_caps(void) if (nested) { kvm_cpu_cap_set(X86_FEATURE_SVM); - if (boot_cpu_has(X86_FEATURE_NRIPS)) + if (nrips) kvm_cpu_cap_set(X86_FEATURE_NRIPS); if (npt_enabled) @@ -6029,7 +6029,8 @@ static void svm_cpuid_update(struct kvm_vcpu *vcpu) boot_cpu_has(X86_FEATURE_XSAVES); /* Update nrips enabled cache */ - svm->nrips_enabled = !!guest_cpuid_has(&svm->vcpu, X86_FEATURE_NRIPS); + svm->nrips_enabled = kvm_cpu_cap_has(X86_FEATURE_NRIPS) && + guest_cpuid_has(&svm->vcpu, X86_FEATURE_NRIPS); if (!kvm_vcpu_apicv_active(vcpu)) return; -- cgit From 257038745cae1fdaa3948013a22eba3b1d610174 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 2 Mar 2020 15:57:09 -0800 Subject: KVM: x86: Move nSVM CPUID 0x8000000A handling into common x86 code Handle CPUID 0x8000000A in the main switch in __do_cpuid_func() and drop ->set_supported_cpuid() now that both VMX and SVM implementations are empty. Like leaf 0x14 (Intel PT) and leaf 0x8000001F (SEV), leaf 0x8000000A is is (obviously) vendor specific but can be queried in common code while respecting SVM's wishes by querying kvm_cpu_cap_has(). Suggested-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/include/asm/kvm_host.h | 2 -- arch/x86/kvm/cpuid.c | 13 +++++++++++-- arch/x86/kvm/svm.c | 19 ------------------- arch/x86/kvm/vmx/vmx.c | 5 ----- 4 files changed, 11 insertions(+), 28 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 531b5a96df33..24c90ea5ddbd 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1158,8 +1158,6 @@ struct kvm_x86_ops { void (*set_tdp_cr3)(struct kvm_vcpu *vcpu, unsigned long cr3); - void (*set_supported_cpuid)(struct kvm_cpuid_entry2 *entry); - bool (*has_wbinvd_exit)(void); u64 (*read_l1_tsc_offset)(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 5dd67b124a68..cc8b24b4d8f3 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -741,6 +741,17 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) cpuid_entry_override(entry, CPUID_8000_0008_EBX); break; } + case 0x8000000A: + if (!kvm_cpu_cap_has(X86_FEATURE_SVM)) { + entry->eax = entry->ebx = entry->ecx = entry->edx = 0; + break; + } + entry->eax = 1; /* SVM revision 1 */ + entry->ebx = 8; /* Lets support 8 ASIDs in case we add proper + ASID emulation to nested SVM */ + entry->ecx = 0; /* Reserved */ + cpuid_entry_override(entry, CPUID_8000_000A_EDX); + break; case 0x80000019: entry->ecx = entry->edx = 0; break; @@ -770,8 +781,6 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) break; } - kvm_x86_ops->set_supported_cpuid(entry); - r = 0; out: diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 9b173d5fdc52..c6e9910d1149 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -6052,23 +6052,6 @@ static void svm_cpuid_update(struct kvm_vcpu *vcpu) APICV_INHIBIT_REASON_NESTED); } -static void svm_set_supported_cpuid(struct kvm_cpuid_entry2 *entry) -{ - switch (entry->function) { - case 0x8000000A: - if (!kvm_cpu_cap_has(X86_FEATURE_SVM)) { - entry->eax = entry->ebx = entry->ecx = entry->edx = 0; - break; - } - entry->eax = 1; /* SVM revision 1 */ - entry->ebx = 8; /* Lets support 8 ASIDs in case we add proper - ASID emulation to nested SVM */ - entry->ecx = 0; /* Reserved */ - cpuid_entry_override(entry, CPUID_8000_000A_EDX); - break; - } -} - static bool svm_has_wbinvd_exit(void) { return true; @@ -7428,8 +7411,6 @@ static struct kvm_x86_ops svm_x86_ops __ro_after_init = { .cpuid_update = svm_cpuid_update, - .set_supported_cpuid = svm_set_supported_cpuid, - .has_wbinvd_exit = svm_has_wbinvd_exit, .read_l1_tsc_offset = svm_read_l1_tsc_offset, diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 3ee5f75dd1e1..e91a84bb251c 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7102,10 +7102,6 @@ static void vmx_cpuid_update(struct kvm_vcpu *vcpu) } } -static void vmx_set_supported_cpuid(struct kvm_cpuid_entry2 *entry) -{ -} - static __init void vmx_set_cpu_caps(void) { kvm_set_cpu_caps(); @@ -7919,7 +7915,6 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = { .get_exit_info = vmx_get_exit_info, .cpuid_update = vmx_cpuid_update, - .set_supported_cpuid = vmx_set_supported_cpuid, .has_wbinvd_exit = cpu_has_vmx_wbinvd_exit, -- cgit From 408e9a318f57ba8be82ba01e98cc271b97392187 Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Thu, 5 Mar 2020 16:11:56 +0100 Subject: KVM: CPUID: add support for supervisor states Current CPUID 0xd enumeration code does not support supervisor states, because KVM only supports setting IA32_XSS to zero. Change it instead to use a new variable supported_xss, to be set from the hardware_setup callback which is in charge of CPU capabilities. Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 30 ++++++++++++++++++------------ arch/x86/kvm/svm.c | 2 ++ arch/x86/kvm/vmx/vmx.c | 1 + arch/x86/kvm/x86.c | 13 +++++++++---- arch/x86/kvm/x86.h | 1 + 5 files changed, 31 insertions(+), 16 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index cc8b24b4d8f3..78d461be2102 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -642,15 +642,22 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) cpuid_entry_override(entry, CPUID_D_1_EAX); if (entry->eax & (F(XSAVES)|F(XSAVEC))) - entry->ebx = xstate_required_size(supported_xcr0, true); - else + entry->ebx = xstate_required_size(supported_xcr0 | supported_xss, + true); + else { + WARN_ON_ONCE(supported_xss != 0); entry->ebx = 0; - /* Saving XSS controlled state via XSAVES isn't supported. */ - entry->ecx = 0; - entry->edx = 0; + } + entry->ecx &= supported_xss; + entry->edx &= supported_xss >> 32; for (i = 2; i < 64; ++i) { - if (!(supported_xcr0 & BIT_ULL(i))) + bool s_state; + if (supported_xcr0 & BIT_ULL(i)) + s_state = false; + else if (supported_xss & BIT_ULL(i)) + s_state = true; + else continue; entry = do_host_cpuid(array, function, i); @@ -659,17 +666,16 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) /* * The supported check above should have filtered out - * invalid sub-leafs as well as sub-leafs managed by - * IA32_XSS MSR. Only XCR0-managed sub-leafs should + * invalid sub-leafs. Only valid sub-leafs should * reach this point, and they should have a non-zero - * save state size. + * save state size. Furthermore, check whether the + * processor agrees with supported_xcr0/supported_xss + * on whether this is an XCR0- or IA32_XSS-managed area. */ - if (WARN_ON_ONCE(!entry->eax || (entry->ecx & 1))) { + if (WARN_ON_ONCE(!entry->eax || (entry->ecx & 0x1) != s_state)) { --array->nent; continue; } - - entry->ecx = 0; entry->edx = 0; } break; diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index c6e9910d1149..4dca3579e740 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1371,6 +1371,8 @@ static __init void svm_set_cpu_caps(void) { kvm_set_cpu_caps(); + supported_xss = 0; + /* CPUID 0x80000001 and 0x8000000A (SVM features) */ if (nested) { kvm_cpu_cap_set(X86_FEATURE_SVM); diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index e91a84bb251c..8001070b209c 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7126,6 +7126,7 @@ static __init void vmx_set_cpu_caps(void) kvm_cpu_cap_set(X86_FEATURE_UMIP); /* CPUID 0xD.1 */ + supported_xss = 0; if (!vmx_xsaves_supported()) kvm_cpu_cap_clear(X86_FEATURE_XSAVES); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index d2f1b4746903..96e897d38a63 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -190,6 +190,8 @@ u64 __read_mostly host_efer; EXPORT_SYMBOL_GPL(host_efer); static u64 __read_mostly host_xss; +u64 __read_mostly supported_xss; +EXPORT_SYMBOL_GPL(supported_xss); struct kvm_stats_debugfs_item debugfs_entries[] = { { "pf_fixed", VCPU_STAT(pf_fixed) }, @@ -2827,7 +2829,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) * IA32_XSS[bit 8]. Guests have to use RDMSR/WRMSR rather than * XSAVES/XRSTORS to save/restore PT MSRs. */ - if (data != 0) + if (data & ~supported_xss) return 1; vcpu->arch.ia32_xss = data; break; @@ -9617,10 +9619,16 @@ int kvm_arch_hardware_setup(void) rdmsrl_safe(MSR_EFER, &host_efer); + if (boot_cpu_has(X86_FEATURE_XSAVES)) + rdmsrl(MSR_IA32_XSS, host_xss); + r = kvm_x86_ops->hardware_setup(); if (r != 0) return r; + if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES)) + supported_xss = 0; + cr4_reserved_bits = kvm_host_cr4_reserved_bits(&boot_cpu_data); if (kvm_has_tsc_control) { @@ -9637,9 +9645,6 @@ int kvm_arch_hardware_setup(void) kvm_default_tsc_scaling_ratio = 1ULL << kvm_tsc_scaling_ratio_frac_bits; } - if (boot_cpu_has(X86_FEATURE_XSAVES)) - rdmsrl(MSR_IA32_XSS, host_xss); - kvm_init_msr_list(); return 0; } diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h index 4d890bf827e8..c1954e216b41 100644 --- a/arch/x86/kvm/x86.h +++ b/arch/x86/kvm/x86.h @@ -272,6 +272,7 @@ enum exit_fastpath_completion handle_fastpath_set_msr_irqoff(struct kvm_vcpu *vc extern u64 host_xcr0; extern u64 supported_xcr0; +extern u64 supported_xss; static inline bool kvm_mpx_supported(void) { -- cgit From b7fb8488c85f2b5304e90d954a54d980adde60a4 Mon Sep 17 00:00:00 2001 From: Jan Kiszka Date: Wed, 4 Mar 2020 17:34:31 -0800 Subject: KVM: x86: Trace the original requested CPUID function in kvm_cpuid() Trace the requested CPUID function instead of the effective function, e.g. if the requested function is out-of-range and KVM is emulating an Intel CPU, as the intent of the tracepoint is to show if the output came from the actual leaf as opposed to the max basic leaf via redirection. Similarly, leave "found" as is, i.e. report that an entry was found if and only if the requested entry was found. Fixes: 43561123ab37 ("kvm: x86: Improve emulation of CPUID leaves 0BH and 1FH") Signed-off-by: Jan Kiszka [Sean: Drop "found" semantic change, reword changelong accordingly ] Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 78d461be2102..a25b520d26c9 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -933,7 +933,7 @@ static bool cpuid_function_in_range(struct kvm_vcpu *vcpu, u32 function) bool kvm_cpuid(struct kvm_vcpu *vcpu, u32 *eax, u32 *ebx, u32 *ecx, u32 *edx, bool check_limit) { - u32 function = *eax, index = *ecx; + u32 orig_function = *eax, function = *eax, index = *ecx; struct kvm_cpuid_entry2 *entry; struct kvm_cpuid_entry2 *max; bool found; @@ -982,7 +982,7 @@ bool kvm_cpuid(struct kvm_vcpu *vcpu, u32 *eax, u32 *ebx, } } } - trace_kvm_cpuid(function, *eax, *ebx, *ecx, *edx, found); + trace_kvm_cpuid(orig_function, *eax, *ebx, *ecx, *edx, found); return found; } EXPORT_SYMBOL_GPL(kvm_cpuid); -- cgit From 15608ed03f10a1414188612b0ee9653e3c78bbd9 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Wed, 4 Mar 2020 17:34:32 -0800 Subject: KVM: x86: Add helpers to perform CPUID-based guest vendor check Add helpers to provide CPUID-based guest vendor checks, i.e. to do the ugly register comparisons. Use the new helpers to check for an AMD guest vendor in guest_cpuid_is_amd() as well as in the existing emulator flows. Using the new helpers fixes a _very_ theoretical bug where guest_cpuid_is_amd() would get a false positive on a non-AMD virtual CPU with a vendor string beginning with "Auth" due to the previous logic only checking EBX. It also fixes a marginally less theoretically bug where guest_cpuid_is_amd() would incorrectly return false for a guest CPU with "AMDisbetter!" as its vendor string. Fixes: a0c0feb57992c ("KVM: x86: reserve bit 8 of non-leaf PDPEs and PML4Es in 64-bit mode on AMD") Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.h | 2 +- arch/x86/kvm/emulate.c | 36 ++++++++---------------------------- arch/x86/kvm/kvm_emulate.h | 24 ++++++++++++++++++++++++ 3 files changed, 33 insertions(+), 29 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h index e9a3277ce256..cccfd854aacf 100644 --- a/arch/x86/kvm/cpuid.h +++ b/arch/x86/kvm/cpuid.h @@ -219,7 +219,7 @@ static inline bool guest_cpuid_is_amd(struct kvm_vcpu *vcpu) struct kvm_cpuid_entry2 *best; best = kvm_find_cpuid_entry(vcpu, 0, 0); - return best && best->ebx == X86EMUL_CPUID_VENDOR_AuthenticAMD_ebx; + return best && is_guest_vendor_amd(best->ebx, best->ecx, best->edx); } static inline int guest_cpuid_family(struct kvm_vcpu *vcpu) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 0bef7762c502..6663f6887d2c 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -2723,9 +2723,7 @@ static bool vendor_intel(struct x86_emulate_ctxt *ctxt) eax = ecx = 0; ctxt->ops->get_cpuid(ctxt, &eax, &ebx, &ecx, &edx, false); - return ebx == X86EMUL_CPUID_VENDOR_GenuineIntel_ebx - && ecx == X86EMUL_CPUID_VENDOR_GenuineIntel_ecx - && edx == X86EMUL_CPUID_VENDOR_GenuineIntel_edx; + return is_guest_vendor_intel(ebx, ecx, edx); } static bool em_syscall_is_enabled(struct x86_emulate_ctxt *ctxt) @@ -2744,34 +2742,16 @@ static bool em_syscall_is_enabled(struct x86_emulate_ctxt *ctxt) ecx = 0x00000000; ops->get_cpuid(ctxt, &eax, &ebx, &ecx, &edx, false); /* - * Intel ("GenuineIntel") - * remark: Intel CPUs only support "syscall" in 64bit - * longmode. Also an 64bit guest with a - * 32bit compat-app running will #UD !! While this - * behaviour can be fixed (by emulating) into AMD - * response - CPUs of AMD can't behave like Intel. + * remark: Intel CPUs only support "syscall" in 64bit longmode. Also a + * 64bit guest with a 32bit compat-app running will #UD !! While this + * behaviour can be fixed (by emulating) into AMD response - CPUs of + * AMD can't behave like Intel. */ - if (ebx == X86EMUL_CPUID_VENDOR_GenuineIntel_ebx && - ecx == X86EMUL_CPUID_VENDOR_GenuineIntel_ecx && - edx == X86EMUL_CPUID_VENDOR_GenuineIntel_edx) + if (is_guest_vendor_intel(ebx, ecx, edx)) return false; - /* AMD ("AuthenticAMD") */ - if (ebx == X86EMUL_CPUID_VENDOR_AuthenticAMD_ebx && - ecx == X86EMUL_CPUID_VENDOR_AuthenticAMD_ecx && - edx == X86EMUL_CPUID_VENDOR_AuthenticAMD_edx) - return true; - - /* AMD ("AMDisbetter!") */ - if (ebx == X86EMUL_CPUID_VENDOR_AMDisbetterI_ebx && - ecx == X86EMUL_CPUID_VENDOR_AMDisbetterI_ecx && - edx == X86EMUL_CPUID_VENDOR_AMDisbetterI_edx) - return true; - - /* Hygon ("HygonGenuine") */ - if (ebx == X86EMUL_CPUID_VENDOR_HygonGenuine_ebx && - ecx == X86EMUL_CPUID_VENDOR_HygonGenuine_ecx && - edx == X86EMUL_CPUID_VENDOR_HygonGenuine_edx) + if (is_guest_vendor_amd(ebx, ecx, edx) || + is_guest_vendor_hygon(ebx, ecx, edx)) return true; /* diff --git a/arch/x86/kvm/kvm_emulate.h b/arch/x86/kvm/kvm_emulate.h index f86fe6bf170d..b03a6a56e46e 100644 --- a/arch/x86/kvm/kvm_emulate.h +++ b/arch/x86/kvm/kvm_emulate.h @@ -396,6 +396,30 @@ struct x86_emulate_ctxt { #define X86EMUL_CPUID_VENDOR_GenuineIntel_ecx 0x6c65746e #define X86EMUL_CPUID_VENDOR_GenuineIntel_edx 0x49656e69 +static inline bool is_guest_vendor_intel(u32 ebx, u32 ecx, u32 edx) +{ + return ebx == X86EMUL_CPUID_VENDOR_GenuineIntel_ebx && + ecx == X86EMUL_CPUID_VENDOR_GenuineIntel_ecx && + edx == X86EMUL_CPUID_VENDOR_GenuineIntel_edx; +} + +static inline bool is_guest_vendor_amd(u32 ebx, u32 ecx, u32 edx) +{ + return (ebx == X86EMUL_CPUID_VENDOR_AuthenticAMD_ebx && + ecx == X86EMUL_CPUID_VENDOR_AuthenticAMD_ecx && + edx == X86EMUL_CPUID_VENDOR_AuthenticAMD_edx) || + (ebx == X86EMUL_CPUID_VENDOR_AMDisbetterI_ebx && + ecx == X86EMUL_CPUID_VENDOR_AMDisbetterI_ecx && + edx == X86EMUL_CPUID_VENDOR_AMDisbetterI_edx); +} + +static inline bool is_guest_vendor_hygon(u32 ebx, u32 ecx, u32 edx) +{ + return ebx == X86EMUL_CPUID_VENDOR_HygonGenuine_ebx && + ecx == X86EMUL_CPUID_VENDOR_HygonGenuine_ecx && + edx == X86EMUL_CPUID_VENDOR_HygonGenuine_edx; +} + enum x86_intercept_stage { X86_ICTP_NONE = 0, /* Allow zero-init to not match anything */ X86_ICPT_PRE_EXCEPT, -- cgit From 23493d0a1731205a4bccaa0e283da642ca6f6e29 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Wed, 4 Mar 2020 17:34:33 -0800 Subject: KVM x86: Extend AMD specific guest behavior to Hygon virtual CPUs Extend guest_cpuid_is_amd() to cover Hygon virtual CPUs and rename it accordingly. Hygon CPUs use an AMD-based core and so have the same basic behavior as AMD CPUs. Fixes: b8f4abb652146 ("x86/kvm: Add Hygon Dhyana support to KVM") Cc: Pu Wen Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 2 +- arch/x86/kvm/cpuid.h | 6 ++++-- arch/x86/kvm/mmu/mmu.c | 3 ++- arch/x86/kvm/x86.c | 2 +- 4 files changed, 8 insertions(+), 5 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index a25b520d26c9..ec91b9b127fa 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -946,7 +946,7 @@ bool kvm_cpuid(struct kvm_vcpu *vcpu, u32 *eax, u32 *ebx, * requested. AMD CPUID semantics returns all zeroes for any * undefined leaf, whether or not the leaf is in range. */ - if (!entry && check_limit && !guest_cpuid_is_amd(vcpu) && + if (!entry && check_limit && !guest_cpuid_is_amd_or_hygon(vcpu) && !cpuid_function_in_range(vcpu, function)) { max = kvm_find_cpuid_entry(vcpu, 0, 0); if (max) { diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h index cccfd854aacf..a0988609f620 100644 --- a/arch/x86/kvm/cpuid.h +++ b/arch/x86/kvm/cpuid.h @@ -214,12 +214,14 @@ static __always_inline void guest_cpuid_clear(struct kvm_vcpu *vcpu, *reg &= ~__feature_bit(x86_feature); } -static inline bool guest_cpuid_is_amd(struct kvm_vcpu *vcpu) +static inline bool guest_cpuid_is_amd_or_hygon(struct kvm_vcpu *vcpu) { struct kvm_cpuid_entry2 *best; best = kvm_find_cpuid_entry(vcpu, 0, 0); - return best && is_guest_vendor_amd(best->ebx, best->ecx, best->edx); + return best && + (is_guest_vendor_amd(best->ebx, best->ecx, best->edx) || + is_guest_vendor_hygon(best->ebx, best->ecx, best->edx)); } static inline int guest_cpuid_family(struct kvm_vcpu *vcpu) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 554546948e87..f1eb2d3e1f40 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -4510,7 +4510,8 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu, cpuid_maxphyaddr(vcpu), context->root_level, context->nx, guest_cpuid_has(vcpu, X86_FEATURE_GBPAGES), - is_pse(vcpu), guest_cpuid_is_amd(vcpu)); + is_pse(vcpu), + guest_cpuid_is_amd_or_hygon(vcpu)); } static void diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 96e897d38a63..cda0b787b2e3 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2550,7 +2550,7 @@ static void kvmclock_sync_fn(struct work_struct *work) static bool can_set_mci_status(struct kvm_vcpu *vcpu) { /* McStatusWrEn enabled? */ - if (guest_cpuid_is_amd(vcpu)) + if (guest_cpuid_is_amd_or_hygon(vcpu)) return !!(vcpu->arch.msr_hwcr & BIT_ULL(18)); return false; -- cgit From 8d8923115f1bc8f10b3309b9245cc3a7f39b5aa7 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Wed, 4 Mar 2020 17:34:34 -0800 Subject: KVM: x86: Fix CPUID range checks for Hypervisor and Centaur classes Rework the masking in the out-of-range CPUID logic to handle the Hypervisor sub-classes, as well as the Centaur class if the guest virtual CPU vendor is Centaur. Masking against 0x80000000 only handles basic and extended leafs, which results in Hypervisor range checks being performed against the basic CPUID class, and Centuar range checks being performed against the Extended class. E.g. if CPUID.0x40000000.EAX returns 0x4000000A and there is no entry for CPUID.0x40000006, then function 0x40000006 would be incorrectly reported as out of bounds. While there is no official definition of what constitutes a class, the convention established for Hypervisor classes effectively uses bits 31:8 as the mask by virtue of checking for different bases in increments of 0x100, e.g. KVM advertises its CPUID functions starting at 0x40000100 when HyperV features are advertised at the default base of 0x40000000. The bad range check doesn't cause functional problems for any known VMM because out-of-range semantics only come into play if the exact entry isn't found, and VMMs either support a very limited Hypervisor range, e.g. the official KVM range is 0x40000000-0x40000001 (effectively no room for undefined leafs) or explicitly defines gaps to be zero, e.g. Qemu explicitly creates zeroed entries up to the Centaur and Hypervisor limits (the latter comes into play when providing HyperV features). The bad behavior can be visually confirmed by dumping CPUID output in the guest when running Qemu with a stable TSC, as Qemu extends the limit of range 0x40000000 to 0x40000010 to advertise VMware's cpuid_freq, without defining zeroed entries for 0x40000002 - 0x4000000f. Note, documentation of Centaur/VIA CPUs is hard to come by. Designating 0xc0000000 - 0xcfffffff as the Centaur class is a best guess as to the behavior of a real Centaur/VIA CPU. Fixes: 43561123ab37 ("kvm: x86: Improve emulation of CPUID leaves 0BH and 1FH") Cc: Jim Mattson Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 45 +++++++++++++++++++++++++++++++++++++++------ arch/x86/kvm/kvm_emulate.h | 4 ++++ 2 files changed, 43 insertions(+), 6 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index ec91b9b127fa..299ed439af25 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -918,16 +918,49 @@ struct kvm_cpuid_entry2 *kvm_find_cpuid_entry(struct kvm_vcpu *vcpu, EXPORT_SYMBOL_GPL(kvm_find_cpuid_entry); /* - * If the basic or extended CPUID leaf requested is higher than the - * maximum supported basic or extended leaf, respectively, then it is - * out of range. + * Intel CPUID semantics treats any query for an out-of-range leaf as if the + * highest basic leaf (i.e. CPUID.0H:EAX) were requested. AMD CPUID semantics + * returns all zeroes for any undefined leaf, whether or not the leaf is in + * range. Centaur/VIA follows Intel semantics. + * + * A leaf is considered out-of-range if its function is higher than the maximum + * supported leaf of its associated class or if its associated class does not + * exist. + * + * There are three primary classes to be considered, with their respective + * ranges described as " - [, - ] inclusive. A primary + * class exists if a guest CPUID entry for its leaf exists. For a given + * class, CPUID..EAX contains the max supported leaf for the class. + * + * - Basic: 0x00000000 - 0x3fffffff, 0x50000000 - 0x7fffffff + * - Hypervisor: 0x40000000 - 0x4fffffff + * - Extended: 0x80000000 - 0xbfffffff + * - Centaur: 0xc0000000 - 0xcfffffff + * + * The Hypervisor class is further subdivided into sub-classes that each act as + * their own indepdent class associated with a 0x100 byte range. E.g. if Qemu + * is advertising support for both HyperV and KVM, the resulting Hypervisor + * CPUID sub-classes are: + * + * - HyperV: 0x40000000 - 0x400000ff + * - KVM: 0x40000100 - 0x400001ff */ static bool cpuid_function_in_range(struct kvm_vcpu *vcpu, u32 function) { - struct kvm_cpuid_entry2 *max; + struct kvm_cpuid_entry2 *basic, *class; + + basic = kvm_find_cpuid_entry(vcpu, 0, 0); + if (!basic) + return true; + + if (function >= 0x40000000 && function <= 0x4fffffff) + class = kvm_find_cpuid_entry(vcpu, function & 0xffffff00, 0); + else if (function >= 0xc0000000) + class = kvm_find_cpuid_entry(vcpu, 0xc0000000, 0); + else + class = kvm_find_cpuid_entry(vcpu, function & 0x80000000, 0); - max = kvm_find_cpuid_entry(vcpu, function & 0x80000000, 0); - return max && function <= max->eax; + return class && function <= class->eax; } bool kvm_cpuid(struct kvm_vcpu *vcpu, u32 *eax, u32 *ebx, diff --git a/arch/x86/kvm/kvm_emulate.h b/arch/x86/kvm/kvm_emulate.h index b03a6a56e46e..3cb50eda606d 100644 --- a/arch/x86/kvm/kvm_emulate.h +++ b/arch/x86/kvm/kvm_emulate.h @@ -396,6 +396,10 @@ struct x86_emulate_ctxt { #define X86EMUL_CPUID_VENDOR_GenuineIntel_ecx 0x6c65746e #define X86EMUL_CPUID_VENDOR_GenuineIntel_edx 0x49656e69 +#define X86EMUL_CPUID_VENDOR_CentaurHauls_ebx 0x746e6543 +#define X86EMUL_CPUID_VENDOR_CentaurHauls_ecx 0x736c7561 +#define X86EMUL_CPUID_VENDOR_CentaurHauls_edx 0x48727561 + static inline bool is_guest_vendor_intel(u32 ebx, u32 ecx, u32 edx) { return ebx == X86EMUL_CPUID_VENDOR_GenuineIntel_ebx && -- cgit From 09c7431ed31f3cc71aba2c35a6d11a742db47b2a Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Wed, 4 Mar 2020 17:34:36 -0800 Subject: KVM: x86: Refactor out-of-range logic to contain the madness Move all of the out-of-range logic into a single helper, get_out_of_range_cpuid_entry(), to avoid an extra lookup of CPUID.0.0 and to provide a single location for documenting the out-of-range behavior. No functional change intended. Cc: Jim Mattson Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 47 +++++++++++++++++++++++++++++------------------ 1 file changed, 29 insertions(+), 18 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 299ed439af25..afa2062b5a79 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -945,13 +945,19 @@ EXPORT_SYMBOL_GPL(kvm_find_cpuid_entry); * - HyperV: 0x40000000 - 0x400000ff * - KVM: 0x40000100 - 0x400001ff */ -static bool cpuid_function_in_range(struct kvm_vcpu *vcpu, u32 function) +static struct kvm_cpuid_entry2 * +get_out_of_range_cpuid_entry(struct kvm_vcpu *vcpu, u32 *fn_ptr, u32 index) { struct kvm_cpuid_entry2 *basic, *class; + u32 function = *fn_ptr; basic = kvm_find_cpuid_entry(vcpu, 0, 0); if (!basic) - return true; + return NULL; + + if (is_guest_vendor_amd(basic->ebx, basic->ecx, basic->edx) || + is_guest_vendor_hygon(basic->ebx, basic->ecx, basic->edx)) + return NULL; if (function >= 0x40000000 && function <= 0x4fffffff) class = kvm_find_cpuid_entry(vcpu, function & 0xffffff00, 0); @@ -960,7 +966,23 @@ static bool cpuid_function_in_range(struct kvm_vcpu *vcpu, u32 function) else class = kvm_find_cpuid_entry(vcpu, function & 0x80000000, 0); - return class && function <= class->eax; + if (class && function <= class->eax) + return NULL; + + /* + * Leaf specific adjustments are also applied when redirecting to the + * max basic entry, e.g. if the max basic leaf is 0xb but there is no + * entry for CPUID.0xb.index (see below), then the output value for EDX + * needs to be pulled from CPUID.0xb.1. + */ + *fn_ptr = basic->eax; + + /* + * The class does not exist or the requested function is out of range; + * the effective CPUID entry is the max basic leaf. Note, the index of + * the original requested leaf is observed! + */ + return kvm_find_cpuid_entry(vcpu, basic->eax, index); } bool kvm_cpuid(struct kvm_vcpu *vcpu, u32 *eax, u32 *ebx, @@ -968,25 +990,14 @@ bool kvm_cpuid(struct kvm_vcpu *vcpu, u32 *eax, u32 *ebx, { u32 orig_function = *eax, function = *eax, index = *ecx; struct kvm_cpuid_entry2 *entry; - struct kvm_cpuid_entry2 *max; bool found; entry = kvm_find_cpuid_entry(vcpu, function, index); found = entry; - /* - * Intel CPUID semantics treats any query for an out-of-range - * leaf as if the highest basic leaf (i.e. CPUID.0H:EAX) were - * requested. AMD CPUID semantics returns all zeroes for any - * undefined leaf, whether or not the leaf is in range. - */ - if (!entry && check_limit && !guest_cpuid_is_amd_or_hygon(vcpu) && - !cpuid_function_in_range(vcpu, function)) { - max = kvm_find_cpuid_entry(vcpu, 0, 0); - if (max) { - function = max->eax; - entry = kvm_find_cpuid_entry(vcpu, function, index); - } - } + + if (!entry && check_limit) + entry = get_out_of_range_cpuid_entry(vcpu, &function, index); + if (entry) { *eax = entry->eax; *ebx = entry->ebx; -- cgit From f91af5176cce77bb0d3292e46665c30af0792dcd Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Wed, 4 Mar 2020 17:34:37 -0800 Subject: KVM: x86: Refactor kvm_cpuid() param that controls out-of-range logic Invert and rename the kvm_cpuid() param that controls out-of-range logic to better reflect the semantics of the affected callers, i.e. callers that bypass the out-of-range logic do so because they are looking up an exact guest CPUID entry, e.g. to query the maxphyaddr. Similarly, rename kvm_cpuid()'s internal "found" to "exact" to clarify that it tracks whether or not the exact requested leaf was found, as opposed to any usable leaf being found. No functional change intended. Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 14 +++++++------- arch/x86/kvm/cpuid.h | 2 +- arch/x86/kvm/emulate.c | 8 ++++---- arch/x86/kvm/kvm_emulate.h | 2 +- arch/x86/kvm/svm.c | 2 +- arch/x86/kvm/x86.c | 5 +++-- 6 files changed, 17 insertions(+), 16 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index afa2062b5a79..08280d8a2ac9 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -986,16 +986,16 @@ get_out_of_range_cpuid_entry(struct kvm_vcpu *vcpu, u32 *fn_ptr, u32 index) } bool kvm_cpuid(struct kvm_vcpu *vcpu, u32 *eax, u32 *ebx, - u32 *ecx, u32 *edx, bool check_limit) + u32 *ecx, u32 *edx, bool exact_only) { u32 orig_function = *eax, function = *eax, index = *ecx; struct kvm_cpuid_entry2 *entry; - bool found; + bool exact; entry = kvm_find_cpuid_entry(vcpu, function, index); - found = entry; + exact = !!entry; - if (!entry && check_limit) + if (!entry && !exact_only) entry = get_out_of_range_cpuid_entry(vcpu, &function, index); if (entry) { @@ -1026,8 +1026,8 @@ bool kvm_cpuid(struct kvm_vcpu *vcpu, u32 *eax, u32 *ebx, } } } - trace_kvm_cpuid(orig_function, *eax, *ebx, *ecx, *edx, found); - return found; + trace_kvm_cpuid(orig_function, *eax, *ebx, *ecx, *edx, exact); + return exact; } EXPORT_SYMBOL_GPL(kvm_cpuid); @@ -1040,7 +1040,7 @@ int kvm_emulate_cpuid(struct kvm_vcpu *vcpu) eax = kvm_rax_read(vcpu); ecx = kvm_rcx_read(vcpu); - kvm_cpuid(vcpu, &eax, &ebx, &ecx, &edx, true); + kvm_cpuid(vcpu, &eax, &ebx, &ecx, &edx, false); kvm_rax_write(vcpu, eax); kvm_rbx_write(vcpu, ebx); kvm_rcx_write(vcpu, ecx); diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h index a0988609f620..23b4cd1ad986 100644 --- a/arch/x86/kvm/cpuid.h +++ b/arch/x86/kvm/cpuid.h @@ -25,7 +25,7 @@ int kvm_vcpu_ioctl_get_cpuid2(struct kvm_vcpu *vcpu, struct kvm_cpuid2 *cpuid, struct kvm_cpuid_entry2 __user *entries); bool kvm_cpuid(struct kvm_vcpu *vcpu, u32 *eax, u32 *ebx, - u32 *ecx, u32 *edx, bool check_limit); + u32 *ecx, u32 *edx, bool exact_only); int cpuid_query_maxphyaddr(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 6663f6887d2c..fefa32d6af00 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -2722,7 +2722,7 @@ static bool vendor_intel(struct x86_emulate_ctxt *ctxt) u32 eax, ebx, ecx, edx; eax = ecx = 0; - ctxt->ops->get_cpuid(ctxt, &eax, &ebx, &ecx, &edx, false); + ctxt->ops->get_cpuid(ctxt, &eax, &ebx, &ecx, &edx, true); return is_guest_vendor_intel(ebx, ecx, edx); } @@ -2740,7 +2740,7 @@ static bool em_syscall_is_enabled(struct x86_emulate_ctxt *ctxt) eax = 0x00000000; ecx = 0x00000000; - ops->get_cpuid(ctxt, &eax, &ebx, &ecx, &edx, false); + ops->get_cpuid(ctxt, &eax, &ebx, &ecx, &edx, true); /* * remark: Intel CPUs only support "syscall" in 64bit longmode. Also a * 64bit guest with a 32bit compat-app running will #UD !! While this @@ -3971,7 +3971,7 @@ static int em_cpuid(struct x86_emulate_ctxt *ctxt) eax = reg_read(ctxt, VCPU_REGS_RAX); ecx = reg_read(ctxt, VCPU_REGS_RCX); - ctxt->ops->get_cpuid(ctxt, &eax, &ebx, &ecx, &edx, true); + ctxt->ops->get_cpuid(ctxt, &eax, &ebx, &ecx, &edx, false); *reg_write(ctxt, VCPU_REGS_RAX) = eax; *reg_write(ctxt, VCPU_REGS_RBX) = ebx; *reg_write(ctxt, VCPU_REGS_RCX) = ecx; @@ -4241,7 +4241,7 @@ static int check_cr_write(struct x86_emulate_ctxt *ctxt) eax = 0x80000008; ecx = 0; if (ctxt->ops->get_cpuid(ctxt, &eax, &ebx, &ecx, - &edx, false)) + &edx, true)) maxphyaddr = eax & 0xff; else maxphyaddr = 36; diff --git a/arch/x86/kvm/kvm_emulate.h b/arch/x86/kvm/kvm_emulate.h index 3cb50eda606d..4688b26c17ee 100644 --- a/arch/x86/kvm/kvm_emulate.h +++ b/arch/x86/kvm/kvm_emulate.h @@ -221,7 +221,7 @@ struct x86_emulate_ops { enum x86_intercept_stage stage); bool (*get_cpuid)(struct x86_emulate_ctxt *ctxt, u32 *eax, u32 *ebx, - u32 *ecx, u32 *edx, bool check_limit); + u32 *ecx, u32 *edx, bool exact_only); bool (*guest_has_long_mode)(struct x86_emulate_ctxt *ctxt); bool (*guest_has_movbe)(struct x86_emulate_ctxt *ctxt); bool (*guest_has_fxsr)(struct x86_emulate_ctxt *ctxt); diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 4dca3579e740..95d0d1841483 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -2193,7 +2193,7 @@ static void svm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) } init_vmcb(svm); - kvm_cpuid(vcpu, &eax, &dummy, &dummy, &dummy, true); + kvm_cpuid(vcpu, &eax, &dummy, &dummy, &dummy, false); kvm_rdx_write(vcpu, eax); if (kvm_vcpu_apicv_active(vcpu) && !init_event) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index cda0b787b2e3..62cf170b31d4 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -6241,9 +6241,10 @@ static int emulator_intercept(struct x86_emulate_ctxt *ctxt, } static bool emulator_get_cpuid(struct x86_emulate_ctxt *ctxt, - u32 *eax, u32 *ebx, u32 *ecx, u32 *edx, bool check_limit) + u32 *eax, u32 *ebx, u32 *ecx, u32 *edx, + bool exact_only) { - return kvm_cpuid(emul_to_vcpu(ctxt), eax, ebx, ecx, edx, check_limit); + return kvm_cpuid(emul_to_vcpu(ctxt), eax, ebx, ecx, edx, exact_only); } static bool emulator_guest_has_long_mode(struct x86_emulate_ctxt *ctxt) -- cgit From 689f3bf2162895cf0b847f36584309064887c966 Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Tue, 3 Mar 2020 10:11:10 +0100 Subject: KVM: x86: unify callbacks to load paging root Similar to what kvm-intel.ko is doing, provide a single callback that merges svm_set_cr3, set_tdp_cr3 and nested_svm_set_tdp_cr3. This lets us unify the set_cr3 and set_tdp_cr3 entries in kvm_x86_ops. I'm doing that in this same patch because splitting it adds quite a bit of churn due to the need for forward declarations. For the same reason the assignment to vcpu->arch.mmu->set_cr3 is moved to kvm_init_shadow_mmu from init_kvm_softmmu and nested_svm_init_mmu_context. Signed-off-by: Paolo Bonzini --- arch/x86/include/asm/kvm_host.h | 3 --- arch/x86/kvm/mmu.h | 6 +++--- arch/x86/kvm/mmu/mmu.c | 2 -- arch/x86/kvm/svm.c | 42 ++++++++++++++++++----------------------- arch/x86/kvm/vmx/nested.c | 1 - arch/x86/kvm/vmx/vmx.c | 2 -- 6 files changed, 21 insertions(+), 35 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 24c90ea5ddbd..c3e4e764a291 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -387,7 +387,6 @@ struct kvm_mmu_root_info { * current mmu mode. */ struct kvm_mmu { - void (*set_cr3)(struct kvm_vcpu *vcpu, unsigned long root); unsigned long (*get_guest_pgd)(struct kvm_vcpu *vcpu); u64 (*get_pdptr)(struct kvm_vcpu *vcpu, int index); int (*page_fault)(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u32 err, @@ -1156,8 +1155,6 @@ struct kvm_x86_ops { int (*get_tdp_level)(struct kvm_vcpu *vcpu); u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio); - void (*set_tdp_cr3)(struct kvm_vcpu *vcpu, unsigned long cr3); - bool (*has_wbinvd_exit)(void); u64 (*read_l1_tsc_offset)(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index a647601c9e1c..27d2c892bdbf 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -95,11 +95,11 @@ static inline unsigned long kvm_get_active_pcid(struct kvm_vcpu *vcpu) return kvm_get_pcid(vcpu, kvm_read_cr3(vcpu)); } -static inline void kvm_mmu_load_cr3(struct kvm_vcpu *vcpu) +static inline void kvm_mmu_load_pgd(struct kvm_vcpu *vcpu) { if (VALID_PAGE(vcpu->arch.mmu->root_hpa)) - vcpu->arch.mmu->set_cr3(vcpu, vcpu->arch.mmu->root_hpa | - kvm_get_active_pcid(vcpu)); + kvm_x86_ops->set_cr3(vcpu, vcpu->arch.mmu->root_hpa | + kvm_get_active_pcid(vcpu)); } int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code, diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index f1eb2d3e1f40..b60215975073 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -4932,7 +4932,6 @@ static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu) context->update_pte = nonpaging_update_pte; context->shadow_root_level = kvm_x86_ops->get_tdp_level(vcpu); context->direct_map = true; - context->set_cr3 = kvm_x86_ops->set_tdp_cr3; context->get_guest_pgd = get_cr3; context->get_pdptr = kvm_pdptr_read; context->inject_page_fault = kvm_inject_page_fault; @@ -5079,7 +5078,6 @@ static void init_kvm_softmmu(struct kvm_vcpu *vcpu) struct kvm_mmu *context = vcpu->arch.mmu; kvm_init_shadow_mmu(vcpu); - context->set_cr3 = kvm_x86_ops->set_cr3; context->get_guest_pgd = get_cr3; context->get_pdptr = kvm_pdptr_read; context->inject_page_fault = kvm_inject_page_fault; diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 95d0d1841483..091615eeead9 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -2989,15 +2989,6 @@ static u64 nested_svm_get_tdp_pdptr(struct kvm_vcpu *vcpu, int index) return pdpte; } -static void nested_svm_set_tdp_cr3(struct kvm_vcpu *vcpu, - unsigned long root) -{ - struct vcpu_svm *svm = to_svm(vcpu); - - svm->vmcb->control.nested_cr3 = __sme_set(root); - mark_dirty(svm->vmcb, VMCB_NPT); -} - static void nested_svm_inject_npf_exit(struct kvm_vcpu *vcpu, struct x86_exception *fault) { @@ -3033,7 +3024,6 @@ static void nested_svm_init_mmu_context(struct kvm_vcpu *vcpu) vcpu->arch.mmu = &vcpu->arch.guest_mmu; kvm_init_shadow_mmu(vcpu); - vcpu->arch.mmu->set_cr3 = nested_svm_set_tdp_cr3; vcpu->arch.mmu->get_guest_pgd = nested_svm_get_tdp_cr3; vcpu->arch.mmu->get_pdptr = nested_svm_get_tdp_pdptr; vcpu->arch.mmu->inject_page_fault = nested_svm_inject_npf_exit; @@ -5955,21 +5945,27 @@ STACK_FRAME_NON_STANDARD(svm_vcpu_run); static void svm_set_cr3(struct kvm_vcpu *vcpu, unsigned long root) { struct vcpu_svm *svm = to_svm(vcpu); + bool update_guest_cr3 = true; + unsigned long cr3; - svm->vmcb->save.cr3 = __sme_set(root); - mark_dirty(svm->vmcb, VMCB_CR); -} - -static void set_tdp_cr3(struct kvm_vcpu *vcpu, unsigned long root) -{ - struct vcpu_svm *svm = to_svm(vcpu); + cr3 = __sme_set(root); + if (npt_enabled) { + svm->vmcb->control.nested_cr3 = cr3; + mark_dirty(svm->vmcb, VMCB_NPT); - svm->vmcb->control.nested_cr3 = __sme_set(root); - mark_dirty(svm->vmcb, VMCB_NPT); + /* Loading L2's CR3 is handled by enter_svm_guest_mode. */ + if (is_guest_mode(vcpu)) + update_guest_cr3 = false; + else if (test_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail)) + cr3 = vcpu->arch.cr3; + else /* CR3 is already up-to-date. */ + update_guest_cr3 = false; + } - /* Also sync guest cr3 here in case we live migrate */ - svm->vmcb->save.cr3 = kvm_read_cr3(vcpu); - mark_dirty(svm->vmcb, VMCB_CR); + if (update_guest_cr3) { + svm->vmcb->save.cr3 = cr3; + mark_dirty(svm->vmcb, VMCB_CR); + } } static int is_disabled(void) @@ -7418,8 +7414,6 @@ static struct kvm_x86_ops svm_x86_ops __ro_after_init = { .read_l1_tsc_offset = svm_read_l1_tsc_offset, .write_l1_tsc_offset = svm_write_l1_tsc_offset, - .set_tdp_cr3 = set_tdp_cr3, - .check_intercept = svm_check_intercept, .handle_exit_irqoff = svm_handle_exit_irqoff, diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index e47eb7c0fbae..cf3d95c99089 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -354,7 +354,6 @@ static void nested_ept_init_mmu_context(struct kvm_vcpu *vcpu) VMX_EPT_EXECUTE_ONLY_BIT, nested_ept_ad_enabled(vcpu), nested_ept_get_eptp(vcpu)); - vcpu->arch.mmu->set_cr3 = vmx_set_cr3; vcpu->arch.mmu->get_guest_pgd = nested_ept_get_eptp; vcpu->arch.mmu->inject_page_fault = nested_ept_inject_page_fault; vcpu->arch.mmu->get_pdptr = kvm_pdptr_read; diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 8001070b209c..815e5e9b05c7 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7922,8 +7922,6 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = { .read_l1_tsc_offset = vmx_read_l1_tsc_offset, .write_l1_tsc_offset = vmx_write_l1_tsc_offset, - .set_tdp_cr3 = vmx_set_cr3, - .check_intercept = vmx_check_intercept, .handle_exit_irqoff = vmx_handle_exit_irqoff, -- cgit From 727a7e27cf88a261c5a0f14f4f9ee4d767352766 Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Thu, 5 Mar 2020 03:52:50 -0500 Subject: KVM: x86: rename set_cr3 callback and related flags to load_mmu_pgd The set_cr3 callback is not setting the guest CR3, it is setting the root of the guest page tables, either shadow or two-dimensional. To make this clearer as well as to indicate that the MMU calls it via kvm_mmu_load_cr3, rename it to load_mmu_pgd. Signed-off-by: Paolo Bonzini --- arch/x86/include/asm/kvm_host.h | 5 +++-- arch/x86/kvm/mmu.h | 4 ++-- arch/x86/kvm/mmu/mmu.c | 4 ++-- arch/x86/kvm/svm.c | 5 +++-- arch/x86/kvm/vmx/nested.c | 8 ++++---- arch/x86/kvm/vmx/vmx.c | 5 +++-- arch/x86/kvm/vmx/vmx.h | 2 +- arch/x86/kvm/x86.c | 4 ++-- 8 files changed, 20 insertions(+), 17 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index c3e4e764a291..9a183e9d4cb1 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -58,7 +58,7 @@ #define KVM_REQ_TRIPLE_FAULT KVM_ARCH_REQ(2) #define KVM_REQ_MMU_SYNC KVM_ARCH_REQ(3) #define KVM_REQ_CLOCK_UPDATE KVM_ARCH_REQ(4) -#define KVM_REQ_LOAD_CR3 KVM_ARCH_REQ(5) +#define KVM_REQ_LOAD_MMU_PGD KVM_ARCH_REQ(5) #define KVM_REQ_EVENT KVM_ARCH_REQ(6) #define KVM_REQ_APF_HALT KVM_ARCH_REQ(7) #define KVM_REQ_STEAL_UPDATE KVM_ARCH_REQ(8) @@ -1091,7 +1091,6 @@ struct kvm_x86_ops { void (*decache_cr0_guest_bits)(struct kvm_vcpu *vcpu); void (*decache_cr4_guest_bits)(struct kvm_vcpu *vcpu); void (*set_cr0)(struct kvm_vcpu *vcpu, unsigned long cr0); - void (*set_cr3)(struct kvm_vcpu *vcpu, unsigned long cr3); int (*set_cr4)(struct kvm_vcpu *vcpu, unsigned long cr4); void (*set_efer)(struct kvm_vcpu *vcpu, u64 efer); void (*get_idt)(struct kvm_vcpu *vcpu, struct desc_ptr *dt); @@ -1155,6 +1154,8 @@ struct kvm_x86_ops { int (*get_tdp_level)(struct kvm_vcpu *vcpu); u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio); + void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, unsigned long cr3); + bool (*has_wbinvd_exit)(void); u64 (*read_l1_tsc_offset)(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index 27d2c892bdbf..e6bfe79e94d8 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -98,8 +98,8 @@ static inline unsigned long kvm_get_active_pcid(struct kvm_vcpu *vcpu) static inline void kvm_mmu_load_pgd(struct kvm_vcpu *vcpu) { if (VALID_PAGE(vcpu->arch.mmu->root_hpa)) - kvm_x86_ops->set_cr3(vcpu, vcpu->arch.mmu->root_hpa | - kvm_get_active_pcid(vcpu)); + kvm_x86_ops->load_mmu_pgd(vcpu, vcpu->arch.mmu->root_hpa | + kvm_get_active_pcid(vcpu)); } int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code, diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index b60215975073..560e85ebdf22 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -4311,7 +4311,7 @@ static bool fast_cr3_switch(struct kvm_vcpu *vcpu, gpa_t new_cr3, * accompanied by KVM_REQ_MMU_RELOAD, which will free * the root set here and allocate a new one. */ - kvm_make_request(KVM_REQ_LOAD_CR3, vcpu); + kvm_make_request(KVM_REQ_LOAD_MMU_PGD, vcpu); if (!skip_tlb_flush) { kvm_make_request(KVM_REQ_MMU_SYNC, vcpu); kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu); @@ -5182,7 +5182,7 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu) kvm_mmu_sync_roots(vcpu); if (r) goto out; - kvm_mmu_load_cr3(vcpu); + kvm_mmu_load_pgd(vcpu); kvm_x86_ops->tlb_flush(vcpu, true); out: return r; diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 091615eeead9..9b983162af73 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -5942,7 +5942,7 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu) } STACK_FRAME_NON_STANDARD(svm_vcpu_run); -static void svm_set_cr3(struct kvm_vcpu *vcpu, unsigned long root) +static void svm_load_mmu_pgd(struct kvm_vcpu *vcpu, unsigned long root) { struct vcpu_svm *svm = to_svm(vcpu); bool update_guest_cr3 = true; @@ -7354,7 +7354,6 @@ static struct kvm_x86_ops svm_x86_ops __ro_after_init = { .decache_cr0_guest_bits = svm_decache_cr0_guest_bits, .decache_cr4_guest_bits = svm_decache_cr4_guest_bits, .set_cr0 = svm_set_cr0, - .set_cr3 = svm_set_cr3, .set_cr4 = svm_set_cr4, .set_efer = svm_set_efer, .get_idt = svm_get_idt, @@ -7414,6 +7413,8 @@ static struct kvm_x86_ops svm_x86_ops __ro_after_init = { .read_l1_tsc_offset = svm_read_l1_tsc_offset, .write_l1_tsc_offset = svm_write_l1_tsc_offset, + .load_mmu_pgd = svm_load_mmu_pgd, + .check_intercept = svm_check_intercept, .handle_exit_irqoff = svm_handle_exit_irqoff, diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index cf3d95c99089..06585c1346ca 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -2473,9 +2473,9 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12, * If L1 use EPT, then L0 needs to execute INVEPT on * EPTP02 instead of EPTP01. Therefore, delay TLB * flush until vmcs02->eptp is fully updated by - * KVM_REQ_LOAD_CR3. Note that this assumes + * KVM_REQ_LOAD_MMU_PGD. Note that this assumes * KVM_REQ_TLB_FLUSH is evaluated after - * KVM_REQ_LOAD_CR3 in vcpu_enter_guest(). + * KVM_REQ_LOAD_MMU_PGD in vcpu_enter_guest(). */ kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu); } @@ -2520,7 +2520,7 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12, /* * Immediately write vmcs02.GUEST_CR3. It will be propagated to vmcs12 * on nested VM-Exit, which can occur without actually running L2 and - * thus without hitting vmx_set_cr3(), e.g. if L1 is entering L2 with + * thus without hitting vmx_load_mmu_pgd(), e.g. if L1 is entering L2 with * vmcs12.GUEST_ACTIVITYSTATE=HLT, in which case KVM will intercept the * transition to HLT instead of running L2. */ @@ -4031,7 +4031,7 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu, * * If vmcs12 uses EPT, we need to execute this flush on EPTP01 * and therefore we request the TLB flush to happen only after VMCS EPTP - * has been set by KVM_REQ_LOAD_CR3. + * has been set by KVM_REQ_LOAD_MMU_PGD. */ if (enable_vpid && (!nested_cpu_has_vpid(vmcs12) || !nested_has_guest_tlb_tag(vcpu))) { diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 815e5e9b05c7..e961633182f8 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -2995,7 +2995,7 @@ u64 construct_eptp(struct kvm_vcpu *vcpu, unsigned long root_hpa) return eptp; } -void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3) +void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, unsigned long cr3) { struct kvm *kvm = vcpu->kvm; bool update_guest_cr3 = true; @@ -7859,7 +7859,6 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = { .decache_cr0_guest_bits = vmx_decache_cr0_guest_bits, .decache_cr4_guest_bits = vmx_decache_cr4_guest_bits, .set_cr0 = vmx_set_cr0, - .set_cr3 = vmx_set_cr3, .set_cr4 = vmx_set_cr4, .set_efer = vmx_set_efer, .get_idt = vmx_get_idt, @@ -7922,6 +7921,8 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = { .read_l1_tsc_offset = vmx_read_l1_tsc_offset, .write_l1_tsc_offset = vmx_write_l1_tsc_offset, + .load_mmu_pgd = vmx_load_mmu_pgd, + .check_intercept = vmx_check_intercept, .handle_exit_irqoff = vmx_handle_exit_irqoff, diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h index fc45bdb5a62f..be93d597306c 100644 --- a/arch/x86/kvm/vmx/vmx.h +++ b/arch/x86/kvm/vmx/vmx.h @@ -334,9 +334,9 @@ u32 vmx_get_interrupt_shadow(struct kvm_vcpu *vcpu); void vmx_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask); void vmx_set_efer(struct kvm_vcpu *vcpu, u64 efer); void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0); -void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3); int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4); void set_cr4_guest_host_mask(struct vcpu_vmx *vmx); +void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, unsigned long cr3); void ept_save_pdptrs(struct kvm_vcpu *vcpu); void vmx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg); void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 62cf170b31d4..c67324138e17 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8186,8 +8186,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) } if (kvm_check_request(KVM_REQ_MMU_SYNC, vcpu)) kvm_mmu_sync_roots(vcpu); - if (kvm_check_request(KVM_REQ_LOAD_CR3, vcpu)) - kvm_mmu_load_cr3(vcpu); + if (kvm_check_request(KVM_REQ_LOAD_MMU_PGD, vcpu)) + kvm_mmu_load_pgd(vcpu); if (kvm_check_request(KVM_REQ_TLB_FLUSH, vcpu)) kvm_vcpu_flush_tlb(vcpu, true); if (kvm_check_request(KVM_REQ_REPORT_TPR_ACCESS, vcpu)) { -- cgit From b5ec2e020b7015618781d313a4a6f93c2d2b8144 Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Wed, 4 Mar 2020 12:57:49 -0500 Subject: KVM: nSVM: do not change host intercepts while nested VM is running Instead of touching the host intercepts so that the bitwise OR in recalc_intercepts just works, mask away uninteresting intercepts directly in recalc_intercepts. This is cleaner and keeps the logic in one place even for intercepts that can change even while L2 is running. Signed-off-by: Paolo Bonzini --- arch/x86/kvm/svm.c | 31 ++++++++++++++++++------------- 1 file changed, 18 insertions(+), 13 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 9b983162af73..b1ccbcf6e751 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -519,10 +519,24 @@ static void recalc_intercepts(struct vcpu_svm *svm) h = &svm->nested.hsave->control; g = &svm->nested; - c->intercept_cr = h->intercept_cr | g->intercept_cr; - c->intercept_dr = h->intercept_dr | g->intercept_dr; - c->intercept_exceptions = h->intercept_exceptions | g->intercept_exceptions; - c->intercept = h->intercept | g->intercept; + c->intercept_cr = h->intercept_cr; + c->intercept_dr = h->intercept_dr; + c->intercept_exceptions = h->intercept_exceptions; + c->intercept = h->intercept; + + if (svm->vcpu.arch.hflags & HF_VINTR_MASK) { + /* We only want the cr8 intercept bits of L1 */ + c->intercept_cr &= ~(1U << INTERCEPT_CR8_READ); + c->intercept_cr &= ~(1U << INTERCEPT_CR8_WRITE); + } + + /* We don't want to see VMMCALLs from a nested guest */ + c->intercept &= ~(1ULL << INTERCEPT_VMMCALL); + + c->intercept_cr |= g->intercept_cr; + c->intercept_dr |= g->intercept_dr; + c->intercept_exceptions |= g->intercept_exceptions; + c->intercept |= g->intercept; } static inline struct vmcb *get_host_vmcb(struct vcpu_svm *svm) @@ -3592,15 +3606,6 @@ static void enter_svm_guest_mode(struct vcpu_svm *svm, u64 vmcb_gpa, else svm->vcpu.arch.hflags &= ~HF_VINTR_MASK; - if (svm->vcpu.arch.hflags & HF_VINTR_MASK) { - /* We only want the cr8 intercept bits of the guest */ - clr_cr_intercept(svm, INTERCEPT_CR8_READ); - clr_cr_intercept(svm, INTERCEPT_CR8_WRITE); - } - - /* We don't want to see VMMCALLs from a nested guest */ - clr_intercept(svm, INTERCEPT_VMMCALL); - svm->vcpu.arch.tsc_offset += nested_vmcb->control.tsc_offset; svm->vmcb->control.tsc_offset = svm->vcpu.arch.tsc_offset; -- cgit From 64b5bd27042639dfcc1534f01771b7b871a02ffe Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Wed, 4 Mar 2020 13:12:35 -0500 Subject: KVM: nSVM: ignore L1 interrupt window while running L2 with V_INTR_MASKING=1 If a nested VM is started while an IRQ was pending and with V_INTR_MASKING=1, the behavior of the guest depends on host IF. If it is 1, the VM should exit immediately, before executing the first instruction of the guest, because VMRUN sets GIF back to 1. If it is 0 and the host has VGIF, however, at the time of the VMRUN instruction L0 is running the guest with a pending interrupt window request. This interrupt window request is completely irrelevant to L2, since IF only controls virtual interrupts, so this patch drops INTERCEPT_VINTR from the VMCB while running L2 under these circumstances. To simplify the code, both steps of enabling the interrupt window (setting the VINTR intercept and requesting a fake virtual interrupt in svm_inject_irq) are grouped in the svm_set_vintr function, and likewise for dismissing the interrupt window request in svm_clear_vintr. Signed-off-by: Paolo Bonzini --- arch/x86/kvm/svm.c | 55 ++++++++++++++++++++++++++++++++++++------------------ 1 file changed, 37 insertions(+), 18 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index b1ccbcf6e751..0bcac23210e2 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -528,6 +528,13 @@ static void recalc_intercepts(struct vcpu_svm *svm) /* We only want the cr8 intercept bits of L1 */ c->intercept_cr &= ~(1U << INTERCEPT_CR8_READ); c->intercept_cr &= ~(1U << INTERCEPT_CR8_WRITE); + + /* + * Once running L2 with HF_VINTR_MASK, EFLAGS.IF does not + * affect any interrupt we may want to inject; therefore, + * interrupt window vmexits are irrelevant to L0. + */ + c->intercept &= ~(1ULL << INTERCEPT_VINTR); } /* We don't want to see VMMCALLs from a nested guest */ @@ -641,6 +648,11 @@ static inline void clr_intercept(struct vcpu_svm *svm, int bit) recalc_intercepts(svm); } +static inline bool is_intercept(struct vcpu_svm *svm, int bit) +{ + return (svm->vmcb->control.intercept & (1ULL << bit)) != 0; +} + static inline bool vgif_enabled(struct vcpu_svm *svm) { return !!(svm->vmcb->control.int_ctl & V_GIF_ENABLE_MASK); @@ -2440,14 +2452,38 @@ static void svm_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg) } } +static inline void svm_enable_vintr(struct vcpu_svm *svm) +{ + struct vmcb_control_area *control; + + /* The following fields are ignored when AVIC is enabled */ + WARN_ON(kvm_vcpu_apicv_active(&svm->vcpu)); + + /* + * This is just a dummy VINTR to actually cause a vmexit to happen. + * Actual injection of virtual interrupts happens through EVENTINJ. + */ + control = &svm->vmcb->control; + control->int_vector = 0x0; + control->int_ctl &= ~V_INTR_PRIO_MASK; + control->int_ctl |= V_IRQ_MASK | + ((/*control->int_vector >> 4*/ 0xf) << V_INTR_PRIO_SHIFT); + mark_dirty(svm->vmcb, VMCB_INTR); +} + static void svm_set_vintr(struct vcpu_svm *svm) { set_intercept(svm, INTERCEPT_VINTR); + if (is_intercept(svm, INTERCEPT_VINTR)) + svm_enable_vintr(svm); } static void svm_clear_vintr(struct vcpu_svm *svm) { clr_intercept(svm, INTERCEPT_VINTR); + + svm->vmcb->control.int_ctl &= ~V_IRQ_MASK; + mark_dirty(svm->vmcb, VMCB_INTR); } static struct vmcb_seg *svm_seg(struct kvm_vcpu *vcpu, int seg) @@ -3835,11 +3871,8 @@ static int clgi_interception(struct vcpu_svm *svm) disable_gif(svm); /* After a CLGI no interrupts should come */ - if (!kvm_vcpu_apicv_active(&svm->vcpu)) { + if (!kvm_vcpu_apicv_active(&svm->vcpu)) svm_clear_vintr(svm); - svm->vmcb->control.int_ctl &= ~V_IRQ_MASK; - mark_dirty(svm->vmcb, VMCB_INTR); - } return ret; } @@ -5125,19 +5158,6 @@ static void svm_inject_nmi(struct kvm_vcpu *vcpu) ++vcpu->stat.nmi_injections; } -static inline void svm_inject_irq(struct vcpu_svm *svm, int irq) -{ - struct vmcb_control_area *control; - - /* The following fields are ignored when AVIC is enabled */ - control = &svm->vmcb->control; - control->int_vector = irq; - control->int_ctl &= ~V_INTR_PRIO_MASK; - control->int_ctl |= V_IRQ_MASK | - ((/*control->int_vector >> 4*/ 0xf) << V_INTR_PRIO_SHIFT); - mark_dirty(svm->vmcb, VMCB_INTR); -} - static void svm_set_irq(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); @@ -5561,7 +5581,6 @@ static void enable_irq_window(struct kvm_vcpu *vcpu) */ svm_toggle_avic_for_irq_window(vcpu, false); svm_set_vintr(svm); - svm_inject_irq(svm, 0x0); } } -- cgit From b518ba9fa691a3066ee935f6f317f827295453f0 Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Wed, 4 Mar 2020 16:46:47 -0500 Subject: KVM: nSVM: implement check_nested_events for interrupts The current implementation of physical interrupt delivery to a nested guest is quite broken. It relies on svm_interrupt_allowed returning false if VINTR=1 so that the interrupt can be injected from enable_irq_window, but this does not work for guests that do not intercept HLT or that rely on clearing the host IF to block physical interrupts while L2 runs. This patch can be split in two logical parts, but including only one breaks tests so I am combining both changes together. The first and easiest is simply to return true for svm_interrupt_allowed if HF_VINTR_MASK is set and HIF is set. This way the semantics of svm_interrupt_allowed are respected: svm_interrupt_allowed being false does not mean "call enable_irq_window", it means "interrupts cannot be injected now". After doing this, however, we need another place to inject the interrupt, and fortunately we already have one, check_nested_events, which nested SVM does not implement but which is meant exactly for this purpose. It is called before interrupts are injected, and it can therefore do the L2->L1 switch while leaving inject_pending_event none the wiser. This patch was developed together with Cathy Avery, who wrote the test and did a lot of the initial debugging. Signed-off-by: Paolo Bonzini --- arch/x86/kvm/svm.c | 68 ++++++++++++++++++++++++------------------------------ 1 file changed, 30 insertions(+), 38 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 0bcac23210e2..80f15a2483f4 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -3135,43 +3135,36 @@ static int nested_svm_check_exception(struct vcpu_svm *svm, unsigned nr, return vmexit; } -/* This function returns true if it is save to enable the irq window */ -static inline bool nested_svm_intr(struct vcpu_svm *svm) +static void nested_svm_intr(struct vcpu_svm *svm) { - if (!is_guest_mode(&svm->vcpu)) - return true; - - if (!(svm->vcpu.arch.hflags & HF_VINTR_MASK)) - return true; - - if (!(svm->vcpu.arch.hflags & HF_HIF_MASK)) - return false; - - /* - * if vmexit was already requested (by intercepted exception - * for instance) do not overwrite it with "external interrupt" - * vmexit. - */ - if (svm->nested.exit_required) - return false; - svm->vmcb->control.exit_code = SVM_EXIT_INTR; svm->vmcb->control.exit_info_1 = 0; svm->vmcb->control.exit_info_2 = 0; - if (svm->nested.intercept & 1ULL) { - /* - * The #vmexit can't be emulated here directly because this - * code path runs with irqs and preemption disabled. A - * #vmexit emulation might sleep. Only signal request for - * the #vmexit here. - */ - svm->nested.exit_required = true; - trace_kvm_nested_intr_vmexit(svm->vmcb->save.rip); - return false; + /* nested_svm_vmexit this gets called afterwards from handle_exit */ + svm->nested.exit_required = true; + trace_kvm_nested_intr_vmexit(svm->vmcb->save.rip); +} + +static bool nested_exit_on_intr(struct vcpu_svm *svm) +{ + return (svm->nested.intercept & 1ULL); +} + +static int svm_check_nested_events(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + bool block_nested_events = + kvm_event_needs_reinjection(vcpu) || svm->nested.exit_required; + + if (kvm_cpu_has_interrupt(vcpu) && nested_exit_on_intr(svm)) { + if (block_nested_events) + return -EBUSY; + nested_svm_intr(svm); + return 0; } - return true; + return 0; } /* This function returns true if it is save to enable the nmi window */ @@ -5546,18 +5539,15 @@ static int svm_interrupt_allowed(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); struct vmcb *vmcb = svm->vmcb; - int ret; if (!gif_set(svm) || (vmcb->control.int_state & SVM_INTERRUPT_SHADOW_MASK)) return 0; - ret = !!(kvm_get_rflags(vcpu) & X86_EFLAGS_IF); - - if (is_guest_mode(vcpu)) - return ret && !(svm->vcpu.arch.hflags & HF_VINTR_MASK); - - return ret; + if (is_guest_mode(vcpu) && (svm->vcpu.arch.hflags & HF_VINTR_MASK)) + return !!(svm->vcpu.arch.hflags & HF_HIF_MASK); + else + return !!(kvm_get_rflags(vcpu) & X86_EFLAGS_IF); } static void enable_irq_window(struct kvm_vcpu *vcpu) @@ -5572,7 +5562,7 @@ static void enable_irq_window(struct kvm_vcpu *vcpu) * enabled, the STGI interception will not occur. Enable the irq * window under the assumption that the hardware will set the GIF. */ - if ((vgif_enabled(svm) || gif_set(svm)) && nested_svm_intr(svm)) { + if (vgif_enabled(svm) || gif_set(svm)) { /* * IRQ window is not needed when AVIC is enabled, * unless we have pending ExtINT since it cannot be injected @@ -7467,6 +7457,8 @@ static struct kvm_x86_ops svm_x86_ops __ro_after_init = { .need_emulation_on_page_fault = svm_need_emulation_on_page_fault, .apic_init_signal_blocked = svm_apic_init_signal_blocked, + + .check_nested_events = svm_check_nested_events, }; static int __init svm_init(void) -- cgit From 78f2145c4d937a4770f365d6f5dc6eb825658d10 Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Wed, 4 Mar 2020 17:05:44 -0500 Subject: KVM: nSVM: avoid loss of pending IRQ/NMI before entering L2 This patch reproduces for nSVM the change that was made for nVMX in commit b5861e5cf2fc ("KVM: nVMX: Fix loss of pending IRQ/NMI before entering L2"). While I do not have a test that breaks without it, I cannot see why it would not be necessary since all events are unblocked by VMRUN's setting of GIF back to 1. Signed-off-by: Paolo Bonzini --- arch/x86/kvm/svm.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) (limited to 'arch/x86') diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 80f15a2483f4..c923ad1d7321 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -3576,6 +3576,10 @@ static bool nested_vmcb_checks(struct vmcb *vmcb) static void enter_svm_guest_mode(struct vcpu_svm *svm, u64 vmcb_gpa, struct vmcb *nested_vmcb, struct kvm_host_map *map) { + bool evaluate_pending_interrupts = + is_intercept(svm, INTERCEPT_VINTR) || + is_intercept(svm, INTERCEPT_IRET); + if (kvm_get_rflags(&svm->vcpu) & X86_EFLAGS_IF) svm->vcpu.arch.hflags |= HF_HIF_MASK; else @@ -3662,7 +3666,21 @@ static void enter_svm_guest_mode(struct vcpu_svm *svm, u64 vmcb_gpa, svm->nested.vmcb = vmcb_gpa; + /* + * If L1 had a pending IRQ/NMI before executing VMRUN, + * which wasn't delivered because it was disallowed (e.g. + * interrupts disabled), L0 needs to evaluate if this pending + * event should cause an exit from L2 to L1 or be delivered + * directly to L2. + * + * Usually this would be handled by the processor noticing an + * IRQ/NMI window request. However, VMRUN can unblock interrupts + * by implicitly setting GIF, so force L0 to perform pending event + * evaluation by requesting a KVM_REQ_EVENT. + */ enable_gif(svm); + if (unlikely(evaluate_pending_interrupts)) + kvm_make_request(KVM_REQ_EVENT, &svm->vcpu); mark_all_dirty(svm->vmcb); } -- cgit From ab56f8e62dafe4c9bec9fc236937c9884bd9966d Mon Sep 17 00:00:00 2001 From: Suravee Suthikulpanit Date: Thu, 12 Mar 2020 05:39:28 -0500 Subject: kvm: svm: Introduce GA Log tracepoint for AVIC GA Log tracepoint is useful when debugging AVIC performance issue as it can be used with perf to count the number of times IOMMU AVIC injects interrupts through the slow-path instead of directly inject interrupts to the target vcpu. Signed-off-by: Suravee Suthikulpanit Signed-off-by: Paolo Bonzini --- arch/x86/kvm/svm.c | 1 + arch/x86/kvm/trace.h | 18 ++++++++++++++++++ arch/x86/kvm/x86.c | 1 + 3 files changed, 20 insertions(+) (limited to 'arch/x86') diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index c923ad1d7321..7c9ddd680f22 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1232,6 +1232,7 @@ static int avic_ga_log_notifier(u32 ga_tag) u32 vcpu_id = AVIC_GATAG_TO_VCPUID(ga_tag); pr_debug("SVM: %s: vm_id=%#x, vcpu_id=%#x\n", __func__, vm_id, vcpu_id); + trace_kvm_avic_ga_log(vm_id, vcpu_id); spin_lock_irqsave(&svm_vm_data_hash_lock, flags); hash_for_each_possible(svm_vm_data_hash, kvm_svm, hnode, vm_id) { diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h index f5b8814d9f83..6c4d9b4caf07 100644 --- a/arch/x86/kvm/trace.h +++ b/arch/x86/kvm/trace.h @@ -1367,6 +1367,24 @@ TRACE_EVENT(kvm_avic_unaccelerated_access, __entry->vec) ); +TRACE_EVENT(kvm_avic_ga_log, + TP_PROTO(u32 vmid, u32 vcpuid), + TP_ARGS(vmid, vcpuid), + + TP_STRUCT__entry( + __field(u32, vmid) + __field(u32, vcpuid) + ), + + TP_fast_assign( + __entry->vmid = vmid; + __entry->vcpuid = vcpuid; + ), + + TP_printk("vmid=%u, vcpuid=%u", + __entry->vmid, __entry->vcpuid) +); + TRACE_EVENT(kvm_hv_timer_state, TP_PROTO(unsigned int vcpu_id, unsigned int hv_timer_in_use), TP_ARGS(vcpu_id, hv_timer_in_use), diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index c67324138e17..a7cb85231330 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -10562,4 +10562,5 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_pml_full); EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_pi_irte_update); EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_avic_unaccelerated_access); EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_avic_incomplete_ipi); +EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_avic_ga_log); EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_apicv_update_request); -- cgit From 041bc42ce2d0efac3b85bbb81dea8c74b81f4ef9 Mon Sep 17 00:00:00 2001 From: Wanpeng Li Date: Fri, 13 Mar 2020 11:55:18 +0800 Subject: KVM: VMX: Micro-optimize vmexit time when not exposing PMU PMU is not exposed to guest by most of products from cloud providers since the bad performance of PMU emulation and security concern. However, it calls perf_guest_switch_get_msrs() and clear_atomic_switch_msr() unconditionally even if PMU is not exposed to the guest before each vmentry. ~2% vmexit time reduced can be observed by kvm-unit-tests/vmexit.flat on my SKX server. Before patch: vmcall 1559 After patch: vmcall 1529 Signed-off-by: Wanpeng Li Signed-off-by: Paolo Bonzini --- arch/x86/kvm/vmx/vmx.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index e961633182f8..b447d66f44e6 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -6549,7 +6549,8 @@ static void vmx_vcpu_run(struct kvm_vcpu *vcpu) pt_guest_enter(vmx); - atomic_switch_perf_msrs(vmx); + if (vcpu_to_pmu(vcpu)->version) + atomic_switch_perf_msrs(vmx); atomic_switch_umwait_control_msr(vmx); if (enable_preemption_timer) -- cgit From 212617dbb6bac2a21dec6ef7d6012d96bb6dbb5d Mon Sep 17 00:00:00 2001 From: Oliver Upton Date: Mon, 24 Feb 2020 12:27:44 -0800 Subject: KVM: nVMX: Consolidate nested MTF checks to helper function commit 5ef8acbdd687 ("KVM: nVMX: Emulate MTF when performing instruction emulation") introduced a helper to check the MTF VM-execution control in vmcs12. Change pre-existing check in nested_vmx_exit_reflected() to instead use the helper. Signed-off-by: Oliver Upton Reviewed-by: Krish Sadhukhan Reviewed-by: Miaohe Lin Signed-off-by: Paolo Bonzini --- arch/x86/kvm/vmx/nested.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index 06585c1346ca..6c489ef95f68 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -5626,7 +5626,7 @@ bool nested_vmx_exit_reflected(struct kvm_vcpu *vcpu, u32 exit_reason) case EXIT_REASON_MWAIT_INSTRUCTION: return nested_cpu_has(vmcs12, CPU_BASED_MWAIT_EXITING); case EXIT_REASON_MONITOR_TRAP_FLAG: - return nested_cpu_has(vmcs12, CPU_BASED_MONITOR_TRAP_FLAG); + return nested_cpu_has_mtf(vmcs12); case EXIT_REASON_MONITOR_INSTRUCTION: return nested_cpu_has(vmcs12, CPU_BASED_MONITOR_EXITING); case EXIT_REASON_PAUSE_INSTRUCTION: -- cgit From 8e205a6b2a06764a4c2bfc9e1a6a8a8e7920faf8 Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Sat, 14 Mar 2020 12:29:23 +0100 Subject: KVM: X86: correct meaningless kvm_apicv_activated() check After test_and_set_bit() for kvm->arch.apicv_inhibit_reasons, we will always get false when calling kvm_apicv_activated() because it's sure apicv_inhibit_reasons do not equal to 0. What the code wants to do, is check whether APICv was *already* active and if so skip the costly request; we can do this using cmpxchg. Reported-by: Miaohe Lin Signed-off-by: Paolo Bonzini --- arch/x86/kvm/x86.c | 25 ++++++++++++++++--------- 1 file changed, 16 insertions(+), 9 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index a7cb85231330..e54c6ad628a8 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8049,19 +8049,26 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_update_apicv); */ void kvm_request_apicv_update(struct kvm *kvm, bool activate, ulong bit) { + unsigned long old, new, expected; + if (!kvm_x86_ops->check_apicv_inhibit_reasons || !kvm_x86_ops->check_apicv_inhibit_reasons(bit)) return; - if (activate) { - if (!test_and_clear_bit(bit, &kvm->arch.apicv_inhibit_reasons) || - !kvm_apicv_activated(kvm)) - return; - } else { - if (test_and_set_bit(bit, &kvm->arch.apicv_inhibit_reasons) || - kvm_apicv_activated(kvm)) - return; - } + old = READ_ONCE(kvm->arch.apicv_inhibit_reasons); + do { + expected = new = old; + if (activate) + __clear_bit(bit, &new); + else + __set_bit(bit, &new); + if (new == old) + break; + old = cmpxchg(&kvm->arch.apicv_inhibit_reasons, expected, new); + } while (old != expected); + + if (!!old == !!new) + return; trace_kvm_apicv_update_request(activate, bit); if (kvm_x86_ops->pre_update_apicv_exec_ctrl) -- cgit From 0b66465344a7177411adb277e8ec33d9c5616b90 Mon Sep 17 00:00:00 2001 From: Miaohe Lin Date: Tue, 25 Feb 2020 11:05:15 +0800 Subject: KVM: nSVM: Remove an obsolete comment. The function does not return bool anymore. Signed-off-by: Miaohe Lin Signed-off-by: Paolo Bonzini --- arch/x86/kvm/svm.c | 3 --- 1 file changed, 3 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 7c9ddd680f22..08568ae9f7a1 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -3284,9 +3284,6 @@ static int nested_svm_exit_special(struct vcpu_svm *svm) return NESTED_EXIT_CONTINUE; } -/* - * If this function returns true, this #vmexit was already handled - */ static int nested_svm_intercept(struct vcpu_svm *svm) { u32 exit_code = svm->vmcb->control.exit_code; -- cgit From e942dbf8c58e1bf1ccfe18eb2713e3b360ec2e7f Mon Sep 17 00:00:00 2001 From: Vitaly Kuznetsov Date: Mon, 9 Mar 2020 16:52:12 +0100 Subject: KVM: nVMX: stop abusing need_vmcs12_to_shadow_sync for eVMCS mapping When vmx_set_nested_state() happens, we may not have all the required data to map enlightened VMCS: e.g. HV_X64_MSR_VP_ASSIST_PAGE MSR may not yet be restored so we need a postponed action. Currently, we (ab)use need_vmcs12_to_shadow_sync/nested_sync_vmcs12_to_shadow() for that but this is not ideal: - We may not need to sync anything if L2 is running - It is hard to propagate errors from nested_sync_vmcs12_to_shadow() as we call it from vmx_prepare_switch_to_guest() which happens just before we do VMLAUNCH, the code is not ready to handle errors there. Move eVMCS mapping to nested_get_vmcs12_pages() and request KVM_REQ_GET_VMCS12_PAGES, it seems to be is less abusive in nature. It would probably be possible to introduce a specialized KVM_REQ_EVMCS_MAP but it is undesirable to propagate eVMCS specifics all the way up to x86.c Note, we don't need to request KVM_REQ_GET_VMCS12_PAGES from vmx_set_nested_state() directly as nested_vmx_enter_non_root_mode() already does that. Requesting KVM_REQ_GET_VMCS12_PAGES is done to document the (non-obvious) side-effect and to be future proof. Signed-off-by: Vitaly Kuznetsov Signed-off-by: Paolo Bonzini --- arch/x86/kvm/vmx/nested.c | 24 +++++++++++++----------- 1 file changed, 13 insertions(+), 11 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index 98b82ccdf5f0..424d9a77af86 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -1996,14 +1996,6 @@ void nested_sync_vmcs12_to_shadow(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); - /* - * hv_evmcs may end up being not mapped after migration (when - * L2 was running), map it here to make sure vmcs12 changes are - * properly reflected. - */ - if (vmx->nested.enlightened_vmcs_enabled && !vmx->nested.hv_evmcs) - nested_vmx_handle_enlightened_vmptrld(vcpu, false); - if (vmx->nested.hv_evmcs) { copy_vmcs12_to_enlightened(vmx); /* All fields are clean */ @@ -3062,6 +3054,14 @@ static bool nested_get_vmcs12_pages(struct kvm_vcpu *vcpu) struct page *page; u64 hpa; + /* + * hv_evmcs may end up being not mapped after migration (when + * L2 was running), map it here to make sure vmcs12 changes are + * properly reflected. + */ + if (vmx->nested.enlightened_vmcs_enabled && !vmx->nested.hv_evmcs) + nested_vmx_handle_enlightened_vmptrld(vcpu, false); + if (nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES)) { /* * Translate L1 physical address to host physical @@ -5904,10 +5904,12 @@ static int vmx_set_nested_state(struct kvm_vcpu *vcpu, set_current_vmptr(vmx, kvm_state->hdr.vmx.vmcs12_pa); } else if (kvm_state->flags & KVM_STATE_NESTED_EVMCS) { /* - * Sync eVMCS upon entry as we may not have - * HV_X64_MSR_VP_ASSIST_PAGE set up yet. + * nested_vmx_handle_enlightened_vmptrld() cannot be called + * directly from here as HV_X64_MSR_VP_ASSIST_PAGE may not be + * restored yet. EVMCS will be mapped from + * nested_get_vmcs12_pages(). */ - vmx->nested.need_vmcs12_to_shadow_sync = true; + kvm_make_request(KVM_REQ_GET_VMCS12_PAGES, vcpu); } else { return -EINVAL; } -- cgit From b6a0653ae2cd71a58f479b46ff20307dd3540d63 Mon Sep 17 00:00:00 2001 From: Vitaly Kuznetsov Date: Mon, 9 Mar 2020 16:52:13 +0100 Subject: KVM: nVMX: properly handle errors in nested_vmx_handle_enlightened_vmptrld() nested_vmx_handle_enlightened_vmptrld() fails in two cases: - when we fail to kvm_vcpu_map() the supplied GPA - when revision_id is incorrect. Genuine Hyper-V raises #UD in the former case (at least with *some* incorrect GPAs) and does VMfailInvalid() in the later. KVM doesn't do anything so L1 just gets stuck retrying the same faulty VMLAUNCH. nested_vmx_handle_enlightened_vmptrld() has two call sites: nested_vmx_run() and nested_get_vmcs12_pages(). The former needs to queue do much: the failure there happens after migration when L2 was running (and L1 did something weird like wrote to VP assist page from a different vCPU), just kill L1 with KVM_EXIT_INTERNAL_ERROR. Reported-by: Miaohe Lin Signed-off-by: Vitaly Kuznetsov [Squash kbuild autopatch. - Paolo] Signed-off-by: Paolo Bonzini --- arch/x86/kvm/vmx/evmcs.h | 7 +++++++ arch/x86/kvm/vmx/nested.c | 39 +++++++++++++++++++++++++++++---------- 2 files changed, 36 insertions(+), 10 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/vmx/evmcs.h b/arch/x86/kvm/vmx/evmcs.h index 6de47f2569c9..e5f7a7ebf27d 100644 --- a/arch/x86/kvm/vmx/evmcs.h +++ b/arch/x86/kvm/vmx/evmcs.h @@ -198,6 +198,13 @@ static inline void evmcs_sanitize_exec_ctrls(struct vmcs_config *vmcs_conf) {} static inline void evmcs_touch_msr_bitmap(void) {} #endif /* IS_ENABLED(CONFIG_HYPERV) */ +enum nested_evmptrld_status { + EVMPTRLD_DISABLED, + EVMPTRLD_SUCCEEDED, + EVMPTRLD_VMFAIL, + EVMPTRLD_ERROR, +}; + bool nested_enlightened_vmentry(struct kvm_vcpu *vcpu, u64 *evmcs_gpa); uint16_t nested_get_evmcs_version(struct kvm_vcpu *vcpu); int nested_enable_evmcs(struct kvm_vcpu *vcpu, diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index 424d9a77af86..8578513907d7 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -1909,18 +1909,18 @@ static int copy_vmcs12_to_enlightened(struct vcpu_vmx *vmx) * This is an equivalent of the nested hypervisor executing the vmptrld * instruction. */ -static int nested_vmx_handle_enlightened_vmptrld(struct kvm_vcpu *vcpu, - bool from_launch) +static enum nested_evmptrld_status nested_vmx_handle_enlightened_vmptrld( + struct kvm_vcpu *vcpu, bool from_launch) { struct vcpu_vmx *vmx = to_vmx(vcpu); bool evmcs_gpa_changed = false; u64 evmcs_gpa; if (likely(!vmx->nested.enlightened_vmcs_enabled)) - return 1; + return EVMPTRLD_DISABLED; if (!nested_enlightened_vmentry(vcpu, &evmcs_gpa)) - return 1; + return EVMPTRLD_DISABLED; if (unlikely(!vmx->nested.hv_evmcs || evmcs_gpa != vmx->nested.hv_evmcs_vmptr)) { @@ -1931,7 +1931,7 @@ static int nested_vmx_handle_enlightened_vmptrld(struct kvm_vcpu *vcpu, if (kvm_vcpu_map(vcpu, gpa_to_gfn(evmcs_gpa), &vmx->nested.hv_evmcs_map)) - return 0; + return EVMPTRLD_ERROR; vmx->nested.hv_evmcs = vmx->nested.hv_evmcs_map.hva; @@ -1960,7 +1960,7 @@ static int nested_vmx_handle_enlightened_vmptrld(struct kvm_vcpu *vcpu, if ((vmx->nested.hv_evmcs->revision_id != KVM_EVMCS_VERSION) && (vmx->nested.hv_evmcs->revision_id != VMCS12_REVISION)) { nested_release_evmcs(vcpu); - return 0; + return EVMPTRLD_VMFAIL; } vmx->nested.dirty_vmcs12 = true; @@ -1989,7 +1989,7 @@ static int nested_vmx_handle_enlightened_vmptrld(struct kvm_vcpu *vcpu, vmx->nested.hv_evmcs->hv_clean_fields &= ~HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL; - return 1; + return EVMPTRLD_SUCCEEDED; } void nested_sync_vmcs12_to_shadow(struct kvm_vcpu *vcpu) @@ -3059,8 +3059,21 @@ static bool nested_get_vmcs12_pages(struct kvm_vcpu *vcpu) * L2 was running), map it here to make sure vmcs12 changes are * properly reflected. */ - if (vmx->nested.enlightened_vmcs_enabled && !vmx->nested.hv_evmcs) - nested_vmx_handle_enlightened_vmptrld(vcpu, false); + if (vmx->nested.enlightened_vmcs_enabled && !vmx->nested.hv_evmcs) { + enum nested_evmptrld_status evmptrld_status = + nested_vmx_handle_enlightened_vmptrld(vcpu, false); + + if (evmptrld_status == EVMPTRLD_VMFAIL || + evmptrld_status == EVMPTRLD_ERROR) { + pr_debug_ratelimited("%s: enlightened vmptrld failed\n", + __func__); + vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR; + vcpu->run->internal.suberror = + KVM_INTERNAL_ERROR_EMULATION; + vcpu->run->internal.ndata = 0; + return false; + } + } if (nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES)) { /* @@ -3325,12 +3338,18 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch) enum nvmx_vmentry_status status; struct vcpu_vmx *vmx = to_vmx(vcpu); u32 interrupt_shadow = vmx_get_interrupt_shadow(vcpu); + enum nested_evmptrld_status evmptrld_status; if (!nested_vmx_check_permission(vcpu)) return 1; - if (!nested_vmx_handle_enlightened_vmptrld(vcpu, launch)) + evmptrld_status = nested_vmx_handle_enlightened_vmptrld(vcpu, launch); + if (evmptrld_status == EVMPTRLD_ERROR) { + kvm_queue_exception(vcpu, UD_VECTOR); return 1; + } else if (evmptrld_status == EVMPTRLD_VMFAIL) { + return nested_vmx_failInvalid(vcpu); + } if (!vmx->nested.hv_evmcs && vmx->nested.current_vmptr == -1ull) return nested_vmx_failInvalid(vcpu); -- cgit From 19d33357ecdf6f4d591b4c17f119bacd6ae834eb Mon Sep 17 00:00:00 2001 From: Borislav Petkov Date: Mon, 16 Mar 2020 13:23:21 +0100 Subject: x86/amd_nb, char/amd64-agp: Use amd_nb_num() accessor ... to find whether there are northbridges present on the system. Convert the last forgotten user and therefore, unexport amd_nb_misc_ids[] too. Signed-off-by: Borislav Petkov Cc: Michal Kubecek Cc: Yazen Ghannam Link: https://lkml.kernel.org/r/20200316150725.925-1-bp@alien8.de --- arch/x86/include/asm/amd_nb.h | 1 - arch/x86/kernel/amd_nb.c | 4 +--- 2 files changed, 1 insertion(+), 4 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/amd_nb.h b/arch/x86/include/asm/amd_nb.h index 1ae4e5791afa..c7df20e78b09 100644 --- a/arch/x86/include/asm/amd_nb.h +++ b/arch/x86/include/asm/amd_nb.h @@ -12,7 +12,6 @@ struct amd_nb_bus_dev_range { u8 dev_limit; }; -extern const struct pci_device_id amd_nb_misc_ids[]; extern const struct amd_nb_bus_dev_range amd_nb_bus_dev_ranges[]; extern bool early_is_amd_nb(u32 value); diff --git a/arch/x86/kernel/amd_nb.c b/arch/x86/kernel/amd_nb.c index 69aed0ebbdfc..b6b3297851f3 100644 --- a/arch/x86/kernel/amd_nb.c +++ b/arch/x86/kernel/amd_nb.c @@ -36,10 +36,9 @@ static const struct pci_device_id amd_root_ids[] = { {} }; - #define PCI_DEVICE_ID_AMD_CNB17H_F4 0x1704 -const struct pci_device_id amd_nb_misc_ids[] = { +static const struct pci_device_id amd_nb_misc_ids[] = { { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_K8_NB_MISC) }, { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_10H_NB_MISC) }, { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_15H_NB_F3) }, @@ -56,7 +55,6 @@ const struct pci_device_id amd_nb_misc_ids[] = { { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_19H_DF_F3) }, {} }; -EXPORT_SYMBOL_GPL(amd_nb_misc_ids); static const struct pci_device_id amd_nb_link_ids[] = { { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_15H_NB_F4) }, -- cgit From 6db73f17c5f155dbcfd5e48e621c706270b84df0 Mon Sep 17 00:00:00 2001 From: Thomas Hellstrom Date: Wed, 4 Mar 2020 12:45:26 +0100 Subject: x86: Don't let pgprot_modify() change the page encryption bit When SEV or SME is enabled and active, vm_get_page_prot() typically returns with the encryption bit set. This means that users of pgprot_modify(, vm_get_page_prot()) (mprotect_fixup(), do_mmap()) end up with a value of vma->vm_pg_prot that is not consistent with the intended protection of the PTEs. This is also important for fault handlers that rely on the VMA vm_page_prot to set the page protection. Fix this by not allowing pgprot_modify() to change the encryption bit, similar to how it's done for PAT bits. Signed-off-by: Thomas Hellstrom Signed-off-by: Borislav Petkov Reviewed-by: Dave Hansen Acked-by: Tom Lendacky Link: https://lkml.kernel.org/r/20200304114527.3636-2-thomas_os@shipmail.org --- arch/x86/include/asm/pgtable.h | 7 +++++-- arch/x86/include/asm/pgtable_types.h | 2 +- 2 files changed, 6 insertions(+), 3 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 7e118660bbd9..64a03f226ab7 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -627,12 +627,15 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot) return __pmd(val); } -/* mprotect needs to preserve PAT bits when updating vm_page_prot */ +/* + * mprotect needs to preserve PAT and encryption bits when updating + * vm_page_prot + */ #define pgprot_modify pgprot_modify static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot) { pgprotval_t preservebits = pgprot_val(oldprot) & _PAGE_CHG_MASK; - pgprotval_t addbits = pgprot_val(newprot); + pgprotval_t addbits = pgprot_val(newprot) & ~_PAGE_CHG_MASK; return __pgprot(preservebits | addbits); } diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h index 0239998d8cdc..65c2ecd730c5 100644 --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -118,7 +118,7 @@ */ #define _PAGE_CHG_MASK (PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT | \ _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY | \ - _PAGE_SOFT_DIRTY | _PAGE_DEVMAP) + _PAGE_SOFT_DIRTY | _PAGE_DEVMAP | _PAGE_ENC) #define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE) /* -- cgit From 4dcc3df82573a946c620dda5fb00e27c7b080105 Mon Sep 17 00:00:00 2001 From: Kim Phillips Date: Fri, 13 Mar 2020 18:10:22 -0500 Subject: perf/amd/uncore: Prepare L3 thread mask code for Family 19h In order to better accommodate the upcoming Family 19h, given the 80-char line limit, move the existing code into a new l3_thread_slice_mask() function. No functional changes. [ bp: Touchups. ] Signed-off-by: Kim Phillips Signed-off-by: Borislav Petkov Acked-by: Peter Zijlstra Link: https://lkml.kernel.org/r/20200313231024.17601-1-kim.phillips@amd.com --- arch/x86/events/amd/uncore.c | 25 ++++++++++++++++--------- 1 file changed, 16 insertions(+), 9 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/events/amd/uncore.c b/arch/x86/events/amd/uncore.c index a6ea07f2aa84..2abcb1abd07c 100644 --- a/arch/x86/events/amd/uncore.c +++ b/arch/x86/events/amd/uncore.c @@ -180,6 +180,20 @@ static void amd_uncore_del(struct perf_event *event, int flags) hwc->idx = -1; } +/* + * Convert logical CPU number to L3 PMC Config ThreadMask format + */ +static u64 l3_thread_slice_mask(int cpu) +{ + int thread = 2 * (cpu_data(cpu).cpu_core_id % 4); + + if (smp_num_siblings > 1) + thread += cpu_data(cpu).apicid & 1; + + return (1ULL << (AMD64_L3_THREAD_SHIFT + thread) & + AMD64_L3_THREAD_MASK) | AMD64_L3_SLICE_MASK; +} + static int amd_uncore_event_init(struct perf_event *event) { struct amd_uncore *uncore; @@ -209,15 +223,8 @@ static int amd_uncore_event_init(struct perf_event *event) * SliceMask and ThreadMask need to be set for certain L3 events in * Family 17h. For other events, the two fields do not affect the count. */ - if (l3_mask && is_llc_event(event)) { - int thread = 2 * (cpu_data(event->cpu).cpu_core_id % 4); - - if (smp_num_siblings > 1) - thread += cpu_data(event->cpu).apicid & 1; - - hwc->config |= (1ULL << (AMD64_L3_THREAD_SHIFT + thread) & - AMD64_L3_THREAD_MASK) | AMD64_L3_SLICE_MASK; - } + if (l3_mask && is_llc_event(event)) + hwc->config |= l3_thread_slice_mask(event->cpu); uncore = event_to_amd_uncore(event); if (!uncore) -- cgit From 9689dbbeaea884d19e3085439c6a247ef986b2af Mon Sep 17 00:00:00 2001 From: Kim Phillips Date: Fri, 13 Mar 2020 18:10:23 -0500 Subject: perf/amd/uncore: Make L3 thread mask code more readable Convert the l3_thread_slice_mask() function to use the more readable topology_* helper functions, more intuitive variable names like shift and thread_mask, and BIT_ULL(). No functional changes. Signed-off-by: Kim Phillips Signed-off-by: Borislav Petkov Acked-by: Peter Zijlstra Link: https://lkml.kernel.org/r/20200313231024.17601-2-kim.phillips@amd.com --- arch/x86/events/amd/uncore.c | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/events/amd/uncore.c b/arch/x86/events/amd/uncore.c index 2abcb1abd07c..07af497b517f 100644 --- a/arch/x86/events/amd/uncore.c +++ b/arch/x86/events/amd/uncore.c @@ -185,13 +185,16 @@ static void amd_uncore_del(struct perf_event *event, int flags) */ static u64 l3_thread_slice_mask(int cpu) { - int thread = 2 * (cpu_data(cpu).cpu_core_id % 4); + u64 thread_mask, core = topology_core_id(cpu); + unsigned int shift, thread = 0; - if (smp_num_siblings > 1) - thread += cpu_data(cpu).apicid & 1; + if (topology_smt_supported() && !topology_is_primary_thread(cpu)) + thread = 1; - return (1ULL << (AMD64_L3_THREAD_SHIFT + thread) & - AMD64_L3_THREAD_MASK) | AMD64_L3_SLICE_MASK; + shift = AMD64_L3_THREAD_SHIFT + 2 * (core % 4) + thread; + thread_mask = BIT_ULL(shift); + + return AMD64_L3_SLICE_MASK | thread_mask; } static int amd_uncore_event_init(struct perf_event *event) -- cgit From e48667b865480d8bf0f1171a8b474ffc785b9ace Mon Sep 17 00:00:00 2001 From: Kim Phillips Date: Fri, 13 Mar 2020 18:10:24 -0500 Subject: perf/amd/uncore: Add support for Family 19h L3 PMU Family 19h introduces change in slice, core and thread specification in its L3 Performance Event Select (ChL3PmcCfg) h/w register. The change is incompatible with Family 17h's version of the register. Introduce a new path in l3_thread_slice_mask() to do things differently for Family 19h vs. Family 17h, otherwise the new hardware doesn't get programmed correctly. Instead of a linear core--thread bitmask, Family 19h takes an encoded core number, and a separate thread mask. There are new bits that are set for all cores and all slices, of which only the latter is used, since the driver counts events for all slices on behalf of the specified CPU. Also update amd_uncore_init() to base its L2/NB vs. L3/Data Fabric mode decision based on Family 17h or above, not just 17h and 18h: the Family 19h Data Fabric PMC is compatible with the Family 17h DF PMC. [ bp: Touchups. ] Signed-off-by: Kim Phillips Signed-off-by: Borislav Petkov Acked-by: Peter Zijlstra Link: https://lkml.kernel.org/r/20200313231024.17601-3-kim.phillips@amd.com --- arch/x86/events/amd/uncore.c | 20 ++++++++++++++------ arch/x86/include/asm/perf_event.h | 15 +++++++++++++-- 2 files changed, 27 insertions(+), 8 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/events/amd/uncore.c b/arch/x86/events/amd/uncore.c index 07af497b517f..46018e515fe2 100644 --- a/arch/x86/events/amd/uncore.c +++ b/arch/x86/events/amd/uncore.c @@ -191,10 +191,18 @@ static u64 l3_thread_slice_mask(int cpu) if (topology_smt_supported() && !topology_is_primary_thread(cpu)) thread = 1; - shift = AMD64_L3_THREAD_SHIFT + 2 * (core % 4) + thread; + if (boot_cpu_data.x86 <= 0x18) { + shift = AMD64_L3_THREAD_SHIFT + 2 * (core % 4) + thread; + thread_mask = BIT_ULL(shift); + + return AMD64_L3_SLICE_MASK | thread_mask; + } + + core = (core << AMD64_L3_COREID_SHIFT) & AMD64_L3_COREID_MASK; + shift = AMD64_L3_THREAD_SHIFT + thread; thread_mask = BIT_ULL(shift); - return AMD64_L3_SLICE_MASK | thread_mask; + return AMD64_L3_EN_ALL_SLICES | core | thread_mask; } static int amd_uncore_event_init(struct perf_event *event) @@ -223,8 +231,8 @@ static int amd_uncore_event_init(struct perf_event *event) return -EINVAL; /* - * SliceMask and ThreadMask need to be set for certain L3 events in - * Family 17h. For other events, the two fields do not affect the count. + * SliceMask and ThreadMask need to be set for certain L3 events. + * For other events, the two fields do not affect the count. */ if (l3_mask && is_llc_event(event)) hwc->config |= l3_thread_slice_mask(event->cpu); @@ -533,9 +541,9 @@ static int __init amd_uncore_init(void) if (!boot_cpu_has(X86_FEATURE_TOPOEXT)) return -ENODEV; - if (boot_cpu_data.x86 == 0x17 || boot_cpu_data.x86 == 0x18) { + if (boot_cpu_data.x86 >= 0x17) { /* - * For F17h or F18h, the Northbridge counters are + * For F17h and above, the Northbridge counters are * repurposed as Data Fabric counters. Also, L3 * counters are supported too. The PMUs are exported * based on family as either L2 or L3 and NB or DF. diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h index 29964b0e1075..e855e9cf2c37 100644 --- a/arch/x86/include/asm/perf_event.h +++ b/arch/x86/include/asm/perf_event.h @@ -50,11 +50,22 @@ #define AMD64_L3_SLICE_SHIFT 48 #define AMD64_L3_SLICE_MASK \ - ((0xFULL) << AMD64_L3_SLICE_SHIFT) + (0xFULL << AMD64_L3_SLICE_SHIFT) +#define AMD64_L3_SLICEID_MASK \ + (0x7ULL << AMD64_L3_SLICE_SHIFT) #define AMD64_L3_THREAD_SHIFT 56 #define AMD64_L3_THREAD_MASK \ - ((0xFFULL) << AMD64_L3_THREAD_SHIFT) + (0xFFULL << AMD64_L3_THREAD_SHIFT) +#define AMD64_L3_F19H_THREAD_MASK \ + (0x3ULL << AMD64_L3_THREAD_SHIFT) + +#define AMD64_L3_EN_ALL_CORES BIT_ULL(47) +#define AMD64_L3_EN_ALL_SLICES BIT_ULL(46) + +#define AMD64_L3_COREID_SHIFT 42 +#define AMD64_L3_COREID_MASK \ + (0x7ULL << AMD64_L3_COREID_SHIFT) #define X86_RAW_EVENT_MASK \ (ARCH_PERFMON_EVENTSEL_EVENT | \ -- cgit From bb03911f79f610c857cf82b8ccd434101dd05356 Mon Sep 17 00:00:00 2001 From: Uros Bizjak Date: Tue, 10 Mar 2020 18:10:24 +0100 Subject: KVM: VMX: access regs array in vmenter.S in its natural order Registers in "regs" array are indexed as rax/rcx/rdx/.../rsi/rdi/r8/... Reorder access to "regs" array in vmenter.S to follow its natural order. Signed-off-by: Uros Bizjak Signed-off-by: Paolo Bonzini --- arch/x86/kvm/vmx/vmenter.S | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S index 81ada2ce99e7..ca2065166d1d 100644 --- a/arch/x86/kvm/vmx/vmenter.S +++ b/arch/x86/kvm/vmx/vmenter.S @@ -135,12 +135,12 @@ SYM_FUNC_START(__vmx_vcpu_run) cmpb $0, %bl /* Load guest registers. Don't clobber flags. */ - mov VCPU_RBX(%_ASM_AX), %_ASM_BX mov VCPU_RCX(%_ASM_AX), %_ASM_CX mov VCPU_RDX(%_ASM_AX), %_ASM_DX + mov VCPU_RBX(%_ASM_AX), %_ASM_BX + mov VCPU_RBP(%_ASM_AX), %_ASM_BP mov VCPU_RSI(%_ASM_AX), %_ASM_SI mov VCPU_RDI(%_ASM_AX), %_ASM_DI - mov VCPU_RBP(%_ASM_AX), %_ASM_BP #ifdef CONFIG_X86_64 mov VCPU_R8 (%_ASM_AX), %r8 mov VCPU_R9 (%_ASM_AX), %r9 @@ -168,12 +168,12 @@ SYM_FUNC_START(__vmx_vcpu_run) /* Save all guest registers, including RAX from the stack */ __ASM_SIZE(pop) VCPU_RAX(%_ASM_AX) - mov %_ASM_BX, VCPU_RBX(%_ASM_AX) mov %_ASM_CX, VCPU_RCX(%_ASM_AX) mov %_ASM_DX, VCPU_RDX(%_ASM_AX) + mov %_ASM_BX, VCPU_RBX(%_ASM_AX) + mov %_ASM_BP, VCPU_RBP(%_ASM_AX) mov %_ASM_SI, VCPU_RSI(%_ASM_AX) mov %_ASM_DI, VCPU_RDI(%_ASM_AX) - mov %_ASM_BP, VCPU_RBP(%_ASM_AX) #ifdef CONFIG_X86_64 mov %r8, VCPU_R8 (%_ASM_AX) mov %r9, VCPU_R9 (%_ASM_AX) @@ -197,12 +197,12 @@ SYM_FUNC_START(__vmx_vcpu_run) * free. RSP and RAX are exempt as RSP is restored by hardware during * VM-Exit and RAX is explicitly loaded with 0 or 1 to return VM-Fail. */ -1: xor %ebx, %ebx - xor %ecx, %ecx +1: xor %ecx, %ecx xor %edx, %edx + xor %ebx, %ebx + xor %ebp, %ebp xor %esi, %esi xor %edi, %edi - xor %ebp, %ebp #ifdef CONFIG_X86_64 xor %r8d, %r8d xor %r9d, %r9d -- cgit From aa61ee7b9ee3cb84c0d3a842b0d17937bf024c46 Mon Sep 17 00:00:00 2001 From: Baoquan He Date: Wed, 11 Mar 2020 09:18:23 +0800 Subject: x86/mm: Remove the now redundant N_MEMORY check In commit f70029bbaacb ("mm, memory_hotplug: drop CONFIG_MOVABLE_NODE") the dependency on CONFIG_MOVABLE_NODE was removed for N_MEMORY. Before, CONFIG_HIGHMEM && !CONFIG_MOVABLE_NODE could make (N_MEMORY == N_NORMAL_MEMORY) be true. After that commit, N_MEMORY cannot be equal to N_NORMAL_MEMORY. So the conditional check in paging_init() is not needed anymore, remove it. [ bp: Massage. ] Signed-off-by: Baoquan He Signed-off-by: Borislav Petkov Reviewed-by: Wei Yang Acked-by: Michal Hocko Link: https://lkml.kernel.org/r/20200311011823.27740-1-bhe@redhat.com --- arch/x86/mm/init_64.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index abbdecb75fad..0a14711d3a93 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -818,8 +818,7 @@ void __init paging_init(void) * will not set it back. */ node_clear_state(0, N_MEMORY); - if (N_MEMORY != N_NORMAL_MEMORY) - node_clear_state(0, N_NORMAL_MEMORY); + node_clear_state(0, N_NORMAL_MEMORY); zone_sizes_init(); } -- cgit From 023f270b44cd2df0741c2c6a1587ee698d8cc7c8 Mon Sep 17 00:00:00 2001 From: Geert Uytterhoeven Date: Thu, 24 Oct 2019 17:30:08 +0200 Subject: x86/boot: Fix comment spelling Fix misspelling of "disconnect". Signed-off-by: Geert Uytterhoeven Signed-off-by: Jiri Kosina --- arch/x86/boot/apm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/boot/apm.c b/arch/x86/boot/apm.c index b72fc10fc1be..bda15f9673d5 100644 --- a/arch/x86/boot/apm.c +++ b/arch/x86/boot/apm.c @@ -60,7 +60,7 @@ int query_apm_bios(void) intcall(0x15, &ireg, &oreg); if ((oreg.eflags & X86_EFLAGS_CF) || oreg.bx != 0x504d) { - /* Failure with 32-bit connect, try to disconect and ignore */ + /* Failure with 32-bit connect, try to disconnect and ignore */ ireg.al = 0x04; intcall(0x15, &ireg, NULL); return -1; -- cgit From 96b100cd1464ce41d5d6b083b1fe5ceace8eef8a Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Tue, 17 Mar 2020 18:32:50 +0100 Subject: KVM: nVMX: remove side effects from nested_vmx_exit_reflected The name of nested_vmx_exit_reflected suggests that it's purely a test, but it actually marks VMCS12 pages as dirty. Move this to vmx_handle_exit, observing that the initial nested_run_pending check in nested_vmx_exit_reflected is pointless---nested_run_pending has just been cleared in vmx_vcpu_run and won't be set until handle_vmlaunch or handle_vmresume. Suggested-by: Vitaly Kuznetsov Reviewed-by: Vitaly Kuznetsov Signed-off-by: Paolo Bonzini --- arch/x86/kvm/vmx/nested.c | 18 ++---------------- arch/x86/kvm/vmx/nested.h | 1 + arch/x86/kvm/vmx/vmx.c | 19 +++++++++++++++++-- 3 files changed, 20 insertions(+), 18 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index 8578513907d7..4ff859c99946 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -3527,7 +3527,7 @@ static void vmcs12_save_pending_event(struct kvm_vcpu *vcpu, } -static void nested_mark_vmcs12_pages_dirty(struct kvm_vcpu *vcpu) +void nested_mark_vmcs12_pages_dirty(struct kvm_vcpu *vcpu) { struct vmcs12 *vmcs12 = get_vmcs12(vcpu); gfn_t gfn; @@ -5543,8 +5543,7 @@ bool nested_vmx_exit_reflected(struct kvm_vcpu *vcpu, u32 exit_reason) struct vcpu_vmx *vmx = to_vmx(vcpu); struct vmcs12 *vmcs12 = get_vmcs12(vcpu); - if (vmx->nested.nested_run_pending) - return false; + WARN_ON_ONCE(vmx->nested.nested_run_pending); if (unlikely(vmx->fail)) { trace_kvm_nested_vmenter_failed( @@ -5553,19 +5552,6 @@ bool nested_vmx_exit_reflected(struct kvm_vcpu *vcpu, u32 exit_reason) return true; } - /* - * The host physical addresses of some pages of guest memory - * are loaded into the vmcs02 (e.g. vmcs12's Virtual APIC - * Page). The CPU may write to these pages via their host - * physical address while L2 is running, bypassing any - * address-translation-based dirty tracking (e.g. EPT write - * protection). - * - * Mark them dirty on every exit from L2 to prevent them from - * getting out of sync with dirty tracking. - */ - nested_mark_vmcs12_pages_dirty(vcpu); - trace_kvm_nested_vmexit(kvm_rip_read(vcpu), exit_reason, vmcs_readl(EXIT_QUALIFICATION), vmx->idt_vectoring_info, diff --git a/arch/x86/kvm/vmx/nested.h b/arch/x86/kvm/vmx/nested.h index 21d36652f213..f70968b76d33 100644 --- a/arch/x86/kvm/vmx/nested.h +++ b/arch/x86/kvm/vmx/nested.h @@ -33,6 +33,7 @@ int vmx_get_vmx_msr(struct nested_vmx_msrs *msrs, u32 msr_index, u64 *pdata); int get_vmx_mem_address(struct kvm_vcpu *vcpu, unsigned long exit_qualification, u32 vmx_instruction_info, bool wr, int len, gva_t *ret); void nested_vmx_pmu_entry_exit_ctls_update(struct kvm_vcpu *vcpu); +void nested_mark_vmcs12_pages_dirty(struct kvm_vcpu *vcpu); bool nested_vmx_check_io_bitmaps(struct kvm_vcpu *vcpu, unsigned int port, int size); diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index b447d66f44e6..07299a957d4a 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -5851,8 +5851,23 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu, if (vmx->emulation_required) return handle_invalid_guest_state(vcpu); - if (is_guest_mode(vcpu) && nested_vmx_exit_reflected(vcpu, exit_reason)) - return nested_vmx_reflect_vmexit(vcpu, exit_reason); + if (is_guest_mode(vcpu)) { + /* + * The host physical addresses of some pages of guest memory + * are loaded into the vmcs02 (e.g. vmcs12's Virtual APIC + * Page). The CPU may write to these pages via their host + * physical address while L2 is running, bypassing any + * address-translation-based dirty tracking (e.g. EPT write + * protection). + * + * Mark them dirty on every exit from L2 to prevent them from + * getting out of sync with dirty tracking. + */ + nested_mark_vmcs12_pages_dirty(vcpu); + + if (nested_vmx_exit_reflected(vcpu, exit_reason)) + return nested_vmx_reflect_vmexit(vcpu, exit_reason); + } if (exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY) { dump_vmcs(); -- cgit From 9401f2e5b0cee7a122eb05452850e0e799d32d4b Mon Sep 17 00:00:00 2001 From: Zhenyu Wang Date: Wed, 18 Mar 2020 11:27:39 +0800 Subject: KVM: x86: Expose AVX512 VP2INTERSECT in cpuid for TGL On Tigerlake new AVX512 VP2INTERSECT feature is available. This allows to expose it via KVM_GET_SUPPORTED_CPUID. Cc: "Zhong, Yang" Signed-off-by: Zhenyu Wang Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 08280d8a2ac9..435a7da07d5f 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -338,7 +338,7 @@ void kvm_set_cpu_caps(void) kvm_cpu_cap_mask(CPUID_7_EDX, F(AVX512_4VNNIW) | F(AVX512_4FMAPS) | F(SPEC_CTRL) | F(SPEC_CTRL_SSBD) | F(ARCH_CAPABILITIES) | F(INTEL_STIBP) | - F(MD_CLEAR) + F(MD_CLEAR) | F(AVX512_VP2INTERSECT) ); /* TSC_ADJUST and ARCH_CAPABILITIES are emulated in software. */ -- cgit From 1651e700664b4597ddf4f8adfe435252a0d11277 Mon Sep 17 00:00:00 2001 From: Jesse Brandeburg Date: Tue, 10 Mar 2020 15:17:46 -0700 Subject: x86: Fix bitops.h warning with a moved cast Fix many sparse warnings when building with C=1. These are useless noise from the bitops.h file and getting rid of them helps developers make more use of the tools and possibly find real bugs. When the kernel is compiled with C=1, there are lots of messages like: arch/x86/include/asm/bitops.h:77:37: warning: cast truncates bits from constant value (ffffff7f becomes 7f) CONST_MASK() is using a signed integer "1" to create the mask which is later cast to (u8), in order to yield an 8-bit value for the assembly instructions to use. Simplify the expressions used to clearly indicate they are working on 8-bit values only, which still keeps sparse happy without an accidental promotion to a 32 bit integer. The warning was occurring because certain bitmasks that end with a bit set next to a natural boundary like 7, 15, 23, 31, end up with a mask like 0x7f, which then results in sign extension due to the integer type promotion rules[1]. It was really only clear_bit() that was having problems, and it was only on some bit checks that resulted in a mask like 0xffffff7f being generated after the inversion. Verify with a test module (see next patch) and assembly inspection that the fix doesn't introduce any change in generated code. [ bp: Massage. ] Signed-off-by: Jesse Brandeburg Signed-off-by: Borislav Petkov Reviewed-by: Andy Shevchenko Acked-by: Luc Van Oostenryck Acked-by: Peter Zijlstra (Intel) Link: https://stackoverflow.com/questions/46073295/implicit-type-promotion-rules [1] Link: https://lkml.kernel.org/r/20200310221747.2848474-1-jesse.brandeburg@intel.com --- arch/x86/include/asm/bitops.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h index 062cdecb2f24..53f246e9df5a 100644 --- a/arch/x86/include/asm/bitops.h +++ b/arch/x86/include/asm/bitops.h @@ -54,7 +54,7 @@ arch_set_bit(long nr, volatile unsigned long *addr) if (__builtin_constant_p(nr)) { asm volatile(LOCK_PREFIX "orb %1,%0" : CONST_MASK_ADDR(nr, addr) - : "iq" ((u8)CONST_MASK(nr)) + : "iq" (CONST_MASK(nr) & 0xff) : "memory"); } else { asm volatile(LOCK_PREFIX __ASM_SIZE(bts) " %1,%0" @@ -74,7 +74,7 @@ arch_clear_bit(long nr, volatile unsigned long *addr) if (__builtin_constant_p(nr)) { asm volatile(LOCK_PREFIX "andb %1,%0" : CONST_MASK_ADDR(nr, addr) - : "iq" ((u8)~CONST_MASK(nr))); + : "iq" (CONST_MASK(nr) ^ 0xff)); } else { asm volatile(LOCK_PREFIX __ASM_SIZE(btr) " %1,%0" : : RLONG_ADDR(addr), "Ir" (nr) : "memory"); -- cgit From d55c9d4009c7622e081cbe599c673076b9854ea1 Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Wed, 18 Mar 2020 13:41:32 +0100 Subject: KVM: nSVM: check for EFER.SVME=1 before entering guest EFER is set for L2 using svm_set_efer, which hardcodes EFER_SVME to 1 and hides an incorrect value for EFER.SVME in the L1 VMCB. Perform the check manually to detect invalid guest state. Reported-by: Krish Sadhukhan Signed-off-by: Paolo Bonzini --- arch/x86/kvm/svm.c | 3 +++ 1 file changed, 3 insertions(+) (limited to 'arch/x86') diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 08568ae9f7a1..2125c6ae5951 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -3558,6 +3558,9 @@ static bool nested_svm_vmrun_msrpm(struct vcpu_svm *svm) static bool nested_vmcb_checks(struct vmcb *vmcb) { + if ((vmcb->save.efer & EFER_SVME) == 0) + return false; + if ((vmcb->control.intercept & (1ULL << INTERCEPT_VMRUN)) == 0) return false; -- cgit From e7adda281063becab7631c6ac60f8b19684ba042 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 17 Mar 2020 12:53:53 -0700 Subject: KVM: x86: Add requested index to the CPUID tracepoint Output the requested index when tracing CPUID emulation; it's basically mandatory for leafs where the index is meaningful, and is helpful for verifying KVM correctness even when the index isn't meaningful, e.g. the trace for a Linux guest's hypervisor_cpuid_base() probing appears to be broken (returns all zeroes) at first glance, but is correct because the index is non-zero, i.e. the output values correspond to a random index in the maximum basic leaf. Suggested-by: Xiaoyao Li Cc: Jan Kiszka Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 2 +- arch/x86/kvm/trace.h | 13 ++++++++----- 2 files changed, 9 insertions(+), 6 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 435a7da07d5f..2de721f3c420 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -1026,7 +1026,7 @@ bool kvm_cpuid(struct kvm_vcpu *vcpu, u32 *eax, u32 *ebx, } } } - trace_kvm_cpuid(orig_function, *eax, *ebx, *ecx, *edx, exact); + trace_kvm_cpuid(orig_function, index, *eax, *ebx, *ecx, *edx, exact); return exact; } EXPORT_SYMBOL_GPL(kvm_cpuid); diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h index 6c4d9b4caf07..27270ba0f05f 100644 --- a/arch/x86/kvm/trace.h +++ b/arch/x86/kvm/trace.h @@ -151,12 +151,14 @@ TRACE_EVENT(kvm_fast_mmio, * Tracepoint for cpuid. */ TRACE_EVENT(kvm_cpuid, - TP_PROTO(unsigned int function, unsigned long rax, unsigned long rbx, - unsigned long rcx, unsigned long rdx, bool found), - TP_ARGS(function, rax, rbx, rcx, rdx, found), + TP_PROTO(unsigned int function, unsigned int index, unsigned long rax, + unsigned long rbx, unsigned long rcx, unsigned long rdx, + bool found), + TP_ARGS(function, index, rax, rbx, rcx, rdx, found), TP_STRUCT__entry( __field( unsigned int, function ) + __field( unsigned int, index ) __field( unsigned long, rax ) __field( unsigned long, rbx ) __field( unsigned long, rcx ) @@ -166,6 +168,7 @@ TRACE_EVENT(kvm_cpuid, TP_fast_assign( __entry->function = function; + __entry->index = index; __entry->rax = rax; __entry->rbx = rbx; __entry->rcx = rcx; @@ -173,8 +176,8 @@ TRACE_EVENT(kvm_cpuid, __entry->found = found; ), - TP_printk("func %x rax %lx rbx %lx rcx %lx rdx %lx, cpuid entry %s", - __entry->function, __entry->rax, + TP_printk("func %x idx %x rax %lx rbx %lx rcx %lx rdx %lx, cpuid entry %s", + __entry->function, __entry->index, __entry->rax, __entry->rbx, __entry->rcx, __entry->rdx, __entry->found ? "found" : "not found") ); -- cgit From 2b110b61644a34e97c92ae20788cbcb42d474fa9 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 17 Mar 2020 12:53:54 -0700 Subject: KVM: x86: Add blurb to CPUID tracepoint when using max basic leaf values Tack on "used max basic" at the end of the CPUID tracepoint when the output values correspond to the max basic leaf, i.e. when emulating Intel's out-of-range CPUID behavior. Observing "cpuid entry not found" in the tracepoint with non-zero output values is confusing for users that aren't familiar with the out-of-range semantics, and qualifying the "not found" case hopefully makes it clear that "found" means "found the exact entry". Suggested-by: Jan Kiszka Signed-off-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 9 ++++++--- arch/x86/kvm/trace.h | 11 +++++++---- 2 files changed, 13 insertions(+), 7 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 2de721f3c420..990c76f34b03 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -990,13 +990,15 @@ bool kvm_cpuid(struct kvm_vcpu *vcpu, u32 *eax, u32 *ebx, { u32 orig_function = *eax, function = *eax, index = *ecx; struct kvm_cpuid_entry2 *entry; - bool exact; + bool exact, used_max_basic = false; entry = kvm_find_cpuid_entry(vcpu, function, index); exact = !!entry; - if (!entry && !exact_only) + if (!entry && !exact_only) { entry = get_out_of_range_cpuid_entry(vcpu, &function, index); + used_max_basic = !!entry; + } if (entry) { *eax = entry->eax; @@ -1026,7 +1028,8 @@ bool kvm_cpuid(struct kvm_vcpu *vcpu, u32 *eax, u32 *ebx, } } } - trace_kvm_cpuid(orig_function, index, *eax, *ebx, *ecx, *edx, exact); + trace_kvm_cpuid(orig_function, index, *eax, *ebx, *ecx, *edx, exact, + used_max_basic); return exact; } EXPORT_SYMBOL_GPL(kvm_cpuid); diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h index 27270ba0f05f..c3d1e9f4a2c0 100644 --- a/arch/x86/kvm/trace.h +++ b/arch/x86/kvm/trace.h @@ -153,8 +153,8 @@ TRACE_EVENT(kvm_fast_mmio, TRACE_EVENT(kvm_cpuid, TP_PROTO(unsigned int function, unsigned int index, unsigned long rax, unsigned long rbx, unsigned long rcx, unsigned long rdx, - bool found), - TP_ARGS(function, index, rax, rbx, rcx, rdx, found), + bool found, bool used_max_basic), + TP_ARGS(function, index, rax, rbx, rcx, rdx, found, used_max_basic), TP_STRUCT__entry( __field( unsigned int, function ) @@ -164,6 +164,7 @@ TRACE_EVENT(kvm_cpuid, __field( unsigned long, rcx ) __field( unsigned long, rdx ) __field( bool, found ) + __field( bool, used_max_basic ) ), TP_fast_assign( @@ -174,12 +175,14 @@ TRACE_EVENT(kvm_cpuid, __entry->rcx = rcx; __entry->rdx = rdx; __entry->found = found; + __entry->used_max_basic = used_max_basic; ), - TP_printk("func %x idx %x rax %lx rbx %lx rcx %lx rdx %lx, cpuid entry %s", + TP_printk("func %x idx %x rax %lx rbx %lx rcx %lx rdx %lx, cpuid entry %s%s", __entry->function, __entry->index, __entry->rax, __entry->rbx, __entry->rcx, __entry->rdx, - __entry->found ? "found" : "not found") + __entry->found ? "found" : "not found", + __entry->used_max_basic ? ", used max basic" : "") ); #define AREG(x) { APIC_##x, "APIC_" #x } -- cgit From cf6c26ec7bf5386706cd6522708766eb6522995e Mon Sep 17 00:00:00 2001 From: Xiaoyao Li Date: Sat, 29 Feb 2020 10:52:12 +0800 Subject: KVM: x86: Code style cleanup in kvm_arch_dev_ioctl() In kvm_arch_dev_ioctl(), the brackets of case KVM_X86_GET_MCE_CAP_SUPPORTED accidently encapsulates case KVM_GET_MSR_FEATURE_INDEX_LIST and case KVM_GET_MSRS. It doesn't affect functionality but it's misleading. Remove unnecessary brackets and opportunistically add a "break" in the default path. Suggested-by: Sean Christopherson Signed-off-by: Xiaoyao Li Reviewed-by: Vitaly Kuznetsov Signed-off-by: Paolo Bonzini --- arch/x86/kvm/x86.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index e54c6ad628a8..a4e62d767dd6 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3488,7 +3488,7 @@ long kvm_arch_dev_ioctl(struct file *filp, r = 0; break; } - case KVM_X86_GET_MCE_CAP_SUPPORTED: { + case KVM_X86_GET_MCE_CAP_SUPPORTED: r = -EFAULT; if (copy_to_user(argp, &kvm_mce_cap_supported, sizeof(kvm_mce_cap_supported))) @@ -3520,9 +3520,9 @@ long kvm_arch_dev_ioctl(struct file *filp, case KVM_GET_MSRS: r = msr_io(NULL, argp, do_get_msr_feature, 1); break; - } default: r = -EINVAL; + break; } out: return r; -- cgit From 71c3313a38aa09339a2442809e658fd233ab0757 Mon Sep 17 00:00:00 2001 From: Al Viro Date: Sat, 15 Feb 2020 11:43:18 -0500 Subject: x86: switch sigframe sigset handling to explict __get_user()/__put_user() ... and consolidate the definition of sigframe_ia32->extramask - it's always a 1-element array of 32bit unsigned. Signed-off-by: Al Viro --- arch/x86/ia32/ia32_signal.c | 16 +++++----------- arch/x86/include/asm/sigframe.h | 6 +----- arch/x86/kernel/signal.c | 20 ++++++++------------ 3 files changed, 14 insertions(+), 28 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/ia32/ia32_signal.c b/arch/x86/ia32/ia32_signal.c index a3aefe9b9401..c72025d615f8 100644 --- a/arch/x86/ia32/ia32_signal.c +++ b/arch/x86/ia32/ia32_signal.c @@ -126,10 +126,7 @@ COMPAT_SYSCALL_DEFINE0(sigreturn) if (!access_ok(frame, sizeof(*frame))) goto badframe; if (__get_user(set.sig[0], &frame->sc.oldmask) - || (_COMPAT_NSIG_WORDS > 1 - && __copy_from_user((((char *) &set.sig) + 4), - &frame->extramask, - sizeof(frame->extramask)))) + || __get_user(((__u32 *)&set)[1], &frame->extramask[0])) goto badframe; set_current_blocked(&set); @@ -153,7 +150,7 @@ COMPAT_SYSCALL_DEFINE0(rt_sigreturn) if (!access_ok(frame, sizeof(*frame))) goto badframe; - if (__copy_from_user(&set, &frame->uc.uc_sigmask, sizeof(set))) + if (__get_user(set.sig[0], (__u64 __user *)&frame->uc.uc_sigmask)) goto badframe; set_current_blocked(&set); @@ -277,11 +274,8 @@ int ia32_setup_frame(int sig, struct ksignal *ksig, if (ia32_setup_sigcontext(&frame->sc, fpstate, regs, set->sig[0])) return -EFAULT; - if (_COMPAT_NSIG_WORDS > 1) { - if (__copy_to_user(frame->extramask, &set->sig[1], - sizeof(frame->extramask))) - return -EFAULT; - } + if (__put_user(set->sig[1], &frame->extramask[0])) + return -EFAULT; if (ksig->ka.sa.sa_flags & SA_RESTORER) { restorer = ksig->ka.sa.sa_restorer; @@ -381,7 +375,7 @@ int ia32_setup_rt_frame(int sig, struct ksignal *ksig, err |= __copy_siginfo_to_user32(&frame->info, &ksig->info, false); err |= ia32_setup_sigcontext(&frame->uc.uc_mcontext, fpstate, regs, set->sig[0]); - err |= __copy_to_user(&frame->uc.uc_sigmask, set, sizeof(*set)); + err |= __put_user(*(__u64 *)set, (__u64 __user *)&frame->uc.uc_sigmask); if (err) return -EFAULT; diff --git a/arch/x86/include/asm/sigframe.h b/arch/x86/include/asm/sigframe.h index f176114c04d4..84eab2724875 100644 --- a/arch/x86/include/asm/sigframe.h +++ b/arch/x86/include/asm/sigframe.h @@ -33,11 +33,7 @@ struct sigframe_ia32 { * legacy application accessing/modifying it. */ struct _fpstate_32 fpstate_unused; -#ifdef CONFIG_IA32_EMULATION - unsigned int extramask[_COMPAT_NSIG_WORDS-1]; -#else /* !CONFIG_IA32_EMULATION */ - unsigned long extramask[_NSIG_WORDS-1]; -#endif /* CONFIG_IA32_EMULATION */ + unsigned int extramask[1]; char retcode[8]; /* fp state follows here */ }; diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c index 8a29573851a3..53ac66b3fd9b 100644 --- a/arch/x86/kernel/signal.c +++ b/arch/x86/kernel/signal.c @@ -326,11 +326,8 @@ __setup_frame(int sig, struct ksignal *ksig, sigset_t *set, if (setup_sigcontext(&frame->sc, fpstate, regs, set->sig[0])) return -EFAULT; - if (_NSIG_WORDS > 1) { - if (__copy_to_user(&frame->extramask, &set->sig[1], - sizeof(frame->extramask))) - return -EFAULT; - } + if (__put_user(set->sig[1], &frame->extramask[0])) + return -EFAULT; if (current->mm->context.vdso) restorer = current->mm->context.vdso + @@ -489,7 +486,7 @@ static int __setup_rt_frame(int sig, struct ksignal *ksig, } put_user_catch(err); err |= setup_sigcontext(&frame->uc.uc_mcontext, fp, regs, set->sig[0]); - err |= __copy_to_user(&frame->uc.uc_sigmask, set, sizeof(*set)); + err |= __put_user(set->sig[0], &frame->uc.uc_sigmask.sig[0]); if (err) return -EFAULT; @@ -575,7 +572,7 @@ static int x32_setup_rt_frame(struct ksignal *ksig, err |= setup_sigcontext(&frame->uc.uc_mcontext, fpstate, regs, set->sig[0]); - err |= __copy_to_user(&frame->uc.uc_sigmask, set, sizeof(*set)); + err |= __put_user(*(__u64 *)set, (__u64 __user *)&frame->uc.uc_sigmask); if (err) return -EFAULT; @@ -613,9 +610,8 @@ SYSCALL_DEFINE0(sigreturn) if (!access_ok(frame, sizeof(*frame))) goto badframe; - if (__get_user(set.sig[0], &frame->sc.oldmask) || (_NSIG_WORDS > 1 - && __copy_from_user(&set.sig[1], &frame->extramask, - sizeof(frame->extramask)))) + if (__get_user(set.sig[0], &frame->sc.oldmask) || + __get_user(set.sig[1], &frame->extramask[0])) goto badframe; set_current_blocked(&set); @@ -645,7 +641,7 @@ SYSCALL_DEFINE0(rt_sigreturn) frame = (struct rt_sigframe __user *)(regs->sp - sizeof(long)); if (!access_ok(frame, sizeof(*frame))) goto badframe; - if (__copy_from_user(&set, &frame->uc.uc_sigmask, sizeof(set))) + if (__get_user(*(__u64 *)&set, (__u64 __user *)&frame->uc.uc_sigmask)) goto badframe; if (__get_user(uc_flags, &frame->uc.uc_flags)) goto badframe; @@ -870,7 +866,7 @@ asmlinkage long sys32_x32_rt_sigreturn(void) if (!access_ok(frame, sizeof(*frame))) goto badframe; - if (__copy_from_user(&set, &frame->uc.uc_sigmask, sizeof(set))) + if (__get_user(set.sig[0], (__u64 __user *)&frame->uc.uc_sigmask)) goto badframe; if (__get_user(uc_flags, &frame->uc.uc_flags)) goto badframe; -- cgit From 4b842e4e25b12951fa10dedb4bc16bc47e3b850c Mon Sep 17 00:00:00 2001 From: Al Viro Date: Sat, 15 Feb 2020 11:46:30 -0500 Subject: x86: get rid of small constant size cases in raw_copy_{to,from}_user() Very few call sites where that would be triggered remain, and none of those is anywhere near hot enough to bother. Signed-off-by: Al Viro --- arch/x86/include/asm/uaccess.h | 12 ----- arch/x86/include/asm/uaccess_32.h | 27 ---------- arch/x86/include/asm/uaccess_64.h | 108 +------------------------------------- 3 files changed, 2 insertions(+), 145 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h index ab8eab43a8a2..1cfa33b94a1a 100644 --- a/arch/x86/include/asm/uaccess.h +++ b/arch/x86/include/asm/uaccess.h @@ -378,18 +378,6 @@ do { \ : "=r" (err), ltype(x) \ : "m" (__m(addr)), "i" (errret), "0" (err)) -#define __get_user_asm_nozero(x, addr, err, itype, rtype, ltype, errret) \ - asm volatile("\n" \ - "1: mov"itype" %2,%"rtype"1\n" \ - "2:\n" \ - ".section .fixup,\"ax\"\n" \ - "3: mov %3,%0\n" \ - " jmp 2b\n" \ - ".previous\n" \ - _ASM_EXTABLE_UA(1b, 3b) \ - : "=r" (err), ltype(x) \ - : "m" (__m(addr)), "i" (errret), "0" (err)) - /* * This doesn't do __uaccess_begin/end - the exception handling * around it must do that. diff --git a/arch/x86/include/asm/uaccess_32.h b/arch/x86/include/asm/uaccess_32.h index ba2dc1930630..388a40660c7b 100644 --- a/arch/x86/include/asm/uaccess_32.h +++ b/arch/x86/include/asm/uaccess_32.h @@ -23,33 +23,6 @@ raw_copy_to_user(void __user *to, const void *from, unsigned long n) static __always_inline unsigned long raw_copy_from_user(void *to, const void __user *from, unsigned long n) { - if (__builtin_constant_p(n)) { - unsigned long ret; - - switch (n) { - case 1: - ret = 0; - __uaccess_begin_nospec(); - __get_user_asm_nozero(*(u8 *)to, from, ret, - "b", "b", "=q", 1); - __uaccess_end(); - return ret; - case 2: - ret = 0; - __uaccess_begin_nospec(); - __get_user_asm_nozero(*(u16 *)to, from, ret, - "w", "w", "=r", 2); - __uaccess_end(); - return ret; - case 4: - ret = 0; - __uaccess_begin_nospec(); - __get_user_asm_nozero(*(u32 *)to, from, ret, - "l", "k", "=r", 4); - __uaccess_end(); - return ret; - } - } return __copy_user_ll(to, (__force const void *)from, n); } diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h index 5cd1caa8bc65..bc10e3dc64fe 100644 --- a/arch/x86/include/asm/uaccess_64.h +++ b/arch/x86/include/asm/uaccess_64.h @@ -65,117 +65,13 @@ copy_to_user_mcsafe(void *to, const void *from, unsigned len) static __always_inline __must_check unsigned long raw_copy_from_user(void *dst, const void __user *src, unsigned long size) { - int ret = 0; - - if (!__builtin_constant_p(size)) - return copy_user_generic(dst, (__force void *)src, size); - switch (size) { - case 1: - __uaccess_begin_nospec(); - __get_user_asm_nozero(*(u8 *)dst, (u8 __user *)src, - ret, "b", "b", "=q", 1); - __uaccess_end(); - return ret; - case 2: - __uaccess_begin_nospec(); - __get_user_asm_nozero(*(u16 *)dst, (u16 __user *)src, - ret, "w", "w", "=r", 2); - __uaccess_end(); - return ret; - case 4: - __uaccess_begin_nospec(); - __get_user_asm_nozero(*(u32 *)dst, (u32 __user *)src, - ret, "l", "k", "=r", 4); - __uaccess_end(); - return ret; - case 8: - __uaccess_begin_nospec(); - __get_user_asm_nozero(*(u64 *)dst, (u64 __user *)src, - ret, "q", "", "=r", 8); - __uaccess_end(); - return ret; - case 10: - __uaccess_begin_nospec(); - __get_user_asm_nozero(*(u64 *)dst, (u64 __user *)src, - ret, "q", "", "=r", 10); - if (likely(!ret)) - __get_user_asm_nozero(*(u16 *)(8 + (char *)dst), - (u16 __user *)(8 + (char __user *)src), - ret, "w", "w", "=r", 2); - __uaccess_end(); - return ret; - case 16: - __uaccess_begin_nospec(); - __get_user_asm_nozero(*(u64 *)dst, (u64 __user *)src, - ret, "q", "", "=r", 16); - if (likely(!ret)) - __get_user_asm_nozero(*(u64 *)(8 + (char *)dst), - (u64 __user *)(8 + (char __user *)src), - ret, "q", "", "=r", 8); - __uaccess_end(); - return ret; - default: - return copy_user_generic(dst, (__force void *)src, size); - } + return copy_user_generic(dst, (__force void *)src, size); } static __always_inline __must_check unsigned long raw_copy_to_user(void __user *dst, const void *src, unsigned long size) { - int ret = 0; - - if (!__builtin_constant_p(size)) - return copy_user_generic((__force void *)dst, src, size); - switch (size) { - case 1: - __uaccess_begin(); - __put_user_asm(*(u8 *)src, (u8 __user *)dst, - ret, "b", "b", "iq", 1); - __uaccess_end(); - return ret; - case 2: - __uaccess_begin(); - __put_user_asm(*(u16 *)src, (u16 __user *)dst, - ret, "w", "w", "ir", 2); - __uaccess_end(); - return ret; - case 4: - __uaccess_begin(); - __put_user_asm(*(u32 *)src, (u32 __user *)dst, - ret, "l", "k", "ir", 4); - __uaccess_end(); - return ret; - case 8: - __uaccess_begin(); - __put_user_asm(*(u64 *)src, (u64 __user *)dst, - ret, "q", "", "er", 8); - __uaccess_end(); - return ret; - case 10: - __uaccess_begin(); - __put_user_asm(*(u64 *)src, (u64 __user *)dst, - ret, "q", "", "er", 10); - if (likely(!ret)) { - asm("":::"memory"); - __put_user_asm(4[(u16 *)src], 4 + (u16 __user *)dst, - ret, "w", "w", "ir", 2); - } - __uaccess_end(); - return ret; - case 16: - __uaccess_begin(); - __put_user_asm(*(u64 *)src, (u64 __user *)dst, - ret, "q", "", "er", 16); - if (likely(!ret)) { - asm("":::"memory"); - __put_user_asm(1[(u64 *)src], 1 + (u64 __user *)dst, - ret, "q", "", "er", 8); - } - __uaccess_end(); - return ret; - default: - return copy_user_generic((__force void *)dst, src, size); - } + return copy_user_generic((__force void *)dst, src, size); } static __always_inline __must_check -- cgit From c63aad695dceb08290be7845052d7d3f1d457416 Mon Sep 17 00:00:00 2001 From: Al Viro Date: Sat, 15 Feb 2020 12:09:14 -0500 Subject: vm86: get rid of get_user_ex() use Just do a copyin of what we want into a local variable and be done with that. We are guaranteed to be on shallow stack here... Note that conditional expression for range passed to access_ok() in mainline had been pointless all along - the only difference between vm86plus_struct and vm86_struct is that the former has one extra field in the end and when we get to copyin of that field (conditional upon 'plus' argument), we use copy_from_user(). Moreover, all fields starting with ->int_revectored are copied that way, so we only need that check (be it done by access_ok() or by user_access_begin()) only on the beginning of the structure - the fields that used to be covered by that get_user_try() block. Signed-off-by: Al Viro --- arch/x86/kernel/vm86_32.c | 54 +++++++++++++++++++++-------------------------- 1 file changed, 24 insertions(+), 30 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kernel/vm86_32.c b/arch/x86/kernel/vm86_32.c index 91d55454e702..49b37eb01e99 100644 --- a/arch/x86/kernel/vm86_32.c +++ b/arch/x86/kernel/vm86_32.c @@ -243,6 +243,7 @@ static long do_sys_vm86(struct vm86plus_struct __user *user_vm86, bool plus) struct kernel_vm86_regs vm86regs; struct pt_regs *regs = current_pt_regs(); unsigned long err = 0; + struct vm86_struct v; err = security_mmap_addr(0); if (err) { @@ -278,39 +279,32 @@ static long do_sys_vm86(struct vm86plus_struct __user *user_vm86, bool plus) if (vm86->saved_sp0) return -EPERM; - if (!access_ok(user_vm86, plus ? - sizeof(struct vm86_struct) : - sizeof(struct vm86plus_struct))) + if (copy_from_user(&v, user_vm86, + offsetof(struct vm86_struct, int_revectored))) return -EFAULT; memset(&vm86regs, 0, sizeof(vm86regs)); - get_user_try { - unsigned short seg; - get_user_ex(vm86regs.pt.bx, &user_vm86->regs.ebx); - get_user_ex(vm86regs.pt.cx, &user_vm86->regs.ecx); - get_user_ex(vm86regs.pt.dx, &user_vm86->regs.edx); - get_user_ex(vm86regs.pt.si, &user_vm86->regs.esi); - get_user_ex(vm86regs.pt.di, &user_vm86->regs.edi); - get_user_ex(vm86regs.pt.bp, &user_vm86->regs.ebp); - get_user_ex(vm86regs.pt.ax, &user_vm86->regs.eax); - get_user_ex(vm86regs.pt.ip, &user_vm86->regs.eip); - get_user_ex(seg, &user_vm86->regs.cs); - vm86regs.pt.cs = seg; - get_user_ex(vm86regs.pt.flags, &user_vm86->regs.eflags); - get_user_ex(vm86regs.pt.sp, &user_vm86->regs.esp); - get_user_ex(seg, &user_vm86->regs.ss); - vm86regs.pt.ss = seg; - get_user_ex(vm86regs.es, &user_vm86->regs.es); - get_user_ex(vm86regs.ds, &user_vm86->regs.ds); - get_user_ex(vm86regs.fs, &user_vm86->regs.fs); - get_user_ex(vm86regs.gs, &user_vm86->regs.gs); - - get_user_ex(vm86->flags, &user_vm86->flags); - get_user_ex(vm86->screen_bitmap, &user_vm86->screen_bitmap); - get_user_ex(vm86->cpu_type, &user_vm86->cpu_type); - } get_user_catch(err); - if (err) - return err; + + vm86regs.pt.bx = v.regs.ebx; + vm86regs.pt.cx = v.regs.ecx; + vm86regs.pt.dx = v.regs.edx; + vm86regs.pt.si = v.regs.esi; + vm86regs.pt.di = v.regs.edi; + vm86regs.pt.bp = v.regs.ebp; + vm86regs.pt.ax = v.regs.eax; + vm86regs.pt.ip = v.regs.eip; + vm86regs.pt.cs = v.regs.cs; + vm86regs.pt.flags = v.regs.eflags; + vm86regs.pt.sp = v.regs.esp; + vm86regs.pt.ss = v.regs.ss; + vm86regs.es = v.regs.es; + vm86regs.ds = v.regs.ds; + vm86regs.fs = v.regs.fs; + vm86regs.gs = v.regs.gs; + + vm86->flags = v.flags; + vm86->screen_bitmap = v.screen_bitmap; + vm86->cpu_type = v.cpu_type; if (copy_from_user(&vm86->int_revectored, &user_vm86->int_revectored, -- cgit From 978727ca331ebd8b479f6a7afd27bb2e6504b2e3 Mon Sep 17 00:00:00 2001 From: Al Viro Date: Sat, 15 Feb 2020 12:23:36 -0500 Subject: x86: get rid of get_user_ex() in ia32_restore_sigcontext() Just do copyin into a local struct and be done with that - we are on a shallow stack here. [reworked by tglx, removing the macro horrors while we are touching that] Signed-off-by: Al Viro --- arch/x86/ia32/ia32_signal.c | 106 ++++++++++++++++++-------------------------- 1 file changed, 44 insertions(+), 62 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/ia32/ia32_signal.c b/arch/x86/ia32/ia32_signal.c index c72025d615f8..23e2c55d8a59 100644 --- a/arch/x86/ia32/ia32_signal.c +++ b/arch/x86/ia32/ia32_signal.c @@ -36,70 +36,56 @@ #include #include +static inline void reload_segments(struct sigcontext_32 *sc) +{ + unsigned int cur; + + savesegment(gs, cur); + if ((sc->gs | 0x03) != cur) + load_gs_index(sc->gs | 0x03); + savesegment(fs, cur); + if ((sc->fs | 0x03) != cur) + loadsegment(fs, sc->fs | 0x03); + savesegment(ds, cur); + if ((sc->ds | 0x03) != cur) + loadsegment(ds, sc->ds | 0x03); + savesegment(es, cur); + if ((sc->es | 0x03) != cur) + loadsegment(es, sc->es | 0x03); +} + /* * Do a signal return; undo the signal stack. */ -#define loadsegment_gs(v) load_gs_index(v) -#define loadsegment_fs(v) loadsegment(fs, v) -#define loadsegment_ds(v) loadsegment(ds, v) -#define loadsegment_es(v) loadsegment(es, v) - -#define get_user_seg(seg) ({ unsigned int v; savesegment(seg, v); v; }) -#define set_user_seg(seg, v) loadsegment_##seg(v) - -#define COPY(x) { \ - get_user_ex(regs->x, &sc->x); \ -} - -#define GET_SEG(seg) ({ \ - unsigned short tmp; \ - get_user_ex(tmp, &sc->seg); \ - tmp; \ -}) - -#define COPY_SEG_CPL3(seg) do { \ - regs->seg = GET_SEG(seg) | 3; \ -} while (0) - -#define RELOAD_SEG(seg) { \ - unsigned int pre = (seg) | 3; \ - unsigned int cur = get_user_seg(seg); \ - if (pre != cur) \ - set_user_seg(seg, pre); \ -} - static int ia32_restore_sigcontext(struct pt_regs *regs, - struct sigcontext_32 __user *sc) + struct sigcontext_32 __user *usc) { - unsigned int tmpflags, err = 0; - u16 gs, fs, es, ds; - void __user *buf; - u32 tmp; + struct sigcontext_32 sc; /* Always make any pending restarted system calls return -EINTR */ current->restart_block.fn = do_no_restart_syscall; - get_user_try { - gs = GET_SEG(gs); - fs = GET_SEG(fs); - ds = GET_SEG(ds); - es = GET_SEG(es); - - COPY(di); COPY(si); COPY(bp); COPY(sp); COPY(bx); - COPY(dx); COPY(cx); COPY(ip); COPY(ax); - /* Don't touch extended registers */ - - COPY_SEG_CPL3(cs); - COPY_SEG_CPL3(ss); - - get_user_ex(tmpflags, &sc->flags); - regs->flags = (regs->flags & ~FIX_EFLAGS) | (tmpflags & FIX_EFLAGS); - /* disable syscall checks */ - regs->orig_ax = -1; + if (unlikely(copy_from_user(&sc, usc, sizeof(sc)))) + return -EFAULT; - get_user_ex(tmp, &sc->fpstate); - buf = compat_ptr(tmp); - } get_user_catch(err); + /* Get only the ia32 registers. */ + regs->bx = sc.bx; + regs->cx = sc.cx; + regs->dx = sc.dx; + regs->si = sc.si; + regs->di = sc.di; + regs->bp = sc.bp; + regs->ax = sc.ax; + regs->sp = sc.sp; + regs->ip = sc.ip; + + /* Get CS/SS and force CPL3 */ + regs->cs = sc.cs | 0x03; + regs->ss = sc.ss | 0x03; + + regs->flags = (regs->flags & ~FIX_EFLAGS) | (sc.flags & FIX_EFLAGS); + /* disable syscall checks */ + regs->orig_ax = -1; /* * Reload fs and gs if they have changed in the signal @@ -107,14 +93,8 @@ static int ia32_restore_sigcontext(struct pt_regs *regs, * the handler, but does not clobber them at least in the * normal case. */ - RELOAD_SEG(gs); - RELOAD_SEG(fs); - RELOAD_SEG(ds); - RELOAD_SEG(es); - - err |= fpu__restore_sig(buf, 1); - - return err; + reload_segments(&sc); + return fpu__restore_sig(compat_ptr(sc.fpstate), 1); } COMPAT_SYSCALL_DEFINE0(sigreturn) @@ -172,6 +152,8 @@ badframe: * Set up a signal frame. */ +#define get_user_seg(seg) ({ unsigned int v; savesegment(seg, v); v; }) + static int ia32_setup_sigcontext(struct sigcontext_32 __user *sc, void __user *fpstate, struct pt_regs *regs, unsigned int mask) -- cgit From 3add42c29cebb1d5f83c6205c59466a06ccf8da1 Mon Sep 17 00:00:00 2001 From: Al Viro Date: Sat, 15 Feb 2020 12:56:57 -0500 Subject: x86: get rid of get_user_ex() in restore_sigcontext() Just do copyin into a local struct and be done with that - we are on a shallow stack here. [reworked by tglx, removing the macro horrors while we are touching that] Signed-off-by: Al Viro --- arch/x86/kernel/signal.c | 86 ++++++++++++++++++++---------------------------- 1 file changed, 36 insertions(+), 50 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c index 53ac66b3fd9b..83563e98f0be 100644 --- a/arch/x86/kernel/signal.c +++ b/arch/x86/kernel/signal.c @@ -47,24 +47,6 @@ #include #include -#define COPY(x) do { \ - get_user_ex(regs->x, &sc->x); \ -} while (0) - -#define GET_SEG(seg) ({ \ - unsigned short tmp; \ - get_user_ex(tmp, &sc->seg); \ - tmp; \ -}) - -#define COPY_SEG(seg) do { \ - regs->seg = GET_SEG(seg); \ -} while (0) - -#define COPY_SEG_CPL3(seg) do { \ - regs->seg = GET_SEG(seg) | 3; \ -} while (0) - #ifdef CONFIG_X86_64 /* * If regs->ss will cause an IRET fault, change it. Otherwise leave it @@ -92,53 +74,58 @@ static void force_valid_ss(struct pt_regs *regs) ar != (AR_DPL3 | AR_S | AR_P | AR_TYPE_RWDATA_EXPDOWN)) regs->ss = __USER_DS; } +# define CONTEXT_COPY_SIZE offsetof(struct sigcontext, reserved1) +#else +# define CONTEXT_COPY_SIZE sizeof(struct sigcontext) #endif static int restore_sigcontext(struct pt_regs *regs, - struct sigcontext __user *sc, + struct sigcontext __user *usc, unsigned long uc_flags) { - unsigned long buf_val; - void __user *buf; - unsigned int tmpflags; - unsigned int err = 0; + struct sigcontext sc; /* Always make any pending restarted system calls return -EINTR */ current->restart_block.fn = do_no_restart_syscall; - get_user_try { + if (copy_from_user(&sc, usc, CONTEXT_COPY_SIZE)) + return -EFAULT; #ifdef CONFIG_X86_32 - set_user_gs(regs, GET_SEG(gs)); - COPY_SEG(fs); - COPY_SEG(es); - COPY_SEG(ds); + set_user_gs(regs, sc.gs); + regs->fs = sc.fs; + regs->es = sc.es; + regs->ds = sc.ds; #endif /* CONFIG_X86_32 */ - COPY(di); COPY(si); COPY(bp); COPY(sp); COPY(bx); - COPY(dx); COPY(cx); COPY(ip); COPY(ax); + regs->bx = sc.bx; + regs->cx = sc.cx; + regs->dx = sc.dx; + regs->si = sc.si; + regs->di = sc.di; + regs->bp = sc.bp; + regs->ax = sc.ax; + regs->sp = sc.sp; + regs->ip = sc.ip; #ifdef CONFIG_X86_64 - COPY(r8); - COPY(r9); - COPY(r10); - COPY(r11); - COPY(r12); - COPY(r13); - COPY(r14); - COPY(r15); + regs->r8 = sc.r8; + regs->r9 = sc.r9; + regs->r10 = sc.r10; + regs->r11 = sc.r11; + regs->r12 = sc.r12; + regs->r13 = sc.r13; + regs->r14 = sc.r14; + regs->r15 = sc.r15; #endif /* CONFIG_X86_64 */ - COPY_SEG_CPL3(cs); - COPY_SEG_CPL3(ss); - - get_user_ex(tmpflags, &sc->flags); - regs->flags = (regs->flags & ~FIX_EFLAGS) | (tmpflags & FIX_EFLAGS); - regs->orig_ax = -1; /* disable syscall checks */ + /* Get CS/SS and force CPL3 */ + regs->cs = sc.cs | 0x03; + regs->ss = sc.ss | 0x03; - get_user_ex(buf_val, &sc->fpstate); - buf = (void __user *)buf_val; - } get_user_catch(err); + regs->flags = (regs->flags & ~FIX_EFLAGS) | (sc.flags & FIX_EFLAGS); + /* disable syscall checks */ + regs->orig_ax = -1; #ifdef CONFIG_X86_64 /* @@ -149,9 +136,8 @@ static int restore_sigcontext(struct pt_regs *regs, force_valid_ss(regs); #endif - err |= fpu__restore_sig(buf, IS_ENABLED(CONFIG_X86_32)); - - return err; + return fpu__restore_sig((void __user *)sc.fpstate, + IS_ENABLED(CONFIG_X86_32)); } int setup_sigcontext(struct sigcontext __user *sc, void __user *fpstate, -- cgit From 77f3c6166ddc7567455b244074b3ebb63862b56f Mon Sep 17 00:00:00 2001 From: Al Viro Date: Sat, 15 Feb 2020 13:26:51 -0500 Subject: x86: kill get_user_{try,catch,ex} no users left Signed-off-by: Al Viro --- arch/x86/include/asm/uaccess.h | 54 ------------------------------------------ 1 file changed, 54 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h index 1cfa33b94a1a..4dc5accdd826 100644 --- a/arch/x86/include/asm/uaccess.h +++ b/arch/x86/include/asm/uaccess.h @@ -335,12 +335,9 @@ do { \ "i" (errret), "0" (retval)); \ }) -#define __get_user_asm_ex_u64(x, ptr) (x) = __get_user_bad() #else #define __get_user_asm_u64(x, ptr, retval, errret) \ __get_user_asm(x, ptr, retval, "q", "", "=r", errret) -#define __get_user_asm_ex_u64(x, ptr) \ - __get_user_asm_ex(x, ptr, "q", "", "=r") #endif #define __get_user_size(x, ptr, size, retval, errret) \ @@ -378,41 +375,6 @@ do { \ : "=r" (err), ltype(x) \ : "m" (__m(addr)), "i" (errret), "0" (err)) -/* - * This doesn't do __uaccess_begin/end - the exception handling - * around it must do that. - */ -#define __get_user_size_ex(x, ptr, size) \ -do { \ - __chk_user_ptr(ptr); \ - switch (size) { \ - case 1: \ - __get_user_asm_ex(x, ptr, "b", "b", "=q"); \ - break; \ - case 2: \ - __get_user_asm_ex(x, ptr, "w", "w", "=r"); \ - break; \ - case 4: \ - __get_user_asm_ex(x, ptr, "l", "k", "=r"); \ - break; \ - case 8: \ - __get_user_asm_ex_u64(x, ptr); \ - break; \ - default: \ - (x) = __get_user_bad(); \ - } \ -} while (0) - -#define __get_user_asm_ex(x, addr, itype, rtype, ltype) \ - asm volatile("1: mov"itype" %1,%"rtype"0\n" \ - "2:\n" \ - ".section .fixup,\"ax\"\n" \ - "3:xor"itype" %"rtype"0,%"rtype"0\n" \ - " jmp 2b\n" \ - ".previous\n" \ - _ASM_EXTABLE_EX(1b, 3b) \ - : ltype(x) : "m" (__m(addr))) - #define __put_user_nocheck(x, ptr, size) \ ({ \ __label__ __pu_label; \ @@ -540,22 +502,6 @@ struct __large_struct { unsigned long buf[100]; }; #define __put_user(x, ptr) \ __put_user_nocheck((__typeof__(*(ptr)))(x), (ptr), sizeof(*(ptr))) -/* - * {get|put}_user_try and catch - * - * get_user_try { - * get_user_ex(...); - * } get_user_catch(err) - */ -#define get_user_try uaccess_try_nospec -#define get_user_catch(err) uaccess_catch(err) - -#define get_user_ex(x, ptr) do { \ - unsigned long __gue_val; \ - __get_user_size_ex((__gue_val), (ptr), (sizeof(*(ptr)))); \ - (x) = (__force __typeof__(*(ptr)))__gue_val; \ -} while (0) - #define put_user_try uaccess_try #define put_user_catch(err) uaccess_catch(err) -- cgit From a37d01ead405e3aa14d72d284721fe46422b3b63 Mon Sep 17 00:00:00 2001 From: Al Viro Date: Sat, 15 Feb 2020 13:38:18 -0500 Subject: x86: switch save_v86_state() to unsafe_put_user() Signed-off-by: Al Viro --- arch/x86/kernel/vm86_32.c | 61 +++++++++++++++++++++++------------------------ 1 file changed, 30 insertions(+), 31 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kernel/vm86_32.c b/arch/x86/kernel/vm86_32.c index 49b37eb01e99..47a8676c7395 100644 --- a/arch/x86/kernel/vm86_32.c +++ b/arch/x86/kernel/vm86_32.c @@ -98,7 +98,6 @@ void save_v86_state(struct kernel_vm86_regs *regs, int retval) struct task_struct *tsk = current; struct vm86plus_struct __user *user; struct vm86 *vm86 = current->thread.vm86; - long err = 0; /* * This gets called from entry.S with interrupts disabled, but @@ -114,37 +113,30 @@ void save_v86_state(struct kernel_vm86_regs *regs, int retval) set_flags(regs->pt.flags, VEFLAGS, X86_EFLAGS_VIF | vm86->veflags_mask); user = vm86->user_vm86; - if (!access_ok(user, vm86->vm86plus.is_vm86pus ? + if (!user_access_begin(user, vm86->vm86plus.is_vm86pus ? sizeof(struct vm86plus_struct) : - sizeof(struct vm86_struct))) { - pr_alert("could not access userspace vm86 info\n"); - do_exit(SIGSEGV); - } - - put_user_try { - put_user_ex(regs->pt.bx, &user->regs.ebx); - put_user_ex(regs->pt.cx, &user->regs.ecx); - put_user_ex(regs->pt.dx, &user->regs.edx); - put_user_ex(regs->pt.si, &user->regs.esi); - put_user_ex(regs->pt.di, &user->regs.edi); - put_user_ex(regs->pt.bp, &user->regs.ebp); - put_user_ex(regs->pt.ax, &user->regs.eax); - put_user_ex(regs->pt.ip, &user->regs.eip); - put_user_ex(regs->pt.cs, &user->regs.cs); - put_user_ex(regs->pt.flags, &user->regs.eflags); - put_user_ex(regs->pt.sp, &user->regs.esp); - put_user_ex(regs->pt.ss, &user->regs.ss); - put_user_ex(regs->es, &user->regs.es); - put_user_ex(regs->ds, &user->regs.ds); - put_user_ex(regs->fs, &user->regs.fs); - put_user_ex(regs->gs, &user->regs.gs); - - put_user_ex(vm86->screen_bitmap, &user->screen_bitmap); - } put_user_catch(err); - if (err) { - pr_alert("could not access userspace vm86 info\n"); - do_exit(SIGSEGV); - } + sizeof(struct vm86_struct))) + goto Efault; + + unsafe_put_user(regs->pt.bx, &user->regs.ebx, Efault_end); + unsafe_put_user(regs->pt.cx, &user->regs.ecx, Efault_end); + unsafe_put_user(regs->pt.dx, &user->regs.edx, Efault_end); + unsafe_put_user(regs->pt.si, &user->regs.esi, Efault_end); + unsafe_put_user(regs->pt.di, &user->regs.edi, Efault_end); + unsafe_put_user(regs->pt.bp, &user->regs.ebp, Efault_end); + unsafe_put_user(regs->pt.ax, &user->regs.eax, Efault_end); + unsafe_put_user(regs->pt.ip, &user->regs.eip, Efault_end); + unsafe_put_user(regs->pt.cs, &user->regs.cs, Efault_end); + unsafe_put_user(regs->pt.flags, &user->regs.eflags, Efault_end); + unsafe_put_user(regs->pt.sp, &user->regs.esp, Efault_end); + unsafe_put_user(regs->pt.ss, &user->regs.ss, Efault_end); + unsafe_put_user(regs->es, &user->regs.es, Efault_end); + unsafe_put_user(regs->ds, &user->regs.ds, Efault_end); + unsafe_put_user(regs->fs, &user->regs.fs, Efault_end); + unsafe_put_user(regs->gs, &user->regs.gs, Efault_end); + unsafe_put_user(vm86->screen_bitmap, &user->screen_bitmap, Efault_end); + + user_access_end(); preempt_disable(); tsk->thread.sp0 = vm86->saved_sp0; @@ -159,6 +151,13 @@ void save_v86_state(struct kernel_vm86_regs *regs, int retval) lazy_load_gs(vm86->regs32.gs); regs->pt.ax = retval; + return; + +Efault_end: + user_access_end(); +Efault: + pr_alert("could not access userspace vm86 info\n"); + do_exit(SIGSEGV); } static void mark_screen_rdonly(struct mm_struct *mm) -- cgit From 9f855c085fb150ccc208497b39b5d3542bac5322 Mon Sep 17 00:00:00 2001 From: Al Viro Date: Sat, 15 Feb 2020 17:25:27 -0500 Subject: x86: switch setup_sigcontext() to unsafe_put_user() Signed-off-by: Al Viro --- arch/x86/include/asm/sighandling.h | 3 -- arch/x86/kernel/signal.c | 88 +++++++++++++++++++------------------- 2 files changed, 45 insertions(+), 46 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/sighandling.h b/arch/x86/include/asm/sighandling.h index 2fcbd6f33ef7..35e0b579ffcb 100644 --- a/arch/x86/include/asm/sighandling.h +++ b/arch/x86/include/asm/sighandling.h @@ -14,9 +14,6 @@ X86_EFLAGS_CF | X86_EFLAGS_RF) void signal_fault(struct pt_regs *regs, void __user *frame, char *where); -int setup_sigcontext(struct sigcontext __user *sc, void __user *fpstate, - struct pt_regs *regs, unsigned long mask); - #ifdef CONFIG_X86_X32_ABI asmlinkage long sys32_x32_rt_sigreturn(void); diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c index 83563e98f0be..3b4ca484cfc2 100644 --- a/arch/x86/kernel/signal.c +++ b/arch/x86/kernel/signal.c @@ -140,63 +140,65 @@ static int restore_sigcontext(struct pt_regs *regs, IS_ENABLED(CONFIG_X86_32)); } -int setup_sigcontext(struct sigcontext __user *sc, void __user *fpstate, +static int setup_sigcontext(struct sigcontext __user *sc, void __user *fpstate, struct pt_regs *regs, unsigned long mask) { - int err = 0; - - put_user_try { + if (!user_access_begin(sc, sizeof(struct sigcontext))) + return -EFAULT; #ifdef CONFIG_X86_32 - put_user_ex(get_user_gs(regs), (unsigned int __user *)&sc->gs); - put_user_ex(regs->fs, (unsigned int __user *)&sc->fs); - put_user_ex(regs->es, (unsigned int __user *)&sc->es); - put_user_ex(regs->ds, (unsigned int __user *)&sc->ds); + unsafe_put_user(get_user_gs(regs), + (unsigned int __user *)&sc->gs, Efault); + unsafe_put_user(regs->fs, (unsigned int __user *)&sc->fs, Efault); + unsafe_put_user(regs->es, (unsigned int __user *)&sc->es, Efault); + unsafe_put_user(regs->ds, (unsigned int __user *)&sc->ds, Efault); #endif /* CONFIG_X86_32 */ - put_user_ex(regs->di, &sc->di); - put_user_ex(regs->si, &sc->si); - put_user_ex(regs->bp, &sc->bp); - put_user_ex(regs->sp, &sc->sp); - put_user_ex(regs->bx, &sc->bx); - put_user_ex(regs->dx, &sc->dx); - put_user_ex(regs->cx, &sc->cx); - put_user_ex(regs->ax, &sc->ax); + unsafe_put_user(regs->di, &sc->di, Efault); + unsafe_put_user(regs->si, &sc->si, Efault); + unsafe_put_user(regs->bp, &sc->bp, Efault); + unsafe_put_user(regs->sp, &sc->sp, Efault); + unsafe_put_user(regs->bx, &sc->bx, Efault); + unsafe_put_user(regs->dx, &sc->dx, Efault); + unsafe_put_user(regs->cx, &sc->cx, Efault); + unsafe_put_user(regs->ax, &sc->ax, Efault); #ifdef CONFIG_X86_64 - put_user_ex(regs->r8, &sc->r8); - put_user_ex(regs->r9, &sc->r9); - put_user_ex(regs->r10, &sc->r10); - put_user_ex(regs->r11, &sc->r11); - put_user_ex(regs->r12, &sc->r12); - put_user_ex(regs->r13, &sc->r13); - put_user_ex(regs->r14, &sc->r14); - put_user_ex(regs->r15, &sc->r15); + unsafe_put_user(regs->r8, &sc->r8, Efault); + unsafe_put_user(regs->r9, &sc->r9, Efault); + unsafe_put_user(regs->r10, &sc->r10, Efault); + unsafe_put_user(regs->r11, &sc->r11, Efault); + unsafe_put_user(regs->r12, &sc->r12, Efault); + unsafe_put_user(regs->r13, &sc->r13, Efault); + unsafe_put_user(regs->r14, &sc->r14, Efault); + unsafe_put_user(regs->r15, &sc->r15, Efault); #endif /* CONFIG_X86_64 */ - put_user_ex(current->thread.trap_nr, &sc->trapno); - put_user_ex(current->thread.error_code, &sc->err); - put_user_ex(regs->ip, &sc->ip); + unsafe_put_user(current->thread.trap_nr, &sc->trapno, Efault); + unsafe_put_user(current->thread.error_code, &sc->err, Efault); + unsafe_put_user(regs->ip, &sc->ip, Efault); #ifdef CONFIG_X86_32 - put_user_ex(regs->cs, (unsigned int __user *)&sc->cs); - put_user_ex(regs->flags, &sc->flags); - put_user_ex(regs->sp, &sc->sp_at_signal); - put_user_ex(regs->ss, (unsigned int __user *)&sc->ss); + unsafe_put_user(regs->cs, (unsigned int __user *)&sc->cs, Efault); + unsafe_put_user(regs->flags, &sc->flags, Efault); + unsafe_put_user(regs->sp, &sc->sp_at_signal, Efault); + unsafe_put_user(regs->ss, (unsigned int __user *)&sc->ss, Efault); #else /* !CONFIG_X86_32 */ - put_user_ex(regs->flags, &sc->flags); - put_user_ex(regs->cs, &sc->cs); - put_user_ex(0, &sc->gs); - put_user_ex(0, &sc->fs); - put_user_ex(regs->ss, &sc->ss); + unsafe_put_user(regs->flags, &sc->flags, Efault); + unsafe_put_user(regs->cs, &sc->cs, Efault); + unsafe_put_user(0, &sc->gs, Efault); + unsafe_put_user(0, &sc->fs, Efault); + unsafe_put_user(regs->ss, &sc->ss, Efault); #endif /* CONFIG_X86_32 */ - put_user_ex(fpstate, (unsigned long __user *)&sc->fpstate); + unsafe_put_user(fpstate, (unsigned long __user *)&sc->fpstate, Efault); - /* non-iBCS2 extensions.. */ - put_user_ex(mask, &sc->oldmask); - put_user_ex(current->thread.cr2, &sc->cr2); - } put_user_catch(err); - - return err; + /* non-iBCS2 extensions.. */ + unsafe_put_user(mask, &sc->oldmask, Efault); + unsafe_put_user(current->thread.cr2, &sc->cr2, Efault); + user_access_end(); + return 0; +Efault: + user_access_end(); + return -EFAULT; } /* -- cgit From d2d2728d161cbc52739d823a7fb76f3ba2fb3519 Mon Sep 17 00:00:00 2001 From: Al Viro Date: Sat, 15 Feb 2020 17:41:04 -0500 Subject: x86: switch ia32_setup_sigcontext() to unsafe_put_user() Signed-off-by: Al Viro --- arch/x86/ia32/ia32_signal.c | 64 +++++++++++++++++++++++---------------------- 1 file changed, 33 insertions(+), 31 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/ia32/ia32_signal.c b/arch/x86/ia32/ia32_signal.c index 23e2c55d8a59..af673ec23a2d 100644 --- a/arch/x86/ia32/ia32_signal.c +++ b/arch/x86/ia32/ia32_signal.c @@ -158,38 +158,40 @@ static int ia32_setup_sigcontext(struct sigcontext_32 __user *sc, void __user *fpstate, struct pt_regs *regs, unsigned int mask) { - int err = 0; - - put_user_try { - put_user_ex(get_user_seg(gs), (unsigned int __user *)&sc->gs); - put_user_ex(get_user_seg(fs), (unsigned int __user *)&sc->fs); - put_user_ex(get_user_seg(ds), (unsigned int __user *)&sc->ds); - put_user_ex(get_user_seg(es), (unsigned int __user *)&sc->es); - - put_user_ex(regs->di, &sc->di); - put_user_ex(regs->si, &sc->si); - put_user_ex(regs->bp, &sc->bp); - put_user_ex(regs->sp, &sc->sp); - put_user_ex(regs->bx, &sc->bx); - put_user_ex(regs->dx, &sc->dx); - put_user_ex(regs->cx, &sc->cx); - put_user_ex(regs->ax, &sc->ax); - put_user_ex(current->thread.trap_nr, &sc->trapno); - put_user_ex(current->thread.error_code, &sc->err); - put_user_ex(regs->ip, &sc->ip); - put_user_ex(regs->cs, (unsigned int __user *)&sc->cs); - put_user_ex(regs->flags, &sc->flags); - put_user_ex(regs->sp, &sc->sp_at_signal); - put_user_ex(regs->ss, (unsigned int __user *)&sc->ss); - - put_user_ex(ptr_to_compat(fpstate), &sc->fpstate); - - /* non-iBCS2 extensions.. */ - put_user_ex(mask, &sc->oldmask); - put_user_ex(current->thread.cr2, &sc->cr2); - } put_user_catch(err); + if (!user_access_begin(sc, sizeof(struct sigcontext_32))) + return -EFAULT; - return err; + unsafe_put_user(get_user_seg(gs), (unsigned int __user *)&sc->gs, Efault); + unsafe_put_user(get_user_seg(fs), (unsigned int __user *)&sc->fs, Efault); + unsafe_put_user(get_user_seg(ds), (unsigned int __user *)&sc->ds, Efault); + unsafe_put_user(get_user_seg(es), (unsigned int __user *)&sc->es, Efault); + + unsafe_put_user(regs->di, &sc->di, Efault); + unsafe_put_user(regs->si, &sc->si, Efault); + unsafe_put_user(regs->bp, &sc->bp, Efault); + unsafe_put_user(regs->sp, &sc->sp, Efault); + unsafe_put_user(regs->bx, &sc->bx, Efault); + unsafe_put_user(regs->dx, &sc->dx, Efault); + unsafe_put_user(regs->cx, &sc->cx, Efault); + unsafe_put_user(regs->ax, &sc->ax, Efault); + unsafe_put_user(current->thread.trap_nr, &sc->trapno, Efault); + unsafe_put_user(current->thread.error_code, &sc->err, Efault); + unsafe_put_user(regs->ip, &sc->ip, Efault); + unsafe_put_user(regs->cs, (unsigned int __user *)&sc->cs, Efault); + unsafe_put_user(regs->flags, &sc->flags, Efault); + unsafe_put_user(regs->sp, &sc->sp_at_signal, Efault); + unsafe_put_user(regs->ss, (unsigned int __user *)&sc->ss, Efault); + + unsafe_put_user(ptr_to_compat(fpstate), &sc->fpstate, Efault); + + /* non-iBCS2 extensions.. */ + unsafe_put_user(mask, &sc->oldmask, Efault); + unsafe_put_user(current->thread.cr2, &sc->cr2, Efault); + user_access_end(); + return 0; +Efault: + user_access_end(); + return -EFAULT; } /* -- cgit From 39f16c1c0f14e9794545dbf6a64c909d5e16a2ea Mon Sep 17 00:00:00 2001 From: Al Viro Date: Sat, 15 Feb 2020 18:39:17 -0500 Subject: x86: get rid of put_user_try in {ia32,x32}_setup_rt_frame() Straightforward, except for compat_save_altstack_ex() stuck in those. Replace that thing with an analogue that would use unsafe_put_user() instead of put_user_ex() (called unsafe_compat_save_altstack()) and be done with that... Signed-off-by: Al Viro --- arch/x86/ia32/ia32_signal.c | 50 +++++++++++++++++++++++---------------------- arch/x86/kernel/signal.c | 33 ++++++++++++++++-------------- 2 files changed, 44 insertions(+), 39 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/ia32/ia32_signal.c b/arch/x86/ia32/ia32_signal.c index af673ec23a2d..a96995aa23da 100644 --- a/arch/x86/ia32/ia32_signal.c +++ b/arch/x86/ia32/ia32_signal.c @@ -326,35 +326,34 @@ int ia32_setup_rt_frame(int sig, struct ksignal *ksig, frame = get_sigframe(ksig, regs, sizeof(*frame), &fpstate); - if (!access_ok(frame, sizeof(*frame))) + if (!user_access_begin(frame, sizeof(*frame))) return -EFAULT; - put_user_try { - put_user_ex(sig, &frame->sig); - put_user_ex(ptr_to_compat(&frame->info), &frame->pinfo); - put_user_ex(ptr_to_compat(&frame->uc), &frame->puc); + unsafe_put_user(sig, &frame->sig, Efault); + unsafe_put_user(ptr_to_compat(&frame->info), &frame->pinfo, Efault); + unsafe_put_user(ptr_to_compat(&frame->uc), &frame->puc, Efault); - /* Create the ucontext. */ - if (static_cpu_has(X86_FEATURE_XSAVE)) - put_user_ex(UC_FP_XSTATE, &frame->uc.uc_flags); - else - put_user_ex(0, &frame->uc.uc_flags); - put_user_ex(0, &frame->uc.uc_link); - compat_save_altstack_ex(&frame->uc.uc_stack, regs->sp); + /* Create the ucontext. */ + if (static_cpu_has(X86_FEATURE_XSAVE)) + unsafe_put_user(UC_FP_XSTATE, &frame->uc.uc_flags, Efault); + else + unsafe_put_user(0, &frame->uc.uc_flags, Efault); + unsafe_put_user(0, &frame->uc.uc_link, Efault); + unsafe_compat_save_altstack(&frame->uc.uc_stack, regs->sp, Efault); - if (ksig->ka.sa.sa_flags & SA_RESTORER) - restorer = ksig->ka.sa.sa_restorer; - else - restorer = current->mm->context.vdso + - vdso_image_32.sym___kernel_rt_sigreturn; - put_user_ex(ptr_to_compat(restorer), &frame->pretcode); + if (ksig->ka.sa.sa_flags & SA_RESTORER) + restorer = ksig->ka.sa.sa_restorer; + else + restorer = current->mm->context.vdso + + vdso_image_32.sym___kernel_rt_sigreturn; + unsafe_put_user(ptr_to_compat(restorer), &frame->pretcode, Efault); - /* - * Not actually used anymore, but left because some gdb - * versions need it. - */ - put_user_ex(*((u64 *)&code), (u64 __user *)frame->retcode); - } put_user_catch(err); + /* + * Not actually used anymore, but left because some gdb + * versions need it. + */ + unsafe_put_user(*((u64 *)&code), (u64 __user *)frame->retcode, Efault); + user_access_end(); err |= __copy_siginfo_to_user32(&frame->info, &ksig->info, false); err |= ia32_setup_sigcontext(&frame->uc.uc_mcontext, fpstate, @@ -380,4 +379,7 @@ int ia32_setup_rt_frame(int sig, struct ksignal *ksig, regs->ss = __USER32_DS; return 0; +Efault: + user_access_end(); + return -EFAULT; } diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c index 3b4ca484cfc2..29abad29aaa1 100644 --- a/arch/x86/kernel/signal.c +++ b/arch/x86/kernel/signal.c @@ -529,6 +529,9 @@ static int x32_setup_rt_frame(struct ksignal *ksig, int err = 0; void __user *fpstate = NULL; + if (!(ksig->ka.sa.sa_flags & SA_RESTORER)) + return -EFAULT; + frame = get_sigframe(&ksig->ka, regs, sizeof(*frame), &fpstate); if (!access_ok(frame, sizeof(*frame))) @@ -541,22 +544,17 @@ static int x32_setup_rt_frame(struct ksignal *ksig, uc_flags = frame_uc_flags(regs); - put_user_try { - /* Create the ucontext. */ - put_user_ex(uc_flags, &frame->uc.uc_flags); - put_user_ex(0, &frame->uc.uc_link); - compat_save_altstack_ex(&frame->uc.uc_stack, regs->sp); - put_user_ex(0, &frame->uc.uc__pad0); + if (!user_access_begin(frame, sizeof(*frame))) + return -EFAULT; - if (ksig->ka.sa.sa_flags & SA_RESTORER) { - restorer = ksig->ka.sa.sa_restorer; - } else { - /* could use a vstub here */ - restorer = NULL; - err |= -EFAULT; - } - put_user_ex(restorer, (unsigned long __user *)&frame->pretcode); - } put_user_catch(err); + /* Create the ucontext. */ + unsafe_put_user(uc_flags, &frame->uc.uc_flags, Efault); + unsafe_put_user(0, &frame->uc.uc_link, Efault); + unsafe_compat_save_altstack(&frame->uc.uc_stack, regs->sp, Efault); + unsafe_put_user(0, &frame->uc.uc__pad0, Efault); + restorer = ksig->ka.sa.sa_restorer; + unsafe_put_user(restorer, (unsigned long __user *)&frame->pretcode, Efault); + user_access_end(); err |= setup_sigcontext(&frame->uc.uc_mcontext, fpstate, regs, set->sig[0]); @@ -582,6 +580,11 @@ static int x32_setup_rt_frame(struct ksignal *ksig, #endif /* CONFIG_X86_X32_ABI */ return 0; +#ifdef CONFIG_X86_X32_ABI +Efault: + user_access_end(); + return -EFAULT; +#endif } /* -- cgit From 870b4333a62e45b0b2000d14b301b7b8b8cad9da Mon Sep 17 00:00:00 2001 From: Borislav Petkov Date: Wed, 18 Mar 2020 19:27:48 +0100 Subject: x86/ioremap: Fix CONFIG_EFI=n build In order to use efi_mem_type(), one needs CONFIG_EFI enabled. Otherwise that function is undefined. Use IS_ENABLED() to check and avoid the ifdeffery as the compiler optimizes away the following unreachable code then. Fixes: 985e537a4082 ("x86/ioremap: Map EFI runtime services data as encrypted for SEV") Reported-by: Randy Dunlap Signed-off-by: Borislav Petkov Tested-by: Randy Dunlap Cc: Tom Lendacky Cc: Link: https://lkml.kernel.org/r/7561e981-0d9b-d62c-0ef2-ce6007aff1ab@infradead.org --- arch/x86/mm/ioremap.c | 3 +++ 1 file changed, 3 insertions(+) (limited to 'arch/x86') diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c index 935a91e1fd77..18c637c0dc6f 100644 --- a/arch/x86/mm/ioremap.c +++ b/arch/x86/mm/ioremap.c @@ -115,6 +115,9 @@ static void __ioremap_check_other(resource_size_t addr, struct ioremap_desc *des if (!sev_active()) return; + if (!IS_ENABLED(CONFIG_EFI)) + return; + if (efi_mem_type(addr) == EFI_RUNTIME_SERVICES_DATA) desc->flags |= IORES_MAP_ENCRYPTED; } -- cgit From bac59d18c7018a2fd5e800a1e72a8271bf404977 Mon Sep 17 00:00:00 2001 From: Guenter Roeck Date: Thu, 30 Jan 2020 18:11:59 -0800 Subject: x86/setup: Fix static memory detection When booting x86 images in qemu, the following warning is seen randomly if DEBUG_LOCKDEP is enabled. WARNING: CPU: 0 PID: 1 at kernel/locking/lockdep.c:1119 lockdep_register_key+0xc0/0x100 static_obj() returns true if an address is between _stext and _end. On x86, this includes the brk memory space. Problem is that this memory block is not static on x86; its unused portions are released after init and can be allocated. This results in the observed warning if a lockdep object is allocated from this memory. Solve the problem by implementing arch_is_kernel_initmem_freed() for x86 and have it return true if an address is within the released memory range. The same problem was solved for s390 with commit 7a5da02de8d6e ("locking/lockdep: check for freed initmem in static_obj()"), which introduced arch_is_kernel_initmem_freed(). Signed-off-by: Guenter Roeck Signed-off-by: Borislav Petkov Acked-by: Peter Zijlstra Link: https://lkml.kernel.org/r/20200131021159.9178-1-linux@roeck-us.net --- arch/x86/include/asm/sections.h | 20 ++++++++++++++++++++ arch/x86/kernel/setup.c | 1 - 2 files changed, 20 insertions(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/sections.h b/arch/x86/include/asm/sections.h index 036c360910c5..a6e8373a5170 100644 --- a/arch/x86/include/asm/sections.h +++ b/arch/x86/include/asm/sections.h @@ -2,6 +2,8 @@ #ifndef _ASM_X86_SECTIONS_H #define _ASM_X86_SECTIONS_H +#define arch_is_kernel_initmem_freed arch_is_kernel_initmem_freed + #include #include @@ -14,4 +16,22 @@ extern char __end_rodata_hpage_align[]; extern char __end_of_kernel_reserve[]; +extern unsigned long _brk_start, _brk_end; + +static inline bool arch_is_kernel_initmem_freed(unsigned long addr) +{ + /* + * If _brk_start has not been cleared, brk allocation is incomplete, + * and we can not make assumptions about its use. + */ + if (_brk_start) + return 0; + + /* + * After brk allocation is complete, space between _brk_end and _end + * is available for allocation. + */ + return addr >= _brk_end && addr < (unsigned long)&_end; +} + #endif /* _ASM_X86_SECTIONS_H */ diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index a74262c71484..e6b545047f38 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -64,7 +64,6 @@ RESERVE_BRK(dmi_alloc, 65536); * at link time, with RESERVE_BRK*() facility reserving additional * chunks. */ -static __initdata unsigned long _brk_start = (unsigned long)__brk_base; unsigned long _brk_end = (unsigned long)__brk_base; -- cgit From e2bdafc1070f5db0bc1bc40116955f54188771cb Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Tue, 25 Feb 2020 21:14:05 -0800 Subject: x86/configs: Slightly reduce defconfigs Eliminate 2 config symbols from both x86 defconfig files: HAMRADIO and FDDI. The FDDI Kconfig file even says (for the FDDI config symbol): Most people will say N. Signed-off-by: Randy Dunlap Signed-off-by: Borislav Petkov Acked-by: Maciej W. Rozycki # CONFIG_FDDI Link: https://lkml.kernel.org/r/433f203e-4e00-f317-2e6b-81518b72843c@infradead.org --- arch/x86/configs/i386_defconfig | 2 -- arch/x86/configs/x86_64_defconfig | 2 -- 2 files changed, 4 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/configs/i386_defconfig b/arch/x86/configs/i386_defconfig index 59ce9ed58430..5b602beb0b72 100644 --- a/arch/x86/configs/i386_defconfig +++ b/arch/x86/configs/i386_defconfig @@ -125,7 +125,6 @@ CONFIG_IP6_NF_MANGLE=y CONFIG_NET_SCHED=y CONFIG_NET_EMATCH=y CONFIG_NET_CLS_ACT=y -CONFIG_HAMRADIO=y CONFIG_CFG80211=y CONFIG_MAC80211=y CONFIG_MAC80211_LEDS=y @@ -171,7 +170,6 @@ CONFIG_FORCEDETH=y CONFIG_8139TOO=y # CONFIG_8139TOO_PIO is not set CONFIG_R8169=y -CONFIG_FDDI=y CONFIG_INPUT_POLLDEV=y # CONFIG_INPUT_MOUSEDEV_PSAUX is not set CONFIG_INPUT_EVDEV=y diff --git a/arch/x86/configs/x86_64_defconfig b/arch/x86/configs/x86_64_defconfig index 0b9654c7a05c..f3d1f36103b1 100644 --- a/arch/x86/configs/x86_64_defconfig +++ b/arch/x86/configs/x86_64_defconfig @@ -123,7 +123,6 @@ CONFIG_IP6_NF_MANGLE=y CONFIG_NET_SCHED=y CONFIG_NET_EMATCH=y CONFIG_NET_CLS_ACT=y -CONFIG_HAMRADIO=y CONFIG_CFG80211=y CONFIG_MAC80211=y CONFIG_MAC80211_LEDS=y @@ -164,7 +163,6 @@ CONFIG_SKY2=y CONFIG_FORCEDETH=y CONFIG_8139TOO=y CONFIG_R8169=y -CONFIG_FDDI=y CONFIG_INPUT_POLLDEV=y # CONFIG_INPUT_MOUSEDEV_PSAUX is not set CONFIG_INPUT_EVDEV=y -- cgit From d8a738689794c42c3844284b99ddf165d10a724e Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Thu, 5 Mar 2020 10:21:30 +0100 Subject: x86/optprobe: Fix OPTPROBE vs UACCESS While looking at an objtool UACCESS warning, it suddenly occurred to me that it is entirely possible to have an OPTPROBE right in the middle of an UACCESS region. In this case we must of course clear FLAGS.AC while running the KPROBE. Luckily the trampoline already saves/restores [ER]FLAGS, so all we need to do is inject a CLAC. Unfortunately we cannot use ALTERNATIVE() in the trampoline text, so we have to frob that manually. Fixes: ca0bbc70f147 ("sched/x86_64: Don't save flags on context switch") Signed-off-by: Peter Zijlstra (Intel) Acked-by: Masami Hiramatsu Link: https://lkml.kernel.org/r/20200305092130.GU2596@hirez.programming.kicks-ass.net --- arch/x86/include/asm/kprobes.h | 1 + arch/x86/kernel/kprobes/opt.c | 25 +++++++++++++++++++++++++ 2 files changed, 26 insertions(+) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/kprobes.h b/arch/x86/include/asm/kprobes.h index 95b1f053bd96..073eb7ad2f56 100644 --- a/arch/x86/include/asm/kprobes.h +++ b/arch/x86/include/asm/kprobes.h @@ -36,6 +36,7 @@ typedef u8 kprobe_opcode_t; /* optinsn template addresses */ extern __visible kprobe_opcode_t optprobe_template_entry[]; +extern __visible kprobe_opcode_t optprobe_template_clac[]; extern __visible kprobe_opcode_t optprobe_template_val[]; extern __visible kprobe_opcode_t optprobe_template_call[]; extern __visible kprobe_opcode_t optprobe_template_end[]; diff --git a/arch/x86/kernel/kprobes/opt.c b/arch/x86/kernel/kprobes/opt.c index 3f45b5c43a71..ea13f6888284 100644 --- a/arch/x86/kernel/kprobes/opt.c +++ b/arch/x86/kernel/kprobes/opt.c @@ -71,6 +71,21 @@ found: return (unsigned long)buf; } +static void synthesize_clac(kprobe_opcode_t *addr) +{ + /* + * Can't be static_cpu_has() due to how objtool treats this feature bit. + * This isn't a fast path anyway. + */ + if (!boot_cpu_has(X86_FEATURE_SMAP)) + return; + + /* Replace the NOP3 with CLAC */ + addr[0] = 0x0f; + addr[1] = 0x01; + addr[2] = 0xca; +} + /* Insert a move instruction which sets a pointer to eax/rdi (1st arg). */ static void synthesize_set_arg1(kprobe_opcode_t *addr, unsigned long val) { @@ -92,6 +107,9 @@ asm ( /* We don't bother saving the ss register */ " pushq %rsp\n" " pushfq\n" + ".global optprobe_template_clac\n" + "optprobe_template_clac:\n" + ASM_NOP3 SAVE_REGS_STRING " movq %rsp, %rsi\n" ".global optprobe_template_val\n" @@ -111,6 +129,9 @@ asm ( #else /* CONFIG_X86_32 */ " pushl %esp\n" " pushfl\n" + ".global optprobe_template_clac\n" + "optprobe_template_clac:\n" + ASM_NOP3 SAVE_REGS_STRING " movl %esp, %edx\n" ".global optprobe_template_val\n" @@ -134,6 +155,8 @@ asm ( void optprobe_template_func(void); STACK_FRAME_NON_STANDARD(optprobe_template_func); +#define TMPL_CLAC_IDX \ + ((long)optprobe_template_clac - (long)optprobe_template_entry) #define TMPL_MOVE_IDX \ ((long)optprobe_template_val - (long)optprobe_template_entry) #define TMPL_CALL_IDX \ @@ -389,6 +412,8 @@ int arch_prepare_optimized_kprobe(struct optimized_kprobe *op, op->optinsn.size = ret; len = TMPL_END_IDX + op->optinsn.size; + synthesize_clac(buf + TMPL_CLAC_IDX); + /* Set probe information */ synthesize_set_arg1(buf + TMPL_MOVE_IDX, (unsigned long)op); -- cgit From bc88a2fe216a51e8ab46d61f89d0c1b5a400470e Mon Sep 17 00:00:00 2001 From: Kan Liang Date: Tue, 17 Mar 2020 11:38:32 -0700 Subject: perf/x86/intel/uncore: Add box_offsets for free-running counters The offset between uncore boxes of free-running counters varies, e.g. IIO free-running counters on Ice Lake server. Add box_offsets, an array of offsets between adjacent uncore boxes. Signed-off-by: Kan Liang Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/1584470314-46657-1-git-send-email-kan.liang@linux.intel.com --- arch/x86/events/intel/uncore.h | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/events/intel/uncore.h b/arch/x86/events/intel/uncore.h index 1204dcc9fe9b..b30429f8a53a 100644 --- a/arch/x86/events/intel/uncore.h +++ b/arch/x86/events/intel/uncore.h @@ -154,6 +154,7 @@ struct freerunning_counters { unsigned int box_offset; unsigned int num_counters; unsigned int bits; + unsigned *box_offsets; }; struct pci2phy_map { @@ -310,7 +311,9 @@ unsigned int uncore_freerunning_counter(struct intel_uncore_box *box, return pmu->type->freerunning[type].counter_base + pmu->type->freerunning[type].counter_offset * idx + - pmu->type->freerunning[type].box_offset * pmu->pmu_idx; + (pmu->type->freerunning[type].box_offsets ? + pmu->type->freerunning[type].box_offsets[pmu->pmu_idx] : + pmu->type->freerunning[type].box_offset * pmu->pmu_idx); } static inline -- cgit From 3442a9ecb8e72a33c28a2b969b766c659830e410 Mon Sep 17 00:00:00 2001 From: Kan Liang Date: Tue, 17 Mar 2020 11:38:33 -0700 Subject: perf/x86/intel/uncore: Factor out __snr_uncore_mmio_init_box The IMC uncore unit in Ice Lake server can only be accessed by MMIO, which is similar as Snow Ridge. Factor out __snr_uncore_mmio_init_box which can be shared with Ice Lake server in the following patch. No functional changes. Signed-off-by: Kan Liang Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/1584470314-46657-2-git-send-email-kan.liang@linux.intel.com --- arch/x86/events/intel/uncore_snbep.c | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/events/intel/uncore_snbep.c b/arch/x86/events/intel/uncore_snbep.c index ad20220af303..01023f0d935b 100644 --- a/arch/x86/events/intel/uncore_snbep.c +++ b/arch/x86/events/intel/uncore_snbep.c @@ -4380,10 +4380,10 @@ static struct pci_dev *snr_uncore_get_mc_dev(int id) return mc_dev; } -static void snr_uncore_mmio_init_box(struct intel_uncore_box *box) +static void __snr_uncore_mmio_init_box(struct intel_uncore_box *box, + unsigned int box_ctl, int mem_offset) { struct pci_dev *pdev = snr_uncore_get_mc_dev(box->dieid); - unsigned int box_ctl = uncore_mmio_box_ctl(box); resource_size_t addr; u32 pci_dword; @@ -4393,7 +4393,7 @@ static void snr_uncore_mmio_init_box(struct intel_uncore_box *box) pci_read_config_dword(pdev, SNR_IMC_MMIO_BASE_OFFSET, &pci_dword); addr = (pci_dword & SNR_IMC_MMIO_BASE_MASK) << 23; - pci_read_config_dword(pdev, SNR_IMC_MMIO_MEM0_OFFSET, &pci_dword); + pci_read_config_dword(pdev, mem_offset, &pci_dword); addr |= (pci_dword & SNR_IMC_MMIO_MEM0_MASK) << 12; addr += box_ctl; @@ -4405,6 +4405,12 @@ static void snr_uncore_mmio_init_box(struct intel_uncore_box *box) writel(IVBEP_PMON_BOX_CTL_INT, box->io_addr); } +static void snr_uncore_mmio_init_box(struct intel_uncore_box *box) +{ + __snr_uncore_mmio_init_box(box, uncore_mmio_box_ctl(box), + SNR_IMC_MMIO_MEM0_OFFSET); +} + static void snr_uncore_mmio_disable_box(struct intel_uncore_box *box) { u32 config; -- cgit From d33294541889b023068522270cd4153ddd8e4635 Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Thu, 19 Mar 2020 13:41:06 -0400 Subject: KVM: x86: remove bogus user-triggerable WARN_ON The WARN_ON is essentially comparing a user-provided value with 0. It is trivial to trigger it just by passing garbage to KVM_SET_CLOCK. Guests can break if you do so, but the same applies to every KVM_SET_* ioctl. So, if it hurts when you do like this, just do not do it. Reported-by: syzbot+00be5da1d75f1cc95f6b@syzkaller.appspotmail.com Fixes: 9446e6fce0ab ("KVM: x86: fix WARN_ON check of an unsigned less than zero") Cc: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/x86.c | 1 - 1 file changed, 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 3156e25b0774..d65ff2008cf1 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2444,7 +2444,6 @@ static int kvm_guest_time_update(struct kvm_vcpu *v) vcpu->hv_clock.tsc_timestamp = tsc_timestamp; vcpu->hv_clock.system_time = kernel_ns + v->kvm->arch.kvmclock_offset; vcpu->last_guest_tsc = tsc_timestamp; - WARN_ON((s64)vcpu->hv_clock.system_time < 0); /* If the host uses TSC clocksource, then it is stable */ pvclock_flags = 0; -- cgit From 2da1ed62d55c6cbebbdee924f6af4e87bb6666e5 Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Fri, 20 Mar 2020 13:34:50 -0400 Subject: KVM: SVM: document KVM_MEM_ENCRYPT_OP, let userspace detect if SEV is available Userspace has no way to query if SEV has been disabled with the sev module parameter of kvm-amd.ko. Actually it has one, but it is a hack: do ioctl(KVM_MEM_ENCRYPT_OP, NULL) and check if it returns EFAULT. Make it a little nicer by returning zero for SEV enabled and NULL argument, and while at it document the ioctl arguments. Cc: Brijesh Singh Signed-off-by: Paolo Bonzini --- arch/x86/kvm/svm.c | 3 +++ 1 file changed, 3 insertions(+) (limited to 'arch/x86') diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 91000501756e..f0aa9ff9666f 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -7158,6 +7158,9 @@ static int svm_mem_enc_op(struct kvm *kvm, void __user *argp) if (!svm_sev_enabled()) return -ENOTTY; + if (!argp) + return 0; + if (copy_from_user(&sev_cmd, argp, sizeof(struct kvm_sev_cmd))) return -EFAULT; -- cgit From 4dd2a1b92b91b5f2acf853ee1dc0df135054698f Mon Sep 17 00:00:00 2001 From: afzal mohammed Date: Mon, 24 Feb 2020 06:22:26 +0530 Subject: x86: Replace setup_irq() by request_irq() request_irq() is preferred over setup_irq(). The early boot setup_irq() invocations happen either via 'init_IRQ()' or 'time_init()', while memory allocators are ready by 'mm_init()'. setup_irq() was required in old kernels when allocators were not ready by the time early interrupts were initialized. Hence replace setup_irq() by request_irq(). [ tglx: Use a local variable and get rid of the line break. Tweak the comment a bit ] Signed-off-by: afzal mohammed Signed-off-by: Thomas Gleixner Link: https://lkml.kernel.org/r/17f85021f6877650a5b09e0212d88323e6a30fd0.1582471508.git.afzal.mohd.ma@gmail.com --- arch/x86/kernel/irqinit.c | 16 +++++----------- arch/x86/kernel/time.c | 15 ++++++--------- 2 files changed, 11 insertions(+), 20 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kernel/irqinit.c b/arch/x86/kernel/irqinit.c index 1e5ad12f1061..5aa523c2d573 100644 --- a/arch/x86/kernel/irqinit.c +++ b/arch/x86/kernel/irqinit.c @@ -44,15 +44,6 @@ * (these are usually mapped into the 0x30-0xff vector range) */ -/* - * IRQ2 is cascade interrupt to second interrupt controller - */ -static struct irqaction irq2 = { - .handler = no_action, - .name = "cascade", - .flags = IRQF_NO_THREAD, -}; - DEFINE_PER_CPU(vector_irq_t, vector_irq) = { [0 ... NR_VECTORS - 1] = VECTOR_UNUSED, }; @@ -104,6 +95,9 @@ void __init native_init_IRQ(void) idt_setup_apic_and_irq_gates(); lapic_assign_system_vectors(); - if (!acpi_ioapic && !of_ioapic && nr_legacy_irqs()) - setup_irq(2, &irq2); + if (!acpi_ioapic && !of_ioapic && nr_legacy_irqs()) { + /* IRQ2 is cascade interrupt to second interrupt controller */ + if (request_irq(2, no_action, IRQF_NO_THREAD, "cascade", NULL)) + pr_err("%s: request_irq() failed\n", "cascade"); + } } diff --git a/arch/x86/kernel/time.c b/arch/x86/kernel/time.c index d8673d8a779b..8c5213e6ec20 100644 --- a/arch/x86/kernel/time.c +++ b/arch/x86/kernel/time.c @@ -62,19 +62,16 @@ static irqreturn_t timer_interrupt(int irq, void *dev_id) return IRQ_HANDLED; } -static struct irqaction irq0 = { - .handler = timer_interrupt, - .flags = IRQF_NOBALANCING | IRQF_IRQPOLL | IRQF_TIMER, - .name = "timer" -}; - static void __init setup_default_timer_irq(void) { + unsigned long flags = IRQF_NOBALANCING | IRQF_IRQPOLL | IRQF_TIMER; + /* - * Unconditionally register the legacy timer; even without legacy - * PIC/PIT we need this for the HPET0 in legacy replacement mode. + * Unconditionally register the legacy timer interrupt; even + * without legacy PIC/PIT we need this for the HPET0 in legacy + * replacement mode. */ - if (setup_irq(0, &irq0)) + if (request_irq(0, timer_interrupt, flags, "timer", NULL)) pr_info("Failed to register legacy timer interrupt\n"); } -- cgit From 659a9faa3f3c89de4c34c30c3da676059864e0fe Mon Sep 17 00:00:00 2001 From: Vincenzo Frascino Date: Fri, 20 Mar 2020 14:53:29 +0000 Subject: x86: Introduce asm/vdso/clocksource.h The vDSO library should only include the necessary headers required for a userspace library (UAPI and a minimal set of kernel headers). To make this possible it is necessary to isolate from the kernel headers the common parts that are strictly necessary to build the library. Introduce asm/vdso/clocksource.h to contain all the arm64 specific functions that are suitable for vDSO inclusion. This header will be required by a future patch that will generalize vdso/clocksource.h. Signed-off-by: Vincenzo Frascino Signed-off-by: Thomas Gleixner Link: https://lkml.kernel.org/r/20200320145351.32292-5-vincenzo.frascino@arm.com --- arch/x86/include/asm/clocksource.h | 5 +---- arch/x86/include/asm/vdso/clocksource.h | 10 ++++++++++ 2 files changed, 11 insertions(+), 4 deletions(-) create mode 100644 arch/x86/include/asm/vdso/clocksource.h (limited to 'arch/x86') diff --git a/arch/x86/include/asm/clocksource.h b/arch/x86/include/asm/clocksource.h index d561db67f96d..dc9dc7b3911a 100644 --- a/arch/x86/include/asm/clocksource.h +++ b/arch/x86/include/asm/clocksource.h @@ -4,10 +4,7 @@ #ifndef _ASM_X86_CLOCKSOURCE_H #define _ASM_X86_CLOCKSOURCE_H -#define VDSO_ARCH_CLOCKMODES \ - VDSO_CLOCKMODE_TSC, \ - VDSO_CLOCKMODE_PVCLOCK, \ - VDSO_CLOCKMODE_HVCLOCK +#include extern unsigned int vclocks_used; diff --git a/arch/x86/include/asm/vdso/clocksource.h b/arch/x86/include/asm/vdso/clocksource.h new file mode 100644 index 000000000000..119ac8612d89 --- /dev/null +++ b/arch/x86/include/asm/vdso/clocksource.h @@ -0,0 +1,10 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef __ASM_VDSO_CLOCKSOURCE_H +#define __ASM_VDSO_CLOCKSOURCE_H + +#define VDSO_ARCH_CLOCKMODES \ + VDSO_CLOCKMODE_TSC, \ + VDSO_CLOCKMODE_PVCLOCK, \ + VDSO_CLOCKMODE_HVCLOCK + +#endif /* __ASM_VDSO_CLOCKSOURCE_H */ -- cgit From abc22418db02b986fc5623c035507b6357e191ed Mon Sep 17 00:00:00 2001 From: Vincenzo Frascino Date: Fri, 20 Mar 2020 14:53:48 +0000 Subject: x86/vdso: Enable x86 to use common headers Enable x86 to use only the common headers in the implementation of the vDSO library. Signed-off-by: Vincenzo Frascino Signed-off-by: Thomas Gleixner Link: https://lkml.kernel.org/r/20200320145351.32292-24-vincenzo.frascino@arm.com --- arch/x86/include/asm/processor.h | 12 +----------- arch/x86/include/asm/vdso/processor.h | 23 +++++++++++++++++++++++ 2 files changed, 24 insertions(+), 11 deletions(-) create mode 100644 arch/x86/include/asm/vdso/processor.h (limited to 'arch/x86') diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index 09705ccc393c..94789db550df 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -26,6 +26,7 @@ struct vm86; #include #include #include +#include #include #include @@ -677,17 +678,6 @@ static inline unsigned int cpuid_edx(unsigned int op) return edx; } -/* REP NOP (PAUSE) is a good thing to insert into busy-wait loops. */ -static __always_inline void rep_nop(void) -{ - asm volatile("rep; nop" ::: "memory"); -} - -static __always_inline void cpu_relax(void) -{ - rep_nop(); -} - /* * This function forces the icache and prefetched instruction stream to * catch up with reality in two very specific cases: diff --git a/arch/x86/include/asm/vdso/processor.h b/arch/x86/include/asm/vdso/processor.h new file mode 100644 index 000000000000..57b1a7034c64 --- /dev/null +++ b/arch/x86/include/asm/vdso/processor.h @@ -0,0 +1,23 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* + * Copyright (C) 2020 ARM Ltd. + */ +#ifndef __ASM_VDSO_PROCESSOR_H +#define __ASM_VDSO_PROCESSOR_H + +#ifndef __ASSEMBLY__ + +/* REP NOP (PAUSE) is a good thing to insert into busy-wait loops. */ +static __always_inline void rep_nop(void) +{ + asm volatile("rep; nop" ::: "memory"); +} + +static __always_inline void cpu_relax(void) +{ + rep_nop(); +} + +#endif /* __ASSEMBLY__ */ + +#endif /* __ASM_VDSO_PROCESSOR_H */ -- cgit From 4399e0cf494f739af7e0648f52fe43311ecd1bea Mon Sep 17 00:00:00 2001 From: Brian Gerst Date: Fri, 13 Mar 2020 15:51:27 -0400 Subject: x86/entry: Refactor SYSCALL_DEFINEx macros Pull the common code out from the SYSCALL_DEFINEx macros into a new __SYS_STUBx macro. Also conditionalize the X64 version in preparation for enabling syscall wrappers on 32-bit native kernels. Signed-off-by: Brian Gerst Signed-off-by: Thomas Gleixner Reviewed-by: Dominik Brodowski Reviewed-by: Andy Lutomirski Link: https://lkml.kernel.org/r/20200313195144.164260-2-brgerst@gmail.com --- arch/x86/include/asm/syscall_wrapper.h | 49 +++++++++++++++++----------------- 1 file changed, 24 insertions(+), 25 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/syscall_wrapper.h b/arch/x86/include/asm/syscall_wrapper.h index e2389ce9bf58..a1c090bde063 100644 --- a/arch/x86/include/asm/syscall_wrapper.h +++ b/arch/x86/include/asm/syscall_wrapper.h @@ -21,6 +21,22 @@ struct pt_regs; ,,(unsigned int)regs->dx,,(unsigned int)regs->si \ ,,(unsigned int)regs->di,,(unsigned int)regs->bp) +#define __SYS_STUBx(abi, name, ...) \ + asmlinkage long __##abi##_##name(const struct pt_regs *regs); \ + ALLOW_ERROR_INJECTION(__##abi##_##name, ERRNO); \ + asmlinkage long __##abi##_##name(const struct pt_regs *regs) \ + { \ + return __se_##name(__VA_ARGS__); \ + } + +#ifdef CONFIG_X86_64 +#define __X64_SYS_STUBx(x, name, ...) \ + __SYS_STUBx(x64, sys##name, \ + SC_X86_64_REGS_TO_ARGS(x, __VA_ARGS__)) +#else /* CONFIG_X86_64 */ +#define __X64_SYS_STUBx(x, name, ...) +#endif /* CONFIG_X86_64 */ + #ifdef CONFIG_IA32_EMULATION /* * For IA32 emulation, we need to handle "compat" syscalls *and* create @@ -39,20 +55,12 @@ struct pt_regs; } #define __IA32_COMPAT_SYS_STUBx(x, name, ...) \ - asmlinkage long __ia32_compat_sys##name(const struct pt_regs *regs);\ - ALLOW_ERROR_INJECTION(__ia32_compat_sys##name, ERRNO); \ - asmlinkage long __ia32_compat_sys##name(const struct pt_regs *regs)\ - { \ - return __se_compat_sys##name(SC_IA32_REGS_TO_ARGS(x,__VA_ARGS__));\ - } + __SYS_STUBx(ia32, compat_sys##name, \ + SC_IA32_REGS_TO_ARGS(x, __VA_ARGS__)) #define __IA32_SYS_STUBx(x, name, ...) \ - asmlinkage long __ia32_sys##name(const struct pt_regs *regs); \ - ALLOW_ERROR_INJECTION(__ia32_sys##name, ERRNO); \ - asmlinkage long __ia32_sys##name(const struct pt_regs *regs) \ - { \ - return __se_sys##name(SC_IA32_REGS_TO_ARGS(x,__VA_ARGS__));\ - } + __SYS_STUBx(ia32, sys##name, \ + SC_IA32_REGS_TO_ARGS(x, __VA_ARGS__)) /* * To keep the naming coherent, re-define SYSCALL_DEFINE0 to create an alias @@ -82,7 +90,7 @@ struct pt_regs; #else /* CONFIG_IA32_EMULATION */ #define __IA32_COMPAT_SYS_STUBx(x, name, ...) -#define __IA32_SYS_STUBx(x, fullname, name, ...) +#define __IA32_SYS_STUBx(x, name, ...) #endif /* CONFIG_IA32_EMULATION */ @@ -101,12 +109,8 @@ struct pt_regs; } #define __X32_COMPAT_SYS_STUBx(x, name, ...) \ - asmlinkage long __x32_compat_sys##name(const struct pt_regs *regs);\ - ALLOW_ERROR_INJECTION(__x32_compat_sys##name, ERRNO); \ - asmlinkage long __x32_compat_sys##name(const struct pt_regs *regs)\ - { \ - return __se_compat_sys##name(SC_X86_64_REGS_TO_ARGS(x,__VA_ARGS__));\ - } + __SYS_STUBx(x32, compat_sys##name, \ + SC_X86_64_REGS_TO_ARGS(x, __VA_ARGS__)) #else /* CONFIG_X86_X32 */ #define __X32_COMPAT_SYS_STUB0(x, name) @@ -192,14 +196,9 @@ struct pt_regs; * to the i386 calling convention (bx, cx, dx, si, di, bp). */ #define __SYSCALL_DEFINEx(x, name, ...) \ - asmlinkage long __x64_sys##name(const struct pt_regs *regs); \ - ALLOW_ERROR_INJECTION(__x64_sys##name, ERRNO); \ static long __se_sys##name(__MAP(x,__SC_LONG,__VA_ARGS__)); \ static inline long __do_sys##name(__MAP(x,__SC_DECL,__VA_ARGS__));\ - asmlinkage long __x64_sys##name(const struct pt_regs *regs) \ - { \ - return __se_sys##name(SC_X86_64_REGS_TO_ARGS(x,__VA_ARGS__));\ - } \ + __X64_SYS_STUBx(x, name, __VA_ARGS__) \ __IA32_SYS_STUBx(x, name, __VA_ARGS__) \ static long __se_sys##name(__MAP(x,__SC_LONG,__VA_ARGS__)) \ { \ -- cgit From d2b5de495ee9838b3752806c135afd43b76b1e8f Mon Sep 17 00:00:00 2001 From: Brian Gerst Date: Fri, 13 Mar 2020 15:51:28 -0400 Subject: x86/entry: Refactor SYSCALL_DEFINE0 macros Pull the common code out from the SYSCALL_DEFINE0 macros into a new __SYS_STUB0 macro. Also conditionalize the X64 version in preparation for enabling syscall wrappers on 32-bit native kernels. Signed-off-by: Brian Gerst Signed-off-by: Thomas Gleixner Reviewed-by: Dominik Brodowski Reviewed-by: Andy Lutomirski Link: https://lkml.kernel.org/r/20200313195144.164260-3-brgerst@gmail.com --- arch/x86/include/asm/syscall_wrapper.h | 73 +++++++++++++++------------------- 1 file changed, 32 insertions(+), 41 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/syscall_wrapper.h b/arch/x86/include/asm/syscall_wrapper.h index a1c090bde063..45ad60571166 100644 --- a/arch/x86/include/asm/syscall_wrapper.h +++ b/arch/x86/include/asm/syscall_wrapper.h @@ -21,6 +21,12 @@ struct pt_regs; ,,(unsigned int)regs->dx,,(unsigned int)regs->si \ ,,(unsigned int)regs->di,,(unsigned int)regs->bp) +#define __SYS_STUB0(abi, name) \ + asmlinkage long __##abi##_##name(const struct pt_regs *regs); \ + ALLOW_ERROR_INJECTION(__##abi##_##name, ERRNO); \ + asmlinkage long __##abi##_##name(const struct pt_regs *regs) \ + __alias(__do_##name); + #define __SYS_STUBx(abi, name, ...) \ asmlinkage long __##abi##_##name(const struct pt_regs *regs); \ ALLOW_ERROR_INJECTION(__##abi##_##name, ERRNO); \ @@ -30,10 +36,14 @@ struct pt_regs; } #ifdef CONFIG_X86_64 +#define __X64_SYS_STUB0(name) \ + __SYS_STUB0(x64, sys_##name) + #define __X64_SYS_STUBx(x, name, ...) \ __SYS_STUBx(x64, sys##name, \ SC_X86_64_REGS_TO_ARGS(x, __VA_ARGS__)) #else /* CONFIG_X86_64 */ +#define __X64_SYS_STUB0(name) #define __X64_SYS_STUBx(x, name, ...) #endif /* CONFIG_X86_64 */ @@ -46,34 +56,20 @@ struct pt_regs; * kernel/sys_ni.c and SYS_NI in kernel/time/posix-stubs.c to cover this * case as well. */ -#define __IA32_COMPAT_SYS_STUB0(x, name) \ - asmlinkage long __ia32_compat_sys_##name(const struct pt_regs *regs);\ - ALLOW_ERROR_INJECTION(__ia32_compat_sys_##name, ERRNO); \ - asmlinkage long __ia32_compat_sys_##name(const struct pt_regs *regs)\ - { \ - return __se_compat_sys_##name(); \ - } +#define __IA32_COMPAT_SYS_STUB0(name) \ + __SYS_STUB0(ia32, compat_sys_##name) #define __IA32_COMPAT_SYS_STUBx(x, name, ...) \ __SYS_STUBx(ia32, compat_sys##name, \ SC_IA32_REGS_TO_ARGS(x, __VA_ARGS__)) +#define __IA32_SYS_STUB0(name) \ + __SYS_STUB0(ia32, sys_##name) + #define __IA32_SYS_STUBx(x, name, ...) \ __SYS_STUBx(ia32, sys##name, \ SC_IA32_REGS_TO_ARGS(x, __VA_ARGS__)) -/* - * To keep the naming coherent, re-define SYSCALL_DEFINE0 to create an alias - * named __ia32_sys_*() - */ - -#define SYSCALL_DEFINE0(sname) \ - SYSCALL_METADATA(_##sname, 0); \ - asmlinkage long __x64_sys_##sname(const struct pt_regs *__unused);\ - ALLOW_ERROR_INJECTION(__x64_sys_##sname, ERRNO); \ - SYSCALL_ALIAS(__ia32_sys_##sname, __x64_sys_##sname); \ - asmlinkage long __x64_sys_##sname(const struct pt_regs *__unused) - #define COND_SYSCALL(name) \ asmlinkage __weak long __x64_sys_##name(const struct pt_regs *__unused) \ { \ @@ -89,7 +85,9 @@ struct pt_regs; SYSCALL_ALIAS(__ia32_sys_##name, sys_ni_posix_timers) #else /* CONFIG_IA32_EMULATION */ +#define __IA32_COMPAT_SYS_STUB0(name) #define __IA32_COMPAT_SYS_STUBx(x, name, ...) +#define __IA32_SYS_STUB0(name) #define __IA32_SYS_STUBx(x, name, ...) #endif /* CONFIG_IA32_EMULATION */ @@ -100,20 +98,15 @@ struct pt_regs; * of the x86-64-style parameter ordering of x32 syscalls. The syscalls common * with x86_64 obviously do not need such care. */ -#define __X32_COMPAT_SYS_STUB0(x, name, ...) \ - asmlinkage long __x32_compat_sys_##name(const struct pt_regs *regs);\ - ALLOW_ERROR_INJECTION(__x32_compat_sys_##name, ERRNO); \ - asmlinkage long __x32_compat_sys_##name(const struct pt_regs *regs)\ - { \ - return __se_compat_sys_##name();\ - } +#define __X32_COMPAT_SYS_STUB0(name) \ + __SYS_STUB0(x32, compat_sys_##name) #define __X32_COMPAT_SYS_STUBx(x, name, ...) \ __SYS_STUBx(x32, compat_sys##name, \ SC_X86_64_REGS_TO_ARGS(x, __VA_ARGS__)) #else /* CONFIG_X86_X32 */ -#define __X32_COMPAT_SYS_STUB0(x, name) +#define __X32_COMPAT_SYS_STUB0(name) #define __X32_COMPAT_SYS_STUBx(x, name, ...) #endif /* CONFIG_X86_X32 */ @@ -125,15 +118,12 @@ struct pt_regs; * of them. */ #define COMPAT_SYSCALL_DEFINE0(name) \ - static long __se_compat_sys_##name(void); \ - static inline long __do_compat_sys_##name(void); \ - __IA32_COMPAT_SYS_STUB0(x, name) \ - __X32_COMPAT_SYS_STUB0(x, name) \ - static long __se_compat_sys_##name(void) \ - { \ - return __do_compat_sys_##name(); \ - } \ - static inline long __do_compat_sys_##name(void) + static asmlinkage long \ + __do_compat_sys_##name(const struct pt_regs *__unused); \ + __IA32_COMPAT_SYS_STUB0(name) \ + __X32_COMPAT_SYS_STUB0(name) \ + static asmlinkage long \ + __do_compat_sys_##name(const struct pt_regs *__unused) #define COMPAT_SYSCALL_DEFINEx(x, name, ...) \ static long __se_compat_sys##name(__MAP(x,__SC_LONG,__VA_ARGS__)); \ @@ -216,13 +206,14 @@ struct pt_regs; * SYSCALL_DEFINEx() -- which is essential for the COND_SYSCALL() and SYS_NI() * macros to work correctly. */ -#ifndef SYSCALL_DEFINE0 #define SYSCALL_DEFINE0(sname) \ SYSCALL_METADATA(_##sname, 0); \ - asmlinkage long __x64_sys_##sname(const struct pt_regs *__unused);\ - ALLOW_ERROR_INJECTION(__x64_sys_##sname, ERRNO); \ - asmlinkage long __x64_sys_##sname(const struct pt_regs *__unused) -#endif + static asmlinkage long \ + __do_sys_##sname(const struct pt_regs *__unused); \ + __X64_SYS_STUB0(sname) \ + __IA32_SYS_STUB0(sname) \ + static asmlinkage long \ + __do_sys_##sname(const struct pt_regs *__unused) #ifndef COND_SYSCALL #define COND_SYSCALL(name) \ -- cgit From 6cc8d2b286d9e7168d72e342d1b031317cd7752b Mon Sep 17 00:00:00 2001 From: Brian Gerst Date: Fri, 13 Mar 2020 15:51:29 -0400 Subject: x86/entry: Refactor COND_SYSCALL macros Pull the common code out from the COND_SYSCALL macros into a new __COND_SYSCALL macro. Also conditionalize the X64 version in preparation for enabling syscall wrappers on 32-bit native kernels. Signed-off-by: Brian Gerst Signed-off-by: Thomas Gleixner Reviewed-by: Dominik Brodowski Reviewed-by: Andy Lutomirski Link: https://lkml.kernel.org/r/20200313195144.164260-4-brgerst@gmail.com --- arch/x86/include/asm/syscall_wrapper.h | 44 ++++++++++++++++++++-------------- 1 file changed, 26 insertions(+), 18 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/syscall_wrapper.h b/arch/x86/include/asm/syscall_wrapper.h index 45ad60571166..0117b25e6753 100644 --- a/arch/x86/include/asm/syscall_wrapper.h +++ b/arch/x86/include/asm/syscall_wrapper.h @@ -35,6 +35,13 @@ struct pt_regs; return __se_##name(__VA_ARGS__); \ } +#define __COND_SYSCALL(abi, name) \ + asmlinkage __weak long \ + __##abi##_##name(const struct pt_regs *__unused) \ + { \ + return sys_ni_syscall(); \ + } + #ifdef CONFIG_X86_64 #define __X64_SYS_STUB0(name) \ __SYS_STUB0(x64, sys_##name) @@ -42,9 +49,13 @@ struct pt_regs; #define __X64_SYS_STUBx(x, name, ...) \ __SYS_STUBx(x64, sys##name, \ SC_X86_64_REGS_TO_ARGS(x, __VA_ARGS__)) + +#define __X64_COND_SYSCALL(name) \ + __COND_SYSCALL(x64, sys_##name) #else /* CONFIG_X86_64 */ #define __X64_SYS_STUB0(name) #define __X64_SYS_STUBx(x, name, ...) +#define __X64_COND_SYSCALL(name) #endif /* CONFIG_X86_64 */ #ifdef CONFIG_IA32_EMULATION @@ -63,6 +74,9 @@ struct pt_regs; __SYS_STUBx(ia32, compat_sys##name, \ SC_IA32_REGS_TO_ARGS(x, __VA_ARGS__)) +#define __IA32_COMPAT_COND_SYSCALL(name) \ + __COND_SYSCALL(ia32, compat_sys_##name) + #define __IA32_SYS_STUB0(name) \ __SYS_STUB0(ia32, sys_##name) @@ -70,15 +84,8 @@ struct pt_regs; __SYS_STUBx(ia32, sys##name, \ SC_IA32_REGS_TO_ARGS(x, __VA_ARGS__)) -#define COND_SYSCALL(name) \ - asmlinkage __weak long __x64_sys_##name(const struct pt_regs *__unused) \ - { \ - return sys_ni_syscall(); \ - } \ - asmlinkage __weak long __ia32_sys_##name(const struct pt_regs *__unused)\ - { \ - return sys_ni_syscall(); \ - } +#define __IA32_COND_SYSCALL(name) \ + __COND_SYSCALL(ia32, sys_##name) #define SYS_NI(name) \ SYSCALL_ALIAS(__x64_sys_##name, sys_ni_posix_timers); \ @@ -87,8 +94,10 @@ struct pt_regs; #else /* CONFIG_IA32_EMULATION */ #define __IA32_COMPAT_SYS_STUB0(name) #define __IA32_COMPAT_SYS_STUBx(x, name, ...) +#define __IA32_COMPAT_COND_SYSCALL(name) #define __IA32_SYS_STUB0(name) #define __IA32_SYS_STUBx(x, name, ...) +#define __IA32_COND_SYSCALL(name) #endif /* CONFIG_IA32_EMULATION */ @@ -105,9 +114,12 @@ struct pt_regs; __SYS_STUBx(x32, compat_sys##name, \ SC_X86_64_REGS_TO_ARGS(x, __VA_ARGS__)) +#define __X32_COMPAT_COND_SYSCALL(name) \ + __COND_SYSCALL(x32, compat_sys_##name) #else /* CONFIG_X86_X32 */ #define __X32_COMPAT_SYS_STUB0(name) #define __X32_COMPAT_SYS_STUBx(x, name, ...) +#define __X32_COMPAT_COND_SYSCALL(name) #endif /* CONFIG_X86_X32 */ @@ -142,8 +154,8 @@ struct pt_regs; * kernel/time/posix-stubs.c to cover this case as well. */ #define COND_SYSCALL_COMPAT(name) \ - cond_syscall(__ia32_compat_sys_##name); \ - cond_syscall(__x32_compat_sys_##name) + __IA32_COMPAT_COND_SYSCALL(name) \ + __X32_COMPAT_COND_SYSCALL(name) #define COMPAT_SYS_NI(name) \ SYSCALL_ALIAS(__ia32_compat_sys_##name, sys_ni_posix_timers); \ @@ -215,13 +227,9 @@ struct pt_regs; static asmlinkage long \ __do_sys_##sname(const struct pt_regs *__unused) -#ifndef COND_SYSCALL -#define COND_SYSCALL(name) \ - asmlinkage __weak long __x64_sys_##name(const struct pt_regs *__unused) \ - { \ - return sys_ni_syscall(); \ - } -#endif +#define COND_SYSCALL(name) \ + __X64_COND_SYSCALL(name) \ + __IA32_COND_SYSCALL(name) #ifndef SYS_NI #define SYS_NI(name) SYSCALL_ALIAS(__x64_sys_##name, sys_ni_posix_timers); -- cgit From a74d187c2df3a382b8ab6227da34cba690e71e4d Mon Sep 17 00:00:00 2001 From: Brian Gerst Date: Fri, 13 Mar 2020 15:51:30 -0400 Subject: x86/entry: Refactor SYS_NI macros Pull the common code out from the SYS_NI macros into a new __SYS_NI macro. Also conditionalize the X64 version in preparation for enabling syscall wrappers on 32-bit native kernels. Signed-off-by: Brian Gerst Signed-off-by: Thomas Gleixner Reviewed-by: Dominik Brodowski Reviewed-by: Andy Lutomirski Link: https://lkml.kernel.org/r/20200313195144.164260-5-brgerst@gmail.com --- arch/x86/include/asm/syscall_wrapper.h | 32 +++++++++++++++++++++++--------- 1 file changed, 23 insertions(+), 9 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/syscall_wrapper.h b/arch/x86/include/asm/syscall_wrapper.h index 0117b25e6753..1d96ccebc0d2 100644 --- a/arch/x86/include/asm/syscall_wrapper.h +++ b/arch/x86/include/asm/syscall_wrapper.h @@ -42,6 +42,9 @@ struct pt_regs; return sys_ni_syscall(); \ } +#define __SYS_NI(abi, name) \ + SYSCALL_ALIAS(__##abi##_##name, sys_ni_posix_timers) + #ifdef CONFIG_X86_64 #define __X64_SYS_STUB0(name) \ __SYS_STUB0(x64, sys_##name) @@ -52,10 +55,14 @@ struct pt_regs; #define __X64_COND_SYSCALL(name) \ __COND_SYSCALL(x64, sys_##name) + +#define __X64_SYS_NI(name) \ + __SYS_NI(x64, sys_##name) #else /* CONFIG_X86_64 */ #define __X64_SYS_STUB0(name) #define __X64_SYS_STUBx(x, name, ...) #define __X64_COND_SYSCALL(name) +#define __X64_SYS_NI(name) #endif /* CONFIG_X86_64 */ #ifdef CONFIG_IA32_EMULATION @@ -77,6 +84,9 @@ struct pt_regs; #define __IA32_COMPAT_COND_SYSCALL(name) \ __COND_SYSCALL(ia32, compat_sys_##name) +#define __IA32_COMPAT_SYS_NI(name) \ + __SYS_NI(ia32, compat_sys_##name) + #define __IA32_SYS_STUB0(name) \ __SYS_STUB0(ia32, sys_##name) @@ -87,17 +97,17 @@ struct pt_regs; #define __IA32_COND_SYSCALL(name) \ __COND_SYSCALL(ia32, sys_##name) -#define SYS_NI(name) \ - SYSCALL_ALIAS(__x64_sys_##name, sys_ni_posix_timers); \ - SYSCALL_ALIAS(__ia32_sys_##name, sys_ni_posix_timers) - +#define __IA32_SYS_NI(name) \ + __SYS_NI(ia32, sys_##name) #else /* CONFIG_IA32_EMULATION */ #define __IA32_COMPAT_SYS_STUB0(name) #define __IA32_COMPAT_SYS_STUBx(x, name, ...) #define __IA32_COMPAT_COND_SYSCALL(name) +#define __IA32_COMPAT_SYS_NI(name) #define __IA32_SYS_STUB0(name) #define __IA32_SYS_STUBx(x, name, ...) #define __IA32_COND_SYSCALL(name) +#define __IA32_SYS_NI(name) #endif /* CONFIG_IA32_EMULATION */ @@ -116,10 +126,14 @@ struct pt_regs; #define __X32_COMPAT_COND_SYSCALL(name) \ __COND_SYSCALL(x32, compat_sys_##name) + +#define __X32_COMPAT_SYS_NI(name) \ + __SYS_NI(x32, compat_sys_##name) #else /* CONFIG_X86_X32 */ #define __X32_COMPAT_SYS_STUB0(name) #define __X32_COMPAT_SYS_STUBx(x, name, ...) #define __X32_COMPAT_COND_SYSCALL(name) +#define __X32_COMPAT_SYS_NI(name) #endif /* CONFIG_X86_X32 */ @@ -158,8 +172,8 @@ struct pt_regs; __X32_COMPAT_COND_SYSCALL(name) #define COMPAT_SYS_NI(name) \ - SYSCALL_ALIAS(__ia32_compat_sys_##name, sys_ni_posix_timers); \ - SYSCALL_ALIAS(__x32_compat_sys_##name, sys_ni_posix_timers) + __IA32_COMPAT_SYS_NI(name) \ + __X32_COMPAT_SYS_NI(name) #endif /* CONFIG_COMPAT */ @@ -231,9 +245,9 @@ struct pt_regs; __X64_COND_SYSCALL(name) \ __IA32_COND_SYSCALL(name) -#ifndef SYS_NI -#define SYS_NI(name) SYSCALL_ALIAS(__x64_sys_##name, sys_ni_posix_timers); -#endif +#define SYS_NI(name) \ + __X64_SYS_NI(name) \ + __IA32_SYS_NI(name) /* -- cgit From 27dd84fafcd5e3c565164bb303fe8ec8ef59e147 Mon Sep 17 00:00:00 2001 From: Brian Gerst Date: Fri, 13 Mar 2020 15:51:31 -0400 Subject: x86/entry/64: Use syscall wrappers for x32_rt_sigreturn Add missing syscall wrapper for x32_rt_sigreturn(). Signed-off-by: Brian Gerst Signed-off-by: Thomas Gleixner Reviewed-by: Dominik Brodowski Reviewed-by: Andy Lutomirski Link: https://lkml.kernel.org/r/20200313195144.164260-6-brgerst@gmail.com --- arch/x86/entry/syscalls/syscall_64.tbl | 2 +- arch/x86/include/asm/sighandling.h | 5 ----- arch/x86/kernel/signal.c | 2 +- 3 files changed, 2 insertions(+), 7 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index 44d510bc9b78..0b5a25bc9998 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -367,7 +367,7 @@ # is defined. # 512 x32 rt_sigaction __x32_compat_sys_rt_sigaction -513 x32 rt_sigreturn sys32_x32_rt_sigreturn +513 x32 rt_sigreturn __x32_compat_sys_x32_rt_sigreturn 514 x32 ioctl __x32_compat_sys_ioctl 515 x32 readv __x32_compat_sys_readv 516 x32 writev __x32_compat_sys_writev diff --git a/arch/x86/include/asm/sighandling.h b/arch/x86/include/asm/sighandling.h index 2fcbd6f33ef7..bd26834724e5 100644 --- a/arch/x86/include/asm/sighandling.h +++ b/arch/x86/include/asm/sighandling.h @@ -17,9 +17,4 @@ void signal_fault(struct pt_regs *regs, void __user *frame, char *where); int setup_sigcontext(struct sigcontext __user *sc, void __user *fpstate, struct pt_regs *regs, unsigned long mask); - -#ifdef CONFIG_X86_X32_ABI -asmlinkage long sys32_x32_rt_sigreturn(void); -#endif - #endif /* _ASM_X86_SIGHANDLING_H */ diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c index 8a29573851a3..860904990b26 100644 --- a/arch/x86/kernel/signal.c +++ b/arch/x86/kernel/signal.c @@ -859,7 +859,7 @@ void signal_fault(struct pt_regs *regs, void __user *frame, char *where) } #ifdef CONFIG_X86_X32_ABI -asmlinkage long sys32_x32_rt_sigreturn(void) +COMPAT_SYSCALL_DEFINE0(x32_rt_sigreturn) { struct pt_regs *regs = current_pt_regs(); struct rt_sigframe_x32 __user *frame; -- cgit From cc42c045af1ff4dee875196f8fe7d6ed1f29ea64 Mon Sep 17 00:00:00 2001 From: Brian Gerst Date: Fri, 13 Mar 2020 15:51:32 -0400 Subject: x86/entry/64: Move sys_ni_syscall stub to common.c so it can be available to multiple syscall tables. Also directly return -ENOSYS instead of bouncing to the generic sys_ni_syscall(). Signed-off-by: Brian Gerst Signed-off-by: Thomas Gleixner Link: https://lkml.kernel.org/r/20200313195144.164260-7-brgerst@gmail.com --- arch/x86/entry/common.c | 7 +++++++ arch/x86/entry/syscall_64.c | 7 ------- arch/x86/include/asm/syscall_wrapper.h | 3 +++ 3 files changed, 10 insertions(+), 7 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index 9747876980b5..149bf54501ff 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -438,3 +438,10 @@ __visible long do_fast_syscall_32(struct pt_regs *regs) #endif } #endif + +#ifdef CONFIG_X86_64 +SYSCALL_DEFINE0(ni_syscall) +{ + return -ENOSYS; +} +#endif diff --git a/arch/x86/entry/syscall_64.c b/arch/x86/entry/syscall_64.c index adf619a856e8..058dc1b73e96 100644 --- a/arch/x86/entry/syscall_64.c +++ b/arch/x86/entry/syscall_64.c @@ -8,13 +8,6 @@ #include #include -extern asmlinkage long sys_ni_syscall(void); - -SYSCALL_DEFINE0(ni_syscall) -{ - return sys_ni_syscall(); -} - #define __SYSCALL_64(nr, sym, qual) extern asmlinkage long sym(const struct pt_regs *); #define __SYSCALL_X32(nr, sym, qual) __SYSCALL_64(nr, sym, qual) #include diff --git a/arch/x86/include/asm/syscall_wrapper.h b/arch/x86/include/asm/syscall_wrapper.h index 1d96ccebc0d2..0f126e40a464 100644 --- a/arch/x86/include/asm/syscall_wrapper.h +++ b/arch/x86/include/asm/syscall_wrapper.h @@ -8,6 +8,9 @@ struct pt_regs; +extern asmlinkage long __x64_sys_ni_syscall(const struct pt_regs *regs); +extern asmlinkage long __ia32_sys_ni_syscall(const struct pt_regs *regs); + /* Mapping of registers to parameters for syscalls on x86-64 and x32 */ #define SC_X86_64_REGS_TO_ARGS(x, ...) \ __MAP(x,__SC_ARGS \ -- cgit From 2e487c357917b98adc6a6dafa612c435cad1af41 Mon Sep 17 00:00:00 2001 From: Brian Gerst Date: Fri, 13 Mar 2020 15:51:33 -0400 Subject: x86/entry/64: Split X32 syscall table into its own file Since X32 has its own syscall table now, move it to a separate file. Signed-off-by: Brian Gerst Signed-off-by: Thomas Gleixner Reviewed-by: Dominik Brodowski Link: https://lkml.kernel.org/r/20200313195144.164260-8-brgerst@gmail.com --- arch/x86/entry/Makefile | 1 + arch/x86/entry/syscall_64.c | 27 ++------------------------- arch/x86/entry/syscall_x32.c | 26 ++++++++++++++++++++++++++ 3 files changed, 29 insertions(+), 25 deletions(-) create mode 100644 arch/x86/entry/syscall_x32.c (limited to 'arch/x86') diff --git a/arch/x86/entry/Makefile b/arch/x86/entry/Makefile index 06fc70cf5433..85eb381259c2 100644 --- a/arch/x86/entry/Makefile +++ b/arch/x86/entry/Makefile @@ -14,4 +14,5 @@ obj-y += vdso/ obj-y += vsyscall/ obj-$(CONFIG_IA32_EMULATION) += entry_64_compat.o syscall_32.o +obj-$(CONFIG_X86_X32_ABI) += syscall_x32.o diff --git a/arch/x86/entry/syscall_64.c b/arch/x86/entry/syscall_64.c index 058dc1b73e96..efb85c6dce13 100644 --- a/arch/x86/entry/syscall_64.c +++ b/arch/x86/entry/syscall_64.c @@ -8,14 +8,13 @@ #include #include +#define __SYSCALL_X32(nr, sym, qual) + #define __SYSCALL_64(nr, sym, qual) extern asmlinkage long sym(const struct pt_regs *); -#define __SYSCALL_X32(nr, sym, qual) __SYSCALL_64(nr, sym, qual) #include #undef __SYSCALL_64 -#undef __SYSCALL_X32 #define __SYSCALL_64(nr, sym, qual) [nr] = sym, -#define __SYSCALL_X32(nr, sym, qual) asmlinkage const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = { /* @@ -25,25 +24,3 @@ asmlinkage const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = { [0 ... __NR_syscall_max] = &__x64_sys_ni_syscall, #include }; - -#undef __SYSCALL_64 -#undef __SYSCALL_X32 - -#ifdef CONFIG_X86_X32_ABI - -#define __SYSCALL_64(nr, sym, qual) -#define __SYSCALL_X32(nr, sym, qual) [nr] = sym, - -asmlinkage const sys_call_ptr_t x32_sys_call_table[__NR_syscall_x32_max+1] = { - /* - * Smells like a compiler bug -- it doesn't work - * when the & below is removed. - */ - [0 ... __NR_syscall_x32_max] = &__x64_sys_ni_syscall, -#include -}; - -#undef __SYSCALL_64 -#undef __SYSCALL_X32 - -#endif diff --git a/arch/x86/entry/syscall_x32.c b/arch/x86/entry/syscall_x32.c new file mode 100644 index 000000000000..d144ced7f582 --- /dev/null +++ b/arch/x86/entry/syscall_x32.c @@ -0,0 +1,26 @@ +// SPDX-License-Identifier: GPL-2.0 +/* System call table for x32 ABI. */ + +#include +#include +#include +#include +#include +#include + +#define __SYSCALL_64(nr, sym, qual) + +#define __SYSCALL_X32(nr, sym, qual) extern asmlinkage long sym(const struct pt_regs *); +#include +#undef __SYSCALL_X32 + +#define __SYSCALL_X32(nr, sym, qual) [nr] = sym, + +asmlinkage const sys_call_ptr_t x32_sys_call_table[__NR_syscall_x32_max+1] = { + /* + * Smells like a compiler bug -- it doesn't work + * when the & below is removed. + */ + [0 ... __NR_syscall_x32_max] = &__x64_sys_ni_syscall, +#include +}; -- cgit From 0872098804b5f44bab91f80b6df55df32894fee3 Mon Sep 17 00:00:00 2001 From: Brian Gerst Date: Fri, 13 Mar 2020 15:51:34 -0400 Subject: x86/entry: Move max syscall number calculation to syscallhdr.sh Instead of using an array in asm-offsets to calculate the max syscall number, calculate it when writing out the syscall headers. Signed-off-by: Brian Gerst Signed-off-by: Thomas Gleixner Link: https://lkml.kernel.org/r/20200313195144.164260-9-brgerst@gmail.com --- arch/x86/entry/syscall_32.c | 6 ++--- arch/x86/entry/syscall_64.c | 2 +- arch/x86/entry/syscall_x32.c | 6 ++--- arch/x86/entry/syscalls/syscallhdr.sh | 7 ++++++ arch/x86/entry/vdso/vdso32/vclock_gettime.c | 1 + arch/x86/include/asm/syscall.h | 3 --- arch/x86/include/asm/unistd.h | 7 ++++++ arch/x86/kernel/asm-offsets_32.c | 9 -------- arch/x86/kernel/asm-offsets_64.c | 36 ----------------------------- arch/x86/um/sys_call_table_32.c | 2 +- arch/x86/um/sys_call_table_64.c | 2 +- arch/x86/um/user-offsets.c | 15 ------------ 12 files changed, 24 insertions(+), 72 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/entry/syscall_32.c b/arch/x86/entry/syscall_32.c index 7d17b3addbbb..3207cf685cde 100644 --- a/arch/x86/entry/syscall_32.c +++ b/arch/x86/entry/syscall_32.c @@ -4,7 +4,7 @@ #include #include #include -#include +#include #include #ifdef CONFIG_IA32_EMULATION @@ -22,11 +22,11 @@ extern asmlinkage long sys_ni_syscall(unsigned long, unsigned long, unsigned lon #define __SYSCALL_I386(nr, sym, qual) [nr] = sym, -__visible const sys_call_ptr_t ia32_sys_call_table[__NR_syscall_compat_max+1] = { +__visible const sys_call_ptr_t ia32_sys_call_table[__NR_ia32_syscall_max+1] = { /* * Smells like a compiler bug -- it doesn't work * when the & below is removed. */ - [0 ... __NR_syscall_compat_max] = &__sys_ni_syscall, + [0 ... __NR_ia32_syscall_max] = &__sys_ni_syscall, #include }; diff --git a/arch/x86/entry/syscall_64.c b/arch/x86/entry/syscall_64.c index efb85c6dce13..5dc6846979df 100644 --- a/arch/x86/entry/syscall_64.c +++ b/arch/x86/entry/syscall_64.c @@ -5,7 +5,7 @@ #include #include #include -#include +#include #include #define __SYSCALL_X32(nr, sym, qual) diff --git a/arch/x86/entry/syscall_x32.c b/arch/x86/entry/syscall_x32.c index d144ced7f582..95abb6da0ecc 100644 --- a/arch/x86/entry/syscall_x32.c +++ b/arch/x86/entry/syscall_x32.c @@ -5,7 +5,7 @@ #include #include #include -#include +#include #include #define __SYSCALL_64(nr, sym, qual) @@ -16,11 +16,11 @@ #define __SYSCALL_X32(nr, sym, qual) [nr] = sym, -asmlinkage const sys_call_ptr_t x32_sys_call_table[__NR_syscall_x32_max+1] = { +asmlinkage const sys_call_ptr_t x32_sys_call_table[__NR_x32_syscall_max+1] = { /* * Smells like a compiler bug -- it doesn't work * when the & below is removed. */ - [0 ... __NR_syscall_x32_max] = &__x64_sys_ni_syscall, + [0 ... __NR_x32_syscall_max] = &__x64_sys_ni_syscall, #include }; diff --git a/arch/x86/entry/syscalls/syscallhdr.sh b/arch/x86/entry/syscalls/syscallhdr.sh index 12fbbcfe7ef3..cc1e63857427 100644 --- a/arch/x86/entry/syscalls/syscallhdr.sh +++ b/arch/x86/entry/syscalls/syscallhdr.sh @@ -15,14 +15,21 @@ grep -E "^[0-9A-Fa-fXx]+[[:space:]]+${my_abis}" "$in" | sort -n | ( echo "#define ${fileguard} 1" echo "" + max=0 while read nr abi name entry ; do if [ -z "$offset" ]; then echo "#define __NR_${prefix}${name} $nr" else echo "#define __NR_${prefix}${name} ($offset + $nr)" fi + + max=$nr done + echo "" + echo "#ifdef __KERNEL__" + echo "#define __NR_${prefix}syscall_max $max" + echo "#endif" echo "" echo "#endif /* ${fileguard} */" ) > "$out" diff --git a/arch/x86/entry/vdso/vdso32/vclock_gettime.c b/arch/x86/entry/vdso/vdso32/vclock_gettime.c index 9242b28418d5..1e82bd43286c 100644 --- a/arch/x86/entry/vdso/vdso32/vclock_gettime.c +++ b/arch/x86/entry/vdso/vdso32/vclock_gettime.c @@ -13,6 +13,7 @@ */ #undef CONFIG_64BIT #undef CONFIG_X86_64 +#undef CONFIG_COMPAT #undef CONFIG_PGTABLE_LEVELS #undef CONFIG_ILLEGAL_POINTER_VALUE #undef CONFIG_SPARSEMEM_VMEMMAP diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h index 8db3fdb6102e..1ef8d7bffa15 100644 --- a/arch/x86/include/asm/syscall.h +++ b/arch/x86/include/asm/syscall.h @@ -13,7 +13,6 @@ #include #include #include -#include /* For NR_syscalls */ #include /* for TS_COMPAT */ #include @@ -28,8 +27,6 @@ extern const sys_call_ptr_t sys_call_table[]; #if defined(CONFIG_X86_32) #define ia32_sys_call_table sys_call_table -#define __NR_syscall_compat_max __NR_syscall_max -#define IA32_NR_syscalls NR_syscalls #endif #if defined(CONFIG_IA32_EMULATION) diff --git a/arch/x86/include/asm/unistd.h b/arch/x86/include/asm/unistd.h index a7dd080749ce..c1c3d31b15c0 100644 --- a/arch/x86/include/asm/unistd.h +++ b/arch/x86/include/asm/unistd.h @@ -13,10 +13,13 @@ # define __ARCH_WANT_SYS_OLD_MMAP # define __ARCH_WANT_SYS_OLD_SELECT +# define __NR_ia32_syscall_max __NR_syscall_max + # else # include # include +# include # define __ARCH_WANT_SYS_TIME # define __ARCH_WANT_SYS_UTIME # define __ARCH_WANT_COMPAT_SYS_PREADV64 @@ -26,6 +29,10 @@ # endif +# define NR_syscalls (__NR_syscall_max + 1) +# define X32_NR_syscalls (__NR_x32_syscall_max + 1) +# define IA32_NR_syscalls (__NR_ia32_syscall_max + 1) + # define __ARCH_WANT_NEW_STAT # define __ARCH_WANT_OLD_READDIR # define __ARCH_WANT_OLD_STAT diff --git a/arch/x86/kernel/asm-offsets_32.c b/arch/x86/kernel/asm-offsets_32.c index 82826f2275cc..0dcac420b341 100644 --- a/arch/x86/kernel/asm-offsets_32.c +++ b/arch/x86/kernel/asm-offsets_32.c @@ -5,11 +5,6 @@ #include -#define __SYSCALL_I386(nr, sym, qual) [nr] = 1, -static char syscalls[] = { -#include -}; - /* workaround for a warning with -Wmissing-prototypes */ void foo(void); @@ -60,8 +55,4 @@ void foo(void) BLANK(); OFFSET(stack_canary_offset, stack_canary, canary); #endif - - BLANK(); - DEFINE(__NR_syscall_max, sizeof(syscalls) - 1); - DEFINE(NR_syscalls, sizeof(syscalls)); } diff --git a/arch/x86/kernel/asm-offsets_64.c b/arch/x86/kernel/asm-offsets_64.c index 24d2fde30d00..c2a47016f243 100644 --- a/arch/x86/kernel/asm-offsets_64.c +++ b/arch/x86/kernel/asm-offsets_64.c @@ -5,30 +5,6 @@ #include -#define __SYSCALL_64(nr, sym, qual) [nr] = 1, -#define __SYSCALL_X32(nr, sym, qual) -static char syscalls_64[] = { -#include -}; -#undef __SYSCALL_64 -#undef __SYSCALL_X32 - -#ifdef CONFIG_X86_X32_ABI -#define __SYSCALL_64(nr, sym, qual) -#define __SYSCALL_X32(nr, sym, qual) [nr] = 1, -static char syscalls_x32[] = { -#include -}; -#undef __SYSCALL_64 -#undef __SYSCALL_X32 -#endif - -#define __SYSCALL_I386(nr, sym, qual) [nr] = 1, -static char syscalls_ia32[] = { -#include -}; -#undef __SYSCALL_I386 - #if defined(CONFIG_KVM_GUEST) && defined(CONFIG_PARAVIRT_SPINLOCKS) #include #endif @@ -90,17 +66,5 @@ int main(void) DEFINE(stack_canary_offset, offsetof(struct fixed_percpu_data, stack_canary)); BLANK(); #endif - - DEFINE(__NR_syscall_max, sizeof(syscalls_64) - 1); - DEFINE(NR_syscalls, sizeof(syscalls_64)); - -#ifdef CONFIG_X86_X32_ABI - DEFINE(__NR_syscall_x32_max, sizeof(syscalls_x32) - 1); - DEFINE(X32_NR_syscalls, sizeof(syscalls_x32)); -#endif - - DEFINE(__NR_syscall_compat_max, sizeof(syscalls_ia32) - 1); - DEFINE(IA32_NR_syscalls, sizeof(syscalls_ia32)); - return 0; } diff --git a/arch/x86/um/sys_call_table_32.c b/arch/x86/um/sys_call_table_32.c index 9649b5ad2ca2..a0d75c560030 100644 --- a/arch/x86/um/sys_call_table_32.c +++ b/arch/x86/um/sys_call_table_32.c @@ -7,7 +7,7 @@ #include #include #include -#include +#include #include #define __NO_STUBS diff --git a/arch/x86/um/sys_call_table_64.c b/arch/x86/um/sys_call_table_64.c index c8bc7fb8cbd6..fa97740badd5 100644 --- a/arch/x86/um/sys_call_table_64.c +++ b/arch/x86/um/sys_call_table_64.c @@ -7,7 +7,7 @@ #include #include #include -#include +#include #include #define __NO_STUBS diff --git a/arch/x86/um/user-offsets.c b/arch/x86/um/user-offsets.c index 5b37b7f952dd..c51dd8363d25 100644 --- a/arch/x86/um/user-offsets.c +++ b/arch/x86/um/user-offsets.c @@ -9,18 +9,6 @@ #include #include -#ifdef __i386__ -#define __SYSCALL_I386(nr, sym, qual) [nr] = 1, -static char syscalls[] = { -#include -}; -#else -#define __SYSCALL_64(nr, sym, qual) [nr] = 1, -static char syscalls[] = { -#include -}; -#endif - #define DEFINE(sym, val) \ asm volatile("\n->" #sym " %0 " #val : : "i" (val)) @@ -94,7 +82,4 @@ void foo(void) DEFINE(UM_PROT_READ, PROT_READ); DEFINE(UM_PROT_WRITE, PROT_WRITE); DEFINE(UM_PROT_EXEC, PROT_EXEC); - - DEFINE(__NR_syscall_max, sizeof(syscalls) - 1); - DEFINE(NR_syscalls, sizeof(syscalls)); } -- cgit From d3b1b776eefcc1695d15730f37cdaab201fd8707 Mon Sep 17 00:00:00 2001 From: Brian Gerst Date: Fri, 13 Mar 2020 15:51:35 -0400 Subject: x86/entry/64: Remove ptregs qualifier from syscall table Now that the fast syscall path is removed, the ptregs qualifier is unused. Signed-off-by: Brian Gerst Signed-off-by: Thomas Gleixner Reviewed-by: Dominik Brodowski Link: https://lkml.kernel.org/r/20200313195144.164260-10-brgerst@gmail.com --- arch/x86/entry/syscalls/syscall_64.tbl | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index 0b5a25bc9998..e8fb7222a3bd 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -23,7 +23,7 @@ 12 common brk __x64_sys_brk 13 64 rt_sigaction __x64_sys_rt_sigaction 14 common rt_sigprocmask __x64_sys_rt_sigprocmask -15 64 rt_sigreturn __x64_sys_rt_sigreturn/ptregs +15 64 rt_sigreturn __x64_sys_rt_sigreturn 16 64 ioctl __x64_sys_ioctl 17 common pread64 __x64_sys_pread64 18 common pwrite64 __x64_sys_pwrite64 @@ -64,10 +64,10 @@ 53 common socketpair __x64_sys_socketpair 54 64 setsockopt __x64_sys_setsockopt 55 64 getsockopt __x64_sys_getsockopt -56 common clone __x64_sys_clone/ptregs -57 common fork __x64_sys_fork/ptregs -58 common vfork __x64_sys_vfork/ptregs -59 64 execve __x64_sys_execve/ptregs +56 common clone __x64_sys_clone +57 common fork __x64_sys_fork +58 common vfork __x64_sys_vfork +59 64 execve __x64_sys_execve 60 common exit __x64_sys_exit 61 common wait4 __x64_sys_wait4 62 common kill __x64_sys_kill @@ -180,7 +180,7 @@ 169 common reboot __x64_sys_reboot 170 common sethostname __x64_sys_sethostname 171 common setdomainname __x64_sys_setdomainname -172 common iopl __x64_sys_iopl/ptregs +172 common iopl __x64_sys_iopl 173 common ioperm __x64_sys_ioperm 174 64 create_module 175 common init_module __x64_sys_init_module @@ -330,7 +330,7 @@ 319 common memfd_create __x64_sys_memfd_create 320 common kexec_file_load __x64_sys_kexec_file_load 321 common bpf __x64_sys_bpf -322 64 execveat __x64_sys_execveat/ptregs +322 64 execveat __x64_sys_execveat 323 common userfaultfd __x64_sys_userfaultfd 324 common membarrier __x64_sys_membarrier 325 common mlock2 __x64_sys_mlock2 @@ -356,7 +356,7 @@ 432 common fsmount __x64_sys_fsmount 433 common fspick __x64_sys_fspick 434 common pidfd_open __x64_sys_pidfd_open -435 common clone3 __x64_sys_clone3/ptregs +435 common clone3 __x64_sys_clone3 437 common openat2 __x64_sys_openat2 438 common pidfd_getfd __x64_sys_pidfd_getfd @@ -374,7 +374,7 @@ 517 x32 recvfrom __x32_compat_sys_recvfrom 518 x32 sendmsg __x32_compat_sys_sendmsg 519 x32 recvmsg __x32_compat_sys_recvmsg -520 x32 execve __x32_compat_sys_execve/ptregs +520 x32 execve __x32_compat_sys_execve 521 x32 ptrace __x32_compat_sys_ptrace 522 x32 rt_sigpending __x32_compat_sys_rt_sigpending 523 x32 rt_sigtimedwait __x32_compat_sys_rt_sigtimedwait_time64 @@ -399,6 +399,6 @@ 542 x32 getsockopt __x32_compat_sys_getsockopt 543 x32 io_setup __x32_compat_sys_io_setup 544 x32 io_submit __x32_compat_sys_io_submit -545 x32 execveat __x32_compat_sys_execveat/ptregs +545 x32 execveat __x32_compat_sys_execveat 546 x32 preadv2 __x32_compat_sys_preadv64v2 547 x32 pwritev2 __x32_compat_sys_pwritev64v2 -- cgit From b5592e5c0d8682e0e82b42c91fd7265c3cce19e1 Mon Sep 17 00:00:00 2001 From: Brian Gerst Date: Fri, 13 Mar 2020 15:51:36 -0400 Subject: x86/entry: Remove syscall qualifier support Syscall qualifier support is no longer needed. Signed-off-by: Brian Gerst Signed-off-by: Thomas Gleixner Reviewed-by: Dominik Brodowski Link: https://lkml.kernel.org/r/20200313195144.164260-11-brgerst@gmail.com --- arch/x86/entry/syscall_32.c | 6 +++--- arch/x86/entry/syscall_64.c | 6 +++--- arch/x86/entry/syscall_x32.c | 6 +++--- arch/x86/entry/syscalls/syscalltbl.sh | 10 +--------- arch/x86/um/sys_call_table_32.c | 4 ++-- arch/x86/um/sys_call_table_64.c | 4 ++-- 6 files changed, 14 insertions(+), 22 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/entry/syscall_32.c b/arch/x86/entry/syscall_32.c index 3207cf685cde..9f4cab4a56d8 100644 --- a/arch/x86/entry/syscall_32.c +++ b/arch/x86/entry/syscall_32.c @@ -9,10 +9,10 @@ #ifdef CONFIG_IA32_EMULATION /* On X86_64, we use struct pt_regs * to pass parameters to syscalls */ -#define __SYSCALL_I386(nr, sym, qual) extern asmlinkage long sym(const struct pt_regs *); +#define __SYSCALL_I386(nr, sym) extern asmlinkage long sym(const struct pt_regs *); #define __sys_ni_syscall __ia32_sys_ni_syscall #else /* CONFIG_IA32_EMULATION */ -#define __SYSCALL_I386(nr, sym, qual) extern asmlinkage long sym(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long); +#define __SYSCALL_I386(nr, sym) extern asmlinkage long sym(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long); extern asmlinkage long sys_ni_syscall(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long); #define __sys_ni_syscall sys_ni_syscall #endif /* CONFIG_IA32_EMULATION */ @@ -20,7 +20,7 @@ extern asmlinkage long sys_ni_syscall(unsigned long, unsigned long, unsigned lon #include #undef __SYSCALL_I386 -#define __SYSCALL_I386(nr, sym, qual) [nr] = sym, +#define __SYSCALL_I386(nr, sym) [nr] = sym, __visible const sys_call_ptr_t ia32_sys_call_table[__NR_ia32_syscall_max+1] = { /* diff --git a/arch/x86/entry/syscall_64.c b/arch/x86/entry/syscall_64.c index 5dc6846979df..bce4821a7e14 100644 --- a/arch/x86/entry/syscall_64.c +++ b/arch/x86/entry/syscall_64.c @@ -8,13 +8,13 @@ #include #include -#define __SYSCALL_X32(nr, sym, qual) +#define __SYSCALL_X32(nr, sym) -#define __SYSCALL_64(nr, sym, qual) extern asmlinkage long sym(const struct pt_regs *); +#define __SYSCALL_64(nr, sym) extern asmlinkage long sym(const struct pt_regs *); #include #undef __SYSCALL_64 -#define __SYSCALL_64(nr, sym, qual) [nr] = sym, +#define __SYSCALL_64(nr, sym) [nr] = sym, asmlinkage const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = { /* diff --git a/arch/x86/entry/syscall_x32.c b/arch/x86/entry/syscall_x32.c index 95abb6da0ecc..21e306c5a401 100644 --- a/arch/x86/entry/syscall_x32.c +++ b/arch/x86/entry/syscall_x32.c @@ -8,13 +8,13 @@ #include #include -#define __SYSCALL_64(nr, sym, qual) +#define __SYSCALL_64(nr, sym) -#define __SYSCALL_X32(nr, sym, qual) extern asmlinkage long sym(const struct pt_regs *); +#define __SYSCALL_X32(nr, sym) extern asmlinkage long sym(const struct pt_regs *); #include #undef __SYSCALL_X32 -#define __SYSCALL_X32(nr, sym, qual) [nr] = sym, +#define __SYSCALL_X32(nr, sym) [nr] = sym, asmlinkage const sys_call_ptr_t x32_sys_call_table[__NR_x32_syscall_max+1] = { /* diff --git a/arch/x86/entry/syscalls/syscalltbl.sh b/arch/x86/entry/syscalls/syscalltbl.sh index 1af2be39e7d9..b0519ddad8d7 100644 --- a/arch/x86/entry/syscalls/syscalltbl.sh +++ b/arch/x86/entry/syscalls/syscalltbl.sh @@ -9,15 +9,7 @@ syscall_macro() { local nr="$2" local entry="$3" - # Entry can be either just a function name or "function/qualifier" - real_entry="${entry%%/*}" - if [ "$entry" = "$real_entry" ]; then - qualifier= - else - qualifier=${entry#*/} - fi - - echo "__SYSCALL_${abi}($nr, $real_entry, $qualifier)" + echo "__SYSCALL_${abi}($nr, $entry)" } emit() { diff --git a/arch/x86/um/sys_call_table_32.c b/arch/x86/um/sys_call_table_32.c index a0d75c560030..2ed81e581755 100644 --- a/arch/x86/um/sys_call_table_32.c +++ b/arch/x86/um/sys_call_table_32.c @@ -26,11 +26,11 @@ #define old_mmap sys_old_mmap -#define __SYSCALL_I386(nr, sym, qual) extern asmlinkage long sym(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long) ; +#define __SYSCALL_I386(nr, sym) extern asmlinkage long sym(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long) ; #include #undef __SYSCALL_I386 -#define __SYSCALL_I386(nr, sym, qual) [ nr ] = sym, +#define __SYSCALL_I386(nr, sym) [ nr ] = sym, extern asmlinkage long sys_ni_syscall(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long); diff --git a/arch/x86/um/sys_call_table_64.c b/arch/x86/um/sys_call_table_64.c index fa97740badd5..7d057ea53d54 100644 --- a/arch/x86/um/sys_call_table_64.c +++ b/arch/x86/um/sys_call_table_64.c @@ -36,11 +36,11 @@ #define stub_execveat sys_execveat #define stub_rt_sigreturn sys_rt_sigreturn -#define __SYSCALL_64(nr, sym, qual) extern asmlinkage long sym(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long) ; +#define __SYSCALL_64(nr, sym) extern asmlinkage long sym(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long) ; #include #undef __SYSCALL_64 -#define __SYSCALL_64(nr, sym, qual) [ nr ] = sym, +#define __SYSCALL_64(nr, sym) [ nr ] = sym, extern asmlinkage long sys_ni_syscall(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long); -- cgit From 8210efcb153625d2bf4bb79875ddc78eee2aba3e Mon Sep 17 00:00:00 2001 From: Brian Gerst Date: Fri, 13 Mar 2020 15:51:37 -0400 Subject: x86/entry/64: Add __SYSCALL_COMMON() Add a __SYSCALL_COMMON() macro to the syscall table, which simplifies syscalltbl.sh. Signed-off-by: Brian Gerst Signed-off-by: Thomas Gleixner Link: https://lkml.kernel.org/r/20200313195144.164260-12-brgerst@gmail.com --- arch/x86/entry/syscall_64.c | 1 + arch/x86/entry/syscall_x32.c | 3 +++ arch/x86/entry/syscalls/syscalltbl.sh | 22 ++-------------------- arch/x86/um/sys_call_table_64.c | 3 +++ 4 files changed, 9 insertions(+), 20 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/entry/syscall_64.c b/arch/x86/entry/syscall_64.c index bce4821a7e14..03645f9014d3 100644 --- a/arch/x86/entry/syscall_64.c +++ b/arch/x86/entry/syscall_64.c @@ -9,6 +9,7 @@ #include #define __SYSCALL_X32(nr, sym) +#define __SYSCALL_COMMON(nr, sym) __SYSCALL_64(nr, sym) #define __SYSCALL_64(nr, sym) extern asmlinkage long sym(const struct pt_regs *); #include diff --git a/arch/x86/entry/syscall_x32.c b/arch/x86/entry/syscall_x32.c index 21e306c5a401..57a151a3a4b4 100644 --- a/arch/x86/entry/syscall_x32.c +++ b/arch/x86/entry/syscall_x32.c @@ -11,10 +11,13 @@ #define __SYSCALL_64(nr, sym) #define __SYSCALL_X32(nr, sym) extern asmlinkage long sym(const struct pt_regs *); +#define __SYSCALL_COMMON(nr, sym) extern asmlinkage long sym(const struct pt_regs *); #include #undef __SYSCALL_X32 +#undef __SYSCALL_COMMON #define __SYSCALL_X32(nr, sym) [nr] = sym, +#define __SYSCALL_COMMON(nr, sym) [nr] = sym, asmlinkage const sys_call_ptr_t x32_sys_call_table[__NR_x32_syscall_max+1] = { /* diff --git a/arch/x86/entry/syscalls/syscalltbl.sh b/arch/x86/entry/syscalls/syscalltbl.sh index b0519ddad8d7..6106ed37b8de 100644 --- a/arch/x86/entry/syscalls/syscalltbl.sh +++ b/arch/x86/entry/syscalls/syscalltbl.sh @@ -25,7 +25,7 @@ emit() { fi # For CONFIG_UML, we need to strip the __x64_sys prefix - if [ "$abi" = "64" -a "${entry}" != "${entry#__x64_sys}" ]; then + if [ "${entry}" != "${entry#__x64_sys}" ]; then umlentry="sys${entry#__x64_sys}" fi @@ -53,24 +53,6 @@ emit() { grep '^[0-9]' "$in" | sort -n | ( while read nr abi name entry compat; do abi=`echo "$abi" | tr '[a-z]' '[A-Z]'` - if [ "$abi" = "COMMON" -o "$abi" = "64" ]; then - emit 64 "$nr" "$entry" "$compat" - if [ "$abi" = "COMMON" ]; then - # COMMON means that this syscall exists in the same form for - # 64-bit and X32. - echo "#ifdef CONFIG_X86_X32_ABI" - emit X32 "$nr" "$entry" "$compat" - echo "#endif" - fi - elif [ "$abi" = "X32" ]; then - echo "#ifdef CONFIG_X86_X32_ABI" - emit X32 "$nr" "$entry" "$compat" - echo "#endif" - elif [ "$abi" = "I386" ]; then - emit "$abi" "$nr" "$entry" "$compat" - else - echo "Unknown abi $abi" >&2 - exit 1 - fi + emit "$abi" "$nr" "$entry" "$compat" done ) > "$out" diff --git a/arch/x86/um/sys_call_table_64.c b/arch/x86/um/sys_call_table_64.c index 7d057ea53d54..2e8544dafbb0 100644 --- a/arch/x86/um/sys_call_table_64.c +++ b/arch/x86/um/sys_call_table_64.c @@ -36,6 +36,9 @@ #define stub_execveat sys_execveat #define stub_rt_sigreturn sys_rt_sigreturn +#define __SYSCALL_X32(nr, sym) +#define __SYSCALL_COMMON(nr, sym) __SYSCALL_64(nr, sym) + #define __SYSCALL_64(nr, sym) extern asmlinkage long sym(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long) ; #include -- cgit From cab56d3484d4bb8b21e4d9500392ac1ce99af026 Mon Sep 17 00:00:00 2001 From: Brian Gerst Date: Fri, 13 Mar 2020 15:51:38 -0400 Subject: x86/entry: Remove ABI prefixes from functions in syscall tables Move the ABI prefixes to the __SYSCALL_[abi]() macros. This allows removal of the need to strip the prefix for UML. Signed-off-by: Brian Gerst Signed-off-by: Thomas Gleixner Link: https://lkml.kernel.org/r/20200313195144.164260-13-brgerst@gmail.com --- arch/x86/entry/syscall_32.c | 6 +- arch/x86/entry/syscall_64.c | 4 +- arch/x86/entry/syscall_x32.c | 8 +- arch/x86/entry/syscalls/syscall_32.tbl | 818 ++++++++++++++++----------------- arch/x86/entry/syscalls/syscall_64.tbl | 740 ++++++++++++++--------------- arch/x86/entry/syscalls/syscalltbl.sh | 14 +- 6 files changed, 791 insertions(+), 799 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/entry/syscall_32.c b/arch/x86/entry/syscall_32.c index 9f4cab4a56d8..41ec9c66fe15 100644 --- a/arch/x86/entry/syscall_32.c +++ b/arch/x86/entry/syscall_32.c @@ -9,7 +9,7 @@ #ifdef CONFIG_IA32_EMULATION /* On X86_64, we use struct pt_regs * to pass parameters to syscalls */ -#define __SYSCALL_I386(nr, sym) extern asmlinkage long sym(const struct pt_regs *); +#define __SYSCALL_I386(nr, sym) extern asmlinkage long __ia32_##sym(const struct pt_regs *); #define __sys_ni_syscall __ia32_sys_ni_syscall #else /* CONFIG_IA32_EMULATION */ #define __SYSCALL_I386(nr, sym) extern asmlinkage long sym(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long); @@ -20,7 +20,11 @@ extern asmlinkage long sys_ni_syscall(unsigned long, unsigned long, unsigned lon #include #undef __SYSCALL_I386 +#ifdef CONFIG_IA32_EMULATION +#define __SYSCALL_I386(nr, sym) [nr] = __ia32_##sym, +#else /* CONFIG_IA32_EMULATION */ #define __SYSCALL_I386(nr, sym) [nr] = sym, +#endif /* CONFIG_IA32_EMULATION */ __visible const sys_call_ptr_t ia32_sys_call_table[__NR_ia32_syscall_max+1] = { /* diff --git a/arch/x86/entry/syscall_64.c b/arch/x86/entry/syscall_64.c index 03645f9014d3..66d3e65e3b6b 100644 --- a/arch/x86/entry/syscall_64.c +++ b/arch/x86/entry/syscall_64.c @@ -11,11 +11,11 @@ #define __SYSCALL_X32(nr, sym) #define __SYSCALL_COMMON(nr, sym) __SYSCALL_64(nr, sym) -#define __SYSCALL_64(nr, sym) extern asmlinkage long sym(const struct pt_regs *); +#define __SYSCALL_64(nr, sym) extern asmlinkage long __x64_##sym(const struct pt_regs *); #include #undef __SYSCALL_64 -#define __SYSCALL_64(nr, sym) [nr] = sym, +#define __SYSCALL_64(nr, sym) [nr] = __x64_##sym, asmlinkage const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = { /* diff --git a/arch/x86/entry/syscall_x32.c b/arch/x86/entry/syscall_x32.c index 57a151a3a4b4..2fb09efd7b40 100644 --- a/arch/x86/entry/syscall_x32.c +++ b/arch/x86/entry/syscall_x32.c @@ -10,14 +10,14 @@ #define __SYSCALL_64(nr, sym) -#define __SYSCALL_X32(nr, sym) extern asmlinkage long sym(const struct pt_regs *); -#define __SYSCALL_COMMON(nr, sym) extern asmlinkage long sym(const struct pt_regs *); +#define __SYSCALL_X32(nr, sym) extern asmlinkage long __x32_##sym(const struct pt_regs *); +#define __SYSCALL_COMMON(nr, sym) extern asmlinkage long __x64_##sym(const struct pt_regs *); #include #undef __SYSCALL_X32 #undef __SYSCALL_COMMON -#define __SYSCALL_X32(nr, sym) [nr] = sym, -#define __SYSCALL_COMMON(nr, sym) [nr] = sym, +#define __SYSCALL_X32(nr, sym) [nr] = __x32_##sym, +#define __SYSCALL_COMMON(nr, sym) [nr] = __x64_##sym, asmlinkage const sys_call_ptr_t x32_sys_call_table[__NR_x32_syscall_max+1] = { /* diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index c17cb77eb150..390c21230290 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -11,434 +11,434 @@ # # The abi is always "i386" for this file. # -0 i386 restart_syscall sys_restart_syscall __ia32_sys_restart_syscall -1 i386 exit sys_exit __ia32_sys_exit -2 i386 fork sys_fork __ia32_sys_fork -3 i386 read sys_read __ia32_sys_read -4 i386 write sys_write __ia32_sys_write -5 i386 open sys_open __ia32_compat_sys_open -6 i386 close sys_close __ia32_sys_close -7 i386 waitpid sys_waitpid __ia32_sys_waitpid -8 i386 creat sys_creat __ia32_sys_creat -9 i386 link sys_link __ia32_sys_link -10 i386 unlink sys_unlink __ia32_sys_unlink -11 i386 execve sys_execve __ia32_compat_sys_execve -12 i386 chdir sys_chdir __ia32_sys_chdir -13 i386 time sys_time32 __ia32_sys_time32 -14 i386 mknod sys_mknod __ia32_sys_mknod -15 i386 chmod sys_chmod __ia32_sys_chmod -16 i386 lchown sys_lchown16 __ia32_sys_lchown16 +0 i386 restart_syscall sys_restart_syscall sys_restart_syscall +1 i386 exit sys_exit sys_exit +2 i386 fork sys_fork sys_fork +3 i386 read sys_read sys_read +4 i386 write sys_write sys_write +5 i386 open sys_open compat_sys_open +6 i386 close sys_close sys_close +7 i386 waitpid sys_waitpid sys_waitpid +8 i386 creat sys_creat sys_creat +9 i386 link sys_link sys_link +10 i386 unlink sys_unlink sys_unlink +11 i386 execve sys_execve compat_sys_execve +12 i386 chdir sys_chdir sys_chdir +13 i386 time sys_time32 sys_time32 +14 i386 mknod sys_mknod sys_mknod +15 i386 chmod sys_chmod sys_chmod +16 i386 lchown sys_lchown16 sys_lchown16 17 i386 break -18 i386 oldstat sys_stat __ia32_sys_stat -19 i386 lseek sys_lseek __ia32_compat_sys_lseek -20 i386 getpid sys_getpid __ia32_sys_getpid -21 i386 mount sys_mount __ia32_compat_sys_mount -22 i386 umount sys_oldumount __ia32_sys_oldumount -23 i386 setuid sys_setuid16 __ia32_sys_setuid16 -24 i386 getuid sys_getuid16 __ia32_sys_getuid16 -25 i386 stime sys_stime32 __ia32_sys_stime32 -26 i386 ptrace sys_ptrace __ia32_compat_sys_ptrace -27 i386 alarm sys_alarm __ia32_sys_alarm -28 i386 oldfstat sys_fstat __ia32_sys_fstat -29 i386 pause sys_pause __ia32_sys_pause -30 i386 utime sys_utime32 __ia32_sys_utime32 +18 i386 oldstat sys_stat sys_stat +19 i386 lseek sys_lseek compat_sys_lseek +20 i386 getpid sys_getpid sys_getpid +21 i386 mount sys_mount compat_sys_mount +22 i386 umount sys_oldumount sys_oldumount +23 i386 setuid sys_setuid16 sys_setuid16 +24 i386 getuid sys_getuid16 sys_getuid16 +25 i386 stime sys_stime32 sys_stime32 +26 i386 ptrace sys_ptrace compat_sys_ptrace +27 i386 alarm sys_alarm sys_alarm +28 i386 oldfstat sys_fstat sys_fstat +29 i386 pause sys_pause sys_pause +30 i386 utime sys_utime32 sys_utime32 31 i386 stty 32 i386 gtty -33 i386 access sys_access __ia32_sys_access -34 i386 nice sys_nice __ia32_sys_nice +33 i386 access sys_access sys_access +34 i386 nice sys_nice sys_nice 35 i386 ftime -36 i386 sync sys_sync __ia32_sys_sync -37 i386 kill sys_kill __ia32_sys_kill -38 i386 rename sys_rename __ia32_sys_rename -39 i386 mkdir sys_mkdir __ia32_sys_mkdir -40 i386 rmdir sys_rmdir __ia32_sys_rmdir -41 i386 dup sys_dup __ia32_sys_dup -42 i386 pipe sys_pipe __ia32_sys_pipe -43 i386 times sys_times __ia32_compat_sys_times +36 i386 sync sys_sync sys_sync +37 i386 kill sys_kill sys_kill +38 i386 rename sys_rename sys_rename +39 i386 mkdir sys_mkdir sys_mkdir +40 i386 rmdir sys_rmdir sys_rmdir +41 i386 dup sys_dup sys_dup +42 i386 pipe sys_pipe sys_pipe +43 i386 times sys_times compat_sys_times 44 i386 prof -45 i386 brk sys_brk __ia32_sys_brk -46 i386 setgid sys_setgid16 __ia32_sys_setgid16 -47 i386 getgid sys_getgid16 __ia32_sys_getgid16 -48 i386 signal sys_signal __ia32_sys_signal -49 i386 geteuid sys_geteuid16 __ia32_sys_geteuid16 -50 i386 getegid sys_getegid16 __ia32_sys_getegid16 -51 i386 acct sys_acct __ia32_sys_acct -52 i386 umount2 sys_umount __ia32_sys_umount +45 i386 brk sys_brk sys_brk +46 i386 setgid sys_setgid16 sys_setgid16 +47 i386 getgid sys_getgid16 sys_getgid16 +48 i386 signal sys_signal sys_signal +49 i386 geteuid sys_geteuid16 sys_geteuid16 +50 i386 getegid sys_getegid16 sys_getegid16 +51 i386 acct sys_acct sys_acct +52 i386 umount2 sys_umount sys_umount 53 i386 lock -54 i386 ioctl sys_ioctl __ia32_compat_sys_ioctl -55 i386 fcntl sys_fcntl __ia32_compat_sys_fcntl64 +54 i386 ioctl sys_ioctl compat_sys_ioctl +55 i386 fcntl sys_fcntl compat_sys_fcntl64 56 i386 mpx -57 i386 setpgid sys_setpgid __ia32_sys_setpgid +57 i386 setpgid sys_setpgid sys_setpgid 58 i386 ulimit -59 i386 oldolduname sys_olduname __ia32_sys_olduname -60 i386 umask sys_umask __ia32_sys_umask -61 i386 chroot sys_chroot __ia32_sys_chroot -62 i386 ustat sys_ustat __ia32_compat_sys_ustat -63 i386 dup2 sys_dup2 __ia32_sys_dup2 -64 i386 getppid sys_getppid __ia32_sys_getppid -65 i386 getpgrp sys_getpgrp __ia32_sys_getpgrp -66 i386 setsid sys_setsid __ia32_sys_setsid -67 i386 sigaction sys_sigaction __ia32_compat_sys_sigaction -68 i386 sgetmask sys_sgetmask __ia32_sys_sgetmask -69 i386 ssetmask sys_ssetmask __ia32_sys_ssetmask -70 i386 setreuid sys_setreuid16 __ia32_sys_setreuid16 -71 i386 setregid sys_setregid16 __ia32_sys_setregid16 -72 i386 sigsuspend sys_sigsuspend __ia32_sys_sigsuspend -73 i386 sigpending sys_sigpending __ia32_compat_sys_sigpending -74 i386 sethostname sys_sethostname __ia32_sys_sethostname -75 i386 setrlimit sys_setrlimit __ia32_compat_sys_setrlimit -76 i386 getrlimit sys_old_getrlimit __ia32_compat_sys_old_getrlimit -77 i386 getrusage sys_getrusage __ia32_compat_sys_getrusage -78 i386 gettimeofday sys_gettimeofday __ia32_compat_sys_gettimeofday -79 i386 settimeofday sys_settimeofday __ia32_compat_sys_settimeofday -80 i386 getgroups sys_getgroups16 __ia32_sys_getgroups16 -81 i386 setgroups sys_setgroups16 __ia32_sys_setgroups16 -82 i386 select sys_old_select __ia32_compat_sys_old_select -83 i386 symlink sys_symlink __ia32_sys_symlink -84 i386 oldlstat sys_lstat __ia32_sys_lstat -85 i386 readlink sys_readlink __ia32_sys_readlink -86 i386 uselib sys_uselib __ia32_sys_uselib -87 i386 swapon sys_swapon __ia32_sys_swapon -88 i386 reboot sys_reboot __ia32_sys_reboot -89 i386 readdir sys_old_readdir __ia32_compat_sys_old_readdir -90 i386 mmap sys_old_mmap __ia32_compat_sys_x86_mmap -91 i386 munmap sys_munmap __ia32_sys_munmap -92 i386 truncate sys_truncate __ia32_compat_sys_truncate -93 i386 ftruncate sys_ftruncate __ia32_compat_sys_ftruncate -94 i386 fchmod sys_fchmod __ia32_sys_fchmod -95 i386 fchown sys_fchown16 __ia32_sys_fchown16 -96 i386 getpriority sys_getpriority __ia32_sys_getpriority -97 i386 setpriority sys_setpriority __ia32_sys_setpriority +59 i386 oldolduname sys_olduname sys_olduname +60 i386 umask sys_umask sys_umask +61 i386 chroot sys_chroot sys_chroot +62 i386 ustat sys_ustat compat_sys_ustat +63 i386 dup2 sys_dup2 sys_dup2 +64 i386 getppid sys_getppid sys_getppid +65 i386 getpgrp sys_getpgrp sys_getpgrp +66 i386 setsid sys_setsid sys_setsid +67 i386 sigaction sys_sigaction compat_sys_sigaction +68 i386 sgetmask sys_sgetmask sys_sgetmask +69 i386 ssetmask sys_ssetmask sys_ssetmask +70 i386 setreuid sys_setreuid16 sys_setreuid16 +71 i386 setregid sys_setregid16 sys_setregid16 +72 i386 sigsuspend sys_sigsuspend sys_sigsuspend +73 i386 sigpending sys_sigpending compat_sys_sigpending +74 i386 sethostname sys_sethostname sys_sethostname +75 i386 setrlimit sys_setrlimit compat_sys_setrlimit +76 i386 getrlimit sys_old_getrlimit compat_sys_old_getrlimit +77 i386 getrusage sys_getrusage compat_sys_getrusage +78 i386 gettimeofday sys_gettimeofday compat_sys_gettimeofday +79 i386 settimeofday sys_settimeofday compat_sys_settimeofday +80 i386 getgroups sys_getgroups16 sys_getgroups16 +81 i386 setgroups sys_setgroups16 sys_setgroups16 +82 i386 select sys_old_select compat_sys_old_select +83 i386 symlink sys_symlink sys_symlink +84 i386 oldlstat sys_lstat sys_lstat +85 i386 readlink sys_readlink sys_readlink +86 i386 uselib sys_uselib sys_uselib +87 i386 swapon sys_swapon sys_swapon +88 i386 reboot sys_reboot sys_reboot +89 i386 readdir sys_old_readdir compat_sys_old_readdir +90 i386 mmap sys_old_mmap compat_sys_x86_mmap +91 i386 munmap sys_munmap sys_munmap +92 i386 truncate sys_truncate compat_sys_truncate +93 i386 ftruncate sys_ftruncate compat_sys_ftruncate +94 i386 fchmod sys_fchmod sys_fchmod +95 i386 fchown sys_fchown16 sys_fchown16 +96 i386 getpriority sys_getpriority sys_getpriority +97 i386 setpriority sys_setpriority sys_setpriority 98 i386 profil -99 i386 statfs sys_statfs __ia32_compat_sys_statfs -100 i386 fstatfs sys_fstatfs __ia32_compat_sys_fstatfs -101 i386 ioperm sys_ioperm __ia32_sys_ioperm -102 i386 socketcall sys_socketcall __ia32_compat_sys_socketcall -103 i386 syslog sys_syslog __ia32_sys_syslog -104 i386 setitimer sys_setitimer __ia32_compat_sys_setitimer -105 i386 getitimer sys_getitimer __ia32_compat_sys_getitimer -106 i386 stat sys_newstat __ia32_compat_sys_newstat -107 i386 lstat sys_newlstat __ia32_compat_sys_newlstat -108 i386 fstat sys_newfstat __ia32_compat_sys_newfstat -109 i386 olduname sys_uname __ia32_sys_uname -110 i386 iopl sys_iopl __ia32_sys_iopl -111 i386 vhangup sys_vhangup __ia32_sys_vhangup +99 i386 statfs sys_statfs compat_sys_statfs +100 i386 fstatfs sys_fstatfs compat_sys_fstatfs +101 i386 ioperm sys_ioperm sys_ioperm +102 i386 socketcall sys_socketcall compat_sys_socketcall +103 i386 syslog sys_syslog sys_syslog +104 i386 setitimer sys_setitimer compat_sys_setitimer +105 i386 getitimer sys_getitimer compat_sys_getitimer +106 i386 stat sys_newstat compat_sys_newstat +107 i386 lstat sys_newlstat compat_sys_newlstat +108 i386 fstat sys_newfstat compat_sys_newfstat +109 i386 olduname sys_uname sys_uname +110 i386 iopl sys_iopl sys_iopl +111 i386 vhangup sys_vhangup sys_vhangup 112 i386 idle -113 i386 vm86old sys_vm86old __ia32_sys_ni_syscall -114 i386 wait4 sys_wait4 __ia32_compat_sys_wait4 -115 i386 swapoff sys_swapoff __ia32_sys_swapoff -116 i386 sysinfo sys_sysinfo __ia32_compat_sys_sysinfo -117 i386 ipc sys_ipc __ia32_compat_sys_ipc -118 i386 fsync sys_fsync __ia32_sys_fsync -119 i386 sigreturn sys_sigreturn __ia32_compat_sys_sigreturn -120 i386 clone sys_clone __ia32_compat_sys_x86_clone -121 i386 setdomainname sys_setdomainname __ia32_sys_setdomainname -122 i386 uname sys_newuname __ia32_sys_newuname -123 i386 modify_ldt sys_modify_ldt __ia32_sys_modify_ldt -124 i386 adjtimex sys_adjtimex_time32 __ia32_sys_adjtimex_time32 -125 i386 mprotect sys_mprotect __ia32_sys_mprotect -126 i386 sigprocmask sys_sigprocmask __ia32_compat_sys_sigprocmask +113 i386 vm86old sys_vm86old sys_ni_syscall +114 i386 wait4 sys_wait4 compat_sys_wait4 +115 i386 swapoff sys_swapoff sys_swapoff +116 i386 sysinfo sys_sysinfo compat_sys_sysinfo +117 i386 ipc sys_ipc compat_sys_ipc +118 i386 fsync sys_fsync sys_fsync +119 i386 sigreturn sys_sigreturn compat_sys_sigreturn +120 i386 clone sys_clone compat_sys_x86_clone +121 i386 setdomainname sys_setdomainname sys_setdomainname +122 i386 uname sys_newuname sys_newuname +123 i386 modify_ldt sys_modify_ldt sys_modify_ldt +124 i386 adjtimex sys_adjtimex_time32 sys_adjtimex_time32 +125 i386 mprotect sys_mprotect sys_mprotect +126 i386 sigprocmask sys_sigprocmask compat_sys_sigprocmask 127 i386 create_module -128 i386 init_module sys_init_module __ia32_sys_init_module -129 i386 delete_module sys_delete_module __ia32_sys_delete_module +128 i386 init_module sys_init_module sys_init_module +129 i386 delete_module sys_delete_module sys_delete_module 130 i386 get_kernel_syms -131 i386 quotactl sys_quotactl __ia32_compat_sys_quotactl32 -132 i386 getpgid sys_getpgid __ia32_sys_getpgid -133 i386 fchdir sys_fchdir __ia32_sys_fchdir -134 i386 bdflush sys_bdflush __ia32_sys_bdflush -135 i386 sysfs sys_sysfs __ia32_sys_sysfs -136 i386 personality sys_personality __ia32_sys_personality +131 i386 quotactl sys_quotactl compat_sys_quotactl32 +132 i386 getpgid sys_getpgid sys_getpgid +133 i386 fchdir sys_fchdir sys_fchdir +134 i386 bdflush sys_bdflush sys_bdflush +135 i386 sysfs sys_sysfs sys_sysfs +136 i386 personality sys_personality sys_personality 137 i386 afs_syscall -138 i386 setfsuid sys_setfsuid16 __ia32_sys_setfsuid16 -139 i386 setfsgid sys_setfsgid16 __ia32_sys_setfsgid16 -140 i386 _llseek sys_llseek __ia32_sys_llseek -141 i386 getdents sys_getdents __ia32_compat_sys_getdents -142 i386 _newselect sys_select __ia32_compat_sys_select -143 i386 flock sys_flock __ia32_sys_flock -144 i386 msync sys_msync __ia32_sys_msync -145 i386 readv sys_readv __ia32_compat_sys_readv -146 i386 writev sys_writev __ia32_compat_sys_writev -147 i386 getsid sys_getsid __ia32_sys_getsid -148 i386 fdatasync sys_fdatasync __ia32_sys_fdatasync -149 i386 _sysctl sys_sysctl __ia32_compat_sys_sysctl -150 i386 mlock sys_mlock __ia32_sys_mlock -151 i386 munlock sys_munlock __ia32_sys_munlock -152 i386 mlockall sys_mlockall __ia32_sys_mlockall -153 i386 munlockall sys_munlockall __ia32_sys_munlockall -154 i386 sched_setparam sys_sched_setparam __ia32_sys_sched_setparam -155 i386 sched_getparam sys_sched_getparam __ia32_sys_sched_getparam -156 i386 sched_setscheduler sys_sched_setscheduler __ia32_sys_sched_setscheduler -157 i386 sched_getscheduler sys_sched_getscheduler __ia32_sys_sched_getscheduler -158 i386 sched_yield sys_sched_yield __ia32_sys_sched_yield -159 i386 sched_get_priority_max sys_sched_get_priority_max __ia32_sys_sched_get_priority_max -160 i386 sched_get_priority_min sys_sched_get_priority_min __ia32_sys_sched_get_priority_min -161 i386 sched_rr_get_interval sys_sched_rr_get_interval_time32 __ia32_sys_sched_rr_get_interval_time32 -162 i386 nanosleep sys_nanosleep_time32 __ia32_sys_nanosleep_time32 -163 i386 mremap sys_mremap __ia32_sys_mremap -164 i386 setresuid sys_setresuid16 __ia32_sys_setresuid16 -165 i386 getresuid sys_getresuid16 __ia32_sys_getresuid16 -166 i386 vm86 sys_vm86 __ia32_sys_ni_syscall +138 i386 setfsuid sys_setfsuid16 sys_setfsuid16 +139 i386 setfsgid sys_setfsgid16 sys_setfsgid16 +140 i386 _llseek sys_llseek sys_llseek +141 i386 getdents sys_getdents compat_sys_getdents +142 i386 _newselect sys_select compat_sys_select +143 i386 flock sys_flock sys_flock +144 i386 msync sys_msync sys_msync +145 i386 readv sys_readv compat_sys_readv +146 i386 writev sys_writev compat_sys_writev +147 i386 getsid sys_getsid sys_getsid +148 i386 fdatasync sys_fdatasync sys_fdatasync +149 i386 _sysctl sys_sysctl compat_sys_sysctl +150 i386 mlock sys_mlock sys_mlock +151 i386 munlock sys_munlock sys_munlock +152 i386 mlockall sys_mlockall sys_mlockall +153 i386 munlockall sys_munlockall sys_munlockall +154 i386 sched_setparam sys_sched_setparam sys_sched_setparam +155 i386 sched_getparam sys_sched_getparam sys_sched_getparam +156 i386 sched_setscheduler sys_sched_setscheduler sys_sched_setscheduler +157 i386 sched_getscheduler sys_sched_getscheduler sys_sched_getscheduler +158 i386 sched_yield sys_sched_yield sys_sched_yield +159 i386 sched_get_priority_max sys_sched_get_priority_max sys_sched_get_priority_max +160 i386 sched_get_priority_min sys_sched_get_priority_min sys_sched_get_priority_min +161 i386 sched_rr_get_interval sys_sched_rr_get_interval_time32 sys_sched_rr_get_interval_time32 +162 i386 nanosleep sys_nanosleep_time32 sys_nanosleep_time32 +163 i386 mremap sys_mremap sys_mremap +164 i386 setresuid sys_setresuid16 sys_setresuid16 +165 i386 getresuid sys_getresuid16 sys_getresuid16 +166 i386 vm86 sys_vm86 sys_ni_syscall 167 i386 query_module -168 i386 poll sys_poll __ia32_sys_poll +168 i386 poll sys_poll sys_poll 169 i386 nfsservctl -170 i386 setresgid sys_setresgid16 __ia32_sys_setresgid16 -171 i386 getresgid sys_getresgid16 __ia32_sys_getresgid16 -172 i386 prctl sys_prctl __ia32_sys_prctl -173 i386 rt_sigreturn sys_rt_sigreturn __ia32_compat_sys_rt_sigreturn -174 i386 rt_sigaction sys_rt_sigaction __ia32_compat_sys_rt_sigaction -175 i386 rt_sigprocmask sys_rt_sigprocmask __ia32_compat_sys_rt_sigprocmask -176 i386 rt_sigpending sys_rt_sigpending __ia32_compat_sys_rt_sigpending -177 i386 rt_sigtimedwait sys_rt_sigtimedwait_time32 __ia32_compat_sys_rt_sigtimedwait_time32 -178 i386 rt_sigqueueinfo sys_rt_sigqueueinfo __ia32_compat_sys_rt_sigqueueinfo -179 i386 rt_sigsuspend sys_rt_sigsuspend __ia32_compat_sys_rt_sigsuspend -180 i386 pread64 sys_pread64 __ia32_compat_sys_x86_pread -181 i386 pwrite64 sys_pwrite64 __ia32_compat_sys_x86_pwrite -182 i386 chown sys_chown16 __ia32_sys_chown16 -183 i386 getcwd sys_getcwd __ia32_sys_getcwd -184 i386 capget sys_capget __ia32_sys_capget -185 i386 capset sys_capset __ia32_sys_capset -186 i386 sigaltstack sys_sigaltstack __ia32_compat_sys_sigaltstack -187 i386 sendfile sys_sendfile __ia32_compat_sys_sendfile +170 i386 setresgid sys_setresgid16 sys_setresgid16 +171 i386 getresgid sys_getresgid16 sys_getresgid16 +172 i386 prctl sys_prctl sys_prctl +173 i386 rt_sigreturn sys_rt_sigreturn compat_sys_rt_sigreturn +174 i386 rt_sigaction sys_rt_sigaction compat_sys_rt_sigaction +175 i386 rt_sigprocmask sys_rt_sigprocmask compat_sys_rt_sigprocmask +176 i386 rt_sigpending sys_rt_sigpending compat_sys_rt_sigpending +177 i386 rt_sigtimedwait sys_rt_sigtimedwait_time32 compat_sys_rt_sigtimedwait_time32 +178 i386 rt_sigqueueinfo sys_rt_sigqueueinfo compat_sys_rt_sigqueueinfo +179 i386 rt_sigsuspend sys_rt_sigsuspend compat_sys_rt_sigsuspend +180 i386 pread64 sys_pread64 compat_sys_x86_pread +181 i386 pwrite64 sys_pwrite64 compat_sys_x86_pwrite +182 i386 chown sys_chown16 sys_chown16 +183 i386 getcwd sys_getcwd sys_getcwd +184 i386 capget sys_capget sys_capget +185 i386 capset sys_capset sys_capset +186 i386 sigaltstack sys_sigaltstack compat_sys_sigaltstack +187 i386 sendfile sys_sendfile compat_sys_sendfile 188 i386 getpmsg 189 i386 putpmsg -190 i386 vfork sys_vfork __ia32_sys_vfork -191 i386 ugetrlimit sys_getrlimit __ia32_compat_sys_getrlimit -192 i386 mmap2 sys_mmap_pgoff __ia32_sys_mmap_pgoff -193 i386 truncate64 sys_truncate64 __ia32_compat_sys_x86_truncate64 -194 i386 ftruncate64 sys_ftruncate64 __ia32_compat_sys_x86_ftruncate64 -195 i386 stat64 sys_stat64 __ia32_compat_sys_x86_stat64 -196 i386 lstat64 sys_lstat64 __ia32_compat_sys_x86_lstat64 -197 i386 fstat64 sys_fstat64 __ia32_compat_sys_x86_fstat64 -198 i386 lchown32 sys_lchown __ia32_sys_lchown -199 i386 getuid32 sys_getuid __ia32_sys_getuid -200 i386 getgid32 sys_getgid __ia32_sys_getgid -201 i386 geteuid32 sys_geteuid __ia32_sys_geteuid -202 i386 getegid32 sys_getegid __ia32_sys_getegid -203 i386 setreuid32 sys_setreuid __ia32_sys_setreuid -204 i386 setregid32 sys_setregid __ia32_sys_setregid -205 i386 getgroups32 sys_getgroups __ia32_sys_getgroups -206 i386 setgroups32 sys_setgroups __ia32_sys_setgroups -207 i386 fchown32 sys_fchown __ia32_sys_fchown -208 i386 setresuid32 sys_setresuid __ia32_sys_setresuid -209 i386 getresuid32 sys_getresuid __ia32_sys_getresuid -210 i386 setresgid32 sys_setresgid __ia32_sys_setresgid -211 i386 getresgid32 sys_getresgid __ia32_sys_getresgid -212 i386 chown32 sys_chown __ia32_sys_chown -213 i386 setuid32 sys_setuid __ia32_sys_setuid -214 i386 setgid32 sys_setgid __ia32_sys_setgid -215 i386 setfsuid32 sys_setfsuid __ia32_sys_setfsuid -216 i386 setfsgid32 sys_setfsgid __ia32_sys_setfsgid -217 i386 pivot_root sys_pivot_root __ia32_sys_pivot_root -218 i386 mincore sys_mincore __ia32_sys_mincore -219 i386 madvise sys_madvise __ia32_sys_madvise -220 i386 getdents64 sys_getdents64 __ia32_sys_getdents64 -221 i386 fcntl64 sys_fcntl64 __ia32_compat_sys_fcntl64 +190 i386 vfork sys_vfork sys_vfork +191 i386 ugetrlimit sys_getrlimit compat_sys_getrlimit +192 i386 mmap2 sys_mmap_pgoff sys_mmap_pgoff +193 i386 truncate64 sys_truncate64 compat_sys_x86_truncate64 +194 i386 ftruncate64 sys_ftruncate64 compat_sys_x86_ftruncate64 +195 i386 stat64 sys_stat64 compat_sys_x86_stat64 +196 i386 lstat64 sys_lstat64 compat_sys_x86_lstat64 +197 i386 fstat64 sys_fstat64 compat_sys_x86_fstat64 +198 i386 lchown32 sys_lchown sys_lchown +199 i386 getuid32 sys_getuid sys_getuid +200 i386 getgid32 sys_getgid sys_getgid +201 i386 geteuid32 sys_geteuid sys_geteuid +202 i386 getegid32 sys_getegid sys_getegid +203 i386 setreuid32 sys_setreuid sys_setreuid +204 i386 setregid32 sys_setregid sys_setregid +205 i386 getgroups32 sys_getgroups sys_getgroups +206 i386 setgroups32 sys_setgroups sys_setgroups +207 i386 fchown32 sys_fchown sys_fchown +208 i386 setresuid32 sys_setresuid sys_setresuid +209 i386 getresuid32 sys_getresuid sys_getresuid +210 i386 setresgid32 sys_setresgid sys_setresgid +211 i386 getresgid32 sys_getresgid sys_getresgid +212 i386 chown32 sys_chown sys_chown +213 i386 setuid32 sys_setuid sys_setuid +214 i386 setgid32 sys_setgid sys_setgid +215 i386 setfsuid32 sys_setfsuid sys_setfsuid +216 i386 setfsgid32 sys_setfsgid sys_setfsgid +217 i386 pivot_root sys_pivot_root sys_pivot_root +218 i386 mincore sys_mincore sys_mincore +219 i386 madvise sys_madvise sys_madvise +220 i386 getdents64 sys_getdents64 sys_getdents64 +221 i386 fcntl64 sys_fcntl64 compat_sys_fcntl64 # 222 is unused # 223 is unused -224 i386 gettid sys_gettid __ia32_sys_gettid -225 i386 readahead sys_readahead __ia32_compat_sys_x86_readahead -226 i386 setxattr sys_setxattr __ia32_sys_setxattr -227 i386 lsetxattr sys_lsetxattr __ia32_sys_lsetxattr -228 i386 fsetxattr sys_fsetxattr __ia32_sys_fsetxattr -229 i386 getxattr sys_getxattr __ia32_sys_getxattr -230 i386 lgetxattr sys_lgetxattr __ia32_sys_lgetxattr -231 i386 fgetxattr sys_fgetxattr __ia32_sys_fgetxattr -232 i386 listxattr sys_listxattr __ia32_sys_listxattr -233 i386 llistxattr sys_llistxattr __ia32_sys_llistxattr -234 i386 flistxattr sys_flistxattr __ia32_sys_flistxattr -235 i386 removexattr sys_removexattr __ia32_sys_removexattr -236 i386 lremovexattr sys_lremovexattr __ia32_sys_lremovexattr -237 i386 fremovexattr sys_fremovexattr __ia32_sys_fremovexattr -238 i386 tkill sys_tkill __ia32_sys_tkill -239 i386 sendfile64 sys_sendfile64 __ia32_sys_sendfile64 -240 i386 futex sys_futex_time32 __ia32_sys_futex_time32 -241 i386 sched_setaffinity sys_sched_setaffinity __ia32_compat_sys_sched_setaffinity -242 i386 sched_getaffinity sys_sched_getaffinity __ia32_compat_sys_sched_getaffinity -243 i386 set_thread_area sys_set_thread_area __ia32_sys_set_thread_area -244 i386 get_thread_area sys_get_thread_area __ia32_sys_get_thread_area -245 i386 io_setup sys_io_setup __ia32_compat_sys_io_setup -246 i386 io_destroy sys_io_destroy __ia32_sys_io_destroy -247 i386 io_getevents sys_io_getevents_time32 __ia32_sys_io_getevents_time32 -248 i386 io_submit sys_io_submit __ia32_compat_sys_io_submit -249 i386 io_cancel sys_io_cancel __ia32_sys_io_cancel -250 i386 fadvise64 sys_fadvise64 __ia32_compat_sys_x86_fadvise64 +224 i386 gettid sys_gettid sys_gettid +225 i386 readahead sys_readahead compat_sys_x86_readahead +226 i386 setxattr sys_setxattr sys_setxattr +227 i386 lsetxattr sys_lsetxattr sys_lsetxattr +228 i386 fsetxattr sys_fsetxattr sys_fsetxattr +229 i386 getxattr sys_getxattr sys_getxattr +230 i386 lgetxattr sys_lgetxattr sys_lgetxattr +231 i386 fgetxattr sys_fgetxattr sys_fgetxattr +232 i386 listxattr sys_listxattr sys_listxattr +233 i386 llistxattr sys_llistxattr sys_llistxattr +234 i386 flistxattr sys_flistxattr sys_flistxattr +235 i386 removexattr sys_removexattr sys_removexattr +236 i386 lremovexattr sys_lremovexattr sys_lremovexattr +237 i386 fremovexattr sys_fremovexattr sys_fremovexattr +238 i386 tkill sys_tkill sys_tkill +239 i386 sendfile64 sys_sendfile64 sys_sendfile64 +240 i386 futex sys_futex_time32 sys_futex_time32 +241 i386 sched_setaffinity sys_sched_setaffinity compat_sys_sched_setaffinity +242 i386 sched_getaffinity sys_sched_getaffinity compat_sys_sched_getaffinity +243 i386 set_thread_area sys_set_thread_area sys_set_thread_area +244 i386 get_thread_area sys_get_thread_area sys_get_thread_area +245 i386 io_setup sys_io_setup compat_sys_io_setup +246 i386 io_destroy sys_io_destroy sys_io_destroy +247 i386 io_getevents sys_io_getevents_time32 sys_io_getevents_time32 +248 i386 io_submit sys_io_submit compat_sys_io_submit +249 i386 io_cancel sys_io_cancel sys_io_cancel +250 i386 fadvise64 sys_fadvise64 compat_sys_x86_fadvise64 # 251 is available for reuse (was briefly sys_set_zone_reclaim) -252 i386 exit_group sys_exit_group __ia32_sys_exit_group -253 i386 lookup_dcookie sys_lookup_dcookie __ia32_compat_sys_lookup_dcookie -254 i386 epoll_create sys_epoll_create __ia32_sys_epoll_create -255 i386 epoll_ctl sys_epoll_ctl __ia32_sys_epoll_ctl -256 i386 epoll_wait sys_epoll_wait __ia32_sys_epoll_wait -257 i386 remap_file_pages sys_remap_file_pages __ia32_sys_remap_file_pages -258 i386 set_tid_address sys_set_tid_address __ia32_sys_set_tid_address -259 i386 timer_create sys_timer_create __ia32_compat_sys_timer_create -260 i386 timer_settime sys_timer_settime32 __ia32_sys_timer_settime32 -261 i386 timer_gettime sys_timer_gettime32 __ia32_sys_timer_gettime32 -262 i386 timer_getoverrun sys_timer_getoverrun __ia32_sys_timer_getoverrun -263 i386 timer_delete sys_timer_delete __ia32_sys_timer_delete -264 i386 clock_settime sys_clock_settime32 __ia32_sys_clock_settime32 -265 i386 clock_gettime sys_clock_gettime32 __ia32_sys_clock_gettime32 -266 i386 clock_getres sys_clock_getres_time32 __ia32_sys_clock_getres_time32 -267 i386 clock_nanosleep sys_clock_nanosleep_time32 __ia32_sys_clock_nanosleep_time32 -268 i386 statfs64 sys_statfs64 __ia32_compat_sys_statfs64 -269 i386 fstatfs64 sys_fstatfs64 __ia32_compat_sys_fstatfs64 -270 i386 tgkill sys_tgkill __ia32_sys_tgkill -271 i386 utimes sys_utimes_time32 __ia32_sys_utimes_time32 -272 i386 fadvise64_64 sys_fadvise64_64 __ia32_compat_sys_x86_fadvise64_64 +252 i386 exit_group sys_exit_group sys_exit_group +253 i386 lookup_dcookie sys_lookup_dcookie compat_sys_lookup_dcookie +254 i386 epoll_create sys_epoll_create sys_epoll_create +255 i386 epoll_ctl sys_epoll_ctl sys_epoll_ctl +256 i386 epoll_wait sys_epoll_wait sys_epoll_wait +257 i386 remap_file_pages sys_remap_file_pages sys_remap_file_pages +258 i386 set_tid_address sys_set_tid_address sys_set_tid_address +259 i386 timer_create sys_timer_create compat_sys_timer_create +260 i386 timer_settime sys_timer_settime32 sys_timer_settime32 +261 i386 timer_gettime sys_timer_gettime32 sys_timer_gettime32 +262 i386 timer_getoverrun sys_timer_getoverrun sys_timer_getoverrun +263 i386 timer_delete sys_timer_delete sys_timer_delete +264 i386 clock_settime sys_clock_settime32 sys_clock_settime32 +265 i386 clock_gettime sys_clock_gettime32 sys_clock_gettime32 +266 i386 clock_getres sys_clock_getres_time32 sys_clock_getres_time32 +267 i386 clock_nanosleep sys_clock_nanosleep_time32 sys_clock_nanosleep_time32 +268 i386 statfs64 sys_statfs64 compat_sys_statfs64 +269 i386 fstatfs64 sys_fstatfs64 compat_sys_fstatfs64 +270 i386 tgkill sys_tgkill sys_tgkill +271 i386 utimes sys_utimes_time32 sys_utimes_time32 +272 i386 fadvise64_64 sys_fadvise64_64 compat_sys_x86_fadvise64_64 273 i386 vserver -274 i386 mbind sys_mbind __ia32_sys_mbind -275 i386 get_mempolicy sys_get_mempolicy __ia32_compat_sys_get_mempolicy -276 i386 set_mempolicy sys_set_mempolicy __ia32_sys_set_mempolicy -277 i386 mq_open sys_mq_open __ia32_compat_sys_mq_open -278 i386 mq_unlink sys_mq_unlink __ia32_sys_mq_unlink -279 i386 mq_timedsend sys_mq_timedsend_time32 __ia32_sys_mq_timedsend_time32 -280 i386 mq_timedreceive sys_mq_timedreceive_time32 __ia32_sys_mq_timedreceive_time32 -281 i386 mq_notify sys_mq_notify __ia32_compat_sys_mq_notify -282 i386 mq_getsetattr sys_mq_getsetattr __ia32_compat_sys_mq_getsetattr -283 i386 kexec_load sys_kexec_load __ia32_compat_sys_kexec_load -284 i386 waitid sys_waitid __ia32_compat_sys_waitid +274 i386 mbind sys_mbind sys_mbind +275 i386 get_mempolicy sys_get_mempolicy compat_sys_get_mempolicy +276 i386 set_mempolicy sys_set_mempolicy sys_set_mempolicy +277 i386 mq_open sys_mq_open compat_sys_mq_open +278 i386 mq_unlink sys_mq_unlink sys_mq_unlink +279 i386 mq_timedsend sys_mq_timedsend_time32 sys_mq_timedsend_time32 +280 i386 mq_timedreceive sys_mq_timedreceive_time32 sys_mq_timedreceive_time32 +281 i386 mq_notify sys_mq_notify compat_sys_mq_notify +282 i386 mq_getsetattr sys_mq_getsetattr compat_sys_mq_getsetattr +283 i386 kexec_load sys_kexec_load compat_sys_kexec_load +284 i386 waitid sys_waitid compat_sys_waitid # 285 sys_setaltroot -286 i386 add_key sys_add_key __ia32_sys_add_key -287 i386 request_key sys_request_key __ia32_sys_request_key -288 i386 keyctl sys_keyctl __ia32_compat_sys_keyctl -289 i386 ioprio_set sys_ioprio_set __ia32_sys_ioprio_set -290 i386 ioprio_get sys_ioprio_get __ia32_sys_ioprio_get -291 i386 inotify_init sys_inotify_init __ia32_sys_inotify_init -292 i386 inotify_add_watch sys_inotify_add_watch __ia32_sys_inotify_add_watch -293 i386 inotify_rm_watch sys_inotify_rm_watch __ia32_sys_inotify_rm_watch -294 i386 migrate_pages sys_migrate_pages __ia32_sys_migrate_pages -295 i386 openat sys_openat __ia32_compat_sys_openat -296 i386 mkdirat sys_mkdirat __ia32_sys_mkdirat -297 i386 mknodat sys_mknodat __ia32_sys_mknodat -298 i386 fchownat sys_fchownat __ia32_sys_fchownat -299 i386 futimesat sys_futimesat_time32 __ia32_sys_futimesat_time32 -300 i386 fstatat64 sys_fstatat64 __ia32_compat_sys_x86_fstatat -301 i386 unlinkat sys_unlinkat __ia32_sys_unlinkat -302 i386 renameat sys_renameat __ia32_sys_renameat -303 i386 linkat sys_linkat __ia32_sys_linkat -304 i386 symlinkat sys_symlinkat __ia32_sys_symlinkat -305 i386 readlinkat sys_readlinkat __ia32_sys_readlinkat -306 i386 fchmodat sys_fchmodat __ia32_sys_fchmodat -307 i386 faccessat sys_faccessat __ia32_sys_faccessat -308 i386 pselect6 sys_pselect6_time32 __ia32_compat_sys_pselect6_time32 -309 i386 ppoll sys_ppoll_time32 __ia32_compat_sys_ppoll_time32 -310 i386 unshare sys_unshare __ia32_sys_unshare -311 i386 set_robust_list sys_set_robust_list __ia32_compat_sys_set_robust_list -312 i386 get_robust_list sys_get_robust_list __ia32_compat_sys_get_robust_list -313 i386 splice sys_splice __ia32_sys_splice -314 i386 sync_file_range sys_sync_file_range __ia32_compat_sys_x86_sync_file_range -315 i386 tee sys_tee __ia32_sys_tee -316 i386 vmsplice sys_vmsplice __ia32_compat_sys_vmsplice -317 i386 move_pages sys_move_pages __ia32_compat_sys_move_pages -318 i386 getcpu sys_getcpu __ia32_sys_getcpu -319 i386 epoll_pwait sys_epoll_pwait __ia32_sys_epoll_pwait -320 i386 utimensat sys_utimensat_time32 __ia32_sys_utimensat_time32 -321 i386 signalfd sys_signalfd __ia32_compat_sys_signalfd -322 i386 timerfd_create sys_timerfd_create __ia32_sys_timerfd_create -323 i386 eventfd sys_eventfd __ia32_sys_eventfd -324 i386 fallocate sys_fallocate __ia32_compat_sys_x86_fallocate -325 i386 timerfd_settime sys_timerfd_settime32 __ia32_sys_timerfd_settime32 -326 i386 timerfd_gettime sys_timerfd_gettime32 __ia32_sys_timerfd_gettime32 -327 i386 signalfd4 sys_signalfd4 __ia32_compat_sys_signalfd4 -328 i386 eventfd2 sys_eventfd2 __ia32_sys_eventfd2 -329 i386 epoll_create1 sys_epoll_create1 __ia32_sys_epoll_create1 -330 i386 dup3 sys_dup3 __ia32_sys_dup3 -331 i386 pipe2 sys_pipe2 __ia32_sys_pipe2 -332 i386 inotify_init1 sys_inotify_init1 __ia32_sys_inotify_init1 -333 i386 preadv sys_preadv __ia32_compat_sys_preadv -334 i386 pwritev sys_pwritev __ia32_compat_sys_pwritev -335 i386 rt_tgsigqueueinfo sys_rt_tgsigqueueinfo __ia32_compat_sys_rt_tgsigqueueinfo -336 i386 perf_event_open sys_perf_event_open __ia32_sys_perf_event_open -337 i386 recvmmsg sys_recvmmsg_time32 __ia32_compat_sys_recvmmsg_time32 -338 i386 fanotify_init sys_fanotify_init __ia32_sys_fanotify_init -339 i386 fanotify_mark sys_fanotify_mark __ia32_compat_sys_fanotify_mark -340 i386 prlimit64 sys_prlimit64 __ia32_sys_prlimit64 -341 i386 name_to_handle_at sys_name_to_handle_at __ia32_sys_name_to_handle_at -342 i386 open_by_handle_at sys_open_by_handle_at __ia32_compat_sys_open_by_handle_at -343 i386 clock_adjtime sys_clock_adjtime32 __ia32_sys_clock_adjtime32 -344 i386 syncfs sys_syncfs __ia32_sys_syncfs -345 i386 sendmmsg sys_sendmmsg __ia32_compat_sys_sendmmsg -346 i386 setns sys_setns __ia32_sys_setns -347 i386 process_vm_readv sys_process_vm_readv __ia32_compat_sys_process_vm_readv -348 i386 process_vm_writev sys_process_vm_writev __ia32_compat_sys_process_vm_writev -349 i386 kcmp sys_kcmp __ia32_sys_kcmp -350 i386 finit_module sys_finit_module __ia32_sys_finit_module -351 i386 sched_setattr sys_sched_setattr __ia32_sys_sched_setattr -352 i386 sched_getattr sys_sched_getattr __ia32_sys_sched_getattr -353 i386 renameat2 sys_renameat2 __ia32_sys_renameat2 -354 i386 seccomp sys_seccomp __ia32_sys_seccomp -355 i386 getrandom sys_getrandom __ia32_sys_getrandom -356 i386 memfd_create sys_memfd_create __ia32_sys_memfd_create -357 i386 bpf sys_bpf __ia32_sys_bpf -358 i386 execveat sys_execveat __ia32_compat_sys_execveat -359 i386 socket sys_socket __ia32_sys_socket -360 i386 socketpair sys_socketpair __ia32_sys_socketpair -361 i386 bind sys_bind __ia32_sys_bind -362 i386 connect sys_connect __ia32_sys_connect -363 i386 listen sys_listen __ia32_sys_listen -364 i386 accept4 sys_accept4 __ia32_sys_accept4 -365 i386 getsockopt sys_getsockopt __ia32_compat_sys_getsockopt -366 i386 setsockopt sys_setsockopt __ia32_compat_sys_setsockopt -367 i386 getsockname sys_getsockname __ia32_sys_getsockname -368 i386 getpeername sys_getpeername __ia32_sys_getpeername -369 i386 sendto sys_sendto __ia32_sys_sendto -370 i386 sendmsg sys_sendmsg __ia32_compat_sys_sendmsg -371 i386 recvfrom sys_recvfrom __ia32_compat_sys_recvfrom -372 i386 recvmsg sys_recvmsg __ia32_compat_sys_recvmsg -373 i386 shutdown sys_shutdown __ia32_sys_shutdown -374 i386 userfaultfd sys_userfaultfd __ia32_sys_userfaultfd -375 i386 membarrier sys_membarrier __ia32_sys_membarrier -376 i386 mlock2 sys_mlock2 __ia32_sys_mlock2 -377 i386 copy_file_range sys_copy_file_range __ia32_sys_copy_file_range -378 i386 preadv2 sys_preadv2 __ia32_compat_sys_preadv2 -379 i386 pwritev2 sys_pwritev2 __ia32_compat_sys_pwritev2 -380 i386 pkey_mprotect sys_pkey_mprotect __ia32_sys_pkey_mprotect -381 i386 pkey_alloc sys_pkey_alloc __ia32_sys_pkey_alloc -382 i386 pkey_free sys_pkey_free __ia32_sys_pkey_free -383 i386 statx sys_statx __ia32_sys_statx -384 i386 arch_prctl sys_arch_prctl __ia32_compat_sys_arch_prctl -385 i386 io_pgetevents sys_io_pgetevents_time32 __ia32_compat_sys_io_pgetevents -386 i386 rseq sys_rseq __ia32_sys_rseq -393 i386 semget sys_semget __ia32_sys_semget -394 i386 semctl sys_semctl __ia32_compat_sys_semctl -395 i386 shmget sys_shmget __ia32_sys_shmget -396 i386 shmctl sys_shmctl __ia32_compat_sys_shmctl -397 i386 shmat sys_shmat __ia32_compat_sys_shmat -398 i386 shmdt sys_shmdt __ia32_sys_shmdt -399 i386 msgget sys_msgget __ia32_sys_msgget -400 i386 msgsnd sys_msgsnd __ia32_compat_sys_msgsnd -401 i386 msgrcv sys_msgrcv __ia32_compat_sys_msgrcv -402 i386 msgctl sys_msgctl __ia32_compat_sys_msgctl -403 i386 clock_gettime64 sys_clock_gettime __ia32_sys_clock_gettime -404 i386 clock_settime64 sys_clock_settime __ia32_sys_clock_settime -405 i386 clock_adjtime64 sys_clock_adjtime __ia32_sys_clock_adjtime -406 i386 clock_getres_time64 sys_clock_getres __ia32_sys_clock_getres -407 i386 clock_nanosleep_time64 sys_clock_nanosleep __ia32_sys_clock_nanosleep -408 i386 timer_gettime64 sys_timer_gettime __ia32_sys_timer_gettime -409 i386 timer_settime64 sys_timer_settime __ia32_sys_timer_settime -410 i386 timerfd_gettime64 sys_timerfd_gettime __ia32_sys_timerfd_gettime -411 i386 timerfd_settime64 sys_timerfd_settime __ia32_sys_timerfd_settime -412 i386 utimensat_time64 sys_utimensat __ia32_sys_utimensat -413 i386 pselect6_time64 sys_pselect6 __ia32_compat_sys_pselect6_time64 -414 i386 ppoll_time64 sys_ppoll __ia32_compat_sys_ppoll_time64 -416 i386 io_pgetevents_time64 sys_io_pgetevents __ia32_sys_io_pgetevents -417 i386 recvmmsg_time64 sys_recvmmsg __ia32_compat_sys_recvmmsg_time64 -418 i386 mq_timedsend_time64 sys_mq_timedsend __ia32_sys_mq_timedsend -419 i386 mq_timedreceive_time64 sys_mq_timedreceive __ia32_sys_mq_timedreceive -420 i386 semtimedop_time64 sys_semtimedop __ia32_sys_semtimedop -421 i386 rt_sigtimedwait_time64 sys_rt_sigtimedwait __ia32_compat_sys_rt_sigtimedwait_time64 -422 i386 futex_time64 sys_futex __ia32_sys_futex -423 i386 sched_rr_get_interval_time64 sys_sched_rr_get_interval __ia32_sys_sched_rr_get_interval -424 i386 pidfd_send_signal sys_pidfd_send_signal __ia32_sys_pidfd_send_signal -425 i386 io_uring_setup sys_io_uring_setup __ia32_sys_io_uring_setup -426 i386 io_uring_enter sys_io_uring_enter __ia32_sys_io_uring_enter -427 i386 io_uring_register sys_io_uring_register __ia32_sys_io_uring_register -428 i386 open_tree sys_open_tree __ia32_sys_open_tree -429 i386 move_mount sys_move_mount __ia32_sys_move_mount -430 i386 fsopen sys_fsopen __ia32_sys_fsopen -431 i386 fsconfig sys_fsconfig __ia32_sys_fsconfig -432 i386 fsmount sys_fsmount __ia32_sys_fsmount -433 i386 fspick sys_fspick __ia32_sys_fspick -434 i386 pidfd_open sys_pidfd_open __ia32_sys_pidfd_open -435 i386 clone3 sys_clone3 __ia32_sys_clone3 -437 i386 openat2 sys_openat2 __ia32_sys_openat2 -438 i386 pidfd_getfd sys_pidfd_getfd __ia32_sys_pidfd_getfd +286 i386 add_key sys_add_key sys_add_key +287 i386 request_key sys_request_key sys_request_key +288 i386 keyctl sys_keyctl compat_sys_keyctl +289 i386 ioprio_set sys_ioprio_set sys_ioprio_set +290 i386 ioprio_get sys_ioprio_get sys_ioprio_get +291 i386 inotify_init sys_inotify_init sys_inotify_init +292 i386 inotify_add_watch sys_inotify_add_watch sys_inotify_add_watch +293 i386 inotify_rm_watch sys_inotify_rm_watch sys_inotify_rm_watch +294 i386 migrate_pages sys_migrate_pages sys_migrate_pages +295 i386 openat sys_openat compat_sys_openat +296 i386 mkdirat sys_mkdirat sys_mkdirat +297 i386 mknodat sys_mknodat sys_mknodat +298 i386 fchownat sys_fchownat sys_fchownat +299 i386 futimesat sys_futimesat_time32 sys_futimesat_time32 +300 i386 fstatat64 sys_fstatat64 compat_sys_x86_fstatat +301 i386 unlinkat sys_unlinkat sys_unlinkat +302 i386 renameat sys_renameat sys_renameat +303 i386 linkat sys_linkat sys_linkat +304 i386 symlinkat sys_symlinkat sys_symlinkat +305 i386 readlinkat sys_readlinkat sys_readlinkat +306 i386 fchmodat sys_fchmodat sys_fchmodat +307 i386 faccessat sys_faccessat sys_faccessat +308 i386 pselect6 sys_pselect6_time32 compat_sys_pselect6_time32 +309 i386 ppoll sys_ppoll_time32 compat_sys_ppoll_time32 +310 i386 unshare sys_unshare sys_unshare +311 i386 set_robust_list sys_set_robust_list compat_sys_set_robust_list +312 i386 get_robust_list sys_get_robust_list compat_sys_get_robust_list +313 i386 splice sys_splice sys_splice +314 i386 sync_file_range sys_sync_file_range compat_sys_x86_sync_file_range +315 i386 tee sys_tee sys_tee +316 i386 vmsplice sys_vmsplice compat_sys_vmsplice +317 i386 move_pages sys_move_pages compat_sys_move_pages +318 i386 getcpu sys_getcpu sys_getcpu +319 i386 epoll_pwait sys_epoll_pwait sys_epoll_pwait +320 i386 utimensat sys_utimensat_time32 sys_utimensat_time32 +321 i386 signalfd sys_signalfd compat_sys_signalfd +322 i386 timerfd_create sys_timerfd_create sys_timerfd_create +323 i386 eventfd sys_eventfd sys_eventfd +324 i386 fallocate sys_fallocate compat_sys_x86_fallocate +325 i386 timerfd_settime sys_timerfd_settime32 sys_timerfd_settime32 +326 i386 timerfd_gettime sys_timerfd_gettime32 sys_timerfd_gettime32 +327 i386 signalfd4 sys_signalfd4 compat_sys_signalfd4 +328 i386 eventfd2 sys_eventfd2 sys_eventfd2 +329 i386 epoll_create1 sys_epoll_create1 sys_epoll_create1 +330 i386 dup3 sys_dup3 sys_dup3 +331 i386 pipe2 sys_pipe2 sys_pipe2 +332 i386 inotify_init1 sys_inotify_init1 sys_inotify_init1 +333 i386 preadv sys_preadv compat_sys_preadv +334 i386 pwritev sys_pwritev compat_sys_pwritev +335 i386 rt_tgsigqueueinfo sys_rt_tgsigqueueinfo compat_sys_rt_tgsigqueueinfo +336 i386 perf_event_open sys_perf_event_open sys_perf_event_open +337 i386 recvmmsg sys_recvmmsg_time32 compat_sys_recvmmsg_time32 +338 i386 fanotify_init sys_fanotify_init sys_fanotify_init +339 i386 fanotify_mark sys_fanotify_mark compat_sys_fanotify_mark +340 i386 prlimit64 sys_prlimit64 sys_prlimit64 +341 i386 name_to_handle_at sys_name_to_handle_at sys_name_to_handle_at +342 i386 open_by_handle_at sys_open_by_handle_at compat_sys_open_by_handle_at +343 i386 clock_adjtime sys_clock_adjtime32 sys_clock_adjtime32 +344 i386 syncfs sys_syncfs sys_syncfs +345 i386 sendmmsg sys_sendmmsg compat_sys_sendmmsg +346 i386 setns sys_setns sys_setns +347 i386 process_vm_readv sys_process_vm_readv compat_sys_process_vm_readv +348 i386 process_vm_writev sys_process_vm_writev compat_sys_process_vm_writev +349 i386 kcmp sys_kcmp sys_kcmp +350 i386 finit_module sys_finit_module sys_finit_module +351 i386 sched_setattr sys_sched_setattr sys_sched_setattr +352 i386 sched_getattr sys_sched_getattr sys_sched_getattr +353 i386 renameat2 sys_renameat2 sys_renameat2 +354 i386 seccomp sys_seccomp sys_seccomp +355 i386 getrandom sys_getrandom sys_getrandom +356 i386 memfd_create sys_memfd_create sys_memfd_create +357 i386 bpf sys_bpf sys_bpf +358 i386 execveat sys_execveat compat_sys_execveat +359 i386 socket sys_socket sys_socket +360 i386 socketpair sys_socketpair sys_socketpair +361 i386 bind sys_bind sys_bind +362 i386 connect sys_connect sys_connect +363 i386 listen sys_listen sys_listen +364 i386 accept4 sys_accept4 sys_accept4 +365 i386 getsockopt sys_getsockopt compat_sys_getsockopt +366 i386 setsockopt sys_setsockopt compat_sys_setsockopt +367 i386 getsockname sys_getsockname sys_getsockname +368 i386 getpeername sys_getpeername sys_getpeername +369 i386 sendto sys_sendto sys_sendto +370 i386 sendmsg sys_sendmsg compat_sys_sendmsg +371 i386 recvfrom sys_recvfrom compat_sys_recvfrom +372 i386 recvmsg sys_recvmsg compat_sys_recvmsg +373 i386 shutdown sys_shutdown sys_shutdown +374 i386 userfaultfd sys_userfaultfd sys_userfaultfd +375 i386 membarrier sys_membarrier sys_membarrier +376 i386 mlock2 sys_mlock2 sys_mlock2 +377 i386 copy_file_range sys_copy_file_range sys_copy_file_range +378 i386 preadv2 sys_preadv2 compat_sys_preadv2 +379 i386 pwritev2 sys_pwritev2 compat_sys_pwritev2 +380 i386 pkey_mprotect sys_pkey_mprotect sys_pkey_mprotect +381 i386 pkey_alloc sys_pkey_alloc sys_pkey_alloc +382 i386 pkey_free sys_pkey_free sys_pkey_free +383 i386 statx sys_statx sys_statx +384 i386 arch_prctl sys_arch_prctl compat_sys_arch_prctl +385 i386 io_pgetevents sys_io_pgetevents_time32 compat_sys_io_pgetevents +386 i386 rseq sys_rseq sys_rseq +393 i386 semget sys_semget sys_semget +394 i386 semctl sys_semctl compat_sys_semctl +395 i386 shmget sys_shmget sys_shmget +396 i386 shmctl sys_shmctl compat_sys_shmctl +397 i386 shmat sys_shmat compat_sys_shmat +398 i386 shmdt sys_shmdt sys_shmdt +399 i386 msgget sys_msgget sys_msgget +400 i386 msgsnd sys_msgsnd compat_sys_msgsnd +401 i386 msgrcv sys_msgrcv compat_sys_msgrcv +402 i386 msgctl sys_msgctl compat_sys_msgctl +403 i386 clock_gettime64 sys_clock_gettime sys_clock_gettime +404 i386 clock_settime64 sys_clock_settime sys_clock_settime +405 i386 clock_adjtime64 sys_clock_adjtime sys_clock_adjtime +406 i386 clock_getres_time64 sys_clock_getres sys_clock_getres +407 i386 clock_nanosleep_time64 sys_clock_nanosleep sys_clock_nanosleep +408 i386 timer_gettime64 sys_timer_gettime sys_timer_gettime +409 i386 timer_settime64 sys_timer_settime sys_timer_settime +410 i386 timerfd_gettime64 sys_timerfd_gettime sys_timerfd_gettime +411 i386 timerfd_settime64 sys_timerfd_settime sys_timerfd_settime +412 i386 utimensat_time64 sys_utimensat sys_utimensat +413 i386 pselect6_time64 sys_pselect6 compat_sys_pselect6_time64 +414 i386 ppoll_time64 sys_ppoll compat_sys_ppoll_time64 +416 i386 io_pgetevents_time64 sys_io_pgetevents sys_io_pgetevents +417 i386 recvmmsg_time64 sys_recvmmsg compat_sys_recvmmsg_time64 +418 i386 mq_timedsend_time64 sys_mq_timedsend sys_mq_timedsend +419 i386 mq_timedreceive_time64 sys_mq_timedreceive sys_mq_timedreceive +420 i386 semtimedop_time64 sys_semtimedop sys_semtimedop +421 i386 rt_sigtimedwait_time64 sys_rt_sigtimedwait compat_sys_rt_sigtimedwait_time64 +422 i386 futex_time64 sys_futex sys_futex +423 i386 sched_rr_get_interval_time64 sys_sched_rr_get_interval sys_sched_rr_get_interval +424 i386 pidfd_send_signal sys_pidfd_send_signal sys_pidfd_send_signal +425 i386 io_uring_setup sys_io_uring_setup sys_io_uring_setup +426 i386 io_uring_enter sys_io_uring_enter sys_io_uring_enter +427 i386 io_uring_register sys_io_uring_register sys_io_uring_register +428 i386 open_tree sys_open_tree sys_open_tree +429 i386 move_mount sys_move_mount sys_move_mount +430 i386 fsopen sys_fsopen sys_fsopen +431 i386 fsconfig sys_fsconfig sys_fsconfig +432 i386 fsmount sys_fsmount sys_fsmount +433 i386 fspick sys_fspick sys_fspick +434 i386 pidfd_open sys_pidfd_open sys_pidfd_open +435 i386 clone3 sys_clone3 sys_clone3 +437 i386 openat2 sys_openat2 sys_openat2 +438 i386 pidfd_getfd sys_pidfd_getfd sys_pidfd_getfd diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index e8fb7222a3bd..37b844f839bc 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -8,357 +8,357 @@ # # The abi is "common", "64" or "x32" for this file. # -0 common read __x64_sys_read -1 common write __x64_sys_write -2 common open __x64_sys_open -3 common close __x64_sys_close -4 common stat __x64_sys_newstat -5 common fstat __x64_sys_newfstat -6 common lstat __x64_sys_newlstat -7 common poll __x64_sys_poll -8 common lseek __x64_sys_lseek -9 common mmap __x64_sys_mmap -10 common mprotect __x64_sys_mprotect -11 common munmap __x64_sys_munmap -12 common brk __x64_sys_brk -13 64 rt_sigaction __x64_sys_rt_sigaction -14 common rt_sigprocmask __x64_sys_rt_sigprocmask -15 64 rt_sigreturn __x64_sys_rt_sigreturn -16 64 ioctl __x64_sys_ioctl -17 common pread64 __x64_sys_pread64 -18 common pwrite64 __x64_sys_pwrite64 -19 64 readv __x64_sys_readv -20 64 writev __x64_sys_writev -21 common access __x64_sys_access -22 common pipe __x64_sys_pipe -23 common select __x64_sys_select -24 common sched_yield __x64_sys_sched_yield -25 common mremap __x64_sys_mremap -26 common msync __x64_sys_msync -27 common mincore __x64_sys_mincore -28 common madvise __x64_sys_madvise -29 common shmget __x64_sys_shmget -30 common shmat __x64_sys_shmat -31 common shmctl __x64_sys_shmctl -32 common dup __x64_sys_dup -33 common dup2 __x64_sys_dup2 -34 common pause __x64_sys_pause -35 common nanosleep __x64_sys_nanosleep -36 common getitimer __x64_sys_getitimer -37 common alarm __x64_sys_alarm -38 common setitimer __x64_sys_setitimer -39 common getpid __x64_sys_getpid -40 common sendfile __x64_sys_sendfile64 -41 common socket __x64_sys_socket -42 common connect __x64_sys_connect -43 common accept __x64_sys_accept -44 common sendto __x64_sys_sendto -45 64 recvfrom __x64_sys_recvfrom -46 64 sendmsg __x64_sys_sendmsg -47 64 recvmsg __x64_sys_recvmsg -48 common shutdown __x64_sys_shutdown -49 common bind __x64_sys_bind -50 common listen __x64_sys_listen -51 common getsockname __x64_sys_getsockname -52 common getpeername __x64_sys_getpeername -53 common socketpair __x64_sys_socketpair -54 64 setsockopt __x64_sys_setsockopt -55 64 getsockopt __x64_sys_getsockopt -56 common clone __x64_sys_clone -57 common fork __x64_sys_fork -58 common vfork __x64_sys_vfork -59 64 execve __x64_sys_execve -60 common exit __x64_sys_exit -61 common wait4 __x64_sys_wait4 -62 common kill __x64_sys_kill -63 common uname __x64_sys_newuname -64 common semget __x64_sys_semget -65 common semop __x64_sys_semop -66 common semctl __x64_sys_semctl -67 common shmdt __x64_sys_shmdt -68 common msgget __x64_sys_msgget -69 common msgsnd __x64_sys_msgsnd -70 common msgrcv __x64_sys_msgrcv -71 common msgctl __x64_sys_msgctl -72 common fcntl __x64_sys_fcntl -73 common flock __x64_sys_flock -74 common fsync __x64_sys_fsync -75 common fdatasync __x64_sys_fdatasync -76 common truncate __x64_sys_truncate -77 common ftruncate __x64_sys_ftruncate -78 common getdents __x64_sys_getdents -79 common getcwd __x64_sys_getcwd -80 common chdir __x64_sys_chdir -81 common fchdir __x64_sys_fchdir -82 common rename __x64_sys_rename -83 common mkdir __x64_sys_mkdir -84 common rmdir __x64_sys_rmdir -85 common creat __x64_sys_creat -86 common link __x64_sys_link -87 common unlink __x64_sys_unlink -88 common symlink __x64_sys_symlink -89 common readlink __x64_sys_readlink -90 common chmod __x64_sys_chmod -91 common fchmod __x64_sys_fchmod -92 common chown __x64_sys_chown -93 common fchown __x64_sys_fchown -94 common lchown __x64_sys_lchown -95 common umask __x64_sys_umask -96 common gettimeofday __x64_sys_gettimeofday -97 common getrlimit __x64_sys_getrlimit -98 common getrusage __x64_sys_getrusage -99 common sysinfo __x64_sys_sysinfo -100 common times __x64_sys_times -101 64 ptrace __x64_sys_ptrace -102 common getuid __x64_sys_getuid -103 common syslog __x64_sys_syslog -104 common getgid __x64_sys_getgid -105 common setuid __x64_sys_setuid -106 common setgid __x64_sys_setgid -107 common geteuid __x64_sys_geteuid -108 common getegid __x64_sys_getegid -109 common setpgid __x64_sys_setpgid -110 common getppid __x64_sys_getppid -111 common getpgrp __x64_sys_getpgrp -112 common setsid __x64_sys_setsid -113 common setreuid __x64_sys_setreuid -114 common setregid __x64_sys_setregid -115 common getgroups __x64_sys_getgroups -116 common setgroups __x64_sys_setgroups -117 common setresuid __x64_sys_setresuid -118 common getresuid __x64_sys_getresuid -119 common setresgid __x64_sys_setresgid -120 common getresgid __x64_sys_getresgid -121 common getpgid __x64_sys_getpgid -122 common setfsuid __x64_sys_setfsuid -123 common setfsgid __x64_sys_setfsgid -124 common getsid __x64_sys_getsid -125 common capget __x64_sys_capget -126 common capset __x64_sys_capset -127 64 rt_sigpending __x64_sys_rt_sigpending -128 64 rt_sigtimedwait __x64_sys_rt_sigtimedwait -129 64 rt_sigqueueinfo __x64_sys_rt_sigqueueinfo -130 common rt_sigsuspend __x64_sys_rt_sigsuspend -131 64 sigaltstack __x64_sys_sigaltstack -132 common utime __x64_sys_utime -133 common mknod __x64_sys_mknod +0 common read sys_read +1 common write sys_write +2 common open sys_open +3 common close sys_close +4 common stat sys_newstat +5 common fstat sys_newfstat +6 common lstat sys_newlstat +7 common poll sys_poll +8 common lseek sys_lseek +9 common mmap sys_mmap +10 common mprotect sys_mprotect +11 common munmap sys_munmap +12 common brk sys_brk +13 64 rt_sigaction sys_rt_sigaction +14 common rt_sigprocmask sys_rt_sigprocmask +15 64 rt_sigreturn sys_rt_sigreturn +16 64 ioctl sys_ioctl +17 common pread64 sys_pread64 +18 common pwrite64 sys_pwrite64 +19 64 readv sys_readv +20 64 writev sys_writev +21 common access sys_access +22 common pipe sys_pipe +23 common select sys_select +24 common sched_yield sys_sched_yield +25 common mremap sys_mremap +26 common msync sys_msync +27 common mincore sys_mincore +28 common madvise sys_madvise +29 common shmget sys_shmget +30 common shmat sys_shmat +31 common shmctl sys_shmctl +32 common dup sys_dup +33 common dup2 sys_dup2 +34 common pause sys_pause +35 common nanosleep sys_nanosleep +36 common getitimer sys_getitimer +37 common alarm sys_alarm +38 common setitimer sys_setitimer +39 common getpid sys_getpid +40 common sendfile sys_sendfile64 +41 common socket sys_socket +42 common connect sys_connect +43 common accept sys_accept +44 common sendto sys_sendto +45 64 recvfrom sys_recvfrom +46 64 sendmsg sys_sendmsg +47 64 recvmsg sys_recvmsg +48 common shutdown sys_shutdown +49 common bind sys_bind +50 common listen sys_listen +51 common getsockname sys_getsockname +52 common getpeername sys_getpeername +53 common socketpair sys_socketpair +54 64 setsockopt sys_setsockopt +55 64 getsockopt sys_getsockopt +56 common clone sys_clone +57 common fork sys_fork +58 common vfork sys_vfork +59 64 execve sys_execve +60 common exit sys_exit +61 common wait4 sys_wait4 +62 common kill sys_kill +63 common uname sys_newuname +64 common semget sys_semget +65 common semop sys_semop +66 common semctl sys_semctl +67 common shmdt sys_shmdt +68 common msgget sys_msgget +69 common msgsnd sys_msgsnd +70 common msgrcv sys_msgrcv +71 common msgctl sys_msgctl +72 common fcntl sys_fcntl +73 common flock sys_flock +74 common fsync sys_fsync +75 common fdatasync sys_fdatasync +76 common truncate sys_truncate +77 common ftruncate sys_ftruncate +78 common getdents sys_getdents +79 common getcwd sys_getcwd +80 common chdir sys_chdir +81 common fchdir sys_fchdir +82 common rename sys_rename +83 common mkdir sys_mkdir +84 common rmdir sys_rmdir +85 common creat sys_creat +86 common link sys_link +87 common unlink sys_unlink +88 common symlink sys_symlink +89 common readlink sys_readlink +90 common chmod sys_chmod +91 common fchmod sys_fchmod +92 common chown sys_chown +93 common fchown sys_fchown +94 common lchown sys_lchown +95 common umask sys_umask +96 common gettimeofday sys_gettimeofday +97 common getrlimit sys_getrlimit +98 common getrusage sys_getrusage +99 common sysinfo sys_sysinfo +100 common times sys_times +101 64 ptrace sys_ptrace +102 common getuid sys_getuid +103 common syslog sys_syslog +104 common getgid sys_getgid +105 common setuid sys_setuid +106 common setgid sys_setgid +107 common geteuid sys_geteuid +108 common getegid sys_getegid +109 common setpgid sys_setpgid +110 common getppid sys_getppid +111 common getpgrp sys_getpgrp +112 common setsid sys_setsid +113 common setreuid sys_setreuid +114 common setregid sys_setregid +115 common getgroups sys_getgroups +116 common setgroups sys_setgroups +117 common setresuid sys_setresuid +118 common getresuid sys_getresuid +119 common setresgid sys_setresgid +120 common getresgid sys_getresgid +121 common getpgid sys_getpgid +122 common setfsuid sys_setfsuid +123 common setfsgid sys_setfsgid +124 common getsid sys_getsid +125 common capget sys_capget +126 common capset sys_capset +127 64 rt_sigpending sys_rt_sigpending +128 64 rt_sigtimedwait sys_rt_sigtimedwait +129 64 rt_sigqueueinfo sys_rt_sigqueueinfo +130 common rt_sigsuspend sys_rt_sigsuspend +131 64 sigaltstack sys_sigaltstack +132 common utime sys_utime +133 common mknod sys_mknod 134 64 uselib -135 common personality __x64_sys_personality -136 common ustat __x64_sys_ustat -137 common statfs __x64_sys_statfs -138 common fstatfs __x64_sys_fstatfs -139 common sysfs __x64_sys_sysfs -140 common getpriority __x64_sys_getpriority -141 common setpriority __x64_sys_setpriority -142 common sched_setparam __x64_sys_sched_setparam -143 common sched_getparam __x64_sys_sched_getparam -144 common sched_setscheduler __x64_sys_sched_setscheduler -145 common sched_getscheduler __x64_sys_sched_getscheduler -146 common sched_get_priority_max __x64_sys_sched_get_priority_max -147 common sched_get_priority_min __x64_sys_sched_get_priority_min -148 common sched_rr_get_interval __x64_sys_sched_rr_get_interval -149 common mlock __x64_sys_mlock -150 common munlock __x64_sys_munlock -151 common mlockall __x64_sys_mlockall -152 common munlockall __x64_sys_munlockall -153 common vhangup __x64_sys_vhangup -154 common modify_ldt __x64_sys_modify_ldt -155 common pivot_root __x64_sys_pivot_root -156 64 _sysctl __x64_sys_sysctl -157 common prctl __x64_sys_prctl -158 common arch_prctl __x64_sys_arch_prctl -159 common adjtimex __x64_sys_adjtimex -160 common setrlimit __x64_sys_setrlimit -161 common chroot __x64_sys_chroot -162 common sync __x64_sys_sync -163 common acct __x64_sys_acct -164 common settimeofday __x64_sys_settimeofday -165 common mount __x64_sys_mount -166 common umount2 __x64_sys_umount -167 common swapon __x64_sys_swapon -168 common swapoff __x64_sys_swapoff -169 common reboot __x64_sys_reboot -170 common sethostname __x64_sys_sethostname -171 common setdomainname __x64_sys_setdomainname -172 common iopl __x64_sys_iopl -173 common ioperm __x64_sys_ioperm +135 common personality sys_personality +136 common ustat sys_ustat +137 common statfs sys_statfs +138 common fstatfs sys_fstatfs +139 common sysfs sys_sysfs +140 common getpriority sys_getpriority +141 common setpriority sys_setpriority +142 common sched_setparam sys_sched_setparam +143 common sched_getparam sys_sched_getparam +144 common sched_setscheduler sys_sched_setscheduler +145 common sched_getscheduler sys_sched_getscheduler +146 common sched_get_priority_max sys_sched_get_priority_max +147 common sched_get_priority_min sys_sched_get_priority_min +148 common sched_rr_get_interval sys_sched_rr_get_interval +149 common mlock sys_mlock +150 common munlock sys_munlock +151 common mlockall sys_mlockall +152 common munlockall sys_munlockall +153 common vhangup sys_vhangup +154 common modify_ldt sys_modify_ldt +155 common pivot_root sys_pivot_root +156 64 _sysctl sys_sysctl +157 common prctl sys_prctl +158 common arch_prctl sys_arch_prctl +159 common adjtimex sys_adjtimex +160 common setrlimit sys_setrlimit +161 common chroot sys_chroot +162 common sync sys_sync +163 common acct sys_acct +164 common settimeofday sys_settimeofday +165 common mount sys_mount +166 common umount2 sys_umount +167 common swapon sys_swapon +168 common swapoff sys_swapoff +169 common reboot sys_reboot +170 common sethostname sys_sethostname +171 common setdomainname sys_setdomainname +172 common iopl sys_iopl +173 common ioperm sys_ioperm 174 64 create_module -175 common init_module __x64_sys_init_module -176 common delete_module __x64_sys_delete_module +175 common init_module sys_init_module +176 common delete_module sys_delete_module 177 64 get_kernel_syms 178 64 query_module -179 common quotactl __x64_sys_quotactl +179 common quotactl sys_quotactl 180 64 nfsservctl 181 common getpmsg 182 common putpmsg 183 common afs_syscall 184 common tuxcall 185 common security -186 common gettid __x64_sys_gettid -187 common readahead __x64_sys_readahead -188 common setxattr __x64_sys_setxattr -189 common lsetxattr __x64_sys_lsetxattr -190 common fsetxattr __x64_sys_fsetxattr -191 common getxattr __x64_sys_getxattr -192 common lgetxattr __x64_sys_lgetxattr -193 common fgetxattr __x64_sys_fgetxattr -194 common listxattr __x64_sys_listxattr -195 common llistxattr __x64_sys_llistxattr -196 common flistxattr __x64_sys_flistxattr -197 common removexattr __x64_sys_removexattr -198 common lremovexattr __x64_sys_lremovexattr -199 common fremovexattr __x64_sys_fremovexattr -200 common tkill __x64_sys_tkill -201 common time __x64_sys_time -202 common futex __x64_sys_futex -203 common sched_setaffinity __x64_sys_sched_setaffinity -204 common sched_getaffinity __x64_sys_sched_getaffinity +186 common gettid sys_gettid +187 common readahead sys_readahead +188 common setxattr sys_setxattr +189 common lsetxattr sys_lsetxattr +190 common fsetxattr sys_fsetxattr +191 common getxattr sys_getxattr +192 common lgetxattr sys_lgetxattr +193 common fgetxattr sys_fgetxattr +194 common listxattr sys_listxattr +195 common llistxattr sys_llistxattr +196 common flistxattr sys_flistxattr +197 common removexattr sys_removexattr +198 common lremovexattr sys_lremovexattr +199 common fremovexattr sys_fremovexattr +200 common tkill sys_tkill +201 common time sys_time +202 common futex sys_futex +203 common sched_setaffinity sys_sched_setaffinity +204 common sched_getaffinity sys_sched_getaffinity 205 64 set_thread_area -206 64 io_setup __x64_sys_io_setup -207 common io_destroy __x64_sys_io_destroy -208 common io_getevents __x64_sys_io_getevents -209 64 io_submit __x64_sys_io_submit -210 common io_cancel __x64_sys_io_cancel +206 64 io_setup sys_io_setup +207 common io_destroy sys_io_destroy +208 common io_getevents sys_io_getevents +209 64 io_submit sys_io_submit +210 common io_cancel sys_io_cancel 211 64 get_thread_area -212 common lookup_dcookie __x64_sys_lookup_dcookie -213 common epoll_create __x64_sys_epoll_create +212 common lookup_dcookie sys_lookup_dcookie +213 common epoll_create sys_epoll_create 214 64 epoll_ctl_old 215 64 epoll_wait_old -216 common remap_file_pages __x64_sys_remap_file_pages -217 common getdents64 __x64_sys_getdents64 -218 common set_tid_address __x64_sys_set_tid_address -219 common restart_syscall __x64_sys_restart_syscall -220 common semtimedop __x64_sys_semtimedop -221 common fadvise64 __x64_sys_fadvise64 -222 64 timer_create __x64_sys_timer_create -223 common timer_settime __x64_sys_timer_settime -224 common timer_gettime __x64_sys_timer_gettime -225 common timer_getoverrun __x64_sys_timer_getoverrun -226 common timer_delete __x64_sys_timer_delete -227 common clock_settime __x64_sys_clock_settime -228 common clock_gettime __x64_sys_clock_gettime -229 common clock_getres __x64_sys_clock_getres -230 common clock_nanosleep __x64_sys_clock_nanosleep -231 common exit_group __x64_sys_exit_group -232 common epoll_wait __x64_sys_epoll_wait -233 common epoll_ctl __x64_sys_epoll_ctl -234 common tgkill __x64_sys_tgkill -235 common utimes __x64_sys_utimes +216 common remap_file_pages sys_remap_file_pages +217 common getdents64 sys_getdents64 +218 common set_tid_address sys_set_tid_address +219 common restart_syscall sys_restart_syscall +220 common semtimedop sys_semtimedop +221 common fadvise64 sys_fadvise64 +222 64 timer_create sys_timer_create +223 common timer_settime sys_timer_settime +224 common timer_gettime sys_timer_gettime +225 common timer_getoverrun sys_timer_getoverrun +226 common timer_delete sys_timer_delete +227 common clock_settime sys_clock_settime +228 common clock_gettime sys_clock_gettime +229 common clock_getres sys_clock_getres +230 common clock_nanosleep sys_clock_nanosleep +231 common exit_group sys_exit_group +232 common epoll_wait sys_epoll_wait +233 common epoll_ctl sys_epoll_ctl +234 common tgkill sys_tgkill +235 common utimes sys_utimes 236 64 vserver -237 common mbind __x64_sys_mbind -238 common set_mempolicy __x64_sys_set_mempolicy -239 common get_mempolicy __x64_sys_get_mempolicy -240 common mq_open __x64_sys_mq_open -241 common mq_unlink __x64_sys_mq_unlink -242 common mq_timedsend __x64_sys_mq_timedsend -243 common mq_timedreceive __x64_sys_mq_timedreceive -244 64 mq_notify __x64_sys_mq_notify -245 common mq_getsetattr __x64_sys_mq_getsetattr -246 64 kexec_load __x64_sys_kexec_load -247 64 waitid __x64_sys_waitid -248 common add_key __x64_sys_add_key -249 common request_key __x64_sys_request_key -250 common keyctl __x64_sys_keyctl -251 common ioprio_set __x64_sys_ioprio_set -252 common ioprio_get __x64_sys_ioprio_get -253 common inotify_init __x64_sys_inotify_init -254 common inotify_add_watch __x64_sys_inotify_add_watch -255 common inotify_rm_watch __x64_sys_inotify_rm_watch -256 common migrate_pages __x64_sys_migrate_pages -257 common openat __x64_sys_openat -258 common mkdirat __x64_sys_mkdirat -259 common mknodat __x64_sys_mknodat -260 common fchownat __x64_sys_fchownat -261 common futimesat __x64_sys_futimesat -262 common newfstatat __x64_sys_newfstatat -263 common unlinkat __x64_sys_unlinkat -264 common renameat __x64_sys_renameat -265 common linkat __x64_sys_linkat -266 common symlinkat __x64_sys_symlinkat -267 common readlinkat __x64_sys_readlinkat -268 common fchmodat __x64_sys_fchmodat -269 common faccessat __x64_sys_faccessat -270 common pselect6 __x64_sys_pselect6 -271 common ppoll __x64_sys_ppoll -272 common unshare __x64_sys_unshare -273 64 set_robust_list __x64_sys_set_robust_list -274 64 get_robust_list __x64_sys_get_robust_list -275 common splice __x64_sys_splice -276 common tee __x64_sys_tee -277 common sync_file_range __x64_sys_sync_file_range -278 64 vmsplice __x64_sys_vmsplice -279 64 move_pages __x64_sys_move_pages -280 common utimensat __x64_sys_utimensat -281 common epoll_pwait __x64_sys_epoll_pwait -282 common signalfd __x64_sys_signalfd -283 common timerfd_create __x64_sys_timerfd_create -284 common eventfd __x64_sys_eventfd -285 common fallocate __x64_sys_fallocate -286 common timerfd_settime __x64_sys_timerfd_settime -287 common timerfd_gettime __x64_sys_timerfd_gettime -288 common accept4 __x64_sys_accept4 -289 common signalfd4 __x64_sys_signalfd4 -290 common eventfd2 __x64_sys_eventfd2 -291 common epoll_create1 __x64_sys_epoll_create1 -292 common dup3 __x64_sys_dup3 -293 common pipe2 __x64_sys_pipe2 -294 common inotify_init1 __x64_sys_inotify_init1 -295 64 preadv __x64_sys_preadv -296 64 pwritev __x64_sys_pwritev -297 64 rt_tgsigqueueinfo __x64_sys_rt_tgsigqueueinfo -298 common perf_event_open __x64_sys_perf_event_open -299 64 recvmmsg __x64_sys_recvmmsg -300 common fanotify_init __x64_sys_fanotify_init -301 common fanotify_mark __x64_sys_fanotify_mark -302 common prlimit64 __x64_sys_prlimit64 -303 common name_to_handle_at __x64_sys_name_to_handle_at -304 common open_by_handle_at __x64_sys_open_by_handle_at -305 common clock_adjtime __x64_sys_clock_adjtime -306 common syncfs __x64_sys_syncfs -307 64 sendmmsg __x64_sys_sendmmsg -308 common setns __x64_sys_setns -309 common getcpu __x64_sys_getcpu -310 64 process_vm_readv __x64_sys_process_vm_readv -311 64 process_vm_writev __x64_sys_process_vm_writev -312 common kcmp __x64_sys_kcmp -313 common finit_module __x64_sys_finit_module -314 common sched_setattr __x64_sys_sched_setattr -315 common sched_getattr __x64_sys_sched_getattr -316 common renameat2 __x64_sys_renameat2 -317 common seccomp __x64_sys_seccomp -318 common getrandom __x64_sys_getrandom -319 common memfd_create __x64_sys_memfd_create -320 common kexec_file_load __x64_sys_kexec_file_load -321 common bpf __x64_sys_bpf -322 64 execveat __x64_sys_execveat -323 common userfaultfd __x64_sys_userfaultfd -324 common membarrier __x64_sys_membarrier -325 common mlock2 __x64_sys_mlock2 -326 common copy_file_range __x64_sys_copy_file_range -327 64 preadv2 __x64_sys_preadv2 -328 64 pwritev2 __x64_sys_pwritev2 -329 common pkey_mprotect __x64_sys_pkey_mprotect -330 common pkey_alloc __x64_sys_pkey_alloc -331 common pkey_free __x64_sys_pkey_free -332 common statx __x64_sys_statx -333 common io_pgetevents __x64_sys_io_pgetevents -334 common rseq __x64_sys_rseq +237 common mbind sys_mbind +238 common set_mempolicy sys_set_mempolicy +239 common get_mempolicy sys_get_mempolicy +240 common mq_open sys_mq_open +241 common mq_unlink sys_mq_unlink +242 common mq_timedsend sys_mq_timedsend +243 common mq_timedreceive sys_mq_timedreceive +244 64 mq_notify sys_mq_notify +245 common mq_getsetattr sys_mq_getsetattr +246 64 kexec_load sys_kexec_load +247 64 waitid sys_waitid +248 common add_key sys_add_key +249 common request_key sys_request_key +250 common keyctl sys_keyctl +251 common ioprio_set sys_ioprio_set +252 common ioprio_get sys_ioprio_get +253 common inotify_init sys_inotify_init +254 common inotify_add_watch sys_inotify_add_watch +255 common inotify_rm_watch sys_inotify_rm_watch +256 common migrate_pages sys_migrate_pages +257 common openat sys_openat +258 common mkdirat sys_mkdirat +259 common mknodat sys_mknodat +260 common fchownat sys_fchownat +261 common futimesat sys_futimesat +262 common newfstatat sys_newfstatat +263 common unlinkat sys_unlinkat +264 common renameat sys_renameat +265 common linkat sys_linkat +266 common symlinkat sys_symlinkat +267 common readlinkat sys_readlinkat +268 common fchmodat sys_fchmodat +269 common faccessat sys_faccessat +270 common pselect6 sys_pselect6 +271 common ppoll sys_ppoll +272 common unshare sys_unshare +273 64 set_robust_list sys_set_robust_list +274 64 get_robust_list sys_get_robust_list +275 common splice sys_splice +276 common tee sys_tee +277 common sync_file_range sys_sync_file_range +278 64 vmsplice sys_vmsplice +279 64 move_pages sys_move_pages +280 common utimensat sys_utimensat +281 common epoll_pwait sys_epoll_pwait +282 common signalfd sys_signalfd +283 common timerfd_create sys_timerfd_create +284 common eventfd sys_eventfd +285 common fallocate sys_fallocate +286 common timerfd_settime sys_timerfd_settime +287 common timerfd_gettime sys_timerfd_gettime +288 common accept4 sys_accept4 +289 common signalfd4 sys_signalfd4 +290 common eventfd2 sys_eventfd2 +291 common epoll_create1 sys_epoll_create1 +292 common dup3 sys_dup3 +293 common pipe2 sys_pipe2 +294 common inotify_init1 sys_inotify_init1 +295 64 preadv sys_preadv +296 64 pwritev sys_pwritev +297 64 rt_tgsigqueueinfo sys_rt_tgsigqueueinfo +298 common perf_event_open sys_perf_event_open +299 64 recvmmsg sys_recvmmsg +300 common fanotify_init sys_fanotify_init +301 common fanotify_mark sys_fanotify_mark +302 common prlimit64 sys_prlimit64 +303 common name_to_handle_at sys_name_to_handle_at +304 common open_by_handle_at sys_open_by_handle_at +305 common clock_adjtime sys_clock_adjtime +306 common syncfs sys_syncfs +307 64 sendmmsg sys_sendmmsg +308 common setns sys_setns +309 common getcpu sys_getcpu +310 64 process_vm_readv sys_process_vm_readv +311 64 process_vm_writev sys_process_vm_writev +312 common kcmp sys_kcmp +313 common finit_module sys_finit_module +314 common sched_setattr sys_sched_setattr +315 common sched_getattr sys_sched_getattr +316 common renameat2 sys_renameat2 +317 common seccomp sys_seccomp +318 common getrandom sys_getrandom +319 common memfd_create sys_memfd_create +320 common kexec_file_load sys_kexec_file_load +321 common bpf sys_bpf +322 64 execveat sys_execveat +323 common userfaultfd sys_userfaultfd +324 common membarrier sys_membarrier +325 common mlock2 sys_mlock2 +326 common copy_file_range sys_copy_file_range +327 64 preadv2 sys_preadv2 +328 64 pwritev2 sys_pwritev2 +329 common pkey_mprotect sys_pkey_mprotect +330 common pkey_alloc sys_pkey_alloc +331 common pkey_free sys_pkey_free +332 common statx sys_statx +333 common io_pgetevents sys_io_pgetevents +334 common rseq sys_rseq # don't use numbers 387 through 423, add new calls after the last # 'common' entry -424 common pidfd_send_signal __x64_sys_pidfd_send_signal -425 common io_uring_setup __x64_sys_io_uring_setup -426 common io_uring_enter __x64_sys_io_uring_enter -427 common io_uring_register __x64_sys_io_uring_register -428 common open_tree __x64_sys_open_tree -429 common move_mount __x64_sys_move_mount -430 common fsopen __x64_sys_fsopen -431 common fsconfig __x64_sys_fsconfig -432 common fsmount __x64_sys_fsmount -433 common fspick __x64_sys_fspick -434 common pidfd_open __x64_sys_pidfd_open -435 common clone3 __x64_sys_clone3 -437 common openat2 __x64_sys_openat2 -438 common pidfd_getfd __x64_sys_pidfd_getfd +424 common pidfd_send_signal sys_pidfd_send_signal +425 common io_uring_setup sys_io_uring_setup +426 common io_uring_enter sys_io_uring_enter +427 common io_uring_register sys_io_uring_register +428 common open_tree sys_open_tree +429 common move_mount sys_move_mount +430 common fsopen sys_fsopen +431 common fsconfig sys_fsconfig +432 common fsmount sys_fsmount +433 common fspick sys_fspick +434 common pidfd_open sys_pidfd_open +435 common clone3 sys_clone3 +437 common openat2 sys_openat2 +438 common pidfd_getfd sys_pidfd_getfd # # x32-specific system call numbers start at 512 to avoid cache impact @@ -366,39 +366,39 @@ # on-the-fly for compat_sys_*() compatibility system calls if X86_X32 # is defined. # -512 x32 rt_sigaction __x32_compat_sys_rt_sigaction -513 x32 rt_sigreturn __x32_compat_sys_x32_rt_sigreturn -514 x32 ioctl __x32_compat_sys_ioctl -515 x32 readv __x32_compat_sys_readv -516 x32 writev __x32_compat_sys_writev -517 x32 recvfrom __x32_compat_sys_recvfrom -518 x32 sendmsg __x32_compat_sys_sendmsg -519 x32 recvmsg __x32_compat_sys_recvmsg -520 x32 execve __x32_compat_sys_execve -521 x32 ptrace __x32_compat_sys_ptrace -522 x32 rt_sigpending __x32_compat_sys_rt_sigpending -523 x32 rt_sigtimedwait __x32_compat_sys_rt_sigtimedwait_time64 -524 x32 rt_sigqueueinfo __x32_compat_sys_rt_sigqueueinfo -525 x32 sigaltstack __x32_compat_sys_sigaltstack -526 x32 timer_create __x32_compat_sys_timer_create -527 x32 mq_notify __x32_compat_sys_mq_notify -528 x32 kexec_load __x32_compat_sys_kexec_load -529 x32 waitid __x32_compat_sys_waitid -530 x32 set_robust_list __x32_compat_sys_set_robust_list -531 x32 get_robust_list __x32_compat_sys_get_robust_list -532 x32 vmsplice __x32_compat_sys_vmsplice -533 x32 move_pages __x32_compat_sys_move_pages -534 x32 preadv __x32_compat_sys_preadv64 -535 x32 pwritev __x32_compat_sys_pwritev64 -536 x32 rt_tgsigqueueinfo __x32_compat_sys_rt_tgsigqueueinfo -537 x32 recvmmsg __x32_compat_sys_recvmmsg_time64 -538 x32 sendmmsg __x32_compat_sys_sendmmsg -539 x32 process_vm_readv __x32_compat_sys_process_vm_readv -540 x32 process_vm_writev __x32_compat_sys_process_vm_writev -541 x32 setsockopt __x32_compat_sys_setsockopt -542 x32 getsockopt __x32_compat_sys_getsockopt -543 x32 io_setup __x32_compat_sys_io_setup -544 x32 io_submit __x32_compat_sys_io_submit -545 x32 execveat __x32_compat_sys_execveat -546 x32 preadv2 __x32_compat_sys_preadv64v2 -547 x32 pwritev2 __x32_compat_sys_pwritev64v2 +512 x32 rt_sigaction compat_sys_rt_sigaction +513 x32 rt_sigreturn compat_sys_x32_rt_sigreturn +514 x32 ioctl compat_sys_ioctl +515 x32 readv compat_sys_readv +516 x32 writev compat_sys_writev +517 x32 recvfrom compat_sys_recvfrom +518 x32 sendmsg compat_sys_sendmsg +519 x32 recvmsg compat_sys_recvmsg +520 x32 execve compat_sys_execve +521 x32 ptrace compat_sys_ptrace +522 x32 rt_sigpending compat_sys_rt_sigpending +523 x32 rt_sigtimedwait compat_sys_rt_sigtimedwait_time64 +524 x32 rt_sigqueueinfo compat_sys_rt_sigqueueinfo +525 x32 sigaltstack compat_sys_sigaltstack +526 x32 timer_create compat_sys_timer_create +527 x32 mq_notify compat_sys_mq_notify +528 x32 kexec_load compat_sys_kexec_load +529 x32 waitid compat_sys_waitid +530 x32 set_robust_list compat_sys_set_robust_list +531 x32 get_robust_list compat_sys_get_robust_list +532 x32 vmsplice compat_sys_vmsplice +533 x32 move_pages compat_sys_move_pages +534 x32 preadv compat_sys_preadv64 +535 x32 pwritev compat_sys_pwritev64 +536 x32 rt_tgsigqueueinfo compat_sys_rt_tgsigqueueinfo +537 x32 recvmmsg compat_sys_recvmmsg_time64 +538 x32 sendmmsg compat_sys_sendmmsg +539 x32 process_vm_readv compat_sys_process_vm_readv +540 x32 process_vm_writev compat_sys_process_vm_writev +541 x32 setsockopt compat_sys_setsockopt +542 x32 getsockopt compat_sys_getsockopt +543 x32 io_setup compat_sys_io_setup +544 x32 io_submit compat_sys_io_submit +545 x32 execveat compat_sys_execveat +546 x32 preadv2 compat_sys_preadv64v2 +547 x32 pwritev2 compat_sys_pwritev64v2 diff --git a/arch/x86/entry/syscalls/syscalltbl.sh b/arch/x86/entry/syscalls/syscalltbl.sh index 6106ed37b8de..929bde120d6b 100644 --- a/arch/x86/entry/syscalls/syscalltbl.sh +++ b/arch/x86/entry/syscalls/syscalltbl.sh @@ -17,27 +17,15 @@ emit() { local nr="$2" local entry="$3" local compat="$4" - local umlentry="" if [ "$abi" != "I386" -a -n "$compat" ]; then echo "a compat entry ($abi: $compat) for a 64-bit syscall makes no sense" >&2 exit 1 fi - # For CONFIG_UML, we need to strip the __x64_sys prefix - if [ "${entry}" != "${entry#__x64_sys}" ]; then - umlentry="sys${entry#__x64_sys}" - fi - if [ -z "$compat" ]; then - if [ -n "$entry" -a -z "$umlentry" ]; then - syscall_macro "$abi" "$nr" "$entry" - elif [ -n "$umlentry" ]; then # implies -n "$entry" - echo "#ifdef CONFIG_X86" + if [ -n "$entry" ]; then syscall_macro "$abi" "$nr" "$entry" - echo "#else /* CONFIG_UML */" - syscall_macro "$abi" "$nr" "$umlentry" - echo "#endif" fi else echo "#ifdef CONFIG_X86_32" -- cgit From a845a6cf1dad16fa86007c86702fb27fec89d38b Mon Sep 17 00:00:00 2001 From: Brian Gerst Date: Fri, 13 Mar 2020 15:51:39 -0400 Subject: x86/entry/32: Clean up syscall_32.tbl After removal of the __ia32_ prefix, remove compat entries that are now identical to the native entry. Converted with this script and fixing up whitespace: while read nr abi name entry compat; do if [ "${nr:0:1}" = "#" ]; then echo $nr $abi $name $entry $compat continue fi if [ "$entry" = "$compat" ]; then compat="" fi echo "$nr $abi $name $entry $compat" done Signed-off-by: Brian Gerst Signed-off-by: Thomas Gleixner Link: https://lkml.kernel.org/r/20200313195144.164260-14-brgerst@gmail.com --- arch/x86/entry/syscalls/syscall_32.tbl | 578 ++++++++++++++++----------------- 1 file changed, 289 insertions(+), 289 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 390c21230290..b0712cdae7d8 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -11,179 +11,179 @@ # # The abi is always "i386" for this file. # -0 i386 restart_syscall sys_restart_syscall sys_restart_syscall -1 i386 exit sys_exit sys_exit -2 i386 fork sys_fork sys_fork -3 i386 read sys_read sys_read -4 i386 write sys_write sys_write +0 i386 restart_syscall sys_restart_syscall +1 i386 exit sys_exit +2 i386 fork sys_fork +3 i386 read sys_read +4 i386 write sys_write 5 i386 open sys_open compat_sys_open -6 i386 close sys_close sys_close -7 i386 waitpid sys_waitpid sys_waitpid -8 i386 creat sys_creat sys_creat -9 i386 link sys_link sys_link -10 i386 unlink sys_unlink sys_unlink +6 i386 close sys_close +7 i386 waitpid sys_waitpid +8 i386 creat sys_creat +9 i386 link sys_link +10 i386 unlink sys_unlink 11 i386 execve sys_execve compat_sys_execve -12 i386 chdir sys_chdir sys_chdir -13 i386 time sys_time32 sys_time32 -14 i386 mknod sys_mknod sys_mknod -15 i386 chmod sys_chmod sys_chmod -16 i386 lchown sys_lchown16 sys_lchown16 +12 i386 chdir sys_chdir +13 i386 time sys_time32 +14 i386 mknod sys_mknod +15 i386 chmod sys_chmod +16 i386 lchown sys_lchown16 17 i386 break -18 i386 oldstat sys_stat sys_stat +18 i386 oldstat sys_stat 19 i386 lseek sys_lseek compat_sys_lseek -20 i386 getpid sys_getpid sys_getpid +20 i386 getpid sys_getpid 21 i386 mount sys_mount compat_sys_mount -22 i386 umount sys_oldumount sys_oldumount -23 i386 setuid sys_setuid16 sys_setuid16 -24 i386 getuid sys_getuid16 sys_getuid16 -25 i386 stime sys_stime32 sys_stime32 +22 i386 umount sys_oldumount +23 i386 setuid sys_setuid16 +24 i386 getuid sys_getuid16 +25 i386 stime sys_stime32 26 i386 ptrace sys_ptrace compat_sys_ptrace -27 i386 alarm sys_alarm sys_alarm -28 i386 oldfstat sys_fstat sys_fstat -29 i386 pause sys_pause sys_pause -30 i386 utime sys_utime32 sys_utime32 +27 i386 alarm sys_alarm +28 i386 oldfstat sys_fstat +29 i386 pause sys_pause +30 i386 utime sys_utime32 31 i386 stty 32 i386 gtty -33 i386 access sys_access sys_access -34 i386 nice sys_nice sys_nice +33 i386 access sys_access +34 i386 nice sys_nice 35 i386 ftime -36 i386 sync sys_sync sys_sync -37 i386 kill sys_kill sys_kill -38 i386 rename sys_rename sys_rename -39 i386 mkdir sys_mkdir sys_mkdir -40 i386 rmdir sys_rmdir sys_rmdir -41 i386 dup sys_dup sys_dup -42 i386 pipe sys_pipe sys_pipe +36 i386 sync sys_sync +37 i386 kill sys_kill +38 i386 rename sys_rename +39 i386 mkdir sys_mkdir +40 i386 rmdir sys_rmdir +41 i386 dup sys_dup +42 i386 pipe sys_pipe 43 i386 times sys_times compat_sys_times 44 i386 prof -45 i386 brk sys_brk sys_brk -46 i386 setgid sys_setgid16 sys_setgid16 -47 i386 getgid sys_getgid16 sys_getgid16 -48 i386 signal sys_signal sys_signal -49 i386 geteuid sys_geteuid16 sys_geteuid16 -50 i386 getegid sys_getegid16 sys_getegid16 -51 i386 acct sys_acct sys_acct -52 i386 umount2 sys_umount sys_umount +45 i386 brk sys_brk +46 i386 setgid sys_setgid16 +47 i386 getgid sys_getgid16 +48 i386 signal sys_signal +49 i386 geteuid sys_geteuid16 +50 i386 getegid sys_getegid16 +51 i386 acct sys_acct +52 i386 umount2 sys_umount 53 i386 lock 54 i386 ioctl sys_ioctl compat_sys_ioctl 55 i386 fcntl sys_fcntl compat_sys_fcntl64 56 i386 mpx -57 i386 setpgid sys_setpgid sys_setpgid +57 i386 setpgid sys_setpgid 58 i386 ulimit -59 i386 oldolduname sys_olduname sys_olduname -60 i386 umask sys_umask sys_umask -61 i386 chroot sys_chroot sys_chroot +59 i386 oldolduname sys_olduname +60 i386 umask sys_umask +61 i386 chroot sys_chroot 62 i386 ustat sys_ustat compat_sys_ustat -63 i386 dup2 sys_dup2 sys_dup2 -64 i386 getppid sys_getppid sys_getppid -65 i386 getpgrp sys_getpgrp sys_getpgrp -66 i386 setsid sys_setsid sys_setsid +63 i386 dup2 sys_dup2 +64 i386 getppid sys_getppid +65 i386 getpgrp sys_getpgrp +66 i386 setsid sys_setsid 67 i386 sigaction sys_sigaction compat_sys_sigaction -68 i386 sgetmask sys_sgetmask sys_sgetmask -69 i386 ssetmask sys_ssetmask sys_ssetmask -70 i386 setreuid sys_setreuid16 sys_setreuid16 -71 i386 setregid sys_setregid16 sys_setregid16 -72 i386 sigsuspend sys_sigsuspend sys_sigsuspend +68 i386 sgetmask sys_sgetmask +69 i386 ssetmask sys_ssetmask +70 i386 setreuid sys_setreuid16 +71 i386 setregid sys_setregid16 +72 i386 sigsuspend sys_sigsuspend 73 i386 sigpending sys_sigpending compat_sys_sigpending -74 i386 sethostname sys_sethostname sys_sethostname +74 i386 sethostname sys_sethostname 75 i386 setrlimit sys_setrlimit compat_sys_setrlimit 76 i386 getrlimit sys_old_getrlimit compat_sys_old_getrlimit 77 i386 getrusage sys_getrusage compat_sys_getrusage 78 i386 gettimeofday sys_gettimeofday compat_sys_gettimeofday 79 i386 settimeofday sys_settimeofday compat_sys_settimeofday -80 i386 getgroups sys_getgroups16 sys_getgroups16 -81 i386 setgroups sys_setgroups16 sys_setgroups16 +80 i386 getgroups sys_getgroups16 +81 i386 setgroups sys_setgroups16 82 i386 select sys_old_select compat_sys_old_select -83 i386 symlink sys_symlink sys_symlink -84 i386 oldlstat sys_lstat sys_lstat -85 i386 readlink sys_readlink sys_readlink -86 i386 uselib sys_uselib sys_uselib -87 i386 swapon sys_swapon sys_swapon -88 i386 reboot sys_reboot sys_reboot +83 i386 symlink sys_symlink +84 i386 oldlstat sys_lstat +85 i386 readlink sys_readlink +86 i386 uselib sys_uselib +87 i386 swapon sys_swapon +88 i386 reboot sys_reboot 89 i386 readdir sys_old_readdir compat_sys_old_readdir 90 i386 mmap sys_old_mmap compat_sys_x86_mmap -91 i386 munmap sys_munmap sys_munmap +91 i386 munmap sys_munmap 92 i386 truncate sys_truncate compat_sys_truncate 93 i386 ftruncate sys_ftruncate compat_sys_ftruncate -94 i386 fchmod sys_fchmod sys_fchmod -95 i386 fchown sys_fchown16 sys_fchown16 -96 i386 getpriority sys_getpriority sys_getpriority -97 i386 setpriority sys_setpriority sys_setpriority +94 i386 fchmod sys_fchmod +95 i386 fchown sys_fchown16 +96 i386 getpriority sys_getpriority +97 i386 setpriority sys_setpriority 98 i386 profil 99 i386 statfs sys_statfs compat_sys_statfs 100 i386 fstatfs sys_fstatfs compat_sys_fstatfs -101 i386 ioperm sys_ioperm sys_ioperm +101 i386 ioperm sys_ioperm 102 i386 socketcall sys_socketcall compat_sys_socketcall -103 i386 syslog sys_syslog sys_syslog +103 i386 syslog sys_syslog 104 i386 setitimer sys_setitimer compat_sys_setitimer 105 i386 getitimer sys_getitimer compat_sys_getitimer 106 i386 stat sys_newstat compat_sys_newstat 107 i386 lstat sys_newlstat compat_sys_newlstat 108 i386 fstat sys_newfstat compat_sys_newfstat -109 i386 olduname sys_uname sys_uname -110 i386 iopl sys_iopl sys_iopl -111 i386 vhangup sys_vhangup sys_vhangup +109 i386 olduname sys_uname +110 i386 iopl sys_iopl +111 i386 vhangup sys_vhangup 112 i386 idle 113 i386 vm86old sys_vm86old sys_ni_syscall 114 i386 wait4 sys_wait4 compat_sys_wait4 -115 i386 swapoff sys_swapoff sys_swapoff +115 i386 swapoff sys_swapoff 116 i386 sysinfo sys_sysinfo compat_sys_sysinfo 117 i386 ipc sys_ipc compat_sys_ipc -118 i386 fsync sys_fsync sys_fsync +118 i386 fsync sys_fsync 119 i386 sigreturn sys_sigreturn compat_sys_sigreturn 120 i386 clone sys_clone compat_sys_x86_clone -121 i386 setdomainname sys_setdomainname sys_setdomainname -122 i386 uname sys_newuname sys_newuname -123 i386 modify_ldt sys_modify_ldt sys_modify_ldt -124 i386 adjtimex sys_adjtimex_time32 sys_adjtimex_time32 -125 i386 mprotect sys_mprotect sys_mprotect +121 i386 setdomainname sys_setdomainname +122 i386 uname sys_newuname +123 i386 modify_ldt sys_modify_ldt +124 i386 adjtimex sys_adjtimex_time32 +125 i386 mprotect sys_mprotect 126 i386 sigprocmask sys_sigprocmask compat_sys_sigprocmask 127 i386 create_module -128 i386 init_module sys_init_module sys_init_module -129 i386 delete_module sys_delete_module sys_delete_module +128 i386 init_module sys_init_module +129 i386 delete_module sys_delete_module 130 i386 get_kernel_syms 131 i386 quotactl sys_quotactl compat_sys_quotactl32 -132 i386 getpgid sys_getpgid sys_getpgid -133 i386 fchdir sys_fchdir sys_fchdir -134 i386 bdflush sys_bdflush sys_bdflush -135 i386 sysfs sys_sysfs sys_sysfs -136 i386 personality sys_personality sys_personality +132 i386 getpgid sys_getpgid +133 i386 fchdir sys_fchdir +134 i386 bdflush sys_bdflush +135 i386 sysfs sys_sysfs +136 i386 personality sys_personality 137 i386 afs_syscall -138 i386 setfsuid sys_setfsuid16 sys_setfsuid16 -139 i386 setfsgid sys_setfsgid16 sys_setfsgid16 -140 i386 _llseek sys_llseek sys_llseek +138 i386 setfsuid sys_setfsuid16 +139 i386 setfsgid sys_setfsgid16 +140 i386 _llseek sys_llseek 141 i386 getdents sys_getdents compat_sys_getdents 142 i386 _newselect sys_select compat_sys_select -143 i386 flock sys_flock sys_flock -144 i386 msync sys_msync sys_msync +143 i386 flock sys_flock +144 i386 msync sys_msync 145 i386 readv sys_readv compat_sys_readv 146 i386 writev sys_writev compat_sys_writev -147 i386 getsid sys_getsid sys_getsid -148 i386 fdatasync sys_fdatasync sys_fdatasync +147 i386 getsid sys_getsid +148 i386 fdatasync sys_fdatasync 149 i386 _sysctl sys_sysctl compat_sys_sysctl -150 i386 mlock sys_mlock sys_mlock -151 i386 munlock sys_munlock sys_munlock -152 i386 mlockall sys_mlockall sys_mlockall -153 i386 munlockall sys_munlockall sys_munlockall -154 i386 sched_setparam sys_sched_setparam sys_sched_setparam -155 i386 sched_getparam sys_sched_getparam sys_sched_getparam -156 i386 sched_setscheduler sys_sched_setscheduler sys_sched_setscheduler -157 i386 sched_getscheduler sys_sched_getscheduler sys_sched_getscheduler -158 i386 sched_yield sys_sched_yield sys_sched_yield -159 i386 sched_get_priority_max sys_sched_get_priority_max sys_sched_get_priority_max -160 i386 sched_get_priority_min sys_sched_get_priority_min sys_sched_get_priority_min -161 i386 sched_rr_get_interval sys_sched_rr_get_interval_time32 sys_sched_rr_get_interval_time32 -162 i386 nanosleep sys_nanosleep_time32 sys_nanosleep_time32 -163 i386 mremap sys_mremap sys_mremap -164 i386 setresuid sys_setresuid16 sys_setresuid16 -165 i386 getresuid sys_getresuid16 sys_getresuid16 +150 i386 mlock sys_mlock +151 i386 munlock sys_munlock +152 i386 mlockall sys_mlockall +153 i386 munlockall sys_munlockall +154 i386 sched_setparam sys_sched_setparam +155 i386 sched_getparam sys_sched_getparam +156 i386 sched_setscheduler sys_sched_setscheduler +157 i386 sched_getscheduler sys_sched_getscheduler +158 i386 sched_yield sys_sched_yield +159 i386 sched_get_priority_max sys_sched_get_priority_max +160 i386 sched_get_priority_min sys_sched_get_priority_min +161 i386 sched_rr_get_interval sys_sched_rr_get_interval_time32 +162 i386 nanosleep sys_nanosleep_time32 +163 i386 mremap sys_mremap +164 i386 setresuid sys_setresuid16 +165 i386 getresuid sys_getresuid16 166 i386 vm86 sys_vm86 sys_ni_syscall 167 i386 query_module -168 i386 poll sys_poll sys_poll +168 i386 poll sys_poll 169 i386 nfsservctl -170 i386 setresgid sys_setresgid16 sys_setresgid16 -171 i386 getresgid sys_getresgid16 sys_getresgid16 -172 i386 prctl sys_prctl sys_prctl +170 i386 setresgid sys_setresgid16 +171 i386 getresgid sys_getresgid16 +172 i386 prctl sys_prctl 173 i386 rt_sigreturn sys_rt_sigreturn compat_sys_rt_sigreturn 174 i386 rt_sigaction sys_rt_sigaction compat_sys_rt_sigaction 175 i386 rt_sigprocmask sys_rt_sigprocmask compat_sys_rt_sigprocmask @@ -193,252 +193,252 @@ 179 i386 rt_sigsuspend sys_rt_sigsuspend compat_sys_rt_sigsuspend 180 i386 pread64 sys_pread64 compat_sys_x86_pread 181 i386 pwrite64 sys_pwrite64 compat_sys_x86_pwrite -182 i386 chown sys_chown16 sys_chown16 -183 i386 getcwd sys_getcwd sys_getcwd -184 i386 capget sys_capget sys_capget -185 i386 capset sys_capset sys_capset +182 i386 chown sys_chown16 +183 i386 getcwd sys_getcwd +184 i386 capget sys_capget +185 i386 capset sys_capset 186 i386 sigaltstack sys_sigaltstack compat_sys_sigaltstack 187 i386 sendfile sys_sendfile compat_sys_sendfile 188 i386 getpmsg 189 i386 putpmsg -190 i386 vfork sys_vfork sys_vfork +190 i386 vfork sys_vfork 191 i386 ugetrlimit sys_getrlimit compat_sys_getrlimit -192 i386 mmap2 sys_mmap_pgoff sys_mmap_pgoff +192 i386 mmap2 sys_mmap_pgoff 193 i386 truncate64 sys_truncate64 compat_sys_x86_truncate64 194 i386 ftruncate64 sys_ftruncate64 compat_sys_x86_ftruncate64 195 i386 stat64 sys_stat64 compat_sys_x86_stat64 196 i386 lstat64 sys_lstat64 compat_sys_x86_lstat64 197 i386 fstat64 sys_fstat64 compat_sys_x86_fstat64 -198 i386 lchown32 sys_lchown sys_lchown -199 i386 getuid32 sys_getuid sys_getuid -200 i386 getgid32 sys_getgid sys_getgid -201 i386 geteuid32 sys_geteuid sys_geteuid -202 i386 getegid32 sys_getegid sys_getegid -203 i386 setreuid32 sys_setreuid sys_setreuid -204 i386 setregid32 sys_setregid sys_setregid -205 i386 getgroups32 sys_getgroups sys_getgroups -206 i386 setgroups32 sys_setgroups sys_setgroups -207 i386 fchown32 sys_fchown sys_fchown -208 i386 setresuid32 sys_setresuid sys_setresuid -209 i386 getresuid32 sys_getresuid sys_getresuid -210 i386 setresgid32 sys_setresgid sys_setresgid -211 i386 getresgid32 sys_getresgid sys_getresgid -212 i386 chown32 sys_chown sys_chown -213 i386 setuid32 sys_setuid sys_setuid -214 i386 setgid32 sys_setgid sys_setgid -215 i386 setfsuid32 sys_setfsuid sys_setfsuid -216 i386 setfsgid32 sys_setfsgid sys_setfsgid -217 i386 pivot_root sys_pivot_root sys_pivot_root -218 i386 mincore sys_mincore sys_mincore -219 i386 madvise sys_madvise sys_madvise -220 i386 getdents64 sys_getdents64 sys_getdents64 +198 i386 lchown32 sys_lchown +199 i386 getuid32 sys_getuid +200 i386 getgid32 sys_getgid +201 i386 geteuid32 sys_geteuid +202 i386 getegid32 sys_getegid +203 i386 setreuid32 sys_setreuid +204 i386 setregid32 sys_setregid +205 i386 getgroups32 sys_getgroups +206 i386 setgroups32 sys_setgroups +207 i386 fchown32 sys_fchown +208 i386 setresuid32 sys_setresuid +209 i386 getresuid32 sys_getresuid +210 i386 setresgid32 sys_setresgid +211 i386 getresgid32 sys_getresgid +212 i386 chown32 sys_chown +213 i386 setuid32 sys_setuid +214 i386 setgid32 sys_setgid +215 i386 setfsuid32 sys_setfsuid +216 i386 setfsgid32 sys_setfsgid +217 i386 pivot_root sys_pivot_root +218 i386 mincore sys_mincore +219 i386 madvise sys_madvise +220 i386 getdents64 sys_getdents64 221 i386 fcntl64 sys_fcntl64 compat_sys_fcntl64 # 222 is unused # 223 is unused -224 i386 gettid sys_gettid sys_gettid +224 i386 gettid sys_gettid 225 i386 readahead sys_readahead compat_sys_x86_readahead -226 i386 setxattr sys_setxattr sys_setxattr -227 i386 lsetxattr sys_lsetxattr sys_lsetxattr -228 i386 fsetxattr sys_fsetxattr sys_fsetxattr -229 i386 getxattr sys_getxattr sys_getxattr -230 i386 lgetxattr sys_lgetxattr sys_lgetxattr -231 i386 fgetxattr sys_fgetxattr sys_fgetxattr -232 i386 listxattr sys_listxattr sys_listxattr -233 i386 llistxattr sys_llistxattr sys_llistxattr -234 i386 flistxattr sys_flistxattr sys_flistxattr -235 i386 removexattr sys_removexattr sys_removexattr -236 i386 lremovexattr sys_lremovexattr sys_lremovexattr -237 i386 fremovexattr sys_fremovexattr sys_fremovexattr -238 i386 tkill sys_tkill sys_tkill -239 i386 sendfile64 sys_sendfile64 sys_sendfile64 -240 i386 futex sys_futex_time32 sys_futex_time32 +226 i386 setxattr sys_setxattr +227 i386 lsetxattr sys_lsetxattr +228 i386 fsetxattr sys_fsetxattr +229 i386 getxattr sys_getxattr +230 i386 lgetxattr sys_lgetxattr +231 i386 fgetxattr sys_fgetxattr +232 i386 listxattr sys_listxattr +233 i386 llistxattr sys_llistxattr +234 i386 flistxattr sys_flistxattr +235 i386 removexattr sys_removexattr +236 i386 lremovexattr sys_lremovexattr +237 i386 fremovexattr sys_fremovexattr +238 i386 tkill sys_tkill +239 i386 sendfile64 sys_sendfile64 +240 i386 futex sys_futex_time32 241 i386 sched_setaffinity sys_sched_setaffinity compat_sys_sched_setaffinity 242 i386 sched_getaffinity sys_sched_getaffinity compat_sys_sched_getaffinity -243 i386 set_thread_area sys_set_thread_area sys_set_thread_area -244 i386 get_thread_area sys_get_thread_area sys_get_thread_area +243 i386 set_thread_area sys_set_thread_area +244 i386 get_thread_area sys_get_thread_area 245 i386 io_setup sys_io_setup compat_sys_io_setup -246 i386 io_destroy sys_io_destroy sys_io_destroy -247 i386 io_getevents sys_io_getevents_time32 sys_io_getevents_time32 +246 i386 io_destroy sys_io_destroy +247 i386 io_getevents sys_io_getevents_time32 248 i386 io_submit sys_io_submit compat_sys_io_submit -249 i386 io_cancel sys_io_cancel sys_io_cancel +249 i386 io_cancel sys_io_cancel 250 i386 fadvise64 sys_fadvise64 compat_sys_x86_fadvise64 # 251 is available for reuse (was briefly sys_set_zone_reclaim) -252 i386 exit_group sys_exit_group sys_exit_group +252 i386 exit_group sys_exit_group 253 i386 lookup_dcookie sys_lookup_dcookie compat_sys_lookup_dcookie -254 i386 epoll_create sys_epoll_create sys_epoll_create -255 i386 epoll_ctl sys_epoll_ctl sys_epoll_ctl -256 i386 epoll_wait sys_epoll_wait sys_epoll_wait -257 i386 remap_file_pages sys_remap_file_pages sys_remap_file_pages -258 i386 set_tid_address sys_set_tid_address sys_set_tid_address +254 i386 epoll_create sys_epoll_create +255 i386 epoll_ctl sys_epoll_ctl +256 i386 epoll_wait sys_epoll_wait +257 i386 remap_file_pages sys_remap_file_pages +258 i386 set_tid_address sys_set_tid_address 259 i386 timer_create sys_timer_create compat_sys_timer_create -260 i386 timer_settime sys_timer_settime32 sys_timer_settime32 -261 i386 timer_gettime sys_timer_gettime32 sys_timer_gettime32 -262 i386 timer_getoverrun sys_timer_getoverrun sys_timer_getoverrun -263 i386 timer_delete sys_timer_delete sys_timer_delete -264 i386 clock_settime sys_clock_settime32 sys_clock_settime32 -265 i386 clock_gettime sys_clock_gettime32 sys_clock_gettime32 -266 i386 clock_getres sys_clock_getres_time32 sys_clock_getres_time32 -267 i386 clock_nanosleep sys_clock_nanosleep_time32 sys_clock_nanosleep_time32 +260 i386 timer_settime sys_timer_settime32 +261 i386 timer_gettime sys_timer_gettime32 +262 i386 timer_getoverrun sys_timer_getoverrun +263 i386 timer_delete sys_timer_delete +264 i386 clock_settime sys_clock_settime32 +265 i386 clock_gettime sys_clock_gettime32 +266 i386 clock_getres sys_clock_getres_time32 +267 i386 clock_nanosleep sys_clock_nanosleep_time32 268 i386 statfs64 sys_statfs64 compat_sys_statfs64 269 i386 fstatfs64 sys_fstatfs64 compat_sys_fstatfs64 -270 i386 tgkill sys_tgkill sys_tgkill -271 i386 utimes sys_utimes_time32 sys_utimes_time32 +270 i386 tgkill sys_tgkill +271 i386 utimes sys_utimes_time32 272 i386 fadvise64_64 sys_fadvise64_64 compat_sys_x86_fadvise64_64 273 i386 vserver -274 i386 mbind sys_mbind sys_mbind +274 i386 mbind sys_mbind 275 i386 get_mempolicy sys_get_mempolicy compat_sys_get_mempolicy -276 i386 set_mempolicy sys_set_mempolicy sys_set_mempolicy +276 i386 set_mempolicy sys_set_mempolicy 277 i386 mq_open sys_mq_open compat_sys_mq_open -278 i386 mq_unlink sys_mq_unlink sys_mq_unlink -279 i386 mq_timedsend sys_mq_timedsend_time32 sys_mq_timedsend_time32 -280 i386 mq_timedreceive sys_mq_timedreceive_time32 sys_mq_timedreceive_time32 +278 i386 mq_unlink sys_mq_unlink +279 i386 mq_timedsend sys_mq_timedsend_time32 +280 i386 mq_timedreceive sys_mq_timedreceive_time32 281 i386 mq_notify sys_mq_notify compat_sys_mq_notify 282 i386 mq_getsetattr sys_mq_getsetattr compat_sys_mq_getsetattr 283 i386 kexec_load sys_kexec_load compat_sys_kexec_load 284 i386 waitid sys_waitid compat_sys_waitid # 285 sys_setaltroot -286 i386 add_key sys_add_key sys_add_key -287 i386 request_key sys_request_key sys_request_key +286 i386 add_key sys_add_key +287 i386 request_key sys_request_key 288 i386 keyctl sys_keyctl compat_sys_keyctl -289 i386 ioprio_set sys_ioprio_set sys_ioprio_set -290 i386 ioprio_get sys_ioprio_get sys_ioprio_get -291 i386 inotify_init sys_inotify_init sys_inotify_init -292 i386 inotify_add_watch sys_inotify_add_watch sys_inotify_add_watch -293 i386 inotify_rm_watch sys_inotify_rm_watch sys_inotify_rm_watch -294 i386 migrate_pages sys_migrate_pages sys_migrate_pages +289 i386 ioprio_set sys_ioprio_set +290 i386 ioprio_get sys_ioprio_get +291 i386 inotify_init sys_inotify_init +292 i386 inotify_add_watch sys_inotify_add_watch +293 i386 inotify_rm_watch sys_inotify_rm_watch +294 i386 migrate_pages sys_migrate_pages 295 i386 openat sys_openat compat_sys_openat -296 i386 mkdirat sys_mkdirat sys_mkdirat -297 i386 mknodat sys_mknodat sys_mknodat -298 i386 fchownat sys_fchownat sys_fchownat -299 i386 futimesat sys_futimesat_time32 sys_futimesat_time32 +296 i386 mkdirat sys_mkdirat +297 i386 mknodat sys_mknodat +298 i386 fchownat sys_fchownat +299 i386 futimesat sys_futimesat_time32 300 i386 fstatat64 sys_fstatat64 compat_sys_x86_fstatat -301 i386 unlinkat sys_unlinkat sys_unlinkat -302 i386 renameat sys_renameat sys_renameat -303 i386 linkat sys_linkat sys_linkat -304 i386 symlinkat sys_symlinkat sys_symlinkat -305 i386 readlinkat sys_readlinkat sys_readlinkat -306 i386 fchmodat sys_fchmodat sys_fchmodat -307 i386 faccessat sys_faccessat sys_faccessat +301 i386 unlinkat sys_unlinkat +302 i386 renameat sys_renameat +303 i386 linkat sys_linkat +304 i386 symlinkat sys_symlinkat +305 i386 readlinkat sys_readlinkat +306 i386 fchmodat sys_fchmodat +307 i386 faccessat sys_faccessat 308 i386 pselect6 sys_pselect6_time32 compat_sys_pselect6_time32 309 i386 ppoll sys_ppoll_time32 compat_sys_ppoll_time32 -310 i386 unshare sys_unshare sys_unshare +310 i386 unshare sys_unshare 311 i386 set_robust_list sys_set_robust_list compat_sys_set_robust_list 312 i386 get_robust_list sys_get_robust_list compat_sys_get_robust_list -313 i386 splice sys_splice sys_splice +313 i386 splice sys_splice 314 i386 sync_file_range sys_sync_file_range compat_sys_x86_sync_file_range -315 i386 tee sys_tee sys_tee +315 i386 tee sys_tee 316 i386 vmsplice sys_vmsplice compat_sys_vmsplice 317 i386 move_pages sys_move_pages compat_sys_move_pages -318 i386 getcpu sys_getcpu sys_getcpu -319 i386 epoll_pwait sys_epoll_pwait sys_epoll_pwait -320 i386 utimensat sys_utimensat_time32 sys_utimensat_time32 +318 i386 getcpu sys_getcpu +319 i386 epoll_pwait sys_epoll_pwait +320 i386 utimensat sys_utimensat_time32 321 i386 signalfd sys_signalfd compat_sys_signalfd -322 i386 timerfd_create sys_timerfd_create sys_timerfd_create -323 i386 eventfd sys_eventfd sys_eventfd +322 i386 timerfd_create sys_timerfd_create +323 i386 eventfd sys_eventfd 324 i386 fallocate sys_fallocate compat_sys_x86_fallocate -325 i386 timerfd_settime sys_timerfd_settime32 sys_timerfd_settime32 -326 i386 timerfd_gettime sys_timerfd_gettime32 sys_timerfd_gettime32 +325 i386 timerfd_settime sys_timerfd_settime32 +326 i386 timerfd_gettime sys_timerfd_gettime32 327 i386 signalfd4 sys_signalfd4 compat_sys_signalfd4 -328 i386 eventfd2 sys_eventfd2 sys_eventfd2 -329 i386 epoll_create1 sys_epoll_create1 sys_epoll_create1 -330 i386 dup3 sys_dup3 sys_dup3 -331 i386 pipe2 sys_pipe2 sys_pipe2 -332 i386 inotify_init1 sys_inotify_init1 sys_inotify_init1 +328 i386 eventfd2 sys_eventfd2 +329 i386 epoll_create1 sys_epoll_create1 +330 i386 dup3 sys_dup3 +331 i386 pipe2 sys_pipe2 +332 i386 inotify_init1 sys_inotify_init1 333 i386 preadv sys_preadv compat_sys_preadv 334 i386 pwritev sys_pwritev compat_sys_pwritev 335 i386 rt_tgsigqueueinfo sys_rt_tgsigqueueinfo compat_sys_rt_tgsigqueueinfo -336 i386 perf_event_open sys_perf_event_open sys_perf_event_open +336 i386 perf_event_open sys_perf_event_open 337 i386 recvmmsg sys_recvmmsg_time32 compat_sys_recvmmsg_time32 -338 i386 fanotify_init sys_fanotify_init sys_fanotify_init +338 i386 fanotify_init sys_fanotify_init 339 i386 fanotify_mark sys_fanotify_mark compat_sys_fanotify_mark -340 i386 prlimit64 sys_prlimit64 sys_prlimit64 -341 i386 name_to_handle_at sys_name_to_handle_at sys_name_to_handle_at +340 i386 prlimit64 sys_prlimit64 +341 i386 name_to_handle_at sys_name_to_handle_at 342 i386 open_by_handle_at sys_open_by_handle_at compat_sys_open_by_handle_at -343 i386 clock_adjtime sys_clock_adjtime32 sys_clock_adjtime32 -344 i386 syncfs sys_syncfs sys_syncfs +343 i386 clock_adjtime sys_clock_adjtime32 +344 i386 syncfs sys_syncfs 345 i386 sendmmsg sys_sendmmsg compat_sys_sendmmsg -346 i386 setns sys_setns sys_setns +346 i386 setns sys_setns 347 i386 process_vm_readv sys_process_vm_readv compat_sys_process_vm_readv 348 i386 process_vm_writev sys_process_vm_writev compat_sys_process_vm_writev -349 i386 kcmp sys_kcmp sys_kcmp -350 i386 finit_module sys_finit_module sys_finit_module -351 i386 sched_setattr sys_sched_setattr sys_sched_setattr -352 i386 sched_getattr sys_sched_getattr sys_sched_getattr -353 i386 renameat2 sys_renameat2 sys_renameat2 -354 i386 seccomp sys_seccomp sys_seccomp -355 i386 getrandom sys_getrandom sys_getrandom -356 i386 memfd_create sys_memfd_create sys_memfd_create -357 i386 bpf sys_bpf sys_bpf +349 i386 kcmp sys_kcmp +350 i386 finit_module sys_finit_module +351 i386 sched_setattr sys_sched_setattr +352 i386 sched_getattr sys_sched_getattr +353 i386 renameat2 sys_renameat2 +354 i386 seccomp sys_seccomp +355 i386 getrandom sys_getrandom +356 i386 memfd_create sys_memfd_create +357 i386 bpf sys_bpf 358 i386 execveat sys_execveat compat_sys_execveat -359 i386 socket sys_socket sys_socket -360 i386 socketpair sys_socketpair sys_socketpair -361 i386 bind sys_bind sys_bind -362 i386 connect sys_connect sys_connect -363 i386 listen sys_listen sys_listen -364 i386 accept4 sys_accept4 sys_accept4 +359 i386 socket sys_socket +360 i386 socketpair sys_socketpair +361 i386 bind sys_bind +362 i386 connect sys_connect +363 i386 listen sys_listen +364 i386 accept4 sys_accept4 365 i386 getsockopt sys_getsockopt compat_sys_getsockopt 366 i386 setsockopt sys_setsockopt compat_sys_setsockopt -367 i386 getsockname sys_getsockname sys_getsockname -368 i386 getpeername sys_getpeername sys_getpeername -369 i386 sendto sys_sendto sys_sendto +367 i386 getsockname sys_getsockname +368 i386 getpeername sys_getpeername +369 i386 sendto sys_sendto 370 i386 sendmsg sys_sendmsg compat_sys_sendmsg 371 i386 recvfrom sys_recvfrom compat_sys_recvfrom 372 i386 recvmsg sys_recvmsg compat_sys_recvmsg -373 i386 shutdown sys_shutdown sys_shutdown -374 i386 userfaultfd sys_userfaultfd sys_userfaultfd -375 i386 membarrier sys_membarrier sys_membarrier -376 i386 mlock2 sys_mlock2 sys_mlock2 -377 i386 copy_file_range sys_copy_file_range sys_copy_file_range +373 i386 shutdown sys_shutdown +374 i386 userfaultfd sys_userfaultfd +375 i386 membarrier sys_membarrier +376 i386 mlock2 sys_mlock2 +377 i386 copy_file_range sys_copy_file_range 378 i386 preadv2 sys_preadv2 compat_sys_preadv2 379 i386 pwritev2 sys_pwritev2 compat_sys_pwritev2 -380 i386 pkey_mprotect sys_pkey_mprotect sys_pkey_mprotect -381 i386 pkey_alloc sys_pkey_alloc sys_pkey_alloc -382 i386 pkey_free sys_pkey_free sys_pkey_free -383 i386 statx sys_statx sys_statx +380 i386 pkey_mprotect sys_pkey_mprotect +381 i386 pkey_alloc sys_pkey_alloc +382 i386 pkey_free sys_pkey_free +383 i386 statx sys_statx 384 i386 arch_prctl sys_arch_prctl compat_sys_arch_prctl 385 i386 io_pgetevents sys_io_pgetevents_time32 compat_sys_io_pgetevents -386 i386 rseq sys_rseq sys_rseq -393 i386 semget sys_semget sys_semget +386 i386 rseq sys_rseq +393 i386 semget sys_semget 394 i386 semctl sys_semctl compat_sys_semctl -395 i386 shmget sys_shmget sys_shmget +395 i386 shmget sys_shmget 396 i386 shmctl sys_shmctl compat_sys_shmctl 397 i386 shmat sys_shmat compat_sys_shmat -398 i386 shmdt sys_shmdt sys_shmdt -399 i386 msgget sys_msgget sys_msgget +398 i386 shmdt sys_shmdt +399 i386 msgget sys_msgget 400 i386 msgsnd sys_msgsnd compat_sys_msgsnd 401 i386 msgrcv sys_msgrcv compat_sys_msgrcv 402 i386 msgctl sys_msgctl compat_sys_msgctl -403 i386 clock_gettime64 sys_clock_gettime sys_clock_gettime -404 i386 clock_settime64 sys_clock_settime sys_clock_settime -405 i386 clock_adjtime64 sys_clock_adjtime sys_clock_adjtime -406 i386 clock_getres_time64 sys_clock_getres sys_clock_getres -407 i386 clock_nanosleep_time64 sys_clock_nanosleep sys_clock_nanosleep -408 i386 timer_gettime64 sys_timer_gettime sys_timer_gettime -409 i386 timer_settime64 sys_timer_settime sys_timer_settime -410 i386 timerfd_gettime64 sys_timerfd_gettime sys_timerfd_gettime -411 i386 timerfd_settime64 sys_timerfd_settime sys_timerfd_settime -412 i386 utimensat_time64 sys_utimensat sys_utimensat +403 i386 clock_gettime64 sys_clock_gettime +404 i386 clock_settime64 sys_clock_settime +405 i386 clock_adjtime64 sys_clock_adjtime +406 i386 clock_getres_time64 sys_clock_getres +407 i386 clock_nanosleep_time64 sys_clock_nanosleep +408 i386 timer_gettime64 sys_timer_gettime +409 i386 timer_settime64 sys_timer_settime +410 i386 timerfd_gettime64 sys_timerfd_gettime +411 i386 timerfd_settime64 sys_timerfd_settime +412 i386 utimensat_time64 sys_utimensat 413 i386 pselect6_time64 sys_pselect6 compat_sys_pselect6_time64 414 i386 ppoll_time64 sys_ppoll compat_sys_ppoll_time64 -416 i386 io_pgetevents_time64 sys_io_pgetevents sys_io_pgetevents +416 i386 io_pgetevents_time64 sys_io_pgetevents 417 i386 recvmmsg_time64 sys_recvmmsg compat_sys_recvmmsg_time64 -418 i386 mq_timedsend_time64 sys_mq_timedsend sys_mq_timedsend -419 i386 mq_timedreceive_time64 sys_mq_timedreceive sys_mq_timedreceive -420 i386 semtimedop_time64 sys_semtimedop sys_semtimedop +418 i386 mq_timedsend_time64 sys_mq_timedsend +419 i386 mq_timedreceive_time64 sys_mq_timedreceive +420 i386 semtimedop_time64 sys_semtimedop 421 i386 rt_sigtimedwait_time64 sys_rt_sigtimedwait compat_sys_rt_sigtimedwait_time64 -422 i386 futex_time64 sys_futex sys_futex -423 i386 sched_rr_get_interval_time64 sys_sched_rr_get_interval sys_sched_rr_get_interval -424 i386 pidfd_send_signal sys_pidfd_send_signal sys_pidfd_send_signal -425 i386 io_uring_setup sys_io_uring_setup sys_io_uring_setup -426 i386 io_uring_enter sys_io_uring_enter sys_io_uring_enter -427 i386 io_uring_register sys_io_uring_register sys_io_uring_register -428 i386 open_tree sys_open_tree sys_open_tree -429 i386 move_mount sys_move_mount sys_move_mount -430 i386 fsopen sys_fsopen sys_fsopen -431 i386 fsconfig sys_fsconfig sys_fsconfig -432 i386 fsmount sys_fsmount sys_fsmount -433 i386 fspick sys_fspick sys_fspick -434 i386 pidfd_open sys_pidfd_open sys_pidfd_open -435 i386 clone3 sys_clone3 sys_clone3 -437 i386 openat2 sys_openat2 sys_openat2 -438 i386 pidfd_getfd sys_pidfd_getfd sys_pidfd_getfd +422 i386 futex_time64 sys_futex +423 i386 sched_rr_get_interval_time64 sys_sched_rr_get_interval +424 i386 pidfd_send_signal sys_pidfd_send_signal +425 i386 io_uring_setup sys_io_uring_setup +426 i386 io_uring_enter sys_io_uring_enter +427 i386 io_uring_register sys_io_uring_register +428 i386 open_tree sys_open_tree +429 i386 move_mount sys_move_mount +430 i386 fsopen sys_fsopen +431 i386 fsconfig sys_fsconfig +432 i386 fsmount sys_fsmount +433 i386 fspick sys_fspick +434 i386 pidfd_open sys_pidfd_open +435 i386 clone3 sys_clone3 +437 i386 openat2 sys_openat2 +438 i386 pidfd_getfd sys_pidfd_getfd -- cgit From 866128a996647097d599c6b91deb340d4da593a8 Mon Sep 17 00:00:00 2001 From: Brian Gerst Date: Fri, 13 Mar 2020 15:51:40 -0400 Subject: x86/entry/32: Rename 32-bit specific syscalls Rename the syscalls that only exist for 32-bit from x86_* to ia32_* to make it clear they are for 32-bit only. Also rename the functions to match the syscall name. Signed-off-by: Brian Gerst Signed-off-by: Thomas Gleixner Reviewed-by: Dominik Brodowski Link: https://lkml.kernel.org/r/20200313195144.164260-15-brgerst@gmail.com --- arch/x86/entry/syscalls/syscall_32.tbl | 30 +++++++++++++++--------------- arch/x86/ia32/sys_ia32.c | 30 +++++++++++++++--------------- 2 files changed, 30 insertions(+), 30 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index b0712cdae7d8..0b8e24cb981d 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -101,7 +101,7 @@ 87 i386 swapon sys_swapon 88 i386 reboot sys_reboot 89 i386 readdir sys_old_readdir compat_sys_old_readdir -90 i386 mmap sys_old_mmap compat_sys_x86_mmap +90 i386 mmap sys_old_mmap compat_sys_ia32_mmap 91 i386 munmap sys_munmap 92 i386 truncate sys_truncate compat_sys_truncate 93 i386 ftruncate sys_ftruncate compat_sys_ftruncate @@ -131,7 +131,7 @@ 117 i386 ipc sys_ipc compat_sys_ipc 118 i386 fsync sys_fsync 119 i386 sigreturn sys_sigreturn compat_sys_sigreturn -120 i386 clone sys_clone compat_sys_x86_clone +120 i386 clone sys_clone compat_sys_ia32_clone 121 i386 setdomainname sys_setdomainname 122 i386 uname sys_newuname 123 i386 modify_ldt sys_modify_ldt @@ -191,8 +191,8 @@ 177 i386 rt_sigtimedwait sys_rt_sigtimedwait_time32 compat_sys_rt_sigtimedwait_time32 178 i386 rt_sigqueueinfo sys_rt_sigqueueinfo compat_sys_rt_sigqueueinfo 179 i386 rt_sigsuspend sys_rt_sigsuspend compat_sys_rt_sigsuspend -180 i386 pread64 sys_pread64 compat_sys_x86_pread -181 i386 pwrite64 sys_pwrite64 compat_sys_x86_pwrite +180 i386 pread64 sys_pread64 compat_sys_ia32_pread64 +181 i386 pwrite64 sys_pwrite64 compat_sys_ia32_pwrite64 182 i386 chown sys_chown16 183 i386 getcwd sys_getcwd 184 i386 capget sys_capget @@ -204,11 +204,11 @@ 190 i386 vfork sys_vfork 191 i386 ugetrlimit sys_getrlimit compat_sys_getrlimit 192 i386 mmap2 sys_mmap_pgoff -193 i386 truncate64 sys_truncate64 compat_sys_x86_truncate64 -194 i386 ftruncate64 sys_ftruncate64 compat_sys_x86_ftruncate64 -195 i386 stat64 sys_stat64 compat_sys_x86_stat64 -196 i386 lstat64 sys_lstat64 compat_sys_x86_lstat64 -197 i386 fstat64 sys_fstat64 compat_sys_x86_fstat64 +193 i386 truncate64 sys_truncate64 compat_sys_ia32_truncate64 +194 i386 ftruncate64 sys_ftruncate64 compat_sys_ia32_ftruncate64 +195 i386 stat64 sys_stat64 compat_sys_ia32_stat64 +196 i386 lstat64 sys_lstat64 compat_sys_ia32_lstat64 +197 i386 fstat64 sys_fstat64 compat_sys_ia32_fstat64 198 i386 lchown32 sys_lchown 199 i386 getuid32 sys_getuid 200 i386 getgid32 sys_getgid @@ -236,7 +236,7 @@ # 222 is unused # 223 is unused 224 i386 gettid sys_gettid -225 i386 readahead sys_readahead compat_sys_x86_readahead +225 i386 readahead sys_readahead compat_sys_ia32_readahead 226 i386 setxattr sys_setxattr 227 i386 lsetxattr sys_lsetxattr 228 i386 fsetxattr sys_fsetxattr @@ -261,7 +261,7 @@ 247 i386 io_getevents sys_io_getevents_time32 248 i386 io_submit sys_io_submit compat_sys_io_submit 249 i386 io_cancel sys_io_cancel -250 i386 fadvise64 sys_fadvise64 compat_sys_x86_fadvise64 +250 i386 fadvise64 sys_fadvise64 compat_sys_ia32_fadvise64 # 251 is available for reuse (was briefly sys_set_zone_reclaim) 252 i386 exit_group sys_exit_group 253 i386 lookup_dcookie sys_lookup_dcookie compat_sys_lookup_dcookie @@ -283,7 +283,7 @@ 269 i386 fstatfs64 sys_fstatfs64 compat_sys_fstatfs64 270 i386 tgkill sys_tgkill 271 i386 utimes sys_utimes_time32 -272 i386 fadvise64_64 sys_fadvise64_64 compat_sys_x86_fadvise64_64 +272 i386 fadvise64_64 sys_fadvise64_64 compat_sys_ia32_fadvise64_64 273 i386 vserver 274 i386 mbind sys_mbind 275 i386 get_mempolicy sys_get_mempolicy compat_sys_get_mempolicy @@ -311,7 +311,7 @@ 297 i386 mknodat sys_mknodat 298 i386 fchownat sys_fchownat 299 i386 futimesat sys_futimesat_time32 -300 i386 fstatat64 sys_fstatat64 compat_sys_x86_fstatat +300 i386 fstatat64 sys_fstatat64 compat_sys_ia32_fstatat64 301 i386 unlinkat sys_unlinkat 302 i386 renameat sys_renameat 303 i386 linkat sys_linkat @@ -325,7 +325,7 @@ 311 i386 set_robust_list sys_set_robust_list compat_sys_set_robust_list 312 i386 get_robust_list sys_get_robust_list compat_sys_get_robust_list 313 i386 splice sys_splice -314 i386 sync_file_range sys_sync_file_range compat_sys_x86_sync_file_range +314 i386 sync_file_range sys_sync_file_range compat_sys_ia32_sync_file_range 315 i386 tee sys_tee 316 i386 vmsplice sys_vmsplice compat_sys_vmsplice 317 i386 move_pages sys_move_pages compat_sys_move_pages @@ -335,7 +335,7 @@ 321 i386 signalfd sys_signalfd compat_sys_signalfd 322 i386 timerfd_create sys_timerfd_create 323 i386 eventfd sys_eventfd -324 i386 fallocate sys_fallocate compat_sys_x86_fallocate +324 i386 fallocate sys_fallocate compat_sys_ia32_fallocate 325 i386 timerfd_settime sys_timerfd_settime32 326 i386 timerfd_gettime sys_timerfd_gettime32 327 i386 signalfd4 sys_signalfd4 compat_sys_signalfd4 diff --git a/arch/x86/ia32/sys_ia32.c b/arch/x86/ia32/sys_ia32.c index 21790307121e..a189dc6f98dc 100644 --- a/arch/x86/ia32/sys_ia32.c +++ b/arch/x86/ia32/sys_ia32.c @@ -52,14 +52,14 @@ #define AA(__x) ((unsigned long)(__x)) -COMPAT_SYSCALL_DEFINE3(x86_truncate64, const char __user *, filename, +COMPAT_SYSCALL_DEFINE3(ia32_truncate64, const char __user *, filename, unsigned long, offset_low, unsigned long, offset_high) { return ksys_truncate(filename, ((loff_t) offset_high << 32) | offset_low); } -COMPAT_SYSCALL_DEFINE3(x86_ftruncate64, unsigned int, fd, +COMPAT_SYSCALL_DEFINE3(ia32_ftruncate64, unsigned int, fd, unsigned long, offset_low, unsigned long, offset_high) { return ksys_ftruncate(fd, ((loff_t) offset_high << 32) | offset_low); @@ -97,7 +97,7 @@ static int cp_stat64(struct stat64 __user *ubuf, struct kstat *stat) return 0; } -COMPAT_SYSCALL_DEFINE2(x86_stat64, const char __user *, filename, +COMPAT_SYSCALL_DEFINE2(ia32_stat64, const char __user *, filename, struct stat64 __user *, statbuf) { struct kstat stat; @@ -108,7 +108,7 @@ COMPAT_SYSCALL_DEFINE2(x86_stat64, const char __user *, filename, return ret; } -COMPAT_SYSCALL_DEFINE2(x86_lstat64, const char __user *, filename, +COMPAT_SYSCALL_DEFINE2(ia32_lstat64, const char __user *, filename, struct stat64 __user *, statbuf) { struct kstat stat; @@ -118,7 +118,7 @@ COMPAT_SYSCALL_DEFINE2(x86_lstat64, const char __user *, filename, return ret; } -COMPAT_SYSCALL_DEFINE2(x86_fstat64, unsigned int, fd, +COMPAT_SYSCALL_DEFINE2(ia32_fstat64, unsigned int, fd, struct stat64 __user *, statbuf) { struct kstat stat; @@ -128,7 +128,7 @@ COMPAT_SYSCALL_DEFINE2(x86_fstat64, unsigned int, fd, return ret; } -COMPAT_SYSCALL_DEFINE4(x86_fstatat, unsigned int, dfd, +COMPAT_SYSCALL_DEFINE4(ia32_fstatat64, unsigned int, dfd, const char __user *, filename, struct stat64 __user *, statbuf, int, flag) { @@ -156,7 +156,7 @@ struct mmap_arg_struct32 { unsigned int offset; }; -COMPAT_SYSCALL_DEFINE1(x86_mmap, struct mmap_arg_struct32 __user *, arg) +COMPAT_SYSCALL_DEFINE1(ia32_mmap, struct mmap_arg_struct32 __user *, arg) { struct mmap_arg_struct32 a; @@ -171,14 +171,14 @@ COMPAT_SYSCALL_DEFINE1(x86_mmap, struct mmap_arg_struct32 __user *, arg) } /* warning: next two assume little endian */ -COMPAT_SYSCALL_DEFINE5(x86_pread, unsigned int, fd, char __user *, ubuf, +COMPAT_SYSCALL_DEFINE5(ia32_pread64, unsigned int, fd, char __user *, ubuf, u32, count, u32, poslo, u32, poshi) { return ksys_pread64(fd, ubuf, count, ((loff_t)AA(poshi) << 32) | AA(poslo)); } -COMPAT_SYSCALL_DEFINE5(x86_pwrite, unsigned int, fd, const char __user *, ubuf, +COMPAT_SYSCALL_DEFINE5(ia32_pwrite64, unsigned int, fd, const char __user *, ubuf, u32, count, u32, poslo, u32, poshi) { return ksys_pwrite64(fd, ubuf, count, @@ -190,7 +190,7 @@ COMPAT_SYSCALL_DEFINE5(x86_pwrite, unsigned int, fd, const char __user *, ubuf, * Some system calls that need sign extended arguments. This could be * done by a generic wrapper. */ -COMPAT_SYSCALL_DEFINE6(x86_fadvise64_64, int, fd, __u32, offset_low, +COMPAT_SYSCALL_DEFINE6(ia32_fadvise64_64, int, fd, __u32, offset_low, __u32, offset_high, __u32, len_low, __u32, len_high, int, advice) { @@ -200,13 +200,13 @@ COMPAT_SYSCALL_DEFINE6(x86_fadvise64_64, int, fd, __u32, offset_low, advice); } -COMPAT_SYSCALL_DEFINE4(x86_readahead, int, fd, unsigned int, off_lo, +COMPAT_SYSCALL_DEFINE4(ia32_readahead, int, fd, unsigned int, off_lo, unsigned int, off_hi, size_t, count) { return ksys_readahead(fd, ((u64)off_hi << 32) | off_lo, count); } -COMPAT_SYSCALL_DEFINE6(x86_sync_file_range, int, fd, unsigned int, off_low, +COMPAT_SYSCALL_DEFINE6(ia32_sync_file_range, int, fd, unsigned int, off_low, unsigned int, off_hi, unsigned int, n_low, unsigned int, n_hi, int, flags) { @@ -215,14 +215,14 @@ COMPAT_SYSCALL_DEFINE6(x86_sync_file_range, int, fd, unsigned int, off_low, ((u64)n_hi << 32) | n_low, flags); } -COMPAT_SYSCALL_DEFINE5(x86_fadvise64, int, fd, unsigned int, offset_lo, +COMPAT_SYSCALL_DEFINE5(ia32_fadvise64, int, fd, unsigned int, offset_lo, unsigned int, offset_hi, size_t, len, int, advice) { return ksys_fadvise64_64(fd, ((u64)offset_hi << 32) | offset_lo, len, advice); } -COMPAT_SYSCALL_DEFINE6(x86_fallocate, int, fd, int, mode, +COMPAT_SYSCALL_DEFINE6(ia32_fallocate, int, fd, int, mode, unsigned int, offset_lo, unsigned int, offset_hi, unsigned int, len_lo, unsigned int, len_hi) { @@ -233,7 +233,7 @@ COMPAT_SYSCALL_DEFINE6(x86_fallocate, int, fd, int, mode, /* * The 32-bit clone ABI is CONFIG_CLONE_BACKWARDS */ -COMPAT_SYSCALL_DEFINE5(x86_clone, unsigned long, clone_flags, +COMPAT_SYSCALL_DEFINE5(ia32_clone, unsigned long, clone_flags, unsigned long, newsp, int __user *, parent_tidptr, unsigned long, tls_val, int __user *, child_tidptr) { -- cgit From 121b32a58a3af89a780cf194ce3769fc4120e574 Mon Sep 17 00:00:00 2001 From: Brian Gerst Date: Fri, 13 Mar 2020 15:51:41 -0400 Subject: x86/entry/32: Use IA32-specific wrappers for syscalls taking 64-bit arguments For the 32-bit syscall interface, 64-bit arguments (loff_t) are passed via a pair of 32-bit registers. These register pairs end up in consecutive stack slots, which matches the C ABI for 64-bit arguments. But when accessing the registers directly from pt_regs, the wrapper needs to manually reassemble the 64-bit value. These wrappers already exist for 32-bit compat, so make them available to 32-bit native in preparation for enabling pt_regs-based syscalls. Signed-off-by: Brian Gerst Signed-off-by: Thomas Gleixner Reviewed-by: Dominik Brodowski Link: https://lkml.kernel.org/r/20200313195144.164260-16-brgerst@gmail.com --- arch/x86/entry/syscalls/syscall_32.tbl | 18 +-- arch/x86/ia32/Makefile | 2 +- arch/x86/ia32/sys_ia32.c | 254 -------------------------------- arch/x86/kernel/Makefile | 2 + arch/x86/kernel/sys_ia32.c | 255 +++++++++++++++++++++++++++++++++ arch/x86/um/Makefile | 1 + 6 files changed, 268 insertions(+), 264 deletions(-) delete mode 100644 arch/x86/ia32/sys_ia32.c create mode 100644 arch/x86/kernel/sys_ia32.c (limited to 'arch/x86') diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 0b8e24cb981d..54581ac671b4 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -191,8 +191,8 @@ 177 i386 rt_sigtimedwait sys_rt_sigtimedwait_time32 compat_sys_rt_sigtimedwait_time32 178 i386 rt_sigqueueinfo sys_rt_sigqueueinfo compat_sys_rt_sigqueueinfo 179 i386 rt_sigsuspend sys_rt_sigsuspend compat_sys_rt_sigsuspend -180 i386 pread64 sys_pread64 compat_sys_ia32_pread64 -181 i386 pwrite64 sys_pwrite64 compat_sys_ia32_pwrite64 +180 i386 pread64 sys_ia32_pread64 +181 i386 pwrite64 sys_ia32_pwrite64 182 i386 chown sys_chown16 183 i386 getcwd sys_getcwd 184 i386 capget sys_capget @@ -204,8 +204,8 @@ 190 i386 vfork sys_vfork 191 i386 ugetrlimit sys_getrlimit compat_sys_getrlimit 192 i386 mmap2 sys_mmap_pgoff -193 i386 truncate64 sys_truncate64 compat_sys_ia32_truncate64 -194 i386 ftruncate64 sys_ftruncate64 compat_sys_ia32_ftruncate64 +193 i386 truncate64 sys_ia32_truncate64 +194 i386 ftruncate64 sys_ia32_ftruncate64 195 i386 stat64 sys_stat64 compat_sys_ia32_stat64 196 i386 lstat64 sys_lstat64 compat_sys_ia32_lstat64 197 i386 fstat64 sys_fstat64 compat_sys_ia32_fstat64 @@ -236,7 +236,7 @@ # 222 is unused # 223 is unused 224 i386 gettid sys_gettid -225 i386 readahead sys_readahead compat_sys_ia32_readahead +225 i386 readahead sys_ia32_readahead 226 i386 setxattr sys_setxattr 227 i386 lsetxattr sys_lsetxattr 228 i386 fsetxattr sys_fsetxattr @@ -261,7 +261,7 @@ 247 i386 io_getevents sys_io_getevents_time32 248 i386 io_submit sys_io_submit compat_sys_io_submit 249 i386 io_cancel sys_io_cancel -250 i386 fadvise64 sys_fadvise64 compat_sys_ia32_fadvise64 +250 i386 fadvise64 sys_ia32_fadvise64 # 251 is available for reuse (was briefly sys_set_zone_reclaim) 252 i386 exit_group sys_exit_group 253 i386 lookup_dcookie sys_lookup_dcookie compat_sys_lookup_dcookie @@ -283,7 +283,7 @@ 269 i386 fstatfs64 sys_fstatfs64 compat_sys_fstatfs64 270 i386 tgkill sys_tgkill 271 i386 utimes sys_utimes_time32 -272 i386 fadvise64_64 sys_fadvise64_64 compat_sys_ia32_fadvise64_64 +272 i386 fadvise64_64 sys_ia32_fadvise64_64 273 i386 vserver 274 i386 mbind sys_mbind 275 i386 get_mempolicy sys_get_mempolicy compat_sys_get_mempolicy @@ -325,7 +325,7 @@ 311 i386 set_robust_list sys_set_robust_list compat_sys_set_robust_list 312 i386 get_robust_list sys_get_robust_list compat_sys_get_robust_list 313 i386 splice sys_splice -314 i386 sync_file_range sys_sync_file_range compat_sys_ia32_sync_file_range +314 i386 sync_file_range sys_ia32_sync_file_range 315 i386 tee sys_tee 316 i386 vmsplice sys_vmsplice compat_sys_vmsplice 317 i386 move_pages sys_move_pages compat_sys_move_pages @@ -335,7 +335,7 @@ 321 i386 signalfd sys_signalfd compat_sys_signalfd 322 i386 timerfd_create sys_timerfd_create 323 i386 eventfd sys_eventfd -324 i386 fallocate sys_fallocate compat_sys_ia32_fallocate +324 i386 fallocate sys_ia32_fallocate 325 i386 timerfd_settime sys_timerfd_settime32 326 i386 timerfd_gettime sys_timerfd_gettime32 327 i386 signalfd4 sys_signalfd4 compat_sys_signalfd4 diff --git a/arch/x86/ia32/Makefile b/arch/x86/ia32/Makefile index d13b352b2aa7..8e4d0391ff6c 100644 --- a/arch/x86/ia32/Makefile +++ b/arch/x86/ia32/Makefile @@ -3,7 +3,7 @@ # Makefile for the ia32 kernel emulation subsystem. # -obj-$(CONFIG_IA32_EMULATION) := sys_ia32.o ia32_signal.o +obj-$(CONFIG_IA32_EMULATION) := ia32_signal.o obj-$(CONFIG_IA32_AOUT) += ia32_aout.o diff --git a/arch/x86/ia32/sys_ia32.c b/arch/x86/ia32/sys_ia32.c deleted file mode 100644 index a189dc6f98dc..000000000000 --- a/arch/x86/ia32/sys_ia32.c +++ /dev/null @@ -1,254 +0,0 @@ -// SPDX-License-Identifier: GPL-2.0 -/* - * sys_ia32.c: Conversion between 32bit and 64bit native syscalls. Based on - * sys_sparc32 - * - * Copyright (C) 2000 VA Linux Co - * Copyright (C) 2000 Don Dugger - * Copyright (C) 1999 Arun Sharma - * Copyright (C) 1997,1998 Jakub Jelinek (jj@sunsite.mff.cuni.cz) - * Copyright (C) 1997 David S. Miller (davem@caip.rutgers.edu) - * Copyright (C) 2000 Hewlett-Packard Co. - * Copyright (C) 2000 David Mosberger-Tang - * Copyright (C) 2000,2001,2002 Andi Kleen, SuSE Labs (x86-64 port) - * - * These routines maintain argument size conversion between 32bit and 64bit - * environment. In 2.5 most of this should be moved to a generic directory. - * - * This file assumes that there is a hole at the end of user address space. - * - * Some of the functions are LE specific currently. These are - * hopefully all marked. This should be fixed. - */ - -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -#define AA(__x) ((unsigned long)(__x)) - - -COMPAT_SYSCALL_DEFINE3(ia32_truncate64, const char __user *, filename, - unsigned long, offset_low, unsigned long, offset_high) -{ - return ksys_truncate(filename, - ((loff_t) offset_high << 32) | offset_low); -} - -COMPAT_SYSCALL_DEFINE3(ia32_ftruncate64, unsigned int, fd, - unsigned long, offset_low, unsigned long, offset_high) -{ - return ksys_ftruncate(fd, ((loff_t) offset_high << 32) | offset_low); -} - -/* - * Another set for IA32/LFS -- x86_64 struct stat is different due to - * support for 64bit inode numbers. - */ -static int cp_stat64(struct stat64 __user *ubuf, struct kstat *stat) -{ - typeof(ubuf->st_uid) uid = 0; - typeof(ubuf->st_gid) gid = 0; - SET_UID(uid, from_kuid_munged(current_user_ns(), stat->uid)); - SET_GID(gid, from_kgid_munged(current_user_ns(), stat->gid)); - if (!access_ok(ubuf, sizeof(struct stat64)) || - __put_user(huge_encode_dev(stat->dev), &ubuf->st_dev) || - __put_user(stat->ino, &ubuf->__st_ino) || - __put_user(stat->ino, &ubuf->st_ino) || - __put_user(stat->mode, &ubuf->st_mode) || - __put_user(stat->nlink, &ubuf->st_nlink) || - __put_user(uid, &ubuf->st_uid) || - __put_user(gid, &ubuf->st_gid) || - __put_user(huge_encode_dev(stat->rdev), &ubuf->st_rdev) || - __put_user(stat->size, &ubuf->st_size) || - __put_user(stat->atime.tv_sec, &ubuf->st_atime) || - __put_user(stat->atime.tv_nsec, &ubuf->st_atime_nsec) || - __put_user(stat->mtime.tv_sec, &ubuf->st_mtime) || - __put_user(stat->mtime.tv_nsec, &ubuf->st_mtime_nsec) || - __put_user(stat->ctime.tv_sec, &ubuf->st_ctime) || - __put_user(stat->ctime.tv_nsec, &ubuf->st_ctime_nsec) || - __put_user(stat->blksize, &ubuf->st_blksize) || - __put_user(stat->blocks, &ubuf->st_blocks)) - return -EFAULT; - return 0; -} - -COMPAT_SYSCALL_DEFINE2(ia32_stat64, const char __user *, filename, - struct stat64 __user *, statbuf) -{ - struct kstat stat; - int ret = vfs_stat(filename, &stat); - - if (!ret) - ret = cp_stat64(statbuf, &stat); - return ret; -} - -COMPAT_SYSCALL_DEFINE2(ia32_lstat64, const char __user *, filename, - struct stat64 __user *, statbuf) -{ - struct kstat stat; - int ret = vfs_lstat(filename, &stat); - if (!ret) - ret = cp_stat64(statbuf, &stat); - return ret; -} - -COMPAT_SYSCALL_DEFINE2(ia32_fstat64, unsigned int, fd, - struct stat64 __user *, statbuf) -{ - struct kstat stat; - int ret = vfs_fstat(fd, &stat); - if (!ret) - ret = cp_stat64(statbuf, &stat); - return ret; -} - -COMPAT_SYSCALL_DEFINE4(ia32_fstatat64, unsigned int, dfd, - const char __user *, filename, - struct stat64 __user *, statbuf, int, flag) -{ - struct kstat stat; - int error; - - error = vfs_fstatat(dfd, filename, &stat, flag); - if (error) - return error; - return cp_stat64(statbuf, &stat); -} - -/* - * Linux/i386 didn't use to be able to handle more than - * 4 system call parameters, so these system calls used a memory - * block for parameter passing.. - */ - -struct mmap_arg_struct32 { - unsigned int addr; - unsigned int len; - unsigned int prot; - unsigned int flags; - unsigned int fd; - unsigned int offset; -}; - -COMPAT_SYSCALL_DEFINE1(ia32_mmap, struct mmap_arg_struct32 __user *, arg) -{ - struct mmap_arg_struct32 a; - - if (copy_from_user(&a, arg, sizeof(a))) - return -EFAULT; - - if (a.offset & ~PAGE_MASK) - return -EINVAL; - - return ksys_mmap_pgoff(a.addr, a.len, a.prot, a.flags, a.fd, - a.offset>>PAGE_SHIFT); -} - -/* warning: next two assume little endian */ -COMPAT_SYSCALL_DEFINE5(ia32_pread64, unsigned int, fd, char __user *, ubuf, - u32, count, u32, poslo, u32, poshi) -{ - return ksys_pread64(fd, ubuf, count, - ((loff_t)AA(poshi) << 32) | AA(poslo)); -} - -COMPAT_SYSCALL_DEFINE5(ia32_pwrite64, unsigned int, fd, const char __user *, ubuf, - u32, count, u32, poslo, u32, poshi) -{ - return ksys_pwrite64(fd, ubuf, count, - ((loff_t)AA(poshi) << 32) | AA(poslo)); -} - - -/* - * Some system calls that need sign extended arguments. This could be - * done by a generic wrapper. - */ -COMPAT_SYSCALL_DEFINE6(ia32_fadvise64_64, int, fd, __u32, offset_low, - __u32, offset_high, __u32, len_low, __u32, len_high, - int, advice) -{ - return ksys_fadvise64_64(fd, - (((u64)offset_high)<<32) | offset_low, - (((u64)len_high)<<32) | len_low, - advice); -} - -COMPAT_SYSCALL_DEFINE4(ia32_readahead, int, fd, unsigned int, off_lo, - unsigned int, off_hi, size_t, count) -{ - return ksys_readahead(fd, ((u64)off_hi << 32) | off_lo, count); -} - -COMPAT_SYSCALL_DEFINE6(ia32_sync_file_range, int, fd, unsigned int, off_low, - unsigned int, off_hi, unsigned int, n_low, - unsigned int, n_hi, int, flags) -{ - return ksys_sync_file_range(fd, - ((u64)off_hi << 32) | off_low, - ((u64)n_hi << 32) | n_low, flags); -} - -COMPAT_SYSCALL_DEFINE5(ia32_fadvise64, int, fd, unsigned int, offset_lo, - unsigned int, offset_hi, size_t, len, int, advice) -{ - return ksys_fadvise64_64(fd, ((u64)offset_hi << 32) | offset_lo, - len, advice); -} - -COMPAT_SYSCALL_DEFINE6(ia32_fallocate, int, fd, int, mode, - unsigned int, offset_lo, unsigned int, offset_hi, - unsigned int, len_lo, unsigned int, len_hi) -{ - return ksys_fallocate(fd, mode, ((u64)offset_hi << 32) | offset_lo, - ((u64)len_hi << 32) | len_lo); -} - -/* - * The 32-bit clone ABI is CONFIG_CLONE_BACKWARDS - */ -COMPAT_SYSCALL_DEFINE5(ia32_clone, unsigned long, clone_flags, - unsigned long, newsp, int __user *, parent_tidptr, - unsigned long, tls_val, int __user *, child_tidptr) -{ - struct kernel_clone_args args = { - .flags = (clone_flags & ~CSIGNAL), - .pidfd = parent_tidptr, - .child_tid = child_tidptr, - .parent_tid = parent_tidptr, - .exit_signal = (clone_flags & CSIGNAL), - .stack = newsp, - .tls = tls_val, - }; - - if (!legacy_clone_args_valid(&args)) - return -EINVAL; - - return _do_fork(&args); -} diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile index 9b294c13809a..b8f89f78b8cd 100644 --- a/arch/x86/kernel/Makefile +++ b/arch/x86/kernel/Makefile @@ -53,6 +53,8 @@ obj-y += setup.o x86_init.o i8259.o irqinit.o obj-$(CONFIG_JUMP_LABEL) += jump_label.o obj-$(CONFIG_IRQ_WORK) += irq_work.o obj-y += probe_roms.o +obj-$(CONFIG_X86_32) += sys_ia32.o +obj-$(CONFIG_IA32_EMULATION) += sys_ia32.o obj-$(CONFIG_X86_64) += sys_x86_64.o obj-$(CONFIG_X86_ESPFIX64) += espfix_64.o obj-$(CONFIG_SYSFS) += ksysfs.o diff --git a/arch/x86/kernel/sys_ia32.c b/arch/x86/kernel/sys_ia32.c new file mode 100644 index 000000000000..ab03fede1422 --- /dev/null +++ b/arch/x86/kernel/sys_ia32.c @@ -0,0 +1,255 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * sys_ia32.c: Conversion between 32bit and 64bit native syscalls. Based on + * sys_sparc32 + * + * Copyright (C) 2000 VA Linux Co + * Copyright (C) 2000 Don Dugger + * Copyright (C) 1999 Arun Sharma + * Copyright (C) 1997,1998 Jakub Jelinek (jj@sunsite.mff.cuni.cz) + * Copyright (C) 1997 David S. Miller (davem@caip.rutgers.edu) + * Copyright (C) 2000 Hewlett-Packard Co. + * Copyright (C) 2000 David Mosberger-Tang + * Copyright (C) 2000,2001,2002 Andi Kleen, SuSE Labs (x86-64 port) + * + * These routines maintain argument size conversion between 32bit and 64bit + * environment. In 2.5 most of this should be moved to a generic directory. + * + * This file assumes that there is a hole at the end of user address space. + * + * Some of the functions are LE specific currently. These are + * hopefully all marked. This should be fixed. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define AA(__x) ((unsigned long)(__x)) + +SYSCALL_DEFINE3(ia32_truncate64, const char __user *, filename, + unsigned long, offset_low, unsigned long, offset_high) +{ + return ksys_truncate(filename, + ((loff_t) offset_high << 32) | offset_low); +} + +SYSCALL_DEFINE3(ia32_ftruncate64, unsigned int, fd, + unsigned long, offset_low, unsigned long, offset_high) +{ + return ksys_ftruncate(fd, ((loff_t) offset_high << 32) | offset_low); +} + +/* warning: next two assume little endian */ +SYSCALL_DEFINE5(ia32_pread64, unsigned int, fd, char __user *, ubuf, + u32, count, u32, poslo, u32, poshi) +{ + return ksys_pread64(fd, ubuf, count, + ((loff_t)AA(poshi) << 32) | AA(poslo)); +} + +SYSCALL_DEFINE5(ia32_pwrite64, unsigned int, fd, const char __user *, ubuf, + u32, count, u32, poslo, u32, poshi) +{ + return ksys_pwrite64(fd, ubuf, count, + ((loff_t)AA(poshi) << 32) | AA(poslo)); +} + + +/* + * Some system calls that need sign extended arguments. This could be + * done by a generic wrapper. + */ +SYSCALL_DEFINE6(ia32_fadvise64_64, int, fd, __u32, offset_low, + __u32, offset_high, __u32, len_low, __u32, len_high, + int, advice) +{ + return ksys_fadvise64_64(fd, + (((u64)offset_high)<<32) | offset_low, + (((u64)len_high)<<32) | len_low, + advice); +} + +SYSCALL_DEFINE4(ia32_readahead, int, fd, unsigned int, off_lo, + unsigned int, off_hi, size_t, count) +{ + return ksys_readahead(fd, ((u64)off_hi << 32) | off_lo, count); +} + +SYSCALL_DEFINE6(ia32_sync_file_range, int, fd, unsigned int, off_low, + unsigned int, off_hi, unsigned int, n_low, + unsigned int, n_hi, int, flags) +{ + return ksys_sync_file_range(fd, + ((u64)off_hi << 32) | off_low, + ((u64)n_hi << 32) | n_low, flags); +} + +SYSCALL_DEFINE5(ia32_fadvise64, int, fd, unsigned int, offset_lo, + unsigned int, offset_hi, size_t, len, int, advice) +{ + return ksys_fadvise64_64(fd, ((u64)offset_hi << 32) | offset_lo, + len, advice); +} + +SYSCALL_DEFINE6(ia32_fallocate, int, fd, int, mode, + unsigned int, offset_lo, unsigned int, offset_hi, + unsigned int, len_lo, unsigned int, len_hi) +{ + return ksys_fallocate(fd, mode, ((u64)offset_hi << 32) | offset_lo, + ((u64)len_hi << 32) | len_lo); +} + +#ifdef CONFIG_IA32_EMULATION +/* + * Another set for IA32/LFS -- x86_64 struct stat is different due to + * support for 64bit inode numbers. + */ +static int cp_stat64(struct stat64 __user *ubuf, struct kstat *stat) +{ + typeof(ubuf->st_uid) uid = 0; + typeof(ubuf->st_gid) gid = 0; + SET_UID(uid, from_kuid_munged(current_user_ns(), stat->uid)); + SET_GID(gid, from_kgid_munged(current_user_ns(), stat->gid)); + if (!access_ok(ubuf, sizeof(struct stat64)) || + __put_user(huge_encode_dev(stat->dev), &ubuf->st_dev) || + __put_user(stat->ino, &ubuf->__st_ino) || + __put_user(stat->ino, &ubuf->st_ino) || + __put_user(stat->mode, &ubuf->st_mode) || + __put_user(stat->nlink, &ubuf->st_nlink) || + __put_user(uid, &ubuf->st_uid) || + __put_user(gid, &ubuf->st_gid) || + __put_user(huge_encode_dev(stat->rdev), &ubuf->st_rdev) || + __put_user(stat->size, &ubuf->st_size) || + __put_user(stat->atime.tv_sec, &ubuf->st_atime) || + __put_user(stat->atime.tv_nsec, &ubuf->st_atime_nsec) || + __put_user(stat->mtime.tv_sec, &ubuf->st_mtime) || + __put_user(stat->mtime.tv_nsec, &ubuf->st_mtime_nsec) || + __put_user(stat->ctime.tv_sec, &ubuf->st_ctime) || + __put_user(stat->ctime.tv_nsec, &ubuf->st_ctime_nsec) || + __put_user(stat->blksize, &ubuf->st_blksize) || + __put_user(stat->blocks, &ubuf->st_blocks)) + return -EFAULT; + return 0; +} + +COMPAT_SYSCALL_DEFINE2(ia32_stat64, const char __user *, filename, + struct stat64 __user *, statbuf) +{ + struct kstat stat; + int ret = vfs_stat(filename, &stat); + + if (!ret) + ret = cp_stat64(statbuf, &stat); + return ret; +} + +COMPAT_SYSCALL_DEFINE2(ia32_lstat64, const char __user *, filename, + struct stat64 __user *, statbuf) +{ + struct kstat stat; + int ret = vfs_lstat(filename, &stat); + if (!ret) + ret = cp_stat64(statbuf, &stat); + return ret; +} + +COMPAT_SYSCALL_DEFINE2(ia32_fstat64, unsigned int, fd, + struct stat64 __user *, statbuf) +{ + struct kstat stat; + int ret = vfs_fstat(fd, &stat); + if (!ret) + ret = cp_stat64(statbuf, &stat); + return ret; +} + +COMPAT_SYSCALL_DEFINE4(ia32_fstatat64, unsigned int, dfd, + const char __user *, filename, + struct stat64 __user *, statbuf, int, flag) +{ + struct kstat stat; + int error; + + error = vfs_fstatat(dfd, filename, &stat, flag); + if (error) + return error; + return cp_stat64(statbuf, &stat); +} + +/* + * Linux/i386 didn't use to be able to handle more than + * 4 system call parameters, so these system calls used a memory + * block for parameter passing.. + */ + +struct mmap_arg_struct32 { + unsigned int addr; + unsigned int len; + unsigned int prot; + unsigned int flags; + unsigned int fd; + unsigned int offset; +}; + +COMPAT_SYSCALL_DEFINE1(ia32_mmap, struct mmap_arg_struct32 __user *, arg) +{ + struct mmap_arg_struct32 a; + + if (copy_from_user(&a, arg, sizeof(a))) + return -EFAULT; + + if (a.offset & ~PAGE_MASK) + return -EINVAL; + + return ksys_mmap_pgoff(a.addr, a.len, a.prot, a.flags, a.fd, + a.offset>>PAGE_SHIFT); +} + +/* + * The 32-bit clone ABI is CONFIG_CLONE_BACKWARDS + */ +COMPAT_SYSCALL_DEFINE5(ia32_clone, unsigned long, clone_flags, + unsigned long, newsp, int __user *, parent_tidptr, + unsigned long, tls_val, int __user *, child_tidptr) +{ + struct kernel_clone_args args = { + .flags = (clone_flags & ~CSIGNAL), + .pidfd = parent_tidptr, + .child_tid = child_tidptr, + .parent_tid = parent_tidptr, + .exit_signal = (clone_flags & CSIGNAL), + .stack = newsp, + .tls = tls_val, + }; + + if (!legacy_clone_args_valid(&args)) + return -EINVAL; + + return _do_fork(&args); +} +#endif /* CONFIG_IA32_EMULATION */ diff --git a/arch/x86/um/Makefile b/arch/x86/um/Makefile index 33c51c064c77..77f70b969d14 100644 --- a/arch/x86/um/Makefile +++ b/arch/x86/um/Makefile @@ -21,6 +21,7 @@ obj-y += checksum_32.o syscalls_32.o obj-$(CONFIG_ELF_CORE) += elfcore.o subarch-y = ../lib/string_32.o ../lib/atomic64_32.o ../lib/atomic64_cx8_32.o +subarch-y += ../kernel/sys_ia32.o else -- cgit From 25c619e59b395a8c970d339f9c714302738e350e Mon Sep 17 00:00:00 2001 From: Brian Gerst Date: Fri, 13 Mar 2020 15:51:42 -0400 Subject: x86/entry/32: Enable pt_regs based syscalls Enable pt_regs based syscalls for 32-bit. This makes the 32-bit native kernel consistent with the 64-bit kernel, and improves the syscall interface by not needing to push all 6 potential arguments onto the stack. Signed-off-by: Brian Gerst Signed-off-by: Thomas Gleixner Reviewed-by: Dominik Brodowski Link: https://lkml.kernel.org/r/20200313195144.164260-17-brgerst@gmail.com --- arch/x86/Kconfig | 2 +- arch/x86/entry/common.c | 15 ----- arch/x86/entry/syscall_32.c | 15 +---- arch/x86/include/asm/syscall.h | 6 -- arch/x86/include/asm/syscall_wrapper.h | 111 ++++++++++++++++++--------------- arch/x86/include/asm/syscalls.h | 29 --------- 6 files changed, 64 insertions(+), 114 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index beea77046f9b..9df07b2c82bd 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -30,7 +30,6 @@ config X86_64 select MODULES_USE_ELF_RELA select NEED_DMA_MAP_STATE select SWIOTLB - select ARCH_HAS_SYSCALL_WRAPPER config FORCE_DYNAMIC_FTRACE def_bool y @@ -80,6 +79,7 @@ config X86 select ARCH_HAS_STRICT_KERNEL_RWX select ARCH_HAS_STRICT_MODULE_RWX select ARCH_HAS_SYNC_CORE_BEFORE_USERMODE + select ARCH_HAS_SYSCALL_WRAPPER select ARCH_HAS_UBSAN_SANITIZE_ALL select ARCH_HAVE_NMI_SAFE_CMPXCHG select ARCH_MIGHT_HAVE_ACPI_PDC if ACPI diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index 149bf54501ff..6062e8ebc315 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -333,20 +333,7 @@ static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs) if (likely(nr < IA32_NR_syscalls)) { nr = array_index_nospec(nr, IA32_NR_syscalls); -#ifdef CONFIG_IA32_EMULATION regs->ax = ia32_sys_call_table[nr](regs); -#else - /* - * It's possible that a 32-bit syscall implementation - * takes a 64-bit parameter but nonetheless assumes that - * the high bits are zero. Make sure we zero-extend all - * of the args. - */ - regs->ax = ia32_sys_call_table[nr]( - (unsigned int)regs->bx, (unsigned int)regs->cx, - (unsigned int)regs->dx, (unsigned int)regs->si, - (unsigned int)regs->di, (unsigned int)regs->bp); -#endif /* CONFIG_IA32_EMULATION */ } syscall_return_slowpath(regs); @@ -439,9 +426,7 @@ __visible long do_fast_syscall_32(struct pt_regs *regs) } #endif -#ifdef CONFIG_X86_64 SYSCALL_DEFINE0(ni_syscall) { return -ENOSYS; } -#endif diff --git a/arch/x86/entry/syscall_32.c b/arch/x86/entry/syscall_32.c index 41ec9c66fe15..097413c705ad 100644 --- a/arch/x86/entry/syscall_32.c +++ b/arch/x86/entry/syscall_32.c @@ -4,33 +4,22 @@ #include #include #include +#include #include #include -#ifdef CONFIG_IA32_EMULATION -/* On X86_64, we use struct pt_regs * to pass parameters to syscalls */ #define __SYSCALL_I386(nr, sym) extern asmlinkage long __ia32_##sym(const struct pt_regs *); -#define __sys_ni_syscall __ia32_sys_ni_syscall -#else /* CONFIG_IA32_EMULATION */ -#define __SYSCALL_I386(nr, sym) extern asmlinkage long sym(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long); -extern asmlinkage long sys_ni_syscall(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long); -#define __sys_ni_syscall sys_ni_syscall -#endif /* CONFIG_IA32_EMULATION */ #include #undef __SYSCALL_I386 -#ifdef CONFIG_IA32_EMULATION #define __SYSCALL_I386(nr, sym) [nr] = __ia32_##sym, -#else /* CONFIG_IA32_EMULATION */ -#define __SYSCALL_I386(nr, sym) [nr] = sym, -#endif /* CONFIG_IA32_EMULATION */ __visible const sys_call_ptr_t ia32_sys_call_table[__NR_ia32_syscall_max+1] = { /* * Smells like a compiler bug -- it doesn't work * when the & below is removed. */ - [0 ... __NR_ia32_syscall_max] = &__sys_ni_syscall, + [0 ... __NR_ia32_syscall_max] = &__ia32_sys_ni_syscall, #include }; diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h index 1ef8d7bffa15..e413c836a826 100644 --- a/arch/x86/include/asm/syscall.h +++ b/arch/x86/include/asm/syscall.h @@ -16,13 +16,7 @@ #include /* for TS_COMPAT */ #include -#ifdef CONFIG_X86_64 typedef asmlinkage long (*sys_call_ptr_t)(const struct pt_regs *); -#else -typedef asmlinkage long (*sys_call_ptr_t)(unsigned long, unsigned long, - unsigned long, unsigned long, - unsigned long, unsigned long); -#endif /* CONFIG_X86_64 */ extern const sys_call_ptr_t sys_call_table[]; #if defined(CONFIG_X86_32) diff --git a/arch/x86/include/asm/syscall_wrapper.h b/arch/x86/include/asm/syscall_wrapper.h index 0f126e40a464..5e13e2caf2e4 100644 --- a/arch/x86/include/asm/syscall_wrapper.h +++ b/arch/x86/include/asm/syscall_wrapper.h @@ -11,6 +11,47 @@ struct pt_regs; extern asmlinkage long __x64_sys_ni_syscall(const struct pt_regs *regs); extern asmlinkage long __ia32_sys_ni_syscall(const struct pt_regs *regs); +/* + * Instead of the generic __SYSCALL_DEFINEx() definition, the x86 version takes + * struct pt_regs *regs as the only argument of the syscall stub(s) named as: + * __x64_sys_*() - 64-bit native syscall + * __ia32_sys_*() - 32-bit native syscall or common compat syscall + * __ia32_compat_sys_*() - 32-bit compat syscall + * __x32_compat_sys_*() - 64-bit X32 compat syscall + * + * The registers are decoded according to the ABI: + * 64-bit: RDI, RSI, RDX, R10, R8, R9 + * 32-bit: EBX, ECX, EDX, ESI, EDI, EBP + * + * The stub then passes the decoded arguments to the __se_sys_*() wrapper to + * perform sign-extension (omitted for zero-argument syscalls). Finally the + * arguments are passed to the __do_sys_*() function which is the actual + * syscall. These wrappers are marked as inline so the compiler can optimize + * the functions where appropriate. + * + * Example assembly (slightly re-ordered for better readability): + * + * <__x64_sys_recv>: <-- syscall with 4 parameters + * callq <__fentry__> + * + * mov 0x70(%rdi),%rdi <-- decode regs->di + * mov 0x68(%rdi),%rsi <-- decode regs->si + * mov 0x60(%rdi),%rdx <-- decode regs->dx + * mov 0x38(%rdi),%rcx <-- decode regs->r10 + * + * xor %r9d,%r9d <-- clear %r9 + * xor %r8d,%r8d <-- clear %r8 + * + * callq __sys_recvfrom <-- do the actual work in __sys_recvfrom() + * which takes 6 arguments + * + * cltq <-- extend return value to 64-bit + * retq <-- return + * + * This approach avoids leaking random user-provided register content down + * the call chain. + */ + /* Mapping of registers to parameters for syscalls on x86-64 and x32 */ #define SC_X86_64_REGS_TO_ARGS(x, ...) \ __MAP(x,__SC_ARGS \ @@ -68,6 +109,26 @@ extern asmlinkage long __ia32_sys_ni_syscall(const struct pt_regs *regs); #define __X64_SYS_NI(name) #endif /* CONFIG_X86_64 */ +#if defined(CONFIG_X86_32) || defined(CONFIG_IA32_EMULATION) +#define __IA32_SYS_STUB0(name) \ + __SYS_STUB0(ia32, sys_##name) + +#define __IA32_SYS_STUBx(x, name, ...) \ + __SYS_STUBx(ia32, sys##name, \ + SC_IA32_REGS_TO_ARGS(x, __VA_ARGS__)) + +#define __IA32_COND_SYSCALL(name) \ + __COND_SYSCALL(ia32, sys_##name) + +#define __IA32_SYS_NI(name) \ + __SYS_NI(ia32, sys_##name) +#else /* CONFIG_X86_32 || CONFIG_IA32_EMULATION */ +#define __IA32_SYS_STUB0(name) +#define __IA32_SYS_STUBx(x, name, ...) +#define __IA32_COND_SYSCALL(name) +#define __IA32_SYS_NI(name) +#endif /* CONFIG_X86_32 || CONFIG_IA32_EMULATION */ + #ifdef CONFIG_IA32_EMULATION /* * For IA32 emulation, we need to handle "compat" syscalls *and* create @@ -90,27 +151,11 @@ extern asmlinkage long __ia32_sys_ni_syscall(const struct pt_regs *regs); #define __IA32_COMPAT_SYS_NI(name) \ __SYS_NI(ia32, compat_sys_##name) -#define __IA32_SYS_STUB0(name) \ - __SYS_STUB0(ia32, sys_##name) - -#define __IA32_SYS_STUBx(x, name, ...) \ - __SYS_STUBx(ia32, sys##name, \ - SC_IA32_REGS_TO_ARGS(x, __VA_ARGS__)) - -#define __IA32_COND_SYSCALL(name) \ - __COND_SYSCALL(ia32, sys_##name) - -#define __IA32_SYS_NI(name) \ - __SYS_NI(ia32, sys_##name) #else /* CONFIG_IA32_EMULATION */ #define __IA32_COMPAT_SYS_STUB0(name) #define __IA32_COMPAT_SYS_STUBx(x, name, ...) #define __IA32_COMPAT_COND_SYSCALL(name) #define __IA32_COMPAT_SYS_NI(name) -#define __IA32_SYS_STUB0(name) -#define __IA32_SYS_STUBx(x, name, ...) -#define __IA32_COND_SYSCALL(name) -#define __IA32_SYS_NI(name) #endif /* CONFIG_IA32_EMULATION */ @@ -180,40 +225,6 @@ extern asmlinkage long __ia32_sys_ni_syscall(const struct pt_regs *regs); #endif /* CONFIG_COMPAT */ - -/* - * Instead of the generic __SYSCALL_DEFINEx() definition, this macro takes - * struct pt_regs *regs as the only argument of the syscall stub named - * __x64_sys_*(). It decodes just the registers it needs and passes them on to - * the __se_sys_*() wrapper performing sign extension and then to the - * __do_sys_*() function doing the actual job. These wrappers and functions - * are inlined (at least in very most cases), meaning that the assembly looks - * as follows (slightly re-ordered for better readability): - * - * <__x64_sys_recv>: <-- syscall with 4 parameters - * callq <__fentry__> - * - * mov 0x70(%rdi),%rdi <-- decode regs->di - * mov 0x68(%rdi),%rsi <-- decode regs->si - * mov 0x60(%rdi),%rdx <-- decode regs->dx - * mov 0x38(%rdi),%rcx <-- decode regs->r10 - * - * xor %r9d,%r9d <-- clear %r9 - * xor %r8d,%r8d <-- clear %r8 - * - * callq __sys_recvfrom <-- do the actual work in __sys_recvfrom() - * which takes 6 arguments - * - * cltq <-- extend return value to 64-bit - * retq <-- return - * - * This approach avoids leaking random user-provided register content down - * the call chain. - * - * If IA32_EMULATION is enabled, this macro generates an additional wrapper - * named __ia32_sys_*() which decodes the struct pt_regs *regs according - * to the i386 calling convention (bx, cx, dx, si, di, bp). - */ #define __SYSCALL_DEFINEx(x, name, ...) \ static long __se_sys##name(__MAP(x,__SC_LONG,__VA_ARGS__)); \ static inline long __do_sys##name(__MAP(x,__SC_DECL,__VA_ARGS__));\ diff --git a/arch/x86/include/asm/syscalls.h b/arch/x86/include/asm/syscalls.h index 91b7b6e1a115..06cbdca634d6 100644 --- a/arch/x86/include/asm/syscalls.h +++ b/arch/x86/include/asm/syscalls.h @@ -17,33 +17,4 @@ /* kernel/ioport.c */ long ksys_ioperm(unsigned long from, unsigned long num, int turn_on); -#ifdef CONFIG_X86_32 -/* - * These definitions are only valid on pure 32-bit systems; x86-64 uses a - * different syscall calling convention - */ -asmlinkage long sys_ioperm(unsigned long, unsigned long, int); -asmlinkage long sys_iopl(unsigned int); - -/* kernel/ldt.c */ -asmlinkage long sys_modify_ldt(int, void __user *, unsigned long); - -/* kernel/signal.c */ -asmlinkage long sys_rt_sigreturn(void); - -/* kernel/tls.c */ -asmlinkage long sys_set_thread_area(struct user_desc __user *); -asmlinkage long sys_get_thread_area(struct user_desc __user *); - -/* X86_32 only */ - -/* kernel/signal.c */ -asmlinkage long sys_sigreturn(void); - -/* kernel/vm86_32.c */ -struct vm86_struct; -asmlinkage long sys_vm86old(struct vm86_struct __user *); -asmlinkage long sys_vm86(unsigned long, unsigned long); - -#endif /* CONFIG_X86_32 */ #endif /* _ASM_X86_SYSCALLS_H */ -- cgit From 0f78ff17112d8b3469b805ff4ea9780cc1e5c93b Mon Sep 17 00:00:00 2001 From: Brian Gerst Date: Fri, 13 Mar 2020 15:51:43 -0400 Subject: x86/entry: Drop asmlinkage from syscalls asmlinkage is no longer required since the syscall ABI is now fully under x86 architecture control. This makes the 32-bit native syscalls a bit more effecient by passing in regs via EAX instead of on the stack. Signed-off-by: Brian Gerst Signed-off-by: Thomas Gleixner Reviewed-by: Dominik Brodowski Reviewed-by: Andy Lutomirski Link: https://lkml.kernel.org/r/20200313195144.164260-18-brgerst@gmail.com --- arch/x86/entry/syscall_32.c | 2 +- arch/x86/entry/syscall_64.c | 2 +- arch/x86/entry/syscall_x32.c | 4 ++-- arch/x86/include/asm/syscall.h | 2 +- arch/x86/include/asm/syscall_wrapper.h | 31 ++++++++++++++----------------- 5 files changed, 19 insertions(+), 22 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/entry/syscall_32.c b/arch/x86/entry/syscall_32.c index 097413c705ad..86eb0d89d46f 100644 --- a/arch/x86/entry/syscall_32.c +++ b/arch/x86/entry/syscall_32.c @@ -8,7 +8,7 @@ #include #include -#define __SYSCALL_I386(nr, sym) extern asmlinkage long __ia32_##sym(const struct pt_regs *); +#define __SYSCALL_I386(nr, sym) extern long __ia32_##sym(const struct pt_regs *); #include #undef __SYSCALL_I386 diff --git a/arch/x86/entry/syscall_64.c b/arch/x86/entry/syscall_64.c index 66d3e65e3b6b..1594ec72bcbb 100644 --- a/arch/x86/entry/syscall_64.c +++ b/arch/x86/entry/syscall_64.c @@ -11,7 +11,7 @@ #define __SYSCALL_X32(nr, sym) #define __SYSCALL_COMMON(nr, sym) __SYSCALL_64(nr, sym) -#define __SYSCALL_64(nr, sym) extern asmlinkage long __x64_##sym(const struct pt_regs *); +#define __SYSCALL_64(nr, sym) extern long __x64_##sym(const struct pt_regs *); #include #undef __SYSCALL_64 diff --git a/arch/x86/entry/syscall_x32.c b/arch/x86/entry/syscall_x32.c index 2fb09efd7b40..3d8d70d3896c 100644 --- a/arch/x86/entry/syscall_x32.c +++ b/arch/x86/entry/syscall_x32.c @@ -10,8 +10,8 @@ #define __SYSCALL_64(nr, sym) -#define __SYSCALL_X32(nr, sym) extern asmlinkage long __x32_##sym(const struct pt_regs *); -#define __SYSCALL_COMMON(nr, sym) extern asmlinkage long __x64_##sym(const struct pt_regs *); +#define __SYSCALL_X32(nr, sym) extern long __x32_##sym(const struct pt_regs *); +#define __SYSCALL_COMMON(nr, sym) extern long __x64_##sym(const struct pt_regs *); #include #undef __SYSCALL_X32 #undef __SYSCALL_COMMON diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h index e413c836a826..643529417f74 100644 --- a/arch/x86/include/asm/syscall.h +++ b/arch/x86/include/asm/syscall.h @@ -16,7 +16,7 @@ #include /* for TS_COMPAT */ #include -typedef asmlinkage long (*sys_call_ptr_t)(const struct pt_regs *); +typedef long (*sys_call_ptr_t)(const struct pt_regs *); extern const sys_call_ptr_t sys_call_table[]; #if defined(CONFIG_X86_32) diff --git a/arch/x86/include/asm/syscall_wrapper.h b/arch/x86/include/asm/syscall_wrapper.h index 5e13e2caf2e4..e10efa1454bc 100644 --- a/arch/x86/include/asm/syscall_wrapper.h +++ b/arch/x86/include/asm/syscall_wrapper.h @@ -8,8 +8,8 @@ struct pt_regs; -extern asmlinkage long __x64_sys_ni_syscall(const struct pt_regs *regs); -extern asmlinkage long __ia32_sys_ni_syscall(const struct pt_regs *regs); +extern long __x64_sys_ni_syscall(const struct pt_regs *regs); +extern long __ia32_sys_ni_syscall(const struct pt_regs *regs); /* * Instead of the generic __SYSCALL_DEFINEx() definition, the x86 version takes @@ -66,22 +66,21 @@ extern asmlinkage long __ia32_sys_ni_syscall(const struct pt_regs *regs); ,,(unsigned int)regs->di,,(unsigned int)regs->bp) #define __SYS_STUB0(abi, name) \ - asmlinkage long __##abi##_##name(const struct pt_regs *regs); \ + long __##abi##_##name(const struct pt_regs *regs); \ ALLOW_ERROR_INJECTION(__##abi##_##name, ERRNO); \ - asmlinkage long __##abi##_##name(const struct pt_regs *regs) \ + long __##abi##_##name(const struct pt_regs *regs) \ __alias(__do_##name); #define __SYS_STUBx(abi, name, ...) \ - asmlinkage long __##abi##_##name(const struct pt_regs *regs); \ + long __##abi##_##name(const struct pt_regs *regs); \ ALLOW_ERROR_INJECTION(__##abi##_##name, ERRNO); \ - asmlinkage long __##abi##_##name(const struct pt_regs *regs) \ + long __##abi##_##name(const struct pt_regs *regs) \ { \ return __se_##name(__VA_ARGS__); \ } #define __COND_SYSCALL(abi, name) \ - asmlinkage __weak long \ - __##abi##_##name(const struct pt_regs *__unused) \ + __weak long __##abi##_##name(const struct pt_regs *__unused) \ { \ return sys_ni_syscall(); \ } @@ -192,11 +191,11 @@ extern asmlinkage long __ia32_sys_ni_syscall(const struct pt_regs *regs); * of them. */ #define COMPAT_SYSCALL_DEFINE0(name) \ - static asmlinkage long \ + static long \ __do_compat_sys_##name(const struct pt_regs *__unused); \ __IA32_COMPAT_SYS_STUB0(name) \ __X32_COMPAT_SYS_STUB0(name) \ - static asmlinkage long \ + static long \ __do_compat_sys_##name(const struct pt_regs *__unused) #define COMPAT_SYSCALL_DEFINEx(x, name, ...) \ @@ -248,12 +247,10 @@ extern asmlinkage long __ia32_sys_ni_syscall(const struct pt_regs *regs); */ #define SYSCALL_DEFINE0(sname) \ SYSCALL_METADATA(_##sname, 0); \ - static asmlinkage long \ - __do_sys_##sname(const struct pt_regs *__unused); \ + static long __do_sys_##sname(const struct pt_regs *__unused); \ __X64_SYS_STUB0(sname) \ __IA32_SYS_STUB0(sname) \ - static asmlinkage long \ - __do_sys_##sname(const struct pt_regs *__unused) + static long __do_sys_##sname(const struct pt_regs *__unused) #define COND_SYSCALL(name) \ __X64_COND_SYSCALL(name) \ @@ -268,8 +265,8 @@ extern asmlinkage long __ia32_sys_ni_syscall(const struct pt_regs *regs); * For VSYSCALLS, we need to declare these three syscalls with the new * pt_regs-based calling convention for in-kernel use. */ -asmlinkage long __x64_sys_getcpu(const struct pt_regs *regs); -asmlinkage long __x64_sys_gettimeofday(const struct pt_regs *regs); -asmlinkage long __x64_sys_time(const struct pt_regs *regs); +long __x64_sys_getcpu(const struct pt_regs *regs); +long __x64_sys_gettimeofday(const struct pt_regs *regs); +long __x64_sys_time(const struct pt_regs *regs); #endif /* _ASM_X86_SYSCALL_WRAPPER_H */ -- cgit From ffd75b373f3656cbb593c5221cc36ce232b7bbc1 Mon Sep 17 00:00:00 2001 From: Brian Gerst Date: Fri, 13 Mar 2020 15:51:44 -0400 Subject: x86: Remove unneeded includes Clean up includes of and in Signed-off-by: Brian Gerst Signed-off-by: Thomas Gleixner Link: https://lkml.kernel.org/r/20200313195144.164260-19-brgerst@gmail.com --- arch/x86/include/asm/syscalls.h | 5 ----- arch/x86/kernel/ldt.c | 1 - arch/x86/kernel/process.c | 1 - arch/x86/kernel/process_32.c | 1 - arch/x86/kernel/process_64.c | 1 - arch/x86/kernel/signal.c | 2 -- arch/x86/kernel/sys_x86_64.c | 1 - 7 files changed, 12 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/syscalls.h b/arch/x86/include/asm/syscalls.h index 06cbdca634d6..6714a358235d 100644 --- a/arch/x86/include/asm/syscalls.h +++ b/arch/x86/include/asm/syscalls.h @@ -8,11 +8,6 @@ #ifndef _ASM_X86_SYSCALLS_H #define _ASM_X86_SYSCALLS_H -#include -#include -#include -#include - /* Common in X86_32 and X86_64 */ /* kernel/ioport.c */ long ksys_ioperm(unsigned long from, unsigned long num, int turn_on); diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c index c57e1ca70fd1..84c3ba32f211 100644 --- a/arch/x86/kernel/ldt.c +++ b/arch/x86/kernel/ldt.c @@ -27,7 +27,6 @@ #include #include #include -#include #include /* This is a multiple of PAGE_SIZE. */ diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index 839b5244e3b7..d78e2282df0f 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -28,7 +28,6 @@ #include #include #include -#include #include #include #include diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c index 5052ced43373..954b013cc585 100644 --- a/arch/x86/kernel/process_32.c +++ b/arch/x86/kernel/process_32.c @@ -49,7 +49,6 @@ #include #include -#include #include #include #include diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c index ffd497804dbc..5ef9d8f25b0e 100644 --- a/arch/x86/kernel/process_64.c +++ b/arch/x86/kernel/process_64.c @@ -48,7 +48,6 @@ #include #include #include -#include #include #include #include diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c index 860904990b26..0364f8c3bee3 100644 --- a/arch/x86/kernel/signal.c +++ b/arch/x86/kernel/signal.c @@ -42,8 +42,6 @@ #endif /* CONFIG_X86_64 */ #include -#include - #include #include diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c index ca3c11a17b5a..504fa5425bce 100644 --- a/arch/x86/kernel/sys_x86_64.c +++ b/arch/x86/kernel/sys_x86_64.c @@ -21,7 +21,6 @@ #include #include -#include /* * Align a virtual address to avoid aliasing in the I$ on AMD F15h. -- cgit From 46db36abc32dfd18041c57d571fe4945fc0057fe Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Fri, 20 Mar 2020 12:56:39 +0100 Subject: x86/entry: Rename ___preempt_schedule Because moar '_' isn't always moar readable. git grep -l "___preempt_schedule\(_notrace\)*" | while read file; do sed -ie 's/___preempt_schedule\(_notrace\)*/preempt_schedule\1_thunk/g' $file; done Reported-by: Thomas Gleixner Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Thomas Gleixner Reviewed-by: Thomas Gleixner Acked-by: Will Deacon Link: https://lkml.kernel.org/r/20200320115858.995685950@infradead.org --- arch/x86/entry/thunk_32.S | 8 ++++---- arch/x86/entry/thunk_64.S | 8 ++++---- arch/x86/include/asm/preempt.h | 8 ++++---- 3 files changed, 12 insertions(+), 12 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/entry/thunk_32.S b/arch/x86/entry/thunk_32.S index e010d4ae11f1..3a07ce3ec70b 100644 --- a/arch/x86/entry/thunk_32.S +++ b/arch/x86/entry/thunk_32.S @@ -35,9 +35,9 @@ SYM_CODE_END(\name) #endif #ifdef CONFIG_PREEMPTION - THUNK ___preempt_schedule, preempt_schedule - THUNK ___preempt_schedule_notrace, preempt_schedule_notrace - EXPORT_SYMBOL(___preempt_schedule) - EXPORT_SYMBOL(___preempt_schedule_notrace) + THUNK preempt_schedule_thunk, preempt_schedule + THUNK preempt_schedule_notrace_thunk, preempt_schedule_notrace + EXPORT_SYMBOL(preempt_schedule_thunk) + EXPORT_SYMBOL(preempt_schedule_notrace_thunk) #endif diff --git a/arch/x86/entry/thunk_64.S b/arch/x86/entry/thunk_64.S index c5c3b6e86e62..dbe4493b534e 100644 --- a/arch/x86/entry/thunk_64.S +++ b/arch/x86/entry/thunk_64.S @@ -47,10 +47,10 @@ SYM_FUNC_END(\name) #endif #ifdef CONFIG_PREEMPTION - THUNK ___preempt_schedule, preempt_schedule - THUNK ___preempt_schedule_notrace, preempt_schedule_notrace - EXPORT_SYMBOL(___preempt_schedule) - EXPORT_SYMBOL(___preempt_schedule_notrace) + THUNK preempt_schedule_thunk, preempt_schedule + THUNK preempt_schedule_notrace_thunk, preempt_schedule_notrace + EXPORT_SYMBOL(preempt_schedule_thunk) + EXPORT_SYMBOL(preempt_schedule_notrace_thunk) #endif #if defined(CONFIG_TRACE_IRQFLAGS) \ diff --git a/arch/x86/include/asm/preempt.h b/arch/x86/include/asm/preempt.h index 3d4cb83a8828..69485ca13665 100644 --- a/arch/x86/include/asm/preempt.h +++ b/arch/x86/include/asm/preempt.h @@ -103,14 +103,14 @@ static __always_inline bool should_resched(int preempt_offset) } #ifdef CONFIG_PREEMPTION - extern asmlinkage void ___preempt_schedule(void); + extern asmlinkage void preempt_schedule_thunk(void); # define __preempt_schedule() \ - asm volatile ("call ___preempt_schedule" : ASM_CALL_CONSTRAINT) + asm volatile ("call preempt_schedule_thunk" : ASM_CALL_CONSTRAINT) extern asmlinkage void preempt_schedule(void); - extern asmlinkage void ___preempt_schedule_notrace(void); + extern asmlinkage void preempt_schedule_notrace_thunk(void); # define __preempt_schedule_notrace() \ - asm volatile ("call ___preempt_schedule_notrace" : ASM_CALL_CONSTRAINT) + asm volatile ("call preempt_schedule_notrace_thunk" : ASM_CALL_CONSTRAINT) extern asmlinkage void preempt_schedule_notrace(void); #endif -- cgit From 763802b53a427ed3cbd419dbba255c414fdd9e7c Mon Sep 17 00:00:00 2001 From: Joerg Roedel Date: Sat, 21 Mar 2020 18:22:41 -0700 Subject: x86/mm: split vmalloc_sync_all() Commit 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()") introduced a call to vmalloc_sync_all() in the vunmap() code-path. While this change was necessary to maintain correctness on x86-32-pae kernels, it also adds additional cycles for architectures that don't need it. Specifically on x86-64 with CONFIG_VMAP_STACK=y some people reported severe performance regressions in micro-benchmarks because it now also calls the x86-64 implementation of vmalloc_sync_all() on vunmap(). But the vmalloc_sync_all() implementation on x86-64 is only needed for newly created mappings. To avoid the unnecessary work on x86-64 and to gain the performance back, split up vmalloc_sync_all() into two functions: * vmalloc_sync_mappings(), and * vmalloc_sync_unmappings() Most call-sites to vmalloc_sync_all() only care about new mappings being synchronized. The only exception is the new call-site added in the above mentioned commit. Shile Zhang directed us to a report of an 80% regression in reaim throughput. Fixes: 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()") Reported-by: kernel test robot Reported-by: Shile Zhang Signed-off-by: Joerg Roedel Signed-off-by: Andrew Morton Tested-by: Borislav Petkov Acked-by: Rafael J. Wysocki [GHES] Cc: Dave Hansen Cc: Andy Lutomirski Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Link: http://lkml.kernel.org/r/20191009124418.8286-1-joro@8bytes.org Link: https://lists.01.org/hyperkitty/list/lkp@lists.01.org/thread/4D3JPPHBNOSPFK2KEPC6KGKS6J25AIDB/ Link: http://lkml.kernel.org/r/20191113095530.228959-1-shile.zhang@linux.alibaba.com Signed-off-by: Linus Torvalds --- arch/x86/mm/fault.c | 26 ++++++++++++++++++++++++-- 1 file changed, 24 insertions(+), 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index fa4ea09593ab..629fdf13f846 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -190,7 +190,7 @@ static inline pmd_t *vmalloc_sync_one(pgd_t *pgd, unsigned long address) return pmd_k; } -void vmalloc_sync_all(void) +static void vmalloc_sync(void) { unsigned long address; @@ -217,6 +217,16 @@ void vmalloc_sync_all(void) } } +void vmalloc_sync_mappings(void) +{ + vmalloc_sync(); +} + +void vmalloc_sync_unmappings(void) +{ + vmalloc_sync(); +} + /* * 32-bit: * @@ -319,11 +329,23 @@ out: #else /* CONFIG_X86_64: */ -void vmalloc_sync_all(void) +void vmalloc_sync_mappings(void) { + /* + * 64-bit mappings might allocate new p4d/pud pages + * that need to be propagated to all tasks' PGDs. + */ sync_global_pgds(VMALLOC_START & PGDIR_MASK, VMALLOC_END); } +void vmalloc_sync_unmappings(void) +{ + /* + * Unmappings never allocate or free p4d/pud pages. + * No work is required here. + */ +} + /* * 64-bit: * -- cgit From 077168e241ec5a3b273652acb1e85f8bc1dc2d81 Mon Sep 17 00:00:00 2001 From: Wei Huang Date: Sat, 21 Mar 2020 14:38:00 -0500 Subject: x86/mce/amd: Add PPIN support for AMD MCE Newer AMD CPUs support a feature called protected processor identification number (PPIN). This feature can be detected via CPUID_Fn80000008_EBX[23]. However, CPUID alone is not enough to read the processor identification number - MSR_AMD_PPIN_CTL also needs to be configured properly. If, for any reason, MSR_AMD_PPIN_CTL[PPIN_EN] can not be turned on, such as disabled in BIOS, the CPU capability bit X86_FEATURE_AMD_PPIN needs to be cleared. When the X86_FEATURE_AMD_PPIN capability is available, the identification number is issued together with the MCE error info in order to keep track of the source of MCE errors. [ bp: Massage. ] Co-developed-by: Smita Koralahalli Channabasappa Signed-off-by: Smita Koralahalli Channabasappa Signed-off-by: Wei Huang Signed-off-by: Borislav Petkov Acked-by: Tony Luck Link: https://lkml.kernel.org/r/20200321193800.3666964-1-wei.huang2@amd.com --- arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/kernel/cpu/amd.c | 30 ++++++++++++++++++++++++++++++ arch/x86/kernel/cpu/mce/core.c | 2 ++ 3 files changed, 33 insertions(+) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index f3327cb56edf..4b263ffb793b 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -299,6 +299,7 @@ #define X86_FEATURE_AMD_IBRS (13*32+14) /* "" Indirect Branch Restricted Speculation */ #define X86_FEATURE_AMD_STIBP (13*32+15) /* "" Single Thread Indirect Branch Predictors */ #define X86_FEATURE_AMD_STIBP_ALWAYS_ON (13*32+17) /* "" Single Thread Indirect Branch Predictors always-on preferred */ +#define X86_FEATURE_AMD_PPIN (13*32+23) /* Protected Processor Inventory Number */ #define X86_FEATURE_AMD_SSBD (13*32+24) /* "" Speculative Store Bypass Disable */ #define X86_FEATURE_VIRT_SSBD (13*32+25) /* Virtualized Speculative Store Bypass Disable */ #define X86_FEATURE_AMD_SSB_NO (13*32+26) /* "" Speculative Store Bypass is fixed in hardware. */ diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c index ac83a0fef628..a2cd870af7f5 100644 --- a/arch/x86/kernel/cpu/amd.c +++ b/arch/x86/kernel/cpu/amd.c @@ -393,6 +393,35 @@ static void amd_detect_cmp(struct cpuinfo_x86 *c) per_cpu(cpu_llc_id, cpu) = c->phys_proc_id; } +static void amd_detect_ppin(struct cpuinfo_x86 *c) +{ + unsigned long long val; + + if (!cpu_has(c, X86_FEATURE_AMD_PPIN)) + return; + + /* When PPIN is defined in CPUID, still need to check PPIN_CTL MSR */ + if (rdmsrl_safe(MSR_AMD_PPIN_CTL, &val)) + goto clear_ppin; + + /* PPIN is locked in disabled mode, clear feature bit */ + if ((val & 3UL) == 1UL) + goto clear_ppin; + + /* If PPIN is disabled, try to enable it */ + if (!(val & 2UL)) { + wrmsrl_safe(MSR_AMD_PPIN_CTL, val | 2UL); + rdmsrl_safe(MSR_AMD_PPIN_CTL, &val); + } + + /* If PPIN_EN bit is 1, return from here; otherwise fall through */ + if (val & 2UL) + return; + +clear_ppin: + clear_cpu_cap(c, X86_FEATURE_AMD_PPIN); +} + u16 amd_get_nb_id(int cpu) { return per_cpu(cpu_llc_id, cpu); @@ -940,6 +969,7 @@ static void init_amd(struct cpuinfo_x86 *c) amd_detect_cmp(c); amd_get_topology(c); srat_detect_node(c); + amd_detect_ppin(c); init_amd_cacheinfo(c); diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index fe3983d551cc..dd06fce537fc 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -142,6 +142,8 @@ void mce_setup(struct mce *m) if (this_cpu_has(X86_FEATURE_INTEL_PPIN)) rdmsrl(MSR_PPIN, m->ppin); + else if (this_cpu_has(X86_FEATURE_AMD_PPIN)) + rdmsrl(MSR_AMD_PPIN, m->ppin); m->microcode = boot_cpu_data.microcode; } -- cgit From 31a9122058bc5f042cb04bcdb8cd9e6c77fdae8d Mon Sep 17 00:00:00 2001 From: Anshuman Khandual Date: Mon, 23 Mar 2020 06:35:42 +0530 Subject: x86/mm: Drop pud_mknotpresent() There is an inconsistency between PMD and PUD-based THP page table helpers like the following, as pud_present() does not test for _PAGE_PSE. pmd_present(pmd_mknotpresent(pmd)) : True pud_present(pud_mknotpresent(pud)) : False Drop pud_mknotpresent() as there are no current users. If/when needed back later, pud_present() will also have to be fixed to accommodate _PAGE_PSE. Signed-off-by: Anshuman Khandual Signed-off-by: Borislav Petkov Reviewed-by: Baoquan He Acked-by: Balbir Singh Acked-by: Kirill A. Shutemov Link: https://lkml.kernel.org/r/1584925542-13034-1-git-send-email-anshuman.khandual@arm.com --- arch/x86/include/asm/pgtable.h | 6 ------ 1 file changed, 6 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 7e118660bbd9..d74dc560e3ab 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -595,12 +595,6 @@ static inline pmd_t pmd_mknotpresent(pmd_t pmd) __pgprot(pmd_flags(pmd) & ~(_PAGE_PRESENT|_PAGE_PROTNONE))); } -static inline pud_t pud_mknotpresent(pud_t pud) -{ - return pfn_pud(pud_pfn(pud), - __pgprot(pud_flags(pud) & ~(_PAGE_PRESENT|_PAGE_PROTNONE))); -} - static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask); static inline pte_t pte_modify(pte_t pte, pgprot_t newprot) -- cgit From 0e79ad863df43b01090ae18c97de5c3787f069c6 Mon Sep 17 00:00:00 2001 From: Benjamin Thiel Date: Mon, 23 Mar 2020 11:59:34 +0100 Subject: x86/cpu: Fix a -Wmissing-prototypes warning for init_ia32_feat_ctl() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add a missing include in order to fix -Wmissing-prototypes warning: arch/x86/kernel/cpu/feat_ctl.c:95:6: warning: no previous prototype for ‘init_ia32_feat_ctl’ [-Wmissing-prototypes] 95 | void init_ia32_feat_ctl(struct cpuinfo_x86 *c) Signed-off-by: Benjamin Thiel Signed-off-by: Borislav Petkov Link: https://lkml.kernel.org/r/20200323105934.26597-1-b.thiel@posteo.de --- arch/x86/kernel/cpu/feat_ctl.c | 1 + 1 file changed, 1 insertion(+) (limited to 'arch/x86') diff --git a/arch/x86/kernel/cpu/feat_ctl.c b/arch/x86/kernel/cpu/feat_ctl.c index 0268185bef94..29a3bedabd06 100644 --- a/arch/x86/kernel/cpu/feat_ctl.c +++ b/arch/x86/kernel/cpu/feat_ctl.c @@ -5,6 +5,7 @@ #include #include #include +#include "cpu.h" #undef pr_fmt #define pr_fmt(fmt) "x86/cpu: " fmt -- cgit From 2e2409afe5f0c284c7dfe5504058e8d115806a7d Mon Sep 17 00:00:00 2001 From: Tom Lendacky Date: Fri, 20 Mar 2020 11:07:07 -0500 Subject: KVM: SVM: Issue WBINVD after deactivating an SEV guest Currently, CLFLUSH is used to flush SEV guest memory before the guest is terminated (or a memory hotplug region is removed). However, CLFLUSH is not enough to ensure that SEV guest tagged data is flushed from the cache. With 33af3a7ef9e6 ("KVM: SVM: Reduce WBINVD/DF_FLUSH invocations"), the original WBINVD was removed. This then exposed crashes at random times because of a cache flush race with a page that had both a hypervisor and a guest tag in the cache. Restore the WBINVD when destroying an SEV guest and add a WBINVD to the svm_unregister_enc_region() function to ensure hotplug memory is flushed when removed. The DF_FLUSH can still be avoided at this point. Fixes: 33af3a7ef9e6 ("KVM: SVM: Reduce WBINVD/DF_FLUSH invocations") Signed-off-by: Tom Lendacky Message-Id: Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini --- arch/x86/kvm/svm.c | 22 ++++++++++++++-------- 1 file changed, 14 insertions(+), 8 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index f0aa9ff9666f..50d1ebafe0b3 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1933,14 +1933,6 @@ static void sev_clflush_pages(struct page *pages[], unsigned long npages) static void __unregister_enc_region_locked(struct kvm *kvm, struct enc_region *region) { - /* - * The guest may change the memory encryption attribute from C=0 -> C=1 - * or vice versa for this memory range. Lets make sure caches are - * flushed to ensure that guest data gets written into memory with - * correct C-bit. - */ - sev_clflush_pages(region->pages, region->npages); - sev_unpin_memory(kvm, region->pages, region->npages); list_del(®ion->list); kfree(region); @@ -1970,6 +1962,13 @@ static void sev_vm_destroy(struct kvm *kvm) mutex_lock(&kvm->lock); + /* + * Ensure that all guest tagged cache entries are flushed before + * releasing the pages back to the system for use. CLFLUSH will + * not do this, so issue a WBINVD. + */ + wbinvd_on_all_cpus(); + /* * if userspace was terminated before unregistering the memory regions * then lets unpin all the registered memory. @@ -7288,6 +7287,13 @@ static int svm_unregister_enc_region(struct kvm *kvm, goto failed; } + /* + * Ensure that all guest tagged cache entries are flushed before + * releasing the pages back to the system for use. CLFLUSH will + * not do this, so issue a WBINVD. + */ + wbinvd_on_all_cpus(); + __unregister_enc_region_locked(kvm, region); mutex_unlock(&kvm->lock); -- cgit From edec6e015a02003c2af0ce82c54ea016b5a9e3f0 Mon Sep 17 00:00:00 2001 From: He Zhe Date: Fri, 20 Mar 2020 15:06:07 +0800 Subject: KVM: LAPIC: Mark hrtimer for period or oneshot mode to expire in hard interrupt context apic->lapic_timer.timer was initialized with HRTIMER_MODE_ABS_HARD but started later with HRTIMER_MODE_ABS, which may cause the following warning in PREEMPT_RT kernel. WARNING: CPU: 1 PID: 2957 at kernel/time/hrtimer.c:1129 hrtimer_start_range_ns+0x348/0x3f0 CPU: 1 PID: 2957 Comm: qemu-system-x86 Not tainted 5.4.23-rt11 #1 Hardware name: Supermicro SYS-E300-9A-8C/A2SDi-8C-HLN4F, BIOS 1.1a 09/18/2018 RIP: 0010:hrtimer_start_range_ns+0x348/0x3f0 Code: 4d b8 0f 94 c1 0f b6 c9 e8 35 f1 ff ff 4c 8b 45 b0 e9 3b fd ff ff e8 d7 3f fa ff 48 98 4c 03 34 c5 a0 26 bf 93 e9 a1 fd ff ff <0f> 0b e9 fd fc ff ff 65 8b 05 fa b7 90 6d 89 c0 48 0f a3 05 60 91 RSP: 0018:ffffbc60026ffaf8 EFLAGS: 00010202 RAX: 0000000000000001 RBX: ffff9d81657d4110 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000006cc7987bcf RDI: ffff9d81657d4110 RBP: ffffbc60026ffb58 R08: 0000000000000001 R09: 0000000000000010 R10: 0000000000000000 R11: 0000000000000000 R12: 0000006cc7987bcf R13: 0000000000000000 R14: 0000006cc7987bcf R15: ffffbc60026d6a00 FS: 00007f401daed700(0000) GS:ffff9d81ffa40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000ffffffff CR3: 0000000fa7574000 CR4: 00000000003426e0 Call Trace: ? kvm_release_pfn_clean+0x22/0x60 [kvm] start_sw_timer+0x85/0x230 [kvm] ? vmx_vmexit+0x1b/0x30 [kvm_intel] kvm_lapic_switch_to_sw_timer+0x72/0x80 [kvm] vmx_pre_block+0x1cb/0x260 [kvm_intel] ? vmx_vmexit+0xf/0x30 [kvm_intel] ? vmx_vmexit+0x1b/0x30 [kvm_intel] ? vmx_vmexit+0xf/0x30 [kvm_intel] ? vmx_vmexit+0x1b/0x30 [kvm_intel] ? vmx_vmexit+0xf/0x30 [kvm_intel] ? vmx_vmexit+0x1b/0x30 [kvm_intel] ? vmx_vmexit+0xf/0x30 [kvm_intel] ? vmx_vmexit+0xf/0x30 [kvm_intel] ? vmx_vmexit+0x1b/0x30 [kvm_intel] ? vmx_vmexit+0xf/0x30 [kvm_intel] ? vmx_vmexit+0x1b/0x30 [kvm_intel] ? vmx_vmexit+0xf/0x30 [kvm_intel] ? vmx_vmexit+0x1b/0x30 [kvm_intel] ? vmx_vmexit+0xf/0x30 [kvm_intel] ? vmx_vmexit+0x1b/0x30 [kvm_intel] ? vmx_vmexit+0xf/0x30 [kvm_intel] ? vmx_sync_pir_to_irr+0x9e/0x100 [kvm_intel] ? kvm_apic_has_interrupt+0x46/0x80 [kvm] kvm_arch_vcpu_ioctl_run+0x85b/0x1fa0 [kvm] ? _raw_spin_unlock_irqrestore+0x18/0x50 ? _copy_to_user+0x2c/0x30 kvm_vcpu_ioctl+0x235/0x660 [kvm] ? rt_spin_unlock+0x2c/0x50 do_vfs_ioctl+0x3e4/0x650 ? __fget+0x7a/0xa0 ksys_ioctl+0x67/0x90 __x64_sys_ioctl+0x1a/0x20 do_syscall_64+0x4d/0x120 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f4027cc54a7 Code: 00 00 90 48 8b 05 e9 59 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b9 59 0c 00 f7 d8 64 89 01 48 RSP: 002b:00007f401dae9858 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00005558bd029690 RCX: 00007f4027cc54a7 RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 000000000000000d RBP: 00007f4028b72000 R08: 00005558bc829ad0 R09: 00000000ffffffff R10: 00005558bcf90ca0 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 00005558bce1c840 --[ end trace 0000000000000002 ]-- Signed-off-by: He Zhe Message-Id: <1584687967-332859-1-git-send-email-zhe.he@windriver.com> Reviewed-by: Wanpeng Li Signed-off-by: Paolo Bonzini --- arch/x86/kvm/lapic.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index e3099c642fec..929511e01960 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -1715,7 +1715,7 @@ static void start_sw_period(struct kvm_lapic *apic) hrtimer_start(&apic->lapic_timer.timer, apic->lapic_timer.target_expiration, - HRTIMER_MODE_ABS); + HRTIMER_MODE_ABS_HARD); } bool kvm_lapic_hv_timer_in_use(struct kvm_vcpu *vcpu) -- cgit From 1c1a18b00d7e25d1bed3507880de2da07be704a2 Mon Sep 17 00:00:00 2001 From: Vincenzo Frascino Date: Mon, 23 Mar 2020 12:41:09 +0000 Subject: um: Fix header inclusion User Mode Linux is a flavor of x86 that from the vDSO prospective always falls back on system calls. This implies that it does not require any of the unified vDSO definitions and their inclusion causes side effects like this: In file included from include/vdso/processor.h:10:0, from include/vdso/datapage.h:17, from arch/x86/include/asm/vgtod.h:7, from arch/x86/um/../kernel/sys_ia32.c:49: >> arch/x86/include/asm/vdso/processor.h:11:29: error: redefinition of 'rep_nop' static __always_inline void rep_nop(void) ^~~~~~~ In file included from include/linux/rcupdate.h:30:0, from include/linux/rculist.h:11, from include/linux/pid.h:5, from include/linux/sched.h:14, from arch/x86/um/../kernel/sys_ia32.c:25: arch/x86/um/asm/processor.h:24:20: note: previous definition of 'rep_nop' was here static inline void rep_nop(void) Make sure that the unnecessary headers are not included when um is built to address the problem. Fixes: abc22418db02 ("x86/vdso: Enable x86 to use common headers") Reported-by: kbuild test robot Signed-off-by: Vincenzo Frascino Signed-off-by: Borislav Petkov Link: https://lkml.kernel.org/r/20200323124109.7104-1-vincenzo.frascino@arm.com --- arch/x86/include/asm/vgtod.h | 6 ++++++ 1 file changed, 6 insertions(+) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h index fc8e4cd342cc..7aa38b2ad8a9 100644 --- a/arch/x86/include/asm/vgtod.h +++ b/arch/x86/include/asm/vgtod.h @@ -2,6 +2,11 @@ #ifndef _ASM_X86_VGTOD_H #define _ASM_X86_VGTOD_H +/* + * This check is required to prevent ARCH=um to include + * unwanted headers. + */ +#ifdef CONFIG_GENERIC_GETTIMEOFDAY #include #include #include @@ -14,5 +19,6 @@ typedef u64 gtod_long_t; #else typedef unsigned long gtod_long_t; #endif +#endif /* CONFIG_GENERIC_GETTIMEOFDAY */ #endif /* _ASM_X86_VGTOD_H */ -- cgit From 428b8f1d9f92f838b73997adc10046d3c6e05790 Mon Sep 17 00:00:00 2001 From: Nick Desaulniers Date: Mon, 23 Mar 2020 12:12:43 -0700 Subject: KVM: VMX: don't allow memory operands for inline asm that modifies SP THUNK_TARGET defines [thunk_target] as having "rm" input constraints when CONFIG_RETPOLINE is not set, which isn't constrained enough for this specific case. For inline assembly that modifies the stack pointer before using this input, the underspecification of constraints is dangerous, and results in an indirect call to a previously pushed flags register. In this case `entry`'s stack slot is good enough to satisfy the "m" constraint in "rm", but the inline assembly in handle_external_interrupt_irqoff() modifies the stack pointer via push+pushf before using this input, which in this case results in calling what was the previous state of the flags register, rather than `entry`. Be more specific in the constraints by requiring `entry` be in a register, and not a memory operand. Reported-by: Dmitry Vyukov Reported-by: syzbot+3f29ca2efb056a761e38@syzkaller.appspotmail.com Debugged-by: Alexander Potapenko Debugged-by: Paolo Bonzini Debugged-by: Sean Christopherson Signed-off-by: Nick Desaulniers Message-Id: <20200323191243.30002-1-ndesaulniers@google.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/vmx/vmx.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 26f8f31563e9..079d9fbf278e 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -6287,7 +6287,7 @@ static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu) #endif ASM_CALL_CONSTRAINT : - THUNK_TARGET(entry), + [thunk_target]"r"(entry), [ss]"i"(__KERNEL_DS), [cs]"i"(__KERNEL_CS) ); -- cgit From e3747407c4d52ee3f82afdad235866e815362197 Mon Sep 17 00:00:00 2001 From: Zhenyu Wang Date: Mon, 23 Mar 2020 17:22:36 +0800 Subject: KVM: x86: Expose fast short REP MOV for supported cpuid For CPU supporting fast short REP MOV (XF86_FEATURE_FSRM) e.g Icelake, Tigerlake, expose it in KVM supported cpuid as well. Signed-off-by: Zhenyu Wang Message-Id: <20200323092236.3703-1-zhenyuw@linux.intel.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 990c76f34b03..60ae93b09e72 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -338,7 +338,7 @@ void kvm_set_cpu_caps(void) kvm_cpu_cap_mask(CPUID_7_EDX, F(AVX512_4VNNIW) | F(AVX512_4FMAPS) | F(SPEC_CTRL) | F(SPEC_CTRL_SSBD) | F(ARCH_CAPABILITIES) | F(INTEL_STIBP) | - F(MD_CLEAR) | F(AVX512_VP2INTERSECT) + F(MD_CLEAR) | F(AVX512_VP2INTERSECT) | F(FSRM) ); /* TSC_ADJUST and ARCH_CAPABILITIES are emulated in software. */ -- cgit From 31603d4fc2bb4f0815245d496cb970b27b4f636a Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Sat, 21 Mar 2020 12:37:49 -0700 Subject: KVM: VMX: Always VMCLEAR in-use VMCSes during crash with kexec support VMCLEAR all in-use VMCSes during a crash, even if kdump's NMI shootdown interrupted a KVM update of the percpu in-use VMCS list. Because NMIs are not blocked by disabling IRQs, it's possible that crash_vmclear_local_loaded_vmcss() could be called while the percpu list of VMCSes is being modified, e.g. in the middle of list_add() in vmx_vcpu_load_vmcs(). This potential corner case was called out in the original commit[*], but the analysis of its impact was wrong. Skipping the VMCLEARs is wrong because it all but guarantees that a loaded, and therefore cached, VMCS will live across kexec and corrupt memory in the new kernel. Corruption will occur because the CPU's VMCS cache is non-coherent, i.e. not snooped, and so the writeback of VMCS memory on its eviction will overwrite random memory in the new kernel. The VMCS will live because the NMI shootdown also disables VMX, i.e. the in-progress VMCLEAR will #UD, and existing Intel CPUs do not flush the VMCS cache on VMXOFF. Furthermore, interrupting list_add() and list_del() is safe due to crash_vmclear_local_loaded_vmcss() using forward iteration. list_add() ensures the new entry is not visible to forward iteration unless the entire add completes, via WRITE_ONCE(prev->next, new). A bad "prev" pointer could be observed if the NMI shootdown interrupted list_del() or list_add(), but list_for_each_entry() does not consume ->prev. In addition to removing the temporary disabling of VMCLEAR, open code loaded_vmcs_init() in __loaded_vmcs_clear() and reorder VMCLEAR so that the VMCS is deleted from the list only after it's been VMCLEAR'd. Deleting the VMCS before VMCLEAR would allow a race where the NMI shootdown could arrive between list_del() and vmcs_clear() and thus neither flow would execute a successful VMCLEAR. Alternatively, more code could be moved into loaded_vmcs_init(), but that gets rather silly as the only other user, alloc_loaded_vmcs(), doesn't need the smp_wmb() and would need to work around the list_del(). Update the smp_*() comments related to the list manipulation, and opportunistically reword them to improve clarity. [*] https://patchwork.kernel.org/patch/1675731/#3720461 Fixes: 8f536b7697a0 ("KVM: VMX: provide the vmclear function and a bitmap to support VMCLEAR in kdump") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson Message-Id: <20200321193751.24985-2-sean.j.christopherson@intel.com> Reviewed-by: Vitaly Kuznetsov Signed-off-by: Paolo Bonzini --- arch/x86/kvm/vmx/vmx.c | 67 ++++++++++++-------------------------------------- 1 file changed, 16 insertions(+), 51 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 07299a957d4a..efaca09455bf 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -663,43 +663,15 @@ void loaded_vmcs_init(struct loaded_vmcs *loaded_vmcs) } #ifdef CONFIG_KEXEC_CORE -/* - * This bitmap is used to indicate whether the vmclear - * operation is enabled on all cpus. All disabled by - * default. - */ -static cpumask_t crash_vmclear_enabled_bitmap = CPU_MASK_NONE; - -static inline void crash_enable_local_vmclear(int cpu) -{ - cpumask_set_cpu(cpu, &crash_vmclear_enabled_bitmap); -} - -static inline void crash_disable_local_vmclear(int cpu) -{ - cpumask_clear_cpu(cpu, &crash_vmclear_enabled_bitmap); -} - -static inline int crash_local_vmclear_enabled(int cpu) -{ - return cpumask_test_cpu(cpu, &crash_vmclear_enabled_bitmap); -} - static void crash_vmclear_local_loaded_vmcss(void) { int cpu = raw_smp_processor_id(); struct loaded_vmcs *v; - if (!crash_local_vmclear_enabled(cpu)) - return; - list_for_each_entry(v, &per_cpu(loaded_vmcss_on_cpu, cpu), loaded_vmcss_on_cpu_link) vmcs_clear(v->vmcs); } -#else -static inline void crash_enable_local_vmclear(int cpu) { } -static inline void crash_disable_local_vmclear(int cpu) { } #endif /* CONFIG_KEXEC_CORE */ static void __loaded_vmcs_clear(void *arg) @@ -711,19 +683,24 @@ static void __loaded_vmcs_clear(void *arg) return; /* vcpu migration can race with cpu offline */ if (per_cpu(current_vmcs, cpu) == loaded_vmcs->vmcs) per_cpu(current_vmcs, cpu) = NULL; - crash_disable_local_vmclear(cpu); + + vmcs_clear(loaded_vmcs->vmcs); + if (loaded_vmcs->shadow_vmcs && loaded_vmcs->launched) + vmcs_clear(loaded_vmcs->shadow_vmcs); + list_del(&loaded_vmcs->loaded_vmcss_on_cpu_link); /* - * we should ensure updating loaded_vmcs->loaded_vmcss_on_cpu_link - * is before setting loaded_vmcs->vcpu to -1 which is done in - * loaded_vmcs_init. Otherwise, other cpu can see vcpu = -1 fist - * then adds the vmcs into percpu list before it is deleted. + * Ensure all writes to loaded_vmcs, including deleting it from its + * current percpu list, complete before setting loaded_vmcs->vcpu to + * -1, otherwise a different cpu can see vcpu == -1 first and add + * loaded_vmcs to its percpu list before it's deleted from this cpu's + * list. Pairs with the smp_rmb() in vmx_vcpu_load_vmcs(). */ smp_wmb(); - loaded_vmcs_init(loaded_vmcs); - crash_enable_local_vmclear(cpu); + loaded_vmcs->cpu = -1; + loaded_vmcs->launched = 0; } void loaded_vmcs_clear(struct loaded_vmcs *loaded_vmcs) @@ -1342,18 +1319,17 @@ void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu) if (!already_loaded) { loaded_vmcs_clear(vmx->loaded_vmcs); local_irq_disable(); - crash_disable_local_vmclear(cpu); /* - * Read loaded_vmcs->cpu should be before fetching - * loaded_vmcs->loaded_vmcss_on_cpu_link. - * See the comments in __loaded_vmcs_clear(). + * Ensure loaded_vmcs->cpu is read before adding loaded_vmcs to + * this cpu's percpu list, otherwise it may not yet be deleted + * from its previous cpu's percpu list. Pairs with the + * smb_wmb() in __loaded_vmcs_clear(). */ smp_rmb(); list_add(&vmx->loaded_vmcs->loaded_vmcss_on_cpu_link, &per_cpu(loaded_vmcss_on_cpu, cpu)); - crash_enable_local_vmclear(cpu); local_irq_enable(); } @@ -2279,17 +2255,6 @@ static int hardware_enable(void) INIT_LIST_HEAD(&per_cpu(blocked_vcpu_on_cpu, cpu)); spin_lock_init(&per_cpu(blocked_vcpu_on_cpu_lock, cpu)); - /* - * Now we can enable the vmclear operation in kdump - * since the loaded_vmcss_on_cpu list on this cpu - * has been initialized. - * - * Though the cpu is not in VMX operation now, there - * is no problem to enable the vmclear operation - * for the loaded_vmcss_on_cpu list is empty! - */ - crash_enable_local_vmclear(cpu); - kvm_cpu_vmxon(phys_addr); if (enable_ept) ept_sync_global(); -- cgit From d260f9ef50c76c5587353fa71719be8cf5525f06 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Sat, 21 Mar 2020 12:37:50 -0700 Subject: KVM: VMX: Fold loaded_vmcs_init() into alloc_loaded_vmcs() Subsume loaded_vmcs_init() into alloc_loaded_vmcs(), its only remaining caller, and drop the VMCLEAR on the shadow VMCS, which is guaranteed to be NULL. loaded_vmcs_init() was previously used by loaded_vmcs_clear(), but loaded_vmcs_clear() also subsumed loaded_vmcs_init() to properly handle smp_wmb() with respect to VMCLEAR. Signed-off-by: Sean Christopherson Message-Id: <20200321193751.24985-3-sean.j.christopherson@intel.com> Reviewed-by: Vitaly Kuznetsov Signed-off-by: Paolo Bonzini --- arch/x86/kvm/vmx/vmx.c | 14 ++++---------- arch/x86/kvm/vmx/vmx.h | 1 - 2 files changed, 4 insertions(+), 11 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index efaca09455bf..07634caa560d 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -653,15 +653,6 @@ static int vmx_set_guest_msr(struct vcpu_vmx *vmx, struct shared_msr_entry *msr, return ret; } -void loaded_vmcs_init(struct loaded_vmcs *loaded_vmcs) -{ - vmcs_clear(loaded_vmcs->vmcs); - if (loaded_vmcs->shadow_vmcs && loaded_vmcs->launched) - vmcs_clear(loaded_vmcs->shadow_vmcs); - loaded_vmcs->cpu = -1; - loaded_vmcs->launched = 0; -} - #ifdef CONFIG_KEXEC_CORE static void crash_vmclear_local_loaded_vmcss(void) { @@ -2555,9 +2546,12 @@ int alloc_loaded_vmcs(struct loaded_vmcs *loaded_vmcs) if (!loaded_vmcs->vmcs) return -ENOMEM; + vmcs_clear(loaded_vmcs->vmcs); + loaded_vmcs->shadow_vmcs = NULL; loaded_vmcs->hv_timer_soft_disabled = false; - loaded_vmcs_init(loaded_vmcs); + loaded_vmcs->cpu = -1; + loaded_vmcs->launched = 0; if (cpu_has_vmx_msr_bitmap()) { loaded_vmcs->msr_bitmap = (unsigned long *) diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h index be93d597306c..79d38f41ef7a 100644 --- a/arch/x86/kvm/vmx/vmx.h +++ b/arch/x86/kvm/vmx/vmx.h @@ -492,7 +492,6 @@ struct vmcs *alloc_vmcs_cpu(bool shadow, int cpu, gfp_t flags); void free_vmcs(struct vmcs *vmcs); int alloc_loaded_vmcs(struct loaded_vmcs *loaded_vmcs); void free_loaded_vmcs(struct loaded_vmcs *loaded_vmcs); -void loaded_vmcs_init(struct loaded_vmcs *loaded_vmcs); void loaded_vmcs_clear(struct loaded_vmcs *loaded_vmcs); static inline struct vmcs *alloc_vmcs(bool shadow) -- cgit From 4f6ea0a87608e1b26ed26123ae7c42aaecdd2c6c Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Sat, 21 Mar 2020 12:37:51 -0700 Subject: KVM: VMX: Gracefully handle faults on VMXON Gracefully handle faults on VMXON, e.g. #GP due to VMX being disabled by BIOS, instead of letting the fault crash the system. Now that KVM uses cpufeatures to query support instead of reading MSR_IA32_FEAT_CTL directly, it's possible for a bug in a different subsystem to cause KVM to incorrectly attempt VMXON[*]. Crashing the system is especially annoying if the system is configured such that hardware_enable() will be triggered during boot. Oppurtunistically rename @addr to @vmxon_pointer and use a named param to reference it in the inline assembly. Print 0xdeadbeef in the ultra-"rare" case that reading MSR_IA32_FEAT_CTL also faults. [*] https://lkml.kernel.org/r/20200226231615.13664-1-sean.j.christopherson@intel.com Signed-off-by: Sean Christopherson Message-Id: <20200321193751.24985-4-sean.j.christopherson@intel.com> Reviewed-by: Vitaly Kuznetsov Signed-off-by: Paolo Bonzini --- arch/x86/kvm/vmx/vmx.c | 24 +++++++++++++++++++++--- 1 file changed, 21 insertions(+), 3 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 07634caa560d..3aba51d782e2 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -2218,18 +2218,33 @@ static __init int vmx_disabled_by_bios(void) !boot_cpu_has(X86_FEATURE_VMX); } -static void kvm_cpu_vmxon(u64 addr) +static int kvm_cpu_vmxon(u64 vmxon_pointer) { + u64 msr; + cr4_set_bits(X86_CR4_VMXE); intel_pt_handle_vmx(1); - asm volatile ("vmxon %0" : : "m"(addr)); + asm_volatile_goto("1: vmxon %[vmxon_pointer]\n\t" + _ASM_EXTABLE(1b, %l[fault]) + : : [vmxon_pointer] "m"(vmxon_pointer) + : : fault); + return 0; + +fault: + WARN_ONCE(1, "VMXON faulted, MSR_IA32_FEAT_CTL (0x3a) = 0x%llx\n", + rdmsrl_safe(MSR_IA32_FEAT_CTL, &msr) ? 0xdeadbeef : msr); + intel_pt_handle_vmx(0); + cr4_clear_bits(X86_CR4_VMXE); + + return -EFAULT; } static int hardware_enable(void) { int cpu = raw_smp_processor_id(); u64 phys_addr = __pa(per_cpu(vmxarea, cpu)); + int r; if (cr4_read_shadow() & X86_CR4_VMXE) return -EBUSY; @@ -2246,7 +2261,10 @@ static int hardware_enable(void) INIT_LIST_HEAD(&per_cpu(blocked_vcpu_on_cpu, cpu)); spin_lock_init(&per_cpu(blocked_vcpu_on_cpu_lock, cpu)); - kvm_cpu_vmxon(phys_addr); + r = kvm_cpu_vmxon(phys_addr); + if (r) + return r; + if (enable_ept) ept_sync_global(); -- cgit From 14388ae24544c4cc09056773c886839a2c8d256b Mon Sep 17 00:00:00 2001 From: Alexey Makhalov Date: Mon, 23 Mar 2020 19:57:03 +0000 Subject: x86/vmware: Make vmware_select_hypercall() __init vmware_select_hypercall() is used only by the __init functions, and should be annotated with __init as well. Signed-off-by: Alexey Makhalov Signed-off-by: Borislav Petkov Reviewed-by: Thomas Hellstrom Reviewed-by: Thomas Gleixner Link: https://lkml.kernel.org/r/20200323195707.31242-2-amakhalov@vmware.com --- arch/x86/kernel/cpu/vmware.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/kernel/cpu/vmware.c b/arch/x86/kernel/cpu/vmware.c index 46d732696c1c..d280560fd75e 100644 --- a/arch/x86/kernel/cpu/vmware.c +++ b/arch/x86/kernel/cpu/vmware.c @@ -213,7 +213,7 @@ static void __init vmware_platform_setup(void) vmware_set_capabilities(); } -static u8 vmware_select_hypercall(void) +static u8 __init vmware_select_hypercall(void) { int eax, ebx, ecx, edx; -- cgit From dd735f4707e603ac5b541b5f30de87c3c7bd60dd Mon Sep 17 00:00:00 2001 From: Alexey Makhalov Date: Mon, 23 Mar 2020 19:57:04 +0000 Subject: x86/vmware: Remove vmware_sched_clock_setup() Move cyc2ns setup logic to separate function. This separation will allow to use cyc2ns mult/shift pair not only for the sched_clock but also for other clocks such as steal_clock. Signed-off-by: Alexey Makhalov Signed-off-by: Borislav Petkov Reviewed-by: Thomas Hellstrom Reviewed-by: Thomas Gleixner Link: https://lkml.kernel.org/r/20200323195707.31242-3-amakhalov@vmware.com --- arch/x86/kernel/cpu/vmware.c | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kernel/cpu/vmware.c b/arch/x86/kernel/cpu/vmware.c index d280560fd75e..efb22fa76ba4 100644 --- a/arch/x86/kernel/cpu/vmware.c +++ b/arch/x86/kernel/cpu/vmware.c @@ -122,7 +122,7 @@ static unsigned long long notrace vmware_sched_clock(void) return ns; } -static void __init vmware_sched_clock_setup(void) +static void __init vmware_cyc2ns_setup(void) { struct cyc2ns_data *d = &vmware_cyc2ns; unsigned long long tsc_now = rdtsc(); @@ -132,8 +132,7 @@ static void __init vmware_sched_clock_setup(void) d->cyc2ns_offset = mul_u64_u32_shr(tsc_now, d->cyc2ns_mul, d->cyc2ns_shift); - pv_ops.time.sched_clock = vmware_sched_clock; - pr_info("using sched offset of %llu ns\n", d->cyc2ns_offset); + pr_info("using clock offset of %llu ns\n", d->cyc2ns_offset); } static void __init vmware_paravirt_ops_setup(void) @@ -141,8 +140,14 @@ static void __init vmware_paravirt_ops_setup(void) pv_info.name = "VMware hypervisor"; pv_ops.cpu.io_delay = paravirt_nop; - if (vmware_tsc_khz && vmw_sched_clock) - vmware_sched_clock_setup(); + if (vmware_tsc_khz == 0) + return; + + vmware_cyc2ns_setup(); + + if (vmw_sched_clock) + pv_ops.time.sched_clock = vmware_sched_clock; + } #else #define vmware_paravirt_ops_setup() do {} while (0) -- cgit From ab02bb3f55f58e7608a88188000c3353398ebe3b Mon Sep 17 00:00:00 2001 From: Alexey Makhalov Date: Mon, 23 Mar 2020 19:57:05 +0000 Subject: x86/vmware: Add steal time clock support for VMware guests Steal time is the amount of CPU time needed by a guest virtual machine that is not provided by the host. Steal time occurs when the host allocates this CPU time elsewhere, for example, to another guest. Steal time can be enabled by adding the VM configuration option stealclock.enable = "TRUE". It is supported by VMs that run hardware version 13 or newer. Introduce the VMware steal time infrastructure. The high level code (such as enabling, disabling and hot-plug routines) was derived from KVM. [ Tomer: use READ_ONCE macros and 32bit guests support. ] [ bp: Massage. ] Co-developed-by: Tomer Zeltzer Signed-off-by: Alexey Makhalov Signed-off-by: Tomer Zeltzer Signed-off-by: Borislav Petkov Reviewed-by: Thomas Hellstrom Reviewed-by: Thomas Gleixner Link: https://lkml.kernel.org/r/20200323195707.31242-4-amakhalov@vmware.com --- arch/x86/kernel/cpu/vmware.c | 197 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 197 insertions(+) (limited to 'arch/x86') diff --git a/arch/x86/kernel/cpu/vmware.c b/arch/x86/kernel/cpu/vmware.c index efb22fa76ba4..cc604614f0ae 100644 --- a/arch/x86/kernel/cpu/vmware.c +++ b/arch/x86/kernel/cpu/vmware.c @@ -25,6 +25,8 @@ #include #include #include +#include +#include #include #include #include @@ -47,6 +49,11 @@ #define VMWARE_CMD_GETVCPU_INFO 68 #define VMWARE_CMD_LEGACY_X2APIC 3 #define VMWARE_CMD_VCPU_RESERVED 31 +#define VMWARE_CMD_STEALCLOCK 91 + +#define STEALCLOCK_NOT_AVAILABLE (-1) +#define STEALCLOCK_DISABLED 0 +#define STEALCLOCK_ENABLED 1 #define VMWARE_PORT(cmd, eax, ebx, ecx, edx) \ __asm__("inl (%%dx), %%eax" : \ @@ -86,6 +93,18 @@ } \ } while (0) +struct vmware_steal_time { + union { + uint64_t clock; /* stolen time counter in units of vtsc */ + struct { + /* only for little-endian */ + uint32_t clock_low; + uint32_t clock_high; + }; + }; + uint64_t reserved[7]; +}; + static unsigned long vmware_tsc_khz __ro_after_init; static u8 vmware_hypercall_mode __ro_after_init; @@ -104,6 +123,8 @@ static unsigned long vmware_get_tsc_khz(void) #ifdef CONFIG_PARAVIRT static struct cyc2ns_data vmware_cyc2ns __ro_after_init; static int vmw_sched_clock __initdata = 1; +static DEFINE_PER_CPU_DECRYPTED(struct vmware_steal_time, vmw_steal_time) __aligned(64); +static bool has_steal_clock; static __init int setup_vmw_sched_clock(char *s) { @@ -135,6 +156,163 @@ static void __init vmware_cyc2ns_setup(void) pr_info("using clock offset of %llu ns\n", d->cyc2ns_offset); } +static int vmware_cmd_stealclock(uint32_t arg1, uint32_t arg2) +{ + uint32_t result, info; + + asm volatile (VMWARE_HYPERCALL : + "=a"(result), + "=c"(info) : + "a"(VMWARE_HYPERVISOR_MAGIC), + "b"(0), + "c"(VMWARE_CMD_STEALCLOCK), + "d"(0), + "S"(arg1), + "D"(arg2) : + "memory"); + return result; +} + +static bool stealclock_enable(phys_addr_t pa) +{ + return vmware_cmd_stealclock(upper_32_bits(pa), + lower_32_bits(pa)) == STEALCLOCK_ENABLED; +} + +static int __stealclock_disable(void) +{ + return vmware_cmd_stealclock(0, 1); +} + +static void stealclock_disable(void) +{ + __stealclock_disable(); +} + +static bool vmware_is_stealclock_available(void) +{ + return __stealclock_disable() != STEALCLOCK_NOT_AVAILABLE; +} + +/** + * vmware_steal_clock() - read the per-cpu steal clock + * @cpu: the cpu number whose steal clock we want to read + * + * The function reads the steal clock if we are on a 64-bit system, otherwise + * reads it in parts, checking that the high part didn't change in the + * meantime. + * + * Return: + * The steal clock reading in ns. + */ +static uint64_t vmware_steal_clock(int cpu) +{ + struct vmware_steal_time *steal = &per_cpu(vmw_steal_time, cpu); + uint64_t clock; + + if (IS_ENABLED(CONFIG_64BIT)) + clock = READ_ONCE(steal->clock); + else { + uint32_t initial_high, low, high; + + do { + initial_high = READ_ONCE(steal->clock_high); + /* Do not reorder initial_high and high readings */ + virt_rmb(); + low = READ_ONCE(steal->clock_low); + /* Keep low reading in between */ + virt_rmb(); + high = READ_ONCE(steal->clock_high); + } while (initial_high != high); + + clock = ((uint64_t)high << 32) | low; + } + + return mul_u64_u32_shr(clock, vmware_cyc2ns.cyc2ns_mul, + vmware_cyc2ns.cyc2ns_shift); +} + +static void vmware_register_steal_time(void) +{ + int cpu = smp_processor_id(); + struct vmware_steal_time *st = &per_cpu(vmw_steal_time, cpu); + + if (!has_steal_clock) + return; + + if (!stealclock_enable(slow_virt_to_phys(st))) { + has_steal_clock = false; + return; + } + + pr_info("vmware-stealtime: cpu %d, pa %llx\n", + cpu, (unsigned long long) slow_virt_to_phys(st)); +} + +static void vmware_disable_steal_time(void) +{ + if (!has_steal_clock) + return; + + stealclock_disable(); +} + +static void vmware_guest_cpu_init(void) +{ + if (has_steal_clock) + vmware_register_steal_time(); +} + +static void vmware_pv_guest_cpu_reboot(void *unused) +{ + vmware_disable_steal_time(); +} + +static int vmware_pv_reboot_notify(struct notifier_block *nb, + unsigned long code, void *unused) +{ + if (code == SYS_RESTART) + on_each_cpu(vmware_pv_guest_cpu_reboot, NULL, 1); + return NOTIFY_DONE; +} + +static struct notifier_block vmware_pv_reboot_nb = { + .notifier_call = vmware_pv_reboot_notify, +}; + +#ifdef CONFIG_SMP +static void __init vmware_smp_prepare_boot_cpu(void) +{ + vmware_guest_cpu_init(); + native_smp_prepare_boot_cpu(); +} + +static int vmware_cpu_online(unsigned int cpu) +{ + local_irq_disable(); + vmware_guest_cpu_init(); + local_irq_enable(); + return 0; +} + +static int vmware_cpu_down_prepare(unsigned int cpu) +{ + local_irq_disable(); + vmware_disable_steal_time(); + local_irq_enable(); + return 0; +} +#endif + +static __init int activate_jump_labels(void) +{ + if (has_steal_clock) + static_key_slow_inc(¶virt_steal_enabled); + + return 0; +} +arch_initcall(activate_jump_labels); + static void __init vmware_paravirt_ops_setup(void) { pv_info.name = "VMware hypervisor"; @@ -148,6 +326,25 @@ static void __init vmware_paravirt_ops_setup(void) if (vmw_sched_clock) pv_ops.time.sched_clock = vmware_sched_clock; + if (vmware_is_stealclock_available()) { + has_steal_clock = true; + pv_ops.time.steal_clock = vmware_steal_clock; + + /* We use reboot notifier only to disable steal clock */ + register_reboot_notifier(&vmware_pv_reboot_nb); + +#ifdef CONFIG_SMP + smp_ops.smp_prepare_boot_cpu = + vmware_smp_prepare_boot_cpu; + if (cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN, + "x86/vmware:online", + vmware_cpu_online, + vmware_cpu_down_prepare) < 0) + pr_err("vmware_guest: Failed to install cpu hotplug callbacks\n"); +#else + vmware_guest_cpu_init(); +#endif + } } #else #define vmware_paravirt_ops_setup() do {} while (0) -- cgit From e73a8f38f82dd1c41b70a06556bea7dc250cc384 Mon Sep 17 00:00:00 2001 From: Alexey Makhalov Date: Mon, 23 Mar 2020 19:57:06 +0000 Subject: x86/vmware: Enable steal time accounting Set paravirt_steal_rq_enabled if steal clock present. paravirt_steal_rq_enabled is used in sched/core.c to adjust task progress by offsetting stolen time. Use 'no-steal-acc' off switch (share same name with KVM) to disable steal time accounting. Signed-off-by: Alexey Makhalov Signed-off-by: Borislav Petkov Reviewed-by: Thomas Hellstrom Reviewed-by: Thomas Gleixner Link: https://lkml.kernel.org/r/20200323195707.31242-5-amakhalov@vmware.com --- arch/x86/kernel/cpu/vmware.c | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/kernel/cpu/vmware.c b/arch/x86/kernel/cpu/vmware.c index cc604614f0ae..e885f73bebd4 100644 --- a/arch/x86/kernel/cpu/vmware.c +++ b/arch/x86/kernel/cpu/vmware.c @@ -125,6 +125,7 @@ static struct cyc2ns_data vmware_cyc2ns __ro_after_init; static int vmw_sched_clock __initdata = 1; static DEFINE_PER_CPU_DECRYPTED(struct vmware_steal_time, vmw_steal_time) __aligned(64); static bool has_steal_clock; +static bool steal_acc __initdata = true; /* steal time accounting */ static __init int setup_vmw_sched_clock(char *s) { @@ -133,6 +134,13 @@ static __init int setup_vmw_sched_clock(char *s) } early_param("no-vmw-sched-clock", setup_vmw_sched_clock); +static __init int parse_no_stealacc(char *arg) +{ + steal_acc = false; + return 0; +} +early_param("no-steal-acc", parse_no_stealacc); + static unsigned long long notrace vmware_sched_clock(void) { unsigned long long ns; @@ -306,8 +314,11 @@ static int vmware_cpu_down_prepare(unsigned int cpu) static __init int activate_jump_labels(void) { - if (has_steal_clock) + if (has_steal_clock) { static_key_slow_inc(¶virt_steal_enabled); + if (steal_acc) + static_key_slow_inc(¶virt_steal_rq_enabled); + } return 0; } -- cgit From 8fefe9dacdb0a1347d3dac871bb1bba3cbc32945 Mon Sep 17 00:00:00 2001 From: Alexey Makhalov Date: Mon, 23 Mar 2020 19:57:07 +0000 Subject: x86/vmware: Use bool type for vmw_sched_clock To be aligned with other bool variables. Signed-off-by: Alexey Makhalov Signed-off-by: Borislav Petkov Reviewed-by: Thomas Gleixner Link: https://lkml.kernel.org/r/20200323195707.31242-6-amakhalov@vmware.com --- arch/x86/kernel/cpu/vmware.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kernel/cpu/vmware.c b/arch/x86/kernel/cpu/vmware.c index e885f73bebd4..9b6fafa69be9 100644 --- a/arch/x86/kernel/cpu/vmware.c +++ b/arch/x86/kernel/cpu/vmware.c @@ -122,14 +122,14 @@ static unsigned long vmware_get_tsc_khz(void) #ifdef CONFIG_PARAVIRT static struct cyc2ns_data vmware_cyc2ns __ro_after_init; -static int vmw_sched_clock __initdata = 1; +static bool vmw_sched_clock __initdata = true; static DEFINE_PER_CPU_DECRYPTED(struct vmware_steal_time, vmw_steal_time) __aligned(64); static bool has_steal_clock; static bool steal_acc __initdata = true; /* steal time accounting */ static __init int setup_vmw_sched_clock(char *s) { - vmw_sched_clock = 0; + vmw_sched_clock = false; return 0; } early_param("no-vmw-sched-clock", setup_vmw_sched_clock); -- cgit From 94be4b85d89520329f766c7d69a4d74bcc87f5b5 Mon Sep 17 00:00:00 2001 From: Wanpeng Li Date: Tue, 24 Mar 2020 14:32:10 +0800 Subject: KVM: LAPIC: Also cancel preemption timer when disarm LAPIC timer The timer is disarmed when switching between TSC deadline and other modes, we should set everything to disarmed state, however, LAPIC timer can be emulated by preemption timer, it still works if vmx->hv_deadline_timer is not -1. This patch also cancels preemption timer when disarm LAPIC timer. Signed-off-by: Wanpeng Li Message-Id: <1585031530-19823-1-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/lapic.c | 6 ++++++ 1 file changed, 6 insertions(+) (limited to 'arch/x86') diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 929511e01960..7356a56e6282 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -1445,6 +1445,8 @@ static void limit_periodic_timer_frequency(struct kvm_lapic *apic) } } +static void cancel_hv_timer(struct kvm_lapic *apic); + static void apic_update_lvtt(struct kvm_lapic *apic) { u32 timer_mode = kvm_lapic_get_reg(apic, APIC_LVTT) & @@ -1454,6 +1456,10 @@ static void apic_update_lvtt(struct kvm_lapic *apic) if (apic_lvtt_tscdeadline(apic) != (timer_mode == APIC_LVT_TIMER_TSCDEADLINE)) { hrtimer_cancel(&apic->lapic_timer.timer); + preempt_disable(); + if (apic->lapic_timer.hv_timer_in_use) + cancel_hv_timer(apic); + preempt_enable(); kvm_lapic_set_reg(apic, APIC_TMICT, 0); apic->lapic_timer.period = 0; apic->lapic_timer.tscdeadline = 0; -- cgit From ba5bade4cc0d2013cdf5634dae554693c968a090 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Fri, 20 Mar 2020 14:13:46 +0100 Subject: x86/devicetable: Move x86 specific macro out of generic code There is no reason that this gunk is in a generic header file. The wildcard defines need to stay as they are required by file2alias. Signed-off-by: Thomas Gleixner Signed-off-by: Borislav Petkov Reviewed-by: Greg Kroah-Hartman Link: https://lkml.kernel.org/r/20200320131508.736205164@linutronix.de --- arch/x86/include/asm/cpu_device_id.h | 13 ++++++++++++- arch/x86/kvm/svm.c | 1 + arch/x86/kvm/vmx/vmx.c | 1 + 3 files changed, 14 insertions(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/cpu_device_id.h b/arch/x86/include/asm/cpu_device_id.h index 31c379c1da41..a28dc6ba5be1 100644 --- a/arch/x86/include/asm/cpu_device_id.h +++ b/arch/x86/include/asm/cpu_device_id.h @@ -6,9 +6,20 @@ * Declare drivers belonging to specific x86 CPUs * Similar in spirit to pci_device_id and related PCI functions */ - #include +/* + * The wildcard initializers are in mod_devicetable.h because + * file2alias needs them. Sigh. + */ + +#define X86_FEATURE_MATCH(x) { \ + .vendor = X86_VENDOR_ANY, \ + .family = X86_FAMILY_ANY, \ + .model = X86_MODEL_ANY, \ + .feature = x, \ +} + /* * Match specific microcode revisions. * diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index bef0ba35f121..5753fc3bcfc1 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -48,6 +48,7 @@ #include #include #include +#include #include #include "trace.h" diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 3be25ecae145..31c80fa4653d 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -31,6 +31,7 @@ #include #include #include +#include #include #include #include -- cgit From 20d437447c0089cda46c683db219d3b4e2cde40e Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Fri, 20 Mar 2020 14:13:47 +0100 Subject: x86/cpu: Add consistent CPU match macros Finding all places which build x86_cpu_id match tables is tedious and the logic is hidden in lots of differently named macro wrappers. Most of these initializer macros use plain C89 initializers which rely on the ordering of the struct members. So new members could only be added at the end of the struct, but that's ugly as hell and C99 initializers are really the right thing to use. Provide a set of macros which: - Have a proper naming scheme, starting with X86_MATCH_ - Use C99 initializers The set of provided macros are all subsets of the base macro X86_MATCH_VENDOR_FAM_MODEL_FEATURE() which allows to supply all possible selection criteria: vendor, family, model, feature The other macros shorten this to avoid typing all arguments when they are not needed and would require one of the _ANY constants. They have been created due to the requirements of the existing usage sites. Also add a few model constants for Centaur CPUs and QUARK. Signed-off-by: Thomas Gleixner Signed-off-by: Borislav Petkov Reviewed-by: Greg Kroah-Hartman Link: https://lkml.kernel.org/r/20200320131508.826011988@linutronix.de --- arch/x86/include/asm/cpu_device_id.h | 140 ++++++++++++++++++++++++++++++++--- arch/x86/include/asm/intel-family.h | 6 ++ arch/x86/kernel/cpu/match.c | 13 +++- 3 files changed, 146 insertions(+), 13 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/cpu_device_id.h b/arch/x86/include/asm/cpu_device_id.h index a28dc6ba5be1..f11770fac73a 100644 --- a/arch/x86/include/asm/cpu_device_id.h +++ b/arch/x86/include/asm/cpu_device_id.h @@ -5,21 +5,143 @@ /* * Declare drivers belonging to specific x86 CPUs * Similar in spirit to pci_device_id and related PCI functions - */ -#include - -/* + * * The wildcard initializers are in mod_devicetable.h because * file2alias needs them. Sigh. */ +#include +/* Get the INTEL_FAM* model defines */ +#include +/* And the X86_VENDOR_* ones */ +#include -#define X86_FEATURE_MATCH(x) { \ - .vendor = X86_VENDOR_ANY, \ - .family = X86_FAMILY_ANY, \ - .model = X86_MODEL_ANY, \ - .feature = x, \ +/* Centaur FAM6 models */ +#define X86_CENTAUR_FAM6_C7_A 0xa +#define X86_CENTAUR_FAM6_C7_D 0xd +#define X86_CENTAUR_FAM6_NANO 0xf + +/** + * X86_MATCH_VENDOR_FAM_MODEL_FEATURE - Base macro for CPU matching + * @_vendor: The vendor name, e.g. INTEL, AMD, HYGON, ..., ANY + * The name is expanded to X86_VENDOR_@_vendor + * @_family: The family number or X86_FAMILY_ANY + * @_model: The model number, model constant or X86_MODEL_ANY + * @_feature: A X86_FEATURE bit or X86_FEATURE_ANY + * @_data: Driver specific data or NULL. The internal storage + * format is unsigned long. The supplied value, pointer + * etc. is casted to unsigned long internally. + * + * Use only if you need all selectors. Otherwise use one of the shorter + * macros of the X86_MATCH_* family. If there is no matching shorthand + * macro, consider to add one. If you really need to wrap one of the macros + * into another macro at the usage site for good reasons, then please + * start this local macro with X86_MATCH to allow easy grepping. + */ +#define X86_MATCH_VENDOR_FAM_MODEL_FEATURE(_vendor, _family, _model, \ + _feature, _data) { \ + .vendor = X86_VENDOR_##_vendor, \ + .family = _family, \ + .model = _model, \ + .feature = _feature, \ + .driver_data = (unsigned long) _data \ } +/** + * X86_MATCH_VENDOR_FAM_FEATURE - Macro for matching vendor, family and CPU feature + * @vendor: The vendor name, e.g. INTEL, AMD, HYGON, ..., ANY + * The name is expanded to X86_VENDOR_@vendor + * @family: The family number or X86_FAMILY_ANY + * @feature: A X86_FEATURE bit + * @data: Driver specific data or NULL. The internal storage + * format is unsigned long. The supplied value, pointer + * etc. is casted to unsigned long internally. + * + * All other missing arguments of X86_MATCH_VENDOR_FAM_MODEL_FEATURE() are + * set to wildcards. + */ +#define X86_MATCH_VENDOR_FAM_FEATURE(vendor, family, feature, data) \ + X86_MATCH_VENDOR_FAM_MODEL_FEATURE(vendor, family, \ + X86_MODEL_ANY, feature, data) + +/** + * X86_MATCH_VENDOR_FEATURE - Macro for matching vendor and CPU feature + * @vendor: The vendor name, e.g. INTEL, AMD, HYGON, ..., ANY + * The name is expanded to X86_VENDOR_@vendor + * @feature: A X86_FEATURE bit + * @data: Driver specific data or NULL. The internal storage + * format is unsigned long. The supplied value, pointer + * etc. is casted to unsigned long internally. + * + * All other missing arguments of X86_MATCH_VENDOR_FAM_MODEL_FEATURE() are + * set to wildcards. + */ +#define X86_MATCH_VENDOR_FEATURE(vendor, feature, data) \ + X86_MATCH_VENDOR_FAM_FEATURE(vendor, X86_FAMILY_ANY, feature, data) + +/** + * X86_MATCH_FEATURE - Macro for matching a CPU feature + * @feature: A X86_FEATURE bit + * @data: Driver specific data or NULL. The internal storage + * format is unsigned long. The supplied value, pointer + * etc. is casted to unsigned long internally. + * + * All other missing arguments of X86_MATCH_VENDOR_FAM_MODEL_FEATURE() are + * set to wildcards. + */ +#define X86_MATCH_FEATURE(feature, data) \ + X86_MATCH_VENDOR_FEATURE(ANY, feature, data) + +/* Transitional to keep the existing code working */ +#define X86_FEATURE_MATCH(feature) X86_MATCH_FEATURE(feature, NULL) + +/** + * X86_MATCH_VENDOR_FAM_MODEL - Match vendor, family and model + * @vendor: The vendor name, e.g. INTEL, AMD, HYGON, ..., ANY + * The name is expanded to X86_VENDOR_@vendor + * @family: The family number or X86_FAMILY_ANY + * @model: The model number, model constant or X86_MODEL_ANY + * @data: Driver specific data or NULL. The internal storage + * format is unsigned long. The supplied value, pointer + * etc. is casted to unsigned long internally. + * + * All other missing arguments of X86_MATCH_VENDOR_FAM_MODEL_FEATURE() are + * set to wildcards. + */ +#define X86_MATCH_VENDOR_FAM_MODEL(vendor, family, model, data) \ + X86_MATCH_VENDOR_FAM_MODEL_FEATURE(vendor, family, model, \ + X86_FEATURE_ANY, data) + +/** + * X86_MATCH_VENDOR_FAM - Match vendor and family + * @vendor: The vendor name, e.g. INTEL, AMD, HYGON, ..., ANY + * The name is expanded to X86_VENDOR_@vendor + * @family: The family number or X86_FAMILY_ANY + * @data: Driver specific data or NULL. The internal storage + * format is unsigned long. The supplied value, pointer + * etc. is casted to unsigned long internally. + * + * All other missing arguments to X86_MATCH_VENDOR_FAM_MODEL_FEATURE() are + * set of wildcards. + */ +#define X86_MATCH_VENDOR_FAM(vendor, family, data) \ + X86_MATCH_VENDOR_FAM_MODEL(vendor, family, X86_MODEL_ANY, data) + +/** + * X86_MATCH_INTEL_FAM6_MODEL - Match vendor INTEL, family 6 and model + * @model: The model name without the INTEL_FAM6_ prefix or ANY + * The model name is expanded to INTEL_FAM6_@model internally + * @data: Driver specific data or NULL. The internal storage + * format is unsigned long. The supplied value, pointer + * etc. is casted to unsigned long internally. + * + * The vendor is set to INTEL, the family to 6 and all other missing + * arguments of X86_MATCH_VENDOR_FAM_MODEL_FEATURE() are set to wildcards. + * + * See X86_MATCH_VENDOR_FAM_MODEL_FEATURE() for further information. + */ +#define X86_MATCH_INTEL_FAM6_MODEL(model, data) \ + X86_MATCH_VENDOR_FAM_MODEL(INTEL, 6, INTEL_FAM6_##model, data) + /* * Match specific microcode revisions. * diff --git a/arch/x86/include/asm/intel-family.h b/arch/x86/include/asm/intel-family.h index 4981c293f926..236a87aa84d6 100644 --- a/arch/x86/include/asm/intel-family.h +++ b/arch/x86/include/asm/intel-family.h @@ -35,6 +35,9 @@ * The #define line may optionally include a comment including platform names. */ +/* Wildcard match for FAM6 so X86_MATCH_INTEL_FAM6_MODEL(ANY) works */ +#define INTEL_FAM6_ANY X86_MODEL_ANY + #define INTEL_FAM6_CORE_YONAH 0x0E #define INTEL_FAM6_CORE2_MEROM 0x0F @@ -118,6 +121,9 @@ #define INTEL_FAM6_XEON_PHI_KNL 0x57 /* Knights Landing */ #define INTEL_FAM6_XEON_PHI_KNM 0x85 /* Knights Mill */ +/* Family 5 */ +#define INTEL_FAM5_QUARK_X1000 0x09 /* Quark X1000 SoC */ + /* Useful macros */ #define INTEL_CPU_FAM_ANY(_family, _model, _driver_data) \ { \ diff --git a/arch/x86/kernel/cpu/match.c b/arch/x86/kernel/cpu/match.c index 6dd78d8235e4..d3482eb43ff3 100644 --- a/arch/x86/kernel/cpu/match.c +++ b/arch/x86/kernel/cpu/match.c @@ -16,12 +16,17 @@ * respective wildcard entries. * * A typical table entry would be to match a specific CPU - * { X86_VENDOR_INTEL, 6, 0x12 } - * or to match a specific CPU feature - * { X86_FEATURE_MATCH(X86_FEATURE_FOOBAR) } + * + * X86_MATCH_VENDOR_FAM_MODEL_FEATURE(INTEL, 6, INTEL_FAM6_BROADWELL, + * X86_FEATURE_ANY, NULL); * * Fields can be wildcarded with %X86_VENDOR_ANY, %X86_FAMILY_ANY, - * %X86_MODEL_ANY, %X86_FEATURE_ANY or 0 (except for vendor) + * %X86_MODEL_ANY, %X86_FEATURE_ANY (except for vendor) + * + * asm/cpu_device_id.h contains a set of useful macros which are shortcuts + * for various common selections. The above can be shortened to: + * + * X86_MATCH_INTEL_FAM6_MODEL(BROADWELL, NULL); * * Arrays used to match for this should also be declared using * MODULE_DEVICE_TABLE(x86cpu, ...) -- cgit From f6d502fcfc51ae025f18a8de8f1ed2ab34503742 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Fri, 20 Mar 2020 14:13:48 +0100 Subject: x86/cpu/bugs: Convert to new matching macros The new macro set has a consistent namespace and uses C99 initializers instead of the grufty C89 ones. The local wrappers have to stay as they are tailored to tame the hardware vulnerability mess. Signed-off-by: Thomas Gleixner Signed-off-by: Borislav Petkov Reviewed-by: Greg Kroah-Hartman Link: https://lkml.kernel.org/r/20200320131508.934926587@linutronix.de --- arch/x86/kernel/cpu/common.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 52c9bfbbdb2a..f4c672e5ce9e 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -1008,8 +1008,8 @@ static void identify_cpu_without_cpuid(struct cpuinfo_x86 *c) #define NO_ITLB_MULTIHIT BIT(7) #define NO_SPECTRE_V2 BIT(8) -#define VULNWL(_vendor, _family, _model, _whitelist) \ - { X86_VENDOR_##_vendor, _family, _model, X86_FEATURE_ANY, _whitelist } +#define VULNWL(vendor, family, model, whitelist) \ + X86_MATCH_VENDOR_FAM_MODEL(vendor, family, model, whitelist) #define VULNWL_INTEL(model, whitelist) \ VULNWL(INTEL, 6, INTEL_FAM6_##model, whitelist) -- cgit From ef37219ab828c9ead544589ed33cd94f9273d7c7 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Fri, 20 Mar 2020 14:13:49 +0100 Subject: x86/perf/events: Convert to new CPU match macros The new macro set has a consistent namespace and uses C99 initializers instead of the grufty C89 ones. Get rid the of the local macro wrappers for consistency. Signed-off-by: Thomas Gleixner Signed-off-by: Borislav Petkov Reviewed-by: Greg Kroah-Hartman Link: https://lkml.kernel.org/r/20200320131509.029267418@linutronix.de --- arch/x86/events/amd/power.c | 2 +- arch/x86/events/intel/cstate.c | 83 ++++++++++++++++++++---------------------- arch/x86/events/intel/rapl.c | 58 ++++++++++++++--------------- arch/x86/events/intel/uncore.c | 63 +++++++++++++++----------------- 4 files changed, 97 insertions(+), 109 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/events/amd/power.c b/arch/x86/events/amd/power.c index abef51320e3a..43b09e9c93a2 100644 --- a/arch/x86/events/amd/power.c +++ b/arch/x86/events/amd/power.c @@ -259,7 +259,7 @@ static int power_cpu_init(unsigned int cpu) } static const struct x86_cpu_id cpu_match[] = { - { .vendor = X86_VENDOR_AMD, .family = 0x15 }, + X86_MATCH_VENDOR_FAM(AMD, 0x15, NULL), {}, }; diff --git a/arch/x86/events/intel/cstate.c b/arch/x86/events/intel/cstate.c index 4814c964692c..e4aa20c0426f 100644 --- a/arch/x86/events/intel/cstate.c +++ b/arch/x86/events/intel/cstate.c @@ -594,63 +594,60 @@ static const struct cstate_model glm_cstates __initconst = { }; -#define X86_CSTATES_MODEL(model, states) \ - { X86_VENDOR_INTEL, 6, model, X86_FEATURE_ANY, (unsigned long) &(states) } - static const struct x86_cpu_id intel_cstates_match[] __initconst = { - X86_CSTATES_MODEL(INTEL_FAM6_NEHALEM, nhm_cstates), - X86_CSTATES_MODEL(INTEL_FAM6_NEHALEM_EP, nhm_cstates), - X86_CSTATES_MODEL(INTEL_FAM6_NEHALEM_EX, nhm_cstates), + X86_MATCH_INTEL_FAM6_MODEL(NEHALEM, &nhm_cstates), + X86_MATCH_INTEL_FAM6_MODEL(NEHALEM_EP, &nhm_cstates), + X86_MATCH_INTEL_FAM6_MODEL(NEHALEM_EX, &nhm_cstates), - X86_CSTATES_MODEL(INTEL_FAM6_WESTMERE, nhm_cstates), - X86_CSTATES_MODEL(INTEL_FAM6_WESTMERE_EP, nhm_cstates), - X86_CSTATES_MODEL(INTEL_FAM6_WESTMERE_EX, nhm_cstates), + X86_MATCH_INTEL_FAM6_MODEL(WESTMERE, &nhm_cstates), + X86_MATCH_INTEL_FAM6_MODEL(WESTMERE_EP, &nhm_cstates), + X86_MATCH_INTEL_FAM6_MODEL(WESTMERE_EX, &nhm_cstates), - X86_CSTATES_MODEL(INTEL_FAM6_SANDYBRIDGE, snb_cstates), - X86_CSTATES_MODEL(INTEL_FAM6_SANDYBRIDGE_X, snb_cstates), + X86_MATCH_INTEL_FAM6_MODEL(SANDYBRIDGE, &snb_cstates), + X86_MATCH_INTEL_FAM6_MODEL(SANDYBRIDGE_X, &snb_cstates), - X86_CSTATES_MODEL(INTEL_FAM6_IVYBRIDGE, snb_cstates), - X86_CSTATES_MODEL(INTEL_FAM6_IVYBRIDGE_X, snb_cstates), + X86_MATCH_INTEL_FAM6_MODEL(IVYBRIDGE, &snb_cstates), + X86_MATCH_INTEL_FAM6_MODEL(IVYBRIDGE_X, &snb_cstates), - X86_CSTATES_MODEL(INTEL_FAM6_HASWELL, snb_cstates), - X86_CSTATES_MODEL(INTEL_FAM6_HASWELL_X, snb_cstates), - X86_CSTATES_MODEL(INTEL_FAM6_HASWELL_G, snb_cstates), + X86_MATCH_INTEL_FAM6_MODEL(HASWELL, &snb_cstates), + X86_MATCH_INTEL_FAM6_MODEL(HASWELL_X, &snb_cstates), + X86_MATCH_INTEL_FAM6_MODEL(HASWELL_G, &snb_cstates), - X86_CSTATES_MODEL(INTEL_FAM6_HASWELL_L, hswult_cstates), + X86_MATCH_INTEL_FAM6_MODEL(HASWELL_L, &hswult_cstates), - X86_CSTATES_MODEL(INTEL_FAM6_ATOM_SILVERMONT, slm_cstates), - X86_CSTATES_MODEL(INTEL_FAM6_ATOM_SILVERMONT_D, slm_cstates), - X86_CSTATES_MODEL(INTEL_FAM6_ATOM_AIRMONT, slm_cstates), + X86_MATCH_INTEL_FAM6_MODEL(ATOM_SILVERMONT, &slm_cstates), + X86_MATCH_INTEL_FAM6_MODEL(ATOM_SILVERMONT_D, &slm_cstates), + X86_MATCH_INTEL_FAM6_MODEL(ATOM_AIRMONT, &slm_cstates), - X86_CSTATES_MODEL(INTEL_FAM6_BROADWELL, snb_cstates), - X86_CSTATES_MODEL(INTEL_FAM6_BROADWELL_D, snb_cstates), - X86_CSTATES_MODEL(INTEL_FAM6_BROADWELL_G, snb_cstates), - X86_CSTATES_MODEL(INTEL_FAM6_BROADWELL_X, snb_cstates), + X86_MATCH_INTEL_FAM6_MODEL(BROADWELL, &snb_cstates), + X86_MATCH_INTEL_FAM6_MODEL(BROADWELL_D, &snb_cstates), + X86_MATCH_INTEL_FAM6_MODEL(BROADWELL_G, &snb_cstates), + X86_MATCH_INTEL_FAM6_MODEL(BROADWELL_X, &snb_cstates), - X86_CSTATES_MODEL(INTEL_FAM6_SKYLAKE_L, snb_cstates), - X86_CSTATES_MODEL(INTEL_FAM6_SKYLAKE, snb_cstates), - X86_CSTATES_MODEL(INTEL_FAM6_SKYLAKE_X, snb_cstates), + X86_MATCH_INTEL_FAM6_MODEL(SKYLAKE_L, &snb_cstates), + X86_MATCH_INTEL_FAM6_MODEL(SKYLAKE, &snb_cstates), + X86_MATCH_INTEL_FAM6_MODEL(SKYLAKE_X, &snb_cstates), - X86_CSTATES_MODEL(INTEL_FAM6_KABYLAKE_L, hswult_cstates), - X86_CSTATES_MODEL(INTEL_FAM6_KABYLAKE, hswult_cstates), - X86_CSTATES_MODEL(INTEL_FAM6_COMETLAKE_L, hswult_cstates), - X86_CSTATES_MODEL(INTEL_FAM6_COMETLAKE, hswult_cstates), + X86_MATCH_INTEL_FAM6_MODEL(KABYLAKE_L, &hswult_cstates), + X86_MATCH_INTEL_FAM6_MODEL(KABYLAKE, &hswult_cstates), + X86_MATCH_INTEL_FAM6_MODEL(COMETLAKE_L, &hswult_cstates), + X86_MATCH_INTEL_FAM6_MODEL(COMETLAKE, &hswult_cstates), - X86_CSTATES_MODEL(INTEL_FAM6_CANNONLAKE_L, cnl_cstates), + X86_MATCH_INTEL_FAM6_MODEL(CANNONLAKE_L, &cnl_cstates), - X86_CSTATES_MODEL(INTEL_FAM6_XEON_PHI_KNL, knl_cstates), - X86_CSTATES_MODEL(INTEL_FAM6_XEON_PHI_KNM, knl_cstates), + X86_MATCH_INTEL_FAM6_MODEL(XEON_PHI_KNL, &knl_cstates), + X86_MATCH_INTEL_FAM6_MODEL(XEON_PHI_KNM, &knl_cstates), - X86_CSTATES_MODEL(INTEL_FAM6_ATOM_GOLDMONT, glm_cstates), - X86_CSTATES_MODEL(INTEL_FAM6_ATOM_GOLDMONT_D, glm_cstates), - X86_CSTATES_MODEL(INTEL_FAM6_ATOM_GOLDMONT_PLUS, glm_cstates), - X86_CSTATES_MODEL(INTEL_FAM6_ATOM_TREMONT_D, glm_cstates), - X86_CSTATES_MODEL(INTEL_FAM6_ATOM_TREMONT, glm_cstates), + X86_MATCH_INTEL_FAM6_MODEL(ATOM_GOLDMONT, &glm_cstates), + X86_MATCH_INTEL_FAM6_MODEL(ATOM_GOLDMONT_D, &glm_cstates), + X86_MATCH_INTEL_FAM6_MODEL(ATOM_GOLDMONT_PLUS, &glm_cstates), + X86_MATCH_INTEL_FAM6_MODEL(ATOM_TREMONT_D, &glm_cstates), + X86_MATCH_INTEL_FAM6_MODEL(ATOM_TREMONT, &glm_cstates), - X86_CSTATES_MODEL(INTEL_FAM6_ICELAKE_L, icl_cstates), - X86_CSTATES_MODEL(INTEL_FAM6_ICELAKE, icl_cstates), - X86_CSTATES_MODEL(INTEL_FAM6_TIGERLAKE_L, icl_cstates), - X86_CSTATES_MODEL(INTEL_FAM6_TIGERLAKE, icl_cstates), + X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_L, &icl_cstates), + X86_MATCH_INTEL_FAM6_MODEL(ICELAKE, &icl_cstates), + X86_MATCH_INTEL_FAM6_MODEL(TIGERLAKE_L, &icl_cstates), + X86_MATCH_INTEL_FAM6_MODEL(TIGERLAKE, &icl_cstates), { }, }; MODULE_DEVICE_TABLE(x86cpu, intel_cstates_match); diff --git a/arch/x86/events/intel/rapl.c b/arch/x86/events/intel/rapl.c index 09913121e726..a5dbd25852cb 100644 --- a/arch/x86/events/intel/rapl.c +++ b/arch/x86/events/intel/rapl.c @@ -668,9 +668,6 @@ static int __init init_rapl_pmus(void) return 0; } -#define X86_RAPL_MODEL_MATCH(model, init) \ - { X86_VENDOR_INTEL, 6, model, X86_FEATURE_ANY, (unsigned long)&init } - static struct rapl_model model_snb = { .events = BIT(PERF_RAPL_PP0) | BIT(PERF_RAPL_PKG) | @@ -716,36 +713,35 @@ static struct rapl_model model_skl = { }; static const struct x86_cpu_id rapl_model_match[] __initconst = { - X86_RAPL_MODEL_MATCH(INTEL_FAM6_SANDYBRIDGE, model_snb), - X86_RAPL_MODEL_MATCH(INTEL_FAM6_SANDYBRIDGE_X, model_snbep), - X86_RAPL_MODEL_MATCH(INTEL_FAM6_IVYBRIDGE, model_snb), - X86_RAPL_MODEL_MATCH(INTEL_FAM6_IVYBRIDGE_X, model_snbep), - X86_RAPL_MODEL_MATCH(INTEL_FAM6_HASWELL, model_hsw), - X86_RAPL_MODEL_MATCH(INTEL_FAM6_HASWELL_X, model_hsx), - X86_RAPL_MODEL_MATCH(INTEL_FAM6_HASWELL_L, model_hsw), - X86_RAPL_MODEL_MATCH(INTEL_FAM6_HASWELL_G, model_hsw), - X86_RAPL_MODEL_MATCH(INTEL_FAM6_BROADWELL, model_hsw), - X86_RAPL_MODEL_MATCH(INTEL_FAM6_BROADWELL_G, model_hsw), - X86_RAPL_MODEL_MATCH(INTEL_FAM6_BROADWELL_X, model_hsx), - X86_RAPL_MODEL_MATCH(INTEL_FAM6_BROADWELL_D, model_hsx), - X86_RAPL_MODEL_MATCH(INTEL_FAM6_XEON_PHI_KNL, model_knl), - X86_RAPL_MODEL_MATCH(INTEL_FAM6_XEON_PHI_KNM, model_knl), - X86_RAPL_MODEL_MATCH(INTEL_FAM6_SKYLAKE_L, model_skl), - X86_RAPL_MODEL_MATCH(INTEL_FAM6_SKYLAKE, model_skl), - X86_RAPL_MODEL_MATCH(INTEL_FAM6_SKYLAKE_X, model_hsx), - X86_RAPL_MODEL_MATCH(INTEL_FAM6_KABYLAKE_L, model_skl), - X86_RAPL_MODEL_MATCH(INTEL_FAM6_KABYLAKE, model_skl), - X86_RAPL_MODEL_MATCH(INTEL_FAM6_CANNONLAKE_L, model_skl), - X86_RAPL_MODEL_MATCH(INTEL_FAM6_ATOM_GOLDMONT, model_hsw), - X86_RAPL_MODEL_MATCH(INTEL_FAM6_ATOM_GOLDMONT_D, model_hsw), - X86_RAPL_MODEL_MATCH(INTEL_FAM6_ATOM_GOLDMONT_PLUS, model_hsw), - X86_RAPL_MODEL_MATCH(INTEL_FAM6_ICELAKE_L, model_skl), - X86_RAPL_MODEL_MATCH(INTEL_FAM6_ICELAKE, model_skl), - X86_RAPL_MODEL_MATCH(INTEL_FAM6_COMETLAKE_L, model_skl), - X86_RAPL_MODEL_MATCH(INTEL_FAM6_COMETLAKE, model_skl), + X86_MATCH_INTEL_FAM6_MODEL(SANDYBRIDGE, &model_snb), + X86_MATCH_INTEL_FAM6_MODEL(SANDYBRIDGE_X, &model_snbep), + X86_MATCH_INTEL_FAM6_MODEL(IVYBRIDGE, &model_snb), + X86_MATCH_INTEL_FAM6_MODEL(IVYBRIDGE_X, &model_snbep), + X86_MATCH_INTEL_FAM6_MODEL(HASWELL, &model_hsw), + X86_MATCH_INTEL_FAM6_MODEL(HASWELL_X, &model_hsx), + X86_MATCH_INTEL_FAM6_MODEL(HASWELL_L, &model_hsw), + X86_MATCH_INTEL_FAM6_MODEL(HASWELL_G, &model_hsw), + X86_MATCH_INTEL_FAM6_MODEL(BROADWELL, &model_hsw), + X86_MATCH_INTEL_FAM6_MODEL(BROADWELL_G, &model_hsw), + X86_MATCH_INTEL_FAM6_MODEL(BROADWELL_X, &model_hsx), + X86_MATCH_INTEL_FAM6_MODEL(BROADWELL_D, &model_hsx), + X86_MATCH_INTEL_FAM6_MODEL(XEON_PHI_KNL, &model_knl), + X86_MATCH_INTEL_FAM6_MODEL(XEON_PHI_KNM, &model_knl), + X86_MATCH_INTEL_FAM6_MODEL(SKYLAKE_L, &model_skl), + X86_MATCH_INTEL_FAM6_MODEL(SKYLAKE, &model_skl), + X86_MATCH_INTEL_FAM6_MODEL(SKYLAKE_X, &model_hsx), + X86_MATCH_INTEL_FAM6_MODEL(KABYLAKE_L, &model_skl), + X86_MATCH_INTEL_FAM6_MODEL(KABYLAKE, &model_skl), + X86_MATCH_INTEL_FAM6_MODEL(CANNONLAKE_L, &model_skl), + X86_MATCH_INTEL_FAM6_MODEL(ATOM_GOLDMONT, &model_hsw), + X86_MATCH_INTEL_FAM6_MODEL(ATOM_GOLDMONT_D, &model_hsw), + X86_MATCH_INTEL_FAM6_MODEL(ATOM_GOLDMONT_PLUS, &model_hsw), + X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_L, &model_skl), + X86_MATCH_INTEL_FAM6_MODEL(ICELAKE, &model_skl), + X86_MATCH_INTEL_FAM6_MODEL(COMETLAKE_L, &model_skl), + X86_MATCH_INTEL_FAM6_MODEL(COMETLAKE, &model_skl), {}, }; - MODULE_DEVICE_TABLE(x86cpu, rapl_model_match); static int __init rapl_pmu_init(void) diff --git a/arch/x86/events/intel/uncore.c b/arch/x86/events/intel/uncore.c index 86467f85c383..ac93705a2127 100644 --- a/arch/x86/events/intel/uncore.c +++ b/arch/x86/events/intel/uncore.c @@ -1392,10 +1392,6 @@ err: return ret; } - -#define X86_UNCORE_MODEL_MATCH(model, init) \ - { X86_VENDOR_INTEL, 6, model, X86_FEATURE_ANY, (unsigned long)&init } - struct intel_uncore_init_fun { void (*cpu_init)(void); int (*pci_init)(void); @@ -1477,38 +1473,37 @@ static const struct intel_uncore_init_fun snr_uncore_init __initconst = { }; static const struct x86_cpu_id intel_uncore_match[] __initconst = { - X86_UNCORE_MODEL_MATCH(INTEL_FAM6_NEHALEM_EP, nhm_uncore_init), - X86_UNCORE_MODEL_MATCH(INTEL_FAM6_NEHALEM, nhm_uncore_init), - X86_UNCORE_MODEL_MATCH(INTEL_FAM6_WESTMERE, nhm_uncore_init), - X86_UNCORE_MODEL_MATCH(INTEL_FAM6_WESTMERE_EP, nhm_uncore_init), - X86_UNCORE_MODEL_MATCH(INTEL_FAM6_SANDYBRIDGE, snb_uncore_init), - X86_UNCORE_MODEL_MATCH(INTEL_FAM6_IVYBRIDGE, ivb_uncore_init), - X86_UNCORE_MODEL_MATCH(INTEL_FAM6_HASWELL, hsw_uncore_init), - X86_UNCORE_MODEL_MATCH(INTEL_FAM6_HASWELL_L, hsw_uncore_init), - X86_UNCORE_MODEL_MATCH(INTEL_FAM6_HASWELL_G, hsw_uncore_init), - X86_UNCORE_MODEL_MATCH(INTEL_FAM6_BROADWELL, bdw_uncore_init), - X86_UNCORE_MODEL_MATCH(INTEL_FAM6_BROADWELL_G, bdw_uncore_init), - X86_UNCORE_MODEL_MATCH(INTEL_FAM6_SANDYBRIDGE_X, snbep_uncore_init), - X86_UNCORE_MODEL_MATCH(INTEL_FAM6_NEHALEM_EX, nhmex_uncore_init), - X86_UNCORE_MODEL_MATCH(INTEL_FAM6_WESTMERE_EX, nhmex_uncore_init), - X86_UNCORE_MODEL_MATCH(INTEL_FAM6_IVYBRIDGE_X, ivbep_uncore_init), - X86_UNCORE_MODEL_MATCH(INTEL_FAM6_HASWELL_X, hswep_uncore_init), - X86_UNCORE_MODEL_MATCH(INTEL_FAM6_BROADWELL_X, bdx_uncore_init), - X86_UNCORE_MODEL_MATCH(INTEL_FAM6_BROADWELL_D, bdx_uncore_init), - X86_UNCORE_MODEL_MATCH(INTEL_FAM6_XEON_PHI_KNL, knl_uncore_init), - X86_UNCORE_MODEL_MATCH(INTEL_FAM6_XEON_PHI_KNM, knl_uncore_init), - X86_UNCORE_MODEL_MATCH(INTEL_FAM6_SKYLAKE, skl_uncore_init), - X86_UNCORE_MODEL_MATCH(INTEL_FAM6_SKYLAKE_L, skl_uncore_init), - X86_UNCORE_MODEL_MATCH(INTEL_FAM6_SKYLAKE_X, skx_uncore_init), - X86_UNCORE_MODEL_MATCH(INTEL_FAM6_KABYLAKE_L, skl_uncore_init), - X86_UNCORE_MODEL_MATCH(INTEL_FAM6_KABYLAKE, skl_uncore_init), - X86_UNCORE_MODEL_MATCH(INTEL_FAM6_ICELAKE_L, icl_uncore_init), - X86_UNCORE_MODEL_MATCH(INTEL_FAM6_ICELAKE_NNPI, icl_uncore_init), - X86_UNCORE_MODEL_MATCH(INTEL_FAM6_ICELAKE, icl_uncore_init), - X86_UNCORE_MODEL_MATCH(INTEL_FAM6_ATOM_TREMONT_D, snr_uncore_init), + X86_MATCH_INTEL_FAM6_MODEL(NEHALEM_EP, &nhm_uncore_init), + X86_MATCH_INTEL_FAM6_MODEL(NEHALEM, &nhm_uncore_init), + X86_MATCH_INTEL_FAM6_MODEL(WESTMERE, &nhm_uncore_init), + X86_MATCH_INTEL_FAM6_MODEL(WESTMERE_EP, &nhm_uncore_init), + X86_MATCH_INTEL_FAM6_MODEL(SANDYBRIDGE, &snb_uncore_init), + X86_MATCH_INTEL_FAM6_MODEL(IVYBRIDGE, &ivb_uncore_init), + X86_MATCH_INTEL_FAM6_MODEL(HASWELL, &hsw_uncore_init), + X86_MATCH_INTEL_FAM6_MODEL(HASWELL_L, &hsw_uncore_init), + X86_MATCH_INTEL_FAM6_MODEL(HASWELL_G, &hsw_uncore_init), + X86_MATCH_INTEL_FAM6_MODEL(BROADWELL, &bdw_uncore_init), + X86_MATCH_INTEL_FAM6_MODEL(BROADWELL_G, &bdw_uncore_init), + X86_MATCH_INTEL_FAM6_MODEL(SANDYBRIDGE_X, &snbep_uncore_init), + X86_MATCH_INTEL_FAM6_MODEL(NEHALEM_EX, &nhmex_uncore_init), + X86_MATCH_INTEL_FAM6_MODEL(WESTMERE_EX, &nhmex_uncore_init), + X86_MATCH_INTEL_FAM6_MODEL(IVYBRIDGE_X, &ivbep_uncore_init), + X86_MATCH_INTEL_FAM6_MODEL(HASWELL_X, &hswep_uncore_init), + X86_MATCH_INTEL_FAM6_MODEL(BROADWELL_X, &bdx_uncore_init), + X86_MATCH_INTEL_FAM6_MODEL(BROADWELL_D, &bdx_uncore_init), + X86_MATCH_INTEL_FAM6_MODEL(XEON_PHI_KNL, &knl_uncore_init), + X86_MATCH_INTEL_FAM6_MODEL(XEON_PHI_KNM, &knl_uncore_init), + X86_MATCH_INTEL_FAM6_MODEL(SKYLAKE, &skl_uncore_init), + X86_MATCH_INTEL_FAM6_MODEL(SKYLAKE_L, &skl_uncore_init), + X86_MATCH_INTEL_FAM6_MODEL(SKYLAKE_X, &skx_uncore_init), + X86_MATCH_INTEL_FAM6_MODEL(KABYLAKE_L, &skl_uncore_init), + X86_MATCH_INTEL_FAM6_MODEL(KABYLAKE, &skl_uncore_init), + X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_L, &icl_uncore_init), + X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_NNPI, &icl_uncore_init), + X86_MATCH_INTEL_FAM6_MODEL(ICELAKE, &icl_uncore_init), + X86_MATCH_INTEL_FAM6_MODEL(ATOM_TREMONT_D, &snr_uncore_init), {}, }; - MODULE_DEVICE_TABLE(x86cpu, intel_uncore_match); static int __init intel_uncore_init(void) -- cgit From 320debe5ef6d4990a70535b5cf891472e6e14d93 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Fri, 20 Mar 2020 14:13:50 +0100 Subject: x86/kvm: Convert to new CPU match macros The new macro set has a consistent namespace and uses C99 initializers instead of the grufty C89 ones. Signed-off-by: Thomas Gleixner Signed-off-by: Borislav Petkov Reviewed-by: Greg Kroah-Hartman Link: https://lkml.kernel.org/r/20200320131509.136884777@linutronix.de --- arch/x86/kvm/svm.c | 2 +- arch/x86/kvm/vmx/vmx.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 5753fc3bcfc1..7c959ae20cd4 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -59,7 +59,7 @@ MODULE_AUTHOR("Qumranet"); MODULE_LICENSE("GPL"); static const struct x86_cpu_id svm_cpu_id[] = { - X86_FEATURE_MATCH(X86_FEATURE_SVM), + X86_MATCH_FEATURE(X86_FEATURE_SVM, NULL), {} }; MODULE_DEVICE_TABLE(x86cpu, svm_cpu_id); diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 31c80fa4653d..5b017efcc944 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -66,7 +66,7 @@ MODULE_AUTHOR("Qumranet"); MODULE_LICENSE("GPL"); static const struct x86_cpu_id vmx_cpu_id[] = { - X86_FEATURE_MATCH(X86_FEATURE_VMX), + X86_MATCH_FEATURE(X86_FEATURE_VMX, NULL), {} }; MODULE_DEVICE_TABLE(x86cpu, vmx_cpu_id); -- cgit From adefe55e725821e8ae23207992ded5994f1650a9 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Fri, 20 Mar 2020 14:13:51 +0100 Subject: x86/kernel: Convert to new CPU match macros The new macro set has a consistent namespace and uses C99 initializers instead of the grufty C89 ones. Get rid the of the local macro wrappers for consistency. Signed-off-by: Thomas Gleixner Signed-off-by: Borislav Petkov Reviewed-by: Greg Kroah-Hartman Link: https://lkml.kernel.org/r/20200320131509.250559388@linutronix.de --- arch/x86/kernel/apic/apic.c | 32 +++++++++++++------------------- arch/x86/kernel/smpboot.c | 2 +- arch/x86/kernel/tsc_msr.c | 14 +++++++------- arch/x86/power/cpu.c | 16 ++-------------- 4 files changed, 23 insertions(+), 41 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c index 5f973fed3c9f..81b9c63dae1b 100644 --- a/arch/x86/kernel/apic/apic.c +++ b/arch/x86/kernel/apic/apic.c @@ -546,12 +546,6 @@ static struct clock_event_device lapic_clockevent = { }; static DEFINE_PER_CPU(struct clock_event_device, lapic_events); -#define DEADLINE_MODEL_MATCH_FUNC(model, func) \ - { X86_VENDOR_INTEL, 6, model, X86_FEATURE_ANY, (unsigned long)&func } - -#define DEADLINE_MODEL_MATCH_REV(model, rev) \ - { X86_VENDOR_INTEL, 6, model, X86_FEATURE_ANY, (unsigned long)rev } - static u32 hsx_deadline_rev(void) { switch (boot_cpu_data.x86_stepping) { @@ -588,23 +582,23 @@ static u32 skx_deadline_rev(void) } static const struct x86_cpu_id deadline_match[] = { - DEADLINE_MODEL_MATCH_FUNC( INTEL_FAM6_HASWELL_X, hsx_deadline_rev), - DEADLINE_MODEL_MATCH_REV ( INTEL_FAM6_BROADWELL_X, 0x0b000020), - DEADLINE_MODEL_MATCH_FUNC( INTEL_FAM6_BROADWELL_D, bdx_deadline_rev), - DEADLINE_MODEL_MATCH_FUNC( INTEL_FAM6_SKYLAKE_X, skx_deadline_rev), + X86_MATCH_INTEL_FAM6_MODEL( HASWELL_X, &hsx_deadline_rev), + X86_MATCH_INTEL_FAM6_MODEL( BROADWELL_X, 0x0b000020), + X86_MATCH_INTEL_FAM6_MODEL( BROADWELL_D, &bdx_deadline_rev), + X86_MATCH_INTEL_FAM6_MODEL( SKYLAKE_X, &skx_deadline_rev), - DEADLINE_MODEL_MATCH_REV ( INTEL_FAM6_HASWELL, 0x22), - DEADLINE_MODEL_MATCH_REV ( INTEL_FAM6_HASWELL_L, 0x20), - DEADLINE_MODEL_MATCH_REV ( INTEL_FAM6_HASWELL_G, 0x17), + X86_MATCH_INTEL_FAM6_MODEL( HASWELL, 0x22), + X86_MATCH_INTEL_FAM6_MODEL( HASWELL_L, 0x20), + X86_MATCH_INTEL_FAM6_MODEL( HASWELL_G, 0x17), - DEADLINE_MODEL_MATCH_REV ( INTEL_FAM6_BROADWELL, 0x25), - DEADLINE_MODEL_MATCH_REV ( INTEL_FAM6_BROADWELL_G, 0x17), + X86_MATCH_INTEL_FAM6_MODEL( BROADWELL, 0x25), + X86_MATCH_INTEL_FAM6_MODEL( BROADWELL_G, 0x17), - DEADLINE_MODEL_MATCH_REV ( INTEL_FAM6_SKYLAKE_L, 0xb2), - DEADLINE_MODEL_MATCH_REV ( INTEL_FAM6_SKYLAKE, 0xb2), + X86_MATCH_INTEL_FAM6_MODEL( SKYLAKE_L, 0xb2), + X86_MATCH_INTEL_FAM6_MODEL( SKYLAKE, 0xb2), - DEADLINE_MODEL_MATCH_REV ( INTEL_FAM6_KABYLAKE_L, 0x52), - DEADLINE_MODEL_MATCH_REV ( INTEL_FAM6_KABYLAKE, 0x52), + X86_MATCH_INTEL_FAM6_MODEL( KABYLAKE_L, 0x52), + X86_MATCH_INTEL_FAM6_MODEL( KABYLAKE, 0x52), {}, }; diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 69881b2d446c..3076ef0864dd 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -466,7 +466,7 @@ static bool match_smt(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o) */ static const struct x86_cpu_id snc_cpu[] = { - { X86_VENDOR_INTEL, 6, INTEL_FAM6_SKYLAKE_X }, + X86_MATCH_INTEL_FAM6_MODEL(SKYLAKE_X, NULL), {} }; diff --git a/arch/x86/kernel/tsc_msr.c b/arch/x86/kernel/tsc_msr.c index e0cbe4f2af49..bf528aae8ece 100644 --- a/arch/x86/kernel/tsc_msr.c +++ b/arch/x86/kernel/tsc_msr.c @@ -63,13 +63,13 @@ static const struct freq_desc freq_desc_lgm = { }; static const struct x86_cpu_id tsc_msr_cpu_ids[] = { - INTEL_CPU_FAM6(ATOM_SALTWELL_MID, freq_desc_pnw), - INTEL_CPU_FAM6(ATOM_SALTWELL_TABLET, freq_desc_clv), - INTEL_CPU_FAM6(ATOM_SILVERMONT, freq_desc_byt), - INTEL_CPU_FAM6(ATOM_SILVERMONT_MID, freq_desc_tng), - INTEL_CPU_FAM6(ATOM_AIRMONT, freq_desc_cht), - INTEL_CPU_FAM6(ATOM_AIRMONT_MID, freq_desc_ann), - INTEL_CPU_FAM6(ATOM_AIRMONT_NP, freq_desc_lgm), + X86_MATCH_INTEL_FAM6_MODEL(ATOM_SALTWELL_MID, &freq_desc_pnw), + X86_MATCH_INTEL_FAM6_MODEL(ATOM_SALTWELL_TABLET,&freq_desc_clv), + X86_MATCH_INTEL_FAM6_MODEL(ATOM_SILVERMONT, &freq_desc_byt), + X86_MATCH_INTEL_FAM6_MODEL(ATOM_SILVERMONT_MID, &freq_desc_tng), + X86_MATCH_INTEL_FAM6_MODEL(ATOM_AIRMONT, &freq_desc_cht), + X86_MATCH_INTEL_FAM6_MODEL(ATOM_AIRMONT_MID, &freq_desc_ann), + X86_MATCH_INTEL_FAM6_MODEL(ATOM_AIRMONT_NP, &freq_desc_lgm), {} }; diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c index 915bb1639763..aaff9ed7ff45 100644 --- a/arch/x86/power/cpu.c +++ b/arch/x86/power/cpu.c @@ -475,20 +475,8 @@ static int msr_save_cpuid_features(const struct x86_cpu_id *c) } static const struct x86_cpu_id msr_save_cpu_table[] = { - { - .vendor = X86_VENDOR_AMD, - .family = 0x15, - .model = X86_MODEL_ANY, - .feature = X86_FEATURE_ANY, - .driver_data = (kernel_ulong_t)msr_save_cpuid_features, - }, - { - .vendor = X86_VENDOR_AMD, - .family = 0x16, - .model = X86_MODEL_ANY, - .feature = X86_FEATURE_ANY, - .driver_data = (kernel_ulong_t)msr_save_cpuid_features, - }, + X86_MATCH_VENDOR_FAM(AMD, 0x15, &msr_save_cpuid_features), + X86_MATCH_VENDOR_FAM(AMD, 0x16, &msr_save_cpuid_features), {} }; -- cgit From 9595198f8dc4111f8cab39de2c0c4432787bc690 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Fri, 20 Mar 2020 14:13:52 +0100 Subject: x86/platform: Convert to new CPU match macros The new macro set has a consistent namespace and uses C99 initializers instead of the grufty C89 ones. Get rid the of the local macro wrappers for consistency. Signed-off-by: Thomas Gleixner Signed-off-by: Borislav Petkov Reviewed-by: Greg Kroah-Hartman Link: https://lkml.kernel.org/r/20200320131509.359448901@linutronix.de --- arch/x86/platform/atom/punit_atom_debug.c | 13 ++++++------- arch/x86/platform/efi/quirks.c | 7 ++----- arch/x86/platform/intel-mid/device_libs/platform_bt.c | 5 +---- arch/x86/platform/intel-quark/imr.c | 2 +- arch/x86/platform/intel-quark/imr_selftest.c | 2 +- 5 files changed, 11 insertions(+), 18 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/platform/atom/punit_atom_debug.c b/arch/x86/platform/atom/punit_atom_debug.c index ee6b0780bea1..f8ed5f66cd20 100644 --- a/arch/x86/platform/atom/punit_atom_debug.c +++ b/arch/x86/platform/atom/punit_atom_debug.c @@ -117,17 +117,16 @@ static void punit_dbgfs_unregister(void) debugfs_remove_recursive(punit_dbg_file); } -#define ICPU(model, drv_data) \ - { X86_VENDOR_INTEL, 6, model, X86_FEATURE_MWAIT,\ - (kernel_ulong_t)&drv_data } +#define X86_MATCH(model, data) \ + X86_MATCH_VENDOR_FAM_MODEL_FEATURE(INTEL, 6, INTEL_FAM6_##model, \ + X86_FEATURE_MWAIT, data) static const struct x86_cpu_id intel_punit_cpu_ids[] = { - ICPU(INTEL_FAM6_ATOM_SILVERMONT, punit_device_byt), - ICPU(INTEL_FAM6_ATOM_SILVERMONT_MID, punit_device_tng), - ICPU(INTEL_FAM6_ATOM_AIRMONT, punit_device_cht), + X86_MATCH(ATOM_SILVERMONT, &punit_device_byt), + X86_MATCH(ATOM_SILVERMONT_MID, &punit_device_tng), + X86_MATCH(ATOM_AIRMONT, &punit_device_cht), {} }; - MODULE_DEVICE_TABLE(x86cpu, intel_punit_cpu_ids); static int __init punit_atom_debug_init(void) diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c index 88d32c06cffa..c000e03ecfe3 100644 --- a/arch/x86/platform/efi/quirks.c +++ b/arch/x86/platform/efi/quirks.c @@ -659,12 +659,9 @@ static int qrk_capsule_setup_info(struct capsule_info *cap_info, void **pkbuff, return 1; } -#define ICPU(family, model, quirk_handler) \ - { X86_VENDOR_INTEL, family, model, X86_FEATURE_ANY, \ - (unsigned long)&quirk_handler } - static const struct x86_cpu_id efi_capsule_quirk_ids[] = { - ICPU(5, 9, qrk_capsule_setup_info), /* Intel Quark X1000 */ + X86_MATCH_VENDOR_FAM_MODEL(INTEL, 5, INTEL_FAM5_QUARK_X1000, + &qrk_capsule_setup_info), { } }; diff --git a/arch/x86/platform/intel-mid/device_libs/platform_bt.c b/arch/x86/platform/intel-mid/device_libs/platform_bt.c index e3f4bfc08f78..31dda18bb370 100644 --- a/arch/x86/platform/intel-mid/device_libs/platform_bt.c +++ b/arch/x86/platform/intel-mid/device_libs/platform_bt.c @@ -60,11 +60,8 @@ static struct bt_sfi_data tng_bt_sfi_data __initdata = { .setup = tng_bt_sfi_setup, }; -#define ICPU(model, ddata) \ - { X86_VENDOR_INTEL, 6, model, X86_FEATURE_ANY, (kernel_ulong_t)&ddata } - static const struct x86_cpu_id bt_sfi_cpu_ids[] = { - ICPU(INTEL_FAM6_ATOM_SILVERMONT_MID, tng_bt_sfi_data), + X86_MATCH_INTEL_FAM6_MODEL(ATOM_SILVERMONT_MID, &tng_bt_sfi_data), {} }; diff --git a/arch/x86/platform/intel-quark/imr.c b/arch/x86/platform/intel-quark/imr.c index e9d97d52475e..0286fe1b14b5 100644 --- a/arch/x86/platform/intel-quark/imr.c +++ b/arch/x86/platform/intel-quark/imr.c @@ -569,7 +569,7 @@ static void __init imr_fixup_memmap(struct imr_device *idev) } static const struct x86_cpu_id imr_ids[] __initconst = { - { X86_VENDOR_INTEL, 5, 9 }, /* Intel Quark SoC X1000. */ + X86_MATCH_VENDOR_FAM_MODEL(INTEL, 5, INTEL_FAM5_QUARK_X1000, NULL), {} }; diff --git a/arch/x86/platform/intel-quark/imr_selftest.c b/arch/x86/platform/intel-quark/imr_selftest.c index 4307830e1b6f..570e3062faac 100644 --- a/arch/x86/platform/intel-quark/imr_selftest.c +++ b/arch/x86/platform/intel-quark/imr_selftest.c @@ -105,7 +105,7 @@ static void __init imr_self_test(void) } static const struct x86_cpu_id imr_ids[] __initconst = { - { X86_VENDOR_INTEL, 5, 9 }, /* Intel Quark SoC X1000. */ + X86_MATCH_VENDOR_FAM_MODEL(INTEL, 5, INTEL_FAM5_QUARK_X1000, NULL), {} }; -- cgit From f30cfacad1ee948c821a82e63b28958b92bf8c6c Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Fri, 20 Mar 2020 14:14:05 +0100 Subject: crypto: Convert to new CPU match macros The new macro set has a consistent namespace and uses C99 initializers instead of the grufty C89 ones. Signed-off-by: Thomas Gleixner Signed-off-by: Borislav Petkov Reviewed-by: Greg Kroah-Hartman Link: https://lkml.kernel.org/r/20200320131510.700250889@linutronix.de --- arch/x86/crypto/aesni-intel_glue.c | 2 +- arch/x86/crypto/crc32-pclmul_glue.c | 2 +- arch/x86/crypto/crc32c-intel_glue.c | 2 +- arch/x86/crypto/crct10dif-pclmul_glue.c | 2 +- arch/x86/crypto/ghash-clmulni-intel_glue.c | 2 +- 5 files changed, 5 insertions(+), 5 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/crypto/aesni-intel_glue.c b/arch/x86/crypto/aesni-intel_glue.c index bbbebbd35b5d..75b6ea20491e 100644 --- a/arch/x86/crypto/aesni-intel_glue.c +++ b/arch/x86/crypto/aesni-intel_glue.c @@ -1064,7 +1064,7 @@ static struct aead_alg aesni_aeads[0]; static struct simd_aead_alg *aesni_simd_aeads[ARRAY_SIZE(aesni_aeads)]; static const struct x86_cpu_id aesni_cpu_id[] = { - X86_FEATURE_MATCH(X86_FEATURE_AES), + X86_MATCH_FEATURE(X86_FEATURE_AES, NULL), {} }; MODULE_DEVICE_TABLE(x86cpu, aesni_cpu_id); diff --git a/arch/x86/crypto/crc32-pclmul_glue.c b/arch/x86/crypto/crc32-pclmul_glue.c index 418bd88acac8..7c4c7b2fbf05 100644 --- a/arch/x86/crypto/crc32-pclmul_glue.c +++ b/arch/x86/crypto/crc32-pclmul_glue.c @@ -170,7 +170,7 @@ static struct shash_alg alg = { }; static const struct x86_cpu_id crc32pclmul_cpu_id[] = { - X86_FEATURE_MATCH(X86_FEATURE_PCLMULQDQ), + X86_MATCH_FEATURE(X86_FEATURE_PCLMULQDQ, NULL), {} }; MODULE_DEVICE_TABLE(x86cpu, crc32pclmul_cpu_id); diff --git a/arch/x86/crypto/crc32c-intel_glue.c b/arch/x86/crypto/crc32c-intel_glue.c index c20d1b8a82c3..d2d069bd459b 100644 --- a/arch/x86/crypto/crc32c-intel_glue.c +++ b/arch/x86/crypto/crc32c-intel_glue.c @@ -221,7 +221,7 @@ static struct shash_alg alg = { }; static const struct x86_cpu_id crc32c_cpu_id[] = { - X86_FEATURE_MATCH(X86_FEATURE_XMM4_2), + X86_MATCH_FEATURE(X86_FEATURE_XMM4_2, NULL), {} }; MODULE_DEVICE_TABLE(x86cpu, crc32c_cpu_id); diff --git a/arch/x86/crypto/crct10dif-pclmul_glue.c b/arch/x86/crypto/crct10dif-pclmul_glue.c index 3c81e15b0873..71291d5af9f4 100644 --- a/arch/x86/crypto/crct10dif-pclmul_glue.c +++ b/arch/x86/crypto/crct10dif-pclmul_glue.c @@ -114,7 +114,7 @@ static struct shash_alg alg = { }; static const struct x86_cpu_id crct10dif_cpu_id[] = { - X86_FEATURE_MATCH(X86_FEATURE_PCLMULQDQ), + X86_MATCH_FEATURE(X86_FEATURE_PCLMULQDQ, NULL), {} }; MODULE_DEVICE_TABLE(x86cpu, crct10dif_cpu_id); diff --git a/arch/x86/crypto/ghash-clmulni-intel_glue.c b/arch/x86/crypto/ghash-clmulni-intel_glue.c index a4b728518e28..1f1a95f3dd0c 100644 --- a/arch/x86/crypto/ghash-clmulni-intel_glue.c +++ b/arch/x86/crypto/ghash-clmulni-intel_glue.c @@ -313,7 +313,7 @@ static struct ahash_alg ghash_async_alg = { }; static const struct x86_cpu_id pcmul_cpu_id[] = { - X86_FEATURE_MATCH(X86_FEATURE_PCLMULQDQ), /* Pickle-Mickle-Duck */ + X86_MATCH_FEATURE(X86_FEATURE_PCLMULQDQ, NULL), /* Pickle-Mickle-Duck */ {} }; MODULE_DEVICE_TABLE(x86cpu, pcmul_cpu_id); -- cgit From 1826d56bcef9c38287f7c1a8e3b7778863e0b9d7 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Fri, 20 Mar 2020 14:14:07 +0100 Subject: x86/cpu: Cleanup the now unused CPU match macros No more users. Signed-off-by: Thomas Gleixner Signed-off-by: Borislav Petkov Reviewed-by: Greg Kroah-Hartman Link: https://lkml.kernel.org/r/20200320131510.900226233@linutronix.de --- arch/x86/include/asm/cpu_device_id.h | 3 --- arch/x86/include/asm/intel-family.h | 13 ------------- 2 files changed, 16 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/cpu_device_id.h b/arch/x86/include/asm/cpu_device_id.h index f11770fac73a..cf3d621c6892 100644 --- a/arch/x86/include/asm/cpu_device_id.h +++ b/arch/x86/include/asm/cpu_device_id.h @@ -91,9 +91,6 @@ #define X86_MATCH_FEATURE(feature, data) \ X86_MATCH_VENDOR_FEATURE(ANY, feature, data) -/* Transitional to keep the existing code working */ -#define X86_FEATURE_MATCH(feature) X86_MATCH_FEATURE(feature, NULL) - /** * X86_MATCH_VENDOR_FAM_MODEL - Match vendor, family and model * @vendor: The vendor name, e.g. INTEL, AMD, HYGON, ..., ANY diff --git a/arch/x86/include/asm/intel-family.h b/arch/x86/include/asm/intel-family.h index 236a87aa84d6..8f1e94f29a16 100644 --- a/arch/x86/include/asm/intel-family.h +++ b/arch/x86/include/asm/intel-family.h @@ -124,17 +124,4 @@ /* Family 5 */ #define INTEL_FAM5_QUARK_X1000 0x09 /* Quark X1000 SoC */ -/* Useful macros */ -#define INTEL_CPU_FAM_ANY(_family, _model, _driver_data) \ -{ \ - .vendor = X86_VENDOR_INTEL, \ - .family = _family, \ - .model = _model, \ - .feature = X86_FEATURE_ANY, \ - .driver_data = (kernel_ulong_t)&_driver_data \ -} - -#define INTEL_CPU_FAM6(_model, _driver_data) \ - INTEL_CPU_FAM_ANY(6, INTEL_FAM6_##_model, _driver_data) - #endif /* _ASM_X86_INTEL_FAMILY_H */ -- cgit From 290a4474d019c7e49c186100e157fff5e273ab3b Mon Sep 17 00:00:00 2001 From: Brian Gerst Date: Tue, 24 Mar 2020 10:35:20 -0400 Subject: x86/entry: Fix build error x86 with !CONFIG_POSIX_TIMERS Add missing semicolon. Fixes: a74d187c2df3 ("x86/entry: Refactor SYS_NI macros") Reported-by: Ingo Molnar Signed-off-by: Brian Gerst Signed-off-by: Thomas Gleixner Link: https://lkml.kernel.org/r/20200324143520.898733-1-brgerst@gmail.com --- arch/x86/include/asm/syscall_wrapper.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/syscall_wrapper.h b/arch/x86/include/asm/syscall_wrapper.h index e10efa1454bc..a84333adeef2 100644 --- a/arch/x86/include/asm/syscall_wrapper.h +++ b/arch/x86/include/asm/syscall_wrapper.h @@ -86,7 +86,7 @@ extern long __ia32_sys_ni_syscall(const struct pt_regs *regs); } #define __SYS_NI(abi, name) \ - SYSCALL_ALIAS(__##abi##_##name, sys_ni_posix_timers) + SYSCALL_ALIAS(__##abi##_##name, sys_ni_posix_timers); #ifdef CONFIG_X86_64 #define __X64_SYS_STUB0(name) \ -- cgit From d198b34f3855eee2571dda03eea75a09c7c31480 Mon Sep 17 00:00:00 2001 From: Masahiro Yamada Date: Tue, 3 Mar 2020 22:35:59 +0900 Subject: .gitignore: add SPDX License Identifier Add SPDX License Identifier to all .gitignore files. Signed-off-by: Masahiro Yamada Signed-off-by: Greg Kroah-Hartman --- arch/x86/.gitignore | 1 + arch/x86/boot/.gitignore | 1 + arch/x86/boot/compressed/.gitignore | 1 + arch/x86/boot/tools/.gitignore | 1 + arch/x86/crypto/.gitignore | 1 + arch/x86/entry/vdso/.gitignore | 1 + arch/x86/entry/vdso/vdso32/.gitignore | 1 + arch/x86/kernel/.gitignore | 1 + arch/x86/kernel/cpu/.gitignore | 1 + arch/x86/lib/.gitignore | 1 + arch/x86/realmode/rm/.gitignore | 1 + arch/x86/tools/.gitignore | 1 + arch/x86/um/vdso/.gitignore | 1 + 13 files changed, 13 insertions(+) (limited to 'arch/x86') diff --git a/arch/x86/.gitignore b/arch/x86/.gitignore index 5a82bac5e0bc..677111acbaa3 100644 --- a/arch/x86/.gitignore +++ b/arch/x86/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only boot/compressed/vmlinux tools/test_get_len tools/insn_sanity diff --git a/arch/x86/boot/.gitignore b/arch/x86/boot/.gitignore index 09d25dd09307..9cc7f1357b9b 100644 --- a/arch/x86/boot/.gitignore +++ b/arch/x86/boot/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only bootsect bzImage cpustr.h diff --git a/arch/x86/boot/compressed/.gitignore b/arch/x86/boot/compressed/.gitignore index 4a46fab7162e..25805199a506 100644 --- a/arch/x86/boot/compressed/.gitignore +++ b/arch/x86/boot/compressed/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only relocs vmlinux.bin.all vmlinux.relocs diff --git a/arch/x86/boot/tools/.gitignore b/arch/x86/boot/tools/.gitignore index 378eac25d311..ae91f4d0d78b 100644 --- a/arch/x86/boot/tools/.gitignore +++ b/arch/x86/boot/tools/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only build diff --git a/arch/x86/crypto/.gitignore b/arch/x86/crypto/.gitignore index 30be0400a439..580c839bb177 100644 --- a/arch/x86/crypto/.gitignore +++ b/arch/x86/crypto/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only poly1305-x86_64-cryptogams.S diff --git a/arch/x86/entry/vdso/.gitignore b/arch/x86/entry/vdso/.gitignore index aae8ffdd5880..37a6129d597b 100644 --- a/arch/x86/entry/vdso/.gitignore +++ b/arch/x86/entry/vdso/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only vdso.lds vdsox32.lds vdso32-syscall-syms.lds diff --git a/arch/x86/entry/vdso/vdso32/.gitignore b/arch/x86/entry/vdso/vdso32/.gitignore index e45fba9d0ced..5167384843b9 100644 --- a/arch/x86/entry/vdso/vdso32/.gitignore +++ b/arch/x86/entry/vdso/vdso32/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only vdso32.lds diff --git a/arch/x86/kernel/.gitignore b/arch/x86/kernel/.gitignore index 08f4fd731469..ef66569e7e22 100644 --- a/arch/x86/kernel/.gitignore +++ b/arch/x86/kernel/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only vsyscall.lds vsyscall_32.lds vmlinux.lds diff --git a/arch/x86/kernel/cpu/.gitignore b/arch/x86/kernel/cpu/.gitignore index 667df55a4399..0bca7ef7426a 100644 --- a/arch/x86/kernel/cpu/.gitignore +++ b/arch/x86/kernel/cpu/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only capflags.c diff --git a/arch/x86/lib/.gitignore b/arch/x86/lib/.gitignore index 8df89f0a3fe6..8ae0f93ecbfd 100644 --- a/arch/x86/lib/.gitignore +++ b/arch/x86/lib/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only inat-tables.c diff --git a/arch/x86/realmode/rm/.gitignore b/arch/x86/realmode/rm/.gitignore index b6ed3a2555cb..6c3464f46166 100644 --- a/arch/x86/realmode/rm/.gitignore +++ b/arch/x86/realmode/rm/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only pasyms.h realmode.lds realmode.relocs diff --git a/arch/x86/tools/.gitignore b/arch/x86/tools/.gitignore index be0ed065249b..d36dc7cf9115 100644 --- a/arch/x86/tools/.gitignore +++ b/arch/x86/tools/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only relocs diff --git a/arch/x86/um/vdso/.gitignore b/arch/x86/um/vdso/.gitignore index f8b69d84238e..652e31d82582 100644 --- a/arch/x86/um/vdso/.gitignore +++ b/arch/x86/um/vdso/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only vdso.lds -- cgit From 244febbee876203d8505dfadcc6edb82a0e061b8 Mon Sep 17 00:00:00 2001 From: Qiujun Huang Date: Wed, 4 Mar 2020 00:42:12 +0800 Subject: x86/alternatives: Mark text_poke_loc_init() static The function is only used in this file so make it static. [ bp: Massage. ] Signed-off-by: Qiujun Huang Signed-off-by: Borislav Petkov Acked-by: Peter Zijlstra Link: https://lkml.kernel.org/r/1583253732-18988-1-git-send-email-hqjagain@gmail.com --- arch/x86/kernel/alternative.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index 15ac0d5f4b40..7867dfb3963e 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -1167,8 +1167,8 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries atomic_cond_read_acquire(&desc.refs, !VAL); } -void text_poke_loc_init(struct text_poke_loc *tp, void *addr, - const void *opcode, size_t len, const void *emulate) +static void text_poke_loc_init(struct text_poke_loc *tp, void *addr, + const void *opcode, size_t len, const void *emulate) { struct insn insn; -- cgit From af7aa04683e85ccb9088e31fe67a0397167b7abd Mon Sep 17 00:00:00 2001 From: Qais Yousef Date: Mon, 23 Mar 2020 13:51:02 +0000 Subject: x86/smp: Replace cpu_up/down() with add/remove_cpu() The core device API performs extra housekeeping bits that are missing from directly calling cpu_up/down(). See commit a6717c01ddc2 ("powerpc/rtas: use device model APIs and serialization during LPM") for an example description of what might go wrong. This also prepares to make cpu_up/down() a private interface of the CPU subsystem. Signed-off-by: Qais Yousef Signed-off-by: Thomas Gleixner Link: https://lkml.kernel.org/r/20200323135110.30522-10-qais.yousef@arm.com --- arch/x86/kernel/topology.c | 22 ++++++---------------- arch/x86/mm/mmio-mod.c | 4 ++-- arch/x86/xen/smp.c | 2 +- 3 files changed, 9 insertions(+), 19 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kernel/topology.c b/arch/x86/kernel/topology.c index be5bc2e47c71..b8810ebbc8ae 100644 --- a/arch/x86/kernel/topology.c +++ b/arch/x86/kernel/topology.c @@ -59,39 +59,29 @@ __setup("cpu0_hotplug", enable_cpu0_hotplug); */ int _debug_hotplug_cpu(int cpu, int action) { - struct device *dev = get_cpu_device(cpu); int ret; if (!cpu_is_hotpluggable(cpu)) return -EINVAL; - lock_device_hotplug(); - switch (action) { case 0: - ret = cpu_down(cpu); - if (!ret) { + ret = remove_cpu(cpu); + if (!ret) pr_info("DEBUG_HOTPLUG_CPU0: CPU %u is now offline\n", cpu); - dev->offline = true; - kobject_uevent(&dev->kobj, KOBJ_OFFLINE); - } else + else pr_debug("Can't offline CPU%d.\n", cpu); break; case 1: - ret = cpu_up(cpu); - if (!ret) { - dev->offline = false; - kobject_uevent(&dev->kobj, KOBJ_ONLINE); - } else { + ret = add_cpu(cpu); + if (ret) pr_debug("Can't online CPU%d.\n", cpu); - } + break; default: ret = -EINVAL; } - unlock_device_hotplug(); - return ret; } diff --git a/arch/x86/mm/mmio-mod.c b/arch/x86/mm/mmio-mod.c index 673de6063345..109325d77b3e 100644 --- a/arch/x86/mm/mmio-mod.c +++ b/arch/x86/mm/mmio-mod.c @@ -386,7 +386,7 @@ static void enter_uniprocessor(void) put_online_cpus(); for_each_cpu(cpu, downed_cpus) { - err = cpu_down(cpu); + err = remove_cpu(cpu); if (!err) pr_info("CPU%d is down.\n", cpu); else @@ -406,7 +406,7 @@ static void leave_uniprocessor(void) return; pr_notice("Re-enabling CPUs...\n"); for_each_cpu(cpu, downed_cpus) { - err = cpu_up(cpu); + err = add_cpu(cpu); if (!err) pr_info("enabled CPU%d.\n", cpu); else diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c index 7a43b2ae19f1..2097fa0ebdb5 100644 --- a/arch/x86/xen/smp.c +++ b/arch/x86/xen/smp.c @@ -132,7 +132,7 @@ void __init xen_smp_cpus_done(unsigned int max_cpus) if (xen_vcpu_nr(cpu) < MAX_VIRT_CPUS) continue; - rc = cpu_down(cpu); + rc = remove_cpu(cpu); if (rc == 0) { /* -- cgit From fc8bd77d6476d7733ace9e03093b4acaee6e0605 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Mon, 16 Mar 2020 10:13:45 +0100 Subject: x86/kexec: Use RIP relative addressing Normally identity_mapped is not visible to objtool, due to: arch/x86/kernel/Makefile:OBJECT_FILES_NON_STANDARD_relocate_kernel_$(BITS).o := y However, when we want to run objtool on vmlinux.o there is no hiding it: vmlinux.o: warning: objtool: .text+0x4c0f1: unsupported intra-function call Replace the (i386 inspired) pattern: call 1f 1: popq %r8 subq $(1b - relocate_kernel), %r8 With a x86_64 RIP-relative LEA: leaq relocate_kernel(%rip), %r8 Suggested-by: Brian Gerst Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Miroslav Benes Acked-by: Josh Poimboeuf Link: https://lkml.kernel.org/r/20200324160924.143334345@infradead.org --- arch/x86/kernel/relocate_kernel_64.S | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S index ef3ba99068d3..cc5c8b9048b5 100644 --- a/arch/x86/kernel/relocate_kernel_64.S +++ b/arch/x86/kernel/relocate_kernel_64.S @@ -196,10 +196,7 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped) /* get the re-entry point of the peer system */ movq 0(%rsp), %rbp - call 1f -1: - popq %r8 - subq $(1b - relocate_kernel), %r8 + leaq relocate_kernel(%rip), %r8 movq CP_PA_SWAP_PAGE(%r8), %r10 movq CP_PA_BACKUP_PAGES_MAP(%r8), %rdi movq CP_PA_TABLE_PAGE(%r8), %rax -- cgit From 36cc552055a5f95bab479533b4ebbad6a6cea0e1 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Tue, 24 Mar 2020 15:35:42 +0100 Subject: x86/kexec: Make relocate_kernel_64.S objtool clean Having fixed the biggest objtool issue in this file; fix up the rest and remove the exception. Suggested-by: Josh Poimboeuf Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Miroslav Benes Acked-by: Josh Poimboeuf Link: https://lkml.kernel.org/r/20200324160924.202621656@infradead.org --- arch/x86/kernel/Makefile | 1 - arch/x86/kernel/relocate_kernel_64.S | 7 +++++++ 2 files changed, 7 insertions(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile index 9b294c13809a..8be5926cce51 100644 --- a/arch/x86/kernel/Makefile +++ b/arch/x86/kernel/Makefile @@ -28,7 +28,6 @@ KASAN_SANITIZE_dumpstack_$(BITS).o := n KASAN_SANITIZE_stacktrace.o := n KASAN_SANITIZE_paravirt.o := n -OBJECT_FILES_NON_STANDARD_relocate_kernel_$(BITS).o := y OBJECT_FILES_NON_STANDARD_test_nx.o := y OBJECT_FILES_NON_STANDARD_paravirt_patch.o := y diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S index cc5c8b9048b5..a4d9a261425b 100644 --- a/arch/x86/kernel/relocate_kernel_64.S +++ b/arch/x86/kernel/relocate_kernel_64.S @@ -9,6 +9,8 @@ #include #include #include +#include +#include /* * Must be relocatable PIC code callable as a C function @@ -39,6 +41,7 @@ .align PAGE_SIZE .code64 SYM_CODE_START_NOALIGN(relocate_kernel) + UNWIND_HINT_EMPTY /* * %rdi indirection_page * %rsi page_list @@ -105,6 +108,7 @@ SYM_CODE_START_NOALIGN(relocate_kernel) SYM_CODE_END(relocate_kernel) SYM_CODE_START_LOCAL_NOALIGN(identity_mapped) + UNWIND_HINT_EMPTY /* set return address to 0 if not preserving context */ pushq $0 /* store the start address on the stack */ @@ -192,6 +196,7 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped) 1: popq %rdx leaq PAGE_SIZE(%r10), %rsp + ANNOTATE_RETPOLINE_SAFE call *%rdx /* get the re-entry point of the peer system */ @@ -209,6 +214,7 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped) SYM_CODE_END(identity_mapped) SYM_CODE_START_LOCAL_NOALIGN(virtual_mapped) + UNWIND_HINT_EMPTY movq RSP(%r8), %rsp movq CR4(%r8), %rax movq %rax, %cr4 @@ -230,6 +236,7 @@ SYM_CODE_END(virtual_mapped) /* Do the copies */ SYM_CODE_START_LOCAL_NOALIGN(swap_pages) + UNWIND_HINT_EMPTY movq %rdi, %rcx /* Put the page_list in %rcx */ xorl %edi, %edi xorl %esi, %esi -- cgit From e1be9ac8e6014a9b0a216aebae7250f9863e9fc3 Mon Sep 17 00:00:00 2001 From: Wanpeng Li Date: Thu, 26 Mar 2020 10:20:01 +0800 Subject: KVM: X86: Narrow down the IPI fastpath to single target IPI The original single target IPI fastpath patch forgot to filter the ICR destination shorthand field. Multicast IPI is not suitable for this feature since wakeup the multiple sleeping vCPUs will extend the interrupt disabled time, it especially worse in the over-subscribe and VM has a little bit more vCPUs scenario. Let's narrow it down to single target IPI. Two VMs, each is 76 vCPUs, one running 'ebizzy -M', the other running cyclictest on all vCPUs, w/ this patch, the avg score of cyclictest can improve more than 5%. (pv tlb, pv ipi, pv sched yield are disabled during testing to avoid the disturb). Signed-off-by: Wanpeng Li Message-Id: <1585189202-1708-3-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/x86.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index d65ff2008cf1..cf95c36cb4f4 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1554,7 +1554,10 @@ EXPORT_SYMBOL_GPL(kvm_emulate_wrmsr); */ static int handle_fastpath_set_x2apic_icr_irqoff(struct kvm_vcpu *vcpu, u64 data) { - if (lapic_in_kernel(vcpu) && apic_x2apic_mode(vcpu->arch.apic) && + if (!lapic_in_kernel(vcpu) || !apic_x2apic_mode(vcpu->arch.apic)) + return 1; + + if (((data & APIC_SHORT_MASK) == APIC_DEST_NOSHORT) && ((data & APIC_DEST_MASK) == APIC_DEST_PHYSICAL) && ((data & APIC_MODE_MASK) == APIC_DM_FIXED)) { -- cgit From 8a1038de11a5536c76054061837b11648bec5b46 Mon Sep 17 00:00:00 2001 From: Wanpeng Li Date: Thu, 26 Mar 2020 10:20:00 +0800 Subject: KVM: X86: Delay read msr data iff writes ICR MSR Delay read msr data until we identify guest accesses ICR MSR to avoid to penalize all other MSR writes. Signed-off-by: Wanpeng Li Message-Id: <1585189202-1708-2-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/x86.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index a4e62d767dd6..62d614586402 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1595,11 +1595,12 @@ static int handle_fastpath_set_x2apic_icr_irqoff(struct kvm_vcpu *vcpu, u64 data enum exit_fastpath_completion handle_fastpath_set_msr_irqoff(struct kvm_vcpu *vcpu) { u32 msr = kvm_rcx_read(vcpu); - u64 data = kvm_read_edx_eax(vcpu); + u64 data; int ret = 0; switch (msr) { case APIC_BASE_MSR + (APIC_ICR >> 4): + data = kvm_read_edx_eax(vcpu); ret = handle_fastpath_set_x2apic_icr_irqoff(vcpu, data); break; default: -- cgit From d5361678e63c8a5e72d75cee6d15b840c44306f2 Mon Sep 17 00:00:00 2001 From: Wanpeng Li Date: Thu, 26 Mar 2020 10:20:02 +0800 Subject: KVM: X86: Micro-optimize IPI fastpath delay This patch optimizes the virtual IPI fastpath emulation sequence: write ICR2 send virtual IPI read ICR2 write ICR2 send virtual IPI ==> write ICR write ICR We can observe ~0.67% performance improvement for IPI microbenchmark (https://lore.kernel.org/kvm/20171219085010.4081-1-ynorov@caviumnetworks.com/) on Skylake server. Signed-off-by: Wanpeng Li Message-Id: <1585189202-1708-4-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/lapic.c | 4 ++-- arch/x86/kvm/lapic.h | 1 + arch/x86/kvm/x86.c | 6 +++++- 3 files changed, 8 insertions(+), 3 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 41bd49ebe355..b754e49adbc5 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -1241,7 +1241,7 @@ void kvm_apic_set_eoi_accelerated(struct kvm_vcpu *vcpu, int vector) } EXPORT_SYMBOL_GPL(kvm_apic_set_eoi_accelerated); -static void apic_send_ipi(struct kvm_lapic *apic, u32 icr_low, u32 icr_high) +void kvm_apic_send_ipi(struct kvm_lapic *apic, u32 icr_low, u32 icr_high) { struct kvm_lapic_irq irq; @@ -1955,7 +1955,7 @@ int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val) case APIC_ICR: /* No delay here, so we always clear the pending bit */ val &= ~(1 << 12); - apic_send_ipi(apic, val, kvm_lapic_get_reg(apic, APIC_ICR2)); + kvm_apic_send_ipi(apic, val, kvm_lapic_get_reg(apic, APIC_ICR2)); kvm_lapic_set_reg(apic, APIC_ICR, val); break; diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h index 7581bc2dcd47..40ed6ed22751 100644 --- a/arch/x86/kvm/lapic.h +++ b/arch/x86/kvm/lapic.h @@ -96,6 +96,7 @@ void kvm_apic_update_apicv(struct kvm_vcpu *vcpu); bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct kvm_lapic *src, struct kvm_lapic_irq *irq, int *r, struct dest_map *dest_map); +void kvm_apic_send_ipi(struct kvm_lapic *apic, u32 icr_low, u32 icr_high); u64 kvm_get_apic_base(struct kvm_vcpu *vcpu); int kvm_set_apic_base(struct kvm_vcpu *vcpu, struct msr_data *msr_info); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 62d614586402..6fa014ccd253 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1585,8 +1585,12 @@ static int handle_fastpath_set_x2apic_icr_irqoff(struct kvm_vcpu *vcpu, u64 data ((data & APIC_DEST_MASK) == APIC_DEST_PHYSICAL) && ((data & APIC_MODE_MASK) == APIC_DM_FIXED)) { + data &= ~(1 << 12); + kvm_apic_send_ipi(vcpu->arch.apic, (u32)data, (u32)(data >> 32)); kvm_lapic_set_reg(vcpu->arch.apic, APIC_ICR2, (u32)(data >> 32)); - return kvm_lapic_reg_write(vcpu->arch.apic, APIC_ICR, (u32)data); + kvm_lapic_set_reg(vcpu->arch.apic, APIC_ICR, (u32)data); + trace_kvm_apic_write(APIC_ICR, (u32)data); + return 0; } return 1; -- cgit From 5790921bc18b1eb5c0c61371e31114fd4c4b0154 Mon Sep 17 00:00:00 2001 From: Yu-cheng Yu Date: Tue, 4 Feb 2020 09:14:24 -0800 Subject: x86/insn: Add Control-flow Enforcement (CET) instructions to the opcode map Add the following CET instructions to the opcode map: INCSSP: Increment Shadow Stack pointer (SSP). RDSSP: Read SSP into a GPR. SAVEPREVSSP: Use "previous ssp" token at top of current Shadow Stack (SHSTK) to create a "restore token" on the previous (outgoing) SHSTK. RSTORSSP: Restore from a "restore token" to SSP. WRSS: Write to kernel-mode SHSTK (kernel-mode instruction). WRUSS: Write to user-mode SHSTK (kernel-mode instruction). SETSSBSY: Verify the "supervisor token" pointed by MSR_IA32_PL0_SSP, set the token busy, and set then Shadow Stack pointer(SSP) to the value of MSR_IA32_PL0_SSP. CLRSSBSY: Verify the "supervisor token" and clear its busy bit. ENDBR64/ENDBR32: Mark a valid 64/32 bit control transfer endpoint. Detailed information of CET instructions can be found in Intel Software Developer's Manual. Signed-off-by: Yu-cheng Yu Signed-off-by: Borislav Petkov Reviewed-by: Adrian Hunter Reviewed-by: Tony Luck Acked-by: Masami Hiramatsu Link: https://lkml.kernel.org/r/20200204171425.28073-2-yu-cheng.yu@intel.com --- arch/x86/lib/x86-opcode-map.txt | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/lib/x86-opcode-map.txt b/arch/x86/lib/x86-opcode-map.txt index 53adc1762ec0..ec31f5b60323 100644 --- a/arch/x86/lib/x86-opcode-map.txt +++ b/arch/x86/lib/x86-opcode-map.txt @@ -366,7 +366,7 @@ AVXcode: 1 1b: BNDCN Gv,Ev (F2) | BNDMOV Ev,Gv (66) | BNDMK Gv,Ev (F3) | BNDSTX Ev,Gv 1c: Grp20 (1A),(1C) 1d: -1e: +1e: Grp21 (1A) 1f: NOP Ev # 0x0f 0x20-0x2f 20: MOV Rd,Cd @@ -803,8 +803,8 @@ f0: MOVBE Gy,My | MOVBE Gw,Mw (66) | CRC32 Gd,Eb (F2) | CRC32 Gd,Eb (66&F2) f1: MOVBE My,Gy | MOVBE Mw,Gw (66) | CRC32 Gd,Ey (F2) | CRC32 Gd,Ew (66&F2) f2: ANDN Gy,By,Ey (v) f3: Grp17 (1A) -f5: BZHI Gy,Ey,By (v) | PEXT Gy,By,Ey (F3),(v) | PDEP Gy,By,Ey (F2),(v) -f6: ADCX Gy,Ey (66) | ADOX Gy,Ey (F3) | MULX By,Gy,rDX,Ey (F2),(v) +f5: BZHI Gy,Ey,By (v) | PEXT Gy,By,Ey (F3),(v) | PDEP Gy,By,Ey (F2),(v) | WRUSSD/Q My,Gy (66) +f6: ADCX Gy,Ey (66) | ADOX Gy,Ey (F3) | MULX By,Gy,rDX,Ey (F2),(v) | WRSSD/Q My,Gy f7: BEXTR Gy,Ey,By (v) | SHLX Gy,Ey,By (66),(v) | SARX Gy,Ey,By (F3),(v) | SHRX Gy,Ey,By (F2),(v) f8: MOVDIR64B Gv,Mdqq (66) | ENQCMD Gv,Mdqq (F2) | ENQCMDS Gv,Mdqq (F3) f9: MOVDIRI My,Gy @@ -970,7 +970,7 @@ GrpTable: Grp7 2: LGDT Ms | XGETBV (000),(11B) | XSETBV (001),(11B) | VMFUNC (100),(11B) | XEND (101)(11B) | XTEST (110)(11B) | ENCLU (111),(11B) 3: LIDT Ms 4: SMSW Mw/Rv -5: rdpkru (110),(11B) | wrpkru (111),(11B) +5: rdpkru (110),(11B) | wrpkru (111),(11B) | SAVEPREVSSP (F3),(010),(11B) | RSTORSSP Mq (F3) | SETSSBSY (F3),(000),(11B) 6: LMSW Ew 7: INVLPG Mb | SWAPGS (o64),(000),(11B) | RDTSCP (001),(11B) EndTable @@ -1041,8 +1041,8 @@ GrpTable: Grp15 2: vldmxcsr Md (v1) | WRFSBASE Ry (F3),(11B) 3: vstmxcsr Md (v1) | WRGSBASE Ry (F3),(11B) 4: XSAVE | ptwrite Ey (F3),(11B) -5: XRSTOR | lfence (11B) -6: XSAVEOPT | clwb (66) | mfence (11B) | TPAUSE Rd (66),(11B) | UMONITOR Rv (F3),(11B) | UMWAIT Rd (F2),(11B) +5: XRSTOR | lfence (11B) | INCSSPD/Q Ry (F3),(11B) +6: XSAVEOPT | clwb (66) | mfence (11B) | TPAUSE Rd (66),(11B) | UMONITOR Rv (F3),(11B) | UMWAIT Rd (F2),(11B) | CLRSSBSY Mq (F3) 7: clflush | clflushopt (66) | sfence (11B) EndTable @@ -1077,6 +1077,11 @@ GrpTable: Grp20 0: cldemote Mb EndTable +GrpTable: Grp21 +1: RDSSPD/Q Ry (F3),(11B) +7: ENDBR64 (F3),(010),(11B) | ENDBR32 (F3),(011),(11B) +EndTable + # AMD's Prefetch Group GrpTable: GrpP 0: PREFETCH -- cgit From 44a1d996325982025eefcdc50b636ab83e813372 Mon Sep 17 00:00:00 2001 From: Al Viro Date: Sat, 15 Feb 2020 18:46:02 -0500 Subject: x86: ia32_setup_sigcontext(): lift user_access_{begin,end}() into the callers What's left is just a sequence of stores to userland addresses, with all error handling, etc. done out of line. Calling that from user_access block is safe, but rather than teaching objtool to recognize it as such we can just make it always_inline - it is small enough and has few enough callers, for the space savings not to be an issue. Rename the sucker to __unsafe_setup_sigcontext32() and provide unsafe_put_sigcontext32() with usual kind of semantics. Signed-off-by: Al Viro --- arch/x86/ia32/ia32_signal.c | 44 ++++++++++++++++++++++++++++---------------- 1 file changed, 28 insertions(+), 16 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/ia32/ia32_signal.c b/arch/x86/ia32/ia32_signal.c index a96995aa23da..799ca5b31b87 100644 --- a/arch/x86/ia32/ia32_signal.c +++ b/arch/x86/ia32/ia32_signal.c @@ -154,13 +154,11 @@ badframe: #define get_user_seg(seg) ({ unsigned int v; savesegment(seg, v); v; }) -static int ia32_setup_sigcontext(struct sigcontext_32 __user *sc, - void __user *fpstate, - struct pt_regs *regs, unsigned int mask) +static __always_inline int +__unsafe_setup_sigcontext32(struct sigcontext_32 __user *sc, + void __user *fpstate, + struct pt_regs *regs, unsigned int mask) { - if (!user_access_begin(sc, sizeof(struct sigcontext_32))) - return -EFAULT; - unsafe_put_user(get_user_seg(gs), (unsigned int __user *)&sc->gs, Efault); unsafe_put_user(get_user_seg(fs), (unsigned int __user *)&sc->fs, Efault); unsafe_put_user(get_user_seg(ds), (unsigned int __user *)&sc->ds, Efault); @@ -187,13 +185,18 @@ static int ia32_setup_sigcontext(struct sigcontext_32 __user *sc, /* non-iBCS2 extensions.. */ unsafe_put_user(mask, &sc->oldmask, Efault); unsafe_put_user(current->thread.cr2, &sc->cr2, Efault); - user_access_end(); return 0; + Efault: - user_access_end(); return -EFAULT; } +#define unsafe_put_sigcontext32(sc, fp, regs, set, label) \ +do { \ + if (__unsafe_setup_sigcontext32(sc, fp, regs, set->sig[0])) \ + goto label; \ +} while(0) + /* * Determine which stack to use.. */ @@ -234,7 +237,7 @@ int ia32_setup_frame(int sig, struct ksignal *ksig, struct sigframe_ia32 __user *frame; void __user *restorer; int err = 0; - void __user *fpstate = NULL; + void __user *fp = NULL; /* copy_to_user optimizes that into a single 8 byte store */ static const struct { @@ -247,7 +250,7 @@ int ia32_setup_frame(int sig, struct ksignal *ksig, 0x80cd, /* int $0x80 */ }; - frame = get_sigframe(ksig, regs, sizeof(*frame), &fpstate); + frame = get_sigframe(ksig, regs, sizeof(*frame), &fp); if (!access_ok(frame, sizeof(*frame))) return -EFAULT; @@ -255,9 +258,12 @@ int ia32_setup_frame(int sig, struct ksignal *ksig, if (__put_user(sig, &frame->sig)) return -EFAULT; - if (ia32_setup_sigcontext(&frame->sc, fpstate, regs, set->sig[0])) + if (!user_access_begin(&frame->sc, sizeof(struct sigcontext_32))) return -EFAULT; + unsafe_put_sigcontext32(&frame->sc, fp, regs, set, Efault); + user_access_end(); + if (__put_user(set->sig[1], &frame->extramask[0])) return -EFAULT; @@ -301,6 +307,9 @@ int ia32_setup_frame(int sig, struct ksignal *ksig, regs->ss = __USER32_DS; return 0; +Efault: + user_access_end(); + return -EFAULT; } int ia32_setup_rt_frame(int sig, struct ksignal *ksig, @@ -309,7 +318,7 @@ int ia32_setup_rt_frame(int sig, struct ksignal *ksig, struct rt_sigframe_ia32 __user *frame; void __user *restorer; int err = 0; - void __user *fpstate = NULL; + void __user *fp = NULL; /* __copy_to_user optimizes that into a single 8 byte store */ static const struct { @@ -324,7 +333,7 @@ int ia32_setup_rt_frame(int sig, struct ksignal *ksig, 0, }; - frame = get_sigframe(ksig, regs, sizeof(*frame), &fpstate); + frame = get_sigframe(ksig, regs, sizeof(*frame), &fp); if (!user_access_begin(frame, sizeof(*frame))) return -EFAULT; @@ -355,9 +364,12 @@ int ia32_setup_rt_frame(int sig, struct ksignal *ksig, unsafe_put_user(*((u64 *)&code), (u64 __user *)frame->retcode, Efault); user_access_end(); - err |= __copy_siginfo_to_user32(&frame->info, &ksig->info, false); - err |= ia32_setup_sigcontext(&frame->uc.uc_mcontext, fpstate, - regs, set->sig[0]); + if (__copy_siginfo_to_user32(&frame->info, &ksig->info, false)) + return -EFAULT; + if (!user_access_begin(&frame->uc.uc_mcontext, sizeof(struct sigcontext_32))) + return -EFAULT; + unsafe_put_sigcontext32(&frame->uc.uc_mcontext, fp, regs, set, Efault); + user_access_end(); err |= __put_user(*(__u64 *)set, (__u64 __user *)&frame->uc.uc_sigmask); if (err) -- cgit From e2390741053e4931841650b5eadac32697aa88aa Mon Sep 17 00:00:00 2001 From: Al Viro Date: Sat, 15 Feb 2020 19:36:40 -0500 Subject: x86: ia32_setup_frame(): consolidate uaccess areas Currently we have user_access block, followed by __put_user(), deciding what the restorer will be and finally a put_user_try block. Moving the calculation of restorer first allows the rest (actual copyout work) to coalesce into a single user_access block. Signed-off-by: Al Viro --- arch/x86/ia32/ia32_signal.c | 39 ++++++++++++--------------------------- 1 file changed, 12 insertions(+), 27 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/ia32/ia32_signal.c b/arch/x86/ia32/ia32_signal.c index 799ca5b31b87..7018c2c325a1 100644 --- a/arch/x86/ia32/ia32_signal.c +++ b/arch/x86/ia32/ia32_signal.c @@ -236,7 +236,6 @@ int ia32_setup_frame(int sig, struct ksignal *ksig, { struct sigframe_ia32 __user *frame; void __user *restorer; - int err = 0; void __user *fp = NULL; /* copy_to_user optimizes that into a single 8 byte store */ @@ -252,21 +251,6 @@ int ia32_setup_frame(int sig, struct ksignal *ksig, frame = get_sigframe(ksig, regs, sizeof(*frame), &fp); - if (!access_ok(frame, sizeof(*frame))) - return -EFAULT; - - if (__put_user(sig, &frame->sig)) - return -EFAULT; - - if (!user_access_begin(&frame->sc, sizeof(struct sigcontext_32))) - return -EFAULT; - - unsafe_put_sigcontext32(&frame->sc, fp, regs, set, Efault); - user_access_end(); - - if (__put_user(set->sig[1], &frame->extramask[0])) - return -EFAULT; - if (ksig->ka.sa.sa_flags & SA_RESTORER) { restorer = ksig->ka.sa.sa_restorer; } else { @@ -278,19 +262,20 @@ int ia32_setup_frame(int sig, struct ksignal *ksig, restorer = &frame->retcode; } - put_user_try { - put_user_ex(ptr_to_compat(restorer), &frame->pretcode); - - /* - * These are actually not used anymore, but left because some - * gdb versions depend on them as a marker. - */ - put_user_ex(*((u64 *)&code), (u64 __user *)frame->retcode); - } put_user_catch(err); - - if (err) + if (!user_access_begin(frame, sizeof(*frame))) return -EFAULT; + unsafe_put_user(sig, &frame->sig, Efault); + unsafe_put_sigcontext32(&frame->sc, fp, regs, set, Efault); + unsafe_put_user(set->sig[1], &frame->extramask[0], Efault); + unsafe_put_user(ptr_to_compat(restorer), &frame->pretcode, Efault); + /* + * These are actually not used anymore, but left because some + * gdb versions depend on them as a marker. + */ + unsafe_put_user(*((u64 *)&code), (u64 __user *)frame->retcode, Efault); + user_access_end(); + /* Set up registers for signal handler */ regs->sp = (unsigned long) frame; regs->ip = (unsigned long) ksig->ka.sa.sa_handler; -- cgit From 57d563c8292569f2849569853e846bf740df5032 Mon Sep 17 00:00:00 2001 From: Al Viro Date: Sat, 15 Feb 2020 19:42:40 -0500 Subject: x86: ia32_setup_rt_frame(): consolidate uaccess areas __copy_siginfo_to_user32() call reordered a bit. The rest folds nicely. Signed-off-by: Al Viro --- arch/x86/ia32/ia32_signal.c | 13 +++---------- 1 file changed, 3 insertions(+), 10 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/ia32/ia32_signal.c b/arch/x86/ia32/ia32_signal.c index 7018c2c325a1..f9d8804144d0 100644 --- a/arch/x86/ia32/ia32_signal.c +++ b/arch/x86/ia32/ia32_signal.c @@ -302,10 +302,9 @@ int ia32_setup_rt_frame(int sig, struct ksignal *ksig, { struct rt_sigframe_ia32 __user *frame; void __user *restorer; - int err = 0; void __user *fp = NULL; - /* __copy_to_user optimizes that into a single 8 byte store */ + /* unsafe_put_user optimizes that into a single 8 byte store */ static const struct { u8 movl; u32 val; @@ -347,17 +346,11 @@ int ia32_setup_rt_frame(int sig, struct ksignal *ksig, * versions need it. */ unsafe_put_user(*((u64 *)&code), (u64 __user *)frame->retcode, Efault); - user_access_end(); - - if (__copy_siginfo_to_user32(&frame->info, &ksig->info, false)) - return -EFAULT; - if (!user_access_begin(&frame->uc.uc_mcontext, sizeof(struct sigcontext_32))) - return -EFAULT; unsafe_put_sigcontext32(&frame->uc.uc_mcontext, fp, regs, set, Efault); + unsafe_put_user(*(__u64 *)set, (__u64 *)&frame->uc.uc_sigmask, Efault); user_access_end(); - err |= __put_user(*(__u64 *)set, (__u64 __user *)&frame->uc.uc_sigmask); - if (err) + if (__copy_siginfo_to_user32(&frame->info, &ksig->info, false)) return -EFAULT; /* Set up registers for signal handler */ -- cgit From 119cd59fcfbe70fb3fcab4e64cd232bcc3807585 Mon Sep 17 00:00:00 2001 From: Al Viro Date: Sat, 15 Feb 2020 19:54:56 -0500 Subject: x86: get rid of put_user_try in __setup_rt_frame() (both 32bit and 64bit) Straightforward, except for save_altstack_ex() stuck in those. Replace that thing with an analogue that would use unsafe_put_user() instead of put_user_ex() (called compat_save_altstack()) and be done with that. Signed-off-by: Al Viro --- arch/x86/kernel/signal.c | 91 +++++++++++++++++++++++++----------------------- 1 file changed, 48 insertions(+), 43 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c index 29abad29aaa1..8b879fdc214c 100644 --- a/arch/x86/kernel/signal.c +++ b/arch/x86/kernel/signal.c @@ -365,38 +365,37 @@ static int __setup_rt_frame(int sig, struct ksignal *ksig, frame = get_sigframe(&ksig->ka, regs, sizeof(*frame), &fpstate); - if (!access_ok(frame, sizeof(*frame))) + if (!user_access_begin(frame, sizeof(*frame))) return -EFAULT; - put_user_try { - put_user_ex(sig, &frame->sig); - put_user_ex(&frame->info, &frame->pinfo); - put_user_ex(&frame->uc, &frame->puc); + unsafe_put_user(sig, &frame->sig, Efault); + unsafe_put_user(&frame->info, &frame->pinfo, Efault); + unsafe_put_user(&frame->uc, &frame->puc, Efault); - /* Create the ucontext. */ - if (static_cpu_has(X86_FEATURE_XSAVE)) - put_user_ex(UC_FP_XSTATE, &frame->uc.uc_flags); - else - put_user_ex(0, &frame->uc.uc_flags); - put_user_ex(0, &frame->uc.uc_link); - save_altstack_ex(&frame->uc.uc_stack, regs->sp); + /* Create the ucontext. */ + if (static_cpu_has(X86_FEATURE_XSAVE)) + unsafe_put_user(UC_FP_XSTATE, &frame->uc.uc_flags, Efault); + else + unsafe_put_user(0, &frame->uc.uc_flags, Efault); + unsafe_put_user(0, &frame->uc.uc_link, Efault); + unsafe_save_altstack(&frame->uc.uc_stack, regs->sp, Efault); - /* Set up to return from userspace. */ - restorer = current->mm->context.vdso + - vdso_image_32.sym___kernel_rt_sigreturn; - if (ksig->ka.sa.sa_flags & SA_RESTORER) - restorer = ksig->ka.sa.sa_restorer; - put_user_ex(restorer, &frame->pretcode); + /* Set up to return from userspace. */ + restorer = current->mm->context.vdso + + vdso_image_32.sym___kernel_rt_sigreturn; + if (ksig->ka.sa.sa_flags & SA_RESTORER) + restorer = ksig->ka.sa.sa_restorer; + unsafe_put_user(restorer, &frame->pretcode, Efault); - /* - * This is movl $__NR_rt_sigreturn, %ax ; int $0x80 - * - * WE DO NOT USE IT ANY MORE! It's only left here for historical - * reasons and because gdb uses it as a signature to notice - * signal handler stack frames. - */ - put_user_ex(*((u64 *)&rt_retcode), (u64 *)frame->retcode); - } put_user_catch(err); + /* + * This is movl $__NR_rt_sigreturn, %ax ; int $0x80 + * + * WE DO NOT USE IT ANY MORE! It's only left here for historical + * reasons and because gdb uses it as a signature to notice + * signal handler stack frames. + */ + unsafe_put_user(*((u64 *)&rt_retcode), (u64 *)frame->retcode, Efault); + user_access_end(); err |= copy_siginfo_to_user(&frame->info, &ksig->info); err |= setup_sigcontext(&frame->uc.uc_mcontext, fpstate, @@ -419,6 +418,9 @@ static int __setup_rt_frame(int sig, struct ksignal *ksig, regs->cs = __USER_CS; return 0; +Efault: + user_access_end(); + return -EFAULT; } #else /* !CONFIG_X86_32 */ static unsigned long frame_uc_flags(struct pt_regs *regs) @@ -444,6 +446,10 @@ static int __setup_rt_frame(int sig, struct ksignal *ksig, unsigned long uc_flags; int err = 0; + /* x86-64 should always use SA_RESTORER. */ + if (!(ksig->ka.sa.sa_flags & SA_RESTORER)) + return -EFAULT; + frame = get_sigframe(&ksig->ka, regs, sizeof(struct rt_sigframe), &fp); if (!access_ok(frame, sizeof(*frame))) @@ -455,23 +461,18 @@ static int __setup_rt_frame(int sig, struct ksignal *ksig, } uc_flags = frame_uc_flags(regs); + if (!user_access_begin(frame, sizeof(*frame))) + return -EFAULT; - put_user_try { - /* Create the ucontext. */ - put_user_ex(uc_flags, &frame->uc.uc_flags); - put_user_ex(0, &frame->uc.uc_link); - save_altstack_ex(&frame->uc.uc_stack, regs->sp); - - /* Set up to return from userspace. If provided, use a stub - already in userspace. */ - /* x86-64 should always use SA_RESTORER. */ - if (ksig->ka.sa.sa_flags & SA_RESTORER) { - put_user_ex(ksig->ka.sa.sa_restorer, &frame->pretcode); - } else { - /* could use a vstub here */ - err |= -EFAULT; - } - } put_user_catch(err); + /* Create the ucontext. */ + unsafe_put_user(uc_flags, &frame->uc.uc_flags, Efault); + unsafe_put_user(0, &frame->uc.uc_link, Efault); + unsafe_save_altstack(&frame->uc.uc_stack, regs->sp, Efault); + + /* Set up to return from userspace. If provided, use a stub + already in userspace. */ + unsafe_put_user(ksig->ka.sa.sa_restorer, &frame->pretcode, Efault); + user_access_end(); err |= setup_sigcontext(&frame->uc.uc_mcontext, fp, regs, set->sig[0]); err |= __put_user(set->sig[0], &frame->uc.uc_sigmask.sig[0]); @@ -515,6 +516,10 @@ static int __setup_rt_frame(int sig, struct ksignal *ksig, force_valid_ss(regs); return 0; + +Efault: + user_access_end(); + return -EFAULT; } #endif /* CONFIG_X86_32 */ -- cgit From b00d8f8f0b2b39223c3fd6713d318aba95420264 Mon Sep 17 00:00:00 2001 From: Al Viro Date: Sat, 15 Feb 2020 21:12:26 -0500 Subject: x86: setup_sigcontext(): list user_access_{begin,end}() into callers Similar to ia32_setup_sigcontext() change several commits ago, make it __always_inline. In cases when there is a user_access_{begin,end}() section nearby, just move the call over there. Signed-off-by: Al Viro --- arch/x86/kernel/signal.c | 45 ++++++++++++++++++++++++--------------------- 1 file changed, 24 insertions(+), 21 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c index 8b879fdc214c..f4fb00bd2378 100644 --- a/arch/x86/kernel/signal.c +++ b/arch/x86/kernel/signal.c @@ -140,12 +140,10 @@ static int restore_sigcontext(struct pt_regs *regs, IS_ENABLED(CONFIG_X86_32)); } -static int setup_sigcontext(struct sigcontext __user *sc, void __user *fpstate, +static __always_inline int +__unsafe_setup_sigcontext(struct sigcontext __user *sc, void __user *fpstate, struct pt_regs *regs, unsigned long mask) { - if (!user_access_begin(sc, sizeof(struct sigcontext))) - return -EFAULT; - #ifdef CONFIG_X86_32 unsafe_put_user(get_user_gs(regs), (unsigned int __user *)&sc->gs, Efault); @@ -194,13 +192,17 @@ static int setup_sigcontext(struct sigcontext __user *sc, void __user *fpstate, /* non-iBCS2 extensions.. */ unsafe_put_user(mask, &sc->oldmask, Efault); unsafe_put_user(current->thread.cr2, &sc->cr2, Efault); - user_access_end(); return 0; Efault: - user_access_end(); return -EFAULT; } +#define unsafe_put_sigcontext(sc, fp, regs, set, label) \ +do { \ + if (__unsafe_setup_sigcontext(sc, fp, regs, set->sig[0])) \ + goto label; \ +} while(0); + /* * Set up a signal frame. */ @@ -301,9 +303,9 @@ __setup_frame(int sig, struct ksignal *ksig, sigset_t *set, struct sigframe __user *frame; void __user *restorer; int err = 0; - void __user *fpstate = NULL; + void __user *fp = NULL; - frame = get_sigframe(&ksig->ka, regs, sizeof(*frame), &fpstate); + frame = get_sigframe(&ksig->ka, regs, sizeof(*frame), &fp); if (!access_ok(frame, sizeof(*frame))) return -EFAULT; @@ -311,8 +313,10 @@ __setup_frame(int sig, struct ksignal *ksig, sigset_t *set, if (__put_user(sig, &frame->sig)) return -EFAULT; - if (setup_sigcontext(&frame->sc, fpstate, regs, set->sig[0])) + if (!user_access_begin(&frame->sc, sizeof(struct sigcontext))) return -EFAULT; + unsafe_put_sigcontext(&frame->sc, fp, regs, set, Efault); + user_access_end(); if (__put_user(set->sig[1], &frame->extramask[0])) return -EFAULT; @@ -353,6 +357,10 @@ __setup_frame(int sig, struct ksignal *ksig, sigset_t *set, regs->cs = __USER_CS; return 0; + +Efault: + user_access_end(); + return -EFAULT; } static int __setup_rt_frame(int sig, struct ksignal *ksig, @@ -361,9 +369,9 @@ static int __setup_rt_frame(int sig, struct ksignal *ksig, struct rt_sigframe __user *frame; void __user *restorer; int err = 0; - void __user *fpstate = NULL; + void __user *fp = NULL; - frame = get_sigframe(&ksig->ka, regs, sizeof(*frame), &fpstate); + frame = get_sigframe(&ksig->ka, regs, sizeof(*frame), &fp); if (!user_access_begin(frame, sizeof(*frame))) return -EFAULT; @@ -395,13 +403,11 @@ static int __setup_rt_frame(int sig, struct ksignal *ksig, * signal handler stack frames. */ unsafe_put_user(*((u64 *)&rt_retcode), (u64 *)frame->retcode, Efault); + unsafe_put_sigcontext(&frame->uc.uc_mcontext, fp, regs, set, Efault); user_access_end(); err |= copy_siginfo_to_user(&frame->info, &ksig->info); - err |= setup_sigcontext(&frame->uc.uc_mcontext, fpstate, - regs, set->sig[0]); err |= __copy_to_user(&frame->uc.uc_sigmask, set, sizeof(*set)); - if (err) return -EFAULT; @@ -472,9 +478,8 @@ static int __setup_rt_frame(int sig, struct ksignal *ksig, /* Set up to return from userspace. If provided, use a stub already in userspace. */ unsafe_put_user(ksig->ka.sa.sa_restorer, &frame->pretcode, Efault); + unsafe_put_sigcontext(&frame->uc.uc_mcontext, fp, regs, set, Efault); user_access_end(); - - err |= setup_sigcontext(&frame->uc.uc_mcontext, fp, regs, set->sig[0]); err |= __put_user(set->sig[0], &frame->uc.uc_sigmask.sig[0]); if (err) @@ -532,12 +537,12 @@ static int x32_setup_rt_frame(struct ksignal *ksig, unsigned long uc_flags; void __user *restorer; int err = 0; - void __user *fpstate = NULL; + void __user *fp = NULL; if (!(ksig->ka.sa.sa_flags & SA_RESTORER)) return -EFAULT; - frame = get_sigframe(&ksig->ka, regs, sizeof(*frame), &fpstate); + frame = get_sigframe(&ksig->ka, regs, sizeof(*frame), &fp); if (!access_ok(frame, sizeof(*frame))) return -EFAULT; @@ -559,10 +564,8 @@ static int x32_setup_rt_frame(struct ksignal *ksig, unsafe_put_user(0, &frame->uc.uc__pad0, Efault); restorer = ksig->ka.sa.sa_restorer; unsafe_put_user(restorer, (unsigned long __user *)&frame->pretcode, Efault); + unsafe_put_sigcontext(&frame->uc.uc_mcontext, fp, regs, set, Efault); user_access_end(); - - err |= setup_sigcontext(&frame->uc.uc_mcontext, fpstate, - regs, set->sig[0]); err |= __put_user(*(__u64 *)set, (__u64 __user *)&frame->uc.uc_sigmask); if (err) -- cgit From 5c1f178094631e8b9acc67e4a9b6e03abfbc2529 Mon Sep 17 00:00:00 2001 From: Al Viro Date: Sat, 15 Feb 2020 21:18:02 -0500 Subject: x86: __setup_frame(): consolidate uaccess areas Signed-off-by: Al Viro --- arch/x86/kernel/signal.c | 23 ++++++----------------- 1 file changed, 6 insertions(+), 17 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c index f4fb00bd2378..38ff834ba0d6 100644 --- a/arch/x86/kernel/signal.c +++ b/arch/x86/kernel/signal.c @@ -302,25 +302,16 @@ __setup_frame(int sig, struct ksignal *ksig, sigset_t *set, { struct sigframe __user *frame; void __user *restorer; - int err = 0; void __user *fp = NULL; frame = get_sigframe(&ksig->ka, regs, sizeof(*frame), &fp); - if (!access_ok(frame, sizeof(*frame))) - return -EFAULT; - - if (__put_user(sig, &frame->sig)) + if (!user_access_begin(frame, sizeof(*frame))) return -EFAULT; - if (!user_access_begin(&frame->sc, sizeof(struct sigcontext))) - return -EFAULT; + unsafe_put_user(sig, &frame->sig, Efault); unsafe_put_sigcontext(&frame->sc, fp, regs, set, Efault); - user_access_end(); - - if (__put_user(set->sig[1], &frame->extramask[0])) - return -EFAULT; - + unsafe_put_user(set->sig[1], &frame->extramask[0], Efault); if (current->mm->context.vdso) restorer = current->mm->context.vdso + vdso_image_32.sym___kernel_sigreturn; @@ -330,7 +321,7 @@ __setup_frame(int sig, struct ksignal *ksig, sigset_t *set, restorer = ksig->ka.sa.sa_restorer; /* Set up to return from userspace. */ - err |= __put_user(restorer, &frame->pretcode); + unsafe_put_user(restorer, &frame->pretcode, Efault); /* * This is popl %eax ; movl $__NR_sigreturn, %eax ; int $0x80 @@ -339,10 +330,8 @@ __setup_frame(int sig, struct ksignal *ksig, sigset_t *set, * reasons and because gdb uses it as a signature to notice * signal handler stack frames. */ - err |= __put_user(*((u64 *)&retcode), (u64 *)frame->retcode); - - if (err) - return -EFAULT; + unsafe_put_user(*((u64 *)&retcode), (u64 *)frame->retcode, Efault); + user_access_end(); /* Set up registers for signal handler */ regs->sp = (unsigned long)frame; -- cgit From ead8e4e7e2c75ced6fcd9a53d3e9a2ecd7368553 Mon Sep 17 00:00:00 2001 From: Al Viro Date: Sat, 15 Feb 2020 21:22:39 -0500 Subject: x86: __setup_rt_frame(): consolidate uaccess areas reorder copy_siginfo_to_user() calls a bit Signed-off-by: Al Viro --- arch/x86/kernel/signal.c | 26 +++++++++----------------- 1 file changed, 9 insertions(+), 17 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c index 38ff834ba0d6..e37d5a1bb713 100644 --- a/arch/x86/kernel/signal.c +++ b/arch/x86/kernel/signal.c @@ -357,7 +357,6 @@ static int __setup_rt_frame(int sig, struct ksignal *ksig, { struct rt_sigframe __user *frame; void __user *restorer; - int err = 0; void __user *fp = NULL; frame = get_sigframe(&ksig->ka, regs, sizeof(*frame), &fp); @@ -393,11 +392,11 @@ static int __setup_rt_frame(int sig, struct ksignal *ksig, */ unsafe_put_user(*((u64 *)&rt_retcode), (u64 *)frame->retcode, Efault); unsafe_put_sigcontext(&frame->uc.uc_mcontext, fp, regs, set, Efault); + unsafe_put_user(*(__u64 *)set, + (__u64 __user *)&frame->uc.uc_sigmask, Efault); user_access_end(); - err |= copy_siginfo_to_user(&frame->info, &ksig->info); - err |= __copy_to_user(&frame->uc.uc_sigmask, set, sizeof(*set)); - if (err) + if (copy_siginfo_to_user(&frame->info, &ksig->info)) return -EFAULT; /* Set up registers for signal handler */ @@ -439,23 +438,14 @@ static int __setup_rt_frame(int sig, struct ksignal *ksig, struct rt_sigframe __user *frame; void __user *fp = NULL; unsigned long uc_flags; - int err = 0; /* x86-64 should always use SA_RESTORER. */ if (!(ksig->ka.sa.sa_flags & SA_RESTORER)) return -EFAULT; frame = get_sigframe(&ksig->ka, regs, sizeof(struct rt_sigframe), &fp); - - if (!access_ok(frame, sizeof(*frame))) - return -EFAULT; - - if (ksig->ka.sa.sa_flags & SA_SIGINFO) { - if (copy_siginfo_to_user(&frame->info, &ksig->info)) - return -EFAULT; - } - uc_flags = frame_uc_flags(regs); + if (!user_access_begin(frame, sizeof(*frame))) return -EFAULT; @@ -468,11 +458,13 @@ static int __setup_rt_frame(int sig, struct ksignal *ksig, already in userspace. */ unsafe_put_user(ksig->ka.sa.sa_restorer, &frame->pretcode, Efault); unsafe_put_sigcontext(&frame->uc.uc_mcontext, fp, regs, set, Efault); + unsafe_put_user(set->sig[0], &frame->uc.uc_sigmask.sig[0], Efault); user_access_end(); - err |= __put_user(set->sig[0], &frame->uc.uc_sigmask.sig[0]); - if (err) - return -EFAULT; + if (ksig->ka.sa.sa_flags & SA_SIGINFO) { + if (copy_siginfo_to_user(&frame->info, &ksig->info)) + return -EFAULT; + } /* Set up registers for signal handler */ regs->di = sig; -- cgit From 791612e9668cecbf5dd24d13400ac74e099f005c Mon Sep 17 00:00:00 2001 From: Al Viro Date: Sat, 15 Feb 2020 21:25:14 -0500 Subject: x86: x32_setup_rt_frame(): consolidate uaccess areas Signed-off-by: Al Viro --- arch/x86/kernel/signal.c | 17 +++++------------ 1 file changed, 5 insertions(+), 12 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c index e37d5a1bb713..38b359325291 100644 --- a/arch/x86/kernel/signal.c +++ b/arch/x86/kernel/signal.c @@ -517,7 +517,6 @@ static int x32_setup_rt_frame(struct ksignal *ksig, struct rt_sigframe_x32 __user *frame; unsigned long uc_flags; void __user *restorer; - int err = 0; void __user *fp = NULL; if (!(ksig->ka.sa.sa_flags & SA_RESTORER)) @@ -525,14 +524,6 @@ static int x32_setup_rt_frame(struct ksignal *ksig, frame = get_sigframe(&ksig->ka, regs, sizeof(*frame), &fp); - if (!access_ok(frame, sizeof(*frame))) - return -EFAULT; - - if (ksig->ka.sa.sa_flags & SA_SIGINFO) { - if (__copy_siginfo_to_user32(&frame->info, &ksig->info, true)) - return -EFAULT; - } - uc_flags = frame_uc_flags(regs); if (!user_access_begin(frame, sizeof(*frame))) @@ -546,11 +537,13 @@ static int x32_setup_rt_frame(struct ksignal *ksig, restorer = ksig->ka.sa.sa_restorer; unsafe_put_user(restorer, (unsigned long __user *)&frame->pretcode, Efault); unsafe_put_sigcontext(&frame->uc.uc_mcontext, fp, regs, set, Efault); + unsafe_put_user(*(__u64 *)set, (__u64 __user *)&frame->uc.uc_sigmask, Efault); user_access_end(); - err |= __put_user(*(__u64 *)set, (__u64 __user *)&frame->uc.uc_sigmask); - if (err) - return -EFAULT; + if (ksig->ka.sa.sa_flags & SA_SIGINFO) { + if (__copy_siginfo_to_user32(&frame->info, &ksig->info, true)) + return -EFAULT; + } /* Set up registers for signal handler */ regs->sp = (unsigned long) frame; -- cgit From b87df6594486626a9ae5944807307f2604cea3e2 Mon Sep 17 00:00:00 2001 From: Al Viro Date: Sat, 15 Feb 2020 21:36:52 -0500 Subject: x86: unsafe_put-style macro for sigmask regularizes things a bit Signed-off-by: Al Viro --- arch/x86/kernel/signal.c | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c index 38b359325291..1215fc7da0ba 100644 --- a/arch/x86/kernel/signal.c +++ b/arch/x86/kernel/signal.c @@ -203,6 +203,11 @@ do { \ goto label; \ } while(0); +#define unsafe_put_sigmask(set, frame, label) \ + unsafe_put_user(*(__u64 *)(set), \ + (__u64 __user *)&(frame)->uc.uc_sigmask, \ + label) + /* * Set up a signal frame. */ @@ -392,8 +397,7 @@ static int __setup_rt_frame(int sig, struct ksignal *ksig, */ unsafe_put_user(*((u64 *)&rt_retcode), (u64 *)frame->retcode, Efault); unsafe_put_sigcontext(&frame->uc.uc_mcontext, fp, regs, set, Efault); - unsafe_put_user(*(__u64 *)set, - (__u64 __user *)&frame->uc.uc_sigmask, Efault); + unsafe_put_sigmask(set, frame, Efault); user_access_end(); if (copy_siginfo_to_user(&frame->info, &ksig->info)) @@ -458,7 +462,7 @@ static int __setup_rt_frame(int sig, struct ksignal *ksig, already in userspace. */ unsafe_put_user(ksig->ka.sa.sa_restorer, &frame->pretcode, Efault); unsafe_put_sigcontext(&frame->uc.uc_mcontext, fp, regs, set, Efault); - unsafe_put_user(set->sig[0], &frame->uc.uc_sigmask.sig[0], Efault); + unsafe_put_sigmask(set, frame, Efault); user_access_end(); if (ksig->ka.sa.sa_flags & SA_SIGINFO) { @@ -537,7 +541,7 @@ static int x32_setup_rt_frame(struct ksignal *ksig, restorer = ksig->ka.sa.sa_restorer; unsafe_put_user(restorer, (unsigned long __user *)&frame->pretcode, Efault); unsafe_put_sigcontext(&frame->uc.uc_mcontext, fp, regs, set, Efault); - unsafe_put_user(*(__u64 *)set, (__u64 __user *)&frame->uc.uc_sigmask, Efault); + unsafe_put_sigmask(set, frame, Efault); user_access_end(); if (ksig->ka.sa.sa_flags & SA_SIGINFO) { -- cgit From cf122cfba5b1d9daf64009d143f51dfec4b1705a Mon Sep 17 00:00:00 2001 From: Al Viro Date: Sat, 15 Feb 2020 21:10:25 -0500 Subject: kill uaccess_try() finally Signed-off-by: Al Viro --- arch/x86/include/asm/asm.h | 6 ---- arch/x86/include/asm/processor.h | 1 - arch/x86/include/asm/uaccess.h | 65 ---------------------------------------- arch/x86/mm/extable.c | 12 -------- 4 files changed, 84 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/asm.h b/arch/x86/include/asm/asm.h index cd339b88d5d4..0f63585edf5f 100644 --- a/arch/x86/include/asm/asm.h +++ b/arch/x86/include/asm/asm.h @@ -138,9 +138,6 @@ # define _ASM_EXTABLE_FAULT(from, to) \ _ASM_EXTABLE_HANDLE(from, to, ex_handler_fault) -# define _ASM_EXTABLE_EX(from, to) \ - _ASM_EXTABLE_HANDLE(from, to, ex_handler_ext) - # define _ASM_NOKPROBE(entry) \ .pushsection "_kprobe_blacklist","aw" ; \ _ASM_ALIGN ; \ @@ -166,9 +163,6 @@ # define _ASM_EXTABLE_FAULT(from, to) \ _ASM_EXTABLE_HANDLE(from, to, ex_handler_fault) -# define _ASM_EXTABLE_EX(from, to) \ - _ASM_EXTABLE_HANDLE(from, to, ex_handler_ext) - /* For C file, we already have NOKPROBE_SYMBOL macro */ #endif diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index 09705ccc393c..19718fcfdd6a 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -541,7 +541,6 @@ struct thread_struct { mm_segment_t addr_limit; unsigned int sig_on_uaccess_err:1; - unsigned int uaccess_err:1; /* uaccess failed */ /* Floating point and extended processor state */ struct fpu fpu; diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h index 4dc5accdd826..d11662f753d2 100644 --- a/arch/x86/include/asm/uaccess.h +++ b/arch/x86/include/asm/uaccess.h @@ -193,23 +193,12 @@ __typeof__(__builtin_choose_expr(sizeof(x) > sizeof(0UL), 0ULL, 0UL)) : : "A" (x), "r" (addr) \ : : label) -#define __put_user_asm_ex_u64(x, addr) \ - asm volatile("\n" \ - "1: movl %%eax,0(%1)\n" \ - "2: movl %%edx,4(%1)\n" \ - "3:" \ - _ASM_EXTABLE_EX(1b, 2b) \ - _ASM_EXTABLE_EX(2b, 3b) \ - : : "A" (x), "r" (addr)) - #define __put_user_x8(x, ptr, __ret_pu) \ asm volatile("call __put_user_8" : "=a" (__ret_pu) \ : "A" ((typeof(*(ptr)))(x)), "c" (ptr) : "ebx") #else #define __put_user_goto_u64(x, ptr, label) \ __put_user_goto(x, ptr, "q", "", "er", label) -#define __put_user_asm_ex_u64(x, addr) \ - __put_user_asm_ex(x, addr, "q", "", "er") #define __put_user_x8(x, ptr, __ret_pu) __put_user_x(8, x, ptr, __ret_pu) #endif @@ -289,31 +278,6 @@ do { \ } \ } while (0) -/* - * This doesn't do __uaccess_begin/end - the exception handling - * around it must do that. - */ -#define __put_user_size_ex(x, ptr, size) \ -do { \ - __chk_user_ptr(ptr); \ - switch (size) { \ - case 1: \ - __put_user_asm_ex(x, ptr, "b", "b", "iq"); \ - break; \ - case 2: \ - __put_user_asm_ex(x, ptr, "w", "w", "ir"); \ - break; \ - case 4: \ - __put_user_asm_ex(x, ptr, "l", "k", "ir"); \ - break; \ - case 8: \ - __put_user_asm_ex_u64((__typeof__(*ptr))(x), ptr); \ - break; \ - default: \ - __put_user_bad(); \ - } \ -} while (0) - #ifdef CONFIG_X86_32 #define __get_user_asm_u64(x, ptr, retval, errret) \ ({ \ @@ -430,29 +394,6 @@ struct __large_struct { unsigned long buf[100]; }; retval = __put_user_failed(x, addr, itype, rtype, ltype, errret); \ } while (0) -#define __put_user_asm_ex(x, addr, itype, rtype, ltype) \ - asm volatile("1: mov"itype" %"rtype"0,%1\n" \ - "2:\n" \ - _ASM_EXTABLE_EX(1b, 2b) \ - : : ltype(x), "m" (__m(addr))) - -/* - * uaccess_try and catch - */ -#define uaccess_try do { \ - current->thread.uaccess_err = 0; \ - __uaccess_begin(); \ - barrier(); - -#define uaccess_try_nospec do { \ - current->thread.uaccess_err = 0; \ - __uaccess_begin_nospec(); \ - -#define uaccess_catch(err) \ - __uaccess_end(); \ - (err) |= (current->thread.uaccess_err ? -EFAULT : 0); \ -} while (0) - /** * __get_user - Get a simple variable from user space, with less checking. * @x: Variable to store result. @@ -502,12 +443,6 @@ struct __large_struct { unsigned long buf[100]; }; #define __put_user(x, ptr) \ __put_user_nocheck((__typeof__(*(ptr)))(x), (ptr), sizeof(*(ptr))) -#define put_user_try uaccess_try -#define put_user_catch(err) uaccess_catch(err) - -#define put_user_ex(x, ptr) \ - __put_user_size_ex((__typeof__(*(ptr)))(x), (ptr), sizeof(*(ptr))) - extern unsigned long copy_from_user_nmi(void *to, const void __user *from, unsigned long n); extern __must_check long diff --git a/arch/x86/mm/extable.c b/arch/x86/mm/extable.c index 30bb0bd3b1b8..b991aa4bdfae 100644 --- a/arch/x86/mm/extable.c +++ b/arch/x86/mm/extable.c @@ -80,18 +80,6 @@ __visible bool ex_handler_uaccess(const struct exception_table_entry *fixup, } EXPORT_SYMBOL(ex_handler_uaccess); -__visible bool ex_handler_ext(const struct exception_table_entry *fixup, - struct pt_regs *regs, int trapnr, - unsigned long error_code, - unsigned long fault_addr) -{ - /* Special hack for uaccess_err */ - current->thread.uaccess_err = 1; - regs->ip = ex_fixup_addr(fixup); - return true; -} -EXPORT_SYMBOL(ex_handler_ext); - __visible bool ex_handler_rdmsr_unsafe(const struct exception_table_entry *fixup, struct pt_regs *regs, int trapnr, unsigned long error_code, -- cgit From 01bd18624d91cfe82e7df4bfe2f22814a20b993a Mon Sep 17 00:00:00 2001 From: Benjamin Thiel Date: Fri, 27 Mar 2020 08:26:21 +0100 Subject: x86/platform/uv: Add a missing prototype for uv_bau_message_interrupt() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ... in order to fix a -Wmissing-prototypes warning: arch/x86/platform/uv/tlb_uv.c:1275:6: warning: no previous prototype for ‘uv_bau_message_interrupt’ [-Wmissing-prototypes] \ void uv_bau_message_interrupt(struct pt_regs *regs) Signed-off-by: Benjamin Thiel Signed-off-by: Borislav Petkov Link: https://lkml.kernel.org/r/20200327072621.2255-1-b.thiel@posteo.de --- arch/x86/include/asm/uv/uv_bau.h | 2 ++ 1 file changed, 2 insertions(+) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/uv/uv_bau.h b/arch/x86/include/asm/uv/uv_bau.h index 7803114aa140..13687bf0e0a9 100644 --- a/arch/x86/include/asm/uv/uv_bau.h +++ b/arch/x86/include/asm/uv/uv_bau.h @@ -858,4 +858,6 @@ static inline int atomic_inc_unless_ge(spinlock_t *lock, atomic_t *v, int u) return 1; } +void uv_bau_message_interrupt(struct pt_regs *regs); + #endif /* _ASM_X86_UV_UV_BAU_H */ -- cgit From 4de4952c0abcb490039cdc946e49ed7f237285a2 Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Thu, 26 Mar 2020 14:16:58 -0700 Subject: x86/jump_label: Move 'inline' keyword placement MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Fix gcc warning when -Wextra is used by moving the keyword: arch/x86/kernel/jump_label.c:61:1: warning: ‘inline’ is not at \ beginning of declaration [-Wold-style-declaration] static void inline __jump_label_transform(struct jump_entry *entry, ^~~~~~ Reported-by: Zzy Wysm Signed-off-by: Randy Dunlap Signed-off-by: Borislav Petkov Link: https://lkml.kernel.org/r/796d93d2-e73e-3447-44eb-4f89e1b636d9@infradead.org --- arch/x86/kernel/jump_label.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c index 9c4498ea0b3c..5ba8477c2cb7 100644 --- a/arch/x86/kernel/jump_label.c +++ b/arch/x86/kernel/jump_label.c @@ -58,7 +58,7 @@ __jump_label_set_jump_code(struct jump_entry *entry, enum jump_label_type type, return code; } -static void inline __jump_label_transform(struct jump_entry *entry, +static inline void __jump_label_transform(struct jump_entry *entry, enum jump_label_type type, int init) { -- cgit From be98dc6e50433fdcacaf29ec682e85de0ead6748 Mon Sep 17 00:00:00 2001 From: Benjamin Thiel Date: Thu, 26 Mar 2020 14:58:42 +0100 Subject: x86/mm: Mark setup_emu2phys_nid() static Make function static because it is used only in this file. Signed-off-by: Benjamin Thiel Signed-off-by: Borislav Petkov Link: https://lkml.kernel.org/r/20200326135842.3875-1-b.thiel@posteo.de --- arch/x86/mm/numa_emulation.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/mm/numa_emulation.c b/arch/x86/mm/numa_emulation.c index 7f1d2034df1e..c5174b4e318b 100644 --- a/arch/x86/mm/numa_emulation.c +++ b/arch/x86/mm/numa_emulation.c @@ -324,7 +324,7 @@ static int __init split_nodes_size_interleave(struct numa_meminfo *ei, 0, NULL, NUMA_NO_NODE); } -int __init setup_emu2phys_nid(int *dfl_phys_nid) +static int __init setup_emu2phys_nid(int *dfl_phys_nid) { int i, max_emu_nid = 0; -- cgit From 5bacdc0982f2b343afa5adbb80517d3392a7e357 Mon Sep 17 00:00:00 2001 From: Benjamin Thiel Date: Fri, 27 Mar 2020 11:26:06 +0100 Subject: x86/mm/set_memory: Fix -Wmissing-prototypes warnings Add missing includes and move prototypes into the header set_memory.h in order to fix -Wmissing-prototypes warnings. [ bp: Add ifdeffery around arch_invalidate_pmem() ] Signed-off-by: Benjamin Thiel Signed-off-by: Borislav Petkov Link: https://lkml.kernel.org/r/20200320145028.6013-1-b.thiel@posteo.de --- arch/x86/include/asm/set_memory.h | 2 ++ arch/x86/mm/pat/set_memory.c | 3 +++ arch/x86/mm/pti.c | 8 +------- 3 files changed, 6 insertions(+), 7 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h index 64c3dce374e5..950532ccbc4a 100644 --- a/arch/x86/include/asm/set_memory.h +++ b/arch/x86/include/asm/set_memory.h @@ -46,6 +46,8 @@ int set_memory_4k(unsigned long addr, int numpages); int set_memory_encrypted(unsigned long addr, int numpages); int set_memory_decrypted(unsigned long addr, int numpages); int set_memory_np_noalias(unsigned long addr, int numpages); +int set_memory_nonglobal(unsigned long addr, int numpages); +int set_memory_global(unsigned long addr, int numpages); int set_pages_array_uc(struct page **pages, int addrinarray); int set_pages_array_wc(struct page **pages, int addrinarray); diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c index c4aedd00c1ba..6d5424069e2b 100644 --- a/arch/x86/mm/pat/set_memory.c +++ b/arch/x86/mm/pat/set_memory.c @@ -15,6 +15,7 @@ #include #include #include +#include #include #include @@ -304,11 +305,13 @@ void clflush_cache_range(void *vaddr, unsigned int size) } EXPORT_SYMBOL_GPL(clflush_cache_range); +#ifdef CONFIG_ARCH_HAS_PMEM_API void arch_invalidate_pmem(void *addr, size_t size) { clflush_cache_range(addr, size); } EXPORT_SYMBOL_GPL(arch_invalidate_pmem); +#endif static void __cpa_flush_all(void *arg) { diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c index 44a9f068eee0..843aa10a4cb6 100644 --- a/arch/x86/mm/pti.c +++ b/arch/x86/mm/pti.c @@ -39,6 +39,7 @@ #include #include #include +#include #undef pr_fmt #define pr_fmt(fmt) "Kernel/User page tables isolation: " fmt @@ -554,13 +555,6 @@ static inline bool pti_kernel_image_global_ok(void) return true; } -/* - * This is the only user for these and it is not arch-generic - * like the other set_memory.h functions. Just extern them. - */ -extern int set_memory_nonglobal(unsigned long addr, int numpages); -extern int set_memory_global(unsigned long addr, int numpages); - /* * For some configurations, map all of kernel text into the user page * tables. This reduces TLB misses, especially on non-PCID systems. -- cgit From dbaba47085b0c2aa793ce849750164bd3765e163 Mon Sep 17 00:00:00 2001 From: Xiaoyao Li Date: Wed, 25 Mar 2020 11:09:23 +0800 Subject: x86/split_lock: Rework the initialization flow of split lock detection Current initialization flow of split lock detection has following issues: 1. It assumes the initial value of MSR_TEST_CTRL.SPLIT_LOCK_DETECT to be zero. However, it's possible that BIOS/firmware has set it. 2. X86_FEATURE_SPLIT_LOCK_DETECT flag is unconditionally set even if there is a virtualization flaw that FMS indicates the existence while it's actually not supported. Rework the initialization flow to solve above issues. In detail, explicitly clear and set split_lock_detect bit to verify MSR_TEST_CTRL can be accessed, and rdmsr after wrmsr to ensure bit is cleared/set successfully. X86_FEATURE_SPLIT_LOCK_DETECT flag is set only when the feature does exist and the feature is not disabled with kernel param "split_lock_detect=off" On each processor, explicitly updating the SPLIT_LOCK_DETECT bit based on sld_sate in split_lock_init() since BIOS/firmware may touch it. Originally-by: Thomas Gleixner Signed-off-by: Xiaoyao Li Signed-off-by: Thomas Gleixner Link: https://lkml.kernel.org/r/20200325030924.132881-2-xiaoyao.li@intel.com --- arch/x86/kernel/cpu/intel.c | 75 +++++++++++++++++++++++++-------------------- 1 file changed, 42 insertions(+), 33 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c index db3e745e5d47..0c859c91d008 100644 --- a/arch/x86/kernel/cpu/intel.c +++ b/arch/x86/kernel/cpu/intel.c @@ -44,7 +44,7 @@ enum split_lock_detect_state { * split_lock_setup() will switch this to sld_warn on systems that support * split lock detect, unless there is a command line override. */ -static enum split_lock_detect_state sld_state = sld_off; +static enum split_lock_detect_state sld_state __ro_after_init = sld_off; /* * Processors which have self-snooping capability can handle conflicting @@ -984,78 +984,87 @@ static inline bool match_option(const char *arg, int arglen, const char *opt) return len == arglen && !strncmp(arg, opt, len); } +static bool split_lock_verify_msr(bool on) +{ + u64 ctrl, tmp; + + if (rdmsrl_safe(MSR_TEST_CTRL, &ctrl)) + return false; + if (on) + ctrl |= MSR_TEST_CTRL_SPLIT_LOCK_DETECT; + else + ctrl &= ~MSR_TEST_CTRL_SPLIT_LOCK_DETECT; + if (wrmsrl_safe(MSR_TEST_CTRL, ctrl)) + return false; + rdmsrl(MSR_TEST_CTRL, tmp); + return ctrl == tmp; +} + static void __init split_lock_setup(void) { + enum split_lock_detect_state state = sld_warn; char arg[20]; int i, ret; - setup_force_cpu_cap(X86_FEATURE_SPLIT_LOCK_DETECT); - sld_state = sld_warn; + if (!split_lock_verify_msr(false)) { + pr_info("MSR access failed: Disabled\n"); + return; + } ret = cmdline_find_option(boot_command_line, "split_lock_detect", arg, sizeof(arg)); if (ret >= 0) { for (i = 0; i < ARRAY_SIZE(sld_options); i++) { if (match_option(arg, ret, sld_options[i].option)) { - sld_state = sld_options[i].state; + state = sld_options[i].state; break; } } } - switch (sld_state) { + switch (state) { case sld_off: pr_info("disabled\n"); - break; - + return; case sld_warn: pr_info("warning about user-space split_locks\n"); break; - case sld_fatal: pr_info("sending SIGBUS on user-space split_locks\n"); break; } + + if (!split_lock_verify_msr(true)) { + pr_info("MSR access failed: Disabled\n"); + return; + } + + sld_state = state; + setup_force_cpu_cap(X86_FEATURE_SPLIT_LOCK_DETECT); } /* - * Locking is not required at the moment because only bit 29 of this - * MSR is implemented and locking would not prevent that the operation - * of one thread is immediately undone by the sibling thread. - * Use the "safe" versions of rdmsr/wrmsr here because although code - * checks CPUID and MSR bits to make sure the TEST_CTRL MSR should - * exist, there may be glitches in virtualization that leave a guest - * with an incorrect view of real h/w capabilities. + * MSR_TEST_CTRL is per core, but we treat it like a per CPU MSR. Locking + * is not implemented as one thread could undo the setting of the other + * thread immediately after dropping the lock anyway. */ -static bool __sld_msr_set(bool on) +static void sld_update_msr(bool on) { u64 test_ctrl_val; - if (rdmsrl_safe(MSR_TEST_CTRL, &test_ctrl_val)) - return false; + rdmsrl(MSR_TEST_CTRL, test_ctrl_val); if (on) test_ctrl_val |= MSR_TEST_CTRL_SPLIT_LOCK_DETECT; else test_ctrl_val &= ~MSR_TEST_CTRL_SPLIT_LOCK_DETECT; - return !wrmsrl_safe(MSR_TEST_CTRL, test_ctrl_val); + wrmsrl(MSR_TEST_CTRL, test_ctrl_val); } static void split_lock_init(void) { - if (sld_state == sld_off) - return; - - if (__sld_msr_set(true)) - return; - - /* - * If this is anything other than the boot-cpu, you've done - * funny things and you get to keep whatever pieces. - */ - pr_warn("MSR fail -- disabled\n"); - sld_state = sld_off; + split_lock_verify_msr(sld_state != sld_off); } bool handle_user_split_lock(struct pt_regs *regs, long error_code) @@ -1071,7 +1080,7 @@ bool handle_user_split_lock(struct pt_regs *regs, long error_code) * progress and set TIF_SLD so the detection is re-enabled via * switch_to_sld() when the task is scheduled out. */ - __sld_msr_set(false); + sld_update_msr(false); set_tsk_thread_flag(current, TIF_SLD); return true; } @@ -1085,7 +1094,7 @@ bool handle_user_split_lock(struct pt_regs *regs, long error_code) */ void switch_to_sld(unsigned long tifn) { - __sld_msr_set(!(tifn & _TIF_SLD)); + sld_update_msr(!(tifn & _TIF_SLD)); } #define SPLIT_LOCK_CPU(model) {X86_VENDOR_INTEL, 6, model, X86_FEATURE_ANY} -- cgit From a6a60741035bb48ca8d9f92a138958818148064c Mon Sep 17 00:00:00 2001 From: Xiaoyao Li Date: Wed, 25 Mar 2020 11:09:24 +0800 Subject: x86/split_lock: Avoid runtime reads of the TEST_CTRL MSR In a context switch from a task that is detecting split locks to one that is not (or vice versa) we need to update the TEST_CTRL MSR. Currently this is done with the common sequence: read the MSR flip the bit write the MSR in order to avoid changing the value of any reserved bits in the MSR. Cache unused and reserved bits of TEST_CTRL MSR with SPLIT_LOCK_DETECT bit cleared during initialization, so we can avoid an expensive RDMSR instruction during context switch. Suggested-by: Sean Christopherson Originally-by: Tony Luck Signed-off-by: Xiaoyao Li Signed-off-by: Thomas Gleixner Link: https://lkml.kernel.org/r/20200325030924.132881-3-xiaoyao.li@intel.com --- arch/x86/kernel/cpu/intel.c | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c index 0c859c91d008..9a26e972cdea 100644 --- a/arch/x86/kernel/cpu/intel.c +++ b/arch/x86/kernel/cpu/intel.c @@ -45,6 +45,7 @@ enum split_lock_detect_state { * split lock detect, unless there is a command line override. */ static enum split_lock_detect_state sld_state __ro_after_init = sld_off; +static u64 msr_test_ctrl_cache __ro_after_init; /* * Processors which have self-snooping capability can handle conflicting @@ -1034,6 +1035,8 @@ static void __init split_lock_setup(void) break; } + rdmsrl(MSR_TEST_CTRL, msr_test_ctrl_cache); + if (!split_lock_verify_msr(true)) { pr_info("MSR access failed: Disabled\n"); return; @@ -1050,14 +1053,10 @@ static void __init split_lock_setup(void) */ static void sld_update_msr(bool on) { - u64 test_ctrl_val; - - rdmsrl(MSR_TEST_CTRL, test_ctrl_val); + u64 test_ctrl_val = msr_test_ctrl_cache; if (on) test_ctrl_val |= MSR_TEST_CTRL_SPLIT_LOCK_DETECT; - else - test_ctrl_val &= ~MSR_TEST_CTRL_SPLIT_LOCK_DETECT; wrmsrl(MSR_TEST_CTRL, test_ctrl_val); } -- cgit From 84d5f77fc2ee4e010c2c037750e32f06e55224b0 Mon Sep 17 00:00:00 2001 From: "H.J. Lu" Date: Thu, 26 Mar 2020 12:30:20 -0700 Subject: x86, vmlinux.lds: Add RUNTIME_DISCARD_EXIT to generic DISCARDS In the x86 kernel, .exit.text and .exit.data sections are discarded at runtime, not by the linker. Add RUNTIME_DISCARD_EXIT to generic DISCARDS and define it in the x86 kernel linker script to keep them. The sections are added before the DISCARD directive so document here only the situation explicitly as this change doesn't have any effect on the generated kernel. Also, other architectures like ARM64 will use it too so generalize the approach with the RUNTIME_DISCARD_EXIT define. [ bp: Massage and extend commit message. ] Signed-off-by: H.J. Lu Signed-off-by: Borislav Petkov Reviewed-by: Kees Cook Link: https://lkml.kernel.org/r/20200326193021.255002-1-hjl.tools@gmail.com --- arch/x86/kernel/vmlinux.lds.S | 1 + 1 file changed, 1 insertion(+) (limited to 'arch/x86') diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S index e3296aa028fe..7206e1ac23dd 100644 --- a/arch/x86/kernel/vmlinux.lds.S +++ b/arch/x86/kernel/vmlinux.lds.S @@ -21,6 +21,7 @@ #define LOAD_OFFSET __START_KERNEL_map #endif +#define RUNTIME_DISCARD_EXIT #define EMITS_PT_NOTE #define RO_EXCEPTION_TABLE_ALIGN 16 -- cgit From 4caffe6a28d3157c11cae923a40456a053c520ea Mon Sep 17 00:00:00 2001 From: "H.J. Lu" Date: Thu, 26 Mar 2020 10:43:14 -0700 Subject: x86/vdso: Discard .note.gnu.property sections in vDSO With the command-line option -mx86-used-note=yes which can also be enabled at binutils build time with: --enable-x86-used-note generate GNU x86 used ISA and feature properties the x86 assembler in binutils 2.32 and above generates a program property note in a note section, .note.gnu.property, to encode used x86 ISAs and features. But kernel linker script only contains a single NOTE segment: PHDRS { text PT_LOAD FLAGS(5) FILEHDR PHDRS; /* PF_R|PF_X */ dynamic PT_DYNAMIC FLAGS(4); /* PF_R */ note PT_NOTE FLAGS(4); /* PF_R */ eh_frame_hdr 0x6474e550; } The NOTE segment generated by the vDSO linker script is aligned to 4 bytes. But the .note.gnu.property section must be aligned to 8 bytes on x86-64: [hjl@gnu-skx-1 vdso]$ readelf -n vdso64.so Displaying notes found in: .note Owner Data size Description Linux 0x00000004 Unknown note type: (0x00000000) description data: 06 00 00 00 readelf: Warning: note with invalid namesz and/or descsz found at offset 0x20 readelf: Warning: type: 0x78, namesize: 0x00000100, descsize: 0x756e694c, alignment: 8 Since the note.gnu.property section in the vDSO is not checked by the dynamic linker, discard the .note.gnu.property sections in the vDSO. [ bp: Massage. ] Signed-off-by: H.J. Lu Signed-off-by: Borislav Petkov Reviewed-by: Kees Cook Link: https://lkml.kernel.org/r/20200326174314.254662-1-hjl.tools@gmail.com --- arch/x86/entry/vdso/vdso-layout.lds.S | 7 +++++++ 1 file changed, 7 insertions(+) (limited to 'arch/x86') diff --git a/arch/x86/entry/vdso/vdso-layout.lds.S b/arch/x86/entry/vdso/vdso-layout.lds.S index ea7e0155c604..4d152933547d 100644 --- a/arch/x86/entry/vdso/vdso-layout.lds.S +++ b/arch/x86/entry/vdso/vdso-layout.lds.S @@ -57,6 +57,13 @@ SECTIONS *(.gnu.linkonce.b.*) } :text + /* + * Discard .note.gnu.property sections which are unused and have + * different alignment requirement from vDSO note sections. + */ + /DISCARD/ : { + *(.note.gnu.property) + } .note : { *(.note.*) } :text :note .eh_frame_hdr : { *(.eh_frame_hdr) } :text :eh_frame_hdr -- cgit From a08971e9488d12a10a46eb433612229767b61fd5 Mon Sep 17 00:00:00 2001 From: Al Viro Date: Sun, 16 Feb 2020 10:17:27 -0500 Subject: futex: arch_futex_atomic_op_inuser() calling conventions change Move access_ok() in and pagefault_enable()/pagefault_disable() out. Mechanical conversion only - some instances don't really need a separate access_ok() at all (e.g. the ones only using get_user()/put_user(), or architectures where access_ok() is always true); we'll deal with that in followups. Signed-off-by: Al Viro --- arch/x86/include/asm/futex.h | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/futex.h b/arch/x86/include/asm/futex.h index 13c83fe97988..6bcd1c1486d9 100644 --- a/arch/x86/include/asm/futex.h +++ b/arch/x86/include/asm/futex.h @@ -47,7 +47,8 @@ static inline int arch_futex_atomic_op_inuser(int op, int oparg, int *oval, { int oldval = 0, ret, tem; - pagefault_disable(); + if (!access_ok(uaddr, sizeof(u32))) + return -EFAULT; switch (op) { case FUTEX_OP_SET: @@ -70,8 +71,6 @@ static inline int arch_futex_atomic_op_inuser(int op, int oparg, int *oval, ret = -ENOSYS; } - pagefault_enable(); - if (!ret) *oval = oldval; -- cgit From 0ec33c0171a138bedc1a39f4fd70455416dca926 Mon Sep 17 00:00:00 2001 From: Al Viro Date: Sun, 16 Feb 2020 13:10:42 -0500 Subject: x86: convert arch_futex_atomic_op_inuser() to user_access_begin/user_access_end() Lift stac/clac pairs from __futex_atomic_op{1,2} into arch_futex_atomic_op_inuser(), fold them with access_ok() in there. The switch in arch_futex_atomic_op_inuser() is what has required the previous (objtool) commit... Signed-off-by: Al Viro --- arch/x86/include/asm/futex.h | 62 +++++++++++++++++++++++++------------------- 1 file changed, 36 insertions(+), 26 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/futex.h b/arch/x86/include/asm/futex.h index 6bcd1c1486d9..53c07ab63762 100644 --- a/arch/x86/include/asm/futex.h +++ b/arch/x86/include/asm/futex.h @@ -12,26 +12,33 @@ #include #include -#define __futex_atomic_op1(insn, ret, oldval, uaddr, oparg) \ - asm volatile("\t" ASM_STAC "\n" \ - "1:\t" insn "\n" \ - "2:\t" ASM_CLAC "\n" \ +#define unsafe_atomic_op1(insn, oval, uaddr, oparg, label) \ +do { \ + int oldval = 0, ret; \ + asm volatile("1:\t" insn "\n" \ + "2:\n" \ "\t.section .fixup,\"ax\"\n" \ "3:\tmov\t%3, %1\n" \ "\tjmp\t2b\n" \ "\t.previous\n" \ _ASM_EXTABLE_UA(1b, 3b) \ : "=r" (oldval), "=r" (ret), "+m" (*uaddr) \ - : "i" (-EFAULT), "0" (oparg), "1" (0)) + : "i" (-EFAULT), "0" (oparg), "1" (0)); \ + if (ret) \ + goto label; \ + *oval = oldval; \ +} while(0) -#define __futex_atomic_op2(insn, ret, oldval, uaddr, oparg) \ - asm volatile("\t" ASM_STAC "\n" \ - "1:\tmovl %2, %0\n" \ + +#define unsafe_atomic_op2(insn, oval, uaddr, oparg, label) \ +do { \ + int oldval = 0, ret, tem; \ + asm volatile("1:\tmovl %2, %0\n" \ "\tmovl\t%0, %3\n" \ "\t" insn "\n" \ "2:\t" LOCK_PREFIX "cmpxchgl %3, %2\n" \ "\tjnz\t1b\n" \ - "3:\t" ASM_CLAC "\n" \ + "3:\n" \ "\t.section .fixup,\"ax\"\n" \ "4:\tmov\t%5, %1\n" \ "\tjmp\t3b\n" \ @@ -40,41 +47,44 @@ _ASM_EXTABLE_UA(2b, 4b) \ : "=&a" (oldval), "=&r" (ret), \ "+m" (*uaddr), "=&r" (tem) \ - : "r" (oparg), "i" (-EFAULT), "1" (0)) + : "r" (oparg), "i" (-EFAULT), "1" (0)); \ + if (ret) \ + goto label; \ + *oval = oldval; \ +} while(0) -static inline int arch_futex_atomic_op_inuser(int op, int oparg, int *oval, +static __always_inline int arch_futex_atomic_op_inuser(int op, int oparg, int *oval, u32 __user *uaddr) { - int oldval = 0, ret, tem; - - if (!access_ok(uaddr, sizeof(u32))) + if (!user_access_begin(uaddr, sizeof(u32))) return -EFAULT; switch (op) { case FUTEX_OP_SET: - __futex_atomic_op1("xchgl %0, %2", ret, oldval, uaddr, oparg); + unsafe_atomic_op1("xchgl %0, %2", oval, uaddr, oparg, Efault); break; case FUTEX_OP_ADD: - __futex_atomic_op1(LOCK_PREFIX "xaddl %0, %2", ret, oldval, - uaddr, oparg); + unsafe_atomic_op1(LOCK_PREFIX "xaddl %0, %2", oval, + uaddr, oparg, Efault); break; case FUTEX_OP_OR: - __futex_atomic_op2("orl %4, %3", ret, oldval, uaddr, oparg); + unsafe_atomic_op2("orl %4, %3", oval, uaddr, oparg, Efault); break; case FUTEX_OP_ANDN: - __futex_atomic_op2("andl %4, %3", ret, oldval, uaddr, ~oparg); + unsafe_atomic_op2("andl %4, %3", oval, uaddr, ~oparg, Efault); break; case FUTEX_OP_XOR: - __futex_atomic_op2("xorl %4, %3", ret, oldval, uaddr, oparg); + unsafe_atomic_op2("xorl %4, %3", oval, uaddr, oparg, Efault); break; default: - ret = -ENOSYS; + user_access_end(); + return -ENOSYS; } - - if (!ret) - *oval = oldval; - - return ret; + user_access_end(); + return 0; +Efault: + user_access_end(); + return -EFAULT; } static inline int futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *uaddr, -- cgit From 8aef36dacb3a932cb77da8bb7eb21a154babd0b8 Mon Sep 17 00:00:00 2001 From: Al Viro Date: Thu, 26 Mar 2020 17:34:40 -0400 Subject: x86: don't reload after cmpxchg in unsafe_atomic_op2() loop lock cmpxchg leaves the current value in eax; no need to reload it. Signed-off-by: Al Viro --- arch/x86/include/asm/futex.h | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/futex.h b/arch/x86/include/asm/futex.h index 53c07ab63762..5ff7626a333d 100644 --- a/arch/x86/include/asm/futex.h +++ b/arch/x86/include/asm/futex.h @@ -34,17 +34,17 @@ do { \ do { \ int oldval = 0, ret, tem; \ asm volatile("1:\tmovl %2, %0\n" \ - "\tmovl\t%0, %3\n" \ + "2:\tmovl\t%0, %3\n" \ "\t" insn "\n" \ - "2:\t" LOCK_PREFIX "cmpxchgl %3, %2\n" \ - "\tjnz\t1b\n" \ - "3:\n" \ + "3:\t" LOCK_PREFIX "cmpxchgl %3, %2\n" \ + "\tjnz\t2b\n" \ + "4:\n" \ "\t.section .fixup,\"ax\"\n" \ - "4:\tmov\t%5, %1\n" \ - "\tjmp\t3b\n" \ + "5:\tmov\t%5, %1\n" \ + "\tjmp\t4b\n" \ "\t.previous\n" \ - _ASM_EXTABLE_UA(1b, 4b) \ - _ASM_EXTABLE_UA(2b, 4b) \ + _ASM_EXTABLE_UA(1b, 5b) \ + _ASM_EXTABLE_UA(3b, 5b) \ : "=&a" (oldval), "=&r" (ret), \ "+m" (*uaddr), "=&r" (tem) \ : "r" (oparg), "i" (-EFAULT), "1" (0)); \ -- cgit From f5544ba712afd1b01dd856c7eecfb5d30beaf920 Mon Sep 17 00:00:00 2001 From: Al Viro Date: Thu, 19 Mar 2020 22:23:48 -0400 Subject: x86: get rid of user_atomic_cmpxchg_inatomic() Only one user left; the thing had been made polymorphic back in 2013 for the sake of MPX. No point keeping it now that MPX is gone. Convert futex_atomic_cmpxchg_inatomic() to user_access_{begin,end}() while we are at it. Signed-off-by: Al Viro --- arch/x86/include/asm/futex.h | 20 ++++++++- arch/x86/include/asm/uaccess.h | 93 ------------------------------------------ 2 files changed, 19 insertions(+), 94 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/futex.h b/arch/x86/include/asm/futex.h index 5ff7626a333d..f9c00110a69a 100644 --- a/arch/x86/include/asm/futex.h +++ b/arch/x86/include/asm/futex.h @@ -90,7 +90,25 @@ Efault: static inline int futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *uaddr, u32 oldval, u32 newval) { - return user_atomic_cmpxchg_inatomic(uval, uaddr, oldval, newval); + int ret = 0; + + if (!user_access_begin(uaddr, sizeof(u32))) + return -EFAULT; + asm volatile("\n" + "1:\t" LOCK_PREFIX "cmpxchgl %4, %2\n" + "2:\n" + "\t.section .fixup, \"ax\"\n" + "3:\tmov %3, %0\n" + "\tjmp 2b\n" + "\t.previous\n" + _ASM_EXTABLE_UA(1b, 3b) + : "+r" (ret), "=a" (oldval), "+m" (*uaddr) + : "i" (-EFAULT), "r" (newval), "1" (oldval) + : "memory" + ); + user_access_end(); + *uval = oldval; + return ret; } #endif diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h index 61d93f062a36..ea6fc643ccfe 100644 --- a/arch/x86/include/asm/uaccess.h +++ b/arch/x86/include/asm/uaccess.h @@ -584,99 +584,6 @@ extern __must_check long strnlen_user(const char __user *str, long n); unsigned long __must_check clear_user(void __user *mem, unsigned long len); unsigned long __must_check __clear_user(void __user *mem, unsigned long len); -extern void __cmpxchg_wrong_size(void) - __compiletime_error("Bad argument size for cmpxchg"); - -#define __user_atomic_cmpxchg_inatomic(uval, ptr, old, new, size) \ -({ \ - int __ret = 0; \ - __typeof__(*(ptr)) __old = (old); \ - __typeof__(*(ptr)) __new = (new); \ - __uaccess_begin_nospec(); \ - switch (size) { \ - case 1: \ - { \ - asm volatile("\n" \ - "1:\t" LOCK_PREFIX "cmpxchgb %4, %2\n" \ - "2:\n" \ - "\t.section .fixup, \"ax\"\n" \ - "3:\tmov %3, %0\n" \ - "\tjmp 2b\n" \ - "\t.previous\n" \ - _ASM_EXTABLE_UA(1b, 3b) \ - : "+r" (__ret), "=a" (__old), "+m" (*(ptr)) \ - : "i" (-EFAULT), "q" (__new), "1" (__old) \ - : "memory" \ - ); \ - break; \ - } \ - case 2: \ - { \ - asm volatile("\n" \ - "1:\t" LOCK_PREFIX "cmpxchgw %4, %2\n" \ - "2:\n" \ - "\t.section .fixup, \"ax\"\n" \ - "3:\tmov %3, %0\n" \ - "\tjmp 2b\n" \ - "\t.previous\n" \ - _ASM_EXTABLE_UA(1b, 3b) \ - : "+r" (__ret), "=a" (__old), "+m" (*(ptr)) \ - : "i" (-EFAULT), "r" (__new), "1" (__old) \ - : "memory" \ - ); \ - break; \ - } \ - case 4: \ - { \ - asm volatile("\n" \ - "1:\t" LOCK_PREFIX "cmpxchgl %4, %2\n" \ - "2:\n" \ - "\t.section .fixup, \"ax\"\n" \ - "3:\tmov %3, %0\n" \ - "\tjmp 2b\n" \ - "\t.previous\n" \ - _ASM_EXTABLE_UA(1b, 3b) \ - : "+r" (__ret), "=a" (__old), "+m" (*(ptr)) \ - : "i" (-EFAULT), "r" (__new), "1" (__old) \ - : "memory" \ - ); \ - break; \ - } \ - case 8: \ - { \ - if (!IS_ENABLED(CONFIG_X86_64)) \ - __cmpxchg_wrong_size(); \ - \ - asm volatile("\n" \ - "1:\t" LOCK_PREFIX "cmpxchgq %4, %2\n" \ - "2:\n" \ - "\t.section .fixup, \"ax\"\n" \ - "3:\tmov %3, %0\n" \ - "\tjmp 2b\n" \ - "\t.previous\n" \ - _ASM_EXTABLE_UA(1b, 3b) \ - : "+r" (__ret), "=a" (__old), "+m" (*(ptr)) \ - : "i" (-EFAULT), "r" (__new), "1" (__old) \ - : "memory" \ - ); \ - break; \ - } \ - default: \ - __cmpxchg_wrong_size(); \ - } \ - __uaccess_end(); \ - *(uval) = __old; \ - __ret; \ -}) - -#define user_atomic_cmpxchg_inatomic(uval, ptr, old, new) \ -({ \ - access_ok((ptr), sizeof(*(ptr))) ? \ - __user_atomic_cmpxchg_inatomic((uval), (ptr), \ - (old), (new), sizeof(*(ptr))) : \ - -EFAULT; \ -}) - /* * movsl can be slow when source and dest are not both 8-byte aligned */ -- cgit From c90beea22a2bece4b0bbb39789bf835504421594 Mon Sep 17 00:00:00 2001 From: Joerg Roedel Date: Thu, 19 Mar 2020 10:13:07 +0100 Subject: x86/boot/compressed: Fix debug_puthex() parameter type In the CONFIG_X86_VERBOSE_BOOTUP=Y case, the debug_puthex() macro just turns into __puthex(), which takes 'unsigned long' as parameter. But in the CONFIG_X86_VERBOSE_BOOTUP=N case, it is a function which takes 'unsigned char *', causing compile warnings when the function is used. Fix the parameter type to get rid of the warnings. Signed-off-by: Joerg Roedel Signed-off-by: Borislav Petkov Link: https://lkml.kernel.org/r/20200319091407.1481-11-joro@8bytes.org --- arch/x86/boot/compressed/misc.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h index c8181392f70d..726e264410ff 100644 --- a/arch/x86/boot/compressed/misc.h +++ b/arch/x86/boot/compressed/misc.h @@ -59,7 +59,7 @@ void __puthex(unsigned long value); static inline void debug_putstr(const char *s) { } -static inline void debug_puthex(const char *s) +static inline void debug_puthex(unsigned long value) { } #define debug_putaddr(x) /* */ -- cgit From 5bef0a153bf29150357ff60283315a933f05c994 Mon Sep 17 00:00:00 2001 From: Johannes Berg Date: Thu, 13 Feb 2020 14:26:49 +0100 Subject: um: Implement cpu_relax() as ndelay(1) for time-travel In time-travel mode, cpu_relax() currently does actual CPU relax, but that doesn't affect the simulation. Ideally, we wouldn't run anything that uses it in simulation, but if we actually have virtio devices combined with the same simulation it's possible. Implement cpu_relax() as ndelay(1) in this case, using time_travel_ndelay(1) directly to catch errors if this is used erroneously in builds that don't set CONFIG_UML_TIME_TRAVEL_SUPPORT. While at it, convert it to an __always_inline and also add that to rep_nop() like the original does now. Signed-off-by: Johannes Berg Signed-off-by: Richard Weinberger --- arch/x86/um/asm/processor.h | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/um/asm/processor.h b/arch/x86/um/asm/processor.h index 593d5f3902bd..478710384b34 100644 --- a/arch/x86/um/asm/processor.h +++ b/arch/x86/um/asm/processor.h @@ -1,6 +1,7 @@ /* SPDX-License-Identifier: GPL-2.0 */ #ifndef __UM_PROCESSOR_H #define __UM_PROCESSOR_H +#include /* include faultinfo structure */ #include @@ -21,12 +22,19 @@ #include /* REP NOP (PAUSE) is a good thing to insert into busy-wait loops. */ -static inline void rep_nop(void) +static __always_inline void rep_nop(void) { __asm__ __volatile__("rep;nop": : :"memory"); } -#define cpu_relax() rep_nop() +static __always_inline void cpu_relax(void) +{ + if (time_travel_mode == TT_MODE_INFCPU || + time_travel_mode == TT_MODE_EXTERNAL) + time_travel_ndelay(1); + else + rep_nop(); +} #define task_pt_regs(t) (&(t)->thread.regs) -- cgit From 2f62f36e62daec43aa7b9633ef7f18e042a80bed Mon Sep 17 00:00:00 2001 From: Miroslav Benes Date: Thu, 26 Mar 2020 10:26:02 +0100 Subject: x86/xen: Make the boot CPU idle task reliable The unwinder reports the boot CPU idle task's stack on XEN PV as unreliable, which affects at least live patching. There are two reasons for this. First, the task does not follow the x86 convention that its stack starts at the offset right below saved pt_regs. It allows the unwinder to easily detect the end of the stack and verify it. Second, startup_xen() function does not store the return address before jumping to xen_start_kernel() which confuses the unwinder. Amend both issues by moving the starting point of initial stack in startup_xen() and storing the return address before the jump, which is exactly what call instruction does. Signed-off-by: Miroslav Benes Reviewed-by: Juergen Gross Signed-off-by: Juergen Gross --- arch/x86/xen/xen-head.S | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/xen/xen-head.S b/arch/x86/xen/xen-head.S index 1d0cee3163e4..d63806e1ff7a 100644 --- a/arch/x86/xen/xen-head.S +++ b/arch/x86/xen/xen-head.S @@ -35,7 +35,11 @@ SYM_CODE_START(startup_xen) rep __ASM_SIZE(stos) mov %_ASM_SI, xen_start_info - mov $init_thread_union+THREAD_SIZE, %_ASM_SP +#ifdef CONFIG_X86_64 + mov initial_stack(%rip), %rsp +#else + mov pa(initial_stack), %esp +#endif #ifdef CONFIG_X86_64 /* Set up %gs. @@ -51,7 +55,7 @@ SYM_CODE_START(startup_xen) wrmsr #endif - jmp xen_start_kernel + call xen_start_kernel SYM_CODE_END(startup_xen) __FINIT #endif -- cgit From c3881eb58d56116c79ac4ee4f40fd15ead124c4b Mon Sep 17 00:00:00 2001 From: Miroslav Benes Date: Thu, 26 Mar 2020 10:26:03 +0100 Subject: x86/xen: Make the secondary CPU idle tasks reliable The unwinder reports the secondary CPU idle tasks' stack on XEN PV as unreliable, which affects at least live patching. cpu_initialize_context() sets up the context of the CPU through VCPUOP_initialise hypercall. After it is woken up, the idle task starts in cpu_bringup_and_idle() function and its stack starts at the offset right below pt_regs. The unwinder correctly detects the end of stack there but it is confused by NULL return address in the last frame. Introduce a wrapper in assembly, which just calls cpu_bringup_and_idle(). The return address is thus pushed on the stack and the wrapper contains the annotation hint for the unwinder regarding the stack state. Signed-off-by: Miroslav Benes Reviewed-by: Juergen Gross Signed-off-by: Juergen Gross --- arch/x86/xen/smp_pv.c | 3 ++- arch/x86/xen/xen-head.S | 10 ++++++++++ 2 files changed, 12 insertions(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/xen/smp_pv.c b/arch/x86/xen/smp_pv.c index 802ee5bba66c..8fb8a50a28b4 100644 --- a/arch/x86/xen/smp_pv.c +++ b/arch/x86/xen/smp_pv.c @@ -53,6 +53,7 @@ static DEFINE_PER_CPU(struct xen_common_irq, xen_irq_work) = { .irq = -1 }; static DEFINE_PER_CPU(struct xen_common_irq, xen_pmu_irq) = { .irq = -1 }; static irqreturn_t xen_irq_work_interrupt(int irq, void *dev_id); +void asm_cpu_bringup_and_idle(void); static void cpu_bringup(void) { @@ -309,7 +310,7 @@ cpu_initialize_context(unsigned int cpu, struct task_struct *idle) * pointing just below where pt_regs would be if it were a normal * kernel entry. */ - ctxt->user_regs.eip = (unsigned long)cpu_bringup_and_idle; + ctxt->user_regs.eip = (unsigned long)asm_cpu_bringup_and_idle; ctxt->flags = VGCF_IN_KERNEL; ctxt->user_regs.eflags = 0x1000; /* IOPL_RING1 */ ctxt->user_regs.ds = __USER_DS; diff --git a/arch/x86/xen/xen-head.S b/arch/x86/xen/xen-head.S index d63806e1ff7a..7d1c4fcbe8f7 100644 --- a/arch/x86/xen/xen-head.S +++ b/arch/x86/xen/xen-head.S @@ -58,6 +58,16 @@ SYM_CODE_START(startup_xen) call xen_start_kernel SYM_CODE_END(startup_xen) __FINIT + +#ifdef CONFIG_XEN_PV_SMP +.pushsection .text +SYM_CODE_START(asm_cpu_bringup_and_idle) + UNWIND_HINT_EMPTY + + call cpu_bringup_and_idle +SYM_CODE_END(asm_cpu_bringup_and_idle) +.popsection +#endif #endif .pushsection .text -- cgit From b990408537388e9174b642ad36cdef6c47c64d3a Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Sat, 21 Mar 2020 13:25:55 -0700 Subject: KVM: Pass kvm_init()'s opaque param to additional arch funcs Pass @opaque to kvm_arch_hardware_setup() and kvm_arch_check_processor_compat() to allow architecture specific code to reference @opaque without having to stash it away in a temporary global variable. This will enable x86 to separate its vendor specific callback ops, which are passed via @opaque, into "init" and "runtime" ops without having to stash away the "init" ops. No functional change intended. Reviewed-by: Cornelia Huck Tested-by: Cornelia Huck #s390 Acked-by: Marc Zyngier Signed-off-by: Sean Christopherson Message-Id: <20200321202603.19355-2-sean.j.christopherson@intel.com> Reviewed-by: Vitaly Kuznetsov Signed-off-by: Paolo Bonzini --- arch/x86/kvm/x86.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 1b6d9ac9533c..a656fba5bd60 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -9626,7 +9626,7 @@ void kvm_arch_hardware_disable(void) drop_user_return_notifiers(); } -int kvm_arch_hardware_setup(void) +int kvm_arch_hardware_setup(void *opaque) { int r; @@ -9667,7 +9667,7 @@ void kvm_arch_hardware_unsetup(void) kvm_x86_ops->hardware_unsetup(); } -int kvm_arch_check_processor_compat(void) +int kvm_arch_check_processor_compat(void *opaque) { struct cpuinfo_x86 *c = &cpu_data(smp_processor_id()); -- cgit From d008dfdb0e7012ddff5bd6c0d2abd3b8ec6e77f5 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Sat, 21 Mar 2020 13:25:56 -0700 Subject: KVM: x86: Move init-only kvm_x86_ops to separate struct Move the kvm_x86_ops functions that are used only within the scope of kvm_init() into a separate struct, kvm_x86_init_ops. In addition to identifying the init-only functions without restorting to code comments, this also sets the stage for waiting until after ->hardware_setup() to set kvm_x86_ops. Setting kvm_x86_ops after ->hardware_setup() is desirable as many of the hooks are not usable until ->hardware_setup() completes. No functional change intended. Signed-off-by: Sean Christopherson Message-Id: <20200321202603.19355-3-sean.j.christopherson@intel.com> Reviewed-by: Vitaly Kuznetsov Signed-off-by: Paolo Bonzini --- arch/x86/include/asm/kvm_host.h | 13 +++++++++---- arch/x86/kvm/svm.c | 15 ++++++++++----- arch/x86/kvm/vmx/vmx.c | 16 +++++++++++----- arch/x86/kvm/x86.c | 10 ++++++---- 4 files changed, 36 insertions(+), 18 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 9a183e9d4cb1..f4c5b49299ff 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1054,12 +1054,8 @@ static inline u16 kvm_lapic_irq_dest_mode(bool dest_mode_logical) } struct kvm_x86_ops { - int (*cpu_has_kvm_support)(void); /* __init */ - int (*disabled_by_bios)(void); /* __init */ int (*hardware_enable)(void); void (*hardware_disable)(void); - int (*check_processor_compatibility)(void);/* __init */ - int (*hardware_setup)(void); /* __init */ void (*hardware_unsetup)(void); /* __exit */ bool (*cpu_has_accelerated_tpr)(void); bool (*has_emulated_msr)(int index); @@ -1260,6 +1256,15 @@ struct kvm_x86_ops { int (*enable_direct_tlbflush)(struct kvm_vcpu *vcpu); }; +struct kvm_x86_init_ops { + int (*cpu_has_kvm_support)(void); + int (*disabled_by_bios)(void); + int (*check_processor_compatibility)(void); + int (*hardware_setup)(void); + + struct kvm_x86_ops *runtime_ops; +}; + struct kvm_arch_async_pf { u32 token; gfn_t gfn; diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 05cb45bc0e08..589debab9a3a 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -7354,11 +7354,7 @@ static void svm_pre_update_apicv_exec_ctrl(struct kvm *kvm, bool activate) } static struct kvm_x86_ops svm_x86_ops __ro_after_init = { - .cpu_has_kvm_support = has_svm, - .disabled_by_bios = is_disabled, - .hardware_setup = svm_hardware_setup, .hardware_unsetup = svm_hardware_teardown, - .check_processor_compatibility = svm_check_processor_compat, .hardware_enable = svm_hardware_enable, .hardware_disable = svm_hardware_disable, .cpu_has_accelerated_tpr = svm_cpu_has_accelerated_tpr, @@ -7483,9 +7479,18 @@ static struct kvm_x86_ops svm_x86_ops __ro_after_init = { .check_nested_events = svm_check_nested_events, }; +static struct kvm_x86_init_ops svm_init_ops __initdata = { + .cpu_has_kvm_support = has_svm, + .disabled_by_bios = is_disabled, + .hardware_setup = svm_hardware_setup, + .check_processor_compatibility = svm_check_processor_compat, + + .runtime_ops = &svm_x86_ops, +}; + static int __init svm_init(void) { - return kvm_init(&svm_x86_ops, sizeof(struct vcpu_svm), + return kvm_init(&svm_init_ops, sizeof(struct vcpu_svm), __alignof__(struct vcpu_svm), THIS_MODULE); } diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index a7dd67859bd4..d484ec1af971 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7836,11 +7836,8 @@ static bool vmx_check_apicv_inhibit_reasons(ulong bit) } static struct kvm_x86_ops vmx_x86_ops __ro_after_init = { - .cpu_has_kvm_support = cpu_has_kvm_support, - .disabled_by_bios = vmx_disabled_by_bios, - .hardware_setup = hardware_setup, .hardware_unsetup = hardware_unsetup, - .check_processor_compatibility = vmx_check_processor_compat, + .hardware_enable = hardware_enable, .hardware_disable = hardware_disable, .cpu_has_accelerated_tpr = report_flexpriority, @@ -7975,6 +7972,15 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = { .apic_init_signal_blocked = vmx_apic_init_signal_blocked, }; +static struct kvm_x86_init_ops vmx_init_ops __initdata = { + .cpu_has_kvm_support = cpu_has_kvm_support, + .disabled_by_bios = vmx_disabled_by_bios, + .check_processor_compatibility = vmx_check_processor_compat, + .hardware_setup = hardware_setup, + + .runtime_ops = &vmx_x86_ops, +}; + static void vmx_cleanup_l1d_flush(void) { if (vmx_l1d_flush_pages) { @@ -8059,7 +8065,7 @@ static int __init vmx_init(void) } #endif - r = kvm_init(&vmx_x86_ops, sizeof(struct vcpu_vmx), + r = kvm_init(&vmx_init_ops, sizeof(struct vcpu_vmx), __alignof__(struct vcpu_vmx), THIS_MODULE); if (r) return r; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index a656fba5bd60..283ef4d919f8 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -7303,8 +7303,8 @@ static struct notifier_block pvclock_gtod_notifier = { int kvm_arch_init(void *opaque) { + struct kvm_x86_init_ops *ops = opaque; int r; - struct kvm_x86_ops *ops = opaque; if (kvm_x86_ops) { printk(KERN_ERR "kvm: already loaded the other module\n"); @@ -7359,7 +7359,7 @@ int kvm_arch_init(void *opaque) if (r) goto out_free_percpu; - kvm_x86_ops = ops; + kvm_x86_ops = ops->runtime_ops; kvm_mmu_set_mask_ptes(PT_USER_MASK, PT_ACCESSED_MASK, PT_DIRTY_MASK, PT64_NX_MASK, 0, @@ -9628,6 +9628,7 @@ void kvm_arch_hardware_disable(void) int kvm_arch_hardware_setup(void *opaque) { + struct kvm_x86_init_ops *ops = opaque; int r; rdmsrl_safe(MSR_EFER, &host_efer); @@ -9635,7 +9636,7 @@ int kvm_arch_hardware_setup(void *opaque) if (boot_cpu_has(X86_FEATURE_XSAVES)) rdmsrl(MSR_IA32_XSS, host_xss); - r = kvm_x86_ops->hardware_setup(); + r = ops->hardware_setup(); if (r != 0) return r; @@ -9670,13 +9671,14 @@ void kvm_arch_hardware_unsetup(void) int kvm_arch_check_processor_compat(void *opaque) { struct cpuinfo_x86 *c = &cpu_data(smp_processor_id()); + struct kvm_x86_init_ops *ops = opaque; WARN_ON(!irqs_disabled()); if (kvm_host_cr4_reserved_bits(c) != cr4_reserved_bits) return -EIO; - return kvm_x86_ops->check_processor_compatibility(); + return ops->check_processor_compatibility(); } bool kvm_vcpu_is_reset_bsp(struct kvm_vcpu *vcpu) -- cgit From 484014faf89eb0f3585d7323e4aef34a3cd92ac6 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Sat, 21 Mar 2020 13:25:57 -0700 Subject: KVM: VMX: Move hardware_setup() definition below vmx_x86_ops Move VMX's hardware_setup() below its vmx_x86_ops definition so that a future patch can refactor hardware_setup() to modify vmx_x86_ops directly instead of indirectly modifying the ops via the global kvm_x86_ops. No functional change intended. Signed-off-by: Sean Christopherson Message-Id: <20200321202603.19355-4-sean.j.christopherson@intel.com> Reviewed-by: Vitaly Kuznetsov Signed-off-by: Paolo Bonzini --- arch/x86/kvm/vmx/vmx.c | 306 ++++++++++++++++++++++++------------------------- 1 file changed, 153 insertions(+), 153 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index d484ec1af971..2a3958ea9735 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7646,6 +7646,159 @@ static bool vmx_apic_init_signal_blocked(struct kvm_vcpu *vcpu) return to_vmx(vcpu)->nested.vmxon; } +static __exit void hardware_unsetup(void) +{ + if (nested) + nested_vmx_hardware_unsetup(); + + free_kvm_area(); +} + +static bool vmx_check_apicv_inhibit_reasons(ulong bit) +{ + ulong supported = BIT(APICV_INHIBIT_REASON_DISABLE) | + BIT(APICV_INHIBIT_REASON_HYPERV); + + return supported & BIT(bit); +} + +static struct kvm_x86_ops vmx_x86_ops __ro_after_init = { + .hardware_unsetup = hardware_unsetup, + + .hardware_enable = hardware_enable, + .hardware_disable = hardware_disable, + .cpu_has_accelerated_tpr = report_flexpriority, + .has_emulated_msr = vmx_has_emulated_msr, + + .vm_size = sizeof(struct kvm_vmx), + .vm_init = vmx_vm_init, + + .vcpu_create = vmx_create_vcpu, + .vcpu_free = vmx_free_vcpu, + .vcpu_reset = vmx_vcpu_reset, + + .prepare_guest_switch = vmx_prepare_switch_to_guest, + .vcpu_load = vmx_vcpu_load, + .vcpu_put = vmx_vcpu_put, + + .update_bp_intercept = update_exception_bitmap, + .get_msr_feature = vmx_get_msr_feature, + .get_msr = vmx_get_msr, + .set_msr = vmx_set_msr, + .get_segment_base = vmx_get_segment_base, + .get_segment = vmx_get_segment, + .set_segment = vmx_set_segment, + .get_cpl = vmx_get_cpl, + .get_cs_db_l_bits = vmx_get_cs_db_l_bits, + .decache_cr0_guest_bits = vmx_decache_cr0_guest_bits, + .decache_cr4_guest_bits = vmx_decache_cr4_guest_bits, + .set_cr0 = vmx_set_cr0, + .set_cr4 = vmx_set_cr4, + .set_efer = vmx_set_efer, + .get_idt = vmx_get_idt, + .set_idt = vmx_set_idt, + .get_gdt = vmx_get_gdt, + .set_gdt = vmx_set_gdt, + .get_dr6 = vmx_get_dr6, + .set_dr6 = vmx_set_dr6, + .set_dr7 = vmx_set_dr7, + .sync_dirty_debug_regs = vmx_sync_dirty_debug_regs, + .cache_reg = vmx_cache_reg, + .get_rflags = vmx_get_rflags, + .set_rflags = vmx_set_rflags, + + .tlb_flush = vmx_flush_tlb, + .tlb_flush_gva = vmx_flush_tlb_gva, + + .run = vmx_vcpu_run, + .handle_exit = vmx_handle_exit, + .skip_emulated_instruction = vmx_skip_emulated_instruction, + .update_emulated_instruction = vmx_update_emulated_instruction, + .set_interrupt_shadow = vmx_set_interrupt_shadow, + .get_interrupt_shadow = vmx_get_interrupt_shadow, + .patch_hypercall = vmx_patch_hypercall, + .set_irq = vmx_inject_irq, + .set_nmi = vmx_inject_nmi, + .queue_exception = vmx_queue_exception, + .cancel_injection = vmx_cancel_injection, + .interrupt_allowed = vmx_interrupt_allowed, + .nmi_allowed = vmx_nmi_allowed, + .get_nmi_mask = vmx_get_nmi_mask, + .set_nmi_mask = vmx_set_nmi_mask, + .enable_nmi_window = enable_nmi_window, + .enable_irq_window = enable_irq_window, + .update_cr8_intercept = update_cr8_intercept, + .set_virtual_apic_mode = vmx_set_virtual_apic_mode, + .set_apic_access_page_addr = vmx_set_apic_access_page_addr, + .refresh_apicv_exec_ctrl = vmx_refresh_apicv_exec_ctrl, + .load_eoi_exitmap = vmx_load_eoi_exitmap, + .apicv_post_state_restore = vmx_apicv_post_state_restore, + .check_apicv_inhibit_reasons = vmx_check_apicv_inhibit_reasons, + .hwapic_irr_update = vmx_hwapic_irr_update, + .hwapic_isr_update = vmx_hwapic_isr_update, + .guest_apic_has_interrupt = vmx_guest_apic_has_interrupt, + .sync_pir_to_irr = vmx_sync_pir_to_irr, + .deliver_posted_interrupt = vmx_deliver_posted_interrupt, + .dy_apicv_has_pending_interrupt = vmx_dy_apicv_has_pending_interrupt, + + .set_tss_addr = vmx_set_tss_addr, + .set_identity_map_addr = vmx_set_identity_map_addr, + .get_tdp_level = get_ept_level, + .get_mt_mask = vmx_get_mt_mask, + + .get_exit_info = vmx_get_exit_info, + + .cpuid_update = vmx_cpuid_update, + + .has_wbinvd_exit = cpu_has_vmx_wbinvd_exit, + + .read_l1_tsc_offset = vmx_read_l1_tsc_offset, + .write_l1_tsc_offset = vmx_write_l1_tsc_offset, + + .load_mmu_pgd = vmx_load_mmu_pgd, + + .check_intercept = vmx_check_intercept, + .handle_exit_irqoff = vmx_handle_exit_irqoff, + + .request_immediate_exit = vmx_request_immediate_exit, + + .sched_in = vmx_sched_in, + + .slot_enable_log_dirty = vmx_slot_enable_log_dirty, + .slot_disable_log_dirty = vmx_slot_disable_log_dirty, + .flush_log_dirty = vmx_flush_log_dirty, + .enable_log_dirty_pt_masked = vmx_enable_log_dirty_pt_masked, + .write_log_dirty = vmx_write_pml_buffer, + + .pre_block = vmx_pre_block, + .post_block = vmx_post_block, + + .pmu_ops = &intel_pmu_ops, + + .update_pi_irte = vmx_update_pi_irte, + +#ifdef CONFIG_X86_64 + .set_hv_timer = vmx_set_hv_timer, + .cancel_hv_timer = vmx_cancel_hv_timer, +#endif + + .setup_mce = vmx_setup_mce, + + .smi_allowed = vmx_smi_allowed, + .pre_enter_smm = vmx_pre_enter_smm, + .pre_leave_smm = vmx_pre_leave_smm, + .enable_smi_window = enable_smi_window, + + .check_nested_events = NULL, + .get_nested_state = NULL, + .set_nested_state = NULL, + .get_vmcs12_pages = NULL, + .nested_enable_evmcs = NULL, + .nested_get_evmcs_version = NULL, + .need_emulation_on_page_fault = vmx_need_emulation_on_page_fault, + .apic_init_signal_blocked = vmx_apic_init_signal_blocked, +}; + static __init int hardware_setup(void) { unsigned long host_bndcfgs; @@ -7819,159 +7972,6 @@ static __init int hardware_setup(void) return r; } -static __exit void hardware_unsetup(void) -{ - if (nested) - nested_vmx_hardware_unsetup(); - - free_kvm_area(); -} - -static bool vmx_check_apicv_inhibit_reasons(ulong bit) -{ - ulong supported = BIT(APICV_INHIBIT_REASON_DISABLE) | - BIT(APICV_INHIBIT_REASON_HYPERV); - - return supported & BIT(bit); -} - -static struct kvm_x86_ops vmx_x86_ops __ro_after_init = { - .hardware_unsetup = hardware_unsetup, - - .hardware_enable = hardware_enable, - .hardware_disable = hardware_disable, - .cpu_has_accelerated_tpr = report_flexpriority, - .has_emulated_msr = vmx_has_emulated_msr, - - .vm_size = sizeof(struct kvm_vmx), - .vm_init = vmx_vm_init, - - .vcpu_create = vmx_create_vcpu, - .vcpu_free = vmx_free_vcpu, - .vcpu_reset = vmx_vcpu_reset, - - .prepare_guest_switch = vmx_prepare_switch_to_guest, - .vcpu_load = vmx_vcpu_load, - .vcpu_put = vmx_vcpu_put, - - .update_bp_intercept = update_exception_bitmap, - .get_msr_feature = vmx_get_msr_feature, - .get_msr = vmx_get_msr, - .set_msr = vmx_set_msr, - .get_segment_base = vmx_get_segment_base, - .get_segment = vmx_get_segment, - .set_segment = vmx_set_segment, - .get_cpl = vmx_get_cpl, - .get_cs_db_l_bits = vmx_get_cs_db_l_bits, - .decache_cr0_guest_bits = vmx_decache_cr0_guest_bits, - .decache_cr4_guest_bits = vmx_decache_cr4_guest_bits, - .set_cr0 = vmx_set_cr0, - .set_cr4 = vmx_set_cr4, - .set_efer = vmx_set_efer, - .get_idt = vmx_get_idt, - .set_idt = vmx_set_idt, - .get_gdt = vmx_get_gdt, - .set_gdt = vmx_set_gdt, - .get_dr6 = vmx_get_dr6, - .set_dr6 = vmx_set_dr6, - .set_dr7 = vmx_set_dr7, - .sync_dirty_debug_regs = vmx_sync_dirty_debug_regs, - .cache_reg = vmx_cache_reg, - .get_rflags = vmx_get_rflags, - .set_rflags = vmx_set_rflags, - - .tlb_flush = vmx_flush_tlb, - .tlb_flush_gva = vmx_flush_tlb_gva, - - .run = vmx_vcpu_run, - .handle_exit = vmx_handle_exit, - .skip_emulated_instruction = vmx_skip_emulated_instruction, - .update_emulated_instruction = vmx_update_emulated_instruction, - .set_interrupt_shadow = vmx_set_interrupt_shadow, - .get_interrupt_shadow = vmx_get_interrupt_shadow, - .patch_hypercall = vmx_patch_hypercall, - .set_irq = vmx_inject_irq, - .set_nmi = vmx_inject_nmi, - .queue_exception = vmx_queue_exception, - .cancel_injection = vmx_cancel_injection, - .interrupt_allowed = vmx_interrupt_allowed, - .nmi_allowed = vmx_nmi_allowed, - .get_nmi_mask = vmx_get_nmi_mask, - .set_nmi_mask = vmx_set_nmi_mask, - .enable_nmi_window = enable_nmi_window, - .enable_irq_window = enable_irq_window, - .update_cr8_intercept = update_cr8_intercept, - .set_virtual_apic_mode = vmx_set_virtual_apic_mode, - .set_apic_access_page_addr = vmx_set_apic_access_page_addr, - .refresh_apicv_exec_ctrl = vmx_refresh_apicv_exec_ctrl, - .load_eoi_exitmap = vmx_load_eoi_exitmap, - .apicv_post_state_restore = vmx_apicv_post_state_restore, - .check_apicv_inhibit_reasons = vmx_check_apicv_inhibit_reasons, - .hwapic_irr_update = vmx_hwapic_irr_update, - .hwapic_isr_update = vmx_hwapic_isr_update, - .guest_apic_has_interrupt = vmx_guest_apic_has_interrupt, - .sync_pir_to_irr = vmx_sync_pir_to_irr, - .deliver_posted_interrupt = vmx_deliver_posted_interrupt, - .dy_apicv_has_pending_interrupt = vmx_dy_apicv_has_pending_interrupt, - - .set_tss_addr = vmx_set_tss_addr, - .set_identity_map_addr = vmx_set_identity_map_addr, - .get_tdp_level = get_ept_level, - .get_mt_mask = vmx_get_mt_mask, - - .get_exit_info = vmx_get_exit_info, - - .cpuid_update = vmx_cpuid_update, - - .has_wbinvd_exit = cpu_has_vmx_wbinvd_exit, - - .read_l1_tsc_offset = vmx_read_l1_tsc_offset, - .write_l1_tsc_offset = vmx_write_l1_tsc_offset, - - .load_mmu_pgd = vmx_load_mmu_pgd, - - .check_intercept = vmx_check_intercept, - .handle_exit_irqoff = vmx_handle_exit_irqoff, - - .request_immediate_exit = vmx_request_immediate_exit, - - .sched_in = vmx_sched_in, - - .slot_enable_log_dirty = vmx_slot_enable_log_dirty, - .slot_disable_log_dirty = vmx_slot_disable_log_dirty, - .flush_log_dirty = vmx_flush_log_dirty, - .enable_log_dirty_pt_masked = vmx_enable_log_dirty_pt_masked, - .write_log_dirty = vmx_write_pml_buffer, - - .pre_block = vmx_pre_block, - .post_block = vmx_post_block, - - .pmu_ops = &intel_pmu_ops, - - .update_pi_irte = vmx_update_pi_irte, - -#ifdef CONFIG_X86_64 - .set_hv_timer = vmx_set_hv_timer, - .cancel_hv_timer = vmx_cancel_hv_timer, -#endif - - .setup_mce = vmx_setup_mce, - - .smi_allowed = vmx_smi_allowed, - .pre_enter_smm = vmx_pre_enter_smm, - .pre_leave_smm = vmx_pre_leave_smm, - .enable_smi_window = enable_smi_window, - - .check_nested_events = NULL, - .get_nested_state = NULL, - .set_nested_state = NULL, - .get_vmcs12_pages = NULL, - .nested_enable_evmcs = NULL, - .nested_get_evmcs_version = NULL, - .need_emulation_on_page_fault = vmx_need_emulation_on_page_fault, - .apic_init_signal_blocked = vmx_apic_init_signal_blocked, -}; - static struct kvm_x86_init_ops vmx_init_ops __initdata = { .cpu_has_kvm_support = cpu_has_kvm_support, .disabled_by_bios = vmx_disabled_by_bios, -- cgit From 72b0eaa946076cba3bc315c88199db7704b5538c Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Sat, 21 Mar 2020 13:25:58 -0700 Subject: KVM: VMX: Configure runtime hooks using vmx_x86_ops Configure VMX's runtime hooks by modifying vmx_x86_ops directly instead of using the global kvm_x86_ops. This sets the stage for waiting until after ->hardware_setup() to set kvm_x86_ops with the vendor's implementation. Signed-off-by: Sean Christopherson Message-Id: <20200321202603.19355-5-sean.j.christopherson@intel.com> Reviewed-by: Vitaly Kuznetsov Signed-off-by: Paolo Bonzini --- arch/x86/kvm/vmx/nested.c | 15 ++++++++------- arch/x86/kvm/vmx/nested.h | 3 ++- arch/x86/kvm/vmx/vmx.c | 27 ++++++++++++++------------- 3 files changed, 24 insertions(+), 21 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index 4ff859c99946..87fea22c3799 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -6241,7 +6241,8 @@ void nested_vmx_hardware_unsetup(void) } } -__init int nested_vmx_hardware_setup(int (*exit_handlers[])(struct kvm_vcpu *)) +__init int nested_vmx_hardware_setup(struct kvm_x86_ops *ops, + int (*exit_handlers[])(struct kvm_vcpu *)) { int i; @@ -6277,12 +6278,12 @@ __init int nested_vmx_hardware_setup(int (*exit_handlers[])(struct kvm_vcpu *)) exit_handlers[EXIT_REASON_INVVPID] = handle_invvpid; exit_handlers[EXIT_REASON_VMFUNC] = handle_vmfunc; - kvm_x86_ops->check_nested_events = vmx_check_nested_events; - kvm_x86_ops->get_nested_state = vmx_get_nested_state; - kvm_x86_ops->set_nested_state = vmx_set_nested_state; - kvm_x86_ops->get_vmcs12_pages = nested_get_vmcs12_pages; - kvm_x86_ops->nested_enable_evmcs = nested_enable_evmcs; - kvm_x86_ops->nested_get_evmcs_version = nested_get_evmcs_version; + ops->check_nested_events = vmx_check_nested_events; + ops->get_nested_state = vmx_get_nested_state; + ops->set_nested_state = vmx_set_nested_state; + ops->get_vmcs12_pages = nested_get_vmcs12_pages; + ops->nested_enable_evmcs = nested_enable_evmcs; + ops->nested_get_evmcs_version = nested_get_evmcs_version; return 0; } diff --git a/arch/x86/kvm/vmx/nested.h b/arch/x86/kvm/vmx/nested.h index f70968b76d33..ac56aefa49e3 100644 --- a/arch/x86/kvm/vmx/nested.h +++ b/arch/x86/kvm/vmx/nested.h @@ -19,7 +19,8 @@ enum nvmx_vmentry_status { void vmx_leave_nested(struct kvm_vcpu *vcpu); void nested_vmx_setup_ctls_msrs(struct nested_vmx_msrs *msrs, u32 ept_caps); void nested_vmx_hardware_unsetup(void); -__init int nested_vmx_hardware_setup(int (*exit_handlers[])(struct kvm_vcpu *)); +__init int nested_vmx_hardware_setup(struct kvm_x86_ops *ops, + int (*exit_handlers[])(struct kvm_vcpu *)); void nested_vmx_set_vmcs_shadowing_bitmap(void); void nested_vmx_free_vcpu(struct kvm_vcpu *vcpu); enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu, diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 2a3958ea9735..a64c386895e7 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7854,16 +7854,16 @@ static __init int hardware_setup(void) * using the APIC_ACCESS_ADDR VMCS field. */ if (!flexpriority_enabled) - kvm_x86_ops->set_apic_access_page_addr = NULL; + vmx_x86_ops.set_apic_access_page_addr = NULL; if (!cpu_has_vmx_tpr_shadow()) - kvm_x86_ops->update_cr8_intercept = NULL; + vmx_x86_ops.update_cr8_intercept = NULL; #if IS_ENABLED(CONFIG_HYPERV) if (ms_hyperv.nested_features & HV_X64_NESTED_GUEST_MAPPING_FLUSH && enable_ept) { - kvm_x86_ops->tlb_remote_flush = hv_remote_flush_tlb; - kvm_x86_ops->tlb_remote_flush_with_range = + vmx_x86_ops.tlb_remote_flush = hv_remote_flush_tlb; + vmx_x86_ops.tlb_remote_flush_with_range = hv_remote_flush_tlb_with_range; } #endif @@ -7878,7 +7878,7 @@ static __init int hardware_setup(void) if (!cpu_has_vmx_apicv()) { enable_apicv = 0; - kvm_x86_ops->sync_pir_to_irr = NULL; + vmx_x86_ops.sync_pir_to_irr = NULL; } if (cpu_has_vmx_tsc_scaling()) { @@ -7910,10 +7910,10 @@ static __init int hardware_setup(void) enable_pml = 0; if (!enable_pml) { - kvm_x86_ops->slot_enable_log_dirty = NULL; - kvm_x86_ops->slot_disable_log_dirty = NULL; - kvm_x86_ops->flush_log_dirty = NULL; - kvm_x86_ops->enable_log_dirty_pt_masked = NULL; + vmx_x86_ops.slot_enable_log_dirty = NULL; + vmx_x86_ops.slot_disable_log_dirty = NULL; + vmx_x86_ops.flush_log_dirty = NULL; + vmx_x86_ops.enable_log_dirty_pt_masked = NULL; } if (!cpu_has_vmx_preemption_timer()) @@ -7941,9 +7941,9 @@ static __init int hardware_setup(void) } if (!enable_preemption_timer) { - kvm_x86_ops->set_hv_timer = NULL; - kvm_x86_ops->cancel_hv_timer = NULL; - kvm_x86_ops->request_immediate_exit = __kvm_request_immediate_exit; + vmx_x86_ops.set_hv_timer = NULL; + vmx_x86_ops.cancel_hv_timer = NULL; + vmx_x86_ops.request_immediate_exit = __kvm_request_immediate_exit; } kvm_set_posted_intr_wakeup_handler(wakeup_handler); @@ -7959,7 +7959,8 @@ static __init int hardware_setup(void) nested_vmx_setup_ctls_msrs(&vmcs_config.nested, vmx_capability.ept); - r = nested_vmx_hardware_setup(kvm_vmx_exit_handlers); + r = nested_vmx_hardware_setup(&vmx_x86_ops, + kvm_vmx_exit_handlers); if (r) return r; } -- cgit From 69c6f69aa3064ab6cc8426661f125ea75ffe899c Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Sat, 21 Mar 2020 13:25:59 -0700 Subject: KVM: x86: Set kvm_x86_ops only after ->hardware_setup() completes Set kvm_x86_ops with the vendor's ops only after ->hardware_setup() completes to "prevent" using kvm_x86_ops before they are ready, i.e. to generate a null pointer fault instead of silently consuming unconfigured state. An alternative implementation would be to have ->hardware_setup() return the vendor's ops, but that would require non-trivial refactoring, and would arguably result in less readable code, e.g. ->hardware_setup() would need to use ERR_PTR() in multiple locations, and each vendor's declaration of the runtime ops would be less obvious. No functional change intended. Signed-off-by: Sean Christopherson Message-Id: <20200321202603.19355-6-sean.j.christopherson@intel.com> Reviewed-by: Vitaly Kuznetsov Signed-off-by: Paolo Bonzini --- arch/x86/kvm/x86.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 283ef4d919f8..23b6c2e38d9e 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -7359,8 +7359,6 @@ int kvm_arch_init(void *opaque) if (r) goto out_free_percpu; - kvm_x86_ops = ops->runtime_ops; - kvm_mmu_set_mask_ptes(PT_USER_MASK, PT_ACCESSED_MASK, PT_DIRTY_MASK, PT64_NX_MASK, 0, PT_PRESENT_MASK, 0, sme_me_mask); @@ -9640,6 +9638,8 @@ int kvm_arch_hardware_setup(void *opaque) if (r != 0) return r; + kvm_x86_ops = ops->runtime_ops; + if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES)) supported_xss = 0; -- cgit From afaf0b2f9b801c6eb2278b52d49e6a7d7b659cf1 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Sat, 21 Mar 2020 13:26:00 -0700 Subject: KVM: x86: Copy kvm_x86_ops by value to eliminate layer of indirection Replace the kvm_x86_ops pointer in common x86 with an instance of the struct to save one pointer dereference when invoking functions. Copy the struct by value to set the ops during kvm_init(). Arbitrarily use kvm_x86_ops.hardware_enable to track whether or not the ops have been initialized, i.e. a vendor KVM module has been loaded. Suggested-by: Paolo Bonzini Signed-off-by: Sean Christopherson Message-Id: <20200321202603.19355-7-sean.j.christopherson@intel.com> Reviewed-by: Vitaly Kuznetsov Signed-off-by: Paolo Bonzini --- arch/x86/include/asm/kvm_host.h | 18 +- arch/x86/kvm/cpuid.c | 4 +- arch/x86/kvm/hyperv.c | 8 +- arch/x86/kvm/kvm_cache_regs.h | 10 +- arch/x86/kvm/lapic.c | 30 ++-- arch/x86/kvm/mmu.h | 8 +- arch/x86/kvm/mmu/mmu.c | 32 ++-- arch/x86/kvm/pmu.c | 30 ++-- arch/x86/kvm/pmu.h | 2 +- arch/x86/kvm/svm.c | 2 +- arch/x86/kvm/trace.h | 4 +- arch/x86/kvm/vmx/nested.c | 2 +- arch/x86/kvm/vmx/vmx.c | 4 +- arch/x86/kvm/x86.c | 356 ++++++++++++++++++++-------------------- arch/x86/kvm/x86.h | 4 +- 15 files changed, 257 insertions(+), 257 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index f4c5b49299ff..54f991244fae 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1274,13 +1274,13 @@ struct kvm_arch_async_pf { extern u64 __read_mostly host_efer; -extern struct kvm_x86_ops *kvm_x86_ops; +extern struct kvm_x86_ops kvm_x86_ops; extern struct kmem_cache *x86_fpu_cache; #define __KVM_HAVE_ARCH_VM_ALLOC static inline struct kvm *kvm_arch_alloc_vm(void) { - return __vmalloc(kvm_x86_ops->vm_size, + return __vmalloc(kvm_x86_ops.vm_size, GFP_KERNEL_ACCOUNT | __GFP_ZERO, PAGE_KERNEL); } void kvm_arch_free_vm(struct kvm *kvm); @@ -1288,8 +1288,8 @@ void kvm_arch_free_vm(struct kvm *kvm); #define __KVM_HAVE_ARCH_FLUSH_REMOTE_TLB static inline int kvm_arch_flush_remote_tlb(struct kvm *kvm) { - if (kvm_x86_ops->tlb_remote_flush && - !kvm_x86_ops->tlb_remote_flush(kvm)) + if (kvm_x86_ops.tlb_remote_flush && + !kvm_x86_ops.tlb_remote_flush(kvm)) return 0; else return -ENOTSUPP; @@ -1375,7 +1375,7 @@ extern u64 kvm_mce_cap_supported; * * EMULTYPE_SKIP - Set when emulating solely to skip an instruction, i.e. to * decode the instruction length. For use *only* by - * kvm_x86_ops->skip_emulated_instruction() implementations. + * kvm_x86_ops.skip_emulated_instruction() implementations. * * EMULTYPE_ALLOW_RETRY_PF - Set when the emulator should resume the guest to * retry native execution under certain conditions, @@ -1669,14 +1669,14 @@ static inline bool kvm_irq_is_postable(struct kvm_lapic_irq *irq) static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) { - if (kvm_x86_ops->vcpu_blocking) - kvm_x86_ops->vcpu_blocking(vcpu); + if (kvm_x86_ops.vcpu_blocking) + kvm_x86_ops.vcpu_blocking(vcpu); } static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) { - if (kvm_x86_ops->vcpu_unblocking) - kvm_x86_ops->vcpu_unblocking(vcpu); + if (kvm_x86_ops.vcpu_unblocking) + kvm_x86_ops.vcpu_unblocking(vcpu); } static inline void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu) {} diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 60ae93b09e72..b18c31a26cc2 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -209,7 +209,7 @@ int kvm_vcpu_ioctl_set_cpuid(struct kvm_vcpu *vcpu, vcpu->arch.cpuid_nent = cpuid->nent; cpuid_fix_nx_cap(vcpu); kvm_apic_set_version(vcpu); - kvm_x86_ops->cpuid_update(vcpu); + kvm_x86_ops.cpuid_update(vcpu); r = kvm_update_cpuid(vcpu); out: @@ -232,7 +232,7 @@ int kvm_vcpu_ioctl_set_cpuid2(struct kvm_vcpu *vcpu, goto out; vcpu->arch.cpuid_nent = cpuid->nent; kvm_apic_set_version(vcpu); - kvm_x86_ops->cpuid_update(vcpu); + kvm_x86_ops.cpuid_update(vcpu); r = kvm_update_cpuid(vcpu); out: return r; diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c index a86fda7a1d03..bcefa9d4e57e 100644 --- a/arch/x86/kvm/hyperv.c +++ b/arch/x86/kvm/hyperv.c @@ -1022,7 +1022,7 @@ static int kvm_hv_set_msr_pw(struct kvm_vcpu *vcpu, u32 msr, u64 data, addr = gfn_to_hva(kvm, gfn); if (kvm_is_error_hva(addr)) return 1; - kvm_x86_ops->patch_hypercall(vcpu, instructions); + kvm_x86_ops.patch_hypercall(vcpu, instructions); ((unsigned char *)instructions)[3] = 0xc3; /* ret */ if (__copy_to_user((void __user *)addr, instructions, 4)) return 1; @@ -1607,7 +1607,7 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu) * hypercall generates UD from non zero cpl and real mode * per HYPER-V spec */ - if (kvm_x86_ops->get_cpl(vcpu) != 0 || !is_protmode(vcpu)) { + if (kvm_x86_ops.get_cpl(vcpu) != 0 || !is_protmode(vcpu)) { kvm_queue_exception(vcpu, UD_VECTOR); return 1; } @@ -1800,8 +1800,8 @@ int kvm_vcpu_ioctl_get_hv_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid2 *cpuid, }; int i, nent = ARRAY_SIZE(cpuid_entries); - if (kvm_x86_ops->nested_get_evmcs_version) - evmcs_ver = kvm_x86_ops->nested_get_evmcs_version(vcpu); + if (kvm_x86_ops.nested_get_evmcs_version) + evmcs_ver = kvm_x86_ops.nested_get_evmcs_version(vcpu); /* Skip NESTED_FEATURES if eVMCS is not supported */ if (!evmcs_ver) diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h index 58767020de41..62558b9bdda7 100644 --- a/arch/x86/kvm/kvm_cache_regs.h +++ b/arch/x86/kvm/kvm_cache_regs.h @@ -68,7 +68,7 @@ static inline unsigned long kvm_register_read(struct kvm_vcpu *vcpu, int reg) return 0; if (!kvm_register_is_available(vcpu, reg)) - kvm_x86_ops->cache_reg(vcpu, reg); + kvm_x86_ops.cache_reg(vcpu, reg); return vcpu->arch.regs[reg]; } @@ -108,7 +108,7 @@ static inline u64 kvm_pdptr_read(struct kvm_vcpu *vcpu, int index) might_sleep(); /* on svm */ if (!kvm_register_is_available(vcpu, VCPU_EXREG_PDPTR)) - kvm_x86_ops->cache_reg(vcpu, VCPU_EXREG_PDPTR); + kvm_x86_ops.cache_reg(vcpu, VCPU_EXREG_PDPTR); return vcpu->arch.walk_mmu->pdptrs[index]; } @@ -117,7 +117,7 @@ static inline ulong kvm_read_cr0_bits(struct kvm_vcpu *vcpu, ulong mask) { ulong tmask = mask & KVM_POSSIBLE_CR0_GUEST_BITS; if (tmask & vcpu->arch.cr0_guest_owned_bits) - kvm_x86_ops->decache_cr0_guest_bits(vcpu); + kvm_x86_ops.decache_cr0_guest_bits(vcpu); return vcpu->arch.cr0 & mask; } @@ -130,14 +130,14 @@ static inline ulong kvm_read_cr4_bits(struct kvm_vcpu *vcpu, ulong mask) { ulong tmask = mask & KVM_POSSIBLE_CR4_GUEST_BITS; if (tmask & vcpu->arch.cr4_guest_owned_bits) - kvm_x86_ops->decache_cr4_guest_bits(vcpu); + kvm_x86_ops.decache_cr4_guest_bits(vcpu); return vcpu->arch.cr4 & mask; } static inline ulong kvm_read_cr3(struct kvm_vcpu *vcpu) { if (!kvm_register_is_available(vcpu, VCPU_EXREG_CR3)) - kvm_x86_ops->cache_reg(vcpu, VCPU_EXREG_CR3); + kvm_x86_ops.cache_reg(vcpu, VCPU_EXREG_CR3); return vcpu->arch.cr3; } diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index b754e49adbc5..87d960818e74 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -463,7 +463,7 @@ static inline void apic_clear_irr(int vec, struct kvm_lapic *apic) if (unlikely(vcpu->arch.apicv_active)) { /* need to update RVI */ kvm_lapic_clear_vector(vec, apic->regs + APIC_IRR); - kvm_x86_ops->hwapic_irr_update(vcpu, + kvm_x86_ops.hwapic_irr_update(vcpu, apic_find_highest_irr(apic)); } else { apic->irr_pending = false; @@ -488,7 +488,7 @@ static inline void apic_set_isr(int vec, struct kvm_lapic *apic) * just set SVI. */ if (unlikely(vcpu->arch.apicv_active)) - kvm_x86_ops->hwapic_isr_update(vcpu, vec); + kvm_x86_ops.hwapic_isr_update(vcpu, vec); else { ++apic->isr_count; BUG_ON(apic->isr_count > MAX_APIC_VECTOR); @@ -536,7 +536,7 @@ static inline void apic_clear_isr(int vec, struct kvm_lapic *apic) * and must be left alone. */ if (unlikely(vcpu->arch.apicv_active)) - kvm_x86_ops->hwapic_isr_update(vcpu, + kvm_x86_ops.hwapic_isr_update(vcpu, apic_find_highest_isr(apic)); else { --apic->isr_count; @@ -674,7 +674,7 @@ static int apic_has_interrupt_for_ppr(struct kvm_lapic *apic, u32 ppr) { int highest_irr; if (apic->vcpu->arch.apicv_active) - highest_irr = kvm_x86_ops->sync_pir_to_irr(apic->vcpu); + highest_irr = kvm_x86_ops.sync_pir_to_irr(apic->vcpu); else highest_irr = apic_find_highest_irr(apic); if (highest_irr == -1 || (highest_irr & 0xF0) <= ppr) @@ -1063,7 +1063,7 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int delivery_mode, apic->regs + APIC_TMR); } - if (kvm_x86_ops->deliver_posted_interrupt(vcpu, vector)) { + if (kvm_x86_ops.deliver_posted_interrupt(vcpu, vector)) { kvm_lapic_set_irr(vector, apic); kvm_make_request(KVM_REQ_EVENT, vcpu); kvm_vcpu_kick(vcpu); @@ -1746,7 +1746,7 @@ static void cancel_hv_timer(struct kvm_lapic *apic) { WARN_ON(preemptible()); WARN_ON(!apic->lapic_timer.hv_timer_in_use); - kvm_x86_ops->cancel_hv_timer(apic->vcpu); + kvm_x86_ops.cancel_hv_timer(apic->vcpu); apic->lapic_timer.hv_timer_in_use = false; } @@ -1757,13 +1757,13 @@ static bool start_hv_timer(struct kvm_lapic *apic) bool expired; WARN_ON(preemptible()); - if (!kvm_x86_ops->set_hv_timer) + if (!kvm_x86_ops.set_hv_timer) return false; if (!ktimer->tscdeadline) return false; - if (kvm_x86_ops->set_hv_timer(vcpu, ktimer->tscdeadline, &expired)) + if (kvm_x86_ops.set_hv_timer(vcpu, ktimer->tscdeadline, &expired)) return false; ktimer->hv_timer_in_use = true; @@ -2190,7 +2190,7 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value) kvm_apic_set_x2apic_id(apic, vcpu->vcpu_id); if ((old_value ^ value) & (MSR_IA32_APICBASE_ENABLE | X2APIC_ENABLE)) - kvm_x86_ops->set_virtual_apic_mode(vcpu); + kvm_x86_ops.set_virtual_apic_mode(vcpu); apic->base_address = apic->vcpu->arch.apic_base & MSR_IA32_APICBASE_BASE; @@ -2268,9 +2268,9 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu, bool init_event) vcpu->arch.pv_eoi.msr_val = 0; apic_update_ppr(apic); if (vcpu->arch.apicv_active) { - kvm_x86_ops->apicv_post_state_restore(vcpu); - kvm_x86_ops->hwapic_irr_update(vcpu, -1); - kvm_x86_ops->hwapic_isr_update(vcpu, -1); + kvm_x86_ops.apicv_post_state_restore(vcpu); + kvm_x86_ops.hwapic_irr_update(vcpu, -1); + kvm_x86_ops.hwapic_isr_update(vcpu, -1); } vcpu->arch.apic_arb_prio = 0; @@ -2521,10 +2521,10 @@ int kvm_apic_set_state(struct kvm_vcpu *vcpu, struct kvm_lapic_state *s) kvm_apic_update_apicv(vcpu); apic->highest_isr_cache = -1; if (vcpu->arch.apicv_active) { - kvm_x86_ops->apicv_post_state_restore(vcpu); - kvm_x86_ops->hwapic_irr_update(vcpu, + kvm_x86_ops.apicv_post_state_restore(vcpu); + kvm_x86_ops.hwapic_irr_update(vcpu, apic_find_highest_irr(apic)); - kvm_x86_ops->hwapic_isr_update(vcpu, + kvm_x86_ops.hwapic_isr_update(vcpu, apic_find_highest_isr(apic)); } kvm_make_request(KVM_REQ_EVENT, vcpu); diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index e6bfe79e94d8..8a3b1bce722a 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -98,8 +98,8 @@ static inline unsigned long kvm_get_active_pcid(struct kvm_vcpu *vcpu) static inline void kvm_mmu_load_pgd(struct kvm_vcpu *vcpu) { if (VALID_PAGE(vcpu->arch.mmu->root_hpa)) - kvm_x86_ops->load_mmu_pgd(vcpu, vcpu->arch.mmu->root_hpa | - kvm_get_active_pcid(vcpu)); + kvm_x86_ops.load_mmu_pgd(vcpu, vcpu->arch.mmu->root_hpa | + kvm_get_active_pcid(vcpu)); } int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code, @@ -170,8 +170,8 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, unsigned pte_access, unsigned pte_pkey, unsigned pfec) { - int cpl = kvm_x86_ops->get_cpl(vcpu); - unsigned long rflags = kvm_x86_ops->get_rflags(vcpu); + int cpl = kvm_x86_ops.get_cpl(vcpu); + unsigned long rflags = kvm_x86_ops.get_rflags(vcpu); /* * If CPL < 3, SMAP prevention are disabled if EFLAGS.AC = 1. diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 560e85ebdf22..8071952e9cf2 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -305,7 +305,7 @@ kvm_mmu_calc_root_page_role(struct kvm_vcpu *vcpu); static inline bool kvm_available_flush_tlb_with_range(void) { - return kvm_x86_ops->tlb_remote_flush_with_range; + return kvm_x86_ops.tlb_remote_flush_with_range; } static void kvm_flush_remote_tlbs_with_range(struct kvm *kvm, @@ -313,8 +313,8 @@ static void kvm_flush_remote_tlbs_with_range(struct kvm *kvm, { int ret = -ENOTSUPP; - if (range && kvm_x86_ops->tlb_remote_flush_with_range) - ret = kvm_x86_ops->tlb_remote_flush_with_range(kvm, range); + if (range && kvm_x86_ops.tlb_remote_flush_with_range) + ret = kvm_x86_ops.tlb_remote_flush_with_range(kvm, range); if (ret) kvm_flush_remote_tlbs(kvm); @@ -1642,7 +1642,7 @@ static bool spte_set_dirty(u64 *sptep) rmap_printk("rmap_set_dirty: spte %p %llx\n", sptep, *sptep); /* - * Similar to the !kvm_x86_ops->slot_disable_log_dirty case, + * Similar to the !kvm_x86_ops.slot_disable_log_dirty case, * do not bother adding back write access to pages marked * SPTE_AD_WRPROT_ONLY_MASK. */ @@ -1731,8 +1731,8 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm, struct kvm_memory_slot *slot, gfn_t gfn_offset, unsigned long mask) { - if (kvm_x86_ops->enable_log_dirty_pt_masked) - kvm_x86_ops->enable_log_dirty_pt_masked(kvm, slot, gfn_offset, + if (kvm_x86_ops.enable_log_dirty_pt_masked) + kvm_x86_ops.enable_log_dirty_pt_masked(kvm, slot, gfn_offset, mask); else kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask); @@ -1747,8 +1747,8 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm, */ int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu) { - if (kvm_x86_ops->write_log_dirty) - return kvm_x86_ops->write_log_dirty(vcpu); + if (kvm_x86_ops.write_log_dirty) + return kvm_x86_ops.write_log_dirty(vcpu); return 0; } @@ -3036,7 +3036,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep, if (level > PT_PAGE_TABLE_LEVEL) spte |= PT_PAGE_SIZE_MASK; if (tdp_enabled) - spte |= kvm_x86_ops->get_mt_mask(vcpu, gfn, + spte |= kvm_x86_ops.get_mt_mask(vcpu, gfn, kvm_is_mmio_pfn(pfn)); if (host_writable) @@ -4909,7 +4909,7 @@ kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu, bool base_only) union kvm_mmu_role role = kvm_calc_mmu_role_common(vcpu, base_only); role.base.ad_disabled = (shadow_accessed_mask == 0); - role.base.level = kvm_x86_ops->get_tdp_level(vcpu); + role.base.level = kvm_x86_ops.get_tdp_level(vcpu); role.base.direct = true; role.base.gpte_is_8_bytes = true; @@ -4930,7 +4930,7 @@ static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu) context->sync_page = nonpaging_sync_page; context->invlpg = nonpaging_invlpg; context->update_pte = nonpaging_update_pte; - context->shadow_root_level = kvm_x86_ops->get_tdp_level(vcpu); + context->shadow_root_level = kvm_x86_ops.get_tdp_level(vcpu); context->direct_map = true; context->get_guest_pgd = get_cr3; context->get_pdptr = kvm_pdptr_read; @@ -5183,7 +5183,7 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu) if (r) goto out; kvm_mmu_load_pgd(vcpu); - kvm_x86_ops->tlb_flush(vcpu, true); + kvm_x86_ops.tlb_flush(vcpu, true); out: return r; } @@ -5488,7 +5488,7 @@ emulate: * guest, with the exception of AMD Erratum 1096 which is unrecoverable. */ if (unlikely(insn && !insn_len)) { - if (!kvm_x86_ops->need_emulation_on_page_fault(vcpu)) + if (!kvm_x86_ops.need_emulation_on_page_fault(vcpu)) return 1; } @@ -5523,7 +5523,7 @@ void kvm_mmu_invlpg(struct kvm_vcpu *vcpu, gva_t gva) if (VALID_PAGE(mmu->prev_roots[i].hpa)) mmu->invlpg(vcpu, gva, mmu->prev_roots[i].hpa); - kvm_x86_ops->tlb_flush_gva(vcpu, gva); + kvm_x86_ops.tlb_flush_gva(vcpu, gva); ++vcpu->stat.invlpg; } EXPORT_SYMBOL_GPL(kvm_mmu_invlpg); @@ -5548,7 +5548,7 @@ void kvm_mmu_invpcid_gva(struct kvm_vcpu *vcpu, gva_t gva, unsigned long pcid) } if (tlb_flush) - kvm_x86_ops->tlb_flush_gva(vcpu, gva); + kvm_x86_ops.tlb_flush_gva(vcpu, gva); ++vcpu->stat.invlpg; @@ -5672,7 +5672,7 @@ static int alloc_mmu_pages(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu) * SVM's 32-bit NPT support, TDP paging doesn't use PAE paging and can * skip allocating the PDP table. */ - if (tdp_enabled && kvm_x86_ops->get_tdp_level(vcpu) > PT32E_ROOT_LEVEL) + if (tdp_enabled && kvm_x86_ops.get_tdp_level(vcpu) > PT32E_ROOT_LEVEL) return 0; page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_DMA32); diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index d1f8ca57d354..a5078841bdac 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -211,7 +211,7 @@ void reprogram_gp_counter(struct kvm_pmc *pmc, u64 eventsel) ARCH_PERFMON_EVENTSEL_CMASK | HSW_IN_TX | HSW_IN_TX_CHECKPOINTED))) { - config = kvm_x86_ops->pmu_ops->find_arch_event(pmc_to_pmu(pmc), + config = kvm_x86_ops.pmu_ops->find_arch_event(pmc_to_pmu(pmc), event_select, unit_mask); if (config != PERF_COUNT_HW_MAX) @@ -265,7 +265,7 @@ void reprogram_fixed_counter(struct kvm_pmc *pmc, u8 ctrl, int idx) pmc->current_config = (u64)ctrl; pmc_reprogram_counter(pmc, PERF_TYPE_HARDWARE, - kvm_x86_ops->pmu_ops->find_fixed_event(idx), + kvm_x86_ops.pmu_ops->find_fixed_event(idx), !(en_field & 0x2), /* exclude user */ !(en_field & 0x1), /* exclude kernel */ pmi, false, false); @@ -274,7 +274,7 @@ EXPORT_SYMBOL_GPL(reprogram_fixed_counter); void reprogram_counter(struct kvm_pmu *pmu, int pmc_idx) { - struct kvm_pmc *pmc = kvm_x86_ops->pmu_ops->pmc_idx_to_pmc(pmu, pmc_idx); + struct kvm_pmc *pmc = kvm_x86_ops.pmu_ops->pmc_idx_to_pmc(pmu, pmc_idx); if (!pmc) return; @@ -296,7 +296,7 @@ void kvm_pmu_handle_event(struct kvm_vcpu *vcpu) int bit; for_each_set_bit(bit, pmu->reprogram_pmi, X86_PMC_IDX_MAX) { - struct kvm_pmc *pmc = kvm_x86_ops->pmu_ops->pmc_idx_to_pmc(pmu, bit); + struct kvm_pmc *pmc = kvm_x86_ops.pmu_ops->pmc_idx_to_pmc(pmu, bit); if (unlikely(!pmc || !pmc->perf_event)) { clear_bit(bit, pmu->reprogram_pmi); @@ -318,7 +318,7 @@ void kvm_pmu_handle_event(struct kvm_vcpu *vcpu) /* check if idx is a valid index to access PMU */ int kvm_pmu_is_valid_rdpmc_ecx(struct kvm_vcpu *vcpu, unsigned int idx) { - return kvm_x86_ops->pmu_ops->is_valid_rdpmc_ecx(vcpu, idx); + return kvm_x86_ops.pmu_ops->is_valid_rdpmc_ecx(vcpu, idx); } bool is_vmware_backdoor_pmc(u32 pmc_idx) @@ -368,7 +368,7 @@ int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned idx, u64 *data) if (is_vmware_backdoor_pmc(idx)) return kvm_pmu_rdpmc_vmware(vcpu, idx, data); - pmc = kvm_x86_ops->pmu_ops->rdpmc_ecx_to_pmc(vcpu, idx, &mask); + pmc = kvm_x86_ops.pmu_ops->rdpmc_ecx_to_pmc(vcpu, idx, &mask); if (!pmc) return 1; @@ -384,14 +384,14 @@ void kvm_pmu_deliver_pmi(struct kvm_vcpu *vcpu) bool kvm_pmu_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr) { - return kvm_x86_ops->pmu_ops->msr_idx_to_pmc(vcpu, msr) || - kvm_x86_ops->pmu_ops->is_valid_msr(vcpu, msr); + return kvm_x86_ops.pmu_ops->msr_idx_to_pmc(vcpu, msr) || + kvm_x86_ops.pmu_ops->is_valid_msr(vcpu, msr); } static void kvm_pmu_mark_pmc_in_use(struct kvm_vcpu *vcpu, u32 msr) { struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); - struct kvm_pmc *pmc = kvm_x86_ops->pmu_ops->msr_idx_to_pmc(vcpu, msr); + struct kvm_pmc *pmc = kvm_x86_ops.pmu_ops->msr_idx_to_pmc(vcpu, msr); if (pmc) __set_bit(pmc->idx, pmu->pmc_in_use); @@ -399,13 +399,13 @@ static void kvm_pmu_mark_pmc_in_use(struct kvm_vcpu *vcpu, u32 msr) int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *data) { - return kvm_x86_ops->pmu_ops->get_msr(vcpu, msr, data); + return kvm_x86_ops.pmu_ops->get_msr(vcpu, msr, data); } int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) { kvm_pmu_mark_pmc_in_use(vcpu, msr_info->index); - return kvm_x86_ops->pmu_ops->set_msr(vcpu, msr_info); + return kvm_x86_ops.pmu_ops->set_msr(vcpu, msr_info); } /* refresh PMU settings. This function generally is called when underlying @@ -414,7 +414,7 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) */ void kvm_pmu_refresh(struct kvm_vcpu *vcpu) { - kvm_x86_ops->pmu_ops->refresh(vcpu); + kvm_x86_ops.pmu_ops->refresh(vcpu); } void kvm_pmu_reset(struct kvm_vcpu *vcpu) @@ -422,7 +422,7 @@ void kvm_pmu_reset(struct kvm_vcpu *vcpu) struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); irq_work_sync(&pmu->irq_work); - kvm_x86_ops->pmu_ops->reset(vcpu); + kvm_x86_ops.pmu_ops->reset(vcpu); } void kvm_pmu_init(struct kvm_vcpu *vcpu) @@ -430,7 +430,7 @@ void kvm_pmu_init(struct kvm_vcpu *vcpu) struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); memset(pmu, 0, sizeof(*pmu)); - kvm_x86_ops->pmu_ops->init(vcpu); + kvm_x86_ops.pmu_ops->init(vcpu); init_irq_work(&pmu->irq_work, kvm_pmi_trigger_fn); pmu->event_count = 0; pmu->need_cleanup = false; @@ -462,7 +462,7 @@ void kvm_pmu_cleanup(struct kvm_vcpu *vcpu) pmu->pmc_in_use, X86_PMC_IDX_MAX); for_each_set_bit(i, bitmask, X86_PMC_IDX_MAX) { - pmc = kvm_x86_ops->pmu_ops->pmc_idx_to_pmc(pmu, i); + pmc = kvm_x86_ops.pmu_ops->pmc_idx_to_pmc(pmu, i); if (pmc && pmc->perf_event && !pmc_speculative_in_use(pmc)) pmc_stop_counter(pmc); diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h index d7da2b9e0755..a6c78a797cb1 100644 --- a/arch/x86/kvm/pmu.h +++ b/arch/x86/kvm/pmu.h @@ -88,7 +88,7 @@ static inline bool pmc_is_fixed(struct kvm_pmc *pmc) static inline bool pmc_is_enabled(struct kvm_pmc *pmc) { - return kvm_x86_ops->pmu_ops->pmc_is_enabled(pmc); + return kvm_x86_ops.pmu_ops->pmc_is_enabled(pmc); } static inline bool kvm_valid_perf_global_ctrl(struct kvm_pmu *pmu, diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 589debab9a3a..e5b6b0f7d95b 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -7329,7 +7329,7 @@ static bool svm_apic_init_signal_blocked(struct kvm_vcpu *vcpu) * TODO: Last condition latch INIT signals on vCPU when * vCPU is in guest-mode and vmcb12 defines intercept on INIT. * To properly emulate the INIT intercept, SVM should implement - * kvm_x86_ops->check_nested_events() and call nested_svm_vmexit() + * kvm_x86_ops.check_nested_events() and call nested_svm_vmexit() * there if an INIT signal is pending. */ return !gif_set(svm) || diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h index c3d1e9f4a2c0..aa59e1697bb3 100644 --- a/arch/x86/kvm/trace.h +++ b/arch/x86/kvm/trace.h @@ -246,7 +246,7 @@ TRACE_EVENT(kvm_exit, __entry->guest_rip = kvm_rip_read(vcpu); __entry->isa = isa; __entry->vcpu_id = vcpu->vcpu_id; - kvm_x86_ops->get_exit_info(vcpu, &__entry->info1, + kvm_x86_ops.get_exit_info(vcpu, &__entry->info1, &__entry->info2); ), @@ -750,7 +750,7 @@ TRACE_EVENT(kvm_emulate_insn, ), TP_fast_assign( - __entry->csbase = kvm_x86_ops->get_segment_base(vcpu, VCPU_SREG_CS); + __entry->csbase = kvm_x86_ops.get_segment_base(vcpu, VCPU_SREG_CS); __entry->len = vcpu->arch.emulate_ctxt->fetch.ptr - vcpu->arch.emulate_ctxt->fetch.data; __entry->rip = vcpu->arch.emulate_ctxt->_eip - __entry->len; diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index 87fea22c3799..de232306561a 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -4535,7 +4535,7 @@ void nested_vmx_pmu_entry_exit_ctls_update(struct kvm_vcpu *vcpu) return; vmx = to_vmx(vcpu); - if (kvm_x86_ops->pmu_ops->is_valid_msr(vcpu, MSR_CORE_PERF_GLOBAL_CTRL)) { + if (kvm_x86_ops.pmu_ops->is_valid_msr(vcpu, MSR_CORE_PERF_GLOBAL_CTRL)) { vmx->nested.msrs.entry_ctls_high |= VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL; vmx->nested.msrs.exit_ctls_high |= diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index a64c386895e7..a4cd851e812d 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -2986,7 +2986,7 @@ void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, unsigned long cr3) eptp = construct_eptp(vcpu, cr3); vmcs_write64(EPT_POINTER, eptp); - if (kvm_x86_ops->tlb_remote_flush) { + if (kvm_x86_ops.tlb_remote_flush) { spin_lock(&to_kvm_vmx(kvm)->ept_pointer_lock); to_vmx(vcpu)->ept_pointer = eptp; to_kvm_vmx(kvm)->ept_pointers_match @@ -7479,7 +7479,7 @@ static void pi_post_block(struct kvm_vcpu *vcpu) static void vmx_post_block(struct kvm_vcpu *vcpu) { - if (kvm_x86_ops->set_hv_timer) + if (kvm_x86_ops.set_hv_timer) kvm_lapic_switch_to_hv_timer(vcpu); pi_post_block(vcpu); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 23b6c2e38d9e..f055a79f93b0 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -110,7 +110,7 @@ static void __kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags); static void store_regs(struct kvm_vcpu *vcpu); static int sync_regs(struct kvm_vcpu *vcpu); -struct kvm_x86_ops *kvm_x86_ops __read_mostly; +struct kvm_x86_ops kvm_x86_ops __read_mostly; EXPORT_SYMBOL_GPL(kvm_x86_ops); static bool __read_mostly ignore_msrs = 0; @@ -646,7 +646,7 @@ EXPORT_SYMBOL_GPL(kvm_requeue_exception_e); */ bool kvm_require_cpl(struct kvm_vcpu *vcpu, int required_cpl) { - if (kvm_x86_ops->get_cpl(vcpu) <= required_cpl) + if (kvm_x86_ops.get_cpl(vcpu) <= required_cpl) return true; kvm_queue_exception_e(vcpu, GP_VECTOR, 0); return false; @@ -787,7 +787,7 @@ int kvm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0) if (!is_pae(vcpu)) return 1; - kvm_x86_ops->get_cs_db_l_bits(vcpu, &cs_db, &cs_l); + kvm_x86_ops.get_cs_db_l_bits(vcpu, &cs_db, &cs_l); if (cs_l) return 1; } else @@ -800,7 +800,7 @@ int kvm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0) if (!(cr0 & X86_CR0_PG) && kvm_read_cr4_bits(vcpu, X86_CR4_PCIDE)) return 1; - kvm_x86_ops->set_cr0(vcpu, cr0); + kvm_x86_ops.set_cr0(vcpu, cr0); if ((cr0 ^ old_cr0) & X86_CR0_PG) { kvm_clear_async_pf_completion_queue(vcpu); @@ -896,7 +896,7 @@ static int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr) int kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr) { - if (kvm_x86_ops->get_cpl(vcpu) != 0 || + if (kvm_x86_ops.get_cpl(vcpu) != 0 || __kvm_set_xcr(vcpu, index, xcr)) { kvm_inject_gp(vcpu, 0); return 1; @@ -977,7 +977,7 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4) return 1; } - if (kvm_x86_ops->set_cr4(vcpu, cr4)) + if (kvm_x86_ops.set_cr4(vcpu, cr4)) return 1; if (((cr4 ^ old_cr4) & pdptr_bits) || @@ -1061,7 +1061,7 @@ static void kvm_update_dr0123(struct kvm_vcpu *vcpu) static void kvm_update_dr6(struct kvm_vcpu *vcpu) { if (!(vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP)) - kvm_x86_ops->set_dr6(vcpu, vcpu->arch.dr6); + kvm_x86_ops.set_dr6(vcpu, vcpu->arch.dr6); } static void kvm_update_dr7(struct kvm_vcpu *vcpu) @@ -1072,7 +1072,7 @@ static void kvm_update_dr7(struct kvm_vcpu *vcpu) dr7 = vcpu->arch.guest_debug_dr7; else dr7 = vcpu->arch.dr7; - kvm_x86_ops->set_dr7(vcpu, dr7); + kvm_x86_ops.set_dr7(vcpu, dr7); vcpu->arch.switch_db_regs &= ~KVM_DEBUGREG_BP_ENABLED; if (dr7 & DR7_BP_EN_MASK) vcpu->arch.switch_db_regs |= KVM_DEBUGREG_BP_ENABLED; @@ -1142,7 +1142,7 @@ int kvm_get_dr(struct kvm_vcpu *vcpu, int dr, unsigned long *val) if (vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP) *val = vcpu->arch.dr6; else - *val = kvm_x86_ops->get_dr6(vcpu); + *val = kvm_x86_ops.get_dr6(vcpu); break; case 5: /* fall through */ @@ -1377,7 +1377,7 @@ static int kvm_get_msr_feature(struct kvm_msr_entry *msr) rdmsrl_safe(msr->index, &msr->data); break; default: - if (kvm_x86_ops->get_msr_feature(msr)) + if (kvm_x86_ops.get_msr_feature(msr)) return 1; } return 0; @@ -1445,7 +1445,7 @@ static int set_efer(struct kvm_vcpu *vcpu, struct msr_data *msr_info) efer &= ~EFER_LMA; efer |= vcpu->arch.efer & EFER_LMA; - kvm_x86_ops->set_efer(vcpu, efer); + kvm_x86_ops.set_efer(vcpu, efer); /* Update reserved bits */ if ((efer ^ old_efer) & EFER_NX) @@ -1501,7 +1501,7 @@ static int __kvm_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data, msr.index = index; msr.host_initiated = host_initiated; - return kvm_x86_ops->set_msr(vcpu, &msr); + return kvm_x86_ops.set_msr(vcpu, &msr); } /* @@ -1519,7 +1519,7 @@ int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data, msr.index = index; msr.host_initiated = host_initiated; - ret = kvm_x86_ops->get_msr(vcpu, &msr); + ret = kvm_x86_ops.get_msr(vcpu, &msr); if (!ret) *data = msr.data; return ret; @@ -1905,7 +1905,7 @@ static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu) static void update_ia32_tsc_adjust_msr(struct kvm_vcpu *vcpu, s64 offset) { - u64 curr_offset = kvm_x86_ops->read_l1_tsc_offset(vcpu); + u64 curr_offset = kvm_x86_ops.read_l1_tsc_offset(vcpu); vcpu->arch.ia32_tsc_adjust_msr += offset - curr_offset; } @@ -1947,7 +1947,7 @@ static u64 kvm_compute_tsc_offset(struct kvm_vcpu *vcpu, u64 target_tsc) u64 kvm_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc) { - u64 tsc_offset = kvm_x86_ops->read_l1_tsc_offset(vcpu); + u64 tsc_offset = kvm_x86_ops.read_l1_tsc_offset(vcpu); return tsc_offset + kvm_scale_tsc(vcpu, host_tsc); } @@ -1955,7 +1955,7 @@ EXPORT_SYMBOL_GPL(kvm_read_l1_tsc); static void kvm_vcpu_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset) { - vcpu->arch.tsc_offset = kvm_x86_ops->write_l1_tsc_offset(vcpu, offset); + vcpu->arch.tsc_offset = kvm_x86_ops.write_l1_tsc_offset(vcpu, offset); } static inline bool kvm_check_tsc_unstable(void) @@ -2079,7 +2079,7 @@ EXPORT_SYMBOL_GPL(kvm_write_tsc); static inline void adjust_tsc_offset_guest(struct kvm_vcpu *vcpu, s64 adjustment) { - u64 tsc_offset = kvm_x86_ops->read_l1_tsc_offset(vcpu); + u64 tsc_offset = kvm_x86_ops.read_l1_tsc_offset(vcpu); kvm_vcpu_write_tsc_offset(vcpu, tsc_offset + adjustment); } @@ -2677,7 +2677,7 @@ static void kvmclock_reset(struct kvm_vcpu *vcpu) static void kvm_vcpu_flush_tlb(struct kvm_vcpu *vcpu, bool invalidate_gpa) { ++vcpu->stat.tlb_flush; - kvm_x86_ops->tlb_flush(vcpu, invalidate_gpa); + kvm_x86_ops.tlb_flush(vcpu, invalidate_gpa); } static void record_steal_time(struct kvm_vcpu *vcpu) @@ -3394,10 +3394,10 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) * fringe case that is not enabled except via specific settings * of the module parameters. */ - r = kvm_x86_ops->has_emulated_msr(MSR_IA32_SMBASE); + r = kvm_x86_ops.has_emulated_msr(MSR_IA32_SMBASE); break; case KVM_CAP_VAPIC: - r = !kvm_x86_ops->cpu_has_accelerated_tpr(); + r = !kvm_x86_ops.cpu_has_accelerated_tpr(); break; case KVM_CAP_NR_VCPUS: r = KVM_SOFT_MAX_VCPUS; @@ -3424,14 +3424,14 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) r = KVM_X2APIC_API_VALID_FLAGS; break; case KVM_CAP_NESTED_STATE: - r = kvm_x86_ops->get_nested_state ? - kvm_x86_ops->get_nested_state(NULL, NULL, 0) : 0; + r = kvm_x86_ops.get_nested_state ? + kvm_x86_ops.get_nested_state(NULL, NULL, 0) : 0; break; case KVM_CAP_HYPERV_DIRECT_TLBFLUSH: - r = kvm_x86_ops->enable_direct_tlbflush != NULL; + r = kvm_x86_ops.enable_direct_tlbflush != NULL; break; case KVM_CAP_HYPERV_ENLIGHTENED_VMCS: - r = kvm_x86_ops->nested_enable_evmcs != NULL; + r = kvm_x86_ops.nested_enable_evmcs != NULL; break; default: break; @@ -3547,14 +3547,14 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) { /* Address WBINVD may be executed by guest */ if (need_emulate_wbinvd(vcpu)) { - if (kvm_x86_ops->has_wbinvd_exit()) + if (kvm_x86_ops.has_wbinvd_exit()) cpumask_set_cpu(cpu, vcpu->arch.wbinvd_dirty_mask); else if (vcpu->cpu != -1 && vcpu->cpu != cpu) smp_call_function_single(vcpu->cpu, wbinvd_ipi, NULL, 1); } - kvm_x86_ops->vcpu_load(vcpu, cpu); + kvm_x86_ops.vcpu_load(vcpu, cpu); /* Apply any externally detected TSC adjustments (due to suspend) */ if (unlikely(vcpu->arch.tsc_offset_adjustment)) { @@ -3621,7 +3621,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) int idx; if (vcpu->preempted) - vcpu->arch.preempted_in_kernel = !kvm_x86_ops->get_cpl(vcpu); + vcpu->arch.preempted_in_kernel = !kvm_x86_ops.get_cpl(vcpu); /* * Disable page faults because we're in atomic context here. @@ -3640,7 +3640,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) kvm_steal_time_set_preempted(vcpu); srcu_read_unlock(&vcpu->kvm->srcu, idx); pagefault_enable(); - kvm_x86_ops->vcpu_put(vcpu); + kvm_x86_ops.vcpu_put(vcpu); vcpu->arch.last_host_tsc = rdtsc(); /* * If userspace has set any breakpoints or watchpoints, dr6 is restored @@ -3654,7 +3654,7 @@ static int kvm_vcpu_ioctl_get_lapic(struct kvm_vcpu *vcpu, struct kvm_lapic_state *s) { if (vcpu->arch.apicv_active) - kvm_x86_ops->sync_pir_to_irr(vcpu); + kvm_x86_ops.sync_pir_to_irr(vcpu); return kvm_apic_get_state(vcpu, s); } @@ -3762,7 +3762,7 @@ static int kvm_vcpu_ioctl_x86_setup_mce(struct kvm_vcpu *vcpu, for (bank = 0; bank < bank_num; bank++) vcpu->arch.mce_banks[bank*4] = ~(u64)0; - kvm_x86_ops->setup_mce(vcpu); + kvm_x86_ops.setup_mce(vcpu); out: return r; } @@ -3866,11 +3866,11 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu, vcpu->arch.interrupt.injected && !vcpu->arch.interrupt.soft; events->interrupt.nr = vcpu->arch.interrupt.nr; events->interrupt.soft = 0; - events->interrupt.shadow = kvm_x86_ops->get_interrupt_shadow(vcpu); + events->interrupt.shadow = kvm_x86_ops.get_interrupt_shadow(vcpu); events->nmi.injected = vcpu->arch.nmi_injected; events->nmi.pending = vcpu->arch.nmi_pending != 0; - events->nmi.masked = kvm_x86_ops->get_nmi_mask(vcpu); + events->nmi.masked = kvm_x86_ops.get_nmi_mask(vcpu); events->nmi.pad = 0; events->sipi_vector = 0; /* never valid when reporting to user space */ @@ -3937,13 +3937,13 @@ static int kvm_vcpu_ioctl_x86_set_vcpu_events(struct kvm_vcpu *vcpu, vcpu->arch.interrupt.nr = events->interrupt.nr; vcpu->arch.interrupt.soft = events->interrupt.soft; if (events->flags & KVM_VCPUEVENT_VALID_SHADOW) - kvm_x86_ops->set_interrupt_shadow(vcpu, + kvm_x86_ops.set_interrupt_shadow(vcpu, events->interrupt.shadow); vcpu->arch.nmi_injected = events->nmi.injected; if (events->flags & KVM_VCPUEVENT_VALID_NMI_PENDING) vcpu->arch.nmi_pending = events->nmi.pending; - kvm_x86_ops->set_nmi_mask(vcpu, events->nmi.masked); + kvm_x86_ops.set_nmi_mask(vcpu, events->nmi.masked); if (events->flags & KVM_VCPUEVENT_VALID_SIPI_VECTOR && lapic_in_kernel(vcpu)) @@ -4217,9 +4217,9 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu, return kvm_hv_activate_synic(vcpu, cap->cap == KVM_CAP_HYPERV_SYNIC2); case KVM_CAP_HYPERV_ENLIGHTENED_VMCS: - if (!kvm_x86_ops->nested_enable_evmcs) + if (!kvm_x86_ops.nested_enable_evmcs) return -ENOTTY; - r = kvm_x86_ops->nested_enable_evmcs(vcpu, &vmcs_version); + r = kvm_x86_ops.nested_enable_evmcs(vcpu, &vmcs_version); if (!r) { user_ptr = (void __user *)(uintptr_t)cap->args[0]; if (copy_to_user(user_ptr, &vmcs_version, @@ -4228,10 +4228,10 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu, } return r; case KVM_CAP_HYPERV_DIRECT_TLBFLUSH: - if (!kvm_x86_ops->enable_direct_tlbflush) + if (!kvm_x86_ops.enable_direct_tlbflush) return -ENOTTY; - return kvm_x86_ops->enable_direct_tlbflush(vcpu); + return kvm_x86_ops.enable_direct_tlbflush(vcpu); default: return -EINVAL; @@ -4534,7 +4534,7 @@ long kvm_arch_vcpu_ioctl(struct file *filp, u32 user_data_size; r = -EINVAL; - if (!kvm_x86_ops->get_nested_state) + if (!kvm_x86_ops.get_nested_state) break; BUILD_BUG_ON(sizeof(user_data_size) != sizeof(user_kvm_nested_state->size)); @@ -4542,7 +4542,7 @@ long kvm_arch_vcpu_ioctl(struct file *filp, if (get_user(user_data_size, &user_kvm_nested_state->size)) break; - r = kvm_x86_ops->get_nested_state(vcpu, user_kvm_nested_state, + r = kvm_x86_ops.get_nested_state(vcpu, user_kvm_nested_state, user_data_size); if (r < 0) break; @@ -4564,7 +4564,7 @@ long kvm_arch_vcpu_ioctl(struct file *filp, int idx; r = -EINVAL; - if (!kvm_x86_ops->set_nested_state) + if (!kvm_x86_ops.set_nested_state) break; r = -EFAULT; @@ -4586,7 +4586,7 @@ long kvm_arch_vcpu_ioctl(struct file *filp, break; idx = srcu_read_lock(&vcpu->kvm->srcu); - r = kvm_x86_ops->set_nested_state(vcpu, user_kvm_nested_state, &kvm_state); + r = kvm_x86_ops.set_nested_state(vcpu, user_kvm_nested_state, &kvm_state); srcu_read_unlock(&vcpu->kvm->srcu, idx); break; } @@ -4630,14 +4630,14 @@ static int kvm_vm_ioctl_set_tss_addr(struct kvm *kvm, unsigned long addr) if (addr > (unsigned int)(-3 * PAGE_SIZE)) return -EINVAL; - ret = kvm_x86_ops->set_tss_addr(kvm, addr); + ret = kvm_x86_ops.set_tss_addr(kvm, addr); return ret; } static int kvm_vm_ioctl_set_identity_map_addr(struct kvm *kvm, u64 ident_addr) { - return kvm_x86_ops->set_identity_map_addr(kvm, ident_addr); + return kvm_x86_ops.set_identity_map_addr(kvm, ident_addr); } static int kvm_vm_ioctl_set_nr_mmu_pages(struct kvm *kvm, @@ -4794,8 +4794,8 @@ void kvm_arch_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot) /* * Flush potentially hardware-cached dirty pages to dirty_bitmap. */ - if (kvm_x86_ops->flush_log_dirty) - kvm_x86_ops->flush_log_dirty(kvm); + if (kvm_x86_ops.flush_log_dirty) + kvm_x86_ops.flush_log_dirty(kvm); } int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct kvm_irq_level *irq_event, @@ -5148,8 +5148,8 @@ set_identity_unlock: } case KVM_MEMORY_ENCRYPT_OP: { r = -ENOTTY; - if (kvm_x86_ops->mem_enc_op) - r = kvm_x86_ops->mem_enc_op(kvm, argp); + if (kvm_x86_ops.mem_enc_op) + r = kvm_x86_ops.mem_enc_op(kvm, argp); break; } case KVM_MEMORY_ENCRYPT_REG_REGION: { @@ -5160,8 +5160,8 @@ set_identity_unlock: goto out; r = -ENOTTY; - if (kvm_x86_ops->mem_enc_reg_region) - r = kvm_x86_ops->mem_enc_reg_region(kvm, ®ion); + if (kvm_x86_ops.mem_enc_reg_region) + r = kvm_x86_ops.mem_enc_reg_region(kvm, ®ion); break; } case KVM_MEMORY_ENCRYPT_UNREG_REGION: { @@ -5172,8 +5172,8 @@ set_identity_unlock: goto out; r = -ENOTTY; - if (kvm_x86_ops->mem_enc_unreg_region) - r = kvm_x86_ops->mem_enc_unreg_region(kvm, ®ion); + if (kvm_x86_ops.mem_enc_unreg_region) + r = kvm_x86_ops.mem_enc_unreg_region(kvm, ®ion); break; } case KVM_HYPERV_EVENTFD: { @@ -5268,7 +5268,7 @@ static void kvm_init_msr_list(void) } for (i = 0; i < ARRAY_SIZE(emulated_msrs_all); i++) { - if (!kvm_x86_ops->has_emulated_msr(emulated_msrs_all[i])) + if (!kvm_x86_ops.has_emulated_msr(emulated_msrs_all[i])) continue; emulated_msrs[num_emulated_msrs++] = emulated_msrs_all[i]; @@ -5331,13 +5331,13 @@ static int vcpu_mmio_read(struct kvm_vcpu *vcpu, gpa_t addr, int len, void *v) static void kvm_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg) { - kvm_x86_ops->set_segment(vcpu, var, seg); + kvm_x86_ops.set_segment(vcpu, var, seg); } void kvm_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg) { - kvm_x86_ops->get_segment(vcpu, var, seg); + kvm_x86_ops.get_segment(vcpu, var, seg); } gpa_t translate_nested_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, u32 access, @@ -5357,14 +5357,14 @@ gpa_t translate_nested_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, u32 access, gpa_t kvm_mmu_gva_to_gpa_read(struct kvm_vcpu *vcpu, gva_t gva, struct x86_exception *exception) { - u32 access = (kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0; + u32 access = (kvm_x86_ops.get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0; return vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, access, exception); } gpa_t kvm_mmu_gva_to_gpa_fetch(struct kvm_vcpu *vcpu, gva_t gva, struct x86_exception *exception) { - u32 access = (kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0; + u32 access = (kvm_x86_ops.get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0; access |= PFERR_FETCH_MASK; return vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, access, exception); } @@ -5372,7 +5372,7 @@ gpa_t kvm_mmu_gva_to_gpa_read(struct kvm_vcpu *vcpu, gva_t gva, gpa_t kvm_mmu_gva_to_gpa_write(struct kvm_vcpu *vcpu, gva_t gva, struct x86_exception *exception) { - u32 access = (kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0; + u32 access = (kvm_x86_ops.get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0; access |= PFERR_WRITE_MASK; return vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, access, exception); } @@ -5421,7 +5421,7 @@ static int kvm_fetch_guest_virt(struct x86_emulate_ctxt *ctxt, struct x86_exception *exception) { struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt); - u32 access = (kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0; + u32 access = (kvm_x86_ops.get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0; unsigned offset; int ret; @@ -5446,7 +5446,7 @@ int kvm_read_guest_virt(struct kvm_vcpu *vcpu, gva_t addr, void *val, unsigned int bytes, struct x86_exception *exception) { - u32 access = (kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0; + u32 access = (kvm_x86_ops.get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0; /* * FIXME: this should call handle_emulation_failure if X86EMUL_IO_NEEDED @@ -5467,7 +5467,7 @@ static int emulator_read_std(struct x86_emulate_ctxt *ctxt, struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt); u32 access = 0; - if (!system && kvm_x86_ops->get_cpl(vcpu) == 3) + if (!system && kvm_x86_ops.get_cpl(vcpu) == 3) access |= PFERR_USER_MASK; return kvm_read_guest_virt_helper(addr, val, bytes, vcpu, access, exception); @@ -5520,7 +5520,7 @@ static int emulator_write_std(struct x86_emulate_ctxt *ctxt, gva_t addr, void *v struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt); u32 access = PFERR_WRITE_MASK; - if (!system && kvm_x86_ops->get_cpl(vcpu) == 3) + if (!system && kvm_x86_ops.get_cpl(vcpu) == 3) access |= PFERR_USER_MASK; return kvm_write_guest_virt_helper(addr, val, bytes, vcpu, @@ -5583,7 +5583,7 @@ static int vcpu_mmio_gva_to_gpa(struct kvm_vcpu *vcpu, unsigned long gva, gpa_t *gpa, struct x86_exception *exception, bool write) { - u32 access = ((kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0) + u32 access = ((kvm_x86_ops.get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0) | (write ? PFERR_WRITE_MASK : 0); /* @@ -5981,7 +5981,7 @@ static int emulator_pio_out_emulated(struct x86_emulate_ctxt *ctxt, static unsigned long get_segment_base(struct kvm_vcpu *vcpu, int seg) { - return kvm_x86_ops->get_segment_base(vcpu, seg); + return kvm_x86_ops.get_segment_base(vcpu, seg); } static void emulator_invlpg(struct x86_emulate_ctxt *ctxt, ulong address) @@ -5994,7 +5994,7 @@ static int kvm_emulate_wbinvd_noskip(struct kvm_vcpu *vcpu) if (!need_emulate_wbinvd(vcpu)) return X86EMUL_CONTINUE; - if (kvm_x86_ops->has_wbinvd_exit()) { + if (kvm_x86_ops.has_wbinvd_exit()) { int cpu = get_cpu(); cpumask_set_cpu(cpu, vcpu->arch.wbinvd_dirty_mask); @@ -6099,27 +6099,27 @@ static int emulator_set_cr(struct x86_emulate_ctxt *ctxt, int cr, ulong val) static int emulator_get_cpl(struct x86_emulate_ctxt *ctxt) { - return kvm_x86_ops->get_cpl(emul_to_vcpu(ctxt)); + return kvm_x86_ops.get_cpl(emul_to_vcpu(ctxt)); } static void emulator_get_gdt(struct x86_emulate_ctxt *ctxt, struct desc_ptr *dt) { - kvm_x86_ops->get_gdt(emul_to_vcpu(ctxt), dt); + kvm_x86_ops.get_gdt(emul_to_vcpu(ctxt), dt); } static void emulator_get_idt(struct x86_emulate_ctxt *ctxt, struct desc_ptr *dt) { - kvm_x86_ops->get_idt(emul_to_vcpu(ctxt), dt); + kvm_x86_ops.get_idt(emul_to_vcpu(ctxt), dt); } static void emulator_set_gdt(struct x86_emulate_ctxt *ctxt, struct desc_ptr *dt) { - kvm_x86_ops->set_gdt(emul_to_vcpu(ctxt), dt); + kvm_x86_ops.set_gdt(emul_to_vcpu(ctxt), dt); } static void emulator_set_idt(struct x86_emulate_ctxt *ctxt, struct desc_ptr *dt) { - kvm_x86_ops->set_idt(emul_to_vcpu(ctxt), dt); + kvm_x86_ops.set_idt(emul_to_vcpu(ctxt), dt); } static unsigned long emulator_get_cached_segment_base( @@ -6241,7 +6241,7 @@ static int emulator_intercept(struct x86_emulate_ctxt *ctxt, struct x86_instruction_info *info, enum x86_intercept_stage stage) { - return kvm_x86_ops->check_intercept(emul_to_vcpu(ctxt), info, stage, + return kvm_x86_ops.check_intercept(emul_to_vcpu(ctxt), info, stage, &ctxt->exception); } @@ -6279,7 +6279,7 @@ static void emulator_write_gpr(struct x86_emulate_ctxt *ctxt, unsigned reg, ulon static void emulator_set_nmi_mask(struct x86_emulate_ctxt *ctxt, bool masked) { - kvm_x86_ops->set_nmi_mask(emul_to_vcpu(ctxt), masked); + kvm_x86_ops.set_nmi_mask(emul_to_vcpu(ctxt), masked); } static unsigned emulator_get_hflags(struct x86_emulate_ctxt *ctxt) @@ -6295,7 +6295,7 @@ static void emulator_set_hflags(struct x86_emulate_ctxt *ctxt, unsigned emul_fla static int emulator_pre_leave_smm(struct x86_emulate_ctxt *ctxt, const char *smstate) { - return kvm_x86_ops->pre_leave_smm(emul_to_vcpu(ctxt), smstate); + return kvm_x86_ops.pre_leave_smm(emul_to_vcpu(ctxt), smstate); } static void emulator_post_leave_smm(struct x86_emulate_ctxt *ctxt) @@ -6357,7 +6357,7 @@ static const struct x86_emulate_ops emulate_ops = { static void toggle_interruptibility(struct kvm_vcpu *vcpu, u32 mask) { - u32 int_shadow = kvm_x86_ops->get_interrupt_shadow(vcpu); + u32 int_shadow = kvm_x86_ops.get_interrupt_shadow(vcpu); /* * an sti; sti; sequence only disable interrupts for the first * instruction. So, if the last instruction, be it emulated or @@ -6368,7 +6368,7 @@ static void toggle_interruptibility(struct kvm_vcpu *vcpu, u32 mask) if (int_shadow & mask) mask = 0; if (unlikely(int_shadow || mask)) { - kvm_x86_ops->set_interrupt_shadow(vcpu, mask); + kvm_x86_ops.set_interrupt_shadow(vcpu, mask); if (!mask) kvm_make_request(KVM_REQ_EVENT, vcpu); } @@ -6410,7 +6410,7 @@ static void init_emulate_ctxt(struct kvm_vcpu *vcpu) struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt; int cs_db, cs_l; - kvm_x86_ops->get_cs_db_l_bits(vcpu, &cs_db, &cs_l); + kvm_x86_ops.get_cs_db_l_bits(vcpu, &cs_db, &cs_l); ctxt->gpa_available = false; ctxt->eflags = kvm_get_rflags(vcpu); @@ -6471,7 +6471,7 @@ static int handle_emulation_failure(struct kvm_vcpu *vcpu, int emulation_type) kvm_queue_exception(vcpu, UD_VECTOR); - if (!is_guest_mode(vcpu) && kvm_x86_ops->get_cpl(vcpu) == 0) { + if (!is_guest_mode(vcpu) && kvm_x86_ops.get_cpl(vcpu) == 0) { vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR; vcpu->run->internal.suberror = KVM_INTERNAL_ERROR_EMULATION; vcpu->run->internal.ndata = 0; @@ -6652,10 +6652,10 @@ static int kvm_vcpu_do_singlestep(struct kvm_vcpu *vcpu) int kvm_skip_emulated_instruction(struct kvm_vcpu *vcpu) { - unsigned long rflags = kvm_x86_ops->get_rflags(vcpu); + unsigned long rflags = kvm_x86_ops.get_rflags(vcpu); int r; - r = kvm_x86_ops->skip_emulated_instruction(vcpu); + r = kvm_x86_ops.skip_emulated_instruction(vcpu); if (unlikely(!r)) return 0; @@ -6890,7 +6890,7 @@ restart: r = 1; if (writeback) { - unsigned long rflags = kvm_x86_ops->get_rflags(vcpu); + unsigned long rflags = kvm_x86_ops.get_rflags(vcpu); toggle_interruptibility(vcpu, ctxt->interruptibility); vcpu->arch.emulate_regs_need_sync_to_vcpu = false; if (!ctxt->have_exception || @@ -6898,8 +6898,8 @@ restart: kvm_rip_write(vcpu, ctxt->eip); if (r && ctxt->tf) r = kvm_vcpu_do_singlestep(vcpu); - if (kvm_x86_ops->update_emulated_instruction) - kvm_x86_ops->update_emulated_instruction(vcpu); + if (kvm_x86_ops.update_emulated_instruction) + kvm_x86_ops.update_emulated_instruction(vcpu); __kvm_set_rflags(vcpu, ctxt->eflags); } @@ -7226,7 +7226,7 @@ static int kvm_is_user_mode(void) int user_mode = 3; if (__this_cpu_read(current_vcpu)) - user_mode = kvm_x86_ops->get_cpl(__this_cpu_read(current_vcpu)); + user_mode = kvm_x86_ops.get_cpl(__this_cpu_read(current_vcpu)); return user_mode != 0; } @@ -7306,7 +7306,7 @@ int kvm_arch_init(void *opaque) struct kvm_x86_init_ops *ops = opaque; int r; - if (kvm_x86_ops) { + if (kvm_x86_ops.hardware_enable) { printk(KERN_ERR "kvm: already loaded the other module\n"); r = -EEXIST; goto out; @@ -7409,7 +7409,7 @@ void kvm_arch_exit(void) #ifdef CONFIG_X86_64 pvclock_gtod_unregister_notifier(&pvclock_gtod_notifier); #endif - kvm_x86_ops = NULL; + kvm_x86_ops.hardware_enable = NULL; kvm_mmu_module_exit(); free_percpu(shared_msrs); kmem_cache_destroy(x86_fpu_cache); @@ -7547,7 +7547,7 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu) a3 &= 0xFFFFFFFF; } - if (kvm_x86_ops->get_cpl(vcpu) != 0) { + if (kvm_x86_ops.get_cpl(vcpu) != 0) { ret = -KVM_EPERM; goto out; } @@ -7593,7 +7593,7 @@ static int emulator_fix_hypercall(struct x86_emulate_ctxt *ctxt) char instruction[3]; unsigned long rip = kvm_rip_read(vcpu); - kvm_x86_ops->patch_hypercall(vcpu, instruction); + kvm_x86_ops.patch_hypercall(vcpu, instruction); return emulator_write_emulated(ctxt, rip, instruction, 3, &ctxt->exception); @@ -7622,7 +7622,7 @@ static void update_cr8_intercept(struct kvm_vcpu *vcpu) { int max_irr, tpr; - if (!kvm_x86_ops->update_cr8_intercept) + if (!kvm_x86_ops.update_cr8_intercept) return; if (!lapic_in_kernel(vcpu)) @@ -7641,7 +7641,7 @@ static void update_cr8_intercept(struct kvm_vcpu *vcpu) tpr = kvm_lapic_get_cr8(vcpu); - kvm_x86_ops->update_cr8_intercept(vcpu, tpr, max_irr); + kvm_x86_ops.update_cr8_intercept(vcpu, tpr, max_irr); } static int inject_pending_event(struct kvm_vcpu *vcpu) @@ -7651,7 +7651,7 @@ static int inject_pending_event(struct kvm_vcpu *vcpu) /* try to reinject previous events if any */ if (vcpu->arch.exception.injected) - kvm_x86_ops->queue_exception(vcpu); + kvm_x86_ops.queue_exception(vcpu); /* * Do not inject an NMI or interrupt if there is a pending * exception. Exceptions and interrupts are recognized at @@ -7668,9 +7668,9 @@ static int inject_pending_event(struct kvm_vcpu *vcpu) */ else if (!vcpu->arch.exception.pending) { if (vcpu->arch.nmi_injected) - kvm_x86_ops->set_nmi(vcpu); + kvm_x86_ops.set_nmi(vcpu); else if (vcpu->arch.interrupt.injected) - kvm_x86_ops->set_irq(vcpu); + kvm_x86_ops.set_irq(vcpu); } /* @@ -7679,8 +7679,8 @@ static int inject_pending_event(struct kvm_vcpu *vcpu) * from L2 to L1 due to pending L1 events which require exit * from L2 to L1. */ - if (is_guest_mode(vcpu) && kvm_x86_ops->check_nested_events) { - r = kvm_x86_ops->check_nested_events(vcpu); + if (is_guest_mode(vcpu) && kvm_x86_ops.check_nested_events) { + r = kvm_x86_ops.check_nested_events(vcpu); if (r != 0) return r; } @@ -7717,7 +7717,7 @@ static int inject_pending_event(struct kvm_vcpu *vcpu) } } - kvm_x86_ops->queue_exception(vcpu); + kvm_x86_ops.queue_exception(vcpu); } /* Don't consider new event if we re-injected an event */ @@ -7725,14 +7725,14 @@ static int inject_pending_event(struct kvm_vcpu *vcpu) return 0; if (vcpu->arch.smi_pending && !is_smm(vcpu) && - kvm_x86_ops->smi_allowed(vcpu)) { + kvm_x86_ops.smi_allowed(vcpu)) { vcpu->arch.smi_pending = false; ++vcpu->arch.smi_count; enter_smm(vcpu); - } else if (vcpu->arch.nmi_pending && kvm_x86_ops->nmi_allowed(vcpu)) { + } else if (vcpu->arch.nmi_pending && kvm_x86_ops.nmi_allowed(vcpu)) { --vcpu->arch.nmi_pending; vcpu->arch.nmi_injected = true; - kvm_x86_ops->set_nmi(vcpu); + kvm_x86_ops.set_nmi(vcpu); } else if (kvm_cpu_has_injectable_intr(vcpu)) { /* * Because interrupts can be injected asynchronously, we are @@ -7741,15 +7741,15 @@ static int inject_pending_event(struct kvm_vcpu *vcpu) * proposal and current concerns. Perhaps we should be setting * KVM_REQ_EVENT only on certain events and not unconditionally? */ - if (is_guest_mode(vcpu) && kvm_x86_ops->check_nested_events) { - r = kvm_x86_ops->check_nested_events(vcpu); + if (is_guest_mode(vcpu) && kvm_x86_ops.check_nested_events) { + r = kvm_x86_ops.check_nested_events(vcpu); if (r != 0) return r; } - if (kvm_x86_ops->interrupt_allowed(vcpu)) { + if (kvm_x86_ops.interrupt_allowed(vcpu)) { kvm_queue_interrupt(vcpu, kvm_cpu_get_interrupt(vcpu), false); - kvm_x86_ops->set_irq(vcpu); + kvm_x86_ops.set_irq(vcpu); } } @@ -7765,7 +7765,7 @@ static void process_nmi(struct kvm_vcpu *vcpu) * If an NMI is already in progress, limit further NMIs to just one. * Otherwise, allow two (and we'll inject the first one immediately). */ - if (kvm_x86_ops->get_nmi_mask(vcpu) || vcpu->arch.nmi_injected) + if (kvm_x86_ops.get_nmi_mask(vcpu) || vcpu->arch.nmi_injected) limit = 1; vcpu->arch.nmi_pending += atomic_xchg(&vcpu->arch.nmi_queued, 0); @@ -7855,11 +7855,11 @@ static void enter_smm_save_state_32(struct kvm_vcpu *vcpu, char *buf) put_smstate(u32, buf, 0x7f7c, seg.limit); put_smstate(u32, buf, 0x7f78, enter_smm_get_segment_flags(&seg)); - kvm_x86_ops->get_gdt(vcpu, &dt); + kvm_x86_ops.get_gdt(vcpu, &dt); put_smstate(u32, buf, 0x7f74, dt.address); put_smstate(u32, buf, 0x7f70, dt.size); - kvm_x86_ops->get_idt(vcpu, &dt); + kvm_x86_ops.get_idt(vcpu, &dt); put_smstate(u32, buf, 0x7f58, dt.address); put_smstate(u32, buf, 0x7f54, dt.size); @@ -7909,7 +7909,7 @@ static void enter_smm_save_state_64(struct kvm_vcpu *vcpu, char *buf) put_smstate(u32, buf, 0x7e94, seg.limit); put_smstate(u64, buf, 0x7e98, seg.base); - kvm_x86_ops->get_idt(vcpu, &dt); + kvm_x86_ops.get_idt(vcpu, &dt); put_smstate(u32, buf, 0x7e84, dt.size); put_smstate(u64, buf, 0x7e88, dt.address); @@ -7919,7 +7919,7 @@ static void enter_smm_save_state_64(struct kvm_vcpu *vcpu, char *buf) put_smstate(u32, buf, 0x7e74, seg.limit); put_smstate(u64, buf, 0x7e78, seg.base); - kvm_x86_ops->get_gdt(vcpu, &dt); + kvm_x86_ops.get_gdt(vcpu, &dt); put_smstate(u32, buf, 0x7e64, dt.size); put_smstate(u64, buf, 0x7e68, dt.address); @@ -7949,28 +7949,28 @@ static void enter_smm(struct kvm_vcpu *vcpu) * vCPU state (e.g. leave guest mode) after we've saved the state into * the SMM state-save area. */ - kvm_x86_ops->pre_enter_smm(vcpu, buf); + kvm_x86_ops.pre_enter_smm(vcpu, buf); vcpu->arch.hflags |= HF_SMM_MASK; kvm_vcpu_write_guest(vcpu, vcpu->arch.smbase + 0xfe00, buf, sizeof(buf)); - if (kvm_x86_ops->get_nmi_mask(vcpu)) + if (kvm_x86_ops.get_nmi_mask(vcpu)) vcpu->arch.hflags |= HF_SMM_INSIDE_NMI_MASK; else - kvm_x86_ops->set_nmi_mask(vcpu, true); + kvm_x86_ops.set_nmi_mask(vcpu, true); kvm_set_rflags(vcpu, X86_EFLAGS_FIXED); kvm_rip_write(vcpu, 0x8000); cr0 = vcpu->arch.cr0 & ~(X86_CR0_PE | X86_CR0_EM | X86_CR0_TS | X86_CR0_PG); - kvm_x86_ops->set_cr0(vcpu, cr0); + kvm_x86_ops.set_cr0(vcpu, cr0); vcpu->arch.cr0 = cr0; - kvm_x86_ops->set_cr4(vcpu, 0); + kvm_x86_ops.set_cr4(vcpu, 0); /* Undocumented: IDT limit is set to zero on entry to SMM. */ dt.address = dt.size = 0; - kvm_x86_ops->set_idt(vcpu, &dt); + kvm_x86_ops.set_idt(vcpu, &dt); __kvm_set_dr(vcpu, 7, DR7_FIXED_1); @@ -8001,7 +8001,7 @@ static void enter_smm(struct kvm_vcpu *vcpu) #ifdef CONFIG_X86_64 if (guest_cpuid_has(vcpu, X86_FEATURE_LM)) - kvm_x86_ops->set_efer(vcpu, 0); + kvm_x86_ops.set_efer(vcpu, 0); #endif kvm_update_cpuid(vcpu); @@ -8039,7 +8039,7 @@ void kvm_vcpu_update_apicv(struct kvm_vcpu *vcpu) vcpu->arch.apicv_active = kvm_apicv_activated(vcpu->kvm); kvm_apic_update_apicv(vcpu); - kvm_x86_ops->refresh_apicv_exec_ctrl(vcpu); + kvm_x86_ops.refresh_apicv_exec_ctrl(vcpu); } EXPORT_SYMBOL_GPL(kvm_vcpu_update_apicv); @@ -8054,8 +8054,8 @@ void kvm_request_apicv_update(struct kvm *kvm, bool activate, ulong bit) { unsigned long old, new, expected; - if (!kvm_x86_ops->check_apicv_inhibit_reasons || - !kvm_x86_ops->check_apicv_inhibit_reasons(bit)) + if (!kvm_x86_ops.check_apicv_inhibit_reasons || + !kvm_x86_ops.check_apicv_inhibit_reasons(bit)) return; old = READ_ONCE(kvm->arch.apicv_inhibit_reasons); @@ -8074,8 +8074,8 @@ void kvm_request_apicv_update(struct kvm *kvm, bool activate, ulong bit) return; trace_kvm_apicv_update_request(activate, bit); - if (kvm_x86_ops->pre_update_apicv_exec_ctrl) - kvm_x86_ops->pre_update_apicv_exec_ctrl(kvm, activate); + if (kvm_x86_ops.pre_update_apicv_exec_ctrl) + kvm_x86_ops.pre_update_apicv_exec_ctrl(kvm, activate); kvm_make_all_cpus_request(kvm, KVM_REQ_APICV_UPDATE); } EXPORT_SYMBOL_GPL(kvm_request_apicv_update); @@ -8091,7 +8091,7 @@ static void vcpu_scan_ioapic(struct kvm_vcpu *vcpu) kvm_scan_ioapic_routes(vcpu, vcpu->arch.ioapic_handled_vectors); else { if (vcpu->arch.apicv_active) - kvm_x86_ops->sync_pir_to_irr(vcpu); + kvm_x86_ops.sync_pir_to_irr(vcpu); if (ioapic_in_kernel(vcpu->kvm)) kvm_ioapic_scan_entry(vcpu, vcpu->arch.ioapic_handled_vectors); } @@ -8111,7 +8111,7 @@ static void vcpu_load_eoi_exitmap(struct kvm_vcpu *vcpu) bitmap_or((ulong *)eoi_exit_bitmap, vcpu->arch.ioapic_handled_vectors, vcpu_to_synic(vcpu)->vec_bitmap, 256); - kvm_x86_ops->load_eoi_exitmap(vcpu, eoi_exit_bitmap); + kvm_x86_ops.load_eoi_exitmap(vcpu, eoi_exit_bitmap); } int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm, @@ -8138,13 +8138,13 @@ void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu) if (!lapic_in_kernel(vcpu)) return; - if (!kvm_x86_ops->set_apic_access_page_addr) + if (!kvm_x86_ops.set_apic_access_page_addr) return; page = gfn_to_page(vcpu->kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT); if (is_error_page(page)) return; - kvm_x86_ops->set_apic_access_page_addr(vcpu, page_to_phys(page)); + kvm_x86_ops.set_apic_access_page_addr(vcpu, page_to_phys(page)); /* * Do not pin apic access page in memory, the MMU notifier @@ -8176,7 +8176,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) if (kvm_request_pending(vcpu)) { if (kvm_check_request(KVM_REQ_GET_VMCS12_PAGES, vcpu)) { - if (unlikely(!kvm_x86_ops->get_vmcs12_pages(vcpu))) { + if (unlikely(!kvm_x86_ops.get_vmcs12_pages(vcpu))) { r = 0; goto out; } @@ -8300,12 +8300,12 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) * SMI. */ if (vcpu->arch.smi_pending && !is_smm(vcpu)) - if (!kvm_x86_ops->enable_smi_window(vcpu)) + if (!kvm_x86_ops.enable_smi_window(vcpu)) req_immediate_exit = true; if (vcpu->arch.nmi_pending) - kvm_x86_ops->enable_nmi_window(vcpu); + kvm_x86_ops.enable_nmi_window(vcpu); if (kvm_cpu_has_injectable_intr(vcpu) || req_int_win) - kvm_x86_ops->enable_irq_window(vcpu); + kvm_x86_ops.enable_irq_window(vcpu); WARN_ON(vcpu->arch.exception.pending); } @@ -8322,7 +8322,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) preempt_disable(); - kvm_x86_ops->prepare_guest_switch(vcpu); + kvm_x86_ops.prepare_guest_switch(vcpu); /* * Disable IRQs before setting IN_GUEST_MODE. Posted interrupt @@ -8353,7 +8353,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) * notified with kvm_vcpu_kick. */ if (kvm_lapic_enabled(vcpu) && vcpu->arch.apicv_active) - kvm_x86_ops->sync_pir_to_irr(vcpu); + kvm_x86_ops.sync_pir_to_irr(vcpu); if (vcpu->mode == EXITING_GUEST_MODE || kvm_request_pending(vcpu) || need_resched() || signal_pending(current)) { @@ -8368,7 +8368,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) if (req_immediate_exit) { kvm_make_request(KVM_REQ_EVENT, vcpu); - kvm_x86_ops->request_immediate_exit(vcpu); + kvm_x86_ops.request_immediate_exit(vcpu); } trace_kvm_entry(vcpu->vcpu_id); @@ -8388,7 +8388,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) vcpu->arch.switch_db_regs &= ~KVM_DEBUGREG_RELOAD; } - kvm_x86_ops->run(vcpu); + kvm_x86_ops.run(vcpu); /* * Do this here before restoring debug registers on the host. And @@ -8398,7 +8398,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) */ if (unlikely(vcpu->arch.switch_db_regs & KVM_DEBUGREG_WONT_EXIT)) { WARN_ON(vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP); - kvm_x86_ops->sync_dirty_debug_regs(vcpu); + kvm_x86_ops.sync_dirty_debug_regs(vcpu); kvm_update_dr0123(vcpu); kvm_update_dr6(vcpu); kvm_update_dr7(vcpu); @@ -8420,7 +8420,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) vcpu->mode = OUTSIDE_GUEST_MODE; smp_wmb(); - kvm_x86_ops->handle_exit_irqoff(vcpu, &exit_fastpath); + kvm_x86_ops.handle_exit_irqoff(vcpu, &exit_fastpath); /* * Consume any pending interrupts, including the possible source of @@ -8463,11 +8463,11 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) if (vcpu->arch.apic_attention) kvm_lapic_sync_from_vapic(vcpu); - r = kvm_x86_ops->handle_exit(vcpu, exit_fastpath); + r = kvm_x86_ops.handle_exit(vcpu, exit_fastpath); return r; cancel_injection: - kvm_x86_ops->cancel_injection(vcpu); + kvm_x86_ops.cancel_injection(vcpu); if (unlikely(vcpu->arch.apic_attention)) kvm_lapic_sync_from_vapic(vcpu); out: @@ -8477,13 +8477,13 @@ out: static inline int vcpu_block(struct kvm *kvm, struct kvm_vcpu *vcpu) { if (!kvm_arch_vcpu_runnable(vcpu) && - (!kvm_x86_ops->pre_block || kvm_x86_ops->pre_block(vcpu) == 0)) { + (!kvm_x86_ops.pre_block || kvm_x86_ops.pre_block(vcpu) == 0)) { srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx); kvm_vcpu_block(vcpu); vcpu->srcu_idx = srcu_read_lock(&kvm->srcu); - if (kvm_x86_ops->post_block) - kvm_x86_ops->post_block(vcpu); + if (kvm_x86_ops.post_block) + kvm_x86_ops.post_block(vcpu); if (!kvm_check_request(KVM_REQ_UNHALT, vcpu)) return 1; @@ -8509,8 +8509,8 @@ static inline int vcpu_block(struct kvm *kvm, struct kvm_vcpu *vcpu) static inline bool kvm_vcpu_running(struct kvm_vcpu *vcpu) { - if (is_guest_mode(vcpu) && kvm_x86_ops->check_nested_events) - kvm_x86_ops->check_nested_events(vcpu); + if (is_guest_mode(vcpu) && kvm_x86_ops.check_nested_events) + kvm_x86_ops.check_nested_events(vcpu); return (vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE && !vcpu->arch.apf.halted); @@ -8666,7 +8666,7 @@ static void kvm_load_guest_fpu(struct kvm_vcpu *vcpu) kvm_save_current_fpu(vcpu->arch.user_fpu); - /* PKRU is separately restored in kvm_x86_ops->run. */ + /* PKRU is separately restored in kvm_x86_ops.run. */ __copy_kernel_to_fpregs(&vcpu->arch.guest_fpu->state, ~XFEATURE_MASK_PKRU); @@ -8869,10 +8869,10 @@ static void __get_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs) kvm_get_segment(vcpu, &sregs->tr, VCPU_SREG_TR); kvm_get_segment(vcpu, &sregs->ldt, VCPU_SREG_LDTR); - kvm_x86_ops->get_idt(vcpu, &dt); + kvm_x86_ops.get_idt(vcpu, &dt); sregs->idt.limit = dt.size; sregs->idt.base = dt.address; - kvm_x86_ops->get_gdt(vcpu, &dt); + kvm_x86_ops.get_gdt(vcpu, &dt); sregs->gdt.limit = dt.size; sregs->gdt.base = dt.address; @@ -9019,10 +9019,10 @@ static int __set_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs) dt.size = sregs->idt.limit; dt.address = sregs->idt.base; - kvm_x86_ops->set_idt(vcpu, &dt); + kvm_x86_ops.set_idt(vcpu, &dt); dt.size = sregs->gdt.limit; dt.address = sregs->gdt.base; - kvm_x86_ops->set_gdt(vcpu, &dt); + kvm_x86_ops.set_gdt(vcpu, &dt); vcpu->arch.cr2 = sregs->cr2; mmu_reset_needed |= kvm_read_cr3(vcpu) != sregs->cr3; @@ -9032,16 +9032,16 @@ static int __set_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs) kvm_set_cr8(vcpu, sregs->cr8); mmu_reset_needed |= vcpu->arch.efer != sregs->efer; - kvm_x86_ops->set_efer(vcpu, sregs->efer); + kvm_x86_ops.set_efer(vcpu, sregs->efer); mmu_reset_needed |= kvm_read_cr0(vcpu) != sregs->cr0; - kvm_x86_ops->set_cr0(vcpu, sregs->cr0); + kvm_x86_ops.set_cr0(vcpu, sregs->cr0); vcpu->arch.cr0 = sregs->cr0; mmu_reset_needed |= kvm_read_cr4(vcpu) != sregs->cr4; cpuid_update_needed |= ((kvm_read_cr4(vcpu) ^ sregs->cr4) & (X86_CR4_OSXSAVE | X86_CR4_PKE)); - kvm_x86_ops->set_cr4(vcpu, sregs->cr4); + kvm_x86_ops.set_cr4(vcpu, sregs->cr4); if (cpuid_update_needed) kvm_update_cpuid(vcpu); @@ -9147,7 +9147,7 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu, */ kvm_set_rflags(vcpu, rflags); - kvm_x86_ops->update_bp_intercept(vcpu); + kvm_x86_ops.update_bp_intercept(vcpu); r = 0; @@ -9358,7 +9358,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu) kvm_hv_vcpu_init(vcpu); - r = kvm_x86_ops->vcpu_create(vcpu); + r = kvm_x86_ops.vcpu_create(vcpu); if (r) goto free_guest_fpu; @@ -9425,7 +9425,7 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu) kvmclock_reset(vcpu); - kvm_x86_ops->vcpu_free(vcpu); + kvm_x86_ops.vcpu_free(vcpu); kmem_cache_free(x86_emulator_cache, vcpu->arch.emulate_ctxt); free_cpumask_var(vcpu->arch.wbinvd_dirty_mask); @@ -9513,7 +9513,7 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) vcpu->arch.ia32_xss = 0; - kvm_x86_ops->vcpu_reset(vcpu, init_event); + kvm_x86_ops.vcpu_reset(vcpu, init_event); } void kvm_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector) @@ -9538,7 +9538,7 @@ int kvm_arch_hardware_enable(void) bool stable, backwards_tsc = false; kvm_shared_msr_cpu_online(); - ret = kvm_x86_ops->hardware_enable(); + ret = kvm_x86_ops.hardware_enable(); if (ret != 0) return ret; @@ -9620,7 +9620,7 @@ int kvm_arch_hardware_enable(void) void kvm_arch_hardware_disable(void) { - kvm_x86_ops->hardware_disable(); + kvm_x86_ops.hardware_disable(); drop_user_return_notifiers(); } @@ -9638,7 +9638,7 @@ int kvm_arch_hardware_setup(void *opaque) if (r != 0) return r; - kvm_x86_ops = ops->runtime_ops; + memcpy(&kvm_x86_ops, ops->runtime_ops, sizeof(kvm_x86_ops)); if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES)) supported_xss = 0; @@ -9665,7 +9665,7 @@ int kvm_arch_hardware_setup(void *opaque) void kvm_arch_hardware_unsetup(void) { - kvm_x86_ops->hardware_unsetup(); + kvm_x86_ops.hardware_unsetup(); } int kvm_arch_check_processor_compat(void *opaque) @@ -9704,7 +9704,7 @@ void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) pmu->need_cleanup = true; kvm_make_request(KVM_REQ_PMU, vcpu); } - kvm_x86_ops->sched_in(vcpu, cpu); + kvm_x86_ops.sched_in(vcpu, cpu); } void kvm_arch_free_vm(struct kvm *kvm) @@ -9748,7 +9748,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type) kvm_page_track_init(kvm); kvm_mmu_init_vm(kvm); - return kvm_x86_ops->vm_init(kvm); + return kvm_x86_ops.vm_init(kvm); } int kvm_arch_post_init_vm(struct kvm *kvm) @@ -9871,8 +9871,8 @@ void kvm_arch_destroy_vm(struct kvm *kvm) __x86_set_memory_region(kvm, TSS_PRIVATE_MEMSLOT, 0, 0); mutex_unlock(&kvm->slots_lock); } - if (kvm_x86_ops->vm_destroy) - kvm_x86_ops->vm_destroy(kvm); + if (kvm_x86_ops.vm_destroy) + kvm_x86_ops.vm_destroy(kvm); kvm_pic_destroy(kvm); kvm_ioapic_destroy(kvm); kvm_free_vcpus(kvm); @@ -10010,7 +10010,7 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm, /* * Call kvm_x86_ops dirty logging hooks when they are valid. * - * kvm_x86_ops->slot_disable_log_dirty is called when: + * kvm_x86_ops.slot_disable_log_dirty is called when: * * - KVM_MR_CREATE with dirty logging is disabled * - KVM_MR_FLAGS_ONLY with dirty logging is disabled in new flag @@ -10022,7 +10022,7 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm, * any additional overhead from PML when guest is running with dirty * logging disabled for memory slots. * - * kvm_x86_ops->slot_enable_log_dirty is called when switching new slot + * kvm_x86_ops.slot_enable_log_dirty is called when switching new slot * to dirty logging mode. * * If kvm_x86_ops dirty logging hooks are invalid, use write protect. @@ -10038,8 +10038,8 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm, * See the comments in fast_page_fault(). */ if (new->flags & KVM_MEM_LOG_DIRTY_PAGES) { - if (kvm_x86_ops->slot_enable_log_dirty) { - kvm_x86_ops->slot_enable_log_dirty(kvm, new); + if (kvm_x86_ops.slot_enable_log_dirty) { + kvm_x86_ops.slot_enable_log_dirty(kvm, new); } else { int level = kvm_dirty_log_manual_protect_and_init_set(kvm) ? @@ -10056,8 +10056,8 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm, kvm_mmu_slot_remove_write_access(kvm, new, level); } } else { - if (kvm_x86_ops->slot_disable_log_dirty) - kvm_x86_ops->slot_disable_log_dirty(kvm, new); + if (kvm_x86_ops.slot_disable_log_dirty) + kvm_x86_ops.slot_disable_log_dirty(kvm, new); } } @@ -10125,8 +10125,8 @@ void kvm_arch_flush_shadow_memslot(struct kvm *kvm, static inline bool kvm_guest_apic_has_interrupt(struct kvm_vcpu *vcpu) { return (is_guest_mode(vcpu) && - kvm_x86_ops->guest_apic_has_interrupt && - kvm_x86_ops->guest_apic_has_interrupt(vcpu)); + kvm_x86_ops.guest_apic_has_interrupt && + kvm_x86_ops.guest_apic_has_interrupt(vcpu)); } static inline bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu) @@ -10145,7 +10145,7 @@ static inline bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu) if (kvm_test_request(KVM_REQ_NMI, vcpu) || (vcpu->arch.nmi_pending && - kvm_x86_ops->nmi_allowed(vcpu))) + kvm_x86_ops.nmi_allowed(vcpu))) return true; if (kvm_test_request(KVM_REQ_SMI, vcpu) || @@ -10178,7 +10178,7 @@ bool kvm_arch_dy_runnable(struct kvm_vcpu *vcpu) kvm_test_request(KVM_REQ_EVENT, vcpu)) return true; - if (vcpu->arch.apicv_active && kvm_x86_ops->dy_apicv_has_pending_interrupt(vcpu)) + if (vcpu->arch.apicv_active && kvm_x86_ops.dy_apicv_has_pending_interrupt(vcpu)) return true; return false; @@ -10196,7 +10196,7 @@ int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu) int kvm_arch_interrupt_allowed(struct kvm_vcpu *vcpu) { - return kvm_x86_ops->interrupt_allowed(vcpu); + return kvm_x86_ops.interrupt_allowed(vcpu); } unsigned long kvm_get_linear_rip(struct kvm_vcpu *vcpu) @@ -10218,7 +10218,7 @@ unsigned long kvm_get_rflags(struct kvm_vcpu *vcpu) { unsigned long rflags; - rflags = kvm_x86_ops->get_rflags(vcpu); + rflags = kvm_x86_ops.get_rflags(vcpu); if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP) rflags &= ~X86_EFLAGS_TF; return rflags; @@ -10230,7 +10230,7 @@ static void __kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags) if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP && kvm_is_linear_rip(vcpu, vcpu->arch.singlestep_rip)) rflags |= X86_EFLAGS_TF; - kvm_x86_ops->set_rflags(vcpu, rflags); + kvm_x86_ops.set_rflags(vcpu, rflags); } void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags) @@ -10341,7 +10341,7 @@ static bool kvm_can_deliver_async_pf(struct kvm_vcpu *vcpu) if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED) || (vcpu->arch.apf.send_user_only && - kvm_x86_ops->get_cpl(vcpu) == 0)) + kvm_x86_ops.get_cpl(vcpu) == 0)) return false; return true; @@ -10361,7 +10361,7 @@ bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu) * If interrupts are off we cannot even use an artificial * halt state. */ - return kvm_x86_ops->interrupt_allowed(vcpu); + return kvm_x86_ops.interrupt_allowed(vcpu); } void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu, @@ -10490,7 +10490,7 @@ int kvm_arch_irq_bypass_add_producer(struct irq_bypass_consumer *cons, irqfd->producer = prod; - return kvm_x86_ops->update_pi_irte(irqfd->kvm, + return kvm_x86_ops.update_pi_irte(irqfd->kvm, prod->irq, irqfd->gsi, 1); } @@ -10510,7 +10510,7 @@ void kvm_arch_irq_bypass_del_producer(struct irq_bypass_consumer *cons, * when the irq is masked/disabled or the consumer side (KVM * int this case doesn't want to receive the interrupts. */ - ret = kvm_x86_ops->update_pi_irte(irqfd->kvm, prod->irq, irqfd->gsi, 0); + ret = kvm_x86_ops.update_pi_irte(irqfd->kvm, prod->irq, irqfd->gsi, 0); if (ret) printk(KERN_INFO "irq bypass consumer (token %p) unregistration" " fails: %d\n", irqfd->consumer.token, ret); @@ -10519,7 +10519,7 @@ void kvm_arch_irq_bypass_del_producer(struct irq_bypass_consumer *cons, int kvm_arch_update_irqfd_routing(struct kvm *kvm, unsigned int host_irq, uint32_t guest_irq, bool set) { - return kvm_x86_ops->update_pi_irte(kvm, host_irq, guest_irq, set); + return kvm_x86_ops.update_pi_irte(kvm, host_irq, guest_irq, set); } bool kvm_vector_hashing_enabled(void) diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h index c1954e216b41..b968acc0516f 100644 --- a/arch/x86/kvm/x86.h +++ b/arch/x86/kvm/x86.h @@ -97,7 +97,7 @@ static inline bool is_64_bit_mode(struct kvm_vcpu *vcpu) if (!is_long_mode(vcpu)) return false; - kvm_x86_ops->get_cs_db_l_bits(vcpu, &cs_db, &cs_l); + kvm_x86_ops.get_cs_db_l_bits(vcpu, &cs_db, &cs_l); return cs_l; } @@ -237,7 +237,7 @@ static inline bool kvm_check_has_quirk(struct kvm *kvm, u64 quirk) static inline bool kvm_vcpu_latch_init(struct kvm_vcpu *vcpu) { - return is_smm(vcpu) || kvm_x86_ops->apic_init_signal_blocked(vcpu); + return is_smm(vcpu) || kvm_x86_ops.apic_init_signal_blocked(vcpu); } void kvm_set_pending_timer(struct kvm_vcpu *vcpu); -- cgit From 6e4fd06f3ee1b12ca42fc70522f371bf10977745 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Sat, 21 Mar 2020 13:26:01 -0700 Subject: KVM: x86: Drop __exit from kvm_x86_ops' hardware_unsetup() Remove the __exit annotation from VMX hardware_unsetup(), the hook can be reached during kvm_init() by way of kvm_arch_hardware_unsetup() if failure occurs at various points during initialization. Removing the annotation also lets us annotate vmx_x86_ops and svm_x86_ops with __initdata; otherwise, objtool complains because it doesn't understand that the vendor specific __initdata is being copied by value to a non-__initdata instance. Signed-off-by: Sean Christopherson Message-Id: <20200321202603.19355-8-sean.j.christopherson@intel.com> Reviewed-by: Vitaly Kuznetsov Signed-off-by: Paolo Bonzini --- arch/x86/include/asm/kvm_host.h | 2 +- arch/x86/kvm/vmx/vmx.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 54f991244fae..42a2d0d3984a 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1056,7 +1056,7 @@ static inline u16 kvm_lapic_irq_dest_mode(bool dest_mode_logical) struct kvm_x86_ops { int (*hardware_enable)(void); void (*hardware_disable)(void); - void (*hardware_unsetup)(void); /* __exit */ + void (*hardware_unsetup)(void); bool (*cpu_has_accelerated_tpr)(void); bool (*has_emulated_msr)(int index); void (*cpuid_update)(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index a4cd851e812d..f23c2cfb99a6 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7646,7 +7646,7 @@ static bool vmx_apic_init_signal_blocked(struct kvm_vcpu *vcpu) return to_vmx(vcpu)->nested.vmxon; } -static __exit void hardware_unsetup(void) +static void hardware_unsetup(void) { if (nested) nested_vmx_hardware_unsetup(); -- cgit From e286ac0e38cbe187fbbc72d7e9da503f5ed26cc8 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Sat, 21 Mar 2020 13:26:02 -0700 Subject: KVM: VMX: Annotate vmx_x86_ops as __initdata Tag vmx_x86_ops with __initdata now the the struct is copied by value to a common x86 instance of kvm_x86_ops as part of kvm_init(). Signed-off-by: Sean Christopherson Message-Id: <20200321202603.19355-9-sean.j.christopherson@intel.com> Reviewed-by: Vitaly Kuznetsov Signed-off-by: Paolo Bonzini --- arch/x86/kvm/vmx/vmx.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index f23c2cfb99a6..4f844257a72d 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7662,7 +7662,7 @@ static bool vmx_check_apicv_inhibit_reasons(ulong bit) return supported & BIT(bit); } -static struct kvm_x86_ops vmx_x86_ops __ro_after_init = { +static struct kvm_x86_ops vmx_x86_ops __initdata = { .hardware_unsetup = hardware_unsetup, .hardware_enable = hardware_enable, -- cgit From 9c14ee21fcf74ac1f31e11180bf0dfd928c912cc Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Sat, 21 Mar 2020 13:26:03 -0700 Subject: KVM: SVM: Annotate svm_x86_ops as __initdata Tag svm_x86_ops with __initdata now the the struct is copied by value to a common x86 instance of kvm_x86_ops as part of kvm_init(). Signed-off-by: Sean Christopherson Message-Id: <20200321202603.19355-10-sean.j.christopherson@intel.com> Reviewed-by: Vitaly Kuznetsov Signed-off-by: Paolo Bonzini --- arch/x86/kvm/svm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index e5b6b0f7d95b..26f709cfff1d 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -7353,7 +7353,7 @@ static void svm_pre_update_apicv_exec_ctrl(struct kvm *kvm, bool activate) avic_update_access_page(kvm, activate); } -static struct kvm_x86_ops svm_x86_ops __ro_after_init = { +static struct kvm_x86_ops svm_x86_ops __initdata = { .hardware_unsetup = svm_hardware_teardown, .hardware_enable = svm_hardware_enable, .hardware_disable = svm_hardware_disable, -- cgit From 842f4be95899df22b5843ba1a7c8cf37e831a6e8 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Thu, 26 Mar 2020 09:07:12 -0700 Subject: KVM: VMX: Add a trampoline to fix VMREAD error handling Add a hand coded assembly trampoline to preserve volatile registers across vmread_error(), and to handle the calling convention differences between 64-bit and 32-bit due to asmlinkage on vmread_error(). Pass @field and @fault on the stack when invoking the trampoline to avoid clobbering volatile registers in the context of the inline assembly. Calling vmread_error() directly from inline assembly is partially broken on 64-bit, and completely broken on 32-bit. On 64-bit, it will clobber %rdi and %rsi (used to pass @field and @fault) and any volatile regs written by vmread_error(). On 32-bit, asmlinkage means vmread_error() expects the parameters to be passed on the stack, not via regs. Opportunistically zero out the result in the trampoline to save a few bytes of code for every VMREAD. A happy side effect of the trampoline is that the inline code footprint is reduced by three bytes on 64-bit due to PUSH/POP being more efficent (in terms of opcode bytes) than MOV. Fixes: 6e2020977e3e6 ("KVM: VMX: Add error handling to VMREAD helper") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson Message-Id: <20200326160712.28803-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/vmx/ops.h | 28 ++++++++++++++++------ arch/x86/kvm/vmx/vmenter.S | 58 ++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 79 insertions(+), 7 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/vmx/ops.h b/arch/x86/kvm/vmx/ops.h index 45eaedee2ac0..09b0937d56b1 100644 --- a/arch/x86/kvm/vmx/ops.h +++ b/arch/x86/kvm/vmx/ops.h @@ -12,7 +12,8 @@ #define __ex(x) __kvm_handle_fault_on_reboot(x) -asmlinkage void vmread_error(unsigned long field, bool fault); +__attribute__((regparm(0))) void vmread_error_trampoline(unsigned long field, + bool fault); void vmwrite_error(unsigned long field, unsigned long value); void vmclear_error(struct vmcs *vmcs, u64 phys_addr); void vmptrld_error(struct vmcs *vmcs, u64 phys_addr); @@ -70,15 +71,28 @@ static __always_inline unsigned long __vmcs_readl(unsigned long field) asm volatile("1: vmread %2, %1\n\t" ".byte 0x3e\n\t" /* branch taken hint */ "ja 3f\n\t" - "mov %2, %%" _ASM_ARG1 "\n\t" - "xor %%" _ASM_ARG2 ", %%" _ASM_ARG2 "\n\t" - "2: call vmread_error\n\t" - "xor %k1, %k1\n\t" + + /* + * VMREAD failed. Push '0' for @fault, push the failing + * @field, and bounce through the trampoline to preserve + * volatile registers. + */ + "push $0\n\t" + "push %2\n\t" + "2:call vmread_error_trampoline\n\t" + + /* + * Unwind the stack. Note, the trampoline zeros out the + * memory for @fault so that the result is '0' on error. + */ + "pop %2\n\t" + "pop %1\n\t" "3:\n\t" + /* VMREAD faulted. As above, except push '1' for @fault. */ ".pushsection .fixup, \"ax\"\n\t" - "4: mov %2, %%" _ASM_ARG1 "\n\t" - "mov $1, %%" _ASM_ARG2 "\n\t" + "4: push $1\n\t" + "push %2\n\t" "jmp 2b\n\t" ".popsection\n\t" _ASM_EXTABLE(1b, 4b) diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S index ca2065166d1d..9651ba388ba9 100644 --- a/arch/x86/kvm/vmx/vmenter.S +++ b/arch/x86/kvm/vmx/vmenter.S @@ -234,3 +234,61 @@ SYM_FUNC_START(__vmx_vcpu_run) 2: mov $1, %eax jmp 1b SYM_FUNC_END(__vmx_vcpu_run) + +/** + * vmread_error_trampoline - Trampoline from inline asm to vmread_error() + * @field: VMCS field encoding that failed + * @fault: %true if the VMREAD faulted, %false if it failed + + * Save and restore volatile registers across a call to vmread_error(). Note, + * all parameters are passed on the stack. + */ +SYM_FUNC_START(vmread_error_trampoline) + push %_ASM_BP + mov %_ASM_SP, %_ASM_BP + + push %_ASM_AX + push %_ASM_CX + push %_ASM_DX +#ifdef CONFIG_X86_64 + push %rdi + push %rsi + push %r8 + push %r9 + push %r10 + push %r11 +#endif +#ifdef CONFIG_X86_64 + /* Load @field and @fault to arg1 and arg2 respectively. */ + mov 3*WORD_SIZE(%rbp), %_ASM_ARG2 + mov 2*WORD_SIZE(%rbp), %_ASM_ARG1 +#else + /* Parameters are passed on the stack for 32-bit (see asmlinkage). */ + push 3*WORD_SIZE(%ebp) + push 2*WORD_SIZE(%ebp) +#endif + + call vmread_error + +#ifndef CONFIG_X86_64 + add $8, %esp +#endif + + /* Zero out @fault, which will be popped into the result register. */ + _ASM_MOV $0, 3*WORD_SIZE(%_ASM_BP) + +#ifdef CONFIG_X86_64 + pop %r11 + pop %r10 + pop %r9 + pop %r8 + pop %rsi + pop %rdi +#endif + pop %_ASM_DX + pop %_ASM_CX + pop %_ASM_AX + pop %_ASM_BP + + ret +SYM_FUNC_END(vmread_error_trampoline) -- cgit From 855c7e9b9c2c39fb8d108ac70d0eed530f80a2aa Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Wed, 25 Mar 2020 12:12:59 -0700 Subject: KVM: x86: Fix BUILD_BUG() in __cpuid_entry_get_reg() w/ CONFIG_UBSAN=y Take the target reg in __cpuid_entry_get_reg() instead of a pointer to a struct cpuid_reg. When building with -fsanitize=alignment (enabled by CONFIG_UBSAN=y), some versions of gcc get tripped up on the pointer and trigger the BUILD_BUG(). Reported-by: Randy Dunlap Fixes: d8577a4c238f8 ("KVM: x86: Do host CPUID at load time to mask KVM cpu caps") Fixes: 4c61534aaae2a ("KVM: x86: Introduce cpuid_entry_{get,has}() accessors") Signed-off-by: Sean Christopherson Message-Id: <20200325191259.23559-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 2 +- arch/x86/kvm/cpuid.h | 8 ++++---- 2 files changed, 5 insertions(+), 5 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index b18c31a26cc2..901cd1fdecd9 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -269,7 +269,7 @@ static __always_inline void kvm_cpu_cap_mask(enum cpuid_leafs leaf, u32 mask) cpuid_count(cpuid.function, cpuid.index, &entry.eax, &entry.ebx, &entry.ecx, &entry.edx); - kvm_cpu_caps[leaf] &= *__cpuid_entry_get_reg(&entry, &cpuid); + kvm_cpu_caps[leaf] &= *__cpuid_entry_get_reg(&entry, cpuid.reg); } void kvm_set_cpu_caps(void) diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h index 23b4cd1ad986..63a70f6a3df3 100644 --- a/arch/x86/kvm/cpuid.h +++ b/arch/x86/kvm/cpuid.h @@ -99,9 +99,9 @@ static __always_inline struct cpuid_reg x86_feature_cpuid(unsigned int x86_featu } static __always_inline u32 *__cpuid_entry_get_reg(struct kvm_cpuid_entry2 *entry, - const struct cpuid_reg *cpuid) + u32 reg) { - switch (cpuid->reg) { + switch (reg) { case CPUID_EAX: return &entry->eax; case CPUID_EBX: @@ -121,7 +121,7 @@ static __always_inline u32 *cpuid_entry_get_reg(struct kvm_cpuid_entry2 *entry, { const struct cpuid_reg cpuid = x86_feature_cpuid(x86_feature); - return __cpuid_entry_get_reg(entry, &cpuid); + return __cpuid_entry_get_reg(entry, cpuid.reg); } static __always_inline u32 cpuid_entry_get(struct kvm_cpuid_entry2 *entry, @@ -189,7 +189,7 @@ static __always_inline u32 *guest_cpuid_get_register(struct kvm_vcpu *vcpu, if (!entry) return NULL; - return __cpuid_entry_get_reg(entry, &cpuid); + return __cpuid_entry_get_reg(entry, cpuid.reg); } static __always_inline bool guest_cpuid_has(struct kvm_vcpu *vcpu, -- cgit From ab33eb494c603bb200fbe3c3c018fc9c544487b8 Mon Sep 17 00:00:00 2001 From: Linus Torvalds Date: Tue, 31 Mar 2020 18:11:18 -0700 Subject: x86: remove __put_user_asm() infrastructure The last user was removed by commit 4b842e4e25b1 ("x86: get rid of small constant size cases in raw_copy_{to,from}_user()"). Get rid of the left-overs before somebody tries to use it again. Signed-off-by: Linus Torvalds --- arch/x86/include/asm/uaccess.h | 11 ----------- 1 file changed, 11 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h index c8247a84244b..eea2b70d8300 100644 --- a/arch/x86/include/asm/uaccess.h +++ b/arch/x86/include/asm/uaccess.h @@ -383,17 +383,6 @@ struct __large_struct { unsigned long buf[100]; }; : : ltype(x), "m" (__m(addr)) \ : : label) -#define __put_user_failed(x, addr, itype, rtype, ltype, errret) \ - ({ __label__ __puflab; \ - int __pufret = errret; \ - __put_user_goto(x,addr,itype,rtype,ltype,__puflab); \ - __pufret = 0; \ - __puflab: __pufret; }) - -#define __put_user_asm(x, addr, retval, itype, rtype, ltype, errret) do { \ - retval = __put_user_failed(x, addr, itype, rtype, ltype, errret); \ -} while (0) - /** * __get_user - Get a simple variable from user space, with less checking. * @x: Variable to store result. -- cgit From 1a323ea5356edbb3073dc59d51b9e6b86908857d Mon Sep 17 00:00:00 2001 From: Linus Torvalds Date: Tue, 31 Mar 2020 18:23:47 -0700 Subject: x86: get rid of 'errret' argument to __get_user_xyz() macross Every remaining user just has the error case returning -EFAULT. In fact, the exception was __get_user_asm_nozero(), which was removed in commit 4b842e4e25b1 ("x86: get rid of small constant size cases in raw_copy_{to,from}_user()"), and the other __get_user_xyz() macros just followed suit for consistency. Fix up some macro whitespace while at it. Signed-off-by: Linus Torvalds --- arch/x86/include/asm/uaccess.h | 30 +++++++++++++++--------------- 1 file changed, 15 insertions(+), 15 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h index eea2b70d8300..bd924d6b4e68 100644 --- a/arch/x86/include/asm/uaccess.h +++ b/arch/x86/include/asm/uaccess.h @@ -279,13 +279,13 @@ do { \ } while (0) #ifdef CONFIG_X86_32 -#define __get_user_asm_u64(x, ptr, retval, errret) \ +#define __get_user_asm_u64(x, ptr, retval) \ ({ \ __typeof__(ptr) __ptr = (ptr); \ - asm volatile("\n" \ + asm volatile("\n" \ "1: movl %2,%%eax\n" \ "2: movl %3,%%edx\n" \ - "3:\n" \ + "3:\n" \ ".section .fixup,\"ax\"\n" \ "4: mov %4,%0\n" \ " xorl %%eax,%%eax\n" \ @@ -296,37 +296,37 @@ do { \ _ASM_EXTABLE_UA(2b, 4b) \ : "=r" (retval), "=&A"(x) \ : "m" (__m(__ptr)), "m" __m(((u32 __user *)(__ptr)) + 1), \ - "i" (errret), "0" (retval)); \ + "i" (-EFAULT), "0" (retval)); \ }) #else -#define __get_user_asm_u64(x, ptr, retval, errret) \ - __get_user_asm(x, ptr, retval, "q", "", "=r", errret) +#define __get_user_asm_u64(x, ptr, retval) \ + __get_user_asm(x, ptr, retval, "q", "", "=r") #endif -#define __get_user_size(x, ptr, size, retval, errret) \ +#define __get_user_size(x, ptr, size, retval) \ do { \ retval = 0; \ __chk_user_ptr(ptr); \ switch (size) { \ case 1: \ - __get_user_asm(x, ptr, retval, "b", "b", "=q", errret); \ + __get_user_asm(x, ptr, retval, "b", "b", "=q"); \ break; \ case 2: \ - __get_user_asm(x, ptr, retval, "w", "w", "=r", errret); \ + __get_user_asm(x, ptr, retval, "w", "w", "=r"); \ break; \ case 4: \ - __get_user_asm(x, ptr, retval, "l", "k", "=r", errret); \ + __get_user_asm(x, ptr, retval, "l", "k", "=r"); \ break; \ case 8: \ - __get_user_asm_u64(x, ptr, retval, errret); \ + __get_user_asm_u64(x, ptr, retval); \ break; \ default: \ (x) = __get_user_bad(); \ } \ } while (0) -#define __get_user_asm(x, addr, err, itype, rtype, ltype, errret) \ +#define __get_user_asm(x, addr, err, itype, rtype, ltype) \ asm volatile("\n" \ "1: mov"itype" %2,%"rtype"1\n" \ "2:\n" \ @@ -337,7 +337,7 @@ do { \ ".previous\n" \ _ASM_EXTABLE_UA(1b, 3b) \ : "=r" (err), ltype(x) \ - : "m" (__m(addr)), "i" (errret), "0" (err)) + : "m" (__m(addr)), "i" (-EFAULT), "0" (err)) #define __put_user_nocheck(x, ptr, size) \ ({ \ @@ -361,7 +361,7 @@ __pu_label: \ __typeof__(ptr) __gu_ptr = (ptr); \ __typeof__(size) __gu_size = (size); \ __uaccess_begin_nospec(); \ - __get_user_size(__gu_val, __gu_ptr, __gu_size, __gu_err, -EFAULT); \ + __get_user_size(__gu_val, __gu_ptr, __gu_size, __gu_err); \ __uaccess_end(); \ (x) = (__force __typeof__(*(ptr)))__gu_val; \ __builtin_expect(__gu_err, 0); \ @@ -485,7 +485,7 @@ static __must_check __always_inline bool user_access_begin(const void __user *pt do { \ int __gu_err; \ __inttype(*(ptr)) __gu_val; \ - __get_user_size(__gu_val, (ptr), sizeof(*(ptr)), __gu_err, -EFAULT); \ + __get_user_size(__gu_val, (ptr), sizeof(*(ptr)), __gu_err); \ (x) = (__force __typeof__(*(ptr)))__gu_val; \ if (unlikely(__gu_err)) goto err_label; \ } while (0) -- cgit From 20c384f1ea1a0bc7320bc445c72dd02d2970d594 Mon Sep 17 00:00:00 2001 From: Jason Wang Date: Thu, 26 Mar 2020 22:01:17 +0800 Subject: vhost: refine vhost and vringh kconfig Currently, CONFIG_VHOST depends on CONFIG_VIRTUALIZATION. But vhost is not necessarily for VM since it's a generic userspace and kernel communication protocol. Such dependency may prevent archs without virtualization support from using vhost. To solve this, a dedicated vhost menu is created under drivers so CONIFG_VHOST can be decoupled out of CONFIG_VIRTUALIZATION. While at it, also squash Kconfig.vringh into vhost Kconfig file. This avoids the trick of conditional inclusion from VOP or CAIF. Then it will be easier to introduce new vringh users and common dependency for both vringh and vhost. Signed-off-by: Jason Wang Link: https://lore.kernel.org/r/20200326140125.19794-2-jasowang@redhat.com Signed-off-by: Michael S. Tsirkin --- arch/x86/kvm/Kconfig | 4 ---- 1 file changed, 4 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index 9fea0757db92..d8154e0684b6 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -107,8 +107,4 @@ config KVM_MMU_AUDIT This option adds a R/W kVM module parameter 'mmu_audit', which allows auditing of KVM MMU events at runtime. -# OK, it's a little counter-intuitive to do this, but it puts it neatly under -# the virtualization menu. -source "drivers/vhost/Kconfig" - endif # VIRTUALIZATION -- cgit From 3680785692fbc57453a7bf973f962a92d1b9ec33 Mon Sep 17 00:00:00 2001 From: Linus Torvalds Date: Wed, 1 Apr 2020 10:52:01 -0700 Subject: x86: get rid of 'rtype' argument to __put_user_goto() macro The 'rtype' argument goes back to pre-git (and pre-BK) times, and comes from the fact that we used to not necessarily have the same type sizes for the arguments of the inline asm as we did for the actual accesses we did. So 'rtype' is the 'register type' - the override of the register size in the inline asm when it doesn't match the actual size of the variable we use as the output argument (for when you used "put_user()" on an "int" value that was assigned to a byte-sized user space access etc). That mismatch doesn't actually exist any more, and should probably never have existed in the first place. It's a horrid bug just waiting to happen (using more - or less - of the variable that the compiler expected us to use). I think we had some odd casting going on to hide the effects of that oddity after-the-fact, but those are long gone, and these days we should always have the right size value in the first place, using things like __typeof__(*(ptr)) __pu_val = (x); and gcc should thus have the right register size without any manual 'rtype' games. Signed-off-by: Linus Torvalds --- arch/x86/include/asm/uaccess.h | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h index bd924d6b4e68..6957fdc4855b 100644 --- a/arch/x86/include/asm/uaccess.h +++ b/arch/x86/include/asm/uaccess.h @@ -198,7 +198,7 @@ __typeof__(__builtin_choose_expr(sizeof(x) > sizeof(0UL), 0ULL, 0UL)) : "A" ((typeof(*(ptr)))(x)), "c" (ptr) : "ebx") #else #define __put_user_goto_u64(x, ptr, label) \ - __put_user_goto(x, ptr, "q", "", "er", label) + __put_user_goto(x, ptr, "q", "er", label) #define __put_user_x8(x, ptr, __ret_pu) __put_user_x(8, x, ptr, __ret_pu) #endif @@ -262,13 +262,13 @@ do { \ __chk_user_ptr(ptr); \ switch (size) { \ case 1: \ - __put_user_goto(x, ptr, "b", "b", "iq", label); \ + __put_user_goto(x, ptr, "b", "iq", label); \ break; \ case 2: \ - __put_user_goto(x, ptr, "w", "w", "ir", label); \ + __put_user_goto(x, ptr, "w", "ir", label); \ break; \ case 4: \ - __put_user_goto(x, ptr, "l", "k", "ir", label); \ + __put_user_goto(x, ptr, "l", "ir", label); \ break; \ case 8: \ __put_user_goto_u64(x, ptr, label); \ @@ -376,10 +376,10 @@ struct __large_struct { unsigned long buf[100]; }; * we do not write to any memory gcc knows about, so there are no * aliasing issues. */ -#define __put_user_goto(x, addr, itype, rtype, ltype, label) \ +#define __put_user_goto(x, addr, itype, ltype, label) \ asm_volatile_goto("\n" \ - "1: mov"itype" %"rtype"0,%1\n" \ - _ASM_EXTABLE_UA(1b, %l2) \ + "1: mov"itype" %0,%1\n" \ + _ASM_EXTABLE_UA(1b, %l2) \ : : ltype(x), "m" (__m(addr)) \ : : label) -- cgit From 7da63b3d54aa7f1ad4b33a50d88edd623f29fded Mon Sep 17 00:00:00 2001 From: Linus Torvalds Date: Wed, 1 Apr 2020 12:41:50 -0700 Subject: x86: get rid of 'rtype' argument to __get_user_asm() macro This is the exact same thing as 3680785692fb ("x86: get rid of 'rtype' argument to __put_user_goto() macro") except it's about __get_user_asm() rather than __put_user_goto(). The reasons are the same: having the low-level asm access the argument with a different size than the compiler thinks it does is fundamentally wrong. But unlike the __put_user_goto() case, we actually did tell the compiler that we used a bigger variable (either long or long long), and then only filled in the low bits, and ended up "fixing" this by casting the result to the proper pointer type. That's because we needed to use a non-qualified type (the user pointer might be a const pointer!), and that makes this a bit more painful. Our '__inttype()' macro used to be lazy and only differentiate between "fits in a register" or "needs two registers". So this fix had to also make that '__inttype()' macro more precise. Signed-off-by: Linus Torvalds --- arch/x86/include/asm/uaccess.h | 28 +++++++++++++++++----------- 1 file changed, 17 insertions(+), 11 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h index 6957fdc4855b..f26c84316ba5 100644 --- a/arch/x86/include/asm/uaccess.h +++ b/arch/x86/include/asm/uaccess.h @@ -126,11 +126,17 @@ extern int __get_user_bad(void); }) /* - * This is a type: either unsigned long, if the argument fits into - * that type, or otherwise unsigned long long. + * This is the smallest unsigned integer type that can fit a value + * (up to 'long long') */ -#define __inttype(x) \ -__typeof__(__builtin_choose_expr(sizeof(x) > sizeof(0UL), 0ULL, 0UL)) +#define __inttype(x) __typeof__( \ + __typefits(x,char, \ + __typefits(x,short, \ + __typefits(x,int, \ + __typefits(x,long,0ULL))))) + +#define __typefits(x,type,not) \ + __builtin_choose_expr(sizeof(x)<=sizeof(type),(unsigned type)0,not) /** * get_user - Get a simple variable from user space. @@ -301,7 +307,7 @@ do { \ #else #define __get_user_asm_u64(x, ptr, retval) \ - __get_user_asm(x, ptr, retval, "q", "", "=r") + __get_user_asm(x, ptr, retval, "q", "=r") #endif #define __get_user_size(x, ptr, size, retval) \ @@ -310,13 +316,13 @@ do { \ __chk_user_ptr(ptr); \ switch (size) { \ case 1: \ - __get_user_asm(x, ptr, retval, "b", "b", "=q"); \ + __get_user_asm(x, ptr, retval, "b", "=q"); \ break; \ case 2: \ - __get_user_asm(x, ptr, retval, "w", "w", "=r"); \ + __get_user_asm(x, ptr, retval, "w", "=r"); \ break; \ case 4: \ - __get_user_asm(x, ptr, retval, "l", "k", "=r"); \ + __get_user_asm(x, ptr, retval, "l", "=r"); \ break; \ case 8: \ __get_user_asm_u64(x, ptr, retval); \ @@ -326,13 +332,13 @@ do { \ } \ } while (0) -#define __get_user_asm(x, addr, err, itype, rtype, ltype) \ +#define __get_user_asm(x, addr, err, itype, ltype) \ asm volatile("\n" \ - "1: mov"itype" %2,%"rtype"1\n" \ + "1: mov"itype" %2,%1\n" \ "2:\n" \ ".section .fixup,\"ax\"\n" \ "3: mov %3,%0\n" \ - " xor"itype" %"rtype"1,%"rtype"1\n" \ + " xor"itype" %1,%1\n" \ " jmp 2b\n" \ ".previous\n" \ _ASM_EXTABLE_UA(1b, 3b) \ -- cgit From 890f0b0d27dc400679b9a91d04ca44f5ee4c19c0 Mon Sep 17 00:00:00 2001 From: Linus Torvalds Date: Wed, 1 Apr 2020 13:23:14 -0700 Subject: x86: start using named parameters for low-level uaccess asms This is partly for readability - using named arguments instead of numbered ones makes it muchmore obvious just what is going on. Using "%[efault]" instead of "%4" for the special -EFAULT constant just means that you don't have to count the arguments to see what's up. But the motivation for all this cleanup is that when we'll start to conditionally use "asm goto" even for the __get_user_asm() case, the argument numbers will depend on whether we have an error output, or an error label we can just directly jump to. So this moves us towards named arguments for the same reason that we have to use named arguments for the asms that use SET_CC(): numbering will eventually become similarly unreliable and depends on whether we can use particular compiler features or not. Signed-off-by: Linus Torvalds --- arch/x86/include/asm/uaccess.h | 26 +++++++++++++++----------- 1 file changed, 15 insertions(+), 11 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h index f26c84316ba5..d8f283b9a569 100644 --- a/arch/x86/include/asm/uaccess.h +++ b/arch/x86/include/asm/uaccess.h @@ -289,20 +289,22 @@ do { \ ({ \ __typeof__(ptr) __ptr = (ptr); \ asm volatile("\n" \ - "1: movl %2,%%eax\n" \ - "2: movl %3,%%edx\n" \ + "1: movl %[lowbits],%%eax\n" \ + "2: movl %[highbits],%%edx\n" \ "3:\n" \ ".section .fixup,\"ax\"\n" \ - "4: mov %4,%0\n" \ + "4: mov %[efault],%[errout]\n" \ " xorl %%eax,%%eax\n" \ " xorl %%edx,%%edx\n" \ " jmp 3b\n" \ ".previous\n" \ _ASM_EXTABLE_UA(1b, 4b) \ _ASM_EXTABLE_UA(2b, 4b) \ - : "=r" (retval), "=&A"(x) \ - : "m" (__m(__ptr)), "m" __m(((u32 __user *)(__ptr)) + 1), \ - "i" (-EFAULT), "0" (retval)); \ + : [errout] "=r" (retval), \ + [output] "=&A"(x) \ + : [lowbits] "m" (__m(__ptr)), \ + [highbits] "m" __m(((u32 __user *)(__ptr)) + 1), \ + [efault] "i" (-EFAULT), "0" (retval)); \ }) #else @@ -334,16 +336,18 @@ do { \ #define __get_user_asm(x, addr, err, itype, ltype) \ asm volatile("\n" \ - "1: mov"itype" %2,%1\n" \ + "1: mov"itype" %[umem],%[output]\n" \ "2:\n" \ ".section .fixup,\"ax\"\n" \ - "3: mov %3,%0\n" \ - " xor"itype" %1,%1\n" \ + "3: mov %[efault],%[errout]\n" \ + " xor"itype" %[output],%[output]\n" \ " jmp 2b\n" \ ".previous\n" \ _ASM_EXTABLE_UA(1b, 3b) \ - : "=r" (err), ltype(x) \ - : "m" (__m(addr)), "i" (-EFAULT), "0" (err)) + : [errout] "=r" (err), \ + [output] ltype(x) \ + : [umem] "m" (__m(addr)), \ + [efault] "i" (-EFAULT), "0" (err)) #define __put_user_nocheck(x, ptr, size) \ ({ \ -- cgit From 630f289b7114c0e68519cbd634e2b7ec804ca8c5 Mon Sep 17 00:00:00 2001 From: Masahiro Yamada Date: Wed, 1 Apr 2020 21:03:12 -0700 Subject: asm-generic: make more kernel-space headers mandatory Change a header to mandatory-y if both of the following are met: [1] At least one architecture (except um) specifies it as generic-y in arch/*/include/asm/Kbuild [2] Every architecture (except um) either has its own implementation (arch/*/include/asm/*.h) or specifies it as generic-y in arch/*/include/asm/Kbuild This commit was generated by the following shell script. ----------------------------------->8----------------------------------- arches=$(cd arch; ls -1 | sed -e '/Kconfig/d' -e '/um/d') tmpfile=$(mktemp) grep "^mandatory-y +=" include/asm-generic/Kbuild > $tmpfile find arch -path 'arch/*/include/asm/Kbuild' | xargs sed -n 's/^generic-y += \(.*\)/\1/p' | sort -u | while read header do mandatory=yes for arch in $arches do if ! grep -q "generic-y += $header" arch/$arch/include/asm/Kbuild && ! [ -f arch/$arch/include/asm/$header ]; then mandatory=no break fi done if [ "$mandatory" = yes ]; then echo "mandatory-y += $header" >> $tmpfile for arch in $arches do sed -i "/generic-y += $header/d" arch/$arch/include/asm/Kbuild done fi done sed -i '/^mandatory-y +=/d' include/asm-generic/Kbuild LANG=C sort $tmpfile >> include/asm-generic/Kbuild ----------------------------------->8----------------------------------- One obvious benefit is the diff stat: 25 files changed, 52 insertions(+), 557 deletions(-) It is tedious to list generic-y for each arch that needs it. So, mandatory-y works like a fallback default (by just wrapping asm-generic one) when arch does not have a specific header implementation. See the following commits: def3f7cefe4e81c296090e1722a76551142c227c a1b39bae16a62ce4aae02d958224f19316d98b24 It is tedious to convert headers one by one, so I processed by a shell script. Signed-off-by: Masahiro Yamada Signed-off-by: Andrew Morton Cc: Michal Simek Cc: Christoph Hellwig Cc: Arnd Bergmann Link: http://lkml.kernel.org/r/20200210175452.5030-1-masahiroy@kernel.org Signed-off-by: Linus Torvalds --- arch/x86/include/asm/Kbuild | 2 -- 1 file changed, 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/Kbuild b/arch/x86/include/asm/Kbuild index ea34464d6221..b19ec8282d50 100644 --- a/arch/x86/include/asm/Kbuild +++ b/arch/x86/include/asm/Kbuild @@ -10,5 +10,3 @@ generated-y += xen-hypercalls.h generic-y += early_ioremap.h generic-y += export.h generic-y += mcs_spinlock.h -generic-y += mm-arch-hooks.h -generic-y += mmiowb.h -- cgit From 7969f2264f92871b9879796f08b455f737373718 Mon Sep 17 00:00:00 2001 From: Anshuman Khandual Date: Wed, 1 Apr 2020 21:07:49 -0700 Subject: mm/vma: make vma_is_foreign() available for general use Idea of a foreign VMA with respect to the present context is very generic. But currently there are two identical definitions for this in powerpc and x86 platforms. Lets consolidate those redundant definitions while making vma_is_foreign() available for general use later. This should not cause any functional change. Signed-off-by: Anshuman Khandual Signed-off-by: Andrew Morton Acked-by: Vlastimil Babka Cc: Paul Mackerras Cc: Michael Ellerman Cc: Thomas Gleixner Cc: Ingo Molnar Link: http://lkml.kernel.org/r/1582782965-3274-3-git-send-email-anshuman.khandual@arm.com Signed-off-by: Linus Torvalds --- arch/x86/include/asm/mmu_context.h | 15 --------------- 1 file changed, 15 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h index b538d9ddee9c..4e55370e48e8 100644 --- a/arch/x86/include/asm/mmu_context.h +++ b/arch/x86/include/asm/mmu_context.h @@ -213,21 +213,6 @@ static inline void arch_unmap(struct mm_struct *mm, unsigned long start, * So do not enforce things if the VMA is not from the current * mm, or if we are in a kernel thread. */ -static inline bool vma_is_foreign(struct vm_area_struct *vma) -{ - if (!current->mm) - return true; - /* - * Should PKRU be enforced on the access to this VMA? If - * the VMA is from another process, then PKRU has no - * relevance and should not be enforced. - */ - if (current->mm != vma->vm_mm) - return true; - - return false; -} - static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write, bool execute, bool foreign) { -- cgit From 39678191cd8988c811813baf4c97b43bf46094e4 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Wed, 1 Apr 2020 21:08:10 -0700 Subject: x86/mm: use helper fault_signal_pending() Let's move the fatal signal check even earlier so that we can directly use the new fault_signal_pending() in x86 mm code. Signed-off-by: Peter Xu Signed-off-by: Andrew Morton Tested-by: Brian Geffon Cc: Andrea Arcangeli Cc: Bobby Powers Cc: David Hildenbrand Cc: Denis Plotnikov Cc: "Dr . David Alan Gilbert" Cc: Hugh Dickins Cc: Jerome Glisse Cc: Johannes Weiner Cc: "Kirill A . Shutemov" Cc: Martin Cracauer Cc: Marty McFadden Cc: Matthew Wilcox Cc: Maya Gokhale Cc: Mel Gorman Cc: Mike Kravetz Cc: Mike Rapoport Cc: Pavel Emelyanov Link: http://lkml.kernel.org/r/20200220155353.8676-5-peterx@redhat.com Signed-off-by: Linus Torvalds --- arch/x86/mm/fault.c | 28 +++++++++++++--------------- 1 file changed, 13 insertions(+), 15 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 629fdf13f846..552770b6af9a 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1464,27 +1464,25 @@ good_area: fault = handle_mm_fault(vma, address, flags); major |= fault & VM_FAULT_MAJOR; + /* Quick path to respond to signals */ + if (fault_signal_pending(fault, regs)) { + if (!user_mode(regs)) + no_context(regs, hw_error_code, address, SIGBUS, + BUS_ADRERR); + return; + } + /* * If we need to retry the mmap_sem has already been released, * and if there is a fatal signal pending there is no guarantee * that we made any progress. Handle this case first. */ - if (unlikely(fault & VM_FAULT_RETRY)) { + if (unlikely((fault & VM_FAULT_RETRY) && + (flags & FAULT_FLAG_ALLOW_RETRY))) { /* Retry at most once */ - if (flags & FAULT_FLAG_ALLOW_RETRY) { - flags &= ~FAULT_FLAG_ALLOW_RETRY; - flags |= FAULT_FLAG_TRIED; - if (!fatal_signal_pending(tsk)) - goto retry; - } - - /* User mode? Just return to handle the fatal exception */ - if (flags & FAULT_FLAG_USER) - return; - - /* Not returning to user mode? Handle exceptions or die: */ - no_context(regs, hw_error_code, address, SIGBUS, BUS_ADRERR); - return; + flags &= ~FAULT_FLAG_ALLOW_RETRY; + flags |= FAULT_FLAG_TRIED; + goto retry; } up_read(&mm->mmap_sem); -- cgit From dde1607248328cdb7570e3a252e8fb76b3411d66 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Wed, 1 Apr 2020 21:08:37 -0700 Subject: mm: introduce FAULT_FLAG_DEFAULT Although there're tons of arch-specific page fault handlers, most of them are still sharing the same initial value of the page fault flags. Say, merely all of the page fault handlers would allow the fault to be retried, and they also allow the fault to respond to SIGKILL. Let's define a default value for the fault flags to replace those initial page fault flags that were copied over. With this, it'll be far easier to introduce new fault flag that can be used by all the architectures instead of touching all the archs. Signed-off-by: Peter Xu Signed-off-by: Andrew Morton Tested-by: Brian Geffon Reviewed-by: David Hildenbrand Cc: Andrea Arcangeli Cc: Bobby Powers Cc: Denis Plotnikov Cc: "Dr . David Alan Gilbert" Cc: Hugh Dickins Cc: Jerome Glisse Cc: Johannes Weiner Cc: "Kirill A . Shutemov" Cc: Martin Cracauer Cc: Marty McFadden Cc: Matthew Wilcox Cc: Maya Gokhale Cc: Mel Gorman Cc: Mike Kravetz Cc: Mike Rapoport Cc: Pavel Emelyanov Link: http://lkml.kernel.org/r/20200220160238.9694-1-peterx@redhat.com Signed-off-by: Linus Torvalds --- arch/x86/mm/fault.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 552770b6af9a..f70a08e5271f 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1310,7 +1310,7 @@ void do_user_addr_fault(struct pt_regs *regs, struct task_struct *tsk; struct mm_struct *mm; vm_fault_t fault, major = 0; - unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; + unsigned int flags = FAULT_FLAG_DEFAULT; tsk = current; mm = tsk->mm; -- cgit From 4064b982706375025628094e51d11cf1a958a5d3 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Wed, 1 Apr 2020 21:08:45 -0700 Subject: mm: allow VM_FAULT_RETRY for multiple times The idea comes from a discussion between Linus and Andrea [1]. Before this patch we only allow a page fault to retry once. We achieved this by clearing the FAULT_FLAG_ALLOW_RETRY flag when doing handle_mm_fault() the second time. This was majorly used to avoid unexpected starvation of the system by looping over forever to handle the page fault on a single page. However that should hardly happen, and after all for each code path to return a VM_FAULT_RETRY we'll first wait for a condition (during which time we should possibly yield the cpu) to happen before VM_FAULT_RETRY is really returned. This patch removes the restriction by keeping the FAULT_FLAG_ALLOW_RETRY flag when we receive VM_FAULT_RETRY. It means that the page fault handler now can retry the page fault for multiple times if necessary without the need to generate another page fault event. Meanwhile we still keep the FAULT_FLAG_TRIED flag so page fault handler can still identify whether a page fault is the first attempt or not. Then we'll have these combinations of fault flags (only considering ALLOW_RETRY flag and TRIED flag): - ALLOW_RETRY and !TRIED: this means the page fault allows to retry, and this is the first try - ALLOW_RETRY and TRIED: this means the page fault allows to retry, and this is not the first try - !ALLOW_RETRY and !TRIED: this means the page fault does not allow to retry at all - !ALLOW_RETRY and TRIED: this is forbidden and should never be used In existing code we have multiple places that has taken special care of the first condition above by checking against (fault_flags & FAULT_FLAG_ALLOW_RETRY). This patch introduces a simple helper to detect the first retry of a page fault by checking against both (fault_flags & FAULT_FLAG_ALLOW_RETRY) and !(fault_flag & FAULT_FLAG_TRIED) because now even the 2nd try will have the ALLOW_RETRY set, then use that helper in all existing special paths. One example is in __lock_page_or_retry(), now we'll drop the mmap_sem only in the first attempt of page fault and we'll keep it in follow up retries, so old locking behavior will be retained. This will be a nice enhancement for current code [2] at the same time a supporting material for the future userfaultfd-writeprotect work, since in that work there will always be an explicit userfault writeprotect retry for protected pages, and if that cannot resolve the page fault (e.g., when userfaultfd-writeprotect is used in conjunction with swapped pages) then we'll possibly need a 3rd retry of the page fault. It might also benefit other potential users who will have similar requirement like userfault write-protection. GUP code is not touched yet and will be covered in follow up patch. Please read the thread below for more information. [1] https://lore.kernel.org/lkml/20171102193644.GB22686@redhat.com/ [2] https://lore.kernel.org/lkml/20181230154648.GB9832@redhat.com/ Suggested-by: Linus Torvalds Suggested-by: Andrea Arcangeli Signed-off-by: Peter Xu Signed-off-by: Andrew Morton Tested-by: Brian Geffon Cc: Bobby Powers Cc: David Hildenbrand Cc: Denis Plotnikov Cc: "Dr . David Alan Gilbert" Cc: Hugh Dickins Cc: Jerome Glisse Cc: Johannes Weiner Cc: "Kirill A . Shutemov" Cc: Martin Cracauer Cc: Marty McFadden Cc: Matthew Wilcox Cc: Maya Gokhale Cc: Mel Gorman Cc: Mike Kravetz Cc: Mike Rapoport Cc: Pavel Emelyanov Link: http://lkml.kernel.org/r/20200220160246.9790-1-peterx@redhat.com Signed-off-by: Linus Torvalds --- arch/x86/mm/fault.c | 2 -- 1 file changed, 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index f70a08e5271f..859519f5b342 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1479,8 +1479,6 @@ good_area: */ if (unlikely((fault & VM_FAULT_RETRY) && (flags & FAULT_FLAG_ALLOW_RETRY))) { - /* Retry at most once */ - flags &= ~FAULT_FLAG_ALLOW_RETRY; flags |= FAULT_FLAG_TRIED; goto retry; } -- cgit From 514ccc194971d0649e4e7ec8a9b3a6e33561d7bf Mon Sep 17 00:00:00 2001 From: Qian Cai Date: Thu, 2 Apr 2020 11:39:55 -0400 Subject: x86/kvm: fix a missing-prototypes "vmread_error" The commit 842f4be95899 ("KVM: VMX: Add a trampoline to fix VMREAD error handling") removed the declaration of vmread_error() causes a W=1 build failure with KVM_WERROR=y. Fix it by adding it back. arch/x86/kvm/vmx/vmx.c:359:17: error: no previous prototype for 'vmread_error' [-Werror=missing-prototypes] asmlinkage void vmread_error(unsigned long field, bool fault) ^~~~~~~~~~~~ Signed-off-by: Qian Cai Message-Id: <20200402153955.1695-1-cai@lca.pw> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/vmx/ops.h | 1 + 1 file changed, 1 insertion(+) (limited to 'arch/x86') diff --git a/arch/x86/kvm/vmx/ops.h b/arch/x86/kvm/vmx/ops.h index 09b0937d56b1..19717d0a1100 100644 --- a/arch/x86/kvm/vmx/ops.h +++ b/arch/x86/kvm/vmx/ops.h @@ -12,6 +12,7 @@ #define __ex(x) __kvm_handle_fault_on_reboot(x) +asmlinkage void vmread_error(unsigned long field, bool fault); __attribute__((regparm(0))) void vmread_error_trampoline(unsigned long field, bool fault); void vmwrite_error(unsigned long field, unsigned long value); -- cgit From 46a010dd6896045484c7749b3d30b685c649eb18 Mon Sep 17 00:00:00 2001 From: Joerg Roedel Date: Tue, 24 Mar 2020 10:41:51 +0100 Subject: kVM SVM: Move SVM related files to own sub-directory Move svm.c and pmu_amd.c into their own arch/x86/kvm/svm/ subdirectory. Signed-off-by: Joerg Roedel Message-Id: <20200324094154.32352-2-joro@8bytes.org> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/Makefile | 2 +- arch/x86/kvm/pmu_amd.c | 327 --- arch/x86/kvm/svm.c | 7514 ------------------------------------------------ arch/x86/kvm/svm/pmu.c | 327 +++ arch/x86/kvm/svm/svm.c | 7514 ++++++++++++++++++++++++++++++++++++++++++++++++ 5 files changed, 7842 insertions(+), 7842 deletions(-) delete mode 100644 arch/x86/kvm/pmu_amd.c delete mode 100644 arch/x86/kvm/svm.c create mode 100644 arch/x86/kvm/svm/pmu.c create mode 100644 arch/x86/kvm/svm/svm.c (limited to 'arch/x86') diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile index e553f0fdd87d..0ef050982c37 100644 --- a/arch/x86/kvm/Makefile +++ b/arch/x86/kvm/Makefile @@ -14,7 +14,7 @@ kvm-y += x86.o emulate.o i8259.o irq.o lapic.o \ hyperv.o debugfs.o mmu/mmu.o mmu/page_track.o kvm-intel-y += vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o vmx/evmcs.o vmx/nested.o -kvm-amd-y += svm.o pmu_amd.o +kvm-amd-y += svm/svm.o svm/pmu.o obj-$(CONFIG_KVM) += kvm.o obj-$(CONFIG_KVM_INTEL) += kvm-intel.o diff --git a/arch/x86/kvm/pmu_amd.c b/arch/x86/kvm/pmu_amd.c deleted file mode 100644 index ce0b10fe5e2b..000000000000 --- a/arch/x86/kvm/pmu_amd.c +++ /dev/null @@ -1,327 +0,0 @@ -// SPDX-License-Identifier: GPL-2.0-only -/* - * KVM PMU support for AMD - * - * Copyright 2015, Red Hat, Inc. and/or its affiliates. - * - * Author: - * Wei Huang - * - * Implementation is based on pmu_intel.c file - */ -#include -#include -#include -#include "x86.h" -#include "cpuid.h" -#include "lapic.h" -#include "pmu.h" - -enum pmu_type { - PMU_TYPE_COUNTER = 0, - PMU_TYPE_EVNTSEL, -}; - -enum index { - INDEX_ZERO = 0, - INDEX_ONE, - INDEX_TWO, - INDEX_THREE, - INDEX_FOUR, - INDEX_FIVE, - INDEX_ERROR, -}; - -/* duplicated from amd_perfmon_event_map, K7 and above should work. */ -static struct kvm_event_hw_type_mapping amd_event_mapping[] = { - [0] = { 0x76, 0x00, PERF_COUNT_HW_CPU_CYCLES }, - [1] = { 0xc0, 0x00, PERF_COUNT_HW_INSTRUCTIONS }, - [2] = { 0x7d, 0x07, PERF_COUNT_HW_CACHE_REFERENCES }, - [3] = { 0x7e, 0x07, PERF_COUNT_HW_CACHE_MISSES }, - [4] = { 0xc2, 0x00, PERF_COUNT_HW_BRANCH_INSTRUCTIONS }, - [5] = { 0xc3, 0x00, PERF_COUNT_HW_BRANCH_MISSES }, - [6] = { 0xd0, 0x00, PERF_COUNT_HW_STALLED_CYCLES_FRONTEND }, - [7] = { 0xd1, 0x00, PERF_COUNT_HW_STALLED_CYCLES_BACKEND }, -}; - -static unsigned int get_msr_base(struct kvm_pmu *pmu, enum pmu_type type) -{ - struct kvm_vcpu *vcpu = pmu_to_vcpu(pmu); - - if (guest_cpuid_has(vcpu, X86_FEATURE_PERFCTR_CORE)) { - if (type == PMU_TYPE_COUNTER) - return MSR_F15H_PERF_CTR; - else - return MSR_F15H_PERF_CTL; - } else { - if (type == PMU_TYPE_COUNTER) - return MSR_K7_PERFCTR0; - else - return MSR_K7_EVNTSEL0; - } -} - -static enum index msr_to_index(u32 msr) -{ - switch (msr) { - case MSR_F15H_PERF_CTL0: - case MSR_F15H_PERF_CTR0: - case MSR_K7_EVNTSEL0: - case MSR_K7_PERFCTR0: - return INDEX_ZERO; - case MSR_F15H_PERF_CTL1: - case MSR_F15H_PERF_CTR1: - case MSR_K7_EVNTSEL1: - case MSR_K7_PERFCTR1: - return INDEX_ONE; - case MSR_F15H_PERF_CTL2: - case MSR_F15H_PERF_CTR2: - case MSR_K7_EVNTSEL2: - case MSR_K7_PERFCTR2: - return INDEX_TWO; - case MSR_F15H_PERF_CTL3: - case MSR_F15H_PERF_CTR3: - case MSR_K7_EVNTSEL3: - case MSR_K7_PERFCTR3: - return INDEX_THREE; - case MSR_F15H_PERF_CTL4: - case MSR_F15H_PERF_CTR4: - return INDEX_FOUR; - case MSR_F15H_PERF_CTL5: - case MSR_F15H_PERF_CTR5: - return INDEX_FIVE; - default: - return INDEX_ERROR; - } -} - -static inline struct kvm_pmc *get_gp_pmc_amd(struct kvm_pmu *pmu, u32 msr, - enum pmu_type type) -{ - switch (msr) { - case MSR_F15H_PERF_CTL0: - case MSR_F15H_PERF_CTL1: - case MSR_F15H_PERF_CTL2: - case MSR_F15H_PERF_CTL3: - case MSR_F15H_PERF_CTL4: - case MSR_F15H_PERF_CTL5: - case MSR_K7_EVNTSEL0 ... MSR_K7_EVNTSEL3: - if (type != PMU_TYPE_EVNTSEL) - return NULL; - break; - case MSR_F15H_PERF_CTR0: - case MSR_F15H_PERF_CTR1: - case MSR_F15H_PERF_CTR2: - case MSR_F15H_PERF_CTR3: - case MSR_F15H_PERF_CTR4: - case MSR_F15H_PERF_CTR5: - case MSR_K7_PERFCTR0 ... MSR_K7_PERFCTR3: - if (type != PMU_TYPE_COUNTER) - return NULL; - break; - default: - return NULL; - } - - return &pmu->gp_counters[msr_to_index(msr)]; -} - -static unsigned amd_find_arch_event(struct kvm_pmu *pmu, - u8 event_select, - u8 unit_mask) -{ - int i; - - for (i = 0; i < ARRAY_SIZE(amd_event_mapping); i++) - if (amd_event_mapping[i].eventsel == event_select - && amd_event_mapping[i].unit_mask == unit_mask) - break; - - if (i == ARRAY_SIZE(amd_event_mapping)) - return PERF_COUNT_HW_MAX; - - return amd_event_mapping[i].event_type; -} - -/* return PERF_COUNT_HW_MAX as AMD doesn't have fixed events */ -static unsigned amd_find_fixed_event(int idx) -{ - return PERF_COUNT_HW_MAX; -} - -/* check if a PMC is enabled by comparing it against global_ctrl bits. Because - * AMD CPU doesn't have global_ctrl MSR, all PMCs are enabled (return TRUE). - */ -static bool amd_pmc_is_enabled(struct kvm_pmc *pmc) -{ - return true; -} - -static struct kvm_pmc *amd_pmc_idx_to_pmc(struct kvm_pmu *pmu, int pmc_idx) -{ - unsigned int base = get_msr_base(pmu, PMU_TYPE_COUNTER); - struct kvm_vcpu *vcpu = pmu_to_vcpu(pmu); - - if (guest_cpuid_has(vcpu, X86_FEATURE_PERFCTR_CORE)) { - /* - * The idx is contiguous. The MSRs are not. The counter MSRs - * are interleaved with the event select MSRs. - */ - pmc_idx *= 2; - } - - return get_gp_pmc_amd(pmu, base + pmc_idx, PMU_TYPE_COUNTER); -} - -/* returns 0 if idx's corresponding MSR exists; otherwise returns 1. */ -static int amd_is_valid_rdpmc_ecx(struct kvm_vcpu *vcpu, unsigned int idx) -{ - struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); - - idx &= ~(3u << 30); - - return (idx >= pmu->nr_arch_gp_counters); -} - -/* idx is the ECX register of RDPMC instruction */ -static struct kvm_pmc *amd_rdpmc_ecx_to_pmc(struct kvm_vcpu *vcpu, - unsigned int idx, u64 *mask) -{ - struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); - struct kvm_pmc *counters; - - idx &= ~(3u << 30); - if (idx >= pmu->nr_arch_gp_counters) - return NULL; - counters = pmu->gp_counters; - - return &counters[idx]; -} - -static bool amd_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr) -{ - /* All MSRs refer to exactly one PMC, so msr_idx_to_pmc is enough. */ - return false; -} - -static struct kvm_pmc *amd_msr_idx_to_pmc(struct kvm_vcpu *vcpu, u32 msr) -{ - struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); - struct kvm_pmc *pmc; - - pmc = get_gp_pmc_amd(pmu, msr, PMU_TYPE_COUNTER); - pmc = pmc ? pmc : get_gp_pmc_amd(pmu, msr, PMU_TYPE_EVNTSEL); - - return pmc; -} - -static int amd_pmu_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *data) -{ - struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); - struct kvm_pmc *pmc; - - /* MSR_PERFCTRn */ - pmc = get_gp_pmc_amd(pmu, msr, PMU_TYPE_COUNTER); - if (pmc) { - *data = pmc_read_counter(pmc); - return 0; - } - /* MSR_EVNTSELn */ - pmc = get_gp_pmc_amd(pmu, msr, PMU_TYPE_EVNTSEL); - if (pmc) { - *data = pmc->eventsel; - return 0; - } - - return 1; -} - -static int amd_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) -{ - struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); - struct kvm_pmc *pmc; - u32 msr = msr_info->index; - u64 data = msr_info->data; - - /* MSR_PERFCTRn */ - pmc = get_gp_pmc_amd(pmu, msr, PMU_TYPE_COUNTER); - if (pmc) { - pmc->counter += data - pmc_read_counter(pmc); - return 0; - } - /* MSR_EVNTSELn */ - pmc = get_gp_pmc_amd(pmu, msr, PMU_TYPE_EVNTSEL); - if (pmc) { - if (data == pmc->eventsel) - return 0; - if (!(data & pmu->reserved_bits)) { - reprogram_gp_counter(pmc, data); - return 0; - } - } - - return 1; -} - -static void amd_pmu_refresh(struct kvm_vcpu *vcpu) -{ - struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); - - if (guest_cpuid_has(vcpu, X86_FEATURE_PERFCTR_CORE)) - pmu->nr_arch_gp_counters = AMD64_NUM_COUNTERS_CORE; - else - pmu->nr_arch_gp_counters = AMD64_NUM_COUNTERS; - - pmu->counter_bitmask[KVM_PMC_GP] = ((u64)1 << 48) - 1; - pmu->reserved_bits = 0xffffffff00200000ull; - pmu->version = 1; - /* not applicable to AMD; but clean them to prevent any fall out */ - pmu->counter_bitmask[KVM_PMC_FIXED] = 0; - pmu->nr_arch_fixed_counters = 0; - pmu->global_status = 0; - bitmap_set(pmu->all_valid_pmc_idx, 0, pmu->nr_arch_gp_counters); -} - -static void amd_pmu_init(struct kvm_vcpu *vcpu) -{ - struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); - int i; - - BUILD_BUG_ON(AMD64_NUM_COUNTERS_CORE > INTEL_PMC_MAX_GENERIC); - - for (i = 0; i < AMD64_NUM_COUNTERS_CORE ; i++) { - pmu->gp_counters[i].type = KVM_PMC_GP; - pmu->gp_counters[i].vcpu = vcpu; - pmu->gp_counters[i].idx = i; - pmu->gp_counters[i].current_config = 0; - } -} - -static void amd_pmu_reset(struct kvm_vcpu *vcpu) -{ - struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); - int i; - - for (i = 0; i < AMD64_NUM_COUNTERS_CORE; i++) { - struct kvm_pmc *pmc = &pmu->gp_counters[i]; - - pmc_stop_counter(pmc); - pmc->counter = pmc->eventsel = 0; - } -} - -struct kvm_pmu_ops amd_pmu_ops = { - .find_arch_event = amd_find_arch_event, - .find_fixed_event = amd_find_fixed_event, - .pmc_is_enabled = amd_pmc_is_enabled, - .pmc_idx_to_pmc = amd_pmc_idx_to_pmc, - .rdpmc_ecx_to_pmc = amd_rdpmc_ecx_to_pmc, - .msr_idx_to_pmc = amd_msr_idx_to_pmc, - .is_valid_rdpmc_ecx = amd_is_valid_rdpmc_ecx, - .is_valid_msr = amd_is_valid_msr, - .get_msr = amd_pmu_get_msr, - .set_msr = amd_pmu_set_msr, - .refresh = amd_pmu_refresh, - .init = amd_pmu_init, - .reset = amd_pmu_reset, -}; diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c deleted file mode 100644 index 851e9cc79930..000000000000 --- a/arch/x86/kvm/svm.c +++ /dev/null @@ -1,7514 +0,0 @@ -// SPDX-License-Identifier: GPL-2.0-only -/* - * Kernel-based Virtual Machine driver for Linux - * - * AMD SVM support - * - * Copyright (C) 2006 Qumranet, Inc. - * Copyright 2010 Red Hat, Inc. and/or its affiliates. - * - * Authors: - * Yaniv Kamay - * Avi Kivity - */ - -#define pr_fmt(fmt) "SVM: " fmt - -#include - -#include "irq.h" -#include "mmu.h" -#include "kvm_cache_regs.h" -#include "x86.h" -#include "cpuid.h" -#include "pmu.h" - -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -#include -#include -#include -#include -#include -#include -#include -#include -#include - -#include -#include "trace.h" - -#define __ex(x) __kvm_handle_fault_on_reboot(x) - -MODULE_AUTHOR("Qumranet"); -MODULE_LICENSE("GPL"); - -#ifdef MODULE -static const struct x86_cpu_id svm_cpu_id[] = { - X86_MATCH_FEATURE(X86_FEATURE_SVM, NULL), - {} -}; -MODULE_DEVICE_TABLE(x86cpu, svm_cpu_id); -#endif - -#define IOPM_ALLOC_ORDER 2 -#define MSRPM_ALLOC_ORDER 1 - -#define SEG_TYPE_LDT 2 -#define SEG_TYPE_BUSY_TSS16 3 - -#define SVM_FEATURE_LBRV (1 << 1) -#define SVM_FEATURE_SVML (1 << 2) -#define SVM_FEATURE_TSC_RATE (1 << 4) -#define SVM_FEATURE_VMCB_CLEAN (1 << 5) -#define SVM_FEATURE_FLUSH_ASID (1 << 6) -#define SVM_FEATURE_DECODE_ASSIST (1 << 7) -#define SVM_FEATURE_PAUSE_FILTER (1 << 10) - -#define SVM_AVIC_DOORBELL 0xc001011b - -#define NESTED_EXIT_HOST 0 /* Exit handled on host level */ -#define NESTED_EXIT_DONE 1 /* Exit caused nested vmexit */ -#define NESTED_EXIT_CONTINUE 2 /* Further checks needed */ - -#define DEBUGCTL_RESERVED_BITS (~(0x3fULL)) - -#define TSC_RATIO_RSVD 0xffffff0000000000ULL -#define TSC_RATIO_MIN 0x0000000000000001ULL -#define TSC_RATIO_MAX 0x000000ffffffffffULL - -#define AVIC_HPA_MASK ~((0xFFFULL << 52) | 0xFFF) - -/* - * 0xff is broadcast, so the max index allowed for physical APIC ID - * table is 0xfe. APIC IDs above 0xff are reserved. - */ -#define AVIC_MAX_PHYSICAL_ID_COUNT 255 - -#define AVIC_UNACCEL_ACCESS_WRITE_MASK 1 -#define AVIC_UNACCEL_ACCESS_OFFSET_MASK 0xFF0 -#define AVIC_UNACCEL_ACCESS_VECTOR_MASK 0xFFFFFFFF - -/* AVIC GATAG is encoded using VM and VCPU IDs */ -#define AVIC_VCPU_ID_BITS 8 -#define AVIC_VCPU_ID_MASK ((1 << AVIC_VCPU_ID_BITS) - 1) - -#define AVIC_VM_ID_BITS 24 -#define AVIC_VM_ID_NR (1 << AVIC_VM_ID_BITS) -#define AVIC_VM_ID_MASK ((1 << AVIC_VM_ID_BITS) - 1) - -#define AVIC_GATAG(x, y) (((x & AVIC_VM_ID_MASK) << AVIC_VCPU_ID_BITS) | \ - (y & AVIC_VCPU_ID_MASK)) -#define AVIC_GATAG_TO_VMID(x) ((x >> AVIC_VCPU_ID_BITS) & AVIC_VM_ID_MASK) -#define AVIC_GATAG_TO_VCPUID(x) (x & AVIC_VCPU_ID_MASK) - -static bool erratum_383_found __read_mostly; - -static const u32 host_save_user_msrs[] = { -#ifdef CONFIG_X86_64 - MSR_STAR, MSR_LSTAR, MSR_CSTAR, MSR_SYSCALL_MASK, MSR_KERNEL_GS_BASE, - MSR_FS_BASE, -#endif - MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP, - MSR_TSC_AUX, -}; - -#define NR_HOST_SAVE_USER_MSRS ARRAY_SIZE(host_save_user_msrs) - -struct kvm_sev_info { - bool active; /* SEV enabled guest */ - unsigned int asid; /* ASID used for this guest */ - unsigned int handle; /* SEV firmware handle */ - int fd; /* SEV device fd */ - unsigned long pages_locked; /* Number of pages locked */ - struct list_head regions_list; /* List of registered regions */ -}; - -struct kvm_svm { - struct kvm kvm; - - /* Struct members for AVIC */ - u32 avic_vm_id; - struct page *avic_logical_id_table_page; - struct page *avic_physical_id_table_page; - struct hlist_node hnode; - - struct kvm_sev_info sev_info; -}; - -struct kvm_vcpu; - -struct nested_state { - struct vmcb *hsave; - u64 hsave_msr; - u64 vm_cr_msr; - u64 vmcb; - - /* These are the merged vectors */ - u32 *msrpm; - - /* gpa pointers to the real vectors */ - u64 vmcb_msrpm; - u64 vmcb_iopm; - - /* A VMEXIT is required but not yet emulated */ - bool exit_required; - - /* cache for intercepts of the guest */ - u32 intercept_cr; - u32 intercept_dr; - u32 intercept_exceptions; - u64 intercept; - - /* Nested Paging related state */ - u64 nested_cr3; -}; - -#define MSRPM_OFFSETS 16 -static u32 msrpm_offsets[MSRPM_OFFSETS] __read_mostly; - -/* - * Set osvw_len to higher value when updated Revision Guides - * are published and we know what the new status bits are - */ -static uint64_t osvw_len = 4, osvw_status; - -struct vcpu_svm { - struct kvm_vcpu vcpu; - struct vmcb *vmcb; - unsigned long vmcb_pa; - struct svm_cpu_data *svm_data; - uint64_t asid_generation; - uint64_t sysenter_esp; - uint64_t sysenter_eip; - uint64_t tsc_aux; - - u64 msr_decfg; - - u64 next_rip; - - u64 host_user_msrs[NR_HOST_SAVE_USER_MSRS]; - struct { - u16 fs; - u16 gs; - u16 ldt; - u64 gs_base; - } host; - - u64 spec_ctrl; - /* - * Contains guest-controlled bits of VIRT_SPEC_CTRL, which will be - * translated into the appropriate L2_CFG bits on the host to - * perform speculative control. - */ - u64 virt_spec_ctrl; - - u32 *msrpm; - - ulong nmi_iret_rip; - - struct nested_state nested; - - bool nmi_singlestep; - u64 nmi_singlestep_guest_rflags; - - unsigned int3_injected; - unsigned long int3_rip; - - /* cached guest cpuid flags for faster access */ - bool nrips_enabled : 1; - - u32 ldr_reg; - u32 dfr_reg; - struct page *avic_backing_page; - u64 *avic_physical_id_cache; - bool avic_is_running; - - /* - * Per-vcpu list of struct amd_svm_iommu_ir: - * This is used mainly to store interrupt remapping information used - * when update the vcpu affinity. This avoids the need to scan for - * IRTE and try to match ga_tag in the IOMMU driver. - */ - struct list_head ir_list; - spinlock_t ir_list_lock; - - /* which host CPU was used for running this vcpu */ - unsigned int last_cpu; -}; - -/* - * This is a wrapper of struct amd_iommu_ir_data. - */ -struct amd_svm_iommu_ir { - struct list_head node; /* Used by SVM for per-vcpu ir_list */ - void *data; /* Storing pointer to struct amd_ir_data */ -}; - -#define AVIC_LOGICAL_ID_ENTRY_GUEST_PHYSICAL_ID_MASK (0xFF) -#define AVIC_LOGICAL_ID_ENTRY_VALID_BIT 31 -#define AVIC_LOGICAL_ID_ENTRY_VALID_MASK (1 << 31) - -#define AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK (0xFFULL) -#define AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK (0xFFFFFFFFFFULL << 12) -#define AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK (1ULL << 62) -#define AVIC_PHYSICAL_ID_ENTRY_VALID_MASK (1ULL << 63) - -static DEFINE_PER_CPU(u64, current_tsc_ratio); -#define TSC_RATIO_DEFAULT 0x0100000000ULL - -#define MSR_INVALID 0xffffffffU - -static const struct svm_direct_access_msrs { - u32 index; /* Index of the MSR */ - bool always; /* True if intercept is always on */ -} direct_access_msrs[] = { - { .index = MSR_STAR, .always = true }, - { .index = MSR_IA32_SYSENTER_CS, .always = true }, -#ifdef CONFIG_X86_64 - { .index = MSR_GS_BASE, .always = true }, - { .index = MSR_FS_BASE, .always = true }, - { .index = MSR_KERNEL_GS_BASE, .always = true }, - { .index = MSR_LSTAR, .always = true }, - { .index = MSR_CSTAR, .always = true }, - { .index = MSR_SYSCALL_MASK, .always = true }, -#endif - { .index = MSR_IA32_SPEC_CTRL, .always = false }, - { .index = MSR_IA32_PRED_CMD, .always = false }, - { .index = MSR_IA32_LASTBRANCHFROMIP, .always = false }, - { .index = MSR_IA32_LASTBRANCHTOIP, .always = false }, - { .index = MSR_IA32_LASTINTFROMIP, .always = false }, - { .index = MSR_IA32_LASTINTTOIP, .always = false }, - { .index = MSR_INVALID, .always = false }, -}; - -/* enable NPT for AMD64 and X86 with PAE */ -#if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE) -static bool npt_enabled = true; -#else -static bool npt_enabled; -#endif - -/* - * These 2 parameters are used to config the controls for Pause-Loop Exiting: - * pause_filter_count: On processors that support Pause filtering(indicated - * by CPUID Fn8000_000A_EDX), the VMCB provides a 16 bit pause filter - * count value. On VMRUN this value is loaded into an internal counter. - * Each time a pause instruction is executed, this counter is decremented - * until it reaches zero at which time a #VMEXIT is generated if pause - * intercept is enabled. Refer to AMD APM Vol 2 Section 15.14.4 Pause - * Intercept Filtering for more details. - * This also indicate if ple logic enabled. - * - * pause_filter_thresh: In addition, some processor families support advanced - * pause filtering (indicated by CPUID Fn8000_000A_EDX) upper bound on - * the amount of time a guest is allowed to execute in a pause loop. - * In this mode, a 16-bit pause filter threshold field is added in the - * VMCB. The threshold value is a cycle count that is used to reset the - * pause counter. As with simple pause filtering, VMRUN loads the pause - * count value from VMCB into an internal counter. Then, on each pause - * instruction the hardware checks the elapsed number of cycles since - * the most recent pause instruction against the pause filter threshold. - * If the elapsed cycle count is greater than the pause filter threshold, - * then the internal pause count is reloaded from the VMCB and execution - * continues. If the elapsed cycle count is less than the pause filter - * threshold, then the internal pause count is decremented. If the count - * value is less than zero and PAUSE intercept is enabled, a #VMEXIT is - * triggered. If advanced pause filtering is supported and pause filter - * threshold field is set to zero, the filter will operate in the simpler, - * count only mode. - */ - -static unsigned short pause_filter_thresh = KVM_DEFAULT_PLE_GAP; -module_param(pause_filter_thresh, ushort, 0444); - -static unsigned short pause_filter_count = KVM_SVM_DEFAULT_PLE_WINDOW; -module_param(pause_filter_count, ushort, 0444); - -/* Default doubles per-vcpu window every exit. */ -static unsigned short pause_filter_count_grow = KVM_DEFAULT_PLE_WINDOW_GROW; -module_param(pause_filter_count_grow, ushort, 0444); - -/* Default resets per-vcpu window every exit to pause_filter_count. */ -static unsigned short pause_filter_count_shrink = KVM_DEFAULT_PLE_WINDOW_SHRINK; -module_param(pause_filter_count_shrink, ushort, 0444); - -/* Default is to compute the maximum so we can never overflow. */ -static unsigned short pause_filter_count_max = KVM_SVM_DEFAULT_PLE_WINDOW_MAX; -module_param(pause_filter_count_max, ushort, 0444); - -/* allow nested paging (virtualized MMU) for all guests */ -static int npt = true; -module_param(npt, int, S_IRUGO); - -/* allow nested virtualization in KVM/SVM */ -static int nested = true; -module_param(nested, int, S_IRUGO); - -/* enable / disable AVIC */ -static int avic; -#ifdef CONFIG_X86_LOCAL_APIC -module_param(avic, int, S_IRUGO); -#endif - -/* enable/disable Next RIP Save */ -static int nrips = true; -module_param(nrips, int, 0444); - -/* enable/disable Virtual VMLOAD VMSAVE */ -static int vls = true; -module_param(vls, int, 0444); - -/* enable/disable Virtual GIF */ -static int vgif = true; -module_param(vgif, int, 0444); - -/* enable/disable SEV support */ -static int sev = IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT); -module_param(sev, int, 0444); - -static bool __read_mostly dump_invalid_vmcb = 0; -module_param(dump_invalid_vmcb, bool, 0644); - -static u8 rsm_ins_bytes[] = "\x0f\xaa"; - -static void svm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0); -static void svm_flush_tlb(struct kvm_vcpu *vcpu, bool invalidate_gpa); -static void svm_complete_interrupts(struct vcpu_svm *svm); -static void svm_toggle_avic_for_irq_window(struct kvm_vcpu *vcpu, bool activate); -static inline void avic_post_state_restore(struct kvm_vcpu *vcpu); - -static int nested_svm_exit_handled(struct vcpu_svm *svm); -static int nested_svm_intercept(struct vcpu_svm *svm); -static int nested_svm_vmexit(struct vcpu_svm *svm); -static int nested_svm_check_exception(struct vcpu_svm *svm, unsigned nr, - bool has_error_code, u32 error_code); - -enum { - VMCB_INTERCEPTS, /* Intercept vectors, TSC offset, - pause filter count */ - VMCB_PERM_MAP, /* IOPM Base and MSRPM Base */ - VMCB_ASID, /* ASID */ - VMCB_INTR, /* int_ctl, int_vector */ - VMCB_NPT, /* npt_en, nCR3, gPAT */ - VMCB_CR, /* CR0, CR3, CR4, EFER */ - VMCB_DR, /* DR6, DR7 */ - VMCB_DT, /* GDT, IDT */ - VMCB_SEG, /* CS, DS, SS, ES, CPL */ - VMCB_CR2, /* CR2 only */ - VMCB_LBR, /* DBGCTL, BR_FROM, BR_TO, LAST_EX_FROM, LAST_EX_TO */ - VMCB_AVIC, /* AVIC APIC_BAR, AVIC APIC_BACKING_PAGE, - * AVIC PHYSICAL_TABLE pointer, - * AVIC LOGICAL_TABLE pointer - */ - VMCB_DIRTY_MAX, -}; - -/* TPR and CR2 are always written before VMRUN */ -#define VMCB_ALWAYS_DIRTY_MASK ((1U << VMCB_INTR) | (1U << VMCB_CR2)) - -#define VMCB_AVIC_APIC_BAR_MASK 0xFFFFFFFFFF000ULL - -static int sev_flush_asids(void); -static DECLARE_RWSEM(sev_deactivate_lock); -static DEFINE_MUTEX(sev_bitmap_lock); -static unsigned int max_sev_asid; -static unsigned int min_sev_asid; -static unsigned long *sev_asid_bitmap; -static unsigned long *sev_reclaim_asid_bitmap; -#define __sme_page_pa(x) __sme_set(page_to_pfn(x) << PAGE_SHIFT) - -struct enc_region { - struct list_head list; - unsigned long npages; - struct page **pages; - unsigned long uaddr; - unsigned long size; -}; - - -static inline struct kvm_svm *to_kvm_svm(struct kvm *kvm) -{ - return container_of(kvm, struct kvm_svm, kvm); -} - -static inline bool svm_sev_enabled(void) -{ - return IS_ENABLED(CONFIG_KVM_AMD_SEV) ? max_sev_asid : 0; -} - -static inline bool sev_guest(struct kvm *kvm) -{ -#ifdef CONFIG_KVM_AMD_SEV - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; - - return sev->active; -#else - return false; -#endif -} - -static inline int sev_get_asid(struct kvm *kvm) -{ - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; - - return sev->asid; -} - -static inline void mark_all_dirty(struct vmcb *vmcb) -{ - vmcb->control.clean = 0; -} - -static inline void mark_all_clean(struct vmcb *vmcb) -{ - vmcb->control.clean = ((1 << VMCB_DIRTY_MAX) - 1) - & ~VMCB_ALWAYS_DIRTY_MASK; -} - -static inline void mark_dirty(struct vmcb *vmcb, int bit) -{ - vmcb->control.clean &= ~(1 << bit); -} - -static inline struct vcpu_svm *to_svm(struct kvm_vcpu *vcpu) -{ - return container_of(vcpu, struct vcpu_svm, vcpu); -} - -static inline void avic_update_vapic_bar(struct vcpu_svm *svm, u64 data) -{ - svm->vmcb->control.avic_vapic_bar = data & VMCB_AVIC_APIC_BAR_MASK; - mark_dirty(svm->vmcb, VMCB_AVIC); -} - -static inline bool avic_vcpu_is_running(struct kvm_vcpu *vcpu) -{ - struct vcpu_svm *svm = to_svm(vcpu); - u64 *entry = svm->avic_physical_id_cache; - - if (!entry) - return false; - - return (READ_ONCE(*entry) & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK); -} - -static void recalc_intercepts(struct vcpu_svm *svm) -{ - struct vmcb_control_area *c, *h; - struct nested_state *g; - - mark_dirty(svm->vmcb, VMCB_INTERCEPTS); - - if (!is_guest_mode(&svm->vcpu)) - return; - - c = &svm->vmcb->control; - h = &svm->nested.hsave->control; - g = &svm->nested; - - c->intercept_cr = h->intercept_cr; - c->intercept_dr = h->intercept_dr; - c->intercept_exceptions = h->intercept_exceptions; - c->intercept = h->intercept; - - if (svm->vcpu.arch.hflags & HF_VINTR_MASK) { - /* We only want the cr8 intercept bits of L1 */ - c->intercept_cr &= ~(1U << INTERCEPT_CR8_READ); - c->intercept_cr &= ~(1U << INTERCEPT_CR8_WRITE); - - /* - * Once running L2 with HF_VINTR_MASK, EFLAGS.IF does not - * affect any interrupt we may want to inject; therefore, - * interrupt window vmexits are irrelevant to L0. - */ - c->intercept &= ~(1ULL << INTERCEPT_VINTR); - } - - /* We don't want to see VMMCALLs from a nested guest */ - c->intercept &= ~(1ULL << INTERCEPT_VMMCALL); - - c->intercept_cr |= g->intercept_cr; - c->intercept_dr |= g->intercept_dr; - c->intercept_exceptions |= g->intercept_exceptions; - c->intercept |= g->intercept; -} - -static inline struct vmcb *get_host_vmcb(struct vcpu_svm *svm) -{ - if (is_guest_mode(&svm->vcpu)) - return svm->nested.hsave; - else - return svm->vmcb; -} - -static inline void set_cr_intercept(struct vcpu_svm *svm, int bit) -{ - struct vmcb *vmcb = get_host_vmcb(svm); - - vmcb->control.intercept_cr |= (1U << bit); - - recalc_intercepts(svm); -} - -static inline void clr_cr_intercept(struct vcpu_svm *svm, int bit) -{ - struct vmcb *vmcb = get_host_vmcb(svm); - - vmcb->control.intercept_cr &= ~(1U << bit); - - recalc_intercepts(svm); -} - -static inline bool is_cr_intercept(struct vcpu_svm *svm, int bit) -{ - struct vmcb *vmcb = get_host_vmcb(svm); - - return vmcb->control.intercept_cr & (1U << bit); -} - -static inline void set_dr_intercepts(struct vcpu_svm *svm) -{ - struct vmcb *vmcb = get_host_vmcb(svm); - - vmcb->control.intercept_dr = (1 << INTERCEPT_DR0_READ) - | (1 << INTERCEPT_DR1_READ) - | (1 << INTERCEPT_DR2_READ) - | (1 << INTERCEPT_DR3_READ) - | (1 << INTERCEPT_DR4_READ) - | (1 << INTERCEPT_DR5_READ) - | (1 << INTERCEPT_DR6_READ) - | (1 << INTERCEPT_DR7_READ) - | (1 << INTERCEPT_DR0_WRITE) - | (1 << INTERCEPT_DR1_WRITE) - | (1 << INTERCEPT_DR2_WRITE) - | (1 << INTERCEPT_DR3_WRITE) - | (1 << INTERCEPT_DR4_WRITE) - | (1 << INTERCEPT_DR5_WRITE) - | (1 << INTERCEPT_DR6_WRITE) - | (1 << INTERCEPT_DR7_WRITE); - - recalc_intercepts(svm); -} - -static inline void clr_dr_intercepts(struct vcpu_svm *svm) -{ - struct vmcb *vmcb = get_host_vmcb(svm); - - vmcb->control.intercept_dr = 0; - - recalc_intercepts(svm); -} - -static inline void set_exception_intercept(struct vcpu_svm *svm, int bit) -{ - struct vmcb *vmcb = get_host_vmcb(svm); - - vmcb->control.intercept_exceptions |= (1U << bit); - - recalc_intercepts(svm); -} - -static inline void clr_exception_intercept(struct vcpu_svm *svm, int bit) -{ - struct vmcb *vmcb = get_host_vmcb(svm); - - vmcb->control.intercept_exceptions &= ~(1U << bit); - - recalc_intercepts(svm); -} - -static inline void set_intercept(struct vcpu_svm *svm, int bit) -{ - struct vmcb *vmcb = get_host_vmcb(svm); - - vmcb->control.intercept |= (1ULL << bit); - - recalc_intercepts(svm); -} - -static inline void clr_intercept(struct vcpu_svm *svm, int bit) -{ - struct vmcb *vmcb = get_host_vmcb(svm); - - vmcb->control.intercept &= ~(1ULL << bit); - - recalc_intercepts(svm); -} - -static inline bool is_intercept(struct vcpu_svm *svm, int bit) -{ - return (svm->vmcb->control.intercept & (1ULL << bit)) != 0; -} - -static inline bool vgif_enabled(struct vcpu_svm *svm) -{ - return !!(svm->vmcb->control.int_ctl & V_GIF_ENABLE_MASK); -} - -static inline void enable_gif(struct vcpu_svm *svm) -{ - if (vgif_enabled(svm)) - svm->vmcb->control.int_ctl |= V_GIF_MASK; - else - svm->vcpu.arch.hflags |= HF_GIF_MASK; -} - -static inline void disable_gif(struct vcpu_svm *svm) -{ - if (vgif_enabled(svm)) - svm->vmcb->control.int_ctl &= ~V_GIF_MASK; - else - svm->vcpu.arch.hflags &= ~HF_GIF_MASK; -} - -static inline bool gif_set(struct vcpu_svm *svm) -{ - if (vgif_enabled(svm)) - return !!(svm->vmcb->control.int_ctl & V_GIF_MASK); - else - return !!(svm->vcpu.arch.hflags & HF_GIF_MASK); -} - -static unsigned long iopm_base; - -struct kvm_ldttss_desc { - u16 limit0; - u16 base0; - unsigned base1:8, type:5, dpl:2, p:1; - unsigned limit1:4, zero0:3, g:1, base2:8; - u32 base3; - u32 zero1; -} __attribute__((packed)); - -struct svm_cpu_data { - int cpu; - - u64 asid_generation; - u32 max_asid; - u32 next_asid; - u32 min_asid; - struct kvm_ldttss_desc *tss_desc; - - struct page *save_area; - struct vmcb *current_vmcb; - - /* index = sev_asid, value = vmcb pointer */ - struct vmcb **sev_vmcbs; -}; - -static DEFINE_PER_CPU(struct svm_cpu_data *, svm_data); - -static const u32 msrpm_ranges[] = {0, 0xc0000000, 0xc0010000}; - -#define NUM_MSR_MAPS ARRAY_SIZE(msrpm_ranges) -#define MSRS_RANGE_SIZE 2048 -#define MSRS_IN_RANGE (MSRS_RANGE_SIZE * 8 / 2) - -static u32 svm_msrpm_offset(u32 msr) -{ - u32 offset; - int i; - - for (i = 0; i < NUM_MSR_MAPS; i++) { - if (msr < msrpm_ranges[i] || - msr >= msrpm_ranges[i] + MSRS_IN_RANGE) - continue; - - offset = (msr - msrpm_ranges[i]) / 4; /* 4 msrs per u8 */ - offset += (i * MSRS_RANGE_SIZE); /* add range offset */ - - /* Now we have the u8 offset - but need the u32 offset */ - return offset / 4; - } - - /* MSR not in any range */ - return MSR_INVALID; -} - -#define MAX_INST_SIZE 15 - -static inline void clgi(void) -{ - asm volatile (__ex("clgi")); -} - -static inline void stgi(void) -{ - asm volatile (__ex("stgi")); -} - -static inline void invlpga(unsigned long addr, u32 asid) -{ - asm volatile (__ex("invlpga %1, %0") : : "c"(asid), "a"(addr)); -} - -static int get_npt_level(struct kvm_vcpu *vcpu) -{ -#ifdef CONFIG_X86_64 - return PT64_ROOT_4LEVEL; -#else - return PT32E_ROOT_LEVEL; -#endif -} - -static void svm_set_efer(struct kvm_vcpu *vcpu, u64 efer) -{ - vcpu->arch.efer = efer; - - if (!npt_enabled) { - /* Shadow paging assumes NX to be available. */ - efer |= EFER_NX; - - if (!(efer & EFER_LMA)) - efer &= ~EFER_LME; - } - - to_svm(vcpu)->vmcb->save.efer = efer | EFER_SVME; - mark_dirty(to_svm(vcpu)->vmcb, VMCB_CR); -} - -static int is_external_interrupt(u32 info) -{ - info &= SVM_EVTINJ_TYPE_MASK | SVM_EVTINJ_VALID; - return info == (SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_INTR); -} - -static u32 svm_get_interrupt_shadow(struct kvm_vcpu *vcpu) -{ - struct vcpu_svm *svm = to_svm(vcpu); - u32 ret = 0; - - if (svm->vmcb->control.int_state & SVM_INTERRUPT_SHADOW_MASK) - ret = KVM_X86_SHADOW_INT_STI | KVM_X86_SHADOW_INT_MOV_SS; - return ret; -} - -static void svm_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask) -{ - struct vcpu_svm *svm = to_svm(vcpu); - - if (mask == 0) - svm->vmcb->control.int_state &= ~SVM_INTERRUPT_SHADOW_MASK; - else - svm->vmcb->control.int_state |= SVM_INTERRUPT_SHADOW_MASK; - -} - -static int skip_emulated_instruction(struct kvm_vcpu *vcpu) -{ - struct vcpu_svm *svm = to_svm(vcpu); - - if (nrips && svm->vmcb->control.next_rip != 0) { - WARN_ON_ONCE(!static_cpu_has(X86_FEATURE_NRIPS)); - svm->next_rip = svm->vmcb->control.next_rip; - } - - if (!svm->next_rip) { - if (!kvm_emulate_instruction(vcpu, EMULTYPE_SKIP)) - return 0; - } else { - if (svm->next_rip - kvm_rip_read(vcpu) > MAX_INST_SIZE) - pr_err("%s: ip 0x%lx next 0x%llx\n", - __func__, kvm_rip_read(vcpu), svm->next_rip); - kvm_rip_write(vcpu, svm->next_rip); - } - svm_set_interrupt_shadow(vcpu, 0); - - return 1; -} - -static void svm_queue_exception(struct kvm_vcpu *vcpu) -{ - struct vcpu_svm *svm = to_svm(vcpu); - unsigned nr = vcpu->arch.exception.nr; - bool has_error_code = vcpu->arch.exception.has_error_code; - bool reinject = vcpu->arch.exception.injected; - u32 error_code = vcpu->arch.exception.error_code; - - /* - * If we are within a nested VM we'd better #VMEXIT and let the guest - * handle the exception - */ - if (!reinject && - nested_svm_check_exception(svm, nr, has_error_code, error_code)) - return; - - kvm_deliver_exception_payload(&svm->vcpu); - - if (nr == BP_VECTOR && !nrips) { - unsigned long rip, old_rip = kvm_rip_read(&svm->vcpu); - - /* - * For guest debugging where we have to reinject #BP if some - * INT3 is guest-owned: - * Emulate nRIP by moving RIP forward. Will fail if injection - * raises a fault that is not intercepted. Still better than - * failing in all cases. - */ - (void)skip_emulated_instruction(&svm->vcpu); - rip = kvm_rip_read(&svm->vcpu); - svm->int3_rip = rip + svm->vmcb->save.cs.base; - svm->int3_injected = rip - old_rip; - } - - svm->vmcb->control.event_inj = nr - | SVM_EVTINJ_VALID - | (has_error_code ? SVM_EVTINJ_VALID_ERR : 0) - | SVM_EVTINJ_TYPE_EXEPT; - svm->vmcb->control.event_inj_err = error_code; -} - -static void svm_init_erratum_383(void) -{ - u32 low, high; - int err; - u64 val; - - if (!static_cpu_has_bug(X86_BUG_AMD_TLB_MMATCH)) - return; - - /* Use _safe variants to not break nested virtualization */ - val = native_read_msr_safe(MSR_AMD64_DC_CFG, &err); - if (err) - return; - - val |= (1ULL << 47); - - low = lower_32_bits(val); - high = upper_32_bits(val); - - native_write_msr_safe(MSR_AMD64_DC_CFG, low, high); - - erratum_383_found = true; -} - -static void svm_init_osvw(struct kvm_vcpu *vcpu) -{ - /* - * Guests should see errata 400 and 415 as fixed (assuming that - * HLT and IO instructions are intercepted). - */ - vcpu->arch.osvw.length = (osvw_len >= 3) ? (osvw_len) : 3; - vcpu->arch.osvw.status = osvw_status & ~(6ULL); - - /* - * By increasing VCPU's osvw.length to 3 we are telling the guest that - * all osvw.status bits inside that length, including bit 0 (which is - * reserved for erratum 298), are valid. However, if host processor's - * osvw_len is 0 then osvw_status[0] carries no information. We need to - * be conservative here and therefore we tell the guest that erratum 298 - * is present (because we really don't know). - */ - if (osvw_len == 0 && boot_cpu_data.x86 == 0x10) - vcpu->arch.osvw.status |= 1; -} - -static int has_svm(void) -{ - const char *msg; - - if (!cpu_has_svm(&msg)) { - printk(KERN_INFO "has_svm: %s\n", msg); - return 0; - } - - return 1; -} - -static void svm_hardware_disable(void) -{ - /* Make sure we clean up behind us */ - if (static_cpu_has(X86_FEATURE_TSCRATEMSR)) - wrmsrl(MSR_AMD64_TSC_RATIO, TSC_RATIO_DEFAULT); - - cpu_svm_disable(); - - amd_pmu_disable_virt(); -} - -static int svm_hardware_enable(void) -{ - - struct svm_cpu_data *sd; - uint64_t efer; - struct desc_struct *gdt; - int me = raw_smp_processor_id(); - - rdmsrl(MSR_EFER, efer); - if (efer & EFER_SVME) - return -EBUSY; - - if (!has_svm()) { - pr_err("%s: err EOPNOTSUPP on %d\n", __func__, me); - return -EINVAL; - } - sd = per_cpu(svm_data, me); - if (!sd) { - pr_err("%s: svm_data is NULL on %d\n", __func__, me); - return -EINVAL; - } - - sd->asid_generation = 1; - sd->max_asid = cpuid_ebx(SVM_CPUID_FUNC) - 1; - sd->next_asid = sd->max_asid + 1; - sd->min_asid = max_sev_asid + 1; - - gdt = get_current_gdt_rw(); - sd->tss_desc = (struct kvm_ldttss_desc *)(gdt + GDT_ENTRY_TSS); - - wrmsrl(MSR_EFER, efer | EFER_SVME); - - wrmsrl(MSR_VM_HSAVE_PA, page_to_pfn(sd->save_area) << PAGE_SHIFT); - - if (static_cpu_has(X86_FEATURE_TSCRATEMSR)) { - wrmsrl(MSR_AMD64_TSC_RATIO, TSC_RATIO_DEFAULT); - __this_cpu_write(current_tsc_ratio, TSC_RATIO_DEFAULT); - } - - - /* - * Get OSVW bits. - * - * Note that it is possible to have a system with mixed processor - * revisions and therefore different OSVW bits. If bits are not the same - * on different processors then choose the worst case (i.e. if erratum - * is present on one processor and not on another then assume that the - * erratum is present everywhere). - */ - if (cpu_has(&boot_cpu_data, X86_FEATURE_OSVW)) { - uint64_t len, status = 0; - int err; - - len = native_read_msr_safe(MSR_AMD64_OSVW_ID_LENGTH, &err); - if (!err) - status = native_read_msr_safe(MSR_AMD64_OSVW_STATUS, - &err); - - if (err) - osvw_status = osvw_len = 0; - else { - if (len < osvw_len) - osvw_len = len; - osvw_status |= status; - osvw_status &= (1ULL << osvw_len) - 1; - } - } else - osvw_status = osvw_len = 0; - - svm_init_erratum_383(); - - amd_pmu_enable_virt(); - - return 0; -} - -static void svm_cpu_uninit(int cpu) -{ - struct svm_cpu_data *sd = per_cpu(svm_data, raw_smp_processor_id()); - - if (!sd) - return; - - per_cpu(svm_data, raw_smp_processor_id()) = NULL; - kfree(sd->sev_vmcbs); - __free_page(sd->save_area); - kfree(sd); -} - -static int svm_cpu_init(int cpu) -{ - struct svm_cpu_data *sd; - - sd = kzalloc(sizeof(struct svm_cpu_data), GFP_KERNEL); - if (!sd) - return -ENOMEM; - sd->cpu = cpu; - sd->save_area = alloc_page(GFP_KERNEL); - if (!sd->save_area) - goto free_cpu_data; - - if (svm_sev_enabled()) { - sd->sev_vmcbs = kmalloc_array(max_sev_asid + 1, - sizeof(void *), - GFP_KERNEL); - if (!sd->sev_vmcbs) - goto free_save_area; - } - - per_cpu(svm_data, cpu) = sd; - - return 0; - -free_save_area: - __free_page(sd->save_area); -free_cpu_data: - kfree(sd); - return -ENOMEM; - -} - -static bool valid_msr_intercept(u32 index) -{ - int i; - - for (i = 0; direct_access_msrs[i].index != MSR_INVALID; i++) - if (direct_access_msrs[i].index == index) - return true; - - return false; -} - -static bool msr_write_intercepted(struct kvm_vcpu *vcpu, unsigned msr) -{ - u8 bit_write; - unsigned long tmp; - u32 offset; - u32 *msrpm; - - msrpm = is_guest_mode(vcpu) ? to_svm(vcpu)->nested.msrpm: - to_svm(vcpu)->msrpm; - - offset = svm_msrpm_offset(msr); - bit_write = 2 * (msr & 0x0f) + 1; - tmp = msrpm[offset]; - - BUG_ON(offset == MSR_INVALID); - - return !!test_bit(bit_write, &tmp); -} - -static void set_msr_interception(u32 *msrpm, unsigned msr, - int read, int write) -{ - u8 bit_read, bit_write; - unsigned long tmp; - u32 offset; - - /* - * If this warning triggers extend the direct_access_msrs list at the - * beginning of the file - */ - WARN_ON(!valid_msr_intercept(msr)); - - offset = svm_msrpm_offset(msr); - bit_read = 2 * (msr & 0x0f); - bit_write = 2 * (msr & 0x0f) + 1; - tmp = msrpm[offset]; - - BUG_ON(offset == MSR_INVALID); - - read ? clear_bit(bit_read, &tmp) : set_bit(bit_read, &tmp); - write ? clear_bit(bit_write, &tmp) : set_bit(bit_write, &tmp); - - msrpm[offset] = tmp; -} - -static void svm_vcpu_init_msrpm(u32 *msrpm) -{ - int i; - - memset(msrpm, 0xff, PAGE_SIZE * (1 << MSRPM_ALLOC_ORDER)); - - for (i = 0; direct_access_msrs[i].index != MSR_INVALID; i++) { - if (!direct_access_msrs[i].always) - continue; - - set_msr_interception(msrpm, direct_access_msrs[i].index, 1, 1); - } -} - -static void add_msr_offset(u32 offset) -{ - int i; - - for (i = 0; i < MSRPM_OFFSETS; ++i) { - - /* Offset already in list? */ - if (msrpm_offsets[i] == offset) - return; - - /* Slot used by another offset? */ - if (msrpm_offsets[i] != MSR_INVALID) - continue; - - /* Add offset to list */ - msrpm_offsets[i] = offset; - - return; - } - - /* - * If this BUG triggers the msrpm_offsets table has an overflow. Just - * increase MSRPM_OFFSETS in this case. - */ - BUG(); -} - -static void init_msrpm_offsets(void) -{ - int i; - - memset(msrpm_offsets, 0xff, sizeof(msrpm_offsets)); - - for (i = 0; direct_access_msrs[i].index != MSR_INVALID; i++) { - u32 offset; - - offset = svm_msrpm_offset(direct_access_msrs[i].index); - BUG_ON(offset == MSR_INVALID); - - add_msr_offset(offset); - } -} - -static void svm_enable_lbrv(struct vcpu_svm *svm) -{ - u32 *msrpm = svm->msrpm; - - svm->vmcb->control.virt_ext |= LBR_CTL_ENABLE_MASK; - set_msr_interception(msrpm, MSR_IA32_LASTBRANCHFROMIP, 1, 1); - set_msr_interception(msrpm, MSR_IA32_LASTBRANCHTOIP, 1, 1); - set_msr_interception(msrpm, MSR_IA32_LASTINTFROMIP, 1, 1); - set_msr_interception(msrpm, MSR_IA32_LASTINTTOIP, 1, 1); -} - -static void svm_disable_lbrv(struct vcpu_svm *svm) -{ - u32 *msrpm = svm->msrpm; - - svm->vmcb->control.virt_ext &= ~LBR_CTL_ENABLE_MASK; - set_msr_interception(msrpm, MSR_IA32_LASTBRANCHFROMIP, 0, 0); - set_msr_interception(msrpm, MSR_IA32_LASTBRANCHTOIP, 0, 0); - set_msr_interception(msrpm, MSR_IA32_LASTINTFROMIP, 0, 0); - set_msr_interception(msrpm, MSR_IA32_LASTINTTOIP, 0, 0); -} - -static void disable_nmi_singlestep(struct vcpu_svm *svm) -{ - svm->nmi_singlestep = false; - - if (!(svm->vcpu.guest_debug & KVM_GUESTDBG_SINGLESTEP)) { - /* Clear our flags if they were not set by the guest */ - if (!(svm->nmi_singlestep_guest_rflags & X86_EFLAGS_TF)) - svm->vmcb->save.rflags &= ~X86_EFLAGS_TF; - if (!(svm->nmi_singlestep_guest_rflags & X86_EFLAGS_RF)) - svm->vmcb->save.rflags &= ~X86_EFLAGS_RF; - } -} - -/* Note: - * This hash table is used to map VM_ID to a struct kvm_svm, - * when handling AMD IOMMU GALOG notification to schedule in - * a particular vCPU. - */ -#define SVM_VM_DATA_HASH_BITS 8 -static DEFINE_HASHTABLE(svm_vm_data_hash, SVM_VM_DATA_HASH_BITS); -static u32 next_vm_id = 0; -static bool next_vm_id_wrapped = 0; -static DEFINE_SPINLOCK(svm_vm_data_hash_lock); - -/* Note: - * This function is called from IOMMU driver to notify - * SVM to schedule in a particular vCPU of a particular VM. - */ -static int avic_ga_log_notifier(u32 ga_tag) -{ - unsigned long flags; - struct kvm_svm *kvm_svm; - struct kvm_vcpu *vcpu = NULL; - u32 vm_id = AVIC_GATAG_TO_VMID(ga_tag); - u32 vcpu_id = AVIC_GATAG_TO_VCPUID(ga_tag); - - pr_debug("SVM: %s: vm_id=%#x, vcpu_id=%#x\n", __func__, vm_id, vcpu_id); - trace_kvm_avic_ga_log(vm_id, vcpu_id); - - spin_lock_irqsave(&svm_vm_data_hash_lock, flags); - hash_for_each_possible(svm_vm_data_hash, kvm_svm, hnode, vm_id) { - if (kvm_svm->avic_vm_id != vm_id) - continue; - vcpu = kvm_get_vcpu_by_id(&kvm_svm->kvm, vcpu_id); - break; - } - spin_unlock_irqrestore(&svm_vm_data_hash_lock, flags); - - /* Note: - * At this point, the IOMMU should have already set the pending - * bit in the vAPIC backing page. So, we just need to schedule - * in the vcpu. - */ - if (vcpu) - kvm_vcpu_wake_up(vcpu); - - return 0; -} - -static __init int sev_hardware_setup(void) -{ - struct sev_user_data_status *status; - int rc; - - /* Maximum number of encrypted guests supported simultaneously */ - max_sev_asid = cpuid_ecx(0x8000001F); - - if (!max_sev_asid) - return 1; - - /* Minimum ASID value that should be used for SEV guest */ - min_sev_asid = cpuid_edx(0x8000001F); - - /* Initialize SEV ASID bitmaps */ - sev_asid_bitmap = bitmap_zalloc(max_sev_asid, GFP_KERNEL); - if (!sev_asid_bitmap) - return 1; - - sev_reclaim_asid_bitmap = bitmap_zalloc(max_sev_asid, GFP_KERNEL); - if (!sev_reclaim_asid_bitmap) - return 1; - - status = kmalloc(sizeof(*status), GFP_KERNEL); - if (!status) - return 1; - - /* - * Check SEV platform status. - * - * PLATFORM_STATUS can be called in any state, if we failed to query - * the PLATFORM status then either PSP firmware does not support SEV - * feature or SEV firmware is dead. - */ - rc = sev_platform_status(status, NULL); - if (rc) - goto err; - - pr_info("SEV supported\n"); - -err: - kfree(status); - return rc; -} - -static void grow_ple_window(struct kvm_vcpu *vcpu) -{ - struct vcpu_svm *svm = to_svm(vcpu); - struct vmcb_control_area *control = &svm->vmcb->control; - int old = control->pause_filter_count; - - control->pause_filter_count = __grow_ple_window(old, - pause_filter_count, - pause_filter_count_grow, - pause_filter_count_max); - - if (control->pause_filter_count != old) { - mark_dirty(svm->vmcb, VMCB_INTERCEPTS); - trace_kvm_ple_window_update(vcpu->vcpu_id, - control->pause_filter_count, old); - } -} - -static void shrink_ple_window(struct kvm_vcpu *vcpu) -{ - struct vcpu_svm *svm = to_svm(vcpu); - struct vmcb_control_area *control = &svm->vmcb->control; - int old = control->pause_filter_count; - - control->pause_filter_count = - __shrink_ple_window(old, - pause_filter_count, - pause_filter_count_shrink, - pause_filter_count); - if (control->pause_filter_count != old) { - mark_dirty(svm->vmcb, VMCB_INTERCEPTS); - trace_kvm_ple_window_update(vcpu->vcpu_id, - control->pause_filter_count, old); - } -} - -/* - * The default MMIO mask is a single bit (excluding the present bit), - * which could conflict with the memory encryption bit. Check for - * memory encryption support and override the default MMIO mask if - * memory encryption is enabled. - */ -static __init void svm_adjust_mmio_mask(void) -{ - unsigned int enc_bit, mask_bit; - u64 msr, mask; - - /* If there is no memory encryption support, use existing mask */ - if (cpuid_eax(0x80000000) < 0x8000001f) - return; - - /* If memory encryption is not enabled, use existing mask */ - rdmsrl(MSR_K8_SYSCFG, msr); - if (!(msr & MSR_K8_SYSCFG_MEM_ENCRYPT)) - return; - - enc_bit = cpuid_ebx(0x8000001f) & 0x3f; - mask_bit = boot_cpu_data.x86_phys_bits; - - /* Increment the mask bit if it is the same as the encryption bit */ - if (enc_bit == mask_bit) - mask_bit++; - - /* - * If the mask bit location is below 52, then some bits above the - * physical addressing limit will always be reserved, so use the - * rsvd_bits() function to generate the mask. This mask, along with - * the present bit, will be used to generate a page fault with - * PFER.RSV = 1. - * - * If the mask bit location is 52 (or above), then clear the mask. - */ - mask = (mask_bit < 52) ? rsvd_bits(mask_bit, 51) | PT_PRESENT_MASK : 0; - - kvm_mmu_set_mmio_spte_mask(mask, mask, PT_WRITABLE_MASK | PT_USER_MASK); -} - -static void svm_hardware_teardown(void) -{ - int cpu; - - if (svm_sev_enabled()) { - bitmap_free(sev_asid_bitmap); - bitmap_free(sev_reclaim_asid_bitmap); - - sev_flush_asids(); - } - - for_each_possible_cpu(cpu) - svm_cpu_uninit(cpu); - - __free_pages(pfn_to_page(iopm_base >> PAGE_SHIFT), IOPM_ALLOC_ORDER); - iopm_base = 0; -} - -static __init void svm_set_cpu_caps(void) -{ - kvm_set_cpu_caps(); - - supported_xss = 0; - - /* CPUID 0x80000001 and 0x8000000A (SVM features) */ - if (nested) { - kvm_cpu_cap_set(X86_FEATURE_SVM); - - if (nrips) - kvm_cpu_cap_set(X86_FEATURE_NRIPS); - - if (npt_enabled) - kvm_cpu_cap_set(X86_FEATURE_NPT); - } - - /* CPUID 0x80000008 */ - if (boot_cpu_has(X86_FEATURE_LS_CFG_SSBD) || - boot_cpu_has(X86_FEATURE_AMD_SSBD)) - kvm_cpu_cap_set(X86_FEATURE_VIRT_SSBD); -} - -static __init int svm_hardware_setup(void) -{ - int cpu; - struct page *iopm_pages; - void *iopm_va; - int r; - - iopm_pages = alloc_pages(GFP_KERNEL, IOPM_ALLOC_ORDER); - - if (!iopm_pages) - return -ENOMEM; - - iopm_va = page_address(iopm_pages); - memset(iopm_va, 0xff, PAGE_SIZE * (1 << IOPM_ALLOC_ORDER)); - iopm_base = page_to_pfn(iopm_pages) << PAGE_SHIFT; - - init_msrpm_offsets(); - - supported_xcr0 &= ~(XFEATURE_MASK_BNDREGS | XFEATURE_MASK_BNDCSR); - - if (boot_cpu_has(X86_FEATURE_NX)) - kvm_enable_efer_bits(EFER_NX); - - if (boot_cpu_has(X86_FEATURE_FXSR_OPT)) - kvm_enable_efer_bits(EFER_FFXSR); - - if (boot_cpu_has(X86_FEATURE_TSCRATEMSR)) { - kvm_has_tsc_control = true; - kvm_max_tsc_scaling_ratio = TSC_RATIO_MAX; - kvm_tsc_scaling_ratio_frac_bits = 32; - } - - /* Check for pause filtering support */ - if (!boot_cpu_has(X86_FEATURE_PAUSEFILTER)) { - pause_filter_count = 0; - pause_filter_thresh = 0; - } else if (!boot_cpu_has(X86_FEATURE_PFTHRESHOLD)) { - pause_filter_thresh = 0; - } - - if (nested) { - printk(KERN_INFO "kvm: Nested Virtualization enabled\n"); - kvm_enable_efer_bits(EFER_SVME | EFER_LMSLE); - } - - if (sev) { - if (boot_cpu_has(X86_FEATURE_SEV) && - IS_ENABLED(CONFIG_KVM_AMD_SEV)) { - r = sev_hardware_setup(); - if (r) - sev = false; - } else { - sev = false; - } - } - - svm_adjust_mmio_mask(); - - for_each_possible_cpu(cpu) { - r = svm_cpu_init(cpu); - if (r) - goto err; - } - - if (!boot_cpu_has(X86_FEATURE_NPT)) - npt_enabled = false; - - if (npt_enabled && !npt) - npt_enabled = false; - - kvm_configure_mmu(npt_enabled, PT_PDPE_LEVEL); - pr_info("kvm: Nested Paging %sabled\n", npt_enabled ? "en" : "dis"); - - if (nrips) { - if (!boot_cpu_has(X86_FEATURE_NRIPS)) - nrips = false; - } - - if (avic) { - if (!npt_enabled || - !boot_cpu_has(X86_FEATURE_AVIC) || - !IS_ENABLED(CONFIG_X86_LOCAL_APIC)) { - avic = false; - } else { - pr_info("AVIC enabled\n"); - - amd_iommu_register_ga_log_notifier(&avic_ga_log_notifier); - } - } - - if (vls) { - if (!npt_enabled || - !boot_cpu_has(X86_FEATURE_V_VMSAVE_VMLOAD) || - !IS_ENABLED(CONFIG_X86_64)) { - vls = false; - } else { - pr_info("Virtual VMLOAD VMSAVE supported\n"); - } - } - - if (vgif) { - if (!boot_cpu_has(X86_FEATURE_VGIF)) - vgif = false; - else - pr_info("Virtual GIF supported\n"); - } - - svm_set_cpu_caps(); - - return 0; - -err: - svm_hardware_teardown(); - return r; -} - -static void init_seg(struct vmcb_seg *seg) -{ - seg->selector = 0; - seg->attrib = SVM_SELECTOR_P_MASK | SVM_SELECTOR_S_MASK | - SVM_SELECTOR_WRITE_MASK; /* Read/Write Data Segment */ - seg->limit = 0xffff; - seg->base = 0; -} - -static void init_sys_seg(struct vmcb_seg *seg, uint32_t type) -{ - seg->selector = 0; - seg->attrib = SVM_SELECTOR_P_MASK | type; - seg->limit = 0xffff; - seg->base = 0; -} - -static u64 svm_read_l1_tsc_offset(struct kvm_vcpu *vcpu) -{ - struct vcpu_svm *svm = to_svm(vcpu); - - if (is_guest_mode(vcpu)) - return svm->nested.hsave->control.tsc_offset; - - return vcpu->arch.tsc_offset; -} - -static u64 svm_write_l1_tsc_offset(struct kvm_vcpu *vcpu, u64 offset) -{ - struct vcpu_svm *svm = to_svm(vcpu); - u64 g_tsc_offset = 0; - - if (is_guest_mode(vcpu)) { - /* Write L1's TSC offset. */ - g_tsc_offset = svm->vmcb->control.tsc_offset - - svm->nested.hsave->control.tsc_offset; - svm->nested.hsave->control.tsc_offset = offset; - } - - trace_kvm_write_tsc_offset(vcpu->vcpu_id, - svm->vmcb->control.tsc_offset - g_tsc_offset, - offset); - - svm->vmcb->control.tsc_offset = offset + g_tsc_offset; - - mark_dirty(svm->vmcb, VMCB_INTERCEPTS); - return svm->vmcb->control.tsc_offset; -} - -static void avic_init_vmcb(struct vcpu_svm *svm) -{ - struct vmcb *vmcb = svm->vmcb; - struct kvm_svm *kvm_svm = to_kvm_svm(svm->vcpu.kvm); - phys_addr_t bpa = __sme_set(page_to_phys(svm->avic_backing_page)); - phys_addr_t lpa = __sme_set(page_to_phys(kvm_svm->avic_logical_id_table_page)); - phys_addr_t ppa = __sme_set(page_to_phys(kvm_svm->avic_physical_id_table_page)); - - vmcb->control.avic_backing_page = bpa & AVIC_HPA_MASK; - vmcb->control.avic_logical_id = lpa & AVIC_HPA_MASK; - vmcb->control.avic_physical_id = ppa & AVIC_HPA_MASK; - vmcb->control.avic_physical_id |= AVIC_MAX_PHYSICAL_ID_COUNT; - if (kvm_apicv_activated(svm->vcpu.kvm)) - vmcb->control.int_ctl |= AVIC_ENABLE_MASK; - else - vmcb->control.int_ctl &= ~AVIC_ENABLE_MASK; -} - -static void init_vmcb(struct vcpu_svm *svm) -{ - struct vmcb_control_area *control = &svm->vmcb->control; - struct vmcb_save_area *save = &svm->vmcb->save; - - svm->vcpu.arch.hflags = 0; - - set_cr_intercept(svm, INTERCEPT_CR0_READ); - set_cr_intercept(svm, INTERCEPT_CR3_READ); - set_cr_intercept(svm, INTERCEPT_CR4_READ); - set_cr_intercept(svm, INTERCEPT_CR0_WRITE); - set_cr_intercept(svm, INTERCEPT_CR3_WRITE); - set_cr_intercept(svm, INTERCEPT_CR4_WRITE); - if (!kvm_vcpu_apicv_active(&svm->vcpu)) - set_cr_intercept(svm, INTERCEPT_CR8_WRITE); - - set_dr_intercepts(svm); - - set_exception_intercept(svm, PF_VECTOR); - set_exception_intercept(svm, UD_VECTOR); - set_exception_intercept(svm, MC_VECTOR); - set_exception_intercept(svm, AC_VECTOR); - set_exception_intercept(svm, DB_VECTOR); - /* - * Guest access to VMware backdoor ports could legitimately - * trigger #GP because of TSS I/O permission bitmap. - * We intercept those #GP and allow access to them anyway - * as VMware does. - */ - if (enable_vmware_backdoor) - set_exception_intercept(svm, GP_VECTOR); - - set_intercept(svm, INTERCEPT_INTR); - set_intercept(svm, INTERCEPT_NMI); - set_intercept(svm, INTERCEPT_SMI); - set_intercept(svm, INTERCEPT_SELECTIVE_CR0); - set_intercept(svm, INTERCEPT_RDPMC); - set_intercept(svm, INTERCEPT_CPUID); - set_intercept(svm, INTERCEPT_INVD); - set_intercept(svm, INTERCEPT_INVLPG); - set_intercept(svm, INTERCEPT_INVLPGA); - set_intercept(svm, INTERCEPT_IOIO_PROT); - set_intercept(svm, INTERCEPT_MSR_PROT); - set_intercept(svm, INTERCEPT_TASK_SWITCH); - set_intercept(svm, INTERCEPT_SHUTDOWN); - set_intercept(svm, INTERCEPT_VMRUN); - set_intercept(svm, INTERCEPT_VMMCALL); - set_intercept(svm, INTERCEPT_VMLOAD); - set_intercept(svm, INTERCEPT_VMSAVE); - set_intercept(svm, INTERCEPT_STGI); - set_intercept(svm, INTERCEPT_CLGI); - set_intercept(svm, INTERCEPT_SKINIT); - set_intercept(svm, INTERCEPT_WBINVD); - set_intercept(svm, INTERCEPT_XSETBV); - set_intercept(svm, INTERCEPT_RDPRU); - set_intercept(svm, INTERCEPT_RSM); - - if (!kvm_mwait_in_guest(svm->vcpu.kvm)) { - set_intercept(svm, INTERCEPT_MONITOR); - set_intercept(svm, INTERCEPT_MWAIT); - } - - if (!kvm_hlt_in_guest(svm->vcpu.kvm)) - set_intercept(svm, INTERCEPT_HLT); - - control->iopm_base_pa = __sme_set(iopm_base); - control->msrpm_base_pa = __sme_set(__pa(svm->msrpm)); - control->int_ctl = V_INTR_MASKING_MASK; - - init_seg(&save->es); - init_seg(&save->ss); - init_seg(&save->ds); - init_seg(&save->fs); - init_seg(&save->gs); - - save->cs.selector = 0xf000; - save->cs.base = 0xffff0000; - /* Executable/Readable Code Segment */ - save->cs.attrib = SVM_SELECTOR_READ_MASK | SVM_SELECTOR_P_MASK | - SVM_SELECTOR_S_MASK | SVM_SELECTOR_CODE_MASK; - save->cs.limit = 0xffff; - - save->gdtr.limit = 0xffff; - save->idtr.limit = 0xffff; - - init_sys_seg(&save->ldtr, SEG_TYPE_LDT); - init_sys_seg(&save->tr, SEG_TYPE_BUSY_TSS16); - - svm_set_efer(&svm->vcpu, 0); - save->dr6 = 0xffff0ff0; - kvm_set_rflags(&svm->vcpu, 2); - save->rip = 0x0000fff0; - svm->vcpu.arch.regs[VCPU_REGS_RIP] = save->rip; - - /* - * svm_set_cr0() sets PG and WP and clears NW and CD on save->cr0. - * It also updates the guest-visible cr0 value. - */ - svm_set_cr0(&svm->vcpu, X86_CR0_NW | X86_CR0_CD | X86_CR0_ET); - kvm_mmu_reset_context(&svm->vcpu); - - save->cr4 = X86_CR4_PAE; - /* rdx = ?? */ - - if (npt_enabled) { - /* Setup VMCB for Nested Paging */ - control->nested_ctl |= SVM_NESTED_CTL_NP_ENABLE; - clr_intercept(svm, INTERCEPT_INVLPG); - clr_exception_intercept(svm, PF_VECTOR); - clr_cr_intercept(svm, INTERCEPT_CR3_READ); - clr_cr_intercept(svm, INTERCEPT_CR3_WRITE); - save->g_pat = svm->vcpu.arch.pat; - save->cr3 = 0; - save->cr4 = 0; - } - svm->asid_generation = 0; - - svm->nested.vmcb = 0; - svm->vcpu.arch.hflags = 0; - - if (pause_filter_count) { - control->pause_filter_count = pause_filter_count; - if (pause_filter_thresh) - control->pause_filter_thresh = pause_filter_thresh; - set_intercept(svm, INTERCEPT_PAUSE); - } else { - clr_intercept(svm, INTERCEPT_PAUSE); - } - - if (kvm_vcpu_apicv_active(&svm->vcpu)) - avic_init_vmcb(svm); - - /* - * If hardware supports Virtual VMLOAD VMSAVE then enable it - * in VMCB and clear intercepts to avoid #VMEXIT. - */ - if (vls) { - clr_intercept(svm, INTERCEPT_VMLOAD); - clr_intercept(svm, INTERCEPT_VMSAVE); - svm->vmcb->control.virt_ext |= VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK; - } - - if (vgif) { - clr_intercept(svm, INTERCEPT_STGI); - clr_intercept(svm, INTERCEPT_CLGI); - svm->vmcb->control.int_ctl |= V_GIF_ENABLE_MASK; - } - - if (sev_guest(svm->vcpu.kvm)) { - svm->vmcb->control.nested_ctl |= SVM_NESTED_CTL_SEV_ENABLE; - clr_exception_intercept(svm, UD_VECTOR); - } - - mark_all_dirty(svm->vmcb); - - enable_gif(svm); - -} - -static u64 *avic_get_physical_id_entry(struct kvm_vcpu *vcpu, - unsigned int index) -{ - u64 *avic_physical_id_table; - struct kvm_svm *kvm_svm = to_kvm_svm(vcpu->kvm); - - if (index >= AVIC_MAX_PHYSICAL_ID_COUNT) - return NULL; - - avic_physical_id_table = page_address(kvm_svm->avic_physical_id_table_page); - - return &avic_physical_id_table[index]; -} - -/** - * Note: - * AVIC hardware walks the nested page table to check permissions, - * but does not use the SPA address specified in the leaf page - * table entry since it uses address in the AVIC_BACKING_PAGE pointer - * field of the VMCB. Therefore, we set up the - * APIC_ACCESS_PAGE_PRIVATE_MEMSLOT (4KB) here. - */ -static int avic_update_access_page(struct kvm *kvm, bool activate) -{ - int ret = 0; - - mutex_lock(&kvm->slots_lock); - /* - * During kvm_destroy_vm(), kvm_pit_set_reinject() could trigger - * APICv mode change, which update APIC_ACCESS_PAGE_PRIVATE_MEMSLOT - * memory region. So, we need to ensure that kvm->mm == current->mm. - */ - if ((kvm->arch.apic_access_page_done == activate) || - (kvm->mm != current->mm)) - goto out; - - ret = __x86_set_memory_region(kvm, - APIC_ACCESS_PAGE_PRIVATE_MEMSLOT, - APIC_DEFAULT_PHYS_BASE, - activate ? PAGE_SIZE : 0); - if (ret) - goto out; - - kvm->arch.apic_access_page_done = activate; -out: - mutex_unlock(&kvm->slots_lock); - return ret; -} - -static int avic_init_backing_page(struct kvm_vcpu *vcpu) -{ - u64 *entry, new_entry; - int id = vcpu->vcpu_id; - struct vcpu_svm *svm = to_svm(vcpu); - - if (id >= AVIC_MAX_PHYSICAL_ID_COUNT) - return -EINVAL; - - if (!svm->vcpu.arch.apic->regs) - return -EINVAL; - - if (kvm_apicv_activated(vcpu->kvm)) { - int ret; - - ret = avic_update_access_page(vcpu->kvm, true); - if (ret) - return ret; - } - - svm->avic_backing_page = virt_to_page(svm->vcpu.arch.apic->regs); - - /* Setting AVIC backing page address in the phy APIC ID table */ - entry = avic_get_physical_id_entry(vcpu, id); - if (!entry) - return -EINVAL; - - new_entry = __sme_set((page_to_phys(svm->avic_backing_page) & - AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK) | - AVIC_PHYSICAL_ID_ENTRY_VALID_MASK); - WRITE_ONCE(*entry, new_entry); - - svm->avic_physical_id_cache = entry; - - return 0; -} - -static void sev_asid_free(int asid) -{ - struct svm_cpu_data *sd; - int cpu, pos; - - mutex_lock(&sev_bitmap_lock); - - pos = asid - 1; - __set_bit(pos, sev_reclaim_asid_bitmap); - - for_each_possible_cpu(cpu) { - sd = per_cpu(svm_data, cpu); - sd->sev_vmcbs[pos] = NULL; - } - - mutex_unlock(&sev_bitmap_lock); -} - -static void sev_unbind_asid(struct kvm *kvm, unsigned int handle) -{ - struct sev_data_decommission *decommission; - struct sev_data_deactivate *data; - - if (!handle) - return; - - data = kzalloc(sizeof(*data), GFP_KERNEL); - if (!data) - return; - - /* deactivate handle */ - data->handle = handle; - - /* Guard DEACTIVATE against WBINVD/DF_FLUSH used in ASID recycling */ - down_read(&sev_deactivate_lock); - sev_guest_deactivate(data, NULL); - up_read(&sev_deactivate_lock); - - kfree(data); - - decommission = kzalloc(sizeof(*decommission), GFP_KERNEL); - if (!decommission) - return; - - /* decommission handle */ - decommission->handle = handle; - sev_guest_decommission(decommission, NULL); - - kfree(decommission); -} - -static struct page **sev_pin_memory(struct kvm *kvm, unsigned long uaddr, - unsigned long ulen, unsigned long *n, - int write) -{ - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; - unsigned long npages, npinned, size; - unsigned long locked, lock_limit; - struct page **pages; - unsigned long first, last; - - if (ulen == 0 || uaddr + ulen < uaddr) - return NULL; - - /* Calculate number of pages. */ - first = (uaddr & PAGE_MASK) >> PAGE_SHIFT; - last = ((uaddr + ulen - 1) & PAGE_MASK) >> PAGE_SHIFT; - npages = (last - first + 1); - - locked = sev->pages_locked + npages; - lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; - if (locked > lock_limit && !capable(CAP_IPC_LOCK)) { - pr_err("SEV: %lu locked pages exceed the lock limit of %lu.\n", locked, lock_limit); - return NULL; - } - - /* Avoid using vmalloc for smaller buffers. */ - size = npages * sizeof(struct page *); - if (size > PAGE_SIZE) - pages = __vmalloc(size, GFP_KERNEL_ACCOUNT | __GFP_ZERO, - PAGE_KERNEL); - else - pages = kmalloc(size, GFP_KERNEL_ACCOUNT); - - if (!pages) - return NULL; - - /* Pin the user virtual address. */ - npinned = get_user_pages_fast(uaddr, npages, FOLL_WRITE, pages); - if (npinned != npages) { - pr_err("SEV: Failure locking %lu pages.\n", npages); - goto err; - } - - *n = npages; - sev->pages_locked = locked; - - return pages; - -err: - if (npinned > 0) - release_pages(pages, npinned); - - kvfree(pages); - return NULL; -} - -static void sev_unpin_memory(struct kvm *kvm, struct page **pages, - unsigned long npages) -{ - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; - - release_pages(pages, npages); - kvfree(pages); - sev->pages_locked -= npages; -} - -static void sev_clflush_pages(struct page *pages[], unsigned long npages) -{ - uint8_t *page_virtual; - unsigned long i; - - if (npages == 0 || pages == NULL) - return; - - for (i = 0; i < npages; i++) { - page_virtual = kmap_atomic(pages[i]); - clflush_cache_range(page_virtual, PAGE_SIZE); - kunmap_atomic(page_virtual); - } -} - -static void __unregister_enc_region_locked(struct kvm *kvm, - struct enc_region *region) -{ - sev_unpin_memory(kvm, region->pages, region->npages); - list_del(®ion->list); - kfree(region); -} - -static void sev_vm_destroy(struct kvm *kvm) -{ - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; - struct list_head *head = &sev->regions_list; - struct list_head *pos, *q; - - if (!sev_guest(kvm)) - return; - - mutex_lock(&kvm->lock); - - /* - * Ensure that all guest tagged cache entries are flushed before - * releasing the pages back to the system for use. CLFLUSH will - * not do this, so issue a WBINVD. - */ - wbinvd_on_all_cpus(); - - /* - * if userspace was terminated before unregistering the memory regions - * then lets unpin all the registered memory. - */ - if (!list_empty(head)) { - list_for_each_safe(pos, q, head) { - __unregister_enc_region_locked(kvm, - list_entry(pos, struct enc_region, list)); - } - } - - mutex_unlock(&kvm->lock); - - sev_unbind_asid(kvm, sev->handle); - sev_asid_free(sev->asid); -} - -static void avic_vm_destroy(struct kvm *kvm) -{ - unsigned long flags; - struct kvm_svm *kvm_svm = to_kvm_svm(kvm); - - if (!avic) - return; - - if (kvm_svm->avic_logical_id_table_page) - __free_page(kvm_svm->avic_logical_id_table_page); - if (kvm_svm->avic_physical_id_table_page) - __free_page(kvm_svm->avic_physical_id_table_page); - - spin_lock_irqsave(&svm_vm_data_hash_lock, flags); - hash_del(&kvm_svm->hnode); - spin_unlock_irqrestore(&svm_vm_data_hash_lock, flags); -} - -static void svm_vm_destroy(struct kvm *kvm) -{ - avic_vm_destroy(kvm); - sev_vm_destroy(kvm); -} - -static int avic_vm_init(struct kvm *kvm) -{ - unsigned long flags; - int err = -ENOMEM; - struct kvm_svm *kvm_svm = to_kvm_svm(kvm); - struct kvm_svm *k2; - struct page *p_page; - struct page *l_page; - u32 vm_id; - - if (!avic) - return 0; - - /* Allocating physical APIC ID table (4KB) */ - p_page = alloc_page(GFP_KERNEL_ACCOUNT); - if (!p_page) - goto free_avic; - - kvm_svm->avic_physical_id_table_page = p_page; - clear_page(page_address(p_page)); - - /* Allocating logical APIC ID table (4KB) */ - l_page = alloc_page(GFP_KERNEL_ACCOUNT); - if (!l_page) - goto free_avic; - - kvm_svm->avic_logical_id_table_page = l_page; - clear_page(page_address(l_page)); - - spin_lock_irqsave(&svm_vm_data_hash_lock, flags); - again: - vm_id = next_vm_id = (next_vm_id + 1) & AVIC_VM_ID_MASK; - if (vm_id == 0) { /* id is 1-based, zero is not okay */ - next_vm_id_wrapped = 1; - goto again; - } - /* Is it still in use? Only possible if wrapped at least once */ - if (next_vm_id_wrapped) { - hash_for_each_possible(svm_vm_data_hash, k2, hnode, vm_id) { - if (k2->avic_vm_id == vm_id) - goto again; - } - } - kvm_svm->avic_vm_id = vm_id; - hash_add(svm_vm_data_hash, &kvm_svm->hnode, kvm_svm->avic_vm_id); - spin_unlock_irqrestore(&svm_vm_data_hash_lock, flags); - - return 0; - -free_avic: - avic_vm_destroy(kvm); - return err; -} - -static int svm_vm_init(struct kvm *kvm) -{ - if (avic) { - int ret = avic_vm_init(kvm); - if (ret) - return ret; - } - - kvm_apicv_init(kvm, avic); - return 0; -} - -static inline int -avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu, bool r) -{ - int ret = 0; - unsigned long flags; - struct amd_svm_iommu_ir *ir; - struct vcpu_svm *svm = to_svm(vcpu); - - if (!kvm_arch_has_assigned_device(vcpu->kvm)) - return 0; - - /* - * Here, we go through the per-vcpu ir_list to update all existing - * interrupt remapping table entry targeting this vcpu. - */ - spin_lock_irqsave(&svm->ir_list_lock, flags); - - if (list_empty(&svm->ir_list)) - goto out; - - list_for_each_entry(ir, &svm->ir_list, node) { - ret = amd_iommu_update_ga(cpu, r, ir->data); - if (ret) - break; - } -out: - spin_unlock_irqrestore(&svm->ir_list_lock, flags); - return ret; -} - -static void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu) -{ - u64 entry; - /* ID = 0xff (broadcast), ID > 0xff (reserved) */ - int h_physical_id = kvm_cpu_get_apicid(cpu); - struct vcpu_svm *svm = to_svm(vcpu); - - if (!kvm_vcpu_apicv_active(vcpu)) - return; - - /* - * Since the host physical APIC id is 8 bits, - * we can support host APIC ID upto 255. - */ - if (WARN_ON(h_physical_id > AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK)) - return; - - entry = READ_ONCE(*(svm->avic_physical_id_cache)); - WARN_ON(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK); - - entry &= ~AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK; - entry |= (h_physical_id & AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK); - - entry &= ~AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK; - if (svm->avic_is_running) - entry |= AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK; - - WRITE_ONCE(*(svm->avic_physical_id_cache), entry); - avic_update_iommu_vcpu_affinity(vcpu, h_physical_id, - svm->avic_is_running); -} - -static void avic_vcpu_put(struct kvm_vcpu *vcpu) -{ - u64 entry; - struct vcpu_svm *svm = to_svm(vcpu); - - if (!kvm_vcpu_apicv_active(vcpu)) - return; - - entry = READ_ONCE(*(svm->avic_physical_id_cache)); - if (entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK) - avic_update_iommu_vcpu_affinity(vcpu, -1, 0); - - entry &= ~AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK; - WRITE_ONCE(*(svm->avic_physical_id_cache), entry); -} - -/** - * This function is called during VCPU halt/unhalt. - */ -static void avic_set_running(struct kvm_vcpu *vcpu, bool is_run) -{ - struct vcpu_svm *svm = to_svm(vcpu); - - svm->avic_is_running = is_run; - if (is_run) - avic_vcpu_load(vcpu, vcpu->cpu); - else - avic_vcpu_put(vcpu); -} - -static void svm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) -{ - struct vcpu_svm *svm = to_svm(vcpu); - u32 dummy; - u32 eax = 1; - - svm->spec_ctrl = 0; - svm->virt_spec_ctrl = 0; - - if (!init_event) { - svm->vcpu.arch.apic_base = APIC_DEFAULT_PHYS_BASE | - MSR_IA32_APICBASE_ENABLE; - if (kvm_vcpu_is_reset_bsp(&svm->vcpu)) - svm->vcpu.arch.apic_base |= MSR_IA32_APICBASE_BSP; - } - init_vmcb(svm); - - kvm_cpuid(vcpu, &eax, &dummy, &dummy, &dummy, false); - kvm_rdx_write(vcpu, eax); - - if (kvm_vcpu_apicv_active(vcpu) && !init_event) - avic_update_vapic_bar(svm, APIC_DEFAULT_PHYS_BASE); -} - -static int avic_init_vcpu(struct vcpu_svm *svm) -{ - int ret; - struct kvm_vcpu *vcpu = &svm->vcpu; - - if (!avic || !irqchip_in_kernel(vcpu->kvm)) - return 0; - - ret = avic_init_backing_page(&svm->vcpu); - if (ret) - return ret; - - INIT_LIST_HEAD(&svm->ir_list); - spin_lock_init(&svm->ir_list_lock); - svm->dfr_reg = APIC_DFR_FLAT; - - return ret; -} - -static int svm_create_vcpu(struct kvm_vcpu *vcpu) -{ - struct vcpu_svm *svm; - struct page *page; - struct page *msrpm_pages; - struct page *hsave_page; - struct page *nested_msrpm_pages; - int err; - - BUILD_BUG_ON(offsetof(struct vcpu_svm, vcpu) != 0); - svm = to_svm(vcpu); - - err = -ENOMEM; - page = alloc_page(GFP_KERNEL_ACCOUNT); - if (!page) - goto out; - - msrpm_pages = alloc_pages(GFP_KERNEL_ACCOUNT, MSRPM_ALLOC_ORDER); - if (!msrpm_pages) - goto free_page1; - - nested_msrpm_pages = alloc_pages(GFP_KERNEL_ACCOUNT, MSRPM_ALLOC_ORDER); - if (!nested_msrpm_pages) - goto free_page2; - - hsave_page = alloc_page(GFP_KERNEL_ACCOUNT); - if (!hsave_page) - goto free_page3; - - err = avic_init_vcpu(svm); - if (err) - goto free_page4; - - /* We initialize this flag to true to make sure that the is_running - * bit would be set the first time the vcpu is loaded. - */ - if (irqchip_in_kernel(vcpu->kvm) && kvm_apicv_activated(vcpu->kvm)) - svm->avic_is_running = true; - - svm->nested.hsave = page_address(hsave_page); - - svm->msrpm = page_address(msrpm_pages); - svm_vcpu_init_msrpm(svm->msrpm); - - svm->nested.msrpm = page_address(nested_msrpm_pages); - svm_vcpu_init_msrpm(svm->nested.msrpm); - - svm->vmcb = page_address(page); - clear_page(svm->vmcb); - svm->vmcb_pa = __sme_set(page_to_pfn(page) << PAGE_SHIFT); - svm->asid_generation = 0; - init_vmcb(svm); - - svm_init_osvw(vcpu); - vcpu->arch.microcode_version = 0x01000065; - - return 0; - -free_page4: - __free_page(hsave_page); -free_page3: - __free_pages(nested_msrpm_pages, MSRPM_ALLOC_ORDER); -free_page2: - __free_pages(msrpm_pages, MSRPM_ALLOC_ORDER); -free_page1: - __free_page(page); -out: - return err; -} - -static void svm_clear_current_vmcb(struct vmcb *vmcb) -{ - int i; - - for_each_online_cpu(i) - cmpxchg(&per_cpu(svm_data, i)->current_vmcb, vmcb, NULL); -} - -static void svm_free_vcpu(struct kvm_vcpu *vcpu) -{ - struct vcpu_svm *svm = to_svm(vcpu); - - /* - * The vmcb page can be recycled, causing a false negative in - * svm_vcpu_load(). So, ensure that no logical CPU has this - * vmcb page recorded as its current vmcb. - */ - svm_clear_current_vmcb(svm->vmcb); - - __free_page(pfn_to_page(__sme_clr(svm->vmcb_pa) >> PAGE_SHIFT)); - __free_pages(virt_to_page(svm->msrpm), MSRPM_ALLOC_ORDER); - __free_page(virt_to_page(svm->nested.hsave)); - __free_pages(virt_to_page(svm->nested.msrpm), MSRPM_ALLOC_ORDER); -} - -static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu) -{ - struct vcpu_svm *svm = to_svm(vcpu); - struct svm_cpu_data *sd = per_cpu(svm_data, cpu); - int i; - - if (unlikely(cpu != vcpu->cpu)) { - svm->asid_generation = 0; - mark_all_dirty(svm->vmcb); - } - -#ifdef CONFIG_X86_64 - rdmsrl(MSR_GS_BASE, to_svm(vcpu)->host.gs_base); -#endif - savesegment(fs, svm->host.fs); - savesegment(gs, svm->host.gs); - svm->host.ldt = kvm_read_ldt(); - - for (i = 0; i < NR_HOST_SAVE_USER_MSRS; i++) - rdmsrl(host_save_user_msrs[i], svm->host_user_msrs[i]); - - if (static_cpu_has(X86_FEATURE_TSCRATEMSR)) { - u64 tsc_ratio = vcpu->arch.tsc_scaling_ratio; - if (tsc_ratio != __this_cpu_read(current_tsc_ratio)) { - __this_cpu_write(current_tsc_ratio, tsc_ratio); - wrmsrl(MSR_AMD64_TSC_RATIO, tsc_ratio); - } - } - /* This assumes that the kernel never uses MSR_TSC_AUX */ - if (static_cpu_has(X86_FEATURE_RDTSCP)) - wrmsrl(MSR_TSC_AUX, svm->tsc_aux); - - if (sd->current_vmcb != svm->vmcb) { - sd->current_vmcb = svm->vmcb; - indirect_branch_prediction_barrier(); - } - avic_vcpu_load(vcpu, cpu); -} - -static void svm_vcpu_put(struct kvm_vcpu *vcpu) -{ - struct vcpu_svm *svm = to_svm(vcpu); - int i; - - avic_vcpu_put(vcpu); - - ++vcpu->stat.host_state_reload; - kvm_load_ldt(svm->host.ldt); -#ifdef CONFIG_X86_64 - loadsegment(fs, svm->host.fs); - wrmsrl(MSR_KERNEL_GS_BASE, current->thread.gsbase); - load_gs_index(svm->host.gs); -#else -#ifdef CONFIG_X86_32_LAZY_GS - loadsegment(gs, svm->host.gs); -#endif -#endif - for (i = 0; i < NR_HOST_SAVE_USER_MSRS; i++) - wrmsrl(host_save_user_msrs[i], svm->host_user_msrs[i]); -} - -static void svm_vcpu_blocking(struct kvm_vcpu *vcpu) -{ - avic_set_running(vcpu, false); -} - -static void svm_vcpu_unblocking(struct kvm_vcpu *vcpu) -{ - if (kvm_check_request(KVM_REQ_APICV_UPDATE, vcpu)) - kvm_vcpu_update_apicv(vcpu); - avic_set_running(vcpu, true); -} - -static unsigned long svm_get_rflags(struct kvm_vcpu *vcpu) -{ - struct vcpu_svm *svm = to_svm(vcpu); - unsigned long rflags = svm->vmcb->save.rflags; - - if (svm->nmi_singlestep) { - /* Hide our flags if they were not set by the guest */ - if (!(svm->nmi_singlestep_guest_rflags & X86_EFLAGS_TF)) - rflags &= ~X86_EFLAGS_TF; - if (!(svm->nmi_singlestep_guest_rflags & X86_EFLAGS_RF)) - rflags &= ~X86_EFLAGS_RF; - } - return rflags; -} - -static void svm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags) -{ - if (to_svm(vcpu)->nmi_singlestep) - rflags |= (X86_EFLAGS_TF | X86_EFLAGS_RF); - - /* - * Any change of EFLAGS.VM is accompanied by a reload of SS - * (caused by either a task switch or an inter-privilege IRET), - * so we do not need to update the CPL here. - */ - to_svm(vcpu)->vmcb->save.rflags = rflags; -} - -static void svm_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg) -{ - switch (reg) { - case VCPU_EXREG_PDPTR: - BUG_ON(!npt_enabled); - load_pdptrs(vcpu, vcpu->arch.walk_mmu, kvm_read_cr3(vcpu)); - break; - default: - WARN_ON_ONCE(1); - } -} - -static inline void svm_enable_vintr(struct vcpu_svm *svm) -{ - struct vmcb_control_area *control; - - /* The following fields are ignored when AVIC is enabled */ - WARN_ON(kvm_vcpu_apicv_active(&svm->vcpu)); - - /* - * This is just a dummy VINTR to actually cause a vmexit to happen. - * Actual injection of virtual interrupts happens through EVENTINJ. - */ - control = &svm->vmcb->control; - control->int_vector = 0x0; - control->int_ctl &= ~V_INTR_PRIO_MASK; - control->int_ctl |= V_IRQ_MASK | - ((/*control->int_vector >> 4*/ 0xf) << V_INTR_PRIO_SHIFT); - mark_dirty(svm->vmcb, VMCB_INTR); -} - -static void svm_set_vintr(struct vcpu_svm *svm) -{ - set_intercept(svm, INTERCEPT_VINTR); - if (is_intercept(svm, INTERCEPT_VINTR)) - svm_enable_vintr(svm); -} - -static void svm_clear_vintr(struct vcpu_svm *svm) -{ - clr_intercept(svm, INTERCEPT_VINTR); - - svm->vmcb->control.int_ctl &= ~V_IRQ_MASK; - mark_dirty(svm->vmcb, VMCB_INTR); -} - -static struct vmcb_seg *svm_seg(struct kvm_vcpu *vcpu, int seg) -{ - struct vmcb_save_area *save = &to_svm(vcpu)->vmcb->save; - - switch (seg) { - case VCPU_SREG_CS: return &save->cs; - case VCPU_SREG_DS: return &save->ds; - case VCPU_SREG_ES: return &save->es; - case VCPU_SREG_FS: return &save->fs; - case VCPU_SREG_GS: return &save->gs; - case VCPU_SREG_SS: return &save->ss; - case VCPU_SREG_TR: return &save->tr; - case VCPU_SREG_LDTR: return &save->ldtr; - } - BUG(); - return NULL; -} - -static u64 svm_get_segment_base(struct kvm_vcpu *vcpu, int seg) -{ - struct vmcb_seg *s = svm_seg(vcpu, seg); - - return s->base; -} - -static void svm_get_segment(struct kvm_vcpu *vcpu, - struct kvm_segment *var, int seg) -{ - struct vmcb_seg *s = svm_seg(vcpu, seg); - - var->base = s->base; - var->limit = s->limit; - var->selector = s->selector; - var->type = s->attrib & SVM_SELECTOR_TYPE_MASK; - var->s = (s->attrib >> SVM_SELECTOR_S_SHIFT) & 1; - var->dpl = (s->attrib >> SVM_SELECTOR_DPL_SHIFT) & 3; - var->present = (s->attrib >> SVM_SELECTOR_P_SHIFT) & 1; - var->avl = (s->attrib >> SVM_SELECTOR_AVL_SHIFT) & 1; - var->l = (s->attrib >> SVM_SELECTOR_L_SHIFT) & 1; - var->db = (s->attrib >> SVM_SELECTOR_DB_SHIFT) & 1; - - /* - * AMD CPUs circa 2014 track the G bit for all segments except CS. - * However, the SVM spec states that the G bit is not observed by the - * CPU, and some VMware virtual CPUs drop the G bit for all segments. - * So let's synthesize a legal G bit for all segments, this helps - * running KVM nested. It also helps cross-vendor migration, because - * Intel's vmentry has a check on the 'G' bit. - */ - var->g = s->limit > 0xfffff; - - /* - * AMD's VMCB does not have an explicit unusable field, so emulate it - * for cross vendor migration purposes by "not present" - */ - var->unusable = !var->present; - - switch (seg) { - case VCPU_SREG_TR: - /* - * Work around a bug where the busy flag in the tr selector - * isn't exposed - */ - var->type |= 0x2; - break; - case VCPU_SREG_DS: - case VCPU_SREG_ES: - case VCPU_SREG_FS: - case VCPU_SREG_GS: - /* - * The accessed bit must always be set in the segment - * descriptor cache, although it can be cleared in the - * descriptor, the cached bit always remains at 1. Since - * Intel has a check on this, set it here to support - * cross-vendor migration. - */ - if (!var->unusable) - var->type |= 0x1; - break; - case VCPU_SREG_SS: - /* - * On AMD CPUs sometimes the DB bit in the segment - * descriptor is left as 1, although the whole segment has - * been made unusable. Clear it here to pass an Intel VMX - * entry check when cross vendor migrating. - */ - if (var->unusable) - var->db = 0; - /* This is symmetric with svm_set_segment() */ - var->dpl = to_svm(vcpu)->vmcb->save.cpl; - break; - } -} - -static int svm_get_cpl(struct kvm_vcpu *vcpu) -{ - struct vmcb_save_area *save = &to_svm(vcpu)->vmcb->save; - - return save->cpl; -} - -static void svm_get_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt) -{ - struct vcpu_svm *svm = to_svm(vcpu); - - dt->size = svm->vmcb->save.idtr.limit; - dt->address = svm->vmcb->save.idtr.base; -} - -static void svm_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt) -{ - struct vcpu_svm *svm = to_svm(vcpu); - - svm->vmcb->save.idtr.limit = dt->size; - svm->vmcb->save.idtr.base = dt->address ; - mark_dirty(svm->vmcb, VMCB_DT); -} - -static void svm_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt) -{ - struct vcpu_svm *svm = to_svm(vcpu); - - dt->size = svm->vmcb->save.gdtr.limit; - dt->address = svm->vmcb->save.gdtr.base; -} - -static void svm_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt) -{ - struct vcpu_svm *svm = to_svm(vcpu); - - svm->vmcb->save.gdtr.limit = dt->size; - svm->vmcb->save.gdtr.base = dt->address ; - mark_dirty(svm->vmcb, VMCB_DT); -} - -static void svm_decache_cr0_guest_bits(struct kvm_vcpu *vcpu) -{ -} - -static void svm_decache_cr4_guest_bits(struct kvm_vcpu *vcpu) -{ -} - -static void update_cr0_intercept(struct vcpu_svm *svm) -{ - ulong gcr0 = svm->vcpu.arch.cr0; - u64 *hcr0 = &svm->vmcb->save.cr0; - - *hcr0 = (*hcr0 & ~SVM_CR0_SELECTIVE_MASK) - | (gcr0 & SVM_CR0_SELECTIVE_MASK); - - mark_dirty(svm->vmcb, VMCB_CR); - - if (gcr0 == *hcr0) { - clr_cr_intercept(svm, INTERCEPT_CR0_READ); - clr_cr_intercept(svm, INTERCEPT_CR0_WRITE); - } else { - set_cr_intercept(svm, INTERCEPT_CR0_READ); - set_cr_intercept(svm, INTERCEPT_CR0_WRITE); - } -} - -static void svm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0) -{ - struct vcpu_svm *svm = to_svm(vcpu); - -#ifdef CONFIG_X86_64 - if (vcpu->arch.efer & EFER_LME) { - if (!is_paging(vcpu) && (cr0 & X86_CR0_PG)) { - vcpu->arch.efer |= EFER_LMA; - svm->vmcb->save.efer |= EFER_LMA | EFER_LME; - } - - if (is_paging(vcpu) && !(cr0 & X86_CR0_PG)) { - vcpu->arch.efer &= ~EFER_LMA; - svm->vmcb->save.efer &= ~(EFER_LMA | EFER_LME); - } - } -#endif - vcpu->arch.cr0 = cr0; - - if (!npt_enabled) - cr0 |= X86_CR0_PG | X86_CR0_WP; - - /* - * re-enable caching here because the QEMU bios - * does not do it - this results in some delay at - * reboot - */ - if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED)) - cr0 &= ~(X86_CR0_CD | X86_CR0_NW); - svm->vmcb->save.cr0 = cr0; - mark_dirty(svm->vmcb, VMCB_CR); - update_cr0_intercept(svm); -} - -static int svm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4) -{ - unsigned long host_cr4_mce = cr4_read_shadow() & X86_CR4_MCE; - unsigned long old_cr4 = to_svm(vcpu)->vmcb->save.cr4; - - if (cr4 & X86_CR4_VMXE) - return 1; - - if (npt_enabled && ((old_cr4 ^ cr4) & X86_CR4_PGE)) - svm_flush_tlb(vcpu, true); - - vcpu->arch.cr4 = cr4; - if (!npt_enabled) - cr4 |= X86_CR4_PAE; - cr4 |= host_cr4_mce; - to_svm(vcpu)->vmcb->save.cr4 = cr4; - mark_dirty(to_svm(vcpu)->vmcb, VMCB_CR); - return 0; -} - -static void svm_set_segment(struct kvm_vcpu *vcpu, - struct kvm_segment *var, int seg) -{ - struct vcpu_svm *svm = to_svm(vcpu); - struct vmcb_seg *s = svm_seg(vcpu, seg); - - s->base = var->base; - s->limit = var->limit; - s->selector = var->selector; - s->attrib = (var->type & SVM_SELECTOR_TYPE_MASK); - s->attrib |= (var->s & 1) << SVM_SELECTOR_S_SHIFT; - s->attrib |= (var->dpl & 3) << SVM_SELECTOR_DPL_SHIFT; - s->attrib |= ((var->present & 1) && !var->unusable) << SVM_SELECTOR_P_SHIFT; - s->attrib |= (var->avl & 1) << SVM_SELECTOR_AVL_SHIFT; - s->attrib |= (var->l & 1) << SVM_SELECTOR_L_SHIFT; - s->attrib |= (var->db & 1) << SVM_SELECTOR_DB_SHIFT; - s->attrib |= (var->g & 1) << SVM_SELECTOR_G_SHIFT; - - /* - * This is always accurate, except if SYSRET returned to a segment - * with SS.DPL != 3. Intel does not have this quirk, and always - * forces SS.DPL to 3 on sysret, so we ignore that case; fixing it - * would entail passing the CPL to userspace and back. - */ - if (seg == VCPU_SREG_SS) - /* This is symmetric with svm_get_segment() */ - svm->vmcb->save.cpl = (var->dpl & 3); - - mark_dirty(svm->vmcb, VMCB_SEG); -} - -static void update_bp_intercept(struct kvm_vcpu *vcpu) -{ - struct vcpu_svm *svm = to_svm(vcpu); - - clr_exception_intercept(svm, BP_VECTOR); - - if (vcpu->guest_debug & KVM_GUESTDBG_ENABLE) { - if (vcpu->guest_debug & KVM_GUESTDBG_USE_SW_BP) - set_exception_intercept(svm, BP_VECTOR); - } else - vcpu->guest_debug = 0; -} - -static void new_asid(struct vcpu_svm *svm, struct svm_cpu_data *sd) -{ - if (sd->next_asid > sd->max_asid) { - ++sd->asid_generation; - sd->next_asid = sd->min_asid; - svm->vmcb->control.tlb_ctl = TLB_CONTROL_FLUSH_ALL_ASID; - } - - svm->asid_generation = sd->asid_generation; - svm->vmcb->control.asid = sd->next_asid++; - - mark_dirty(svm->vmcb, VMCB_ASID); -} - -static u64 svm_get_dr6(struct kvm_vcpu *vcpu) -{ - return to_svm(vcpu)->vmcb->save.dr6; -} - -static void svm_set_dr6(struct kvm_vcpu *vcpu, unsigned long value) -{ - struct vcpu_svm *svm = to_svm(vcpu); - - svm->vmcb->save.dr6 = value; - mark_dirty(svm->vmcb, VMCB_DR); -} - -static void svm_sync_dirty_debug_regs(struct kvm_vcpu *vcpu) -{ - struct vcpu_svm *svm = to_svm(vcpu); - - get_debugreg(vcpu->arch.db[0], 0); - get_debugreg(vcpu->arch.db[1], 1); - get_debugreg(vcpu->arch.db[2], 2); - get_debugreg(vcpu->arch.db[3], 3); - vcpu->arch.dr6 = svm_get_dr6(vcpu); - vcpu->arch.dr7 = svm->vmcb->save.dr7; - - vcpu->arch.switch_db_regs &= ~KVM_DEBUGREG_WONT_EXIT; - set_dr_intercepts(svm); -} - -static void svm_set_dr7(struct kvm_vcpu *vcpu, unsigned long value) -{ - struct vcpu_svm *svm = to_svm(vcpu); - - svm->vmcb->save.dr7 = value; - mark_dirty(svm->vmcb, VMCB_DR); -} - -static int pf_interception(struct vcpu_svm *svm) -{ - u64 fault_address = __sme_clr(svm->vmcb->control.exit_info_2); - u64 error_code = svm->vmcb->control.exit_info_1; - - return kvm_handle_page_fault(&svm->vcpu, error_code, fault_address, - static_cpu_has(X86_FEATURE_DECODEASSISTS) ? - svm->vmcb->control.insn_bytes : NULL, - svm->vmcb->control.insn_len); -} - -static int npf_interception(struct vcpu_svm *svm) -{ - u64 fault_address = __sme_clr(svm->vmcb->control.exit_info_2); - u64 error_code = svm->vmcb->control.exit_info_1; - - trace_kvm_page_fault(fault_address, error_code); - return kvm_mmu_page_fault(&svm->vcpu, fault_address, error_code, - static_cpu_has(X86_FEATURE_DECODEASSISTS) ? - svm->vmcb->control.insn_bytes : NULL, - svm->vmcb->control.insn_len); -} - -static int db_interception(struct vcpu_svm *svm) -{ - struct kvm_run *kvm_run = svm->vcpu.run; - struct kvm_vcpu *vcpu = &svm->vcpu; - - if (!(svm->vcpu.guest_debug & - (KVM_GUESTDBG_SINGLESTEP | KVM_GUESTDBG_USE_HW_BP)) && - !svm->nmi_singlestep) { - kvm_queue_exception(&svm->vcpu, DB_VECTOR); - return 1; - } - - if (svm->nmi_singlestep) { - disable_nmi_singlestep(svm); - /* Make sure we check for pending NMIs upon entry */ - kvm_make_request(KVM_REQ_EVENT, vcpu); - } - - if (svm->vcpu.guest_debug & - (KVM_GUESTDBG_SINGLESTEP | KVM_GUESTDBG_USE_HW_BP)) { - kvm_run->exit_reason = KVM_EXIT_DEBUG; - kvm_run->debug.arch.pc = - svm->vmcb->save.cs.base + svm->vmcb->save.rip; - kvm_run->debug.arch.exception = DB_VECTOR; - return 0; - } - - return 1; -} - -static int bp_interception(struct vcpu_svm *svm) -{ - struct kvm_run *kvm_run = svm->vcpu.run; - - kvm_run->exit_reason = KVM_EXIT_DEBUG; - kvm_run->debug.arch.pc = svm->vmcb->save.cs.base + svm->vmcb->save.rip; - kvm_run->debug.arch.exception = BP_VECTOR; - return 0; -} - -static int ud_interception(struct vcpu_svm *svm) -{ - return handle_ud(&svm->vcpu); -} - -static int ac_interception(struct vcpu_svm *svm) -{ - kvm_queue_exception_e(&svm->vcpu, AC_VECTOR, 0); - return 1; -} - -static int gp_interception(struct vcpu_svm *svm) -{ - struct kvm_vcpu *vcpu = &svm->vcpu; - u32 error_code = svm->vmcb->control.exit_info_1; - - WARN_ON_ONCE(!enable_vmware_backdoor); - - /* - * VMware backdoor emulation on #GP interception only handles IN{S}, - * OUT{S}, and RDPMC, none of which generate a non-zero error code. - */ - if (error_code) { - kvm_queue_exception_e(vcpu, GP_VECTOR, error_code); - return 1; - } - return kvm_emulate_instruction(vcpu, EMULTYPE_VMWARE_GP); -} - -static bool is_erratum_383(void) -{ - int err, i; - u64 value; - - if (!erratum_383_found) - return false; - - value = native_read_msr_safe(MSR_IA32_MC0_STATUS, &err); - if (err) - return false; - - /* Bit 62 may or may not be set for this mce */ - value &= ~(1ULL << 62); - - if (value != 0xb600000000010015ULL) - return false; - - /* Clear MCi_STATUS registers */ - for (i = 0; i < 6; ++i) - native_write_msr_safe(MSR_IA32_MCx_STATUS(i), 0, 0); - - value = native_read_msr_safe(MSR_IA32_MCG_STATUS, &err); - if (!err) { - u32 low, high; - - value &= ~(1ULL << 2); - low = lower_32_bits(value); - high = upper_32_bits(value); - - native_write_msr_safe(MSR_IA32_MCG_STATUS, low, high); - } - - /* Flush tlb to evict multi-match entries */ - __flush_tlb_all(); - - return true; -} - -static void svm_handle_mce(struct vcpu_svm *svm) -{ - if (is_erratum_383()) { - /* - * Erratum 383 triggered. Guest state is corrupt so kill the - * guest. - */ - pr_err("KVM: Guest triggered AMD Erratum 383\n"); - - kvm_make_request(KVM_REQ_TRIPLE_FAULT, &svm->vcpu); - - return; - } - - /* - * On an #MC intercept the MCE handler is not called automatically in - * the host. So do it by hand here. - */ - asm volatile ( - "int $0x12\n"); - /* not sure if we ever come back to this point */ - - return; -} - -static int mc_interception(struct vcpu_svm *svm) -{ - return 1; -} - -static int shutdown_interception(struct vcpu_svm *svm) -{ - struct kvm_run *kvm_run = svm->vcpu.run; - - /* - * VMCB is undefined after a SHUTDOWN intercept - * so reinitialize it. - */ - clear_page(svm->vmcb); - init_vmcb(svm); - - kvm_run->exit_reason = KVM_EXIT_SHUTDOWN; - return 0; -} - -static int io_interception(struct vcpu_svm *svm) -{ - struct kvm_vcpu *vcpu = &svm->vcpu; - u32 io_info = svm->vmcb->control.exit_info_1; /* address size bug? */ - int size, in, string; - unsigned port; - - ++svm->vcpu.stat.io_exits; - string = (io_info & SVM_IOIO_STR_MASK) != 0; - in = (io_info & SVM_IOIO_TYPE_MASK) != 0; - if (string) - return kvm_emulate_instruction(vcpu, 0); - - port = io_info >> 16; - size = (io_info & SVM_IOIO_SIZE_MASK) >> SVM_IOIO_SIZE_SHIFT; - svm->next_rip = svm->vmcb->control.exit_info_2; - - return kvm_fast_pio(&svm->vcpu, size, port, in); -} - -static int nmi_interception(struct vcpu_svm *svm) -{ - return 1; -} - -static int intr_interception(struct vcpu_svm *svm) -{ - ++svm->vcpu.stat.irq_exits; - return 1; -} - -static int nop_on_interception(struct vcpu_svm *svm) -{ - return 1; -} - -static int halt_interception(struct vcpu_svm *svm) -{ - return kvm_emulate_halt(&svm->vcpu); -} - -static int vmmcall_interception(struct vcpu_svm *svm) -{ - return kvm_emulate_hypercall(&svm->vcpu); -} - -static unsigned long nested_svm_get_tdp_cr3(struct kvm_vcpu *vcpu) -{ - struct vcpu_svm *svm = to_svm(vcpu); - - return svm->nested.nested_cr3; -} - -static u64 nested_svm_get_tdp_pdptr(struct kvm_vcpu *vcpu, int index) -{ - struct vcpu_svm *svm = to_svm(vcpu); - u64 cr3 = svm->nested.nested_cr3; - u64 pdpte; - int ret; - - ret = kvm_vcpu_read_guest_page(vcpu, gpa_to_gfn(__sme_clr(cr3)), &pdpte, - offset_in_page(cr3) + index * 8, 8); - if (ret) - return 0; - return pdpte; -} - -static void nested_svm_inject_npf_exit(struct kvm_vcpu *vcpu, - struct x86_exception *fault) -{ - struct vcpu_svm *svm = to_svm(vcpu); - - if (svm->vmcb->control.exit_code != SVM_EXIT_NPF) { - /* - * TODO: track the cause of the nested page fault, and - * correctly fill in the high bits of exit_info_1. - */ - svm->vmcb->control.exit_code = SVM_EXIT_NPF; - svm->vmcb->control.exit_code_hi = 0; - svm->vmcb->control.exit_info_1 = (1ULL << 32); - svm->vmcb->control.exit_info_2 = fault->address; - } - - svm->vmcb->control.exit_info_1 &= ~0xffffffffULL; - svm->vmcb->control.exit_info_1 |= fault->error_code; - - /* - * The present bit is always zero for page structure faults on real - * hardware. - */ - if (svm->vmcb->control.exit_info_1 & (2ULL << 32)) - svm->vmcb->control.exit_info_1 &= ~1; - - nested_svm_vmexit(svm); -} - -static void nested_svm_init_mmu_context(struct kvm_vcpu *vcpu) -{ - WARN_ON(mmu_is_nested(vcpu)); - - vcpu->arch.mmu = &vcpu->arch.guest_mmu; - kvm_init_shadow_mmu(vcpu); - vcpu->arch.mmu->get_guest_pgd = nested_svm_get_tdp_cr3; - vcpu->arch.mmu->get_pdptr = nested_svm_get_tdp_pdptr; - vcpu->arch.mmu->inject_page_fault = nested_svm_inject_npf_exit; - vcpu->arch.mmu->shadow_root_level = get_npt_level(vcpu); - reset_shadow_zero_bits_mask(vcpu, vcpu->arch.mmu); - vcpu->arch.walk_mmu = &vcpu->arch.nested_mmu; -} - -static void nested_svm_uninit_mmu_context(struct kvm_vcpu *vcpu) -{ - vcpu->arch.mmu = &vcpu->arch.root_mmu; - vcpu->arch.walk_mmu = &vcpu->arch.root_mmu; -} - -static int nested_svm_check_permissions(struct vcpu_svm *svm) -{ - if (!(svm->vcpu.arch.efer & EFER_SVME) || - !is_paging(&svm->vcpu)) { - kvm_queue_exception(&svm->vcpu, UD_VECTOR); - return 1; - } - - if (svm->vmcb->save.cpl) { - kvm_inject_gp(&svm->vcpu, 0); - return 1; - } - - return 0; -} - -static int nested_svm_check_exception(struct vcpu_svm *svm, unsigned nr, - bool has_error_code, u32 error_code) -{ - int vmexit; - - if (!is_guest_mode(&svm->vcpu)) - return 0; - - vmexit = nested_svm_intercept(svm); - if (vmexit != NESTED_EXIT_DONE) - return 0; - - svm->vmcb->control.exit_code = SVM_EXIT_EXCP_BASE + nr; - svm->vmcb->control.exit_code_hi = 0; - svm->vmcb->control.exit_info_1 = error_code; - - /* - * EXITINFO2 is undefined for all exception intercepts other - * than #PF. - */ - if (svm->vcpu.arch.exception.nested_apf) - svm->vmcb->control.exit_info_2 = svm->vcpu.arch.apf.nested_apf_token; - else if (svm->vcpu.arch.exception.has_payload) - svm->vmcb->control.exit_info_2 = svm->vcpu.arch.exception.payload; - else - svm->vmcb->control.exit_info_2 = svm->vcpu.arch.cr2; - - svm->nested.exit_required = true; - return vmexit; -} - -static void nested_svm_intr(struct vcpu_svm *svm) -{ - svm->vmcb->control.exit_code = SVM_EXIT_INTR; - svm->vmcb->control.exit_info_1 = 0; - svm->vmcb->control.exit_info_2 = 0; - - /* nested_svm_vmexit this gets called afterwards from handle_exit */ - svm->nested.exit_required = true; - trace_kvm_nested_intr_vmexit(svm->vmcb->save.rip); -} - -static bool nested_exit_on_intr(struct vcpu_svm *svm) -{ - return (svm->nested.intercept & 1ULL); -} - -static int svm_check_nested_events(struct kvm_vcpu *vcpu) -{ - struct vcpu_svm *svm = to_svm(vcpu); - bool block_nested_events = - kvm_event_needs_reinjection(vcpu) || svm->nested.exit_required; - - if (kvm_cpu_has_interrupt(vcpu) && nested_exit_on_intr(svm)) { - if (block_nested_events) - return -EBUSY; - nested_svm_intr(svm); - return 0; - } - - return 0; -} - -/* This function returns true if it is save to enable the nmi window */ -static inline bool nested_svm_nmi(struct vcpu_svm *svm) -{ - if (!is_guest_mode(&svm->vcpu)) - return true; - - if (!(svm->nested.intercept & (1ULL << INTERCEPT_NMI))) - return true; - - svm->vmcb->control.exit_code = SVM_EXIT_NMI; - svm->nested.exit_required = true; - - return false; -} - -static int nested_svm_intercept_ioio(struct vcpu_svm *svm) -{ - unsigned port, size, iopm_len; - u16 val, mask; - u8 start_bit; - u64 gpa; - - if (!(svm->nested.intercept & (1ULL << INTERCEPT_IOIO_PROT))) - return NESTED_EXIT_HOST; - - port = svm->vmcb->control.exit_info_1 >> 16; - size = (svm->vmcb->control.exit_info_1 & SVM_IOIO_SIZE_MASK) >> - SVM_IOIO_SIZE_SHIFT; - gpa = svm->nested.vmcb_iopm + (port / 8); - start_bit = port % 8; - iopm_len = (start_bit + size > 8) ? 2 : 1; - mask = (0xf >> (4 - size)) << start_bit; - val = 0; - - if (kvm_vcpu_read_guest(&svm->vcpu, gpa, &val, iopm_len)) - return NESTED_EXIT_DONE; - - return (val & mask) ? NESTED_EXIT_DONE : NESTED_EXIT_HOST; -} - -static int nested_svm_exit_handled_msr(struct vcpu_svm *svm) -{ - u32 offset, msr, value; - int write, mask; - - if (!(svm->nested.intercept & (1ULL << INTERCEPT_MSR_PROT))) - return NESTED_EXIT_HOST; - - msr = svm->vcpu.arch.regs[VCPU_REGS_RCX]; - offset = svm_msrpm_offset(msr); - write = svm->vmcb->control.exit_info_1 & 1; - mask = 1 << ((2 * (msr & 0xf)) + write); - - if (offset == MSR_INVALID) - return NESTED_EXIT_DONE; - - /* Offset is in 32 bit units but need in 8 bit units */ - offset *= 4; - - if (kvm_vcpu_read_guest(&svm->vcpu, svm->nested.vmcb_msrpm + offset, &value, 4)) - return NESTED_EXIT_DONE; - - return (value & mask) ? NESTED_EXIT_DONE : NESTED_EXIT_HOST; -} - -/* DB exceptions for our internal use must not cause vmexit */ -static int nested_svm_intercept_db(struct vcpu_svm *svm) -{ - unsigned long dr6; - - /* if we're not singlestepping, it's not ours */ - if (!svm->nmi_singlestep) - return NESTED_EXIT_DONE; - - /* if it's not a singlestep exception, it's not ours */ - if (kvm_get_dr(&svm->vcpu, 6, &dr6)) - return NESTED_EXIT_DONE; - if (!(dr6 & DR6_BS)) - return NESTED_EXIT_DONE; - - /* if the guest is singlestepping, it should get the vmexit */ - if (svm->nmi_singlestep_guest_rflags & X86_EFLAGS_TF) { - disable_nmi_singlestep(svm); - return NESTED_EXIT_DONE; - } - - /* it's ours, the nested hypervisor must not see this one */ - return NESTED_EXIT_HOST; -} - -static int nested_svm_exit_special(struct vcpu_svm *svm) -{ - u32 exit_code = svm->vmcb->control.exit_code; - - switch (exit_code) { - case SVM_EXIT_INTR: - case SVM_EXIT_NMI: - case SVM_EXIT_EXCP_BASE + MC_VECTOR: - return NESTED_EXIT_HOST; - case SVM_EXIT_NPF: - /* For now we are always handling NPFs when using them */ - if (npt_enabled) - return NESTED_EXIT_HOST; - break; - case SVM_EXIT_EXCP_BASE + PF_VECTOR: - /* When we're shadowing, trap PFs, but not async PF */ - if (!npt_enabled && svm->vcpu.arch.apf.host_apf_reason == 0) - return NESTED_EXIT_HOST; - break; - default: - break; - } - - return NESTED_EXIT_CONTINUE; -} - -static int nested_svm_intercept(struct vcpu_svm *svm) -{ - u32 exit_code = svm->vmcb->control.exit_code; - int vmexit = NESTED_EXIT_HOST; - - switch (exit_code) { - case SVM_EXIT_MSR: - vmexit = nested_svm_exit_handled_msr(svm); - break; - case SVM_EXIT_IOIO: - vmexit = nested_svm_intercept_ioio(svm); - break; - case SVM_EXIT_READ_CR0 ... SVM_EXIT_WRITE_CR8: { - u32 bit = 1U << (exit_code - SVM_EXIT_READ_CR0); - if (svm->nested.intercept_cr & bit) - vmexit = NESTED_EXIT_DONE; - break; - } - case SVM_EXIT_READ_DR0 ... SVM_EXIT_WRITE_DR7: { - u32 bit = 1U << (exit_code - SVM_EXIT_READ_DR0); - if (svm->nested.intercept_dr & bit) - vmexit = NESTED_EXIT_DONE; - break; - } - case SVM_EXIT_EXCP_BASE ... SVM_EXIT_EXCP_BASE + 0x1f: { - u32 excp_bits = 1 << (exit_code - SVM_EXIT_EXCP_BASE); - if (svm->nested.intercept_exceptions & excp_bits) { - if (exit_code == SVM_EXIT_EXCP_BASE + DB_VECTOR) - vmexit = nested_svm_intercept_db(svm); - else - vmexit = NESTED_EXIT_DONE; - } - /* async page fault always cause vmexit */ - else if ((exit_code == SVM_EXIT_EXCP_BASE + PF_VECTOR) && - svm->vcpu.arch.exception.nested_apf != 0) - vmexit = NESTED_EXIT_DONE; - break; - } - case SVM_EXIT_ERR: { - vmexit = NESTED_EXIT_DONE; - break; - } - default: { - u64 exit_bits = 1ULL << (exit_code - SVM_EXIT_INTR); - if (svm->nested.intercept & exit_bits) - vmexit = NESTED_EXIT_DONE; - } - } - - return vmexit; -} - -static int nested_svm_exit_handled(struct vcpu_svm *svm) -{ - int vmexit; - - vmexit = nested_svm_intercept(svm); - - if (vmexit == NESTED_EXIT_DONE) - nested_svm_vmexit(svm); - - return vmexit; -} - -static inline void copy_vmcb_control_area(struct vmcb *dst_vmcb, struct vmcb *from_vmcb) -{ - struct vmcb_control_area *dst = &dst_vmcb->control; - struct vmcb_control_area *from = &from_vmcb->control; - - dst->intercept_cr = from->intercept_cr; - dst->intercept_dr = from->intercept_dr; - dst->intercept_exceptions = from->intercept_exceptions; - dst->intercept = from->intercept; - dst->iopm_base_pa = from->iopm_base_pa; - dst->msrpm_base_pa = from->msrpm_base_pa; - dst->tsc_offset = from->tsc_offset; - dst->asid = from->asid; - dst->tlb_ctl = from->tlb_ctl; - dst->int_ctl = from->int_ctl; - dst->int_vector = from->int_vector; - dst->int_state = from->int_state; - dst->exit_code = from->exit_code; - dst->exit_code_hi = from->exit_code_hi; - dst->exit_info_1 = from->exit_info_1; - dst->exit_info_2 = from->exit_info_2; - dst->exit_int_info = from->exit_int_info; - dst->exit_int_info_err = from->exit_int_info_err; - dst->nested_ctl = from->nested_ctl; - dst->event_inj = from->event_inj; - dst->event_inj_err = from->event_inj_err; - dst->nested_cr3 = from->nested_cr3; - dst->virt_ext = from->virt_ext; - dst->pause_filter_count = from->pause_filter_count; - dst->pause_filter_thresh = from->pause_filter_thresh; -} - -static int nested_svm_vmexit(struct vcpu_svm *svm) -{ - int rc; - struct vmcb *nested_vmcb; - struct vmcb *hsave = svm->nested.hsave; - struct vmcb *vmcb = svm->vmcb; - struct kvm_host_map map; - - trace_kvm_nested_vmexit_inject(vmcb->control.exit_code, - vmcb->control.exit_info_1, - vmcb->control.exit_info_2, - vmcb->control.exit_int_info, - vmcb->control.exit_int_info_err, - KVM_ISA_SVM); - - rc = kvm_vcpu_map(&svm->vcpu, gpa_to_gfn(svm->nested.vmcb), &map); - if (rc) { - if (rc == -EINVAL) - kvm_inject_gp(&svm->vcpu, 0); - return 1; - } - - nested_vmcb = map.hva; - - /* Exit Guest-Mode */ - leave_guest_mode(&svm->vcpu); - svm->nested.vmcb = 0; - - /* Give the current vmcb to the guest */ - disable_gif(svm); - - nested_vmcb->save.es = vmcb->save.es; - nested_vmcb->save.cs = vmcb->save.cs; - nested_vmcb->save.ss = vmcb->save.ss; - nested_vmcb->save.ds = vmcb->save.ds; - nested_vmcb->save.gdtr = vmcb->save.gdtr; - nested_vmcb->save.idtr = vmcb->save.idtr; - nested_vmcb->save.efer = svm->vcpu.arch.efer; - nested_vmcb->save.cr0 = kvm_read_cr0(&svm->vcpu); - nested_vmcb->save.cr3 = kvm_read_cr3(&svm->vcpu); - nested_vmcb->save.cr2 = vmcb->save.cr2; - nested_vmcb->save.cr4 = svm->vcpu.arch.cr4; - nested_vmcb->save.rflags = kvm_get_rflags(&svm->vcpu); - nested_vmcb->save.rip = vmcb->save.rip; - nested_vmcb->save.rsp = vmcb->save.rsp; - nested_vmcb->save.rax = vmcb->save.rax; - nested_vmcb->save.dr7 = vmcb->save.dr7; - nested_vmcb->save.dr6 = vmcb->save.dr6; - nested_vmcb->save.cpl = vmcb->save.cpl; - - nested_vmcb->control.int_ctl = vmcb->control.int_ctl; - nested_vmcb->control.int_vector = vmcb->control.int_vector; - nested_vmcb->control.int_state = vmcb->control.int_state; - nested_vmcb->control.exit_code = vmcb->control.exit_code; - nested_vmcb->control.exit_code_hi = vmcb->control.exit_code_hi; - nested_vmcb->control.exit_info_1 = vmcb->control.exit_info_1; - nested_vmcb->control.exit_info_2 = vmcb->control.exit_info_2; - nested_vmcb->control.exit_int_info = vmcb->control.exit_int_info; - nested_vmcb->control.exit_int_info_err = vmcb->control.exit_int_info_err; - - if (svm->nrips_enabled) - nested_vmcb->control.next_rip = vmcb->control.next_rip; - - /* - * If we emulate a VMRUN/#VMEXIT in the same host #vmexit cycle we have - * to make sure that we do not lose injected events. So check event_inj - * here and copy it to exit_int_info if it is valid. - * Exit_int_info and event_inj can't be both valid because the case - * below only happens on a VMRUN instruction intercept which has - * no valid exit_int_info set. - */ - if (vmcb->control.event_inj & SVM_EVTINJ_VALID) { - struct vmcb_control_area *nc = &nested_vmcb->control; - - nc->exit_int_info = vmcb->control.event_inj; - nc->exit_int_info_err = vmcb->control.event_inj_err; - } - - nested_vmcb->control.tlb_ctl = 0; - nested_vmcb->control.event_inj = 0; - nested_vmcb->control.event_inj_err = 0; - - nested_vmcb->control.pause_filter_count = - svm->vmcb->control.pause_filter_count; - nested_vmcb->control.pause_filter_thresh = - svm->vmcb->control.pause_filter_thresh; - - /* We always set V_INTR_MASKING and remember the old value in hflags */ - if (!(svm->vcpu.arch.hflags & HF_VINTR_MASK)) - nested_vmcb->control.int_ctl &= ~V_INTR_MASKING_MASK; - - /* Restore the original control entries */ - copy_vmcb_control_area(vmcb, hsave); - - svm->vcpu.arch.tsc_offset = svm->vmcb->control.tsc_offset; - kvm_clear_exception_queue(&svm->vcpu); - kvm_clear_interrupt_queue(&svm->vcpu); - - svm->nested.nested_cr3 = 0; - - /* Restore selected save entries */ - svm->vmcb->save.es = hsave->save.es; - svm->vmcb->save.cs = hsave->save.cs; - svm->vmcb->save.ss = hsave->save.ss; - svm->vmcb->save.ds = hsave->save.ds; - svm->vmcb->save.gdtr = hsave->save.gdtr; - svm->vmcb->save.idtr = hsave->save.idtr; - kvm_set_rflags(&svm->vcpu, hsave->save.rflags); - svm_set_efer(&svm->vcpu, hsave->save.efer); - svm_set_cr0(&svm->vcpu, hsave->save.cr0 | X86_CR0_PE); - svm_set_cr4(&svm->vcpu, hsave->save.cr4); - if (npt_enabled) { - svm->vmcb->save.cr3 = hsave->save.cr3; - svm->vcpu.arch.cr3 = hsave->save.cr3; - } else { - (void)kvm_set_cr3(&svm->vcpu, hsave->save.cr3); - } - kvm_rax_write(&svm->vcpu, hsave->save.rax); - kvm_rsp_write(&svm->vcpu, hsave->save.rsp); - kvm_rip_write(&svm->vcpu, hsave->save.rip); - svm->vmcb->save.dr7 = 0; - svm->vmcb->save.cpl = 0; - svm->vmcb->control.exit_int_info = 0; - - mark_all_dirty(svm->vmcb); - - kvm_vcpu_unmap(&svm->vcpu, &map, true); - - nested_svm_uninit_mmu_context(&svm->vcpu); - kvm_mmu_reset_context(&svm->vcpu); - kvm_mmu_load(&svm->vcpu); - - /* - * Drop what we picked up for L2 via svm_complete_interrupts() so it - * doesn't end up in L1. - */ - svm->vcpu.arch.nmi_injected = false; - kvm_clear_exception_queue(&svm->vcpu); - kvm_clear_interrupt_queue(&svm->vcpu); - - return 0; -} - -static bool nested_svm_vmrun_msrpm(struct vcpu_svm *svm) -{ - /* - * This function merges the msr permission bitmaps of kvm and the - * nested vmcb. It is optimized in that it only merges the parts where - * the kvm msr permission bitmap may contain zero bits - */ - int i; - - if (!(svm->nested.intercept & (1ULL << INTERCEPT_MSR_PROT))) - return true; - - for (i = 0; i < MSRPM_OFFSETS; i++) { - u32 value, p; - u64 offset; - - if (msrpm_offsets[i] == 0xffffffff) - break; - - p = msrpm_offsets[i]; - offset = svm->nested.vmcb_msrpm + (p * 4); - - if (kvm_vcpu_read_guest(&svm->vcpu, offset, &value, 4)) - return false; - - svm->nested.msrpm[p] = svm->msrpm[p] | value; - } - - svm->vmcb->control.msrpm_base_pa = __sme_set(__pa(svm->nested.msrpm)); - - return true; -} - -static bool nested_vmcb_checks(struct vmcb *vmcb) -{ - if ((vmcb->save.efer & EFER_SVME) == 0) - return false; - - if ((vmcb->control.intercept & (1ULL << INTERCEPT_VMRUN)) == 0) - return false; - - if (vmcb->control.asid == 0) - return false; - - if ((vmcb->control.nested_ctl & SVM_NESTED_CTL_NP_ENABLE) && - !npt_enabled) - return false; - - return true; -} - -static void enter_svm_guest_mode(struct vcpu_svm *svm, u64 vmcb_gpa, - struct vmcb *nested_vmcb, struct kvm_host_map *map) -{ - bool evaluate_pending_interrupts = - is_intercept(svm, INTERCEPT_VINTR) || - is_intercept(svm, INTERCEPT_IRET); - - if (kvm_get_rflags(&svm->vcpu) & X86_EFLAGS_IF) - svm->vcpu.arch.hflags |= HF_HIF_MASK; - else - svm->vcpu.arch.hflags &= ~HF_HIF_MASK; - - if (nested_vmcb->control.nested_ctl & SVM_NESTED_CTL_NP_ENABLE) { - svm->nested.nested_cr3 = nested_vmcb->control.nested_cr3; - nested_svm_init_mmu_context(&svm->vcpu); - } - - /* Load the nested guest state */ - svm->vmcb->save.es = nested_vmcb->save.es; - svm->vmcb->save.cs = nested_vmcb->save.cs; - svm->vmcb->save.ss = nested_vmcb->save.ss; - svm->vmcb->save.ds = nested_vmcb->save.ds; - svm->vmcb->save.gdtr = nested_vmcb->save.gdtr; - svm->vmcb->save.idtr = nested_vmcb->save.idtr; - kvm_set_rflags(&svm->vcpu, nested_vmcb->save.rflags); - svm_set_efer(&svm->vcpu, nested_vmcb->save.efer); - svm_set_cr0(&svm->vcpu, nested_vmcb->save.cr0); - svm_set_cr4(&svm->vcpu, nested_vmcb->save.cr4); - if (npt_enabled) { - svm->vmcb->save.cr3 = nested_vmcb->save.cr3; - svm->vcpu.arch.cr3 = nested_vmcb->save.cr3; - } else - (void)kvm_set_cr3(&svm->vcpu, nested_vmcb->save.cr3); - - /* Guest paging mode is active - reset mmu */ - kvm_mmu_reset_context(&svm->vcpu); - - svm->vmcb->save.cr2 = svm->vcpu.arch.cr2 = nested_vmcb->save.cr2; - kvm_rax_write(&svm->vcpu, nested_vmcb->save.rax); - kvm_rsp_write(&svm->vcpu, nested_vmcb->save.rsp); - kvm_rip_write(&svm->vcpu, nested_vmcb->save.rip); - - /* In case we don't even reach vcpu_run, the fields are not updated */ - svm->vmcb->save.rax = nested_vmcb->save.rax; - svm->vmcb->save.rsp = nested_vmcb->save.rsp; - svm->vmcb->save.rip = nested_vmcb->save.rip; - svm->vmcb->save.dr7 = nested_vmcb->save.dr7; - svm->vmcb->save.dr6 = nested_vmcb->save.dr6; - svm->vmcb->save.cpl = nested_vmcb->save.cpl; - - svm->nested.vmcb_msrpm = nested_vmcb->control.msrpm_base_pa & ~0x0fffULL; - svm->nested.vmcb_iopm = nested_vmcb->control.iopm_base_pa & ~0x0fffULL; - - /* cache intercepts */ - svm->nested.intercept_cr = nested_vmcb->control.intercept_cr; - svm->nested.intercept_dr = nested_vmcb->control.intercept_dr; - svm->nested.intercept_exceptions = nested_vmcb->control.intercept_exceptions; - svm->nested.intercept = nested_vmcb->control.intercept; - - svm_flush_tlb(&svm->vcpu, true); - svm->vmcb->control.int_ctl = nested_vmcb->control.int_ctl | V_INTR_MASKING_MASK; - if (nested_vmcb->control.int_ctl & V_INTR_MASKING_MASK) - svm->vcpu.arch.hflags |= HF_VINTR_MASK; - else - svm->vcpu.arch.hflags &= ~HF_VINTR_MASK; - - svm->vcpu.arch.tsc_offset += nested_vmcb->control.tsc_offset; - svm->vmcb->control.tsc_offset = svm->vcpu.arch.tsc_offset; - - svm->vmcb->control.virt_ext = nested_vmcb->control.virt_ext; - svm->vmcb->control.int_vector = nested_vmcb->control.int_vector; - svm->vmcb->control.int_state = nested_vmcb->control.int_state; - svm->vmcb->control.event_inj = nested_vmcb->control.event_inj; - svm->vmcb->control.event_inj_err = nested_vmcb->control.event_inj_err; - - svm->vmcb->control.pause_filter_count = - nested_vmcb->control.pause_filter_count; - svm->vmcb->control.pause_filter_thresh = - nested_vmcb->control.pause_filter_thresh; - - kvm_vcpu_unmap(&svm->vcpu, map, true); - - /* Enter Guest-Mode */ - enter_guest_mode(&svm->vcpu); - - /* - * Merge guest and host intercepts - must be called with vcpu in - * guest-mode to take affect here - */ - recalc_intercepts(svm); - - svm->nested.vmcb = vmcb_gpa; - - /* - * If L1 had a pending IRQ/NMI before executing VMRUN, - * which wasn't delivered because it was disallowed (e.g. - * interrupts disabled), L0 needs to evaluate if this pending - * event should cause an exit from L2 to L1 or be delivered - * directly to L2. - * - * Usually this would be handled by the processor noticing an - * IRQ/NMI window request. However, VMRUN can unblock interrupts - * by implicitly setting GIF, so force L0 to perform pending event - * evaluation by requesting a KVM_REQ_EVENT. - */ - enable_gif(svm); - if (unlikely(evaluate_pending_interrupts)) - kvm_make_request(KVM_REQ_EVENT, &svm->vcpu); - - mark_all_dirty(svm->vmcb); -} - -static int nested_svm_vmrun(struct vcpu_svm *svm) -{ - int ret; - struct vmcb *nested_vmcb; - struct vmcb *hsave = svm->nested.hsave; - struct vmcb *vmcb = svm->vmcb; - struct kvm_host_map map; - u64 vmcb_gpa; - - vmcb_gpa = svm->vmcb->save.rax; - - ret = kvm_vcpu_map(&svm->vcpu, gpa_to_gfn(vmcb_gpa), &map); - if (ret == -EINVAL) { - kvm_inject_gp(&svm->vcpu, 0); - return 1; - } else if (ret) { - return kvm_skip_emulated_instruction(&svm->vcpu); - } - - ret = kvm_skip_emulated_instruction(&svm->vcpu); - - nested_vmcb = map.hva; - - if (!nested_vmcb_checks(nested_vmcb)) { - nested_vmcb->control.exit_code = SVM_EXIT_ERR; - nested_vmcb->control.exit_code_hi = 0; - nested_vmcb->control.exit_info_1 = 0; - nested_vmcb->control.exit_info_2 = 0; - - kvm_vcpu_unmap(&svm->vcpu, &map, true); - - return ret; - } - - trace_kvm_nested_vmrun(svm->vmcb->save.rip, vmcb_gpa, - nested_vmcb->save.rip, - nested_vmcb->control.int_ctl, - nested_vmcb->control.event_inj, - nested_vmcb->control.nested_ctl); - - trace_kvm_nested_intercepts(nested_vmcb->control.intercept_cr & 0xffff, - nested_vmcb->control.intercept_cr >> 16, - nested_vmcb->control.intercept_exceptions, - nested_vmcb->control.intercept); - - /* Clear internal status */ - kvm_clear_exception_queue(&svm->vcpu); - kvm_clear_interrupt_queue(&svm->vcpu); - - /* - * Save the old vmcb, so we don't need to pick what we save, but can - * restore everything when a VMEXIT occurs - */ - hsave->save.es = vmcb->save.es; - hsave->save.cs = vmcb->save.cs; - hsave->save.ss = vmcb->save.ss; - hsave->save.ds = vmcb->save.ds; - hsave->save.gdtr = vmcb->save.gdtr; - hsave->save.idtr = vmcb->save.idtr; - hsave->save.efer = svm->vcpu.arch.efer; - hsave->save.cr0 = kvm_read_cr0(&svm->vcpu); - hsave->save.cr4 = svm->vcpu.arch.cr4; - hsave->save.rflags = kvm_get_rflags(&svm->vcpu); - hsave->save.rip = kvm_rip_read(&svm->vcpu); - hsave->save.rsp = vmcb->save.rsp; - hsave->save.rax = vmcb->save.rax; - if (npt_enabled) - hsave->save.cr3 = vmcb->save.cr3; - else - hsave->save.cr3 = kvm_read_cr3(&svm->vcpu); - - copy_vmcb_control_area(hsave, vmcb); - - enter_svm_guest_mode(svm, vmcb_gpa, nested_vmcb, &map); - - if (!nested_svm_vmrun_msrpm(svm)) { - svm->vmcb->control.exit_code = SVM_EXIT_ERR; - svm->vmcb->control.exit_code_hi = 0; - svm->vmcb->control.exit_info_1 = 0; - svm->vmcb->control.exit_info_2 = 0; - - nested_svm_vmexit(svm); - } - - return ret; -} - -static void nested_svm_vmloadsave(struct vmcb *from_vmcb, struct vmcb *to_vmcb) -{ - to_vmcb->save.fs = from_vmcb->save.fs; - to_vmcb->save.gs = from_vmcb->save.gs; - to_vmcb->save.tr = from_vmcb->save.tr; - to_vmcb->save.ldtr = from_vmcb->save.ldtr; - to_vmcb->save.kernel_gs_base = from_vmcb->save.kernel_gs_base; - to_vmcb->save.star = from_vmcb->save.star; - to_vmcb->save.lstar = from_vmcb->save.lstar; - to_vmcb->save.cstar = from_vmcb->save.cstar; - to_vmcb->save.sfmask = from_vmcb->save.sfmask; - to_vmcb->save.sysenter_cs = from_vmcb->save.sysenter_cs; - to_vmcb->save.sysenter_esp = from_vmcb->save.sysenter_esp; - to_vmcb->save.sysenter_eip = from_vmcb->save.sysenter_eip; -} - -static int vmload_interception(struct vcpu_svm *svm) -{ - struct vmcb *nested_vmcb; - struct kvm_host_map map; - int ret; - - if (nested_svm_check_permissions(svm)) - return 1; - - ret = kvm_vcpu_map(&svm->vcpu, gpa_to_gfn(svm->vmcb->save.rax), &map); - if (ret) { - if (ret == -EINVAL) - kvm_inject_gp(&svm->vcpu, 0); - return 1; - } - - nested_vmcb = map.hva; - - ret = kvm_skip_emulated_instruction(&svm->vcpu); - - nested_svm_vmloadsave(nested_vmcb, svm->vmcb); - kvm_vcpu_unmap(&svm->vcpu, &map, true); - - return ret; -} - -static int vmsave_interception(struct vcpu_svm *svm) -{ - struct vmcb *nested_vmcb; - struct kvm_host_map map; - int ret; - - if (nested_svm_check_permissions(svm)) - return 1; - - ret = kvm_vcpu_map(&svm->vcpu, gpa_to_gfn(svm->vmcb->save.rax), &map); - if (ret) { - if (ret == -EINVAL) - kvm_inject_gp(&svm->vcpu, 0); - return 1; - } - - nested_vmcb = map.hva; - - ret = kvm_skip_emulated_instruction(&svm->vcpu); - - nested_svm_vmloadsave(svm->vmcb, nested_vmcb); - kvm_vcpu_unmap(&svm->vcpu, &map, true); - - return ret; -} - -static int vmrun_interception(struct vcpu_svm *svm) -{ - if (nested_svm_check_permissions(svm)) - return 1; - - return nested_svm_vmrun(svm); -} - -static int stgi_interception(struct vcpu_svm *svm) -{ - int ret; - - if (nested_svm_check_permissions(svm)) - return 1; - - /* - * If VGIF is enabled, the STGI intercept is only added to - * detect the opening of the SMI/NMI window; remove it now. - */ - if (vgif_enabled(svm)) - clr_intercept(svm, INTERCEPT_STGI); - - ret = kvm_skip_emulated_instruction(&svm->vcpu); - kvm_make_request(KVM_REQ_EVENT, &svm->vcpu); - - enable_gif(svm); - - return ret; -} - -static int clgi_interception(struct vcpu_svm *svm) -{ - int ret; - - if (nested_svm_check_permissions(svm)) - return 1; - - ret = kvm_skip_emulated_instruction(&svm->vcpu); - - disable_gif(svm); - - /* After a CLGI no interrupts should come */ - if (!kvm_vcpu_apicv_active(&svm->vcpu)) - svm_clear_vintr(svm); - - return ret; -} - -static int invlpga_interception(struct vcpu_svm *svm) -{ - struct kvm_vcpu *vcpu = &svm->vcpu; - - trace_kvm_invlpga(svm->vmcb->save.rip, kvm_rcx_read(&svm->vcpu), - kvm_rax_read(&svm->vcpu)); - - /* Let's treat INVLPGA the same as INVLPG (can be optimized!) */ - kvm_mmu_invlpg(vcpu, kvm_rax_read(&svm->vcpu)); - - return kvm_skip_emulated_instruction(&svm->vcpu); -} - -static int skinit_interception(struct vcpu_svm *svm) -{ - trace_kvm_skinit(svm->vmcb->save.rip, kvm_rax_read(&svm->vcpu)); - - kvm_queue_exception(&svm->vcpu, UD_VECTOR); - return 1; -} - -static int wbinvd_interception(struct vcpu_svm *svm) -{ - return kvm_emulate_wbinvd(&svm->vcpu); -} - -static int xsetbv_interception(struct vcpu_svm *svm) -{ - u64 new_bv = kvm_read_edx_eax(&svm->vcpu); - u32 index = kvm_rcx_read(&svm->vcpu); - - if (kvm_set_xcr(&svm->vcpu, index, new_bv) == 0) { - return kvm_skip_emulated_instruction(&svm->vcpu); - } - - return 1; -} - -static int rdpru_interception(struct vcpu_svm *svm) -{ - kvm_queue_exception(&svm->vcpu, UD_VECTOR); - return 1; -} - -static int task_switch_interception(struct vcpu_svm *svm) -{ - u16 tss_selector; - int reason; - int int_type = svm->vmcb->control.exit_int_info & - SVM_EXITINTINFO_TYPE_MASK; - int int_vec = svm->vmcb->control.exit_int_info & SVM_EVTINJ_VEC_MASK; - uint32_t type = - svm->vmcb->control.exit_int_info & SVM_EXITINTINFO_TYPE_MASK; - uint32_t idt_v = - svm->vmcb->control.exit_int_info & SVM_EXITINTINFO_VALID; - bool has_error_code = false; - u32 error_code = 0; - - tss_selector = (u16)svm->vmcb->control.exit_info_1; - - if (svm->vmcb->control.exit_info_2 & - (1ULL << SVM_EXITINFOSHIFT_TS_REASON_IRET)) - reason = TASK_SWITCH_IRET; - else if (svm->vmcb->control.exit_info_2 & - (1ULL << SVM_EXITINFOSHIFT_TS_REASON_JMP)) - reason = TASK_SWITCH_JMP; - else if (idt_v) - reason = TASK_SWITCH_GATE; - else - reason = TASK_SWITCH_CALL; - - if (reason == TASK_SWITCH_GATE) { - switch (type) { - case SVM_EXITINTINFO_TYPE_NMI: - svm->vcpu.arch.nmi_injected = false; - break; - case SVM_EXITINTINFO_TYPE_EXEPT: - if (svm->vmcb->control.exit_info_2 & - (1ULL << SVM_EXITINFOSHIFT_TS_HAS_ERROR_CODE)) { - has_error_code = true; - error_code = - (u32)svm->vmcb->control.exit_info_2; - } - kvm_clear_exception_queue(&svm->vcpu); - break; - case SVM_EXITINTINFO_TYPE_INTR: - kvm_clear_interrupt_queue(&svm->vcpu); - break; - default: - break; - } - } - - if (reason != TASK_SWITCH_GATE || - int_type == SVM_EXITINTINFO_TYPE_SOFT || - (int_type == SVM_EXITINTINFO_TYPE_EXEPT && - (int_vec == OF_VECTOR || int_vec == BP_VECTOR))) { - if (!skip_emulated_instruction(&svm->vcpu)) - return 0; - } - - if (int_type != SVM_EXITINTINFO_TYPE_SOFT) - int_vec = -1; - - return kvm_task_switch(&svm->vcpu, tss_selector, int_vec, reason, - has_error_code, error_code); -} - -static int cpuid_interception(struct vcpu_svm *svm) -{ - return kvm_emulate_cpuid(&svm->vcpu); -} - -static int iret_interception(struct vcpu_svm *svm) -{ - ++svm->vcpu.stat.nmi_window_exits; - clr_intercept(svm, INTERCEPT_IRET); - svm->vcpu.arch.hflags |= HF_IRET_MASK; - svm->nmi_iret_rip = kvm_rip_read(&svm->vcpu); - kvm_make_request(KVM_REQ_EVENT, &svm->vcpu); - return 1; -} - -static int invlpg_interception(struct vcpu_svm *svm) -{ - if (!static_cpu_has(X86_FEATURE_DECODEASSISTS)) - return kvm_emulate_instruction(&svm->vcpu, 0); - - kvm_mmu_invlpg(&svm->vcpu, svm->vmcb->control.exit_info_1); - return kvm_skip_emulated_instruction(&svm->vcpu); -} - -static int emulate_on_interception(struct vcpu_svm *svm) -{ - return kvm_emulate_instruction(&svm->vcpu, 0); -} - -static int rsm_interception(struct vcpu_svm *svm) -{ - return kvm_emulate_instruction_from_buffer(&svm->vcpu, rsm_ins_bytes, 2); -} - -static int rdpmc_interception(struct vcpu_svm *svm) -{ - int err; - - if (!nrips) - return emulate_on_interception(svm); - - err = kvm_rdpmc(&svm->vcpu); - return kvm_complete_insn_gp(&svm->vcpu, err); -} - -static bool check_selective_cr0_intercepted(struct vcpu_svm *svm, - unsigned long val) -{ - unsigned long cr0 = svm->vcpu.arch.cr0; - bool ret = false; - u64 intercept; - - intercept = svm->nested.intercept; - - if (!is_guest_mode(&svm->vcpu) || - (!(intercept & (1ULL << INTERCEPT_SELECTIVE_CR0)))) - return false; - - cr0 &= ~SVM_CR0_SELECTIVE_MASK; - val &= ~SVM_CR0_SELECTIVE_MASK; - - if (cr0 ^ val) { - svm->vmcb->control.exit_code = SVM_EXIT_CR0_SEL_WRITE; - ret = (nested_svm_exit_handled(svm) == NESTED_EXIT_DONE); - } - - return ret; -} - -#define CR_VALID (1ULL << 63) - -static int cr_interception(struct vcpu_svm *svm) -{ - int reg, cr; - unsigned long val; - int err; - - if (!static_cpu_has(X86_FEATURE_DECODEASSISTS)) - return emulate_on_interception(svm); - - if (unlikely((svm->vmcb->control.exit_info_1 & CR_VALID) == 0)) - return emulate_on_interception(svm); - - reg = svm->vmcb->control.exit_info_1 & SVM_EXITINFO_REG_MASK; - if (svm->vmcb->control.exit_code == SVM_EXIT_CR0_SEL_WRITE) - cr = SVM_EXIT_WRITE_CR0 - SVM_EXIT_READ_CR0; - else - cr = svm->vmcb->control.exit_code - SVM_EXIT_READ_CR0; - - err = 0; - if (cr >= 16) { /* mov to cr */ - cr -= 16; - val = kvm_register_read(&svm->vcpu, reg); - switch (cr) { - case 0: - if (!check_selective_cr0_intercepted(svm, val)) - err = kvm_set_cr0(&svm->vcpu, val); - else - return 1; - - break; - case 3: - err = kvm_set_cr3(&svm->vcpu, val); - break; - case 4: - err = kvm_set_cr4(&svm->vcpu, val); - break; - case 8: - err = kvm_set_cr8(&svm->vcpu, val); - break; - default: - WARN(1, "unhandled write to CR%d", cr); - kvm_queue_exception(&svm->vcpu, UD_VECTOR); - return 1; - } - } else { /* mov from cr */ - switch (cr) { - case 0: - val = kvm_read_cr0(&svm->vcpu); - break; - case 2: - val = svm->vcpu.arch.cr2; - break; - case 3: - val = kvm_read_cr3(&svm->vcpu); - break; - case 4: - val = kvm_read_cr4(&svm->vcpu); - break; - case 8: - val = kvm_get_cr8(&svm->vcpu); - break; - default: - WARN(1, "unhandled read from CR%d", cr); - kvm_queue_exception(&svm->vcpu, UD_VECTOR); - return 1; - } - kvm_register_write(&svm->vcpu, reg, val); - } - return kvm_complete_insn_gp(&svm->vcpu, err); -} - -static int dr_interception(struct vcpu_svm *svm) -{ - int reg, dr; - unsigned long val; - - if (svm->vcpu.guest_debug == 0) { - /* - * No more DR vmexits; force a reload of the debug registers - * and reenter on this instruction. The next vmexit will - * retrieve the full state of the debug registers. - */ - clr_dr_intercepts(svm); - svm->vcpu.arch.switch_db_regs |= KVM_DEBUGREG_WONT_EXIT; - return 1; - } - - if (!boot_cpu_has(X86_FEATURE_DECODEASSISTS)) - return emulate_on_interception(svm); - - reg = svm->vmcb->control.exit_info_1 & SVM_EXITINFO_REG_MASK; - dr = svm->vmcb->control.exit_code - SVM_EXIT_READ_DR0; - - if (dr >= 16) { /* mov to DRn */ - if (!kvm_require_dr(&svm->vcpu, dr - 16)) - return 1; - val = kvm_register_read(&svm->vcpu, reg); - kvm_set_dr(&svm->vcpu, dr - 16, val); - } else { - if (!kvm_require_dr(&svm->vcpu, dr)) - return 1; - kvm_get_dr(&svm->vcpu, dr, &val); - kvm_register_write(&svm->vcpu, reg, val); - } - - return kvm_skip_emulated_instruction(&svm->vcpu); -} - -static int cr8_write_interception(struct vcpu_svm *svm) -{ - struct kvm_run *kvm_run = svm->vcpu.run; - int r; - - u8 cr8_prev = kvm_get_cr8(&svm->vcpu); - /* instruction emulation calls kvm_set_cr8() */ - r = cr_interception(svm); - if (lapic_in_kernel(&svm->vcpu)) - return r; - if (cr8_prev <= kvm_get_cr8(&svm->vcpu)) - return r; - kvm_run->exit_reason = KVM_EXIT_SET_TPR; - return 0; -} - -static int svm_get_msr_feature(struct kvm_msr_entry *msr) -{ - msr->data = 0; - - switch (msr->index) { - case MSR_F10H_DECFG: - if (boot_cpu_has(X86_FEATURE_LFENCE_RDTSC)) - msr->data |= MSR_F10H_DECFG_LFENCE_SERIALIZE; - break; - default: - return 1; - } - - return 0; -} - -static int svm_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) -{ - struct vcpu_svm *svm = to_svm(vcpu); - - switch (msr_info->index) { - case MSR_STAR: - msr_info->data = svm->vmcb->save.star; - break; -#ifdef CONFIG_X86_64 - case MSR_LSTAR: - msr_info->data = svm->vmcb->save.lstar; - break; - case MSR_CSTAR: - msr_info->data = svm->vmcb->save.cstar; - break; - case MSR_KERNEL_GS_BASE: - msr_info->data = svm->vmcb->save.kernel_gs_base; - break; - case MSR_SYSCALL_MASK: - msr_info->data = svm->vmcb->save.sfmask; - break; -#endif - case MSR_IA32_SYSENTER_CS: - msr_info->data = svm->vmcb->save.sysenter_cs; - break; - case MSR_IA32_SYSENTER_EIP: - msr_info->data = svm->sysenter_eip; - break; - case MSR_IA32_SYSENTER_ESP: - msr_info->data = svm->sysenter_esp; - break; - case MSR_TSC_AUX: - if (!boot_cpu_has(X86_FEATURE_RDTSCP)) - return 1; - msr_info->data = svm->tsc_aux; - break; - /* - * Nobody will change the following 5 values in the VMCB so we can - * safely return them on rdmsr. They will always be 0 until LBRV is - * implemented. - */ - case MSR_IA32_DEBUGCTLMSR: - msr_info->data = svm->vmcb->save.dbgctl; - break; - case MSR_IA32_LASTBRANCHFROMIP: - msr_info->data = svm->vmcb->save.br_from; - break; - case MSR_IA32_LASTBRANCHTOIP: - msr_info->data = svm->vmcb->save.br_to; - break; - case MSR_IA32_LASTINTFROMIP: - msr_info->data = svm->vmcb->save.last_excp_from; - break; - case MSR_IA32_LASTINTTOIP: - msr_info->data = svm->vmcb->save.last_excp_to; - break; - case MSR_VM_HSAVE_PA: - msr_info->data = svm->nested.hsave_msr; - break; - case MSR_VM_CR: - msr_info->data = svm->nested.vm_cr_msr; - break; - case MSR_IA32_SPEC_CTRL: - if (!msr_info->host_initiated && - !guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL) && - !guest_cpuid_has(vcpu, X86_FEATURE_AMD_STIBP) && - !guest_cpuid_has(vcpu, X86_FEATURE_AMD_IBRS) && - !guest_cpuid_has(vcpu, X86_FEATURE_AMD_SSBD)) - return 1; - - msr_info->data = svm->spec_ctrl; - break; - case MSR_AMD64_VIRT_SPEC_CTRL: - if (!msr_info->host_initiated && - !guest_cpuid_has(vcpu, X86_FEATURE_VIRT_SSBD)) - return 1; - - msr_info->data = svm->virt_spec_ctrl; - break; - case MSR_F15H_IC_CFG: { - - int family, model; - - family = guest_cpuid_family(vcpu); - model = guest_cpuid_model(vcpu); - - if (family < 0 || model < 0) - return kvm_get_msr_common(vcpu, msr_info); - - msr_info->data = 0; - - if (family == 0x15 && - (model >= 0x2 && model < 0x20)) - msr_info->data = 0x1E; - } - break; - case MSR_F10H_DECFG: - msr_info->data = svm->msr_decfg; - break; - default: - return kvm_get_msr_common(vcpu, msr_info); - } - return 0; -} - -static int rdmsr_interception(struct vcpu_svm *svm) -{ - return kvm_emulate_rdmsr(&svm->vcpu); -} - -static int svm_set_vm_cr(struct kvm_vcpu *vcpu, u64 data) -{ - struct vcpu_svm *svm = to_svm(vcpu); - int svm_dis, chg_mask; - - if (data & ~SVM_VM_CR_VALID_MASK) - return 1; - - chg_mask = SVM_VM_CR_VALID_MASK; - - if (svm->nested.vm_cr_msr & SVM_VM_CR_SVM_DIS_MASK) - chg_mask &= ~(SVM_VM_CR_SVM_LOCK_MASK | SVM_VM_CR_SVM_DIS_MASK); - - svm->nested.vm_cr_msr &= ~chg_mask; - svm->nested.vm_cr_msr |= (data & chg_mask); - - svm_dis = svm->nested.vm_cr_msr & SVM_VM_CR_SVM_DIS_MASK; - - /* check for svm_disable while efer.svme is set */ - if (svm_dis && (vcpu->arch.efer & EFER_SVME)) - return 1; - - return 0; -} - -static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) -{ - struct vcpu_svm *svm = to_svm(vcpu); - - u32 ecx = msr->index; - u64 data = msr->data; - switch (ecx) { - case MSR_IA32_CR_PAT: - if (!kvm_mtrr_valid(vcpu, MSR_IA32_CR_PAT, data)) - return 1; - vcpu->arch.pat = data; - svm->vmcb->save.g_pat = data; - mark_dirty(svm->vmcb, VMCB_NPT); - break; - case MSR_IA32_SPEC_CTRL: - if (!msr->host_initiated && - !guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL) && - !guest_cpuid_has(vcpu, X86_FEATURE_AMD_STIBP) && - !guest_cpuid_has(vcpu, X86_FEATURE_AMD_IBRS) && - !guest_cpuid_has(vcpu, X86_FEATURE_AMD_SSBD)) - return 1; - - if (data & ~kvm_spec_ctrl_valid_bits(vcpu)) - return 1; - - svm->spec_ctrl = data; - if (!data) - break; - - /* - * For non-nested: - * When it's written (to non-zero) for the first time, pass - * it through. - * - * For nested: - * The handling of the MSR bitmap for L2 guests is done in - * nested_svm_vmrun_msrpm. - * We update the L1 MSR bit as well since it will end up - * touching the MSR anyway now. - */ - set_msr_interception(svm->msrpm, MSR_IA32_SPEC_CTRL, 1, 1); - break; - case MSR_IA32_PRED_CMD: - if (!msr->host_initiated && - !guest_cpuid_has(vcpu, X86_FEATURE_AMD_IBPB)) - return 1; - - if (data & ~PRED_CMD_IBPB) - return 1; - if (!boot_cpu_has(X86_FEATURE_AMD_IBPB)) - return 1; - if (!data) - break; - - wrmsrl(MSR_IA32_PRED_CMD, PRED_CMD_IBPB); - set_msr_interception(svm->msrpm, MSR_IA32_PRED_CMD, 0, 1); - break; - case MSR_AMD64_VIRT_SPEC_CTRL: - if (!msr->host_initiated && - !guest_cpuid_has(vcpu, X86_FEATURE_VIRT_SSBD)) - return 1; - - if (data & ~SPEC_CTRL_SSBD) - return 1; - - svm->virt_spec_ctrl = data; - break; - case MSR_STAR: - svm->vmcb->save.star = data; - break; -#ifdef CONFIG_X86_64 - case MSR_LSTAR: - svm->vmcb->save.lstar = data; - break; - case MSR_CSTAR: - svm->vmcb->save.cstar = data; - break; - case MSR_KERNEL_GS_BASE: - svm->vmcb->save.kernel_gs_base = data; - break; - case MSR_SYSCALL_MASK: - svm->vmcb->save.sfmask = data; - break; -#endif - case MSR_IA32_SYSENTER_CS: - svm->vmcb->save.sysenter_cs = data; - break; - case MSR_IA32_SYSENTER_EIP: - svm->sysenter_eip = data; - svm->vmcb->save.sysenter_eip = data; - break; - case MSR_IA32_SYSENTER_ESP: - svm->sysenter_esp = data; - svm->vmcb->save.sysenter_esp = data; - break; - case MSR_TSC_AUX: - if (!boot_cpu_has(X86_FEATURE_RDTSCP)) - return 1; - - /* - * This is rare, so we update the MSR here instead of using - * direct_access_msrs. Doing that would require a rdmsr in - * svm_vcpu_put. - */ - svm->tsc_aux = data; - wrmsrl(MSR_TSC_AUX, svm->tsc_aux); - break; - case MSR_IA32_DEBUGCTLMSR: - if (!boot_cpu_has(X86_FEATURE_LBRV)) { - vcpu_unimpl(vcpu, "%s: MSR_IA32_DEBUGCTL 0x%llx, nop\n", - __func__, data); - break; - } - if (data & DEBUGCTL_RESERVED_BITS) - return 1; - - svm->vmcb->save.dbgctl = data; - mark_dirty(svm->vmcb, VMCB_LBR); - if (data & (1ULL<<0)) - svm_enable_lbrv(svm); - else - svm_disable_lbrv(svm); - break; - case MSR_VM_HSAVE_PA: - svm->nested.hsave_msr = data; - break; - case MSR_VM_CR: - return svm_set_vm_cr(vcpu, data); - case MSR_VM_IGNNE: - vcpu_unimpl(vcpu, "unimplemented wrmsr: 0x%x data 0x%llx\n", ecx, data); - break; - case MSR_F10H_DECFG: { - struct kvm_msr_entry msr_entry; - - msr_entry.index = msr->index; - if (svm_get_msr_feature(&msr_entry)) - return 1; - - /* Check the supported bits */ - if (data & ~msr_entry.data) - return 1; - - /* Don't allow the guest to change a bit, #GP */ - if (!msr->host_initiated && (data ^ msr_entry.data)) - return 1; - - svm->msr_decfg = data; - break; - } - case MSR_IA32_APICBASE: - if (kvm_vcpu_apicv_active(vcpu)) - avic_update_vapic_bar(to_svm(vcpu), data); - /* Fall through */ - default: - return kvm_set_msr_common(vcpu, msr); - } - return 0; -} - -static int wrmsr_interception(struct vcpu_svm *svm) -{ - return kvm_emulate_wrmsr(&svm->vcpu); -} - -static int msr_interception(struct vcpu_svm *svm) -{ - if (svm->vmcb->control.exit_info_1) - return wrmsr_interception(svm); - else - return rdmsr_interception(svm); -} - -static int interrupt_window_interception(struct vcpu_svm *svm) -{ - kvm_make_request(KVM_REQ_EVENT, &svm->vcpu); - svm_clear_vintr(svm); - - /* - * For AVIC, the only reason to end up here is ExtINTs. - * In this case AVIC was temporarily disabled for - * requesting the IRQ window and we have to re-enable it. - */ - svm_toggle_avic_for_irq_window(&svm->vcpu, true); - - svm->vmcb->control.int_ctl &= ~V_IRQ_MASK; - mark_dirty(svm->vmcb, VMCB_INTR); - ++svm->vcpu.stat.irq_window_exits; - return 1; -} - -static int pause_interception(struct vcpu_svm *svm) -{ - struct kvm_vcpu *vcpu = &svm->vcpu; - bool in_kernel = (svm_get_cpl(vcpu) == 0); - - if (pause_filter_thresh) - grow_ple_window(vcpu); - - kvm_vcpu_on_spin(vcpu, in_kernel); - return 1; -} - -static int nop_interception(struct vcpu_svm *svm) -{ - return kvm_skip_emulated_instruction(&(svm->vcpu)); -} - -static int monitor_interception(struct vcpu_svm *svm) -{ - printk_once(KERN_WARNING "kvm: MONITOR instruction emulated as NOP!\n"); - return nop_interception(svm); -} - -static int mwait_interception(struct vcpu_svm *svm) -{ - printk_once(KERN_WARNING "kvm: MWAIT instruction emulated as NOP!\n"); - return nop_interception(svm); -} - -enum avic_ipi_failure_cause { - AVIC_IPI_FAILURE_INVALID_INT_TYPE, - AVIC_IPI_FAILURE_TARGET_NOT_RUNNING, - AVIC_IPI_FAILURE_INVALID_TARGET, - AVIC_IPI_FAILURE_INVALID_BACKING_PAGE, -}; - -static int avic_incomplete_ipi_interception(struct vcpu_svm *svm) -{ - u32 icrh = svm->vmcb->control.exit_info_1 >> 32; - u32 icrl = svm->vmcb->control.exit_info_1; - u32 id = svm->vmcb->control.exit_info_2 >> 32; - u32 index = svm->vmcb->control.exit_info_2 & 0xFF; - struct kvm_lapic *apic = svm->vcpu.arch.apic; - - trace_kvm_avic_incomplete_ipi(svm->vcpu.vcpu_id, icrh, icrl, id, index); - - switch (id) { - case AVIC_IPI_FAILURE_INVALID_INT_TYPE: - /* - * AVIC hardware handles the generation of - * IPIs when the specified Message Type is Fixed - * (also known as fixed delivery mode) and - * the Trigger Mode is edge-triggered. The hardware - * also supports self and broadcast delivery modes - * specified via the Destination Shorthand(DSH) - * field of the ICRL. Logical and physical APIC ID - * formats are supported. All other IPI types cause - * a #VMEXIT, which needs to emulated. - */ - kvm_lapic_reg_write(apic, APIC_ICR2, icrh); - kvm_lapic_reg_write(apic, APIC_ICR, icrl); - break; - case AVIC_IPI_FAILURE_TARGET_NOT_RUNNING: { - int i; - struct kvm_vcpu *vcpu; - struct kvm *kvm = svm->vcpu.kvm; - struct kvm_lapic *apic = svm->vcpu.arch.apic; - - /* - * At this point, we expect that the AVIC HW has already - * set the appropriate IRR bits on the valid target - * vcpus. So, we just need to kick the appropriate vcpu. - */ - kvm_for_each_vcpu(i, vcpu, kvm) { - bool m = kvm_apic_match_dest(vcpu, apic, - icrl & APIC_SHORT_MASK, - GET_APIC_DEST_FIELD(icrh), - icrl & APIC_DEST_MASK); - - if (m && !avic_vcpu_is_running(vcpu)) - kvm_vcpu_wake_up(vcpu); - } - break; - } - case AVIC_IPI_FAILURE_INVALID_TARGET: - WARN_ONCE(1, "Invalid IPI target: index=%u, vcpu=%d, icr=%#0x:%#0x\n", - index, svm->vcpu.vcpu_id, icrh, icrl); - break; - case AVIC_IPI_FAILURE_INVALID_BACKING_PAGE: - WARN_ONCE(1, "Invalid backing page\n"); - break; - default: - pr_err("Unknown IPI interception\n"); - } - - return 1; -} - -static u32 *avic_get_logical_id_entry(struct kvm_vcpu *vcpu, u32 ldr, bool flat) -{ - struct kvm_svm *kvm_svm = to_kvm_svm(vcpu->kvm); - int index; - u32 *logical_apic_id_table; - int dlid = GET_APIC_LOGICAL_ID(ldr); - - if (!dlid) - return NULL; - - if (flat) { /* flat */ - index = ffs(dlid) - 1; - if (index > 7) - return NULL; - } else { /* cluster */ - int cluster = (dlid & 0xf0) >> 4; - int apic = ffs(dlid & 0x0f) - 1; - - if ((apic < 0) || (apic > 7) || - (cluster >= 0xf)) - return NULL; - index = (cluster << 2) + apic; - } - - logical_apic_id_table = (u32 *) page_address(kvm_svm->avic_logical_id_table_page); - - return &logical_apic_id_table[index]; -} - -static int avic_ldr_write(struct kvm_vcpu *vcpu, u8 g_physical_id, u32 ldr) -{ - bool flat; - u32 *entry, new_entry; - - flat = kvm_lapic_get_reg(vcpu->arch.apic, APIC_DFR) == APIC_DFR_FLAT; - entry = avic_get_logical_id_entry(vcpu, ldr, flat); - if (!entry) - return -EINVAL; - - new_entry = READ_ONCE(*entry); - new_entry &= ~AVIC_LOGICAL_ID_ENTRY_GUEST_PHYSICAL_ID_MASK; - new_entry |= (g_physical_id & AVIC_LOGICAL_ID_ENTRY_GUEST_PHYSICAL_ID_MASK); - new_entry |= AVIC_LOGICAL_ID_ENTRY_VALID_MASK; - WRITE_ONCE(*entry, new_entry); - - return 0; -} - -static void avic_invalidate_logical_id_entry(struct kvm_vcpu *vcpu) -{ - struct vcpu_svm *svm = to_svm(vcpu); - bool flat = svm->dfr_reg == APIC_DFR_FLAT; - u32 *entry = avic_get_logical_id_entry(vcpu, svm->ldr_reg, flat); - - if (entry) - clear_bit(AVIC_LOGICAL_ID_ENTRY_VALID_BIT, (unsigned long *)entry); -} - -static int avic_handle_ldr_update(struct kvm_vcpu *vcpu) -{ - int ret = 0; - struct vcpu_svm *svm = to_svm(vcpu); - u32 ldr = kvm_lapic_get_reg(vcpu->arch.apic, APIC_LDR); - u32 id = kvm_xapic_id(vcpu->arch.apic); - - if (ldr == svm->ldr_reg) - return 0; - - avic_invalidate_logical_id_entry(vcpu); - - if (ldr) - ret = avic_ldr_write(vcpu, id, ldr); - - if (!ret) - svm->ldr_reg = ldr; - - return ret; -} - -static int avic_handle_apic_id_update(struct kvm_vcpu *vcpu) -{ - u64 *old, *new; - struct vcpu_svm *svm = to_svm(vcpu); - u32 id = kvm_xapic_id(vcpu->arch.apic); - - if (vcpu->vcpu_id == id) - return 0; - - old = avic_get_physical_id_entry(vcpu, vcpu->vcpu_id); - new = avic_get_physical_id_entry(vcpu, id); - if (!new || !old) - return 1; - - /* We need to move physical_id_entry to new offset */ - *new = *old; - *old = 0ULL; - to_svm(vcpu)->avic_physical_id_cache = new; - - /* - * Also update the guest physical APIC ID in the logical - * APIC ID table entry if already setup the LDR. - */ - if (svm->ldr_reg) - avic_handle_ldr_update(vcpu); - - return 0; -} - -static void avic_handle_dfr_update(struct kvm_vcpu *vcpu) -{ - struct vcpu_svm *svm = to_svm(vcpu); - u32 dfr = kvm_lapic_get_reg(vcpu->arch.apic, APIC_DFR); - - if (svm->dfr_reg == dfr) - return; - - avic_invalidate_logical_id_entry(vcpu); - svm->dfr_reg = dfr; -} - -static int avic_unaccel_trap_write(struct vcpu_svm *svm) -{ - struct kvm_lapic *apic = svm->vcpu.arch.apic; - u32 offset = svm->vmcb->control.exit_info_1 & - AVIC_UNACCEL_ACCESS_OFFSET_MASK; - - switch (offset) { - case APIC_ID: - if (avic_handle_apic_id_update(&svm->vcpu)) - return 0; - break; - case APIC_LDR: - if (avic_handle_ldr_update(&svm->vcpu)) - return 0; - break; - case APIC_DFR: - avic_handle_dfr_update(&svm->vcpu); - break; - default: - break; - } - - kvm_lapic_reg_write(apic, offset, kvm_lapic_get_reg(apic, offset)); - - return 1; -} - -static bool is_avic_unaccelerated_access_trap(u32 offset) -{ - bool ret = false; - - switch (offset) { - case APIC_ID: - case APIC_EOI: - case APIC_RRR: - case APIC_LDR: - case APIC_DFR: - case APIC_SPIV: - case APIC_ESR: - case APIC_ICR: - case APIC_LVTT: - case APIC_LVTTHMR: - case APIC_LVTPC: - case APIC_LVT0: - case APIC_LVT1: - case APIC_LVTERR: - case APIC_TMICT: - case APIC_TDCR: - ret = true; - break; - default: - break; - } - return ret; -} - -static int avic_unaccelerated_access_interception(struct vcpu_svm *svm) -{ - int ret = 0; - u32 offset = svm->vmcb->control.exit_info_1 & - AVIC_UNACCEL_ACCESS_OFFSET_MASK; - u32 vector = svm->vmcb->control.exit_info_2 & - AVIC_UNACCEL_ACCESS_VECTOR_MASK; - bool write = (svm->vmcb->control.exit_info_1 >> 32) & - AVIC_UNACCEL_ACCESS_WRITE_MASK; - bool trap = is_avic_unaccelerated_access_trap(offset); - - trace_kvm_avic_unaccelerated_access(svm->vcpu.vcpu_id, offset, - trap, write, vector); - if (trap) { - /* Handling Trap */ - WARN_ONCE(!write, "svm: Handling trap read.\n"); - ret = avic_unaccel_trap_write(svm); - } else { - /* Handling Fault */ - ret = kvm_emulate_instruction(&svm->vcpu, 0); - } - - return ret; -} - -static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = { - [SVM_EXIT_READ_CR0] = cr_interception, - [SVM_EXIT_READ_CR3] = cr_interception, - [SVM_EXIT_READ_CR4] = cr_interception, - [SVM_EXIT_READ_CR8] = cr_interception, - [SVM_EXIT_CR0_SEL_WRITE] = cr_interception, - [SVM_EXIT_WRITE_CR0] = cr_interception, - [SVM_EXIT_WRITE_CR3] = cr_interception, - [SVM_EXIT_WRITE_CR4] = cr_interception, - [SVM_EXIT_WRITE_CR8] = cr8_write_interception, - [SVM_EXIT_READ_DR0] = dr_interception, - [SVM_EXIT_READ_DR1] = dr_interception, - [SVM_EXIT_READ_DR2] = dr_interception, - [SVM_EXIT_READ_DR3] = dr_interception, - [SVM_EXIT_READ_DR4] = dr_interception, - [SVM_EXIT_READ_DR5] = dr_interception, - [SVM_EXIT_READ_DR6] = dr_interception, - [SVM_EXIT_READ_DR7] = dr_interception, - [SVM_EXIT_WRITE_DR0] = dr_interception, - [SVM_EXIT_WRITE_DR1] = dr_interception, - [SVM_EXIT_WRITE_DR2] = dr_interception, - [SVM_EXIT_WRITE_DR3] = dr_interception, - [SVM_EXIT_WRITE_DR4] = dr_interception, - [SVM_EXIT_WRITE_DR5] = dr_interception, - [SVM_EXIT_WRITE_DR6] = dr_interception, - [SVM_EXIT_WRITE_DR7] = dr_interception, - [SVM_EXIT_EXCP_BASE + DB_VECTOR] = db_interception, - [SVM_EXIT_EXCP_BASE + BP_VECTOR] = bp_interception, - [SVM_EXIT_EXCP_BASE + UD_VECTOR] = ud_interception, - [SVM_EXIT_EXCP_BASE + PF_VECTOR] = pf_interception, - [SVM_EXIT_EXCP_BASE + MC_VECTOR] = mc_interception, - [SVM_EXIT_EXCP_BASE + AC_VECTOR] = ac_interception, - [SVM_EXIT_EXCP_BASE + GP_VECTOR] = gp_interception, - [SVM_EXIT_INTR] = intr_interception, - [SVM_EXIT_NMI] = nmi_interception, - [SVM_EXIT_SMI] = nop_on_interception, - [SVM_EXIT_INIT] = nop_on_interception, - [SVM_EXIT_VINTR] = interrupt_window_interception, - [SVM_EXIT_RDPMC] = rdpmc_interception, - [SVM_EXIT_CPUID] = cpuid_interception, - [SVM_EXIT_IRET] = iret_interception, - [SVM_EXIT_INVD] = emulate_on_interception, - [SVM_EXIT_PAUSE] = pause_interception, - [SVM_EXIT_HLT] = halt_interception, - [SVM_EXIT_INVLPG] = invlpg_interception, - [SVM_EXIT_INVLPGA] = invlpga_interception, - [SVM_EXIT_IOIO] = io_interception, - [SVM_EXIT_MSR] = msr_interception, - [SVM_EXIT_TASK_SWITCH] = task_switch_interception, - [SVM_EXIT_SHUTDOWN] = shutdown_interception, - [SVM_EXIT_VMRUN] = vmrun_interception, - [SVM_EXIT_VMMCALL] = vmmcall_interception, - [SVM_EXIT_VMLOAD] = vmload_interception, - [SVM_EXIT_VMSAVE] = vmsave_interception, - [SVM_EXIT_STGI] = stgi_interception, - [SVM_EXIT_CLGI] = clgi_interception, - [SVM_EXIT_SKINIT] = skinit_interception, - [SVM_EXIT_WBINVD] = wbinvd_interception, - [SVM_EXIT_MONITOR] = monitor_interception, - [SVM_EXIT_MWAIT] = mwait_interception, - [SVM_EXIT_XSETBV] = xsetbv_interception, - [SVM_EXIT_RDPRU] = rdpru_interception, - [SVM_EXIT_NPF] = npf_interception, - [SVM_EXIT_RSM] = rsm_interception, - [SVM_EXIT_AVIC_INCOMPLETE_IPI] = avic_incomplete_ipi_interception, - [SVM_EXIT_AVIC_UNACCELERATED_ACCESS] = avic_unaccelerated_access_interception, -}; - -static void dump_vmcb(struct kvm_vcpu *vcpu) -{ - struct vcpu_svm *svm = to_svm(vcpu); - struct vmcb_control_area *control = &svm->vmcb->control; - struct vmcb_save_area *save = &svm->vmcb->save; - - if (!dump_invalid_vmcb) { - pr_warn_ratelimited("set kvm_amd.dump_invalid_vmcb=1 to dump internal KVM state.\n"); - return; - } - - pr_err("VMCB Control Area:\n"); - pr_err("%-20s%04x\n", "cr_read:", control->intercept_cr & 0xffff); - pr_err("%-20s%04x\n", "cr_write:", control->intercept_cr >> 16); - pr_err("%-20s%04x\n", "dr_read:", control->intercept_dr & 0xffff); - pr_err("%-20s%04x\n", "dr_write:", control->intercept_dr >> 16); - pr_err("%-20s%08x\n", "exceptions:", control->intercept_exceptions); - pr_err("%-20s%016llx\n", "intercepts:", control->intercept); - pr_err("%-20s%d\n", "pause filter count:", control->pause_filter_count); - pr_err("%-20s%d\n", "pause filter threshold:", - control->pause_filter_thresh); - pr_err("%-20s%016llx\n", "iopm_base_pa:", control->iopm_base_pa); - pr_err("%-20s%016llx\n", "msrpm_base_pa:", control->msrpm_base_pa); - pr_err("%-20s%016llx\n", "tsc_offset:", control->tsc_offset); - pr_err("%-20s%d\n", "asid:", control->asid); - pr_err("%-20s%d\n", "tlb_ctl:", control->tlb_ctl); - pr_err("%-20s%08x\n", "int_ctl:", control->int_ctl); - pr_err("%-20s%08x\n", "int_vector:", control->int_vector); - pr_err("%-20s%08x\n", "int_state:", control->int_state); - pr_err("%-20s%08x\n", "exit_code:", control->exit_code); - pr_err("%-20s%016llx\n", "exit_info1:", control->exit_info_1); - pr_err("%-20s%016llx\n", "exit_info2:", control->exit_info_2); - pr_err("%-20s%08x\n", "exit_int_info:", control->exit_int_info); - pr_err("%-20s%08x\n", "exit_int_info_err:", control->exit_int_info_err); - pr_err("%-20s%lld\n", "nested_ctl:", control->nested_ctl); - pr_err("%-20s%016llx\n", "nested_cr3:", control->nested_cr3); - pr_err("%-20s%016llx\n", "avic_vapic_bar:", control->avic_vapic_bar); - pr_err("%-20s%08x\n", "event_inj:", control->event_inj); - pr_err("%-20s%08x\n", "event_inj_err:", control->event_inj_err); - pr_err("%-20s%lld\n", "virt_ext:", control->virt_ext); - pr_err("%-20s%016llx\n", "next_rip:", control->next_rip); - pr_err("%-20s%016llx\n", "avic_backing_page:", control->avic_backing_page); - pr_err("%-20s%016llx\n", "avic_logical_id:", control->avic_logical_id); - pr_err("%-20s%016llx\n", "avic_physical_id:", control->avic_physical_id); - pr_err("VMCB State Save Area:\n"); - pr_err("%-5s s: %04x a: %04x l: %08x b: %016llx\n", - "es:", - save->es.selector, save->es.attrib, - save->es.limit, save->es.base); - pr_err("%-5s s: %04x a: %04x l: %08x b: %016llx\n", - "cs:", - save->cs.selector, save->cs.attrib, - save->cs.limit, save->cs.base); - pr_err("%-5s s: %04x a: %04x l: %08x b: %016llx\n", - "ss:", - save->ss.selector, save->ss.attrib, - save->ss.limit, save->ss.base); - pr_err("%-5s s: %04x a: %04x l: %08x b: %016llx\n", - "ds:", - save->ds.selector, save->ds.attrib, - save->ds.limit, save->ds.base); - pr_err("%-5s s: %04x a: %04x l: %08x b: %016llx\n", - "fs:", - save->fs.selector, save->fs.attrib, - save->fs.limit, save->fs.base); - pr_err("%-5s s: %04x a: %04x l: %08x b: %016llx\n", - "gs:", - save->gs.selector, save->gs.attrib, - save->gs.limit, save->gs.base); - pr_err("%-5s s: %04x a: %04x l: %08x b: %016llx\n", - "gdtr:", - save->gdtr.selector, save->gdtr.attrib, - save->gdtr.limit, save->gdtr.base); - pr_err("%-5s s: %04x a: %04x l: %08x b: %016llx\n", - "ldtr:", - save->ldtr.selector, save->ldtr.attrib, - save->ldtr.limit, save->ldtr.base); - pr_err("%-5s s: %04x a: %04x l: %08x b: %016llx\n", - "idtr:", - save->idtr.selector, save->idtr.attrib, - save->idtr.limit, save->idtr.base); - pr_err("%-5s s: %04x a: %04x l: %08x b: %016llx\n", - "tr:", - save->tr.selector, save->tr.attrib, - save->tr.limit, save->tr.base); - pr_err("cpl: %d efer: %016llx\n", - save->cpl, save->efer); - pr_err("%-15s %016llx %-13s %016llx\n", - "cr0:", save->cr0, "cr2:", save->cr2); - pr_err("%-15s %016llx %-13s %016llx\n", - "cr3:", save->cr3, "cr4:", save->cr4); - pr_err("%-15s %016llx %-13s %016llx\n", - "dr6:", save->dr6, "dr7:", save->dr7); - pr_err("%-15s %016llx %-13s %016llx\n", - "rip:", save->rip, "rflags:", save->rflags); - pr_err("%-15s %016llx %-13s %016llx\n", - "rsp:", save->rsp, "rax:", save->rax); - pr_err("%-15s %016llx %-13s %016llx\n", - "star:", save->star, "lstar:", save->lstar); - pr_err("%-15s %016llx %-13s %016llx\n", - "cstar:", save->cstar, "sfmask:", save->sfmask); - pr_err("%-15s %016llx %-13s %016llx\n", - "kernel_gs_base:", save->kernel_gs_base, - "sysenter_cs:", save->sysenter_cs); - pr_err("%-15s %016llx %-13s %016llx\n", - "sysenter_esp:", save->sysenter_esp, - "sysenter_eip:", save->sysenter_eip); - pr_err("%-15s %016llx %-13s %016llx\n", - "gpat:", save->g_pat, "dbgctl:", save->dbgctl); - pr_err("%-15s %016llx %-13s %016llx\n", - "br_from:", save->br_from, "br_to:", save->br_to); - pr_err("%-15s %016llx %-13s %016llx\n", - "excp_from:", save->last_excp_from, - "excp_to:", save->last_excp_to); -} - -static void svm_get_exit_info(struct kvm_vcpu *vcpu, u64 *info1, u64 *info2) -{ - struct vmcb_control_area *control = &to_svm(vcpu)->vmcb->control; - - *info1 = control->exit_info_1; - *info2 = control->exit_info_2; -} - -static int handle_exit(struct kvm_vcpu *vcpu, - enum exit_fastpath_completion exit_fastpath) -{ - struct vcpu_svm *svm = to_svm(vcpu); - struct kvm_run *kvm_run = vcpu->run; - u32 exit_code = svm->vmcb->control.exit_code; - - trace_kvm_exit(exit_code, vcpu, KVM_ISA_SVM); - - if (!is_cr_intercept(svm, INTERCEPT_CR0_WRITE)) - vcpu->arch.cr0 = svm->vmcb->save.cr0; - if (npt_enabled) - vcpu->arch.cr3 = svm->vmcb->save.cr3; - - if (unlikely(svm->nested.exit_required)) { - nested_svm_vmexit(svm); - svm->nested.exit_required = false; - - return 1; - } - - if (is_guest_mode(vcpu)) { - int vmexit; - - trace_kvm_nested_vmexit(svm->vmcb->save.rip, exit_code, - svm->vmcb->control.exit_info_1, - svm->vmcb->control.exit_info_2, - svm->vmcb->control.exit_int_info, - svm->vmcb->control.exit_int_info_err, - KVM_ISA_SVM); - - vmexit = nested_svm_exit_special(svm); - - if (vmexit == NESTED_EXIT_CONTINUE) - vmexit = nested_svm_exit_handled(svm); - - if (vmexit == NESTED_EXIT_DONE) - return 1; - } - - svm_complete_interrupts(svm); - - if (svm->vmcb->control.exit_code == SVM_EXIT_ERR) { - kvm_run->exit_reason = KVM_EXIT_FAIL_ENTRY; - kvm_run->fail_entry.hardware_entry_failure_reason - = svm->vmcb->control.exit_code; - dump_vmcb(vcpu); - return 0; - } - - if (is_external_interrupt(svm->vmcb->control.exit_int_info) && - exit_code != SVM_EXIT_EXCP_BASE + PF_VECTOR && - exit_code != SVM_EXIT_NPF && exit_code != SVM_EXIT_TASK_SWITCH && - exit_code != SVM_EXIT_INTR && exit_code != SVM_EXIT_NMI) - printk(KERN_ERR "%s: unexpected exit_int_info 0x%x " - "exit_code 0x%x\n", - __func__, svm->vmcb->control.exit_int_info, - exit_code); - - if (exit_fastpath == EXIT_FASTPATH_SKIP_EMUL_INS) { - kvm_skip_emulated_instruction(vcpu); - return 1; - } else if (exit_code >= ARRAY_SIZE(svm_exit_handlers) - || !svm_exit_handlers[exit_code]) { - vcpu_unimpl(vcpu, "svm: unexpected exit reason 0x%x\n", exit_code); - dump_vmcb(vcpu); - vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR; - vcpu->run->internal.suberror = - KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON; - vcpu->run->internal.ndata = 1; - vcpu->run->internal.data[0] = exit_code; - return 0; - } - -#ifdef CONFIG_RETPOLINE - if (exit_code == SVM_EXIT_MSR) - return msr_interception(svm); - else if (exit_code == SVM_EXIT_VINTR) - return interrupt_window_interception(svm); - else if (exit_code == SVM_EXIT_INTR) - return intr_interception(svm); - else if (exit_code == SVM_EXIT_HLT) - return halt_interception(svm); - else if (exit_code == SVM_EXIT_NPF) - return npf_interception(svm); -#endif - return svm_exit_handlers[exit_code](svm); -} - -static void reload_tss(struct kvm_vcpu *vcpu) -{ - int cpu = raw_smp_processor_id(); - - struct svm_cpu_data *sd = per_cpu(svm_data, cpu); - sd->tss_desc->type = 9; /* available 32/64-bit TSS */ - load_TR_desc(); -} - -static void pre_sev_run(struct vcpu_svm *svm, int cpu) -{ - struct svm_cpu_data *sd = per_cpu(svm_data, cpu); - int asid = sev_get_asid(svm->vcpu.kvm); - - /* Assign the asid allocated with this SEV guest */ - svm->vmcb->control.asid = asid; - - /* - * Flush guest TLB: - * - * 1) when different VMCB for the same ASID is to be run on the same host CPU. - * 2) or this VMCB was executed on different host CPU in previous VMRUNs. - */ - if (sd->sev_vmcbs[asid] == svm->vmcb && - svm->last_cpu == cpu) - return; - - svm->last_cpu = cpu; - sd->sev_vmcbs[asid] = svm->vmcb; - svm->vmcb->control.tlb_ctl = TLB_CONTROL_FLUSH_ASID; - mark_dirty(svm->vmcb, VMCB_ASID); -} - -static void pre_svm_run(struct vcpu_svm *svm) -{ - int cpu = raw_smp_processor_id(); - - struct svm_cpu_data *sd = per_cpu(svm_data, cpu); - - if (sev_guest(svm->vcpu.kvm)) - return pre_sev_run(svm, cpu); - - /* FIXME: handle wraparound of asid_generation */ - if (svm->asid_generation != sd->asid_generation) - new_asid(svm, sd); -} - -static void svm_inject_nmi(struct kvm_vcpu *vcpu) -{ - struct vcpu_svm *svm = to_svm(vcpu); - - svm->vmcb->control.event_inj = SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_NMI; - vcpu->arch.hflags |= HF_NMI_MASK; - set_intercept(svm, INTERCEPT_IRET); - ++vcpu->stat.nmi_injections; -} - -static void svm_set_irq(struct kvm_vcpu *vcpu) -{ - struct vcpu_svm *svm = to_svm(vcpu); - - BUG_ON(!(gif_set(svm))); - - trace_kvm_inj_virq(vcpu->arch.interrupt.nr); - ++vcpu->stat.irq_injections; - - svm->vmcb->control.event_inj = vcpu->arch.interrupt.nr | - SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_INTR; -} - -static inline bool svm_nested_virtualize_tpr(struct kvm_vcpu *vcpu) -{ - return is_guest_mode(vcpu) && (vcpu->arch.hflags & HF_VINTR_MASK); -} - -static void update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr) -{ - struct vcpu_svm *svm = to_svm(vcpu); - - if (svm_nested_virtualize_tpr(vcpu)) - return; - - clr_cr_intercept(svm, INTERCEPT_CR8_WRITE); - - if (irr == -1) - return; - - if (tpr >= irr) - set_cr_intercept(svm, INTERCEPT_CR8_WRITE); -} - -static void svm_set_virtual_apic_mode(struct kvm_vcpu *vcpu) -{ - return; -} - -static void svm_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr) -{ -} - -static void svm_hwapic_isr_update(struct kvm_vcpu *vcpu, int max_isr) -{ -} - -static void svm_toggle_avic_for_irq_window(struct kvm_vcpu *vcpu, bool activate) -{ - if (!avic || !lapic_in_kernel(vcpu)) - return; - - srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx); - kvm_request_apicv_update(vcpu->kvm, activate, - APICV_INHIBIT_REASON_IRQWIN); - vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu); -} - -static int svm_set_pi_irte_mode(struct kvm_vcpu *vcpu, bool activate) -{ - int ret = 0; - unsigned long flags; - struct amd_svm_iommu_ir *ir; - struct vcpu_svm *svm = to_svm(vcpu); - - if (!kvm_arch_has_assigned_device(vcpu->kvm)) - return 0; - - /* - * Here, we go through the per-vcpu ir_list to update all existing - * interrupt remapping table entry targeting this vcpu. - */ - spin_lock_irqsave(&svm->ir_list_lock, flags); - - if (list_empty(&svm->ir_list)) - goto out; - - list_for_each_entry(ir, &svm->ir_list, node) { - if (activate) - ret = amd_iommu_activate_guest_mode(ir->data); - else - ret = amd_iommu_deactivate_guest_mode(ir->data); - if (ret) - break; - } -out: - spin_unlock_irqrestore(&svm->ir_list_lock, flags); - return ret; -} - -static void svm_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu) -{ - struct vcpu_svm *svm = to_svm(vcpu); - struct vmcb *vmcb = svm->vmcb; - bool activated = kvm_vcpu_apicv_active(vcpu); - - if (!avic) - return; - - if (activated) { - /** - * During AVIC temporary deactivation, guest could update - * APIC ID, DFR and LDR registers, which would not be trapped - * by avic_unaccelerated_access_interception(). In this case, - * we need to check and update the AVIC logical APIC ID table - * accordingly before re-activating. - */ - avic_post_state_restore(vcpu); - vmcb->control.int_ctl |= AVIC_ENABLE_MASK; - } else { - vmcb->control.int_ctl &= ~AVIC_ENABLE_MASK; - } - mark_dirty(vmcb, VMCB_AVIC); - - svm_set_pi_irte_mode(vcpu, activated); -} - -static void svm_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap) -{ - return; -} - -static int svm_deliver_avic_intr(struct kvm_vcpu *vcpu, int vec) -{ - if (!vcpu->arch.apicv_active) - return -1; - - kvm_lapic_set_irr(vec, vcpu->arch.apic); - smp_mb__after_atomic(); - - if (avic_vcpu_is_running(vcpu)) { - int cpuid = vcpu->cpu; - - if (cpuid != get_cpu()) - wrmsrl(SVM_AVIC_DOORBELL, kvm_cpu_get_apicid(cpuid)); - put_cpu(); - } else - kvm_vcpu_wake_up(vcpu); - - return 0; -} - -static bool svm_dy_apicv_has_pending_interrupt(struct kvm_vcpu *vcpu) -{ - return false; -} - -static void svm_ir_list_del(struct vcpu_svm *svm, struct amd_iommu_pi_data *pi) -{ - unsigned long flags; - struct amd_svm_iommu_ir *cur; - - spin_lock_irqsave(&svm->ir_list_lock, flags); - list_for_each_entry(cur, &svm->ir_list, node) { - if (cur->data != pi->ir_data) - continue; - list_del(&cur->node); - kfree(cur); - break; - } - spin_unlock_irqrestore(&svm->ir_list_lock, flags); -} - -static int svm_ir_list_add(struct vcpu_svm *svm, struct amd_iommu_pi_data *pi) -{ - int ret = 0; - unsigned long flags; - struct amd_svm_iommu_ir *ir; - - /** - * In some cases, the existing irte is updaed and re-set, - * so we need to check here if it's already been * added - * to the ir_list. - */ - if (pi->ir_data && (pi->prev_ga_tag != 0)) { - struct kvm *kvm = svm->vcpu.kvm; - u32 vcpu_id = AVIC_GATAG_TO_VCPUID(pi->prev_ga_tag); - struct kvm_vcpu *prev_vcpu = kvm_get_vcpu_by_id(kvm, vcpu_id); - struct vcpu_svm *prev_svm; - - if (!prev_vcpu) { - ret = -EINVAL; - goto out; - } - - prev_svm = to_svm(prev_vcpu); - svm_ir_list_del(prev_svm, pi); - } - - /** - * Allocating new amd_iommu_pi_data, which will get - * add to the per-vcpu ir_list. - */ - ir = kzalloc(sizeof(struct amd_svm_iommu_ir), GFP_KERNEL_ACCOUNT); - if (!ir) { - ret = -ENOMEM; - goto out; - } - ir->data = pi->ir_data; - - spin_lock_irqsave(&svm->ir_list_lock, flags); - list_add(&ir->node, &svm->ir_list); - spin_unlock_irqrestore(&svm->ir_list_lock, flags); -out: - return ret; -} - -/** - * Note: - * The HW cannot support posting multicast/broadcast - * interrupts to a vCPU. So, we still use legacy interrupt - * remapping for these kind of interrupts. - * - * For lowest-priority interrupts, we only support - * those with single CPU as the destination, e.g. user - * configures the interrupts via /proc/irq or uses - * irqbalance to make the interrupts single-CPU. - */ -static int -get_pi_vcpu_info(struct kvm *kvm, struct kvm_kernel_irq_routing_entry *e, - struct vcpu_data *vcpu_info, struct vcpu_svm **svm) -{ - struct kvm_lapic_irq irq; - struct kvm_vcpu *vcpu = NULL; - - kvm_set_msi_irq(kvm, e, &irq); - - if (!kvm_intr_is_single_vcpu(kvm, &irq, &vcpu) || - !kvm_irq_is_postable(&irq)) { - pr_debug("SVM: %s: use legacy intr remap mode for irq %u\n", - __func__, irq.vector); - return -1; - } - - pr_debug("SVM: %s: use GA mode for irq %u\n", __func__, - irq.vector); - *svm = to_svm(vcpu); - vcpu_info->pi_desc_addr = __sme_set(page_to_phys((*svm)->avic_backing_page)); - vcpu_info->vector = irq.vector; - - return 0; -} - -/* - * svm_update_pi_irte - set IRTE for Posted-Interrupts - * - * @kvm: kvm - * @host_irq: host irq of the interrupt - * @guest_irq: gsi of the interrupt - * @set: set or unset PI - * returns 0 on success, < 0 on failure - */ -static int svm_update_pi_irte(struct kvm *kvm, unsigned int host_irq, - uint32_t guest_irq, bool set) -{ - struct kvm_kernel_irq_routing_entry *e; - struct kvm_irq_routing_table *irq_rt; - int idx, ret = -EINVAL; - - if (!kvm_arch_has_assigned_device(kvm) || - !irq_remapping_cap(IRQ_POSTING_CAP)) - return 0; - - pr_debug("SVM: %s: host_irq=%#x, guest_irq=%#x, set=%#x\n", - __func__, host_irq, guest_irq, set); - - idx = srcu_read_lock(&kvm->irq_srcu); - irq_rt = srcu_dereference(kvm->irq_routing, &kvm->irq_srcu); - WARN_ON(guest_irq >= irq_rt->nr_rt_entries); - - hlist_for_each_entry(e, &irq_rt->map[guest_irq], link) { - struct vcpu_data vcpu_info; - struct vcpu_svm *svm = NULL; - - if (e->type != KVM_IRQ_ROUTING_MSI) - continue; - - /** - * Here, we setup with legacy mode in the following cases: - * 1. When cannot target interrupt to a specific vcpu. - * 2. Unsetting posted interrupt. - * 3. APIC virtialization is disabled for the vcpu. - * 4. IRQ has incompatible delivery mode (SMI, INIT, etc) - */ - if (!get_pi_vcpu_info(kvm, e, &vcpu_info, &svm) && set && - kvm_vcpu_apicv_active(&svm->vcpu)) { - struct amd_iommu_pi_data pi; - - /* Try to enable guest_mode in IRTE */ - pi.base = __sme_set(page_to_phys(svm->avic_backing_page) & - AVIC_HPA_MASK); - pi.ga_tag = AVIC_GATAG(to_kvm_svm(kvm)->avic_vm_id, - svm->vcpu.vcpu_id); - pi.is_guest_mode = true; - pi.vcpu_data = &vcpu_info; - ret = irq_set_vcpu_affinity(host_irq, &pi); - - /** - * Here, we successfully setting up vcpu affinity in - * IOMMU guest mode. Now, we need to store the posted - * interrupt information in a per-vcpu ir_list so that - * we can reference to them directly when we update vcpu - * scheduling information in IOMMU irte. - */ - if (!ret && pi.is_guest_mode) - svm_ir_list_add(svm, &pi); - } else { - /* Use legacy mode in IRTE */ - struct amd_iommu_pi_data pi; - - /** - * Here, pi is used to: - * - Tell IOMMU to use legacy mode for this interrupt. - * - Retrieve ga_tag of prior interrupt remapping data. - */ - pi.is_guest_mode = false; - ret = irq_set_vcpu_affinity(host_irq, &pi); - - /** - * Check if the posted interrupt was previously - * setup with the guest_mode by checking if the ga_tag - * was cached. If so, we need to clean up the per-vcpu - * ir_list. - */ - if (!ret && pi.prev_ga_tag) { - int id = AVIC_GATAG_TO_VCPUID(pi.prev_ga_tag); - struct kvm_vcpu *vcpu; - - vcpu = kvm_get_vcpu_by_id(kvm, id); - if (vcpu) - svm_ir_list_del(to_svm(vcpu), &pi); - } - } - - if (!ret && svm) { - trace_kvm_pi_irte_update(host_irq, svm->vcpu.vcpu_id, - e->gsi, vcpu_info.vector, - vcpu_info.pi_desc_addr, set); - } - - if (ret < 0) { - pr_err("%s: failed to update PI IRTE\n", __func__); - goto out; - } - } - - ret = 0; -out: - srcu_read_unlock(&kvm->irq_srcu, idx); - return ret; -} - -static int svm_nmi_allowed(struct kvm_vcpu *vcpu) -{ - struct vcpu_svm *svm = to_svm(vcpu); - struct vmcb *vmcb = svm->vmcb; - int ret; - ret = !(vmcb->control.int_state & SVM_INTERRUPT_SHADOW_MASK) && - !(svm->vcpu.arch.hflags & HF_NMI_MASK); - ret = ret && gif_set(svm) && nested_svm_nmi(svm); - - return ret; -} - -static bool svm_get_nmi_mask(struct kvm_vcpu *vcpu) -{ - struct vcpu_svm *svm = to_svm(vcpu); - - return !!(svm->vcpu.arch.hflags & HF_NMI_MASK); -} - -static void svm_set_nmi_mask(struct kvm_vcpu *vcpu, bool masked) -{ - struct vcpu_svm *svm = to_svm(vcpu); - - if (masked) { - svm->vcpu.arch.hflags |= HF_NMI_MASK; - set_intercept(svm, INTERCEPT_IRET); - } else { - svm->vcpu.arch.hflags &= ~HF_NMI_MASK; - clr_intercept(svm, INTERCEPT_IRET); - } -} - -static int svm_interrupt_allowed(struct kvm_vcpu *vcpu) -{ - struct vcpu_svm *svm = to_svm(vcpu); - struct vmcb *vmcb = svm->vmcb; - - if (!gif_set(svm) || - (vmcb->control.int_state & SVM_INTERRUPT_SHADOW_MASK)) - return 0; - - if (is_guest_mode(vcpu) && (svm->vcpu.arch.hflags & HF_VINTR_MASK)) - return !!(svm->vcpu.arch.hflags & HF_HIF_MASK); - else - return !!(kvm_get_rflags(vcpu) & X86_EFLAGS_IF); -} - -static void enable_irq_window(struct kvm_vcpu *vcpu) -{ - struct vcpu_svm *svm = to_svm(vcpu); - - /* - * In case GIF=0 we can't rely on the CPU to tell us when GIF becomes - * 1, because that's a separate STGI/VMRUN intercept. The next time we - * get that intercept, this function will be called again though and - * we'll get the vintr intercept. However, if the vGIF feature is - * enabled, the STGI interception will not occur. Enable the irq - * window under the assumption that the hardware will set the GIF. - */ - if (vgif_enabled(svm) || gif_set(svm)) { - /* - * IRQ window is not needed when AVIC is enabled, - * unless we have pending ExtINT since it cannot be injected - * via AVIC. In such case, we need to temporarily disable AVIC, - * and fallback to injecting IRQ via V_IRQ. - */ - svm_toggle_avic_for_irq_window(vcpu, false); - svm_set_vintr(svm); - } -} - -static void enable_nmi_window(struct kvm_vcpu *vcpu) -{ - struct vcpu_svm *svm = to_svm(vcpu); - - if ((svm->vcpu.arch.hflags & (HF_NMI_MASK | HF_IRET_MASK)) - == HF_NMI_MASK) - return; /* IRET will cause a vm exit */ - - if (!gif_set(svm)) { - if (vgif_enabled(svm)) - set_intercept(svm, INTERCEPT_STGI); - return; /* STGI will cause a vm exit */ - } - - if (svm->nested.exit_required) - return; /* we're not going to run the guest yet */ - - /* - * Something prevents NMI from been injected. Single step over possible - * problem (IRET or exception injection or interrupt shadow) - */ - svm->nmi_singlestep_guest_rflags = svm_get_rflags(vcpu); - svm->nmi_singlestep = true; - svm->vmcb->save.rflags |= (X86_EFLAGS_TF | X86_EFLAGS_RF); -} - -static int svm_set_tss_addr(struct kvm *kvm, unsigned int addr) -{ - return 0; -} - -static int svm_set_identity_map_addr(struct kvm *kvm, u64 ident_addr) -{ - return 0; -} - -static void svm_flush_tlb(struct kvm_vcpu *vcpu, bool invalidate_gpa) -{ - struct vcpu_svm *svm = to_svm(vcpu); - - if (static_cpu_has(X86_FEATURE_FLUSHBYASID)) - svm->vmcb->control.tlb_ctl = TLB_CONTROL_FLUSH_ASID; - else - svm->asid_generation--; -} - -static void svm_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t gva) -{ - struct vcpu_svm *svm = to_svm(vcpu); - - invlpga(gva, svm->vmcb->control.asid); -} - -static void svm_prepare_guest_switch(struct kvm_vcpu *vcpu) -{ -} - -static inline void sync_cr8_to_lapic(struct kvm_vcpu *vcpu) -{ - struct vcpu_svm *svm = to_svm(vcpu); - - if (svm_nested_virtualize_tpr(vcpu)) - return; - - if (!is_cr_intercept(svm, INTERCEPT_CR8_WRITE)) { - int cr8 = svm->vmcb->control.int_ctl & V_TPR_MASK; - kvm_set_cr8(vcpu, cr8); - } -} - -static inline void sync_lapic_to_cr8(struct kvm_vcpu *vcpu) -{ - struct vcpu_svm *svm = to_svm(vcpu); - u64 cr8; - - if (svm_nested_virtualize_tpr(vcpu) || - kvm_vcpu_apicv_active(vcpu)) - return; - - cr8 = kvm_get_cr8(vcpu); - svm->vmcb->control.int_ctl &= ~V_TPR_MASK; - svm->vmcb->control.int_ctl |= cr8 & V_TPR_MASK; -} - -static void svm_complete_interrupts(struct vcpu_svm *svm) -{ - u8 vector; - int type; - u32 exitintinfo = svm->vmcb->control.exit_int_info; - unsigned int3_injected = svm->int3_injected; - - svm->int3_injected = 0; - - /* - * If we've made progress since setting HF_IRET_MASK, we've - * executed an IRET and can allow NMI injection. - */ - if ((svm->vcpu.arch.hflags & HF_IRET_MASK) - && kvm_rip_read(&svm->vcpu) != svm->nmi_iret_rip) { - svm->vcpu.arch.hflags &= ~(HF_NMI_MASK | HF_IRET_MASK); - kvm_make_request(KVM_REQ_EVENT, &svm->vcpu); - } - - svm->vcpu.arch.nmi_injected = false; - kvm_clear_exception_queue(&svm->vcpu); - kvm_clear_interrupt_queue(&svm->vcpu); - - if (!(exitintinfo & SVM_EXITINTINFO_VALID)) - return; - - kvm_make_request(KVM_REQ_EVENT, &svm->vcpu); - - vector = exitintinfo & SVM_EXITINTINFO_VEC_MASK; - type = exitintinfo & SVM_EXITINTINFO_TYPE_MASK; - - switch (type) { - case SVM_EXITINTINFO_TYPE_NMI: - svm->vcpu.arch.nmi_injected = true; - break; - case SVM_EXITINTINFO_TYPE_EXEPT: - /* - * In case of software exceptions, do not reinject the vector, - * but re-execute the instruction instead. Rewind RIP first - * if we emulated INT3 before. - */ - if (kvm_exception_is_soft(vector)) { - if (vector == BP_VECTOR && int3_injected && - kvm_is_linear_rip(&svm->vcpu, svm->int3_rip)) - kvm_rip_write(&svm->vcpu, - kvm_rip_read(&svm->vcpu) - - int3_injected); - break; - } - if (exitintinfo & SVM_EXITINTINFO_VALID_ERR) { - u32 err = svm->vmcb->control.exit_int_info_err; - kvm_requeue_exception_e(&svm->vcpu, vector, err); - - } else - kvm_requeue_exception(&svm->vcpu, vector); - break; - case SVM_EXITINTINFO_TYPE_INTR: - kvm_queue_interrupt(&svm->vcpu, vector, false); - break; - default: - break; - } -} - -static void svm_cancel_injection(struct kvm_vcpu *vcpu) -{ - struct vcpu_svm *svm = to_svm(vcpu); - struct vmcb_control_area *control = &svm->vmcb->control; - - control->exit_int_info = control->event_inj; - control->exit_int_info_err = control->event_inj_err; - control->event_inj = 0; - svm_complete_interrupts(svm); -} - -static void svm_vcpu_run(struct kvm_vcpu *vcpu) -{ - struct vcpu_svm *svm = to_svm(vcpu); - - svm->vmcb->save.rax = vcpu->arch.regs[VCPU_REGS_RAX]; - svm->vmcb->save.rsp = vcpu->arch.regs[VCPU_REGS_RSP]; - svm->vmcb->save.rip = vcpu->arch.regs[VCPU_REGS_RIP]; - - /* - * A vmexit emulation is required before the vcpu can be executed - * again. - */ - if (unlikely(svm->nested.exit_required)) - return; - - /* - * Disable singlestep if we're injecting an interrupt/exception. - * We don't want our modified rflags to be pushed on the stack where - * we might not be able to easily reset them if we disabled NMI - * singlestep later. - */ - if (svm->nmi_singlestep && svm->vmcb->control.event_inj) { - /* - * Event injection happens before external interrupts cause a - * vmexit and interrupts are disabled here, so smp_send_reschedule - * is enough to force an immediate vmexit. - */ - disable_nmi_singlestep(svm); - smp_send_reschedule(vcpu->cpu); - } - - pre_svm_run(svm); - - sync_lapic_to_cr8(vcpu); - - svm->vmcb->save.cr2 = vcpu->arch.cr2; - - clgi(); - kvm_load_guest_xsave_state(vcpu); - - if (lapic_in_kernel(vcpu) && - vcpu->arch.apic->lapic_timer.timer_advance_ns) - kvm_wait_lapic_expire(vcpu); - - /* - * If this vCPU has touched SPEC_CTRL, restore the guest's value if - * it's non-zero. Since vmentry is serialising on affected CPUs, there - * is no need to worry about the conditional branch over the wrmsr - * being speculatively taken. - */ - x86_spec_ctrl_set_guest(svm->spec_ctrl, svm->virt_spec_ctrl); - - local_irq_enable(); - - asm volatile ( - "push %%" _ASM_BP "; \n\t" - "mov %c[rbx](%[svm]), %%" _ASM_BX " \n\t" - "mov %c[rcx](%[svm]), %%" _ASM_CX " \n\t" - "mov %c[rdx](%[svm]), %%" _ASM_DX " \n\t" - "mov %c[rsi](%[svm]), %%" _ASM_SI " \n\t" - "mov %c[rdi](%[svm]), %%" _ASM_DI " \n\t" - "mov %c[rbp](%[svm]), %%" _ASM_BP " \n\t" -#ifdef CONFIG_X86_64 - "mov %c[r8](%[svm]), %%r8 \n\t" - "mov %c[r9](%[svm]), %%r9 \n\t" - "mov %c[r10](%[svm]), %%r10 \n\t" - "mov %c[r11](%[svm]), %%r11 \n\t" - "mov %c[r12](%[svm]), %%r12 \n\t" - "mov %c[r13](%[svm]), %%r13 \n\t" - "mov %c[r14](%[svm]), %%r14 \n\t" - "mov %c[r15](%[svm]), %%r15 \n\t" -#endif - - /* Enter guest mode */ - "push %%" _ASM_AX " \n\t" - "mov %c[vmcb](%[svm]), %%" _ASM_AX " \n\t" - __ex("vmload %%" _ASM_AX) "\n\t" - __ex("vmrun %%" _ASM_AX) "\n\t" - __ex("vmsave %%" _ASM_AX) "\n\t" - "pop %%" _ASM_AX " \n\t" - - /* Save guest registers, load host registers */ - "mov %%" _ASM_BX ", %c[rbx](%[svm]) \n\t" - "mov %%" _ASM_CX ", %c[rcx](%[svm]) \n\t" - "mov %%" _ASM_DX ", %c[rdx](%[svm]) \n\t" - "mov %%" _ASM_SI ", %c[rsi](%[svm]) \n\t" - "mov %%" _ASM_DI ", %c[rdi](%[svm]) \n\t" - "mov %%" _ASM_BP ", %c[rbp](%[svm]) \n\t" -#ifdef CONFIG_X86_64 - "mov %%r8, %c[r8](%[svm]) \n\t" - "mov %%r9, %c[r9](%[svm]) \n\t" - "mov %%r10, %c[r10](%[svm]) \n\t" - "mov %%r11, %c[r11](%[svm]) \n\t" - "mov %%r12, %c[r12](%[svm]) \n\t" - "mov %%r13, %c[r13](%[svm]) \n\t" - "mov %%r14, %c[r14](%[svm]) \n\t" - "mov %%r15, %c[r15](%[svm]) \n\t" - /* - * Clear host registers marked as clobbered to prevent - * speculative use. - */ - "xor %%r8d, %%r8d \n\t" - "xor %%r9d, %%r9d \n\t" - "xor %%r10d, %%r10d \n\t" - "xor %%r11d, %%r11d \n\t" - "xor %%r12d, %%r12d \n\t" - "xor %%r13d, %%r13d \n\t" - "xor %%r14d, %%r14d \n\t" - "xor %%r15d, %%r15d \n\t" -#endif - "xor %%ebx, %%ebx \n\t" - "xor %%ecx, %%ecx \n\t" - "xor %%edx, %%edx \n\t" - "xor %%esi, %%esi \n\t" - "xor %%edi, %%edi \n\t" - "pop %%" _ASM_BP - : - : [svm]"a"(svm), - [vmcb]"i"(offsetof(struct vcpu_svm, vmcb_pa)), - [rbx]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_RBX])), - [rcx]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_RCX])), - [rdx]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_RDX])), - [rsi]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_RSI])), - [rdi]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_RDI])), - [rbp]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_RBP])) -#ifdef CONFIG_X86_64 - , [r8]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_R8])), - [r9]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_R9])), - [r10]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_R10])), - [r11]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_R11])), - [r12]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_R12])), - [r13]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_R13])), - [r14]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_R14])), - [r15]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_R15])) -#endif - : "cc", "memory" -#ifdef CONFIG_X86_64 - , "rbx", "rcx", "rdx", "rsi", "rdi" - , "r8", "r9", "r10", "r11" , "r12", "r13", "r14", "r15" -#else - , "ebx", "ecx", "edx", "esi", "edi" -#endif - ); - - /* Eliminate branch target predictions from guest mode */ - vmexit_fill_RSB(); - -#ifdef CONFIG_X86_64 - wrmsrl(MSR_GS_BASE, svm->host.gs_base); -#else - loadsegment(fs, svm->host.fs); -#ifndef CONFIG_X86_32_LAZY_GS - loadsegment(gs, svm->host.gs); -#endif -#endif - - /* - * We do not use IBRS in the kernel. If this vCPU has used the - * SPEC_CTRL MSR it may have left it on; save the value and - * turn it off. This is much more efficient than blindly adding - * it to the atomic save/restore list. Especially as the former - * (Saving guest MSRs on vmexit) doesn't even exist in KVM. - * - * For non-nested case: - * If the L01 MSR bitmap does not intercept the MSR, then we need to - * save it. - * - * For nested case: - * If the L02 MSR bitmap does not intercept the MSR, then we need to - * save it. - */ - if (unlikely(!msr_write_intercepted(vcpu, MSR_IA32_SPEC_CTRL))) - svm->spec_ctrl = native_read_msr(MSR_IA32_SPEC_CTRL); - - reload_tss(vcpu); - - local_irq_disable(); - - x86_spec_ctrl_restore_host(svm->spec_ctrl, svm->virt_spec_ctrl); - - vcpu->arch.cr2 = svm->vmcb->save.cr2; - vcpu->arch.regs[VCPU_REGS_RAX] = svm->vmcb->save.rax; - vcpu->arch.regs[VCPU_REGS_RSP] = svm->vmcb->save.rsp; - vcpu->arch.regs[VCPU_REGS_RIP] = svm->vmcb->save.rip; - - if (unlikely(svm->vmcb->control.exit_code == SVM_EXIT_NMI)) - kvm_before_interrupt(&svm->vcpu); - - kvm_load_host_xsave_state(vcpu); - stgi(); - - /* Any pending NMI will happen here */ - - if (unlikely(svm->vmcb->control.exit_code == SVM_EXIT_NMI)) - kvm_after_interrupt(&svm->vcpu); - - sync_cr8_to_lapic(vcpu); - - svm->next_rip = 0; - - svm->vmcb->control.tlb_ctl = TLB_CONTROL_DO_NOTHING; - - /* if exit due to PF check for async PF */ - if (svm->vmcb->control.exit_code == SVM_EXIT_EXCP_BASE + PF_VECTOR) - svm->vcpu.arch.apf.host_apf_reason = kvm_read_and_reset_pf_reason(); - - if (npt_enabled) { - vcpu->arch.regs_avail &= ~(1 << VCPU_EXREG_PDPTR); - vcpu->arch.regs_dirty &= ~(1 << VCPU_EXREG_PDPTR); - } - - /* - * We need to handle MC intercepts here before the vcpu has a chance to - * change the physical cpu - */ - if (unlikely(svm->vmcb->control.exit_code == - SVM_EXIT_EXCP_BASE + MC_VECTOR)) - svm_handle_mce(svm); - - mark_all_clean(svm->vmcb); -} -STACK_FRAME_NON_STANDARD(svm_vcpu_run); - -static void svm_load_mmu_pgd(struct kvm_vcpu *vcpu, unsigned long root) -{ - struct vcpu_svm *svm = to_svm(vcpu); - bool update_guest_cr3 = true; - unsigned long cr3; - - cr3 = __sme_set(root); - if (npt_enabled) { - svm->vmcb->control.nested_cr3 = cr3; - mark_dirty(svm->vmcb, VMCB_NPT); - - /* Loading L2's CR3 is handled by enter_svm_guest_mode. */ - if (is_guest_mode(vcpu)) - update_guest_cr3 = false; - else if (test_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail)) - cr3 = vcpu->arch.cr3; - else /* CR3 is already up-to-date. */ - update_guest_cr3 = false; - } - - if (update_guest_cr3) { - svm->vmcb->save.cr3 = cr3; - mark_dirty(svm->vmcb, VMCB_CR); - } -} - -static int is_disabled(void) -{ - u64 vm_cr; - - rdmsrl(MSR_VM_CR, vm_cr); - if (vm_cr & (1 << SVM_VM_CR_SVM_DISABLE)) - return 1; - - return 0; -} - -static void -svm_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall) -{ - /* - * Patch in the VMMCALL instruction: - */ - hypercall[0] = 0x0f; - hypercall[1] = 0x01; - hypercall[2] = 0xd9; -} - -static int __init svm_check_processor_compat(void) -{ - return 0; -} - -static bool svm_cpu_has_accelerated_tpr(void) -{ - return false; -} - -static bool svm_has_emulated_msr(int index) -{ - switch (index) { - case MSR_IA32_MCG_EXT_CTL: - case MSR_IA32_VMX_BASIC ... MSR_IA32_VMX_VMFUNC: - return false; - default: - break; - } - - return true; -} - -static u64 svm_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) -{ - return 0; -} - -static void svm_cpuid_update(struct kvm_vcpu *vcpu) -{ - struct vcpu_svm *svm = to_svm(vcpu); - - vcpu->arch.xsaves_enabled = guest_cpuid_has(vcpu, X86_FEATURE_XSAVE) && - boot_cpu_has(X86_FEATURE_XSAVE) && - boot_cpu_has(X86_FEATURE_XSAVES); - - /* Update nrips enabled cache */ - svm->nrips_enabled = kvm_cpu_cap_has(X86_FEATURE_NRIPS) && - guest_cpuid_has(&svm->vcpu, X86_FEATURE_NRIPS); - - if (!kvm_vcpu_apicv_active(vcpu)) - return; - - /* - * AVIC does not work with an x2APIC mode guest. If the X2APIC feature - * is exposed to the guest, disable AVIC. - */ - if (guest_cpuid_has(vcpu, X86_FEATURE_X2APIC)) - kvm_request_apicv_update(vcpu->kvm, false, - APICV_INHIBIT_REASON_X2APIC); - - /* - * Currently, AVIC does not work with nested virtualization. - * So, we disable AVIC when cpuid for SVM is set in the L1 guest. - */ - if (nested && guest_cpuid_has(vcpu, X86_FEATURE_SVM)) - kvm_request_apicv_update(vcpu->kvm, false, - APICV_INHIBIT_REASON_NESTED); -} - -static bool svm_has_wbinvd_exit(void) -{ - return true; -} - -#define PRE_EX(exit) { .exit_code = (exit), \ - .stage = X86_ICPT_PRE_EXCEPT, } -#define POST_EX(exit) { .exit_code = (exit), \ - .stage = X86_ICPT_POST_EXCEPT, } -#define POST_MEM(exit) { .exit_code = (exit), \ - .stage = X86_ICPT_POST_MEMACCESS, } - -static const struct __x86_intercept { - u32 exit_code; - enum x86_intercept_stage stage; -} x86_intercept_map[] = { - [x86_intercept_cr_read] = POST_EX(SVM_EXIT_READ_CR0), - [x86_intercept_cr_write] = POST_EX(SVM_EXIT_WRITE_CR0), - [x86_intercept_clts] = POST_EX(SVM_EXIT_WRITE_CR0), - [x86_intercept_lmsw] = POST_EX(SVM_EXIT_WRITE_CR0), - [x86_intercept_smsw] = POST_EX(SVM_EXIT_READ_CR0), - [x86_intercept_dr_read] = POST_EX(SVM_EXIT_READ_DR0), - [x86_intercept_dr_write] = POST_EX(SVM_EXIT_WRITE_DR0), - [x86_intercept_sldt] = POST_EX(SVM_EXIT_LDTR_READ), - [x86_intercept_str] = POST_EX(SVM_EXIT_TR_READ), - [x86_intercept_lldt] = POST_EX(SVM_EXIT_LDTR_WRITE), - [x86_intercept_ltr] = POST_EX(SVM_EXIT_TR_WRITE), - [x86_intercept_sgdt] = POST_EX(SVM_EXIT_GDTR_READ), - [x86_intercept_sidt] = POST_EX(SVM_EXIT_IDTR_READ), - [x86_intercept_lgdt] = POST_EX(SVM_EXIT_GDTR_WRITE), - [x86_intercept_lidt] = POST_EX(SVM_EXIT_IDTR_WRITE), - [x86_intercept_vmrun] = POST_EX(SVM_EXIT_VMRUN), - [x86_intercept_vmmcall] = POST_EX(SVM_EXIT_VMMCALL), - [x86_intercept_vmload] = POST_EX(SVM_EXIT_VMLOAD), - [x86_intercept_vmsave] = POST_EX(SVM_EXIT_VMSAVE), - [x86_intercept_stgi] = POST_EX(SVM_EXIT_STGI), - [x86_intercept_clgi] = POST_EX(SVM_EXIT_CLGI), - [x86_intercept_skinit] = POST_EX(SVM_EXIT_SKINIT), - [x86_intercept_invlpga] = POST_EX(SVM_EXIT_INVLPGA), - [x86_intercept_rdtscp] = POST_EX(SVM_EXIT_RDTSCP), - [x86_intercept_monitor] = POST_MEM(SVM_EXIT_MONITOR), - [x86_intercept_mwait] = POST_EX(SVM_EXIT_MWAIT), - [x86_intercept_invlpg] = POST_EX(SVM_EXIT_INVLPG), - [x86_intercept_invd] = POST_EX(SVM_EXIT_INVD), - [x86_intercept_wbinvd] = POST_EX(SVM_EXIT_WBINVD), - [x86_intercept_wrmsr] = POST_EX(SVM_EXIT_MSR), - [x86_intercept_rdtsc] = POST_EX(SVM_EXIT_RDTSC), - [x86_intercept_rdmsr] = POST_EX(SVM_EXIT_MSR), - [x86_intercept_rdpmc] = POST_EX(SVM_EXIT_RDPMC), - [x86_intercept_cpuid] = PRE_EX(SVM_EXIT_CPUID), - [x86_intercept_rsm] = PRE_EX(SVM_EXIT_RSM), - [x86_intercept_pause] = PRE_EX(SVM_EXIT_PAUSE), - [x86_intercept_pushf] = PRE_EX(SVM_EXIT_PUSHF), - [x86_intercept_popf] = PRE_EX(SVM_EXIT_POPF), - [x86_intercept_intn] = PRE_EX(SVM_EXIT_SWINT), - [x86_intercept_iret] = PRE_EX(SVM_EXIT_IRET), - [x86_intercept_icebp] = PRE_EX(SVM_EXIT_ICEBP), - [x86_intercept_hlt] = POST_EX(SVM_EXIT_HLT), - [x86_intercept_in] = POST_EX(SVM_EXIT_IOIO), - [x86_intercept_ins] = POST_EX(SVM_EXIT_IOIO), - [x86_intercept_out] = POST_EX(SVM_EXIT_IOIO), - [x86_intercept_outs] = POST_EX(SVM_EXIT_IOIO), - [x86_intercept_xsetbv] = PRE_EX(SVM_EXIT_XSETBV), -}; - -#undef PRE_EX -#undef POST_EX -#undef POST_MEM - -static int svm_check_intercept(struct kvm_vcpu *vcpu, - struct x86_instruction_info *info, - enum x86_intercept_stage stage, - struct x86_exception *exception) -{ - struct vcpu_svm *svm = to_svm(vcpu); - int vmexit, ret = X86EMUL_CONTINUE; - struct __x86_intercept icpt_info; - struct vmcb *vmcb = svm->vmcb; - - if (info->intercept >= ARRAY_SIZE(x86_intercept_map)) - goto out; - - icpt_info = x86_intercept_map[info->intercept]; - - if (stage != icpt_info.stage) - goto out; - - switch (icpt_info.exit_code) { - case SVM_EXIT_READ_CR0: - if (info->intercept == x86_intercept_cr_read) - icpt_info.exit_code += info->modrm_reg; - break; - case SVM_EXIT_WRITE_CR0: { - unsigned long cr0, val; - u64 intercept; - - if (info->intercept == x86_intercept_cr_write) - icpt_info.exit_code += info->modrm_reg; - - if (icpt_info.exit_code != SVM_EXIT_WRITE_CR0 || - info->intercept == x86_intercept_clts) - break; - - intercept = svm->nested.intercept; - - if (!(intercept & (1ULL << INTERCEPT_SELECTIVE_CR0))) - break; - - cr0 = vcpu->arch.cr0 & ~SVM_CR0_SELECTIVE_MASK; - val = info->src_val & ~SVM_CR0_SELECTIVE_MASK; - - if (info->intercept == x86_intercept_lmsw) { - cr0 &= 0xfUL; - val &= 0xfUL; - /* lmsw can't clear PE - catch this here */ - if (cr0 & X86_CR0_PE) - val |= X86_CR0_PE; - } - - if (cr0 ^ val) - icpt_info.exit_code = SVM_EXIT_CR0_SEL_WRITE; - - break; - } - case SVM_EXIT_READ_DR0: - case SVM_EXIT_WRITE_DR0: - icpt_info.exit_code += info->modrm_reg; - break; - case SVM_EXIT_MSR: - if (info->intercept == x86_intercept_wrmsr) - vmcb->control.exit_info_1 = 1; - else - vmcb->control.exit_info_1 = 0; - break; - case SVM_EXIT_PAUSE: - /* - * We get this for NOP only, but pause - * is rep not, check this here - */ - if (info->rep_prefix != REPE_PREFIX) - goto out; - break; - case SVM_EXIT_IOIO: { - u64 exit_info; - u32 bytes; - - if (info->intercept == x86_intercept_in || - info->intercept == x86_intercept_ins) { - exit_info = ((info->src_val & 0xffff) << 16) | - SVM_IOIO_TYPE_MASK; - bytes = info->dst_bytes; - } else { - exit_info = (info->dst_val & 0xffff) << 16; - bytes = info->src_bytes; - } - - if (info->intercept == x86_intercept_outs || - info->intercept == x86_intercept_ins) - exit_info |= SVM_IOIO_STR_MASK; - - if (info->rep_prefix) - exit_info |= SVM_IOIO_REP_MASK; - - bytes = min(bytes, 4u); - - exit_info |= bytes << SVM_IOIO_SIZE_SHIFT; - - exit_info |= (u32)info->ad_bytes << (SVM_IOIO_ASIZE_SHIFT - 1); - - vmcb->control.exit_info_1 = exit_info; - vmcb->control.exit_info_2 = info->next_rip; - - break; - } - default: - break; - } - - /* TODO: Advertise NRIPS to guest hypervisor unconditionally */ - if (static_cpu_has(X86_FEATURE_NRIPS)) - vmcb->control.next_rip = info->next_rip; - vmcb->control.exit_code = icpt_info.exit_code; - vmexit = nested_svm_exit_handled(svm); - - ret = (vmexit == NESTED_EXIT_DONE) ? X86EMUL_INTERCEPTED - : X86EMUL_CONTINUE; - -out: - return ret; -} - -static void svm_handle_exit_irqoff(struct kvm_vcpu *vcpu, - enum exit_fastpath_completion *exit_fastpath) -{ - if (!is_guest_mode(vcpu) && - to_svm(vcpu)->vmcb->control.exit_code == SVM_EXIT_MSR && - to_svm(vcpu)->vmcb->control.exit_info_1) - *exit_fastpath = handle_fastpath_set_msr_irqoff(vcpu); -} - -static void svm_sched_in(struct kvm_vcpu *vcpu, int cpu) -{ - if (pause_filter_thresh) - shrink_ple_window(vcpu); -} - -static inline void avic_post_state_restore(struct kvm_vcpu *vcpu) -{ - if (avic_handle_apic_id_update(vcpu) != 0) - return; - avic_handle_dfr_update(vcpu); - avic_handle_ldr_update(vcpu); -} - -static void svm_setup_mce(struct kvm_vcpu *vcpu) -{ - /* [63:9] are reserved. */ - vcpu->arch.mcg_cap &= 0x1ff; -} - -static int svm_smi_allowed(struct kvm_vcpu *vcpu) -{ - struct vcpu_svm *svm = to_svm(vcpu); - - /* Per APM Vol.2 15.22.2 "Response to SMI" */ - if (!gif_set(svm)) - return 0; - - if (is_guest_mode(&svm->vcpu) && - svm->nested.intercept & (1ULL << INTERCEPT_SMI)) { - /* TODO: Might need to set exit_info_1 and exit_info_2 here */ - svm->vmcb->control.exit_code = SVM_EXIT_SMI; - svm->nested.exit_required = true; - return 0; - } - - return 1; -} - -static int svm_pre_enter_smm(struct kvm_vcpu *vcpu, char *smstate) -{ - struct vcpu_svm *svm = to_svm(vcpu); - int ret; - - if (is_guest_mode(vcpu)) { - /* FED8h - SVM Guest */ - put_smstate(u64, smstate, 0x7ed8, 1); - /* FEE0h - SVM Guest VMCB Physical Address */ - put_smstate(u64, smstate, 0x7ee0, svm->nested.vmcb); - - svm->vmcb->save.rax = vcpu->arch.regs[VCPU_REGS_RAX]; - svm->vmcb->save.rsp = vcpu->arch.regs[VCPU_REGS_RSP]; - svm->vmcb->save.rip = vcpu->arch.regs[VCPU_REGS_RIP]; - - ret = nested_svm_vmexit(svm); - if (ret) - return ret; - } - return 0; -} - -static int svm_pre_leave_smm(struct kvm_vcpu *vcpu, const char *smstate) -{ - struct vcpu_svm *svm = to_svm(vcpu); - struct vmcb *nested_vmcb; - struct kvm_host_map map; - u64 guest; - u64 vmcb; - - guest = GET_SMSTATE(u64, smstate, 0x7ed8); - vmcb = GET_SMSTATE(u64, smstate, 0x7ee0); - - if (guest) { - if (kvm_vcpu_map(&svm->vcpu, gpa_to_gfn(vmcb), &map) == -EINVAL) - return 1; - nested_vmcb = map.hva; - enter_svm_guest_mode(svm, vmcb, nested_vmcb, &map); - } - return 0; -} - -static int enable_smi_window(struct kvm_vcpu *vcpu) -{ - struct vcpu_svm *svm = to_svm(vcpu); - - if (!gif_set(svm)) { - if (vgif_enabled(svm)) - set_intercept(svm, INTERCEPT_STGI); - /* STGI will cause a vm exit */ - return 1; - } - return 0; -} - -static int sev_flush_asids(void) -{ - int ret, error; - - /* - * DEACTIVATE will clear the WBINVD indicator causing DF_FLUSH to fail, - * so it must be guarded. - */ - down_write(&sev_deactivate_lock); - - wbinvd_on_all_cpus(); - ret = sev_guest_df_flush(&error); - - up_write(&sev_deactivate_lock); - - if (ret) - pr_err("SEV: DF_FLUSH failed, ret=%d, error=%#x\n", ret, error); - - return ret; -} - -/* Must be called with the sev_bitmap_lock held */ -static bool __sev_recycle_asids(void) -{ - int pos; - - /* Check if there are any ASIDs to reclaim before performing a flush */ - pos = find_next_bit(sev_reclaim_asid_bitmap, - max_sev_asid, min_sev_asid - 1); - if (pos >= max_sev_asid) - return false; - - if (sev_flush_asids()) - return false; - - bitmap_xor(sev_asid_bitmap, sev_asid_bitmap, sev_reclaim_asid_bitmap, - max_sev_asid); - bitmap_zero(sev_reclaim_asid_bitmap, max_sev_asid); - - return true; -} - -static int sev_asid_new(void) -{ - bool retry = true; - int pos; - - mutex_lock(&sev_bitmap_lock); - - /* - * SEV-enabled guest must use asid from min_sev_asid to max_sev_asid. - */ -again: - pos = find_next_zero_bit(sev_asid_bitmap, max_sev_asid, min_sev_asid - 1); - if (pos >= max_sev_asid) { - if (retry && __sev_recycle_asids()) { - retry = false; - goto again; - } - mutex_unlock(&sev_bitmap_lock); - return -EBUSY; - } - - __set_bit(pos, sev_asid_bitmap); - - mutex_unlock(&sev_bitmap_lock); - - return pos + 1; -} - -static int sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp) -{ - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; - int asid, ret; - - ret = -EBUSY; - if (unlikely(sev->active)) - return ret; - - asid = sev_asid_new(); - if (asid < 0) - return ret; - - ret = sev_platform_init(&argp->error); - if (ret) - goto e_free; - - sev->active = true; - sev->asid = asid; - INIT_LIST_HEAD(&sev->regions_list); - - return 0; - -e_free: - sev_asid_free(asid); - return ret; -} - -static int sev_bind_asid(struct kvm *kvm, unsigned int handle, int *error) -{ - struct sev_data_activate *data; - int asid = sev_get_asid(kvm); - int ret; - - data = kzalloc(sizeof(*data), GFP_KERNEL_ACCOUNT); - if (!data) - return -ENOMEM; - - /* activate ASID on the given handle */ - data->handle = handle; - data->asid = asid; - ret = sev_guest_activate(data, error); - kfree(data); - - return ret; -} - -static int __sev_issue_cmd(int fd, int id, void *data, int *error) -{ - struct fd f; - int ret; - - f = fdget(fd); - if (!f.file) - return -EBADF; - - ret = sev_issue_cmd_external_user(f.file, id, data, error); - - fdput(f); - return ret; -} - -static int sev_issue_cmd(struct kvm *kvm, int id, void *data, int *error) -{ - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; - - return __sev_issue_cmd(sev->fd, id, data, error); -} - -static int sev_launch_start(struct kvm *kvm, struct kvm_sev_cmd *argp) -{ - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; - struct sev_data_launch_start *start; - struct kvm_sev_launch_start params; - void *dh_blob, *session_blob; - int *error = &argp->error; - int ret; - - if (!sev_guest(kvm)) - return -ENOTTY; - - if (copy_from_user(¶ms, (void __user *)(uintptr_t)argp->data, sizeof(params))) - return -EFAULT; - - start = kzalloc(sizeof(*start), GFP_KERNEL_ACCOUNT); - if (!start) - return -ENOMEM; - - dh_blob = NULL; - if (params.dh_uaddr) { - dh_blob = psp_copy_user_blob(params.dh_uaddr, params.dh_len); - if (IS_ERR(dh_blob)) { - ret = PTR_ERR(dh_blob); - goto e_free; - } - - start->dh_cert_address = __sme_set(__pa(dh_blob)); - start->dh_cert_len = params.dh_len; - } - - session_blob = NULL; - if (params.session_uaddr) { - session_blob = psp_copy_user_blob(params.session_uaddr, params.session_len); - if (IS_ERR(session_blob)) { - ret = PTR_ERR(session_blob); - goto e_free_dh; - } - - start->session_address = __sme_set(__pa(session_blob)); - start->session_len = params.session_len; - } - - start->handle = params.handle; - start->policy = params.policy; - - /* create memory encryption context */ - ret = __sev_issue_cmd(argp->sev_fd, SEV_CMD_LAUNCH_START, start, error); - if (ret) - goto e_free_session; - - /* Bind ASID to this guest */ - ret = sev_bind_asid(kvm, start->handle, error); - if (ret) - goto e_free_session; - - /* return handle to userspace */ - params.handle = start->handle; - if (copy_to_user((void __user *)(uintptr_t)argp->data, ¶ms, sizeof(params))) { - sev_unbind_asid(kvm, start->handle); - ret = -EFAULT; - goto e_free_session; - } - - sev->handle = start->handle; - sev->fd = argp->sev_fd; - -e_free_session: - kfree(session_blob); -e_free_dh: - kfree(dh_blob); -e_free: - kfree(start); - return ret; -} - -static unsigned long get_num_contig_pages(unsigned long idx, - struct page **inpages, unsigned long npages) -{ - unsigned long paddr, next_paddr; - unsigned long i = idx + 1, pages = 1; - - /* find the number of contiguous pages starting from idx */ - paddr = __sme_page_pa(inpages[idx]); - while (i < npages) { - next_paddr = __sme_page_pa(inpages[i++]); - if ((paddr + PAGE_SIZE) == next_paddr) { - pages++; - paddr = next_paddr; - continue; - } - break; - } - - return pages; -} - -static int sev_launch_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp) -{ - unsigned long vaddr, vaddr_end, next_vaddr, npages, pages, size, i; - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; - struct kvm_sev_launch_update_data params; - struct sev_data_launch_update_data *data; - struct page **inpages; - int ret; - - if (!sev_guest(kvm)) - return -ENOTTY; - - if (copy_from_user(¶ms, (void __user *)(uintptr_t)argp->data, sizeof(params))) - return -EFAULT; - - data = kzalloc(sizeof(*data), GFP_KERNEL_ACCOUNT); - if (!data) - return -ENOMEM; - - vaddr = params.uaddr; - size = params.len; - vaddr_end = vaddr + size; - - /* Lock the user memory. */ - inpages = sev_pin_memory(kvm, vaddr, size, &npages, 1); - if (!inpages) { - ret = -ENOMEM; - goto e_free; - } - - /* - * The LAUNCH_UPDATE command will perform in-place encryption of the - * memory content (i.e it will write the same memory region with C=1). - * It's possible that the cache may contain the data with C=0, i.e., - * unencrypted so invalidate it first. - */ - sev_clflush_pages(inpages, npages); - - for (i = 0; vaddr < vaddr_end; vaddr = next_vaddr, i += pages) { - int offset, len; - - /* - * If the user buffer is not page-aligned, calculate the offset - * within the page. - */ - offset = vaddr & (PAGE_SIZE - 1); - - /* Calculate the number of pages that can be encrypted in one go. */ - pages = get_num_contig_pages(i, inpages, npages); - - len = min_t(size_t, ((pages * PAGE_SIZE) - offset), size); - - data->handle = sev->handle; - data->len = len; - data->address = __sme_page_pa(inpages[i]) + offset; - ret = sev_issue_cmd(kvm, SEV_CMD_LAUNCH_UPDATE_DATA, data, &argp->error); - if (ret) - goto e_unpin; - - size -= len; - next_vaddr = vaddr + len; - } - -e_unpin: - /* content of memory is updated, mark pages dirty */ - for (i = 0; i < npages; i++) { - set_page_dirty_lock(inpages[i]); - mark_page_accessed(inpages[i]); - } - /* unlock the user pages */ - sev_unpin_memory(kvm, inpages, npages); -e_free: - kfree(data); - return ret; -} - -static int sev_launch_measure(struct kvm *kvm, struct kvm_sev_cmd *argp) -{ - void __user *measure = (void __user *)(uintptr_t)argp->data; - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; - struct sev_data_launch_measure *data; - struct kvm_sev_launch_measure params; - void __user *p = NULL; - void *blob = NULL; - int ret; - - if (!sev_guest(kvm)) - return -ENOTTY; - - if (copy_from_user(¶ms, measure, sizeof(params))) - return -EFAULT; - - data = kzalloc(sizeof(*data), GFP_KERNEL_ACCOUNT); - if (!data) - return -ENOMEM; - - /* User wants to query the blob length */ - if (!params.len) - goto cmd; - - p = (void __user *)(uintptr_t)params.uaddr; - if (p) { - if (params.len > SEV_FW_BLOB_MAX_SIZE) { - ret = -EINVAL; - goto e_free; - } - - ret = -ENOMEM; - blob = kmalloc(params.len, GFP_KERNEL); - if (!blob) - goto e_free; - - data->address = __psp_pa(blob); - data->len = params.len; - } - -cmd: - data->handle = sev->handle; - ret = sev_issue_cmd(kvm, SEV_CMD_LAUNCH_MEASURE, data, &argp->error); - - /* - * If we query the session length, FW responded with expected data. - */ - if (!params.len) - goto done; - - if (ret) - goto e_free_blob; - - if (blob) { - if (copy_to_user(p, blob, params.len)) - ret = -EFAULT; - } - -done: - params.len = data->len; - if (copy_to_user(measure, ¶ms, sizeof(params))) - ret = -EFAULT; -e_free_blob: - kfree(blob); -e_free: - kfree(data); - return ret; -} - -static int sev_launch_finish(struct kvm *kvm, struct kvm_sev_cmd *argp) -{ - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; - struct sev_data_launch_finish *data; - int ret; - - if (!sev_guest(kvm)) - return -ENOTTY; - - data = kzalloc(sizeof(*data), GFP_KERNEL_ACCOUNT); - if (!data) - return -ENOMEM; - - data->handle = sev->handle; - ret = sev_issue_cmd(kvm, SEV_CMD_LAUNCH_FINISH, data, &argp->error); - - kfree(data); - return ret; -} - -static int sev_guest_status(struct kvm *kvm, struct kvm_sev_cmd *argp) -{ - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; - struct kvm_sev_guest_status params; - struct sev_data_guest_status *data; - int ret; - - if (!sev_guest(kvm)) - return -ENOTTY; - - data = kzalloc(sizeof(*data), GFP_KERNEL_ACCOUNT); - if (!data) - return -ENOMEM; - - data->handle = sev->handle; - ret = sev_issue_cmd(kvm, SEV_CMD_GUEST_STATUS, data, &argp->error); - if (ret) - goto e_free; - - params.policy = data->policy; - params.state = data->state; - params.handle = data->handle; - - if (copy_to_user((void __user *)(uintptr_t)argp->data, ¶ms, sizeof(params))) - ret = -EFAULT; -e_free: - kfree(data); - return ret; -} - -static int __sev_issue_dbg_cmd(struct kvm *kvm, unsigned long src, - unsigned long dst, int size, - int *error, bool enc) -{ - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; - struct sev_data_dbg *data; - int ret; - - data = kzalloc(sizeof(*data), GFP_KERNEL_ACCOUNT); - if (!data) - return -ENOMEM; - - data->handle = sev->handle; - data->dst_addr = dst; - data->src_addr = src; - data->len = size; - - ret = sev_issue_cmd(kvm, - enc ? SEV_CMD_DBG_ENCRYPT : SEV_CMD_DBG_DECRYPT, - data, error); - kfree(data); - return ret; -} - -static int __sev_dbg_decrypt(struct kvm *kvm, unsigned long src_paddr, - unsigned long dst_paddr, int sz, int *err) -{ - int offset; - - /* - * Its safe to read more than we are asked, caller should ensure that - * destination has enough space. - */ - src_paddr = round_down(src_paddr, 16); - offset = src_paddr & 15; - sz = round_up(sz + offset, 16); - - return __sev_issue_dbg_cmd(kvm, src_paddr, dst_paddr, sz, err, false); -} - -static int __sev_dbg_decrypt_user(struct kvm *kvm, unsigned long paddr, - unsigned long __user dst_uaddr, - unsigned long dst_paddr, - int size, int *err) -{ - struct page *tpage = NULL; - int ret, offset; - - /* if inputs are not 16-byte then use intermediate buffer */ - if (!IS_ALIGNED(dst_paddr, 16) || - !IS_ALIGNED(paddr, 16) || - !IS_ALIGNED(size, 16)) { - tpage = (void *)alloc_page(GFP_KERNEL); - if (!tpage) - return -ENOMEM; - - dst_paddr = __sme_page_pa(tpage); - } - - ret = __sev_dbg_decrypt(kvm, paddr, dst_paddr, size, err); - if (ret) - goto e_free; - - if (tpage) { - offset = paddr & 15; - if (copy_to_user((void __user *)(uintptr_t)dst_uaddr, - page_address(tpage) + offset, size)) - ret = -EFAULT; - } - -e_free: - if (tpage) - __free_page(tpage); - - return ret; -} - -static int __sev_dbg_encrypt_user(struct kvm *kvm, unsigned long paddr, - unsigned long __user vaddr, - unsigned long dst_paddr, - unsigned long __user dst_vaddr, - int size, int *error) -{ - struct page *src_tpage = NULL; - struct page *dst_tpage = NULL; - int ret, len = size; - - /* If source buffer is not aligned then use an intermediate buffer */ - if (!IS_ALIGNED(vaddr, 16)) { - src_tpage = alloc_page(GFP_KERNEL); - if (!src_tpage) - return -ENOMEM; - - if (copy_from_user(page_address(src_tpage), - (void __user *)(uintptr_t)vaddr, size)) { - __free_page(src_tpage); - return -EFAULT; - } - - paddr = __sme_page_pa(src_tpage); - } - - /* - * If destination buffer or length is not aligned then do read-modify-write: - * - decrypt destination in an intermediate buffer - * - copy the source buffer in an intermediate buffer - * - use the intermediate buffer as source buffer - */ - if (!IS_ALIGNED(dst_vaddr, 16) || !IS_ALIGNED(size, 16)) { - int dst_offset; - - dst_tpage = alloc_page(GFP_KERNEL); - if (!dst_tpage) { - ret = -ENOMEM; - goto e_free; - } - - ret = __sev_dbg_decrypt(kvm, dst_paddr, - __sme_page_pa(dst_tpage), size, error); - if (ret) - goto e_free; - - /* - * If source is kernel buffer then use memcpy() otherwise - * copy_from_user(). - */ - dst_offset = dst_paddr & 15; - - if (src_tpage) - memcpy(page_address(dst_tpage) + dst_offset, - page_address(src_tpage), size); - else { - if (copy_from_user(page_address(dst_tpage) + dst_offset, - (void __user *)(uintptr_t)vaddr, size)) { - ret = -EFAULT; - goto e_free; - } - } - - paddr = __sme_page_pa(dst_tpage); - dst_paddr = round_down(dst_paddr, 16); - len = round_up(size, 16); - } - - ret = __sev_issue_dbg_cmd(kvm, paddr, dst_paddr, len, error, true); - -e_free: - if (src_tpage) - __free_page(src_tpage); - if (dst_tpage) - __free_page(dst_tpage); - return ret; -} - -static int sev_dbg_crypt(struct kvm *kvm, struct kvm_sev_cmd *argp, bool dec) -{ - unsigned long vaddr, vaddr_end, next_vaddr; - unsigned long dst_vaddr; - struct page **src_p, **dst_p; - struct kvm_sev_dbg debug; - unsigned long n; - unsigned int size; - int ret; - - if (!sev_guest(kvm)) - return -ENOTTY; - - if (copy_from_user(&debug, (void __user *)(uintptr_t)argp->data, sizeof(debug))) - return -EFAULT; - - if (!debug.len || debug.src_uaddr + debug.len < debug.src_uaddr) - return -EINVAL; - if (!debug.dst_uaddr) - return -EINVAL; - - vaddr = debug.src_uaddr; - size = debug.len; - vaddr_end = vaddr + size; - dst_vaddr = debug.dst_uaddr; - - for (; vaddr < vaddr_end; vaddr = next_vaddr) { - int len, s_off, d_off; - - /* lock userspace source and destination page */ - src_p = sev_pin_memory(kvm, vaddr & PAGE_MASK, PAGE_SIZE, &n, 0); - if (!src_p) - return -EFAULT; - - dst_p = sev_pin_memory(kvm, dst_vaddr & PAGE_MASK, PAGE_SIZE, &n, 1); - if (!dst_p) { - sev_unpin_memory(kvm, src_p, n); - return -EFAULT; - } - - /* - * The DBG_{DE,EN}CRYPT commands will perform {dec,en}cryption of the - * memory content (i.e it will write the same memory region with C=1). - * It's possible that the cache may contain the data with C=0, i.e., - * unencrypted so invalidate it first. - */ - sev_clflush_pages(src_p, 1); - sev_clflush_pages(dst_p, 1); - - /* - * Since user buffer may not be page aligned, calculate the - * offset within the page. - */ - s_off = vaddr & ~PAGE_MASK; - d_off = dst_vaddr & ~PAGE_MASK; - len = min_t(size_t, (PAGE_SIZE - s_off), size); - - if (dec) - ret = __sev_dbg_decrypt_user(kvm, - __sme_page_pa(src_p[0]) + s_off, - dst_vaddr, - __sme_page_pa(dst_p[0]) + d_off, - len, &argp->error); - else - ret = __sev_dbg_encrypt_user(kvm, - __sme_page_pa(src_p[0]) + s_off, - vaddr, - __sme_page_pa(dst_p[0]) + d_off, - dst_vaddr, - len, &argp->error); - - sev_unpin_memory(kvm, src_p, n); - sev_unpin_memory(kvm, dst_p, n); - - if (ret) - goto err; - - next_vaddr = vaddr + len; - dst_vaddr = dst_vaddr + len; - size -= len; - } -err: - return ret; -} - -static int sev_launch_secret(struct kvm *kvm, struct kvm_sev_cmd *argp) -{ - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; - struct sev_data_launch_secret *data; - struct kvm_sev_launch_secret params; - struct page **pages; - void *blob, *hdr; - unsigned long n; - int ret, offset; - - if (!sev_guest(kvm)) - return -ENOTTY; - - if (copy_from_user(¶ms, (void __user *)(uintptr_t)argp->data, sizeof(params))) - return -EFAULT; - - pages = sev_pin_memory(kvm, params.guest_uaddr, params.guest_len, &n, 1); - if (!pages) - return -ENOMEM; - - /* - * The secret must be copied into contiguous memory region, lets verify - * that userspace memory pages are contiguous before we issue command. - */ - if (get_num_contig_pages(0, pages, n) != n) { - ret = -EINVAL; - goto e_unpin_memory; - } - - ret = -ENOMEM; - data = kzalloc(sizeof(*data), GFP_KERNEL_ACCOUNT); - if (!data) - goto e_unpin_memory; - - offset = params.guest_uaddr & (PAGE_SIZE - 1); - data->guest_address = __sme_page_pa(pages[0]) + offset; - data->guest_len = params.guest_len; - - blob = psp_copy_user_blob(params.trans_uaddr, params.trans_len); - if (IS_ERR(blob)) { - ret = PTR_ERR(blob); - goto e_free; - } - - data->trans_address = __psp_pa(blob); - data->trans_len = params.trans_len; - - hdr = psp_copy_user_blob(params.hdr_uaddr, params.hdr_len); - if (IS_ERR(hdr)) { - ret = PTR_ERR(hdr); - goto e_free_blob; - } - data->hdr_address = __psp_pa(hdr); - data->hdr_len = params.hdr_len; - - data->handle = sev->handle; - ret = sev_issue_cmd(kvm, SEV_CMD_LAUNCH_UPDATE_SECRET, data, &argp->error); - - kfree(hdr); - -e_free_blob: - kfree(blob); -e_free: - kfree(data); -e_unpin_memory: - sev_unpin_memory(kvm, pages, n); - return ret; -} - -static int svm_mem_enc_op(struct kvm *kvm, void __user *argp) -{ - struct kvm_sev_cmd sev_cmd; - int r; - - if (!svm_sev_enabled()) - return -ENOTTY; - - if (!argp) - return 0; - - if (copy_from_user(&sev_cmd, argp, sizeof(struct kvm_sev_cmd))) - return -EFAULT; - - mutex_lock(&kvm->lock); - - switch (sev_cmd.id) { - case KVM_SEV_INIT: - r = sev_guest_init(kvm, &sev_cmd); - break; - case KVM_SEV_LAUNCH_START: - r = sev_launch_start(kvm, &sev_cmd); - break; - case KVM_SEV_LAUNCH_UPDATE_DATA: - r = sev_launch_update_data(kvm, &sev_cmd); - break; - case KVM_SEV_LAUNCH_MEASURE: - r = sev_launch_measure(kvm, &sev_cmd); - break; - case KVM_SEV_LAUNCH_FINISH: - r = sev_launch_finish(kvm, &sev_cmd); - break; - case KVM_SEV_GUEST_STATUS: - r = sev_guest_status(kvm, &sev_cmd); - break; - case KVM_SEV_DBG_DECRYPT: - r = sev_dbg_crypt(kvm, &sev_cmd, true); - break; - case KVM_SEV_DBG_ENCRYPT: - r = sev_dbg_crypt(kvm, &sev_cmd, false); - break; - case KVM_SEV_LAUNCH_SECRET: - r = sev_launch_secret(kvm, &sev_cmd); - break; - default: - r = -EINVAL; - goto out; - } - - if (copy_to_user(argp, &sev_cmd, sizeof(struct kvm_sev_cmd))) - r = -EFAULT; - -out: - mutex_unlock(&kvm->lock); - return r; -} - -static int svm_register_enc_region(struct kvm *kvm, - struct kvm_enc_region *range) -{ - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; - struct enc_region *region; - int ret = 0; - - if (!sev_guest(kvm)) - return -ENOTTY; - - if (range->addr > ULONG_MAX || range->size > ULONG_MAX) - return -EINVAL; - - region = kzalloc(sizeof(*region), GFP_KERNEL_ACCOUNT); - if (!region) - return -ENOMEM; - - region->pages = sev_pin_memory(kvm, range->addr, range->size, ®ion->npages, 1); - if (!region->pages) { - ret = -ENOMEM; - goto e_free; - } - - /* - * The guest may change the memory encryption attribute from C=0 -> C=1 - * or vice versa for this memory range. Lets make sure caches are - * flushed to ensure that guest data gets written into memory with - * correct C-bit. - */ - sev_clflush_pages(region->pages, region->npages); - - region->uaddr = range->addr; - region->size = range->size; - - mutex_lock(&kvm->lock); - list_add_tail(®ion->list, &sev->regions_list); - mutex_unlock(&kvm->lock); - - return ret; - -e_free: - kfree(region); - return ret; -} - -static struct enc_region * -find_enc_region(struct kvm *kvm, struct kvm_enc_region *range) -{ - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; - struct list_head *head = &sev->regions_list; - struct enc_region *i; - - list_for_each_entry(i, head, list) { - if (i->uaddr == range->addr && - i->size == range->size) - return i; - } - - return NULL; -} - - -static int svm_unregister_enc_region(struct kvm *kvm, - struct kvm_enc_region *range) -{ - struct enc_region *region; - int ret; - - mutex_lock(&kvm->lock); - - if (!sev_guest(kvm)) { - ret = -ENOTTY; - goto failed; - } - - region = find_enc_region(kvm, range); - if (!region) { - ret = -EINVAL; - goto failed; - } - - /* - * Ensure that all guest tagged cache entries are flushed before - * releasing the pages back to the system for use. CLFLUSH will - * not do this, so issue a WBINVD. - */ - wbinvd_on_all_cpus(); - - __unregister_enc_region_locked(kvm, region); - - mutex_unlock(&kvm->lock); - return 0; - -failed: - mutex_unlock(&kvm->lock); - return ret; -} - -static bool svm_need_emulation_on_page_fault(struct kvm_vcpu *vcpu) -{ - unsigned long cr4 = kvm_read_cr4(vcpu); - bool smep = cr4 & X86_CR4_SMEP; - bool smap = cr4 & X86_CR4_SMAP; - bool is_user = svm_get_cpl(vcpu) == 3; - - /* - * Detect and workaround Errata 1096 Fam_17h_00_0Fh. - * - * Errata: - * When CPU raise #NPF on guest data access and vCPU CR4.SMAP=1, it is - * possible that CPU microcode implementing DecodeAssist will fail - * to read bytes of instruction which caused #NPF. In this case, - * GuestIntrBytes field of the VMCB on a VMEXIT will incorrectly - * return 0 instead of the correct guest instruction bytes. - * - * This happens because CPU microcode reading instruction bytes - * uses a special opcode which attempts to read data using CPL=0 - * priviledges. The microcode reads CS:RIP and if it hits a SMAP - * fault, it gives up and returns no instruction bytes. - * - * Detection: - * We reach here in case CPU supports DecodeAssist, raised #NPF and - * returned 0 in GuestIntrBytes field of the VMCB. - * First, errata can only be triggered in case vCPU CR4.SMAP=1. - * Second, if vCPU CR4.SMEP=1, errata could only be triggered - * in case vCPU CPL==3 (Because otherwise guest would have triggered - * a SMEP fault instead of #NPF). - * Otherwise, vCPU CR4.SMEP=0, errata could be triggered by any vCPU CPL. - * As most guests enable SMAP if they have also enabled SMEP, use above - * logic in order to attempt minimize false-positive of detecting errata - * while still preserving all cases semantic correctness. - * - * Workaround: - * To determine what instruction the guest was executing, the hypervisor - * will have to decode the instruction at the instruction pointer. - * - * In non SEV guest, hypervisor will be able to read the guest - * memory to decode the instruction pointer when insn_len is zero - * so we return true to indicate that decoding is possible. - * - * But in the SEV guest, the guest memory is encrypted with the - * guest specific key and hypervisor will not be able to decode the - * instruction pointer so we will not able to workaround it. Lets - * print the error and request to kill the guest. - */ - if (smap && (!smep || is_user)) { - if (!sev_guest(vcpu->kvm)) - return true; - - pr_err_ratelimited("KVM: SEV Guest triggered AMD Erratum 1096\n"); - kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu); - } - - return false; -} - -static bool svm_apic_init_signal_blocked(struct kvm_vcpu *vcpu) -{ - struct vcpu_svm *svm = to_svm(vcpu); - - /* - * TODO: Last condition latch INIT signals on vCPU when - * vCPU is in guest-mode and vmcb12 defines intercept on INIT. - * To properly emulate the INIT intercept, SVM should implement - * kvm_x86_ops.check_nested_events() and call nested_svm_vmexit() - * there if an INIT signal is pending. - */ - return !gif_set(svm) || - (svm->vmcb->control.intercept & (1ULL << INTERCEPT_INIT)); -} - -static bool svm_check_apicv_inhibit_reasons(ulong bit) -{ - ulong supported = BIT(APICV_INHIBIT_REASON_DISABLE) | - BIT(APICV_INHIBIT_REASON_HYPERV) | - BIT(APICV_INHIBIT_REASON_NESTED) | - BIT(APICV_INHIBIT_REASON_IRQWIN) | - BIT(APICV_INHIBIT_REASON_PIT_REINJ) | - BIT(APICV_INHIBIT_REASON_X2APIC); - - return supported & BIT(bit); -} - -static void svm_pre_update_apicv_exec_ctrl(struct kvm *kvm, bool activate) -{ - avic_update_access_page(kvm, activate); -} - -static struct kvm_x86_ops svm_x86_ops __initdata = { - .hardware_unsetup = svm_hardware_teardown, - .hardware_enable = svm_hardware_enable, - .hardware_disable = svm_hardware_disable, - .cpu_has_accelerated_tpr = svm_cpu_has_accelerated_tpr, - .has_emulated_msr = svm_has_emulated_msr, - - .vcpu_create = svm_create_vcpu, - .vcpu_free = svm_free_vcpu, - .vcpu_reset = svm_vcpu_reset, - - .vm_size = sizeof(struct kvm_svm), - .vm_init = svm_vm_init, - .vm_destroy = svm_vm_destroy, - - .prepare_guest_switch = svm_prepare_guest_switch, - .vcpu_load = svm_vcpu_load, - .vcpu_put = svm_vcpu_put, - .vcpu_blocking = svm_vcpu_blocking, - .vcpu_unblocking = svm_vcpu_unblocking, - - .update_bp_intercept = update_bp_intercept, - .get_msr_feature = svm_get_msr_feature, - .get_msr = svm_get_msr, - .set_msr = svm_set_msr, - .get_segment_base = svm_get_segment_base, - .get_segment = svm_get_segment, - .set_segment = svm_set_segment, - .get_cpl = svm_get_cpl, - .get_cs_db_l_bits = kvm_get_cs_db_l_bits, - .decache_cr0_guest_bits = svm_decache_cr0_guest_bits, - .decache_cr4_guest_bits = svm_decache_cr4_guest_bits, - .set_cr0 = svm_set_cr0, - .set_cr4 = svm_set_cr4, - .set_efer = svm_set_efer, - .get_idt = svm_get_idt, - .set_idt = svm_set_idt, - .get_gdt = svm_get_gdt, - .set_gdt = svm_set_gdt, - .get_dr6 = svm_get_dr6, - .set_dr6 = svm_set_dr6, - .set_dr7 = svm_set_dr7, - .sync_dirty_debug_regs = svm_sync_dirty_debug_regs, - .cache_reg = svm_cache_reg, - .get_rflags = svm_get_rflags, - .set_rflags = svm_set_rflags, - - .tlb_flush = svm_flush_tlb, - .tlb_flush_gva = svm_flush_tlb_gva, - - .run = svm_vcpu_run, - .handle_exit = handle_exit, - .skip_emulated_instruction = skip_emulated_instruction, - .update_emulated_instruction = NULL, - .set_interrupt_shadow = svm_set_interrupt_shadow, - .get_interrupt_shadow = svm_get_interrupt_shadow, - .patch_hypercall = svm_patch_hypercall, - .set_irq = svm_set_irq, - .set_nmi = svm_inject_nmi, - .queue_exception = svm_queue_exception, - .cancel_injection = svm_cancel_injection, - .interrupt_allowed = svm_interrupt_allowed, - .nmi_allowed = svm_nmi_allowed, - .get_nmi_mask = svm_get_nmi_mask, - .set_nmi_mask = svm_set_nmi_mask, - .enable_nmi_window = enable_nmi_window, - .enable_irq_window = enable_irq_window, - .update_cr8_intercept = update_cr8_intercept, - .set_virtual_apic_mode = svm_set_virtual_apic_mode, - .refresh_apicv_exec_ctrl = svm_refresh_apicv_exec_ctrl, - .check_apicv_inhibit_reasons = svm_check_apicv_inhibit_reasons, - .pre_update_apicv_exec_ctrl = svm_pre_update_apicv_exec_ctrl, - .load_eoi_exitmap = svm_load_eoi_exitmap, - .hwapic_irr_update = svm_hwapic_irr_update, - .hwapic_isr_update = svm_hwapic_isr_update, - .sync_pir_to_irr = kvm_lapic_find_highest_irr, - .apicv_post_state_restore = avic_post_state_restore, - - .set_tss_addr = svm_set_tss_addr, - .set_identity_map_addr = svm_set_identity_map_addr, - .get_tdp_level = get_npt_level, - .get_mt_mask = svm_get_mt_mask, - - .get_exit_info = svm_get_exit_info, - - .cpuid_update = svm_cpuid_update, - - .has_wbinvd_exit = svm_has_wbinvd_exit, - - .read_l1_tsc_offset = svm_read_l1_tsc_offset, - .write_l1_tsc_offset = svm_write_l1_tsc_offset, - - .load_mmu_pgd = svm_load_mmu_pgd, - - .check_intercept = svm_check_intercept, - .handle_exit_irqoff = svm_handle_exit_irqoff, - - .request_immediate_exit = __kvm_request_immediate_exit, - - .sched_in = svm_sched_in, - - .pmu_ops = &amd_pmu_ops, - .deliver_posted_interrupt = svm_deliver_avic_intr, - .dy_apicv_has_pending_interrupt = svm_dy_apicv_has_pending_interrupt, - .update_pi_irte = svm_update_pi_irte, - .setup_mce = svm_setup_mce, - - .smi_allowed = svm_smi_allowed, - .pre_enter_smm = svm_pre_enter_smm, - .pre_leave_smm = svm_pre_leave_smm, - .enable_smi_window = enable_smi_window, - - .mem_enc_op = svm_mem_enc_op, - .mem_enc_reg_region = svm_register_enc_region, - .mem_enc_unreg_region = svm_unregister_enc_region, - - .nested_enable_evmcs = NULL, - .nested_get_evmcs_version = NULL, - - .need_emulation_on_page_fault = svm_need_emulation_on_page_fault, - - .apic_init_signal_blocked = svm_apic_init_signal_blocked, - - .check_nested_events = svm_check_nested_events, -}; - -static struct kvm_x86_init_ops svm_init_ops __initdata = { - .cpu_has_kvm_support = has_svm, - .disabled_by_bios = is_disabled, - .hardware_setup = svm_hardware_setup, - .check_processor_compatibility = svm_check_processor_compat, - - .runtime_ops = &svm_x86_ops, -}; - -static int __init svm_init(void) -{ - return kvm_init(&svm_init_ops, sizeof(struct vcpu_svm), - __alignof__(struct vcpu_svm), THIS_MODULE); -} - -static void __exit svm_exit(void) -{ - kvm_exit(); -} - -module_init(svm_init) -module_exit(svm_exit) diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c new file mode 100644 index 000000000000..ce0b10fe5e2b --- /dev/null +++ b/arch/x86/kvm/svm/pmu.c @@ -0,0 +1,327 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * KVM PMU support for AMD + * + * Copyright 2015, Red Hat, Inc. and/or its affiliates. + * + * Author: + * Wei Huang + * + * Implementation is based on pmu_intel.c file + */ +#include +#include +#include +#include "x86.h" +#include "cpuid.h" +#include "lapic.h" +#include "pmu.h" + +enum pmu_type { + PMU_TYPE_COUNTER = 0, + PMU_TYPE_EVNTSEL, +}; + +enum index { + INDEX_ZERO = 0, + INDEX_ONE, + INDEX_TWO, + INDEX_THREE, + INDEX_FOUR, + INDEX_FIVE, + INDEX_ERROR, +}; + +/* duplicated from amd_perfmon_event_map, K7 and above should work. */ +static struct kvm_event_hw_type_mapping amd_event_mapping[] = { + [0] = { 0x76, 0x00, PERF_COUNT_HW_CPU_CYCLES }, + [1] = { 0xc0, 0x00, PERF_COUNT_HW_INSTRUCTIONS }, + [2] = { 0x7d, 0x07, PERF_COUNT_HW_CACHE_REFERENCES }, + [3] = { 0x7e, 0x07, PERF_COUNT_HW_CACHE_MISSES }, + [4] = { 0xc2, 0x00, PERF_COUNT_HW_BRANCH_INSTRUCTIONS }, + [5] = { 0xc3, 0x00, PERF_COUNT_HW_BRANCH_MISSES }, + [6] = { 0xd0, 0x00, PERF_COUNT_HW_STALLED_CYCLES_FRONTEND }, + [7] = { 0xd1, 0x00, PERF_COUNT_HW_STALLED_CYCLES_BACKEND }, +}; + +static unsigned int get_msr_base(struct kvm_pmu *pmu, enum pmu_type type) +{ + struct kvm_vcpu *vcpu = pmu_to_vcpu(pmu); + + if (guest_cpuid_has(vcpu, X86_FEATURE_PERFCTR_CORE)) { + if (type == PMU_TYPE_COUNTER) + return MSR_F15H_PERF_CTR; + else + return MSR_F15H_PERF_CTL; + } else { + if (type == PMU_TYPE_COUNTER) + return MSR_K7_PERFCTR0; + else + return MSR_K7_EVNTSEL0; + } +} + +static enum index msr_to_index(u32 msr) +{ + switch (msr) { + case MSR_F15H_PERF_CTL0: + case MSR_F15H_PERF_CTR0: + case MSR_K7_EVNTSEL0: + case MSR_K7_PERFCTR0: + return INDEX_ZERO; + case MSR_F15H_PERF_CTL1: + case MSR_F15H_PERF_CTR1: + case MSR_K7_EVNTSEL1: + case MSR_K7_PERFCTR1: + return INDEX_ONE; + case MSR_F15H_PERF_CTL2: + case MSR_F15H_PERF_CTR2: + case MSR_K7_EVNTSEL2: + case MSR_K7_PERFCTR2: + return INDEX_TWO; + case MSR_F15H_PERF_CTL3: + case MSR_F15H_PERF_CTR3: + case MSR_K7_EVNTSEL3: + case MSR_K7_PERFCTR3: + return INDEX_THREE; + case MSR_F15H_PERF_CTL4: + case MSR_F15H_PERF_CTR4: + return INDEX_FOUR; + case MSR_F15H_PERF_CTL5: + case MSR_F15H_PERF_CTR5: + return INDEX_FIVE; + default: + return INDEX_ERROR; + } +} + +static inline struct kvm_pmc *get_gp_pmc_amd(struct kvm_pmu *pmu, u32 msr, + enum pmu_type type) +{ + switch (msr) { + case MSR_F15H_PERF_CTL0: + case MSR_F15H_PERF_CTL1: + case MSR_F15H_PERF_CTL2: + case MSR_F15H_PERF_CTL3: + case MSR_F15H_PERF_CTL4: + case MSR_F15H_PERF_CTL5: + case MSR_K7_EVNTSEL0 ... MSR_K7_EVNTSEL3: + if (type != PMU_TYPE_EVNTSEL) + return NULL; + break; + case MSR_F15H_PERF_CTR0: + case MSR_F15H_PERF_CTR1: + case MSR_F15H_PERF_CTR2: + case MSR_F15H_PERF_CTR3: + case MSR_F15H_PERF_CTR4: + case MSR_F15H_PERF_CTR5: + case MSR_K7_PERFCTR0 ... MSR_K7_PERFCTR3: + if (type != PMU_TYPE_COUNTER) + return NULL; + break; + default: + return NULL; + } + + return &pmu->gp_counters[msr_to_index(msr)]; +} + +static unsigned amd_find_arch_event(struct kvm_pmu *pmu, + u8 event_select, + u8 unit_mask) +{ + int i; + + for (i = 0; i < ARRAY_SIZE(amd_event_mapping); i++) + if (amd_event_mapping[i].eventsel == event_select + && amd_event_mapping[i].unit_mask == unit_mask) + break; + + if (i == ARRAY_SIZE(amd_event_mapping)) + return PERF_COUNT_HW_MAX; + + return amd_event_mapping[i].event_type; +} + +/* return PERF_COUNT_HW_MAX as AMD doesn't have fixed events */ +static unsigned amd_find_fixed_event(int idx) +{ + return PERF_COUNT_HW_MAX; +} + +/* check if a PMC is enabled by comparing it against global_ctrl bits. Because + * AMD CPU doesn't have global_ctrl MSR, all PMCs are enabled (return TRUE). + */ +static bool amd_pmc_is_enabled(struct kvm_pmc *pmc) +{ + return true; +} + +static struct kvm_pmc *amd_pmc_idx_to_pmc(struct kvm_pmu *pmu, int pmc_idx) +{ + unsigned int base = get_msr_base(pmu, PMU_TYPE_COUNTER); + struct kvm_vcpu *vcpu = pmu_to_vcpu(pmu); + + if (guest_cpuid_has(vcpu, X86_FEATURE_PERFCTR_CORE)) { + /* + * The idx is contiguous. The MSRs are not. The counter MSRs + * are interleaved with the event select MSRs. + */ + pmc_idx *= 2; + } + + return get_gp_pmc_amd(pmu, base + pmc_idx, PMU_TYPE_COUNTER); +} + +/* returns 0 if idx's corresponding MSR exists; otherwise returns 1. */ +static int amd_is_valid_rdpmc_ecx(struct kvm_vcpu *vcpu, unsigned int idx) +{ + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); + + idx &= ~(3u << 30); + + return (idx >= pmu->nr_arch_gp_counters); +} + +/* idx is the ECX register of RDPMC instruction */ +static struct kvm_pmc *amd_rdpmc_ecx_to_pmc(struct kvm_vcpu *vcpu, + unsigned int idx, u64 *mask) +{ + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); + struct kvm_pmc *counters; + + idx &= ~(3u << 30); + if (idx >= pmu->nr_arch_gp_counters) + return NULL; + counters = pmu->gp_counters; + + return &counters[idx]; +} + +static bool amd_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr) +{ + /* All MSRs refer to exactly one PMC, so msr_idx_to_pmc is enough. */ + return false; +} + +static struct kvm_pmc *amd_msr_idx_to_pmc(struct kvm_vcpu *vcpu, u32 msr) +{ + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); + struct kvm_pmc *pmc; + + pmc = get_gp_pmc_amd(pmu, msr, PMU_TYPE_COUNTER); + pmc = pmc ? pmc : get_gp_pmc_amd(pmu, msr, PMU_TYPE_EVNTSEL); + + return pmc; +} + +static int amd_pmu_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *data) +{ + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); + struct kvm_pmc *pmc; + + /* MSR_PERFCTRn */ + pmc = get_gp_pmc_amd(pmu, msr, PMU_TYPE_COUNTER); + if (pmc) { + *data = pmc_read_counter(pmc); + return 0; + } + /* MSR_EVNTSELn */ + pmc = get_gp_pmc_amd(pmu, msr, PMU_TYPE_EVNTSEL); + if (pmc) { + *data = pmc->eventsel; + return 0; + } + + return 1; +} + +static int amd_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) +{ + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); + struct kvm_pmc *pmc; + u32 msr = msr_info->index; + u64 data = msr_info->data; + + /* MSR_PERFCTRn */ + pmc = get_gp_pmc_amd(pmu, msr, PMU_TYPE_COUNTER); + if (pmc) { + pmc->counter += data - pmc_read_counter(pmc); + return 0; + } + /* MSR_EVNTSELn */ + pmc = get_gp_pmc_amd(pmu, msr, PMU_TYPE_EVNTSEL); + if (pmc) { + if (data == pmc->eventsel) + return 0; + if (!(data & pmu->reserved_bits)) { + reprogram_gp_counter(pmc, data); + return 0; + } + } + + return 1; +} + +static void amd_pmu_refresh(struct kvm_vcpu *vcpu) +{ + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); + + if (guest_cpuid_has(vcpu, X86_FEATURE_PERFCTR_CORE)) + pmu->nr_arch_gp_counters = AMD64_NUM_COUNTERS_CORE; + else + pmu->nr_arch_gp_counters = AMD64_NUM_COUNTERS; + + pmu->counter_bitmask[KVM_PMC_GP] = ((u64)1 << 48) - 1; + pmu->reserved_bits = 0xffffffff00200000ull; + pmu->version = 1; + /* not applicable to AMD; but clean them to prevent any fall out */ + pmu->counter_bitmask[KVM_PMC_FIXED] = 0; + pmu->nr_arch_fixed_counters = 0; + pmu->global_status = 0; + bitmap_set(pmu->all_valid_pmc_idx, 0, pmu->nr_arch_gp_counters); +} + +static void amd_pmu_init(struct kvm_vcpu *vcpu) +{ + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); + int i; + + BUILD_BUG_ON(AMD64_NUM_COUNTERS_CORE > INTEL_PMC_MAX_GENERIC); + + for (i = 0; i < AMD64_NUM_COUNTERS_CORE ; i++) { + pmu->gp_counters[i].type = KVM_PMC_GP; + pmu->gp_counters[i].vcpu = vcpu; + pmu->gp_counters[i].idx = i; + pmu->gp_counters[i].current_config = 0; + } +} + +static void amd_pmu_reset(struct kvm_vcpu *vcpu) +{ + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); + int i; + + for (i = 0; i < AMD64_NUM_COUNTERS_CORE; i++) { + struct kvm_pmc *pmc = &pmu->gp_counters[i]; + + pmc_stop_counter(pmc); + pmc->counter = pmc->eventsel = 0; + } +} + +struct kvm_pmu_ops amd_pmu_ops = { + .find_arch_event = amd_find_arch_event, + .find_fixed_event = amd_find_fixed_event, + .pmc_is_enabled = amd_pmc_is_enabled, + .pmc_idx_to_pmc = amd_pmc_idx_to_pmc, + .rdpmc_ecx_to_pmc = amd_rdpmc_ecx_to_pmc, + .msr_idx_to_pmc = amd_msr_idx_to_pmc, + .is_valid_rdpmc_ecx = amd_is_valid_rdpmc_ecx, + .is_valid_msr = amd_is_valid_msr, + .get_msr = amd_pmu_get_msr, + .set_msr = amd_pmu_set_msr, + .refresh = amd_pmu_refresh, + .init = amd_pmu_init, + .reset = amd_pmu_reset, +}; diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c new file mode 100644 index 000000000000..851e9cc79930 --- /dev/null +++ b/arch/x86/kvm/svm/svm.c @@ -0,0 +1,7514 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Kernel-based Virtual Machine driver for Linux + * + * AMD SVM support + * + * Copyright (C) 2006 Qumranet, Inc. + * Copyright 2010 Red Hat, Inc. and/or its affiliates. + * + * Authors: + * Yaniv Kamay + * Avi Kivity + */ + +#define pr_fmt(fmt) "SVM: " fmt + +#include + +#include "irq.h" +#include "mmu.h" +#include "kvm_cache_regs.h" +#include "x86.h" +#include "cpuid.h" +#include "pmu.h" + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include "trace.h" + +#define __ex(x) __kvm_handle_fault_on_reboot(x) + +MODULE_AUTHOR("Qumranet"); +MODULE_LICENSE("GPL"); + +#ifdef MODULE +static const struct x86_cpu_id svm_cpu_id[] = { + X86_MATCH_FEATURE(X86_FEATURE_SVM, NULL), + {} +}; +MODULE_DEVICE_TABLE(x86cpu, svm_cpu_id); +#endif + +#define IOPM_ALLOC_ORDER 2 +#define MSRPM_ALLOC_ORDER 1 + +#define SEG_TYPE_LDT 2 +#define SEG_TYPE_BUSY_TSS16 3 + +#define SVM_FEATURE_LBRV (1 << 1) +#define SVM_FEATURE_SVML (1 << 2) +#define SVM_FEATURE_TSC_RATE (1 << 4) +#define SVM_FEATURE_VMCB_CLEAN (1 << 5) +#define SVM_FEATURE_FLUSH_ASID (1 << 6) +#define SVM_FEATURE_DECODE_ASSIST (1 << 7) +#define SVM_FEATURE_PAUSE_FILTER (1 << 10) + +#define SVM_AVIC_DOORBELL 0xc001011b + +#define NESTED_EXIT_HOST 0 /* Exit handled on host level */ +#define NESTED_EXIT_DONE 1 /* Exit caused nested vmexit */ +#define NESTED_EXIT_CONTINUE 2 /* Further checks needed */ + +#define DEBUGCTL_RESERVED_BITS (~(0x3fULL)) + +#define TSC_RATIO_RSVD 0xffffff0000000000ULL +#define TSC_RATIO_MIN 0x0000000000000001ULL +#define TSC_RATIO_MAX 0x000000ffffffffffULL + +#define AVIC_HPA_MASK ~((0xFFFULL << 52) | 0xFFF) + +/* + * 0xff is broadcast, so the max index allowed for physical APIC ID + * table is 0xfe. APIC IDs above 0xff are reserved. + */ +#define AVIC_MAX_PHYSICAL_ID_COUNT 255 + +#define AVIC_UNACCEL_ACCESS_WRITE_MASK 1 +#define AVIC_UNACCEL_ACCESS_OFFSET_MASK 0xFF0 +#define AVIC_UNACCEL_ACCESS_VECTOR_MASK 0xFFFFFFFF + +/* AVIC GATAG is encoded using VM and VCPU IDs */ +#define AVIC_VCPU_ID_BITS 8 +#define AVIC_VCPU_ID_MASK ((1 << AVIC_VCPU_ID_BITS) - 1) + +#define AVIC_VM_ID_BITS 24 +#define AVIC_VM_ID_NR (1 << AVIC_VM_ID_BITS) +#define AVIC_VM_ID_MASK ((1 << AVIC_VM_ID_BITS) - 1) + +#define AVIC_GATAG(x, y) (((x & AVIC_VM_ID_MASK) << AVIC_VCPU_ID_BITS) | \ + (y & AVIC_VCPU_ID_MASK)) +#define AVIC_GATAG_TO_VMID(x) ((x >> AVIC_VCPU_ID_BITS) & AVIC_VM_ID_MASK) +#define AVIC_GATAG_TO_VCPUID(x) (x & AVIC_VCPU_ID_MASK) + +static bool erratum_383_found __read_mostly; + +static const u32 host_save_user_msrs[] = { +#ifdef CONFIG_X86_64 + MSR_STAR, MSR_LSTAR, MSR_CSTAR, MSR_SYSCALL_MASK, MSR_KERNEL_GS_BASE, + MSR_FS_BASE, +#endif + MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP, + MSR_TSC_AUX, +}; + +#define NR_HOST_SAVE_USER_MSRS ARRAY_SIZE(host_save_user_msrs) + +struct kvm_sev_info { + bool active; /* SEV enabled guest */ + unsigned int asid; /* ASID used for this guest */ + unsigned int handle; /* SEV firmware handle */ + int fd; /* SEV device fd */ + unsigned long pages_locked; /* Number of pages locked */ + struct list_head regions_list; /* List of registered regions */ +}; + +struct kvm_svm { + struct kvm kvm; + + /* Struct members for AVIC */ + u32 avic_vm_id; + struct page *avic_logical_id_table_page; + struct page *avic_physical_id_table_page; + struct hlist_node hnode; + + struct kvm_sev_info sev_info; +}; + +struct kvm_vcpu; + +struct nested_state { + struct vmcb *hsave; + u64 hsave_msr; + u64 vm_cr_msr; + u64 vmcb; + + /* These are the merged vectors */ + u32 *msrpm; + + /* gpa pointers to the real vectors */ + u64 vmcb_msrpm; + u64 vmcb_iopm; + + /* A VMEXIT is required but not yet emulated */ + bool exit_required; + + /* cache for intercepts of the guest */ + u32 intercept_cr; + u32 intercept_dr; + u32 intercept_exceptions; + u64 intercept; + + /* Nested Paging related state */ + u64 nested_cr3; +}; + +#define MSRPM_OFFSETS 16 +static u32 msrpm_offsets[MSRPM_OFFSETS] __read_mostly; + +/* + * Set osvw_len to higher value when updated Revision Guides + * are published and we know what the new status bits are + */ +static uint64_t osvw_len = 4, osvw_status; + +struct vcpu_svm { + struct kvm_vcpu vcpu; + struct vmcb *vmcb; + unsigned long vmcb_pa; + struct svm_cpu_data *svm_data; + uint64_t asid_generation; + uint64_t sysenter_esp; + uint64_t sysenter_eip; + uint64_t tsc_aux; + + u64 msr_decfg; + + u64 next_rip; + + u64 host_user_msrs[NR_HOST_SAVE_USER_MSRS]; + struct { + u16 fs; + u16 gs; + u16 ldt; + u64 gs_base; + } host; + + u64 spec_ctrl; + /* + * Contains guest-controlled bits of VIRT_SPEC_CTRL, which will be + * translated into the appropriate L2_CFG bits on the host to + * perform speculative control. + */ + u64 virt_spec_ctrl; + + u32 *msrpm; + + ulong nmi_iret_rip; + + struct nested_state nested; + + bool nmi_singlestep; + u64 nmi_singlestep_guest_rflags; + + unsigned int3_injected; + unsigned long int3_rip; + + /* cached guest cpuid flags for faster access */ + bool nrips_enabled : 1; + + u32 ldr_reg; + u32 dfr_reg; + struct page *avic_backing_page; + u64 *avic_physical_id_cache; + bool avic_is_running; + + /* + * Per-vcpu list of struct amd_svm_iommu_ir: + * This is used mainly to store interrupt remapping information used + * when update the vcpu affinity. This avoids the need to scan for + * IRTE and try to match ga_tag in the IOMMU driver. + */ + struct list_head ir_list; + spinlock_t ir_list_lock; + + /* which host CPU was used for running this vcpu */ + unsigned int last_cpu; +}; + +/* + * This is a wrapper of struct amd_iommu_ir_data. + */ +struct amd_svm_iommu_ir { + struct list_head node; /* Used by SVM for per-vcpu ir_list */ + void *data; /* Storing pointer to struct amd_ir_data */ +}; + +#define AVIC_LOGICAL_ID_ENTRY_GUEST_PHYSICAL_ID_MASK (0xFF) +#define AVIC_LOGICAL_ID_ENTRY_VALID_BIT 31 +#define AVIC_LOGICAL_ID_ENTRY_VALID_MASK (1 << 31) + +#define AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK (0xFFULL) +#define AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK (0xFFFFFFFFFFULL << 12) +#define AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK (1ULL << 62) +#define AVIC_PHYSICAL_ID_ENTRY_VALID_MASK (1ULL << 63) + +static DEFINE_PER_CPU(u64, current_tsc_ratio); +#define TSC_RATIO_DEFAULT 0x0100000000ULL + +#define MSR_INVALID 0xffffffffU + +static const struct svm_direct_access_msrs { + u32 index; /* Index of the MSR */ + bool always; /* True if intercept is always on */ +} direct_access_msrs[] = { + { .index = MSR_STAR, .always = true }, + { .index = MSR_IA32_SYSENTER_CS, .always = true }, +#ifdef CONFIG_X86_64 + { .index = MSR_GS_BASE, .always = true }, + { .index = MSR_FS_BASE, .always = true }, + { .index = MSR_KERNEL_GS_BASE, .always = true }, + { .index = MSR_LSTAR, .always = true }, + { .index = MSR_CSTAR, .always = true }, + { .index = MSR_SYSCALL_MASK, .always = true }, +#endif + { .index = MSR_IA32_SPEC_CTRL, .always = false }, + { .index = MSR_IA32_PRED_CMD, .always = false }, + { .index = MSR_IA32_LASTBRANCHFROMIP, .always = false }, + { .index = MSR_IA32_LASTBRANCHTOIP, .always = false }, + { .index = MSR_IA32_LASTINTFROMIP, .always = false }, + { .index = MSR_IA32_LASTINTTOIP, .always = false }, + { .index = MSR_INVALID, .always = false }, +}; + +/* enable NPT for AMD64 and X86 with PAE */ +#if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE) +static bool npt_enabled = true; +#else +static bool npt_enabled; +#endif + +/* + * These 2 parameters are used to config the controls for Pause-Loop Exiting: + * pause_filter_count: On processors that support Pause filtering(indicated + * by CPUID Fn8000_000A_EDX), the VMCB provides a 16 bit pause filter + * count value. On VMRUN this value is loaded into an internal counter. + * Each time a pause instruction is executed, this counter is decremented + * until it reaches zero at which time a #VMEXIT is generated if pause + * intercept is enabled. Refer to AMD APM Vol 2 Section 15.14.4 Pause + * Intercept Filtering for more details. + * This also indicate if ple logic enabled. + * + * pause_filter_thresh: In addition, some processor families support advanced + * pause filtering (indicated by CPUID Fn8000_000A_EDX) upper bound on + * the amount of time a guest is allowed to execute in a pause loop. + * In this mode, a 16-bit pause filter threshold field is added in the + * VMCB. The threshold value is a cycle count that is used to reset the + * pause counter. As with simple pause filtering, VMRUN loads the pause + * count value from VMCB into an internal counter. Then, on each pause + * instruction the hardware checks the elapsed number of cycles since + * the most recent pause instruction against the pause filter threshold. + * If the elapsed cycle count is greater than the pause filter threshold, + * then the internal pause count is reloaded from the VMCB and execution + * continues. If the elapsed cycle count is less than the pause filter + * threshold, then the internal pause count is decremented. If the count + * value is less than zero and PAUSE intercept is enabled, a #VMEXIT is + * triggered. If advanced pause filtering is supported and pause filter + * threshold field is set to zero, the filter will operate in the simpler, + * count only mode. + */ + +static unsigned short pause_filter_thresh = KVM_DEFAULT_PLE_GAP; +module_param(pause_filter_thresh, ushort, 0444); + +static unsigned short pause_filter_count = KVM_SVM_DEFAULT_PLE_WINDOW; +module_param(pause_filter_count, ushort, 0444); + +/* Default doubles per-vcpu window every exit. */ +static unsigned short pause_filter_count_grow = KVM_DEFAULT_PLE_WINDOW_GROW; +module_param(pause_filter_count_grow, ushort, 0444); + +/* Default resets per-vcpu window every exit to pause_filter_count. */ +static unsigned short pause_filter_count_shrink = KVM_DEFAULT_PLE_WINDOW_SHRINK; +module_param(pause_filter_count_shrink, ushort, 0444); + +/* Default is to compute the maximum so we can never overflow. */ +static unsigned short pause_filter_count_max = KVM_SVM_DEFAULT_PLE_WINDOW_MAX; +module_param(pause_filter_count_max, ushort, 0444); + +/* allow nested paging (virtualized MMU) for all guests */ +static int npt = true; +module_param(npt, int, S_IRUGO); + +/* allow nested virtualization in KVM/SVM */ +static int nested = true; +module_param(nested, int, S_IRUGO); + +/* enable / disable AVIC */ +static int avic; +#ifdef CONFIG_X86_LOCAL_APIC +module_param(avic, int, S_IRUGO); +#endif + +/* enable/disable Next RIP Save */ +static int nrips = true; +module_param(nrips, int, 0444); + +/* enable/disable Virtual VMLOAD VMSAVE */ +static int vls = true; +module_param(vls, int, 0444); + +/* enable/disable Virtual GIF */ +static int vgif = true; +module_param(vgif, int, 0444); + +/* enable/disable SEV support */ +static int sev = IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT); +module_param(sev, int, 0444); + +static bool __read_mostly dump_invalid_vmcb = 0; +module_param(dump_invalid_vmcb, bool, 0644); + +static u8 rsm_ins_bytes[] = "\x0f\xaa"; + +static void svm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0); +static void svm_flush_tlb(struct kvm_vcpu *vcpu, bool invalidate_gpa); +static void svm_complete_interrupts(struct vcpu_svm *svm); +static void svm_toggle_avic_for_irq_window(struct kvm_vcpu *vcpu, bool activate); +static inline void avic_post_state_restore(struct kvm_vcpu *vcpu); + +static int nested_svm_exit_handled(struct vcpu_svm *svm); +static int nested_svm_intercept(struct vcpu_svm *svm); +static int nested_svm_vmexit(struct vcpu_svm *svm); +static int nested_svm_check_exception(struct vcpu_svm *svm, unsigned nr, + bool has_error_code, u32 error_code); + +enum { + VMCB_INTERCEPTS, /* Intercept vectors, TSC offset, + pause filter count */ + VMCB_PERM_MAP, /* IOPM Base and MSRPM Base */ + VMCB_ASID, /* ASID */ + VMCB_INTR, /* int_ctl, int_vector */ + VMCB_NPT, /* npt_en, nCR3, gPAT */ + VMCB_CR, /* CR0, CR3, CR4, EFER */ + VMCB_DR, /* DR6, DR7 */ + VMCB_DT, /* GDT, IDT */ + VMCB_SEG, /* CS, DS, SS, ES, CPL */ + VMCB_CR2, /* CR2 only */ + VMCB_LBR, /* DBGCTL, BR_FROM, BR_TO, LAST_EX_FROM, LAST_EX_TO */ + VMCB_AVIC, /* AVIC APIC_BAR, AVIC APIC_BACKING_PAGE, + * AVIC PHYSICAL_TABLE pointer, + * AVIC LOGICAL_TABLE pointer + */ + VMCB_DIRTY_MAX, +}; + +/* TPR and CR2 are always written before VMRUN */ +#define VMCB_ALWAYS_DIRTY_MASK ((1U << VMCB_INTR) | (1U << VMCB_CR2)) + +#define VMCB_AVIC_APIC_BAR_MASK 0xFFFFFFFFFF000ULL + +static int sev_flush_asids(void); +static DECLARE_RWSEM(sev_deactivate_lock); +static DEFINE_MUTEX(sev_bitmap_lock); +static unsigned int max_sev_asid; +static unsigned int min_sev_asid; +static unsigned long *sev_asid_bitmap; +static unsigned long *sev_reclaim_asid_bitmap; +#define __sme_page_pa(x) __sme_set(page_to_pfn(x) << PAGE_SHIFT) + +struct enc_region { + struct list_head list; + unsigned long npages; + struct page **pages; + unsigned long uaddr; + unsigned long size; +}; + + +static inline struct kvm_svm *to_kvm_svm(struct kvm *kvm) +{ + return container_of(kvm, struct kvm_svm, kvm); +} + +static inline bool svm_sev_enabled(void) +{ + return IS_ENABLED(CONFIG_KVM_AMD_SEV) ? max_sev_asid : 0; +} + +static inline bool sev_guest(struct kvm *kvm) +{ +#ifdef CONFIG_KVM_AMD_SEV + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; + + return sev->active; +#else + return false; +#endif +} + +static inline int sev_get_asid(struct kvm *kvm) +{ + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; + + return sev->asid; +} + +static inline void mark_all_dirty(struct vmcb *vmcb) +{ + vmcb->control.clean = 0; +} + +static inline void mark_all_clean(struct vmcb *vmcb) +{ + vmcb->control.clean = ((1 << VMCB_DIRTY_MAX) - 1) + & ~VMCB_ALWAYS_DIRTY_MASK; +} + +static inline void mark_dirty(struct vmcb *vmcb, int bit) +{ + vmcb->control.clean &= ~(1 << bit); +} + +static inline struct vcpu_svm *to_svm(struct kvm_vcpu *vcpu) +{ + return container_of(vcpu, struct vcpu_svm, vcpu); +} + +static inline void avic_update_vapic_bar(struct vcpu_svm *svm, u64 data) +{ + svm->vmcb->control.avic_vapic_bar = data & VMCB_AVIC_APIC_BAR_MASK; + mark_dirty(svm->vmcb, VMCB_AVIC); +} + +static inline bool avic_vcpu_is_running(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + u64 *entry = svm->avic_physical_id_cache; + + if (!entry) + return false; + + return (READ_ONCE(*entry) & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK); +} + +static void recalc_intercepts(struct vcpu_svm *svm) +{ + struct vmcb_control_area *c, *h; + struct nested_state *g; + + mark_dirty(svm->vmcb, VMCB_INTERCEPTS); + + if (!is_guest_mode(&svm->vcpu)) + return; + + c = &svm->vmcb->control; + h = &svm->nested.hsave->control; + g = &svm->nested; + + c->intercept_cr = h->intercept_cr; + c->intercept_dr = h->intercept_dr; + c->intercept_exceptions = h->intercept_exceptions; + c->intercept = h->intercept; + + if (svm->vcpu.arch.hflags & HF_VINTR_MASK) { + /* We only want the cr8 intercept bits of L1 */ + c->intercept_cr &= ~(1U << INTERCEPT_CR8_READ); + c->intercept_cr &= ~(1U << INTERCEPT_CR8_WRITE); + + /* + * Once running L2 with HF_VINTR_MASK, EFLAGS.IF does not + * affect any interrupt we may want to inject; therefore, + * interrupt window vmexits are irrelevant to L0. + */ + c->intercept &= ~(1ULL << INTERCEPT_VINTR); + } + + /* We don't want to see VMMCALLs from a nested guest */ + c->intercept &= ~(1ULL << INTERCEPT_VMMCALL); + + c->intercept_cr |= g->intercept_cr; + c->intercept_dr |= g->intercept_dr; + c->intercept_exceptions |= g->intercept_exceptions; + c->intercept |= g->intercept; +} + +static inline struct vmcb *get_host_vmcb(struct vcpu_svm *svm) +{ + if (is_guest_mode(&svm->vcpu)) + return svm->nested.hsave; + else + return svm->vmcb; +} + +static inline void set_cr_intercept(struct vcpu_svm *svm, int bit) +{ + struct vmcb *vmcb = get_host_vmcb(svm); + + vmcb->control.intercept_cr |= (1U << bit); + + recalc_intercepts(svm); +} + +static inline void clr_cr_intercept(struct vcpu_svm *svm, int bit) +{ + struct vmcb *vmcb = get_host_vmcb(svm); + + vmcb->control.intercept_cr &= ~(1U << bit); + + recalc_intercepts(svm); +} + +static inline bool is_cr_intercept(struct vcpu_svm *svm, int bit) +{ + struct vmcb *vmcb = get_host_vmcb(svm); + + return vmcb->control.intercept_cr & (1U << bit); +} + +static inline void set_dr_intercepts(struct vcpu_svm *svm) +{ + struct vmcb *vmcb = get_host_vmcb(svm); + + vmcb->control.intercept_dr = (1 << INTERCEPT_DR0_READ) + | (1 << INTERCEPT_DR1_READ) + | (1 << INTERCEPT_DR2_READ) + | (1 << INTERCEPT_DR3_READ) + | (1 << INTERCEPT_DR4_READ) + | (1 << INTERCEPT_DR5_READ) + | (1 << INTERCEPT_DR6_READ) + | (1 << INTERCEPT_DR7_READ) + | (1 << INTERCEPT_DR0_WRITE) + | (1 << INTERCEPT_DR1_WRITE) + | (1 << INTERCEPT_DR2_WRITE) + | (1 << INTERCEPT_DR3_WRITE) + | (1 << INTERCEPT_DR4_WRITE) + | (1 << INTERCEPT_DR5_WRITE) + | (1 << INTERCEPT_DR6_WRITE) + | (1 << INTERCEPT_DR7_WRITE); + + recalc_intercepts(svm); +} + +static inline void clr_dr_intercepts(struct vcpu_svm *svm) +{ + struct vmcb *vmcb = get_host_vmcb(svm); + + vmcb->control.intercept_dr = 0; + + recalc_intercepts(svm); +} + +static inline void set_exception_intercept(struct vcpu_svm *svm, int bit) +{ + struct vmcb *vmcb = get_host_vmcb(svm); + + vmcb->control.intercept_exceptions |= (1U << bit); + + recalc_intercepts(svm); +} + +static inline void clr_exception_intercept(struct vcpu_svm *svm, int bit) +{ + struct vmcb *vmcb = get_host_vmcb(svm); + + vmcb->control.intercept_exceptions &= ~(1U << bit); + + recalc_intercepts(svm); +} + +static inline void set_intercept(struct vcpu_svm *svm, int bit) +{ + struct vmcb *vmcb = get_host_vmcb(svm); + + vmcb->control.intercept |= (1ULL << bit); + + recalc_intercepts(svm); +} + +static inline void clr_intercept(struct vcpu_svm *svm, int bit) +{ + struct vmcb *vmcb = get_host_vmcb(svm); + + vmcb->control.intercept &= ~(1ULL << bit); + + recalc_intercepts(svm); +} + +static inline bool is_intercept(struct vcpu_svm *svm, int bit) +{ + return (svm->vmcb->control.intercept & (1ULL << bit)) != 0; +} + +static inline bool vgif_enabled(struct vcpu_svm *svm) +{ + return !!(svm->vmcb->control.int_ctl & V_GIF_ENABLE_MASK); +} + +static inline void enable_gif(struct vcpu_svm *svm) +{ + if (vgif_enabled(svm)) + svm->vmcb->control.int_ctl |= V_GIF_MASK; + else + svm->vcpu.arch.hflags |= HF_GIF_MASK; +} + +static inline void disable_gif(struct vcpu_svm *svm) +{ + if (vgif_enabled(svm)) + svm->vmcb->control.int_ctl &= ~V_GIF_MASK; + else + svm->vcpu.arch.hflags &= ~HF_GIF_MASK; +} + +static inline bool gif_set(struct vcpu_svm *svm) +{ + if (vgif_enabled(svm)) + return !!(svm->vmcb->control.int_ctl & V_GIF_MASK); + else + return !!(svm->vcpu.arch.hflags & HF_GIF_MASK); +} + +static unsigned long iopm_base; + +struct kvm_ldttss_desc { + u16 limit0; + u16 base0; + unsigned base1:8, type:5, dpl:2, p:1; + unsigned limit1:4, zero0:3, g:1, base2:8; + u32 base3; + u32 zero1; +} __attribute__((packed)); + +struct svm_cpu_data { + int cpu; + + u64 asid_generation; + u32 max_asid; + u32 next_asid; + u32 min_asid; + struct kvm_ldttss_desc *tss_desc; + + struct page *save_area; + struct vmcb *current_vmcb; + + /* index = sev_asid, value = vmcb pointer */ + struct vmcb **sev_vmcbs; +}; + +static DEFINE_PER_CPU(struct svm_cpu_data *, svm_data); + +static const u32 msrpm_ranges[] = {0, 0xc0000000, 0xc0010000}; + +#define NUM_MSR_MAPS ARRAY_SIZE(msrpm_ranges) +#define MSRS_RANGE_SIZE 2048 +#define MSRS_IN_RANGE (MSRS_RANGE_SIZE * 8 / 2) + +static u32 svm_msrpm_offset(u32 msr) +{ + u32 offset; + int i; + + for (i = 0; i < NUM_MSR_MAPS; i++) { + if (msr < msrpm_ranges[i] || + msr >= msrpm_ranges[i] + MSRS_IN_RANGE) + continue; + + offset = (msr - msrpm_ranges[i]) / 4; /* 4 msrs per u8 */ + offset += (i * MSRS_RANGE_SIZE); /* add range offset */ + + /* Now we have the u8 offset - but need the u32 offset */ + return offset / 4; + } + + /* MSR not in any range */ + return MSR_INVALID; +} + +#define MAX_INST_SIZE 15 + +static inline void clgi(void) +{ + asm volatile (__ex("clgi")); +} + +static inline void stgi(void) +{ + asm volatile (__ex("stgi")); +} + +static inline void invlpga(unsigned long addr, u32 asid) +{ + asm volatile (__ex("invlpga %1, %0") : : "c"(asid), "a"(addr)); +} + +static int get_npt_level(struct kvm_vcpu *vcpu) +{ +#ifdef CONFIG_X86_64 + return PT64_ROOT_4LEVEL; +#else + return PT32E_ROOT_LEVEL; +#endif +} + +static void svm_set_efer(struct kvm_vcpu *vcpu, u64 efer) +{ + vcpu->arch.efer = efer; + + if (!npt_enabled) { + /* Shadow paging assumes NX to be available. */ + efer |= EFER_NX; + + if (!(efer & EFER_LMA)) + efer &= ~EFER_LME; + } + + to_svm(vcpu)->vmcb->save.efer = efer | EFER_SVME; + mark_dirty(to_svm(vcpu)->vmcb, VMCB_CR); +} + +static int is_external_interrupt(u32 info) +{ + info &= SVM_EVTINJ_TYPE_MASK | SVM_EVTINJ_VALID; + return info == (SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_INTR); +} + +static u32 svm_get_interrupt_shadow(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + u32 ret = 0; + + if (svm->vmcb->control.int_state & SVM_INTERRUPT_SHADOW_MASK) + ret = KVM_X86_SHADOW_INT_STI | KVM_X86_SHADOW_INT_MOV_SS; + return ret; +} + +static void svm_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask) +{ + struct vcpu_svm *svm = to_svm(vcpu); + + if (mask == 0) + svm->vmcb->control.int_state &= ~SVM_INTERRUPT_SHADOW_MASK; + else + svm->vmcb->control.int_state |= SVM_INTERRUPT_SHADOW_MASK; + +} + +static int skip_emulated_instruction(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + + if (nrips && svm->vmcb->control.next_rip != 0) { + WARN_ON_ONCE(!static_cpu_has(X86_FEATURE_NRIPS)); + svm->next_rip = svm->vmcb->control.next_rip; + } + + if (!svm->next_rip) { + if (!kvm_emulate_instruction(vcpu, EMULTYPE_SKIP)) + return 0; + } else { + if (svm->next_rip - kvm_rip_read(vcpu) > MAX_INST_SIZE) + pr_err("%s: ip 0x%lx next 0x%llx\n", + __func__, kvm_rip_read(vcpu), svm->next_rip); + kvm_rip_write(vcpu, svm->next_rip); + } + svm_set_interrupt_shadow(vcpu, 0); + + return 1; +} + +static void svm_queue_exception(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + unsigned nr = vcpu->arch.exception.nr; + bool has_error_code = vcpu->arch.exception.has_error_code; + bool reinject = vcpu->arch.exception.injected; + u32 error_code = vcpu->arch.exception.error_code; + + /* + * If we are within a nested VM we'd better #VMEXIT and let the guest + * handle the exception + */ + if (!reinject && + nested_svm_check_exception(svm, nr, has_error_code, error_code)) + return; + + kvm_deliver_exception_payload(&svm->vcpu); + + if (nr == BP_VECTOR && !nrips) { + unsigned long rip, old_rip = kvm_rip_read(&svm->vcpu); + + /* + * For guest debugging where we have to reinject #BP if some + * INT3 is guest-owned: + * Emulate nRIP by moving RIP forward. Will fail if injection + * raises a fault that is not intercepted. Still better than + * failing in all cases. + */ + (void)skip_emulated_instruction(&svm->vcpu); + rip = kvm_rip_read(&svm->vcpu); + svm->int3_rip = rip + svm->vmcb->save.cs.base; + svm->int3_injected = rip - old_rip; + } + + svm->vmcb->control.event_inj = nr + | SVM_EVTINJ_VALID + | (has_error_code ? SVM_EVTINJ_VALID_ERR : 0) + | SVM_EVTINJ_TYPE_EXEPT; + svm->vmcb->control.event_inj_err = error_code; +} + +static void svm_init_erratum_383(void) +{ + u32 low, high; + int err; + u64 val; + + if (!static_cpu_has_bug(X86_BUG_AMD_TLB_MMATCH)) + return; + + /* Use _safe variants to not break nested virtualization */ + val = native_read_msr_safe(MSR_AMD64_DC_CFG, &err); + if (err) + return; + + val |= (1ULL << 47); + + low = lower_32_bits(val); + high = upper_32_bits(val); + + native_write_msr_safe(MSR_AMD64_DC_CFG, low, high); + + erratum_383_found = true; +} + +static void svm_init_osvw(struct kvm_vcpu *vcpu) +{ + /* + * Guests should see errata 400 and 415 as fixed (assuming that + * HLT and IO instructions are intercepted). + */ + vcpu->arch.osvw.length = (osvw_len >= 3) ? (osvw_len) : 3; + vcpu->arch.osvw.status = osvw_status & ~(6ULL); + + /* + * By increasing VCPU's osvw.length to 3 we are telling the guest that + * all osvw.status bits inside that length, including bit 0 (which is + * reserved for erratum 298), are valid. However, if host processor's + * osvw_len is 0 then osvw_status[0] carries no information. We need to + * be conservative here and therefore we tell the guest that erratum 298 + * is present (because we really don't know). + */ + if (osvw_len == 0 && boot_cpu_data.x86 == 0x10) + vcpu->arch.osvw.status |= 1; +} + +static int has_svm(void) +{ + const char *msg; + + if (!cpu_has_svm(&msg)) { + printk(KERN_INFO "has_svm: %s\n", msg); + return 0; + } + + return 1; +} + +static void svm_hardware_disable(void) +{ + /* Make sure we clean up behind us */ + if (static_cpu_has(X86_FEATURE_TSCRATEMSR)) + wrmsrl(MSR_AMD64_TSC_RATIO, TSC_RATIO_DEFAULT); + + cpu_svm_disable(); + + amd_pmu_disable_virt(); +} + +static int svm_hardware_enable(void) +{ + + struct svm_cpu_data *sd; + uint64_t efer; + struct desc_struct *gdt; + int me = raw_smp_processor_id(); + + rdmsrl(MSR_EFER, efer); + if (efer & EFER_SVME) + return -EBUSY; + + if (!has_svm()) { + pr_err("%s: err EOPNOTSUPP on %d\n", __func__, me); + return -EINVAL; + } + sd = per_cpu(svm_data, me); + if (!sd) { + pr_err("%s: svm_data is NULL on %d\n", __func__, me); + return -EINVAL; + } + + sd->asid_generation = 1; + sd->max_asid = cpuid_ebx(SVM_CPUID_FUNC) - 1; + sd->next_asid = sd->max_asid + 1; + sd->min_asid = max_sev_asid + 1; + + gdt = get_current_gdt_rw(); + sd->tss_desc = (struct kvm_ldttss_desc *)(gdt + GDT_ENTRY_TSS); + + wrmsrl(MSR_EFER, efer | EFER_SVME); + + wrmsrl(MSR_VM_HSAVE_PA, page_to_pfn(sd->save_area) << PAGE_SHIFT); + + if (static_cpu_has(X86_FEATURE_TSCRATEMSR)) { + wrmsrl(MSR_AMD64_TSC_RATIO, TSC_RATIO_DEFAULT); + __this_cpu_write(current_tsc_ratio, TSC_RATIO_DEFAULT); + } + + + /* + * Get OSVW bits. + * + * Note that it is possible to have a system with mixed processor + * revisions and therefore different OSVW bits. If bits are not the same + * on different processors then choose the worst case (i.e. if erratum + * is present on one processor and not on another then assume that the + * erratum is present everywhere). + */ + if (cpu_has(&boot_cpu_data, X86_FEATURE_OSVW)) { + uint64_t len, status = 0; + int err; + + len = native_read_msr_safe(MSR_AMD64_OSVW_ID_LENGTH, &err); + if (!err) + status = native_read_msr_safe(MSR_AMD64_OSVW_STATUS, + &err); + + if (err) + osvw_status = osvw_len = 0; + else { + if (len < osvw_len) + osvw_len = len; + osvw_status |= status; + osvw_status &= (1ULL << osvw_len) - 1; + } + } else + osvw_status = osvw_len = 0; + + svm_init_erratum_383(); + + amd_pmu_enable_virt(); + + return 0; +} + +static void svm_cpu_uninit(int cpu) +{ + struct svm_cpu_data *sd = per_cpu(svm_data, raw_smp_processor_id()); + + if (!sd) + return; + + per_cpu(svm_data, raw_smp_processor_id()) = NULL; + kfree(sd->sev_vmcbs); + __free_page(sd->save_area); + kfree(sd); +} + +static int svm_cpu_init(int cpu) +{ + struct svm_cpu_data *sd; + + sd = kzalloc(sizeof(struct svm_cpu_data), GFP_KERNEL); + if (!sd) + return -ENOMEM; + sd->cpu = cpu; + sd->save_area = alloc_page(GFP_KERNEL); + if (!sd->save_area) + goto free_cpu_data; + + if (svm_sev_enabled()) { + sd->sev_vmcbs = kmalloc_array(max_sev_asid + 1, + sizeof(void *), + GFP_KERNEL); + if (!sd->sev_vmcbs) + goto free_save_area; + } + + per_cpu(svm_data, cpu) = sd; + + return 0; + +free_save_area: + __free_page(sd->save_area); +free_cpu_data: + kfree(sd); + return -ENOMEM; + +} + +static bool valid_msr_intercept(u32 index) +{ + int i; + + for (i = 0; direct_access_msrs[i].index != MSR_INVALID; i++) + if (direct_access_msrs[i].index == index) + return true; + + return false; +} + +static bool msr_write_intercepted(struct kvm_vcpu *vcpu, unsigned msr) +{ + u8 bit_write; + unsigned long tmp; + u32 offset; + u32 *msrpm; + + msrpm = is_guest_mode(vcpu) ? to_svm(vcpu)->nested.msrpm: + to_svm(vcpu)->msrpm; + + offset = svm_msrpm_offset(msr); + bit_write = 2 * (msr & 0x0f) + 1; + tmp = msrpm[offset]; + + BUG_ON(offset == MSR_INVALID); + + return !!test_bit(bit_write, &tmp); +} + +static void set_msr_interception(u32 *msrpm, unsigned msr, + int read, int write) +{ + u8 bit_read, bit_write; + unsigned long tmp; + u32 offset; + + /* + * If this warning triggers extend the direct_access_msrs list at the + * beginning of the file + */ + WARN_ON(!valid_msr_intercept(msr)); + + offset = svm_msrpm_offset(msr); + bit_read = 2 * (msr & 0x0f); + bit_write = 2 * (msr & 0x0f) + 1; + tmp = msrpm[offset]; + + BUG_ON(offset == MSR_INVALID); + + read ? clear_bit(bit_read, &tmp) : set_bit(bit_read, &tmp); + write ? clear_bit(bit_write, &tmp) : set_bit(bit_write, &tmp); + + msrpm[offset] = tmp; +} + +static void svm_vcpu_init_msrpm(u32 *msrpm) +{ + int i; + + memset(msrpm, 0xff, PAGE_SIZE * (1 << MSRPM_ALLOC_ORDER)); + + for (i = 0; direct_access_msrs[i].index != MSR_INVALID; i++) { + if (!direct_access_msrs[i].always) + continue; + + set_msr_interception(msrpm, direct_access_msrs[i].index, 1, 1); + } +} + +static void add_msr_offset(u32 offset) +{ + int i; + + for (i = 0; i < MSRPM_OFFSETS; ++i) { + + /* Offset already in list? */ + if (msrpm_offsets[i] == offset) + return; + + /* Slot used by another offset? */ + if (msrpm_offsets[i] != MSR_INVALID) + continue; + + /* Add offset to list */ + msrpm_offsets[i] = offset; + + return; + } + + /* + * If this BUG triggers the msrpm_offsets table has an overflow. Just + * increase MSRPM_OFFSETS in this case. + */ + BUG(); +} + +static void init_msrpm_offsets(void) +{ + int i; + + memset(msrpm_offsets, 0xff, sizeof(msrpm_offsets)); + + for (i = 0; direct_access_msrs[i].index != MSR_INVALID; i++) { + u32 offset; + + offset = svm_msrpm_offset(direct_access_msrs[i].index); + BUG_ON(offset == MSR_INVALID); + + add_msr_offset(offset); + } +} + +static void svm_enable_lbrv(struct vcpu_svm *svm) +{ + u32 *msrpm = svm->msrpm; + + svm->vmcb->control.virt_ext |= LBR_CTL_ENABLE_MASK; + set_msr_interception(msrpm, MSR_IA32_LASTBRANCHFROMIP, 1, 1); + set_msr_interception(msrpm, MSR_IA32_LASTBRANCHTOIP, 1, 1); + set_msr_interception(msrpm, MSR_IA32_LASTINTFROMIP, 1, 1); + set_msr_interception(msrpm, MSR_IA32_LASTINTTOIP, 1, 1); +} + +static void svm_disable_lbrv(struct vcpu_svm *svm) +{ + u32 *msrpm = svm->msrpm; + + svm->vmcb->control.virt_ext &= ~LBR_CTL_ENABLE_MASK; + set_msr_interception(msrpm, MSR_IA32_LASTBRANCHFROMIP, 0, 0); + set_msr_interception(msrpm, MSR_IA32_LASTBRANCHTOIP, 0, 0); + set_msr_interception(msrpm, MSR_IA32_LASTINTFROMIP, 0, 0); + set_msr_interception(msrpm, MSR_IA32_LASTINTTOIP, 0, 0); +} + +static void disable_nmi_singlestep(struct vcpu_svm *svm) +{ + svm->nmi_singlestep = false; + + if (!(svm->vcpu.guest_debug & KVM_GUESTDBG_SINGLESTEP)) { + /* Clear our flags if they were not set by the guest */ + if (!(svm->nmi_singlestep_guest_rflags & X86_EFLAGS_TF)) + svm->vmcb->save.rflags &= ~X86_EFLAGS_TF; + if (!(svm->nmi_singlestep_guest_rflags & X86_EFLAGS_RF)) + svm->vmcb->save.rflags &= ~X86_EFLAGS_RF; + } +} + +/* Note: + * This hash table is used to map VM_ID to a struct kvm_svm, + * when handling AMD IOMMU GALOG notification to schedule in + * a particular vCPU. + */ +#define SVM_VM_DATA_HASH_BITS 8 +static DEFINE_HASHTABLE(svm_vm_data_hash, SVM_VM_DATA_HASH_BITS); +static u32 next_vm_id = 0; +static bool next_vm_id_wrapped = 0; +static DEFINE_SPINLOCK(svm_vm_data_hash_lock); + +/* Note: + * This function is called from IOMMU driver to notify + * SVM to schedule in a particular vCPU of a particular VM. + */ +static int avic_ga_log_notifier(u32 ga_tag) +{ + unsigned long flags; + struct kvm_svm *kvm_svm; + struct kvm_vcpu *vcpu = NULL; + u32 vm_id = AVIC_GATAG_TO_VMID(ga_tag); + u32 vcpu_id = AVIC_GATAG_TO_VCPUID(ga_tag); + + pr_debug("SVM: %s: vm_id=%#x, vcpu_id=%#x\n", __func__, vm_id, vcpu_id); + trace_kvm_avic_ga_log(vm_id, vcpu_id); + + spin_lock_irqsave(&svm_vm_data_hash_lock, flags); + hash_for_each_possible(svm_vm_data_hash, kvm_svm, hnode, vm_id) { + if (kvm_svm->avic_vm_id != vm_id) + continue; + vcpu = kvm_get_vcpu_by_id(&kvm_svm->kvm, vcpu_id); + break; + } + spin_unlock_irqrestore(&svm_vm_data_hash_lock, flags); + + /* Note: + * At this point, the IOMMU should have already set the pending + * bit in the vAPIC backing page. So, we just need to schedule + * in the vcpu. + */ + if (vcpu) + kvm_vcpu_wake_up(vcpu); + + return 0; +} + +static __init int sev_hardware_setup(void) +{ + struct sev_user_data_status *status; + int rc; + + /* Maximum number of encrypted guests supported simultaneously */ + max_sev_asid = cpuid_ecx(0x8000001F); + + if (!max_sev_asid) + return 1; + + /* Minimum ASID value that should be used for SEV guest */ + min_sev_asid = cpuid_edx(0x8000001F); + + /* Initialize SEV ASID bitmaps */ + sev_asid_bitmap = bitmap_zalloc(max_sev_asid, GFP_KERNEL); + if (!sev_asid_bitmap) + return 1; + + sev_reclaim_asid_bitmap = bitmap_zalloc(max_sev_asid, GFP_KERNEL); + if (!sev_reclaim_asid_bitmap) + return 1; + + status = kmalloc(sizeof(*status), GFP_KERNEL); + if (!status) + return 1; + + /* + * Check SEV platform status. + * + * PLATFORM_STATUS can be called in any state, if we failed to query + * the PLATFORM status then either PSP firmware does not support SEV + * feature or SEV firmware is dead. + */ + rc = sev_platform_status(status, NULL); + if (rc) + goto err; + + pr_info("SEV supported\n"); + +err: + kfree(status); + return rc; +} + +static void grow_ple_window(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + struct vmcb_control_area *control = &svm->vmcb->control; + int old = control->pause_filter_count; + + control->pause_filter_count = __grow_ple_window(old, + pause_filter_count, + pause_filter_count_grow, + pause_filter_count_max); + + if (control->pause_filter_count != old) { + mark_dirty(svm->vmcb, VMCB_INTERCEPTS); + trace_kvm_ple_window_update(vcpu->vcpu_id, + control->pause_filter_count, old); + } +} + +static void shrink_ple_window(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + struct vmcb_control_area *control = &svm->vmcb->control; + int old = control->pause_filter_count; + + control->pause_filter_count = + __shrink_ple_window(old, + pause_filter_count, + pause_filter_count_shrink, + pause_filter_count); + if (control->pause_filter_count != old) { + mark_dirty(svm->vmcb, VMCB_INTERCEPTS); + trace_kvm_ple_window_update(vcpu->vcpu_id, + control->pause_filter_count, old); + } +} + +/* + * The default MMIO mask is a single bit (excluding the present bit), + * which could conflict with the memory encryption bit. Check for + * memory encryption support and override the default MMIO mask if + * memory encryption is enabled. + */ +static __init void svm_adjust_mmio_mask(void) +{ + unsigned int enc_bit, mask_bit; + u64 msr, mask; + + /* If there is no memory encryption support, use existing mask */ + if (cpuid_eax(0x80000000) < 0x8000001f) + return; + + /* If memory encryption is not enabled, use existing mask */ + rdmsrl(MSR_K8_SYSCFG, msr); + if (!(msr & MSR_K8_SYSCFG_MEM_ENCRYPT)) + return; + + enc_bit = cpuid_ebx(0x8000001f) & 0x3f; + mask_bit = boot_cpu_data.x86_phys_bits; + + /* Increment the mask bit if it is the same as the encryption bit */ + if (enc_bit == mask_bit) + mask_bit++; + + /* + * If the mask bit location is below 52, then some bits above the + * physical addressing limit will always be reserved, so use the + * rsvd_bits() function to generate the mask. This mask, along with + * the present bit, will be used to generate a page fault with + * PFER.RSV = 1. + * + * If the mask bit location is 52 (or above), then clear the mask. + */ + mask = (mask_bit < 52) ? rsvd_bits(mask_bit, 51) | PT_PRESENT_MASK : 0; + + kvm_mmu_set_mmio_spte_mask(mask, mask, PT_WRITABLE_MASK | PT_USER_MASK); +} + +static void svm_hardware_teardown(void) +{ + int cpu; + + if (svm_sev_enabled()) { + bitmap_free(sev_asid_bitmap); + bitmap_free(sev_reclaim_asid_bitmap); + + sev_flush_asids(); + } + + for_each_possible_cpu(cpu) + svm_cpu_uninit(cpu); + + __free_pages(pfn_to_page(iopm_base >> PAGE_SHIFT), IOPM_ALLOC_ORDER); + iopm_base = 0; +} + +static __init void svm_set_cpu_caps(void) +{ + kvm_set_cpu_caps(); + + supported_xss = 0; + + /* CPUID 0x80000001 and 0x8000000A (SVM features) */ + if (nested) { + kvm_cpu_cap_set(X86_FEATURE_SVM); + + if (nrips) + kvm_cpu_cap_set(X86_FEATURE_NRIPS); + + if (npt_enabled) + kvm_cpu_cap_set(X86_FEATURE_NPT); + } + + /* CPUID 0x80000008 */ + if (boot_cpu_has(X86_FEATURE_LS_CFG_SSBD) || + boot_cpu_has(X86_FEATURE_AMD_SSBD)) + kvm_cpu_cap_set(X86_FEATURE_VIRT_SSBD); +} + +static __init int svm_hardware_setup(void) +{ + int cpu; + struct page *iopm_pages; + void *iopm_va; + int r; + + iopm_pages = alloc_pages(GFP_KERNEL, IOPM_ALLOC_ORDER); + + if (!iopm_pages) + return -ENOMEM; + + iopm_va = page_address(iopm_pages); + memset(iopm_va, 0xff, PAGE_SIZE * (1 << IOPM_ALLOC_ORDER)); + iopm_base = page_to_pfn(iopm_pages) << PAGE_SHIFT; + + init_msrpm_offsets(); + + supported_xcr0 &= ~(XFEATURE_MASK_BNDREGS | XFEATURE_MASK_BNDCSR); + + if (boot_cpu_has(X86_FEATURE_NX)) + kvm_enable_efer_bits(EFER_NX); + + if (boot_cpu_has(X86_FEATURE_FXSR_OPT)) + kvm_enable_efer_bits(EFER_FFXSR); + + if (boot_cpu_has(X86_FEATURE_TSCRATEMSR)) { + kvm_has_tsc_control = true; + kvm_max_tsc_scaling_ratio = TSC_RATIO_MAX; + kvm_tsc_scaling_ratio_frac_bits = 32; + } + + /* Check for pause filtering support */ + if (!boot_cpu_has(X86_FEATURE_PAUSEFILTER)) { + pause_filter_count = 0; + pause_filter_thresh = 0; + } else if (!boot_cpu_has(X86_FEATURE_PFTHRESHOLD)) { + pause_filter_thresh = 0; + } + + if (nested) { + printk(KERN_INFO "kvm: Nested Virtualization enabled\n"); + kvm_enable_efer_bits(EFER_SVME | EFER_LMSLE); + } + + if (sev) { + if (boot_cpu_has(X86_FEATURE_SEV) && + IS_ENABLED(CONFIG_KVM_AMD_SEV)) { + r = sev_hardware_setup(); + if (r) + sev = false; + } else { + sev = false; + } + } + + svm_adjust_mmio_mask(); + + for_each_possible_cpu(cpu) { + r = svm_cpu_init(cpu); + if (r) + goto err; + } + + if (!boot_cpu_has(X86_FEATURE_NPT)) + npt_enabled = false; + + if (npt_enabled && !npt) + npt_enabled = false; + + kvm_configure_mmu(npt_enabled, PT_PDPE_LEVEL); + pr_info("kvm: Nested Paging %sabled\n", npt_enabled ? "en" : "dis"); + + if (nrips) { + if (!boot_cpu_has(X86_FEATURE_NRIPS)) + nrips = false; + } + + if (avic) { + if (!npt_enabled || + !boot_cpu_has(X86_FEATURE_AVIC) || + !IS_ENABLED(CONFIG_X86_LOCAL_APIC)) { + avic = false; + } else { + pr_info("AVIC enabled\n"); + + amd_iommu_register_ga_log_notifier(&avic_ga_log_notifier); + } + } + + if (vls) { + if (!npt_enabled || + !boot_cpu_has(X86_FEATURE_V_VMSAVE_VMLOAD) || + !IS_ENABLED(CONFIG_X86_64)) { + vls = false; + } else { + pr_info("Virtual VMLOAD VMSAVE supported\n"); + } + } + + if (vgif) { + if (!boot_cpu_has(X86_FEATURE_VGIF)) + vgif = false; + else + pr_info("Virtual GIF supported\n"); + } + + svm_set_cpu_caps(); + + return 0; + +err: + svm_hardware_teardown(); + return r; +} + +static void init_seg(struct vmcb_seg *seg) +{ + seg->selector = 0; + seg->attrib = SVM_SELECTOR_P_MASK | SVM_SELECTOR_S_MASK | + SVM_SELECTOR_WRITE_MASK; /* Read/Write Data Segment */ + seg->limit = 0xffff; + seg->base = 0; +} + +static void init_sys_seg(struct vmcb_seg *seg, uint32_t type) +{ + seg->selector = 0; + seg->attrib = SVM_SELECTOR_P_MASK | type; + seg->limit = 0xffff; + seg->base = 0; +} + +static u64 svm_read_l1_tsc_offset(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + + if (is_guest_mode(vcpu)) + return svm->nested.hsave->control.tsc_offset; + + return vcpu->arch.tsc_offset; +} + +static u64 svm_write_l1_tsc_offset(struct kvm_vcpu *vcpu, u64 offset) +{ + struct vcpu_svm *svm = to_svm(vcpu); + u64 g_tsc_offset = 0; + + if (is_guest_mode(vcpu)) { + /* Write L1's TSC offset. */ + g_tsc_offset = svm->vmcb->control.tsc_offset - + svm->nested.hsave->control.tsc_offset; + svm->nested.hsave->control.tsc_offset = offset; + } + + trace_kvm_write_tsc_offset(vcpu->vcpu_id, + svm->vmcb->control.tsc_offset - g_tsc_offset, + offset); + + svm->vmcb->control.tsc_offset = offset + g_tsc_offset; + + mark_dirty(svm->vmcb, VMCB_INTERCEPTS); + return svm->vmcb->control.tsc_offset; +} + +static void avic_init_vmcb(struct vcpu_svm *svm) +{ + struct vmcb *vmcb = svm->vmcb; + struct kvm_svm *kvm_svm = to_kvm_svm(svm->vcpu.kvm); + phys_addr_t bpa = __sme_set(page_to_phys(svm->avic_backing_page)); + phys_addr_t lpa = __sme_set(page_to_phys(kvm_svm->avic_logical_id_table_page)); + phys_addr_t ppa = __sme_set(page_to_phys(kvm_svm->avic_physical_id_table_page)); + + vmcb->control.avic_backing_page = bpa & AVIC_HPA_MASK; + vmcb->control.avic_logical_id = lpa & AVIC_HPA_MASK; + vmcb->control.avic_physical_id = ppa & AVIC_HPA_MASK; + vmcb->control.avic_physical_id |= AVIC_MAX_PHYSICAL_ID_COUNT; + if (kvm_apicv_activated(svm->vcpu.kvm)) + vmcb->control.int_ctl |= AVIC_ENABLE_MASK; + else + vmcb->control.int_ctl &= ~AVIC_ENABLE_MASK; +} + +static void init_vmcb(struct vcpu_svm *svm) +{ + struct vmcb_control_area *control = &svm->vmcb->control; + struct vmcb_save_area *save = &svm->vmcb->save; + + svm->vcpu.arch.hflags = 0; + + set_cr_intercept(svm, INTERCEPT_CR0_READ); + set_cr_intercept(svm, INTERCEPT_CR3_READ); + set_cr_intercept(svm, INTERCEPT_CR4_READ); + set_cr_intercept(svm, INTERCEPT_CR0_WRITE); + set_cr_intercept(svm, INTERCEPT_CR3_WRITE); + set_cr_intercept(svm, INTERCEPT_CR4_WRITE); + if (!kvm_vcpu_apicv_active(&svm->vcpu)) + set_cr_intercept(svm, INTERCEPT_CR8_WRITE); + + set_dr_intercepts(svm); + + set_exception_intercept(svm, PF_VECTOR); + set_exception_intercept(svm, UD_VECTOR); + set_exception_intercept(svm, MC_VECTOR); + set_exception_intercept(svm, AC_VECTOR); + set_exception_intercept(svm, DB_VECTOR); + /* + * Guest access to VMware backdoor ports could legitimately + * trigger #GP because of TSS I/O permission bitmap. + * We intercept those #GP and allow access to them anyway + * as VMware does. + */ + if (enable_vmware_backdoor) + set_exception_intercept(svm, GP_VECTOR); + + set_intercept(svm, INTERCEPT_INTR); + set_intercept(svm, INTERCEPT_NMI); + set_intercept(svm, INTERCEPT_SMI); + set_intercept(svm, INTERCEPT_SELECTIVE_CR0); + set_intercept(svm, INTERCEPT_RDPMC); + set_intercept(svm, INTERCEPT_CPUID); + set_intercept(svm, INTERCEPT_INVD); + set_intercept(svm, INTERCEPT_INVLPG); + set_intercept(svm, INTERCEPT_INVLPGA); + set_intercept(svm, INTERCEPT_IOIO_PROT); + set_intercept(svm, INTERCEPT_MSR_PROT); + set_intercept(svm, INTERCEPT_TASK_SWITCH); + set_intercept(svm, INTERCEPT_SHUTDOWN); + set_intercept(svm, INTERCEPT_VMRUN); + set_intercept(svm, INTERCEPT_VMMCALL); + set_intercept(svm, INTERCEPT_VMLOAD); + set_intercept(svm, INTERCEPT_VMSAVE); + set_intercept(svm, INTERCEPT_STGI); + set_intercept(svm, INTERCEPT_CLGI); + set_intercept(svm, INTERCEPT_SKINIT); + set_intercept(svm, INTERCEPT_WBINVD); + set_intercept(svm, INTERCEPT_XSETBV); + set_intercept(svm, INTERCEPT_RDPRU); + set_intercept(svm, INTERCEPT_RSM); + + if (!kvm_mwait_in_guest(svm->vcpu.kvm)) { + set_intercept(svm, INTERCEPT_MONITOR); + set_intercept(svm, INTERCEPT_MWAIT); + } + + if (!kvm_hlt_in_guest(svm->vcpu.kvm)) + set_intercept(svm, INTERCEPT_HLT); + + control->iopm_base_pa = __sme_set(iopm_base); + control->msrpm_base_pa = __sme_set(__pa(svm->msrpm)); + control->int_ctl = V_INTR_MASKING_MASK; + + init_seg(&save->es); + init_seg(&save->ss); + init_seg(&save->ds); + init_seg(&save->fs); + init_seg(&save->gs); + + save->cs.selector = 0xf000; + save->cs.base = 0xffff0000; + /* Executable/Readable Code Segment */ + save->cs.attrib = SVM_SELECTOR_READ_MASK | SVM_SELECTOR_P_MASK | + SVM_SELECTOR_S_MASK | SVM_SELECTOR_CODE_MASK; + save->cs.limit = 0xffff; + + save->gdtr.limit = 0xffff; + save->idtr.limit = 0xffff; + + init_sys_seg(&save->ldtr, SEG_TYPE_LDT); + init_sys_seg(&save->tr, SEG_TYPE_BUSY_TSS16); + + svm_set_efer(&svm->vcpu, 0); + save->dr6 = 0xffff0ff0; + kvm_set_rflags(&svm->vcpu, 2); + save->rip = 0x0000fff0; + svm->vcpu.arch.regs[VCPU_REGS_RIP] = save->rip; + + /* + * svm_set_cr0() sets PG and WP and clears NW and CD on save->cr0. + * It also updates the guest-visible cr0 value. + */ + svm_set_cr0(&svm->vcpu, X86_CR0_NW | X86_CR0_CD | X86_CR0_ET); + kvm_mmu_reset_context(&svm->vcpu); + + save->cr4 = X86_CR4_PAE; + /* rdx = ?? */ + + if (npt_enabled) { + /* Setup VMCB for Nested Paging */ + control->nested_ctl |= SVM_NESTED_CTL_NP_ENABLE; + clr_intercept(svm, INTERCEPT_INVLPG); + clr_exception_intercept(svm, PF_VECTOR); + clr_cr_intercept(svm, INTERCEPT_CR3_READ); + clr_cr_intercept(svm, INTERCEPT_CR3_WRITE); + save->g_pat = svm->vcpu.arch.pat; + save->cr3 = 0; + save->cr4 = 0; + } + svm->asid_generation = 0; + + svm->nested.vmcb = 0; + svm->vcpu.arch.hflags = 0; + + if (pause_filter_count) { + control->pause_filter_count = pause_filter_count; + if (pause_filter_thresh) + control->pause_filter_thresh = pause_filter_thresh; + set_intercept(svm, INTERCEPT_PAUSE); + } else { + clr_intercept(svm, INTERCEPT_PAUSE); + } + + if (kvm_vcpu_apicv_active(&svm->vcpu)) + avic_init_vmcb(svm); + + /* + * If hardware supports Virtual VMLOAD VMSAVE then enable it + * in VMCB and clear intercepts to avoid #VMEXIT. + */ + if (vls) { + clr_intercept(svm, INTERCEPT_VMLOAD); + clr_intercept(svm, INTERCEPT_VMSAVE); + svm->vmcb->control.virt_ext |= VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK; + } + + if (vgif) { + clr_intercept(svm, INTERCEPT_STGI); + clr_intercept(svm, INTERCEPT_CLGI); + svm->vmcb->control.int_ctl |= V_GIF_ENABLE_MASK; + } + + if (sev_guest(svm->vcpu.kvm)) { + svm->vmcb->control.nested_ctl |= SVM_NESTED_CTL_SEV_ENABLE; + clr_exception_intercept(svm, UD_VECTOR); + } + + mark_all_dirty(svm->vmcb); + + enable_gif(svm); + +} + +static u64 *avic_get_physical_id_entry(struct kvm_vcpu *vcpu, + unsigned int index) +{ + u64 *avic_physical_id_table; + struct kvm_svm *kvm_svm = to_kvm_svm(vcpu->kvm); + + if (index >= AVIC_MAX_PHYSICAL_ID_COUNT) + return NULL; + + avic_physical_id_table = page_address(kvm_svm->avic_physical_id_table_page); + + return &avic_physical_id_table[index]; +} + +/** + * Note: + * AVIC hardware walks the nested page table to check permissions, + * but does not use the SPA address specified in the leaf page + * table entry since it uses address in the AVIC_BACKING_PAGE pointer + * field of the VMCB. Therefore, we set up the + * APIC_ACCESS_PAGE_PRIVATE_MEMSLOT (4KB) here. + */ +static int avic_update_access_page(struct kvm *kvm, bool activate) +{ + int ret = 0; + + mutex_lock(&kvm->slots_lock); + /* + * During kvm_destroy_vm(), kvm_pit_set_reinject() could trigger + * APICv mode change, which update APIC_ACCESS_PAGE_PRIVATE_MEMSLOT + * memory region. So, we need to ensure that kvm->mm == current->mm. + */ + if ((kvm->arch.apic_access_page_done == activate) || + (kvm->mm != current->mm)) + goto out; + + ret = __x86_set_memory_region(kvm, + APIC_ACCESS_PAGE_PRIVATE_MEMSLOT, + APIC_DEFAULT_PHYS_BASE, + activate ? PAGE_SIZE : 0); + if (ret) + goto out; + + kvm->arch.apic_access_page_done = activate; +out: + mutex_unlock(&kvm->slots_lock); + return ret; +} + +static int avic_init_backing_page(struct kvm_vcpu *vcpu) +{ + u64 *entry, new_entry; + int id = vcpu->vcpu_id; + struct vcpu_svm *svm = to_svm(vcpu); + + if (id >= AVIC_MAX_PHYSICAL_ID_COUNT) + return -EINVAL; + + if (!svm->vcpu.arch.apic->regs) + return -EINVAL; + + if (kvm_apicv_activated(vcpu->kvm)) { + int ret; + + ret = avic_update_access_page(vcpu->kvm, true); + if (ret) + return ret; + } + + svm->avic_backing_page = virt_to_page(svm->vcpu.arch.apic->regs); + + /* Setting AVIC backing page address in the phy APIC ID table */ + entry = avic_get_physical_id_entry(vcpu, id); + if (!entry) + return -EINVAL; + + new_entry = __sme_set((page_to_phys(svm->avic_backing_page) & + AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK) | + AVIC_PHYSICAL_ID_ENTRY_VALID_MASK); + WRITE_ONCE(*entry, new_entry); + + svm->avic_physical_id_cache = entry; + + return 0; +} + +static void sev_asid_free(int asid) +{ + struct svm_cpu_data *sd; + int cpu, pos; + + mutex_lock(&sev_bitmap_lock); + + pos = asid - 1; + __set_bit(pos, sev_reclaim_asid_bitmap); + + for_each_possible_cpu(cpu) { + sd = per_cpu(svm_data, cpu); + sd->sev_vmcbs[pos] = NULL; + } + + mutex_unlock(&sev_bitmap_lock); +} + +static void sev_unbind_asid(struct kvm *kvm, unsigned int handle) +{ + struct sev_data_decommission *decommission; + struct sev_data_deactivate *data; + + if (!handle) + return; + + data = kzalloc(sizeof(*data), GFP_KERNEL); + if (!data) + return; + + /* deactivate handle */ + data->handle = handle; + + /* Guard DEACTIVATE against WBINVD/DF_FLUSH used in ASID recycling */ + down_read(&sev_deactivate_lock); + sev_guest_deactivate(data, NULL); + up_read(&sev_deactivate_lock); + + kfree(data); + + decommission = kzalloc(sizeof(*decommission), GFP_KERNEL); + if (!decommission) + return; + + /* decommission handle */ + decommission->handle = handle; + sev_guest_decommission(decommission, NULL); + + kfree(decommission); +} + +static struct page **sev_pin_memory(struct kvm *kvm, unsigned long uaddr, + unsigned long ulen, unsigned long *n, + int write) +{ + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; + unsigned long npages, npinned, size; + unsigned long locked, lock_limit; + struct page **pages; + unsigned long first, last; + + if (ulen == 0 || uaddr + ulen < uaddr) + return NULL; + + /* Calculate number of pages. */ + first = (uaddr & PAGE_MASK) >> PAGE_SHIFT; + last = ((uaddr + ulen - 1) & PAGE_MASK) >> PAGE_SHIFT; + npages = (last - first + 1); + + locked = sev->pages_locked + npages; + lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; + if (locked > lock_limit && !capable(CAP_IPC_LOCK)) { + pr_err("SEV: %lu locked pages exceed the lock limit of %lu.\n", locked, lock_limit); + return NULL; + } + + /* Avoid using vmalloc for smaller buffers. */ + size = npages * sizeof(struct page *); + if (size > PAGE_SIZE) + pages = __vmalloc(size, GFP_KERNEL_ACCOUNT | __GFP_ZERO, + PAGE_KERNEL); + else + pages = kmalloc(size, GFP_KERNEL_ACCOUNT); + + if (!pages) + return NULL; + + /* Pin the user virtual address. */ + npinned = get_user_pages_fast(uaddr, npages, FOLL_WRITE, pages); + if (npinned != npages) { + pr_err("SEV: Failure locking %lu pages.\n", npages); + goto err; + } + + *n = npages; + sev->pages_locked = locked; + + return pages; + +err: + if (npinned > 0) + release_pages(pages, npinned); + + kvfree(pages); + return NULL; +} + +static void sev_unpin_memory(struct kvm *kvm, struct page **pages, + unsigned long npages) +{ + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; + + release_pages(pages, npages); + kvfree(pages); + sev->pages_locked -= npages; +} + +static void sev_clflush_pages(struct page *pages[], unsigned long npages) +{ + uint8_t *page_virtual; + unsigned long i; + + if (npages == 0 || pages == NULL) + return; + + for (i = 0; i < npages; i++) { + page_virtual = kmap_atomic(pages[i]); + clflush_cache_range(page_virtual, PAGE_SIZE); + kunmap_atomic(page_virtual); + } +} + +static void __unregister_enc_region_locked(struct kvm *kvm, + struct enc_region *region) +{ + sev_unpin_memory(kvm, region->pages, region->npages); + list_del(®ion->list); + kfree(region); +} + +static void sev_vm_destroy(struct kvm *kvm) +{ + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; + struct list_head *head = &sev->regions_list; + struct list_head *pos, *q; + + if (!sev_guest(kvm)) + return; + + mutex_lock(&kvm->lock); + + /* + * Ensure that all guest tagged cache entries are flushed before + * releasing the pages back to the system for use. CLFLUSH will + * not do this, so issue a WBINVD. + */ + wbinvd_on_all_cpus(); + + /* + * if userspace was terminated before unregistering the memory regions + * then lets unpin all the registered memory. + */ + if (!list_empty(head)) { + list_for_each_safe(pos, q, head) { + __unregister_enc_region_locked(kvm, + list_entry(pos, struct enc_region, list)); + } + } + + mutex_unlock(&kvm->lock); + + sev_unbind_asid(kvm, sev->handle); + sev_asid_free(sev->asid); +} + +static void avic_vm_destroy(struct kvm *kvm) +{ + unsigned long flags; + struct kvm_svm *kvm_svm = to_kvm_svm(kvm); + + if (!avic) + return; + + if (kvm_svm->avic_logical_id_table_page) + __free_page(kvm_svm->avic_logical_id_table_page); + if (kvm_svm->avic_physical_id_table_page) + __free_page(kvm_svm->avic_physical_id_table_page); + + spin_lock_irqsave(&svm_vm_data_hash_lock, flags); + hash_del(&kvm_svm->hnode); + spin_unlock_irqrestore(&svm_vm_data_hash_lock, flags); +} + +static void svm_vm_destroy(struct kvm *kvm) +{ + avic_vm_destroy(kvm); + sev_vm_destroy(kvm); +} + +static int avic_vm_init(struct kvm *kvm) +{ + unsigned long flags; + int err = -ENOMEM; + struct kvm_svm *kvm_svm = to_kvm_svm(kvm); + struct kvm_svm *k2; + struct page *p_page; + struct page *l_page; + u32 vm_id; + + if (!avic) + return 0; + + /* Allocating physical APIC ID table (4KB) */ + p_page = alloc_page(GFP_KERNEL_ACCOUNT); + if (!p_page) + goto free_avic; + + kvm_svm->avic_physical_id_table_page = p_page; + clear_page(page_address(p_page)); + + /* Allocating logical APIC ID table (4KB) */ + l_page = alloc_page(GFP_KERNEL_ACCOUNT); + if (!l_page) + goto free_avic; + + kvm_svm->avic_logical_id_table_page = l_page; + clear_page(page_address(l_page)); + + spin_lock_irqsave(&svm_vm_data_hash_lock, flags); + again: + vm_id = next_vm_id = (next_vm_id + 1) & AVIC_VM_ID_MASK; + if (vm_id == 0) { /* id is 1-based, zero is not okay */ + next_vm_id_wrapped = 1; + goto again; + } + /* Is it still in use? Only possible if wrapped at least once */ + if (next_vm_id_wrapped) { + hash_for_each_possible(svm_vm_data_hash, k2, hnode, vm_id) { + if (k2->avic_vm_id == vm_id) + goto again; + } + } + kvm_svm->avic_vm_id = vm_id; + hash_add(svm_vm_data_hash, &kvm_svm->hnode, kvm_svm->avic_vm_id); + spin_unlock_irqrestore(&svm_vm_data_hash_lock, flags); + + return 0; + +free_avic: + avic_vm_destroy(kvm); + return err; +} + +static int svm_vm_init(struct kvm *kvm) +{ + if (avic) { + int ret = avic_vm_init(kvm); + if (ret) + return ret; + } + + kvm_apicv_init(kvm, avic); + return 0; +} + +static inline int +avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu, bool r) +{ + int ret = 0; + unsigned long flags; + struct amd_svm_iommu_ir *ir; + struct vcpu_svm *svm = to_svm(vcpu); + + if (!kvm_arch_has_assigned_device(vcpu->kvm)) + return 0; + + /* + * Here, we go through the per-vcpu ir_list to update all existing + * interrupt remapping table entry targeting this vcpu. + */ + spin_lock_irqsave(&svm->ir_list_lock, flags); + + if (list_empty(&svm->ir_list)) + goto out; + + list_for_each_entry(ir, &svm->ir_list, node) { + ret = amd_iommu_update_ga(cpu, r, ir->data); + if (ret) + break; + } +out: + spin_unlock_irqrestore(&svm->ir_list_lock, flags); + return ret; +} + +static void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu) +{ + u64 entry; + /* ID = 0xff (broadcast), ID > 0xff (reserved) */ + int h_physical_id = kvm_cpu_get_apicid(cpu); + struct vcpu_svm *svm = to_svm(vcpu); + + if (!kvm_vcpu_apicv_active(vcpu)) + return; + + /* + * Since the host physical APIC id is 8 bits, + * we can support host APIC ID upto 255. + */ + if (WARN_ON(h_physical_id > AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK)) + return; + + entry = READ_ONCE(*(svm->avic_physical_id_cache)); + WARN_ON(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK); + + entry &= ~AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK; + entry |= (h_physical_id & AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK); + + entry &= ~AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK; + if (svm->avic_is_running) + entry |= AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK; + + WRITE_ONCE(*(svm->avic_physical_id_cache), entry); + avic_update_iommu_vcpu_affinity(vcpu, h_physical_id, + svm->avic_is_running); +} + +static void avic_vcpu_put(struct kvm_vcpu *vcpu) +{ + u64 entry; + struct vcpu_svm *svm = to_svm(vcpu); + + if (!kvm_vcpu_apicv_active(vcpu)) + return; + + entry = READ_ONCE(*(svm->avic_physical_id_cache)); + if (entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK) + avic_update_iommu_vcpu_affinity(vcpu, -1, 0); + + entry &= ~AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK; + WRITE_ONCE(*(svm->avic_physical_id_cache), entry); +} + +/** + * This function is called during VCPU halt/unhalt. + */ +static void avic_set_running(struct kvm_vcpu *vcpu, bool is_run) +{ + struct vcpu_svm *svm = to_svm(vcpu); + + svm->avic_is_running = is_run; + if (is_run) + avic_vcpu_load(vcpu, vcpu->cpu); + else + avic_vcpu_put(vcpu); +} + +static void svm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) +{ + struct vcpu_svm *svm = to_svm(vcpu); + u32 dummy; + u32 eax = 1; + + svm->spec_ctrl = 0; + svm->virt_spec_ctrl = 0; + + if (!init_event) { + svm->vcpu.arch.apic_base = APIC_DEFAULT_PHYS_BASE | + MSR_IA32_APICBASE_ENABLE; + if (kvm_vcpu_is_reset_bsp(&svm->vcpu)) + svm->vcpu.arch.apic_base |= MSR_IA32_APICBASE_BSP; + } + init_vmcb(svm); + + kvm_cpuid(vcpu, &eax, &dummy, &dummy, &dummy, false); + kvm_rdx_write(vcpu, eax); + + if (kvm_vcpu_apicv_active(vcpu) && !init_event) + avic_update_vapic_bar(svm, APIC_DEFAULT_PHYS_BASE); +} + +static int avic_init_vcpu(struct vcpu_svm *svm) +{ + int ret; + struct kvm_vcpu *vcpu = &svm->vcpu; + + if (!avic || !irqchip_in_kernel(vcpu->kvm)) + return 0; + + ret = avic_init_backing_page(&svm->vcpu); + if (ret) + return ret; + + INIT_LIST_HEAD(&svm->ir_list); + spin_lock_init(&svm->ir_list_lock); + svm->dfr_reg = APIC_DFR_FLAT; + + return ret; +} + +static int svm_create_vcpu(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm; + struct page *page; + struct page *msrpm_pages; + struct page *hsave_page; + struct page *nested_msrpm_pages; + int err; + + BUILD_BUG_ON(offsetof(struct vcpu_svm, vcpu) != 0); + svm = to_svm(vcpu); + + err = -ENOMEM; + page = alloc_page(GFP_KERNEL_ACCOUNT); + if (!page) + goto out; + + msrpm_pages = alloc_pages(GFP_KERNEL_ACCOUNT, MSRPM_ALLOC_ORDER); + if (!msrpm_pages) + goto free_page1; + + nested_msrpm_pages = alloc_pages(GFP_KERNEL_ACCOUNT, MSRPM_ALLOC_ORDER); + if (!nested_msrpm_pages) + goto free_page2; + + hsave_page = alloc_page(GFP_KERNEL_ACCOUNT); + if (!hsave_page) + goto free_page3; + + err = avic_init_vcpu(svm); + if (err) + goto free_page4; + + /* We initialize this flag to true to make sure that the is_running + * bit would be set the first time the vcpu is loaded. + */ + if (irqchip_in_kernel(vcpu->kvm) && kvm_apicv_activated(vcpu->kvm)) + svm->avic_is_running = true; + + svm->nested.hsave = page_address(hsave_page); + + svm->msrpm = page_address(msrpm_pages); + svm_vcpu_init_msrpm(svm->msrpm); + + svm->nested.msrpm = page_address(nested_msrpm_pages); + svm_vcpu_init_msrpm(svm->nested.msrpm); + + svm->vmcb = page_address(page); + clear_page(svm->vmcb); + svm->vmcb_pa = __sme_set(page_to_pfn(page) << PAGE_SHIFT); + svm->asid_generation = 0; + init_vmcb(svm); + + svm_init_osvw(vcpu); + vcpu->arch.microcode_version = 0x01000065; + + return 0; + +free_page4: + __free_page(hsave_page); +free_page3: + __free_pages(nested_msrpm_pages, MSRPM_ALLOC_ORDER); +free_page2: + __free_pages(msrpm_pages, MSRPM_ALLOC_ORDER); +free_page1: + __free_page(page); +out: + return err; +} + +static void svm_clear_current_vmcb(struct vmcb *vmcb) +{ + int i; + + for_each_online_cpu(i) + cmpxchg(&per_cpu(svm_data, i)->current_vmcb, vmcb, NULL); +} + +static void svm_free_vcpu(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + + /* + * The vmcb page can be recycled, causing a false negative in + * svm_vcpu_load(). So, ensure that no logical CPU has this + * vmcb page recorded as its current vmcb. + */ + svm_clear_current_vmcb(svm->vmcb); + + __free_page(pfn_to_page(__sme_clr(svm->vmcb_pa) >> PAGE_SHIFT)); + __free_pages(virt_to_page(svm->msrpm), MSRPM_ALLOC_ORDER); + __free_page(virt_to_page(svm->nested.hsave)); + __free_pages(virt_to_page(svm->nested.msrpm), MSRPM_ALLOC_ORDER); +} + +static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + struct svm_cpu_data *sd = per_cpu(svm_data, cpu); + int i; + + if (unlikely(cpu != vcpu->cpu)) { + svm->asid_generation = 0; + mark_all_dirty(svm->vmcb); + } + +#ifdef CONFIG_X86_64 + rdmsrl(MSR_GS_BASE, to_svm(vcpu)->host.gs_base); +#endif + savesegment(fs, svm->host.fs); + savesegment(gs, svm->host.gs); + svm->host.ldt = kvm_read_ldt(); + + for (i = 0; i < NR_HOST_SAVE_USER_MSRS; i++) + rdmsrl(host_save_user_msrs[i], svm->host_user_msrs[i]); + + if (static_cpu_has(X86_FEATURE_TSCRATEMSR)) { + u64 tsc_ratio = vcpu->arch.tsc_scaling_ratio; + if (tsc_ratio != __this_cpu_read(current_tsc_ratio)) { + __this_cpu_write(current_tsc_ratio, tsc_ratio); + wrmsrl(MSR_AMD64_TSC_RATIO, tsc_ratio); + } + } + /* This assumes that the kernel never uses MSR_TSC_AUX */ + if (static_cpu_has(X86_FEATURE_RDTSCP)) + wrmsrl(MSR_TSC_AUX, svm->tsc_aux); + + if (sd->current_vmcb != svm->vmcb) { + sd->current_vmcb = svm->vmcb; + indirect_branch_prediction_barrier(); + } + avic_vcpu_load(vcpu, cpu); +} + +static void svm_vcpu_put(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + int i; + + avic_vcpu_put(vcpu); + + ++vcpu->stat.host_state_reload; + kvm_load_ldt(svm->host.ldt); +#ifdef CONFIG_X86_64 + loadsegment(fs, svm->host.fs); + wrmsrl(MSR_KERNEL_GS_BASE, current->thread.gsbase); + load_gs_index(svm->host.gs); +#else +#ifdef CONFIG_X86_32_LAZY_GS + loadsegment(gs, svm->host.gs); +#endif +#endif + for (i = 0; i < NR_HOST_SAVE_USER_MSRS; i++) + wrmsrl(host_save_user_msrs[i], svm->host_user_msrs[i]); +} + +static void svm_vcpu_blocking(struct kvm_vcpu *vcpu) +{ + avic_set_running(vcpu, false); +} + +static void svm_vcpu_unblocking(struct kvm_vcpu *vcpu) +{ + if (kvm_check_request(KVM_REQ_APICV_UPDATE, vcpu)) + kvm_vcpu_update_apicv(vcpu); + avic_set_running(vcpu, true); +} + +static unsigned long svm_get_rflags(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + unsigned long rflags = svm->vmcb->save.rflags; + + if (svm->nmi_singlestep) { + /* Hide our flags if they were not set by the guest */ + if (!(svm->nmi_singlestep_guest_rflags & X86_EFLAGS_TF)) + rflags &= ~X86_EFLAGS_TF; + if (!(svm->nmi_singlestep_guest_rflags & X86_EFLAGS_RF)) + rflags &= ~X86_EFLAGS_RF; + } + return rflags; +} + +static void svm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags) +{ + if (to_svm(vcpu)->nmi_singlestep) + rflags |= (X86_EFLAGS_TF | X86_EFLAGS_RF); + + /* + * Any change of EFLAGS.VM is accompanied by a reload of SS + * (caused by either a task switch or an inter-privilege IRET), + * so we do not need to update the CPL here. + */ + to_svm(vcpu)->vmcb->save.rflags = rflags; +} + +static void svm_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg) +{ + switch (reg) { + case VCPU_EXREG_PDPTR: + BUG_ON(!npt_enabled); + load_pdptrs(vcpu, vcpu->arch.walk_mmu, kvm_read_cr3(vcpu)); + break; + default: + WARN_ON_ONCE(1); + } +} + +static inline void svm_enable_vintr(struct vcpu_svm *svm) +{ + struct vmcb_control_area *control; + + /* The following fields are ignored when AVIC is enabled */ + WARN_ON(kvm_vcpu_apicv_active(&svm->vcpu)); + + /* + * This is just a dummy VINTR to actually cause a vmexit to happen. + * Actual injection of virtual interrupts happens through EVENTINJ. + */ + control = &svm->vmcb->control; + control->int_vector = 0x0; + control->int_ctl &= ~V_INTR_PRIO_MASK; + control->int_ctl |= V_IRQ_MASK | + ((/*control->int_vector >> 4*/ 0xf) << V_INTR_PRIO_SHIFT); + mark_dirty(svm->vmcb, VMCB_INTR); +} + +static void svm_set_vintr(struct vcpu_svm *svm) +{ + set_intercept(svm, INTERCEPT_VINTR); + if (is_intercept(svm, INTERCEPT_VINTR)) + svm_enable_vintr(svm); +} + +static void svm_clear_vintr(struct vcpu_svm *svm) +{ + clr_intercept(svm, INTERCEPT_VINTR); + + svm->vmcb->control.int_ctl &= ~V_IRQ_MASK; + mark_dirty(svm->vmcb, VMCB_INTR); +} + +static struct vmcb_seg *svm_seg(struct kvm_vcpu *vcpu, int seg) +{ + struct vmcb_save_area *save = &to_svm(vcpu)->vmcb->save; + + switch (seg) { + case VCPU_SREG_CS: return &save->cs; + case VCPU_SREG_DS: return &save->ds; + case VCPU_SREG_ES: return &save->es; + case VCPU_SREG_FS: return &save->fs; + case VCPU_SREG_GS: return &save->gs; + case VCPU_SREG_SS: return &save->ss; + case VCPU_SREG_TR: return &save->tr; + case VCPU_SREG_LDTR: return &save->ldtr; + } + BUG(); + return NULL; +} + +static u64 svm_get_segment_base(struct kvm_vcpu *vcpu, int seg) +{ + struct vmcb_seg *s = svm_seg(vcpu, seg); + + return s->base; +} + +static void svm_get_segment(struct kvm_vcpu *vcpu, + struct kvm_segment *var, int seg) +{ + struct vmcb_seg *s = svm_seg(vcpu, seg); + + var->base = s->base; + var->limit = s->limit; + var->selector = s->selector; + var->type = s->attrib & SVM_SELECTOR_TYPE_MASK; + var->s = (s->attrib >> SVM_SELECTOR_S_SHIFT) & 1; + var->dpl = (s->attrib >> SVM_SELECTOR_DPL_SHIFT) & 3; + var->present = (s->attrib >> SVM_SELECTOR_P_SHIFT) & 1; + var->avl = (s->attrib >> SVM_SELECTOR_AVL_SHIFT) & 1; + var->l = (s->attrib >> SVM_SELECTOR_L_SHIFT) & 1; + var->db = (s->attrib >> SVM_SELECTOR_DB_SHIFT) & 1; + + /* + * AMD CPUs circa 2014 track the G bit for all segments except CS. + * However, the SVM spec states that the G bit is not observed by the + * CPU, and some VMware virtual CPUs drop the G bit for all segments. + * So let's synthesize a legal G bit for all segments, this helps + * running KVM nested. It also helps cross-vendor migration, because + * Intel's vmentry has a check on the 'G' bit. + */ + var->g = s->limit > 0xfffff; + + /* + * AMD's VMCB does not have an explicit unusable field, so emulate it + * for cross vendor migration purposes by "not present" + */ + var->unusable = !var->present; + + switch (seg) { + case VCPU_SREG_TR: + /* + * Work around a bug where the busy flag in the tr selector + * isn't exposed + */ + var->type |= 0x2; + break; + case VCPU_SREG_DS: + case VCPU_SREG_ES: + case VCPU_SREG_FS: + case VCPU_SREG_GS: + /* + * The accessed bit must always be set in the segment + * descriptor cache, although it can be cleared in the + * descriptor, the cached bit always remains at 1. Since + * Intel has a check on this, set it here to support + * cross-vendor migration. + */ + if (!var->unusable) + var->type |= 0x1; + break; + case VCPU_SREG_SS: + /* + * On AMD CPUs sometimes the DB bit in the segment + * descriptor is left as 1, although the whole segment has + * been made unusable. Clear it here to pass an Intel VMX + * entry check when cross vendor migrating. + */ + if (var->unusable) + var->db = 0; + /* This is symmetric with svm_set_segment() */ + var->dpl = to_svm(vcpu)->vmcb->save.cpl; + break; + } +} + +static int svm_get_cpl(struct kvm_vcpu *vcpu) +{ + struct vmcb_save_area *save = &to_svm(vcpu)->vmcb->save; + + return save->cpl; +} + +static void svm_get_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt) +{ + struct vcpu_svm *svm = to_svm(vcpu); + + dt->size = svm->vmcb->save.idtr.limit; + dt->address = svm->vmcb->save.idtr.base; +} + +static void svm_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt) +{ + struct vcpu_svm *svm = to_svm(vcpu); + + svm->vmcb->save.idtr.limit = dt->size; + svm->vmcb->save.idtr.base = dt->address ; + mark_dirty(svm->vmcb, VMCB_DT); +} + +static void svm_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt) +{ + struct vcpu_svm *svm = to_svm(vcpu); + + dt->size = svm->vmcb->save.gdtr.limit; + dt->address = svm->vmcb->save.gdtr.base; +} + +static void svm_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt) +{ + struct vcpu_svm *svm = to_svm(vcpu); + + svm->vmcb->save.gdtr.limit = dt->size; + svm->vmcb->save.gdtr.base = dt->address ; + mark_dirty(svm->vmcb, VMCB_DT); +} + +static void svm_decache_cr0_guest_bits(struct kvm_vcpu *vcpu) +{ +} + +static void svm_decache_cr4_guest_bits(struct kvm_vcpu *vcpu) +{ +} + +static void update_cr0_intercept(struct vcpu_svm *svm) +{ + ulong gcr0 = svm->vcpu.arch.cr0; + u64 *hcr0 = &svm->vmcb->save.cr0; + + *hcr0 = (*hcr0 & ~SVM_CR0_SELECTIVE_MASK) + | (gcr0 & SVM_CR0_SELECTIVE_MASK); + + mark_dirty(svm->vmcb, VMCB_CR); + + if (gcr0 == *hcr0) { + clr_cr_intercept(svm, INTERCEPT_CR0_READ); + clr_cr_intercept(svm, INTERCEPT_CR0_WRITE); + } else { + set_cr_intercept(svm, INTERCEPT_CR0_READ); + set_cr_intercept(svm, INTERCEPT_CR0_WRITE); + } +} + +static void svm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0) +{ + struct vcpu_svm *svm = to_svm(vcpu); + +#ifdef CONFIG_X86_64 + if (vcpu->arch.efer & EFER_LME) { + if (!is_paging(vcpu) && (cr0 & X86_CR0_PG)) { + vcpu->arch.efer |= EFER_LMA; + svm->vmcb->save.efer |= EFER_LMA | EFER_LME; + } + + if (is_paging(vcpu) && !(cr0 & X86_CR0_PG)) { + vcpu->arch.efer &= ~EFER_LMA; + svm->vmcb->save.efer &= ~(EFER_LMA | EFER_LME); + } + } +#endif + vcpu->arch.cr0 = cr0; + + if (!npt_enabled) + cr0 |= X86_CR0_PG | X86_CR0_WP; + + /* + * re-enable caching here because the QEMU bios + * does not do it - this results in some delay at + * reboot + */ + if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED)) + cr0 &= ~(X86_CR0_CD | X86_CR0_NW); + svm->vmcb->save.cr0 = cr0; + mark_dirty(svm->vmcb, VMCB_CR); + update_cr0_intercept(svm); +} + +static int svm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4) +{ + unsigned long host_cr4_mce = cr4_read_shadow() & X86_CR4_MCE; + unsigned long old_cr4 = to_svm(vcpu)->vmcb->save.cr4; + + if (cr4 & X86_CR4_VMXE) + return 1; + + if (npt_enabled && ((old_cr4 ^ cr4) & X86_CR4_PGE)) + svm_flush_tlb(vcpu, true); + + vcpu->arch.cr4 = cr4; + if (!npt_enabled) + cr4 |= X86_CR4_PAE; + cr4 |= host_cr4_mce; + to_svm(vcpu)->vmcb->save.cr4 = cr4; + mark_dirty(to_svm(vcpu)->vmcb, VMCB_CR); + return 0; +} + +static void svm_set_segment(struct kvm_vcpu *vcpu, + struct kvm_segment *var, int seg) +{ + struct vcpu_svm *svm = to_svm(vcpu); + struct vmcb_seg *s = svm_seg(vcpu, seg); + + s->base = var->base; + s->limit = var->limit; + s->selector = var->selector; + s->attrib = (var->type & SVM_SELECTOR_TYPE_MASK); + s->attrib |= (var->s & 1) << SVM_SELECTOR_S_SHIFT; + s->attrib |= (var->dpl & 3) << SVM_SELECTOR_DPL_SHIFT; + s->attrib |= ((var->present & 1) && !var->unusable) << SVM_SELECTOR_P_SHIFT; + s->attrib |= (var->avl & 1) << SVM_SELECTOR_AVL_SHIFT; + s->attrib |= (var->l & 1) << SVM_SELECTOR_L_SHIFT; + s->attrib |= (var->db & 1) << SVM_SELECTOR_DB_SHIFT; + s->attrib |= (var->g & 1) << SVM_SELECTOR_G_SHIFT; + + /* + * This is always accurate, except if SYSRET returned to a segment + * with SS.DPL != 3. Intel does not have this quirk, and always + * forces SS.DPL to 3 on sysret, so we ignore that case; fixing it + * would entail passing the CPL to userspace and back. + */ + if (seg == VCPU_SREG_SS) + /* This is symmetric with svm_get_segment() */ + svm->vmcb->save.cpl = (var->dpl & 3); + + mark_dirty(svm->vmcb, VMCB_SEG); +} + +static void update_bp_intercept(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + + clr_exception_intercept(svm, BP_VECTOR); + + if (vcpu->guest_debug & KVM_GUESTDBG_ENABLE) { + if (vcpu->guest_debug & KVM_GUESTDBG_USE_SW_BP) + set_exception_intercept(svm, BP_VECTOR); + } else + vcpu->guest_debug = 0; +} + +static void new_asid(struct vcpu_svm *svm, struct svm_cpu_data *sd) +{ + if (sd->next_asid > sd->max_asid) { + ++sd->asid_generation; + sd->next_asid = sd->min_asid; + svm->vmcb->control.tlb_ctl = TLB_CONTROL_FLUSH_ALL_ASID; + } + + svm->asid_generation = sd->asid_generation; + svm->vmcb->control.asid = sd->next_asid++; + + mark_dirty(svm->vmcb, VMCB_ASID); +} + +static u64 svm_get_dr6(struct kvm_vcpu *vcpu) +{ + return to_svm(vcpu)->vmcb->save.dr6; +} + +static void svm_set_dr6(struct kvm_vcpu *vcpu, unsigned long value) +{ + struct vcpu_svm *svm = to_svm(vcpu); + + svm->vmcb->save.dr6 = value; + mark_dirty(svm->vmcb, VMCB_DR); +} + +static void svm_sync_dirty_debug_regs(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + + get_debugreg(vcpu->arch.db[0], 0); + get_debugreg(vcpu->arch.db[1], 1); + get_debugreg(vcpu->arch.db[2], 2); + get_debugreg(vcpu->arch.db[3], 3); + vcpu->arch.dr6 = svm_get_dr6(vcpu); + vcpu->arch.dr7 = svm->vmcb->save.dr7; + + vcpu->arch.switch_db_regs &= ~KVM_DEBUGREG_WONT_EXIT; + set_dr_intercepts(svm); +} + +static void svm_set_dr7(struct kvm_vcpu *vcpu, unsigned long value) +{ + struct vcpu_svm *svm = to_svm(vcpu); + + svm->vmcb->save.dr7 = value; + mark_dirty(svm->vmcb, VMCB_DR); +} + +static int pf_interception(struct vcpu_svm *svm) +{ + u64 fault_address = __sme_clr(svm->vmcb->control.exit_info_2); + u64 error_code = svm->vmcb->control.exit_info_1; + + return kvm_handle_page_fault(&svm->vcpu, error_code, fault_address, + static_cpu_has(X86_FEATURE_DECODEASSISTS) ? + svm->vmcb->control.insn_bytes : NULL, + svm->vmcb->control.insn_len); +} + +static int npf_interception(struct vcpu_svm *svm) +{ + u64 fault_address = __sme_clr(svm->vmcb->control.exit_info_2); + u64 error_code = svm->vmcb->control.exit_info_1; + + trace_kvm_page_fault(fault_address, error_code); + return kvm_mmu_page_fault(&svm->vcpu, fault_address, error_code, + static_cpu_has(X86_FEATURE_DECODEASSISTS) ? + svm->vmcb->control.insn_bytes : NULL, + svm->vmcb->control.insn_len); +} + +static int db_interception(struct vcpu_svm *svm) +{ + struct kvm_run *kvm_run = svm->vcpu.run; + struct kvm_vcpu *vcpu = &svm->vcpu; + + if (!(svm->vcpu.guest_debug & + (KVM_GUESTDBG_SINGLESTEP | KVM_GUESTDBG_USE_HW_BP)) && + !svm->nmi_singlestep) { + kvm_queue_exception(&svm->vcpu, DB_VECTOR); + return 1; + } + + if (svm->nmi_singlestep) { + disable_nmi_singlestep(svm); + /* Make sure we check for pending NMIs upon entry */ + kvm_make_request(KVM_REQ_EVENT, vcpu); + } + + if (svm->vcpu.guest_debug & + (KVM_GUESTDBG_SINGLESTEP | KVM_GUESTDBG_USE_HW_BP)) { + kvm_run->exit_reason = KVM_EXIT_DEBUG; + kvm_run->debug.arch.pc = + svm->vmcb->save.cs.base + svm->vmcb->save.rip; + kvm_run->debug.arch.exception = DB_VECTOR; + return 0; + } + + return 1; +} + +static int bp_interception(struct vcpu_svm *svm) +{ + struct kvm_run *kvm_run = svm->vcpu.run; + + kvm_run->exit_reason = KVM_EXIT_DEBUG; + kvm_run->debug.arch.pc = svm->vmcb->save.cs.base + svm->vmcb->save.rip; + kvm_run->debug.arch.exception = BP_VECTOR; + return 0; +} + +static int ud_interception(struct vcpu_svm *svm) +{ + return handle_ud(&svm->vcpu); +} + +static int ac_interception(struct vcpu_svm *svm) +{ + kvm_queue_exception_e(&svm->vcpu, AC_VECTOR, 0); + return 1; +} + +static int gp_interception(struct vcpu_svm *svm) +{ + struct kvm_vcpu *vcpu = &svm->vcpu; + u32 error_code = svm->vmcb->control.exit_info_1; + + WARN_ON_ONCE(!enable_vmware_backdoor); + + /* + * VMware backdoor emulation on #GP interception only handles IN{S}, + * OUT{S}, and RDPMC, none of which generate a non-zero error code. + */ + if (error_code) { + kvm_queue_exception_e(vcpu, GP_VECTOR, error_code); + return 1; + } + return kvm_emulate_instruction(vcpu, EMULTYPE_VMWARE_GP); +} + +static bool is_erratum_383(void) +{ + int err, i; + u64 value; + + if (!erratum_383_found) + return false; + + value = native_read_msr_safe(MSR_IA32_MC0_STATUS, &err); + if (err) + return false; + + /* Bit 62 may or may not be set for this mce */ + value &= ~(1ULL << 62); + + if (value != 0xb600000000010015ULL) + return false; + + /* Clear MCi_STATUS registers */ + for (i = 0; i < 6; ++i) + native_write_msr_safe(MSR_IA32_MCx_STATUS(i), 0, 0); + + value = native_read_msr_safe(MSR_IA32_MCG_STATUS, &err); + if (!err) { + u32 low, high; + + value &= ~(1ULL << 2); + low = lower_32_bits(value); + high = upper_32_bits(value); + + native_write_msr_safe(MSR_IA32_MCG_STATUS, low, high); + } + + /* Flush tlb to evict multi-match entries */ + __flush_tlb_all(); + + return true; +} + +static void svm_handle_mce(struct vcpu_svm *svm) +{ + if (is_erratum_383()) { + /* + * Erratum 383 triggered. Guest state is corrupt so kill the + * guest. + */ + pr_err("KVM: Guest triggered AMD Erratum 383\n"); + + kvm_make_request(KVM_REQ_TRIPLE_FAULT, &svm->vcpu); + + return; + } + + /* + * On an #MC intercept the MCE handler is not called automatically in + * the host. So do it by hand here. + */ + asm volatile ( + "int $0x12\n"); + /* not sure if we ever come back to this point */ + + return; +} + +static int mc_interception(struct vcpu_svm *svm) +{ + return 1; +} + +static int shutdown_interception(struct vcpu_svm *svm) +{ + struct kvm_run *kvm_run = svm->vcpu.run; + + /* + * VMCB is undefined after a SHUTDOWN intercept + * so reinitialize it. + */ + clear_page(svm->vmcb); + init_vmcb(svm); + + kvm_run->exit_reason = KVM_EXIT_SHUTDOWN; + return 0; +} + +static int io_interception(struct vcpu_svm *svm) +{ + struct kvm_vcpu *vcpu = &svm->vcpu; + u32 io_info = svm->vmcb->control.exit_info_1; /* address size bug? */ + int size, in, string; + unsigned port; + + ++svm->vcpu.stat.io_exits; + string = (io_info & SVM_IOIO_STR_MASK) != 0; + in = (io_info & SVM_IOIO_TYPE_MASK) != 0; + if (string) + return kvm_emulate_instruction(vcpu, 0); + + port = io_info >> 16; + size = (io_info & SVM_IOIO_SIZE_MASK) >> SVM_IOIO_SIZE_SHIFT; + svm->next_rip = svm->vmcb->control.exit_info_2; + + return kvm_fast_pio(&svm->vcpu, size, port, in); +} + +static int nmi_interception(struct vcpu_svm *svm) +{ + return 1; +} + +static int intr_interception(struct vcpu_svm *svm) +{ + ++svm->vcpu.stat.irq_exits; + return 1; +} + +static int nop_on_interception(struct vcpu_svm *svm) +{ + return 1; +} + +static int halt_interception(struct vcpu_svm *svm) +{ + return kvm_emulate_halt(&svm->vcpu); +} + +static int vmmcall_interception(struct vcpu_svm *svm) +{ + return kvm_emulate_hypercall(&svm->vcpu); +} + +static unsigned long nested_svm_get_tdp_cr3(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + + return svm->nested.nested_cr3; +} + +static u64 nested_svm_get_tdp_pdptr(struct kvm_vcpu *vcpu, int index) +{ + struct vcpu_svm *svm = to_svm(vcpu); + u64 cr3 = svm->nested.nested_cr3; + u64 pdpte; + int ret; + + ret = kvm_vcpu_read_guest_page(vcpu, gpa_to_gfn(__sme_clr(cr3)), &pdpte, + offset_in_page(cr3) + index * 8, 8); + if (ret) + return 0; + return pdpte; +} + +static void nested_svm_inject_npf_exit(struct kvm_vcpu *vcpu, + struct x86_exception *fault) +{ + struct vcpu_svm *svm = to_svm(vcpu); + + if (svm->vmcb->control.exit_code != SVM_EXIT_NPF) { + /* + * TODO: track the cause of the nested page fault, and + * correctly fill in the high bits of exit_info_1. + */ + svm->vmcb->control.exit_code = SVM_EXIT_NPF; + svm->vmcb->control.exit_code_hi = 0; + svm->vmcb->control.exit_info_1 = (1ULL << 32); + svm->vmcb->control.exit_info_2 = fault->address; + } + + svm->vmcb->control.exit_info_1 &= ~0xffffffffULL; + svm->vmcb->control.exit_info_1 |= fault->error_code; + + /* + * The present bit is always zero for page structure faults on real + * hardware. + */ + if (svm->vmcb->control.exit_info_1 & (2ULL << 32)) + svm->vmcb->control.exit_info_1 &= ~1; + + nested_svm_vmexit(svm); +} + +static void nested_svm_init_mmu_context(struct kvm_vcpu *vcpu) +{ + WARN_ON(mmu_is_nested(vcpu)); + + vcpu->arch.mmu = &vcpu->arch.guest_mmu; + kvm_init_shadow_mmu(vcpu); + vcpu->arch.mmu->get_guest_pgd = nested_svm_get_tdp_cr3; + vcpu->arch.mmu->get_pdptr = nested_svm_get_tdp_pdptr; + vcpu->arch.mmu->inject_page_fault = nested_svm_inject_npf_exit; + vcpu->arch.mmu->shadow_root_level = get_npt_level(vcpu); + reset_shadow_zero_bits_mask(vcpu, vcpu->arch.mmu); + vcpu->arch.walk_mmu = &vcpu->arch.nested_mmu; +} + +static void nested_svm_uninit_mmu_context(struct kvm_vcpu *vcpu) +{ + vcpu->arch.mmu = &vcpu->arch.root_mmu; + vcpu->arch.walk_mmu = &vcpu->arch.root_mmu; +} + +static int nested_svm_check_permissions(struct vcpu_svm *svm) +{ + if (!(svm->vcpu.arch.efer & EFER_SVME) || + !is_paging(&svm->vcpu)) { + kvm_queue_exception(&svm->vcpu, UD_VECTOR); + return 1; + } + + if (svm->vmcb->save.cpl) { + kvm_inject_gp(&svm->vcpu, 0); + return 1; + } + + return 0; +} + +static int nested_svm_check_exception(struct vcpu_svm *svm, unsigned nr, + bool has_error_code, u32 error_code) +{ + int vmexit; + + if (!is_guest_mode(&svm->vcpu)) + return 0; + + vmexit = nested_svm_intercept(svm); + if (vmexit != NESTED_EXIT_DONE) + return 0; + + svm->vmcb->control.exit_code = SVM_EXIT_EXCP_BASE + nr; + svm->vmcb->control.exit_code_hi = 0; + svm->vmcb->control.exit_info_1 = error_code; + + /* + * EXITINFO2 is undefined for all exception intercepts other + * than #PF. + */ + if (svm->vcpu.arch.exception.nested_apf) + svm->vmcb->control.exit_info_2 = svm->vcpu.arch.apf.nested_apf_token; + else if (svm->vcpu.arch.exception.has_payload) + svm->vmcb->control.exit_info_2 = svm->vcpu.arch.exception.payload; + else + svm->vmcb->control.exit_info_2 = svm->vcpu.arch.cr2; + + svm->nested.exit_required = true; + return vmexit; +} + +static void nested_svm_intr(struct vcpu_svm *svm) +{ + svm->vmcb->control.exit_code = SVM_EXIT_INTR; + svm->vmcb->control.exit_info_1 = 0; + svm->vmcb->control.exit_info_2 = 0; + + /* nested_svm_vmexit this gets called afterwards from handle_exit */ + svm->nested.exit_required = true; + trace_kvm_nested_intr_vmexit(svm->vmcb->save.rip); +} + +static bool nested_exit_on_intr(struct vcpu_svm *svm) +{ + return (svm->nested.intercept & 1ULL); +} + +static int svm_check_nested_events(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + bool block_nested_events = + kvm_event_needs_reinjection(vcpu) || svm->nested.exit_required; + + if (kvm_cpu_has_interrupt(vcpu) && nested_exit_on_intr(svm)) { + if (block_nested_events) + return -EBUSY; + nested_svm_intr(svm); + return 0; + } + + return 0; +} + +/* This function returns true if it is save to enable the nmi window */ +static inline bool nested_svm_nmi(struct vcpu_svm *svm) +{ + if (!is_guest_mode(&svm->vcpu)) + return true; + + if (!(svm->nested.intercept & (1ULL << INTERCEPT_NMI))) + return true; + + svm->vmcb->control.exit_code = SVM_EXIT_NMI; + svm->nested.exit_required = true; + + return false; +} + +static int nested_svm_intercept_ioio(struct vcpu_svm *svm) +{ + unsigned port, size, iopm_len; + u16 val, mask; + u8 start_bit; + u64 gpa; + + if (!(svm->nested.intercept & (1ULL << INTERCEPT_IOIO_PROT))) + return NESTED_EXIT_HOST; + + port = svm->vmcb->control.exit_info_1 >> 16; + size = (svm->vmcb->control.exit_info_1 & SVM_IOIO_SIZE_MASK) >> + SVM_IOIO_SIZE_SHIFT; + gpa = svm->nested.vmcb_iopm + (port / 8); + start_bit = port % 8; + iopm_len = (start_bit + size > 8) ? 2 : 1; + mask = (0xf >> (4 - size)) << start_bit; + val = 0; + + if (kvm_vcpu_read_guest(&svm->vcpu, gpa, &val, iopm_len)) + return NESTED_EXIT_DONE; + + return (val & mask) ? NESTED_EXIT_DONE : NESTED_EXIT_HOST; +} + +static int nested_svm_exit_handled_msr(struct vcpu_svm *svm) +{ + u32 offset, msr, value; + int write, mask; + + if (!(svm->nested.intercept & (1ULL << INTERCEPT_MSR_PROT))) + return NESTED_EXIT_HOST; + + msr = svm->vcpu.arch.regs[VCPU_REGS_RCX]; + offset = svm_msrpm_offset(msr); + write = svm->vmcb->control.exit_info_1 & 1; + mask = 1 << ((2 * (msr & 0xf)) + write); + + if (offset == MSR_INVALID) + return NESTED_EXIT_DONE; + + /* Offset is in 32 bit units but need in 8 bit units */ + offset *= 4; + + if (kvm_vcpu_read_guest(&svm->vcpu, svm->nested.vmcb_msrpm + offset, &value, 4)) + return NESTED_EXIT_DONE; + + return (value & mask) ? NESTED_EXIT_DONE : NESTED_EXIT_HOST; +} + +/* DB exceptions for our internal use must not cause vmexit */ +static int nested_svm_intercept_db(struct vcpu_svm *svm) +{ + unsigned long dr6; + + /* if we're not singlestepping, it's not ours */ + if (!svm->nmi_singlestep) + return NESTED_EXIT_DONE; + + /* if it's not a singlestep exception, it's not ours */ + if (kvm_get_dr(&svm->vcpu, 6, &dr6)) + return NESTED_EXIT_DONE; + if (!(dr6 & DR6_BS)) + return NESTED_EXIT_DONE; + + /* if the guest is singlestepping, it should get the vmexit */ + if (svm->nmi_singlestep_guest_rflags & X86_EFLAGS_TF) { + disable_nmi_singlestep(svm); + return NESTED_EXIT_DONE; + } + + /* it's ours, the nested hypervisor must not see this one */ + return NESTED_EXIT_HOST; +} + +static int nested_svm_exit_special(struct vcpu_svm *svm) +{ + u32 exit_code = svm->vmcb->control.exit_code; + + switch (exit_code) { + case SVM_EXIT_INTR: + case SVM_EXIT_NMI: + case SVM_EXIT_EXCP_BASE + MC_VECTOR: + return NESTED_EXIT_HOST; + case SVM_EXIT_NPF: + /* For now we are always handling NPFs when using them */ + if (npt_enabled) + return NESTED_EXIT_HOST; + break; + case SVM_EXIT_EXCP_BASE + PF_VECTOR: + /* When we're shadowing, trap PFs, but not async PF */ + if (!npt_enabled && svm->vcpu.arch.apf.host_apf_reason == 0) + return NESTED_EXIT_HOST; + break; + default: + break; + } + + return NESTED_EXIT_CONTINUE; +} + +static int nested_svm_intercept(struct vcpu_svm *svm) +{ + u32 exit_code = svm->vmcb->control.exit_code; + int vmexit = NESTED_EXIT_HOST; + + switch (exit_code) { + case SVM_EXIT_MSR: + vmexit = nested_svm_exit_handled_msr(svm); + break; + case SVM_EXIT_IOIO: + vmexit = nested_svm_intercept_ioio(svm); + break; + case SVM_EXIT_READ_CR0 ... SVM_EXIT_WRITE_CR8: { + u32 bit = 1U << (exit_code - SVM_EXIT_READ_CR0); + if (svm->nested.intercept_cr & bit) + vmexit = NESTED_EXIT_DONE; + break; + } + case SVM_EXIT_READ_DR0 ... SVM_EXIT_WRITE_DR7: { + u32 bit = 1U << (exit_code - SVM_EXIT_READ_DR0); + if (svm->nested.intercept_dr & bit) + vmexit = NESTED_EXIT_DONE; + break; + } + case SVM_EXIT_EXCP_BASE ... SVM_EXIT_EXCP_BASE + 0x1f: { + u32 excp_bits = 1 << (exit_code - SVM_EXIT_EXCP_BASE); + if (svm->nested.intercept_exceptions & excp_bits) { + if (exit_code == SVM_EXIT_EXCP_BASE + DB_VECTOR) + vmexit = nested_svm_intercept_db(svm); + else + vmexit = NESTED_EXIT_DONE; + } + /* async page fault always cause vmexit */ + else if ((exit_code == SVM_EXIT_EXCP_BASE + PF_VECTOR) && + svm->vcpu.arch.exception.nested_apf != 0) + vmexit = NESTED_EXIT_DONE; + break; + } + case SVM_EXIT_ERR: { + vmexit = NESTED_EXIT_DONE; + break; + } + default: { + u64 exit_bits = 1ULL << (exit_code - SVM_EXIT_INTR); + if (svm->nested.intercept & exit_bits) + vmexit = NESTED_EXIT_DONE; + } + } + + return vmexit; +} + +static int nested_svm_exit_handled(struct vcpu_svm *svm) +{ + int vmexit; + + vmexit = nested_svm_intercept(svm); + + if (vmexit == NESTED_EXIT_DONE) + nested_svm_vmexit(svm); + + return vmexit; +} + +static inline void copy_vmcb_control_area(struct vmcb *dst_vmcb, struct vmcb *from_vmcb) +{ + struct vmcb_control_area *dst = &dst_vmcb->control; + struct vmcb_control_area *from = &from_vmcb->control; + + dst->intercept_cr = from->intercept_cr; + dst->intercept_dr = from->intercept_dr; + dst->intercept_exceptions = from->intercept_exceptions; + dst->intercept = from->intercept; + dst->iopm_base_pa = from->iopm_base_pa; + dst->msrpm_base_pa = from->msrpm_base_pa; + dst->tsc_offset = from->tsc_offset; + dst->asid = from->asid; + dst->tlb_ctl = from->tlb_ctl; + dst->int_ctl = from->int_ctl; + dst->int_vector = from->int_vector; + dst->int_state = from->int_state; + dst->exit_code = from->exit_code; + dst->exit_code_hi = from->exit_code_hi; + dst->exit_info_1 = from->exit_info_1; + dst->exit_info_2 = from->exit_info_2; + dst->exit_int_info = from->exit_int_info; + dst->exit_int_info_err = from->exit_int_info_err; + dst->nested_ctl = from->nested_ctl; + dst->event_inj = from->event_inj; + dst->event_inj_err = from->event_inj_err; + dst->nested_cr3 = from->nested_cr3; + dst->virt_ext = from->virt_ext; + dst->pause_filter_count = from->pause_filter_count; + dst->pause_filter_thresh = from->pause_filter_thresh; +} + +static int nested_svm_vmexit(struct vcpu_svm *svm) +{ + int rc; + struct vmcb *nested_vmcb; + struct vmcb *hsave = svm->nested.hsave; + struct vmcb *vmcb = svm->vmcb; + struct kvm_host_map map; + + trace_kvm_nested_vmexit_inject(vmcb->control.exit_code, + vmcb->control.exit_info_1, + vmcb->control.exit_info_2, + vmcb->control.exit_int_info, + vmcb->control.exit_int_info_err, + KVM_ISA_SVM); + + rc = kvm_vcpu_map(&svm->vcpu, gpa_to_gfn(svm->nested.vmcb), &map); + if (rc) { + if (rc == -EINVAL) + kvm_inject_gp(&svm->vcpu, 0); + return 1; + } + + nested_vmcb = map.hva; + + /* Exit Guest-Mode */ + leave_guest_mode(&svm->vcpu); + svm->nested.vmcb = 0; + + /* Give the current vmcb to the guest */ + disable_gif(svm); + + nested_vmcb->save.es = vmcb->save.es; + nested_vmcb->save.cs = vmcb->save.cs; + nested_vmcb->save.ss = vmcb->save.ss; + nested_vmcb->save.ds = vmcb->save.ds; + nested_vmcb->save.gdtr = vmcb->save.gdtr; + nested_vmcb->save.idtr = vmcb->save.idtr; + nested_vmcb->save.efer = svm->vcpu.arch.efer; + nested_vmcb->save.cr0 = kvm_read_cr0(&svm->vcpu); + nested_vmcb->save.cr3 = kvm_read_cr3(&svm->vcpu); + nested_vmcb->save.cr2 = vmcb->save.cr2; + nested_vmcb->save.cr4 = svm->vcpu.arch.cr4; + nested_vmcb->save.rflags = kvm_get_rflags(&svm->vcpu); + nested_vmcb->save.rip = vmcb->save.rip; + nested_vmcb->save.rsp = vmcb->save.rsp; + nested_vmcb->save.rax = vmcb->save.rax; + nested_vmcb->save.dr7 = vmcb->save.dr7; + nested_vmcb->save.dr6 = vmcb->save.dr6; + nested_vmcb->save.cpl = vmcb->save.cpl; + + nested_vmcb->control.int_ctl = vmcb->control.int_ctl; + nested_vmcb->control.int_vector = vmcb->control.int_vector; + nested_vmcb->control.int_state = vmcb->control.int_state; + nested_vmcb->control.exit_code = vmcb->control.exit_code; + nested_vmcb->control.exit_code_hi = vmcb->control.exit_code_hi; + nested_vmcb->control.exit_info_1 = vmcb->control.exit_info_1; + nested_vmcb->control.exit_info_2 = vmcb->control.exit_info_2; + nested_vmcb->control.exit_int_info = vmcb->control.exit_int_info; + nested_vmcb->control.exit_int_info_err = vmcb->control.exit_int_info_err; + + if (svm->nrips_enabled) + nested_vmcb->control.next_rip = vmcb->control.next_rip; + + /* + * If we emulate a VMRUN/#VMEXIT in the same host #vmexit cycle we have + * to make sure that we do not lose injected events. So check event_inj + * here and copy it to exit_int_info if it is valid. + * Exit_int_info and event_inj can't be both valid because the case + * below only happens on a VMRUN instruction intercept which has + * no valid exit_int_info set. + */ + if (vmcb->control.event_inj & SVM_EVTINJ_VALID) { + struct vmcb_control_area *nc = &nested_vmcb->control; + + nc->exit_int_info = vmcb->control.event_inj; + nc->exit_int_info_err = vmcb->control.event_inj_err; + } + + nested_vmcb->control.tlb_ctl = 0; + nested_vmcb->control.event_inj = 0; + nested_vmcb->control.event_inj_err = 0; + + nested_vmcb->control.pause_filter_count = + svm->vmcb->control.pause_filter_count; + nested_vmcb->control.pause_filter_thresh = + svm->vmcb->control.pause_filter_thresh; + + /* We always set V_INTR_MASKING and remember the old value in hflags */ + if (!(svm->vcpu.arch.hflags & HF_VINTR_MASK)) + nested_vmcb->control.int_ctl &= ~V_INTR_MASKING_MASK; + + /* Restore the original control entries */ + copy_vmcb_control_area(vmcb, hsave); + + svm->vcpu.arch.tsc_offset = svm->vmcb->control.tsc_offset; + kvm_clear_exception_queue(&svm->vcpu); + kvm_clear_interrupt_queue(&svm->vcpu); + + svm->nested.nested_cr3 = 0; + + /* Restore selected save entries */ + svm->vmcb->save.es = hsave->save.es; + svm->vmcb->save.cs = hsave->save.cs; + svm->vmcb->save.ss = hsave->save.ss; + svm->vmcb->save.ds = hsave->save.ds; + svm->vmcb->save.gdtr = hsave->save.gdtr; + svm->vmcb->save.idtr = hsave->save.idtr; + kvm_set_rflags(&svm->vcpu, hsave->save.rflags); + svm_set_efer(&svm->vcpu, hsave->save.efer); + svm_set_cr0(&svm->vcpu, hsave->save.cr0 | X86_CR0_PE); + svm_set_cr4(&svm->vcpu, hsave->save.cr4); + if (npt_enabled) { + svm->vmcb->save.cr3 = hsave->save.cr3; + svm->vcpu.arch.cr3 = hsave->save.cr3; + } else { + (void)kvm_set_cr3(&svm->vcpu, hsave->save.cr3); + } + kvm_rax_write(&svm->vcpu, hsave->save.rax); + kvm_rsp_write(&svm->vcpu, hsave->save.rsp); + kvm_rip_write(&svm->vcpu, hsave->save.rip); + svm->vmcb->save.dr7 = 0; + svm->vmcb->save.cpl = 0; + svm->vmcb->control.exit_int_info = 0; + + mark_all_dirty(svm->vmcb); + + kvm_vcpu_unmap(&svm->vcpu, &map, true); + + nested_svm_uninit_mmu_context(&svm->vcpu); + kvm_mmu_reset_context(&svm->vcpu); + kvm_mmu_load(&svm->vcpu); + + /* + * Drop what we picked up for L2 via svm_complete_interrupts() so it + * doesn't end up in L1. + */ + svm->vcpu.arch.nmi_injected = false; + kvm_clear_exception_queue(&svm->vcpu); + kvm_clear_interrupt_queue(&svm->vcpu); + + return 0; +} + +static bool nested_svm_vmrun_msrpm(struct vcpu_svm *svm) +{ + /* + * This function merges the msr permission bitmaps of kvm and the + * nested vmcb. It is optimized in that it only merges the parts where + * the kvm msr permission bitmap may contain zero bits + */ + int i; + + if (!(svm->nested.intercept & (1ULL << INTERCEPT_MSR_PROT))) + return true; + + for (i = 0; i < MSRPM_OFFSETS; i++) { + u32 value, p; + u64 offset; + + if (msrpm_offsets[i] == 0xffffffff) + break; + + p = msrpm_offsets[i]; + offset = svm->nested.vmcb_msrpm + (p * 4); + + if (kvm_vcpu_read_guest(&svm->vcpu, offset, &value, 4)) + return false; + + svm->nested.msrpm[p] = svm->msrpm[p] | value; + } + + svm->vmcb->control.msrpm_base_pa = __sme_set(__pa(svm->nested.msrpm)); + + return true; +} + +static bool nested_vmcb_checks(struct vmcb *vmcb) +{ + if ((vmcb->save.efer & EFER_SVME) == 0) + return false; + + if ((vmcb->control.intercept & (1ULL << INTERCEPT_VMRUN)) == 0) + return false; + + if (vmcb->control.asid == 0) + return false; + + if ((vmcb->control.nested_ctl & SVM_NESTED_CTL_NP_ENABLE) && + !npt_enabled) + return false; + + return true; +} + +static void enter_svm_guest_mode(struct vcpu_svm *svm, u64 vmcb_gpa, + struct vmcb *nested_vmcb, struct kvm_host_map *map) +{ + bool evaluate_pending_interrupts = + is_intercept(svm, INTERCEPT_VINTR) || + is_intercept(svm, INTERCEPT_IRET); + + if (kvm_get_rflags(&svm->vcpu) & X86_EFLAGS_IF) + svm->vcpu.arch.hflags |= HF_HIF_MASK; + else + svm->vcpu.arch.hflags &= ~HF_HIF_MASK; + + if (nested_vmcb->control.nested_ctl & SVM_NESTED_CTL_NP_ENABLE) { + svm->nested.nested_cr3 = nested_vmcb->control.nested_cr3; + nested_svm_init_mmu_context(&svm->vcpu); + } + + /* Load the nested guest state */ + svm->vmcb->save.es = nested_vmcb->save.es; + svm->vmcb->save.cs = nested_vmcb->save.cs; + svm->vmcb->save.ss = nested_vmcb->save.ss; + svm->vmcb->save.ds = nested_vmcb->save.ds; + svm->vmcb->save.gdtr = nested_vmcb->save.gdtr; + svm->vmcb->save.idtr = nested_vmcb->save.idtr; + kvm_set_rflags(&svm->vcpu, nested_vmcb->save.rflags); + svm_set_efer(&svm->vcpu, nested_vmcb->save.efer); + svm_set_cr0(&svm->vcpu, nested_vmcb->save.cr0); + svm_set_cr4(&svm->vcpu, nested_vmcb->save.cr4); + if (npt_enabled) { + svm->vmcb->save.cr3 = nested_vmcb->save.cr3; + svm->vcpu.arch.cr3 = nested_vmcb->save.cr3; + } else + (void)kvm_set_cr3(&svm->vcpu, nested_vmcb->save.cr3); + + /* Guest paging mode is active - reset mmu */ + kvm_mmu_reset_context(&svm->vcpu); + + svm->vmcb->save.cr2 = svm->vcpu.arch.cr2 = nested_vmcb->save.cr2; + kvm_rax_write(&svm->vcpu, nested_vmcb->save.rax); + kvm_rsp_write(&svm->vcpu, nested_vmcb->save.rsp); + kvm_rip_write(&svm->vcpu, nested_vmcb->save.rip); + + /* In case we don't even reach vcpu_run, the fields are not updated */ + svm->vmcb->save.rax = nested_vmcb->save.rax; + svm->vmcb->save.rsp = nested_vmcb->save.rsp; + svm->vmcb->save.rip = nested_vmcb->save.rip; + svm->vmcb->save.dr7 = nested_vmcb->save.dr7; + svm->vmcb->save.dr6 = nested_vmcb->save.dr6; + svm->vmcb->save.cpl = nested_vmcb->save.cpl; + + svm->nested.vmcb_msrpm = nested_vmcb->control.msrpm_base_pa & ~0x0fffULL; + svm->nested.vmcb_iopm = nested_vmcb->control.iopm_base_pa & ~0x0fffULL; + + /* cache intercepts */ + svm->nested.intercept_cr = nested_vmcb->control.intercept_cr; + svm->nested.intercept_dr = nested_vmcb->control.intercept_dr; + svm->nested.intercept_exceptions = nested_vmcb->control.intercept_exceptions; + svm->nested.intercept = nested_vmcb->control.intercept; + + svm_flush_tlb(&svm->vcpu, true); + svm->vmcb->control.int_ctl = nested_vmcb->control.int_ctl | V_INTR_MASKING_MASK; + if (nested_vmcb->control.int_ctl & V_INTR_MASKING_MASK) + svm->vcpu.arch.hflags |= HF_VINTR_MASK; + else + svm->vcpu.arch.hflags &= ~HF_VINTR_MASK; + + svm->vcpu.arch.tsc_offset += nested_vmcb->control.tsc_offset; + svm->vmcb->control.tsc_offset = svm->vcpu.arch.tsc_offset; + + svm->vmcb->control.virt_ext = nested_vmcb->control.virt_ext; + svm->vmcb->control.int_vector = nested_vmcb->control.int_vector; + svm->vmcb->control.int_state = nested_vmcb->control.int_state; + svm->vmcb->control.event_inj = nested_vmcb->control.event_inj; + svm->vmcb->control.event_inj_err = nested_vmcb->control.event_inj_err; + + svm->vmcb->control.pause_filter_count = + nested_vmcb->control.pause_filter_count; + svm->vmcb->control.pause_filter_thresh = + nested_vmcb->control.pause_filter_thresh; + + kvm_vcpu_unmap(&svm->vcpu, map, true); + + /* Enter Guest-Mode */ + enter_guest_mode(&svm->vcpu); + + /* + * Merge guest and host intercepts - must be called with vcpu in + * guest-mode to take affect here + */ + recalc_intercepts(svm); + + svm->nested.vmcb = vmcb_gpa; + + /* + * If L1 had a pending IRQ/NMI before executing VMRUN, + * which wasn't delivered because it was disallowed (e.g. + * interrupts disabled), L0 needs to evaluate if this pending + * event should cause an exit from L2 to L1 or be delivered + * directly to L2. + * + * Usually this would be handled by the processor noticing an + * IRQ/NMI window request. However, VMRUN can unblock interrupts + * by implicitly setting GIF, so force L0 to perform pending event + * evaluation by requesting a KVM_REQ_EVENT. + */ + enable_gif(svm); + if (unlikely(evaluate_pending_interrupts)) + kvm_make_request(KVM_REQ_EVENT, &svm->vcpu); + + mark_all_dirty(svm->vmcb); +} + +static int nested_svm_vmrun(struct vcpu_svm *svm) +{ + int ret; + struct vmcb *nested_vmcb; + struct vmcb *hsave = svm->nested.hsave; + struct vmcb *vmcb = svm->vmcb; + struct kvm_host_map map; + u64 vmcb_gpa; + + vmcb_gpa = svm->vmcb->save.rax; + + ret = kvm_vcpu_map(&svm->vcpu, gpa_to_gfn(vmcb_gpa), &map); + if (ret == -EINVAL) { + kvm_inject_gp(&svm->vcpu, 0); + return 1; + } else if (ret) { + return kvm_skip_emulated_instruction(&svm->vcpu); + } + + ret = kvm_skip_emulated_instruction(&svm->vcpu); + + nested_vmcb = map.hva; + + if (!nested_vmcb_checks(nested_vmcb)) { + nested_vmcb->control.exit_code = SVM_EXIT_ERR; + nested_vmcb->control.exit_code_hi = 0; + nested_vmcb->control.exit_info_1 = 0; + nested_vmcb->control.exit_info_2 = 0; + + kvm_vcpu_unmap(&svm->vcpu, &map, true); + + return ret; + } + + trace_kvm_nested_vmrun(svm->vmcb->save.rip, vmcb_gpa, + nested_vmcb->save.rip, + nested_vmcb->control.int_ctl, + nested_vmcb->control.event_inj, + nested_vmcb->control.nested_ctl); + + trace_kvm_nested_intercepts(nested_vmcb->control.intercept_cr & 0xffff, + nested_vmcb->control.intercept_cr >> 16, + nested_vmcb->control.intercept_exceptions, + nested_vmcb->control.intercept); + + /* Clear internal status */ + kvm_clear_exception_queue(&svm->vcpu); + kvm_clear_interrupt_queue(&svm->vcpu); + + /* + * Save the old vmcb, so we don't need to pick what we save, but can + * restore everything when a VMEXIT occurs + */ + hsave->save.es = vmcb->save.es; + hsave->save.cs = vmcb->save.cs; + hsave->save.ss = vmcb->save.ss; + hsave->save.ds = vmcb->save.ds; + hsave->save.gdtr = vmcb->save.gdtr; + hsave->save.idtr = vmcb->save.idtr; + hsave->save.efer = svm->vcpu.arch.efer; + hsave->save.cr0 = kvm_read_cr0(&svm->vcpu); + hsave->save.cr4 = svm->vcpu.arch.cr4; + hsave->save.rflags = kvm_get_rflags(&svm->vcpu); + hsave->save.rip = kvm_rip_read(&svm->vcpu); + hsave->save.rsp = vmcb->save.rsp; + hsave->save.rax = vmcb->save.rax; + if (npt_enabled) + hsave->save.cr3 = vmcb->save.cr3; + else + hsave->save.cr3 = kvm_read_cr3(&svm->vcpu); + + copy_vmcb_control_area(hsave, vmcb); + + enter_svm_guest_mode(svm, vmcb_gpa, nested_vmcb, &map); + + if (!nested_svm_vmrun_msrpm(svm)) { + svm->vmcb->control.exit_code = SVM_EXIT_ERR; + svm->vmcb->control.exit_code_hi = 0; + svm->vmcb->control.exit_info_1 = 0; + svm->vmcb->control.exit_info_2 = 0; + + nested_svm_vmexit(svm); + } + + return ret; +} + +static void nested_svm_vmloadsave(struct vmcb *from_vmcb, struct vmcb *to_vmcb) +{ + to_vmcb->save.fs = from_vmcb->save.fs; + to_vmcb->save.gs = from_vmcb->save.gs; + to_vmcb->save.tr = from_vmcb->save.tr; + to_vmcb->save.ldtr = from_vmcb->save.ldtr; + to_vmcb->save.kernel_gs_base = from_vmcb->save.kernel_gs_base; + to_vmcb->save.star = from_vmcb->save.star; + to_vmcb->save.lstar = from_vmcb->save.lstar; + to_vmcb->save.cstar = from_vmcb->save.cstar; + to_vmcb->save.sfmask = from_vmcb->save.sfmask; + to_vmcb->save.sysenter_cs = from_vmcb->save.sysenter_cs; + to_vmcb->save.sysenter_esp = from_vmcb->save.sysenter_esp; + to_vmcb->save.sysenter_eip = from_vmcb->save.sysenter_eip; +} + +static int vmload_interception(struct vcpu_svm *svm) +{ + struct vmcb *nested_vmcb; + struct kvm_host_map map; + int ret; + + if (nested_svm_check_permissions(svm)) + return 1; + + ret = kvm_vcpu_map(&svm->vcpu, gpa_to_gfn(svm->vmcb->save.rax), &map); + if (ret) { + if (ret == -EINVAL) + kvm_inject_gp(&svm->vcpu, 0); + return 1; + } + + nested_vmcb = map.hva; + + ret = kvm_skip_emulated_instruction(&svm->vcpu); + + nested_svm_vmloadsave(nested_vmcb, svm->vmcb); + kvm_vcpu_unmap(&svm->vcpu, &map, true); + + return ret; +} + +static int vmsave_interception(struct vcpu_svm *svm) +{ + struct vmcb *nested_vmcb; + struct kvm_host_map map; + int ret; + + if (nested_svm_check_permissions(svm)) + return 1; + + ret = kvm_vcpu_map(&svm->vcpu, gpa_to_gfn(svm->vmcb->save.rax), &map); + if (ret) { + if (ret == -EINVAL) + kvm_inject_gp(&svm->vcpu, 0); + return 1; + } + + nested_vmcb = map.hva; + + ret = kvm_skip_emulated_instruction(&svm->vcpu); + + nested_svm_vmloadsave(svm->vmcb, nested_vmcb); + kvm_vcpu_unmap(&svm->vcpu, &map, true); + + return ret; +} + +static int vmrun_interception(struct vcpu_svm *svm) +{ + if (nested_svm_check_permissions(svm)) + return 1; + + return nested_svm_vmrun(svm); +} + +static int stgi_interception(struct vcpu_svm *svm) +{ + int ret; + + if (nested_svm_check_permissions(svm)) + return 1; + + /* + * If VGIF is enabled, the STGI intercept is only added to + * detect the opening of the SMI/NMI window; remove it now. + */ + if (vgif_enabled(svm)) + clr_intercept(svm, INTERCEPT_STGI); + + ret = kvm_skip_emulated_instruction(&svm->vcpu); + kvm_make_request(KVM_REQ_EVENT, &svm->vcpu); + + enable_gif(svm); + + return ret; +} + +static int clgi_interception(struct vcpu_svm *svm) +{ + int ret; + + if (nested_svm_check_permissions(svm)) + return 1; + + ret = kvm_skip_emulated_instruction(&svm->vcpu); + + disable_gif(svm); + + /* After a CLGI no interrupts should come */ + if (!kvm_vcpu_apicv_active(&svm->vcpu)) + svm_clear_vintr(svm); + + return ret; +} + +static int invlpga_interception(struct vcpu_svm *svm) +{ + struct kvm_vcpu *vcpu = &svm->vcpu; + + trace_kvm_invlpga(svm->vmcb->save.rip, kvm_rcx_read(&svm->vcpu), + kvm_rax_read(&svm->vcpu)); + + /* Let's treat INVLPGA the same as INVLPG (can be optimized!) */ + kvm_mmu_invlpg(vcpu, kvm_rax_read(&svm->vcpu)); + + return kvm_skip_emulated_instruction(&svm->vcpu); +} + +static int skinit_interception(struct vcpu_svm *svm) +{ + trace_kvm_skinit(svm->vmcb->save.rip, kvm_rax_read(&svm->vcpu)); + + kvm_queue_exception(&svm->vcpu, UD_VECTOR); + return 1; +} + +static int wbinvd_interception(struct vcpu_svm *svm) +{ + return kvm_emulate_wbinvd(&svm->vcpu); +} + +static int xsetbv_interception(struct vcpu_svm *svm) +{ + u64 new_bv = kvm_read_edx_eax(&svm->vcpu); + u32 index = kvm_rcx_read(&svm->vcpu); + + if (kvm_set_xcr(&svm->vcpu, index, new_bv) == 0) { + return kvm_skip_emulated_instruction(&svm->vcpu); + } + + return 1; +} + +static int rdpru_interception(struct vcpu_svm *svm) +{ + kvm_queue_exception(&svm->vcpu, UD_VECTOR); + return 1; +} + +static int task_switch_interception(struct vcpu_svm *svm) +{ + u16 tss_selector; + int reason; + int int_type = svm->vmcb->control.exit_int_info & + SVM_EXITINTINFO_TYPE_MASK; + int int_vec = svm->vmcb->control.exit_int_info & SVM_EVTINJ_VEC_MASK; + uint32_t type = + svm->vmcb->control.exit_int_info & SVM_EXITINTINFO_TYPE_MASK; + uint32_t idt_v = + svm->vmcb->control.exit_int_info & SVM_EXITINTINFO_VALID; + bool has_error_code = false; + u32 error_code = 0; + + tss_selector = (u16)svm->vmcb->control.exit_info_1; + + if (svm->vmcb->control.exit_info_2 & + (1ULL << SVM_EXITINFOSHIFT_TS_REASON_IRET)) + reason = TASK_SWITCH_IRET; + else if (svm->vmcb->control.exit_info_2 & + (1ULL << SVM_EXITINFOSHIFT_TS_REASON_JMP)) + reason = TASK_SWITCH_JMP; + else if (idt_v) + reason = TASK_SWITCH_GATE; + else + reason = TASK_SWITCH_CALL; + + if (reason == TASK_SWITCH_GATE) { + switch (type) { + case SVM_EXITINTINFO_TYPE_NMI: + svm->vcpu.arch.nmi_injected = false; + break; + case SVM_EXITINTINFO_TYPE_EXEPT: + if (svm->vmcb->control.exit_info_2 & + (1ULL << SVM_EXITINFOSHIFT_TS_HAS_ERROR_CODE)) { + has_error_code = true; + error_code = + (u32)svm->vmcb->control.exit_info_2; + } + kvm_clear_exception_queue(&svm->vcpu); + break; + case SVM_EXITINTINFO_TYPE_INTR: + kvm_clear_interrupt_queue(&svm->vcpu); + break; + default: + break; + } + } + + if (reason != TASK_SWITCH_GATE || + int_type == SVM_EXITINTINFO_TYPE_SOFT || + (int_type == SVM_EXITINTINFO_TYPE_EXEPT && + (int_vec == OF_VECTOR || int_vec == BP_VECTOR))) { + if (!skip_emulated_instruction(&svm->vcpu)) + return 0; + } + + if (int_type != SVM_EXITINTINFO_TYPE_SOFT) + int_vec = -1; + + return kvm_task_switch(&svm->vcpu, tss_selector, int_vec, reason, + has_error_code, error_code); +} + +static int cpuid_interception(struct vcpu_svm *svm) +{ + return kvm_emulate_cpuid(&svm->vcpu); +} + +static int iret_interception(struct vcpu_svm *svm) +{ + ++svm->vcpu.stat.nmi_window_exits; + clr_intercept(svm, INTERCEPT_IRET); + svm->vcpu.arch.hflags |= HF_IRET_MASK; + svm->nmi_iret_rip = kvm_rip_read(&svm->vcpu); + kvm_make_request(KVM_REQ_EVENT, &svm->vcpu); + return 1; +} + +static int invlpg_interception(struct vcpu_svm *svm) +{ + if (!static_cpu_has(X86_FEATURE_DECODEASSISTS)) + return kvm_emulate_instruction(&svm->vcpu, 0); + + kvm_mmu_invlpg(&svm->vcpu, svm->vmcb->control.exit_info_1); + return kvm_skip_emulated_instruction(&svm->vcpu); +} + +static int emulate_on_interception(struct vcpu_svm *svm) +{ + return kvm_emulate_instruction(&svm->vcpu, 0); +} + +static int rsm_interception(struct vcpu_svm *svm) +{ + return kvm_emulate_instruction_from_buffer(&svm->vcpu, rsm_ins_bytes, 2); +} + +static int rdpmc_interception(struct vcpu_svm *svm) +{ + int err; + + if (!nrips) + return emulate_on_interception(svm); + + err = kvm_rdpmc(&svm->vcpu); + return kvm_complete_insn_gp(&svm->vcpu, err); +} + +static bool check_selective_cr0_intercepted(struct vcpu_svm *svm, + unsigned long val) +{ + unsigned long cr0 = svm->vcpu.arch.cr0; + bool ret = false; + u64 intercept; + + intercept = svm->nested.intercept; + + if (!is_guest_mode(&svm->vcpu) || + (!(intercept & (1ULL << INTERCEPT_SELECTIVE_CR0)))) + return false; + + cr0 &= ~SVM_CR0_SELECTIVE_MASK; + val &= ~SVM_CR0_SELECTIVE_MASK; + + if (cr0 ^ val) { + svm->vmcb->control.exit_code = SVM_EXIT_CR0_SEL_WRITE; + ret = (nested_svm_exit_handled(svm) == NESTED_EXIT_DONE); + } + + return ret; +} + +#define CR_VALID (1ULL << 63) + +static int cr_interception(struct vcpu_svm *svm) +{ + int reg, cr; + unsigned long val; + int err; + + if (!static_cpu_has(X86_FEATURE_DECODEASSISTS)) + return emulate_on_interception(svm); + + if (unlikely((svm->vmcb->control.exit_info_1 & CR_VALID) == 0)) + return emulate_on_interception(svm); + + reg = svm->vmcb->control.exit_info_1 & SVM_EXITINFO_REG_MASK; + if (svm->vmcb->control.exit_code == SVM_EXIT_CR0_SEL_WRITE) + cr = SVM_EXIT_WRITE_CR0 - SVM_EXIT_READ_CR0; + else + cr = svm->vmcb->control.exit_code - SVM_EXIT_READ_CR0; + + err = 0; + if (cr >= 16) { /* mov to cr */ + cr -= 16; + val = kvm_register_read(&svm->vcpu, reg); + switch (cr) { + case 0: + if (!check_selective_cr0_intercepted(svm, val)) + err = kvm_set_cr0(&svm->vcpu, val); + else + return 1; + + break; + case 3: + err = kvm_set_cr3(&svm->vcpu, val); + break; + case 4: + err = kvm_set_cr4(&svm->vcpu, val); + break; + case 8: + err = kvm_set_cr8(&svm->vcpu, val); + break; + default: + WARN(1, "unhandled write to CR%d", cr); + kvm_queue_exception(&svm->vcpu, UD_VECTOR); + return 1; + } + } else { /* mov from cr */ + switch (cr) { + case 0: + val = kvm_read_cr0(&svm->vcpu); + break; + case 2: + val = svm->vcpu.arch.cr2; + break; + case 3: + val = kvm_read_cr3(&svm->vcpu); + break; + case 4: + val = kvm_read_cr4(&svm->vcpu); + break; + case 8: + val = kvm_get_cr8(&svm->vcpu); + break; + default: + WARN(1, "unhandled read from CR%d", cr); + kvm_queue_exception(&svm->vcpu, UD_VECTOR); + return 1; + } + kvm_register_write(&svm->vcpu, reg, val); + } + return kvm_complete_insn_gp(&svm->vcpu, err); +} + +static int dr_interception(struct vcpu_svm *svm) +{ + int reg, dr; + unsigned long val; + + if (svm->vcpu.guest_debug == 0) { + /* + * No more DR vmexits; force a reload of the debug registers + * and reenter on this instruction. The next vmexit will + * retrieve the full state of the debug registers. + */ + clr_dr_intercepts(svm); + svm->vcpu.arch.switch_db_regs |= KVM_DEBUGREG_WONT_EXIT; + return 1; + } + + if (!boot_cpu_has(X86_FEATURE_DECODEASSISTS)) + return emulate_on_interception(svm); + + reg = svm->vmcb->control.exit_info_1 & SVM_EXITINFO_REG_MASK; + dr = svm->vmcb->control.exit_code - SVM_EXIT_READ_DR0; + + if (dr >= 16) { /* mov to DRn */ + if (!kvm_require_dr(&svm->vcpu, dr - 16)) + return 1; + val = kvm_register_read(&svm->vcpu, reg); + kvm_set_dr(&svm->vcpu, dr - 16, val); + } else { + if (!kvm_require_dr(&svm->vcpu, dr)) + return 1; + kvm_get_dr(&svm->vcpu, dr, &val); + kvm_register_write(&svm->vcpu, reg, val); + } + + return kvm_skip_emulated_instruction(&svm->vcpu); +} + +static int cr8_write_interception(struct vcpu_svm *svm) +{ + struct kvm_run *kvm_run = svm->vcpu.run; + int r; + + u8 cr8_prev = kvm_get_cr8(&svm->vcpu); + /* instruction emulation calls kvm_set_cr8() */ + r = cr_interception(svm); + if (lapic_in_kernel(&svm->vcpu)) + return r; + if (cr8_prev <= kvm_get_cr8(&svm->vcpu)) + return r; + kvm_run->exit_reason = KVM_EXIT_SET_TPR; + return 0; +} + +static int svm_get_msr_feature(struct kvm_msr_entry *msr) +{ + msr->data = 0; + + switch (msr->index) { + case MSR_F10H_DECFG: + if (boot_cpu_has(X86_FEATURE_LFENCE_RDTSC)) + msr->data |= MSR_F10H_DECFG_LFENCE_SERIALIZE; + break; + default: + return 1; + } + + return 0; +} + +static int svm_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) +{ + struct vcpu_svm *svm = to_svm(vcpu); + + switch (msr_info->index) { + case MSR_STAR: + msr_info->data = svm->vmcb->save.star; + break; +#ifdef CONFIG_X86_64 + case MSR_LSTAR: + msr_info->data = svm->vmcb->save.lstar; + break; + case MSR_CSTAR: + msr_info->data = svm->vmcb->save.cstar; + break; + case MSR_KERNEL_GS_BASE: + msr_info->data = svm->vmcb->save.kernel_gs_base; + break; + case MSR_SYSCALL_MASK: + msr_info->data = svm->vmcb->save.sfmask; + break; +#endif + case MSR_IA32_SYSENTER_CS: + msr_info->data = svm->vmcb->save.sysenter_cs; + break; + case MSR_IA32_SYSENTER_EIP: + msr_info->data = svm->sysenter_eip; + break; + case MSR_IA32_SYSENTER_ESP: + msr_info->data = svm->sysenter_esp; + break; + case MSR_TSC_AUX: + if (!boot_cpu_has(X86_FEATURE_RDTSCP)) + return 1; + msr_info->data = svm->tsc_aux; + break; + /* + * Nobody will change the following 5 values in the VMCB so we can + * safely return them on rdmsr. They will always be 0 until LBRV is + * implemented. + */ + case MSR_IA32_DEBUGCTLMSR: + msr_info->data = svm->vmcb->save.dbgctl; + break; + case MSR_IA32_LASTBRANCHFROMIP: + msr_info->data = svm->vmcb->save.br_from; + break; + case MSR_IA32_LASTBRANCHTOIP: + msr_info->data = svm->vmcb->save.br_to; + break; + case MSR_IA32_LASTINTFROMIP: + msr_info->data = svm->vmcb->save.last_excp_from; + break; + case MSR_IA32_LASTINTTOIP: + msr_info->data = svm->vmcb->save.last_excp_to; + break; + case MSR_VM_HSAVE_PA: + msr_info->data = svm->nested.hsave_msr; + break; + case MSR_VM_CR: + msr_info->data = svm->nested.vm_cr_msr; + break; + case MSR_IA32_SPEC_CTRL: + if (!msr_info->host_initiated && + !guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL) && + !guest_cpuid_has(vcpu, X86_FEATURE_AMD_STIBP) && + !guest_cpuid_has(vcpu, X86_FEATURE_AMD_IBRS) && + !guest_cpuid_has(vcpu, X86_FEATURE_AMD_SSBD)) + return 1; + + msr_info->data = svm->spec_ctrl; + break; + case MSR_AMD64_VIRT_SPEC_CTRL: + if (!msr_info->host_initiated && + !guest_cpuid_has(vcpu, X86_FEATURE_VIRT_SSBD)) + return 1; + + msr_info->data = svm->virt_spec_ctrl; + break; + case MSR_F15H_IC_CFG: { + + int family, model; + + family = guest_cpuid_family(vcpu); + model = guest_cpuid_model(vcpu); + + if (family < 0 || model < 0) + return kvm_get_msr_common(vcpu, msr_info); + + msr_info->data = 0; + + if (family == 0x15 && + (model >= 0x2 && model < 0x20)) + msr_info->data = 0x1E; + } + break; + case MSR_F10H_DECFG: + msr_info->data = svm->msr_decfg; + break; + default: + return kvm_get_msr_common(vcpu, msr_info); + } + return 0; +} + +static int rdmsr_interception(struct vcpu_svm *svm) +{ + return kvm_emulate_rdmsr(&svm->vcpu); +} + +static int svm_set_vm_cr(struct kvm_vcpu *vcpu, u64 data) +{ + struct vcpu_svm *svm = to_svm(vcpu); + int svm_dis, chg_mask; + + if (data & ~SVM_VM_CR_VALID_MASK) + return 1; + + chg_mask = SVM_VM_CR_VALID_MASK; + + if (svm->nested.vm_cr_msr & SVM_VM_CR_SVM_DIS_MASK) + chg_mask &= ~(SVM_VM_CR_SVM_LOCK_MASK | SVM_VM_CR_SVM_DIS_MASK); + + svm->nested.vm_cr_msr &= ~chg_mask; + svm->nested.vm_cr_msr |= (data & chg_mask); + + svm_dis = svm->nested.vm_cr_msr & SVM_VM_CR_SVM_DIS_MASK; + + /* check for svm_disable while efer.svme is set */ + if (svm_dis && (vcpu->arch.efer & EFER_SVME)) + return 1; + + return 0; +} + +static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) +{ + struct vcpu_svm *svm = to_svm(vcpu); + + u32 ecx = msr->index; + u64 data = msr->data; + switch (ecx) { + case MSR_IA32_CR_PAT: + if (!kvm_mtrr_valid(vcpu, MSR_IA32_CR_PAT, data)) + return 1; + vcpu->arch.pat = data; + svm->vmcb->save.g_pat = data; + mark_dirty(svm->vmcb, VMCB_NPT); + break; + case MSR_IA32_SPEC_CTRL: + if (!msr->host_initiated && + !guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL) && + !guest_cpuid_has(vcpu, X86_FEATURE_AMD_STIBP) && + !guest_cpuid_has(vcpu, X86_FEATURE_AMD_IBRS) && + !guest_cpuid_has(vcpu, X86_FEATURE_AMD_SSBD)) + return 1; + + if (data & ~kvm_spec_ctrl_valid_bits(vcpu)) + return 1; + + svm->spec_ctrl = data; + if (!data) + break; + + /* + * For non-nested: + * When it's written (to non-zero) for the first time, pass + * it through. + * + * For nested: + * The handling of the MSR bitmap for L2 guests is done in + * nested_svm_vmrun_msrpm. + * We update the L1 MSR bit as well since it will end up + * touching the MSR anyway now. + */ + set_msr_interception(svm->msrpm, MSR_IA32_SPEC_CTRL, 1, 1); + break; + case MSR_IA32_PRED_CMD: + if (!msr->host_initiated && + !guest_cpuid_has(vcpu, X86_FEATURE_AMD_IBPB)) + return 1; + + if (data & ~PRED_CMD_IBPB) + return 1; + if (!boot_cpu_has(X86_FEATURE_AMD_IBPB)) + return 1; + if (!data) + break; + + wrmsrl(MSR_IA32_PRED_CMD, PRED_CMD_IBPB); + set_msr_interception(svm->msrpm, MSR_IA32_PRED_CMD, 0, 1); + break; + case MSR_AMD64_VIRT_SPEC_CTRL: + if (!msr->host_initiated && + !guest_cpuid_has(vcpu, X86_FEATURE_VIRT_SSBD)) + return 1; + + if (data & ~SPEC_CTRL_SSBD) + return 1; + + svm->virt_spec_ctrl = data; + break; + case MSR_STAR: + svm->vmcb->save.star = data; + break; +#ifdef CONFIG_X86_64 + case MSR_LSTAR: + svm->vmcb->save.lstar = data; + break; + case MSR_CSTAR: + svm->vmcb->save.cstar = data; + break; + case MSR_KERNEL_GS_BASE: + svm->vmcb->save.kernel_gs_base = data; + break; + case MSR_SYSCALL_MASK: + svm->vmcb->save.sfmask = data; + break; +#endif + case MSR_IA32_SYSENTER_CS: + svm->vmcb->save.sysenter_cs = data; + break; + case MSR_IA32_SYSENTER_EIP: + svm->sysenter_eip = data; + svm->vmcb->save.sysenter_eip = data; + break; + case MSR_IA32_SYSENTER_ESP: + svm->sysenter_esp = data; + svm->vmcb->save.sysenter_esp = data; + break; + case MSR_TSC_AUX: + if (!boot_cpu_has(X86_FEATURE_RDTSCP)) + return 1; + + /* + * This is rare, so we update the MSR here instead of using + * direct_access_msrs. Doing that would require a rdmsr in + * svm_vcpu_put. + */ + svm->tsc_aux = data; + wrmsrl(MSR_TSC_AUX, svm->tsc_aux); + break; + case MSR_IA32_DEBUGCTLMSR: + if (!boot_cpu_has(X86_FEATURE_LBRV)) { + vcpu_unimpl(vcpu, "%s: MSR_IA32_DEBUGCTL 0x%llx, nop\n", + __func__, data); + break; + } + if (data & DEBUGCTL_RESERVED_BITS) + return 1; + + svm->vmcb->save.dbgctl = data; + mark_dirty(svm->vmcb, VMCB_LBR); + if (data & (1ULL<<0)) + svm_enable_lbrv(svm); + else + svm_disable_lbrv(svm); + break; + case MSR_VM_HSAVE_PA: + svm->nested.hsave_msr = data; + break; + case MSR_VM_CR: + return svm_set_vm_cr(vcpu, data); + case MSR_VM_IGNNE: + vcpu_unimpl(vcpu, "unimplemented wrmsr: 0x%x data 0x%llx\n", ecx, data); + break; + case MSR_F10H_DECFG: { + struct kvm_msr_entry msr_entry; + + msr_entry.index = msr->index; + if (svm_get_msr_feature(&msr_entry)) + return 1; + + /* Check the supported bits */ + if (data & ~msr_entry.data) + return 1; + + /* Don't allow the guest to change a bit, #GP */ + if (!msr->host_initiated && (data ^ msr_entry.data)) + return 1; + + svm->msr_decfg = data; + break; + } + case MSR_IA32_APICBASE: + if (kvm_vcpu_apicv_active(vcpu)) + avic_update_vapic_bar(to_svm(vcpu), data); + /* Fall through */ + default: + return kvm_set_msr_common(vcpu, msr); + } + return 0; +} + +static int wrmsr_interception(struct vcpu_svm *svm) +{ + return kvm_emulate_wrmsr(&svm->vcpu); +} + +static int msr_interception(struct vcpu_svm *svm) +{ + if (svm->vmcb->control.exit_info_1) + return wrmsr_interception(svm); + else + return rdmsr_interception(svm); +} + +static int interrupt_window_interception(struct vcpu_svm *svm) +{ + kvm_make_request(KVM_REQ_EVENT, &svm->vcpu); + svm_clear_vintr(svm); + + /* + * For AVIC, the only reason to end up here is ExtINTs. + * In this case AVIC was temporarily disabled for + * requesting the IRQ window and we have to re-enable it. + */ + svm_toggle_avic_for_irq_window(&svm->vcpu, true); + + svm->vmcb->control.int_ctl &= ~V_IRQ_MASK; + mark_dirty(svm->vmcb, VMCB_INTR); + ++svm->vcpu.stat.irq_window_exits; + return 1; +} + +static int pause_interception(struct vcpu_svm *svm) +{ + struct kvm_vcpu *vcpu = &svm->vcpu; + bool in_kernel = (svm_get_cpl(vcpu) == 0); + + if (pause_filter_thresh) + grow_ple_window(vcpu); + + kvm_vcpu_on_spin(vcpu, in_kernel); + return 1; +} + +static int nop_interception(struct vcpu_svm *svm) +{ + return kvm_skip_emulated_instruction(&(svm->vcpu)); +} + +static int monitor_interception(struct vcpu_svm *svm) +{ + printk_once(KERN_WARNING "kvm: MONITOR instruction emulated as NOP!\n"); + return nop_interception(svm); +} + +static int mwait_interception(struct vcpu_svm *svm) +{ + printk_once(KERN_WARNING "kvm: MWAIT instruction emulated as NOP!\n"); + return nop_interception(svm); +} + +enum avic_ipi_failure_cause { + AVIC_IPI_FAILURE_INVALID_INT_TYPE, + AVIC_IPI_FAILURE_TARGET_NOT_RUNNING, + AVIC_IPI_FAILURE_INVALID_TARGET, + AVIC_IPI_FAILURE_INVALID_BACKING_PAGE, +}; + +static int avic_incomplete_ipi_interception(struct vcpu_svm *svm) +{ + u32 icrh = svm->vmcb->control.exit_info_1 >> 32; + u32 icrl = svm->vmcb->control.exit_info_1; + u32 id = svm->vmcb->control.exit_info_2 >> 32; + u32 index = svm->vmcb->control.exit_info_2 & 0xFF; + struct kvm_lapic *apic = svm->vcpu.arch.apic; + + trace_kvm_avic_incomplete_ipi(svm->vcpu.vcpu_id, icrh, icrl, id, index); + + switch (id) { + case AVIC_IPI_FAILURE_INVALID_INT_TYPE: + /* + * AVIC hardware handles the generation of + * IPIs when the specified Message Type is Fixed + * (also known as fixed delivery mode) and + * the Trigger Mode is edge-triggered. The hardware + * also supports self and broadcast delivery modes + * specified via the Destination Shorthand(DSH) + * field of the ICRL. Logical and physical APIC ID + * formats are supported. All other IPI types cause + * a #VMEXIT, which needs to emulated. + */ + kvm_lapic_reg_write(apic, APIC_ICR2, icrh); + kvm_lapic_reg_write(apic, APIC_ICR, icrl); + break; + case AVIC_IPI_FAILURE_TARGET_NOT_RUNNING: { + int i; + struct kvm_vcpu *vcpu; + struct kvm *kvm = svm->vcpu.kvm; + struct kvm_lapic *apic = svm->vcpu.arch.apic; + + /* + * At this point, we expect that the AVIC HW has already + * set the appropriate IRR bits on the valid target + * vcpus. So, we just need to kick the appropriate vcpu. + */ + kvm_for_each_vcpu(i, vcpu, kvm) { + bool m = kvm_apic_match_dest(vcpu, apic, + icrl & APIC_SHORT_MASK, + GET_APIC_DEST_FIELD(icrh), + icrl & APIC_DEST_MASK); + + if (m && !avic_vcpu_is_running(vcpu)) + kvm_vcpu_wake_up(vcpu); + } + break; + } + case AVIC_IPI_FAILURE_INVALID_TARGET: + WARN_ONCE(1, "Invalid IPI target: index=%u, vcpu=%d, icr=%#0x:%#0x\n", + index, svm->vcpu.vcpu_id, icrh, icrl); + break; + case AVIC_IPI_FAILURE_INVALID_BACKING_PAGE: + WARN_ONCE(1, "Invalid backing page\n"); + break; + default: + pr_err("Unknown IPI interception\n"); + } + + return 1; +} + +static u32 *avic_get_logical_id_entry(struct kvm_vcpu *vcpu, u32 ldr, bool flat) +{ + struct kvm_svm *kvm_svm = to_kvm_svm(vcpu->kvm); + int index; + u32 *logical_apic_id_table; + int dlid = GET_APIC_LOGICAL_ID(ldr); + + if (!dlid) + return NULL; + + if (flat) { /* flat */ + index = ffs(dlid) - 1; + if (index > 7) + return NULL; + } else { /* cluster */ + int cluster = (dlid & 0xf0) >> 4; + int apic = ffs(dlid & 0x0f) - 1; + + if ((apic < 0) || (apic > 7) || + (cluster >= 0xf)) + return NULL; + index = (cluster << 2) + apic; + } + + logical_apic_id_table = (u32 *) page_address(kvm_svm->avic_logical_id_table_page); + + return &logical_apic_id_table[index]; +} + +static int avic_ldr_write(struct kvm_vcpu *vcpu, u8 g_physical_id, u32 ldr) +{ + bool flat; + u32 *entry, new_entry; + + flat = kvm_lapic_get_reg(vcpu->arch.apic, APIC_DFR) == APIC_DFR_FLAT; + entry = avic_get_logical_id_entry(vcpu, ldr, flat); + if (!entry) + return -EINVAL; + + new_entry = READ_ONCE(*entry); + new_entry &= ~AVIC_LOGICAL_ID_ENTRY_GUEST_PHYSICAL_ID_MASK; + new_entry |= (g_physical_id & AVIC_LOGICAL_ID_ENTRY_GUEST_PHYSICAL_ID_MASK); + new_entry |= AVIC_LOGICAL_ID_ENTRY_VALID_MASK; + WRITE_ONCE(*entry, new_entry); + + return 0; +} + +static void avic_invalidate_logical_id_entry(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + bool flat = svm->dfr_reg == APIC_DFR_FLAT; + u32 *entry = avic_get_logical_id_entry(vcpu, svm->ldr_reg, flat); + + if (entry) + clear_bit(AVIC_LOGICAL_ID_ENTRY_VALID_BIT, (unsigned long *)entry); +} + +static int avic_handle_ldr_update(struct kvm_vcpu *vcpu) +{ + int ret = 0; + struct vcpu_svm *svm = to_svm(vcpu); + u32 ldr = kvm_lapic_get_reg(vcpu->arch.apic, APIC_LDR); + u32 id = kvm_xapic_id(vcpu->arch.apic); + + if (ldr == svm->ldr_reg) + return 0; + + avic_invalidate_logical_id_entry(vcpu); + + if (ldr) + ret = avic_ldr_write(vcpu, id, ldr); + + if (!ret) + svm->ldr_reg = ldr; + + return ret; +} + +static int avic_handle_apic_id_update(struct kvm_vcpu *vcpu) +{ + u64 *old, *new; + struct vcpu_svm *svm = to_svm(vcpu); + u32 id = kvm_xapic_id(vcpu->arch.apic); + + if (vcpu->vcpu_id == id) + return 0; + + old = avic_get_physical_id_entry(vcpu, vcpu->vcpu_id); + new = avic_get_physical_id_entry(vcpu, id); + if (!new || !old) + return 1; + + /* We need to move physical_id_entry to new offset */ + *new = *old; + *old = 0ULL; + to_svm(vcpu)->avic_physical_id_cache = new; + + /* + * Also update the guest physical APIC ID in the logical + * APIC ID table entry if already setup the LDR. + */ + if (svm->ldr_reg) + avic_handle_ldr_update(vcpu); + + return 0; +} + +static void avic_handle_dfr_update(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + u32 dfr = kvm_lapic_get_reg(vcpu->arch.apic, APIC_DFR); + + if (svm->dfr_reg == dfr) + return; + + avic_invalidate_logical_id_entry(vcpu); + svm->dfr_reg = dfr; +} + +static int avic_unaccel_trap_write(struct vcpu_svm *svm) +{ + struct kvm_lapic *apic = svm->vcpu.arch.apic; + u32 offset = svm->vmcb->control.exit_info_1 & + AVIC_UNACCEL_ACCESS_OFFSET_MASK; + + switch (offset) { + case APIC_ID: + if (avic_handle_apic_id_update(&svm->vcpu)) + return 0; + break; + case APIC_LDR: + if (avic_handle_ldr_update(&svm->vcpu)) + return 0; + break; + case APIC_DFR: + avic_handle_dfr_update(&svm->vcpu); + break; + default: + break; + } + + kvm_lapic_reg_write(apic, offset, kvm_lapic_get_reg(apic, offset)); + + return 1; +} + +static bool is_avic_unaccelerated_access_trap(u32 offset) +{ + bool ret = false; + + switch (offset) { + case APIC_ID: + case APIC_EOI: + case APIC_RRR: + case APIC_LDR: + case APIC_DFR: + case APIC_SPIV: + case APIC_ESR: + case APIC_ICR: + case APIC_LVTT: + case APIC_LVTTHMR: + case APIC_LVTPC: + case APIC_LVT0: + case APIC_LVT1: + case APIC_LVTERR: + case APIC_TMICT: + case APIC_TDCR: + ret = true; + break; + default: + break; + } + return ret; +} + +static int avic_unaccelerated_access_interception(struct vcpu_svm *svm) +{ + int ret = 0; + u32 offset = svm->vmcb->control.exit_info_1 & + AVIC_UNACCEL_ACCESS_OFFSET_MASK; + u32 vector = svm->vmcb->control.exit_info_2 & + AVIC_UNACCEL_ACCESS_VECTOR_MASK; + bool write = (svm->vmcb->control.exit_info_1 >> 32) & + AVIC_UNACCEL_ACCESS_WRITE_MASK; + bool trap = is_avic_unaccelerated_access_trap(offset); + + trace_kvm_avic_unaccelerated_access(svm->vcpu.vcpu_id, offset, + trap, write, vector); + if (trap) { + /* Handling Trap */ + WARN_ONCE(!write, "svm: Handling trap read.\n"); + ret = avic_unaccel_trap_write(svm); + } else { + /* Handling Fault */ + ret = kvm_emulate_instruction(&svm->vcpu, 0); + } + + return ret; +} + +static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = { + [SVM_EXIT_READ_CR0] = cr_interception, + [SVM_EXIT_READ_CR3] = cr_interception, + [SVM_EXIT_READ_CR4] = cr_interception, + [SVM_EXIT_READ_CR8] = cr_interception, + [SVM_EXIT_CR0_SEL_WRITE] = cr_interception, + [SVM_EXIT_WRITE_CR0] = cr_interception, + [SVM_EXIT_WRITE_CR3] = cr_interception, + [SVM_EXIT_WRITE_CR4] = cr_interception, + [SVM_EXIT_WRITE_CR8] = cr8_write_interception, + [SVM_EXIT_READ_DR0] = dr_interception, + [SVM_EXIT_READ_DR1] = dr_interception, + [SVM_EXIT_READ_DR2] = dr_interception, + [SVM_EXIT_READ_DR3] = dr_interception, + [SVM_EXIT_READ_DR4] = dr_interception, + [SVM_EXIT_READ_DR5] = dr_interception, + [SVM_EXIT_READ_DR6] = dr_interception, + [SVM_EXIT_READ_DR7] = dr_interception, + [SVM_EXIT_WRITE_DR0] = dr_interception, + [SVM_EXIT_WRITE_DR1] = dr_interception, + [SVM_EXIT_WRITE_DR2] = dr_interception, + [SVM_EXIT_WRITE_DR3] = dr_interception, + [SVM_EXIT_WRITE_DR4] = dr_interception, + [SVM_EXIT_WRITE_DR5] = dr_interception, + [SVM_EXIT_WRITE_DR6] = dr_interception, + [SVM_EXIT_WRITE_DR7] = dr_interception, + [SVM_EXIT_EXCP_BASE + DB_VECTOR] = db_interception, + [SVM_EXIT_EXCP_BASE + BP_VECTOR] = bp_interception, + [SVM_EXIT_EXCP_BASE + UD_VECTOR] = ud_interception, + [SVM_EXIT_EXCP_BASE + PF_VECTOR] = pf_interception, + [SVM_EXIT_EXCP_BASE + MC_VECTOR] = mc_interception, + [SVM_EXIT_EXCP_BASE + AC_VECTOR] = ac_interception, + [SVM_EXIT_EXCP_BASE + GP_VECTOR] = gp_interception, + [SVM_EXIT_INTR] = intr_interception, + [SVM_EXIT_NMI] = nmi_interception, + [SVM_EXIT_SMI] = nop_on_interception, + [SVM_EXIT_INIT] = nop_on_interception, + [SVM_EXIT_VINTR] = interrupt_window_interception, + [SVM_EXIT_RDPMC] = rdpmc_interception, + [SVM_EXIT_CPUID] = cpuid_interception, + [SVM_EXIT_IRET] = iret_interception, + [SVM_EXIT_INVD] = emulate_on_interception, + [SVM_EXIT_PAUSE] = pause_interception, + [SVM_EXIT_HLT] = halt_interception, + [SVM_EXIT_INVLPG] = invlpg_interception, + [SVM_EXIT_INVLPGA] = invlpga_interception, + [SVM_EXIT_IOIO] = io_interception, + [SVM_EXIT_MSR] = msr_interception, + [SVM_EXIT_TASK_SWITCH] = task_switch_interception, + [SVM_EXIT_SHUTDOWN] = shutdown_interception, + [SVM_EXIT_VMRUN] = vmrun_interception, + [SVM_EXIT_VMMCALL] = vmmcall_interception, + [SVM_EXIT_VMLOAD] = vmload_interception, + [SVM_EXIT_VMSAVE] = vmsave_interception, + [SVM_EXIT_STGI] = stgi_interception, + [SVM_EXIT_CLGI] = clgi_interception, + [SVM_EXIT_SKINIT] = skinit_interception, + [SVM_EXIT_WBINVD] = wbinvd_interception, + [SVM_EXIT_MONITOR] = monitor_interception, + [SVM_EXIT_MWAIT] = mwait_interception, + [SVM_EXIT_XSETBV] = xsetbv_interception, + [SVM_EXIT_RDPRU] = rdpru_interception, + [SVM_EXIT_NPF] = npf_interception, + [SVM_EXIT_RSM] = rsm_interception, + [SVM_EXIT_AVIC_INCOMPLETE_IPI] = avic_incomplete_ipi_interception, + [SVM_EXIT_AVIC_UNACCELERATED_ACCESS] = avic_unaccelerated_access_interception, +}; + +static void dump_vmcb(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + struct vmcb_control_area *control = &svm->vmcb->control; + struct vmcb_save_area *save = &svm->vmcb->save; + + if (!dump_invalid_vmcb) { + pr_warn_ratelimited("set kvm_amd.dump_invalid_vmcb=1 to dump internal KVM state.\n"); + return; + } + + pr_err("VMCB Control Area:\n"); + pr_err("%-20s%04x\n", "cr_read:", control->intercept_cr & 0xffff); + pr_err("%-20s%04x\n", "cr_write:", control->intercept_cr >> 16); + pr_err("%-20s%04x\n", "dr_read:", control->intercept_dr & 0xffff); + pr_err("%-20s%04x\n", "dr_write:", control->intercept_dr >> 16); + pr_err("%-20s%08x\n", "exceptions:", control->intercept_exceptions); + pr_err("%-20s%016llx\n", "intercepts:", control->intercept); + pr_err("%-20s%d\n", "pause filter count:", control->pause_filter_count); + pr_err("%-20s%d\n", "pause filter threshold:", + control->pause_filter_thresh); + pr_err("%-20s%016llx\n", "iopm_base_pa:", control->iopm_base_pa); + pr_err("%-20s%016llx\n", "msrpm_base_pa:", control->msrpm_base_pa); + pr_err("%-20s%016llx\n", "tsc_offset:", control->tsc_offset); + pr_err("%-20s%d\n", "asid:", control->asid); + pr_err("%-20s%d\n", "tlb_ctl:", control->tlb_ctl); + pr_err("%-20s%08x\n", "int_ctl:", control->int_ctl); + pr_err("%-20s%08x\n", "int_vector:", control->int_vector); + pr_err("%-20s%08x\n", "int_state:", control->int_state); + pr_err("%-20s%08x\n", "exit_code:", control->exit_code); + pr_err("%-20s%016llx\n", "exit_info1:", control->exit_info_1); + pr_err("%-20s%016llx\n", "exit_info2:", control->exit_info_2); + pr_err("%-20s%08x\n", "exit_int_info:", control->exit_int_info); + pr_err("%-20s%08x\n", "exit_int_info_err:", control->exit_int_info_err); + pr_err("%-20s%lld\n", "nested_ctl:", control->nested_ctl); + pr_err("%-20s%016llx\n", "nested_cr3:", control->nested_cr3); + pr_err("%-20s%016llx\n", "avic_vapic_bar:", control->avic_vapic_bar); + pr_err("%-20s%08x\n", "event_inj:", control->event_inj); + pr_err("%-20s%08x\n", "event_inj_err:", control->event_inj_err); + pr_err("%-20s%lld\n", "virt_ext:", control->virt_ext); + pr_err("%-20s%016llx\n", "next_rip:", control->next_rip); + pr_err("%-20s%016llx\n", "avic_backing_page:", control->avic_backing_page); + pr_err("%-20s%016llx\n", "avic_logical_id:", control->avic_logical_id); + pr_err("%-20s%016llx\n", "avic_physical_id:", control->avic_physical_id); + pr_err("VMCB State Save Area:\n"); + pr_err("%-5s s: %04x a: %04x l: %08x b: %016llx\n", + "es:", + save->es.selector, save->es.attrib, + save->es.limit, save->es.base); + pr_err("%-5s s: %04x a: %04x l: %08x b: %016llx\n", + "cs:", + save->cs.selector, save->cs.attrib, + save->cs.limit, save->cs.base); + pr_err("%-5s s: %04x a: %04x l: %08x b: %016llx\n", + "ss:", + save->ss.selector, save->ss.attrib, + save->ss.limit, save->ss.base); + pr_err("%-5s s: %04x a: %04x l: %08x b: %016llx\n", + "ds:", + save->ds.selector, save->ds.attrib, + save->ds.limit, save->ds.base); + pr_err("%-5s s: %04x a: %04x l: %08x b: %016llx\n", + "fs:", + save->fs.selector, save->fs.attrib, + save->fs.limit, save->fs.base); + pr_err("%-5s s: %04x a: %04x l: %08x b: %016llx\n", + "gs:", + save->gs.selector, save->gs.attrib, + save->gs.limit, save->gs.base); + pr_err("%-5s s: %04x a: %04x l: %08x b: %016llx\n", + "gdtr:", + save->gdtr.selector, save->gdtr.attrib, + save->gdtr.limit, save->gdtr.base); + pr_err("%-5s s: %04x a: %04x l: %08x b: %016llx\n", + "ldtr:", + save->ldtr.selector, save->ldtr.attrib, + save->ldtr.limit, save->ldtr.base); + pr_err("%-5s s: %04x a: %04x l: %08x b: %016llx\n", + "idtr:", + save->idtr.selector, save->idtr.attrib, + save->idtr.limit, save->idtr.base); + pr_err("%-5s s: %04x a: %04x l: %08x b: %016llx\n", + "tr:", + save->tr.selector, save->tr.attrib, + save->tr.limit, save->tr.base); + pr_err("cpl: %d efer: %016llx\n", + save->cpl, save->efer); + pr_err("%-15s %016llx %-13s %016llx\n", + "cr0:", save->cr0, "cr2:", save->cr2); + pr_err("%-15s %016llx %-13s %016llx\n", + "cr3:", save->cr3, "cr4:", save->cr4); + pr_err("%-15s %016llx %-13s %016llx\n", + "dr6:", save->dr6, "dr7:", save->dr7); + pr_err("%-15s %016llx %-13s %016llx\n", + "rip:", save->rip, "rflags:", save->rflags); + pr_err("%-15s %016llx %-13s %016llx\n", + "rsp:", save->rsp, "rax:", save->rax); + pr_err("%-15s %016llx %-13s %016llx\n", + "star:", save->star, "lstar:", save->lstar); + pr_err("%-15s %016llx %-13s %016llx\n", + "cstar:", save->cstar, "sfmask:", save->sfmask); + pr_err("%-15s %016llx %-13s %016llx\n", + "kernel_gs_base:", save->kernel_gs_base, + "sysenter_cs:", save->sysenter_cs); + pr_err("%-15s %016llx %-13s %016llx\n", + "sysenter_esp:", save->sysenter_esp, + "sysenter_eip:", save->sysenter_eip); + pr_err("%-15s %016llx %-13s %016llx\n", + "gpat:", save->g_pat, "dbgctl:", save->dbgctl); + pr_err("%-15s %016llx %-13s %016llx\n", + "br_from:", save->br_from, "br_to:", save->br_to); + pr_err("%-15s %016llx %-13s %016llx\n", + "excp_from:", save->last_excp_from, + "excp_to:", save->last_excp_to); +} + +static void svm_get_exit_info(struct kvm_vcpu *vcpu, u64 *info1, u64 *info2) +{ + struct vmcb_control_area *control = &to_svm(vcpu)->vmcb->control; + + *info1 = control->exit_info_1; + *info2 = control->exit_info_2; +} + +static int handle_exit(struct kvm_vcpu *vcpu, + enum exit_fastpath_completion exit_fastpath) +{ + struct vcpu_svm *svm = to_svm(vcpu); + struct kvm_run *kvm_run = vcpu->run; + u32 exit_code = svm->vmcb->control.exit_code; + + trace_kvm_exit(exit_code, vcpu, KVM_ISA_SVM); + + if (!is_cr_intercept(svm, INTERCEPT_CR0_WRITE)) + vcpu->arch.cr0 = svm->vmcb->save.cr0; + if (npt_enabled) + vcpu->arch.cr3 = svm->vmcb->save.cr3; + + if (unlikely(svm->nested.exit_required)) { + nested_svm_vmexit(svm); + svm->nested.exit_required = false; + + return 1; + } + + if (is_guest_mode(vcpu)) { + int vmexit; + + trace_kvm_nested_vmexit(svm->vmcb->save.rip, exit_code, + svm->vmcb->control.exit_info_1, + svm->vmcb->control.exit_info_2, + svm->vmcb->control.exit_int_info, + svm->vmcb->control.exit_int_info_err, + KVM_ISA_SVM); + + vmexit = nested_svm_exit_special(svm); + + if (vmexit == NESTED_EXIT_CONTINUE) + vmexit = nested_svm_exit_handled(svm); + + if (vmexit == NESTED_EXIT_DONE) + return 1; + } + + svm_complete_interrupts(svm); + + if (svm->vmcb->control.exit_code == SVM_EXIT_ERR) { + kvm_run->exit_reason = KVM_EXIT_FAIL_ENTRY; + kvm_run->fail_entry.hardware_entry_failure_reason + = svm->vmcb->control.exit_code; + dump_vmcb(vcpu); + return 0; + } + + if (is_external_interrupt(svm->vmcb->control.exit_int_info) && + exit_code != SVM_EXIT_EXCP_BASE + PF_VECTOR && + exit_code != SVM_EXIT_NPF && exit_code != SVM_EXIT_TASK_SWITCH && + exit_code != SVM_EXIT_INTR && exit_code != SVM_EXIT_NMI) + printk(KERN_ERR "%s: unexpected exit_int_info 0x%x " + "exit_code 0x%x\n", + __func__, svm->vmcb->control.exit_int_info, + exit_code); + + if (exit_fastpath == EXIT_FASTPATH_SKIP_EMUL_INS) { + kvm_skip_emulated_instruction(vcpu); + return 1; + } else if (exit_code >= ARRAY_SIZE(svm_exit_handlers) + || !svm_exit_handlers[exit_code]) { + vcpu_unimpl(vcpu, "svm: unexpected exit reason 0x%x\n", exit_code); + dump_vmcb(vcpu); + vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR; + vcpu->run->internal.suberror = + KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON; + vcpu->run->internal.ndata = 1; + vcpu->run->internal.data[0] = exit_code; + return 0; + } + +#ifdef CONFIG_RETPOLINE + if (exit_code == SVM_EXIT_MSR) + return msr_interception(svm); + else if (exit_code == SVM_EXIT_VINTR) + return interrupt_window_interception(svm); + else if (exit_code == SVM_EXIT_INTR) + return intr_interception(svm); + else if (exit_code == SVM_EXIT_HLT) + return halt_interception(svm); + else if (exit_code == SVM_EXIT_NPF) + return npf_interception(svm); +#endif + return svm_exit_handlers[exit_code](svm); +} + +static void reload_tss(struct kvm_vcpu *vcpu) +{ + int cpu = raw_smp_processor_id(); + + struct svm_cpu_data *sd = per_cpu(svm_data, cpu); + sd->tss_desc->type = 9; /* available 32/64-bit TSS */ + load_TR_desc(); +} + +static void pre_sev_run(struct vcpu_svm *svm, int cpu) +{ + struct svm_cpu_data *sd = per_cpu(svm_data, cpu); + int asid = sev_get_asid(svm->vcpu.kvm); + + /* Assign the asid allocated with this SEV guest */ + svm->vmcb->control.asid = asid; + + /* + * Flush guest TLB: + * + * 1) when different VMCB for the same ASID is to be run on the same host CPU. + * 2) or this VMCB was executed on different host CPU in previous VMRUNs. + */ + if (sd->sev_vmcbs[asid] == svm->vmcb && + svm->last_cpu == cpu) + return; + + svm->last_cpu = cpu; + sd->sev_vmcbs[asid] = svm->vmcb; + svm->vmcb->control.tlb_ctl = TLB_CONTROL_FLUSH_ASID; + mark_dirty(svm->vmcb, VMCB_ASID); +} + +static void pre_svm_run(struct vcpu_svm *svm) +{ + int cpu = raw_smp_processor_id(); + + struct svm_cpu_data *sd = per_cpu(svm_data, cpu); + + if (sev_guest(svm->vcpu.kvm)) + return pre_sev_run(svm, cpu); + + /* FIXME: handle wraparound of asid_generation */ + if (svm->asid_generation != sd->asid_generation) + new_asid(svm, sd); +} + +static void svm_inject_nmi(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + + svm->vmcb->control.event_inj = SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_NMI; + vcpu->arch.hflags |= HF_NMI_MASK; + set_intercept(svm, INTERCEPT_IRET); + ++vcpu->stat.nmi_injections; +} + +static void svm_set_irq(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + + BUG_ON(!(gif_set(svm))); + + trace_kvm_inj_virq(vcpu->arch.interrupt.nr); + ++vcpu->stat.irq_injections; + + svm->vmcb->control.event_inj = vcpu->arch.interrupt.nr | + SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_INTR; +} + +static inline bool svm_nested_virtualize_tpr(struct kvm_vcpu *vcpu) +{ + return is_guest_mode(vcpu) && (vcpu->arch.hflags & HF_VINTR_MASK); +} + +static void update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr) +{ + struct vcpu_svm *svm = to_svm(vcpu); + + if (svm_nested_virtualize_tpr(vcpu)) + return; + + clr_cr_intercept(svm, INTERCEPT_CR8_WRITE); + + if (irr == -1) + return; + + if (tpr >= irr) + set_cr_intercept(svm, INTERCEPT_CR8_WRITE); +} + +static void svm_set_virtual_apic_mode(struct kvm_vcpu *vcpu) +{ + return; +} + +static void svm_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr) +{ +} + +static void svm_hwapic_isr_update(struct kvm_vcpu *vcpu, int max_isr) +{ +} + +static void svm_toggle_avic_for_irq_window(struct kvm_vcpu *vcpu, bool activate) +{ + if (!avic || !lapic_in_kernel(vcpu)) + return; + + srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx); + kvm_request_apicv_update(vcpu->kvm, activate, + APICV_INHIBIT_REASON_IRQWIN); + vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu); +} + +static int svm_set_pi_irte_mode(struct kvm_vcpu *vcpu, bool activate) +{ + int ret = 0; + unsigned long flags; + struct amd_svm_iommu_ir *ir; + struct vcpu_svm *svm = to_svm(vcpu); + + if (!kvm_arch_has_assigned_device(vcpu->kvm)) + return 0; + + /* + * Here, we go through the per-vcpu ir_list to update all existing + * interrupt remapping table entry targeting this vcpu. + */ + spin_lock_irqsave(&svm->ir_list_lock, flags); + + if (list_empty(&svm->ir_list)) + goto out; + + list_for_each_entry(ir, &svm->ir_list, node) { + if (activate) + ret = amd_iommu_activate_guest_mode(ir->data); + else + ret = amd_iommu_deactivate_guest_mode(ir->data); + if (ret) + break; + } +out: + spin_unlock_irqrestore(&svm->ir_list_lock, flags); + return ret; +} + +static void svm_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + struct vmcb *vmcb = svm->vmcb; + bool activated = kvm_vcpu_apicv_active(vcpu); + + if (!avic) + return; + + if (activated) { + /** + * During AVIC temporary deactivation, guest could update + * APIC ID, DFR and LDR registers, which would not be trapped + * by avic_unaccelerated_access_interception(). In this case, + * we need to check and update the AVIC logical APIC ID table + * accordingly before re-activating. + */ + avic_post_state_restore(vcpu); + vmcb->control.int_ctl |= AVIC_ENABLE_MASK; + } else { + vmcb->control.int_ctl &= ~AVIC_ENABLE_MASK; + } + mark_dirty(vmcb, VMCB_AVIC); + + svm_set_pi_irte_mode(vcpu, activated); +} + +static void svm_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap) +{ + return; +} + +static int svm_deliver_avic_intr(struct kvm_vcpu *vcpu, int vec) +{ + if (!vcpu->arch.apicv_active) + return -1; + + kvm_lapic_set_irr(vec, vcpu->arch.apic); + smp_mb__after_atomic(); + + if (avic_vcpu_is_running(vcpu)) { + int cpuid = vcpu->cpu; + + if (cpuid != get_cpu()) + wrmsrl(SVM_AVIC_DOORBELL, kvm_cpu_get_apicid(cpuid)); + put_cpu(); + } else + kvm_vcpu_wake_up(vcpu); + + return 0; +} + +static bool svm_dy_apicv_has_pending_interrupt(struct kvm_vcpu *vcpu) +{ + return false; +} + +static void svm_ir_list_del(struct vcpu_svm *svm, struct amd_iommu_pi_data *pi) +{ + unsigned long flags; + struct amd_svm_iommu_ir *cur; + + spin_lock_irqsave(&svm->ir_list_lock, flags); + list_for_each_entry(cur, &svm->ir_list, node) { + if (cur->data != pi->ir_data) + continue; + list_del(&cur->node); + kfree(cur); + break; + } + spin_unlock_irqrestore(&svm->ir_list_lock, flags); +} + +static int svm_ir_list_add(struct vcpu_svm *svm, struct amd_iommu_pi_data *pi) +{ + int ret = 0; + unsigned long flags; + struct amd_svm_iommu_ir *ir; + + /** + * In some cases, the existing irte is updaed and re-set, + * so we need to check here if it's already been * added + * to the ir_list. + */ + if (pi->ir_data && (pi->prev_ga_tag != 0)) { + struct kvm *kvm = svm->vcpu.kvm; + u32 vcpu_id = AVIC_GATAG_TO_VCPUID(pi->prev_ga_tag); + struct kvm_vcpu *prev_vcpu = kvm_get_vcpu_by_id(kvm, vcpu_id); + struct vcpu_svm *prev_svm; + + if (!prev_vcpu) { + ret = -EINVAL; + goto out; + } + + prev_svm = to_svm(prev_vcpu); + svm_ir_list_del(prev_svm, pi); + } + + /** + * Allocating new amd_iommu_pi_data, which will get + * add to the per-vcpu ir_list. + */ + ir = kzalloc(sizeof(struct amd_svm_iommu_ir), GFP_KERNEL_ACCOUNT); + if (!ir) { + ret = -ENOMEM; + goto out; + } + ir->data = pi->ir_data; + + spin_lock_irqsave(&svm->ir_list_lock, flags); + list_add(&ir->node, &svm->ir_list); + spin_unlock_irqrestore(&svm->ir_list_lock, flags); +out: + return ret; +} + +/** + * Note: + * The HW cannot support posting multicast/broadcast + * interrupts to a vCPU. So, we still use legacy interrupt + * remapping for these kind of interrupts. + * + * For lowest-priority interrupts, we only support + * those with single CPU as the destination, e.g. user + * configures the interrupts via /proc/irq or uses + * irqbalance to make the interrupts single-CPU. + */ +static int +get_pi_vcpu_info(struct kvm *kvm, struct kvm_kernel_irq_routing_entry *e, + struct vcpu_data *vcpu_info, struct vcpu_svm **svm) +{ + struct kvm_lapic_irq irq; + struct kvm_vcpu *vcpu = NULL; + + kvm_set_msi_irq(kvm, e, &irq); + + if (!kvm_intr_is_single_vcpu(kvm, &irq, &vcpu) || + !kvm_irq_is_postable(&irq)) { + pr_debug("SVM: %s: use legacy intr remap mode for irq %u\n", + __func__, irq.vector); + return -1; + } + + pr_debug("SVM: %s: use GA mode for irq %u\n", __func__, + irq.vector); + *svm = to_svm(vcpu); + vcpu_info->pi_desc_addr = __sme_set(page_to_phys((*svm)->avic_backing_page)); + vcpu_info->vector = irq.vector; + + return 0; +} + +/* + * svm_update_pi_irte - set IRTE for Posted-Interrupts + * + * @kvm: kvm + * @host_irq: host irq of the interrupt + * @guest_irq: gsi of the interrupt + * @set: set or unset PI + * returns 0 on success, < 0 on failure + */ +static int svm_update_pi_irte(struct kvm *kvm, unsigned int host_irq, + uint32_t guest_irq, bool set) +{ + struct kvm_kernel_irq_routing_entry *e; + struct kvm_irq_routing_table *irq_rt; + int idx, ret = -EINVAL; + + if (!kvm_arch_has_assigned_device(kvm) || + !irq_remapping_cap(IRQ_POSTING_CAP)) + return 0; + + pr_debug("SVM: %s: host_irq=%#x, guest_irq=%#x, set=%#x\n", + __func__, host_irq, guest_irq, set); + + idx = srcu_read_lock(&kvm->irq_srcu); + irq_rt = srcu_dereference(kvm->irq_routing, &kvm->irq_srcu); + WARN_ON(guest_irq >= irq_rt->nr_rt_entries); + + hlist_for_each_entry(e, &irq_rt->map[guest_irq], link) { + struct vcpu_data vcpu_info; + struct vcpu_svm *svm = NULL; + + if (e->type != KVM_IRQ_ROUTING_MSI) + continue; + + /** + * Here, we setup with legacy mode in the following cases: + * 1. When cannot target interrupt to a specific vcpu. + * 2. Unsetting posted interrupt. + * 3. APIC virtialization is disabled for the vcpu. + * 4. IRQ has incompatible delivery mode (SMI, INIT, etc) + */ + if (!get_pi_vcpu_info(kvm, e, &vcpu_info, &svm) && set && + kvm_vcpu_apicv_active(&svm->vcpu)) { + struct amd_iommu_pi_data pi; + + /* Try to enable guest_mode in IRTE */ + pi.base = __sme_set(page_to_phys(svm->avic_backing_page) & + AVIC_HPA_MASK); + pi.ga_tag = AVIC_GATAG(to_kvm_svm(kvm)->avic_vm_id, + svm->vcpu.vcpu_id); + pi.is_guest_mode = true; + pi.vcpu_data = &vcpu_info; + ret = irq_set_vcpu_affinity(host_irq, &pi); + + /** + * Here, we successfully setting up vcpu affinity in + * IOMMU guest mode. Now, we need to store the posted + * interrupt information in a per-vcpu ir_list so that + * we can reference to them directly when we update vcpu + * scheduling information in IOMMU irte. + */ + if (!ret && pi.is_guest_mode) + svm_ir_list_add(svm, &pi); + } else { + /* Use legacy mode in IRTE */ + struct amd_iommu_pi_data pi; + + /** + * Here, pi is used to: + * - Tell IOMMU to use legacy mode for this interrupt. + * - Retrieve ga_tag of prior interrupt remapping data. + */ + pi.is_guest_mode = false; + ret = irq_set_vcpu_affinity(host_irq, &pi); + + /** + * Check if the posted interrupt was previously + * setup with the guest_mode by checking if the ga_tag + * was cached. If so, we need to clean up the per-vcpu + * ir_list. + */ + if (!ret && pi.prev_ga_tag) { + int id = AVIC_GATAG_TO_VCPUID(pi.prev_ga_tag); + struct kvm_vcpu *vcpu; + + vcpu = kvm_get_vcpu_by_id(kvm, id); + if (vcpu) + svm_ir_list_del(to_svm(vcpu), &pi); + } + } + + if (!ret && svm) { + trace_kvm_pi_irte_update(host_irq, svm->vcpu.vcpu_id, + e->gsi, vcpu_info.vector, + vcpu_info.pi_desc_addr, set); + } + + if (ret < 0) { + pr_err("%s: failed to update PI IRTE\n", __func__); + goto out; + } + } + + ret = 0; +out: + srcu_read_unlock(&kvm->irq_srcu, idx); + return ret; +} + +static int svm_nmi_allowed(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + struct vmcb *vmcb = svm->vmcb; + int ret; + ret = !(vmcb->control.int_state & SVM_INTERRUPT_SHADOW_MASK) && + !(svm->vcpu.arch.hflags & HF_NMI_MASK); + ret = ret && gif_set(svm) && nested_svm_nmi(svm); + + return ret; +} + +static bool svm_get_nmi_mask(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + + return !!(svm->vcpu.arch.hflags & HF_NMI_MASK); +} + +static void svm_set_nmi_mask(struct kvm_vcpu *vcpu, bool masked) +{ + struct vcpu_svm *svm = to_svm(vcpu); + + if (masked) { + svm->vcpu.arch.hflags |= HF_NMI_MASK; + set_intercept(svm, INTERCEPT_IRET); + } else { + svm->vcpu.arch.hflags &= ~HF_NMI_MASK; + clr_intercept(svm, INTERCEPT_IRET); + } +} + +static int svm_interrupt_allowed(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + struct vmcb *vmcb = svm->vmcb; + + if (!gif_set(svm) || + (vmcb->control.int_state & SVM_INTERRUPT_SHADOW_MASK)) + return 0; + + if (is_guest_mode(vcpu) && (svm->vcpu.arch.hflags & HF_VINTR_MASK)) + return !!(svm->vcpu.arch.hflags & HF_HIF_MASK); + else + return !!(kvm_get_rflags(vcpu) & X86_EFLAGS_IF); +} + +static void enable_irq_window(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + + /* + * In case GIF=0 we can't rely on the CPU to tell us when GIF becomes + * 1, because that's a separate STGI/VMRUN intercept. The next time we + * get that intercept, this function will be called again though and + * we'll get the vintr intercept. However, if the vGIF feature is + * enabled, the STGI interception will not occur. Enable the irq + * window under the assumption that the hardware will set the GIF. + */ + if (vgif_enabled(svm) || gif_set(svm)) { + /* + * IRQ window is not needed when AVIC is enabled, + * unless we have pending ExtINT since it cannot be injected + * via AVIC. In such case, we need to temporarily disable AVIC, + * and fallback to injecting IRQ via V_IRQ. + */ + svm_toggle_avic_for_irq_window(vcpu, false); + svm_set_vintr(svm); + } +} + +static void enable_nmi_window(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + + if ((svm->vcpu.arch.hflags & (HF_NMI_MASK | HF_IRET_MASK)) + == HF_NMI_MASK) + return; /* IRET will cause a vm exit */ + + if (!gif_set(svm)) { + if (vgif_enabled(svm)) + set_intercept(svm, INTERCEPT_STGI); + return; /* STGI will cause a vm exit */ + } + + if (svm->nested.exit_required) + return; /* we're not going to run the guest yet */ + + /* + * Something prevents NMI from been injected. Single step over possible + * problem (IRET or exception injection or interrupt shadow) + */ + svm->nmi_singlestep_guest_rflags = svm_get_rflags(vcpu); + svm->nmi_singlestep = true; + svm->vmcb->save.rflags |= (X86_EFLAGS_TF | X86_EFLAGS_RF); +} + +static int svm_set_tss_addr(struct kvm *kvm, unsigned int addr) +{ + return 0; +} + +static int svm_set_identity_map_addr(struct kvm *kvm, u64 ident_addr) +{ + return 0; +} + +static void svm_flush_tlb(struct kvm_vcpu *vcpu, bool invalidate_gpa) +{ + struct vcpu_svm *svm = to_svm(vcpu); + + if (static_cpu_has(X86_FEATURE_FLUSHBYASID)) + svm->vmcb->control.tlb_ctl = TLB_CONTROL_FLUSH_ASID; + else + svm->asid_generation--; +} + +static void svm_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t gva) +{ + struct vcpu_svm *svm = to_svm(vcpu); + + invlpga(gva, svm->vmcb->control.asid); +} + +static void svm_prepare_guest_switch(struct kvm_vcpu *vcpu) +{ +} + +static inline void sync_cr8_to_lapic(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + + if (svm_nested_virtualize_tpr(vcpu)) + return; + + if (!is_cr_intercept(svm, INTERCEPT_CR8_WRITE)) { + int cr8 = svm->vmcb->control.int_ctl & V_TPR_MASK; + kvm_set_cr8(vcpu, cr8); + } +} + +static inline void sync_lapic_to_cr8(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + u64 cr8; + + if (svm_nested_virtualize_tpr(vcpu) || + kvm_vcpu_apicv_active(vcpu)) + return; + + cr8 = kvm_get_cr8(vcpu); + svm->vmcb->control.int_ctl &= ~V_TPR_MASK; + svm->vmcb->control.int_ctl |= cr8 & V_TPR_MASK; +} + +static void svm_complete_interrupts(struct vcpu_svm *svm) +{ + u8 vector; + int type; + u32 exitintinfo = svm->vmcb->control.exit_int_info; + unsigned int3_injected = svm->int3_injected; + + svm->int3_injected = 0; + + /* + * If we've made progress since setting HF_IRET_MASK, we've + * executed an IRET and can allow NMI injection. + */ + if ((svm->vcpu.arch.hflags & HF_IRET_MASK) + && kvm_rip_read(&svm->vcpu) != svm->nmi_iret_rip) { + svm->vcpu.arch.hflags &= ~(HF_NMI_MASK | HF_IRET_MASK); + kvm_make_request(KVM_REQ_EVENT, &svm->vcpu); + } + + svm->vcpu.arch.nmi_injected = false; + kvm_clear_exception_queue(&svm->vcpu); + kvm_clear_interrupt_queue(&svm->vcpu); + + if (!(exitintinfo & SVM_EXITINTINFO_VALID)) + return; + + kvm_make_request(KVM_REQ_EVENT, &svm->vcpu); + + vector = exitintinfo & SVM_EXITINTINFO_VEC_MASK; + type = exitintinfo & SVM_EXITINTINFO_TYPE_MASK; + + switch (type) { + case SVM_EXITINTINFO_TYPE_NMI: + svm->vcpu.arch.nmi_injected = true; + break; + case SVM_EXITINTINFO_TYPE_EXEPT: + /* + * In case of software exceptions, do not reinject the vector, + * but re-execute the instruction instead. Rewind RIP first + * if we emulated INT3 before. + */ + if (kvm_exception_is_soft(vector)) { + if (vector == BP_VECTOR && int3_injected && + kvm_is_linear_rip(&svm->vcpu, svm->int3_rip)) + kvm_rip_write(&svm->vcpu, + kvm_rip_read(&svm->vcpu) - + int3_injected); + break; + } + if (exitintinfo & SVM_EXITINTINFO_VALID_ERR) { + u32 err = svm->vmcb->control.exit_int_info_err; + kvm_requeue_exception_e(&svm->vcpu, vector, err); + + } else + kvm_requeue_exception(&svm->vcpu, vector); + break; + case SVM_EXITINTINFO_TYPE_INTR: + kvm_queue_interrupt(&svm->vcpu, vector, false); + break; + default: + break; + } +} + +static void svm_cancel_injection(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + struct vmcb_control_area *control = &svm->vmcb->control; + + control->exit_int_info = control->event_inj; + control->exit_int_info_err = control->event_inj_err; + control->event_inj = 0; + svm_complete_interrupts(svm); +} + +static void svm_vcpu_run(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + + svm->vmcb->save.rax = vcpu->arch.regs[VCPU_REGS_RAX]; + svm->vmcb->save.rsp = vcpu->arch.regs[VCPU_REGS_RSP]; + svm->vmcb->save.rip = vcpu->arch.regs[VCPU_REGS_RIP]; + + /* + * A vmexit emulation is required before the vcpu can be executed + * again. + */ + if (unlikely(svm->nested.exit_required)) + return; + + /* + * Disable singlestep if we're injecting an interrupt/exception. + * We don't want our modified rflags to be pushed on the stack where + * we might not be able to easily reset them if we disabled NMI + * singlestep later. + */ + if (svm->nmi_singlestep && svm->vmcb->control.event_inj) { + /* + * Event injection happens before external interrupts cause a + * vmexit and interrupts are disabled here, so smp_send_reschedule + * is enough to force an immediate vmexit. + */ + disable_nmi_singlestep(svm); + smp_send_reschedule(vcpu->cpu); + } + + pre_svm_run(svm); + + sync_lapic_to_cr8(vcpu); + + svm->vmcb->save.cr2 = vcpu->arch.cr2; + + clgi(); + kvm_load_guest_xsave_state(vcpu); + + if (lapic_in_kernel(vcpu) && + vcpu->arch.apic->lapic_timer.timer_advance_ns) + kvm_wait_lapic_expire(vcpu); + + /* + * If this vCPU has touched SPEC_CTRL, restore the guest's value if + * it's non-zero. Since vmentry is serialising on affected CPUs, there + * is no need to worry about the conditional branch over the wrmsr + * being speculatively taken. + */ + x86_spec_ctrl_set_guest(svm->spec_ctrl, svm->virt_spec_ctrl); + + local_irq_enable(); + + asm volatile ( + "push %%" _ASM_BP "; \n\t" + "mov %c[rbx](%[svm]), %%" _ASM_BX " \n\t" + "mov %c[rcx](%[svm]), %%" _ASM_CX " \n\t" + "mov %c[rdx](%[svm]), %%" _ASM_DX " \n\t" + "mov %c[rsi](%[svm]), %%" _ASM_SI " \n\t" + "mov %c[rdi](%[svm]), %%" _ASM_DI " \n\t" + "mov %c[rbp](%[svm]), %%" _ASM_BP " \n\t" +#ifdef CONFIG_X86_64 + "mov %c[r8](%[svm]), %%r8 \n\t" + "mov %c[r9](%[svm]), %%r9 \n\t" + "mov %c[r10](%[svm]), %%r10 \n\t" + "mov %c[r11](%[svm]), %%r11 \n\t" + "mov %c[r12](%[svm]), %%r12 \n\t" + "mov %c[r13](%[svm]), %%r13 \n\t" + "mov %c[r14](%[svm]), %%r14 \n\t" + "mov %c[r15](%[svm]), %%r15 \n\t" +#endif + + /* Enter guest mode */ + "push %%" _ASM_AX " \n\t" + "mov %c[vmcb](%[svm]), %%" _ASM_AX " \n\t" + __ex("vmload %%" _ASM_AX) "\n\t" + __ex("vmrun %%" _ASM_AX) "\n\t" + __ex("vmsave %%" _ASM_AX) "\n\t" + "pop %%" _ASM_AX " \n\t" + + /* Save guest registers, load host registers */ + "mov %%" _ASM_BX ", %c[rbx](%[svm]) \n\t" + "mov %%" _ASM_CX ", %c[rcx](%[svm]) \n\t" + "mov %%" _ASM_DX ", %c[rdx](%[svm]) \n\t" + "mov %%" _ASM_SI ", %c[rsi](%[svm]) \n\t" + "mov %%" _ASM_DI ", %c[rdi](%[svm]) \n\t" + "mov %%" _ASM_BP ", %c[rbp](%[svm]) \n\t" +#ifdef CONFIG_X86_64 + "mov %%r8, %c[r8](%[svm]) \n\t" + "mov %%r9, %c[r9](%[svm]) \n\t" + "mov %%r10, %c[r10](%[svm]) \n\t" + "mov %%r11, %c[r11](%[svm]) \n\t" + "mov %%r12, %c[r12](%[svm]) \n\t" + "mov %%r13, %c[r13](%[svm]) \n\t" + "mov %%r14, %c[r14](%[svm]) \n\t" + "mov %%r15, %c[r15](%[svm]) \n\t" + /* + * Clear host registers marked as clobbered to prevent + * speculative use. + */ + "xor %%r8d, %%r8d \n\t" + "xor %%r9d, %%r9d \n\t" + "xor %%r10d, %%r10d \n\t" + "xor %%r11d, %%r11d \n\t" + "xor %%r12d, %%r12d \n\t" + "xor %%r13d, %%r13d \n\t" + "xor %%r14d, %%r14d \n\t" + "xor %%r15d, %%r15d \n\t" +#endif + "xor %%ebx, %%ebx \n\t" + "xor %%ecx, %%ecx \n\t" + "xor %%edx, %%edx \n\t" + "xor %%esi, %%esi \n\t" + "xor %%edi, %%edi \n\t" + "pop %%" _ASM_BP + : + : [svm]"a"(svm), + [vmcb]"i"(offsetof(struct vcpu_svm, vmcb_pa)), + [rbx]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_RBX])), + [rcx]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_RCX])), + [rdx]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_RDX])), + [rsi]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_RSI])), + [rdi]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_RDI])), + [rbp]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_RBP])) +#ifdef CONFIG_X86_64 + , [r8]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_R8])), + [r9]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_R9])), + [r10]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_R10])), + [r11]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_R11])), + [r12]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_R12])), + [r13]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_R13])), + [r14]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_R14])), + [r15]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_R15])) +#endif + : "cc", "memory" +#ifdef CONFIG_X86_64 + , "rbx", "rcx", "rdx", "rsi", "rdi" + , "r8", "r9", "r10", "r11" , "r12", "r13", "r14", "r15" +#else + , "ebx", "ecx", "edx", "esi", "edi" +#endif + ); + + /* Eliminate branch target predictions from guest mode */ + vmexit_fill_RSB(); + +#ifdef CONFIG_X86_64 + wrmsrl(MSR_GS_BASE, svm->host.gs_base); +#else + loadsegment(fs, svm->host.fs); +#ifndef CONFIG_X86_32_LAZY_GS + loadsegment(gs, svm->host.gs); +#endif +#endif + + /* + * We do not use IBRS in the kernel. If this vCPU has used the + * SPEC_CTRL MSR it may have left it on; save the value and + * turn it off. This is much more efficient than blindly adding + * it to the atomic save/restore list. Especially as the former + * (Saving guest MSRs on vmexit) doesn't even exist in KVM. + * + * For non-nested case: + * If the L01 MSR bitmap does not intercept the MSR, then we need to + * save it. + * + * For nested case: + * If the L02 MSR bitmap does not intercept the MSR, then we need to + * save it. + */ + if (unlikely(!msr_write_intercepted(vcpu, MSR_IA32_SPEC_CTRL))) + svm->spec_ctrl = native_read_msr(MSR_IA32_SPEC_CTRL); + + reload_tss(vcpu); + + local_irq_disable(); + + x86_spec_ctrl_restore_host(svm->spec_ctrl, svm->virt_spec_ctrl); + + vcpu->arch.cr2 = svm->vmcb->save.cr2; + vcpu->arch.regs[VCPU_REGS_RAX] = svm->vmcb->save.rax; + vcpu->arch.regs[VCPU_REGS_RSP] = svm->vmcb->save.rsp; + vcpu->arch.regs[VCPU_REGS_RIP] = svm->vmcb->save.rip; + + if (unlikely(svm->vmcb->control.exit_code == SVM_EXIT_NMI)) + kvm_before_interrupt(&svm->vcpu); + + kvm_load_host_xsave_state(vcpu); + stgi(); + + /* Any pending NMI will happen here */ + + if (unlikely(svm->vmcb->control.exit_code == SVM_EXIT_NMI)) + kvm_after_interrupt(&svm->vcpu); + + sync_cr8_to_lapic(vcpu); + + svm->next_rip = 0; + + svm->vmcb->control.tlb_ctl = TLB_CONTROL_DO_NOTHING; + + /* if exit due to PF check for async PF */ + if (svm->vmcb->control.exit_code == SVM_EXIT_EXCP_BASE + PF_VECTOR) + svm->vcpu.arch.apf.host_apf_reason = kvm_read_and_reset_pf_reason(); + + if (npt_enabled) { + vcpu->arch.regs_avail &= ~(1 << VCPU_EXREG_PDPTR); + vcpu->arch.regs_dirty &= ~(1 << VCPU_EXREG_PDPTR); + } + + /* + * We need to handle MC intercepts here before the vcpu has a chance to + * change the physical cpu + */ + if (unlikely(svm->vmcb->control.exit_code == + SVM_EXIT_EXCP_BASE + MC_VECTOR)) + svm_handle_mce(svm); + + mark_all_clean(svm->vmcb); +} +STACK_FRAME_NON_STANDARD(svm_vcpu_run); + +static void svm_load_mmu_pgd(struct kvm_vcpu *vcpu, unsigned long root) +{ + struct vcpu_svm *svm = to_svm(vcpu); + bool update_guest_cr3 = true; + unsigned long cr3; + + cr3 = __sme_set(root); + if (npt_enabled) { + svm->vmcb->control.nested_cr3 = cr3; + mark_dirty(svm->vmcb, VMCB_NPT); + + /* Loading L2's CR3 is handled by enter_svm_guest_mode. */ + if (is_guest_mode(vcpu)) + update_guest_cr3 = false; + else if (test_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail)) + cr3 = vcpu->arch.cr3; + else /* CR3 is already up-to-date. */ + update_guest_cr3 = false; + } + + if (update_guest_cr3) { + svm->vmcb->save.cr3 = cr3; + mark_dirty(svm->vmcb, VMCB_CR); + } +} + +static int is_disabled(void) +{ + u64 vm_cr; + + rdmsrl(MSR_VM_CR, vm_cr); + if (vm_cr & (1 << SVM_VM_CR_SVM_DISABLE)) + return 1; + + return 0; +} + +static void +svm_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall) +{ + /* + * Patch in the VMMCALL instruction: + */ + hypercall[0] = 0x0f; + hypercall[1] = 0x01; + hypercall[2] = 0xd9; +} + +static int __init svm_check_processor_compat(void) +{ + return 0; +} + +static bool svm_cpu_has_accelerated_tpr(void) +{ + return false; +} + +static bool svm_has_emulated_msr(int index) +{ + switch (index) { + case MSR_IA32_MCG_EXT_CTL: + case MSR_IA32_VMX_BASIC ... MSR_IA32_VMX_VMFUNC: + return false; + default: + break; + } + + return true; +} + +static u64 svm_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) +{ + return 0; +} + +static void svm_cpuid_update(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + + vcpu->arch.xsaves_enabled = guest_cpuid_has(vcpu, X86_FEATURE_XSAVE) && + boot_cpu_has(X86_FEATURE_XSAVE) && + boot_cpu_has(X86_FEATURE_XSAVES); + + /* Update nrips enabled cache */ + svm->nrips_enabled = kvm_cpu_cap_has(X86_FEATURE_NRIPS) && + guest_cpuid_has(&svm->vcpu, X86_FEATURE_NRIPS); + + if (!kvm_vcpu_apicv_active(vcpu)) + return; + + /* + * AVIC does not work with an x2APIC mode guest. If the X2APIC feature + * is exposed to the guest, disable AVIC. + */ + if (guest_cpuid_has(vcpu, X86_FEATURE_X2APIC)) + kvm_request_apicv_update(vcpu->kvm, false, + APICV_INHIBIT_REASON_X2APIC); + + /* + * Currently, AVIC does not work with nested virtualization. + * So, we disable AVIC when cpuid for SVM is set in the L1 guest. + */ + if (nested && guest_cpuid_has(vcpu, X86_FEATURE_SVM)) + kvm_request_apicv_update(vcpu->kvm, false, + APICV_INHIBIT_REASON_NESTED); +} + +static bool svm_has_wbinvd_exit(void) +{ + return true; +} + +#define PRE_EX(exit) { .exit_code = (exit), \ + .stage = X86_ICPT_PRE_EXCEPT, } +#define POST_EX(exit) { .exit_code = (exit), \ + .stage = X86_ICPT_POST_EXCEPT, } +#define POST_MEM(exit) { .exit_code = (exit), \ + .stage = X86_ICPT_POST_MEMACCESS, } + +static const struct __x86_intercept { + u32 exit_code; + enum x86_intercept_stage stage; +} x86_intercept_map[] = { + [x86_intercept_cr_read] = POST_EX(SVM_EXIT_READ_CR0), + [x86_intercept_cr_write] = POST_EX(SVM_EXIT_WRITE_CR0), + [x86_intercept_clts] = POST_EX(SVM_EXIT_WRITE_CR0), + [x86_intercept_lmsw] = POST_EX(SVM_EXIT_WRITE_CR0), + [x86_intercept_smsw] = POST_EX(SVM_EXIT_READ_CR0), + [x86_intercept_dr_read] = POST_EX(SVM_EXIT_READ_DR0), + [x86_intercept_dr_write] = POST_EX(SVM_EXIT_WRITE_DR0), + [x86_intercept_sldt] = POST_EX(SVM_EXIT_LDTR_READ), + [x86_intercept_str] = POST_EX(SVM_EXIT_TR_READ), + [x86_intercept_lldt] = POST_EX(SVM_EXIT_LDTR_WRITE), + [x86_intercept_ltr] = POST_EX(SVM_EXIT_TR_WRITE), + [x86_intercept_sgdt] = POST_EX(SVM_EXIT_GDTR_READ), + [x86_intercept_sidt] = POST_EX(SVM_EXIT_IDTR_READ), + [x86_intercept_lgdt] = POST_EX(SVM_EXIT_GDTR_WRITE), + [x86_intercept_lidt] = POST_EX(SVM_EXIT_IDTR_WRITE), + [x86_intercept_vmrun] = POST_EX(SVM_EXIT_VMRUN), + [x86_intercept_vmmcall] = POST_EX(SVM_EXIT_VMMCALL), + [x86_intercept_vmload] = POST_EX(SVM_EXIT_VMLOAD), + [x86_intercept_vmsave] = POST_EX(SVM_EXIT_VMSAVE), + [x86_intercept_stgi] = POST_EX(SVM_EXIT_STGI), + [x86_intercept_clgi] = POST_EX(SVM_EXIT_CLGI), + [x86_intercept_skinit] = POST_EX(SVM_EXIT_SKINIT), + [x86_intercept_invlpga] = POST_EX(SVM_EXIT_INVLPGA), + [x86_intercept_rdtscp] = POST_EX(SVM_EXIT_RDTSCP), + [x86_intercept_monitor] = POST_MEM(SVM_EXIT_MONITOR), + [x86_intercept_mwait] = POST_EX(SVM_EXIT_MWAIT), + [x86_intercept_invlpg] = POST_EX(SVM_EXIT_INVLPG), + [x86_intercept_invd] = POST_EX(SVM_EXIT_INVD), + [x86_intercept_wbinvd] = POST_EX(SVM_EXIT_WBINVD), + [x86_intercept_wrmsr] = POST_EX(SVM_EXIT_MSR), + [x86_intercept_rdtsc] = POST_EX(SVM_EXIT_RDTSC), + [x86_intercept_rdmsr] = POST_EX(SVM_EXIT_MSR), + [x86_intercept_rdpmc] = POST_EX(SVM_EXIT_RDPMC), + [x86_intercept_cpuid] = PRE_EX(SVM_EXIT_CPUID), + [x86_intercept_rsm] = PRE_EX(SVM_EXIT_RSM), + [x86_intercept_pause] = PRE_EX(SVM_EXIT_PAUSE), + [x86_intercept_pushf] = PRE_EX(SVM_EXIT_PUSHF), + [x86_intercept_popf] = PRE_EX(SVM_EXIT_POPF), + [x86_intercept_intn] = PRE_EX(SVM_EXIT_SWINT), + [x86_intercept_iret] = PRE_EX(SVM_EXIT_IRET), + [x86_intercept_icebp] = PRE_EX(SVM_EXIT_ICEBP), + [x86_intercept_hlt] = POST_EX(SVM_EXIT_HLT), + [x86_intercept_in] = POST_EX(SVM_EXIT_IOIO), + [x86_intercept_ins] = POST_EX(SVM_EXIT_IOIO), + [x86_intercept_out] = POST_EX(SVM_EXIT_IOIO), + [x86_intercept_outs] = POST_EX(SVM_EXIT_IOIO), + [x86_intercept_xsetbv] = PRE_EX(SVM_EXIT_XSETBV), +}; + +#undef PRE_EX +#undef POST_EX +#undef POST_MEM + +static int svm_check_intercept(struct kvm_vcpu *vcpu, + struct x86_instruction_info *info, + enum x86_intercept_stage stage, + struct x86_exception *exception) +{ + struct vcpu_svm *svm = to_svm(vcpu); + int vmexit, ret = X86EMUL_CONTINUE; + struct __x86_intercept icpt_info; + struct vmcb *vmcb = svm->vmcb; + + if (info->intercept >= ARRAY_SIZE(x86_intercept_map)) + goto out; + + icpt_info = x86_intercept_map[info->intercept]; + + if (stage != icpt_info.stage) + goto out; + + switch (icpt_info.exit_code) { + case SVM_EXIT_READ_CR0: + if (info->intercept == x86_intercept_cr_read) + icpt_info.exit_code += info->modrm_reg; + break; + case SVM_EXIT_WRITE_CR0: { + unsigned long cr0, val; + u64 intercept; + + if (info->intercept == x86_intercept_cr_write) + icpt_info.exit_code += info->modrm_reg; + + if (icpt_info.exit_code != SVM_EXIT_WRITE_CR0 || + info->intercept == x86_intercept_clts) + break; + + intercept = svm->nested.intercept; + + if (!(intercept & (1ULL << INTERCEPT_SELECTIVE_CR0))) + break; + + cr0 = vcpu->arch.cr0 & ~SVM_CR0_SELECTIVE_MASK; + val = info->src_val & ~SVM_CR0_SELECTIVE_MASK; + + if (info->intercept == x86_intercept_lmsw) { + cr0 &= 0xfUL; + val &= 0xfUL; + /* lmsw can't clear PE - catch this here */ + if (cr0 & X86_CR0_PE) + val |= X86_CR0_PE; + } + + if (cr0 ^ val) + icpt_info.exit_code = SVM_EXIT_CR0_SEL_WRITE; + + break; + } + case SVM_EXIT_READ_DR0: + case SVM_EXIT_WRITE_DR0: + icpt_info.exit_code += info->modrm_reg; + break; + case SVM_EXIT_MSR: + if (info->intercept == x86_intercept_wrmsr) + vmcb->control.exit_info_1 = 1; + else + vmcb->control.exit_info_1 = 0; + break; + case SVM_EXIT_PAUSE: + /* + * We get this for NOP only, but pause + * is rep not, check this here + */ + if (info->rep_prefix != REPE_PREFIX) + goto out; + break; + case SVM_EXIT_IOIO: { + u64 exit_info; + u32 bytes; + + if (info->intercept == x86_intercept_in || + info->intercept == x86_intercept_ins) { + exit_info = ((info->src_val & 0xffff) << 16) | + SVM_IOIO_TYPE_MASK; + bytes = info->dst_bytes; + } else { + exit_info = (info->dst_val & 0xffff) << 16; + bytes = info->src_bytes; + } + + if (info->intercept == x86_intercept_outs || + info->intercept == x86_intercept_ins) + exit_info |= SVM_IOIO_STR_MASK; + + if (info->rep_prefix) + exit_info |= SVM_IOIO_REP_MASK; + + bytes = min(bytes, 4u); + + exit_info |= bytes << SVM_IOIO_SIZE_SHIFT; + + exit_info |= (u32)info->ad_bytes << (SVM_IOIO_ASIZE_SHIFT - 1); + + vmcb->control.exit_info_1 = exit_info; + vmcb->control.exit_info_2 = info->next_rip; + + break; + } + default: + break; + } + + /* TODO: Advertise NRIPS to guest hypervisor unconditionally */ + if (static_cpu_has(X86_FEATURE_NRIPS)) + vmcb->control.next_rip = info->next_rip; + vmcb->control.exit_code = icpt_info.exit_code; + vmexit = nested_svm_exit_handled(svm); + + ret = (vmexit == NESTED_EXIT_DONE) ? X86EMUL_INTERCEPTED + : X86EMUL_CONTINUE; + +out: + return ret; +} + +static void svm_handle_exit_irqoff(struct kvm_vcpu *vcpu, + enum exit_fastpath_completion *exit_fastpath) +{ + if (!is_guest_mode(vcpu) && + to_svm(vcpu)->vmcb->control.exit_code == SVM_EXIT_MSR && + to_svm(vcpu)->vmcb->control.exit_info_1) + *exit_fastpath = handle_fastpath_set_msr_irqoff(vcpu); +} + +static void svm_sched_in(struct kvm_vcpu *vcpu, int cpu) +{ + if (pause_filter_thresh) + shrink_ple_window(vcpu); +} + +static inline void avic_post_state_restore(struct kvm_vcpu *vcpu) +{ + if (avic_handle_apic_id_update(vcpu) != 0) + return; + avic_handle_dfr_update(vcpu); + avic_handle_ldr_update(vcpu); +} + +static void svm_setup_mce(struct kvm_vcpu *vcpu) +{ + /* [63:9] are reserved. */ + vcpu->arch.mcg_cap &= 0x1ff; +} + +static int svm_smi_allowed(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + + /* Per APM Vol.2 15.22.2 "Response to SMI" */ + if (!gif_set(svm)) + return 0; + + if (is_guest_mode(&svm->vcpu) && + svm->nested.intercept & (1ULL << INTERCEPT_SMI)) { + /* TODO: Might need to set exit_info_1 and exit_info_2 here */ + svm->vmcb->control.exit_code = SVM_EXIT_SMI; + svm->nested.exit_required = true; + return 0; + } + + return 1; +} + +static int svm_pre_enter_smm(struct kvm_vcpu *vcpu, char *smstate) +{ + struct vcpu_svm *svm = to_svm(vcpu); + int ret; + + if (is_guest_mode(vcpu)) { + /* FED8h - SVM Guest */ + put_smstate(u64, smstate, 0x7ed8, 1); + /* FEE0h - SVM Guest VMCB Physical Address */ + put_smstate(u64, smstate, 0x7ee0, svm->nested.vmcb); + + svm->vmcb->save.rax = vcpu->arch.regs[VCPU_REGS_RAX]; + svm->vmcb->save.rsp = vcpu->arch.regs[VCPU_REGS_RSP]; + svm->vmcb->save.rip = vcpu->arch.regs[VCPU_REGS_RIP]; + + ret = nested_svm_vmexit(svm); + if (ret) + return ret; + } + return 0; +} + +static int svm_pre_leave_smm(struct kvm_vcpu *vcpu, const char *smstate) +{ + struct vcpu_svm *svm = to_svm(vcpu); + struct vmcb *nested_vmcb; + struct kvm_host_map map; + u64 guest; + u64 vmcb; + + guest = GET_SMSTATE(u64, smstate, 0x7ed8); + vmcb = GET_SMSTATE(u64, smstate, 0x7ee0); + + if (guest) { + if (kvm_vcpu_map(&svm->vcpu, gpa_to_gfn(vmcb), &map) == -EINVAL) + return 1; + nested_vmcb = map.hva; + enter_svm_guest_mode(svm, vmcb, nested_vmcb, &map); + } + return 0; +} + +static int enable_smi_window(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + + if (!gif_set(svm)) { + if (vgif_enabled(svm)) + set_intercept(svm, INTERCEPT_STGI); + /* STGI will cause a vm exit */ + return 1; + } + return 0; +} + +static int sev_flush_asids(void) +{ + int ret, error; + + /* + * DEACTIVATE will clear the WBINVD indicator causing DF_FLUSH to fail, + * so it must be guarded. + */ + down_write(&sev_deactivate_lock); + + wbinvd_on_all_cpus(); + ret = sev_guest_df_flush(&error); + + up_write(&sev_deactivate_lock); + + if (ret) + pr_err("SEV: DF_FLUSH failed, ret=%d, error=%#x\n", ret, error); + + return ret; +} + +/* Must be called with the sev_bitmap_lock held */ +static bool __sev_recycle_asids(void) +{ + int pos; + + /* Check if there are any ASIDs to reclaim before performing a flush */ + pos = find_next_bit(sev_reclaim_asid_bitmap, + max_sev_asid, min_sev_asid - 1); + if (pos >= max_sev_asid) + return false; + + if (sev_flush_asids()) + return false; + + bitmap_xor(sev_asid_bitmap, sev_asid_bitmap, sev_reclaim_asid_bitmap, + max_sev_asid); + bitmap_zero(sev_reclaim_asid_bitmap, max_sev_asid); + + return true; +} + +static int sev_asid_new(void) +{ + bool retry = true; + int pos; + + mutex_lock(&sev_bitmap_lock); + + /* + * SEV-enabled guest must use asid from min_sev_asid to max_sev_asid. + */ +again: + pos = find_next_zero_bit(sev_asid_bitmap, max_sev_asid, min_sev_asid - 1); + if (pos >= max_sev_asid) { + if (retry && __sev_recycle_asids()) { + retry = false; + goto again; + } + mutex_unlock(&sev_bitmap_lock); + return -EBUSY; + } + + __set_bit(pos, sev_asid_bitmap); + + mutex_unlock(&sev_bitmap_lock); + + return pos + 1; +} + +static int sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp) +{ + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; + int asid, ret; + + ret = -EBUSY; + if (unlikely(sev->active)) + return ret; + + asid = sev_asid_new(); + if (asid < 0) + return ret; + + ret = sev_platform_init(&argp->error); + if (ret) + goto e_free; + + sev->active = true; + sev->asid = asid; + INIT_LIST_HEAD(&sev->regions_list); + + return 0; + +e_free: + sev_asid_free(asid); + return ret; +} + +static int sev_bind_asid(struct kvm *kvm, unsigned int handle, int *error) +{ + struct sev_data_activate *data; + int asid = sev_get_asid(kvm); + int ret; + + data = kzalloc(sizeof(*data), GFP_KERNEL_ACCOUNT); + if (!data) + return -ENOMEM; + + /* activate ASID on the given handle */ + data->handle = handle; + data->asid = asid; + ret = sev_guest_activate(data, error); + kfree(data); + + return ret; +} + +static int __sev_issue_cmd(int fd, int id, void *data, int *error) +{ + struct fd f; + int ret; + + f = fdget(fd); + if (!f.file) + return -EBADF; + + ret = sev_issue_cmd_external_user(f.file, id, data, error); + + fdput(f); + return ret; +} + +static int sev_issue_cmd(struct kvm *kvm, int id, void *data, int *error) +{ + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; + + return __sev_issue_cmd(sev->fd, id, data, error); +} + +static int sev_launch_start(struct kvm *kvm, struct kvm_sev_cmd *argp) +{ + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; + struct sev_data_launch_start *start; + struct kvm_sev_launch_start params; + void *dh_blob, *session_blob; + int *error = &argp->error; + int ret; + + if (!sev_guest(kvm)) + return -ENOTTY; + + if (copy_from_user(¶ms, (void __user *)(uintptr_t)argp->data, sizeof(params))) + return -EFAULT; + + start = kzalloc(sizeof(*start), GFP_KERNEL_ACCOUNT); + if (!start) + return -ENOMEM; + + dh_blob = NULL; + if (params.dh_uaddr) { + dh_blob = psp_copy_user_blob(params.dh_uaddr, params.dh_len); + if (IS_ERR(dh_blob)) { + ret = PTR_ERR(dh_blob); + goto e_free; + } + + start->dh_cert_address = __sme_set(__pa(dh_blob)); + start->dh_cert_len = params.dh_len; + } + + session_blob = NULL; + if (params.session_uaddr) { + session_blob = psp_copy_user_blob(params.session_uaddr, params.session_len); + if (IS_ERR(session_blob)) { + ret = PTR_ERR(session_blob); + goto e_free_dh; + } + + start->session_address = __sme_set(__pa(session_blob)); + start->session_len = params.session_len; + } + + start->handle = params.handle; + start->policy = params.policy; + + /* create memory encryption context */ + ret = __sev_issue_cmd(argp->sev_fd, SEV_CMD_LAUNCH_START, start, error); + if (ret) + goto e_free_session; + + /* Bind ASID to this guest */ + ret = sev_bind_asid(kvm, start->handle, error); + if (ret) + goto e_free_session; + + /* return handle to userspace */ + params.handle = start->handle; + if (copy_to_user((void __user *)(uintptr_t)argp->data, ¶ms, sizeof(params))) { + sev_unbind_asid(kvm, start->handle); + ret = -EFAULT; + goto e_free_session; + } + + sev->handle = start->handle; + sev->fd = argp->sev_fd; + +e_free_session: + kfree(session_blob); +e_free_dh: + kfree(dh_blob); +e_free: + kfree(start); + return ret; +} + +static unsigned long get_num_contig_pages(unsigned long idx, + struct page **inpages, unsigned long npages) +{ + unsigned long paddr, next_paddr; + unsigned long i = idx + 1, pages = 1; + + /* find the number of contiguous pages starting from idx */ + paddr = __sme_page_pa(inpages[idx]); + while (i < npages) { + next_paddr = __sme_page_pa(inpages[i++]); + if ((paddr + PAGE_SIZE) == next_paddr) { + pages++; + paddr = next_paddr; + continue; + } + break; + } + + return pages; +} + +static int sev_launch_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp) +{ + unsigned long vaddr, vaddr_end, next_vaddr, npages, pages, size, i; + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; + struct kvm_sev_launch_update_data params; + struct sev_data_launch_update_data *data; + struct page **inpages; + int ret; + + if (!sev_guest(kvm)) + return -ENOTTY; + + if (copy_from_user(¶ms, (void __user *)(uintptr_t)argp->data, sizeof(params))) + return -EFAULT; + + data = kzalloc(sizeof(*data), GFP_KERNEL_ACCOUNT); + if (!data) + return -ENOMEM; + + vaddr = params.uaddr; + size = params.len; + vaddr_end = vaddr + size; + + /* Lock the user memory. */ + inpages = sev_pin_memory(kvm, vaddr, size, &npages, 1); + if (!inpages) { + ret = -ENOMEM; + goto e_free; + } + + /* + * The LAUNCH_UPDATE command will perform in-place encryption of the + * memory content (i.e it will write the same memory region with C=1). + * It's possible that the cache may contain the data with C=0, i.e., + * unencrypted so invalidate it first. + */ + sev_clflush_pages(inpages, npages); + + for (i = 0; vaddr < vaddr_end; vaddr = next_vaddr, i += pages) { + int offset, len; + + /* + * If the user buffer is not page-aligned, calculate the offset + * within the page. + */ + offset = vaddr & (PAGE_SIZE - 1); + + /* Calculate the number of pages that can be encrypted in one go. */ + pages = get_num_contig_pages(i, inpages, npages); + + len = min_t(size_t, ((pages * PAGE_SIZE) - offset), size); + + data->handle = sev->handle; + data->len = len; + data->address = __sme_page_pa(inpages[i]) + offset; + ret = sev_issue_cmd(kvm, SEV_CMD_LAUNCH_UPDATE_DATA, data, &argp->error); + if (ret) + goto e_unpin; + + size -= len; + next_vaddr = vaddr + len; + } + +e_unpin: + /* content of memory is updated, mark pages dirty */ + for (i = 0; i < npages; i++) { + set_page_dirty_lock(inpages[i]); + mark_page_accessed(inpages[i]); + } + /* unlock the user pages */ + sev_unpin_memory(kvm, inpages, npages); +e_free: + kfree(data); + return ret; +} + +static int sev_launch_measure(struct kvm *kvm, struct kvm_sev_cmd *argp) +{ + void __user *measure = (void __user *)(uintptr_t)argp->data; + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; + struct sev_data_launch_measure *data; + struct kvm_sev_launch_measure params; + void __user *p = NULL; + void *blob = NULL; + int ret; + + if (!sev_guest(kvm)) + return -ENOTTY; + + if (copy_from_user(¶ms, measure, sizeof(params))) + return -EFAULT; + + data = kzalloc(sizeof(*data), GFP_KERNEL_ACCOUNT); + if (!data) + return -ENOMEM; + + /* User wants to query the blob length */ + if (!params.len) + goto cmd; + + p = (void __user *)(uintptr_t)params.uaddr; + if (p) { + if (params.len > SEV_FW_BLOB_MAX_SIZE) { + ret = -EINVAL; + goto e_free; + } + + ret = -ENOMEM; + blob = kmalloc(params.len, GFP_KERNEL); + if (!blob) + goto e_free; + + data->address = __psp_pa(blob); + data->len = params.len; + } + +cmd: + data->handle = sev->handle; + ret = sev_issue_cmd(kvm, SEV_CMD_LAUNCH_MEASURE, data, &argp->error); + + /* + * If we query the session length, FW responded with expected data. + */ + if (!params.len) + goto done; + + if (ret) + goto e_free_blob; + + if (blob) { + if (copy_to_user(p, blob, params.len)) + ret = -EFAULT; + } + +done: + params.len = data->len; + if (copy_to_user(measure, ¶ms, sizeof(params))) + ret = -EFAULT; +e_free_blob: + kfree(blob); +e_free: + kfree(data); + return ret; +} + +static int sev_launch_finish(struct kvm *kvm, struct kvm_sev_cmd *argp) +{ + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; + struct sev_data_launch_finish *data; + int ret; + + if (!sev_guest(kvm)) + return -ENOTTY; + + data = kzalloc(sizeof(*data), GFP_KERNEL_ACCOUNT); + if (!data) + return -ENOMEM; + + data->handle = sev->handle; + ret = sev_issue_cmd(kvm, SEV_CMD_LAUNCH_FINISH, data, &argp->error); + + kfree(data); + return ret; +} + +static int sev_guest_status(struct kvm *kvm, struct kvm_sev_cmd *argp) +{ + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; + struct kvm_sev_guest_status params; + struct sev_data_guest_status *data; + int ret; + + if (!sev_guest(kvm)) + return -ENOTTY; + + data = kzalloc(sizeof(*data), GFP_KERNEL_ACCOUNT); + if (!data) + return -ENOMEM; + + data->handle = sev->handle; + ret = sev_issue_cmd(kvm, SEV_CMD_GUEST_STATUS, data, &argp->error); + if (ret) + goto e_free; + + params.policy = data->policy; + params.state = data->state; + params.handle = data->handle; + + if (copy_to_user((void __user *)(uintptr_t)argp->data, ¶ms, sizeof(params))) + ret = -EFAULT; +e_free: + kfree(data); + return ret; +} + +static int __sev_issue_dbg_cmd(struct kvm *kvm, unsigned long src, + unsigned long dst, int size, + int *error, bool enc) +{ + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; + struct sev_data_dbg *data; + int ret; + + data = kzalloc(sizeof(*data), GFP_KERNEL_ACCOUNT); + if (!data) + return -ENOMEM; + + data->handle = sev->handle; + data->dst_addr = dst; + data->src_addr = src; + data->len = size; + + ret = sev_issue_cmd(kvm, + enc ? SEV_CMD_DBG_ENCRYPT : SEV_CMD_DBG_DECRYPT, + data, error); + kfree(data); + return ret; +} + +static int __sev_dbg_decrypt(struct kvm *kvm, unsigned long src_paddr, + unsigned long dst_paddr, int sz, int *err) +{ + int offset; + + /* + * Its safe to read more than we are asked, caller should ensure that + * destination has enough space. + */ + src_paddr = round_down(src_paddr, 16); + offset = src_paddr & 15; + sz = round_up(sz + offset, 16); + + return __sev_issue_dbg_cmd(kvm, src_paddr, dst_paddr, sz, err, false); +} + +static int __sev_dbg_decrypt_user(struct kvm *kvm, unsigned long paddr, + unsigned long __user dst_uaddr, + unsigned long dst_paddr, + int size, int *err) +{ + struct page *tpage = NULL; + int ret, offset; + + /* if inputs are not 16-byte then use intermediate buffer */ + if (!IS_ALIGNED(dst_paddr, 16) || + !IS_ALIGNED(paddr, 16) || + !IS_ALIGNED(size, 16)) { + tpage = (void *)alloc_page(GFP_KERNEL); + if (!tpage) + return -ENOMEM; + + dst_paddr = __sme_page_pa(tpage); + } + + ret = __sev_dbg_decrypt(kvm, paddr, dst_paddr, size, err); + if (ret) + goto e_free; + + if (tpage) { + offset = paddr & 15; + if (copy_to_user((void __user *)(uintptr_t)dst_uaddr, + page_address(tpage) + offset, size)) + ret = -EFAULT; + } + +e_free: + if (tpage) + __free_page(tpage); + + return ret; +} + +static int __sev_dbg_encrypt_user(struct kvm *kvm, unsigned long paddr, + unsigned long __user vaddr, + unsigned long dst_paddr, + unsigned long __user dst_vaddr, + int size, int *error) +{ + struct page *src_tpage = NULL; + struct page *dst_tpage = NULL; + int ret, len = size; + + /* If source buffer is not aligned then use an intermediate buffer */ + if (!IS_ALIGNED(vaddr, 16)) { + src_tpage = alloc_page(GFP_KERNEL); + if (!src_tpage) + return -ENOMEM; + + if (copy_from_user(page_address(src_tpage), + (void __user *)(uintptr_t)vaddr, size)) { + __free_page(src_tpage); + return -EFAULT; + } + + paddr = __sme_page_pa(src_tpage); + } + + /* + * If destination buffer or length is not aligned then do read-modify-write: + * - decrypt destination in an intermediate buffer + * - copy the source buffer in an intermediate buffer + * - use the intermediate buffer as source buffer + */ + if (!IS_ALIGNED(dst_vaddr, 16) || !IS_ALIGNED(size, 16)) { + int dst_offset; + + dst_tpage = alloc_page(GFP_KERNEL); + if (!dst_tpage) { + ret = -ENOMEM; + goto e_free; + } + + ret = __sev_dbg_decrypt(kvm, dst_paddr, + __sme_page_pa(dst_tpage), size, error); + if (ret) + goto e_free; + + /* + * If source is kernel buffer then use memcpy() otherwise + * copy_from_user(). + */ + dst_offset = dst_paddr & 15; + + if (src_tpage) + memcpy(page_address(dst_tpage) + dst_offset, + page_address(src_tpage), size); + else { + if (copy_from_user(page_address(dst_tpage) + dst_offset, + (void __user *)(uintptr_t)vaddr, size)) { + ret = -EFAULT; + goto e_free; + } + } + + paddr = __sme_page_pa(dst_tpage); + dst_paddr = round_down(dst_paddr, 16); + len = round_up(size, 16); + } + + ret = __sev_issue_dbg_cmd(kvm, paddr, dst_paddr, len, error, true); + +e_free: + if (src_tpage) + __free_page(src_tpage); + if (dst_tpage) + __free_page(dst_tpage); + return ret; +} + +static int sev_dbg_crypt(struct kvm *kvm, struct kvm_sev_cmd *argp, bool dec) +{ + unsigned long vaddr, vaddr_end, next_vaddr; + unsigned long dst_vaddr; + struct page **src_p, **dst_p; + struct kvm_sev_dbg debug; + unsigned long n; + unsigned int size; + int ret; + + if (!sev_guest(kvm)) + return -ENOTTY; + + if (copy_from_user(&debug, (void __user *)(uintptr_t)argp->data, sizeof(debug))) + return -EFAULT; + + if (!debug.len || debug.src_uaddr + debug.len < debug.src_uaddr) + return -EINVAL; + if (!debug.dst_uaddr) + return -EINVAL; + + vaddr = debug.src_uaddr; + size = debug.len; + vaddr_end = vaddr + size; + dst_vaddr = debug.dst_uaddr; + + for (; vaddr < vaddr_end; vaddr = next_vaddr) { + int len, s_off, d_off; + + /* lock userspace source and destination page */ + src_p = sev_pin_memory(kvm, vaddr & PAGE_MASK, PAGE_SIZE, &n, 0); + if (!src_p) + return -EFAULT; + + dst_p = sev_pin_memory(kvm, dst_vaddr & PAGE_MASK, PAGE_SIZE, &n, 1); + if (!dst_p) { + sev_unpin_memory(kvm, src_p, n); + return -EFAULT; + } + + /* + * The DBG_{DE,EN}CRYPT commands will perform {dec,en}cryption of the + * memory content (i.e it will write the same memory region with C=1). + * It's possible that the cache may contain the data with C=0, i.e., + * unencrypted so invalidate it first. + */ + sev_clflush_pages(src_p, 1); + sev_clflush_pages(dst_p, 1); + + /* + * Since user buffer may not be page aligned, calculate the + * offset within the page. + */ + s_off = vaddr & ~PAGE_MASK; + d_off = dst_vaddr & ~PAGE_MASK; + len = min_t(size_t, (PAGE_SIZE - s_off), size); + + if (dec) + ret = __sev_dbg_decrypt_user(kvm, + __sme_page_pa(src_p[0]) + s_off, + dst_vaddr, + __sme_page_pa(dst_p[0]) + d_off, + len, &argp->error); + else + ret = __sev_dbg_encrypt_user(kvm, + __sme_page_pa(src_p[0]) + s_off, + vaddr, + __sme_page_pa(dst_p[0]) + d_off, + dst_vaddr, + len, &argp->error); + + sev_unpin_memory(kvm, src_p, n); + sev_unpin_memory(kvm, dst_p, n); + + if (ret) + goto err; + + next_vaddr = vaddr + len; + dst_vaddr = dst_vaddr + len; + size -= len; + } +err: + return ret; +} + +static int sev_launch_secret(struct kvm *kvm, struct kvm_sev_cmd *argp) +{ + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; + struct sev_data_launch_secret *data; + struct kvm_sev_launch_secret params; + struct page **pages; + void *blob, *hdr; + unsigned long n; + int ret, offset; + + if (!sev_guest(kvm)) + return -ENOTTY; + + if (copy_from_user(¶ms, (void __user *)(uintptr_t)argp->data, sizeof(params))) + return -EFAULT; + + pages = sev_pin_memory(kvm, params.guest_uaddr, params.guest_len, &n, 1); + if (!pages) + return -ENOMEM; + + /* + * The secret must be copied into contiguous memory region, lets verify + * that userspace memory pages are contiguous before we issue command. + */ + if (get_num_contig_pages(0, pages, n) != n) { + ret = -EINVAL; + goto e_unpin_memory; + } + + ret = -ENOMEM; + data = kzalloc(sizeof(*data), GFP_KERNEL_ACCOUNT); + if (!data) + goto e_unpin_memory; + + offset = params.guest_uaddr & (PAGE_SIZE - 1); + data->guest_address = __sme_page_pa(pages[0]) + offset; + data->guest_len = params.guest_len; + + blob = psp_copy_user_blob(params.trans_uaddr, params.trans_len); + if (IS_ERR(blob)) { + ret = PTR_ERR(blob); + goto e_free; + } + + data->trans_address = __psp_pa(blob); + data->trans_len = params.trans_len; + + hdr = psp_copy_user_blob(params.hdr_uaddr, params.hdr_len); + if (IS_ERR(hdr)) { + ret = PTR_ERR(hdr); + goto e_free_blob; + } + data->hdr_address = __psp_pa(hdr); + data->hdr_len = params.hdr_len; + + data->handle = sev->handle; + ret = sev_issue_cmd(kvm, SEV_CMD_LAUNCH_UPDATE_SECRET, data, &argp->error); + + kfree(hdr); + +e_free_blob: + kfree(blob); +e_free: + kfree(data); +e_unpin_memory: + sev_unpin_memory(kvm, pages, n); + return ret; +} + +static int svm_mem_enc_op(struct kvm *kvm, void __user *argp) +{ + struct kvm_sev_cmd sev_cmd; + int r; + + if (!svm_sev_enabled()) + return -ENOTTY; + + if (!argp) + return 0; + + if (copy_from_user(&sev_cmd, argp, sizeof(struct kvm_sev_cmd))) + return -EFAULT; + + mutex_lock(&kvm->lock); + + switch (sev_cmd.id) { + case KVM_SEV_INIT: + r = sev_guest_init(kvm, &sev_cmd); + break; + case KVM_SEV_LAUNCH_START: + r = sev_launch_start(kvm, &sev_cmd); + break; + case KVM_SEV_LAUNCH_UPDATE_DATA: + r = sev_launch_update_data(kvm, &sev_cmd); + break; + case KVM_SEV_LAUNCH_MEASURE: + r = sev_launch_measure(kvm, &sev_cmd); + break; + case KVM_SEV_LAUNCH_FINISH: + r = sev_launch_finish(kvm, &sev_cmd); + break; + case KVM_SEV_GUEST_STATUS: + r = sev_guest_status(kvm, &sev_cmd); + break; + case KVM_SEV_DBG_DECRYPT: + r = sev_dbg_crypt(kvm, &sev_cmd, true); + break; + case KVM_SEV_DBG_ENCRYPT: + r = sev_dbg_crypt(kvm, &sev_cmd, false); + break; + case KVM_SEV_LAUNCH_SECRET: + r = sev_launch_secret(kvm, &sev_cmd); + break; + default: + r = -EINVAL; + goto out; + } + + if (copy_to_user(argp, &sev_cmd, sizeof(struct kvm_sev_cmd))) + r = -EFAULT; + +out: + mutex_unlock(&kvm->lock); + return r; +} + +static int svm_register_enc_region(struct kvm *kvm, + struct kvm_enc_region *range) +{ + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; + struct enc_region *region; + int ret = 0; + + if (!sev_guest(kvm)) + return -ENOTTY; + + if (range->addr > ULONG_MAX || range->size > ULONG_MAX) + return -EINVAL; + + region = kzalloc(sizeof(*region), GFP_KERNEL_ACCOUNT); + if (!region) + return -ENOMEM; + + region->pages = sev_pin_memory(kvm, range->addr, range->size, ®ion->npages, 1); + if (!region->pages) { + ret = -ENOMEM; + goto e_free; + } + + /* + * The guest may change the memory encryption attribute from C=0 -> C=1 + * or vice versa for this memory range. Lets make sure caches are + * flushed to ensure that guest data gets written into memory with + * correct C-bit. + */ + sev_clflush_pages(region->pages, region->npages); + + region->uaddr = range->addr; + region->size = range->size; + + mutex_lock(&kvm->lock); + list_add_tail(®ion->list, &sev->regions_list); + mutex_unlock(&kvm->lock); + + return ret; + +e_free: + kfree(region); + return ret; +} + +static struct enc_region * +find_enc_region(struct kvm *kvm, struct kvm_enc_region *range) +{ + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; + struct list_head *head = &sev->regions_list; + struct enc_region *i; + + list_for_each_entry(i, head, list) { + if (i->uaddr == range->addr && + i->size == range->size) + return i; + } + + return NULL; +} + + +static int svm_unregister_enc_region(struct kvm *kvm, + struct kvm_enc_region *range) +{ + struct enc_region *region; + int ret; + + mutex_lock(&kvm->lock); + + if (!sev_guest(kvm)) { + ret = -ENOTTY; + goto failed; + } + + region = find_enc_region(kvm, range); + if (!region) { + ret = -EINVAL; + goto failed; + } + + /* + * Ensure that all guest tagged cache entries are flushed before + * releasing the pages back to the system for use. CLFLUSH will + * not do this, so issue a WBINVD. + */ + wbinvd_on_all_cpus(); + + __unregister_enc_region_locked(kvm, region); + + mutex_unlock(&kvm->lock); + return 0; + +failed: + mutex_unlock(&kvm->lock); + return ret; +} + +static bool svm_need_emulation_on_page_fault(struct kvm_vcpu *vcpu) +{ + unsigned long cr4 = kvm_read_cr4(vcpu); + bool smep = cr4 & X86_CR4_SMEP; + bool smap = cr4 & X86_CR4_SMAP; + bool is_user = svm_get_cpl(vcpu) == 3; + + /* + * Detect and workaround Errata 1096 Fam_17h_00_0Fh. + * + * Errata: + * When CPU raise #NPF on guest data access and vCPU CR4.SMAP=1, it is + * possible that CPU microcode implementing DecodeAssist will fail + * to read bytes of instruction which caused #NPF. In this case, + * GuestIntrBytes field of the VMCB on a VMEXIT will incorrectly + * return 0 instead of the correct guest instruction bytes. + * + * This happens because CPU microcode reading instruction bytes + * uses a special opcode which attempts to read data using CPL=0 + * priviledges. The microcode reads CS:RIP and if it hits a SMAP + * fault, it gives up and returns no instruction bytes. + * + * Detection: + * We reach here in case CPU supports DecodeAssist, raised #NPF and + * returned 0 in GuestIntrBytes field of the VMCB. + * First, errata can only be triggered in case vCPU CR4.SMAP=1. + * Second, if vCPU CR4.SMEP=1, errata could only be triggered + * in case vCPU CPL==3 (Because otherwise guest would have triggered + * a SMEP fault instead of #NPF). + * Otherwise, vCPU CR4.SMEP=0, errata could be triggered by any vCPU CPL. + * As most guests enable SMAP if they have also enabled SMEP, use above + * logic in order to attempt minimize false-positive of detecting errata + * while still preserving all cases semantic correctness. + * + * Workaround: + * To determine what instruction the guest was executing, the hypervisor + * will have to decode the instruction at the instruction pointer. + * + * In non SEV guest, hypervisor will be able to read the guest + * memory to decode the instruction pointer when insn_len is zero + * so we return true to indicate that decoding is possible. + * + * But in the SEV guest, the guest memory is encrypted with the + * guest specific key and hypervisor will not be able to decode the + * instruction pointer so we will not able to workaround it. Lets + * print the error and request to kill the guest. + */ + if (smap && (!smep || is_user)) { + if (!sev_guest(vcpu->kvm)) + return true; + + pr_err_ratelimited("KVM: SEV Guest triggered AMD Erratum 1096\n"); + kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu); + } + + return false; +} + +static bool svm_apic_init_signal_blocked(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + + /* + * TODO: Last condition latch INIT signals on vCPU when + * vCPU is in guest-mode and vmcb12 defines intercept on INIT. + * To properly emulate the INIT intercept, SVM should implement + * kvm_x86_ops.check_nested_events() and call nested_svm_vmexit() + * there if an INIT signal is pending. + */ + return !gif_set(svm) || + (svm->vmcb->control.intercept & (1ULL << INTERCEPT_INIT)); +} + +static bool svm_check_apicv_inhibit_reasons(ulong bit) +{ + ulong supported = BIT(APICV_INHIBIT_REASON_DISABLE) | + BIT(APICV_INHIBIT_REASON_HYPERV) | + BIT(APICV_INHIBIT_REASON_NESTED) | + BIT(APICV_INHIBIT_REASON_IRQWIN) | + BIT(APICV_INHIBIT_REASON_PIT_REINJ) | + BIT(APICV_INHIBIT_REASON_X2APIC); + + return supported & BIT(bit); +} + +static void svm_pre_update_apicv_exec_ctrl(struct kvm *kvm, bool activate) +{ + avic_update_access_page(kvm, activate); +} + +static struct kvm_x86_ops svm_x86_ops __initdata = { + .hardware_unsetup = svm_hardware_teardown, + .hardware_enable = svm_hardware_enable, + .hardware_disable = svm_hardware_disable, + .cpu_has_accelerated_tpr = svm_cpu_has_accelerated_tpr, + .has_emulated_msr = svm_has_emulated_msr, + + .vcpu_create = svm_create_vcpu, + .vcpu_free = svm_free_vcpu, + .vcpu_reset = svm_vcpu_reset, + + .vm_size = sizeof(struct kvm_svm), + .vm_init = svm_vm_init, + .vm_destroy = svm_vm_destroy, + + .prepare_guest_switch = svm_prepare_guest_switch, + .vcpu_load = svm_vcpu_load, + .vcpu_put = svm_vcpu_put, + .vcpu_blocking = svm_vcpu_blocking, + .vcpu_unblocking = svm_vcpu_unblocking, + + .update_bp_intercept = update_bp_intercept, + .get_msr_feature = svm_get_msr_feature, + .get_msr = svm_get_msr, + .set_msr = svm_set_msr, + .get_segment_base = svm_get_segment_base, + .get_segment = svm_get_segment, + .set_segment = svm_set_segment, + .get_cpl = svm_get_cpl, + .get_cs_db_l_bits = kvm_get_cs_db_l_bits, + .decache_cr0_guest_bits = svm_decache_cr0_guest_bits, + .decache_cr4_guest_bits = svm_decache_cr4_guest_bits, + .set_cr0 = svm_set_cr0, + .set_cr4 = svm_set_cr4, + .set_efer = svm_set_efer, + .get_idt = svm_get_idt, + .set_idt = svm_set_idt, + .get_gdt = svm_get_gdt, + .set_gdt = svm_set_gdt, + .get_dr6 = svm_get_dr6, + .set_dr6 = svm_set_dr6, + .set_dr7 = svm_set_dr7, + .sync_dirty_debug_regs = svm_sync_dirty_debug_regs, + .cache_reg = svm_cache_reg, + .get_rflags = svm_get_rflags, + .set_rflags = svm_set_rflags, + + .tlb_flush = svm_flush_tlb, + .tlb_flush_gva = svm_flush_tlb_gva, + + .run = svm_vcpu_run, + .handle_exit = handle_exit, + .skip_emulated_instruction = skip_emulated_instruction, + .update_emulated_instruction = NULL, + .set_interrupt_shadow = svm_set_interrupt_shadow, + .get_interrupt_shadow = svm_get_interrupt_shadow, + .patch_hypercall = svm_patch_hypercall, + .set_irq = svm_set_irq, + .set_nmi = svm_inject_nmi, + .queue_exception = svm_queue_exception, + .cancel_injection = svm_cancel_injection, + .interrupt_allowed = svm_interrupt_allowed, + .nmi_allowed = svm_nmi_allowed, + .get_nmi_mask = svm_get_nmi_mask, + .set_nmi_mask = svm_set_nmi_mask, + .enable_nmi_window = enable_nmi_window, + .enable_irq_window = enable_irq_window, + .update_cr8_intercept = update_cr8_intercept, + .set_virtual_apic_mode = svm_set_virtual_apic_mode, + .refresh_apicv_exec_ctrl = svm_refresh_apicv_exec_ctrl, + .check_apicv_inhibit_reasons = svm_check_apicv_inhibit_reasons, + .pre_update_apicv_exec_ctrl = svm_pre_update_apicv_exec_ctrl, + .load_eoi_exitmap = svm_load_eoi_exitmap, + .hwapic_irr_update = svm_hwapic_irr_update, + .hwapic_isr_update = svm_hwapic_isr_update, + .sync_pir_to_irr = kvm_lapic_find_highest_irr, + .apicv_post_state_restore = avic_post_state_restore, + + .set_tss_addr = svm_set_tss_addr, + .set_identity_map_addr = svm_set_identity_map_addr, + .get_tdp_level = get_npt_level, + .get_mt_mask = svm_get_mt_mask, + + .get_exit_info = svm_get_exit_info, + + .cpuid_update = svm_cpuid_update, + + .has_wbinvd_exit = svm_has_wbinvd_exit, + + .read_l1_tsc_offset = svm_read_l1_tsc_offset, + .write_l1_tsc_offset = svm_write_l1_tsc_offset, + + .load_mmu_pgd = svm_load_mmu_pgd, + + .check_intercept = svm_check_intercept, + .handle_exit_irqoff = svm_handle_exit_irqoff, + + .request_immediate_exit = __kvm_request_immediate_exit, + + .sched_in = svm_sched_in, + + .pmu_ops = &amd_pmu_ops, + .deliver_posted_interrupt = svm_deliver_avic_intr, + .dy_apicv_has_pending_interrupt = svm_dy_apicv_has_pending_interrupt, + .update_pi_irte = svm_update_pi_irte, + .setup_mce = svm_setup_mce, + + .smi_allowed = svm_smi_allowed, + .pre_enter_smm = svm_pre_enter_smm, + .pre_leave_smm = svm_pre_leave_smm, + .enable_smi_window = enable_smi_window, + + .mem_enc_op = svm_mem_enc_op, + .mem_enc_reg_region = svm_register_enc_region, + .mem_enc_unreg_region = svm_unregister_enc_region, + + .nested_enable_evmcs = NULL, + .nested_get_evmcs_version = NULL, + + .need_emulation_on_page_fault = svm_need_emulation_on_page_fault, + + .apic_init_signal_blocked = svm_apic_init_signal_blocked, + + .check_nested_events = svm_check_nested_events, +}; + +static struct kvm_x86_init_ops svm_init_ops __initdata = { + .cpu_has_kvm_support = has_svm, + .disabled_by_bios = is_disabled, + .hardware_setup = svm_hardware_setup, + .check_processor_compatibility = svm_check_processor_compat, + + .runtime_ops = &svm_x86_ops, +}; + +static int __init svm_init(void) +{ + return kvm_init(&svm_init_ops, sizeof(struct vcpu_svm), + __alignof__(struct vcpu_svm), THIS_MODULE); +} + +static void __exit svm_exit(void) +{ + kvm_exit(); +} + +module_init(svm_init) +module_exit(svm_exit) -- cgit From 883b0a91f41ab705daa04c24e59d708e457a0bed Mon Sep 17 00:00:00 2001 From: Joerg Roedel Date: Tue, 24 Mar 2020 10:41:52 +0100 Subject: KVM: SVM: Move Nested SVM Implementation to nested.c Split out the code for the nested SVM implementation and move it to a separate file. Signed-off-by: Joerg Roedel Message-Id: <20200324094154.32352-3-joro@8bytes.org> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/Makefile | 2 +- arch/x86/kvm/svm/nested.c | 823 ++++++++++++++++++++++++++++++++ arch/x86/kvm/svm/svm.c | 1155 +-------------------------------------------- arch/x86/kvm/svm/svm.h | 381 +++++++++++++++ 4 files changed, 1216 insertions(+), 1145 deletions(-) create mode 100644 arch/x86/kvm/svm/nested.c create mode 100644 arch/x86/kvm/svm/svm.h (limited to 'arch/x86') diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile index 0ef050982c37..0acec73ef236 100644 --- a/arch/x86/kvm/Makefile +++ b/arch/x86/kvm/Makefile @@ -14,7 +14,7 @@ kvm-y += x86.o emulate.o i8259.o irq.o lapic.o \ hyperv.o debugfs.o mmu/mmu.o mmu/page_track.o kvm-intel-y += vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o vmx/evmcs.o vmx/nested.o -kvm-amd-y += svm/svm.o svm/pmu.o +kvm-amd-y += svm/svm.o svm/pmu.o svm/nested.o obj-$(CONFIG_KVM) += kvm.o obj-$(CONFIG_KVM_INTEL) += kvm-intel.o diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c new file mode 100644 index 000000000000..90a1ca939627 --- /dev/null +++ b/arch/x86/kvm/svm/nested.c @@ -0,0 +1,823 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Kernel-based Virtual Machine driver for Linux + * + * AMD SVM support + * + * Copyright (C) 2006 Qumranet, Inc. + * Copyright 2010 Red Hat, Inc. and/or its affiliates. + * + * Authors: + * Yaniv Kamay + * Avi Kivity + */ + +#define pr_fmt(fmt) "SVM: " fmt + +#include +#include +#include + +#include + +#include "kvm_emulate.h" +#include "trace.h" +#include "mmu.h" +#include "x86.h" +#include "svm.h" + +static void nested_svm_inject_npf_exit(struct kvm_vcpu *vcpu, + struct x86_exception *fault) +{ + struct vcpu_svm *svm = to_svm(vcpu); + + if (svm->vmcb->control.exit_code != SVM_EXIT_NPF) { + /* + * TODO: track the cause of the nested page fault, and + * correctly fill in the high bits of exit_info_1. + */ + svm->vmcb->control.exit_code = SVM_EXIT_NPF; + svm->vmcb->control.exit_code_hi = 0; + svm->vmcb->control.exit_info_1 = (1ULL << 32); + svm->vmcb->control.exit_info_2 = fault->address; + } + + svm->vmcb->control.exit_info_1 &= ~0xffffffffULL; + svm->vmcb->control.exit_info_1 |= fault->error_code; + + /* + * The present bit is always zero for page structure faults on real + * hardware. + */ + if (svm->vmcb->control.exit_info_1 & (2ULL << 32)) + svm->vmcb->control.exit_info_1 &= ~1; + + nested_svm_vmexit(svm); +} + +static u64 nested_svm_get_tdp_pdptr(struct kvm_vcpu *vcpu, int index) +{ + struct vcpu_svm *svm = to_svm(vcpu); + u64 cr3 = svm->nested.nested_cr3; + u64 pdpte; + int ret; + + ret = kvm_vcpu_read_guest_page(vcpu, gpa_to_gfn(__sme_clr(cr3)), &pdpte, + offset_in_page(cr3) + index * 8, 8); + if (ret) + return 0; + return pdpte; +} + +static unsigned long nested_svm_get_tdp_cr3(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + + return svm->nested.nested_cr3; +} + +static void nested_svm_init_mmu_context(struct kvm_vcpu *vcpu) +{ + WARN_ON(mmu_is_nested(vcpu)); + + vcpu->arch.mmu = &vcpu->arch.guest_mmu; + kvm_init_shadow_mmu(vcpu); + vcpu->arch.mmu->get_guest_pgd = nested_svm_get_tdp_cr3; + vcpu->arch.mmu->get_pdptr = nested_svm_get_tdp_pdptr; + vcpu->arch.mmu->inject_page_fault = nested_svm_inject_npf_exit; + vcpu->arch.mmu->shadow_root_level = kvm_x86_ops.get_tdp_level(vcpu); + reset_shadow_zero_bits_mask(vcpu, vcpu->arch.mmu); + vcpu->arch.walk_mmu = &vcpu->arch.nested_mmu; +} + +static void nested_svm_uninit_mmu_context(struct kvm_vcpu *vcpu) +{ + vcpu->arch.mmu = &vcpu->arch.root_mmu; + vcpu->arch.walk_mmu = &vcpu->arch.root_mmu; +} + +void recalc_intercepts(struct vcpu_svm *svm) +{ + struct vmcb_control_area *c, *h; + struct nested_state *g; + + mark_dirty(svm->vmcb, VMCB_INTERCEPTS); + + if (!is_guest_mode(&svm->vcpu)) + return; + + c = &svm->vmcb->control; + h = &svm->nested.hsave->control; + g = &svm->nested; + + c->intercept_cr = h->intercept_cr; + c->intercept_dr = h->intercept_dr; + c->intercept_exceptions = h->intercept_exceptions; + c->intercept = h->intercept; + + if (svm->vcpu.arch.hflags & HF_VINTR_MASK) { + /* We only want the cr8 intercept bits of L1 */ + c->intercept_cr &= ~(1U << INTERCEPT_CR8_READ); + c->intercept_cr &= ~(1U << INTERCEPT_CR8_WRITE); + + /* + * Once running L2 with HF_VINTR_MASK, EFLAGS.IF does not + * affect any interrupt we may want to inject; therefore, + * interrupt window vmexits are irrelevant to L0. + */ + c->intercept &= ~(1ULL << INTERCEPT_VINTR); + } + + /* We don't want to see VMMCALLs from a nested guest */ + c->intercept &= ~(1ULL << INTERCEPT_VMMCALL); + + c->intercept_cr |= g->intercept_cr; + c->intercept_dr |= g->intercept_dr; + c->intercept_exceptions |= g->intercept_exceptions; + c->intercept |= g->intercept; +} + +static void copy_vmcb_control_area(struct vmcb *dst_vmcb, struct vmcb *from_vmcb) +{ + struct vmcb_control_area *dst = &dst_vmcb->control; + struct vmcb_control_area *from = &from_vmcb->control; + + dst->intercept_cr = from->intercept_cr; + dst->intercept_dr = from->intercept_dr; + dst->intercept_exceptions = from->intercept_exceptions; + dst->intercept = from->intercept; + dst->iopm_base_pa = from->iopm_base_pa; + dst->msrpm_base_pa = from->msrpm_base_pa; + dst->tsc_offset = from->tsc_offset; + dst->asid = from->asid; + dst->tlb_ctl = from->tlb_ctl; + dst->int_ctl = from->int_ctl; + dst->int_vector = from->int_vector; + dst->int_state = from->int_state; + dst->exit_code = from->exit_code; + dst->exit_code_hi = from->exit_code_hi; + dst->exit_info_1 = from->exit_info_1; + dst->exit_info_2 = from->exit_info_2; + dst->exit_int_info = from->exit_int_info; + dst->exit_int_info_err = from->exit_int_info_err; + dst->nested_ctl = from->nested_ctl; + dst->event_inj = from->event_inj; + dst->event_inj_err = from->event_inj_err; + dst->nested_cr3 = from->nested_cr3; + dst->virt_ext = from->virt_ext; + dst->pause_filter_count = from->pause_filter_count; + dst->pause_filter_thresh = from->pause_filter_thresh; +} + +static bool nested_svm_vmrun_msrpm(struct vcpu_svm *svm) +{ + /* + * This function merges the msr permission bitmaps of kvm and the + * nested vmcb. It is optimized in that it only merges the parts where + * the kvm msr permission bitmap may contain zero bits + */ + int i; + + if (!(svm->nested.intercept & (1ULL << INTERCEPT_MSR_PROT))) + return true; + + for (i = 0; i < MSRPM_OFFSETS; i++) { + u32 value, p; + u64 offset; + + if (msrpm_offsets[i] == 0xffffffff) + break; + + p = msrpm_offsets[i]; + offset = svm->nested.vmcb_msrpm + (p * 4); + + if (kvm_vcpu_read_guest(&svm->vcpu, offset, &value, 4)) + return false; + + svm->nested.msrpm[p] = svm->msrpm[p] | value; + } + + svm->vmcb->control.msrpm_base_pa = __sme_set(__pa(svm->nested.msrpm)); + + return true; +} + +static bool nested_vmcb_checks(struct vmcb *vmcb) +{ + if ((vmcb->save.efer & EFER_SVME) == 0) + return false; + + if ((vmcb->control.intercept & (1ULL << INTERCEPT_VMRUN)) == 0) + return false; + + if (vmcb->control.asid == 0) + return false; + + if ((vmcb->control.nested_ctl & SVM_NESTED_CTL_NP_ENABLE) && + !npt_enabled) + return false; + + return true; +} + +void enter_svm_guest_mode(struct vcpu_svm *svm, u64 vmcb_gpa, + struct vmcb *nested_vmcb, struct kvm_host_map *map) +{ + bool evaluate_pending_interrupts = + is_intercept(svm, INTERCEPT_VINTR) || + is_intercept(svm, INTERCEPT_IRET); + + if (kvm_get_rflags(&svm->vcpu) & X86_EFLAGS_IF) + svm->vcpu.arch.hflags |= HF_HIF_MASK; + else + svm->vcpu.arch.hflags &= ~HF_HIF_MASK; + + if (nested_vmcb->control.nested_ctl & SVM_NESTED_CTL_NP_ENABLE) { + svm->nested.nested_cr3 = nested_vmcb->control.nested_cr3; + nested_svm_init_mmu_context(&svm->vcpu); + } + + /* Load the nested guest state */ + svm->vmcb->save.es = nested_vmcb->save.es; + svm->vmcb->save.cs = nested_vmcb->save.cs; + svm->vmcb->save.ss = nested_vmcb->save.ss; + svm->vmcb->save.ds = nested_vmcb->save.ds; + svm->vmcb->save.gdtr = nested_vmcb->save.gdtr; + svm->vmcb->save.idtr = nested_vmcb->save.idtr; + kvm_set_rflags(&svm->vcpu, nested_vmcb->save.rflags); + svm_set_efer(&svm->vcpu, nested_vmcb->save.efer); + svm_set_cr0(&svm->vcpu, nested_vmcb->save.cr0); + svm_set_cr4(&svm->vcpu, nested_vmcb->save.cr4); + if (npt_enabled) { + svm->vmcb->save.cr3 = nested_vmcb->save.cr3; + svm->vcpu.arch.cr3 = nested_vmcb->save.cr3; + } else + (void)kvm_set_cr3(&svm->vcpu, nested_vmcb->save.cr3); + + /* Guest paging mode is active - reset mmu */ + kvm_mmu_reset_context(&svm->vcpu); + + svm->vmcb->save.cr2 = svm->vcpu.arch.cr2 = nested_vmcb->save.cr2; + kvm_rax_write(&svm->vcpu, nested_vmcb->save.rax); + kvm_rsp_write(&svm->vcpu, nested_vmcb->save.rsp); + kvm_rip_write(&svm->vcpu, nested_vmcb->save.rip); + + /* In case we don't even reach vcpu_run, the fields are not updated */ + svm->vmcb->save.rax = nested_vmcb->save.rax; + svm->vmcb->save.rsp = nested_vmcb->save.rsp; + svm->vmcb->save.rip = nested_vmcb->save.rip; + svm->vmcb->save.dr7 = nested_vmcb->save.dr7; + svm->vmcb->save.dr6 = nested_vmcb->save.dr6; + svm->vmcb->save.cpl = nested_vmcb->save.cpl; + + svm->nested.vmcb_msrpm = nested_vmcb->control.msrpm_base_pa & ~0x0fffULL; + svm->nested.vmcb_iopm = nested_vmcb->control.iopm_base_pa & ~0x0fffULL; + + /* cache intercepts */ + svm->nested.intercept_cr = nested_vmcb->control.intercept_cr; + svm->nested.intercept_dr = nested_vmcb->control.intercept_dr; + svm->nested.intercept_exceptions = nested_vmcb->control.intercept_exceptions; + svm->nested.intercept = nested_vmcb->control.intercept; + + svm_flush_tlb(&svm->vcpu, true); + svm->vmcb->control.int_ctl = nested_vmcb->control.int_ctl | V_INTR_MASKING_MASK; + if (nested_vmcb->control.int_ctl & V_INTR_MASKING_MASK) + svm->vcpu.arch.hflags |= HF_VINTR_MASK; + else + svm->vcpu.arch.hflags &= ~HF_VINTR_MASK; + + svm->vcpu.arch.tsc_offset += nested_vmcb->control.tsc_offset; + svm->vmcb->control.tsc_offset = svm->vcpu.arch.tsc_offset; + + svm->vmcb->control.virt_ext = nested_vmcb->control.virt_ext; + svm->vmcb->control.int_vector = nested_vmcb->control.int_vector; + svm->vmcb->control.int_state = nested_vmcb->control.int_state; + svm->vmcb->control.event_inj = nested_vmcb->control.event_inj; + svm->vmcb->control.event_inj_err = nested_vmcb->control.event_inj_err; + + svm->vmcb->control.pause_filter_count = + nested_vmcb->control.pause_filter_count; + svm->vmcb->control.pause_filter_thresh = + nested_vmcb->control.pause_filter_thresh; + + kvm_vcpu_unmap(&svm->vcpu, map, true); + + /* Enter Guest-Mode */ + enter_guest_mode(&svm->vcpu); + + /* + * Merge guest and host intercepts - must be called with vcpu in + * guest-mode to take affect here + */ + recalc_intercepts(svm); + + svm->nested.vmcb = vmcb_gpa; + + /* + * If L1 had a pending IRQ/NMI before executing VMRUN, + * which wasn't delivered because it was disallowed (e.g. + * interrupts disabled), L0 needs to evaluate if this pending + * event should cause an exit from L2 to L1 or be delivered + * directly to L2. + * + * Usually this would be handled by the processor noticing an + * IRQ/NMI window request. However, VMRUN can unblock interrupts + * by implicitly setting GIF, so force L0 to perform pending event + * evaluation by requesting a KVM_REQ_EVENT. + */ + enable_gif(svm); + if (unlikely(evaluate_pending_interrupts)) + kvm_make_request(KVM_REQ_EVENT, &svm->vcpu); + + mark_all_dirty(svm->vmcb); +} + +int nested_svm_vmrun(struct vcpu_svm *svm) +{ + int ret; + struct vmcb *nested_vmcb; + struct vmcb *hsave = svm->nested.hsave; + struct vmcb *vmcb = svm->vmcb; + struct kvm_host_map map; + u64 vmcb_gpa; + + vmcb_gpa = svm->vmcb->save.rax; + + ret = kvm_vcpu_map(&svm->vcpu, gpa_to_gfn(vmcb_gpa), &map); + if (ret == -EINVAL) { + kvm_inject_gp(&svm->vcpu, 0); + return 1; + } else if (ret) { + return kvm_skip_emulated_instruction(&svm->vcpu); + } + + ret = kvm_skip_emulated_instruction(&svm->vcpu); + + nested_vmcb = map.hva; + + if (!nested_vmcb_checks(nested_vmcb)) { + nested_vmcb->control.exit_code = SVM_EXIT_ERR; + nested_vmcb->control.exit_code_hi = 0; + nested_vmcb->control.exit_info_1 = 0; + nested_vmcb->control.exit_info_2 = 0; + + kvm_vcpu_unmap(&svm->vcpu, &map, true); + + return ret; + } + + trace_kvm_nested_vmrun(svm->vmcb->save.rip, vmcb_gpa, + nested_vmcb->save.rip, + nested_vmcb->control.int_ctl, + nested_vmcb->control.event_inj, + nested_vmcb->control.nested_ctl); + + trace_kvm_nested_intercepts(nested_vmcb->control.intercept_cr & 0xffff, + nested_vmcb->control.intercept_cr >> 16, + nested_vmcb->control.intercept_exceptions, + nested_vmcb->control.intercept); + + /* Clear internal status */ + kvm_clear_exception_queue(&svm->vcpu); + kvm_clear_interrupt_queue(&svm->vcpu); + + /* + * Save the old vmcb, so we don't need to pick what we save, but can + * restore everything when a VMEXIT occurs + */ + hsave->save.es = vmcb->save.es; + hsave->save.cs = vmcb->save.cs; + hsave->save.ss = vmcb->save.ss; + hsave->save.ds = vmcb->save.ds; + hsave->save.gdtr = vmcb->save.gdtr; + hsave->save.idtr = vmcb->save.idtr; + hsave->save.efer = svm->vcpu.arch.efer; + hsave->save.cr0 = kvm_read_cr0(&svm->vcpu); + hsave->save.cr4 = svm->vcpu.arch.cr4; + hsave->save.rflags = kvm_get_rflags(&svm->vcpu); + hsave->save.rip = kvm_rip_read(&svm->vcpu); + hsave->save.rsp = vmcb->save.rsp; + hsave->save.rax = vmcb->save.rax; + if (npt_enabled) + hsave->save.cr3 = vmcb->save.cr3; + else + hsave->save.cr3 = kvm_read_cr3(&svm->vcpu); + + copy_vmcb_control_area(hsave, vmcb); + + enter_svm_guest_mode(svm, vmcb_gpa, nested_vmcb, &map); + + if (!nested_svm_vmrun_msrpm(svm)) { + svm->vmcb->control.exit_code = SVM_EXIT_ERR; + svm->vmcb->control.exit_code_hi = 0; + svm->vmcb->control.exit_info_1 = 0; + svm->vmcb->control.exit_info_2 = 0; + + nested_svm_vmexit(svm); + } + + return ret; +} + +void nested_svm_vmloadsave(struct vmcb *from_vmcb, struct vmcb *to_vmcb) +{ + to_vmcb->save.fs = from_vmcb->save.fs; + to_vmcb->save.gs = from_vmcb->save.gs; + to_vmcb->save.tr = from_vmcb->save.tr; + to_vmcb->save.ldtr = from_vmcb->save.ldtr; + to_vmcb->save.kernel_gs_base = from_vmcb->save.kernel_gs_base; + to_vmcb->save.star = from_vmcb->save.star; + to_vmcb->save.lstar = from_vmcb->save.lstar; + to_vmcb->save.cstar = from_vmcb->save.cstar; + to_vmcb->save.sfmask = from_vmcb->save.sfmask; + to_vmcb->save.sysenter_cs = from_vmcb->save.sysenter_cs; + to_vmcb->save.sysenter_esp = from_vmcb->save.sysenter_esp; + to_vmcb->save.sysenter_eip = from_vmcb->save.sysenter_eip; +} + +int nested_svm_vmexit(struct vcpu_svm *svm) +{ + int rc; + struct vmcb *nested_vmcb; + struct vmcb *hsave = svm->nested.hsave; + struct vmcb *vmcb = svm->vmcb; + struct kvm_host_map map; + + trace_kvm_nested_vmexit_inject(vmcb->control.exit_code, + vmcb->control.exit_info_1, + vmcb->control.exit_info_2, + vmcb->control.exit_int_info, + vmcb->control.exit_int_info_err, + KVM_ISA_SVM); + + rc = kvm_vcpu_map(&svm->vcpu, gpa_to_gfn(svm->nested.vmcb), &map); + if (rc) { + if (rc == -EINVAL) + kvm_inject_gp(&svm->vcpu, 0); + return 1; + } + + nested_vmcb = map.hva; + + /* Exit Guest-Mode */ + leave_guest_mode(&svm->vcpu); + svm->nested.vmcb = 0; + + /* Give the current vmcb to the guest */ + disable_gif(svm); + + nested_vmcb->save.es = vmcb->save.es; + nested_vmcb->save.cs = vmcb->save.cs; + nested_vmcb->save.ss = vmcb->save.ss; + nested_vmcb->save.ds = vmcb->save.ds; + nested_vmcb->save.gdtr = vmcb->save.gdtr; + nested_vmcb->save.idtr = vmcb->save.idtr; + nested_vmcb->save.efer = svm->vcpu.arch.efer; + nested_vmcb->save.cr0 = kvm_read_cr0(&svm->vcpu); + nested_vmcb->save.cr3 = kvm_read_cr3(&svm->vcpu); + nested_vmcb->save.cr2 = vmcb->save.cr2; + nested_vmcb->save.cr4 = svm->vcpu.arch.cr4; + nested_vmcb->save.rflags = kvm_get_rflags(&svm->vcpu); + nested_vmcb->save.rip = vmcb->save.rip; + nested_vmcb->save.rsp = vmcb->save.rsp; + nested_vmcb->save.rax = vmcb->save.rax; + nested_vmcb->save.dr7 = vmcb->save.dr7; + nested_vmcb->save.dr6 = vmcb->save.dr6; + nested_vmcb->save.cpl = vmcb->save.cpl; + + nested_vmcb->control.int_ctl = vmcb->control.int_ctl; + nested_vmcb->control.int_vector = vmcb->control.int_vector; + nested_vmcb->control.int_state = vmcb->control.int_state; + nested_vmcb->control.exit_code = vmcb->control.exit_code; + nested_vmcb->control.exit_code_hi = vmcb->control.exit_code_hi; + nested_vmcb->control.exit_info_1 = vmcb->control.exit_info_1; + nested_vmcb->control.exit_info_2 = vmcb->control.exit_info_2; + nested_vmcb->control.exit_int_info = vmcb->control.exit_int_info; + nested_vmcb->control.exit_int_info_err = vmcb->control.exit_int_info_err; + + if (svm->nrips_enabled) + nested_vmcb->control.next_rip = vmcb->control.next_rip; + + /* + * If we emulate a VMRUN/#VMEXIT in the same host #vmexit cycle we have + * to make sure that we do not lose injected events. So check event_inj + * here and copy it to exit_int_info if it is valid. + * Exit_int_info and event_inj can't be both valid because the case + * below only happens on a VMRUN instruction intercept which has + * no valid exit_int_info set. + */ + if (vmcb->control.event_inj & SVM_EVTINJ_VALID) { + struct vmcb_control_area *nc = &nested_vmcb->control; + + nc->exit_int_info = vmcb->control.event_inj; + nc->exit_int_info_err = vmcb->control.event_inj_err; + } + + nested_vmcb->control.tlb_ctl = 0; + nested_vmcb->control.event_inj = 0; + nested_vmcb->control.event_inj_err = 0; + + nested_vmcb->control.pause_filter_count = + svm->vmcb->control.pause_filter_count; + nested_vmcb->control.pause_filter_thresh = + svm->vmcb->control.pause_filter_thresh; + + /* We always set V_INTR_MASKING and remember the old value in hflags */ + if (!(svm->vcpu.arch.hflags & HF_VINTR_MASK)) + nested_vmcb->control.int_ctl &= ~V_INTR_MASKING_MASK; + + /* Restore the original control entries */ + copy_vmcb_control_area(vmcb, hsave); + + svm->vcpu.arch.tsc_offset = svm->vmcb->control.tsc_offset; + kvm_clear_exception_queue(&svm->vcpu); + kvm_clear_interrupt_queue(&svm->vcpu); + + svm->nested.nested_cr3 = 0; + + /* Restore selected save entries */ + svm->vmcb->save.es = hsave->save.es; + svm->vmcb->save.cs = hsave->save.cs; + svm->vmcb->save.ss = hsave->save.ss; + svm->vmcb->save.ds = hsave->save.ds; + svm->vmcb->save.gdtr = hsave->save.gdtr; + svm->vmcb->save.idtr = hsave->save.idtr; + kvm_set_rflags(&svm->vcpu, hsave->save.rflags); + svm_set_efer(&svm->vcpu, hsave->save.efer); + svm_set_cr0(&svm->vcpu, hsave->save.cr0 | X86_CR0_PE); + svm_set_cr4(&svm->vcpu, hsave->save.cr4); + if (npt_enabled) { + svm->vmcb->save.cr3 = hsave->save.cr3; + svm->vcpu.arch.cr3 = hsave->save.cr3; + } else { + (void)kvm_set_cr3(&svm->vcpu, hsave->save.cr3); + } + kvm_rax_write(&svm->vcpu, hsave->save.rax); + kvm_rsp_write(&svm->vcpu, hsave->save.rsp); + kvm_rip_write(&svm->vcpu, hsave->save.rip); + svm->vmcb->save.dr7 = 0; + svm->vmcb->save.cpl = 0; + svm->vmcb->control.exit_int_info = 0; + + mark_all_dirty(svm->vmcb); + + kvm_vcpu_unmap(&svm->vcpu, &map, true); + + nested_svm_uninit_mmu_context(&svm->vcpu); + kvm_mmu_reset_context(&svm->vcpu); + kvm_mmu_load(&svm->vcpu); + + /* + * Drop what we picked up for L2 via svm_complete_interrupts() so it + * doesn't end up in L1. + */ + svm->vcpu.arch.nmi_injected = false; + kvm_clear_exception_queue(&svm->vcpu); + kvm_clear_interrupt_queue(&svm->vcpu); + + return 0; +} + +static int nested_svm_exit_handled_msr(struct vcpu_svm *svm) +{ + u32 offset, msr, value; + int write, mask; + + if (!(svm->nested.intercept & (1ULL << INTERCEPT_MSR_PROT))) + return NESTED_EXIT_HOST; + + msr = svm->vcpu.arch.regs[VCPU_REGS_RCX]; + offset = svm_msrpm_offset(msr); + write = svm->vmcb->control.exit_info_1 & 1; + mask = 1 << ((2 * (msr & 0xf)) + write); + + if (offset == MSR_INVALID) + return NESTED_EXIT_DONE; + + /* Offset is in 32 bit units but need in 8 bit units */ + offset *= 4; + + if (kvm_vcpu_read_guest(&svm->vcpu, svm->nested.vmcb_msrpm + offset, &value, 4)) + return NESTED_EXIT_DONE; + + return (value & mask) ? NESTED_EXIT_DONE : NESTED_EXIT_HOST; +} + +/* DB exceptions for our internal use must not cause vmexit */ +static int nested_svm_intercept_db(struct vcpu_svm *svm) +{ + unsigned long dr6; + + /* if we're not singlestepping, it's not ours */ + if (!svm->nmi_singlestep) + return NESTED_EXIT_DONE; + + /* if it's not a singlestep exception, it's not ours */ + if (kvm_get_dr(&svm->vcpu, 6, &dr6)) + return NESTED_EXIT_DONE; + if (!(dr6 & DR6_BS)) + return NESTED_EXIT_DONE; + + /* if the guest is singlestepping, it should get the vmexit */ + if (svm->nmi_singlestep_guest_rflags & X86_EFLAGS_TF) { + disable_nmi_singlestep(svm); + return NESTED_EXIT_DONE; + } + + /* it's ours, the nested hypervisor must not see this one */ + return NESTED_EXIT_HOST; +} + +static int nested_svm_intercept_ioio(struct vcpu_svm *svm) +{ + unsigned port, size, iopm_len; + u16 val, mask; + u8 start_bit; + u64 gpa; + + if (!(svm->nested.intercept & (1ULL << INTERCEPT_IOIO_PROT))) + return NESTED_EXIT_HOST; + + port = svm->vmcb->control.exit_info_1 >> 16; + size = (svm->vmcb->control.exit_info_1 & SVM_IOIO_SIZE_MASK) >> + SVM_IOIO_SIZE_SHIFT; + gpa = svm->nested.vmcb_iopm + (port / 8); + start_bit = port % 8; + iopm_len = (start_bit + size > 8) ? 2 : 1; + mask = (0xf >> (4 - size)) << start_bit; + val = 0; + + if (kvm_vcpu_read_guest(&svm->vcpu, gpa, &val, iopm_len)) + return NESTED_EXIT_DONE; + + return (val & mask) ? NESTED_EXIT_DONE : NESTED_EXIT_HOST; +} + +static int nested_svm_intercept(struct vcpu_svm *svm) +{ + u32 exit_code = svm->vmcb->control.exit_code; + int vmexit = NESTED_EXIT_HOST; + + switch (exit_code) { + case SVM_EXIT_MSR: + vmexit = nested_svm_exit_handled_msr(svm); + break; + case SVM_EXIT_IOIO: + vmexit = nested_svm_intercept_ioio(svm); + break; + case SVM_EXIT_READ_CR0 ... SVM_EXIT_WRITE_CR8: { + u32 bit = 1U << (exit_code - SVM_EXIT_READ_CR0); + if (svm->nested.intercept_cr & bit) + vmexit = NESTED_EXIT_DONE; + break; + } + case SVM_EXIT_READ_DR0 ... SVM_EXIT_WRITE_DR7: { + u32 bit = 1U << (exit_code - SVM_EXIT_READ_DR0); + if (svm->nested.intercept_dr & bit) + vmexit = NESTED_EXIT_DONE; + break; + } + case SVM_EXIT_EXCP_BASE ... SVM_EXIT_EXCP_BASE + 0x1f: { + u32 excp_bits = 1 << (exit_code - SVM_EXIT_EXCP_BASE); + if (svm->nested.intercept_exceptions & excp_bits) { + if (exit_code == SVM_EXIT_EXCP_BASE + DB_VECTOR) + vmexit = nested_svm_intercept_db(svm); + else + vmexit = NESTED_EXIT_DONE; + } + /* async page fault always cause vmexit */ + else if ((exit_code == SVM_EXIT_EXCP_BASE + PF_VECTOR) && + svm->vcpu.arch.exception.nested_apf != 0) + vmexit = NESTED_EXIT_DONE; + break; + } + case SVM_EXIT_ERR: { + vmexit = NESTED_EXIT_DONE; + break; + } + default: { + u64 exit_bits = 1ULL << (exit_code - SVM_EXIT_INTR); + if (svm->nested.intercept & exit_bits) + vmexit = NESTED_EXIT_DONE; + } + } + + return vmexit; +} + +int nested_svm_exit_handled(struct vcpu_svm *svm) +{ + int vmexit; + + vmexit = nested_svm_intercept(svm); + + if (vmexit == NESTED_EXIT_DONE) + nested_svm_vmexit(svm); + + return vmexit; +} + +int nested_svm_check_permissions(struct vcpu_svm *svm) +{ + if (!(svm->vcpu.arch.efer & EFER_SVME) || + !is_paging(&svm->vcpu)) { + kvm_queue_exception(&svm->vcpu, UD_VECTOR); + return 1; + } + + if (svm->vmcb->save.cpl) { + kvm_inject_gp(&svm->vcpu, 0); + return 1; + } + + return 0; +} + +int nested_svm_check_exception(struct vcpu_svm *svm, unsigned nr, + bool has_error_code, u32 error_code) +{ + int vmexit; + + if (!is_guest_mode(&svm->vcpu)) + return 0; + + vmexit = nested_svm_intercept(svm); + if (vmexit != NESTED_EXIT_DONE) + return 0; + + svm->vmcb->control.exit_code = SVM_EXIT_EXCP_BASE + nr; + svm->vmcb->control.exit_code_hi = 0; + svm->vmcb->control.exit_info_1 = error_code; + + /* + * EXITINFO2 is undefined for all exception intercepts other + * than #PF. + */ + if (svm->vcpu.arch.exception.nested_apf) + svm->vmcb->control.exit_info_2 = svm->vcpu.arch.apf.nested_apf_token; + else if (svm->vcpu.arch.exception.has_payload) + svm->vmcb->control.exit_info_2 = svm->vcpu.arch.exception.payload; + else + svm->vmcb->control.exit_info_2 = svm->vcpu.arch.cr2; + + svm->nested.exit_required = true; + return vmexit; +} + +static void nested_svm_intr(struct vcpu_svm *svm) +{ + svm->vmcb->control.exit_code = SVM_EXIT_INTR; + svm->vmcb->control.exit_info_1 = 0; + svm->vmcb->control.exit_info_2 = 0; + + /* nested_svm_vmexit this gets called afterwards from handle_exit */ + svm->nested.exit_required = true; + trace_kvm_nested_intr_vmexit(svm->vmcb->save.rip); +} + +static bool nested_exit_on_intr(struct vcpu_svm *svm) +{ + return (svm->nested.intercept & 1ULL); +} + +int svm_check_nested_events(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + bool block_nested_events = + kvm_event_needs_reinjection(vcpu) || svm->nested.exit_required; + + if (kvm_cpu_has_interrupt(vcpu) && nested_exit_on_intr(svm)) { + if (block_nested_events) + return -EBUSY; + nested_svm_intr(svm); + return 0; + } + + return 0; +} + +int nested_svm_exit_special(struct vcpu_svm *svm) +{ + u32 exit_code = svm->vmcb->control.exit_code; + + switch (exit_code) { + case SVM_EXIT_INTR: + case SVM_EXIT_NMI: + case SVM_EXIT_EXCP_BASE + MC_VECTOR: + return NESTED_EXIT_HOST; + case SVM_EXIT_NPF: + /* For now we are always handling NPFs when using them */ + if (npt_enabled) + return NESTED_EXIT_HOST; + break; + case SVM_EXIT_EXCP_BASE + PF_VECTOR: + /* When we're shadowing, trap PFs, but not async PF */ + if (!npt_enabled && svm->vcpu.arch.apf.host_apf_reason == 0) + return NESTED_EXIT_HOST; + break; + default: + break; + } + + return NESTED_EXIT_CONTINUE; +} diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 851e9cc79930..a24e2de4acda 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -53,6 +53,8 @@ #include #include "trace.h" +#include "svm.h" + #define __ex(x) __kvm_handle_fault_on_reboot(x) MODULE_AUTHOR("Qumranet"); @@ -82,10 +84,6 @@ MODULE_DEVICE_TABLE(x86cpu, svm_cpu_id); #define SVM_AVIC_DOORBELL 0xc001011b -#define NESTED_EXIT_HOST 0 /* Exit handled on host level */ -#define NESTED_EXIT_DONE 1 /* Exit caused nested vmexit */ -#define NESTED_EXIT_CONTINUE 2 /* Further checks needed */ - #define DEBUGCTL_RESERVED_BITS (~(0x3fULL)) #define TSC_RATIO_RSVD 0xffffff0000000000ULL @@ -119,68 +117,7 @@ MODULE_DEVICE_TABLE(x86cpu, svm_cpu_id); static bool erratum_383_found __read_mostly; -static const u32 host_save_user_msrs[] = { -#ifdef CONFIG_X86_64 - MSR_STAR, MSR_LSTAR, MSR_CSTAR, MSR_SYSCALL_MASK, MSR_KERNEL_GS_BASE, - MSR_FS_BASE, -#endif - MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP, - MSR_TSC_AUX, -}; - -#define NR_HOST_SAVE_USER_MSRS ARRAY_SIZE(host_save_user_msrs) - -struct kvm_sev_info { - bool active; /* SEV enabled guest */ - unsigned int asid; /* ASID used for this guest */ - unsigned int handle; /* SEV firmware handle */ - int fd; /* SEV device fd */ - unsigned long pages_locked; /* Number of pages locked */ - struct list_head regions_list; /* List of registered regions */ -}; - -struct kvm_svm { - struct kvm kvm; - - /* Struct members for AVIC */ - u32 avic_vm_id; - struct page *avic_logical_id_table_page; - struct page *avic_physical_id_table_page; - struct hlist_node hnode; - - struct kvm_sev_info sev_info; -}; - -struct kvm_vcpu; - -struct nested_state { - struct vmcb *hsave; - u64 hsave_msr; - u64 vm_cr_msr; - u64 vmcb; - - /* These are the merged vectors */ - u32 *msrpm; - - /* gpa pointers to the real vectors */ - u64 vmcb_msrpm; - u64 vmcb_iopm; - - /* A VMEXIT is required but not yet emulated */ - bool exit_required; - - /* cache for intercepts of the guest */ - u32 intercept_cr; - u32 intercept_dr; - u32 intercept_exceptions; - u64 intercept; - - /* Nested Paging related state */ - u64 nested_cr3; -}; - -#define MSRPM_OFFSETS 16 -static u32 msrpm_offsets[MSRPM_OFFSETS] __read_mostly; +u32 msrpm_offsets[MSRPM_OFFSETS] __read_mostly; /* * Set osvw_len to higher value when updated Revision Guides @@ -188,70 +125,6 @@ static u32 msrpm_offsets[MSRPM_OFFSETS] __read_mostly; */ static uint64_t osvw_len = 4, osvw_status; -struct vcpu_svm { - struct kvm_vcpu vcpu; - struct vmcb *vmcb; - unsigned long vmcb_pa; - struct svm_cpu_data *svm_data; - uint64_t asid_generation; - uint64_t sysenter_esp; - uint64_t sysenter_eip; - uint64_t tsc_aux; - - u64 msr_decfg; - - u64 next_rip; - - u64 host_user_msrs[NR_HOST_SAVE_USER_MSRS]; - struct { - u16 fs; - u16 gs; - u16 ldt; - u64 gs_base; - } host; - - u64 spec_ctrl; - /* - * Contains guest-controlled bits of VIRT_SPEC_CTRL, which will be - * translated into the appropriate L2_CFG bits on the host to - * perform speculative control. - */ - u64 virt_spec_ctrl; - - u32 *msrpm; - - ulong nmi_iret_rip; - - struct nested_state nested; - - bool nmi_singlestep; - u64 nmi_singlestep_guest_rflags; - - unsigned int3_injected; - unsigned long int3_rip; - - /* cached guest cpuid flags for faster access */ - bool nrips_enabled : 1; - - u32 ldr_reg; - u32 dfr_reg; - struct page *avic_backing_page; - u64 *avic_physical_id_cache; - bool avic_is_running; - - /* - * Per-vcpu list of struct amd_svm_iommu_ir: - * This is used mainly to store interrupt remapping information used - * when update the vcpu affinity. This avoids the need to scan for - * IRTE and try to match ga_tag in the IOMMU driver. - */ - struct list_head ir_list; - spinlock_t ir_list_lock; - - /* which host CPU was used for running this vcpu */ - unsigned int last_cpu; -}; - /* * This is a wrapper of struct amd_iommu_ir_data. */ @@ -272,8 +145,6 @@ struct amd_svm_iommu_ir { static DEFINE_PER_CPU(u64, current_tsc_ratio); #define TSC_RATIO_DEFAULT 0x0100000000ULL -#define MSR_INVALID 0xffffffffU - static const struct svm_direct_access_msrs { u32 index; /* Index of the MSR */ bool always; /* True if intercept is always on */ @@ -299,9 +170,9 @@ static const struct svm_direct_access_msrs { /* enable NPT for AMD64 and X86 with PAE */ #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE) -static bool npt_enabled = true; +bool npt_enabled = true; #else -static bool npt_enabled; +bool npt_enabled; #endif /* @@ -387,41 +258,10 @@ module_param(dump_invalid_vmcb, bool, 0644); static u8 rsm_ins_bytes[] = "\x0f\xaa"; -static void svm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0); -static void svm_flush_tlb(struct kvm_vcpu *vcpu, bool invalidate_gpa); static void svm_complete_interrupts(struct vcpu_svm *svm); static void svm_toggle_avic_for_irq_window(struct kvm_vcpu *vcpu, bool activate); static inline void avic_post_state_restore(struct kvm_vcpu *vcpu); -static int nested_svm_exit_handled(struct vcpu_svm *svm); -static int nested_svm_intercept(struct vcpu_svm *svm); -static int nested_svm_vmexit(struct vcpu_svm *svm); -static int nested_svm_check_exception(struct vcpu_svm *svm, unsigned nr, - bool has_error_code, u32 error_code); - -enum { - VMCB_INTERCEPTS, /* Intercept vectors, TSC offset, - pause filter count */ - VMCB_PERM_MAP, /* IOPM Base and MSRPM Base */ - VMCB_ASID, /* ASID */ - VMCB_INTR, /* int_ctl, int_vector */ - VMCB_NPT, /* npt_en, nCR3, gPAT */ - VMCB_CR, /* CR0, CR3, CR4, EFER */ - VMCB_DR, /* DR6, DR7 */ - VMCB_DT, /* GDT, IDT */ - VMCB_SEG, /* CS, DS, SS, ES, CPL */ - VMCB_CR2, /* CR2 only */ - VMCB_LBR, /* DBGCTL, BR_FROM, BR_TO, LAST_EX_FROM, LAST_EX_TO */ - VMCB_AVIC, /* AVIC APIC_BAR, AVIC APIC_BACKING_PAGE, - * AVIC PHYSICAL_TABLE pointer, - * AVIC LOGICAL_TABLE pointer - */ - VMCB_DIRTY_MAX, -}; - -/* TPR and CR2 are always written before VMRUN */ -#define VMCB_ALWAYS_DIRTY_MASK ((1U << VMCB_INTR) | (1U << VMCB_CR2)) - #define VMCB_AVIC_APIC_BAR_MASK 0xFFFFFFFFFF000ULL static int sev_flush_asids(void); @@ -470,27 +310,6 @@ static inline int sev_get_asid(struct kvm *kvm) return sev->asid; } -static inline void mark_all_dirty(struct vmcb *vmcb) -{ - vmcb->control.clean = 0; -} - -static inline void mark_all_clean(struct vmcb *vmcb) -{ - vmcb->control.clean = ((1 << VMCB_DIRTY_MAX) - 1) - & ~VMCB_ALWAYS_DIRTY_MASK; -} - -static inline void mark_dirty(struct vmcb *vmcb, int bit) -{ - vmcb->control.clean &= ~(1 << bit); -} - -static inline struct vcpu_svm *to_svm(struct kvm_vcpu *vcpu) -{ - return container_of(vcpu, struct vcpu_svm, vcpu); -} - static inline void avic_update_vapic_bar(struct vcpu_svm *svm, u64 data) { svm->vmcb->control.avic_vapic_bar = data & VMCB_AVIC_APIC_BAR_MASK; @@ -508,183 +327,6 @@ static inline bool avic_vcpu_is_running(struct kvm_vcpu *vcpu) return (READ_ONCE(*entry) & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK); } -static void recalc_intercepts(struct vcpu_svm *svm) -{ - struct vmcb_control_area *c, *h; - struct nested_state *g; - - mark_dirty(svm->vmcb, VMCB_INTERCEPTS); - - if (!is_guest_mode(&svm->vcpu)) - return; - - c = &svm->vmcb->control; - h = &svm->nested.hsave->control; - g = &svm->nested; - - c->intercept_cr = h->intercept_cr; - c->intercept_dr = h->intercept_dr; - c->intercept_exceptions = h->intercept_exceptions; - c->intercept = h->intercept; - - if (svm->vcpu.arch.hflags & HF_VINTR_MASK) { - /* We only want the cr8 intercept bits of L1 */ - c->intercept_cr &= ~(1U << INTERCEPT_CR8_READ); - c->intercept_cr &= ~(1U << INTERCEPT_CR8_WRITE); - - /* - * Once running L2 with HF_VINTR_MASK, EFLAGS.IF does not - * affect any interrupt we may want to inject; therefore, - * interrupt window vmexits are irrelevant to L0. - */ - c->intercept &= ~(1ULL << INTERCEPT_VINTR); - } - - /* We don't want to see VMMCALLs from a nested guest */ - c->intercept &= ~(1ULL << INTERCEPT_VMMCALL); - - c->intercept_cr |= g->intercept_cr; - c->intercept_dr |= g->intercept_dr; - c->intercept_exceptions |= g->intercept_exceptions; - c->intercept |= g->intercept; -} - -static inline struct vmcb *get_host_vmcb(struct vcpu_svm *svm) -{ - if (is_guest_mode(&svm->vcpu)) - return svm->nested.hsave; - else - return svm->vmcb; -} - -static inline void set_cr_intercept(struct vcpu_svm *svm, int bit) -{ - struct vmcb *vmcb = get_host_vmcb(svm); - - vmcb->control.intercept_cr |= (1U << bit); - - recalc_intercepts(svm); -} - -static inline void clr_cr_intercept(struct vcpu_svm *svm, int bit) -{ - struct vmcb *vmcb = get_host_vmcb(svm); - - vmcb->control.intercept_cr &= ~(1U << bit); - - recalc_intercepts(svm); -} - -static inline bool is_cr_intercept(struct vcpu_svm *svm, int bit) -{ - struct vmcb *vmcb = get_host_vmcb(svm); - - return vmcb->control.intercept_cr & (1U << bit); -} - -static inline void set_dr_intercepts(struct vcpu_svm *svm) -{ - struct vmcb *vmcb = get_host_vmcb(svm); - - vmcb->control.intercept_dr = (1 << INTERCEPT_DR0_READ) - | (1 << INTERCEPT_DR1_READ) - | (1 << INTERCEPT_DR2_READ) - | (1 << INTERCEPT_DR3_READ) - | (1 << INTERCEPT_DR4_READ) - | (1 << INTERCEPT_DR5_READ) - | (1 << INTERCEPT_DR6_READ) - | (1 << INTERCEPT_DR7_READ) - | (1 << INTERCEPT_DR0_WRITE) - | (1 << INTERCEPT_DR1_WRITE) - | (1 << INTERCEPT_DR2_WRITE) - | (1 << INTERCEPT_DR3_WRITE) - | (1 << INTERCEPT_DR4_WRITE) - | (1 << INTERCEPT_DR5_WRITE) - | (1 << INTERCEPT_DR6_WRITE) - | (1 << INTERCEPT_DR7_WRITE); - - recalc_intercepts(svm); -} - -static inline void clr_dr_intercepts(struct vcpu_svm *svm) -{ - struct vmcb *vmcb = get_host_vmcb(svm); - - vmcb->control.intercept_dr = 0; - - recalc_intercepts(svm); -} - -static inline void set_exception_intercept(struct vcpu_svm *svm, int bit) -{ - struct vmcb *vmcb = get_host_vmcb(svm); - - vmcb->control.intercept_exceptions |= (1U << bit); - - recalc_intercepts(svm); -} - -static inline void clr_exception_intercept(struct vcpu_svm *svm, int bit) -{ - struct vmcb *vmcb = get_host_vmcb(svm); - - vmcb->control.intercept_exceptions &= ~(1U << bit); - - recalc_intercepts(svm); -} - -static inline void set_intercept(struct vcpu_svm *svm, int bit) -{ - struct vmcb *vmcb = get_host_vmcb(svm); - - vmcb->control.intercept |= (1ULL << bit); - - recalc_intercepts(svm); -} - -static inline void clr_intercept(struct vcpu_svm *svm, int bit) -{ - struct vmcb *vmcb = get_host_vmcb(svm); - - vmcb->control.intercept &= ~(1ULL << bit); - - recalc_intercepts(svm); -} - -static inline bool is_intercept(struct vcpu_svm *svm, int bit) -{ - return (svm->vmcb->control.intercept & (1ULL << bit)) != 0; -} - -static inline bool vgif_enabled(struct vcpu_svm *svm) -{ - return !!(svm->vmcb->control.int_ctl & V_GIF_ENABLE_MASK); -} - -static inline void enable_gif(struct vcpu_svm *svm) -{ - if (vgif_enabled(svm)) - svm->vmcb->control.int_ctl |= V_GIF_MASK; - else - svm->vcpu.arch.hflags |= HF_GIF_MASK; -} - -static inline void disable_gif(struct vcpu_svm *svm) -{ - if (vgif_enabled(svm)) - svm->vmcb->control.int_ctl &= ~V_GIF_MASK; - else - svm->vcpu.arch.hflags &= ~HF_GIF_MASK; -} - -static inline bool gif_set(struct vcpu_svm *svm) -{ - if (vgif_enabled(svm)) - return !!(svm->vmcb->control.int_ctl & V_GIF_MASK); - else - return !!(svm->vcpu.arch.hflags & HF_GIF_MASK); -} - static unsigned long iopm_base; struct kvm_ldttss_desc { @@ -720,7 +362,7 @@ static const u32 msrpm_ranges[] = {0, 0xc0000000, 0xc0010000}; #define MSRS_RANGE_SIZE 2048 #define MSRS_IN_RANGE (MSRS_RANGE_SIZE * 8 / 2) -static u32 svm_msrpm_offset(u32 msr) +u32 svm_msrpm_offset(u32 msr) { u32 offset; int i; @@ -767,7 +409,7 @@ static int get_npt_level(struct kvm_vcpu *vcpu) #endif } -static void svm_set_efer(struct kvm_vcpu *vcpu, u64 efer) +void svm_set_efer(struct kvm_vcpu *vcpu, u64 efer) { vcpu->arch.efer = efer; @@ -1198,7 +840,7 @@ static void svm_disable_lbrv(struct vcpu_svm *svm) set_msr_interception(msrpm, MSR_IA32_LASTINTTOIP, 0, 0); } -static void disable_nmi_singlestep(struct vcpu_svm *svm) +void disable_nmi_singlestep(struct vcpu_svm *svm) { svm->nmi_singlestep = false; @@ -2652,7 +2294,7 @@ static void update_cr0_intercept(struct vcpu_svm *svm) } } -static void svm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0) +void svm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0) { struct vcpu_svm *svm = to_svm(vcpu); @@ -2686,7 +2328,7 @@ static void svm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0) update_cr0_intercept(svm); } -static int svm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4) +int svm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4) { unsigned long host_cr4_mce = cr4_read_shadow() & X86_CR4_MCE; unsigned long old_cr4 = to_svm(vcpu)->vmcb->save.cr4; @@ -3022,776 +2664,6 @@ static int vmmcall_interception(struct vcpu_svm *svm) return kvm_emulate_hypercall(&svm->vcpu); } -static unsigned long nested_svm_get_tdp_cr3(struct kvm_vcpu *vcpu) -{ - struct vcpu_svm *svm = to_svm(vcpu); - - return svm->nested.nested_cr3; -} - -static u64 nested_svm_get_tdp_pdptr(struct kvm_vcpu *vcpu, int index) -{ - struct vcpu_svm *svm = to_svm(vcpu); - u64 cr3 = svm->nested.nested_cr3; - u64 pdpte; - int ret; - - ret = kvm_vcpu_read_guest_page(vcpu, gpa_to_gfn(__sme_clr(cr3)), &pdpte, - offset_in_page(cr3) + index * 8, 8); - if (ret) - return 0; - return pdpte; -} - -static void nested_svm_inject_npf_exit(struct kvm_vcpu *vcpu, - struct x86_exception *fault) -{ - struct vcpu_svm *svm = to_svm(vcpu); - - if (svm->vmcb->control.exit_code != SVM_EXIT_NPF) { - /* - * TODO: track the cause of the nested page fault, and - * correctly fill in the high bits of exit_info_1. - */ - svm->vmcb->control.exit_code = SVM_EXIT_NPF; - svm->vmcb->control.exit_code_hi = 0; - svm->vmcb->control.exit_info_1 = (1ULL << 32); - svm->vmcb->control.exit_info_2 = fault->address; - } - - svm->vmcb->control.exit_info_1 &= ~0xffffffffULL; - svm->vmcb->control.exit_info_1 |= fault->error_code; - - /* - * The present bit is always zero for page structure faults on real - * hardware. - */ - if (svm->vmcb->control.exit_info_1 & (2ULL << 32)) - svm->vmcb->control.exit_info_1 &= ~1; - - nested_svm_vmexit(svm); -} - -static void nested_svm_init_mmu_context(struct kvm_vcpu *vcpu) -{ - WARN_ON(mmu_is_nested(vcpu)); - - vcpu->arch.mmu = &vcpu->arch.guest_mmu; - kvm_init_shadow_mmu(vcpu); - vcpu->arch.mmu->get_guest_pgd = nested_svm_get_tdp_cr3; - vcpu->arch.mmu->get_pdptr = nested_svm_get_tdp_pdptr; - vcpu->arch.mmu->inject_page_fault = nested_svm_inject_npf_exit; - vcpu->arch.mmu->shadow_root_level = get_npt_level(vcpu); - reset_shadow_zero_bits_mask(vcpu, vcpu->arch.mmu); - vcpu->arch.walk_mmu = &vcpu->arch.nested_mmu; -} - -static void nested_svm_uninit_mmu_context(struct kvm_vcpu *vcpu) -{ - vcpu->arch.mmu = &vcpu->arch.root_mmu; - vcpu->arch.walk_mmu = &vcpu->arch.root_mmu; -} - -static int nested_svm_check_permissions(struct vcpu_svm *svm) -{ - if (!(svm->vcpu.arch.efer & EFER_SVME) || - !is_paging(&svm->vcpu)) { - kvm_queue_exception(&svm->vcpu, UD_VECTOR); - return 1; - } - - if (svm->vmcb->save.cpl) { - kvm_inject_gp(&svm->vcpu, 0); - return 1; - } - - return 0; -} - -static int nested_svm_check_exception(struct vcpu_svm *svm, unsigned nr, - bool has_error_code, u32 error_code) -{ - int vmexit; - - if (!is_guest_mode(&svm->vcpu)) - return 0; - - vmexit = nested_svm_intercept(svm); - if (vmexit != NESTED_EXIT_DONE) - return 0; - - svm->vmcb->control.exit_code = SVM_EXIT_EXCP_BASE + nr; - svm->vmcb->control.exit_code_hi = 0; - svm->vmcb->control.exit_info_1 = error_code; - - /* - * EXITINFO2 is undefined for all exception intercepts other - * than #PF. - */ - if (svm->vcpu.arch.exception.nested_apf) - svm->vmcb->control.exit_info_2 = svm->vcpu.arch.apf.nested_apf_token; - else if (svm->vcpu.arch.exception.has_payload) - svm->vmcb->control.exit_info_2 = svm->vcpu.arch.exception.payload; - else - svm->vmcb->control.exit_info_2 = svm->vcpu.arch.cr2; - - svm->nested.exit_required = true; - return vmexit; -} - -static void nested_svm_intr(struct vcpu_svm *svm) -{ - svm->vmcb->control.exit_code = SVM_EXIT_INTR; - svm->vmcb->control.exit_info_1 = 0; - svm->vmcb->control.exit_info_2 = 0; - - /* nested_svm_vmexit this gets called afterwards from handle_exit */ - svm->nested.exit_required = true; - trace_kvm_nested_intr_vmexit(svm->vmcb->save.rip); -} - -static bool nested_exit_on_intr(struct vcpu_svm *svm) -{ - return (svm->nested.intercept & 1ULL); -} - -static int svm_check_nested_events(struct kvm_vcpu *vcpu) -{ - struct vcpu_svm *svm = to_svm(vcpu); - bool block_nested_events = - kvm_event_needs_reinjection(vcpu) || svm->nested.exit_required; - - if (kvm_cpu_has_interrupt(vcpu) && nested_exit_on_intr(svm)) { - if (block_nested_events) - return -EBUSY; - nested_svm_intr(svm); - return 0; - } - - return 0; -} - -/* This function returns true if it is save to enable the nmi window */ -static inline bool nested_svm_nmi(struct vcpu_svm *svm) -{ - if (!is_guest_mode(&svm->vcpu)) - return true; - - if (!(svm->nested.intercept & (1ULL << INTERCEPT_NMI))) - return true; - - svm->vmcb->control.exit_code = SVM_EXIT_NMI; - svm->nested.exit_required = true; - - return false; -} - -static int nested_svm_intercept_ioio(struct vcpu_svm *svm) -{ - unsigned port, size, iopm_len; - u16 val, mask; - u8 start_bit; - u64 gpa; - - if (!(svm->nested.intercept & (1ULL << INTERCEPT_IOIO_PROT))) - return NESTED_EXIT_HOST; - - port = svm->vmcb->control.exit_info_1 >> 16; - size = (svm->vmcb->control.exit_info_1 & SVM_IOIO_SIZE_MASK) >> - SVM_IOIO_SIZE_SHIFT; - gpa = svm->nested.vmcb_iopm + (port / 8); - start_bit = port % 8; - iopm_len = (start_bit + size > 8) ? 2 : 1; - mask = (0xf >> (4 - size)) << start_bit; - val = 0; - - if (kvm_vcpu_read_guest(&svm->vcpu, gpa, &val, iopm_len)) - return NESTED_EXIT_DONE; - - return (val & mask) ? NESTED_EXIT_DONE : NESTED_EXIT_HOST; -} - -static int nested_svm_exit_handled_msr(struct vcpu_svm *svm) -{ - u32 offset, msr, value; - int write, mask; - - if (!(svm->nested.intercept & (1ULL << INTERCEPT_MSR_PROT))) - return NESTED_EXIT_HOST; - - msr = svm->vcpu.arch.regs[VCPU_REGS_RCX]; - offset = svm_msrpm_offset(msr); - write = svm->vmcb->control.exit_info_1 & 1; - mask = 1 << ((2 * (msr & 0xf)) + write); - - if (offset == MSR_INVALID) - return NESTED_EXIT_DONE; - - /* Offset is in 32 bit units but need in 8 bit units */ - offset *= 4; - - if (kvm_vcpu_read_guest(&svm->vcpu, svm->nested.vmcb_msrpm + offset, &value, 4)) - return NESTED_EXIT_DONE; - - return (value & mask) ? NESTED_EXIT_DONE : NESTED_EXIT_HOST; -} - -/* DB exceptions for our internal use must not cause vmexit */ -static int nested_svm_intercept_db(struct vcpu_svm *svm) -{ - unsigned long dr6; - - /* if we're not singlestepping, it's not ours */ - if (!svm->nmi_singlestep) - return NESTED_EXIT_DONE; - - /* if it's not a singlestep exception, it's not ours */ - if (kvm_get_dr(&svm->vcpu, 6, &dr6)) - return NESTED_EXIT_DONE; - if (!(dr6 & DR6_BS)) - return NESTED_EXIT_DONE; - - /* if the guest is singlestepping, it should get the vmexit */ - if (svm->nmi_singlestep_guest_rflags & X86_EFLAGS_TF) { - disable_nmi_singlestep(svm); - return NESTED_EXIT_DONE; - } - - /* it's ours, the nested hypervisor must not see this one */ - return NESTED_EXIT_HOST; -} - -static int nested_svm_exit_special(struct vcpu_svm *svm) -{ - u32 exit_code = svm->vmcb->control.exit_code; - - switch (exit_code) { - case SVM_EXIT_INTR: - case SVM_EXIT_NMI: - case SVM_EXIT_EXCP_BASE + MC_VECTOR: - return NESTED_EXIT_HOST; - case SVM_EXIT_NPF: - /* For now we are always handling NPFs when using them */ - if (npt_enabled) - return NESTED_EXIT_HOST; - break; - case SVM_EXIT_EXCP_BASE + PF_VECTOR: - /* When we're shadowing, trap PFs, but not async PF */ - if (!npt_enabled && svm->vcpu.arch.apf.host_apf_reason == 0) - return NESTED_EXIT_HOST; - break; - default: - break; - } - - return NESTED_EXIT_CONTINUE; -} - -static int nested_svm_intercept(struct vcpu_svm *svm) -{ - u32 exit_code = svm->vmcb->control.exit_code; - int vmexit = NESTED_EXIT_HOST; - - switch (exit_code) { - case SVM_EXIT_MSR: - vmexit = nested_svm_exit_handled_msr(svm); - break; - case SVM_EXIT_IOIO: - vmexit = nested_svm_intercept_ioio(svm); - break; - case SVM_EXIT_READ_CR0 ... SVM_EXIT_WRITE_CR8: { - u32 bit = 1U << (exit_code - SVM_EXIT_READ_CR0); - if (svm->nested.intercept_cr & bit) - vmexit = NESTED_EXIT_DONE; - break; - } - case SVM_EXIT_READ_DR0 ... SVM_EXIT_WRITE_DR7: { - u32 bit = 1U << (exit_code - SVM_EXIT_READ_DR0); - if (svm->nested.intercept_dr & bit) - vmexit = NESTED_EXIT_DONE; - break; - } - case SVM_EXIT_EXCP_BASE ... SVM_EXIT_EXCP_BASE + 0x1f: { - u32 excp_bits = 1 << (exit_code - SVM_EXIT_EXCP_BASE); - if (svm->nested.intercept_exceptions & excp_bits) { - if (exit_code == SVM_EXIT_EXCP_BASE + DB_VECTOR) - vmexit = nested_svm_intercept_db(svm); - else - vmexit = NESTED_EXIT_DONE; - } - /* async page fault always cause vmexit */ - else if ((exit_code == SVM_EXIT_EXCP_BASE + PF_VECTOR) && - svm->vcpu.arch.exception.nested_apf != 0) - vmexit = NESTED_EXIT_DONE; - break; - } - case SVM_EXIT_ERR: { - vmexit = NESTED_EXIT_DONE; - break; - } - default: { - u64 exit_bits = 1ULL << (exit_code - SVM_EXIT_INTR); - if (svm->nested.intercept & exit_bits) - vmexit = NESTED_EXIT_DONE; - } - } - - return vmexit; -} - -static int nested_svm_exit_handled(struct vcpu_svm *svm) -{ - int vmexit; - - vmexit = nested_svm_intercept(svm); - - if (vmexit == NESTED_EXIT_DONE) - nested_svm_vmexit(svm); - - return vmexit; -} - -static inline void copy_vmcb_control_area(struct vmcb *dst_vmcb, struct vmcb *from_vmcb) -{ - struct vmcb_control_area *dst = &dst_vmcb->control; - struct vmcb_control_area *from = &from_vmcb->control; - - dst->intercept_cr = from->intercept_cr; - dst->intercept_dr = from->intercept_dr; - dst->intercept_exceptions = from->intercept_exceptions; - dst->intercept = from->intercept; - dst->iopm_base_pa = from->iopm_base_pa; - dst->msrpm_base_pa = from->msrpm_base_pa; - dst->tsc_offset = from->tsc_offset; - dst->asid = from->asid; - dst->tlb_ctl = from->tlb_ctl; - dst->int_ctl = from->int_ctl; - dst->int_vector = from->int_vector; - dst->int_state = from->int_state; - dst->exit_code = from->exit_code; - dst->exit_code_hi = from->exit_code_hi; - dst->exit_info_1 = from->exit_info_1; - dst->exit_info_2 = from->exit_info_2; - dst->exit_int_info = from->exit_int_info; - dst->exit_int_info_err = from->exit_int_info_err; - dst->nested_ctl = from->nested_ctl; - dst->event_inj = from->event_inj; - dst->event_inj_err = from->event_inj_err; - dst->nested_cr3 = from->nested_cr3; - dst->virt_ext = from->virt_ext; - dst->pause_filter_count = from->pause_filter_count; - dst->pause_filter_thresh = from->pause_filter_thresh; -} - -static int nested_svm_vmexit(struct vcpu_svm *svm) -{ - int rc; - struct vmcb *nested_vmcb; - struct vmcb *hsave = svm->nested.hsave; - struct vmcb *vmcb = svm->vmcb; - struct kvm_host_map map; - - trace_kvm_nested_vmexit_inject(vmcb->control.exit_code, - vmcb->control.exit_info_1, - vmcb->control.exit_info_2, - vmcb->control.exit_int_info, - vmcb->control.exit_int_info_err, - KVM_ISA_SVM); - - rc = kvm_vcpu_map(&svm->vcpu, gpa_to_gfn(svm->nested.vmcb), &map); - if (rc) { - if (rc == -EINVAL) - kvm_inject_gp(&svm->vcpu, 0); - return 1; - } - - nested_vmcb = map.hva; - - /* Exit Guest-Mode */ - leave_guest_mode(&svm->vcpu); - svm->nested.vmcb = 0; - - /* Give the current vmcb to the guest */ - disable_gif(svm); - - nested_vmcb->save.es = vmcb->save.es; - nested_vmcb->save.cs = vmcb->save.cs; - nested_vmcb->save.ss = vmcb->save.ss; - nested_vmcb->save.ds = vmcb->save.ds; - nested_vmcb->save.gdtr = vmcb->save.gdtr; - nested_vmcb->save.idtr = vmcb->save.idtr; - nested_vmcb->save.efer = svm->vcpu.arch.efer; - nested_vmcb->save.cr0 = kvm_read_cr0(&svm->vcpu); - nested_vmcb->save.cr3 = kvm_read_cr3(&svm->vcpu); - nested_vmcb->save.cr2 = vmcb->save.cr2; - nested_vmcb->save.cr4 = svm->vcpu.arch.cr4; - nested_vmcb->save.rflags = kvm_get_rflags(&svm->vcpu); - nested_vmcb->save.rip = vmcb->save.rip; - nested_vmcb->save.rsp = vmcb->save.rsp; - nested_vmcb->save.rax = vmcb->save.rax; - nested_vmcb->save.dr7 = vmcb->save.dr7; - nested_vmcb->save.dr6 = vmcb->save.dr6; - nested_vmcb->save.cpl = vmcb->save.cpl; - - nested_vmcb->control.int_ctl = vmcb->control.int_ctl; - nested_vmcb->control.int_vector = vmcb->control.int_vector; - nested_vmcb->control.int_state = vmcb->control.int_state; - nested_vmcb->control.exit_code = vmcb->control.exit_code; - nested_vmcb->control.exit_code_hi = vmcb->control.exit_code_hi; - nested_vmcb->control.exit_info_1 = vmcb->control.exit_info_1; - nested_vmcb->control.exit_info_2 = vmcb->control.exit_info_2; - nested_vmcb->control.exit_int_info = vmcb->control.exit_int_info; - nested_vmcb->control.exit_int_info_err = vmcb->control.exit_int_info_err; - - if (svm->nrips_enabled) - nested_vmcb->control.next_rip = vmcb->control.next_rip; - - /* - * If we emulate a VMRUN/#VMEXIT in the same host #vmexit cycle we have - * to make sure that we do not lose injected events. So check event_inj - * here and copy it to exit_int_info if it is valid. - * Exit_int_info and event_inj can't be both valid because the case - * below only happens on a VMRUN instruction intercept which has - * no valid exit_int_info set. - */ - if (vmcb->control.event_inj & SVM_EVTINJ_VALID) { - struct vmcb_control_area *nc = &nested_vmcb->control; - - nc->exit_int_info = vmcb->control.event_inj; - nc->exit_int_info_err = vmcb->control.event_inj_err; - } - - nested_vmcb->control.tlb_ctl = 0; - nested_vmcb->control.event_inj = 0; - nested_vmcb->control.event_inj_err = 0; - - nested_vmcb->control.pause_filter_count = - svm->vmcb->control.pause_filter_count; - nested_vmcb->control.pause_filter_thresh = - svm->vmcb->control.pause_filter_thresh; - - /* We always set V_INTR_MASKING and remember the old value in hflags */ - if (!(svm->vcpu.arch.hflags & HF_VINTR_MASK)) - nested_vmcb->control.int_ctl &= ~V_INTR_MASKING_MASK; - - /* Restore the original control entries */ - copy_vmcb_control_area(vmcb, hsave); - - svm->vcpu.arch.tsc_offset = svm->vmcb->control.tsc_offset; - kvm_clear_exception_queue(&svm->vcpu); - kvm_clear_interrupt_queue(&svm->vcpu); - - svm->nested.nested_cr3 = 0; - - /* Restore selected save entries */ - svm->vmcb->save.es = hsave->save.es; - svm->vmcb->save.cs = hsave->save.cs; - svm->vmcb->save.ss = hsave->save.ss; - svm->vmcb->save.ds = hsave->save.ds; - svm->vmcb->save.gdtr = hsave->save.gdtr; - svm->vmcb->save.idtr = hsave->save.idtr; - kvm_set_rflags(&svm->vcpu, hsave->save.rflags); - svm_set_efer(&svm->vcpu, hsave->save.efer); - svm_set_cr0(&svm->vcpu, hsave->save.cr0 | X86_CR0_PE); - svm_set_cr4(&svm->vcpu, hsave->save.cr4); - if (npt_enabled) { - svm->vmcb->save.cr3 = hsave->save.cr3; - svm->vcpu.arch.cr3 = hsave->save.cr3; - } else { - (void)kvm_set_cr3(&svm->vcpu, hsave->save.cr3); - } - kvm_rax_write(&svm->vcpu, hsave->save.rax); - kvm_rsp_write(&svm->vcpu, hsave->save.rsp); - kvm_rip_write(&svm->vcpu, hsave->save.rip); - svm->vmcb->save.dr7 = 0; - svm->vmcb->save.cpl = 0; - svm->vmcb->control.exit_int_info = 0; - - mark_all_dirty(svm->vmcb); - - kvm_vcpu_unmap(&svm->vcpu, &map, true); - - nested_svm_uninit_mmu_context(&svm->vcpu); - kvm_mmu_reset_context(&svm->vcpu); - kvm_mmu_load(&svm->vcpu); - - /* - * Drop what we picked up for L2 via svm_complete_interrupts() so it - * doesn't end up in L1. - */ - svm->vcpu.arch.nmi_injected = false; - kvm_clear_exception_queue(&svm->vcpu); - kvm_clear_interrupt_queue(&svm->vcpu); - - return 0; -} - -static bool nested_svm_vmrun_msrpm(struct vcpu_svm *svm) -{ - /* - * This function merges the msr permission bitmaps of kvm and the - * nested vmcb. It is optimized in that it only merges the parts where - * the kvm msr permission bitmap may contain zero bits - */ - int i; - - if (!(svm->nested.intercept & (1ULL << INTERCEPT_MSR_PROT))) - return true; - - for (i = 0; i < MSRPM_OFFSETS; i++) { - u32 value, p; - u64 offset; - - if (msrpm_offsets[i] == 0xffffffff) - break; - - p = msrpm_offsets[i]; - offset = svm->nested.vmcb_msrpm + (p * 4); - - if (kvm_vcpu_read_guest(&svm->vcpu, offset, &value, 4)) - return false; - - svm->nested.msrpm[p] = svm->msrpm[p] | value; - } - - svm->vmcb->control.msrpm_base_pa = __sme_set(__pa(svm->nested.msrpm)); - - return true; -} - -static bool nested_vmcb_checks(struct vmcb *vmcb) -{ - if ((vmcb->save.efer & EFER_SVME) == 0) - return false; - - if ((vmcb->control.intercept & (1ULL << INTERCEPT_VMRUN)) == 0) - return false; - - if (vmcb->control.asid == 0) - return false; - - if ((vmcb->control.nested_ctl & SVM_NESTED_CTL_NP_ENABLE) && - !npt_enabled) - return false; - - return true; -} - -static void enter_svm_guest_mode(struct vcpu_svm *svm, u64 vmcb_gpa, - struct vmcb *nested_vmcb, struct kvm_host_map *map) -{ - bool evaluate_pending_interrupts = - is_intercept(svm, INTERCEPT_VINTR) || - is_intercept(svm, INTERCEPT_IRET); - - if (kvm_get_rflags(&svm->vcpu) & X86_EFLAGS_IF) - svm->vcpu.arch.hflags |= HF_HIF_MASK; - else - svm->vcpu.arch.hflags &= ~HF_HIF_MASK; - - if (nested_vmcb->control.nested_ctl & SVM_NESTED_CTL_NP_ENABLE) { - svm->nested.nested_cr3 = nested_vmcb->control.nested_cr3; - nested_svm_init_mmu_context(&svm->vcpu); - } - - /* Load the nested guest state */ - svm->vmcb->save.es = nested_vmcb->save.es; - svm->vmcb->save.cs = nested_vmcb->save.cs; - svm->vmcb->save.ss = nested_vmcb->save.ss; - svm->vmcb->save.ds = nested_vmcb->save.ds; - svm->vmcb->save.gdtr = nested_vmcb->save.gdtr; - svm->vmcb->save.idtr = nested_vmcb->save.idtr; - kvm_set_rflags(&svm->vcpu, nested_vmcb->save.rflags); - svm_set_efer(&svm->vcpu, nested_vmcb->save.efer); - svm_set_cr0(&svm->vcpu, nested_vmcb->save.cr0); - svm_set_cr4(&svm->vcpu, nested_vmcb->save.cr4); - if (npt_enabled) { - svm->vmcb->save.cr3 = nested_vmcb->save.cr3; - svm->vcpu.arch.cr3 = nested_vmcb->save.cr3; - } else - (void)kvm_set_cr3(&svm->vcpu, nested_vmcb->save.cr3); - - /* Guest paging mode is active - reset mmu */ - kvm_mmu_reset_context(&svm->vcpu); - - svm->vmcb->save.cr2 = svm->vcpu.arch.cr2 = nested_vmcb->save.cr2; - kvm_rax_write(&svm->vcpu, nested_vmcb->save.rax); - kvm_rsp_write(&svm->vcpu, nested_vmcb->save.rsp); - kvm_rip_write(&svm->vcpu, nested_vmcb->save.rip); - - /* In case we don't even reach vcpu_run, the fields are not updated */ - svm->vmcb->save.rax = nested_vmcb->save.rax; - svm->vmcb->save.rsp = nested_vmcb->save.rsp; - svm->vmcb->save.rip = nested_vmcb->save.rip; - svm->vmcb->save.dr7 = nested_vmcb->save.dr7; - svm->vmcb->save.dr6 = nested_vmcb->save.dr6; - svm->vmcb->save.cpl = nested_vmcb->save.cpl; - - svm->nested.vmcb_msrpm = nested_vmcb->control.msrpm_base_pa & ~0x0fffULL; - svm->nested.vmcb_iopm = nested_vmcb->control.iopm_base_pa & ~0x0fffULL; - - /* cache intercepts */ - svm->nested.intercept_cr = nested_vmcb->control.intercept_cr; - svm->nested.intercept_dr = nested_vmcb->control.intercept_dr; - svm->nested.intercept_exceptions = nested_vmcb->control.intercept_exceptions; - svm->nested.intercept = nested_vmcb->control.intercept; - - svm_flush_tlb(&svm->vcpu, true); - svm->vmcb->control.int_ctl = nested_vmcb->control.int_ctl | V_INTR_MASKING_MASK; - if (nested_vmcb->control.int_ctl & V_INTR_MASKING_MASK) - svm->vcpu.arch.hflags |= HF_VINTR_MASK; - else - svm->vcpu.arch.hflags &= ~HF_VINTR_MASK; - - svm->vcpu.arch.tsc_offset += nested_vmcb->control.tsc_offset; - svm->vmcb->control.tsc_offset = svm->vcpu.arch.tsc_offset; - - svm->vmcb->control.virt_ext = nested_vmcb->control.virt_ext; - svm->vmcb->control.int_vector = nested_vmcb->control.int_vector; - svm->vmcb->control.int_state = nested_vmcb->control.int_state; - svm->vmcb->control.event_inj = nested_vmcb->control.event_inj; - svm->vmcb->control.event_inj_err = nested_vmcb->control.event_inj_err; - - svm->vmcb->control.pause_filter_count = - nested_vmcb->control.pause_filter_count; - svm->vmcb->control.pause_filter_thresh = - nested_vmcb->control.pause_filter_thresh; - - kvm_vcpu_unmap(&svm->vcpu, map, true); - - /* Enter Guest-Mode */ - enter_guest_mode(&svm->vcpu); - - /* - * Merge guest and host intercepts - must be called with vcpu in - * guest-mode to take affect here - */ - recalc_intercepts(svm); - - svm->nested.vmcb = vmcb_gpa; - - /* - * If L1 had a pending IRQ/NMI before executing VMRUN, - * which wasn't delivered because it was disallowed (e.g. - * interrupts disabled), L0 needs to evaluate if this pending - * event should cause an exit from L2 to L1 or be delivered - * directly to L2. - * - * Usually this would be handled by the processor noticing an - * IRQ/NMI window request. However, VMRUN can unblock interrupts - * by implicitly setting GIF, so force L0 to perform pending event - * evaluation by requesting a KVM_REQ_EVENT. - */ - enable_gif(svm); - if (unlikely(evaluate_pending_interrupts)) - kvm_make_request(KVM_REQ_EVENT, &svm->vcpu); - - mark_all_dirty(svm->vmcb); -} - -static int nested_svm_vmrun(struct vcpu_svm *svm) -{ - int ret; - struct vmcb *nested_vmcb; - struct vmcb *hsave = svm->nested.hsave; - struct vmcb *vmcb = svm->vmcb; - struct kvm_host_map map; - u64 vmcb_gpa; - - vmcb_gpa = svm->vmcb->save.rax; - - ret = kvm_vcpu_map(&svm->vcpu, gpa_to_gfn(vmcb_gpa), &map); - if (ret == -EINVAL) { - kvm_inject_gp(&svm->vcpu, 0); - return 1; - } else if (ret) { - return kvm_skip_emulated_instruction(&svm->vcpu); - } - - ret = kvm_skip_emulated_instruction(&svm->vcpu); - - nested_vmcb = map.hva; - - if (!nested_vmcb_checks(nested_vmcb)) { - nested_vmcb->control.exit_code = SVM_EXIT_ERR; - nested_vmcb->control.exit_code_hi = 0; - nested_vmcb->control.exit_info_1 = 0; - nested_vmcb->control.exit_info_2 = 0; - - kvm_vcpu_unmap(&svm->vcpu, &map, true); - - return ret; - } - - trace_kvm_nested_vmrun(svm->vmcb->save.rip, vmcb_gpa, - nested_vmcb->save.rip, - nested_vmcb->control.int_ctl, - nested_vmcb->control.event_inj, - nested_vmcb->control.nested_ctl); - - trace_kvm_nested_intercepts(nested_vmcb->control.intercept_cr & 0xffff, - nested_vmcb->control.intercept_cr >> 16, - nested_vmcb->control.intercept_exceptions, - nested_vmcb->control.intercept); - - /* Clear internal status */ - kvm_clear_exception_queue(&svm->vcpu); - kvm_clear_interrupt_queue(&svm->vcpu); - - /* - * Save the old vmcb, so we don't need to pick what we save, but can - * restore everything when a VMEXIT occurs - */ - hsave->save.es = vmcb->save.es; - hsave->save.cs = vmcb->save.cs; - hsave->save.ss = vmcb->save.ss; - hsave->save.ds = vmcb->save.ds; - hsave->save.gdtr = vmcb->save.gdtr; - hsave->save.idtr = vmcb->save.idtr; - hsave->save.efer = svm->vcpu.arch.efer; - hsave->save.cr0 = kvm_read_cr0(&svm->vcpu); - hsave->save.cr4 = svm->vcpu.arch.cr4; - hsave->save.rflags = kvm_get_rflags(&svm->vcpu); - hsave->save.rip = kvm_rip_read(&svm->vcpu); - hsave->save.rsp = vmcb->save.rsp; - hsave->save.rax = vmcb->save.rax; - if (npt_enabled) - hsave->save.cr3 = vmcb->save.cr3; - else - hsave->save.cr3 = kvm_read_cr3(&svm->vcpu); - - copy_vmcb_control_area(hsave, vmcb); - - enter_svm_guest_mode(svm, vmcb_gpa, nested_vmcb, &map); - - if (!nested_svm_vmrun_msrpm(svm)) { - svm->vmcb->control.exit_code = SVM_EXIT_ERR; - svm->vmcb->control.exit_code_hi = 0; - svm->vmcb->control.exit_info_1 = 0; - svm->vmcb->control.exit_info_2 = 0; - - nested_svm_vmexit(svm); - } - - return ret; -} - -static void nested_svm_vmloadsave(struct vmcb *from_vmcb, struct vmcb *to_vmcb) -{ - to_vmcb->save.fs = from_vmcb->save.fs; - to_vmcb->save.gs = from_vmcb->save.gs; - to_vmcb->save.tr = from_vmcb->save.tr; - to_vmcb->save.ldtr = from_vmcb->save.ldtr; - to_vmcb->save.kernel_gs_base = from_vmcb->save.kernel_gs_base; - to_vmcb->save.star = from_vmcb->save.star; - to_vmcb->save.lstar = from_vmcb->save.lstar; - to_vmcb->save.cstar = from_vmcb->save.cstar; - to_vmcb->save.sfmask = from_vmcb->save.sfmask; - to_vmcb->save.sysenter_cs = from_vmcb->save.sysenter_cs; - to_vmcb->save.sysenter_esp = from_vmcb->save.sysenter_esp; - to_vmcb->save.sysenter_eip = from_vmcb->save.sysenter_eip; -} - static int vmload_interception(struct vcpu_svm *svm) { struct vmcb *nested_vmcb; @@ -5186,11 +4058,6 @@ static void svm_set_irq(struct kvm_vcpu *vcpu) SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_INTR; } -static inline bool svm_nested_virtualize_tpr(struct kvm_vcpu *vcpu) -{ - return is_guest_mode(vcpu) && (vcpu->arch.hflags & HF_VINTR_MASK); -} - static void update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr) { struct vcpu_svm *svm = to_svm(vcpu); @@ -5632,7 +4499,7 @@ static int svm_set_identity_map_addr(struct kvm *kvm, u64 ident_addr) return 0; } -static void svm_flush_tlb(struct kvm_vcpu *vcpu, bool invalidate_gpa) +void svm_flush_tlb(struct kvm_vcpu *vcpu, bool invalidate_gpa) { struct vcpu_svm *svm = to_svm(vcpu); diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h new file mode 100644 index 000000000000..f4c446d7a31e --- /dev/null +++ b/arch/x86/kvm/svm/svm.h @@ -0,0 +1,381 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Kernel-based Virtual Machine driver for Linux + * + * AMD SVM support + * + * Copyright (C) 2006 Qumranet, Inc. + * Copyright 2010 Red Hat, Inc. and/or its affiliates. + * + * Authors: + * Yaniv Kamay + * Avi Kivity + */ + +#ifndef __SVM_SVM_H +#define __SVM_SVM_H + +#include +#include + +#include + +static const u32 host_save_user_msrs[] = { +#ifdef CONFIG_X86_64 + MSR_STAR, MSR_LSTAR, MSR_CSTAR, MSR_SYSCALL_MASK, MSR_KERNEL_GS_BASE, + MSR_FS_BASE, +#endif + MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP, + MSR_TSC_AUX, +}; + +#define NR_HOST_SAVE_USER_MSRS ARRAY_SIZE(host_save_user_msrs) + +#define MSRPM_OFFSETS 16 +extern u32 msrpm_offsets[MSRPM_OFFSETS] __read_mostly; +extern bool npt_enabled; + +enum { + VMCB_INTERCEPTS, /* Intercept vectors, TSC offset, + pause filter count */ + VMCB_PERM_MAP, /* IOPM Base and MSRPM Base */ + VMCB_ASID, /* ASID */ + VMCB_INTR, /* int_ctl, int_vector */ + VMCB_NPT, /* npt_en, nCR3, gPAT */ + VMCB_CR, /* CR0, CR3, CR4, EFER */ + VMCB_DR, /* DR6, DR7 */ + VMCB_DT, /* GDT, IDT */ + VMCB_SEG, /* CS, DS, SS, ES, CPL */ + VMCB_CR2, /* CR2 only */ + VMCB_LBR, /* DBGCTL, BR_FROM, BR_TO, LAST_EX_FROM, LAST_EX_TO */ + VMCB_AVIC, /* AVIC APIC_BAR, AVIC APIC_BACKING_PAGE, + * AVIC PHYSICAL_TABLE pointer, + * AVIC LOGICAL_TABLE pointer + */ + VMCB_DIRTY_MAX, +}; + +/* TPR and CR2 are always written before VMRUN */ +#define VMCB_ALWAYS_DIRTY_MASK ((1U << VMCB_INTR) | (1U << VMCB_CR2)) + +struct kvm_sev_info { + bool active; /* SEV enabled guest */ + unsigned int asid; /* ASID used for this guest */ + unsigned int handle; /* SEV firmware handle */ + int fd; /* SEV device fd */ + unsigned long pages_locked; /* Number of pages locked */ + struct list_head regions_list; /* List of registered regions */ +}; + +struct kvm_svm { + struct kvm kvm; + + /* Struct members for AVIC */ + u32 avic_vm_id; + struct page *avic_logical_id_table_page; + struct page *avic_physical_id_table_page; + struct hlist_node hnode; + + struct kvm_sev_info sev_info; +}; + +struct kvm_vcpu; + +struct nested_state { + struct vmcb *hsave; + u64 hsave_msr; + u64 vm_cr_msr; + u64 vmcb; + + /* These are the merged vectors */ + u32 *msrpm; + + /* gpa pointers to the real vectors */ + u64 vmcb_msrpm; + u64 vmcb_iopm; + + /* A VMEXIT is required but not yet emulated */ + bool exit_required; + + /* cache for intercepts of the guest */ + u32 intercept_cr; + u32 intercept_dr; + u32 intercept_exceptions; + u64 intercept; + + /* Nested Paging related state */ + u64 nested_cr3; +}; + +struct vcpu_svm { + struct kvm_vcpu vcpu; + struct vmcb *vmcb; + unsigned long vmcb_pa; + struct svm_cpu_data *svm_data; + uint64_t asid_generation; + uint64_t sysenter_esp; + uint64_t sysenter_eip; + uint64_t tsc_aux; + + u64 msr_decfg; + + u64 next_rip; + + u64 host_user_msrs[NR_HOST_SAVE_USER_MSRS]; + struct { + u16 fs; + u16 gs; + u16 ldt; + u64 gs_base; + } host; + + u64 spec_ctrl; + /* + * Contains guest-controlled bits of VIRT_SPEC_CTRL, which will be + * translated into the appropriate L2_CFG bits on the host to + * perform speculative control. + */ + u64 virt_spec_ctrl; + + u32 *msrpm; + + ulong nmi_iret_rip; + + struct nested_state nested; + + bool nmi_singlestep; + u64 nmi_singlestep_guest_rflags; + + unsigned int3_injected; + unsigned long int3_rip; + + /* cached guest cpuid flags for faster access */ + bool nrips_enabled : 1; + + u32 ldr_reg; + u32 dfr_reg; + struct page *avic_backing_page; + u64 *avic_physical_id_cache; + bool avic_is_running; + + /* + * Per-vcpu list of struct amd_svm_iommu_ir: + * This is used mainly to store interrupt remapping information used + * when update the vcpu affinity. This avoids the need to scan for + * IRTE and try to match ga_tag in the IOMMU driver. + */ + struct list_head ir_list; + spinlock_t ir_list_lock; + + /* which host CPU was used for running this vcpu */ + unsigned int last_cpu; +}; + +void recalc_intercepts(struct vcpu_svm *svm); + +static inline void mark_all_dirty(struct vmcb *vmcb) +{ + vmcb->control.clean = 0; +} + +static inline void mark_all_clean(struct vmcb *vmcb) +{ + vmcb->control.clean = ((1 << VMCB_DIRTY_MAX) - 1) + & ~VMCB_ALWAYS_DIRTY_MASK; +} + +static inline void mark_dirty(struct vmcb *vmcb, int bit) +{ + vmcb->control.clean &= ~(1 << bit); +} + +static inline struct vcpu_svm *to_svm(struct kvm_vcpu *vcpu) +{ + return container_of(vcpu, struct vcpu_svm, vcpu); +} + +static inline struct vmcb *get_host_vmcb(struct vcpu_svm *svm) +{ + if (is_guest_mode(&svm->vcpu)) + return svm->nested.hsave; + else + return svm->vmcb; +} + +static inline void set_cr_intercept(struct vcpu_svm *svm, int bit) +{ + struct vmcb *vmcb = get_host_vmcb(svm); + + vmcb->control.intercept_cr |= (1U << bit); + + recalc_intercepts(svm); +} + +static inline void clr_cr_intercept(struct vcpu_svm *svm, int bit) +{ + struct vmcb *vmcb = get_host_vmcb(svm); + + vmcb->control.intercept_cr &= ~(1U << bit); + + recalc_intercepts(svm); +} + +static inline bool is_cr_intercept(struct vcpu_svm *svm, int bit) +{ + struct vmcb *vmcb = get_host_vmcb(svm); + + return vmcb->control.intercept_cr & (1U << bit); +} + +static inline void set_dr_intercepts(struct vcpu_svm *svm) +{ + struct vmcb *vmcb = get_host_vmcb(svm); + + vmcb->control.intercept_dr = (1 << INTERCEPT_DR0_READ) + | (1 << INTERCEPT_DR1_READ) + | (1 << INTERCEPT_DR2_READ) + | (1 << INTERCEPT_DR3_READ) + | (1 << INTERCEPT_DR4_READ) + | (1 << INTERCEPT_DR5_READ) + | (1 << INTERCEPT_DR6_READ) + | (1 << INTERCEPT_DR7_READ) + | (1 << INTERCEPT_DR0_WRITE) + | (1 << INTERCEPT_DR1_WRITE) + | (1 << INTERCEPT_DR2_WRITE) + | (1 << INTERCEPT_DR3_WRITE) + | (1 << INTERCEPT_DR4_WRITE) + | (1 << INTERCEPT_DR5_WRITE) + | (1 << INTERCEPT_DR6_WRITE) + | (1 << INTERCEPT_DR7_WRITE); + + recalc_intercepts(svm); +} + +static inline void clr_dr_intercepts(struct vcpu_svm *svm) +{ + struct vmcb *vmcb = get_host_vmcb(svm); + + vmcb->control.intercept_dr = 0; + + recalc_intercepts(svm); +} + +static inline void set_exception_intercept(struct vcpu_svm *svm, int bit) +{ + struct vmcb *vmcb = get_host_vmcb(svm); + + vmcb->control.intercept_exceptions |= (1U << bit); + + recalc_intercepts(svm); +} + +static inline void clr_exception_intercept(struct vcpu_svm *svm, int bit) +{ + struct vmcb *vmcb = get_host_vmcb(svm); + + vmcb->control.intercept_exceptions &= ~(1U << bit); + + recalc_intercepts(svm); +} + +static inline void set_intercept(struct vcpu_svm *svm, int bit) +{ + struct vmcb *vmcb = get_host_vmcb(svm); + + vmcb->control.intercept |= (1ULL << bit); + + recalc_intercepts(svm); +} + +static inline void clr_intercept(struct vcpu_svm *svm, int bit) +{ + struct vmcb *vmcb = get_host_vmcb(svm); + + vmcb->control.intercept &= ~(1ULL << bit); + + recalc_intercepts(svm); +} + +static inline bool is_intercept(struct vcpu_svm *svm, int bit) +{ + return (svm->vmcb->control.intercept & (1ULL << bit)) != 0; +} + +static inline bool vgif_enabled(struct vcpu_svm *svm) +{ + return !!(svm->vmcb->control.int_ctl & V_GIF_ENABLE_MASK); +} + +static inline void enable_gif(struct vcpu_svm *svm) +{ + if (vgif_enabled(svm)) + svm->vmcb->control.int_ctl |= V_GIF_MASK; + else + svm->vcpu.arch.hflags |= HF_GIF_MASK; +} + +static inline void disable_gif(struct vcpu_svm *svm) +{ + if (vgif_enabled(svm)) + svm->vmcb->control.int_ctl &= ~V_GIF_MASK; + else + svm->vcpu.arch.hflags &= ~HF_GIF_MASK; +} + +static inline bool gif_set(struct vcpu_svm *svm) +{ + if (vgif_enabled(svm)) + return !!(svm->vmcb->control.int_ctl & V_GIF_MASK); + else + return !!(svm->vcpu.arch.hflags & HF_GIF_MASK); +} + +/* svm.c */ +#define MSR_INVALID 0xffffffffU + +u32 svm_msrpm_offset(u32 msr); +void svm_set_efer(struct kvm_vcpu *vcpu, u64 efer); +void svm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0); +int svm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4); +void svm_flush_tlb(struct kvm_vcpu *vcpu, bool invalidate_gpa); +void disable_nmi_singlestep(struct vcpu_svm *svm); + +/* nested.c */ + +#define NESTED_EXIT_HOST 0 /* Exit handled on host level */ +#define NESTED_EXIT_DONE 1 /* Exit caused nested vmexit */ +#define NESTED_EXIT_CONTINUE 2 /* Further checks needed */ + +/* This function returns true if it is save to enable the nmi window */ +static inline bool nested_svm_nmi(struct vcpu_svm *svm) +{ + if (!is_guest_mode(&svm->vcpu)) + return true; + + if (!(svm->nested.intercept & (1ULL << INTERCEPT_NMI))) + return true; + + svm->vmcb->control.exit_code = SVM_EXIT_NMI; + svm->nested.exit_required = true; + + return false; +} + +static inline bool svm_nested_virtualize_tpr(struct kvm_vcpu *vcpu) +{ + return is_guest_mode(vcpu) && (vcpu->arch.hflags & HF_VINTR_MASK); +} + +void enter_svm_guest_mode(struct vcpu_svm *svm, u64 vmcb_gpa, + struct vmcb *nested_vmcb, struct kvm_host_map *map); +int nested_svm_vmrun(struct vcpu_svm *svm); +void nested_svm_vmloadsave(struct vmcb *from_vmcb, struct vmcb *to_vmcb); +int nested_svm_vmexit(struct vcpu_svm *svm); +int nested_svm_exit_handled(struct vcpu_svm *svm); +int nested_svm_check_permissions(struct vcpu_svm *svm); +int nested_svm_check_exception(struct vcpu_svm *svm, unsigned nr, + bool has_error_code, u32 error_code); +int svm_check_nested_events(struct kvm_vcpu *vcpu); +int nested_svm_exit_special(struct vcpu_svm *svm); + +#endif -- cgit From ef0f64960d012cbab8f55f305ef36bb6de4e1a9b Mon Sep 17 00:00:00 2001 From: Joerg Roedel Date: Tue, 31 Mar 2020 12:17:38 -0400 Subject: KVM: SVM: Move AVIC code to separate file Move the AVIC related functions from svm.c to the new avic.c file. Signed-off-by: Joerg Roedel Message-Id: <20200324094154.32352-4-joro@8bytes.org> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/Makefile | 2 +- arch/x86/kvm/svm/avic.c | 1027 +++++++++++++++++++++++++++++++++++++++++++++ arch/x86/kvm/svm/svm.c | 1050 +---------------------------------------------- arch/x86/kvm/svm/svm.h | 62 +++ 4 files changed, 1091 insertions(+), 1050 deletions(-) create mode 100644 arch/x86/kvm/svm/avic.c (limited to 'arch/x86') diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile index 0acec73ef236..31aca07d8f12 100644 --- a/arch/x86/kvm/Makefile +++ b/arch/x86/kvm/Makefile @@ -14,7 +14,7 @@ kvm-y += x86.o emulate.o i8259.o irq.o lapic.o \ hyperv.o debugfs.o mmu/mmu.o mmu/page_track.o kvm-intel-y += vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o vmx/evmcs.o vmx/nested.o -kvm-amd-y += svm/svm.o svm/pmu.o svm/nested.o +kvm-amd-y += svm/svm.o svm/pmu.o svm/nested.o svm/avic.o obj-$(CONFIG_KVM) += kvm.o obj-$(CONFIG_KVM_INTEL) += kvm-intel.o diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c new file mode 100644 index 000000000000..e80daa98682f --- /dev/null +++ b/arch/x86/kvm/svm/avic.c @@ -0,0 +1,1027 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Kernel-based Virtual Machine driver for Linux + * + * AMD SVM support + * + * Copyright (C) 2006 Qumranet, Inc. + * Copyright 2010 Red Hat, Inc. and/or its affiliates. + * + * Authors: + * Yaniv Kamay + * Avi Kivity + */ + +#define pr_fmt(fmt) "SVM: " fmt + +#include +#include +#include +#include + +#include + +#include "trace.h" +#include "lapic.h" +#include "x86.h" +#include "irq.h" +#include "svm.h" + +/* enable / disable AVIC */ +int avic; +#ifdef CONFIG_X86_LOCAL_APIC +module_param(avic, int, S_IRUGO); +#endif + +#define SVM_AVIC_DOORBELL 0xc001011b + +#define AVIC_HPA_MASK ~((0xFFFULL << 52) | 0xFFF) + +/* + * 0xff is broadcast, so the max index allowed for physical APIC ID + * table is 0xfe. APIC IDs above 0xff are reserved. + */ +#define AVIC_MAX_PHYSICAL_ID_COUNT 255 + +#define AVIC_UNACCEL_ACCESS_WRITE_MASK 1 +#define AVIC_UNACCEL_ACCESS_OFFSET_MASK 0xFF0 +#define AVIC_UNACCEL_ACCESS_VECTOR_MASK 0xFFFFFFFF + +/* AVIC GATAG is encoded using VM and VCPU IDs */ +#define AVIC_VCPU_ID_BITS 8 +#define AVIC_VCPU_ID_MASK ((1 << AVIC_VCPU_ID_BITS) - 1) + +#define AVIC_VM_ID_BITS 24 +#define AVIC_VM_ID_NR (1 << AVIC_VM_ID_BITS) +#define AVIC_VM_ID_MASK ((1 << AVIC_VM_ID_BITS) - 1) + +#define AVIC_GATAG(x, y) (((x & AVIC_VM_ID_MASK) << AVIC_VCPU_ID_BITS) | \ + (y & AVIC_VCPU_ID_MASK)) +#define AVIC_GATAG_TO_VMID(x) ((x >> AVIC_VCPU_ID_BITS) & AVIC_VM_ID_MASK) +#define AVIC_GATAG_TO_VCPUID(x) (x & AVIC_VCPU_ID_MASK) + +/* Note: + * This hash table is used to map VM_ID to a struct kvm_svm, + * when handling AMD IOMMU GALOG notification to schedule in + * a particular vCPU. + */ +#define SVM_VM_DATA_HASH_BITS 8 +static DEFINE_HASHTABLE(svm_vm_data_hash, SVM_VM_DATA_HASH_BITS); +static u32 next_vm_id = 0; +static bool next_vm_id_wrapped = 0; +static DEFINE_SPINLOCK(svm_vm_data_hash_lock); + +/* + * This is a wrapper of struct amd_iommu_ir_data. + */ +struct amd_svm_iommu_ir { + struct list_head node; /* Used by SVM for per-vcpu ir_list */ + void *data; /* Storing pointer to struct amd_ir_data */ +}; + +enum avic_ipi_failure_cause { + AVIC_IPI_FAILURE_INVALID_INT_TYPE, + AVIC_IPI_FAILURE_TARGET_NOT_RUNNING, + AVIC_IPI_FAILURE_INVALID_TARGET, + AVIC_IPI_FAILURE_INVALID_BACKING_PAGE, +}; + +/* Note: + * This function is called from IOMMU driver to notify + * SVM to schedule in a particular vCPU of a particular VM. + */ +int avic_ga_log_notifier(u32 ga_tag) +{ + unsigned long flags; + struct kvm_svm *kvm_svm; + struct kvm_vcpu *vcpu = NULL; + u32 vm_id = AVIC_GATAG_TO_VMID(ga_tag); + u32 vcpu_id = AVIC_GATAG_TO_VCPUID(ga_tag); + + pr_debug("SVM: %s: vm_id=%#x, vcpu_id=%#x\n", __func__, vm_id, vcpu_id); + trace_kvm_avic_ga_log(vm_id, vcpu_id); + + spin_lock_irqsave(&svm_vm_data_hash_lock, flags); + hash_for_each_possible(svm_vm_data_hash, kvm_svm, hnode, vm_id) { + if (kvm_svm->avic_vm_id != vm_id) + continue; + vcpu = kvm_get_vcpu_by_id(&kvm_svm->kvm, vcpu_id); + break; + } + spin_unlock_irqrestore(&svm_vm_data_hash_lock, flags); + + /* Note: + * At this point, the IOMMU should have already set the pending + * bit in the vAPIC backing page. So, we just need to schedule + * in the vcpu. + */ + if (vcpu) + kvm_vcpu_wake_up(vcpu); + + return 0; +} + +void avic_vm_destroy(struct kvm *kvm) +{ + unsigned long flags; + struct kvm_svm *kvm_svm = to_kvm_svm(kvm); + + if (!avic) + return; + + if (kvm_svm->avic_logical_id_table_page) + __free_page(kvm_svm->avic_logical_id_table_page); + if (kvm_svm->avic_physical_id_table_page) + __free_page(kvm_svm->avic_physical_id_table_page); + + spin_lock_irqsave(&svm_vm_data_hash_lock, flags); + hash_del(&kvm_svm->hnode); + spin_unlock_irqrestore(&svm_vm_data_hash_lock, flags); +} + +int avic_vm_init(struct kvm *kvm) +{ + unsigned long flags; + int err = -ENOMEM; + struct kvm_svm *kvm_svm = to_kvm_svm(kvm); + struct kvm_svm *k2; + struct page *p_page; + struct page *l_page; + u32 vm_id; + + if (!avic) + return 0; + + /* Allocating physical APIC ID table (4KB) */ + p_page = alloc_page(GFP_KERNEL_ACCOUNT); + if (!p_page) + goto free_avic; + + kvm_svm->avic_physical_id_table_page = p_page; + clear_page(page_address(p_page)); + + /* Allocating logical APIC ID table (4KB) */ + l_page = alloc_page(GFP_KERNEL_ACCOUNT); + if (!l_page) + goto free_avic; + + kvm_svm->avic_logical_id_table_page = l_page; + clear_page(page_address(l_page)); + + spin_lock_irqsave(&svm_vm_data_hash_lock, flags); + again: + vm_id = next_vm_id = (next_vm_id + 1) & AVIC_VM_ID_MASK; + if (vm_id == 0) { /* id is 1-based, zero is not okay */ + next_vm_id_wrapped = 1; + goto again; + } + /* Is it still in use? Only possible if wrapped at least once */ + if (next_vm_id_wrapped) { + hash_for_each_possible(svm_vm_data_hash, k2, hnode, vm_id) { + if (k2->avic_vm_id == vm_id) + goto again; + } + } + kvm_svm->avic_vm_id = vm_id; + hash_add(svm_vm_data_hash, &kvm_svm->hnode, kvm_svm->avic_vm_id); + spin_unlock_irqrestore(&svm_vm_data_hash_lock, flags); + + return 0; + +free_avic: + avic_vm_destroy(kvm); + return err; +} + +void avic_init_vmcb(struct vcpu_svm *svm) +{ + struct vmcb *vmcb = svm->vmcb; + struct kvm_svm *kvm_svm = to_kvm_svm(svm->vcpu.kvm); + phys_addr_t bpa = __sme_set(page_to_phys(svm->avic_backing_page)); + phys_addr_t lpa = __sme_set(page_to_phys(kvm_svm->avic_logical_id_table_page)); + phys_addr_t ppa = __sme_set(page_to_phys(kvm_svm->avic_physical_id_table_page)); + + vmcb->control.avic_backing_page = bpa & AVIC_HPA_MASK; + vmcb->control.avic_logical_id = lpa & AVIC_HPA_MASK; + vmcb->control.avic_physical_id = ppa & AVIC_HPA_MASK; + vmcb->control.avic_physical_id |= AVIC_MAX_PHYSICAL_ID_COUNT; + if (kvm_apicv_activated(svm->vcpu.kvm)) + vmcb->control.int_ctl |= AVIC_ENABLE_MASK; + else + vmcb->control.int_ctl &= ~AVIC_ENABLE_MASK; +} + +static u64 *avic_get_physical_id_entry(struct kvm_vcpu *vcpu, + unsigned int index) +{ + u64 *avic_physical_id_table; + struct kvm_svm *kvm_svm = to_kvm_svm(vcpu->kvm); + + if (index >= AVIC_MAX_PHYSICAL_ID_COUNT) + return NULL; + + avic_physical_id_table = page_address(kvm_svm->avic_physical_id_table_page); + + return &avic_physical_id_table[index]; +} + +/** + * Note: + * AVIC hardware walks the nested page table to check permissions, + * but does not use the SPA address specified in the leaf page + * table entry since it uses address in the AVIC_BACKING_PAGE pointer + * field of the VMCB. Therefore, we set up the + * APIC_ACCESS_PAGE_PRIVATE_MEMSLOT (4KB) here. + */ +static int avic_update_access_page(struct kvm *kvm, bool activate) +{ + int ret = 0; + + mutex_lock(&kvm->slots_lock); + /* + * During kvm_destroy_vm(), kvm_pit_set_reinject() could trigger + * APICv mode change, which update APIC_ACCESS_PAGE_PRIVATE_MEMSLOT + * memory region. So, we need to ensure that kvm->mm == current->mm. + */ + if ((kvm->arch.apic_access_page_done == activate) || + (kvm->mm != current->mm)) + goto out; + + ret = __x86_set_memory_region(kvm, + APIC_ACCESS_PAGE_PRIVATE_MEMSLOT, + APIC_DEFAULT_PHYS_BASE, + activate ? PAGE_SIZE : 0); + if (ret) + goto out; + + kvm->arch.apic_access_page_done = activate; +out: + mutex_unlock(&kvm->slots_lock); + return ret; +} + +static int avic_init_backing_page(struct kvm_vcpu *vcpu) +{ + u64 *entry, new_entry; + int id = vcpu->vcpu_id; + struct vcpu_svm *svm = to_svm(vcpu); + + if (id >= AVIC_MAX_PHYSICAL_ID_COUNT) + return -EINVAL; + + if (!svm->vcpu.arch.apic->regs) + return -EINVAL; + + if (kvm_apicv_activated(vcpu->kvm)) { + int ret; + + ret = avic_update_access_page(vcpu->kvm, true); + if (ret) + return ret; + } + + svm->avic_backing_page = virt_to_page(svm->vcpu.arch.apic->regs); + + /* Setting AVIC backing page address in the phy APIC ID table */ + entry = avic_get_physical_id_entry(vcpu, id); + if (!entry) + return -EINVAL; + + new_entry = __sme_set((page_to_phys(svm->avic_backing_page) & + AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK) | + AVIC_PHYSICAL_ID_ENTRY_VALID_MASK); + WRITE_ONCE(*entry, new_entry); + + svm->avic_physical_id_cache = entry; + + return 0; +} + +int avic_incomplete_ipi_interception(struct vcpu_svm *svm) +{ + u32 icrh = svm->vmcb->control.exit_info_1 >> 32; + u32 icrl = svm->vmcb->control.exit_info_1; + u32 id = svm->vmcb->control.exit_info_2 >> 32; + u32 index = svm->vmcb->control.exit_info_2 & 0xFF; + struct kvm_lapic *apic = svm->vcpu.arch.apic; + + trace_kvm_avic_incomplete_ipi(svm->vcpu.vcpu_id, icrh, icrl, id, index); + + switch (id) { + case AVIC_IPI_FAILURE_INVALID_INT_TYPE: + /* + * AVIC hardware handles the generation of + * IPIs when the specified Message Type is Fixed + * (also known as fixed delivery mode) and + * the Trigger Mode is edge-triggered. The hardware + * also supports self and broadcast delivery modes + * specified via the Destination Shorthand(DSH) + * field of the ICRL. Logical and physical APIC ID + * formats are supported. All other IPI types cause + * a #VMEXIT, which needs to emulated. + */ + kvm_lapic_reg_write(apic, APIC_ICR2, icrh); + kvm_lapic_reg_write(apic, APIC_ICR, icrl); + break; + case AVIC_IPI_FAILURE_TARGET_NOT_RUNNING: { + int i; + struct kvm_vcpu *vcpu; + struct kvm *kvm = svm->vcpu.kvm; + struct kvm_lapic *apic = svm->vcpu.arch.apic; + + /* + * At this point, we expect that the AVIC HW has already + * set the appropriate IRR bits on the valid target + * vcpus. So, we just need to kick the appropriate vcpu. + */ + kvm_for_each_vcpu(i, vcpu, kvm) { + bool m = kvm_apic_match_dest(vcpu, apic, + icrl & APIC_SHORT_MASK, + GET_APIC_DEST_FIELD(icrh), + icrl & APIC_DEST_MASK); + + if (m && !avic_vcpu_is_running(vcpu)) + kvm_vcpu_wake_up(vcpu); + } + break; + } + case AVIC_IPI_FAILURE_INVALID_TARGET: + WARN_ONCE(1, "Invalid IPI target: index=%u, vcpu=%d, icr=%#0x:%#0x\n", + index, svm->vcpu.vcpu_id, icrh, icrl); + break; + case AVIC_IPI_FAILURE_INVALID_BACKING_PAGE: + WARN_ONCE(1, "Invalid backing page\n"); + break; + default: + pr_err("Unknown IPI interception\n"); + } + + return 1; +} + +static u32 *avic_get_logical_id_entry(struct kvm_vcpu *vcpu, u32 ldr, bool flat) +{ + struct kvm_svm *kvm_svm = to_kvm_svm(vcpu->kvm); + int index; + u32 *logical_apic_id_table; + int dlid = GET_APIC_LOGICAL_ID(ldr); + + if (!dlid) + return NULL; + + if (flat) { /* flat */ + index = ffs(dlid) - 1; + if (index > 7) + return NULL; + } else { /* cluster */ + int cluster = (dlid & 0xf0) >> 4; + int apic = ffs(dlid & 0x0f) - 1; + + if ((apic < 0) || (apic > 7) || + (cluster >= 0xf)) + return NULL; + index = (cluster << 2) + apic; + } + + logical_apic_id_table = (u32 *) page_address(kvm_svm->avic_logical_id_table_page); + + return &logical_apic_id_table[index]; +} + +static int avic_ldr_write(struct kvm_vcpu *vcpu, u8 g_physical_id, u32 ldr) +{ + bool flat; + u32 *entry, new_entry; + + flat = kvm_lapic_get_reg(vcpu->arch.apic, APIC_DFR) == APIC_DFR_FLAT; + entry = avic_get_logical_id_entry(vcpu, ldr, flat); + if (!entry) + return -EINVAL; + + new_entry = READ_ONCE(*entry); + new_entry &= ~AVIC_LOGICAL_ID_ENTRY_GUEST_PHYSICAL_ID_MASK; + new_entry |= (g_physical_id & AVIC_LOGICAL_ID_ENTRY_GUEST_PHYSICAL_ID_MASK); + new_entry |= AVIC_LOGICAL_ID_ENTRY_VALID_MASK; + WRITE_ONCE(*entry, new_entry); + + return 0; +} + +static void avic_invalidate_logical_id_entry(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + bool flat = svm->dfr_reg == APIC_DFR_FLAT; + u32 *entry = avic_get_logical_id_entry(vcpu, svm->ldr_reg, flat); + + if (entry) + clear_bit(AVIC_LOGICAL_ID_ENTRY_VALID_BIT, (unsigned long *)entry); +} + +static int avic_handle_ldr_update(struct kvm_vcpu *vcpu) +{ + int ret = 0; + struct vcpu_svm *svm = to_svm(vcpu); + u32 ldr = kvm_lapic_get_reg(vcpu->arch.apic, APIC_LDR); + u32 id = kvm_xapic_id(vcpu->arch.apic); + + if (ldr == svm->ldr_reg) + return 0; + + avic_invalidate_logical_id_entry(vcpu); + + if (ldr) + ret = avic_ldr_write(vcpu, id, ldr); + + if (!ret) + svm->ldr_reg = ldr; + + return ret; +} + +static int avic_handle_apic_id_update(struct kvm_vcpu *vcpu) +{ + u64 *old, *new; + struct vcpu_svm *svm = to_svm(vcpu); + u32 id = kvm_xapic_id(vcpu->arch.apic); + + if (vcpu->vcpu_id == id) + return 0; + + old = avic_get_physical_id_entry(vcpu, vcpu->vcpu_id); + new = avic_get_physical_id_entry(vcpu, id); + if (!new || !old) + return 1; + + /* We need to move physical_id_entry to new offset */ + *new = *old; + *old = 0ULL; + to_svm(vcpu)->avic_physical_id_cache = new; + + /* + * Also update the guest physical APIC ID in the logical + * APIC ID table entry if already setup the LDR. + */ + if (svm->ldr_reg) + avic_handle_ldr_update(vcpu); + + return 0; +} + +static void avic_handle_dfr_update(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + u32 dfr = kvm_lapic_get_reg(vcpu->arch.apic, APIC_DFR); + + if (svm->dfr_reg == dfr) + return; + + avic_invalidate_logical_id_entry(vcpu); + svm->dfr_reg = dfr; +} + +static int avic_unaccel_trap_write(struct vcpu_svm *svm) +{ + struct kvm_lapic *apic = svm->vcpu.arch.apic; + u32 offset = svm->vmcb->control.exit_info_1 & + AVIC_UNACCEL_ACCESS_OFFSET_MASK; + + switch (offset) { + case APIC_ID: + if (avic_handle_apic_id_update(&svm->vcpu)) + return 0; + break; + case APIC_LDR: + if (avic_handle_ldr_update(&svm->vcpu)) + return 0; + break; + case APIC_DFR: + avic_handle_dfr_update(&svm->vcpu); + break; + default: + break; + } + + kvm_lapic_reg_write(apic, offset, kvm_lapic_get_reg(apic, offset)); + + return 1; +} + +static bool is_avic_unaccelerated_access_trap(u32 offset) +{ + bool ret = false; + + switch (offset) { + case APIC_ID: + case APIC_EOI: + case APIC_RRR: + case APIC_LDR: + case APIC_DFR: + case APIC_SPIV: + case APIC_ESR: + case APIC_ICR: + case APIC_LVTT: + case APIC_LVTTHMR: + case APIC_LVTPC: + case APIC_LVT0: + case APIC_LVT1: + case APIC_LVTERR: + case APIC_TMICT: + case APIC_TDCR: + ret = true; + break; + default: + break; + } + return ret; +} + +int avic_unaccelerated_access_interception(struct vcpu_svm *svm) +{ + int ret = 0; + u32 offset = svm->vmcb->control.exit_info_1 & + AVIC_UNACCEL_ACCESS_OFFSET_MASK; + u32 vector = svm->vmcb->control.exit_info_2 & + AVIC_UNACCEL_ACCESS_VECTOR_MASK; + bool write = (svm->vmcb->control.exit_info_1 >> 32) & + AVIC_UNACCEL_ACCESS_WRITE_MASK; + bool trap = is_avic_unaccelerated_access_trap(offset); + + trace_kvm_avic_unaccelerated_access(svm->vcpu.vcpu_id, offset, + trap, write, vector); + if (trap) { + /* Handling Trap */ + WARN_ONCE(!write, "svm: Handling trap read.\n"); + ret = avic_unaccel_trap_write(svm); + } else { + /* Handling Fault */ + ret = kvm_emulate_instruction(&svm->vcpu, 0); + } + + return ret; +} + +int avic_init_vcpu(struct vcpu_svm *svm) +{ + int ret; + struct kvm_vcpu *vcpu = &svm->vcpu; + + if (!avic || !irqchip_in_kernel(vcpu->kvm)) + return 0; + + ret = avic_init_backing_page(&svm->vcpu); + if (ret) + return ret; + + INIT_LIST_HEAD(&svm->ir_list); + spin_lock_init(&svm->ir_list_lock); + svm->dfr_reg = APIC_DFR_FLAT; + + return ret; +} + +void avic_post_state_restore(struct kvm_vcpu *vcpu) +{ + if (avic_handle_apic_id_update(vcpu) != 0) + return; + avic_handle_dfr_update(vcpu); + avic_handle_ldr_update(vcpu); +} + +void svm_toggle_avic_for_irq_window(struct kvm_vcpu *vcpu, bool activate) +{ + if (!avic || !lapic_in_kernel(vcpu)) + return; + + srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx); + kvm_request_apicv_update(vcpu->kvm, activate, + APICV_INHIBIT_REASON_IRQWIN); + vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu); +} + +void svm_set_virtual_apic_mode(struct kvm_vcpu *vcpu) +{ + return; +} + +void svm_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr) +{ +} + +void svm_hwapic_isr_update(struct kvm_vcpu *vcpu, int max_isr) +{ +} + +static int svm_set_pi_irte_mode(struct kvm_vcpu *vcpu, bool activate) +{ + int ret = 0; + unsigned long flags; + struct amd_svm_iommu_ir *ir; + struct vcpu_svm *svm = to_svm(vcpu); + + if (!kvm_arch_has_assigned_device(vcpu->kvm)) + return 0; + + /* + * Here, we go through the per-vcpu ir_list to update all existing + * interrupt remapping table entry targeting this vcpu. + */ + spin_lock_irqsave(&svm->ir_list_lock, flags); + + if (list_empty(&svm->ir_list)) + goto out; + + list_for_each_entry(ir, &svm->ir_list, node) { + if (activate) + ret = amd_iommu_activate_guest_mode(ir->data); + else + ret = amd_iommu_deactivate_guest_mode(ir->data); + if (ret) + break; + } +out: + spin_unlock_irqrestore(&svm->ir_list_lock, flags); + return ret; +} + +void svm_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + struct vmcb *vmcb = svm->vmcb; + bool activated = kvm_vcpu_apicv_active(vcpu); + + if (!avic) + return; + + if (activated) { + /** + * During AVIC temporary deactivation, guest could update + * APIC ID, DFR and LDR registers, which would not be trapped + * by avic_unaccelerated_access_interception(). In this case, + * we need to check and update the AVIC logical APIC ID table + * accordingly before re-activating. + */ + avic_post_state_restore(vcpu); + vmcb->control.int_ctl |= AVIC_ENABLE_MASK; + } else { + vmcb->control.int_ctl &= ~AVIC_ENABLE_MASK; + } + mark_dirty(vmcb, VMCB_AVIC); + + svm_set_pi_irte_mode(vcpu, activated); +} + +void svm_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap) +{ + return; +} + +int svm_deliver_avic_intr(struct kvm_vcpu *vcpu, int vec) +{ + if (!vcpu->arch.apicv_active) + return -1; + + kvm_lapic_set_irr(vec, vcpu->arch.apic); + smp_mb__after_atomic(); + + if (avic_vcpu_is_running(vcpu)) { + int cpuid = vcpu->cpu; + + if (cpuid != get_cpu()) + wrmsrl(SVM_AVIC_DOORBELL, kvm_cpu_get_apicid(cpuid)); + put_cpu(); + } else + kvm_vcpu_wake_up(vcpu); + + return 0; +} + +bool svm_dy_apicv_has_pending_interrupt(struct kvm_vcpu *vcpu) +{ + return false; +} + +static void svm_ir_list_del(struct vcpu_svm *svm, struct amd_iommu_pi_data *pi) +{ + unsigned long flags; + struct amd_svm_iommu_ir *cur; + + spin_lock_irqsave(&svm->ir_list_lock, flags); + list_for_each_entry(cur, &svm->ir_list, node) { + if (cur->data != pi->ir_data) + continue; + list_del(&cur->node); + kfree(cur); + break; + } + spin_unlock_irqrestore(&svm->ir_list_lock, flags); +} + +static int svm_ir_list_add(struct vcpu_svm *svm, struct amd_iommu_pi_data *pi) +{ + int ret = 0; + unsigned long flags; + struct amd_svm_iommu_ir *ir; + + /** + * In some cases, the existing irte is updaed and re-set, + * so we need to check here if it's already been * added + * to the ir_list. + */ + if (pi->ir_data && (pi->prev_ga_tag != 0)) { + struct kvm *kvm = svm->vcpu.kvm; + u32 vcpu_id = AVIC_GATAG_TO_VCPUID(pi->prev_ga_tag); + struct kvm_vcpu *prev_vcpu = kvm_get_vcpu_by_id(kvm, vcpu_id); + struct vcpu_svm *prev_svm; + + if (!prev_vcpu) { + ret = -EINVAL; + goto out; + } + + prev_svm = to_svm(prev_vcpu); + svm_ir_list_del(prev_svm, pi); + } + + /** + * Allocating new amd_iommu_pi_data, which will get + * add to the per-vcpu ir_list. + */ + ir = kzalloc(sizeof(struct amd_svm_iommu_ir), GFP_KERNEL_ACCOUNT); + if (!ir) { + ret = -ENOMEM; + goto out; + } + ir->data = pi->ir_data; + + spin_lock_irqsave(&svm->ir_list_lock, flags); + list_add(&ir->node, &svm->ir_list); + spin_unlock_irqrestore(&svm->ir_list_lock, flags); +out: + return ret; +} + +/** + * Note: + * The HW cannot support posting multicast/broadcast + * interrupts to a vCPU. So, we still use legacy interrupt + * remapping for these kind of interrupts. + * + * For lowest-priority interrupts, we only support + * those with single CPU as the destination, e.g. user + * configures the interrupts via /proc/irq or uses + * irqbalance to make the interrupts single-CPU. + */ +static int +get_pi_vcpu_info(struct kvm *kvm, struct kvm_kernel_irq_routing_entry *e, + struct vcpu_data *vcpu_info, struct vcpu_svm **svm) +{ + struct kvm_lapic_irq irq; + struct kvm_vcpu *vcpu = NULL; + + kvm_set_msi_irq(kvm, e, &irq); + + if (!kvm_intr_is_single_vcpu(kvm, &irq, &vcpu) || + !kvm_irq_is_postable(&irq)) { + pr_debug("SVM: %s: use legacy intr remap mode for irq %u\n", + __func__, irq.vector); + return -1; + } + + pr_debug("SVM: %s: use GA mode for irq %u\n", __func__, + irq.vector); + *svm = to_svm(vcpu); + vcpu_info->pi_desc_addr = __sme_set(page_to_phys((*svm)->avic_backing_page)); + vcpu_info->vector = irq.vector; + + return 0; +} + +/* + * svm_update_pi_irte - set IRTE for Posted-Interrupts + * + * @kvm: kvm + * @host_irq: host irq of the interrupt + * @guest_irq: gsi of the interrupt + * @set: set or unset PI + * returns 0 on success, < 0 on failure + */ +int svm_update_pi_irte(struct kvm *kvm, unsigned int host_irq, + uint32_t guest_irq, bool set) +{ + struct kvm_kernel_irq_routing_entry *e; + struct kvm_irq_routing_table *irq_rt; + int idx, ret = -EINVAL; + + if (!kvm_arch_has_assigned_device(kvm) || + !irq_remapping_cap(IRQ_POSTING_CAP)) + return 0; + + pr_debug("SVM: %s: host_irq=%#x, guest_irq=%#x, set=%#x\n", + __func__, host_irq, guest_irq, set); + + idx = srcu_read_lock(&kvm->irq_srcu); + irq_rt = srcu_dereference(kvm->irq_routing, &kvm->irq_srcu); + WARN_ON(guest_irq >= irq_rt->nr_rt_entries); + + hlist_for_each_entry(e, &irq_rt->map[guest_irq], link) { + struct vcpu_data vcpu_info; + struct vcpu_svm *svm = NULL; + + if (e->type != KVM_IRQ_ROUTING_MSI) + continue; + + /** + * Here, we setup with legacy mode in the following cases: + * 1. When cannot target interrupt to a specific vcpu. + * 2. Unsetting posted interrupt. + * 3. APIC virtialization is disabled for the vcpu. + * 4. IRQ has incompatible delivery mode (SMI, INIT, etc) + */ + if (!get_pi_vcpu_info(kvm, e, &vcpu_info, &svm) && set && + kvm_vcpu_apicv_active(&svm->vcpu)) { + struct amd_iommu_pi_data pi; + + /* Try to enable guest_mode in IRTE */ + pi.base = __sme_set(page_to_phys(svm->avic_backing_page) & + AVIC_HPA_MASK); + pi.ga_tag = AVIC_GATAG(to_kvm_svm(kvm)->avic_vm_id, + svm->vcpu.vcpu_id); + pi.is_guest_mode = true; + pi.vcpu_data = &vcpu_info; + ret = irq_set_vcpu_affinity(host_irq, &pi); + + /** + * Here, we successfully setting up vcpu affinity in + * IOMMU guest mode. Now, we need to store the posted + * interrupt information in a per-vcpu ir_list so that + * we can reference to them directly when we update vcpu + * scheduling information in IOMMU irte. + */ + if (!ret && pi.is_guest_mode) + svm_ir_list_add(svm, &pi); + } else { + /* Use legacy mode in IRTE */ + struct amd_iommu_pi_data pi; + + /** + * Here, pi is used to: + * - Tell IOMMU to use legacy mode for this interrupt. + * - Retrieve ga_tag of prior interrupt remapping data. + */ + pi.is_guest_mode = false; + ret = irq_set_vcpu_affinity(host_irq, &pi); + + /** + * Check if the posted interrupt was previously + * setup with the guest_mode by checking if the ga_tag + * was cached. If so, we need to clean up the per-vcpu + * ir_list. + */ + if (!ret && pi.prev_ga_tag) { + int id = AVIC_GATAG_TO_VCPUID(pi.prev_ga_tag); + struct kvm_vcpu *vcpu; + + vcpu = kvm_get_vcpu_by_id(kvm, id); + if (vcpu) + svm_ir_list_del(to_svm(vcpu), &pi); + } + } + + if (!ret && svm) { + trace_kvm_pi_irte_update(host_irq, svm->vcpu.vcpu_id, + e->gsi, vcpu_info.vector, + vcpu_info.pi_desc_addr, set); + } + + if (ret < 0) { + pr_err("%s: failed to update PI IRTE\n", __func__); + goto out; + } + } + + ret = 0; +out: + srcu_read_unlock(&kvm->irq_srcu, idx); + return ret; +} + +bool svm_check_apicv_inhibit_reasons(ulong bit) +{ + ulong supported = BIT(APICV_INHIBIT_REASON_DISABLE) | + BIT(APICV_INHIBIT_REASON_HYPERV) | + BIT(APICV_INHIBIT_REASON_NESTED) | + BIT(APICV_INHIBIT_REASON_IRQWIN) | + BIT(APICV_INHIBIT_REASON_PIT_REINJ) | + BIT(APICV_INHIBIT_REASON_X2APIC); + + return supported & BIT(bit); +} + +void svm_pre_update_apicv_exec_ctrl(struct kvm *kvm, bool activate) +{ + avic_update_access_page(kvm, activate); +} + +static inline int +avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu, bool r) +{ + int ret = 0; + unsigned long flags; + struct amd_svm_iommu_ir *ir; + struct vcpu_svm *svm = to_svm(vcpu); + + if (!kvm_arch_has_assigned_device(vcpu->kvm)) + return 0; + + /* + * Here, we go through the per-vcpu ir_list to update all existing + * interrupt remapping table entry targeting this vcpu. + */ + spin_lock_irqsave(&svm->ir_list_lock, flags); + + if (list_empty(&svm->ir_list)) + goto out; + + list_for_each_entry(ir, &svm->ir_list, node) { + ret = amd_iommu_update_ga(cpu, r, ir->data); + if (ret) + break; + } +out: + spin_unlock_irqrestore(&svm->ir_list_lock, flags); + return ret; +} + +void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu) +{ + u64 entry; + /* ID = 0xff (broadcast), ID > 0xff (reserved) */ + int h_physical_id = kvm_cpu_get_apicid(cpu); + struct vcpu_svm *svm = to_svm(vcpu); + + if (!kvm_vcpu_apicv_active(vcpu)) + return; + + /* + * Since the host physical APIC id is 8 bits, + * we can support host APIC ID upto 255. + */ + if (WARN_ON(h_physical_id > AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK)) + return; + + entry = READ_ONCE(*(svm->avic_physical_id_cache)); + WARN_ON(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK); + + entry &= ~AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK; + entry |= (h_physical_id & AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK); + + entry &= ~AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK; + if (svm->avic_is_running) + entry |= AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK; + + WRITE_ONCE(*(svm->avic_physical_id_cache), entry); + avic_update_iommu_vcpu_affinity(vcpu, h_physical_id, + svm->avic_is_running); +} + +void avic_vcpu_put(struct kvm_vcpu *vcpu) +{ + u64 entry; + struct vcpu_svm *svm = to_svm(vcpu); + + if (!kvm_vcpu_apicv_active(vcpu)) + return; + + entry = READ_ONCE(*(svm->avic_physical_id_cache)); + if (entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK) + avic_update_iommu_vcpu_affinity(vcpu, -1, 0); + + entry &= ~AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK; + WRITE_ONCE(*(svm->avic_physical_id_cache), entry); +} + +/** + * This function is called during VCPU halt/unhalt. + */ +static void avic_set_running(struct kvm_vcpu *vcpu, bool is_run) +{ + struct vcpu_svm *svm = to_svm(vcpu); + + svm->avic_is_running = is_run; + if (is_run) + avic_vcpu_load(vcpu, vcpu->cpu); + else + avic_vcpu_put(vcpu); +} + +void svm_vcpu_blocking(struct kvm_vcpu *vcpu) +{ + avic_set_running(vcpu, false); +} + +void svm_vcpu_unblocking(struct kvm_vcpu *vcpu) +{ + if (kvm_check_request(KVM_REQ_APICV_UPDATE, vcpu)) + kvm_vcpu_update_apicv(vcpu); + avic_set_running(vcpu, true); +} diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index a24e2de4acda..e32b4956fa06 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -1,17 +1,3 @@ -// SPDX-License-Identifier: GPL-2.0-only -/* - * Kernel-based Virtual Machine driver for Linux - * - * AMD SVM support - * - * Copyright (C) 2006 Qumranet, Inc. - * Copyright 2010 Red Hat, Inc. and/or its affiliates. - * - * Authors: - * Yaniv Kamay - * Avi Kivity - */ - #define pr_fmt(fmt) "SVM: " fmt #include @@ -28,10 +14,10 @@ #include #include #include +#include #include #include #include -#include #include #include #include @@ -82,39 +68,12 @@ MODULE_DEVICE_TABLE(x86cpu, svm_cpu_id); #define SVM_FEATURE_DECODE_ASSIST (1 << 7) #define SVM_FEATURE_PAUSE_FILTER (1 << 10) -#define SVM_AVIC_DOORBELL 0xc001011b - #define DEBUGCTL_RESERVED_BITS (~(0x3fULL)) #define TSC_RATIO_RSVD 0xffffff0000000000ULL #define TSC_RATIO_MIN 0x0000000000000001ULL #define TSC_RATIO_MAX 0x000000ffffffffffULL -#define AVIC_HPA_MASK ~((0xFFFULL << 52) | 0xFFF) - -/* - * 0xff is broadcast, so the max index allowed for physical APIC ID - * table is 0xfe. APIC IDs above 0xff are reserved. - */ -#define AVIC_MAX_PHYSICAL_ID_COUNT 255 - -#define AVIC_UNACCEL_ACCESS_WRITE_MASK 1 -#define AVIC_UNACCEL_ACCESS_OFFSET_MASK 0xFF0 -#define AVIC_UNACCEL_ACCESS_VECTOR_MASK 0xFFFFFFFF - -/* AVIC GATAG is encoded using VM and VCPU IDs */ -#define AVIC_VCPU_ID_BITS 8 -#define AVIC_VCPU_ID_MASK ((1 << AVIC_VCPU_ID_BITS) - 1) - -#define AVIC_VM_ID_BITS 24 -#define AVIC_VM_ID_NR (1 << AVIC_VM_ID_BITS) -#define AVIC_VM_ID_MASK ((1 << AVIC_VM_ID_BITS) - 1) - -#define AVIC_GATAG(x, y) (((x & AVIC_VM_ID_MASK) << AVIC_VCPU_ID_BITS) | \ - (y & AVIC_VCPU_ID_MASK)) -#define AVIC_GATAG_TO_VMID(x) ((x >> AVIC_VCPU_ID_BITS) & AVIC_VM_ID_MASK) -#define AVIC_GATAG_TO_VCPUID(x) (x & AVIC_VCPU_ID_MASK) - static bool erratum_383_found __read_mostly; u32 msrpm_offsets[MSRPM_OFFSETS] __read_mostly; @@ -125,23 +84,6 @@ u32 msrpm_offsets[MSRPM_OFFSETS] __read_mostly; */ static uint64_t osvw_len = 4, osvw_status; -/* - * This is a wrapper of struct amd_iommu_ir_data. - */ -struct amd_svm_iommu_ir { - struct list_head node; /* Used by SVM for per-vcpu ir_list */ - void *data; /* Storing pointer to struct amd_ir_data */ -}; - -#define AVIC_LOGICAL_ID_ENTRY_GUEST_PHYSICAL_ID_MASK (0xFF) -#define AVIC_LOGICAL_ID_ENTRY_VALID_BIT 31 -#define AVIC_LOGICAL_ID_ENTRY_VALID_MASK (1 << 31) - -#define AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK (0xFFULL) -#define AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK (0xFFFFFFFFFFULL << 12) -#define AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK (1ULL << 62) -#define AVIC_PHYSICAL_ID_ENTRY_VALID_MASK (1ULL << 63) - static DEFINE_PER_CPU(u64, current_tsc_ratio); #define TSC_RATIO_DEFAULT 0x0100000000ULL @@ -231,12 +173,6 @@ module_param(npt, int, S_IRUGO); static int nested = true; module_param(nested, int, S_IRUGO); -/* enable / disable AVIC */ -static int avic; -#ifdef CONFIG_X86_LOCAL_APIC -module_param(avic, int, S_IRUGO); -#endif - /* enable/disable Next RIP Save */ static int nrips = true; module_param(nrips, int, 0444); @@ -259,10 +195,6 @@ module_param(dump_invalid_vmcb, bool, 0644); static u8 rsm_ins_bytes[] = "\x0f\xaa"; static void svm_complete_interrupts(struct vcpu_svm *svm); -static void svm_toggle_avic_for_irq_window(struct kvm_vcpu *vcpu, bool activate); -static inline void avic_post_state_restore(struct kvm_vcpu *vcpu); - -#define VMCB_AVIC_APIC_BAR_MASK 0xFFFFFFFFFF000ULL static int sev_flush_asids(void); static DECLARE_RWSEM(sev_deactivate_lock); @@ -282,11 +214,6 @@ struct enc_region { }; -static inline struct kvm_svm *to_kvm_svm(struct kvm *kvm) -{ - return container_of(kvm, struct kvm_svm, kvm); -} - static inline bool svm_sev_enabled(void) { return IS_ENABLED(CONFIG_KVM_AMD_SEV) ? max_sev_asid : 0; @@ -310,23 +237,6 @@ static inline int sev_get_asid(struct kvm *kvm) return sev->asid; } -static inline void avic_update_vapic_bar(struct vcpu_svm *svm, u64 data) -{ - svm->vmcb->control.avic_vapic_bar = data & VMCB_AVIC_APIC_BAR_MASK; - mark_dirty(svm->vmcb, VMCB_AVIC); -} - -static inline bool avic_vcpu_is_running(struct kvm_vcpu *vcpu) -{ - struct vcpu_svm *svm = to_svm(vcpu); - u64 *entry = svm->avic_physical_id_cache; - - if (!entry) - return false; - - return (READ_ONCE(*entry) & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK); -} - static unsigned long iopm_base; struct kvm_ldttss_desc { @@ -853,52 +763,6 @@ void disable_nmi_singlestep(struct vcpu_svm *svm) } } -/* Note: - * This hash table is used to map VM_ID to a struct kvm_svm, - * when handling AMD IOMMU GALOG notification to schedule in - * a particular vCPU. - */ -#define SVM_VM_DATA_HASH_BITS 8 -static DEFINE_HASHTABLE(svm_vm_data_hash, SVM_VM_DATA_HASH_BITS); -static u32 next_vm_id = 0; -static bool next_vm_id_wrapped = 0; -static DEFINE_SPINLOCK(svm_vm_data_hash_lock); - -/* Note: - * This function is called from IOMMU driver to notify - * SVM to schedule in a particular vCPU of a particular VM. - */ -static int avic_ga_log_notifier(u32 ga_tag) -{ - unsigned long flags; - struct kvm_svm *kvm_svm; - struct kvm_vcpu *vcpu = NULL; - u32 vm_id = AVIC_GATAG_TO_VMID(ga_tag); - u32 vcpu_id = AVIC_GATAG_TO_VCPUID(ga_tag); - - pr_debug("SVM: %s: vm_id=%#x, vcpu_id=%#x\n", __func__, vm_id, vcpu_id); - trace_kvm_avic_ga_log(vm_id, vcpu_id); - - spin_lock_irqsave(&svm_vm_data_hash_lock, flags); - hash_for_each_possible(svm_vm_data_hash, kvm_svm, hnode, vm_id) { - if (kvm_svm->avic_vm_id != vm_id) - continue; - vcpu = kvm_get_vcpu_by_id(&kvm_svm->kvm, vcpu_id); - break; - } - spin_unlock_irqrestore(&svm_vm_data_hash_lock, flags); - - /* Note: - * At this point, the IOMMU should have already set the pending - * bit in the vAPIC backing page. So, we just need to schedule - * in the vcpu. - */ - if (vcpu) - kvm_vcpu_wake_up(vcpu); - - return 0; -} - static __init int sev_hardware_setup(void) { struct sev_user_data_status *status; @@ -1227,24 +1091,6 @@ static u64 svm_write_l1_tsc_offset(struct kvm_vcpu *vcpu, u64 offset) return svm->vmcb->control.tsc_offset; } -static void avic_init_vmcb(struct vcpu_svm *svm) -{ - struct vmcb *vmcb = svm->vmcb; - struct kvm_svm *kvm_svm = to_kvm_svm(svm->vcpu.kvm); - phys_addr_t bpa = __sme_set(page_to_phys(svm->avic_backing_page)); - phys_addr_t lpa = __sme_set(page_to_phys(kvm_svm->avic_logical_id_table_page)); - phys_addr_t ppa = __sme_set(page_to_phys(kvm_svm->avic_physical_id_table_page)); - - vmcb->control.avic_backing_page = bpa & AVIC_HPA_MASK; - vmcb->control.avic_logical_id = lpa & AVIC_HPA_MASK; - vmcb->control.avic_physical_id = ppa & AVIC_HPA_MASK; - vmcb->control.avic_physical_id |= AVIC_MAX_PHYSICAL_ID_COUNT; - if (kvm_apicv_activated(svm->vcpu.kvm)) - vmcb->control.int_ctl |= AVIC_ENABLE_MASK; - else - vmcb->control.int_ctl &= ~AVIC_ENABLE_MASK; -} - static void init_vmcb(struct vcpu_svm *svm) { struct vmcb_control_area *control = &svm->vmcb->control; @@ -1404,92 +1250,6 @@ static void init_vmcb(struct vcpu_svm *svm) } -static u64 *avic_get_physical_id_entry(struct kvm_vcpu *vcpu, - unsigned int index) -{ - u64 *avic_physical_id_table; - struct kvm_svm *kvm_svm = to_kvm_svm(vcpu->kvm); - - if (index >= AVIC_MAX_PHYSICAL_ID_COUNT) - return NULL; - - avic_physical_id_table = page_address(kvm_svm->avic_physical_id_table_page); - - return &avic_physical_id_table[index]; -} - -/** - * Note: - * AVIC hardware walks the nested page table to check permissions, - * but does not use the SPA address specified in the leaf page - * table entry since it uses address in the AVIC_BACKING_PAGE pointer - * field of the VMCB. Therefore, we set up the - * APIC_ACCESS_PAGE_PRIVATE_MEMSLOT (4KB) here. - */ -static int avic_update_access_page(struct kvm *kvm, bool activate) -{ - int ret = 0; - - mutex_lock(&kvm->slots_lock); - /* - * During kvm_destroy_vm(), kvm_pit_set_reinject() could trigger - * APICv mode change, which update APIC_ACCESS_PAGE_PRIVATE_MEMSLOT - * memory region. So, we need to ensure that kvm->mm == current->mm. - */ - if ((kvm->arch.apic_access_page_done == activate) || - (kvm->mm != current->mm)) - goto out; - - ret = __x86_set_memory_region(kvm, - APIC_ACCESS_PAGE_PRIVATE_MEMSLOT, - APIC_DEFAULT_PHYS_BASE, - activate ? PAGE_SIZE : 0); - if (ret) - goto out; - - kvm->arch.apic_access_page_done = activate; -out: - mutex_unlock(&kvm->slots_lock); - return ret; -} - -static int avic_init_backing_page(struct kvm_vcpu *vcpu) -{ - u64 *entry, new_entry; - int id = vcpu->vcpu_id; - struct vcpu_svm *svm = to_svm(vcpu); - - if (id >= AVIC_MAX_PHYSICAL_ID_COUNT) - return -EINVAL; - - if (!svm->vcpu.arch.apic->regs) - return -EINVAL; - - if (kvm_apicv_activated(vcpu->kvm)) { - int ret; - - ret = avic_update_access_page(vcpu->kvm, true); - if (ret) - return ret; - } - - svm->avic_backing_page = virt_to_page(svm->vcpu.arch.apic->regs); - - /* Setting AVIC backing page address in the phy APIC ID table */ - entry = avic_get_physical_id_entry(vcpu, id); - if (!entry) - return -EINVAL; - - new_entry = __sme_set((page_to_phys(svm->avic_backing_page) & - AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK) | - AVIC_PHYSICAL_ID_ENTRY_VALID_MASK); - WRITE_ONCE(*entry, new_entry); - - svm->avic_physical_id_cache = entry; - - return 0; -} - static void sev_asid_free(int asid) { struct svm_cpu_data *sd; @@ -1665,84 +1425,12 @@ static void sev_vm_destroy(struct kvm *kvm) sev_asid_free(sev->asid); } -static void avic_vm_destroy(struct kvm *kvm) -{ - unsigned long flags; - struct kvm_svm *kvm_svm = to_kvm_svm(kvm); - - if (!avic) - return; - - if (kvm_svm->avic_logical_id_table_page) - __free_page(kvm_svm->avic_logical_id_table_page); - if (kvm_svm->avic_physical_id_table_page) - __free_page(kvm_svm->avic_physical_id_table_page); - - spin_lock_irqsave(&svm_vm_data_hash_lock, flags); - hash_del(&kvm_svm->hnode); - spin_unlock_irqrestore(&svm_vm_data_hash_lock, flags); -} - static void svm_vm_destroy(struct kvm *kvm) { avic_vm_destroy(kvm); sev_vm_destroy(kvm); } -static int avic_vm_init(struct kvm *kvm) -{ - unsigned long flags; - int err = -ENOMEM; - struct kvm_svm *kvm_svm = to_kvm_svm(kvm); - struct kvm_svm *k2; - struct page *p_page; - struct page *l_page; - u32 vm_id; - - if (!avic) - return 0; - - /* Allocating physical APIC ID table (4KB) */ - p_page = alloc_page(GFP_KERNEL_ACCOUNT); - if (!p_page) - goto free_avic; - - kvm_svm->avic_physical_id_table_page = p_page; - clear_page(page_address(p_page)); - - /* Allocating logical APIC ID table (4KB) */ - l_page = alloc_page(GFP_KERNEL_ACCOUNT); - if (!l_page) - goto free_avic; - - kvm_svm->avic_logical_id_table_page = l_page; - clear_page(page_address(l_page)); - - spin_lock_irqsave(&svm_vm_data_hash_lock, flags); - again: - vm_id = next_vm_id = (next_vm_id + 1) & AVIC_VM_ID_MASK; - if (vm_id == 0) { /* id is 1-based, zero is not okay */ - next_vm_id_wrapped = 1; - goto again; - } - /* Is it still in use? Only possible if wrapped at least once */ - if (next_vm_id_wrapped) { - hash_for_each_possible(svm_vm_data_hash, k2, hnode, vm_id) { - if (k2->avic_vm_id == vm_id) - goto again; - } - } - kvm_svm->avic_vm_id = vm_id; - hash_add(svm_vm_data_hash, &kvm_svm->hnode, kvm_svm->avic_vm_id); - spin_unlock_irqrestore(&svm_vm_data_hash_lock, flags); - - return 0; - -free_avic: - avic_vm_destroy(kvm); - return err; -} - static int svm_vm_init(struct kvm *kvm) { if (avic) { @@ -1755,98 +1443,6 @@ static int svm_vm_init(struct kvm *kvm) return 0; } -static inline int -avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu, bool r) -{ - int ret = 0; - unsigned long flags; - struct amd_svm_iommu_ir *ir; - struct vcpu_svm *svm = to_svm(vcpu); - - if (!kvm_arch_has_assigned_device(vcpu->kvm)) - return 0; - - /* - * Here, we go through the per-vcpu ir_list to update all existing - * interrupt remapping table entry targeting this vcpu. - */ - spin_lock_irqsave(&svm->ir_list_lock, flags); - - if (list_empty(&svm->ir_list)) - goto out; - - list_for_each_entry(ir, &svm->ir_list, node) { - ret = amd_iommu_update_ga(cpu, r, ir->data); - if (ret) - break; - } -out: - spin_unlock_irqrestore(&svm->ir_list_lock, flags); - return ret; -} - -static void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu) -{ - u64 entry; - /* ID = 0xff (broadcast), ID > 0xff (reserved) */ - int h_physical_id = kvm_cpu_get_apicid(cpu); - struct vcpu_svm *svm = to_svm(vcpu); - - if (!kvm_vcpu_apicv_active(vcpu)) - return; - - /* - * Since the host physical APIC id is 8 bits, - * we can support host APIC ID upto 255. - */ - if (WARN_ON(h_physical_id > AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK)) - return; - - entry = READ_ONCE(*(svm->avic_physical_id_cache)); - WARN_ON(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK); - - entry &= ~AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK; - entry |= (h_physical_id & AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK); - - entry &= ~AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK; - if (svm->avic_is_running) - entry |= AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK; - - WRITE_ONCE(*(svm->avic_physical_id_cache), entry); - avic_update_iommu_vcpu_affinity(vcpu, h_physical_id, - svm->avic_is_running); -} - -static void avic_vcpu_put(struct kvm_vcpu *vcpu) -{ - u64 entry; - struct vcpu_svm *svm = to_svm(vcpu); - - if (!kvm_vcpu_apicv_active(vcpu)) - return; - - entry = READ_ONCE(*(svm->avic_physical_id_cache)); - if (entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK) - avic_update_iommu_vcpu_affinity(vcpu, -1, 0); - - entry &= ~AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK; - WRITE_ONCE(*(svm->avic_physical_id_cache), entry); -} - -/** - * This function is called during VCPU halt/unhalt. - */ -static void avic_set_running(struct kvm_vcpu *vcpu, bool is_run) -{ - struct vcpu_svm *svm = to_svm(vcpu); - - svm->avic_is_running = is_run; - if (is_run) - avic_vcpu_load(vcpu, vcpu->cpu); - else - avic_vcpu_put(vcpu); -} - static void svm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) { struct vcpu_svm *svm = to_svm(vcpu); @@ -1871,25 +1467,6 @@ static void svm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) avic_update_vapic_bar(svm, APIC_DEFAULT_PHYS_BASE); } -static int avic_init_vcpu(struct vcpu_svm *svm) -{ - int ret; - struct kvm_vcpu *vcpu = &svm->vcpu; - - if (!avic || !irqchip_in_kernel(vcpu->kvm)) - return 0; - - ret = avic_init_backing_page(&svm->vcpu); - if (ret) - return ret; - - INIT_LIST_HEAD(&svm->ir_list); - spin_lock_init(&svm->ir_list_lock); - svm->dfr_reg = APIC_DFR_FLAT; - - return ret; -} - static int svm_create_vcpu(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm; @@ -2046,18 +1623,6 @@ static void svm_vcpu_put(struct kvm_vcpu *vcpu) wrmsrl(host_save_user_msrs[i], svm->host_user_msrs[i]); } -static void svm_vcpu_blocking(struct kvm_vcpu *vcpu) -{ - avic_set_running(vcpu, false); -} - -static void svm_vcpu_unblocking(struct kvm_vcpu *vcpu) -{ - if (kvm_check_request(KVM_REQ_APICV_UPDATE, vcpu)) - kvm_vcpu_update_apicv(vcpu); - avic_set_running(vcpu, true); -} - static unsigned long svm_get_rflags(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); @@ -3437,276 +3002,6 @@ static int mwait_interception(struct vcpu_svm *svm) return nop_interception(svm); } -enum avic_ipi_failure_cause { - AVIC_IPI_FAILURE_INVALID_INT_TYPE, - AVIC_IPI_FAILURE_TARGET_NOT_RUNNING, - AVIC_IPI_FAILURE_INVALID_TARGET, - AVIC_IPI_FAILURE_INVALID_BACKING_PAGE, -}; - -static int avic_incomplete_ipi_interception(struct vcpu_svm *svm) -{ - u32 icrh = svm->vmcb->control.exit_info_1 >> 32; - u32 icrl = svm->vmcb->control.exit_info_1; - u32 id = svm->vmcb->control.exit_info_2 >> 32; - u32 index = svm->vmcb->control.exit_info_2 & 0xFF; - struct kvm_lapic *apic = svm->vcpu.arch.apic; - - trace_kvm_avic_incomplete_ipi(svm->vcpu.vcpu_id, icrh, icrl, id, index); - - switch (id) { - case AVIC_IPI_FAILURE_INVALID_INT_TYPE: - /* - * AVIC hardware handles the generation of - * IPIs when the specified Message Type is Fixed - * (also known as fixed delivery mode) and - * the Trigger Mode is edge-triggered. The hardware - * also supports self and broadcast delivery modes - * specified via the Destination Shorthand(DSH) - * field of the ICRL. Logical and physical APIC ID - * formats are supported. All other IPI types cause - * a #VMEXIT, which needs to emulated. - */ - kvm_lapic_reg_write(apic, APIC_ICR2, icrh); - kvm_lapic_reg_write(apic, APIC_ICR, icrl); - break; - case AVIC_IPI_FAILURE_TARGET_NOT_RUNNING: { - int i; - struct kvm_vcpu *vcpu; - struct kvm *kvm = svm->vcpu.kvm; - struct kvm_lapic *apic = svm->vcpu.arch.apic; - - /* - * At this point, we expect that the AVIC HW has already - * set the appropriate IRR bits on the valid target - * vcpus. So, we just need to kick the appropriate vcpu. - */ - kvm_for_each_vcpu(i, vcpu, kvm) { - bool m = kvm_apic_match_dest(vcpu, apic, - icrl & APIC_SHORT_MASK, - GET_APIC_DEST_FIELD(icrh), - icrl & APIC_DEST_MASK); - - if (m && !avic_vcpu_is_running(vcpu)) - kvm_vcpu_wake_up(vcpu); - } - break; - } - case AVIC_IPI_FAILURE_INVALID_TARGET: - WARN_ONCE(1, "Invalid IPI target: index=%u, vcpu=%d, icr=%#0x:%#0x\n", - index, svm->vcpu.vcpu_id, icrh, icrl); - break; - case AVIC_IPI_FAILURE_INVALID_BACKING_PAGE: - WARN_ONCE(1, "Invalid backing page\n"); - break; - default: - pr_err("Unknown IPI interception\n"); - } - - return 1; -} - -static u32 *avic_get_logical_id_entry(struct kvm_vcpu *vcpu, u32 ldr, bool flat) -{ - struct kvm_svm *kvm_svm = to_kvm_svm(vcpu->kvm); - int index; - u32 *logical_apic_id_table; - int dlid = GET_APIC_LOGICAL_ID(ldr); - - if (!dlid) - return NULL; - - if (flat) { /* flat */ - index = ffs(dlid) - 1; - if (index > 7) - return NULL; - } else { /* cluster */ - int cluster = (dlid & 0xf0) >> 4; - int apic = ffs(dlid & 0x0f) - 1; - - if ((apic < 0) || (apic > 7) || - (cluster >= 0xf)) - return NULL; - index = (cluster << 2) + apic; - } - - logical_apic_id_table = (u32 *) page_address(kvm_svm->avic_logical_id_table_page); - - return &logical_apic_id_table[index]; -} - -static int avic_ldr_write(struct kvm_vcpu *vcpu, u8 g_physical_id, u32 ldr) -{ - bool flat; - u32 *entry, new_entry; - - flat = kvm_lapic_get_reg(vcpu->arch.apic, APIC_DFR) == APIC_DFR_FLAT; - entry = avic_get_logical_id_entry(vcpu, ldr, flat); - if (!entry) - return -EINVAL; - - new_entry = READ_ONCE(*entry); - new_entry &= ~AVIC_LOGICAL_ID_ENTRY_GUEST_PHYSICAL_ID_MASK; - new_entry |= (g_physical_id & AVIC_LOGICAL_ID_ENTRY_GUEST_PHYSICAL_ID_MASK); - new_entry |= AVIC_LOGICAL_ID_ENTRY_VALID_MASK; - WRITE_ONCE(*entry, new_entry); - - return 0; -} - -static void avic_invalidate_logical_id_entry(struct kvm_vcpu *vcpu) -{ - struct vcpu_svm *svm = to_svm(vcpu); - bool flat = svm->dfr_reg == APIC_DFR_FLAT; - u32 *entry = avic_get_logical_id_entry(vcpu, svm->ldr_reg, flat); - - if (entry) - clear_bit(AVIC_LOGICAL_ID_ENTRY_VALID_BIT, (unsigned long *)entry); -} - -static int avic_handle_ldr_update(struct kvm_vcpu *vcpu) -{ - int ret = 0; - struct vcpu_svm *svm = to_svm(vcpu); - u32 ldr = kvm_lapic_get_reg(vcpu->arch.apic, APIC_LDR); - u32 id = kvm_xapic_id(vcpu->arch.apic); - - if (ldr == svm->ldr_reg) - return 0; - - avic_invalidate_logical_id_entry(vcpu); - - if (ldr) - ret = avic_ldr_write(vcpu, id, ldr); - - if (!ret) - svm->ldr_reg = ldr; - - return ret; -} - -static int avic_handle_apic_id_update(struct kvm_vcpu *vcpu) -{ - u64 *old, *new; - struct vcpu_svm *svm = to_svm(vcpu); - u32 id = kvm_xapic_id(vcpu->arch.apic); - - if (vcpu->vcpu_id == id) - return 0; - - old = avic_get_physical_id_entry(vcpu, vcpu->vcpu_id); - new = avic_get_physical_id_entry(vcpu, id); - if (!new || !old) - return 1; - - /* We need to move physical_id_entry to new offset */ - *new = *old; - *old = 0ULL; - to_svm(vcpu)->avic_physical_id_cache = new; - - /* - * Also update the guest physical APIC ID in the logical - * APIC ID table entry if already setup the LDR. - */ - if (svm->ldr_reg) - avic_handle_ldr_update(vcpu); - - return 0; -} - -static void avic_handle_dfr_update(struct kvm_vcpu *vcpu) -{ - struct vcpu_svm *svm = to_svm(vcpu); - u32 dfr = kvm_lapic_get_reg(vcpu->arch.apic, APIC_DFR); - - if (svm->dfr_reg == dfr) - return; - - avic_invalidate_logical_id_entry(vcpu); - svm->dfr_reg = dfr; -} - -static int avic_unaccel_trap_write(struct vcpu_svm *svm) -{ - struct kvm_lapic *apic = svm->vcpu.arch.apic; - u32 offset = svm->vmcb->control.exit_info_1 & - AVIC_UNACCEL_ACCESS_OFFSET_MASK; - - switch (offset) { - case APIC_ID: - if (avic_handle_apic_id_update(&svm->vcpu)) - return 0; - break; - case APIC_LDR: - if (avic_handle_ldr_update(&svm->vcpu)) - return 0; - break; - case APIC_DFR: - avic_handle_dfr_update(&svm->vcpu); - break; - default: - break; - } - - kvm_lapic_reg_write(apic, offset, kvm_lapic_get_reg(apic, offset)); - - return 1; -} - -static bool is_avic_unaccelerated_access_trap(u32 offset) -{ - bool ret = false; - - switch (offset) { - case APIC_ID: - case APIC_EOI: - case APIC_RRR: - case APIC_LDR: - case APIC_DFR: - case APIC_SPIV: - case APIC_ESR: - case APIC_ICR: - case APIC_LVTT: - case APIC_LVTTHMR: - case APIC_LVTPC: - case APIC_LVT0: - case APIC_LVT1: - case APIC_LVTERR: - case APIC_TMICT: - case APIC_TDCR: - ret = true; - break; - default: - break; - } - return ret; -} - -static int avic_unaccelerated_access_interception(struct vcpu_svm *svm) -{ - int ret = 0; - u32 offset = svm->vmcb->control.exit_info_1 & - AVIC_UNACCEL_ACCESS_OFFSET_MASK; - u32 vector = svm->vmcb->control.exit_info_2 & - AVIC_UNACCEL_ACCESS_VECTOR_MASK; - bool write = (svm->vmcb->control.exit_info_1 >> 32) & - AVIC_UNACCEL_ACCESS_WRITE_MASK; - bool trap = is_avic_unaccelerated_access_trap(offset); - - trace_kvm_avic_unaccelerated_access(svm->vcpu.vcpu_id, offset, - trap, write, vector); - if (trap) { - /* Handling Trap */ - WARN_ONCE(!write, "svm: Handling trap read.\n"); - ret = avic_unaccel_trap_write(svm); - } else { - /* Handling Fault */ - ret = kvm_emulate_instruction(&svm->vcpu, 0); - } - - return ret; -} - static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = { [SVM_EXIT_READ_CR0] = cr_interception, [SVM_EXIT_READ_CR3] = cr_interception, @@ -4074,324 +3369,6 @@ static void update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr) set_cr_intercept(svm, INTERCEPT_CR8_WRITE); } -static void svm_set_virtual_apic_mode(struct kvm_vcpu *vcpu) -{ - return; -} - -static void svm_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr) -{ -} - -static void svm_hwapic_isr_update(struct kvm_vcpu *vcpu, int max_isr) -{ -} - -static void svm_toggle_avic_for_irq_window(struct kvm_vcpu *vcpu, bool activate) -{ - if (!avic || !lapic_in_kernel(vcpu)) - return; - - srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx); - kvm_request_apicv_update(vcpu->kvm, activate, - APICV_INHIBIT_REASON_IRQWIN); - vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu); -} - -static int svm_set_pi_irte_mode(struct kvm_vcpu *vcpu, bool activate) -{ - int ret = 0; - unsigned long flags; - struct amd_svm_iommu_ir *ir; - struct vcpu_svm *svm = to_svm(vcpu); - - if (!kvm_arch_has_assigned_device(vcpu->kvm)) - return 0; - - /* - * Here, we go through the per-vcpu ir_list to update all existing - * interrupt remapping table entry targeting this vcpu. - */ - spin_lock_irqsave(&svm->ir_list_lock, flags); - - if (list_empty(&svm->ir_list)) - goto out; - - list_for_each_entry(ir, &svm->ir_list, node) { - if (activate) - ret = amd_iommu_activate_guest_mode(ir->data); - else - ret = amd_iommu_deactivate_guest_mode(ir->data); - if (ret) - break; - } -out: - spin_unlock_irqrestore(&svm->ir_list_lock, flags); - return ret; -} - -static void svm_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu) -{ - struct vcpu_svm *svm = to_svm(vcpu); - struct vmcb *vmcb = svm->vmcb; - bool activated = kvm_vcpu_apicv_active(vcpu); - - if (!avic) - return; - - if (activated) { - /** - * During AVIC temporary deactivation, guest could update - * APIC ID, DFR and LDR registers, which would not be trapped - * by avic_unaccelerated_access_interception(). In this case, - * we need to check and update the AVIC logical APIC ID table - * accordingly before re-activating. - */ - avic_post_state_restore(vcpu); - vmcb->control.int_ctl |= AVIC_ENABLE_MASK; - } else { - vmcb->control.int_ctl &= ~AVIC_ENABLE_MASK; - } - mark_dirty(vmcb, VMCB_AVIC); - - svm_set_pi_irte_mode(vcpu, activated); -} - -static void svm_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap) -{ - return; -} - -static int svm_deliver_avic_intr(struct kvm_vcpu *vcpu, int vec) -{ - if (!vcpu->arch.apicv_active) - return -1; - - kvm_lapic_set_irr(vec, vcpu->arch.apic); - smp_mb__after_atomic(); - - if (avic_vcpu_is_running(vcpu)) { - int cpuid = vcpu->cpu; - - if (cpuid != get_cpu()) - wrmsrl(SVM_AVIC_DOORBELL, kvm_cpu_get_apicid(cpuid)); - put_cpu(); - } else - kvm_vcpu_wake_up(vcpu); - - return 0; -} - -static bool svm_dy_apicv_has_pending_interrupt(struct kvm_vcpu *vcpu) -{ - return false; -} - -static void svm_ir_list_del(struct vcpu_svm *svm, struct amd_iommu_pi_data *pi) -{ - unsigned long flags; - struct amd_svm_iommu_ir *cur; - - spin_lock_irqsave(&svm->ir_list_lock, flags); - list_for_each_entry(cur, &svm->ir_list, node) { - if (cur->data != pi->ir_data) - continue; - list_del(&cur->node); - kfree(cur); - break; - } - spin_unlock_irqrestore(&svm->ir_list_lock, flags); -} - -static int svm_ir_list_add(struct vcpu_svm *svm, struct amd_iommu_pi_data *pi) -{ - int ret = 0; - unsigned long flags; - struct amd_svm_iommu_ir *ir; - - /** - * In some cases, the existing irte is updaed and re-set, - * so we need to check here if it's already been * added - * to the ir_list. - */ - if (pi->ir_data && (pi->prev_ga_tag != 0)) { - struct kvm *kvm = svm->vcpu.kvm; - u32 vcpu_id = AVIC_GATAG_TO_VCPUID(pi->prev_ga_tag); - struct kvm_vcpu *prev_vcpu = kvm_get_vcpu_by_id(kvm, vcpu_id); - struct vcpu_svm *prev_svm; - - if (!prev_vcpu) { - ret = -EINVAL; - goto out; - } - - prev_svm = to_svm(prev_vcpu); - svm_ir_list_del(prev_svm, pi); - } - - /** - * Allocating new amd_iommu_pi_data, which will get - * add to the per-vcpu ir_list. - */ - ir = kzalloc(sizeof(struct amd_svm_iommu_ir), GFP_KERNEL_ACCOUNT); - if (!ir) { - ret = -ENOMEM; - goto out; - } - ir->data = pi->ir_data; - - spin_lock_irqsave(&svm->ir_list_lock, flags); - list_add(&ir->node, &svm->ir_list); - spin_unlock_irqrestore(&svm->ir_list_lock, flags); -out: - return ret; -} - -/** - * Note: - * The HW cannot support posting multicast/broadcast - * interrupts to a vCPU. So, we still use legacy interrupt - * remapping for these kind of interrupts. - * - * For lowest-priority interrupts, we only support - * those with single CPU as the destination, e.g. user - * configures the interrupts via /proc/irq or uses - * irqbalance to make the interrupts single-CPU. - */ -static int -get_pi_vcpu_info(struct kvm *kvm, struct kvm_kernel_irq_routing_entry *e, - struct vcpu_data *vcpu_info, struct vcpu_svm **svm) -{ - struct kvm_lapic_irq irq; - struct kvm_vcpu *vcpu = NULL; - - kvm_set_msi_irq(kvm, e, &irq); - - if (!kvm_intr_is_single_vcpu(kvm, &irq, &vcpu) || - !kvm_irq_is_postable(&irq)) { - pr_debug("SVM: %s: use legacy intr remap mode for irq %u\n", - __func__, irq.vector); - return -1; - } - - pr_debug("SVM: %s: use GA mode for irq %u\n", __func__, - irq.vector); - *svm = to_svm(vcpu); - vcpu_info->pi_desc_addr = __sme_set(page_to_phys((*svm)->avic_backing_page)); - vcpu_info->vector = irq.vector; - - return 0; -} - -/* - * svm_update_pi_irte - set IRTE for Posted-Interrupts - * - * @kvm: kvm - * @host_irq: host irq of the interrupt - * @guest_irq: gsi of the interrupt - * @set: set or unset PI - * returns 0 on success, < 0 on failure - */ -static int svm_update_pi_irte(struct kvm *kvm, unsigned int host_irq, - uint32_t guest_irq, bool set) -{ - struct kvm_kernel_irq_routing_entry *e; - struct kvm_irq_routing_table *irq_rt; - int idx, ret = -EINVAL; - - if (!kvm_arch_has_assigned_device(kvm) || - !irq_remapping_cap(IRQ_POSTING_CAP)) - return 0; - - pr_debug("SVM: %s: host_irq=%#x, guest_irq=%#x, set=%#x\n", - __func__, host_irq, guest_irq, set); - - idx = srcu_read_lock(&kvm->irq_srcu); - irq_rt = srcu_dereference(kvm->irq_routing, &kvm->irq_srcu); - WARN_ON(guest_irq >= irq_rt->nr_rt_entries); - - hlist_for_each_entry(e, &irq_rt->map[guest_irq], link) { - struct vcpu_data vcpu_info; - struct vcpu_svm *svm = NULL; - - if (e->type != KVM_IRQ_ROUTING_MSI) - continue; - - /** - * Here, we setup with legacy mode in the following cases: - * 1. When cannot target interrupt to a specific vcpu. - * 2. Unsetting posted interrupt. - * 3. APIC virtialization is disabled for the vcpu. - * 4. IRQ has incompatible delivery mode (SMI, INIT, etc) - */ - if (!get_pi_vcpu_info(kvm, e, &vcpu_info, &svm) && set && - kvm_vcpu_apicv_active(&svm->vcpu)) { - struct amd_iommu_pi_data pi; - - /* Try to enable guest_mode in IRTE */ - pi.base = __sme_set(page_to_phys(svm->avic_backing_page) & - AVIC_HPA_MASK); - pi.ga_tag = AVIC_GATAG(to_kvm_svm(kvm)->avic_vm_id, - svm->vcpu.vcpu_id); - pi.is_guest_mode = true; - pi.vcpu_data = &vcpu_info; - ret = irq_set_vcpu_affinity(host_irq, &pi); - - /** - * Here, we successfully setting up vcpu affinity in - * IOMMU guest mode. Now, we need to store the posted - * interrupt information in a per-vcpu ir_list so that - * we can reference to them directly when we update vcpu - * scheduling information in IOMMU irte. - */ - if (!ret && pi.is_guest_mode) - svm_ir_list_add(svm, &pi); - } else { - /* Use legacy mode in IRTE */ - struct amd_iommu_pi_data pi; - - /** - * Here, pi is used to: - * - Tell IOMMU to use legacy mode for this interrupt. - * - Retrieve ga_tag of prior interrupt remapping data. - */ - pi.is_guest_mode = false; - ret = irq_set_vcpu_affinity(host_irq, &pi); - - /** - * Check if the posted interrupt was previously - * setup with the guest_mode by checking if the ga_tag - * was cached. If so, we need to clean up the per-vcpu - * ir_list. - */ - if (!ret && pi.prev_ga_tag) { - int id = AVIC_GATAG_TO_VCPUID(pi.prev_ga_tag); - struct kvm_vcpu *vcpu; - - vcpu = kvm_get_vcpu_by_id(kvm, id); - if (vcpu) - svm_ir_list_del(to_svm(vcpu), &pi); - } - } - - if (!ret && svm) { - trace_kvm_pi_irte_update(host_irq, svm->vcpu.vcpu_id, - e->gsi, vcpu_info.vector, - vcpu_info.pi_desc_addr, set); - } - - if (ret < 0) { - pr_err("%s: failed to update PI IRTE\n", __func__); - goto out; - } - } - - ret = 0; -out: - srcu_read_unlock(&kvm->irq_srcu, idx); - return ret; -} - static int svm_nmi_allowed(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); @@ -5159,14 +4136,6 @@ static void svm_sched_in(struct kvm_vcpu *vcpu, int cpu) shrink_ple_window(vcpu); } -static inline void avic_post_state_restore(struct kvm_vcpu *vcpu) -{ - if (avic_handle_apic_id_update(vcpu) != 0) - return; - avic_handle_dfr_update(vcpu); - avic_handle_ldr_update(vcpu); -} - static void svm_setup_mce(struct kvm_vcpu *vcpu) { /* [63:9] are reserved. */ @@ -6214,23 +5183,6 @@ static bool svm_apic_init_signal_blocked(struct kvm_vcpu *vcpu) (svm->vmcb->control.intercept & (1ULL << INTERCEPT_INIT)); } -static bool svm_check_apicv_inhibit_reasons(ulong bit) -{ - ulong supported = BIT(APICV_INHIBIT_REASON_DISABLE) | - BIT(APICV_INHIBIT_REASON_HYPERV) | - BIT(APICV_INHIBIT_REASON_NESTED) | - BIT(APICV_INHIBIT_REASON_IRQWIN) | - BIT(APICV_INHIBIT_REASON_PIT_REINJ) | - BIT(APICV_INHIBIT_REASON_X2APIC); - - return supported & BIT(bit); -} - -static void svm_pre_update_apicv_exec_ctrl(struct kvm *kvm, bool activate) -{ - avic_update_access_page(kvm, activate); -} - static struct kvm_x86_ops svm_x86_ops __initdata = { .hardware_unsetup = svm_hardware_teardown, .hardware_enable = svm_hardware_enable, diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index f4c446d7a31e..c7abc1fede97 100644 --- a/arch/x86/kvm/svm/svm.h +++ b/arch/x86/kvm/svm/svm.h @@ -173,6 +173,11 @@ struct vcpu_svm { void recalc_intercepts(struct vcpu_svm *svm); +static inline struct kvm_svm *to_kvm_svm(struct kvm *kvm) +{ + return container_of(kvm, struct kvm_svm, kvm); +} + static inline void mark_all_dirty(struct vmcb *vmcb) { vmcb->control.clean = 0; @@ -378,4 +383,61 @@ int nested_svm_check_exception(struct vcpu_svm *svm, unsigned nr, int svm_check_nested_events(struct kvm_vcpu *vcpu); int nested_svm_exit_special(struct vcpu_svm *svm); +/* avic.c */ + +#define AVIC_LOGICAL_ID_ENTRY_GUEST_PHYSICAL_ID_MASK (0xFF) +#define AVIC_LOGICAL_ID_ENTRY_VALID_BIT 31 +#define AVIC_LOGICAL_ID_ENTRY_VALID_MASK (1 << 31) + +#define AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK (0xFFULL) +#define AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK (0xFFFFFFFFFFULL << 12) +#define AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK (1ULL << 62) +#define AVIC_PHYSICAL_ID_ENTRY_VALID_MASK (1ULL << 63) + +#define VMCB_AVIC_APIC_BAR_MASK 0xFFFFFFFFFF000ULL + +extern int avic; + +static inline void avic_update_vapic_bar(struct vcpu_svm *svm, u64 data) +{ + svm->vmcb->control.avic_vapic_bar = data & VMCB_AVIC_APIC_BAR_MASK; + mark_dirty(svm->vmcb, VMCB_AVIC); +} + +static inline bool avic_vcpu_is_running(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + u64 *entry = svm->avic_physical_id_cache; + + if (!entry) + return false; + + return (READ_ONCE(*entry) & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK); +} + +int avic_ga_log_notifier(u32 ga_tag); +void avic_vm_destroy(struct kvm *kvm); +int avic_vm_init(struct kvm *kvm); +void avic_init_vmcb(struct vcpu_svm *svm); +void svm_toggle_avic_for_irq_window(struct kvm_vcpu *vcpu, bool activate); +int avic_incomplete_ipi_interception(struct vcpu_svm *svm); +int avic_unaccelerated_access_interception(struct vcpu_svm *svm); +int avic_init_vcpu(struct vcpu_svm *svm); +void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu); +void avic_vcpu_put(struct kvm_vcpu *vcpu); +void avic_post_state_restore(struct kvm_vcpu *vcpu); +void svm_set_virtual_apic_mode(struct kvm_vcpu *vcpu); +void svm_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu); +bool svm_check_apicv_inhibit_reasons(ulong bit); +void svm_pre_update_apicv_exec_ctrl(struct kvm *kvm, bool activate); +void svm_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap); +void svm_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr); +void svm_hwapic_isr_update(struct kvm_vcpu *vcpu, int max_isr); +int svm_deliver_avic_intr(struct kvm_vcpu *vcpu, int vec); +bool svm_dy_apicv_has_pending_interrupt(struct kvm_vcpu *vcpu); +int svm_update_pi_irte(struct kvm *kvm, unsigned int host_irq, + uint32_t guest_irq, bool set); +void svm_vcpu_blocking(struct kvm_vcpu *vcpu); +void svm_vcpu_unblocking(struct kvm_vcpu *vcpu); + #endif -- cgit From eaf78265a4ab33935d3a0f1407ce4a91aac4d4d5 Mon Sep 17 00:00:00 2001 From: Joerg Roedel Date: Tue, 24 Mar 2020 10:41:54 +0100 Subject: KVM: SVM: Move SEV code to separate file Move the SEV specific parts of svm.c into the new sev.c file. Signed-off-by: Joerg Roedel Message-Id: <20200324094154.32352-5-joro@8bytes.org> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/Makefile | 2 +- arch/x86/kvm/svm/sev.c | 1187 +++++++++++++++++++++++++++++++++++++++++++++ arch/x86/kvm/svm/svm.c | 1241 +----------------------------------------------- arch/x86/kvm/svm/svm.h | 48 ++ 4 files changed, 1257 insertions(+), 1221 deletions(-) create mode 100644 arch/x86/kvm/svm/sev.c (limited to 'arch/x86') diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile index 31aca07d8f12..e5a71aa0967b 100644 --- a/arch/x86/kvm/Makefile +++ b/arch/x86/kvm/Makefile @@ -14,7 +14,7 @@ kvm-y += x86.o emulate.o i8259.o irq.o lapic.o \ hyperv.o debugfs.o mmu/mmu.o mmu/page_track.o kvm-intel-y += vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o vmx/evmcs.o vmx/nested.o -kvm-amd-y += svm/svm.o svm/pmu.o svm/nested.o svm/avic.o +kvm-amd-y += svm/svm.o svm/pmu.o svm/nested.o svm/avic.o svm/sev.o obj-$(CONFIG_KVM) += kvm.o obj-$(CONFIG_KVM_INTEL) += kvm-intel.o diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c new file mode 100644 index 000000000000..0e3fc311d7da --- /dev/null +++ b/arch/x86/kvm/svm/sev.c @@ -0,0 +1,1187 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Kernel-based Virtual Machine driver for Linux + * + * AMD SVM-SEV support + * + * Copyright 2010 Red Hat, Inc. and/or its affiliates. + */ + +#include +#include +#include +#include +#include +#include + +#include "x86.h" +#include "svm.h" + +static int sev_flush_asids(void); +static DECLARE_RWSEM(sev_deactivate_lock); +static DEFINE_MUTEX(sev_bitmap_lock); +unsigned int max_sev_asid; +static unsigned int min_sev_asid; +static unsigned long *sev_asid_bitmap; +static unsigned long *sev_reclaim_asid_bitmap; +#define __sme_page_pa(x) __sme_set(page_to_pfn(x) << PAGE_SHIFT) + +struct enc_region { + struct list_head list; + unsigned long npages; + struct page **pages; + unsigned long uaddr; + unsigned long size; +}; + +static int sev_flush_asids(void) +{ + int ret, error = 0; + + /* + * DEACTIVATE will clear the WBINVD indicator causing DF_FLUSH to fail, + * so it must be guarded. + */ + down_write(&sev_deactivate_lock); + + wbinvd_on_all_cpus(); + ret = sev_guest_df_flush(&error); + + up_write(&sev_deactivate_lock); + + if (ret) + pr_err("SEV: DF_FLUSH failed, ret=%d, error=%#x\n", ret, error); + + return ret; +} + +/* Must be called with the sev_bitmap_lock held */ +static bool __sev_recycle_asids(void) +{ + int pos; + + /* Check if there are any ASIDs to reclaim before performing a flush */ + pos = find_next_bit(sev_reclaim_asid_bitmap, + max_sev_asid, min_sev_asid - 1); + if (pos >= max_sev_asid) + return false; + + if (sev_flush_asids()) + return false; + + bitmap_xor(sev_asid_bitmap, sev_asid_bitmap, sev_reclaim_asid_bitmap, + max_sev_asid); + bitmap_zero(sev_reclaim_asid_bitmap, max_sev_asid); + + return true; +} + +static int sev_asid_new(void) +{ + bool retry = true; + int pos; + + mutex_lock(&sev_bitmap_lock); + + /* + * SEV-enabled guest must use asid from min_sev_asid to max_sev_asid. + */ +again: + pos = find_next_zero_bit(sev_asid_bitmap, max_sev_asid, min_sev_asid - 1); + if (pos >= max_sev_asid) { + if (retry && __sev_recycle_asids()) { + retry = false; + goto again; + } + mutex_unlock(&sev_bitmap_lock); + return -EBUSY; + } + + __set_bit(pos, sev_asid_bitmap); + + mutex_unlock(&sev_bitmap_lock); + + return pos + 1; +} + +static int sev_get_asid(struct kvm *kvm) +{ + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; + + return sev->asid; +} + +static void sev_asid_free(int asid) +{ + struct svm_cpu_data *sd; + int cpu, pos; + + mutex_lock(&sev_bitmap_lock); + + pos = asid - 1; + __set_bit(pos, sev_reclaim_asid_bitmap); + + for_each_possible_cpu(cpu) { + sd = per_cpu(svm_data, cpu); + sd->sev_vmcbs[pos] = NULL; + } + + mutex_unlock(&sev_bitmap_lock); +} + +static void sev_unbind_asid(struct kvm *kvm, unsigned int handle) +{ + struct sev_data_decommission *decommission; + struct sev_data_deactivate *data; + + if (!handle) + return; + + data = kzalloc(sizeof(*data), GFP_KERNEL); + if (!data) + return; + + /* deactivate handle */ + data->handle = handle; + + /* Guard DEACTIVATE against WBINVD/DF_FLUSH used in ASID recycling */ + down_read(&sev_deactivate_lock); + sev_guest_deactivate(data, NULL); + up_read(&sev_deactivate_lock); + + kfree(data); + + decommission = kzalloc(sizeof(*decommission), GFP_KERNEL); + if (!decommission) + return; + + /* decommission handle */ + decommission->handle = handle; + sev_guest_decommission(decommission, NULL); + + kfree(decommission); +} + +static int sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp) +{ + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; + int asid, ret; + + ret = -EBUSY; + if (unlikely(sev->active)) + return ret; + + asid = sev_asid_new(); + if (asid < 0) + return ret; + + ret = sev_platform_init(&argp->error); + if (ret) + goto e_free; + + sev->active = true; + sev->asid = asid; + INIT_LIST_HEAD(&sev->regions_list); + + return 0; + +e_free: + sev_asid_free(asid); + return ret; +} + +static int sev_bind_asid(struct kvm *kvm, unsigned int handle, int *error) +{ + struct sev_data_activate *data; + int asid = sev_get_asid(kvm); + int ret; + + data = kzalloc(sizeof(*data), GFP_KERNEL_ACCOUNT); + if (!data) + return -ENOMEM; + + /* activate ASID on the given handle */ + data->handle = handle; + data->asid = asid; + ret = sev_guest_activate(data, error); + kfree(data); + + return ret; +} + +static int __sev_issue_cmd(int fd, int id, void *data, int *error) +{ + struct fd f; + int ret; + + f = fdget(fd); + if (!f.file) + return -EBADF; + + ret = sev_issue_cmd_external_user(f.file, id, data, error); + + fdput(f); + return ret; +} + +static int sev_issue_cmd(struct kvm *kvm, int id, void *data, int *error) +{ + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; + + return __sev_issue_cmd(sev->fd, id, data, error); +} + +static int sev_launch_start(struct kvm *kvm, struct kvm_sev_cmd *argp) +{ + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; + struct sev_data_launch_start *start; + struct kvm_sev_launch_start params; + void *dh_blob, *session_blob; + int *error = &argp->error; + int ret; + + if (!sev_guest(kvm)) + return -ENOTTY; + + if (copy_from_user(¶ms, (void __user *)(uintptr_t)argp->data, sizeof(params))) + return -EFAULT; + + start = kzalloc(sizeof(*start), GFP_KERNEL_ACCOUNT); + if (!start) + return -ENOMEM; + + dh_blob = NULL; + if (params.dh_uaddr) { + dh_blob = psp_copy_user_blob(params.dh_uaddr, params.dh_len); + if (IS_ERR(dh_blob)) { + ret = PTR_ERR(dh_blob); + goto e_free; + } + + start->dh_cert_address = __sme_set(__pa(dh_blob)); + start->dh_cert_len = params.dh_len; + } + + session_blob = NULL; + if (params.session_uaddr) { + session_blob = psp_copy_user_blob(params.session_uaddr, params.session_len); + if (IS_ERR(session_blob)) { + ret = PTR_ERR(session_blob); + goto e_free_dh; + } + + start->session_address = __sme_set(__pa(session_blob)); + start->session_len = params.session_len; + } + + start->handle = params.handle; + start->policy = params.policy; + + /* create memory encryption context */ + ret = __sev_issue_cmd(argp->sev_fd, SEV_CMD_LAUNCH_START, start, error); + if (ret) + goto e_free_session; + + /* Bind ASID to this guest */ + ret = sev_bind_asid(kvm, start->handle, error); + if (ret) + goto e_free_session; + + /* return handle to userspace */ + params.handle = start->handle; + if (copy_to_user((void __user *)(uintptr_t)argp->data, ¶ms, sizeof(params))) { + sev_unbind_asid(kvm, start->handle); + ret = -EFAULT; + goto e_free_session; + } + + sev->handle = start->handle; + sev->fd = argp->sev_fd; + +e_free_session: + kfree(session_blob); +e_free_dh: + kfree(dh_blob); +e_free: + kfree(start); + return ret; +} + +static struct page **sev_pin_memory(struct kvm *kvm, unsigned long uaddr, + unsigned long ulen, unsigned long *n, + int write) +{ + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; + unsigned long npages, npinned, size; + unsigned long locked, lock_limit; + struct page **pages; + unsigned long first, last; + + if (ulen == 0 || uaddr + ulen < uaddr) + return NULL; + + /* Calculate number of pages. */ + first = (uaddr & PAGE_MASK) >> PAGE_SHIFT; + last = ((uaddr + ulen - 1) & PAGE_MASK) >> PAGE_SHIFT; + npages = (last - first + 1); + + locked = sev->pages_locked + npages; + lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; + if (locked > lock_limit && !capable(CAP_IPC_LOCK)) { + pr_err("SEV: %lu locked pages exceed the lock limit of %lu.\n", locked, lock_limit); + return NULL; + } + + /* Avoid using vmalloc for smaller buffers. */ + size = npages * sizeof(struct page *); + if (size > PAGE_SIZE) + pages = __vmalloc(size, GFP_KERNEL_ACCOUNT | __GFP_ZERO, + PAGE_KERNEL); + else + pages = kmalloc(size, GFP_KERNEL_ACCOUNT); + + if (!pages) + return NULL; + + /* Pin the user virtual address. */ + npinned = get_user_pages_fast(uaddr, npages, FOLL_WRITE, pages); + if (npinned != npages) { + pr_err("SEV: Failure locking %lu pages.\n", npages); + goto err; + } + + *n = npages; + sev->pages_locked = locked; + + return pages; + +err: + if (npinned > 0) + release_pages(pages, npinned); + + kvfree(pages); + return NULL; +} + +static void sev_unpin_memory(struct kvm *kvm, struct page **pages, + unsigned long npages) +{ + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; + + release_pages(pages, npages); + kvfree(pages); + sev->pages_locked -= npages; +} + +static void sev_clflush_pages(struct page *pages[], unsigned long npages) +{ + uint8_t *page_virtual; + unsigned long i; + + if (npages == 0 || pages == NULL) + return; + + for (i = 0; i < npages; i++) { + page_virtual = kmap_atomic(pages[i]); + clflush_cache_range(page_virtual, PAGE_SIZE); + kunmap_atomic(page_virtual); + } +} + +static unsigned long get_num_contig_pages(unsigned long idx, + struct page **inpages, unsigned long npages) +{ + unsigned long paddr, next_paddr; + unsigned long i = idx + 1, pages = 1; + + /* find the number of contiguous pages starting from idx */ + paddr = __sme_page_pa(inpages[idx]); + while (i < npages) { + next_paddr = __sme_page_pa(inpages[i++]); + if ((paddr + PAGE_SIZE) == next_paddr) { + pages++; + paddr = next_paddr; + continue; + } + break; + } + + return pages; +} + +static int sev_launch_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp) +{ + unsigned long vaddr, vaddr_end, next_vaddr, npages, pages, size, i; + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; + struct kvm_sev_launch_update_data params; + struct sev_data_launch_update_data *data; + struct page **inpages; + int ret; + + if (!sev_guest(kvm)) + return -ENOTTY; + + if (copy_from_user(¶ms, (void __user *)(uintptr_t)argp->data, sizeof(params))) + return -EFAULT; + + data = kzalloc(sizeof(*data), GFP_KERNEL_ACCOUNT); + if (!data) + return -ENOMEM; + + vaddr = params.uaddr; + size = params.len; + vaddr_end = vaddr + size; + + /* Lock the user memory. */ + inpages = sev_pin_memory(kvm, vaddr, size, &npages, 1); + if (!inpages) { + ret = -ENOMEM; + goto e_free; + } + + /* + * The LAUNCH_UPDATE command will perform in-place encryption of the + * memory content (i.e it will write the same memory region with C=1). + * It's possible that the cache may contain the data with C=0, i.e., + * unencrypted so invalidate it first. + */ + sev_clflush_pages(inpages, npages); + + for (i = 0; vaddr < vaddr_end; vaddr = next_vaddr, i += pages) { + int offset, len; + + /* + * If the user buffer is not page-aligned, calculate the offset + * within the page. + */ + offset = vaddr & (PAGE_SIZE - 1); + + /* Calculate the number of pages that can be encrypted in one go. */ + pages = get_num_contig_pages(i, inpages, npages); + + len = min_t(size_t, ((pages * PAGE_SIZE) - offset), size); + + data->handle = sev->handle; + data->len = len; + data->address = __sme_page_pa(inpages[i]) + offset; + ret = sev_issue_cmd(kvm, SEV_CMD_LAUNCH_UPDATE_DATA, data, &argp->error); + if (ret) + goto e_unpin; + + size -= len; + next_vaddr = vaddr + len; + } + +e_unpin: + /* content of memory is updated, mark pages dirty */ + for (i = 0; i < npages; i++) { + set_page_dirty_lock(inpages[i]); + mark_page_accessed(inpages[i]); + } + /* unlock the user pages */ + sev_unpin_memory(kvm, inpages, npages); +e_free: + kfree(data); + return ret; +} + +static int sev_launch_measure(struct kvm *kvm, struct kvm_sev_cmd *argp) +{ + void __user *measure = (void __user *)(uintptr_t)argp->data; + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; + struct sev_data_launch_measure *data; + struct kvm_sev_launch_measure params; + void __user *p = NULL; + void *blob = NULL; + int ret; + + if (!sev_guest(kvm)) + return -ENOTTY; + + if (copy_from_user(¶ms, measure, sizeof(params))) + return -EFAULT; + + data = kzalloc(sizeof(*data), GFP_KERNEL_ACCOUNT); + if (!data) + return -ENOMEM; + + /* User wants to query the blob length */ + if (!params.len) + goto cmd; + + p = (void __user *)(uintptr_t)params.uaddr; + if (p) { + if (params.len > SEV_FW_BLOB_MAX_SIZE) { + ret = -EINVAL; + goto e_free; + } + + ret = -ENOMEM; + blob = kmalloc(params.len, GFP_KERNEL); + if (!blob) + goto e_free; + + data->address = __psp_pa(blob); + data->len = params.len; + } + +cmd: + data->handle = sev->handle; + ret = sev_issue_cmd(kvm, SEV_CMD_LAUNCH_MEASURE, data, &argp->error); + + /* + * If we query the session length, FW responded with expected data. + */ + if (!params.len) + goto done; + + if (ret) + goto e_free_blob; + + if (blob) { + if (copy_to_user(p, blob, params.len)) + ret = -EFAULT; + } + +done: + params.len = data->len; + if (copy_to_user(measure, ¶ms, sizeof(params))) + ret = -EFAULT; +e_free_blob: + kfree(blob); +e_free: + kfree(data); + return ret; +} + +static int sev_launch_finish(struct kvm *kvm, struct kvm_sev_cmd *argp) +{ + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; + struct sev_data_launch_finish *data; + int ret; + + if (!sev_guest(kvm)) + return -ENOTTY; + + data = kzalloc(sizeof(*data), GFP_KERNEL_ACCOUNT); + if (!data) + return -ENOMEM; + + data->handle = sev->handle; + ret = sev_issue_cmd(kvm, SEV_CMD_LAUNCH_FINISH, data, &argp->error); + + kfree(data); + return ret; +} + +static int sev_guest_status(struct kvm *kvm, struct kvm_sev_cmd *argp) +{ + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; + struct kvm_sev_guest_status params; + struct sev_data_guest_status *data; + int ret; + + if (!sev_guest(kvm)) + return -ENOTTY; + + data = kzalloc(sizeof(*data), GFP_KERNEL_ACCOUNT); + if (!data) + return -ENOMEM; + + data->handle = sev->handle; + ret = sev_issue_cmd(kvm, SEV_CMD_GUEST_STATUS, data, &argp->error); + if (ret) + goto e_free; + + params.policy = data->policy; + params.state = data->state; + params.handle = data->handle; + + if (copy_to_user((void __user *)(uintptr_t)argp->data, ¶ms, sizeof(params))) + ret = -EFAULT; +e_free: + kfree(data); + return ret; +} + +static int __sev_issue_dbg_cmd(struct kvm *kvm, unsigned long src, + unsigned long dst, int size, + int *error, bool enc) +{ + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; + struct sev_data_dbg *data; + int ret; + + data = kzalloc(sizeof(*data), GFP_KERNEL_ACCOUNT); + if (!data) + return -ENOMEM; + + data->handle = sev->handle; + data->dst_addr = dst; + data->src_addr = src; + data->len = size; + + ret = sev_issue_cmd(kvm, + enc ? SEV_CMD_DBG_ENCRYPT : SEV_CMD_DBG_DECRYPT, + data, error); + kfree(data); + return ret; +} + +static int __sev_dbg_decrypt(struct kvm *kvm, unsigned long src_paddr, + unsigned long dst_paddr, int sz, int *err) +{ + int offset; + + /* + * Its safe to read more than we are asked, caller should ensure that + * destination has enough space. + */ + src_paddr = round_down(src_paddr, 16); + offset = src_paddr & 15; + sz = round_up(sz + offset, 16); + + return __sev_issue_dbg_cmd(kvm, src_paddr, dst_paddr, sz, err, false); +} + +static int __sev_dbg_decrypt_user(struct kvm *kvm, unsigned long paddr, + unsigned long __user dst_uaddr, + unsigned long dst_paddr, + int size, int *err) +{ + struct page *tpage = NULL; + int ret, offset; + + /* if inputs are not 16-byte then use intermediate buffer */ + if (!IS_ALIGNED(dst_paddr, 16) || + !IS_ALIGNED(paddr, 16) || + !IS_ALIGNED(size, 16)) { + tpage = (void *)alloc_page(GFP_KERNEL); + if (!tpage) + return -ENOMEM; + + dst_paddr = __sme_page_pa(tpage); + } + + ret = __sev_dbg_decrypt(kvm, paddr, dst_paddr, size, err); + if (ret) + goto e_free; + + if (tpage) { + offset = paddr & 15; + if (copy_to_user((void __user *)(uintptr_t)dst_uaddr, + page_address(tpage) + offset, size)) + ret = -EFAULT; + } + +e_free: + if (tpage) + __free_page(tpage); + + return ret; +} + +static int __sev_dbg_encrypt_user(struct kvm *kvm, unsigned long paddr, + unsigned long __user vaddr, + unsigned long dst_paddr, + unsigned long __user dst_vaddr, + int size, int *error) +{ + struct page *src_tpage = NULL; + struct page *dst_tpage = NULL; + int ret, len = size; + + /* If source buffer is not aligned then use an intermediate buffer */ + if (!IS_ALIGNED(vaddr, 16)) { + src_tpage = alloc_page(GFP_KERNEL); + if (!src_tpage) + return -ENOMEM; + + if (copy_from_user(page_address(src_tpage), + (void __user *)(uintptr_t)vaddr, size)) { + __free_page(src_tpage); + return -EFAULT; + } + + paddr = __sme_page_pa(src_tpage); + } + + /* + * If destination buffer or length is not aligned then do read-modify-write: + * - decrypt destination in an intermediate buffer + * - copy the source buffer in an intermediate buffer + * - use the intermediate buffer as source buffer + */ + if (!IS_ALIGNED(dst_vaddr, 16) || !IS_ALIGNED(size, 16)) { + int dst_offset; + + dst_tpage = alloc_page(GFP_KERNEL); + if (!dst_tpage) { + ret = -ENOMEM; + goto e_free; + } + + ret = __sev_dbg_decrypt(kvm, dst_paddr, + __sme_page_pa(dst_tpage), size, error); + if (ret) + goto e_free; + + /* + * If source is kernel buffer then use memcpy() otherwise + * copy_from_user(). + */ + dst_offset = dst_paddr & 15; + + if (src_tpage) + memcpy(page_address(dst_tpage) + dst_offset, + page_address(src_tpage), size); + else { + if (copy_from_user(page_address(dst_tpage) + dst_offset, + (void __user *)(uintptr_t)vaddr, size)) { + ret = -EFAULT; + goto e_free; + } + } + + paddr = __sme_page_pa(dst_tpage); + dst_paddr = round_down(dst_paddr, 16); + len = round_up(size, 16); + } + + ret = __sev_issue_dbg_cmd(kvm, paddr, dst_paddr, len, error, true); + +e_free: + if (src_tpage) + __free_page(src_tpage); + if (dst_tpage) + __free_page(dst_tpage); + return ret; +} + +static int sev_dbg_crypt(struct kvm *kvm, struct kvm_sev_cmd *argp, bool dec) +{ + unsigned long vaddr, vaddr_end, next_vaddr; + unsigned long dst_vaddr; + struct page **src_p, **dst_p; + struct kvm_sev_dbg debug; + unsigned long n; + unsigned int size; + int ret; + + if (!sev_guest(kvm)) + return -ENOTTY; + + if (copy_from_user(&debug, (void __user *)(uintptr_t)argp->data, sizeof(debug))) + return -EFAULT; + + if (!debug.len || debug.src_uaddr + debug.len < debug.src_uaddr) + return -EINVAL; + if (!debug.dst_uaddr) + return -EINVAL; + + vaddr = debug.src_uaddr; + size = debug.len; + vaddr_end = vaddr + size; + dst_vaddr = debug.dst_uaddr; + + for (; vaddr < vaddr_end; vaddr = next_vaddr) { + int len, s_off, d_off; + + /* lock userspace source and destination page */ + src_p = sev_pin_memory(kvm, vaddr & PAGE_MASK, PAGE_SIZE, &n, 0); + if (!src_p) + return -EFAULT; + + dst_p = sev_pin_memory(kvm, dst_vaddr & PAGE_MASK, PAGE_SIZE, &n, 1); + if (!dst_p) { + sev_unpin_memory(kvm, src_p, n); + return -EFAULT; + } + + /* + * The DBG_{DE,EN}CRYPT commands will perform {dec,en}cryption of the + * memory content (i.e it will write the same memory region with C=1). + * It's possible that the cache may contain the data with C=0, i.e., + * unencrypted so invalidate it first. + */ + sev_clflush_pages(src_p, 1); + sev_clflush_pages(dst_p, 1); + + /* + * Since user buffer may not be page aligned, calculate the + * offset within the page. + */ + s_off = vaddr & ~PAGE_MASK; + d_off = dst_vaddr & ~PAGE_MASK; + len = min_t(size_t, (PAGE_SIZE - s_off), size); + + if (dec) + ret = __sev_dbg_decrypt_user(kvm, + __sme_page_pa(src_p[0]) + s_off, + dst_vaddr, + __sme_page_pa(dst_p[0]) + d_off, + len, &argp->error); + else + ret = __sev_dbg_encrypt_user(kvm, + __sme_page_pa(src_p[0]) + s_off, + vaddr, + __sme_page_pa(dst_p[0]) + d_off, + dst_vaddr, + len, &argp->error); + + sev_unpin_memory(kvm, src_p, n); + sev_unpin_memory(kvm, dst_p, n); + + if (ret) + goto err; + + next_vaddr = vaddr + len; + dst_vaddr = dst_vaddr + len; + size -= len; + } +err: + return ret; +} + +static int sev_launch_secret(struct kvm *kvm, struct kvm_sev_cmd *argp) +{ + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; + struct sev_data_launch_secret *data; + struct kvm_sev_launch_secret params; + struct page **pages; + void *blob, *hdr; + unsigned long n; + int ret, offset; + + if (!sev_guest(kvm)) + return -ENOTTY; + + if (copy_from_user(¶ms, (void __user *)(uintptr_t)argp->data, sizeof(params))) + return -EFAULT; + + pages = sev_pin_memory(kvm, params.guest_uaddr, params.guest_len, &n, 1); + if (!pages) + return -ENOMEM; + + /* + * The secret must be copied into contiguous memory region, lets verify + * that userspace memory pages are contiguous before we issue command. + */ + if (get_num_contig_pages(0, pages, n) != n) { + ret = -EINVAL; + goto e_unpin_memory; + } + + ret = -ENOMEM; + data = kzalloc(sizeof(*data), GFP_KERNEL_ACCOUNT); + if (!data) + goto e_unpin_memory; + + offset = params.guest_uaddr & (PAGE_SIZE - 1); + data->guest_address = __sme_page_pa(pages[0]) + offset; + data->guest_len = params.guest_len; + + blob = psp_copy_user_blob(params.trans_uaddr, params.trans_len); + if (IS_ERR(blob)) { + ret = PTR_ERR(blob); + goto e_free; + } + + data->trans_address = __psp_pa(blob); + data->trans_len = params.trans_len; + + hdr = psp_copy_user_blob(params.hdr_uaddr, params.hdr_len); + if (IS_ERR(hdr)) { + ret = PTR_ERR(hdr); + goto e_free_blob; + } + data->hdr_address = __psp_pa(hdr); + data->hdr_len = params.hdr_len; + + data->handle = sev->handle; + ret = sev_issue_cmd(kvm, SEV_CMD_LAUNCH_UPDATE_SECRET, data, &argp->error); + + kfree(hdr); + +e_free_blob: + kfree(blob); +e_free: + kfree(data); +e_unpin_memory: + sev_unpin_memory(kvm, pages, n); + return ret; +} + +int svm_mem_enc_op(struct kvm *kvm, void __user *argp) +{ + struct kvm_sev_cmd sev_cmd; + int r; + + if (!svm_sev_enabled()) + return -ENOTTY; + + if (!argp) + return 0; + + if (copy_from_user(&sev_cmd, argp, sizeof(struct kvm_sev_cmd))) + return -EFAULT; + + mutex_lock(&kvm->lock); + + switch (sev_cmd.id) { + case KVM_SEV_INIT: + r = sev_guest_init(kvm, &sev_cmd); + break; + case KVM_SEV_LAUNCH_START: + r = sev_launch_start(kvm, &sev_cmd); + break; + case KVM_SEV_LAUNCH_UPDATE_DATA: + r = sev_launch_update_data(kvm, &sev_cmd); + break; + case KVM_SEV_LAUNCH_MEASURE: + r = sev_launch_measure(kvm, &sev_cmd); + break; + case KVM_SEV_LAUNCH_FINISH: + r = sev_launch_finish(kvm, &sev_cmd); + break; + case KVM_SEV_GUEST_STATUS: + r = sev_guest_status(kvm, &sev_cmd); + break; + case KVM_SEV_DBG_DECRYPT: + r = sev_dbg_crypt(kvm, &sev_cmd, true); + break; + case KVM_SEV_DBG_ENCRYPT: + r = sev_dbg_crypt(kvm, &sev_cmd, false); + break; + case KVM_SEV_LAUNCH_SECRET: + r = sev_launch_secret(kvm, &sev_cmd); + break; + default: + r = -EINVAL; + goto out; + } + + if (copy_to_user(argp, &sev_cmd, sizeof(struct kvm_sev_cmd))) + r = -EFAULT; + +out: + mutex_unlock(&kvm->lock); + return r; +} + +int svm_register_enc_region(struct kvm *kvm, + struct kvm_enc_region *range) +{ + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; + struct enc_region *region; + int ret = 0; + + if (!sev_guest(kvm)) + return -ENOTTY; + + if (range->addr > ULONG_MAX || range->size > ULONG_MAX) + return -EINVAL; + + region = kzalloc(sizeof(*region), GFP_KERNEL_ACCOUNT); + if (!region) + return -ENOMEM; + + region->pages = sev_pin_memory(kvm, range->addr, range->size, ®ion->npages, 1); + if (!region->pages) { + ret = -ENOMEM; + goto e_free; + } + + /* + * The guest may change the memory encryption attribute from C=0 -> C=1 + * or vice versa for this memory range. Lets make sure caches are + * flushed to ensure that guest data gets written into memory with + * correct C-bit. + */ + sev_clflush_pages(region->pages, region->npages); + + region->uaddr = range->addr; + region->size = range->size; + + mutex_lock(&kvm->lock); + list_add_tail(®ion->list, &sev->regions_list); + mutex_unlock(&kvm->lock); + + return ret; + +e_free: + kfree(region); + return ret; +} + +static struct enc_region * +find_enc_region(struct kvm *kvm, struct kvm_enc_region *range) +{ + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; + struct list_head *head = &sev->regions_list; + struct enc_region *i; + + list_for_each_entry(i, head, list) { + if (i->uaddr == range->addr && + i->size == range->size) + return i; + } + + return NULL; +} + +static void __unregister_enc_region_locked(struct kvm *kvm, + struct enc_region *region) +{ + sev_unpin_memory(kvm, region->pages, region->npages); + list_del(®ion->list); + kfree(region); +} + +int svm_unregister_enc_region(struct kvm *kvm, + struct kvm_enc_region *range) +{ + struct enc_region *region; + int ret; + + mutex_lock(&kvm->lock); + + if (!sev_guest(kvm)) { + ret = -ENOTTY; + goto failed; + } + + region = find_enc_region(kvm, range); + if (!region) { + ret = -EINVAL; + goto failed; + } + + /* + * Ensure that all guest tagged cache entries are flushed before + * releasing the pages back to the system for use. CLFLUSH will + * not do this, so issue a WBINVD. + */ + wbinvd_on_all_cpus(); + + __unregister_enc_region_locked(kvm, region); + + mutex_unlock(&kvm->lock); + return 0; + +failed: + mutex_unlock(&kvm->lock); + return ret; +} + +void sev_vm_destroy(struct kvm *kvm) +{ + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; + struct list_head *head = &sev->regions_list; + struct list_head *pos, *q; + + if (!sev_guest(kvm)) + return; + + mutex_lock(&kvm->lock); + + /* + * Ensure that all guest tagged cache entries are flushed before + * releasing the pages back to the system for use. CLFLUSH will + * not do this, so issue a WBINVD. + */ + wbinvd_on_all_cpus(); + + /* + * if userspace was terminated before unregistering the memory regions + * then lets unpin all the registered memory. + */ + if (!list_empty(head)) { + list_for_each_safe(pos, q, head) { + __unregister_enc_region_locked(kvm, + list_entry(pos, struct enc_region, list)); + } + } + + mutex_unlock(&kvm->lock); + + sev_unbind_asid(kvm, sev->handle); + sev_asid_free(sev->asid); +} + +int __init sev_hardware_setup(void) +{ + struct sev_user_data_status *status; + int rc; + + /* Maximum number of encrypted guests supported simultaneously */ + max_sev_asid = cpuid_ecx(0x8000001F); + + if (!max_sev_asid) + return 1; + + /* Minimum ASID value that should be used for SEV guest */ + min_sev_asid = cpuid_edx(0x8000001F); + + /* Initialize SEV ASID bitmaps */ + sev_asid_bitmap = bitmap_zalloc(max_sev_asid, GFP_KERNEL); + if (!sev_asid_bitmap) + return 1; + + sev_reclaim_asid_bitmap = bitmap_zalloc(max_sev_asid, GFP_KERNEL); + if (!sev_reclaim_asid_bitmap) + return 1; + + status = kmalloc(sizeof(*status), GFP_KERNEL); + if (!status) + return 1; + + /* + * Check SEV platform status. + * + * PLATFORM_STATUS can be called in any state, if we failed to query + * the PLATFORM status then either PSP firmware does not support SEV + * feature or SEV firmware is dead. + */ + rc = sev_platform_status(status, NULL); + if (rc) + goto err; + + pr_info("SEV supported\n"); + +err: + kfree(status); + return rc; +} + +void sev_hardware_teardown(void) +{ + bitmap_free(sev_asid_bitmap); + bitmap_free(sev_reclaim_asid_bitmap); + + sev_flush_asids(); +} + +void pre_sev_run(struct vcpu_svm *svm, int cpu) +{ + struct svm_cpu_data *sd = per_cpu(svm_data, cpu); + int asid = sev_get_asid(svm->vcpu.kvm); + + /* Assign the asid allocated with this SEV guest */ + svm->vmcb->control.asid = asid; + + /* + * Flush guest TLB: + * + * 1) when different VMCB for the same ASID is to be run on the same host CPU. + * 2) or this VMCB was executed on different host CPU in previous VMRUNs. + */ + if (sd->sev_vmcbs[asid] == svm->vmcb && + svm->last_cpu == cpu) + return; + + svm->last_cpu = cpu; + sd->sev_vmcbs[asid] = svm->vmcb; + svm->vmcb->control.tlb_ctl = TLB_CONTROL_FLUSH_ASID; + mark_dirty(svm->vmcb, VMCB_ASID); +} diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index e32b4956fa06..05d77b395fc1 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -196,47 +196,6 @@ static u8 rsm_ins_bytes[] = "\x0f\xaa"; static void svm_complete_interrupts(struct vcpu_svm *svm); -static int sev_flush_asids(void); -static DECLARE_RWSEM(sev_deactivate_lock); -static DEFINE_MUTEX(sev_bitmap_lock); -static unsigned int max_sev_asid; -static unsigned int min_sev_asid; -static unsigned long *sev_asid_bitmap; -static unsigned long *sev_reclaim_asid_bitmap; -#define __sme_page_pa(x) __sme_set(page_to_pfn(x) << PAGE_SHIFT) - -struct enc_region { - struct list_head list; - unsigned long npages; - struct page **pages; - unsigned long uaddr; - unsigned long size; -}; - - -static inline bool svm_sev_enabled(void) -{ - return IS_ENABLED(CONFIG_KVM_AMD_SEV) ? max_sev_asid : 0; -} - -static inline bool sev_guest(struct kvm *kvm) -{ -#ifdef CONFIG_KVM_AMD_SEV - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; - - return sev->active; -#else - return false; -#endif -} - -static inline int sev_get_asid(struct kvm *kvm) -{ - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; - - return sev->asid; -} - static unsigned long iopm_base; struct kvm_ldttss_desc { @@ -248,23 +207,7 @@ struct kvm_ldttss_desc { u32 zero1; } __attribute__((packed)); -struct svm_cpu_data { - int cpu; - - u64 asid_generation; - u32 max_asid; - u32 next_asid; - u32 min_asid; - struct kvm_ldttss_desc *tss_desc; - - struct page *save_area; - struct vmcb *current_vmcb; - - /* index = sev_asid, value = vmcb pointer */ - struct vmcb **sev_vmcbs; -}; - -static DEFINE_PER_CPU(struct svm_cpu_data *, svm_data); +DEFINE_PER_CPU(struct svm_cpu_data *, svm_data); static const u32 msrpm_ranges[] = {0, 0xc0000000, 0xc0010000}; @@ -763,51 +706,6 @@ void disable_nmi_singlestep(struct vcpu_svm *svm) } } -static __init int sev_hardware_setup(void) -{ - struct sev_user_data_status *status; - int rc; - - /* Maximum number of encrypted guests supported simultaneously */ - max_sev_asid = cpuid_ecx(0x8000001F); - - if (!max_sev_asid) - return 1; - - /* Minimum ASID value that should be used for SEV guest */ - min_sev_asid = cpuid_edx(0x8000001F); - - /* Initialize SEV ASID bitmaps */ - sev_asid_bitmap = bitmap_zalloc(max_sev_asid, GFP_KERNEL); - if (!sev_asid_bitmap) - return 1; - - sev_reclaim_asid_bitmap = bitmap_zalloc(max_sev_asid, GFP_KERNEL); - if (!sev_reclaim_asid_bitmap) - return 1; - - status = kmalloc(sizeof(*status), GFP_KERNEL); - if (!status) - return 1; - - /* - * Check SEV platform status. - * - * PLATFORM_STATUS can be called in any state, if we failed to query - * the PLATFORM status then either PSP firmware does not support SEV - * feature or SEV firmware is dead. - */ - rc = sev_platform_status(status, NULL); - if (rc) - goto err; - - pr_info("SEV supported\n"); - -err: - kfree(status); - return rc; -} - static void grow_ple_window(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); @@ -889,12 +787,8 @@ static void svm_hardware_teardown(void) { int cpu; - if (svm_sev_enabled()) { - bitmap_free(sev_asid_bitmap); - bitmap_free(sev_reclaim_asid_bitmap); - - sev_flush_asids(); - } + if (svm_sev_enabled()) + sev_hardware_teardown(); for_each_possible_cpu(cpu) svm_cpu_uninit(cpu); @@ -1250,199 +1144,6 @@ static void init_vmcb(struct vcpu_svm *svm) } -static void sev_asid_free(int asid) -{ - struct svm_cpu_data *sd; - int cpu, pos; - - mutex_lock(&sev_bitmap_lock); - - pos = asid - 1; - __set_bit(pos, sev_reclaim_asid_bitmap); - - for_each_possible_cpu(cpu) { - sd = per_cpu(svm_data, cpu); - sd->sev_vmcbs[pos] = NULL; - } - - mutex_unlock(&sev_bitmap_lock); -} - -static void sev_unbind_asid(struct kvm *kvm, unsigned int handle) -{ - struct sev_data_decommission *decommission; - struct sev_data_deactivate *data; - - if (!handle) - return; - - data = kzalloc(sizeof(*data), GFP_KERNEL); - if (!data) - return; - - /* deactivate handle */ - data->handle = handle; - - /* Guard DEACTIVATE against WBINVD/DF_FLUSH used in ASID recycling */ - down_read(&sev_deactivate_lock); - sev_guest_deactivate(data, NULL); - up_read(&sev_deactivate_lock); - - kfree(data); - - decommission = kzalloc(sizeof(*decommission), GFP_KERNEL); - if (!decommission) - return; - - /* decommission handle */ - decommission->handle = handle; - sev_guest_decommission(decommission, NULL); - - kfree(decommission); -} - -static struct page **sev_pin_memory(struct kvm *kvm, unsigned long uaddr, - unsigned long ulen, unsigned long *n, - int write) -{ - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; - unsigned long npages, npinned, size; - unsigned long locked, lock_limit; - struct page **pages; - unsigned long first, last; - - if (ulen == 0 || uaddr + ulen < uaddr) - return NULL; - - /* Calculate number of pages. */ - first = (uaddr & PAGE_MASK) >> PAGE_SHIFT; - last = ((uaddr + ulen - 1) & PAGE_MASK) >> PAGE_SHIFT; - npages = (last - first + 1); - - locked = sev->pages_locked + npages; - lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; - if (locked > lock_limit && !capable(CAP_IPC_LOCK)) { - pr_err("SEV: %lu locked pages exceed the lock limit of %lu.\n", locked, lock_limit); - return NULL; - } - - /* Avoid using vmalloc for smaller buffers. */ - size = npages * sizeof(struct page *); - if (size > PAGE_SIZE) - pages = __vmalloc(size, GFP_KERNEL_ACCOUNT | __GFP_ZERO, - PAGE_KERNEL); - else - pages = kmalloc(size, GFP_KERNEL_ACCOUNT); - - if (!pages) - return NULL; - - /* Pin the user virtual address. */ - npinned = get_user_pages_fast(uaddr, npages, FOLL_WRITE, pages); - if (npinned != npages) { - pr_err("SEV: Failure locking %lu pages.\n", npages); - goto err; - } - - *n = npages; - sev->pages_locked = locked; - - return pages; - -err: - if (npinned > 0) - release_pages(pages, npinned); - - kvfree(pages); - return NULL; -} - -static void sev_unpin_memory(struct kvm *kvm, struct page **pages, - unsigned long npages) -{ - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; - - release_pages(pages, npages); - kvfree(pages); - sev->pages_locked -= npages; -} - -static void sev_clflush_pages(struct page *pages[], unsigned long npages) -{ - uint8_t *page_virtual; - unsigned long i; - - if (npages == 0 || pages == NULL) - return; - - for (i = 0; i < npages; i++) { - page_virtual = kmap_atomic(pages[i]); - clflush_cache_range(page_virtual, PAGE_SIZE); - kunmap_atomic(page_virtual); - } -} - -static void __unregister_enc_region_locked(struct kvm *kvm, - struct enc_region *region) -{ - sev_unpin_memory(kvm, region->pages, region->npages); - list_del(®ion->list); - kfree(region); -} - -static void sev_vm_destroy(struct kvm *kvm) -{ - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; - struct list_head *head = &sev->regions_list; - struct list_head *pos, *q; - - if (!sev_guest(kvm)) - return; - - mutex_lock(&kvm->lock); - - /* - * Ensure that all guest tagged cache entries are flushed before - * releasing the pages back to the system for use. CLFLUSH will - * not do this, so issue a WBINVD. - */ - wbinvd_on_all_cpus(); - - /* - * if userspace was terminated before unregistering the memory regions - * then lets unpin all the registered memory. - */ - if (!list_empty(head)) { - list_for_each_safe(pos, q, head) { - __unregister_enc_region_locked(kvm, - list_entry(pos, struct enc_region, list)); - } - } - - mutex_unlock(&kvm->lock); - - sev_unbind_asid(kvm, sev->handle); - sev_asid_free(sev->asid); -} - -static void svm_vm_destroy(struct kvm *kvm) -{ - avic_vm_destroy(kvm); - sev_vm_destroy(kvm); -} - -static int svm_vm_init(struct kvm *kvm) -{ - if (avic) { - int ret = avic_vm_init(kvm); - if (ret) - return ret; - } - - kvm_apicv_init(kvm, avic); - return 0; -} - static void svm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) { struct vcpu_svm *svm = to_svm(vcpu); @@ -3292,30 +2993,6 @@ static void reload_tss(struct kvm_vcpu *vcpu) load_TR_desc(); } -static void pre_sev_run(struct vcpu_svm *svm, int cpu) -{ - struct svm_cpu_data *sd = per_cpu(svm_data, cpu); - int asid = sev_get_asid(svm->vcpu.kvm); - - /* Assign the asid allocated with this SEV guest */ - svm->vmcb->control.asid = asid; - - /* - * Flush guest TLB: - * - * 1) when different VMCB for the same ASID is to be run on the same host CPU. - * 2) or this VMCB was executed on different host CPU in previous VMRUNs. - */ - if (sd->sev_vmcbs[asid] == svm->vmcb && - svm->last_cpu == cpu) - return; - - svm->last_cpu = cpu; - sd->sev_vmcbs[asid] = svm->vmcb; - svm->vmcb->control.tlb_ctl = TLB_CONTROL_FLUSH_ASID; - mark_dirty(svm->vmcb, VMCB_ASID); -} - static void pre_svm_run(struct vcpu_svm *svm) { int cpu = raw_smp_processor_id(); @@ -4216,900 +3893,6 @@ static int enable_smi_window(struct kvm_vcpu *vcpu) return 0; } -static int sev_flush_asids(void) -{ - int ret, error; - - /* - * DEACTIVATE will clear the WBINVD indicator causing DF_FLUSH to fail, - * so it must be guarded. - */ - down_write(&sev_deactivate_lock); - - wbinvd_on_all_cpus(); - ret = sev_guest_df_flush(&error); - - up_write(&sev_deactivate_lock); - - if (ret) - pr_err("SEV: DF_FLUSH failed, ret=%d, error=%#x\n", ret, error); - - return ret; -} - -/* Must be called with the sev_bitmap_lock held */ -static bool __sev_recycle_asids(void) -{ - int pos; - - /* Check if there are any ASIDs to reclaim before performing a flush */ - pos = find_next_bit(sev_reclaim_asid_bitmap, - max_sev_asid, min_sev_asid - 1); - if (pos >= max_sev_asid) - return false; - - if (sev_flush_asids()) - return false; - - bitmap_xor(sev_asid_bitmap, sev_asid_bitmap, sev_reclaim_asid_bitmap, - max_sev_asid); - bitmap_zero(sev_reclaim_asid_bitmap, max_sev_asid); - - return true; -} - -static int sev_asid_new(void) -{ - bool retry = true; - int pos; - - mutex_lock(&sev_bitmap_lock); - - /* - * SEV-enabled guest must use asid from min_sev_asid to max_sev_asid. - */ -again: - pos = find_next_zero_bit(sev_asid_bitmap, max_sev_asid, min_sev_asid - 1); - if (pos >= max_sev_asid) { - if (retry && __sev_recycle_asids()) { - retry = false; - goto again; - } - mutex_unlock(&sev_bitmap_lock); - return -EBUSY; - } - - __set_bit(pos, sev_asid_bitmap); - - mutex_unlock(&sev_bitmap_lock); - - return pos + 1; -} - -static int sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp) -{ - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; - int asid, ret; - - ret = -EBUSY; - if (unlikely(sev->active)) - return ret; - - asid = sev_asid_new(); - if (asid < 0) - return ret; - - ret = sev_platform_init(&argp->error); - if (ret) - goto e_free; - - sev->active = true; - sev->asid = asid; - INIT_LIST_HEAD(&sev->regions_list); - - return 0; - -e_free: - sev_asid_free(asid); - return ret; -} - -static int sev_bind_asid(struct kvm *kvm, unsigned int handle, int *error) -{ - struct sev_data_activate *data; - int asid = sev_get_asid(kvm); - int ret; - - data = kzalloc(sizeof(*data), GFP_KERNEL_ACCOUNT); - if (!data) - return -ENOMEM; - - /* activate ASID on the given handle */ - data->handle = handle; - data->asid = asid; - ret = sev_guest_activate(data, error); - kfree(data); - - return ret; -} - -static int __sev_issue_cmd(int fd, int id, void *data, int *error) -{ - struct fd f; - int ret; - - f = fdget(fd); - if (!f.file) - return -EBADF; - - ret = sev_issue_cmd_external_user(f.file, id, data, error); - - fdput(f); - return ret; -} - -static int sev_issue_cmd(struct kvm *kvm, int id, void *data, int *error) -{ - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; - - return __sev_issue_cmd(sev->fd, id, data, error); -} - -static int sev_launch_start(struct kvm *kvm, struct kvm_sev_cmd *argp) -{ - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; - struct sev_data_launch_start *start; - struct kvm_sev_launch_start params; - void *dh_blob, *session_blob; - int *error = &argp->error; - int ret; - - if (!sev_guest(kvm)) - return -ENOTTY; - - if (copy_from_user(¶ms, (void __user *)(uintptr_t)argp->data, sizeof(params))) - return -EFAULT; - - start = kzalloc(sizeof(*start), GFP_KERNEL_ACCOUNT); - if (!start) - return -ENOMEM; - - dh_blob = NULL; - if (params.dh_uaddr) { - dh_blob = psp_copy_user_blob(params.dh_uaddr, params.dh_len); - if (IS_ERR(dh_blob)) { - ret = PTR_ERR(dh_blob); - goto e_free; - } - - start->dh_cert_address = __sme_set(__pa(dh_blob)); - start->dh_cert_len = params.dh_len; - } - - session_blob = NULL; - if (params.session_uaddr) { - session_blob = psp_copy_user_blob(params.session_uaddr, params.session_len); - if (IS_ERR(session_blob)) { - ret = PTR_ERR(session_blob); - goto e_free_dh; - } - - start->session_address = __sme_set(__pa(session_blob)); - start->session_len = params.session_len; - } - - start->handle = params.handle; - start->policy = params.policy; - - /* create memory encryption context */ - ret = __sev_issue_cmd(argp->sev_fd, SEV_CMD_LAUNCH_START, start, error); - if (ret) - goto e_free_session; - - /* Bind ASID to this guest */ - ret = sev_bind_asid(kvm, start->handle, error); - if (ret) - goto e_free_session; - - /* return handle to userspace */ - params.handle = start->handle; - if (copy_to_user((void __user *)(uintptr_t)argp->data, ¶ms, sizeof(params))) { - sev_unbind_asid(kvm, start->handle); - ret = -EFAULT; - goto e_free_session; - } - - sev->handle = start->handle; - sev->fd = argp->sev_fd; - -e_free_session: - kfree(session_blob); -e_free_dh: - kfree(dh_blob); -e_free: - kfree(start); - return ret; -} - -static unsigned long get_num_contig_pages(unsigned long idx, - struct page **inpages, unsigned long npages) -{ - unsigned long paddr, next_paddr; - unsigned long i = idx + 1, pages = 1; - - /* find the number of contiguous pages starting from idx */ - paddr = __sme_page_pa(inpages[idx]); - while (i < npages) { - next_paddr = __sme_page_pa(inpages[i++]); - if ((paddr + PAGE_SIZE) == next_paddr) { - pages++; - paddr = next_paddr; - continue; - } - break; - } - - return pages; -} - -static int sev_launch_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp) -{ - unsigned long vaddr, vaddr_end, next_vaddr, npages, pages, size, i; - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; - struct kvm_sev_launch_update_data params; - struct sev_data_launch_update_data *data; - struct page **inpages; - int ret; - - if (!sev_guest(kvm)) - return -ENOTTY; - - if (copy_from_user(¶ms, (void __user *)(uintptr_t)argp->data, sizeof(params))) - return -EFAULT; - - data = kzalloc(sizeof(*data), GFP_KERNEL_ACCOUNT); - if (!data) - return -ENOMEM; - - vaddr = params.uaddr; - size = params.len; - vaddr_end = vaddr + size; - - /* Lock the user memory. */ - inpages = sev_pin_memory(kvm, vaddr, size, &npages, 1); - if (!inpages) { - ret = -ENOMEM; - goto e_free; - } - - /* - * The LAUNCH_UPDATE command will perform in-place encryption of the - * memory content (i.e it will write the same memory region with C=1). - * It's possible that the cache may contain the data with C=0, i.e., - * unencrypted so invalidate it first. - */ - sev_clflush_pages(inpages, npages); - - for (i = 0; vaddr < vaddr_end; vaddr = next_vaddr, i += pages) { - int offset, len; - - /* - * If the user buffer is not page-aligned, calculate the offset - * within the page. - */ - offset = vaddr & (PAGE_SIZE - 1); - - /* Calculate the number of pages that can be encrypted in one go. */ - pages = get_num_contig_pages(i, inpages, npages); - - len = min_t(size_t, ((pages * PAGE_SIZE) - offset), size); - - data->handle = sev->handle; - data->len = len; - data->address = __sme_page_pa(inpages[i]) + offset; - ret = sev_issue_cmd(kvm, SEV_CMD_LAUNCH_UPDATE_DATA, data, &argp->error); - if (ret) - goto e_unpin; - - size -= len; - next_vaddr = vaddr + len; - } - -e_unpin: - /* content of memory is updated, mark pages dirty */ - for (i = 0; i < npages; i++) { - set_page_dirty_lock(inpages[i]); - mark_page_accessed(inpages[i]); - } - /* unlock the user pages */ - sev_unpin_memory(kvm, inpages, npages); -e_free: - kfree(data); - return ret; -} - -static int sev_launch_measure(struct kvm *kvm, struct kvm_sev_cmd *argp) -{ - void __user *measure = (void __user *)(uintptr_t)argp->data; - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; - struct sev_data_launch_measure *data; - struct kvm_sev_launch_measure params; - void __user *p = NULL; - void *blob = NULL; - int ret; - - if (!sev_guest(kvm)) - return -ENOTTY; - - if (copy_from_user(¶ms, measure, sizeof(params))) - return -EFAULT; - - data = kzalloc(sizeof(*data), GFP_KERNEL_ACCOUNT); - if (!data) - return -ENOMEM; - - /* User wants to query the blob length */ - if (!params.len) - goto cmd; - - p = (void __user *)(uintptr_t)params.uaddr; - if (p) { - if (params.len > SEV_FW_BLOB_MAX_SIZE) { - ret = -EINVAL; - goto e_free; - } - - ret = -ENOMEM; - blob = kmalloc(params.len, GFP_KERNEL); - if (!blob) - goto e_free; - - data->address = __psp_pa(blob); - data->len = params.len; - } - -cmd: - data->handle = sev->handle; - ret = sev_issue_cmd(kvm, SEV_CMD_LAUNCH_MEASURE, data, &argp->error); - - /* - * If we query the session length, FW responded with expected data. - */ - if (!params.len) - goto done; - - if (ret) - goto e_free_blob; - - if (blob) { - if (copy_to_user(p, blob, params.len)) - ret = -EFAULT; - } - -done: - params.len = data->len; - if (copy_to_user(measure, ¶ms, sizeof(params))) - ret = -EFAULT; -e_free_blob: - kfree(blob); -e_free: - kfree(data); - return ret; -} - -static int sev_launch_finish(struct kvm *kvm, struct kvm_sev_cmd *argp) -{ - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; - struct sev_data_launch_finish *data; - int ret; - - if (!sev_guest(kvm)) - return -ENOTTY; - - data = kzalloc(sizeof(*data), GFP_KERNEL_ACCOUNT); - if (!data) - return -ENOMEM; - - data->handle = sev->handle; - ret = sev_issue_cmd(kvm, SEV_CMD_LAUNCH_FINISH, data, &argp->error); - - kfree(data); - return ret; -} - -static int sev_guest_status(struct kvm *kvm, struct kvm_sev_cmd *argp) -{ - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; - struct kvm_sev_guest_status params; - struct sev_data_guest_status *data; - int ret; - - if (!sev_guest(kvm)) - return -ENOTTY; - - data = kzalloc(sizeof(*data), GFP_KERNEL_ACCOUNT); - if (!data) - return -ENOMEM; - - data->handle = sev->handle; - ret = sev_issue_cmd(kvm, SEV_CMD_GUEST_STATUS, data, &argp->error); - if (ret) - goto e_free; - - params.policy = data->policy; - params.state = data->state; - params.handle = data->handle; - - if (copy_to_user((void __user *)(uintptr_t)argp->data, ¶ms, sizeof(params))) - ret = -EFAULT; -e_free: - kfree(data); - return ret; -} - -static int __sev_issue_dbg_cmd(struct kvm *kvm, unsigned long src, - unsigned long dst, int size, - int *error, bool enc) -{ - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; - struct sev_data_dbg *data; - int ret; - - data = kzalloc(sizeof(*data), GFP_KERNEL_ACCOUNT); - if (!data) - return -ENOMEM; - - data->handle = sev->handle; - data->dst_addr = dst; - data->src_addr = src; - data->len = size; - - ret = sev_issue_cmd(kvm, - enc ? SEV_CMD_DBG_ENCRYPT : SEV_CMD_DBG_DECRYPT, - data, error); - kfree(data); - return ret; -} - -static int __sev_dbg_decrypt(struct kvm *kvm, unsigned long src_paddr, - unsigned long dst_paddr, int sz, int *err) -{ - int offset; - - /* - * Its safe to read more than we are asked, caller should ensure that - * destination has enough space. - */ - src_paddr = round_down(src_paddr, 16); - offset = src_paddr & 15; - sz = round_up(sz + offset, 16); - - return __sev_issue_dbg_cmd(kvm, src_paddr, dst_paddr, sz, err, false); -} - -static int __sev_dbg_decrypt_user(struct kvm *kvm, unsigned long paddr, - unsigned long __user dst_uaddr, - unsigned long dst_paddr, - int size, int *err) -{ - struct page *tpage = NULL; - int ret, offset; - - /* if inputs are not 16-byte then use intermediate buffer */ - if (!IS_ALIGNED(dst_paddr, 16) || - !IS_ALIGNED(paddr, 16) || - !IS_ALIGNED(size, 16)) { - tpage = (void *)alloc_page(GFP_KERNEL); - if (!tpage) - return -ENOMEM; - - dst_paddr = __sme_page_pa(tpage); - } - - ret = __sev_dbg_decrypt(kvm, paddr, dst_paddr, size, err); - if (ret) - goto e_free; - - if (tpage) { - offset = paddr & 15; - if (copy_to_user((void __user *)(uintptr_t)dst_uaddr, - page_address(tpage) + offset, size)) - ret = -EFAULT; - } - -e_free: - if (tpage) - __free_page(tpage); - - return ret; -} - -static int __sev_dbg_encrypt_user(struct kvm *kvm, unsigned long paddr, - unsigned long __user vaddr, - unsigned long dst_paddr, - unsigned long __user dst_vaddr, - int size, int *error) -{ - struct page *src_tpage = NULL; - struct page *dst_tpage = NULL; - int ret, len = size; - - /* If source buffer is not aligned then use an intermediate buffer */ - if (!IS_ALIGNED(vaddr, 16)) { - src_tpage = alloc_page(GFP_KERNEL); - if (!src_tpage) - return -ENOMEM; - - if (copy_from_user(page_address(src_tpage), - (void __user *)(uintptr_t)vaddr, size)) { - __free_page(src_tpage); - return -EFAULT; - } - - paddr = __sme_page_pa(src_tpage); - } - - /* - * If destination buffer or length is not aligned then do read-modify-write: - * - decrypt destination in an intermediate buffer - * - copy the source buffer in an intermediate buffer - * - use the intermediate buffer as source buffer - */ - if (!IS_ALIGNED(dst_vaddr, 16) || !IS_ALIGNED(size, 16)) { - int dst_offset; - - dst_tpage = alloc_page(GFP_KERNEL); - if (!dst_tpage) { - ret = -ENOMEM; - goto e_free; - } - - ret = __sev_dbg_decrypt(kvm, dst_paddr, - __sme_page_pa(dst_tpage), size, error); - if (ret) - goto e_free; - - /* - * If source is kernel buffer then use memcpy() otherwise - * copy_from_user(). - */ - dst_offset = dst_paddr & 15; - - if (src_tpage) - memcpy(page_address(dst_tpage) + dst_offset, - page_address(src_tpage), size); - else { - if (copy_from_user(page_address(dst_tpage) + dst_offset, - (void __user *)(uintptr_t)vaddr, size)) { - ret = -EFAULT; - goto e_free; - } - } - - paddr = __sme_page_pa(dst_tpage); - dst_paddr = round_down(dst_paddr, 16); - len = round_up(size, 16); - } - - ret = __sev_issue_dbg_cmd(kvm, paddr, dst_paddr, len, error, true); - -e_free: - if (src_tpage) - __free_page(src_tpage); - if (dst_tpage) - __free_page(dst_tpage); - return ret; -} - -static int sev_dbg_crypt(struct kvm *kvm, struct kvm_sev_cmd *argp, bool dec) -{ - unsigned long vaddr, vaddr_end, next_vaddr; - unsigned long dst_vaddr; - struct page **src_p, **dst_p; - struct kvm_sev_dbg debug; - unsigned long n; - unsigned int size; - int ret; - - if (!sev_guest(kvm)) - return -ENOTTY; - - if (copy_from_user(&debug, (void __user *)(uintptr_t)argp->data, sizeof(debug))) - return -EFAULT; - - if (!debug.len || debug.src_uaddr + debug.len < debug.src_uaddr) - return -EINVAL; - if (!debug.dst_uaddr) - return -EINVAL; - - vaddr = debug.src_uaddr; - size = debug.len; - vaddr_end = vaddr + size; - dst_vaddr = debug.dst_uaddr; - - for (; vaddr < vaddr_end; vaddr = next_vaddr) { - int len, s_off, d_off; - - /* lock userspace source and destination page */ - src_p = sev_pin_memory(kvm, vaddr & PAGE_MASK, PAGE_SIZE, &n, 0); - if (!src_p) - return -EFAULT; - - dst_p = sev_pin_memory(kvm, dst_vaddr & PAGE_MASK, PAGE_SIZE, &n, 1); - if (!dst_p) { - sev_unpin_memory(kvm, src_p, n); - return -EFAULT; - } - - /* - * The DBG_{DE,EN}CRYPT commands will perform {dec,en}cryption of the - * memory content (i.e it will write the same memory region with C=1). - * It's possible that the cache may contain the data with C=0, i.e., - * unencrypted so invalidate it first. - */ - sev_clflush_pages(src_p, 1); - sev_clflush_pages(dst_p, 1); - - /* - * Since user buffer may not be page aligned, calculate the - * offset within the page. - */ - s_off = vaddr & ~PAGE_MASK; - d_off = dst_vaddr & ~PAGE_MASK; - len = min_t(size_t, (PAGE_SIZE - s_off), size); - - if (dec) - ret = __sev_dbg_decrypt_user(kvm, - __sme_page_pa(src_p[0]) + s_off, - dst_vaddr, - __sme_page_pa(dst_p[0]) + d_off, - len, &argp->error); - else - ret = __sev_dbg_encrypt_user(kvm, - __sme_page_pa(src_p[0]) + s_off, - vaddr, - __sme_page_pa(dst_p[0]) + d_off, - dst_vaddr, - len, &argp->error); - - sev_unpin_memory(kvm, src_p, n); - sev_unpin_memory(kvm, dst_p, n); - - if (ret) - goto err; - - next_vaddr = vaddr + len; - dst_vaddr = dst_vaddr + len; - size -= len; - } -err: - return ret; -} - -static int sev_launch_secret(struct kvm *kvm, struct kvm_sev_cmd *argp) -{ - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; - struct sev_data_launch_secret *data; - struct kvm_sev_launch_secret params; - struct page **pages; - void *blob, *hdr; - unsigned long n; - int ret, offset; - - if (!sev_guest(kvm)) - return -ENOTTY; - - if (copy_from_user(¶ms, (void __user *)(uintptr_t)argp->data, sizeof(params))) - return -EFAULT; - - pages = sev_pin_memory(kvm, params.guest_uaddr, params.guest_len, &n, 1); - if (!pages) - return -ENOMEM; - - /* - * The secret must be copied into contiguous memory region, lets verify - * that userspace memory pages are contiguous before we issue command. - */ - if (get_num_contig_pages(0, pages, n) != n) { - ret = -EINVAL; - goto e_unpin_memory; - } - - ret = -ENOMEM; - data = kzalloc(sizeof(*data), GFP_KERNEL_ACCOUNT); - if (!data) - goto e_unpin_memory; - - offset = params.guest_uaddr & (PAGE_SIZE - 1); - data->guest_address = __sme_page_pa(pages[0]) + offset; - data->guest_len = params.guest_len; - - blob = psp_copy_user_blob(params.trans_uaddr, params.trans_len); - if (IS_ERR(blob)) { - ret = PTR_ERR(blob); - goto e_free; - } - - data->trans_address = __psp_pa(blob); - data->trans_len = params.trans_len; - - hdr = psp_copy_user_blob(params.hdr_uaddr, params.hdr_len); - if (IS_ERR(hdr)) { - ret = PTR_ERR(hdr); - goto e_free_blob; - } - data->hdr_address = __psp_pa(hdr); - data->hdr_len = params.hdr_len; - - data->handle = sev->handle; - ret = sev_issue_cmd(kvm, SEV_CMD_LAUNCH_UPDATE_SECRET, data, &argp->error); - - kfree(hdr); - -e_free_blob: - kfree(blob); -e_free: - kfree(data); -e_unpin_memory: - sev_unpin_memory(kvm, pages, n); - return ret; -} - -static int svm_mem_enc_op(struct kvm *kvm, void __user *argp) -{ - struct kvm_sev_cmd sev_cmd; - int r; - - if (!svm_sev_enabled()) - return -ENOTTY; - - if (!argp) - return 0; - - if (copy_from_user(&sev_cmd, argp, sizeof(struct kvm_sev_cmd))) - return -EFAULT; - - mutex_lock(&kvm->lock); - - switch (sev_cmd.id) { - case KVM_SEV_INIT: - r = sev_guest_init(kvm, &sev_cmd); - break; - case KVM_SEV_LAUNCH_START: - r = sev_launch_start(kvm, &sev_cmd); - break; - case KVM_SEV_LAUNCH_UPDATE_DATA: - r = sev_launch_update_data(kvm, &sev_cmd); - break; - case KVM_SEV_LAUNCH_MEASURE: - r = sev_launch_measure(kvm, &sev_cmd); - break; - case KVM_SEV_LAUNCH_FINISH: - r = sev_launch_finish(kvm, &sev_cmd); - break; - case KVM_SEV_GUEST_STATUS: - r = sev_guest_status(kvm, &sev_cmd); - break; - case KVM_SEV_DBG_DECRYPT: - r = sev_dbg_crypt(kvm, &sev_cmd, true); - break; - case KVM_SEV_DBG_ENCRYPT: - r = sev_dbg_crypt(kvm, &sev_cmd, false); - break; - case KVM_SEV_LAUNCH_SECRET: - r = sev_launch_secret(kvm, &sev_cmd); - break; - default: - r = -EINVAL; - goto out; - } - - if (copy_to_user(argp, &sev_cmd, sizeof(struct kvm_sev_cmd))) - r = -EFAULT; - -out: - mutex_unlock(&kvm->lock); - return r; -} - -static int svm_register_enc_region(struct kvm *kvm, - struct kvm_enc_region *range) -{ - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; - struct enc_region *region; - int ret = 0; - - if (!sev_guest(kvm)) - return -ENOTTY; - - if (range->addr > ULONG_MAX || range->size > ULONG_MAX) - return -EINVAL; - - region = kzalloc(sizeof(*region), GFP_KERNEL_ACCOUNT); - if (!region) - return -ENOMEM; - - region->pages = sev_pin_memory(kvm, range->addr, range->size, ®ion->npages, 1); - if (!region->pages) { - ret = -ENOMEM; - goto e_free; - } - - /* - * The guest may change the memory encryption attribute from C=0 -> C=1 - * or vice versa for this memory range. Lets make sure caches are - * flushed to ensure that guest data gets written into memory with - * correct C-bit. - */ - sev_clflush_pages(region->pages, region->npages); - - region->uaddr = range->addr; - region->size = range->size; - - mutex_lock(&kvm->lock); - list_add_tail(®ion->list, &sev->regions_list); - mutex_unlock(&kvm->lock); - - return ret; - -e_free: - kfree(region); - return ret; -} - -static struct enc_region * -find_enc_region(struct kvm *kvm, struct kvm_enc_region *range) -{ - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; - struct list_head *head = &sev->regions_list; - struct enc_region *i; - - list_for_each_entry(i, head, list) { - if (i->uaddr == range->addr && - i->size == range->size) - return i; - } - - return NULL; -} - - -static int svm_unregister_enc_region(struct kvm *kvm, - struct kvm_enc_region *range) -{ - struct enc_region *region; - int ret; - - mutex_lock(&kvm->lock); - - if (!sev_guest(kvm)) { - ret = -ENOTTY; - goto failed; - } - - region = find_enc_region(kvm, range); - if (!region) { - ret = -EINVAL; - goto failed; - } - - /* - * Ensure that all guest tagged cache entries are flushed before - * releasing the pages back to the system for use. CLFLUSH will - * not do this, so issue a WBINVD. - */ - wbinvd_on_all_cpus(); - - __unregister_enc_region_locked(kvm, region); - - mutex_unlock(&kvm->lock); - return 0; - -failed: - mutex_unlock(&kvm->lock); - return ret; -} - static bool svm_need_emulation_on_page_fault(struct kvm_vcpu *vcpu) { unsigned long cr4 = kvm_read_cr4(vcpu); @@ -5183,6 +3966,24 @@ static bool svm_apic_init_signal_blocked(struct kvm_vcpu *vcpu) (svm->vmcb->control.intercept & (1ULL << INTERCEPT_INIT)); } +static void svm_vm_destroy(struct kvm *kvm) +{ + avic_vm_destroy(kvm); + sev_vm_destroy(kvm); +} + +static int svm_vm_init(struct kvm *kvm) +{ + if (avic) { + int ret = avic_vm_init(kvm); + if (ret) + return ret; + } + + kvm_apicv_init(kvm, avic); + return 0; +} + static struct kvm_x86_ops svm_x86_ops __initdata = { .hardware_unsetup = svm_hardware_teardown, .hardware_enable = svm_hardware_enable, diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index c7abc1fede97..df3474f4fb02 100644 --- a/arch/x86/kvm/svm/svm.h +++ b/arch/x86/kvm/svm/svm.h @@ -171,6 +171,24 @@ struct vcpu_svm { unsigned int last_cpu; }; +struct svm_cpu_data { + int cpu; + + u64 asid_generation; + u32 max_asid; + u32 next_asid; + u32 min_asid; + struct kvm_ldttss_desc *tss_desc; + + struct page *save_area; + struct vmcb *current_vmcb; + + /* index = sev_asid, value = vmcb pointer */ + struct vmcb **sev_vmcbs; +}; + +DECLARE_PER_CPU(struct svm_cpu_data *, svm_data); + void recalc_intercepts(struct vcpu_svm *svm); static inline struct kvm_svm *to_kvm_svm(struct kvm *kvm) @@ -440,4 +458,34 @@ int svm_update_pi_irte(struct kvm *kvm, unsigned int host_irq, void svm_vcpu_blocking(struct kvm_vcpu *vcpu); void svm_vcpu_unblocking(struct kvm_vcpu *vcpu); +/* sev.c */ + +extern unsigned int max_sev_asid; + +static inline bool sev_guest(struct kvm *kvm) +{ +#ifdef CONFIG_KVM_AMD_SEV + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; + + return sev->active; +#else + return false; +#endif +} + +static inline bool svm_sev_enabled(void) +{ + return IS_ENABLED(CONFIG_KVM_AMD_SEV) ? max_sev_asid : 0; +} + +void sev_vm_destroy(struct kvm *kvm); +int svm_mem_enc_op(struct kvm *kvm, void __user *argp); +int svm_register_enc_region(struct kvm *kvm, + struct kvm_enc_region *range); +int svm_unregister_enc_region(struct kvm *kvm, + struct kvm_enc_region *range); +void pre_sev_run(struct vcpu_svm *svm, int cpu); +int __init sev_hardware_setup(void); +void sev_hardware_teardown(void); + #endif -- cgit From 199cd1d7b5348de4b58208420687676c658efed3 Mon Sep 17 00:00:00 2001 From: Uros Bizjak Date: Mon, 30 Mar 2020 15:02:13 +0200 Subject: KVM: SVM: Split svm_vcpu_run inline assembly to separate file The compiler (GCC) does not like the situation, where there is inline assembly block that clobbers all available machine registers in the middle of the function. This situation can be found in function svm_vcpu_run in file kvm/svm.c and results in many register spills and fills to/from stack frame. This patch fixes the issue with the same approach as was done for VMX some time ago. The big inline assembly is moved to a separate assembly .S file, taking into account all ABI requirements. There are two main benefits of the above approach: * elimination of several register spills and fills to/from stack frame, and consequently smaller function .text size. The binary size of svm_vcpu_run is lowered from 2019 to 1626 bytes. * more efficient access to a register save array. Currently, register save array is accessed as: 7b00: 48 8b 98 28 02 00 00 mov 0x228(%rax),%rbx 7b07: 48 8b 88 18 02 00 00 mov 0x218(%rax),%rcx 7b0e: 48 8b 90 20 02 00 00 mov 0x220(%rax),%rdx and passing ia pointer to a register array as an argument to a function one gets: 12: 48 8b 48 08 mov 0x8(%rax),%rcx 16: 48 8b 50 10 mov 0x10(%rax),%rdx 1a: 48 8b 58 18 mov 0x18(%rax),%rbx As a result, the total size, considering that the new function size is 229 bytes, gets lowered by 164 bytes. Signed-off-by: Uros Bizjak Signed-off-by: Paolo Bonzini --- arch/x86/kvm/Makefile | 2 +- arch/x86/kvm/svm/svm.c | 92 +------------------------ arch/x86/kvm/svm/vmenter.S | 162 +++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 166 insertions(+), 90 deletions(-) create mode 100644 arch/x86/kvm/svm/vmenter.S (limited to 'arch/x86') diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile index e5a71aa0967b..a789759b7261 100644 --- a/arch/x86/kvm/Makefile +++ b/arch/x86/kvm/Makefile @@ -14,7 +14,7 @@ kvm-y += x86.o emulate.o i8259.o irq.o lapic.o \ hyperv.o debugfs.o mmu/mmu.o mmu/page_track.o kvm-intel-y += vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o vmx/evmcs.o vmx/nested.o -kvm-amd-y += svm/svm.o svm/pmu.o svm/nested.o svm/avic.o svm/sev.o +kvm-amd-y += svm/svm.o svm/vmenter.o svm/pmu.o svm/nested.o svm/avic.o svm/sev.o obj-$(CONFIG_KVM) += kvm.o obj-$(CONFIG_KVM_INTEL) += kvm-intel.o diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 05d77b395fc1..2be5bbae3a40 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -3276,6 +3276,8 @@ static void svm_cancel_injection(struct kvm_vcpu *vcpu) svm_complete_interrupts(svm); } +bool __svm_vcpu_run(unsigned long vmcb_pa, unsigned long *regs); + static void svm_vcpu_run(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); @@ -3330,95 +3332,7 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu) local_irq_enable(); - asm volatile ( - "push %%" _ASM_BP "; \n\t" - "mov %c[rbx](%[svm]), %%" _ASM_BX " \n\t" - "mov %c[rcx](%[svm]), %%" _ASM_CX " \n\t" - "mov %c[rdx](%[svm]), %%" _ASM_DX " \n\t" - "mov %c[rsi](%[svm]), %%" _ASM_SI " \n\t" - "mov %c[rdi](%[svm]), %%" _ASM_DI " \n\t" - "mov %c[rbp](%[svm]), %%" _ASM_BP " \n\t" -#ifdef CONFIG_X86_64 - "mov %c[r8](%[svm]), %%r8 \n\t" - "mov %c[r9](%[svm]), %%r9 \n\t" - "mov %c[r10](%[svm]), %%r10 \n\t" - "mov %c[r11](%[svm]), %%r11 \n\t" - "mov %c[r12](%[svm]), %%r12 \n\t" - "mov %c[r13](%[svm]), %%r13 \n\t" - "mov %c[r14](%[svm]), %%r14 \n\t" - "mov %c[r15](%[svm]), %%r15 \n\t" -#endif - - /* Enter guest mode */ - "push %%" _ASM_AX " \n\t" - "mov %c[vmcb](%[svm]), %%" _ASM_AX " \n\t" - __ex("vmload %%" _ASM_AX) "\n\t" - __ex("vmrun %%" _ASM_AX) "\n\t" - __ex("vmsave %%" _ASM_AX) "\n\t" - "pop %%" _ASM_AX " \n\t" - - /* Save guest registers, load host registers */ - "mov %%" _ASM_BX ", %c[rbx](%[svm]) \n\t" - "mov %%" _ASM_CX ", %c[rcx](%[svm]) \n\t" - "mov %%" _ASM_DX ", %c[rdx](%[svm]) \n\t" - "mov %%" _ASM_SI ", %c[rsi](%[svm]) \n\t" - "mov %%" _ASM_DI ", %c[rdi](%[svm]) \n\t" - "mov %%" _ASM_BP ", %c[rbp](%[svm]) \n\t" -#ifdef CONFIG_X86_64 - "mov %%r8, %c[r8](%[svm]) \n\t" - "mov %%r9, %c[r9](%[svm]) \n\t" - "mov %%r10, %c[r10](%[svm]) \n\t" - "mov %%r11, %c[r11](%[svm]) \n\t" - "mov %%r12, %c[r12](%[svm]) \n\t" - "mov %%r13, %c[r13](%[svm]) \n\t" - "mov %%r14, %c[r14](%[svm]) \n\t" - "mov %%r15, %c[r15](%[svm]) \n\t" - /* - * Clear host registers marked as clobbered to prevent - * speculative use. - */ - "xor %%r8d, %%r8d \n\t" - "xor %%r9d, %%r9d \n\t" - "xor %%r10d, %%r10d \n\t" - "xor %%r11d, %%r11d \n\t" - "xor %%r12d, %%r12d \n\t" - "xor %%r13d, %%r13d \n\t" - "xor %%r14d, %%r14d \n\t" - "xor %%r15d, %%r15d \n\t" -#endif - "xor %%ebx, %%ebx \n\t" - "xor %%ecx, %%ecx \n\t" - "xor %%edx, %%edx \n\t" - "xor %%esi, %%esi \n\t" - "xor %%edi, %%edi \n\t" - "pop %%" _ASM_BP - : - : [svm]"a"(svm), - [vmcb]"i"(offsetof(struct vcpu_svm, vmcb_pa)), - [rbx]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_RBX])), - [rcx]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_RCX])), - [rdx]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_RDX])), - [rsi]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_RSI])), - [rdi]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_RDI])), - [rbp]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_RBP])) -#ifdef CONFIG_X86_64 - , [r8]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_R8])), - [r9]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_R9])), - [r10]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_R10])), - [r11]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_R11])), - [r12]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_R12])), - [r13]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_R13])), - [r14]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_R14])), - [r15]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_R15])) -#endif - : "cc", "memory" -#ifdef CONFIG_X86_64 - , "rbx", "rcx", "rdx", "rsi", "rdi" - , "r8", "r9", "r10", "r11" , "r12", "r13", "r14", "r15" -#else - , "ebx", "ecx", "edx", "esi", "edi" -#endif - ); + __svm_vcpu_run(svm->vmcb_pa, (unsigned long *)&svm->vcpu.arch.regs); /* Eliminate branch target predictions from guest mode */ vmexit_fill_RSB(); diff --git a/arch/x86/kvm/svm/vmenter.S b/arch/x86/kvm/svm/vmenter.S new file mode 100644 index 000000000000..fa1af90067e9 --- /dev/null +++ b/arch/x86/kvm/svm/vmenter.S @@ -0,0 +1,162 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#include +#include +#include +#include + +#define WORD_SIZE (BITS_PER_LONG / 8) + +/* Intentionally omit RAX as it's context switched by hardware */ +#define VCPU_RCX __VCPU_REGS_RCX * WORD_SIZE +#define VCPU_RDX __VCPU_REGS_RDX * WORD_SIZE +#define VCPU_RBX __VCPU_REGS_RBX * WORD_SIZE +/* Intentionally omit RSP as it's context switched by hardware */ +#define VCPU_RBP __VCPU_REGS_RBP * WORD_SIZE +#define VCPU_RSI __VCPU_REGS_RSI * WORD_SIZE +#define VCPU_RDI __VCPU_REGS_RDI * WORD_SIZE + +#ifdef CONFIG_X86_64 +#define VCPU_R8 __VCPU_REGS_R8 * WORD_SIZE +#define VCPU_R9 __VCPU_REGS_R9 * WORD_SIZE +#define VCPU_R10 __VCPU_REGS_R10 * WORD_SIZE +#define VCPU_R11 __VCPU_REGS_R11 * WORD_SIZE +#define VCPU_R12 __VCPU_REGS_R12 * WORD_SIZE +#define VCPU_R13 __VCPU_REGS_R13 * WORD_SIZE +#define VCPU_R14 __VCPU_REGS_R14 * WORD_SIZE +#define VCPU_R15 __VCPU_REGS_R15 * WORD_SIZE +#endif + + .text + +/** + * __svm_vcpu_run - Run a vCPU via a transition to SVM guest mode + * @vmcb_pa: unsigned long + * @regs: unsigned long * (to guest registers) + */ +SYM_FUNC_START(__svm_vcpu_run) + push %_ASM_BP + mov %_ASM_SP, %_ASM_BP +#ifdef CONFIG_X86_64 + push %r15 + push %r14 + push %r13 + push %r12 +#else + push %edi + push %esi +#endif + push %_ASM_BX + + /* Save @regs. */ + push %_ASM_ARG2 + + /* Save @vmcb. */ + push %_ASM_ARG1 + + /* Move @regs to RAX. */ + mov %_ASM_ARG2, %_ASM_AX + + /* Load guest registers. */ + mov VCPU_RCX(%_ASM_AX), %_ASM_CX + mov VCPU_RDX(%_ASM_AX), %_ASM_DX + mov VCPU_RBX(%_ASM_AX), %_ASM_BX + mov VCPU_RBP(%_ASM_AX), %_ASM_BP + mov VCPU_RSI(%_ASM_AX), %_ASM_SI + mov VCPU_RDI(%_ASM_AX), %_ASM_DI +#ifdef CONFIG_X86_64 + mov VCPU_R8 (%_ASM_AX), %r8 + mov VCPU_R9 (%_ASM_AX), %r9 + mov VCPU_R10(%_ASM_AX), %r10 + mov VCPU_R11(%_ASM_AX), %r11 + mov VCPU_R12(%_ASM_AX), %r12 + mov VCPU_R13(%_ASM_AX), %r13 + mov VCPU_R14(%_ASM_AX), %r14 + mov VCPU_R15(%_ASM_AX), %r15 +#endif + + /* "POP" @vmcb to RAX. */ + pop %_ASM_AX + + /* Enter guest mode */ +1: vmload %_ASM_AX + jmp 3f +2: cmpb $0, kvm_rebooting + jne 3f + ud2 + _ASM_EXTABLE(1b, 2b) + +3: vmrun %_ASM_AX + jmp 5f +4: cmpb $0, kvm_rebooting + jne 5f + ud2 + _ASM_EXTABLE(3b, 4b) + +5: vmsave %_ASM_AX + jmp 7f +6: cmpb $0, kvm_rebooting + jne 7f + ud2 + _ASM_EXTABLE(5b, 6b) +7: + /* "POP" @regs to RAX. */ + pop %_ASM_AX + + /* Save all guest registers. */ + mov %_ASM_CX, VCPU_RCX(%_ASM_AX) + mov %_ASM_DX, VCPU_RDX(%_ASM_AX) + mov %_ASM_BX, VCPU_RBX(%_ASM_AX) + mov %_ASM_BP, VCPU_RBP(%_ASM_AX) + mov %_ASM_SI, VCPU_RSI(%_ASM_AX) + mov %_ASM_DI, VCPU_RDI(%_ASM_AX) +#ifdef CONFIG_X86_64 + mov %r8, VCPU_R8 (%_ASM_AX) + mov %r9, VCPU_R9 (%_ASM_AX) + mov %r10, VCPU_R10(%_ASM_AX) + mov %r11, VCPU_R11(%_ASM_AX) + mov %r12, VCPU_R12(%_ASM_AX) + mov %r13, VCPU_R13(%_ASM_AX) + mov %r14, VCPU_R14(%_ASM_AX) + mov %r15, VCPU_R15(%_ASM_AX) +#endif + + /* + * Clear all general purpose registers except RSP and RAX to prevent + * speculative use of the guest's values, even those that are reloaded + * via the stack. In theory, an L1 cache miss when restoring registers + * could lead to speculative execution with the guest's values. + * Zeroing XORs are dirt cheap, i.e. the extra paranoia is essentially + * free. RSP and RAX are exempt as they are restored by hardware + * during VM-Exit. + */ + xor %ecx, %ecx + xor %edx, %edx + xor %ebx, %ebx + xor %ebp, %ebp + xor %esi, %esi + xor %edi, %edi +#ifdef CONFIG_X86_64 + xor %r8d, %r8d + xor %r9d, %r9d + xor %r10d, %r10d + xor %r11d, %r11d + xor %r12d, %r12d + xor %r13d, %r13d + xor %r14d, %r14d + xor %r15d, %r15d +#endif + + pop %_ASM_BX + +#ifdef CONFIG_X86_64 + pop %r12 + pop %r13 + pop %r14 + pop %r15 +#else + pop %esi + pop %edi +#endif + pop %_ASM_BP + ret +SYM_FUNC_END(__svm_vcpu_run) -- cgit From 696ac2e3bf267f5a2b2ed7d34e64131f2287d0ad Mon Sep 17 00:00:00 2001 From: Qian Cai Date: Fri, 3 Apr 2020 10:03:45 -0400 Subject: x86: ACPI: fix CPU hotplug deadlock Similar to commit 0266d81e9bf5 ("acpi/processor: Prevent cpu hotplug deadlock") except this is for acpi_processor_ffh_cstate_probe(): "The problem is that the work is scheduled on the current CPU from the hotplug thread associated with that CPU. It's not required to invoke these functions via the workqueue because the hotplug thread runs on the target CPU already. Check whether current is a per cpu thread pinned on the target CPU and invoke the function directly to avoid the workqueue." WARNING: possible circular locking dependency detected ------------------------------------------------------ cpuhp/1/15 is trying to acquire lock: ffffc90003447a28 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: __flush_work+0x4c6/0x630 but task is already holding lock: ffffffffafa1c0e8 (cpuidle_lock){+.+.}-{3:3}, at: cpuidle_pause_and_lock+0x17/0x20 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (cpu_hotplug_lock){++++}-{0:0}: cpus_read_lock+0x3e/0xc0 irq_calc_affinity_vectors+0x5f/0x91 __pci_enable_msix_range+0x10f/0x9a0 pci_alloc_irq_vectors_affinity+0x13e/0x1f0 pci_alloc_irq_vectors_affinity at drivers/pci/msi.c:1208 pqi_ctrl_init+0x72f/0x1618 [smartpqi] pqi_pci_probe.cold.63+0x882/0x892 [smartpqi] local_pci_probe+0x7a/0xc0 work_for_cpu_fn+0x2e/0x50 process_one_work+0x57e/0xb90 worker_thread+0x363/0x5b0 kthread+0x1f4/0x220 ret_from_fork+0x27/0x50 -> #0 ((work_completion)(&wfc.work)){+.+.}-{0:0}: __lock_acquire+0x2244/0x32a0 lock_acquire+0x1a2/0x680 __flush_work+0x4e6/0x630 work_on_cpu+0x114/0x160 acpi_processor_ffh_cstate_probe+0x129/0x250 acpi_processor_evaluate_cst+0x4c8/0x580 acpi_processor_get_power_info+0x86/0x740 acpi_processor_hotplug+0xc3/0x140 acpi_soft_cpu_online+0x102/0x1d0 cpuhp_invoke_callback+0x197/0x1120 cpuhp_thread_fun+0x252/0x2f0 smpboot_thread_fn+0x255/0x440 kthread+0x1f4/0x220 ret_from_fork+0x27/0x50 other info that might help us debug this: Chain exists of: (work_completion)(&wfc.work) --> cpuhp_state-up --> cpuidle_lock Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(cpuidle_lock); lock(cpuhp_state-up); lock(cpuidle_lock); lock((work_completion)(&wfc.work)); *** DEADLOCK *** 3 locks held by cpuhp/1/15: #0: ffffffffaf51ab10 (cpu_hotplug_lock){++++}-{0:0}, at: cpuhp_thread_fun+0x69/0x2f0 #1: ffffffffaf51ad40 (cpuhp_state-up){+.+.}-{0:0}, at: cpuhp_thread_fun+0x69/0x2f0 #2: ffffffffafa1c0e8 (cpuidle_lock){+.+.}-{3:3}, at: cpuidle_pause_and_lock+0x17/0x20 Call Trace: dump_stack+0xa0/0xea print_circular_bug.cold.52+0x147/0x14c check_noncircular+0x295/0x2d0 __lock_acquire+0x2244/0x32a0 lock_acquire+0x1a2/0x680 __flush_work+0x4e6/0x630 work_on_cpu+0x114/0x160 acpi_processor_ffh_cstate_probe+0x129/0x250 acpi_processor_evaluate_cst+0x4c8/0x580 acpi_processor_get_power_info+0x86/0x740 acpi_processor_hotplug+0xc3/0x140 acpi_soft_cpu_online+0x102/0x1d0 cpuhp_invoke_callback+0x197/0x1120 cpuhp_thread_fun+0x252/0x2f0 smpboot_thread_fn+0x255/0x440 kthread+0x1f4/0x220 ret_from_fork+0x27/0x50 Signed-off-by: Qian Cai Tested-by: Borislav Petkov [ rjw: Subject ] Signed-off-by: Rafael J. Wysocki --- arch/x86/kernel/acpi/cstate.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/kernel/acpi/cstate.c b/arch/x86/kernel/acpi/cstate.c index caf2edccbad2..49ae4e1ac9cd 100644 --- a/arch/x86/kernel/acpi/cstate.c +++ b/arch/x86/kernel/acpi/cstate.c @@ -161,7 +161,8 @@ int acpi_processor_ffh_cstate_probe(unsigned int cpu, /* Make sure we are running on right CPU */ - retval = work_on_cpu(cpu, acpi_processor_ffh_cstate_probe_cpu, cx); + retval = call_on_cpu(cpu, acpi_processor_ffh_cstate_probe_cpu, cx, + false); if (retval == 0) { /* Use the hint in CST */ percpu_entry->states[cx->index].eax = cx->address; -- cgit From da7e42320940059d571c233375d873c3a9de88c9 Mon Sep 17 00:00:00 2001 From: Uros Bizjak Date: Mon, 6 Apr 2020 22:21:08 +0200 Subject: KVM: VMX: Remove unnecessary exception trampoline in vmx_vmenter The exception trampoline in .fixup section is not needed, the exception handling code can jump directly to the label in the .text section. Changes since v1: - Fix commit message. Cc: Sean Christopherson Cc: Paolo Bonzini Reviewed-by: Sean Christopherson Signed-off-by: Uros Bizjak Message-Id: <20200406202108.74300-1-ubizjak@gmail.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/vmx/vmenter.S | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S index 9651ba388ba9..87f3f24fef37 100644 --- a/arch/x86/kvm/vmx/vmenter.S +++ b/arch/x86/kvm/vmx/vmenter.S @@ -58,12 +58,8 @@ SYM_FUNC_START(vmx_vmenter) ret 4: ud2 - .pushsection .fixup, "ax" -5: jmp 3b - .popsection - - _ASM_EXTABLE(1b, 5b) - _ASM_EXTABLE(2b, 5b) + _ASM_EXTABLE(1b, 3b) + _ASM_EXTABLE(2b, 3b) SYM_FUNC_END(vmx_vmenter) -- cgit From 5c8beb474665220374abcafa79e40d78778606d1 Mon Sep 17 00:00:00 2001 From: Oliver Upton Date: Mon, 6 Apr 2020 20:12:37 +0000 Subject: KVM: nVMX: don't clear mtf_pending when nested events are blocked If nested events are blocked, don't clear the mtf_pending flag to avoid missing later delivery of the MTF VM-exit. Fixes: 5ef8acbdd687c ("KVM: nVMX: Emulate MTF when performing instruction emulation") Signed-off-by: Oliver Upton Message-Id: <20200406201237.178725-1-oupton@google.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/vmx/nested.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index de232306561a..cbc9ea2de28f 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -3645,7 +3645,8 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu) * Clear the MTF state. If a higher priority VM-exit is delivered first, * this state is discarded. */ - vmx->nested.mtf_pending = false; + if (!block_nested_events) + vmx->nested.mtf_pending = false; if (lapic_in_kernel(vcpu) && test_bit(KVM_APIC_INIT, &apic->pending_events)) { -- cgit From 4064a4c6a1f90d169f36259647be3a8ddb91fa96 Mon Sep 17 00:00:00 2001 From: Wanpeng Li Date: Thu, 2 Apr 2020 16:20:26 +0800 Subject: KVM: X86: Filter out the broadcast dest for IPI fastpath Except destination shorthand, a destination value 0xffffffff is used to broadcast interrupts, let's also filter out this for single target IPI fastpath. Reviewed-by: Vitaly Kuznetsov Signed-off-by: Wanpeng Li Message-Id: <1585815626-28370-1-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/lapic.c | 3 --- arch/x86/kvm/lapic.h | 3 +++ arch/x86/kvm/x86.c | 3 ++- 3 files changed, 5 insertions(+), 4 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index ca80daf8f878..9af25c97612a 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -59,9 +59,6 @@ #define MAX_APIC_VECTOR 256 #define APIC_VECTORS_PER_REG 32 -#define APIC_BROADCAST 0xFF -#define X2APIC_BROADCAST 0xFFFFFFFFul - static bool lapic_timer_advance_dynamic __read_mostly; #define LAPIC_TIMER_ADVANCE_ADJUST_MIN 100 /* clock cycles */ #define LAPIC_TIMER_ADVANCE_ADJUST_MAX 10000 /* clock cycles */ diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h index 40ed6ed22751..a0ffb4331418 100644 --- a/arch/x86/kvm/lapic.h +++ b/arch/x86/kvm/lapic.h @@ -17,6 +17,9 @@ #define APIC_BUS_CYCLE_NS 1 #define APIC_BUS_FREQUENCY (1000000000ULL / APIC_BUS_CYCLE_NS) +#define APIC_BROADCAST 0xFF +#define X2APIC_BROADCAST 0xFFFFFFFFul + enum lapic_mode { LAPIC_MODE_DISABLED = 0, LAPIC_MODE_INVALID = X2APIC_ENABLE, diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index b8124b562dea..027dfd278a97 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1586,7 +1586,8 @@ static int handle_fastpath_set_x2apic_icr_irqoff(struct kvm_vcpu *vcpu, u64 data if (((data & APIC_SHORT_MASK) == APIC_DEST_NOSHORT) && ((data & APIC_DEST_MASK) == APIC_DEST_PHYSICAL) && - ((data & APIC_MODE_MASK) == APIC_DM_FIXED)) { + ((data & APIC_MODE_MASK) == APIC_DM_FIXED) && + ((u32)(data >> 32) != X2APIC_BROADCAST)) { data &= ~(1 << 12); kvm_apic_send_ipi(vcpu->arch.apic, (u32)data, (u32)(data >> 32)); -- cgit From dbef2808af6c594922fe32833b30f55f35e9da6d Mon Sep 17 00:00:00 2001 From: Vitaly Kuznetsov Date: Wed, 1 Apr 2020 10:13:48 +0200 Subject: KVM: VMX: fix crash cleanup when KVM wasn't used If KVM wasn't used at all before we crash the cleanup procedure fails with BUG: unable to handle page fault for address: ffffffffffffffc8 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 23215067 P4D 23215067 PUD 23217067 PMD 0 Oops: 0000 [#8] SMP PTI CPU: 0 PID: 3542 Comm: bash Kdump: loaded Tainted: G D 5.6.0-rc2+ #823 RIP: 0010:crash_vmclear_local_loaded_vmcss.cold+0x19/0x51 [kvm_intel] The root cause is that loaded_vmcss_on_cpu list is not yet initialized, we initialize it in hardware_enable() but this only happens when we start a VM. Previously, we used to have a bitmap with enabled CPUs and that was preventing [masking] the issue. Initialized loaded_vmcss_on_cpu list earlier, right before we assign crash_vmclear_loaded_vmcss pointer. blocked_vcpu_on_cpu list and blocked_vcpu_on_cpu_lock are moved altogether for consistency. Fixes: 31603d4fc2bb ("KVM: VMX: Always VMCLEAR in-use VMCSes during crash with kexec support") Signed-off-by: Vitaly Kuznetsov Message-Id: <20200401081348.1345307-1-vkuznets@redhat.com> Reviewed-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/vmx/vmx.c | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 91749f1254e8..8959514eaf0f 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -2261,10 +2261,6 @@ static int hardware_enable(void) !hv_get_vp_assist_page(cpu)) return -EFAULT; - INIT_LIST_HEAD(&per_cpu(loaded_vmcss_on_cpu, cpu)); - INIT_LIST_HEAD(&per_cpu(blocked_vcpu_on_cpu, cpu)); - spin_lock_init(&per_cpu(blocked_vcpu_on_cpu_lock, cpu)); - r = kvm_cpu_vmxon(phys_addr); if (r) return r; @@ -8044,7 +8040,7 @@ module_exit(vmx_exit); static int __init vmx_init(void) { - int r; + int r, cpu; #if IS_ENABLED(CONFIG_HYPERV) /* @@ -8098,6 +8094,12 @@ static int __init vmx_init(void) return r; } + for_each_possible_cpu(cpu) { + INIT_LIST_HEAD(&per_cpu(loaded_vmcss_on_cpu, cpu)); + INIT_LIST_HEAD(&per_cpu(blocked_vcpu_on_cpu, cpu)); + spin_lock_init(&per_cpu(blocked_vcpu_on_cpu_lock, cpu)); + } + #ifdef CONFIG_KEXEC_CORE rcu_assign_pointer(crash_vmclear_loaded_vmcss, crash_vmclear_local_loaded_vmcss); -- cgit From 3122e80efc0faf4a2accba7a46c7ed795edbfded Mon Sep 17 00:00:00 2001 From: Anshuman Khandual Date: Mon, 6 Apr 2020 20:03:47 -0700 Subject: mm/vma: make vma_is_accessible() available for general use Lets move vma_is_accessible() helper to include/linux/mm.h which makes it available for general use. While here, this replaces all remaining open encodings for VMA access check with vma_is_accessible(). Signed-off-by: Anshuman Khandual Signed-off-by: Andrew Morton Acked-by: Geert Uytterhoeven Acked-by: Guo Ren Acked-by: Vlastimil Babka Cc: Guo Ren Cc: Geert Uytterhoeven Cc: Ralf Baechle Cc: Paul Burton Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Michael Ellerman Cc: Yoshinori Sato Cc: Rich Felker Cc: Dave Hansen Cc: Andy Lutomirski Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Steven Rostedt Cc: Mel Gorman Cc: Alexander Viro Cc: "Aneesh Kumar K.V" Cc: Arnaldo Carvalho de Melo Cc: Arnd Bergmann Cc: Nick Piggin Cc: Paul Mackerras Cc: Will Deacon Link: http://lkml.kernel.org/r/1582520593-30704-3-git-send-email-anshuman.khandual@arm.com Signed-off-by: Linus Torvalds --- arch/x86/mm/fault.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 859519f5b342..a51df516b87b 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1222,7 +1222,7 @@ access_error(unsigned long error_code, struct vm_area_struct *vma) return 1; /* read, not present: */ - if (unlikely(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)))) + if (unlikely(!vma_is_accessible(vma))) return 1; return 0; -- cgit From 5a281062af1d43d3f3956a6b429c2d727bc92603 Mon Sep 17 00:00:00 2001 From: Andrea Arcangeli Date: Mon, 6 Apr 2020 20:05:33 -0700 Subject: userfaultfd: wp: add WP pagetable tracking to x86 Accurate userfaultfd WP tracking is possible by tracking exactly which virtual memory ranges were writeprotected by userland. We can't relay only on the RW bit of the mapped pagetable because that information is destroyed by fork() or KSM or swap. If we were to relay on that, we'd need to stay on the safe side and generate false positive wp faults for every swapped out page. [peterx@redhat.com: append _PAGE_UFD_WP to _PAGE_CHG_MASK] Signed-off-by: Andrea Arcangeli Signed-off-by: Peter Xu Signed-off-by: Andrew Morton Reviewed-by: Jerome Glisse Reviewed-by: Mike Rapoport Cc: Bobby Powers Cc: Brian Geffon Cc: David Hildenbrand Cc: Denis Plotnikov Cc: "Dr . David Alan Gilbert" Cc: Hugh Dickins Cc: Johannes Weiner Cc: "Kirill A . Shutemov" Cc: Martin Cracauer Cc: Marty McFadden Cc: Maya Gokhale Cc: Mel Gorman Cc: Mike Kravetz Cc: Pavel Emelyanov Cc: Rik van Riel Cc: Shaohua Li Link: http://lkml.kernel.org/r/20200220163112.11409-4-peterx@redhat.com Signed-off-by: Linus Torvalds --- arch/x86/Kconfig | 1 + arch/x86/include/asm/pgtable.h | 52 ++++++++++++++++++++++++++++++++++++ arch/x86/include/asm/pgtable_64.h | 8 +++++- arch/x86/include/asm/pgtable_types.h | 12 ++++++++- 4 files changed, 71 insertions(+), 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 1edf788d301c..8d078642b4be 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -149,6 +149,7 @@ config X86 select HAVE_ARCH_TRACEHOOK select HAVE_ARCH_TRANSPARENT_HUGEPAGE select HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD if X86_64 + select HAVE_ARCH_USERFAULTFD_WP if USERFAULTFD select HAVE_ARCH_VMAP_STACK if X86_64 select HAVE_ARCH_WITHIN_STACK_FRAMES select HAVE_ASM_MODVERSIONS diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index afda66a6d325..c37e1649fb7e 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -25,6 +25,7 @@ #include #include #include +#include extern pgd_t early_top_pgt[PTRS_PER_PGD]; int __init __early_make_pgtable(unsigned long address, pmdval_t pmd); @@ -313,6 +314,23 @@ static inline pte_t pte_clear_flags(pte_t pte, pteval_t clear) return native_make_pte(v & ~clear); } +#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP +static inline int pte_uffd_wp(pte_t pte) +{ + return pte_flags(pte) & _PAGE_UFFD_WP; +} + +static inline pte_t pte_mkuffd_wp(pte_t pte) +{ + return pte_set_flags(pte, _PAGE_UFFD_WP); +} + +static inline pte_t pte_clear_uffd_wp(pte_t pte) +{ + return pte_clear_flags(pte, _PAGE_UFFD_WP); +} +#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */ + static inline pte_t pte_mkclean(pte_t pte) { return pte_clear_flags(pte, _PAGE_DIRTY); @@ -392,6 +410,23 @@ static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear) return native_make_pmd(v & ~clear); } +#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP +static inline int pmd_uffd_wp(pmd_t pmd) +{ + return pmd_flags(pmd) & _PAGE_UFFD_WP; +} + +static inline pmd_t pmd_mkuffd_wp(pmd_t pmd) +{ + return pmd_set_flags(pmd, _PAGE_UFFD_WP); +} + +static inline pmd_t pmd_clear_uffd_wp(pmd_t pmd) +{ + return pmd_clear_flags(pmd, _PAGE_UFFD_WP); +} +#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */ + static inline pmd_t pmd_mkold(pmd_t pmd) { return pmd_clear_flags(pmd, _PAGE_ACCESSED); @@ -1374,6 +1409,23 @@ static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd) #endif #endif +#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP +static inline pte_t pte_swp_mkuffd_wp(pte_t pte) +{ + return pte_set_flags(pte, _PAGE_SWP_UFFD_WP); +} + +static inline int pte_swp_uffd_wp(pte_t pte) +{ + return pte_flags(pte) & _PAGE_SWP_UFFD_WP; +} + +static inline pte_t pte_swp_clear_uffd_wp(pte_t pte) +{ + return pte_clear_flags(pte, _PAGE_SWP_UFFD_WP); +} +#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */ + #define PKRU_AD_BIT 0x1 #define PKRU_WD_BIT 0x2 #define PKRU_BITS_PER_PKEY 2 diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h index 0b6c4042942a..df1373415f11 100644 --- a/arch/x86/include/asm/pgtable_64.h +++ b/arch/x86/include/asm/pgtable_64.h @@ -189,7 +189,7 @@ extern void sync_global_pgds(unsigned long start, unsigned long end); * * | ... | 11| 10| 9|8|7|6|5| 4| 3|2| 1|0| <- bit number * | ... |SW3|SW2|SW1|G|L|D|A|CD|WT|U| W|P| <- bit names - * | TYPE (59-63) | ~OFFSET (9-58) |0|0|X|X| X| X|X|SD|0| <- swp entry + * | TYPE (59-63) | ~OFFSET (9-58) |0|0|X|X| X| X|F|SD|0| <- swp entry * * G (8) is aliased and used as a PROT_NONE indicator for * !present ptes. We need to start storing swap entries above @@ -197,9 +197,15 @@ extern void sync_global_pgds(unsigned long start, unsigned long end); * erratum where they can be incorrectly set by hardware on * non-present PTEs. * + * SD Bits 1-4 are not used in non-present format and available for + * special use described below: + * * SD (1) in swp entry is used to store soft dirty bit, which helps us * remember soft dirty over page migration * + * F (2) in swp entry is used to record when a pagetable is + * writeprotected by userfaultfd WP support. + * * Bit 7 in swp entry should be 0 because pmd_present checks not only P, * but also L and G. * diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h index 65c2ecd730c5..b6606fe6cfdf 100644 --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -32,6 +32,7 @@ #define _PAGE_BIT_SPECIAL _PAGE_BIT_SOFTW1 #define _PAGE_BIT_CPA_TEST _PAGE_BIT_SOFTW1 +#define _PAGE_BIT_UFFD_WP _PAGE_BIT_SOFTW2 /* userfaultfd wrprotected */ #define _PAGE_BIT_SOFT_DIRTY _PAGE_BIT_SOFTW3 /* software dirty tracking */ #define _PAGE_BIT_DEVMAP _PAGE_BIT_SOFTW4 @@ -100,6 +101,14 @@ #define _PAGE_SWP_SOFT_DIRTY (_AT(pteval_t, 0)) #endif +#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP +#define _PAGE_UFFD_WP (_AT(pteval_t, 1) << _PAGE_BIT_UFFD_WP) +#define _PAGE_SWP_UFFD_WP _PAGE_USER +#else +#define _PAGE_UFFD_WP (_AT(pteval_t, 0)) +#define _PAGE_SWP_UFFD_WP (_AT(pteval_t, 0)) +#endif + #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE) #define _PAGE_NX (_AT(pteval_t, 1) << _PAGE_BIT_NX) #define _PAGE_DEVMAP (_AT(u64, 1) << _PAGE_BIT_DEVMAP) @@ -118,7 +127,8 @@ */ #define _PAGE_CHG_MASK (PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT | \ _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY | \ - _PAGE_SOFT_DIRTY | _PAGE_DEVMAP | _PAGE_ENC) + _PAGE_SOFT_DIRTY | _PAGE_DEVMAP | _PAGE_ENC | \ + _PAGE_UFFD_WP) #define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE) /* -- cgit From 2e3d5dc508cf001c4fb2d15515ebe6f30df88f76 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Mon, 6 Apr 2020 20:05:57 -0700 Subject: userfaultfd: wp: add pmd_swp_*uffd_wp() helpers Adding these missing helpers for uffd-wp operations with pmd swap/migration entries. Signed-off-by: Peter Xu Signed-off-by: Andrew Morton Reviewed-by: Jerome Glisse Reviewed-by: Mike Rapoport Cc: Andrea Arcangeli Cc: Bobby Powers Cc: Brian Geffon Cc: David Hildenbrand Cc: Denis Plotnikov Cc: "Dr . David Alan Gilbert" Cc: Hugh Dickins Cc: Johannes Weiner Cc: "Kirill A . Shutemov" Cc: Martin Cracauer Cc: Marty McFadden Cc: Maya Gokhale Cc: Mel Gorman Cc: Mike Kravetz Cc: Pavel Emelyanov Cc: Rik van Riel Cc: Shaohua Li Link: http://lkml.kernel.org/r/20200220163112.11409-10-peterx@redhat.com Signed-off-by: Linus Torvalds --- arch/x86/include/asm/pgtable.h | 15 +++++++++++++++ 1 file changed, 15 insertions(+) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index c37e1649fb7e..28838d790191 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -1424,6 +1424,21 @@ static inline pte_t pte_swp_clear_uffd_wp(pte_t pte) { return pte_clear_flags(pte, _PAGE_SWP_UFFD_WP); } + +static inline pmd_t pmd_swp_mkuffd_wp(pmd_t pmd) +{ + return pmd_set_flags(pmd, _PAGE_SWP_UFFD_WP); +} + +static inline int pmd_swp_uffd_wp(pmd_t pmd) +{ + return pmd_flags(pmd) & _PAGE_SWP_UFFD_WP; +} + +static inline pmd_t pmd_swp_clear_uffd_wp(pmd_t pmd) +{ + return pmd_clear_flags(pmd, _PAGE_SWP_UFFD_WP); +} #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */ #define PKRU_AD_BIT 0x1 -- cgit From 12a5b00a5366467e95a25f55da751794e5fe6788 Mon Sep 17 00:00:00 2001 From: Masahiro Yamada Date: Mon, 6 Apr 2020 20:09:30 -0700 Subject: sparc,x86: vdso: remove meaningless undefining CONFIG_OPTIMIZE_INLINING The code, #undef CONFIG_OPTIMIZE_INLINING, is not working as expected because is parsed before vclock_gettime.c since 28128c61e08e ("kconfig.h: Include compiler types to avoid missed struct attributes"). Since then, is included really early by using the '-include' option. So, you cannot negate the decision of in this way. You can confirm it by checking the pre-processed code, like this: $ make arch/x86/entry/vdso/vdso32/vclock_gettime.i There is no difference with/without CONFIG_CC_OPTIMIZE_FOR_SIZE. It is about two years since 28128c61e08e. Nobody has reported a problem (or, nobody has even noticed the fact that this code is not working). It is ugly and unreliable to attempt to undefine a CONFIG option from C files, and anyway the inlining heuristic is up to the compiler. Just remove the broken code. Signed-off-by: Masahiro Yamada Signed-off-by: Andrew Morton Reviewed-by: Nathan Chancellor Acked-by: Miguel Ojeda Cc: Arnd Bergmann Cc: Ingo Molnar Cc: Thomas Gleixner Cc: Masahiro Yamada Cc: Borislav Petkov Cc: "H. Peter Anvin" Cc: David Miller Link: http://lkml.kernel.org/r/20200220110807.32534-1-masahiroy@kernel.org Signed-off-by: Linus Torvalds --- arch/x86/entry/vdso/vdso32/vclock_gettime.c | 4 ---- 1 file changed, 4 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/entry/vdso/vdso32/vclock_gettime.c b/arch/x86/entry/vdso/vdso32/vclock_gettime.c index 1e82bd43286c..84a4a73f77f7 100644 --- a/arch/x86/entry/vdso/vdso32/vclock_gettime.c +++ b/arch/x86/entry/vdso/vdso32/vclock_gettime.c @@ -1,10 +1,6 @@ // SPDX-License-Identifier: GPL-2.0 #define BUILD_VDSO32 -#ifndef CONFIG_CC_OPTIMIZE_FOR_SIZE -#undef CONFIG_OPTIMIZE_INLINING -#endif - #ifdef CONFIG_X86_64 /* -- cgit From 889b3c1245de48ed0cacf7aebb25c489d3e4a3e9 Mon Sep 17 00:00:00 2001 From: Masahiro Yamada Date: Mon, 6 Apr 2020 20:09:33 -0700 Subject: compiler: remove CONFIG_OPTIMIZE_INLINING entirely Commit ac7c3e4ff401 ("compiler: enable CONFIG_OPTIMIZE_INLINING forcibly") made this always-on option. We released v5.4 and v5.5 including that commit. Remove the CONFIG option and clean up the code now. Signed-off-by: Masahiro Yamada Signed-off-by: Andrew Morton Reviewed-by: Miguel Ojeda Reviewed-by: Nathan Chancellor Cc: Arnd Bergmann Cc: Borislav Petkov Cc: David Miller Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/20200220110807.32534-2-masahiroy@kernel.org Signed-off-by: Linus Torvalds --- arch/x86/configs/i386_defconfig | 1 - arch/x86/configs/x86_64_defconfig | 1 - 2 files changed, 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/configs/i386_defconfig b/arch/x86/configs/i386_defconfig index ab8b30cb978e..550904591e94 100644 --- a/arch/x86/configs/i386_defconfig +++ b/arch/x86/configs/i386_defconfig @@ -285,7 +285,6 @@ CONFIG_EARLY_PRINTK_DBGP=y CONFIG_DEBUG_STACKOVERFLOW=y # CONFIG_DEBUG_RODATA_TEST is not set CONFIG_DEBUG_BOOT_PARAMS=y -CONFIG_OPTIMIZE_INLINING=y CONFIG_SECURITY=y CONFIG_SECURITY_NETWORK=y CONFIG_SECURITY_SELINUX=y diff --git a/arch/x86/configs/x86_64_defconfig b/arch/x86/configs/x86_64_defconfig index 2d196cb49084..614961009075 100644 --- a/arch/x86/configs/x86_64_defconfig +++ b/arch/x86/configs/x86_64_defconfig @@ -282,7 +282,6 @@ CONFIG_EARLY_PRINTK_DBGP=y CONFIG_DEBUG_STACKOVERFLOW=y # CONFIG_DEBUG_RODATA_TEST is not set CONFIG_DEBUG_BOOT_PARAMS=y -CONFIG_OPTIMIZE_INLINING=y CONFIG_UNWINDER_ORC=y CONFIG_SECURITY=y CONFIG_SECURITY_NETWORK=y -- cgit From 0e1b4271078787d3408d3dd314d80b290578cc00 Mon Sep 17 00:00:00 2001 From: Jason Yan Date: Wed, 8 Apr 2020 10:46:05 +0800 Subject: x86/xen: make xen_pvmmu_arch_setup() static Fix the following sparse warning: arch/x86/xen/setup.c:998:12: warning: symbol 'xen_pvmmu_arch_setup' was not declared. Should it be static? Reported-by: Hulk Robot Signed-off-by: Jason Yan Reviewed-by: Juergen Gross Link: https://lore.kernel.org/r/20200408024605.42394-1-yanaijie@huawei.com Signed-off-by: Juergen Gross --- arch/x86/xen/setup.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c index 33b0e20df7fc..1a2d8a50dac4 100644 --- a/arch/x86/xen/setup.c +++ b/arch/x86/xen/setup.c @@ -985,7 +985,7 @@ void xen_enable_syscall(void) #endif /* CONFIG_X86_64 */ } -void __init xen_pvmmu_arch_setup(void) +static void __init xen_pvmmu_arch_setup(void) { HYPERVISOR_vm_assist(VMASST_CMD_enable, VMASST_TYPE_4gb_segments); HYPERVISOR_vm_assist(VMASST_CMD_enable, VMASST_TYPE_writable_pagetables); -- cgit From 2b3b76b5ec67568da4bb475d3ce8a92ef494b5de Mon Sep 17 00:00:00 2001 From: Kan Liang Date: Thu, 2 Apr 2020 08:46:51 -0700 Subject: perf/x86/intel/uncore: Add Ice Lake server uncore support The uncore subsystem in Ice Lake server is similar to previous server. There are some differences in config register encoding and pci device IDs. The uncore PMON units in Ice Lake server include Ubox, Chabox, IIO, IRP, M2PCIE, PCU, M2M, PCIE3 and IMC. - For CHA, filter 1 register has been removed. The filter 0 register can be used by and of CHA events to be filterd by Thread/Core-ID. To do so, the control register's tid_en bit must be set to 1. - For IIO, there are some changes on event constraints. The MSR address and MSR offsets among counters are also changed. - For IRP, the MSR address and MSR offsets among counters are changed. - For M2PCIE, the counters are accessed by MSR now. Add new MSR address and MSR offsets. Change event constraints. - To determine the number of CHAs, have to read CAPID6(Low) and CAPID7 (High) now. - For M2M, update the PCICFG address and Device ID. - For UPI, update the PCICFG address, Device ID and counter address. - For M3UPI, update the PCICFG address, Device ID, counter address and event constraints. - For IMC, update the formular to calculate MMIO BAR address, which is MMIO_BASE + specific MEM_BAR offset. Signed-off-by: Kan Liang Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Link: https://lkml.kernel.org/r/1585842411-150452-1-git-send-email-kan.liang@linux.intel.com --- arch/x86/events/intel/uncore.c | 8 + arch/x86/events/intel/uncore.h | 3 + arch/x86/events/intel/uncore_snbep.c | 511 +++++++++++++++++++++++++++++++++++ 3 files changed, 522 insertions(+) (limited to 'arch/x86') diff --git a/arch/x86/events/intel/uncore.c b/arch/x86/events/intel/uncore.c index 1ba72c563313..cf76d6631afa 100644 --- a/arch/x86/events/intel/uncore.c +++ b/arch/x86/events/intel/uncore.c @@ -1476,6 +1476,12 @@ static const struct intel_uncore_init_fun tgl_l_uncore_init __initconst = { .mmio_init = tgl_l_uncore_mmio_init, }; +static const struct intel_uncore_init_fun icx_uncore_init __initconst = { + .cpu_init = icx_uncore_cpu_init, + .pci_init = icx_uncore_pci_init, + .mmio_init = icx_uncore_mmio_init, +}; + static const struct intel_uncore_init_fun snr_uncore_init __initconst = { .cpu_init = snr_uncore_cpu_init, .pci_init = snr_uncore_pci_init, @@ -1511,6 +1517,8 @@ static const struct x86_cpu_id intel_uncore_match[] __initconst = { X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_L, &icl_uncore_init), X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_NNPI, &icl_uncore_init), X86_MATCH_INTEL_FAM6_MODEL(ICELAKE, &icl_uncore_init), + X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_D, &icx_uncore_init), + X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, &icx_uncore_init), X86_MATCH_INTEL_FAM6_MODEL(TIGERLAKE_L, &tgl_l_uncore_init), X86_MATCH_INTEL_FAM6_MODEL(TIGERLAKE, &tgl_uncore_init), X86_MATCH_INTEL_FAM6_MODEL(ATOM_TREMONT_D, &snr_uncore_init), diff --git a/arch/x86/events/intel/uncore.h b/arch/x86/events/intel/uncore.h index b30429f8a53a..0da4a4605536 100644 --- a/arch/x86/events/intel/uncore.h +++ b/arch/x86/events/intel/uncore.h @@ -550,6 +550,9 @@ void skx_uncore_cpu_init(void); int snr_uncore_pci_init(void); void snr_uncore_cpu_init(void); void snr_uncore_mmio_init(void); +int icx_uncore_pci_init(void); +void icx_uncore_cpu_init(void); +void icx_uncore_mmio_init(void); /* uncore_nhmex.c */ void nhmex_uncore_cpu_init(void); diff --git a/arch/x86/events/intel/uncore_snbep.c b/arch/x86/events/intel/uncore_snbep.c index 01023f0d935b..07652fa20ebb 100644 --- a/arch/x86/events/intel/uncore_snbep.c +++ b/arch/x86/events/intel/uncore_snbep.c @@ -382,6 +382,42 @@ #define SNR_IMC_MMIO_MEM0_OFFSET 0xd8 #define SNR_IMC_MMIO_MEM0_MASK 0x7FF +/* ICX CHA */ +#define ICX_C34_MSR_PMON_CTR0 0xb68 +#define ICX_C34_MSR_PMON_CTL0 0xb61 +#define ICX_C34_MSR_PMON_BOX_CTL 0xb60 +#define ICX_C34_MSR_PMON_BOX_FILTER0 0xb65 + +/* ICX IIO */ +#define ICX_IIO_MSR_PMON_CTL0 0xa58 +#define ICX_IIO_MSR_PMON_CTR0 0xa51 +#define ICX_IIO_MSR_PMON_BOX_CTL 0xa50 + +/* ICX IRP */ +#define ICX_IRP0_MSR_PMON_CTL0 0xa4d +#define ICX_IRP0_MSR_PMON_CTR0 0xa4b +#define ICX_IRP0_MSR_PMON_BOX_CTL 0xa4a + +/* ICX M2PCIE */ +#define ICX_M2PCIE_MSR_PMON_CTL0 0xa46 +#define ICX_M2PCIE_MSR_PMON_CTR0 0xa41 +#define ICX_M2PCIE_MSR_PMON_BOX_CTL 0xa40 + +/* ICX UPI */ +#define ICX_UPI_PCI_PMON_CTL0 0x350 +#define ICX_UPI_PCI_PMON_CTR0 0x320 +#define ICX_UPI_PCI_PMON_BOX_CTL 0x318 +#define ICX_UPI_CTL_UMASK_EXT 0xffffff + +/* ICX M3UPI*/ +#define ICX_M3UPI_PCI_PMON_CTL0 0xd8 +#define ICX_M3UPI_PCI_PMON_CTR0 0xa8 +#define ICX_M3UPI_PCI_PMON_BOX_CTL 0xa0 + +/* ICX IMC */ +#define ICX_NUMBER_IMC_CHN 2 +#define ICX_IMC_MEM_STRIDE 0x4 + DEFINE_UNCORE_FORMAT_ATTR(event, event, "config:0-7"); DEFINE_UNCORE_FORMAT_ATTR(event2, event, "config:0-6"); DEFINE_UNCORE_FORMAT_ATTR(event_ext, event, "config:0-7,21"); @@ -390,6 +426,7 @@ DEFINE_UNCORE_FORMAT_ATTR(umask, umask, "config:8-15"); DEFINE_UNCORE_FORMAT_ATTR(umask_ext, umask, "config:8-15,32-43,45-55"); DEFINE_UNCORE_FORMAT_ATTR(umask_ext2, umask, "config:8-15,32-57"); DEFINE_UNCORE_FORMAT_ATTR(umask_ext3, umask, "config:8-15,32-39"); +DEFINE_UNCORE_FORMAT_ATTR(umask_ext4, umask, "config:8-15,32-55"); DEFINE_UNCORE_FORMAT_ATTR(qor, qor, "config:16"); DEFINE_UNCORE_FORMAT_ATTR(edge, edge, "config:18"); DEFINE_UNCORE_FORMAT_ATTR(tid_en, tid_en, "config:19"); @@ -4551,3 +4588,477 @@ void snr_uncore_mmio_init(void) } /* end of SNR uncore support */ + +/* ICX uncore support */ + +static unsigned icx_cha_msr_offsets[] = { + 0x2a0, 0x2ae, 0x2bc, 0x2ca, 0x2d8, 0x2e6, 0x2f4, 0x302, 0x310, + 0x31e, 0x32c, 0x33a, 0x348, 0x356, 0x364, 0x372, 0x380, 0x38e, + 0x3aa, 0x3b8, 0x3c6, 0x3d4, 0x3e2, 0x3f0, 0x3fe, 0x40c, 0x41a, + 0x428, 0x436, 0x444, 0x452, 0x460, 0x46e, 0x47c, 0x0, 0xe, + 0x1c, 0x2a, 0x38, 0x46, +}; + +static int icx_cha_hw_config(struct intel_uncore_box *box, struct perf_event *event) +{ + struct hw_perf_event_extra *reg1 = &event->hw.extra_reg; + bool tie_en = !!(event->hw.config & SNBEP_CBO_PMON_CTL_TID_EN); + + if (tie_en) { + reg1->reg = ICX_C34_MSR_PMON_BOX_FILTER0 + + icx_cha_msr_offsets[box->pmu->pmu_idx]; + reg1->config = event->attr.config1 & SKX_CHA_MSR_PMON_BOX_FILTER_TID; + reg1->idx = 0; + } + + return 0; +} + +static struct intel_uncore_ops icx_uncore_chabox_ops = { + .init_box = ivbep_uncore_msr_init_box, + .disable_box = snbep_uncore_msr_disable_box, + .enable_box = snbep_uncore_msr_enable_box, + .disable_event = snbep_uncore_msr_disable_event, + .enable_event = snr_cha_enable_event, + .read_counter = uncore_msr_read_counter, + .hw_config = icx_cha_hw_config, +}; + +static struct intel_uncore_type icx_uncore_chabox = { + .name = "cha", + .num_counters = 4, + .perf_ctr_bits = 48, + .event_ctl = ICX_C34_MSR_PMON_CTL0, + .perf_ctr = ICX_C34_MSR_PMON_CTR0, + .box_ctl = ICX_C34_MSR_PMON_BOX_CTL, + .msr_offsets = icx_cha_msr_offsets, + .event_mask = HSWEP_S_MSR_PMON_RAW_EVENT_MASK, + .event_mask_ext = SNR_CHA_RAW_EVENT_MASK_EXT, + .constraints = skx_uncore_chabox_constraints, + .ops = &icx_uncore_chabox_ops, + .format_group = &snr_uncore_chabox_format_group, +}; + +static unsigned icx_msr_offsets[] = { + 0x0, 0x20, 0x40, 0x90, 0xb0, 0xd0, +}; + +static struct event_constraint icx_uncore_iio_constraints[] = { + UNCORE_EVENT_CONSTRAINT(0x02, 0x3), + UNCORE_EVENT_CONSTRAINT(0x03, 0x3), + UNCORE_EVENT_CONSTRAINT(0x83, 0x3), + UNCORE_EVENT_CONSTRAINT(0xc0, 0xc), + UNCORE_EVENT_CONSTRAINT(0xc5, 0xc), + EVENT_CONSTRAINT_END +}; + +static struct intel_uncore_type icx_uncore_iio = { + .name = "iio", + .num_counters = 4, + .num_boxes = 6, + .perf_ctr_bits = 48, + .event_ctl = ICX_IIO_MSR_PMON_CTL0, + .perf_ctr = ICX_IIO_MSR_PMON_CTR0, + .event_mask = SNBEP_PMON_RAW_EVENT_MASK, + .event_mask_ext = SNR_IIO_PMON_RAW_EVENT_MASK_EXT, + .box_ctl = ICX_IIO_MSR_PMON_BOX_CTL, + .msr_offsets = icx_msr_offsets, + .constraints = icx_uncore_iio_constraints, + .ops = &skx_uncore_iio_ops, + .format_group = &snr_uncore_iio_format_group, +}; + +static struct intel_uncore_type icx_uncore_irp = { + .name = "irp", + .num_counters = 2, + .num_boxes = 6, + .perf_ctr_bits = 48, + .event_ctl = ICX_IRP0_MSR_PMON_CTL0, + .perf_ctr = ICX_IRP0_MSR_PMON_CTR0, + .event_mask = SNBEP_PMON_RAW_EVENT_MASK, + .box_ctl = ICX_IRP0_MSR_PMON_BOX_CTL, + .msr_offsets = icx_msr_offsets, + .ops = &ivbep_uncore_msr_ops, + .format_group = &ivbep_uncore_format_group, +}; + +static struct event_constraint icx_uncore_m2pcie_constraints[] = { + UNCORE_EVENT_CONSTRAINT(0x14, 0x3), + UNCORE_EVENT_CONSTRAINT(0x23, 0x3), + UNCORE_EVENT_CONSTRAINT(0x2d, 0x3), + EVENT_CONSTRAINT_END +}; + +static struct intel_uncore_type icx_uncore_m2pcie = { + .name = "m2pcie", + .num_counters = 4, + .num_boxes = 6, + .perf_ctr_bits = 48, + .event_ctl = ICX_M2PCIE_MSR_PMON_CTL0, + .perf_ctr = ICX_M2PCIE_MSR_PMON_CTR0, + .box_ctl = ICX_M2PCIE_MSR_PMON_BOX_CTL, + .msr_offsets = icx_msr_offsets, + .constraints = icx_uncore_m2pcie_constraints, + .event_mask = SNBEP_PMON_RAW_EVENT_MASK, + .ops = &ivbep_uncore_msr_ops, + .format_group = &ivbep_uncore_format_group, +}; + +enum perf_uncore_icx_iio_freerunning_type_id { + ICX_IIO_MSR_IOCLK, + ICX_IIO_MSR_BW_IN, + + ICX_IIO_FREERUNNING_TYPE_MAX, +}; + +static unsigned icx_iio_clk_freerunning_box_offsets[] = { + 0x0, 0x20, 0x40, 0x90, 0xb0, 0xd0, +}; + +static unsigned icx_iio_bw_freerunning_box_offsets[] = { + 0x0, 0x10, 0x20, 0x90, 0xa0, 0xb0, +}; + +static struct freerunning_counters icx_iio_freerunning[] = { + [ICX_IIO_MSR_IOCLK] = { 0xa55, 0x1, 0x20, 1, 48, icx_iio_clk_freerunning_box_offsets }, + [ICX_IIO_MSR_BW_IN] = { 0xaa0, 0x1, 0x10, 8, 48, icx_iio_bw_freerunning_box_offsets }, +}; + +static struct uncore_event_desc icx_uncore_iio_freerunning_events[] = { + /* Free-Running IIO CLOCKS Counter */ + INTEL_UNCORE_EVENT_DESC(ioclk, "event=0xff,umask=0x10"), + /* Free-Running IIO BANDWIDTH IN Counters */ + INTEL_UNCORE_EVENT_DESC(bw_in_port0, "event=0xff,umask=0x20"), + INTEL_UNCORE_EVENT_DESC(bw_in_port0.scale, "3.814697266e-6"), + INTEL_UNCORE_EVENT_DESC(bw_in_port0.unit, "MiB"), + INTEL_UNCORE_EVENT_DESC(bw_in_port1, "event=0xff,umask=0x21"), + INTEL_UNCORE_EVENT_DESC(bw_in_port1.scale, "3.814697266e-6"), + INTEL_UNCORE_EVENT_DESC(bw_in_port1.unit, "MiB"), + INTEL_UNCORE_EVENT_DESC(bw_in_port2, "event=0xff,umask=0x22"), + INTEL_UNCORE_EVENT_DESC(bw_in_port2.scale, "3.814697266e-6"), + INTEL_UNCORE_EVENT_DESC(bw_in_port2.unit, "MiB"), + INTEL_UNCORE_EVENT_DESC(bw_in_port3, "event=0xff,umask=0x23"), + INTEL_UNCORE_EVENT_DESC(bw_in_port3.scale, "3.814697266e-6"), + INTEL_UNCORE_EVENT_DESC(bw_in_port3.unit, "MiB"), + INTEL_UNCORE_EVENT_DESC(bw_in_port4, "event=0xff,umask=0x24"), + INTEL_UNCORE_EVENT_DESC(bw_in_port4.scale, "3.814697266e-6"), + INTEL_UNCORE_EVENT_DESC(bw_in_port4.unit, "MiB"), + INTEL_UNCORE_EVENT_DESC(bw_in_port5, "event=0xff,umask=0x25"), + INTEL_UNCORE_EVENT_DESC(bw_in_port5.scale, "3.814697266e-6"), + INTEL_UNCORE_EVENT_DESC(bw_in_port5.unit, "MiB"), + INTEL_UNCORE_EVENT_DESC(bw_in_port6, "event=0xff,umask=0x26"), + INTEL_UNCORE_EVENT_DESC(bw_in_port6.scale, "3.814697266e-6"), + INTEL_UNCORE_EVENT_DESC(bw_in_port6.unit, "MiB"), + INTEL_UNCORE_EVENT_DESC(bw_in_port7, "event=0xff,umask=0x27"), + INTEL_UNCORE_EVENT_DESC(bw_in_port7.scale, "3.814697266e-6"), + INTEL_UNCORE_EVENT_DESC(bw_in_port7.unit, "MiB"), + { /* end: all zeroes */ }, +}; + +static struct intel_uncore_type icx_uncore_iio_free_running = { + .name = "iio_free_running", + .num_counters = 9, + .num_boxes = 6, + .num_freerunning_types = ICX_IIO_FREERUNNING_TYPE_MAX, + .freerunning = icx_iio_freerunning, + .ops = &skx_uncore_iio_freerunning_ops, + .event_descs = icx_uncore_iio_freerunning_events, + .format_group = &skx_uncore_iio_freerunning_format_group, +}; + +static struct intel_uncore_type *icx_msr_uncores[] = { + &skx_uncore_ubox, + &icx_uncore_chabox, + &icx_uncore_iio, + &icx_uncore_irp, + &icx_uncore_m2pcie, + &skx_uncore_pcu, + &icx_uncore_iio_free_running, + NULL, +}; + +/* + * To determine the number of CHAs, it should read CAPID6(Low) and CAPID7 (High) + * registers which located at Device 30, Function 3 + */ +#define ICX_CAPID6 0x9c +#define ICX_CAPID7 0xa0 + +static u64 icx_count_chabox(void) +{ + struct pci_dev *dev = NULL; + u64 caps = 0; + + dev = pci_get_device(PCI_VENDOR_ID_INTEL, 0x345b, dev); + if (!dev) + goto out; + + pci_read_config_dword(dev, ICX_CAPID6, (u32 *)&caps); + pci_read_config_dword(dev, ICX_CAPID7, (u32 *)&caps + 1); +out: + pci_dev_put(dev); + return hweight64(caps); +} + +void icx_uncore_cpu_init(void) +{ + u64 num_boxes = icx_count_chabox(); + + if (WARN_ON(num_boxes > ARRAY_SIZE(icx_cha_msr_offsets))) + return; + icx_uncore_chabox.num_boxes = num_boxes; + uncore_msr_uncores = icx_msr_uncores; +} + +static struct intel_uncore_type icx_uncore_m2m = { + .name = "m2m", + .num_counters = 4, + .num_boxes = 4, + .perf_ctr_bits = 48, + .perf_ctr = SNR_M2M_PCI_PMON_CTR0, + .event_ctl = SNR_M2M_PCI_PMON_CTL0, + .event_mask = SNBEP_PMON_RAW_EVENT_MASK, + .box_ctl = SNR_M2M_PCI_PMON_BOX_CTL, + .ops = &snr_m2m_uncore_pci_ops, + .format_group = &skx_uncore_format_group, +}; + +static struct attribute *icx_upi_uncore_formats_attr[] = { + &format_attr_event.attr, + &format_attr_umask_ext4.attr, + &format_attr_edge.attr, + &format_attr_inv.attr, + &format_attr_thresh8.attr, + NULL, +}; + +static const struct attribute_group icx_upi_uncore_format_group = { + .name = "format", + .attrs = icx_upi_uncore_formats_attr, +}; + +static struct intel_uncore_type icx_uncore_upi = { + .name = "upi", + .num_counters = 4, + .num_boxes = 3, + .perf_ctr_bits = 48, + .perf_ctr = ICX_UPI_PCI_PMON_CTR0, + .event_ctl = ICX_UPI_PCI_PMON_CTL0, + .event_mask = SNBEP_PMON_RAW_EVENT_MASK, + .event_mask_ext = ICX_UPI_CTL_UMASK_EXT, + .box_ctl = ICX_UPI_PCI_PMON_BOX_CTL, + .ops = &skx_upi_uncore_pci_ops, + .format_group = &icx_upi_uncore_format_group, +}; + +static struct event_constraint icx_uncore_m3upi_constraints[] = { + UNCORE_EVENT_CONSTRAINT(0x1c, 0x1), + UNCORE_EVENT_CONSTRAINT(0x1d, 0x1), + UNCORE_EVENT_CONSTRAINT(0x1e, 0x1), + UNCORE_EVENT_CONSTRAINT(0x1f, 0x1), + UNCORE_EVENT_CONSTRAINT(0x40, 0x7), + UNCORE_EVENT_CONSTRAINT(0x4e, 0x7), + UNCORE_EVENT_CONSTRAINT(0x4f, 0x7), + UNCORE_EVENT_CONSTRAINT(0x50, 0x7), + EVENT_CONSTRAINT_END +}; + +static struct intel_uncore_type icx_uncore_m3upi = { + .name = "m3upi", + .num_counters = 4, + .num_boxes = 3, + .perf_ctr_bits = 48, + .perf_ctr = ICX_M3UPI_PCI_PMON_CTR0, + .event_ctl = ICX_M3UPI_PCI_PMON_CTL0, + .event_mask = SNBEP_PMON_RAW_EVENT_MASK, + .box_ctl = ICX_M3UPI_PCI_PMON_BOX_CTL, + .constraints = icx_uncore_m3upi_constraints, + .ops = &ivbep_uncore_pci_ops, + .format_group = &skx_uncore_format_group, +}; + +enum { + ICX_PCI_UNCORE_M2M, + ICX_PCI_UNCORE_UPI, + ICX_PCI_UNCORE_M3UPI, +}; + +static struct intel_uncore_type *icx_pci_uncores[] = { + [ICX_PCI_UNCORE_M2M] = &icx_uncore_m2m, + [ICX_PCI_UNCORE_UPI] = &icx_uncore_upi, + [ICX_PCI_UNCORE_M3UPI] = &icx_uncore_m3upi, + NULL, +}; + +static const struct pci_device_id icx_uncore_pci_ids[] = { + { /* M2M 0 */ + PCI_DEVICE(PCI_VENDOR_ID_INTEL, 0x344a), + .driver_data = UNCORE_PCI_DEV_FULL_DATA(12, 0, ICX_PCI_UNCORE_M2M, 0), + }, + { /* M2M 1 */ + PCI_DEVICE(PCI_VENDOR_ID_INTEL, 0x344a), + .driver_data = UNCORE_PCI_DEV_FULL_DATA(13, 0, ICX_PCI_UNCORE_M2M, 1), + }, + { /* M2M 2 */ + PCI_DEVICE(PCI_VENDOR_ID_INTEL, 0x344a), + .driver_data = UNCORE_PCI_DEV_FULL_DATA(14, 0, ICX_PCI_UNCORE_M2M, 2), + }, + { /* M2M 3 */ + PCI_DEVICE(PCI_VENDOR_ID_INTEL, 0x344a), + .driver_data = UNCORE_PCI_DEV_FULL_DATA(15, 0, ICX_PCI_UNCORE_M2M, 3), + }, + { /* UPI Link 0 */ + PCI_DEVICE(PCI_VENDOR_ID_INTEL, 0x3441), + .driver_data = UNCORE_PCI_DEV_FULL_DATA(2, 1, ICX_PCI_UNCORE_UPI, 0), + }, + { /* UPI Link 1 */ + PCI_DEVICE(PCI_VENDOR_ID_INTEL, 0x3441), + .driver_data = UNCORE_PCI_DEV_FULL_DATA(3, 1, ICX_PCI_UNCORE_UPI, 1), + }, + { /* UPI Link 2 */ + PCI_DEVICE(PCI_VENDOR_ID_INTEL, 0x3441), + .driver_data = UNCORE_PCI_DEV_FULL_DATA(4, 1, ICX_PCI_UNCORE_UPI, 2), + }, + { /* M3UPI Link 0 */ + PCI_DEVICE(PCI_VENDOR_ID_INTEL, 0x3446), + .driver_data = UNCORE_PCI_DEV_FULL_DATA(5, 1, ICX_PCI_UNCORE_M3UPI, 0), + }, + { /* M3UPI Link 1 */ + PCI_DEVICE(PCI_VENDOR_ID_INTEL, 0x3446), + .driver_data = UNCORE_PCI_DEV_FULL_DATA(6, 1, ICX_PCI_UNCORE_M3UPI, 1), + }, + { /* M3UPI Link 2 */ + PCI_DEVICE(PCI_VENDOR_ID_INTEL, 0x3446), + .driver_data = UNCORE_PCI_DEV_FULL_DATA(7, 1, ICX_PCI_UNCORE_M3UPI, 2), + }, + { /* end: all zeroes */ } +}; + +static struct pci_driver icx_uncore_pci_driver = { + .name = "icx_uncore", + .id_table = icx_uncore_pci_ids, +}; + +int icx_uncore_pci_init(void) +{ + /* ICX UBOX DID */ + int ret = snbep_pci2phy_map_init(0x3450, SKX_CPUNODEID, + SKX_GIDNIDMAP, true); + + if (ret) + return ret; + + uncore_pci_uncores = icx_pci_uncores; + uncore_pci_driver = &icx_uncore_pci_driver; + return 0; +} + +static void icx_uncore_imc_init_box(struct intel_uncore_box *box) +{ + unsigned int box_ctl = box->pmu->type->box_ctl + + box->pmu->type->mmio_offset * (box->pmu->pmu_idx % ICX_NUMBER_IMC_CHN); + int mem_offset = (box->pmu->pmu_idx / ICX_NUMBER_IMC_CHN) * ICX_IMC_MEM_STRIDE + + SNR_IMC_MMIO_MEM0_OFFSET; + + __snr_uncore_mmio_init_box(box, box_ctl, mem_offset); +} + +static struct intel_uncore_ops icx_uncore_mmio_ops = { + .init_box = icx_uncore_imc_init_box, + .exit_box = uncore_mmio_exit_box, + .disable_box = snr_uncore_mmio_disable_box, + .enable_box = snr_uncore_mmio_enable_box, + .disable_event = snr_uncore_mmio_disable_event, + .enable_event = snr_uncore_mmio_enable_event, + .read_counter = uncore_mmio_read_counter, +}; + +static struct intel_uncore_type icx_uncore_imc = { + .name = "imc", + .num_counters = 4, + .num_boxes = 8, + .perf_ctr_bits = 48, + .fixed_ctr_bits = 48, + .fixed_ctr = SNR_IMC_MMIO_PMON_FIXED_CTR, + .fixed_ctl = SNR_IMC_MMIO_PMON_FIXED_CTL, + .event_descs = hswep_uncore_imc_events, + .perf_ctr = SNR_IMC_MMIO_PMON_CTR0, + .event_ctl = SNR_IMC_MMIO_PMON_CTL0, + .event_mask = SNBEP_PMON_RAW_EVENT_MASK, + .box_ctl = SNR_IMC_MMIO_PMON_BOX_CTL, + .mmio_offset = SNR_IMC_MMIO_OFFSET, + .ops = &icx_uncore_mmio_ops, + .format_group = &skx_uncore_format_group, +}; + +enum perf_uncore_icx_imc_freerunning_type_id { + ICX_IMC_DCLK, + ICX_IMC_DDR, + ICX_IMC_DDRT, + + ICX_IMC_FREERUNNING_TYPE_MAX, +}; + +static struct freerunning_counters icx_imc_freerunning[] = { + [ICX_IMC_DCLK] = { 0x22b0, 0x0, 0, 1, 48 }, + [ICX_IMC_DDR] = { 0x2290, 0x8, 0, 2, 48 }, + [ICX_IMC_DDRT] = { 0x22a0, 0x8, 0, 2, 48 }, +}; + +static struct uncore_event_desc icx_uncore_imc_freerunning_events[] = { + INTEL_UNCORE_EVENT_DESC(dclk, "event=0xff,umask=0x10"), + + INTEL_UNCORE_EVENT_DESC(read, "event=0xff,umask=0x20"), + INTEL_UNCORE_EVENT_DESC(read.scale, "3.814697266e-6"), + INTEL_UNCORE_EVENT_DESC(read.unit, "MiB"), + INTEL_UNCORE_EVENT_DESC(write, "event=0xff,umask=0x21"), + INTEL_UNCORE_EVENT_DESC(write.scale, "3.814697266e-6"), + INTEL_UNCORE_EVENT_DESC(write.unit, "MiB"), + + INTEL_UNCORE_EVENT_DESC(ddrt_read, "event=0xff,umask=0x30"), + INTEL_UNCORE_EVENT_DESC(ddrt_read.scale, "3.814697266e-6"), + INTEL_UNCORE_EVENT_DESC(ddrt_read.unit, "MiB"), + INTEL_UNCORE_EVENT_DESC(ddrt_write, "event=0xff,umask=0x31"), + INTEL_UNCORE_EVENT_DESC(ddrt_write.scale, "3.814697266e-6"), + INTEL_UNCORE_EVENT_DESC(ddrt_write.unit, "MiB"), + { /* end: all zeroes */ }, +}; + +static void icx_uncore_imc_freerunning_init_box(struct intel_uncore_box *box) +{ + int mem_offset = box->pmu->pmu_idx * ICX_IMC_MEM_STRIDE + + SNR_IMC_MMIO_MEM0_OFFSET; + + __snr_uncore_mmio_init_box(box, uncore_mmio_box_ctl(box), mem_offset); +} + +static struct intel_uncore_ops icx_uncore_imc_freerunning_ops = { + .init_box = icx_uncore_imc_freerunning_init_box, + .exit_box = uncore_mmio_exit_box, + .read_counter = uncore_mmio_read_counter, + .hw_config = uncore_freerunning_hw_config, +}; + +static struct intel_uncore_type icx_uncore_imc_free_running = { + .name = "imc_free_running", + .num_counters = 5, + .num_boxes = 4, + .num_freerunning_types = ICX_IMC_FREERUNNING_TYPE_MAX, + .freerunning = icx_imc_freerunning, + .ops = &icx_uncore_imc_freerunning_ops, + .event_descs = icx_uncore_imc_freerunning_events, + .format_group = &skx_uncore_iio_freerunning_format_group, +}; + +static struct intel_uncore_type *icx_mmio_uncores[] = { + &icx_uncore_imc, + &icx_uncore_imc_free_running, + NULL, +}; + +void icx_uncore_mmio_init(void) +{ + uncore_mmio_uncores = icx_mmio_uncores; +} + +/* end of ICX uncore support */ -- cgit From b5432a699fdff9266a475771bd46d740f40f76aa Mon Sep 17 00:00:00 2001 From: Jason Yan Date: Wed, 8 Apr 2020 10:44:12 +0800 Subject: ACPI, x86/boot: make acpi_nobgrt static Fix the following sparse warning: arch/x86/kernel/acpi/boot.c:48:5: warning: symbol 'acpi_nobgrt' was not declared. Should it be static? Reported-by: Hulk Robot Signed-off-by: Jason Yan Signed-off-by: Rafael J. Wysocki --- arch/x86/kernel/acpi/boot.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c index 1ae5439a9a85..683ed9e12e6b 100644 --- a/arch/x86/kernel/acpi/boot.c +++ b/arch/x86/kernel/acpi/boot.c @@ -45,7 +45,7 @@ EXPORT_SYMBOL(acpi_disabled); #define PREFIX "ACPI: " int acpi_noirq; /* skip ACPI IRQ initialization */ -int acpi_nobgrt; /* skip ACPI BGRT */ +static int acpi_nobgrt; /* skip ACPI BGRT */ int acpi_pci_disabled; /* skip ACPI PCI scan and IRQ initialization */ EXPORT_SYMBOL(acpi_pci_disabled); -- cgit From 418d6e295e4375e1097c511da11795e592aedbeb Mon Sep 17 00:00:00 2001 From: Masahiro Yamada Date: Thu, 26 Mar 2020 17:00:50 +0900 Subject: x86: remove unneeded defined(__ASSEMBLY__) check from asm/dwarf2.h This header file has the following check at the top: #ifndef __ASSEMBLY__ #warning "asm/dwarf2.h should be only included in pure assembly files" #endif So, we expect defined(__ASSEMBLY__) is always true. Signed-off-by: Masahiro Yamada Reviewed-by: Jason A. Donenfeld Reviewed-by: Nick Desaulniers Acked-by: Ingo Molnar --- arch/x86/include/asm/dwarf2.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/dwarf2.h b/arch/x86/include/asm/dwarf2.h index f71a0cce9373..87054cc8272e 100644 --- a/arch/x86/include/asm/dwarf2.h +++ b/arch/x86/include/asm/dwarf2.h @@ -36,7 +36,7 @@ #define CFI_SIGNAL_FRAME #endif -#if defined(CONFIG_AS_CFI_SECTIONS) && defined(__ASSEMBLY__) +#if defined(CONFIG_AS_CFI_SECTIONS) #ifndef BUILD_VDSO /* * Emit CFI data in .debug_frame sections, not .eh_frame sections. -- cgit From 0f2661c4b935e5aa7d309f6c5072d4c49465137f Mon Sep 17 00:00:00 2001 From: Masahiro Yamada Date: Thu, 26 Mar 2020 17:00:51 +0900 Subject: x86: remove always-defined CONFIG_AS_CFI CONFIG_AS_CFI was introduced by commit e2414910f212 ("[PATCH] x86: Detect CFI support in the assembler at runtime"), and extended by commit f0f12d85af85 ("x86_64: Check for .cfi_rel_offset in CFI probe"). We raise the minimal supported binutils version from time to time. The last bump was commit 1fb12b35e5ff ("kbuild: Raise the minimum required binutils version to 2.21"). I confirmed the code in $(call as-instr,...) can be assembled by the binutils 2.21 assembler and also by LLVM integrated assembler. Remove CONFIG_AS_CFI, which is always defined. Signed-off-by: Masahiro Yamada Reviewed-by: Jason A. Donenfeld Reviewed-by: Nick Desaulniers Acked-by: Ingo Molnar --- arch/x86/Makefile | 10 ++-------- arch/x86/include/asm/dwarf2.h | 36 ------------------------------------ 2 files changed, 2 insertions(+), 44 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/Makefile b/arch/x86/Makefile index 513a55562d75..72f8f744ebd7 100644 --- a/arch/x86/Makefile +++ b/arch/x86/Makefile @@ -177,12 +177,6 @@ ifeq ($(ACCUMULATE_OUTGOING_ARGS), 1) KBUILD_CFLAGS += $(call cc-option,-maccumulate-outgoing-args,) endif -# Stackpointer is addressed different for 32 bit and 64 bit x86 -sp-$(CONFIG_X86_32) := esp -sp-$(CONFIG_X86_64) := rsp - -# do binutils support CFI? -cfi := $(call as-instr,.cfi_startproc\n.cfi_rel_offset $(sp-y)$(comma)0\n.cfi_endproc,-DCONFIG_AS_CFI=1) # is .cfi_signal_frame supported too? cfi-sigframe := $(call as-instr,.cfi_startproc\n.cfi_signal_frame\n.cfi_endproc,-DCONFIG_AS_CFI_SIGNAL_FRAME=1) cfi-sections := $(call as-instr,.cfi_sections .debug_frame,-DCONFIG_AS_CFI_SECTIONS=1) @@ -196,8 +190,8 @@ sha1_ni_instr :=$(call as-instr,sha1msg1 %xmm0$(comma)%xmm1,-DCONFIG_AS_SHA1_NI= sha256_ni_instr :=$(call as-instr,sha256msg1 %xmm0$(comma)%xmm1,-DCONFIG_AS_SHA256_NI=1) adx_instr := $(call as-instr,adox %r10$(comma)%r10,-DCONFIG_AS_ADX=1) -KBUILD_AFLAGS += $(cfi) $(cfi-sigframe) $(cfi-sections) $(asinstr) $(avx_instr) $(avx2_instr) $(avx512_instr) $(sha1_ni_instr) $(sha256_ni_instr) $(adx_instr) -KBUILD_CFLAGS += $(cfi) $(cfi-sigframe) $(cfi-sections) $(asinstr) $(avx_instr) $(avx2_instr) $(avx512_instr) $(sha1_ni_instr) $(sha256_ni_instr) $(adx_instr) +KBUILD_AFLAGS += $(cfi-sigframe) $(cfi-sections) $(asinstr) $(avx_instr) $(avx2_instr) $(avx512_instr) $(sha1_ni_instr) $(sha256_ni_instr) $(adx_instr) +KBUILD_CFLAGS += $(cfi-sigframe) $(cfi-sections) $(asinstr) $(avx_instr) $(avx2_instr) $(avx512_instr) $(sha1_ni_instr) $(sha256_ni_instr) $(adx_instr) KBUILD_LDFLAGS := -m elf_$(UTS_MACHINE) diff --git a/arch/x86/include/asm/dwarf2.h b/arch/x86/include/asm/dwarf2.h index 87054cc8272e..76cc07ff1002 100644 --- a/arch/x86/include/asm/dwarf2.h +++ b/arch/x86/include/asm/dwarf2.h @@ -6,15 +6,6 @@ #warning "asm/dwarf2.h should be only included in pure assembly files" #endif -/* - * Macros for dwarf2 CFI unwind table entries. - * See "as.info" for details on these pseudo ops. Unfortunately - * they are only supported in very new binutils, so define them - * away for older version. - */ - -#ifdef CONFIG_AS_CFI - #define CFI_STARTPROC .cfi_startproc #define CFI_ENDPROC .cfi_endproc #define CFI_DEF_CFA .cfi_def_cfa @@ -55,31 +46,4 @@ #endif #endif -#else - -/* - * Due to the structure of pre-exisiting code, don't use assembler line - * comment character # to ignore the arguments. Instead, use a dummy macro. - */ -.macro cfi_ignore a=0, b=0, c=0, d=0 -.endm - -#define CFI_STARTPROC cfi_ignore -#define CFI_ENDPROC cfi_ignore -#define CFI_DEF_CFA cfi_ignore -#define CFI_DEF_CFA_REGISTER cfi_ignore -#define CFI_DEF_CFA_OFFSET cfi_ignore -#define CFI_ADJUST_CFA_OFFSET cfi_ignore -#define CFI_OFFSET cfi_ignore -#define CFI_REL_OFFSET cfi_ignore -#define CFI_REGISTER cfi_ignore -#define CFI_RESTORE cfi_ignore -#define CFI_REMEMBER_STATE cfi_ignore -#define CFI_RESTORE_STATE cfi_ignore -#define CFI_UNDEFINED cfi_ignore -#define CFI_ESCAPE cfi_ignore -#define CFI_SIGNAL_FRAME cfi_ignore - -#endif - #endif /* _ASM_X86_DWARF2_H */ -- cgit From 46427f658e90eda40b85313b5cff05bdc56c1988 Mon Sep 17 00:00:00 2001 From: Masahiro Yamada Date: Thu, 26 Mar 2020 17:00:52 +0900 Subject: x86: remove unneeded (CONFIG_AS_)CFI_SIGNAL_FRAME Commit 131484c8da97 ("x86/debug: Remove perpetually broken, unmaintainable dwarf annotations") removes all the users of CFI_SIGNAL_FRAME. Remove the CFI_SIGNAL_FRAME and CONFIG_AS_CFI_SIGNAL_FRAME. Signed-off-by: Masahiro Yamada Reviewed-by: Jason A. Donenfeld Reviewed-by: Nick Desaulniers Acked-by: Ingo Molnar --- arch/x86/Makefile | 6 ++---- arch/x86/include/asm/dwarf2.h | 6 ------ 2 files changed, 2 insertions(+), 10 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/Makefile b/arch/x86/Makefile index 72f8f744ebd7..dd275008fc59 100644 --- a/arch/x86/Makefile +++ b/arch/x86/Makefile @@ -177,8 +177,6 @@ ifeq ($(ACCUMULATE_OUTGOING_ARGS), 1) KBUILD_CFLAGS += $(call cc-option,-maccumulate-outgoing-args,) endif -# is .cfi_signal_frame supported too? -cfi-sigframe := $(call as-instr,.cfi_startproc\n.cfi_signal_frame\n.cfi_endproc,-DCONFIG_AS_CFI_SIGNAL_FRAME=1) cfi-sections := $(call as-instr,.cfi_sections .debug_frame,-DCONFIG_AS_CFI_SECTIONS=1) # does binutils support specific instructions? @@ -190,8 +188,8 @@ sha1_ni_instr :=$(call as-instr,sha1msg1 %xmm0$(comma)%xmm1,-DCONFIG_AS_SHA1_NI= sha256_ni_instr :=$(call as-instr,sha256msg1 %xmm0$(comma)%xmm1,-DCONFIG_AS_SHA256_NI=1) adx_instr := $(call as-instr,adox %r10$(comma)%r10,-DCONFIG_AS_ADX=1) -KBUILD_AFLAGS += $(cfi-sigframe) $(cfi-sections) $(asinstr) $(avx_instr) $(avx2_instr) $(avx512_instr) $(sha1_ni_instr) $(sha256_ni_instr) $(adx_instr) -KBUILD_CFLAGS += $(cfi-sigframe) $(cfi-sections) $(asinstr) $(avx_instr) $(avx2_instr) $(avx512_instr) $(sha1_ni_instr) $(sha256_ni_instr) $(adx_instr) +KBUILD_AFLAGS += $(cfi-sections) $(asinstr) $(avx_instr) $(avx2_instr) $(avx512_instr) $(sha1_ni_instr) $(sha256_ni_instr) $(adx_instr) +KBUILD_CFLAGS += $(cfi-sections) $(asinstr) $(avx_instr) $(avx2_instr) $(avx512_instr) $(sha1_ni_instr) $(sha256_ni_instr) $(adx_instr) KBUILD_LDFLAGS := -m elf_$(UTS_MACHINE) diff --git a/arch/x86/include/asm/dwarf2.h b/arch/x86/include/asm/dwarf2.h index 76cc07ff1002..882db09d5dfb 100644 --- a/arch/x86/include/asm/dwarf2.h +++ b/arch/x86/include/asm/dwarf2.h @@ -21,12 +21,6 @@ #define CFI_UNDEFINED .cfi_undefined #define CFI_ESCAPE .cfi_escape -#ifdef CONFIG_AS_CFI_SIGNAL_FRAME -#define CFI_SIGNAL_FRAME .cfi_signal_frame -#else -#define CFI_SIGNAL_FRAME -#endif - #if defined(CONFIG_AS_CFI_SECTIONS) #ifndef BUILD_VDSO /* -- cgit From 48e24723d0b54a2c9e53dd96c8864cc0afc7b0ee Mon Sep 17 00:00:00 2001 From: Masahiro Yamada Date: Thu, 26 Mar 2020 17:00:53 +0900 Subject: x86: remove always-defined CONFIG_AS_CFI_SECTIONS CONFIG_AS_CFI_SECTIONS was introduced by commit 9e565292270a ("x86: Use .cfi_sections for assembly code"). We raise the minimal supported binutils version from time to time. The last bump was commit 1fb12b35e5ff ("kbuild: Raise the minimum required binutils version to 2.21"). I confirmed the code in $(call as-instr,...) can be assembled by the binutils 2.21 assembler and also by LLVM integrated assembler. Remove CONFIG_AS_CFI_SECTIONS, which is always defined. Signed-off-by: Masahiro Yamada Reviewed-by: Jason A. Donenfeld Reviewed-by: Nick Desaulniers Acked-by: Ingo Molnar --- arch/x86/Makefile | 6 ++---- arch/x86/include/asm/dwarf2.h | 2 -- 2 files changed, 2 insertions(+), 6 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/Makefile b/arch/x86/Makefile index dd275008fc59..e4a062313bb0 100644 --- a/arch/x86/Makefile +++ b/arch/x86/Makefile @@ -177,8 +177,6 @@ ifeq ($(ACCUMULATE_OUTGOING_ARGS), 1) KBUILD_CFLAGS += $(call cc-option,-maccumulate-outgoing-args,) endif -cfi-sections := $(call as-instr,.cfi_sections .debug_frame,-DCONFIG_AS_CFI_SECTIONS=1) - # does binutils support specific instructions? asinstr += $(call as-instr,pshufb %xmm0$(comma)%xmm0,-DCONFIG_AS_SSSE3=1) avx_instr := $(call as-instr,vxorps %ymm0$(comma)%ymm1$(comma)%ymm2,-DCONFIG_AS_AVX=1) @@ -188,8 +186,8 @@ sha1_ni_instr :=$(call as-instr,sha1msg1 %xmm0$(comma)%xmm1,-DCONFIG_AS_SHA1_NI= sha256_ni_instr :=$(call as-instr,sha256msg1 %xmm0$(comma)%xmm1,-DCONFIG_AS_SHA256_NI=1) adx_instr := $(call as-instr,adox %r10$(comma)%r10,-DCONFIG_AS_ADX=1) -KBUILD_AFLAGS += $(cfi-sections) $(asinstr) $(avx_instr) $(avx2_instr) $(avx512_instr) $(sha1_ni_instr) $(sha256_ni_instr) $(adx_instr) -KBUILD_CFLAGS += $(cfi-sections) $(asinstr) $(avx_instr) $(avx2_instr) $(avx512_instr) $(sha1_ni_instr) $(sha256_ni_instr) $(adx_instr) +KBUILD_AFLAGS += $(asinstr) $(avx_instr) $(avx2_instr) $(avx512_instr) $(sha1_ni_instr) $(sha256_ni_instr) $(adx_instr) +KBUILD_CFLAGS += $(asinstr) $(avx_instr) $(avx2_instr) $(avx512_instr) $(sha1_ni_instr) $(sha256_ni_instr) $(adx_instr) KBUILD_LDFLAGS := -m elf_$(UTS_MACHINE) diff --git a/arch/x86/include/asm/dwarf2.h b/arch/x86/include/asm/dwarf2.h index 882db09d5dfb..430fca13bb56 100644 --- a/arch/x86/include/asm/dwarf2.h +++ b/arch/x86/include/asm/dwarf2.h @@ -21,7 +21,6 @@ #define CFI_UNDEFINED .cfi_undefined #define CFI_ESCAPE .cfi_escape -#if defined(CONFIG_AS_CFI_SECTIONS) #ifndef BUILD_VDSO /* * Emit CFI data in .debug_frame sections, not .eh_frame sections. @@ -38,6 +37,5 @@ */ .cfi_sections .eh_frame, .debug_frame #endif -#endif #endif /* _ASM_X86_DWARF2_H */ -- cgit From 92203b02805d99d8aca88b1c6b93c721237205fe Mon Sep 17 00:00:00 2001 From: Masahiro Yamada Date: Thu, 26 Mar 2020 17:00:54 +0900 Subject: x86: remove always-defined CONFIG_AS_SSSE3 CONFIG_AS_SSSE3 was introduced by commit 75aaf4c3e6a4 ("x86/raid6: correctly check for assembler capabilities"). We raise the minimal supported binutils version from time to time. The last bump was commit 1fb12b35e5ff ("kbuild: Raise the minimum required binutils version to 2.21"). I confirmed the code in $(call as-instr,...) can be assembled by the binutils 2.21 assembler and also by LLVM integrated assembler. Remove CONFIG_AS_SSSE3, which is always defined. I added ifdef CONFIG_X86 to lib/raid6/algos.c to avoid link errors on non-x86 architectures. lib/raid6/algos.c is built not only for the kernel but also for testing the library code from userspace. I added -DCONFIG_X86 to lib/raid6/test/Makefile to cator to this usecase. Signed-off-by: Masahiro Yamada Reviewed-by: Jason A. Donenfeld Reviewed-by: Nick Desaulniers Acked-by: Ingo Molnar --- arch/x86/Makefile | 5 ++--- arch/x86/crypto/blake2s-core.S | 2 -- 2 files changed, 2 insertions(+), 5 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/Makefile b/arch/x86/Makefile index e4a062313bb0..94f89612e024 100644 --- a/arch/x86/Makefile +++ b/arch/x86/Makefile @@ -178,7 +178,6 @@ ifeq ($(ACCUMULATE_OUTGOING_ARGS), 1) endif # does binutils support specific instructions? -asinstr += $(call as-instr,pshufb %xmm0$(comma)%xmm0,-DCONFIG_AS_SSSE3=1) avx_instr := $(call as-instr,vxorps %ymm0$(comma)%ymm1$(comma)%ymm2,-DCONFIG_AS_AVX=1) avx2_instr :=$(call as-instr,vpbroadcastb %xmm0$(comma)%ymm1,-DCONFIG_AS_AVX2=1) avx512_instr :=$(call as-instr,vpmovm2b %k1$(comma)%zmm5,-DCONFIG_AS_AVX512=1) @@ -186,8 +185,8 @@ sha1_ni_instr :=$(call as-instr,sha1msg1 %xmm0$(comma)%xmm1,-DCONFIG_AS_SHA1_NI= sha256_ni_instr :=$(call as-instr,sha256msg1 %xmm0$(comma)%xmm1,-DCONFIG_AS_SHA256_NI=1) adx_instr := $(call as-instr,adox %r10$(comma)%r10,-DCONFIG_AS_ADX=1) -KBUILD_AFLAGS += $(asinstr) $(avx_instr) $(avx2_instr) $(avx512_instr) $(sha1_ni_instr) $(sha256_ni_instr) $(adx_instr) -KBUILD_CFLAGS += $(asinstr) $(avx_instr) $(avx2_instr) $(avx512_instr) $(sha1_ni_instr) $(sha256_ni_instr) $(adx_instr) +KBUILD_AFLAGS += $(avx_instr) $(avx2_instr) $(avx512_instr) $(sha1_ni_instr) $(sha256_ni_instr) $(adx_instr) +KBUILD_CFLAGS += $(avx_instr) $(avx2_instr) $(avx512_instr) $(sha1_ni_instr) $(sha256_ni_instr) $(adx_instr) KBUILD_LDFLAGS := -m elf_$(UTS_MACHINE) diff --git a/arch/x86/crypto/blake2s-core.S b/arch/x86/crypto/blake2s-core.S index 24910b766bdd..2ca79974f819 100644 --- a/arch/x86/crypto/blake2s-core.S +++ b/arch/x86/crypto/blake2s-core.S @@ -46,7 +46,6 @@ SIGMA2: #endif /* CONFIG_AS_AVX512 */ .text -#ifdef CONFIG_AS_SSSE3 SYM_FUNC_START(blake2s_compress_ssse3) testq %rdx,%rdx je .Lendofloop @@ -174,7 +173,6 @@ SYM_FUNC_START(blake2s_compress_ssse3) .Lendofloop: ret SYM_FUNC_END(blake2s_compress_ssse3) -#endif /* CONFIG_AS_SSSE3 */ #ifdef CONFIG_AS_AVX512 SYM_FUNC_START(blake2s_compress_avx512) -- cgit From 42251572c4687813d8e7b1d363a23d0f9201e69f Mon Sep 17 00:00:00 2001 From: Masahiro Yamada Date: Thu, 26 Mar 2020 17:00:55 +0900 Subject: x86: remove always-defined CONFIG_AS_AVX CONFIG_AS_AVX was introduced by commit ea4d26ae24e5 ("raid5: add AVX optimized RAID5 checksumming"). We raise the minimal supported binutils version from time to time. The last bump was commit 1fb12b35e5ff ("kbuild: Raise the minimum required binutils version to 2.21"). I confirmed the code in $(call as-instr,...) can be assembled by the binutils 2.21 assembler and also by LLVM integrated assembler. Remove CONFIG_AS_AVX, which is always defined. Signed-off-by: Masahiro Yamada Reviewed-by: Jason A. Donenfeld Acked-by: Ingo Molnar --- arch/x86/Makefile | 5 ++--- arch/x86/crypto/Makefile | 32 ++++++++++----------------- arch/x86/crypto/aesni-intel_avx-x86_64.S | 3 --- arch/x86/crypto/aesni-intel_glue.c | 14 +----------- arch/x86/crypto/poly1305-x86_64-cryptogams.pl | 8 ------- arch/x86/crypto/poly1305_glue.c | 6 ++--- arch/x86/crypto/sha1_ssse3_asm.S | 4 ---- arch/x86/crypto/sha1_ssse3_glue.c | 9 +------- arch/x86/crypto/sha256-avx-asm.S | 3 --- arch/x86/crypto/sha256_ssse3_glue.c | 8 +------ arch/x86/crypto/sha512-avx-asm.S | 2 -- arch/x86/crypto/sha512_ssse3_glue.c | 7 +----- arch/x86/include/asm/xor_avx.h | 9 -------- 13 files changed, 21 insertions(+), 89 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/Makefile b/arch/x86/Makefile index 94f89612e024..f32ef7b8d5ca 100644 --- a/arch/x86/Makefile +++ b/arch/x86/Makefile @@ -178,15 +178,14 @@ ifeq ($(ACCUMULATE_OUTGOING_ARGS), 1) endif # does binutils support specific instructions? -avx_instr := $(call as-instr,vxorps %ymm0$(comma)%ymm1$(comma)%ymm2,-DCONFIG_AS_AVX=1) avx2_instr :=$(call as-instr,vpbroadcastb %xmm0$(comma)%ymm1,-DCONFIG_AS_AVX2=1) avx512_instr :=$(call as-instr,vpmovm2b %k1$(comma)%zmm5,-DCONFIG_AS_AVX512=1) sha1_ni_instr :=$(call as-instr,sha1msg1 %xmm0$(comma)%xmm1,-DCONFIG_AS_SHA1_NI=1) sha256_ni_instr :=$(call as-instr,sha256msg1 %xmm0$(comma)%xmm1,-DCONFIG_AS_SHA256_NI=1) adx_instr := $(call as-instr,adox %r10$(comma)%r10,-DCONFIG_AS_ADX=1) -KBUILD_AFLAGS += $(avx_instr) $(avx2_instr) $(avx512_instr) $(sha1_ni_instr) $(sha256_ni_instr) $(adx_instr) -KBUILD_CFLAGS += $(avx_instr) $(avx2_instr) $(avx512_instr) $(sha1_ni_instr) $(sha256_ni_instr) $(adx_instr) +KBUILD_AFLAGS += $(avx2_instr) $(avx512_instr) $(sha1_ni_instr) $(sha256_ni_instr) $(adx_instr) +KBUILD_CFLAGS += $(avx2_instr) $(avx512_instr) $(sha1_ni_instr) $(sha256_ni_instr) $(adx_instr) KBUILD_LDFLAGS := -m elf_$(UTS_MACHINE) diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile index 8c2e9eadee8a..1a044908d42d 100644 --- a/arch/x86/crypto/Makefile +++ b/arch/x86/crypto/Makefile @@ -5,7 +5,6 @@ OBJECT_FILES_NON_STANDARD := y -avx_supported := $(call as-instr,vpxor %xmm0$(comma)%xmm0$(comma)%xmm0,yes,no) avx2_supported := $(call as-instr,vpgatherdd %ymm0$(comma)(%eax$(comma)%ymm1\ $(comma)4)$(comma)%ymm2,yes,no) avx512_supported :=$(call as-instr,vpmovm2b %k1$(comma)%zmm5,yes,no) @@ -47,15 +46,12 @@ ifeq ($(adx_supported),yes) endif # These modules require assembler to support AVX. -ifeq ($(avx_supported),yes) - obj-$(CONFIG_CRYPTO_CAMELLIA_AESNI_AVX_X86_64) += \ - camellia-aesni-avx-x86_64.o - obj-$(CONFIG_CRYPTO_CAST5_AVX_X86_64) += cast5-avx-x86_64.o - obj-$(CONFIG_CRYPTO_CAST6_AVX_X86_64) += cast6-avx-x86_64.o - obj-$(CONFIG_CRYPTO_TWOFISH_AVX_X86_64) += twofish-avx-x86_64.o - obj-$(CONFIG_CRYPTO_SERPENT_AVX_X86_64) += serpent-avx-x86_64.o - obj-$(CONFIG_CRYPTO_BLAKE2S_X86) += blake2s-x86_64.o -endif +obj-$(CONFIG_CRYPTO_CAMELLIA_AESNI_AVX_X86_64) += camellia-aesni-avx-x86_64.o +obj-$(CONFIG_CRYPTO_CAST5_AVX_X86_64) += cast5-avx-x86_64.o +obj-$(CONFIG_CRYPTO_CAST6_AVX_X86_64) += cast6-avx-x86_64.o +obj-$(CONFIG_CRYPTO_TWOFISH_AVX_X86_64) += twofish-avx-x86_64.o +obj-$(CONFIG_CRYPTO_SERPENT_AVX_X86_64) += serpent-avx-x86_64.o +obj-$(CONFIG_CRYPTO_BLAKE2S_X86) += blake2s-x86_64.o # These modules require assembler to support AVX2. ifeq ($(avx2_supported),yes) @@ -83,16 +79,12 @@ ifneq ($(CONFIG_CRYPTO_POLY1305_X86_64),) targets += poly1305-x86_64-cryptogams.S endif -ifeq ($(avx_supported),yes) - camellia-aesni-avx-x86_64-y := camellia-aesni-avx-asm_64.o \ - camellia_aesni_avx_glue.o - cast5-avx-x86_64-y := cast5-avx-x86_64-asm_64.o cast5_avx_glue.o - cast6-avx-x86_64-y := cast6-avx-x86_64-asm_64.o cast6_avx_glue.o - twofish-avx-x86_64-y := twofish-avx-x86_64-asm_64.o \ - twofish_avx_glue.o - serpent-avx-x86_64-y := serpent-avx-x86_64-asm_64.o \ - serpent_avx_glue.o -endif +camellia-aesni-avx-x86_64-y := camellia-aesni-avx-asm_64.o \ + camellia_aesni_avx_glue.o +cast5-avx-x86_64-y := cast5-avx-x86_64-asm_64.o cast5_avx_glue.o +cast6-avx-x86_64-y := cast6-avx-x86_64-asm_64.o cast6_avx_glue.o +twofish-avx-x86_64-y := twofish-avx-x86_64-asm_64.o twofish_avx_glue.o +serpent-avx-x86_64-y := serpent-avx-x86_64-asm_64.o serpent_avx_glue.o ifeq ($(avx2_supported),yes) camellia-aesni-avx2-y := camellia-aesni-avx2-asm_64.o camellia_aesni_avx2_glue.o diff --git a/arch/x86/crypto/aesni-intel_avx-x86_64.S b/arch/x86/crypto/aesni-intel_avx-x86_64.S index bfa1c0b3e5b4..cc56ee43238b 100644 --- a/arch/x86/crypto/aesni-intel_avx-x86_64.S +++ b/arch/x86/crypto/aesni-intel_avx-x86_64.S @@ -886,7 +886,6 @@ _less_than_8_bytes_left_\@: _partial_block_done_\@: .endm # PARTIAL_BLOCK -#ifdef CONFIG_AS_AVX ############################################################################### # GHASH_MUL MACRO to implement: Data*HashKey mod (128,127,126,121,0) # Input: A and B (128-bits each, bit-reflected) @@ -1869,8 +1868,6 @@ key_256_finalize: ret SYM_FUNC_END(aesni_gcm_finalize_avx_gen2) -#endif /* CONFIG_AS_AVX */ - #ifdef CONFIG_AS_AVX2 ############################################################################### # GHASH_MUL MACRO to implement: Data*HashKey mod (128,127,126,121,0) diff --git a/arch/x86/crypto/aesni-intel_glue.c b/arch/x86/crypto/aesni-intel_glue.c index 75b6ea20491e..655ad6bc8810 100644 --- a/arch/x86/crypto/aesni-intel_glue.c +++ b/arch/x86/crypto/aesni-intel_glue.c @@ -185,7 +185,6 @@ static const struct aesni_gcm_tfm_s aesni_gcm_tfm_sse = { .finalize = &aesni_gcm_finalize, }; -#ifdef CONFIG_AS_AVX asmlinkage void aes_ctr_enc_128_avx_by8(const u8 *in, u8 *iv, void *keys, u8 *out, unsigned int num_bytes); asmlinkage void aes_ctr_enc_192_avx_by8(const u8 *in, u8 *iv, @@ -234,8 +233,6 @@ static const struct aesni_gcm_tfm_s aesni_gcm_tfm_avx_gen2 = { .finalize = &aesni_gcm_finalize_avx_gen2, }; -#endif - #ifdef CONFIG_AS_AVX2 /* * asmlinkage void aesni_gcm_init_avx_gen4() @@ -476,7 +473,6 @@ static void ctr_crypt_final(struct crypto_aes_ctx *ctx, crypto_inc(ctrblk, AES_BLOCK_SIZE); } -#ifdef CONFIG_AS_AVX static void aesni_ctr_enc_avx_tfm(struct crypto_aes_ctx *ctx, u8 *out, const u8 *in, unsigned int len, u8 *iv) { @@ -493,7 +489,6 @@ static void aesni_ctr_enc_avx_tfm(struct crypto_aes_ctx *ctx, u8 *out, else aes_ctr_enc_256_avx_by8(in, iv, (void *)ctx, out, len); } -#endif static int ctr_crypt(struct skcipher_request *req) { @@ -715,10 +710,8 @@ static int gcmaes_crypt_by_sg(bool enc, struct aead_request *req, if (left < AVX_GEN4_OPTSIZE && gcm_tfm == &aesni_gcm_tfm_avx_gen4) gcm_tfm = &aesni_gcm_tfm_avx_gen2; #endif -#ifdef CONFIG_AS_AVX if (left < AVX_GEN2_OPTSIZE && gcm_tfm == &aesni_gcm_tfm_avx_gen2) gcm_tfm = &aesni_gcm_tfm_sse; -#endif /* Linearize assoc, if not already linear */ if (req->src->length >= assoclen && req->src->length && @@ -1082,24 +1075,19 @@ static int __init aesni_init(void) aesni_gcm_tfm = &aesni_gcm_tfm_avx_gen4; } else #endif -#ifdef CONFIG_AS_AVX if (boot_cpu_has(X86_FEATURE_AVX)) { pr_info("AVX version of gcm_enc/dec engaged.\n"); aesni_gcm_tfm = &aesni_gcm_tfm_avx_gen2; - } else -#endif - { + } else { pr_info("SSE version of gcm_enc/dec engaged.\n"); aesni_gcm_tfm = &aesni_gcm_tfm_sse; } aesni_ctr_enc_tfm = aesni_ctr_enc; -#ifdef CONFIG_AS_AVX if (boot_cpu_has(X86_FEATURE_AVX)) { /* optimize performance of ctr mode encryption transform */ aesni_ctr_enc_tfm = aesni_ctr_enc_avx_tfm; pr_info("AES CTR mode by8 optimization enabled\n"); } -#endif #endif err = crypto_register_alg(&aesni_cipher_alg); diff --git a/arch/x86/crypto/poly1305-x86_64-cryptogams.pl b/arch/x86/crypto/poly1305-x86_64-cryptogams.pl index 7a6b5380a46f..5bac2d533104 100644 --- a/arch/x86/crypto/poly1305-x86_64-cryptogams.pl +++ b/arch/x86/crypto/poly1305-x86_64-cryptogams.pl @@ -404,10 +404,6 @@ ___ &end_function("poly1305_emit_x86_64"); if ($avx) { -if($kernel) { - $code .= "#ifdef CONFIG_AS_AVX\n"; -} - ######################################################################## # Layout of opaque area is following. # @@ -1516,10 +1512,6 @@ $code.=<<___; ___ &end_function("poly1305_emit_avx"); -if ($kernel) { - $code .= "#endif\n"; -} - if ($avx>1) { if ($kernel) { diff --git a/arch/x86/crypto/poly1305_glue.c b/arch/x86/crypto/poly1305_glue.c index 79bb58737d52..4a6226e1d15e 100644 --- a/arch/x86/crypto/poly1305_glue.c +++ b/arch/x86/crypto/poly1305_glue.c @@ -94,7 +94,7 @@ static void poly1305_simd_blocks(void *ctx, const u8 *inp, size_t len, BUILD_BUG_ON(PAGE_SIZE < POLY1305_BLOCK_SIZE || PAGE_SIZE % POLY1305_BLOCK_SIZE); - if (!IS_ENABLED(CONFIG_AS_AVX) || !static_branch_likely(&poly1305_use_avx) || + if (!static_branch_likely(&poly1305_use_avx) || (len < (POLY1305_BLOCK_SIZE * 18) && !state->is_base2_26) || !crypto_simd_usable()) { convert_to_base2_64(ctx); @@ -123,7 +123,7 @@ static void poly1305_simd_blocks(void *ctx, const u8 *inp, size_t len, static void poly1305_simd_emit(void *ctx, u8 mac[POLY1305_DIGEST_SIZE], const u32 nonce[4]) { - if (!IS_ENABLED(CONFIG_AS_AVX) || !static_branch_likely(&poly1305_use_avx)) + if (!static_branch_likely(&poly1305_use_avx)) poly1305_emit_x86_64(ctx, mac, nonce); else poly1305_emit_avx(ctx, mac, nonce); @@ -261,7 +261,7 @@ static struct shash_alg alg = { static int __init poly1305_simd_mod_init(void) { - if (IS_ENABLED(CONFIG_AS_AVX) && boot_cpu_has(X86_FEATURE_AVX) && + if (boot_cpu_has(X86_FEATURE_AVX) && cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM, NULL)) static_branch_enable(&poly1305_use_avx); if (IS_ENABLED(CONFIG_AS_AVX2) && boot_cpu_has(X86_FEATURE_AVX) && diff --git a/arch/x86/crypto/sha1_ssse3_asm.S b/arch/x86/crypto/sha1_ssse3_asm.S index 12e2d19d7402..d25668d2a1e9 100644 --- a/arch/x86/crypto/sha1_ssse3_asm.S +++ b/arch/x86/crypto/sha1_ssse3_asm.S @@ -467,8 +467,6 @@ W_PRECALC_SSSE3 */ SHA1_VECTOR_ASM sha1_transform_ssse3 -#ifdef CONFIG_AS_AVX - .macro W_PRECALC_AVX .purgem W_PRECALC_00_15 @@ -553,5 +551,3 @@ W_PRECALC_AVX * const u8 *data, int blocks); */ SHA1_VECTOR_ASM sha1_transform_avx - -#endif diff --git a/arch/x86/crypto/sha1_ssse3_glue.c b/arch/x86/crypto/sha1_ssse3_glue.c index d70b40ad594c..275b65dd30c9 100644 --- a/arch/x86/crypto/sha1_ssse3_glue.c +++ b/arch/x86/crypto/sha1_ssse3_glue.c @@ -114,7 +114,6 @@ static void unregister_sha1_ssse3(void) crypto_unregister_shash(&sha1_ssse3_alg); } -#ifdef CONFIG_AS_AVX asmlinkage void sha1_transform_avx(struct sha1_state *state, const u8 *data, int blocks); @@ -175,13 +174,7 @@ static void unregister_sha1_avx(void) crypto_unregister_shash(&sha1_avx_alg); } -#else /* CONFIG_AS_AVX */ -static inline int register_sha1_avx(void) { return 0; } -static inline void unregister_sha1_avx(void) { } -#endif /* CONFIG_AS_AVX */ - - -#if defined(CONFIG_AS_AVX2) && (CONFIG_AS_AVX) +#if defined(CONFIG_AS_AVX2) #define SHA1_AVX2_BLOCK_OPTSIZE 4 /* optimal 4*64 bytes of SHA1 blocks */ asmlinkage void sha1_transform_avx2(struct sha1_state *state, diff --git a/arch/x86/crypto/sha256-avx-asm.S b/arch/x86/crypto/sha256-avx-asm.S index fcbc30f58c38..4739cd31b9db 100644 --- a/arch/x86/crypto/sha256-avx-asm.S +++ b/arch/x86/crypto/sha256-avx-asm.S @@ -47,7 +47,6 @@ # This code schedules 1 block at a time, with 4 lanes per block ######################################################################## -#ifdef CONFIG_AS_AVX #include ## assume buffers not aligned @@ -498,5 +497,3 @@ _SHUF_00BA: # shuffle xDxC -> DC00 _SHUF_DC00: .octa 0x0b0a090803020100FFFFFFFFFFFFFFFF - -#endif diff --git a/arch/x86/crypto/sha256_ssse3_glue.c b/arch/x86/crypto/sha256_ssse3_glue.c index 03ad657c04bd..8bdc3be31f64 100644 --- a/arch/x86/crypto/sha256_ssse3_glue.c +++ b/arch/x86/crypto/sha256_ssse3_glue.c @@ -144,7 +144,6 @@ static void unregister_sha256_ssse3(void) ARRAY_SIZE(sha256_ssse3_algs)); } -#ifdef CONFIG_AS_AVX asmlinkage void sha256_transform_avx(struct sha256_state *state, const u8 *data, int blocks); @@ -221,12 +220,7 @@ static void unregister_sha256_avx(void) ARRAY_SIZE(sha256_avx_algs)); } -#else -static inline int register_sha256_avx(void) { return 0; } -static inline void unregister_sha256_avx(void) { } -#endif - -#if defined(CONFIG_AS_AVX2) && defined(CONFIG_AS_AVX) +#if defined(CONFIG_AS_AVX2) asmlinkage void sha256_transform_rorx(struct sha256_state *state, const u8 *data, int blocks); diff --git a/arch/x86/crypto/sha512-avx-asm.S b/arch/x86/crypto/sha512-avx-asm.S index 90ea945ba5e6..63470fd6ae32 100644 --- a/arch/x86/crypto/sha512-avx-asm.S +++ b/arch/x86/crypto/sha512-avx-asm.S @@ -47,7 +47,6 @@ # ######################################################################## -#ifdef CONFIG_AS_AVX #include .text @@ -424,4 +423,3 @@ K512: .quad 0x3c9ebe0a15c9bebc,0x431d67c49c100d4c .quad 0x4cc5d4becb3e42b6,0x597f299cfc657e2a .quad 0x5fcb6fab3ad6faec,0x6c44198c4a475817 -#endif diff --git a/arch/x86/crypto/sha512_ssse3_glue.c b/arch/x86/crypto/sha512_ssse3_glue.c index 1c444f41037c..75214982a633 100644 --- a/arch/x86/crypto/sha512_ssse3_glue.c +++ b/arch/x86/crypto/sha512_ssse3_glue.c @@ -142,7 +142,6 @@ static void unregister_sha512_ssse3(void) ARRAY_SIZE(sha512_ssse3_algs)); } -#ifdef CONFIG_AS_AVX asmlinkage void sha512_transform_avx(struct sha512_state *state, const u8 *data, int blocks); static bool avx_usable(void) @@ -218,12 +217,8 @@ static void unregister_sha512_avx(void) crypto_unregister_shashes(sha512_avx_algs, ARRAY_SIZE(sha512_avx_algs)); } -#else -static inline int register_sha512_avx(void) { return 0; } -static inline void unregister_sha512_avx(void) { } -#endif -#if defined(CONFIG_AS_AVX2) && defined(CONFIG_AS_AVX) +#if defined(CONFIG_AS_AVX2) asmlinkage void sha512_transform_rorx(struct sha512_state *state, const u8 *data, int blocks); diff --git a/arch/x86/include/asm/xor_avx.h b/arch/x86/include/asm/xor_avx.h index d61ddf3d052b..0c4e5b5e3852 100644 --- a/arch/x86/include/asm/xor_avx.h +++ b/arch/x86/include/asm/xor_avx.h @@ -11,8 +11,6 @@ * Based on Ingo Molnar and Zach Brown's respective MMX and SSE routines */ -#ifdef CONFIG_AS_AVX - #include #include @@ -170,11 +168,4 @@ do { \ #define AVX_SELECT(FASTEST) \ (boot_cpu_has(X86_FEATURE_AVX) && boot_cpu_has(X86_FEATURE_OSXSAVE) ? &xor_block_avx : FASTEST) -#else - -#define AVX_XOR_SPEED {} - -#define AVX_SELECT(FASTEST) (FASTEST) - -#endif #endif -- cgit From 5e8ebd841a44b895e2bab5e874ff7d333ca31135 Mon Sep 17 00:00:00 2001 From: "Jason A. Donenfeld" Date: Thu, 26 Mar 2020 17:00:58 +0900 Subject: x86: probe assembler capabilities via kconfig instead of makefile Doing this probing inside of the Makefiles means we have a maze of ifdefs inside the source code and child Makefiles that need to make proper decisions on this too. Instead, we do it at Kconfig time, like many other compiler and assembler options, which allows us to set up the dependencies normally for full compilation units. In the process, the ADX test changes to use %eax instead of %r10 so that it's valid in both 32-bit and 64-bit mode. Signed-off-by: Jason A. Donenfeld Acked-by: Ingo Molnar Reviewed-by: Nick Desaulniers Signed-off-by: Masahiro Yamada --- arch/x86/Kconfig | 2 ++ arch/x86/Kconfig.assembler | 17 +++++++++++++++++ arch/x86/Makefile | 10 ---------- 3 files changed, 19 insertions(+), 10 deletions(-) create mode 100644 arch/x86/Kconfig.assembler (limited to 'arch/x86') diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 8d078642b4be..70898a68fa1a 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -2931,3 +2931,5 @@ config HAVE_ATOMIC_IOMAP source "drivers/firmware/Kconfig" source "arch/x86/kvm/Kconfig" + +source "arch/x86/Kconfig.assembler" diff --git a/arch/x86/Kconfig.assembler b/arch/x86/Kconfig.assembler new file mode 100644 index 000000000000..91230bf11a14 --- /dev/null +++ b/arch/x86/Kconfig.assembler @@ -0,0 +1,17 @@ +# SPDX-License-Identifier: GPL-2.0 +# Copyright (C) 2020 Jason A. Donenfeld . All Rights Reserved. + +config AS_AVX2 + def_bool $(as-instr,vpbroadcastb %xmm0$(comma)%ymm1) + +config AS_AVX512 + def_bool $(as-instr,vpmovm2b %k1$(comma)%zmm5) + +config AS_SHA1_NI + def_bool $(as-instr,sha1msg1 %xmm0$(comma)%xmm1) + +config AS_SHA256_NI + def_bool $(as-instr,sha256msg1 %xmm0$(comma)%xmm1) + +config AS_ADX + def_bool $(as-instr,adox %eax$(comma)%eax) diff --git a/arch/x86/Makefile b/arch/x86/Makefile index f32ef7b8d5ca..b65ec63c7db7 100644 --- a/arch/x86/Makefile +++ b/arch/x86/Makefile @@ -177,16 +177,6 @@ ifeq ($(ACCUMULATE_OUTGOING_ARGS), 1) KBUILD_CFLAGS += $(call cc-option,-maccumulate-outgoing-args,) endif -# does binutils support specific instructions? -avx2_instr :=$(call as-instr,vpbroadcastb %xmm0$(comma)%ymm1,-DCONFIG_AS_AVX2=1) -avx512_instr :=$(call as-instr,vpmovm2b %k1$(comma)%zmm5,-DCONFIG_AS_AVX512=1) -sha1_ni_instr :=$(call as-instr,sha1msg1 %xmm0$(comma)%xmm1,-DCONFIG_AS_SHA1_NI=1) -sha256_ni_instr :=$(call as-instr,sha256msg1 %xmm0$(comma)%xmm1,-DCONFIG_AS_SHA256_NI=1) -adx_instr := $(call as-instr,adox %r10$(comma)%r10,-DCONFIG_AS_ADX=1) - -KBUILD_AFLAGS += $(avx2_instr) $(avx512_instr) $(sha1_ni_instr) $(sha256_ni_instr) $(adx_instr) -KBUILD_CFLAGS += $(avx2_instr) $(avx512_instr) $(sha1_ni_instr) $(sha256_ni_instr) $(adx_instr) - KBUILD_LDFLAGS := -m elf_$(UTS_MACHINE) # -- cgit From e9e070cfe141e73588f1e48ca3f51d1183b8b646 Mon Sep 17 00:00:00 2001 From: Masahiro Yamada Date: Thu, 26 Mar 2020 17:00:59 +0900 Subject: x86: add comments about the binutils version to support code in as-instr We raise the minimal supported binutils version from time to time. The last bump was commit 1fb12b35e5ff ("kbuild: Raise the minimum required binutils version to 2.21"). We have these as-instr tests because binutils 2.21 does not support them. When we bump the binutils version next time, this will be a good hint to find out which one can be dropped. As for the Clang/LLVM builds, we require very new LLVM version, so the LLVM integrated assembler supports all of them. Signed-off-by: Masahiro Yamada Acked-by: Jason A. Donenfeld Acked-by: Ingo Molnar Acked-by: Nick Desaulniers --- arch/x86/Kconfig.assembler | 10 ++++++++++ 1 file changed, 10 insertions(+) (limited to 'arch/x86') diff --git a/arch/x86/Kconfig.assembler b/arch/x86/Kconfig.assembler index 91230bf11a14..a5a1d2766b3a 100644 --- a/arch/x86/Kconfig.assembler +++ b/arch/x86/Kconfig.assembler @@ -3,15 +3,25 @@ config AS_AVX2 def_bool $(as-instr,vpbroadcastb %xmm0$(comma)%ymm1) + help + Supported by binutils >= 2.22 and LLVM integrated assembler config AS_AVX512 def_bool $(as-instr,vpmovm2b %k1$(comma)%zmm5) + help + Supported by binutils >= 2.25 and LLVM integrated assembler config AS_SHA1_NI def_bool $(as-instr,sha1msg1 %xmm0$(comma)%xmm1) + help + Supported by binutils >= 2.24 and LLVM integrated assembler config AS_SHA256_NI def_bool $(as-instr,sha256msg1 %xmm0$(comma)%xmm1) + help + Supported by binutils >= 2.24 and LLVM integrated assembler config AS_ADX def_bool $(as-instr,adox %eax$(comma)%eax) + help + Supported by binutils >= 2.23 and LLVM integrated assembler -- cgit From 4dcbfc35f7da66609cf56690586a72ca0ad279f8 Mon Sep 17 00:00:00 2001 From: "Jason A. Donenfeld" Date: Thu, 26 Mar 2020 17:01:00 +0900 Subject: crypto: x86 - rework configuration based on Kconfig Now that assembler capabilities are probed inside of Kconfig, we can set up proper Kconfig-based dependencies. We also take this opportunity to reorder the Makefile, so that items are grouped logically by primitive. Signed-off-by: Jason A. Donenfeld Acked-by: Herbert Xu Acked-by: Ingo Molnar Signed-off-by: Masahiro Yamada --- arch/x86/crypto/Makefile | 152 ++++++++++++++++++++--------------------------- 1 file changed, 65 insertions(+), 87 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile index 1a044908d42d..2f23f08fdd4b 100644 --- a/arch/x86/crypto/Makefile +++ b/arch/x86/crypto/Makefile @@ -1,122 +1,100 @@ # SPDX-License-Identifier: GPL-2.0 # -# Arch-specific CryptoAPI modules. -# +# x86 crypto algorithms OBJECT_FILES_NON_STANDARD := y -avx2_supported := $(call as-instr,vpgatherdd %ymm0$(comma)(%eax$(comma)%ymm1\ - $(comma)4)$(comma)%ymm2,yes,no) -avx512_supported :=$(call as-instr,vpmovm2b %k1$(comma)%zmm5,yes,no) -sha1_ni_supported :=$(call as-instr,sha1msg1 %xmm0$(comma)%xmm1,yes,no) -sha256_ni_supported :=$(call as-instr,sha256msg1 %xmm0$(comma)%xmm1,yes,no) -adx_supported := $(call as-instr,adox %r10$(comma)%r10,yes,no) - obj-$(CONFIG_CRYPTO_GLUE_HELPER_X86) += glue_helper.o obj-$(CONFIG_CRYPTO_TWOFISH_586) += twofish-i586.o +twofish-i586-y := twofish-i586-asm_32.o twofish_glue.o +obj-$(CONFIG_CRYPTO_TWOFISH_X86_64) += twofish-x86_64.o +twofish-x86_64-y := twofish-x86_64-asm_64.o twofish_glue.o +obj-$(CONFIG_CRYPTO_TWOFISH_X86_64_3WAY) += twofish-x86_64-3way.o +twofish-x86_64-3way-y := twofish-x86_64-asm_64-3way.o twofish_glue_3way.o +obj-$(CONFIG_CRYPTO_TWOFISH_AVX_X86_64) += twofish-avx-x86_64.o +twofish-avx-x86_64-y := twofish-avx-x86_64-asm_64.o twofish_avx_glue.o + obj-$(CONFIG_CRYPTO_SERPENT_SSE2_586) += serpent-sse2-i586.o +serpent-sse2-i586-y := serpent-sse2-i586-asm_32.o serpent_sse2_glue.o +obj-$(CONFIG_CRYPTO_SERPENT_SSE2_X86_64) += serpent-sse2-x86_64.o +serpent-sse2-x86_64-y := serpent-sse2-x86_64-asm_64.o serpent_sse2_glue.o +obj-$(CONFIG_CRYPTO_SERPENT_AVX_X86_64) += serpent-avx-x86_64.o +serpent-avx-x86_64-y := serpent-avx-x86_64-asm_64.o serpent_avx_glue.o +obj-$(CONFIG_CRYPTO_SERPENT_AVX2_X86_64) += serpent-avx2.o +serpent-avx2-y := serpent-avx2-asm_64.o serpent_avx2_glue.o obj-$(CONFIG_CRYPTO_DES3_EDE_X86_64) += des3_ede-x86_64.o +des3_ede-x86_64-y := des3_ede-asm_64.o des3_ede_glue.o + obj-$(CONFIG_CRYPTO_CAMELLIA_X86_64) += camellia-x86_64.o +camellia-x86_64-y := camellia-x86_64-asm_64.o camellia_glue.o +obj-$(CONFIG_CRYPTO_CAMELLIA_AESNI_AVX_X86_64) += camellia-aesni-avx-x86_64.o +camellia-aesni-avx-x86_64-y := camellia-aesni-avx-asm_64.o camellia_aesni_avx_glue.o +obj-$(CONFIG_CRYPTO_CAMELLIA_AESNI_AVX2_X86_64) += camellia-aesni-avx2.o +camellia-aesni-avx2-y := camellia-aesni-avx2-asm_64.o camellia_aesni_avx2_glue.o + obj-$(CONFIG_CRYPTO_BLOWFISH_X86_64) += blowfish-x86_64.o -obj-$(CONFIG_CRYPTO_TWOFISH_X86_64) += twofish-x86_64.o -obj-$(CONFIG_CRYPTO_TWOFISH_X86_64_3WAY) += twofish-x86_64-3way.o +blowfish-x86_64-y := blowfish-x86_64-asm_64.o blowfish_glue.o + +obj-$(CONFIG_CRYPTO_CAST5_AVX_X86_64) += cast5-avx-x86_64.o +cast5-avx-x86_64-y := cast5-avx-x86_64-asm_64.o cast5_avx_glue.o + +obj-$(CONFIG_CRYPTO_CAST6_AVX_X86_64) += cast6-avx-x86_64.o +cast6-avx-x86_64-y := cast6-avx-x86_64-asm_64.o cast6_avx_glue.o + +obj-$(CONFIG_CRYPTO_AEGIS128_AESNI_SSE2) += aegis128-aesni.o +aegis128-aesni-y := aegis128-aesni-asm.o aegis128-aesni-glue.o + obj-$(CONFIG_CRYPTO_CHACHA20_X86_64) += chacha-x86_64.o -obj-$(CONFIG_CRYPTO_SERPENT_SSE2_X86_64) += serpent-sse2-x86_64.o +chacha-x86_64-y := chacha-ssse3-x86_64.o chacha_glue.o +chacha-x86_64-$(CONFIG_AS_AVX2) += chacha-avx2-x86_64.o +chacha-x86_64-$(CONFIG_AS_AVX512) += chacha-avx512vl-x86_64.o + obj-$(CONFIG_CRYPTO_AES_NI_INTEL) += aesni-intel.o -obj-$(CONFIG_CRYPTO_GHASH_CLMUL_NI_INTEL) += ghash-clmulni-intel.o +aesni-intel-y := aesni-intel_asm.o aesni-intel_glue.o +aesni-intel-$(CONFIG_64BIT) += aesni-intel_avx-x86_64.o aes_ctrby8_avx-x86_64.o -obj-$(CONFIG_CRYPTO_CRC32C_INTEL) += crc32c-intel.o obj-$(CONFIG_CRYPTO_SHA1_SSSE3) += sha1-ssse3.o -obj-$(CONFIG_CRYPTO_CRC32_PCLMUL) += crc32-pclmul.o -obj-$(CONFIG_CRYPTO_SHA256_SSSE3) += sha256-ssse3.o -obj-$(CONFIG_CRYPTO_SHA512_SSSE3) += sha512-ssse3.o -obj-$(CONFIG_CRYPTO_CRCT10DIF_PCLMUL) += crct10dif-pclmul.o -obj-$(CONFIG_CRYPTO_POLY1305_X86_64) += poly1305-x86_64.o - -obj-$(CONFIG_CRYPTO_AEGIS128_AESNI_SSE2) += aegis128-aesni.o +sha1-ssse3-y := sha1_ssse3_asm.o sha1_ssse3_glue.o +sha1-ssse3-$(CONFIG_AS_AVX2) += sha1_avx2_x86_64_asm.o +sha1-ssse3-$(CONFIG_AS_SHA1_NI) += sha1_ni_asm.o -obj-$(CONFIG_CRYPTO_NHPOLY1305_SSE2) += nhpoly1305-sse2.o -obj-$(CONFIG_CRYPTO_NHPOLY1305_AVX2) += nhpoly1305-avx2.o +obj-$(CONFIG_CRYPTO_SHA256_SSSE3) += sha256-ssse3.o +sha256-ssse3-y := sha256-ssse3-asm.o sha256-avx-asm.o sha256-avx2-asm.o sha256_ssse3_glue.o +sha256-ssse3-$(CONFIG_AS_SHA256_NI) += sha256_ni_asm.o -# These modules require the assembler to support ADX. -ifeq ($(adx_supported),yes) - obj-$(CONFIG_CRYPTO_CURVE25519_X86) += curve25519-x86_64.o -endif +obj-$(CONFIG_CRYPTO_SHA512_SSSE3) += sha512-ssse3.o +sha512-ssse3-y := sha512-ssse3-asm.o sha512-avx-asm.o sha512-avx2-asm.o sha512_ssse3_glue.o -# These modules require assembler to support AVX. -obj-$(CONFIG_CRYPTO_CAMELLIA_AESNI_AVX_X86_64) += camellia-aesni-avx-x86_64.o -obj-$(CONFIG_CRYPTO_CAST5_AVX_X86_64) += cast5-avx-x86_64.o -obj-$(CONFIG_CRYPTO_CAST6_AVX_X86_64) += cast6-avx-x86_64.o -obj-$(CONFIG_CRYPTO_TWOFISH_AVX_X86_64) += twofish-avx-x86_64.o -obj-$(CONFIG_CRYPTO_SERPENT_AVX_X86_64) += serpent-avx-x86_64.o obj-$(CONFIG_CRYPTO_BLAKE2S_X86) += blake2s-x86_64.o +blake2s-x86_64-y := blake2s-core.o blake2s-glue.o -# These modules require assembler to support AVX2. -ifeq ($(avx2_supported),yes) - obj-$(CONFIG_CRYPTO_CAMELLIA_AESNI_AVX2_X86_64) += camellia-aesni-avx2.o - obj-$(CONFIG_CRYPTO_SERPENT_AVX2_X86_64) += serpent-avx2.o -endif +obj-$(CONFIG_CRYPTO_GHASH_CLMUL_NI_INTEL) += ghash-clmulni-intel.o +ghash-clmulni-intel-y := ghash-clmulni-intel_asm.o ghash-clmulni-intel_glue.o -twofish-i586-y := twofish-i586-asm_32.o twofish_glue.o -serpent-sse2-i586-y := serpent-sse2-i586-asm_32.o serpent_sse2_glue.o +obj-$(CONFIG_CRYPTO_CRC32C_INTEL) += crc32c-intel.o +crc32c-intel-y := crc32c-intel_glue.o +crc32c-intel-$(CONFIG_64BIT) += crc32c-pcl-intel-asm_64.o -des3_ede-x86_64-y := des3_ede-asm_64.o des3_ede_glue.o -camellia-x86_64-y := camellia-x86_64-asm_64.o camellia_glue.o -blowfish-x86_64-y := blowfish-x86_64-asm_64.o blowfish_glue.o -twofish-x86_64-y := twofish-x86_64-asm_64.o twofish_glue.o -twofish-x86_64-3way-y := twofish-x86_64-asm_64-3way.o twofish_glue_3way.o -chacha-x86_64-y := chacha-ssse3-x86_64.o chacha_glue.o -serpent-sse2-x86_64-y := serpent-sse2-x86_64-asm_64.o serpent_sse2_glue.o +obj-$(CONFIG_CRYPTO_CRC32_PCLMUL) += crc32-pclmul.o +crc32-pclmul-y := crc32-pclmul_asm.o crc32-pclmul_glue.o -aegis128-aesni-y := aegis128-aesni-asm.o aegis128-aesni-glue.o +obj-$(CONFIG_CRYPTO_CRCT10DIF_PCLMUL) += crct10dif-pclmul.o +crct10dif-pclmul-y := crct10dif-pcl-asm_64.o crct10dif-pclmul_glue.o -nhpoly1305-sse2-y := nh-sse2-x86_64.o nhpoly1305-sse2-glue.o -blake2s-x86_64-y := blake2s-core.o blake2s-glue.o +obj-$(CONFIG_CRYPTO_POLY1305_X86_64) += poly1305-x86_64.o poly1305-x86_64-y := poly1305-x86_64-cryptogams.o poly1305_glue.o ifneq ($(CONFIG_CRYPTO_POLY1305_X86_64),) targets += poly1305-x86_64-cryptogams.S endif -camellia-aesni-avx-x86_64-y := camellia-aesni-avx-asm_64.o \ - camellia_aesni_avx_glue.o -cast5-avx-x86_64-y := cast5-avx-x86_64-asm_64.o cast5_avx_glue.o -cast6-avx-x86_64-y := cast6-avx-x86_64-asm_64.o cast6_avx_glue.o -twofish-avx-x86_64-y := twofish-avx-x86_64-asm_64.o twofish_avx_glue.o -serpent-avx-x86_64-y := serpent-avx-x86_64-asm_64.o serpent_avx_glue.o - -ifeq ($(avx2_supported),yes) - camellia-aesni-avx2-y := camellia-aesni-avx2-asm_64.o camellia_aesni_avx2_glue.o - chacha-x86_64-y += chacha-avx2-x86_64.o - serpent-avx2-y := serpent-avx2-asm_64.o serpent_avx2_glue.o - - nhpoly1305-avx2-y := nh-avx2-x86_64.o nhpoly1305-avx2-glue.o -endif - -ifeq ($(avx512_supported),yes) - chacha-x86_64-y += chacha-avx512vl-x86_64.o -endif +obj-$(CONFIG_CRYPTO_NHPOLY1305_SSE2) += nhpoly1305-sse2.o +nhpoly1305-sse2-y := nh-sse2-x86_64.o nhpoly1305-sse2-glue.o +obj-$(CONFIG_CRYPTO_NHPOLY1305_AVX2) += nhpoly1305-avx2.o +nhpoly1305-avx2-y := nh-avx2-x86_64.o nhpoly1305-avx2-glue.o -aesni-intel-y := aesni-intel_asm.o aesni-intel_glue.o -aesni-intel-$(CONFIG_64BIT) += aesni-intel_avx-x86_64.o aes_ctrby8_avx-x86_64.o -ghash-clmulni-intel-y := ghash-clmulni-intel_asm.o ghash-clmulni-intel_glue.o -sha1-ssse3-y := sha1_ssse3_asm.o sha1_ssse3_glue.o -ifeq ($(avx2_supported),yes) -sha1-ssse3-y += sha1_avx2_x86_64_asm.o -endif -ifeq ($(sha1_ni_supported),yes) -sha1-ssse3-y += sha1_ni_asm.o -endif -crc32c-intel-y := crc32c-intel_glue.o -crc32c-intel-$(CONFIG_64BIT) += crc32c-pcl-intel-asm_64.o -crc32-pclmul-y := crc32-pclmul_asm.o crc32-pclmul_glue.o -sha256-ssse3-y := sha256-ssse3-asm.o sha256-avx-asm.o sha256-avx2-asm.o sha256_ssse3_glue.o -ifeq ($(sha256_ni_supported),yes) -sha256-ssse3-y += sha256_ni_asm.o -endif -sha512-ssse3-y := sha512-ssse3-asm.o sha512-avx-asm.o sha512-avx2-asm.o sha512_ssse3_glue.o -crct10dif-pclmul-y := crct10dif-pcl-asm_64.o crct10dif-pclmul_glue.o +obj-$(CONFIG_CRYPTO_CURVE25519_X86) += curve25519-x86_64.o quiet_cmd_perlasm = PERLASM $@ cmd_perlasm = $(PERL) $< > $@ -- cgit From d7e40ea83eb9155bd1d6bbbb48ffb843a8f56120 Mon Sep 17 00:00:00 2001 From: Masahiro Yamada Date: Thu, 26 Mar 2020 17:01:04 +0900 Subject: crypto: x86 - clean up poly1305-x86_64-cryptogams.S by 'make clean' poly1305-x86_64-cryptogams.S is a generated file, so it should be cleaned up by 'make clean'. Assigning it to the variable 'targets' teaches Kbuild that it is a generated file. However, this line is not evaluated when cleaning because scripts/Makefile.clean does not include include/config/auto.conf. Remove the ifneq-conditional, so this file is correctly cleaned up. Signed-off-by: Masahiro Yamada Acked-by: Herbert Xu Acked-by: Ingo Molnar --- arch/x86/crypto/Makefile | 2 -- 1 file changed, 2 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile index 2f23f08fdd4b..d1ded16c1dcd 100644 --- a/arch/x86/crypto/Makefile +++ b/arch/x86/crypto/Makefile @@ -85,9 +85,7 @@ crct10dif-pclmul-y := crct10dif-pcl-asm_64.o crct10dif-pclmul_glue.o obj-$(CONFIG_CRYPTO_POLY1305_X86_64) += poly1305-x86_64.o poly1305-x86_64-y := poly1305-x86_64-cryptogams.o poly1305_glue.o -ifneq ($(CONFIG_CRYPTO_POLY1305_X86_64),) targets += poly1305-x86_64-cryptogams.S -endif obj-$(CONFIG_CRYPTO_NHPOLY1305_SSE2) += nhpoly1305-sse2.o nhpoly1305-sse2-y := nh-sse2-x86_64.o nhpoly1305-sse2-glue.o -- cgit From e6abef610c7363cbd25205674b962031ef3bc790 Mon Sep 17 00:00:00 2001 From: "Jason A. Donenfeld" Date: Thu, 26 Mar 2020 14:26:00 -0600 Subject: x86: update AS_* macros to binutils >=2.23, supporting ADX and AVX2 Now that the kernel specifies binutils 2.23 as the minimum version, we can remove ifdefs for AVX2 and ADX throughout. Signed-off-by: Jason A. Donenfeld Acked-by: Ingo Molnar Reviewed-by: Nick Desaulniers Signed-off-by: Masahiro Yamada --- arch/x86/Kconfig.assembler | 10 ---------- arch/x86/crypto/Makefile | 6 ++---- arch/x86/crypto/aesni-intel_avx-x86_64.S | 3 --- arch/x86/crypto/aesni-intel_glue.c | 7 ------- arch/x86/crypto/chacha_glue.c | 6 ++---- arch/x86/crypto/poly1305-x86_64-cryptogams.pl | 8 -------- arch/x86/crypto/poly1305_glue.c | 5 ++--- arch/x86/crypto/sha1_ssse3_glue.c | 6 ------ arch/x86/crypto/sha256-avx2-asm.S | 3 --- arch/x86/crypto/sha256_ssse3_glue.c | 6 ------ arch/x86/crypto/sha512-avx2-asm.S | 3 --- arch/x86/crypto/sha512_ssse3_glue.c | 5 ----- 12 files changed, 6 insertions(+), 62 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/Kconfig.assembler b/arch/x86/Kconfig.assembler index a5a1d2766b3a..13de0db38d4e 100644 --- a/arch/x86/Kconfig.assembler +++ b/arch/x86/Kconfig.assembler @@ -1,11 +1,6 @@ # SPDX-License-Identifier: GPL-2.0 # Copyright (C) 2020 Jason A. Donenfeld . All Rights Reserved. -config AS_AVX2 - def_bool $(as-instr,vpbroadcastb %xmm0$(comma)%ymm1) - help - Supported by binutils >= 2.22 and LLVM integrated assembler - config AS_AVX512 def_bool $(as-instr,vpmovm2b %k1$(comma)%zmm5) help @@ -20,8 +15,3 @@ config AS_SHA256_NI def_bool $(as-instr,sha256msg1 %xmm0$(comma)%xmm1) help Supported by binutils >= 2.24 and LLVM integrated assembler - -config AS_ADX - def_bool $(as-instr,adox %eax$(comma)%eax) - help - Supported by binutils >= 2.23 and LLVM integrated assembler diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile index d1ded16c1dcd..a31de0c6ccde 100644 --- a/arch/x86/crypto/Makefile +++ b/arch/x86/crypto/Makefile @@ -47,8 +47,7 @@ obj-$(CONFIG_CRYPTO_AEGIS128_AESNI_SSE2) += aegis128-aesni.o aegis128-aesni-y := aegis128-aesni-asm.o aegis128-aesni-glue.o obj-$(CONFIG_CRYPTO_CHACHA20_X86_64) += chacha-x86_64.o -chacha-x86_64-y := chacha-ssse3-x86_64.o chacha_glue.o -chacha-x86_64-$(CONFIG_AS_AVX2) += chacha-avx2-x86_64.o +chacha-x86_64-y := chacha-avx2-x86_64.o chacha-ssse3-x86_64.o chacha_glue.o chacha-x86_64-$(CONFIG_AS_AVX512) += chacha-avx512vl-x86_64.o obj-$(CONFIG_CRYPTO_AES_NI_INTEL) += aesni-intel.o @@ -56,8 +55,7 @@ aesni-intel-y := aesni-intel_asm.o aesni-intel_glue.o aesni-intel-$(CONFIG_64BIT) += aesni-intel_avx-x86_64.o aes_ctrby8_avx-x86_64.o obj-$(CONFIG_CRYPTO_SHA1_SSSE3) += sha1-ssse3.o -sha1-ssse3-y := sha1_ssse3_asm.o sha1_ssse3_glue.o -sha1-ssse3-$(CONFIG_AS_AVX2) += sha1_avx2_x86_64_asm.o +sha1-ssse3-y := sha1_avx2_x86_64_asm.o sha1_ssse3_asm.o sha1_ssse3_glue.o sha1-ssse3-$(CONFIG_AS_SHA1_NI) += sha1_ni_asm.o obj-$(CONFIG_CRYPTO_SHA256_SSSE3) += sha256-ssse3.o diff --git a/arch/x86/crypto/aesni-intel_avx-x86_64.S b/arch/x86/crypto/aesni-intel_avx-x86_64.S index cc56ee43238b..0cea33295287 100644 --- a/arch/x86/crypto/aesni-intel_avx-x86_64.S +++ b/arch/x86/crypto/aesni-intel_avx-x86_64.S @@ -1868,7 +1868,6 @@ key_256_finalize: ret SYM_FUNC_END(aesni_gcm_finalize_avx_gen2) -#ifdef CONFIG_AS_AVX2 ############################################################################### # GHASH_MUL MACRO to implement: Data*HashKey mod (128,127,126,121,0) # Input: A and B (128-bits each, bit-reflected) @@ -2836,5 +2835,3 @@ key_256_finalize4: FUNC_RESTORE ret SYM_FUNC_END(aesni_gcm_finalize_avx_gen4) - -#endif /* CONFIG_AS_AVX2 */ diff --git a/arch/x86/crypto/aesni-intel_glue.c b/arch/x86/crypto/aesni-intel_glue.c index 655ad6bc8810..ad8a7188a2bf 100644 --- a/arch/x86/crypto/aesni-intel_glue.c +++ b/arch/x86/crypto/aesni-intel_glue.c @@ -233,7 +233,6 @@ static const struct aesni_gcm_tfm_s aesni_gcm_tfm_avx_gen2 = { .finalize = &aesni_gcm_finalize_avx_gen2, }; -#ifdef CONFIG_AS_AVX2 /* * asmlinkage void aesni_gcm_init_avx_gen4() * gcm_data *my_ctx_data, context data @@ -276,8 +275,6 @@ static const struct aesni_gcm_tfm_s aesni_gcm_tfm_avx_gen4 = { .finalize = &aesni_gcm_finalize_avx_gen4, }; -#endif - static inline struct aesni_rfc4106_gcm_ctx *aesni_rfc4106_gcm_ctx_get(struct crypto_aead *tfm) { @@ -706,10 +703,8 @@ static int gcmaes_crypt_by_sg(bool enc, struct aead_request *req, if (!enc) left -= auth_tag_len; -#ifdef CONFIG_AS_AVX2 if (left < AVX_GEN4_OPTSIZE && gcm_tfm == &aesni_gcm_tfm_avx_gen4) gcm_tfm = &aesni_gcm_tfm_avx_gen2; -#endif if (left < AVX_GEN2_OPTSIZE && gcm_tfm == &aesni_gcm_tfm_avx_gen2) gcm_tfm = &aesni_gcm_tfm_sse; @@ -1069,12 +1064,10 @@ static int __init aesni_init(void) if (!x86_match_cpu(aesni_cpu_id)) return -ENODEV; #ifdef CONFIG_X86_64 -#ifdef CONFIG_AS_AVX2 if (boot_cpu_has(X86_FEATURE_AVX2)) { pr_info("AVX2 version of gcm_enc/dec engaged.\n"); aesni_gcm_tfm = &aesni_gcm_tfm_avx_gen4; } else -#endif if (boot_cpu_has(X86_FEATURE_AVX)) { pr_info("AVX version of gcm_enc/dec engaged.\n"); aesni_gcm_tfm = &aesni_gcm_tfm_avx_gen2; diff --git a/arch/x86/crypto/chacha_glue.c b/arch/x86/crypto/chacha_glue.c index 68a74953efaf..b412c21ee06e 100644 --- a/arch/x86/crypto/chacha_glue.c +++ b/arch/x86/crypto/chacha_glue.c @@ -79,8 +79,7 @@ static void chacha_dosimd(u32 *state, u8 *dst, const u8 *src, } } - if (IS_ENABLED(CONFIG_AS_AVX2) && - static_branch_likely(&chacha_use_avx2)) { + if (static_branch_likely(&chacha_use_avx2)) { while (bytes >= CHACHA_BLOCK_SIZE * 8) { chacha_8block_xor_avx2(state, dst, src, bytes, nrounds); bytes -= CHACHA_BLOCK_SIZE * 8; @@ -288,8 +287,7 @@ static int __init chacha_simd_mod_init(void) static_branch_enable(&chacha_use_simd); - if (IS_ENABLED(CONFIG_AS_AVX2) && - boot_cpu_has(X86_FEATURE_AVX) && + if (boot_cpu_has(X86_FEATURE_AVX) && boot_cpu_has(X86_FEATURE_AVX2) && cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM, NULL)) { static_branch_enable(&chacha_use_avx2); diff --git a/arch/x86/crypto/poly1305-x86_64-cryptogams.pl b/arch/x86/crypto/poly1305-x86_64-cryptogams.pl index 5bac2d533104..137edcf038cb 100644 --- a/arch/x86/crypto/poly1305-x86_64-cryptogams.pl +++ b/arch/x86/crypto/poly1305-x86_64-cryptogams.pl @@ -1514,10 +1514,6 @@ ___ if ($avx>1) { -if ($kernel) { - $code .= "#ifdef CONFIG_AS_AVX2\n"; -} - my ($H0,$H1,$H2,$H3,$H4, $MASK, $T4,$T0,$T1,$T2,$T3, $D0,$D1,$D2,$D3,$D4) = map("%ymm$_",(0..15)); my $S4=$MASK; @@ -2808,10 +2804,6 @@ ___ poly1305_blocks_avxN(0); &end_function("poly1305_blocks_avx2"); -if($kernel) { - $code .= "#endif\n"; -} - ####################################################################### if ($avx>2) { # On entry we have input length divisible by 64. But since inner loop diff --git a/arch/x86/crypto/poly1305_glue.c b/arch/x86/crypto/poly1305_glue.c index 4a6226e1d15e..6dfec19f7d57 100644 --- a/arch/x86/crypto/poly1305_glue.c +++ b/arch/x86/crypto/poly1305_glue.c @@ -108,7 +108,7 @@ static void poly1305_simd_blocks(void *ctx, const u8 *inp, size_t len, kernel_fpu_begin(); if (IS_ENABLED(CONFIG_AS_AVX512) && static_branch_likely(&poly1305_use_avx512)) poly1305_blocks_avx512(ctx, inp, bytes, padbit); - else if (IS_ENABLED(CONFIG_AS_AVX2) && static_branch_likely(&poly1305_use_avx2)) + else if (static_branch_likely(&poly1305_use_avx2)) poly1305_blocks_avx2(ctx, inp, bytes, padbit); else poly1305_blocks_avx(ctx, inp, bytes, padbit); @@ -264,8 +264,7 @@ static int __init poly1305_simd_mod_init(void) if (boot_cpu_has(X86_FEATURE_AVX) && cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM, NULL)) static_branch_enable(&poly1305_use_avx); - if (IS_ENABLED(CONFIG_AS_AVX2) && boot_cpu_has(X86_FEATURE_AVX) && - boot_cpu_has(X86_FEATURE_AVX2) && + if (boot_cpu_has(X86_FEATURE_AVX) && boot_cpu_has(X86_FEATURE_AVX2) && cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM, NULL)) static_branch_enable(&poly1305_use_avx2); if (IS_ENABLED(CONFIG_AS_AVX512) && boot_cpu_has(X86_FEATURE_AVX) && diff --git a/arch/x86/crypto/sha1_ssse3_glue.c b/arch/x86/crypto/sha1_ssse3_glue.c index 275b65dd30c9..a801ffc10cbb 100644 --- a/arch/x86/crypto/sha1_ssse3_glue.c +++ b/arch/x86/crypto/sha1_ssse3_glue.c @@ -174,7 +174,6 @@ static void unregister_sha1_avx(void) crypto_unregister_shash(&sha1_avx_alg); } -#if defined(CONFIG_AS_AVX2) #define SHA1_AVX2_BLOCK_OPTSIZE 4 /* optimal 4*64 bytes of SHA1 blocks */ asmlinkage void sha1_transform_avx2(struct sha1_state *state, @@ -246,11 +245,6 @@ static void unregister_sha1_avx2(void) crypto_unregister_shash(&sha1_avx2_alg); } -#else -static inline int register_sha1_avx2(void) { return 0; } -static inline void unregister_sha1_avx2(void) { } -#endif - #ifdef CONFIG_AS_SHA1_NI asmlinkage void sha1_ni_transform(struct sha1_state *digest, const u8 *data, int rounds); diff --git a/arch/x86/crypto/sha256-avx2-asm.S b/arch/x86/crypto/sha256-avx2-asm.S index 499d9ec129de..11ff60c29c8b 100644 --- a/arch/x86/crypto/sha256-avx2-asm.S +++ b/arch/x86/crypto/sha256-avx2-asm.S @@ -48,7 +48,6 @@ # This code schedules 2 blocks at a time, with 4 lanes per block ######################################################################## -#ifdef CONFIG_AS_AVX2 #include ## assume buffers not aligned @@ -767,5 +766,3 @@ _SHUF_00BA: .align 32 _SHUF_DC00: .octa 0x0b0a090803020100FFFFFFFFFFFFFFFF,0x0b0a090803020100FFFFFFFFFFFFFFFF - -#endif diff --git a/arch/x86/crypto/sha256_ssse3_glue.c b/arch/x86/crypto/sha256_ssse3_glue.c index 8bdc3be31f64..6394b5fe8db6 100644 --- a/arch/x86/crypto/sha256_ssse3_glue.c +++ b/arch/x86/crypto/sha256_ssse3_glue.c @@ -220,7 +220,6 @@ static void unregister_sha256_avx(void) ARRAY_SIZE(sha256_avx_algs)); } -#if defined(CONFIG_AS_AVX2) asmlinkage void sha256_transform_rorx(struct sha256_state *state, const u8 *data, int blocks); @@ -295,11 +294,6 @@ static void unregister_sha256_avx2(void) ARRAY_SIZE(sha256_avx2_algs)); } -#else -static inline int register_sha256_avx2(void) { return 0; } -static inline void unregister_sha256_avx2(void) { } -#endif - #ifdef CONFIG_AS_SHA256_NI asmlinkage void sha256_ni_transform(struct sha256_state *digest, const u8 *data, int rounds); diff --git a/arch/x86/crypto/sha512-avx2-asm.S b/arch/x86/crypto/sha512-avx2-asm.S index 3dd886b14e7d..3a44bdcfd583 100644 --- a/arch/x86/crypto/sha512-avx2-asm.S +++ b/arch/x86/crypto/sha512-avx2-asm.S @@ -49,7 +49,6 @@ # This code schedules 1 blocks at a time, with 4 lanes per block ######################################################################## -#ifdef CONFIG_AS_AVX2 #include .text @@ -749,5 +748,3 @@ PSHUFFLE_BYTE_FLIP_MASK: MASK_YMM_LO: .octa 0x00000000000000000000000000000000 .octa 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF - -#endif diff --git a/arch/x86/crypto/sha512_ssse3_glue.c b/arch/x86/crypto/sha512_ssse3_glue.c index 75214982a633..82cc1b3ced1d 100644 --- a/arch/x86/crypto/sha512_ssse3_glue.c +++ b/arch/x86/crypto/sha512_ssse3_glue.c @@ -218,7 +218,6 @@ static void unregister_sha512_avx(void) ARRAY_SIZE(sha512_avx_algs)); } -#if defined(CONFIG_AS_AVX2) asmlinkage void sha512_transform_rorx(struct sha512_state *state, const u8 *data, int blocks); @@ -293,10 +292,6 @@ static void unregister_sha512_avx2(void) crypto_unregister_shashes(sha512_avx2_algs, ARRAY_SIZE(sha512_avx2_algs)); } -#else -static inline int register_sha512_avx2(void) { return 0; } -static inline void unregister_sha512_avx2(void) { } -#endif static int __init sha512_ssse3_mod_init(void) { -- cgit From d6f34f4c6b4a962eb7a86c923fea206f866a40be Mon Sep 17 00:00:00 2001 From: Juergen Gross Date: Thu, 9 Apr 2020 09:00:01 +0200 Subject: x86/xen: fix booting 32-bit pv guest Commit 2f62f36e62daec ("x86/xen: Make the boot CPU idle task reliable") introduced a regression for booting 32 bit Xen PV guests: the address of the initial stack needs to be a virtual one. Fixes: 2f62f36e62daec ("x86/xen: Make the boot CPU idle task reliable") Signed-off-by: Juergen Gross Reviewed-by: Boris Ostrovsky Link: https://lore.kernel.org/r/20200409070001.16675-1-jgross@suse.com Signed-off-by: Juergen Gross --- arch/x86/xen/xen-head.S | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/xen/xen-head.S b/arch/x86/xen/xen-head.S index 7d1c4fcbe8f7..1ba601df3a37 100644 --- a/arch/x86/xen/xen-head.S +++ b/arch/x86/xen/xen-head.S @@ -38,7 +38,7 @@ SYM_CODE_START(startup_xen) #ifdef CONFIG_X86_64 mov initial_stack(%rip), %rsp #else - mov pa(initial_stack), %esp + mov initial_stack, %esp #endif #ifdef CONFIG_X86_64 -- cgit From cf11e85fc08cc6a4fe3ac2ba2e610c962bf20bc3 Mon Sep 17 00:00:00 2001 From: Roman Gushchin Date: Fri, 10 Apr 2020 14:32:45 -0700 Subject: mm: hugetlb: optionally allocate gigantic hugepages using cma Commit 944d9fec8d7a ("hugetlb: add support for gigantic page allocation at runtime") has added the run-time allocation of gigantic pages. However it actually works only at early stages of the system loading, when the majority of memory is free. After some time the memory gets fragmented by non-movable pages, so the chances to find a contiguous 1GB block are getting close to zero. Even dropping caches manually doesn't help a lot. At large scale rebooting servers in order to allocate gigantic hugepages is quite expensive and complex. At the same time keeping some constant percentage of memory in reserved hugepages even if the workload isn't using it is a big waste: not all workloads can benefit from using 1 GB pages. The following solution can solve the problem: 1) On boot time a dedicated cma area* is reserved. The size is passed as a kernel argument. 2) Run-time allocations of gigantic hugepages are performed using the cma allocator and the dedicated cma area In this case gigantic hugepages can be allocated successfully with a high probability, however the memory isn't completely wasted if nobody is using 1GB hugepages: it can be used for pagecache, anon memory, THPs, etc. * On a multi-node machine a per-node cma area is allocated on each node. Following gigantic hugetlb allocation are using the first available numa node if the mask isn't specified by a user. Usage: 1) configure the kernel to allocate a cma area for hugetlb allocations: pass hugetlb_cma=10G as a kernel argument 2) allocate hugetlb pages as usual, e.g. echo 10 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages If the option isn't enabled or the allocation of the cma area failed, the current behavior of the system is preserved. x86 and arm-64 are covered by this patch, other architectures can be trivially added later. The patch contains clean-ups and fixes proposed and implemented by Aslan Bakirov and Randy Dunlap. It also contains ideas and suggestions proposed by Rik van Riel, Michal Hocko and Mike Kravetz. Thanks! Signed-off-by: Roman Gushchin Signed-off-by: Andrew Morton Tested-by: Andreas Schaufler Acked-by: Mike Kravetz Acked-by: Michal Hocko Cc: Aslan Bakirov Cc: Randy Dunlap Cc: Rik van Riel Cc: Joonsoo Kim Link: http://lkml.kernel.org/r/20200407163840.92263-3-guro@fb.com Signed-off-by: Linus Torvalds --- arch/x86/kernel/setup.c | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'arch/x86') diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index e6b545047f38..4b3fa6cd3106 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -16,6 +16,7 @@ #include #include #include +#include #include #include @@ -1157,6 +1158,9 @@ void __init setup_arch(char **cmdline_p) initmem_init(); dma_contiguous_reserve(max_pfn_mapped << PAGE_SHIFT); + if (boot_cpu_has(X86_FEATURE_GBPAGES)) + hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT); + /* * Reserve memory for crash kernel after SRAT is parsed so that it * won't consume hotpluggable memory. -- cgit From c97078bd219cbe1a878b24bb4e61d312f19ece1f Mon Sep 17 00:00:00 2001 From: Arjun Roy Date: Fri, 10 Apr 2020 14:32:58 -0700 Subject: mm: define pte_index as macro for x86 pte_index() is either defined as a macro (e.g. sparc64) or as an inlined function (e.g. x86). vm_insert_pages() depends on pte_index but it is not defined on all platforms (e.g. m68k). To fix compilation of vm_insert_pages() on architectures not providing pte_index(), we perform the following fix: 0. For platforms where it is meaningful, and defined as a macro, no change is needed. 1. For platforms where it is meaningful and defined as an inlined function, and we want to use it with vm_insert_pages(), we define a degenerate macro of the form: #define pte_index pte_index 2. vm_insert_pages() checks for the existence of a pte_index macro definition. If found, it implements a batched insert. If not found, it devolves to calling vm_insert_page() in a loop. This patch implements step 1 for x86. v3 of this patch fixes a compilation warning for an unused method. v2 of this patch moved a macro definition to a more readable location. Signed-off-by: Arjun Roy Signed-off-by: Andrew Morton Cc: David Miller Cc: Eric Dumazet Cc: Jason Gunthorpe Cc: Matthew Wilcox Cc: Soheil Hassas Yeganeh Cc: Stephen Rothwell Link: http://lkml.kernel.org/r/20200228054714.204424-1-arjunroy.kdev@gmail.com Signed-off-by: Linus Torvalds --- arch/x86/include/asm/pgtable.h | 3 +++ 1 file changed, 3 insertions(+) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 28838d790191..abad0da0973a 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -860,7 +860,10 @@ static inline unsigned long pmd_index(unsigned long address) * * this function returns the index of the entry in the pte page which would * control the given virtual address + * + * Also define macro so we can test if pte_index is defined for arch. */ +#define pte_index pte_index static inline unsigned long pte_index(unsigned long address) { return (address >> PAGE_SHIFT) & (PTRS_PER_PTE - 1); -- cgit From c62da0c35d58518ddb26ff641d2485596567fd96 Mon Sep 17 00:00:00 2001 From: Anshuman Khandual Date: Fri, 10 Apr 2020 14:33:05 -0700 Subject: mm/vma: define a default value for VM_DATA_DEFAULT_FLAGS There are many platforms with exact same value for VM_DATA_DEFAULT_FLAGS This creates a default value for VM_DATA_DEFAULT_FLAGS in line with the existing VM_STACK_DEFAULT_FLAGS. While here, also define some more macros with standard VMA access flag combinations that are used frequently across many platforms. Apart from simplification, this reduces code duplication as well. Signed-off-by: Anshuman Khandual Signed-off-by: Andrew Morton Reviewed-by: Vlastimil Babka Acked-by: Geert Uytterhoeven Cc: Richard Henderson Cc: Vineet Gupta Cc: Russell King Cc: Catalin Marinas Cc: Mark Salter Cc: Guo Ren Cc: Yoshinori Sato Cc: Brian Cain Cc: Tony Luck Cc: Michal Simek Cc: Ralf Baechle Cc: Paul Burton Cc: Nick Hu Cc: Ley Foon Tan Cc: Jonas Bonn Cc: "James E.J. Bottomley" Cc: Michael Ellerman Cc: Paul Walmsley Cc: Heiko Carstens Cc: Rich Felker Cc: "David S. Miller" Cc: Guan Xuetao Cc: Thomas Gleixner Cc: Jeff Dike Cc: Chris Zankel Link: http://lkml.kernel.org/r/1583391014-8170-2-git-send-email-anshuman.khandual@arm.com Signed-off-by: Linus Torvalds --- arch/x86/include/asm/page_types.h | 4 +--- arch/x86/um/asm/vm-flags.h | 10 ++-------- 2 files changed, 3 insertions(+), 11 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h index c85e15010f48..e27aa6be6320 100644 --- a/arch/x86/include/asm/page_types.h +++ b/arch/x86/include/asm/page_types.h @@ -35,9 +35,7 @@ #define PAGE_OFFSET ((unsigned long)__PAGE_OFFSET) -#define VM_DATA_DEFAULT_FLAGS \ - (((current->personality & READ_IMPLIES_EXEC) ? VM_EXEC : 0 ) | \ - VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC) +#define VM_DATA_DEFAULT_FLAGS VM_DATA_FLAGS_TSK_EXEC #define __PHYSICAL_START ALIGN(CONFIG_PHYSICAL_START, \ CONFIG_PHYSICAL_ALIGN) diff --git a/arch/x86/um/asm/vm-flags.h b/arch/x86/um/asm/vm-flags.h index 7c297e9e2413..df7a3896f5dd 100644 --- a/arch/x86/um/asm/vm-flags.h +++ b/arch/x86/um/asm/vm-flags.h @@ -9,17 +9,11 @@ #ifdef CONFIG_X86_32 -#define VM_DATA_DEFAULT_FLAGS \ - (VM_READ | VM_WRITE | \ - ((current->personality & READ_IMPLIES_EXEC) ? VM_EXEC : 0 ) | \ - VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC) +#define VM_DATA_DEFAULT_FLAGS VM_DATA_FLAGS_TSK_EXEC #else -#define VM_DATA_DEFAULT_FLAGS (VM_READ | VM_WRITE | VM_EXEC | \ - VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC) -#define VM_STACK_DEFAULT_FLAGS (VM_GROWSDOWN | VM_READ | VM_WRITE | \ - VM_EXEC | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC) +#define VM_STACK_DEFAULT_FLAGS (VM_GROWSDOWN | VM_DATA_FLAGS_EXEC) #endif #endif -- cgit From 6cb4d9a2870d2062e34c93bfef4d52fca3fe42d1 Mon Sep 17 00:00:00 2001 From: Anshuman Khandual Date: Fri, 10 Apr 2020 14:33:09 -0700 Subject: mm/vma: introduce VM_ACCESS_FLAGS There are many places where all basic VMA access flags (read, write, exec) are initialized or checked against as a group. One such example is during page fault. Existing vma_is_accessible() wrapper already creates the notion of VMA accessibility as a group access permissions. Hence lets just create VM_ACCESS_FLAGS (VM_READ|VM_WRITE|VM_EXEC) which will not only reduce code duplication but also extend the VMA accessibility concept in general. Signed-off-by: Anshuman Khandual Signed-off-by: Andrew Morton Reviewed-by: Vlastimil Babka Cc: Russell King Cc: Catalin Marinas Cc: Mark Salter Cc: Nick Hu Cc: Ley Foon Tan Cc: Michael Ellerman Cc: Heiko Carstens Cc: Yoshinori Sato Cc: Guan Xuetao Cc: Dave Hansen Cc: Thomas Gleixner Cc: Rob Springer Cc: Greg Kroah-Hartman Cc: Geert Uytterhoeven Link: http://lkml.kernel.org/r/1583391014-8170-3-git-send-email-anshuman.khandual@arm.com Signed-off-by: Linus Torvalds --- arch/x86/mm/pkeys.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c index c6f84c0b5d7a..8873ed1438a9 100644 --- a/arch/x86/mm/pkeys.c +++ b/arch/x86/mm/pkeys.c @@ -63,7 +63,7 @@ int __execute_only_pkey(struct mm_struct *mm) static inline bool vma_is_pkey_exec_only(struct vm_area_struct *vma) { /* Do this check first since the vm_flags should be hot */ - if ((vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC)) != VM_EXEC) + if ((vma->vm_flags & VM_ACCESS_FLAGS) != VM_EXEC) return false; if (vma_pkey(vma) != vma->vm_mm->context.execute_only_pkey) return false; -- cgit From f5637d3b42ab0465ef71d5fb8461bce97fba95e8 Mon Sep 17 00:00:00 2001 From: Logan Gunthorpe Date: Fri, 10 Apr 2020 14:33:21 -0700 Subject: mm/memory_hotplug: rename mhp_restrictions to mhp_params The mhp_restrictions struct really doesn't specify anything resembling a restriction anymore so rename it to be mhp_params as it is a list of extended parameters. Signed-off-by: Logan Gunthorpe Signed-off-by: Andrew Morton Reviewed-by: David Hildenbrand Reviewed-by: Dan Williams Acked-by: Michal Hocko Cc: Andy Lutomirski Cc: Benjamin Herrenschmidt Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christoph Hellwig Cc: Dave Hansen Cc: Eric Badger Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Jason Gunthorpe Cc: Michael Ellerman Cc: Paul Mackerras Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Will Deacon Link: http://lkml.kernel.org/r/20200306170846.9333-3-logang@deltatee.com Signed-off-by: Linus Torvalds --- arch/x86/mm/init_32.c | 4 ++-- arch/x86/mm/init_64.c | 8 ++++---- 2 files changed, 6 insertions(+), 6 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c index de73992b8432..d736c8625503 100644 --- a/arch/x86/mm/init_32.c +++ b/arch/x86/mm/init_32.c @@ -819,12 +819,12 @@ void __init mem_init(void) #ifdef CONFIG_MEMORY_HOTPLUG int arch_add_memory(int nid, u64 start, u64 size, - struct mhp_restrictions *restrictions) + struct mhp_params *params) { unsigned long start_pfn = start >> PAGE_SHIFT; unsigned long nr_pages = size >> PAGE_SHIFT; - return __add_pages(nid, start_pfn, nr_pages, restrictions); + return __add_pages(nid, start_pfn, nr_pages, params); } void arch_remove_memory(int nid, u64 start, u64 size, diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index 0a14711d3a93..faa86a9a3b0d 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -843,11 +843,11 @@ static void update_end_of_memory_vars(u64 start, u64 size) } int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages, - struct mhp_restrictions *restrictions) + struct mhp_params *params) { int ret; - ret = __add_pages(nid, start_pfn, nr_pages, restrictions); + ret = __add_pages(nid, start_pfn, nr_pages, params); WARN_ON_ONCE(ret); /* update max_pfn, max_low_pfn and high_memory */ @@ -858,14 +858,14 @@ int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages, } int arch_add_memory(int nid, u64 start, u64 size, - struct mhp_restrictions *restrictions) + struct mhp_params *params) { unsigned long start_pfn = start >> PAGE_SHIFT; unsigned long nr_pages = size >> PAGE_SHIFT; init_memory_mapping(start, start + size); - return add_pages(nid, start_pfn, nr_pages, restrictions); + return add_pages(nid, start_pfn, nr_pages, params); } #define PAGE_INUSE 0xFD -- cgit From c164fbb40c43f8041f4d05ec9996d8ee343c92b1 Mon Sep 17 00:00:00 2001 From: Logan Gunthorpe Date: Fri, 10 Apr 2020 14:33:24 -0700 Subject: x86/mm: thread pgprot_t through init_memory_mapping() In preparation to support a pgprot_t argument for arch_add_memory(). It's required to move the prototype of init_memory_mapping() seeing the original location came before the definition of pgprot_t. Signed-off-by: Logan Gunthorpe Signed-off-by: Andrew Morton Reviewed-by: Dan Williams Acked-by: Michal Hocko Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: "H. Peter Anvin" Cc: Dave Hansen Cc: Andy Lutomirski Cc: Peter Zijlstra Cc: Benjamin Herrenschmidt Cc: Catalin Marinas Cc: Christoph Hellwig Cc: David Hildenbrand Cc: Eric Badger Cc: Jason Gunthorpe Cc: Michael Ellerman Cc: Paul Mackerras Cc: Will Deacon Link: http://lkml.kernel.org/r/20200306170846.9333-4-logang@deltatee.com Signed-off-by: Linus Torvalds --- arch/x86/include/asm/page_types.h | 3 --- arch/x86/include/asm/pgtable.h | 3 +++ arch/x86/kernel/amd_gart_64.c | 3 ++- arch/x86/mm/init.c | 9 +++++---- arch/x86/mm/init_32.c | 3 ++- arch/x86/mm/init_64.c | 32 ++++++++++++++++++-------------- arch/x86/mm/mm_internal.h | 3 ++- arch/x86/platform/uv/bios_uv.c | 3 ++- 8 files changed, 34 insertions(+), 25 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h index e27aa6be6320..a506a411474d 100644 --- a/arch/x86/include/asm/page_types.h +++ b/arch/x86/include/asm/page_types.h @@ -71,9 +71,6 @@ static inline phys_addr_t get_max_mapped(void) bool pfn_range_is_mapped(unsigned long start_pfn, unsigned long end_pfn); -extern unsigned long init_memory_mapping(unsigned long start, - unsigned long end); - extern void initmem_init(void); #endif /* !__ASSEMBLY__ */ diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index abad0da0973a..4d02e64af1b3 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -1081,6 +1081,9 @@ static inline void __meminit init_trampoline_default(void) void __init poking_init(void); +unsigned long init_memory_mapping(unsigned long start, + unsigned long end, pgprot_t prot); + # ifdef CONFIG_RANDOMIZE_MEMORY void __meminit init_trampoline(void); # else diff --git a/arch/x86/kernel/amd_gart_64.c b/arch/x86/kernel/amd_gart_64.c index 4e5f50236048..16133819415c 100644 --- a/arch/x86/kernel/amd_gart_64.c +++ b/arch/x86/kernel/amd_gart_64.c @@ -744,7 +744,8 @@ int __init gart_iommu_init(void) start_pfn = PFN_DOWN(aper_base); if (!pfn_range_is_mapped(start_pfn, end_pfn)) - init_memory_mapping(start_pfn<> PAGE_SHIFT, ret >> PAGE_SHIFT); @@ -521,7 +522,7 @@ static unsigned long __init init_range_memory_mapping( */ can_use_brk_pgt = max(start, (u64)pgt_buf_end<= min(end, (u64)pgt_buf_top<> PAGE_SHIFT, - PAGE_KERNEL_LARGE), + prot), init); spin_unlock(&init_mm.page_table_lock); paddr_last = paddr_next; @@ -669,7 +672,7 @@ phys_pud_init(pud_t *pud_page, unsigned long paddr, unsigned long paddr_end, static unsigned long __meminit phys_p4d_init(p4d_t *p4d_page, unsigned long paddr, unsigned long paddr_end, - unsigned long page_size_mask, bool init) + unsigned long page_size_mask, pgprot_t prot, bool init) { unsigned long vaddr, vaddr_end, vaddr_next, paddr_next, paddr_last; @@ -679,7 +682,7 @@ phys_p4d_init(p4d_t *p4d_page, unsigned long paddr, unsigned long paddr_end, if (!pgtable_l5_enabled()) return phys_pud_init((pud_t *) p4d_page, paddr, paddr_end, - page_size_mask, init); + page_size_mask, prot, init); for (; vaddr < vaddr_end; vaddr = vaddr_next) { p4d_t *p4d = p4d_page + p4d_index(vaddr); @@ -702,13 +705,13 @@ phys_p4d_init(p4d_t *p4d_page, unsigned long paddr, unsigned long paddr_end, if (!p4d_none(*p4d)) { pud = pud_offset(p4d, 0); paddr_last = phys_pud_init(pud, paddr, __pa(vaddr_end), - page_size_mask, init); + page_size_mask, prot, init); continue; } pud = alloc_low_page(); paddr_last = phys_pud_init(pud, paddr, __pa(vaddr_end), - page_size_mask, init); + page_size_mask, prot, init); spin_lock(&init_mm.page_table_lock); p4d_populate_init(&init_mm, p4d, pud, init); @@ -722,7 +725,7 @@ static unsigned long __meminit __kernel_physical_mapping_init(unsigned long paddr_start, unsigned long paddr_end, unsigned long page_size_mask, - bool init) + pgprot_t prot, bool init) { bool pgd_changed = false; unsigned long vaddr, vaddr_start, vaddr_end, vaddr_next, paddr_last; @@ -743,13 +746,13 @@ __kernel_physical_mapping_init(unsigned long paddr_start, paddr_last = phys_p4d_init(p4d, __pa(vaddr), __pa(vaddr_end), page_size_mask, - init); + prot, init); continue; } p4d = alloc_low_page(); paddr_last = phys_p4d_init(p4d, __pa(vaddr), __pa(vaddr_end), - page_size_mask, init); + page_size_mask, prot, init); spin_lock(&init_mm.page_table_lock); if (pgtable_l5_enabled()) @@ -778,10 +781,10 @@ __kernel_physical_mapping_init(unsigned long paddr_start, unsigned long __meminit kernel_physical_mapping_init(unsigned long paddr_start, unsigned long paddr_end, - unsigned long page_size_mask) + unsigned long page_size_mask, pgprot_t prot) { return __kernel_physical_mapping_init(paddr_start, paddr_end, - page_size_mask, true); + page_size_mask, prot, true); } /* @@ -796,7 +799,8 @@ kernel_physical_mapping_change(unsigned long paddr_start, unsigned long page_size_mask) { return __kernel_physical_mapping_init(paddr_start, paddr_end, - page_size_mask, false); + page_size_mask, PAGE_KERNEL, + false); } #ifndef CONFIG_NUMA @@ -863,7 +867,7 @@ int arch_add_memory(int nid, u64 start, u64 size, unsigned long start_pfn = start >> PAGE_SHIFT; unsigned long nr_pages = size >> PAGE_SHIFT; - init_memory_mapping(start, start + size); + init_memory_mapping(start, start + size, PAGE_KERNEL); return add_pages(nid, start_pfn, nr_pages, params); } diff --git a/arch/x86/mm/mm_internal.h b/arch/x86/mm/mm_internal.h index eeae142062ed..3f37b5c80bb3 100644 --- a/arch/x86/mm/mm_internal.h +++ b/arch/x86/mm/mm_internal.h @@ -12,7 +12,8 @@ void early_ioremap_page_table_range_init(void); unsigned long kernel_physical_mapping_init(unsigned long start, unsigned long end, - unsigned long page_size_mask); + unsigned long page_size_mask, + pgprot_t prot); unsigned long kernel_physical_mapping_change(unsigned long start, unsigned long end, unsigned long page_size_mask); diff --git a/arch/x86/platform/uv/bios_uv.c b/arch/x86/platform/uv/bios_uv.c index 607f58147311..c60255da5a6c 100644 --- a/arch/x86/platform/uv/bios_uv.c +++ b/arch/x86/platform/uv/bios_uv.c @@ -352,7 +352,8 @@ void __iomem *__init efi_ioremap(unsigned long phys_addr, unsigned long size, if (type == EFI_MEMORY_MAPPED_IO) return ioremap(phys_addr, size); - last_map_pfn = init_memory_mapping(phys_addr, phys_addr + size); + last_map_pfn = init_memory_mapping(phys_addr, phys_addr + size, + PAGE_KERNEL); if ((last_map_pfn << PAGE_SHIFT) < phys_addr + size) { unsigned long top = last_map_pfn << PAGE_SHIFT; efi_ioremap(top, size - (top - phys_addr), type, attribute); -- cgit From 30796e18c29942c4d64bf89c4135c975393ec1ad Mon Sep 17 00:00:00 2001 From: Logan Gunthorpe Date: Fri, 10 Apr 2020 14:33:28 -0700 Subject: x86/mm: introduce __set_memory_prot() For use in the 32bit arch_add_memory() to set the pgprot type of the memory to add. Signed-off-by: Logan Gunthorpe Signed-off-by: Andrew Morton Reviewed-by: Dan Williams Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: "H. Peter Anvin" Cc: Dave Hansen Cc: Andy Lutomirski Cc: Peter Zijlstra Cc: Benjamin Herrenschmidt Cc: Catalin Marinas Cc: Christoph Hellwig Cc: David Hildenbrand Cc: Eric Badger Cc: Jason Gunthorpe Cc: Michael Ellerman Cc: Michal Hocko Cc: Paul Mackerras Cc: Will Deacon Link: http://lkml.kernel.org/r/20200306170846.9333-5-logang@deltatee.com Signed-off-by: Linus Torvalds --- arch/x86/include/asm/set_memory.h | 1 + arch/x86/mm/pat/set_memory.c | 13 +++++++++++++ 2 files changed, 14 insertions(+) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h index 950532ccbc4a..ec2c0a094b5d 100644 --- a/arch/x86/include/asm/set_memory.h +++ b/arch/x86/include/asm/set_memory.h @@ -34,6 +34,7 @@ * The caller is required to take care of these. */ +int __set_memory_prot(unsigned long addr, int numpages, pgprot_t prot); int _set_memory_uc(unsigned long addr, int numpages); int _set_memory_wc(unsigned long addr, int numpages); int _set_memory_wt(unsigned long addr, int numpages); diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c index 6d5424069e2b..59eca6a94ce7 100644 --- a/arch/x86/mm/pat/set_memory.c +++ b/arch/x86/mm/pat/set_memory.c @@ -1795,6 +1795,19 @@ static inline int cpa_clear_pages_array(struct page **pages, int numpages, CPA_PAGES_ARRAY, pages); } +/* + * _set_memory_prot is an internal helper for callers that have been passed + * a pgprot_t value from upper layers and a reservation has already been taken. + * If you want to set the pgprot to a specific page protocol, use the + * set_memory_xx() functions. + */ +int __set_memory_prot(unsigned long addr, int numpages, pgprot_t prot) +{ + return change_page_attr_set_clr(&addr, numpages, prot, + __pgprot(~pgprot_val(prot)), 0, 0, + NULL); +} + int _set_memory_uc(unsigned long addr, int numpages) { /* -- cgit From bfeb022f8fe4c5afdcfd7a3d868fac9765f9bcad Mon Sep 17 00:00:00 2001 From: Logan Gunthorpe Date: Fri, 10 Apr 2020 14:33:36 -0700 Subject: mm/memory_hotplug: add pgprot_t to mhp_params devm_memremap_pages() is currently used by the PCI P2PDMA code to create struct page mappings for IO memory. At present, these mappings are created with PAGE_KERNEL which implies setting the PAT bits to be WB. However, on x86, an mtrr register will typically override this and force the cache type to be UC-. In the case firmware doesn't set this register it is effectively WB and will typically result in a machine check exception when it's accessed. Other arches are not currently likely to function correctly seeing they don't have any MTRR registers to fall back on. To solve this, provide a way to specify the pgprot value explicitly to arch_add_memory(). Of the arches that support MEMORY_HOTPLUG: x86_64, and arm64 need a simple change to pass the pgprot_t down to their respective functions which set up the page tables. For x86_32, set the page tables explicitly using _set_memory_prot() (seeing they are already mapped). For ia64, s390 and sh, reject anything but PAGE_KERNEL settings -- this should be fine, for now, seeing these architectures don't support ZONE_DEVICE. A check in __add_pages() is also added to ensure the pgprot parameter was set for all arches. Signed-off-by: Logan Gunthorpe Signed-off-by: Andrew Morton Acked-by: David Hildenbrand Acked-by: Michal Hocko Acked-by: Dan Williams Cc: Andy Lutomirski Cc: Benjamin Herrenschmidt Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christoph Hellwig Cc: Dave Hansen Cc: Eric Badger Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Jason Gunthorpe Cc: Michael Ellerman Cc: Paul Mackerras Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Will Deacon Link: http://lkml.kernel.org/r/20200306170846.9333-7-logang@deltatee.com Signed-off-by: Linus Torvalds --- arch/x86/mm/init_32.c | 12 ++++++++++++ arch/x86/mm/init_64.c | 2 +- 2 files changed, 13 insertions(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c index ac75a8397804..4222a010057a 100644 --- a/arch/x86/mm/init_32.c +++ b/arch/x86/mm/init_32.c @@ -824,6 +824,18 @@ int arch_add_memory(int nid, u64 start, u64 size, { unsigned long start_pfn = start >> PAGE_SHIFT; unsigned long nr_pages = size >> PAGE_SHIFT; + int ret; + + /* + * The page tables were already mapped at boot so if the caller + * requests a different mapping type then we must change all the + * pages with __set_memory_prot(). + */ + if (params->pgprot.pgprot != PAGE_KERNEL.pgprot) { + ret = __set_memory_prot(start, nr_pages, params->pgprot); + if (ret) + return ret; + } return __add_pages(nid, start_pfn, nr_pages, params); } diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index 7480de743105..3b289c2f75cd 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -867,7 +867,7 @@ int arch_add_memory(int nid, u64 start, u64 size, unsigned long start_pfn = start >> PAGE_SHIFT; unsigned long nr_pages = size >> PAGE_SHIFT; - init_memory_mapping(start, start + size, PAGE_KERNEL); + init_memory_mapping(start, start + size, params->pgprot); return add_pages(nid, start_pfn, nr_pages, params); } -- cgit From d7e94dbdac1a40924626b0efc7ff530c8baf5e4a Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Fri, 10 Apr 2020 13:54:00 +0200 Subject: x86/split_lock: Provide handle_guest_split_lock() Without at least minimal handling for split lock detection induced #AC, VMX will just run into the same problem as the VMWare hypervisor, which was reported by Kenneth. It will inject the #AC blindly into the guest whether the guest is prepared or not. Provide a function for guest mode which acts depending on the host SLD mode. If mode == sld_warn, treat it like user space, i.e. emit a warning, disable SLD and mark the task accordingly. Otherwise force SIGBUS. [ bp: Add a !CPU_SUP_INTEL stub for handle_guest_split_lock(). ] Signed-off-by: Thomas Gleixner Signed-off-by: Borislav Petkov Acked-by: Paolo Bonzini Link: https://lkml.kernel.org/r/20200410115516.978037132@linutronix.de Link: https://lkml.kernel.org/r/20200402123258.895628824@linutronix.de --- arch/x86/include/asm/cpu.h | 6 ++++++ arch/x86/kernel/cpu/intel.c | 33 ++++++++++++++++++++++++++++----- 2 files changed, 34 insertions(+), 5 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/include/asm/cpu.h b/arch/x86/include/asm/cpu.h index ff6f3ca649b3..dd17c2da1af5 100644 --- a/arch/x86/include/asm/cpu.h +++ b/arch/x86/include/asm/cpu.h @@ -44,6 +44,7 @@ unsigned int x86_stepping(unsigned int sig); extern void __init cpu_set_core_cap_bits(struct cpuinfo_x86 *c); extern void switch_to_sld(unsigned long tifn); extern bool handle_user_split_lock(struct pt_regs *regs, long error_code); +extern bool handle_guest_split_lock(unsigned long ip); #else static inline void __init cpu_set_core_cap_bits(struct cpuinfo_x86 *c) {} static inline void switch_to_sld(unsigned long tifn) {} @@ -51,5 +52,10 @@ static inline bool handle_user_split_lock(struct pt_regs *regs, long error_code) { return false; } + +static inline bool handle_guest_split_lock(unsigned long ip) +{ + return false; +} #endif #endif /* _ASM_X86_CPU_H */ diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c index 9a26e972cdea..bf08d4508ecb 100644 --- a/arch/x86/kernel/cpu/intel.c +++ b/arch/x86/kernel/cpu/intel.c @@ -21,6 +21,7 @@ #include #include #include +#include #ifdef CONFIG_X86_64 #include @@ -1066,13 +1067,10 @@ static void split_lock_init(void) split_lock_verify_msr(sld_state != sld_off); } -bool handle_user_split_lock(struct pt_regs *regs, long error_code) +static void split_lock_warn(unsigned long ip) { - if ((regs->flags & X86_EFLAGS_AC) || sld_state == sld_fatal) - return false; - pr_warn_ratelimited("#AC: %s/%d took a split_lock trap at address: 0x%lx\n", - current->comm, current->pid, regs->ip); + current->comm, current->pid, ip); /* * Disable the split lock detection for this task so it can make @@ -1081,6 +1079,31 @@ bool handle_user_split_lock(struct pt_regs *regs, long error_code) */ sld_update_msr(false); set_tsk_thread_flag(current, TIF_SLD); +} + +bool handle_guest_split_lock(unsigned long ip) +{ + if (sld_state == sld_warn) { + split_lock_warn(ip); + return true; + } + + pr_warn_once("#AC: %s/%d %s split_lock trap at address: 0x%lx\n", + current->comm, current->pid, + sld_state == sld_fatal ? "fatal" : "bogus", ip); + + current->thread.error_code = 0; + current->thread.trap_nr = X86_TRAP_AC; + force_sig_fault(SIGBUS, BUS_ADRALN, NULL); + return false; +} +EXPORT_SYMBOL_GPL(handle_guest_split_lock); + +bool handle_user_split_lock(struct pt_regs *regs, long error_code) +{ + if ((regs->flags & X86_EFLAGS_AC) || sld_state == sld_fatal) + return false; + split_lock_warn(regs->ip); return true; } -- cgit From 9de6fe3c28d6d8feadfad907961f1f31b85c6985 Mon Sep 17 00:00:00 2001 From: Xiaoyao Li Date: Fri, 10 Apr 2020 13:54:01 +0200 Subject: KVM: x86: Emulate split-lock access as a write in emulator Emulate split-lock accesses as writes if split lock detection is on to avoid #AC during emulation, which will result in a panic(). This should never occur for a well-behaved guest, but a malicious guest can manipulate the TLB to trigger emulation of a locked instruction[1]. More discussion can be found at [2][3]. [1] https://lkml.kernel.org/r/8c5b11c9-58df-38e7-a514-dc12d687b198@redhat.com [2] https://lkml.kernel.org/r/20200131200134.GD18946@linux.intel.com [3] https://lkml.kernel.org/r/20200227001117.GX9940@linux.intel.com Suggested-by: Sean Christopherson Signed-off-by: Xiaoyao Li Signed-off-by: Sean Christopherson Signed-off-by: Thomas Gleixner Signed-off-by: Borislav Petkov Acked-by: Paolo Bonzini Link: https://lkml.kernel.org/r/20200410115517.084300242@linutronix.de --- arch/x86/kvm/x86.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 027dfd278a97..3bf2ecafd027 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5839,6 +5839,7 @@ static int emulator_cmpxchg_emulated(struct x86_emulate_ctxt *ctxt, { struct kvm_host_map map; struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt); + u64 page_line_mask; gpa_t gpa; char *kaddr; bool exchanged; @@ -5853,7 +5854,16 @@ static int emulator_cmpxchg_emulated(struct x86_emulate_ctxt *ctxt, (gpa & PAGE_MASK) == APIC_DEFAULT_PHYS_BASE) goto emul_write; - if (((gpa + bytes - 1) & PAGE_MASK) != (gpa & PAGE_MASK)) + /* + * Emulate the atomic as a straight write to avoid #AC if SLD is + * enabled in the host and the access splits a cache line. + */ + if (boot_cpu_has(X86_FEATURE_SPLIT_LOCK_DETECT)) + page_line_mask = ~(cache_line_size() - 1); + else + page_line_mask = PAGE_MASK; + + if (((gpa + bytes - 1) & page_line_mask) != (gpa & page_line_mask)) goto emul_write; if (kvm_vcpu_map(vcpu, gpa_to_gfn(gpa), &map)) -- cgit From e6f8b6c12f03818baacc5f504fe83fa5e20771d6 Mon Sep 17 00:00:00 2001 From: Xiaoyao Li Date: Fri, 10 Apr 2020 13:54:02 +0200 Subject: KVM: VMX: Extend VMXs #AC interceptor to handle split lock #AC in guest Two types of #AC can be generated in Intel CPUs: 1. legacy alignment check #AC 2. split lock #AC Reflect #AC back into the guest if the guest has legacy alignment checks enabled or if split lock detection is disabled. If the #AC is not a legacy one and split lock detection is enabled, then invoke handle_guest_split_lock() which will either warn and disable split lock detection for this task or force SIGBUS on it. [ tglx: Switch it to handle_guest_split_lock() and rename the misnamed helper function. ] Suggested-by: Sean Christopherson Signed-off-by: Xiaoyao Li Signed-off-by: Sean Christopherson Signed-off-by: Thomas Gleixner Signed-off-by: Borislav Petkov Acked-by: Paolo Bonzini Link: https://lkml.kernel.org/r/20200410115517.176308876@linutronix.de --- arch/x86/kvm/vmx/vmx.c | 37 ++++++++++++++++++++++++++++++++++--- 1 file changed, 34 insertions(+), 3 deletions(-) (limited to 'arch/x86') diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 8959514eaf0f..83050977490c 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -4588,6 +4588,26 @@ static int handle_machine_check(struct kvm_vcpu *vcpu) return 1; } +/* + * If the host has split lock detection disabled, then #AC is + * unconditionally injected into the guest, which is the pre split lock + * detection behaviour. + * + * If the host has split lock detection enabled then #AC is + * only injected into the guest when: + * - Guest CPL == 3 (user mode) + * - Guest has #AC detection enabled in CR0 + * - Guest EFLAGS has AC bit set + */ +static inline bool guest_inject_ac(struct kvm_vcpu *vcpu) +{ + if (!boot_cpu_has(X86_FEATURE_SPLIT_LOCK_DETECT)) + return true; + + return vmx_get_cpl(vcpu) == 3 && kvm_read_cr0_bits(vcpu, X86_CR0_AM) && + (kvm_get_rflags(vcpu) & X86_EFLAGS_AC); +} + static int handle_exception_nmi(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); @@ -4653,9 +4673,6 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu) return handle_rmode_exception(vcpu, ex_no, error_code); switch (ex_no) { - case AC_VECTOR: - kvm_queue_exception_e(vcpu, AC_VECTOR, error_code); - return 1; case DB_VECTOR: dr6 = vmcs_readl(EXIT_QUALIFICATION); if (!(vcpu->guest_debug & @@ -4684,6 +4701,20 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu) kvm_run->debug.arch.pc = vmcs_readl(GUEST_CS_BASE) + rip; kvm_run->debug.arch.exception = ex_no; break; + case AC_VECTOR: + if (guest_inject_ac(vcpu)) { + kvm_queue_exception_e(vcpu, AC_VECTOR, error_code); + return 1; + } + + /* + * Handle split lock. Depending on detection mode this will + * either warn and disable split lock detection for this + * task or force SIGBUS on it. + */ + if (handle_guest_split_lock(kvm_rip_read(vcpu))) + return 1; + fallthrough; default: kvm_run->exit_reason = KVM_EXIT_EXCEPTION; kvm_run->ex.exception = ex_no; -- cgit