summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2022-01-20mm: percpu: add generic pcpu_fc_alloc/free funcitonKefeng Wang
With the previous patch, we could add a generic pcpu first chunk allocate and free function to cleanup the duplicated definations on each architecture. Link: https://lkml.kernel.org/r/20211216112359.103822-4-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Dennis Zhou <dennis@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-20mm: percpu: add pcpu_fc_cpu_to_node_fn_t typedefKefeng Wang
Add pcpu_fc_cpu_to_node_fn_t and pass it into pcpu_fc_alloc_fn_t, pcpu first chunk allocation will call it to alloc memblock on the corresponding node by it, this is prepare for the next patch. Link: https://lkml.kernel.org/r/20211216112359.103822-3-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Dennis Zhou <dennis@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-20mm: percpu: generalize percpu related configKefeng Wang
Patch series "mm: percpu: Cleanup percpu first chunk function". When supporting page mapping percpu first chunk allocator on arm64, we found there are lots of duplicated codes in percpu embed/page first chunk allocator. This patchset is aimed to cleanup them and should no function change. The currently supported status about 'embed' and 'page' in Archs shows below, embed: NEED_PER_CPU_PAGE_FIRST_CHUNK page: NEED_PER_CPU_EMBED_FIRST_CHUNK embed page ------------------------ arm64 Y Y mips Y N powerpc Y Y riscv Y N sparc Y Y x86 Y Y ------------------------ There are two interfaces about percpu first chunk allocator, extern int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size, size_t atom_size, pcpu_fc_cpu_distance_fn_t cpu_distance_fn, - pcpu_fc_alloc_fn_t alloc_fn, - pcpu_fc_free_fn_t free_fn); + pcpu_fc_cpu_to_node_fn_t cpu_to_nd_fn); extern int __init pcpu_page_first_chunk(size_t reserved_size, - pcpu_fc_alloc_fn_t alloc_fn, - pcpu_fc_free_fn_t free_fn, - pcpu_fc_populate_pte_fn_t populate_pte_fn); + pcpu_fc_cpu_to_node_fn_t cpu_to_nd_fn); The pcpu_fc_alloc_fn_t/pcpu_fc_free_fn_t is killed, we provide generic pcpu_fc_alloc() and pcpu_fc_free() function, which are called in the pcpu_embed/page_first_chunk(). 1) For pcpu_embed_first_chunk(), pcpu_fc_cpu_to_node_fn_t is needed to be provided when archs supported NUMA. 2) For pcpu_page_first_chunk(), the pcpu_fc_populate_pte_fn_t is killed too, a generic pcpu_populate_pte() which marked '__weak' is provided, if you need a different function to populate pte on the arch(like x86), please provide its own implementation. [1] https://github.com/kevin78/linux.git percpu-cleanup This patch (of 4): The HAVE_SETUP_PER_CPU_AREA/NEED_PER_CPU_EMBED_FIRST_CHUNK/ NEED_PER_CPU_PAGE_FIRST_CHUNK/USE_PERCPU_NUMA_NODE_ID configs, which have duplicate definitions on platforms that subscribe it. Move them into mm, drop these redundant definitions and instead just select it on applicable platforms. Link: https://lkml.kernel.org/r/20211216112359.103822-1-wangkefeng.wang@huawei.com Link: https://lkml.kernel.org/r/20211216112359.103822-2-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64] Cc: Will Deacon <will@kernel.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: "David S. Miller" <davem@davemloft.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-19cifs: update internal module numberSteve French
To 2.35 Signed-off-by: Steve French <stfrench@microsoft.com>
2022-01-19smb3: send NTLMSSP version informationSteve French
For improved debugging it can be helpful to send version information as other clients do during NTLMSSP negotiation. See protocol document MS-NLMP section 2.2.1.1 Set the major and minor versions based on the kernel version, and the BuildNumber based on the internal cifs.ko module version number, and following the recommendation in the protocol documentation (MS-NLMP section 2.2.10) we set the NTLMRevisionCurrent field to 15. Reviewed-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2022-01-19riscv: fix boolconv.cocci warningskernel test robot
arch/riscv/mm/init.c:48:11-16: WARNING: conversion to bool not needed here Remove unneeded conversion to bool Semantic patch information: Relational and logical operators evaluate to bool, explicit conversion is overly verbose and unneeded. Generated by: scripts/coccinelle/misc/boolconv.cocci Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: kernel test robot <lkp@intel.com> Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2022-01-19RISC-V: Introduce sv48 support without relocatable kernelPalmer Dabbelt
This patchset allows to have a single kernel for sv39 and sv48 without being relocatable. The idea comes from Arnd Bergmann who suggested to do the same as x86, that is mapping the kernel to the end of the address space, which allows the kernel to be linked at the same address for both sv39 and sv48 and then does not require to be relocated at runtime. This implements sv48 support at runtime. The kernel will try to boot with 4-level page table and will fallback to 3-level if the HW does not support it. Folding the 4th level into a 3-level page table has almost no cost at runtime. Note that kasan region had to be moved to the end of the address space since its location must be known at compile-time and then be valid for both sv39 and sv48 (and sv57 that is coming). * riscv-sv48-v3: riscv: Explicit comment about user virtual address space size riscv: Use pgtable_l4_enabled to output mmu_type in cpuinfo riscv: Implement sv48 support asm-generic: Prepare for riscv use of pud_alloc_one and pud_free riscv: Allow to dynamically define VA_BITS riscv: Introduce functions to switch pt_ops riscv: Split early kasan mapping to prepare sv48 introduction riscv: Move KASAN mapping next to the kernel mapping riscv: Get rid of MAXPHYSMEM configs Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2022-01-19riscv: Explicit comment about user virtual address space sizeAlexandre Ghiti
Define precisely the size of the user accessible virtual space size for sv32/39/48 mmu types and explain why the whole virtual address space is split into 2 equal chunks between kernel and user space. Signed-off-by: Alexandre Ghiti <alexandre.ghiti@canonical.com> Reviewed-by: Anup Patel <anup@brainfault.org> Reviewed-by: Palmer Dabbelt <palmerdabbelt@google.com> Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2022-01-19riscv: Use pgtable_l4_enabled to output mmu_type in cpuinfoAlexandre Ghiti
Now that the mmu type is determined at runtime using SATP characteristic, use the global variable pgtable_l4_enabled to output mmu type of the processor through /proc/cpuinfo instead of relying on device tree infos. Signed-off-by: Alexandre Ghiti <alexandre.ghiti@canonical.com> Reviewed-by: Anup Patel <anup@brainfault.org> Reviewed-by: Palmer Dabbelt <palmerdabbelt@google.com> Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2022-01-19riscv: Implement sv48 supportAlexandre Ghiti
By adding a new 4th level of page table, give the possibility to 64bit kernel to address 2^48 bytes of virtual address: in practice, that offers 128TB of virtual address space to userspace and allows up to 64TB of physical memory. If the underlying hardware does not support sv48, we will automatically fallback to a standard 3-level page table by folding the new PUD level into PGDIR level. In order to detect HW capabilities at runtime, we use SATP feature that ignores writes with an unsupported mode. Signed-off-by: Alexandre Ghiti <alexandre.ghiti@canonical.com> Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2022-01-19asm-generic: Prepare for riscv use of pud_alloc_one and pud_freeAlexandre Ghiti
In the following commits, riscv will almost use the generic versions of pud_alloc_one and pud_free but an additional check is required since those functions are only relevant when using at least a 4-level page table, which will be determined at runtime on riscv. So move the content of those functions into other functions that riscv can use without duplicating code. Signed-off-by: Alexandre Ghiti <alexandre.ghiti@canonical.com> Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2022-01-19riscv: Allow to dynamically define VA_BITSAlexandre Ghiti
With 4-level page table folding at runtime, we don't know at compile time the size of the virtual address space so we must set VA_BITS dynamically so that sparsemem reserves the right amount of memory for struct pages. Signed-off-by: Alexandre Ghiti <alexandre.ghiti@canonical.com> Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2022-01-19riscv: Introduce functions to switch pt_opsAlexandre Ghiti
This simply gathers the different pt_ops initialization in functions where a comment was added to explain why the page table operations must be changed along the boot process. Signed-off-by: Alexandre Ghiti <alexandre.ghiti@canonical.com> Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2022-01-19riscv: Split early kasan mapping to prepare sv48 introductionAlexandre Ghiti
Now that kasan shadow region is next to the kernel, for sv48, this region won't be aligned on PGDIR_SIZE and then when populating this region, we'll need to get down to lower levels of the page table. So instead of reimplementing the page table walk for the early population, take advantage of the existing functions used for the final population. Note that kasan swapper initialization must also be split since memblock is not initialized at this point and as the last PGD is shared with the kernel, we'd need to allocate a PUD so postpone the kasan final population after the kernel population is done. Signed-off-by: Alexandre Ghiti <alexandre.ghiti@canonical.com> Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2022-01-19riscv: Move KASAN mapping next to the kernel mappingAlexandre Ghiti
Now that KASAN_SHADOW_OFFSET is defined at compile time as a config, this value must remain constant whatever the size of the virtual address space, which is only possible by pushing this region at the end of the address space next to the kernel mapping. Signed-off-by: Alexandre Ghiti <alexandre.ghiti@canonical.com> Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2022-01-19riscv: Get rid of MAXPHYSMEM configsAlexandre Ghiti
CONFIG_MAXPHYSMEM_* are actually never used, even the nommu defconfigs selecting the MAXPHYSMEM_2GB had no effects on PAGE_OFFSET since it was preempted by !MMU case right before. In addition, the move of the kernel mapping at the end of the address space broke the use of MAXPHYSMEM_2G with MMU since it defines PAGE_OFFSET at the same address as the kernel mapping. Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Fixes: 2bfc6cd81bd1 ("riscv: Move kernel mapping outside of linear mapping") Signed-off-by: Alexandre Ghiti <alexandre.ghiti@canonical.com> Tested-by: Geert Uytterhoeven <geert@linux-m68k.org> Tested-by: Conor Dooley <Conor.Dooley@microchip.com> Cc: stable@vger.kernel.org Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2022-01-19xfs: flush inodegc workqueue tasks before cancelBrian Foster
The xfs_inodegc_stop() helper performs a high level flush of pending work on the percpu queues and then runs a cancel_work_sync() on each of the percpu work tasks to ensure all work has completed before returning. While cancel_work_sync() waits for wq tasks to complete, it does not guarantee work tasks have started. This means that the _stop() helper can queue and instantly cancel a wq task without having completed the associated work. This can be observed by tracepoint inspection of a simple "rm -f <file>; fsfreeze -f <mnt>" test: xfs_destroy_inode: ... ino 0x83 ... xfs_inode_set_need_inactive: ... ino 0x83 ... xfs_inodegc_stop: ... ... xfs_inodegc_start: ... xfs_inodegc_worker: ... xfs_inode_inactivating: ... ino 0x83 ... The first few lines show that the inode is removed and need inactive state set, but the inactivation work has not completed before the inodegc mechanism stops. The inactivation doesn't actually occur until the fs is unfrozen and the gc mechanism starts back up. Note that this test requires fsfreeze to reproduce because xfs_freeze indirectly invokes xfs_fs_statfs(), which calls xfs_inodegc_flush(). When this occurs, the workqueue try_to_grab_pending() logic first tries to steal the pending bit, which does not succeed because the bit has been set by queue_work_on(). Subsequently, it checks for association of a pool workqueue from the work item under the pool lock. This association is set at the point a work item is queued and cleared when dequeued for processing. If the association exists, the work item is removed from the queue and cancel_work_sync() returns true. If the pwq association is cleared, the remove attempt assumes the task is busy and retries (eventually returning false to the caller after waiting for the work task to complete). To avoid this race, we can flush each work item explicitly before cancel. However, since the _queue_all() already schedules each underlying work item, the workqueue level helpers are sufficient to achieve the same ordering effect. E.g., the inodegc enabled flag prevents scheduling any further work in the _stop() case. Use the drain_workqueue() helper in this particular case to make the intent a bit more self explanatory. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-01-19Merge branch 'bpf: allow cgroup progs to export custom retval to userspace'Alexei Starovoitov
YiFei Zhu says: ==================== Right now, most cgroup hooks are best used for permission checks. They can only reject a syscall with -EPERM, so a cause of a rejection, if the rejected by eBPF cgroup hooks, is ambiguous to userspace. Additionally, if the syscalls are implemented in eBPF, all permission checks and the implementation has to happen within the same filter, as programs executed later in the series of progs are unaware of the return values return by the previous progs. This patch series adds two helpers, bpf_get_retval and bpf_set_retval, that allows hooks to get/set the return value of syscall to userspace. This also allows later progs to retrieve retval set by previous progs. For legacy programs that rejects a syscall without setting the retval, for backwards compatibility, if a prog rejects without itself or a prior prog setting retval to an -err, the retval is set by the kernel to -EPERM. For getsockopt hooks that has ctx->retval, this variable mirrors that that accessed by the helpers. Additionally, the following user-visible behavior for getsockopt hooks has changed: - If a prior filter rejected the syscall, it will be visible in ctx->retval. - Attempting to change the retval arbitrarily is now allowed and will not cause an -EFAULT. - If kernel rejects a getsockopt syscall before running the hooks, the error will be visible in ctx->retval. Returning 0 from the prog will not overwrite the error to -EPERM unless there is an explicit call of bpf_set_retval(-EPERM) Tests have been added in this series to test the behavior of the helper with cgroup setsockopt getsockopt hooks. Patch 1 changes the API of macros to prepare for the next patch and should be a no-op. Patch 2 moves ctx->retval to a struct pointed to by current task_struct. Patch 3 implements the helpers. Patch 4 tests the behaviors of the helpers. Patch 5 updates a test after the test broke due to the visible changes. v1 -> v2: - errno -> retval - split one helper to get & set helpers - allow retval to be set arbitrarily in the general case - made the helper retval and context retval mirror each other ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-19selftests/bpf: Update sockopt_sk test to the use bpf_set_retvalYiFei Zhu
The tests would break without this patch, because at one point it calls getsockopt(fd, SOL_TCP, TCP_ZEROCOPY_RECEIVE, &buf, &optlen) This getsockopt receives the kernel-set -EINVAL. Prior to this patch series, the eBPF getsockopt hook's -EPERM would override kernel's -EINVAL, however, after this patch series, return 0's automatic -EPERM will not; the eBPF prog has to explicitly bpf_set_retval(-EPERM) if that is wanted. I also removed the explicit mentions of EPERM in the comments in the prog. Signed-off-by: YiFei Zhu <zhuyifei@google.com> Reviewed-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/4f20b77cb46812dbc2bdcd7e3fa87c7573bde55e.1639619851.git.zhuyifei@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-19selftests/bpf: Test bpf_{get,set}_retval behavior with cgroup/sockoptYiFei Zhu
The tests checks how different ways of interacting with the helpers (getting retval, setting EUNATCH, EISCONN, and legacy reject returning 0 without setting retval), produce different results in both the setsockopt syscall and the retval returned by the helper. A few more tests verify the interaction between the retval of the helper and the retval in getsockopt context. Signed-off-by: YiFei Zhu <zhuyifei@google.com> Reviewed-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/43ec60d679ae3f4f6fd2460559c28b63cb93cd12.1639619851.git.zhuyifei@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-19bpf: Add cgroup helpers bpf_{get,set}_retval to get/set syscall return valueYiFei Zhu
The helpers continue to use int for retval because all the hooks are int-returning rather than long-returning. The return value of bpf_set_retval is int for future-proofing, in case in the future there may be errors trying to set the retval. After the previous patch, if a program rejects a syscall by returning 0, an -EPERM will be generated no matter if the retval is already set to -err. This patch change it being forced only if retval is not -err. This is because we want to support, for example, invoking bpf_set_retval(-EINVAL) and return 0, and have the syscall return value be -EINVAL not -EPERM. For BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY, the prior behavior is that, if the return value is NET_XMIT_DROP, the packet is silently dropped. We preserve this behavior for backward compatibility reasons, so even if an errno is set, the errno does not return to caller. However, setting a non-err to retval cannot propagate so this is not allowed and we return a -EFAULT in that case. Signed-off-by: YiFei Zhu <zhuyifei@google.com> Reviewed-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/b4013fd5d16bed0b01977c1fafdeae12e1de61fb.1639619851.git.zhuyifei@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-19bpf: Move getsockopt retval to struct bpf_cg_run_ctxYiFei Zhu
The retval value is moved to struct bpf_cg_run_ctx for ease of access in different prog types with different context structs layouts. The helper implementation (to be added in a later patch in the series) can simply perform a container_of from current->bpf_ctx to retrieve bpf_cg_run_ctx. Unfortunately, there is no easy way to access the current task_struct via the verifier BPF bytecode rewrite, aside from possibly calling a helper, so a pointer to current task is added to struct bpf_sockopt_kern so that the rewritten BPF bytecode can access struct bpf_cg_run_ctx with an indirection. For backward compatibility, if a getsockopt program rejects a syscall by returning 0, an -EPERM will be generated, by having the BPF_PROG_RUN_ARRAY_CG family macros automatically set the retval to -EPERM. Unlike prior to this patch, this -EPERM will be visible to ctx->retval for any other hooks down the line in the prog array. Additionally, the restriction that getsockopt filters can only set the retval to 0 is removed, considering that certain getsockopt implementations may return optlen. Filters are now able to set the value arbitrarily. Signed-off-by: YiFei Zhu <zhuyifei@google.com> Reviewed-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/73b0325f5c29912ccea7ea57ec1ed4d388fc1d37.1639619851.git.zhuyifei@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-19bpf: Make BPF_PROG_RUN_ARRAY return -err instead of allow booleanYiFei Zhu
Right now BPF_PROG_RUN_ARRAY and related macros return 1 or 0 for whether the prog array allows or rejects whatever is being hooked. The caller of these macros then return -EPERM or continue processing based on thw macro's return value. Unforunately this is inflexible, since -EPERM is the only err that can be returned. This patch should be a no-op; it prepares for the next patch. The returning of the -EPERM is moved to inside the macros, so the outer functions are directly returning what the macros returned if they are non-zero. Signed-off-by: YiFei Zhu <zhuyifei@google.com> Reviewed-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/788abcdca55886d1f43274c918eaa9f792a9f33b.1639619851.git.zhuyifei@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-19io-wq: delete dead lock shuffling codeJens Axboe
We used to have more code around the work loop, but now the goto and lock juggling just makes it less readable than it should. Get rid of it. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-01-19clk: mediatek: relicense mt7986 clock driver to GPL-2.0Sam Shih
The previous mt7986 clock drivers were incorrectly marked as GPL-1.0. This patch changes the driver to the standard GPL-2.0 license. Signed-off-by: Sam Shih <sam.shih@mediatek.com> Link: https://lore.kernel.org/r/20220119123658.10095-2-sam.shih@mediatek.com Reported-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2022-01-19libbpf: Improve btf__add_btf() with an additional hashmap for strings.Kui-Feng Lee
Add a hashmap to map the string offsets from a source btf to the string offsets from a target btf to reduce overheads. btf__add_btf() calls btf__add_str() to add strings from a source to a target btf. It causes many string comparisons, and it is a major hotspot when adding a big btf. btf__add_str() uses strcmp() to check if a hash entry is the right one. The extra hashmap here compares offsets of strings, that are much cheaper. It remembers the results of btf__add_str() for later uses to reduce the cost. We are parallelizing BTF encoding for pahole by creating separated btf instances for worker threads. These per-thread btf instances will be added to the btf instance of the main thread by calling btf__add_str() to deduplicate and write out. With this patch and -j4, the running time of pahole drops to about 6.0s from 6.6s. The following lines are the summary of 'perf stat' w/o the change. 6.668126396 seconds time elapsed 13.451054000 seconds user 0.715520000 seconds sys The following lines are the summary w/ the change. 5.986973919 seconds time elapsed 12.939903000 seconds user 0.724152000 seconds sys V4 fixes a bug of error checking against the pointer returned by hashmap__new(). [v3] https://lore.kernel.org/bpf/20220118232053.2113139-1-kuifeng@fb.com/ [v2] https://lore.kernel.org/bpf/20220114193713.461349-1-kuifeng@fb.com/ Signed-off-by: Kui-Feng Lee <kuifeng@fb.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220119180214.255634-1-kuifeng@fb.com
2022-01-19riscv: bpf: Fix eBPF's exception tablesJisheng Zhang
eBPF's exception tables needs to be modified to relative synchronously. Suggested-by: Tong Tiangen <tongtiangen@huawei.com> Signed-off-by: Jisheng Zhang <jszhang@kernel.org> Fixes: 1f77ed9422cb ("riscv: switch to relative extable and other improvements") Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2022-01-19kvm: selftests: Do not indent with spacesPaolo Bonzini
Some indentation with spaces crept in, likely due to terminal-based cut and paste. Clean it up. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19kvm: selftests: sync uapi/linux/kvm.h with Linux headerPaolo Bonzini
KVM_CAP_XSAVE2 is out of sync due to a conflict. Copy the whole file while at it. Reported-by: Yang Zhong <yang.zhong@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19bpf/scripts: Raise an exception if the correct number of sycalls are not ↵Usama Arif
generated Currently the syscalls rst and subsequently man page are auto-generated using function documentation present in bpf.h. If the documentation for the syscall is missing or doesn't follow a specific format, then that syscall is not dumped in the auto-generated rst. This patch checks the number of syscalls documented within the header file with those present as part of the enum bpf_cmd and raises an Exception if they don't match. It is not needed with the currently documented upstream syscalls, but can help in debugging when developing new syscalls when there might be missing or misformatted documentation. The function helper_number_check is moved to the Printer parent class and renamed to elem_number_check as all the most derived children classes are using this function now. Signed-off-by: Usama Arif <usama.arif@bytedance.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Quentin Monnet <quentin@isovalent.com> Link: https://lore.kernel.org/bpf/20220119114442.1452088-3-usama.arif@bytedance.com
2022-01-19bpf/scripts: Make description and returns section for helpers/syscalls mandatoryUsama Arif
This enforce a minimal formatting consistency for the documentation. The description and returns missing for a few helpers have also been added. Signed-off-by: Usama Arif <usama.arif@bytedance.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Quentin Monnet <quentin@isovalent.com> Link: https://lore.kernel.org/bpf/20220119114442.1452088-2-usama.arif@bytedance.com
2022-01-19uapi/bpf: Add missing description and returns for helper documentationUsama Arif
Both description and returns section will become mandatory for helpers and syscalls in a later commit to generate man pages. This commit also adds in the documentation that BPF_PROG_RUN is an alias for BPF_PROG_TEST_RUN for anyone searching for the syscall in the generated man pages. Signed-off-by: Usama Arif <usama.arif@bytedance.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220119114442.1452088-1-usama.arif@bytedance.com
2022-01-19bpftool: Adding support for BTF program namesRaman Shukhau
`bpftool prog list` and other bpftool subcommands that show BPF program names currently get them from bpf_prog_info.name. That field is limited to 16 (BPF_OBJ_NAME_LEN) chars which leads to truncated names since many progs have much longer names. The idea of this change is to improve all bpftool commands that output prog name so that bpftool uses info from BTF to print program names if available. It tries bpf_prog_info.name first and fall back to btf only if the name is suspected to be truncated (has 15 chars length). Right now `bpftool p show id <id>` returns capped prog name <id>: kprobe name example_cap_cap tag 712e... ... With this change it would return <id>: kprobe name example_cap_capable tag 712e... ... Note, other commands that print prog names (e.g. "bpftool cgroup tree") are also addressed in this change. Signed-off-by: Raman Shukhau <ramasha@fb.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220119100255.1068997-1-ramasha@fb.com
2022-01-19riscv: mm: init: try best to remove #ifdef CONFIG_XIP_KERNEL usageJisheng Zhang
Currently, the #ifdef CONFIG_XIP_KERNEL usage can be divided into the following three types: The first one is for functions/declarations only used in XIP case. The second one is for XIP_FIXUP case. Something as below: |foo_type foo; |#ifdef CONFIG_XIP_KERNEL |#define foo (*(foo_type *)XIP_FIXUP(&foo)) |#endif Usually, it's better to let the foo macro sit with the foo var together. But if various foos are defined adjacently, we can save some #ifdef CONFIG_XIP_KERNEL usage by grouping them together. The third one is for different implementations for XIP, usually, this is a #ifdef...#else...#endif case. This patch moves the pt_ops macro to adjacent #ifdef CONFIG_XIP_KERNEL and group first type usage cases into one. Signed-off-by: Jisheng Zhang <jszhang@kernel.org> Reviewed-by: Alexandre Ghiti <alex@ghiti.fr> Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2022-01-19riscv: mm: init: try IS_ENABLED(CONFIG_XIP_KERNEL) instead of #ifdefJisheng Zhang
Try our best to replace the conditional compilation using "#ifdef CONFIG_XIP_KERNEL" with "IS_ENABLED(CONFIG_XIP_KERNEL)", to simplify the code and to increase compile coverage. Signed-off-by: Jisheng Zhang <jszhang@kernel.org> Reviewed-by: Alexandre Ghiti <alex@ghiti.fr> Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2022-01-19riscv: mm: init: remove _pt_ops and use pt_ops directlyJisheng Zhang
Except "pt_ops", other global vars when CONFIG_XIP_KERNEL=y is defined as below: |foo_type foo; |#ifdef CONFIG_XIP_KERNEL |#define foo (*(foo_type *)XIP_FIXUP(&foo)) |#endif Follow the same way for pt_ops to unify the style and to simplify code. Signed-off-by: Jisheng Zhang <jszhang@kernel.org> Reviewed-by: Alexandre Ghiti <alex@ghiti.fr> Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2022-01-19riscv: mm: init: try best to use IS_ENABLED(CONFIG_64BIT) instead of #ifdefJisheng Zhang
Try our best to replace the conditional compilation using "#ifdef CONFIG_64BIT" by a check for "IS_ENABLED(CONFIG_64BIT)", to simplify the code and to increase compile coverage. Now we can also remove the __maybe_unused used in max_mapped_addr declaration. We also remove the BUG_ON check of mapping the last 4K bytes of the addressable memory since this is always true for every kernel actually. Signed-off-by: Jisheng Zhang <jszhang@kernel.org> Reviewed-by: Alexandre Ghiti <alex@ghiti.fr> Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2022-01-19riscv: mm: init: remove unnecessary "#ifdef CONFIG_CRASH_DUMP"Jisheng Zhang
The is_kdump_kernel() returns false for !CRASH_DUMP case, so we don't need the #ifdef CONFIG_CRASH_DUMP for is_kdump_kernel() checking. Signed-off-by: Jisheng Zhang <jszhang@kernel.org> Reviewed-by: Alexandre Ghiti <alex@ghiti.fr> Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2022-01-19tools headers UAPI: Sync x86 arch prctl headers with the kernel sourcesArnaldo Carvalho de Melo
To pick the changes in this cset: 980fe2fddcff2193 ("x86/fpu: Extend fpu_xstate_prctl() with guest permissions") This picks these new prctls: $ tools/perf/trace/beauty/x86_arch_prctl.sh > /tmp/before $ cp arch/x86/include/uapi/asm/prctl.h tools/arch/x86/include/uapi/asm/prctl.h $ tools/perf/trace/beauty/x86_arch_prctl.sh > /tmp/after $ diff -u /tmp/before /tmp/after --- /tmp/before 2022-01-19 14:40:05.049394977 -0300 +++ /tmp/after 2022-01-19 14:40:35.628154565 -0300 @@ -9,6 +9,8 @@ [0x1021 - 0x1001]= "GET_XCOMP_SUPP", [0x1022 - 0x1001]= "GET_XCOMP_PERM", [0x1023 - 0x1001]= "REQ_XCOMP_PERM", + [0x1024 - 0x1001]= "GET_XCOMP_GUEST_PERM", + [0x1025 - 0x1001]= "REQ_XCOMP_GUEST_PERM", }; #define x86_arch_prctl_codes_2_offset 0x2001 $ With this 'perf trace' can translate those numbers into strings and use the strings in filter expressions: # perf trace -e prctl 0.000 ( 0.011 ms): DOM Worker/3722622 prctl(option: SET_NAME, arg2: 0x7f9c014b7df5) = 0 0.032 ( 0.002 ms): DOM Worker/3722622 prctl(option: SET_NAME, arg2: 0x7f9bb6b51580) = 0 5.452 ( 0.003 ms): StreamT~ns #30/3722623 prctl(option: SET_NAME, arg2: 0x7f9bdbdfeb70) = 0 5.468 ( 0.002 ms): StreamT~ns #30/3722623 prctl(option: SET_NAME, arg2: 0x7f9bdbdfea70) = 0 24.494 ( 0.009 ms): IndexedDB #556/3722624 prctl(option: SET_NAME, arg2: 0x7f562a32ae28) = 0 24.540 ( 0.002 ms): IndexedDB #556/3722624 prctl(option: SET_NAME, arg2: 0x7f563c6d4b30) = 0 670.281 ( 0.008 ms): systemd-userwo/3722339 prctl(option: SET_NAME, arg2: 0x564be30805c8) = 0 670.293 ( 0.002 ms): systemd-userwo/3722339 prctl(option: SET_NAME, arg2: 0x564be30800f0) = 0 ^C# This addresses these perf build warnings: Warning: Kernel ABI header at 'tools/arch/x86/include/uapi/asm/prctl.h' differs from latest version at 'arch/x86/include/uapi/asm/prctl.h' diff -u tools/arch/x86/include/uapi/asm/prctl.h arch/x86/include/uapi/asm/prctl.h Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2022-01-19btrfs: defrag: properly update range->start for autodefragQu Wenruo
[BUG] After commit 7b508037d4ca ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()") autodefrag no longer properly re-defrag the file from previously finished location. [CAUSE] The recent refactoring of defrag only focuses on defrag ioctl subpage support, doesn't take autodefrag into consideration. There are two problems involved which prevents autodefrag to restart its scan: - No range.start update Previously when one defrag target is found, range->start will be updated to indicate where next search should start from. But now btrfs_defrag_file() doesn't update it anymore, making all autodefrag to rescan from file offset 0. This would also make autodefrag to mark the same range dirty again and again, causing extra IO. - No proper quick exit for defrag_one_cluster() Currently if we reached or exceed @max_sectors limit, we just exit defrag_one_cluster(), and let next defrag_one_cluster() call to do a quick exit. This makes @cur increase, thus no way to properly know which range is defragged and which range is skipped. [FIX] The fix involves two modifications: - Update range->start to next cluster start This is a little different from the old behavior. Previously range->start is updated to the next defrag target. But in the end, the behavior should still be pretty much the same, as now we skip to next defrag target inside btrfs_defrag_file(). Thus if auto-defrag determines to re-scan, then we still do the skip, just at a different timing. - Make defrag_one_cluster() to return >0 to indicate a quick exit So that btrfs_defrag_file() can also do a quick exit, without increasing @cur to the range end, and re-use @cur to update @range->start. - Add comment for btrfs_defrag_file() to mention the range->start update Currently only autodefrag utilize this behavior, as defrag ioctl won't set @max_to_defrag parameter, thus unless interrupted it will always try to defrag the whole range. Reported-by: Filipe Manana <fdmanana@suse.com> Fixes: 7b508037d4ca ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()") Link: https://lore.kernel.org/linux-btrfs/0a269612-e43f-da22-c5bc-b34b1b56ebe8@mailbox.org/ CC: stable@vger.kernel.org # 5.16 Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-19btrfs: defrag: fix wrong number of defragged sectorsQu Wenruo
[BUG] There are users using autodefrag mount option reporting obvious increase in IO: > If I compare the write average (in total, I don't have it per process) > when taking idle periods on the same machine: > Linux 5.16: > without autodefrag: ~ 10KiB/s > with autodefrag: between 1 and 2MiB/s. > > Linux 5.15: > with autodefrag:~ 10KiB/s (around the same as without > autodefrag on 5.16) [CAUSE] When autodefrag mount option is enabled, btrfs_defrag_file() will be called with @max_sectors = BTRFS_DEFRAG_BATCH (1024) to limit how many sectors we can defrag in one try. And then use the number of sectors defragged to determine if we need to re-defrag. But commit b18c3ab2343d ("btrfs: defrag: introduce helper to defrag one cluster") uses wrong unit to increase @sectors_defragged, which should be in unit of sector, not byte. This means, if we have defragged any sector, then @sectors_defragged will be >= sectorsize (normally 4096), which is larger than BTRFS_DEFRAG_BATCH. This makes the @max_sectors check in defrag_one_cluster() to underflow, rendering the whole @max_sectors check useless. Thus causing way more IO for autodefrag mount options, as now there is no limit on how many sectors can really be defragged. [FIX] Fix the problems by: - Use sector as unit when increasing @sectors_defragged - Include @sectors_defragged > @max_sectors case to break the loop - Add extra comment on the return value of btrfs_defrag_file() Reported-by: Anthony Ruhier <aruhier@mailbox.org> Fixes: b18c3ab2343d ("btrfs: defrag: introduce helper to defrag one cluster") Link: https://lore.kernel.org/linux-btrfs/0a269612-e43f-da22-c5bc-b34b1b56ebe8@mailbox.org/ CC: stable@vger.kernel.org # 5.16 Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-19cifs: Support fscache indexing rewriteDavid Howells
Change the cifs filesystem to take account of the changes to fscache's indexing rewrite and reenable caching in cifs. The following changes have been made: (1) The fscache_netfs struct is no more, and there's no need to register the filesystem as a whole. (2) The session cookie is now an fscache_volume cookie, allocated with fscache_acquire_volume(). That takes three parameters: a string representing the "volume" in the index, a string naming the cache to use (or NULL) and a u64 that conveys coherency metadata for the volume. For cifs, I've made it render the volume name string as: "cifs,<ipaddress>,<sharename>" where the sharename has '/' characters replaced with ';'. This probably needs rethinking a bit as the total name could exceed the maximum filename component length. Further, the coherency data is currently just set to 0. It needs something else doing with it - I wonder if it would suffice simply to sum the resource_id, vol_create_time and vol_serial_number or maybe hash them. (3) The fscache_cookie_def is no more and needed information is passed directly to fscache_acquire_cookie(). The cache no longer calls back into the filesystem, but rather metadata changes are indicated at other times. fscache_acquire_cookie() is passed the same keying and coherency information as before. (4) The functions to set/reset cookies are removed and fscache_use_cookie() and fscache_unuse_cookie() are used instead. fscache_use_cookie() is passed a flag to indicate if the cookie is opened for writing. fscache_unuse_cookie() is passed updates for the metadata if we changed it (ie. if the file was opened for writing). These are called when the file is opened or closed. (5) cifs_setattr_*() are made to call fscache_resize() to change the size of the cache object. (6) The functions to read and write data are stubbed out pending a conversion to use netfslib. Changes ======= ver #8: - Abstract cache invalidation into a helper function. - Fix some checkpatch warnings[3]. ver #7: - Removed the accidentally added-back call to get the super cookie in cifs_root_iget(). - Fixed the right call to cifs_fscache_get_super_cookie() to take account of the "-o fsc" mount flag. ver #6: - Moved the change of gfpflags_allow_blocking() to current_is_kswapd() for cifs here. - Fixed one of the error paths in cifs_atomic_open() to jump around the call to use the cookie. - Fixed an additional successful return in the middle of cifs_open() to use the cookie on the way out. - Only get a volume cookie (and thus inode cookies) when "-o fsc" is supplied to mount. ver #5: - Fixed a couple of bits of cookie handling[2]: - The cookie should be released in cifs_evict_inode(), not cifsFileInfo_put_final(). The cookie needs to persist beyond file closure so that writepages will be able to write to it. - fscache_use_cookie() needs to be called in cifs_atomic_open() as it is for cifs_open(). ver #4: - Fixed the use of sizeof with memset. - tcon->vol_create_time is __le64 so doesn't need cpu_to_le64(). ver #3: - Canonicalise the cifs coherency data to make the cache portable. - Set volume coherency data. ver #2: - Use gfpflags_allow_blocking() rather than using flag directly. - Upgraded to -rc4 to allow for upstream changes[1]. - fscache_acquire_volume() now returns errors. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Jeff Layton <jlayton@kernel.org> cc: Steve French <smfrench@gmail.com> cc: Shyam Prasad N <nspmangalore@gmail.com> cc: linux-cifs@vger.kernel.org cc: linux-cachefs@redhat.com Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=23b55d673d7527b093cd97b7c217c82e70cd1af0 [1] Link: https://lore.kernel.org/r/3419813.1641592362@warthog.procyon.org.uk/ [2] Link: https://lore.kernel.org/r/CAH2r5muTanw9pJqzAHd01d9A8keeChkzGsCEH6=0rHutVLAF-A@mail.gmail.com/ [3] Link: https://lore.kernel.org/r/163819671009.215744.11230627184193298714.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/163906982979.143852.10672081929614953210.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/163967187187.1823006.247415138444991444.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/164021579335.640689.2681324337038770579.stgit@warthog.procyon.org.uk/ # v4 Link: https://lore.kernel.org/r/3462849.1641593783@warthog.procyon.org.uk/ # v5 Link: https://lore.kernel.org/r/1318953.1642024578@warthog.procyon.org.uk/ # v6 Signed-off-by: Steve French <stfrench@microsoft.com>
2022-01-19btrfs: allow defrag to be interruptibleFilipe Manana
During defrag, at btrfs_defrag_file(), we have this loop that iterates over a file range in steps no larger than 256K subranges. If the range is too long, there's no way to interrupt it. So make the loop check in each iteration if there's signal pending, and if there is, break and return -AGAIN to userspace. Before kernel 5.16, we used to allow defrag to be cancelled through a signal, but that was lost with commit 7b508037d4cac3 ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()"). This change adds back the possibility to cancel a defrag with a signal and keeps the same semantics, returning -EAGAIN to user space (and not the usually more expected -EINTR). This is also motivated by a recent bug on 5.16 where defragging a 1 byte file resulted in iterating from file range 0 to (u64)-1, as hitting the bug triggered a too long loop, basically requiring one to reboot the machine, as it was not possible to cancel defrag. Fixes: 7b508037d4cac3 ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()") CC: stable@vger.kernel.org # 5.16 Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-19btrfs: fix too long loop when defragging a 1 byte fileFilipe Manana
When attempting to defrag a file with a single byte, we can end up in a too long loop, which is nearly infinite because at btrfs_defrag_file() we end up with the variable last_byte assigned with a value of 18446744073709551615 (which is (u64)-1). The problem comes from the fact we end up doing: last_byte = round_up(last_byte, fs_info->sectorsize) - 1; So if last_byte was assigned 0, which is i_size - 1, we underflow and end up with the value 18446744073709551615. This is trivial to reproduce and the following script triggers it: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj mkfs.btrfs -f $DEV mount $DEV $MNT echo -n "X" > $MNT/foobar btrfs filesystem defragment $MNT/foobar umount $MNT So fix this by not decrementing last_byte by 1 before doing the sector size round up. Also, to make it easier to follow, make the round up right after computing last_byte. Reported-by: Anthony Ruhier <aruhier@mailbox.org> Fixes: 7b508037d4cac3 ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()") Link: https://lore.kernel.org/linux-btrfs/0a269612-e43f-da22-c5bc-b34b1b56ebe8@mailbox.org/ CC: stable@vger.kernel.org # 5.16 Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-19selftests: kvm: add amx_test to .gitignoreMuhammad Usama Anjum
amx_test's binary should be present in the .gitignore file for the git to ignore it. Fixes: bf70636d9443 ("selftest: kvm: Add amx selftest") Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Message-Id: <20220118122053.1941915-1-usama.anjum@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19KVM: SVM: Nullify vcpu_(un)blocking() hooks if AVIC is disabledSean Christopherson
Nullify svm_x86_ops.vcpu_(un)blocking if AVIC/APICv is disabled as the hooks are necessary only to clear the vCPU's IsRunning entry in the Physical APIC and to update IRTE entries if the VM has a pass-through device attached. Opportunistically rename the helpers to clarify their AVIC relationship. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211208015236.1616697-24-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19KVM: SVM: Move svm_hardware_setup() and its helpers below svm_x86_opsSean Christopherson
Move svm_hardware_setup() below svm_x86_ops so that KVM can modify ops during setup, e.g. the vcpu_(un)blocking hooks can be nullified if AVIC is disabled or unsupported. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211208015236.1616697-23-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19KVM: SVM: Drop AVIC's intermediate avic_set_running() helperSean Christopherson
Drop avic_set_running() in favor of calling avic_vcpu_{load,put}() directly, and modify the block+put path to use preempt_disable/enable() instead of get/put_cpu(), as it doesn't actually care about the current pCPU associated with the vCPU. Opportunistically add lockdep assertions as being preempted in avic_vcpu_put() would lead to consuming stale data, even though doing so _in the current code base_ would not be fatal. Add a much needed comment explaining why svm_vcpu_blocking() needs to unload the AVIC and update the IRTE _before_ the vCPU starts blocking. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211208015236.1616697-22-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19KVM: VMX: Don't do full kick when handling posted interrupt wakeupSean Christopherson
When waking vCPUs in the posted interrupt wakeup handling, do exactly that and no more. There is no need to kick the vCPU as the wakeup handler just needs to get the vCPU task running, and if it's in the guest then it's definitely running. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20211208015236.1616697-21-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19KVM: VMX: Fold fallback path into triggering posted IRQ helperSean Christopherson
Move the fallback "wake_up" path into the helper to trigger posted interrupt helper now that the nested and non-nested paths are identical. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20211208015236.1616697-20-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>