Age | Commit message (Collapse) | Author |
|
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Alexei Starovoitov says:
====================
pull-request: bpf-next 2018-10-08
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) sk_lookup_[tcp|udp] and sk_release helpers from Joe Stringer which allow
BPF programs to perform lookups for sockets in a network namespace. This would
allow programs to determine early on in processing whether the stack is
expecting to receive the packet, and perform some action (eg drop,
forward somewhere) based on this information.
2) per-cpu cgroup local storage from Roman Gushchin.
Per-cpu cgroup local storage is very similar to simple cgroup storage
except all the data is per-cpu. The main goal of per-cpu variant is to
implement super fast counters (e.g. packet counters), which don't require
neither lookups, neither atomic operations in a fast path.
The example of these hybrid counters is in selftests/bpf/netcnt_prog.c
3) allow HW offload of programs with BPF-to-BPF function calls from Quentin Monnet
4) support more than 64-byte key/value in HW offloaded BPF maps from Jakub Kicinski
5) rename of libbpf interfaces from Andrey Ignatov.
libbpf is maturing as a library and should follow good practices in
library design and implementation to play well with other libraries.
This patch set brings consistent naming convention to global symbols.
6) relicense libbpf as LGPL-2.1 OR BSD-2-Clause from Alexei Starovoitov
to let Apache2 projects use libbpf
7) various AF_XDP fixes from Björn and Magnus
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Remove the leftover pglist_data::numabalancing_migrate_lock and its
initialization, we stopped using this lock with:
efaffc5e40ae ("mm, sched/numa: Remove rate-limiting of automatic NUMA balancing migration")
[ mingo: Rewrote the changelog. ]
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Linux-MM <linux-mm@kvack.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1538824999-31230-1-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
While the DOC at the beginning of lib/bitmap.c explicitly states that
"The number of valid bits in a given bitmap does _not_ need to be an
exact multiple of BITS_PER_LONG.", some of the bitmap operations do
indeed access BITS_PER_LONG portions of the provided bitmap no matter
the size of the provided bitmap. For example, if bitmap_intersects()
is provided with an 8 bit bitmap the operation will access
BITS_PER_LONG bits from the provided bitmap. While the operation
ensures that these extra bits do not affect the result, the memory
is still accessed.
The capacity bitmasks (CBMs) are typically stored in u32 since they
can never exceed 32 bits. A few instances exist where a bitmap_*
operation is performed on a CBM by simply pointing the bitmap operation
to the stored u32 value.
The consequence of this pattern is that some bitmap_* operations will
access out-of-bounds memory when interacting with the provided CBM. This
is confirmed with a KASAN test that reports:
BUG: KASAN: stack-out-of-bounds in __bitmap_intersects+0xa2/0x100
and
BUG: KASAN: stack-out-of-bounds in __bitmap_weight+0x58/0x90
Fix this by moving any CBM provided to a bitmap operation needing
BITS_PER_LONG to an 'unsigned long' variable.
[ tglx: Changed related function arguments to unsigned long and got rid
of the _cbm extra step ]
Fixes: 72d505056604 ("x86/intel_rdt: Add utilities to test pseudo-locked region possibility")
Fixes: 49f7b4efa110 ("x86/intel_rdt: Enable setting of exclusive mode")
Fixes: d9b48c86eb38 ("x86/intel_rdt: Display resource groups' allocations' size in bytes")
Fixes: 95f0b77efa57 ("x86/intel_rdt: Initialize new resource group with sane defaults")
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: fenghua.yu@intel.com
Cc: tony.luck@intel.com
Cc: gavin.hindman@intel.com
Cc: jithu.joseph@intel.com
Cc: dave.hansen@intel.com
Cc: hpa@zytor.com
Link: https://lkml.kernel.org/r/69a428613a53f10e80594679ac726246020ff94f.1538686926.git.reinette.chatre@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Fix a grammar mistake in <linux/interrupt.h>.
[ mingo: While at it also fix another similar error in another comment as well. ]
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Jiri Kosina <trivial@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20181008111726.26286-1-geert%2Brenesas@glider.be
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
What we are doing here isn't quite obvious, so add a comment explaining
it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core
Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo:
- Fix building the python bindings with python3, which fixes some
problems with building with clang on Clear Linux (Eduardo Habkost)
- Fix coverity warnings, fixing up some error paths and plugging
some temporary small buffer leaks (Sanskriti Sharma)
- Adopt a wrapper for strerror_r() for the same reasons as recently
for libbpf (Steven Rostedt)
- S390 does not support watchpoints in perf test 22', check if
that test is supported by the arch. (Thomas Richter)
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
This adds a KVM_PPC_NO_HASH flag to the flags field of the
kvm_ppc_smmu_info struct, and arranges for it to be set when
running as a nested hypervisor, as an unambiguous indication
to userspace that HPT guests are not supported. Reporting the
KVM_CAP_PPC_MMU_HASH_V3 capability as false could be taken as
indicating only that the new HPT features in ISA V3.0 are not
supported, leaving it ambiguous whether pre-V3.0 HPT features
are supported.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
|
With this, userspace can enable a KVM-HV guest to run nested guests
under it.
The administrator can control whether any nested guests can be run;
setting the "nested" module parameter to false prevents any guests
becoming nested hypervisors (that is, any attempt to enable the nested
capability on a guest will fail). Guests which are already nested
hypervisors will continue to be so.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
|
This merges in the "ppc-kvm" topic branch of the powerpc tree to get a
series of commits that touch both general arch/powerpc code and KVM
code. These commits will be merged both via the KVM tree and the
powerpc tree.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
|
This adds a list of valid shadow PTEs for each nested guest to
the 'radix' file for the guest in debugfs. This can be useful for
debugging.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
With this, the KVM-HV module can be loaded in a guest running under
KVM-HV, and if the hypervisor supports nested virtualization, this
guest can now act as a nested hypervisor and run nested guests.
This also adds some checks to inform userspace that HPT guests are not
supported by nested hypervisors (by returning false for the
KVM_CAP_PPC_MMU_HASH_V3 capability), and to prevent userspace from
configuring a guest to use HPT mode.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
The hcall H_ENTER_NESTED takes two parameters: the address in L1 guest
memory of a hv_regs struct and the address of a pt_regs struct. The
hcall requests the L0 hypervisor to use the register values in these
structs to run a L2 guest and to return the exit state of the L2 guest
in these structs. These are in the endianness of the L1 guest, rather
than being always big-endian as is usually the case for PAPR
hypercalls.
This is convenient because it means that the L1 guest can pass the
address of the regs field in its kvm_vcpu_arch struct. This also
improves performance slightly by avoiding the need for two copies of
the pt_regs struct.
When reading/writing these structures, this patch handles the case
where the endianness of the L1 guest differs from that of the L0
hypervisor, by byteswapping the structures after reading and before
writing them back.
Since all the fields of the pt_regs are of the same type, i.e.,
unsigned long, we treat it as an array of unsigned longs. The fields
of struct hv_guest_state are not all the same, so its fields are
byteswapped individually.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
restore_hv_regs() is used to copy the hv_regs L1 wants to set to run the
nested (L2) guest into the vcpu structure. We need to sanitise these
values to ensure we don't let the L1 guest hypervisor do things we don't
want it to.
We don't let data address watchpoints or completed instruction address
breakpoints be set to match in hypervisor state.
We also don't let L1 enable features in the hypervisor facility status
and control register (HFSCR) for L2 which we have disabled for L1. That
is L2 will get the subset of features which the L0 hypervisor has
enabled for L1 and the features L1 wants to enable for L2. This could
mean we give L1 a hypervisor facility unavailable interrupt for a
facility it thinks it has enabled, however it shouldn't have enabled a
facility it itself doesn't have for the L2 guest.
We sanitise the registers when copying in the L2 hv_regs. We don't need
to sanitise when copying back the L1 hv_regs since these shouldn't be
able to contain invalid values as they're just what was copied out.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
This adds a one-reg register identifier which can be used to read and
set the virtual PTCR for the guest. This register identifies the
address and size of the virtual partition table for the guest, which
contains information about the nested guests under this guest.
Migrating this value is the only extra requirement for migrating a
guest which has nested guests (assuming of course that the destination
host supports nested virtualization in the kvm-hv module).
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
When running as a nested hypervisor, this avoids reading hypervisor
privileged registers (specifically HFSCR, LPIDR and LPCR) at startup;
instead reasonable default values are used. This also avoids writing
LPIDR in the single-vcpu entry/exit path.
Also, this removes the check for CPU_FTR_HVMODE in kvmppc_mmu_hv_init()
since its only caller already checks this.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
This is only done at level 0, since only level 0 knows which physical
CPU a vcpu is running on. This does for nested guests what L0 already
did for its own guests, which is to flush the TLB on a pCPU when it
goes to run a vCPU there, and there is another vCPU in the same VM
which previously ran on this pCPU and has now started to run on another
pCPU. This is to handle the situation where the other vCPU touched
a mapping, moved to another pCPU and did a tlbiel (local-only tlbie)
on that new pCPU and thus left behind a stale TLB entry on this pCPU.
This introduces a limit on the the vcpu_token values used in the
H_ENTER_NESTED hcall -- they must now be less than NR_CPUS.
[paulus@ozlabs.org - made prev_cpu array be short[] to reduce
memory consumption.]
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
This adds code to call the H_TLB_INVALIDATE hypercall when running as
a guest, in the cases where we need to invalidate TLBs (or other MMU
caches) as part of managing the mappings for a nested guest. Calling
H_TLB_INVALIDATE lets the nested hypervisor inform the parent
hypervisor about changes to partition-scoped page tables or the
partition table without needing to do hypervisor-privileged tlbie
instructions.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
When running a nested (L2) guest the guest (L1) hypervisor will use
the H_TLB_INVALIDATE hcall when it needs to change the partition
scoped page tables or the partition table which it manages. It will
use this hcall in the situations where it would use a partition-scoped
tlbie instruction if it were running in hypervisor mode.
The H_TLB_INVALIDATE hcall can invalidate different scopes:
Invalidate TLB for a given target address:
- This invalidates a single L2 -> L1 pte
- We need to invalidate any L2 -> L0 shadow_pgtable ptes which map the L2
address space which is being invalidated. This is because a single
L2 -> L1 pte may have been mapped with more than one pte in the
L2 -> L0 page tables.
Invalidate the entire TLB for a given LPID or for all LPIDs:
- Invalidate the entire shadow_pgtable for a given nested guest, or
for all nested guests.
Invalidate the PWC (page walk cache) for a given LPID or for all LPIDs:
- We don't cache the PWC, so nothing to do.
Invalidate the entire TLB, PWC and partition table for a given/all LPIDs:
- Here we re-read the partition table entry and remove the nested state
for any nested guest for which the first doubleword of the partition
table entry is now zero.
The H_TLB_INVALIDATE hcall takes as parameters the tlbie instruction
word (of which only the RIC, PRS and R fields are used), the rS value
(giving the lpid, where required) and the rB value (giving the IS, AP
and EPN values).
[paulus@ozlabs.org - adapted to having the partition table in guest
memory, added the H_TLB_INVALIDATE implementation, removed tlbie
instruction emulation, reworded the commit message.]
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
When a host (L0) page which is mapped into a (L1) guest is in turn
mapped through to a nested (L2) guest we keep a reverse mapping (rmap)
so that these mappings can be retrieved later.
Whenever we create an entry in a shadow_pgtable for a nested guest we
create a corresponding rmap entry and add it to the list for the
L1 guest memslot at the index of the L1 guest page it maps. This means
at the L1 guest memslot we end up with lists of rmaps.
When we are notified of a host page being invalidated which has been
mapped through to a (L1) guest, we can then walk the rmap list for that
guest page, and find and invalidate all of the corresponding
shadow_pgtable entries.
In order to reduce memory consumption, we compress the information for
each rmap entry down to 52 bits -- 12 bits for the LPID and 40 bits
for the guest real page frame number -- which will fit in a single
unsigned long. To avoid a scenario where a guest can trigger
unbounded memory allocations, we scan the list when adding an entry to
see if there is already an entry with the contents we need. This can
occur, because we don't ever remove entries from the middle of a list.
A struct nested guest rmap is a list pointer and an rmap entry;
----------------
| next pointer |
----------------
| rmap entry |
----------------
Thus the rmap pointer for each guest frame number in the memslot can be
either NULL, a single entry, or a pointer to a list of nested rmap entries.
gfn memslot rmap array
-------------------------
0 | NULL | (no rmap entry)
-------------------------
1 | single rmap entry | (rmap entry with low bit set)
-------------------------
2 | list head pointer | (list of rmap entries)
-------------------------
The final entry always has the lowest bit set and is stored in the next
pointer of the last list entry, or as a single rmap entry.
With a list of rmap entries looking like;
----------------- ----------------- -------------------------
| list head ptr | ----> | next pointer | ----> | single rmap entry |
----------------- ----------------- -------------------------
| rmap entry | | rmap entry |
----------------- -------------------------
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
Consider a normal (L1) guest running under the main hypervisor (L0),
and then a nested guest (L2) running under the L1 guest which is acting
as a nested hypervisor. L0 has page tables to map the address space for
L1 providing the translation from L1 real address -> L0 real address;
L1
|
| (L1 -> L0)
|
----> L0
There are also page tables in L1 used to map the address space for L2
providing the translation from L2 real address -> L1 read address. Since
the hardware can only walk a single level of page table, we need to
maintain in L0 a "shadow_pgtable" for L2 which provides the translation
from L2 real address -> L0 real address. Which looks like;
L2 L2
| |
| (L2 -> L1) |
| |
----> L1 | (L2 -> L0)
| |
| (L1 -> L0) |
| |
----> L0 --------> L0
When a page fault occurs while running a nested (L2) guest we need to
insert a pte into this "shadow_pgtable" for the L2 -> L0 mapping. To
do this we need to:
1. Walk the pgtable in L1 memory to find the L2 -> L1 mapping, and
provide a page fault to L1 if this mapping doesn't exist.
2. Use our L1 -> L0 pgtable to convert this L1 address to an L0 address,
or try to insert a pte for that mapping if it doesn't exist.
3. Now we have a L2 -> L0 mapping, insert this into our shadow_pgtable
Once this mapping exists we can take rc faults when hardware is unable
to automatically set the reference and change bits in the pte. On these
we need to:
1. Check the rc bits on the L2 -> L1 pte match, and otherwise reflect
the fault down to L1.
2. Set the rc bits in the L1 -> L0 pte which corresponds to the same
host page.
3. Set the rc bits in the L2 -> L0 pte.
As we reuse a large number of functions in book3s_64_mmu_radix.c for
this we also needed to refactor a number of these functions to take
an lpid parameter so that the correct lpid is used for tlb invalidations.
The functionality however has remained the same.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
When we are running as a nested hypervisor, we use a hypercall to
enter the guest rather than code in book3s_hv_rmhandlers.S. This means
that the hypercall handlers listed in hcall_real_table never get called.
There are some hypercalls that are handled there and not in
kvmppc_pseries_do_hcall(), which therefore won't get processed for
a nested guest.
To fix this, we add cases to kvmppc_pseries_do_hcall() to handle those
hypercalls, with the following exceptions:
- The HPT hypercalls (H_ENTER, H_REMOVE, etc.) are not handled because
we only support radix mode for nested guests.
- H_CEDE has to be handled specially because the cede logic in
kvmhv_run_single_vcpu assumes that it has been processed by the time
that kvmhv_p9_guest_entry() returns. Therefore we put a special
case for H_CEDE in kvmhv_p9_guest_entry().
For the XICS hypercalls, if real-mode processing is enabled, then the
virtual-mode handlers assume that they are being called only to finish
up the operation. Therefore we turn off the real-mode flag in the XICS
code when running as a nested hypervisor.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
This adds code to call the H_IPI and H_EOI hypercalls when we are
running as a nested hypervisor (i.e. without the CPU_FTR_HVMODE cpu
feature) and we would otherwise access the XICS interrupt controller
directly or via an OPAL call.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
This adds a new hypercall, H_ENTER_NESTED, which is used by a nested
hypervisor to enter one of its nested guests. The hypercall supplies
register values in two structs. Those values are copied by the level 0
(L0) hypervisor (the one which is running in hypervisor mode) into the
vcpu struct of the L1 guest, and then the guest is run until an
interrupt or error occurs which needs to be reported to L1 via the
hypercall return value.
Currently this assumes that the L0 and L1 hypervisors are the same
endianness, and the structs passed as arguments are in native
endianness. If they are of different endianness, the version number
check will fail and the hcall will be rejected.
Nested hypervisors do not support indep_threads_mode=N, so this adds
code to print a warning message if the administrator has set
indep_threads_mode=N, and treat it as Y.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
This starts the process of adding the code to support nested HV-style
virtualization. It defines a new H_SET_PARTITION_TABLE hypercall which
a nested hypervisor can use to set the base address and size of a
partition table in its memory (analogous to the PTCR register).
On the host (level 0 hypervisor) side, the H_SET_PARTITION_TABLE
hypercall from the guest is handled by code that saves the virtual
PTCR value for the guest.
This also adds code for creating and destroying nested guests and for
reading the partition table entry for a nested guest from L1 memory.
Each nested guest has its own shadow LPID value, different in general
from the LPID value used by the nested hypervisor to refer to it. The
shadow LPID value is allocated at nested guest creation time.
Nested hypervisor functionality is only available for a radix guest,
which therefore means a radix host on a POWER9 (or later) processor.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
kvmppc_unmap_pte() does a sequence of operations that are open-coded in
kvm_unmap_radix(). This extends kvmppc_unmap_pte() a little so that it
can be used by kvm_unmap_radix(), and makes kvm_unmap_radix() call it.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
The radix page fault handler accounts for all cases, including just
needing to insert a pte. This breaks it up into separate functions for
the two main cases; setting rc and inserting a pte.
This allows us to make the setting of rc and inserting of a pte
generic for any pgtable, not specific to the one for this guest.
[paulus@ozlabs.org - reduced diffs from previous code]
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
agnostic
kvmppc_mmu_radix_xlate() is used to translate an effective address
through the process tables. The process table and partition tables have
identical layout. Exploit this fact to make the kvmppc_mmu_radix_xlate()
function able to translate either an effective address through the
process tables or a guest real address through the partition tables.
[paulus@ozlabs.org - reduced diffs from previous code]
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
When destroying a VM we return the LPID to the pool, however we never
zero the partition table entry. This is instead done when we reallocate
the LPID.
Zero the partition table entry on VM teardown before returning the LPID
to the pool. This means if we were running as a nested hypervisor the
real hypervisor could use this to determine when it can free resources.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
When the 'regs' field was added to struct kvm_vcpu_arch, the code
was changed to use several of the fields inside regs (e.g., gpr, lr,
etc.) but not the ccr field, because the ccr field in struct pt_regs
is 64 bits on 64-bit platforms, but the cr field in kvm_vcpu_arch is
only 32 bits. This changes the code to use the regs.ccr field
instead of cr, and changes the assembly code on 64-bit platforms to
use 64-bit loads and stores instead of 32-bit ones.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
This adds a file called 'radix' in the debugfs directory for the
guest, which when read gives all of the valid leaf PTEs in the
partition-scoped radix tree for a radix guest, in human-readable
format. It is analogous to the existing 'htab' file which dumps
the HPT entries for a HPT guest.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
Currently the code for handling hypervisor instruction page faults
passes 0 for the flags indicating the type of fault, which is OK in
the usual case that the page is not mapped in the partition-scoped
page tables. However, there are other causes for hypervisor
instruction page faults, such as not being to update a reference
(R) or change (C) bit. The cause is indicated in bits in HSRR1,
including a bit which indicates that the fault is due to not being
able to write to a page (for example to update an R or C bit).
Not handling these other kinds of faults correctly can lead to a
loop of continual faults without forward progress in the guest.
In order to handle these faults better, this patch constructs a
"DSISR-like" value from the bits which DSISR and SRR1 (for a HISI)
have in common, and passes it to kvmppc_book3s_hv_page_fault() so
that it knows what caused the fault.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
This creates an alternative guest entry/exit path which is used for
radix guests on POWER9 systems when we have indep_threads_mode=Y. In
these circumstances there is exactly one vcpu per vcore and there is
no coordination required between vcpus or vcores; the vcpu can enter
the guest without needing to synchronize with anything else.
The new fast path is implemented almost entirely in C in book3s_hv.c
and runs with the MMU on until the guest is entered. On guest exit
we use the existing path until the point where we are committed to
exiting the guest (as distinct from handling an interrupt in the
low-level code and returning to the guest) and we have pulled the
guest context from the XIVE. At that point we check a flag in the
stack frame to see whether we came in via the old path and the new
path; if we came in via the new path then we go back to C code to do
the rest of the process of saving the guest context and restoring the
host context.
The C code is split into separate functions for handling the
OS-accessible state and the hypervisor state, with the idea that the
latter can be replaced by a hypercall when we implement nested
virtualization.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
[mpe: Fix CONFIG_ALTIVEC=n build]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
Currently kvmppc_handle_exit_hv() is called with the vcore lock held
because it is called within a for_each_runnable_thread loop.
However, we already unlock the vcore within kvmppc_handle_exit_hv()
under certain circumstances, and this is safe because (a) any vcpus
that become runnable and are added to the runnable set by
kvmppc_run_vcpu() have their vcpu->arch.trap == 0 and can't actually
run in the guest (because the vcore state is VCORE_EXITING), and
(b) for_each_runnable_thread is safe against addition or removal
of vcpus from the runnable set.
Therefore, in order to simplify things for following patches, let's
drop the vcore lock in the for_each_runnable_thread loop, so
kvmppc_handle_exit_hv() gets called without the vcore lock held.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
This adds a parameter to __kvmppc_save_tm and __kvmppc_restore_tm
which allows the caller to indicate whether it wants the nonvolatile
register state to be preserved across the call, as required by the C
calling conventions. This parameter being non-zero also causes the
MSR bits that enable TM, FP, VMX and VSX to be preserved. The
condition register and DSCR are now always preserved.
With this, kvmppc_save_tm_hv and kvmppc_restore_tm_hv can be called
from C code provided the 3rd parameter is non-zero. So that these
functions can be called from modules, they now include code to set
the TOC pointer (r2) on entry, as they can call other built-in C
functions which will assume the TOC to have been set.
Also, the fake suspend code in kvmppc_save_tm_hv is modified here to
assume that treclaim in fake-suspend state does not modify any registers,
which is the case on POWER9. This enables the code to be simplified
quite a bit.
_kvmppc_save_tm_pr and _kvmppc_restore_tm_pr become much simpler with
this change, since they now only need to save and restore TAR and pass
1 for the 3rd argument to __kvmppc_{save,restore}_tm.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
This streamlines the first part of the code that handles a hypervisor
interrupt that occurred in the guest. With this, all of the real-mode
handling that occurs is done before the "guest_exit_cont" label; once
we get to that label we are committed to exiting to host virtual mode.
Thus the machine check and HMI real-mode handling is moved before that
label.
Also, the code to handle external interrupts is moved out of line, as
is the code that calls kvmppc_realmode_hmi_handler().
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
This pulls out the assembler code that is responsible for saving and
restoring the PMU state for the host and guest into separate functions
so they can be used from an alternate entry path. The calling
convention is made compatible with C.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
This is based on a patch by Suraj Jitindar Singh.
This moves the code in book3s_hv_rmhandlers.S that generates an
external, decrementer or privileged doorbell interrupt just before
entering the guest to C code in book3s_hv_builtin.c. This is to
make future maintenance and modification easier. The algorithm
expressed in the C code is almost identical to the previous
algorithm.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
This removes code that clears the external interrupt pending bit in
the pending_exceptions bitmap. This is left over from an earlier
iteration of the code where this bit was set when an escalation
interrupt arrived in order to wake the vcpu from cede. Currently
we set the vcpu->arch.irq_pending flag instead for this purpose.
Therefore there is no need to do anything with the pending_exceptions
bitmap.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
Currently we use two bits in the vcpu pending_exceptions bitmap to
indicate that an external interrupt is pending for the guest, one
for "one-shot" interrupts that are cleared when delivered, and one
for interrupts that persist until cleared by an explicit action of
the OS (e.g. an acknowledge to an interrupt controller). The
BOOK3S_IRQPRIO_EXTERNAL bit is used for one-shot interrupt requests
and BOOK3S_IRQPRIO_EXTERNAL_LEVEL is used for persisting interrupts.
In practice BOOK3S_IRQPRIO_EXTERNAL never gets used, because our
Book3S platforms generally, and pseries in particular, expect
external interrupt requests to persist until they are acknowledged
at the interrupt controller. That combined with the confusion
introduced by having two bits for what is essentially the same thing
makes it attractive to simplify things by only using one bit. This
patch does that.
With this patch there is only BOOK3S_IRQPRIO_EXTERNAL, and by default
it has the semantics of a persisting interrupt. In order to avoid
breaking the ABI, we introduce a new "external_oneshot" flag which
preserves the behaviour of the KVM_INTERRUPT ioctl with the
KVM_INTERRUPT_SET argument.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
When doing nested virtualization, it is only necessary to do the
transactional memory hypervisor assist at level 0, that is, when
we are in hypervisor mode. Nested hypervisors can just use the TM
facilities as architected. Therefore we should clear the
CPU_FTR_P9_TM_HV_ASSIST bit when we are not in hypervisor mode,
along with the CPU_FTR_HVMODE bit.
Doing this will not change anything at this stage because the only
code that tests CPU_FTR_P9_TM_HV_ASSIST is in HV KVM, which currently
can only be used when when CPU_FTR_HVMODE is set.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
The kvmppc_gpa_to_ua() helper itself takes care of the permission
bits in the TCE and yet every single caller removes them.
This changes semantics of kvmppc_gpa_to_ua() so it takes TCEs
(which are GPAs + TCE permission bits) to make the callers simpler.
This should cause no behavioural change.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
At the moment if the PUT_TCE{_INDIRECT} handlers fail to update
the hardware tables, we print a warning once, clear the entry and
continue. This is so as at the time the assumption was that if
a VFIO device is hotplugged into the guest, and the userspace replays
virtual DMA mappings (i.e. TCEs) to the hardware tables and if this fails,
then there is nothing useful we can do about it.
However the assumption is not valid as these handlers are not called for
TCE replay (VFIO ioctl interface is used for that) and these handlers
are for new TCEs.
This returns an error to the guest if there is a request which cannot be
processed. By now the only possible failure must be H_TOO_HARD.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
The userspace can request an arbitrary supported page size for a DMA
window and this works fine as long as the mapped memory is backed with
the pages of the same or bigger size; if this is not the case,
mm_iommu_ua_to_hpa{_rm}() fail and tables do not populated with
dangerously incorrect TCEs.
However since it is quite easy to misconfigure the KVM and we do not do
reverts to all changes made to TCE tables if an error happens in a middle,
we better do the acceptable page size validation before we even touch
the tables.
This enhances kvmppc_tce_validate() to check the hardware IOMMU page sizes
against the preregistered memory page sizes.
Since the new check uses real/virtual mode helpers, this renames
kvmppc_tce_validate() to kvmppc_rm_tce_validate() to handle the real mode
case and mirrors it for the virtual mode under the old name. The real
mode handler is not used for the virtual mode as:
1. it uses _lockless() list traversing primitives instead of RCU;
2. realmode's mm_iommu_ua_to_hpa_rm() uses vmalloc_to_phys() which
virtual mode does not have to use and since on POWER9+radix only virtual
mode handlers actually work, we do not want to slow down that path even
a bit.
This removes EXPORT_SYMBOL_GPL(kvmppc_tce_validate) as the validators
are static now.
From now on the attempts on mapping IOMMU pages bigger than allowed
will result in KVM exit.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
[mpe: Fix KVM_HV=n build]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for your net-next tree:
1) Support for matching on ipsec policy already set in the route, from
Florian Westphal.
2) Split set destruction into deactivate and destroy phase to make it
fit better into the transaction infrastructure, also from Florian.
This includes a patch to warn on imbalance when setting the new
activate and deactivate interfaces.
3) Release transaction list from the workqueue to remove expensive
synchronize_rcu() from configuration plane path. This speeds up
configuration plane quite a bit. From Florian Westphal.
4) Add new xfrm/ipsec extension, this new extension allows you to match
for ipsec tunnel keys such as source and destination address, spi and
reqid. From Máté Eckl and Florian Westphal.
5) Add secmark support, this includes connsecmark too, patches
from Christian Gottsche.
6) Allow to specify remaining bytes in xt_quota, from Chenbo Feng.
One follow up patch to calm a clang warning for this one, from
Nathan Chancellor.
7) Flush conntrack entries based on layer 3 family, from Kristian Evensen.
8) New revision for cgroups2 to shrink the path field.
9) Get rid of obsolete need_conntrack(), as a result from recent
demodularization works.
10) Use WARN_ON instead of BUG_ON, from Florian Westphal.
11) Unused exported symbol in nf_nat_ipv4_fn(), from Florian.
12) Remove superfluous check for timeout netlink parser and dump
functions in layer 4 conntrack helpers.
13) Unnecessary redundant rcu read side locks in NAT redirect,
from Taehee Yoo.
14) Pass nf_hook_state structure to error handlers, patch from
Florian Westphal.
15) Remove ->new() interface from layer 4 protocol trackers. Place
them in the ->packet() interface. From Florian.
16) Place conntrack ->error() handling in the ->packet() interface.
Patches from Florian Westphal.
17) Remove unused parameter in the pernet initialization path,
also from Florian.
18) Remove additional parameter to specify layer 3 protocol when
looking up for protocol tracker. From Florian.
19) Shrink array of layer 4 protocol trackers, from Florian.
20) Check for linear skb only once from the ALG NAT mangling
codebase, from Taehee Yoo.
21) Use rhashtable_walk_enter() instead of deprecated
rhashtable_walk_init(), also from Taehee.
22) No need to flush all conntracks when only one single address
is gone, from Tan Hu.
23) Remove redundant check for NAT flags in flowtable code, from
Taehee Yoo.
24) Use rhashtable_lookup() instead of rhashtable_lookup_fast()
from netfilter codebase, since rcu read lock side is already
assumed in this path.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Clang warns when one enumerated type is implicitly converted to another.
drivers/spi/spi-ep93xx.c:342:62: warning: implicit conversion from
enumeration type 'enum dma_transfer_direction' to different enumeration
type 'enum dma_data_direction' [-Wenum-conversion]
nents = dma_map_sg(chan->device->dev, sgt->sgl, sgt->nents, dir);
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~
./include/linux/dma-mapping.h:428:58: note: expanded from macro
'dma_map_sg'
#define dma_map_sg(d, s, n, r) dma_map_sg_attrs(d, s, n, r, 0)
~~~~~~~~~~~~~~~~ ^
drivers/spi/spi-ep93xx.c:348:57: warning: implicit conversion from
enumeration type 'enum dma_transfer_direction' to different enumeration
type 'enum dma_data_direction' [-Wenum-conversion]
dma_unmap_sg(chan->device->dev, sgt->sgl, sgt->nents, dir);
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~
./include/linux/dma-mapping.h:429:62: note: expanded from macro
'dma_unmap_sg'
#define dma_unmap_sg(d, s, n, r) dma_unmap_sg_attrs(d, s, n, r, 0)
~~~~~~~~~~~~~~~~~~ ^
drivers/spi/spi-ep93xx.c:377:56: warning: implicit conversion from
enumeration type 'enum dma_transfer_direction' to different enumeration
type 'enum dma_data_direction' [-Wenum-conversion]
dma_unmap_sg(chan->device->dev, sgt->sgl, sgt->nents, dir);
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~
./include/linux/dma-mapping.h:429:62: note: expanded from macro
'dma_unmap_sg'
#define dma_unmap_sg(d, s, n, r) dma_unmap_sg_attrs(d, s, n, r, 0)
~~~~~~~~~~~~~~~~~~ ^
3 warnings generated.
dma_{,un}map_sg expect an enum of type dma_data_direction but this
driver uses dma_transfer_direction for everything. Convert the driver to
use dma_data_direction for these two functions.
There are two places that strictly require an enum of type
dma_transfer_direction: the direction member in struct dma_slave_config
and the direction parameter in dmaengine_prep_slave_sg. To avoid using
an explicit cast, add a simple function, ep93xx_dma_data_to_trans_dir,
to safely map between the two types because they are not 1 to 1 in
meaning.
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
The newly added TCP and UDP handling fails to link when CONFIG_INET
is disabled:
net/core/filter.o: In function `sk_lookup':
filter.c:(.text+0x7ff8): undefined reference to `tcp_hashinfo'
filter.c:(.text+0x7ffc): undefined reference to `tcp_hashinfo'
filter.c:(.text+0x8020): undefined reference to `__inet_lookup_established'
filter.c:(.text+0x8058): undefined reference to `__inet_lookup_listener'
filter.c:(.text+0x8068): undefined reference to `udp_table'
filter.c:(.text+0x8070): undefined reference to `udp_table'
filter.c:(.text+0x808c): undefined reference to `__udp4_lib_lookup'
net/core/filter.o: In function `bpf_sk_release':
filter.c:(.text+0x82e8): undefined reference to `sock_gen_put'
Wrap the related sections of code in #ifdefs for the config option.
Furthermore, sk_lookup() should always have been marked 'static', this
also avoids a warning about a missing prototype when building with
'make W=1'.
Fixes: 6acc9b432e67 ("bpf: Add helper to retrieve socket in BPF")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Joe Stringer <joe@wand.net.nz>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
|
Clang warns:
net/netfilter/xt_quota.c:47:44: warning: 'aligned' attribute ignored
when parsing type [-Wignored-attributes]
BUILD_BUG_ON(sizeof(atomic64_t) != sizeof(__aligned_u64));
^~~~~~~~~~~~~
Use 'sizeof(__u64)' instead, as the alignment doesn't affect the size
of the type.
Fixes: e9837e55b020 ("netfilter: xt_quota: fix the behavior of xt_quota module")
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
The rxrpc_input_packet() function and its call tree was built around the
assumption that data_ready() handler called from UDP to inform a kernel
service that there is data to be had was non-reentrant. This means that
certain locking could be dispensed with.
This, however, turns out not to be the case with a multi-queue network card
that can deliver packets to multiple cpus simultaneously. Each of those
cpus can be in the rxrpc_input_packet() function at the same time.
Fix by adding or changing some structure members:
(1) Add peer->rtt_input_lock to serialise access to the RTT buffer.
(2) Make conn->service_id into a 32-bit variable so that it can be
cmpxchg'd on all arches.
(3) Add call->input_lock to serialise access to the Rx/Tx state. Note
that although the Rx and Tx states are (almost) entirely separate,
there's no point completing the separation and having separate locks
since it's a bi-phasal RPC protocol rather than a bi-direction
streaming protocol. Data transmission and data reception do not take
place simultaneously on any particular call.
and making the following functional changes:
(1) In rxrpc_input_data(), hold call->input_lock around the core to
prevent simultaneous producing of packets into the Rx ring and
updating of tracking state for a particular call.
(2) In rxrpc_input_ping_response(), only read call->ping_serial once, and
check it before checking RXRPC_CALL_PINGING as that's a cheaper test.
The bit test and bit clear can then be combined. No further locking
is needed here.
(3) In rxrpc_input_ack(), take call->input_lock after we've parsed much of
the ACK packet. The superseded ACK check is then done both before and
after the lock is taken.
The handing of ackinfo data is split, parsing before the lock is taken
and processing with it held. This is keyed on rxMTU being non-zero.
Congestion management is also done within the locked section.
(4) In rxrpc_input_ackall(), take call->input_lock around the Tx window
rotation. The ACKALL packet carries no information and is only really
useful after all packets have been transmitted since it's imprecise.
(5) In rxrpc_input_implicit_end_call(), we use rx->incoming_lock to
prevent calls being simultaneously implicitly ended on two cpus and
also to prevent any races with incoming call setup.
(6) In rxrpc_input_packet(), use cmpxchg() to effect the service upgrade
on a connection. It is only permitted to happen once for a
connection.
(7) In rxrpc_new_incoming_call(), we have to recheck the routing inside
rx->incoming_lock to see if someone else set up the call, connection
or peer whilst we were getting there. We can't trust the values from
the earlier routing check unless we pin refs on them - which we want
to avoid.
Further, we need to allow for an incoming call to have its state
changed on another CPU between us making it live and us adjusting it
because the conn is now in the RXRPC_CONN_SERVICE state.
(8) In rxrpc_peer_add_rtt(), take peer->rtt_input_lock around the access
to the RTT buffer. Don't need to lock around setting peer->rtt.
For reference, the inventory of state-accessing or state-altering functions
used by the packet input procedure is:
> rxrpc_input_packet()
* PACKET CHECKING
* ROUTING
> rxrpc_post_packet_to_local()
> rxrpc_find_connection_rcu() - uses RCU
> rxrpc_lookup_peer_rcu() - uses RCU
> rxrpc_find_service_conn_rcu() - uses RCU
> idr_find() - uses RCU
* CONNECTION-LEVEL PROCESSING
- Service upgrade
- Can only happen once per conn
! Changed to use cmpxchg
> rxrpc_post_packet_to_conn()
- Setting conn->hi_serial
- Probably safe not using locks
- Maybe use cmpxchg
* CALL-LEVEL PROCESSING
> Old-call checking
> rxrpc_input_implicit_end_call()
> rxrpc_call_completed()
> rxrpc_queue_call()
! Need to take rx->incoming_lock
> __rxrpc_disconnect_call()
> rxrpc_notify_socket()
> rxrpc_new_incoming_call()
- Uses rx->incoming_lock for the entire process
- Might be able to drop this earlier in favour of the call lock
> rxrpc_incoming_call()
! Conflicts with rxrpc_input_implicit_end_call()
> rxrpc_send_ping()
- Don't need locks to check rtt state
> rxrpc_propose_ACK
* PACKET DISTRIBUTION
> rxrpc_input_call_packet()
> rxrpc_input_data()
* QUEUE DATA PACKET ON CALL
> rxrpc_reduce_call_timer()
- Uses timer_reduce()
! Needs call->input_lock()
> rxrpc_receiving_reply()
! Needs locking around ack state
> rxrpc_rotate_tx_window()
> rxrpc_end_tx_phase()
> rxrpc_proto_abort()
> rxrpc_input_dup_data()
- Fills the Rx buffer
- rxrpc_propose_ACK()
- rxrpc_notify_socket()
> rxrpc_input_ack()
* APPLY ACK PACKET TO CALL AND DISCARD PACKET
> rxrpc_input_ping_response()
- Probably doesn't need any extra locking
! Need READ_ONCE() on call->ping_serial
> rxrpc_input_check_for_lost_ack()
- Takes call->lock to consult Tx buffer
> rxrpc_peer_add_rtt()
! Needs to take a lock (peer->rtt_input_lock)
! Could perhaps manage with cmpxchg() and xadd() instead
> rxrpc_input_requested_ack
- Consults Tx buffer
! Probably needs a lock
> rxrpc_peer_add_rtt()
> rxrpc_propose_ack()
> rxrpc_input_ackinfo()
- Changes call->tx_winsize
! Use cmpxchg to handle change
! Should perhaps track serial number
- Uses peer->lock to record MTU specification changes
> rxrpc_proto_abort()
! Need to take call->input_lock
> rxrpc_rotate_tx_window()
> rxrpc_end_tx_phase()
> rxrpc_input_soft_acks()
- Consults the Tx buffer
> rxrpc_congestion_management()
- Modifies the Tx annotations
! Needs call->input_lock()
> rxrpc_queue_call()
> rxrpc_input_abort()
* APPLY ABORT PACKET TO CALL AND DISCARD PACKET
> rxrpc_set_call_completion()
> rxrpc_notify_socket()
> rxrpc_input_ackall()
* APPLY ACKALL PACKET TO CALL AND DISCARD PACKET
! Need to take call->input_lock
> rxrpc_rotate_tx_window()
> rxrpc_end_tx_phase()
> rxrpc_reject_packet()
There are some functions used by the above that queue the packet, after
which the procedure is terminated:
- rxrpc_post_packet_to_local()
- local->event_queue is an sk_buff_head
- local->processor is a work_struct
- rxrpc_post_packet_to_conn()
- conn->rx_queue is an sk_buff_head
- conn->processor is a work_struct
- rxrpc_reject_packet()
- local->reject_queue is an sk_buff_head
- local->processor is a work_struct
And some that offload processing to process context:
- rxrpc_notify_socket()
- Uses RCU lock
- Uses call->notify_lock to call call->notify_rx
- Uses call->recvmsg_lock to queue recvmsg side
- rxrpc_queue_call()
- call->processor is a work_struct
- rxrpc_propose_ACK()
- Uses call->lock to wrap __rxrpc_propose_ACK()
And a bunch that complete a call, all of which use call->state_lock to
protect the call state:
- rxrpc_call_completed()
- rxrpc_set_call_completion()
- rxrpc_abort_call()
- rxrpc_proto_abort()
- Also uses rxrpc_queue_call()
Fixes: 17926a79320a ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both")
Signed-off-by: David Howells <dhowells@redhat.com>
|