Age | Commit message (Collapse) | Author |
|
Since 41f2c7c342d3 ("net/sched: act_ct: Fix promotion of offloaded
unreplied tuple"), flowtable GC pushes back flows with IPS_SEEN_REPLY
back to classic path in every run, ie. every second. This is because of
a new check for NF_FLOW_HW_ESTABLISHED which is specific of sched/act_ct.
In Netfilter's flowtable case, NF_FLOW_HW_ESTABLISHED never gets set on
and IPS_SEEN_REPLY is unreliable since users decide when to offload the
flow before, such bit might be set on at a later stage.
Fix it by adding a custom .gc handler that sched/act_ct can use to
deal with its NF_FLOW_HW_ESTABLISHED bit.
Fixes: 41f2c7c342d3 ("net/sched: act_ct: Fix promotion of offloaded unreplied tuple")
Reported-by: Vladimir Smelhaus <vl.sm@email.cz>
Reviewed-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
net->ct.labels_used was meant to convey 'number of ip/nftables rules
that need the label extension allocated'.
act_ct enables this for each net namespace, which voids all attempts
to avoid ct->ext allocation when possible.
Move this increment to the control plane to request label extension
space allocation only when its needed.
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add comment to indicate that the callback function target_destroy in the
scsi_host_template must not sleep.
Signed-off-by: Wenchao Hao <haowenchao2@huawei.com>
Link: https://lore.kernel.org/r/20231018113746.1940197-3-haowenchao2@huawei.com
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
This work adds a new, minimal BPF-programmable device called "netkit"
(former PoC code-name "meta") we recently presented at LSF/MM/BPF. The
core idea is that BPF programs are executed within the drivers xmit routine
and therefore e.g. in case of containers/Pods moving BPF processing closer
to the source.
One of the goals was that in case of Pod egress traffic, this allows to
move BPF programs from hostns tcx ingress into the device itself, providing
earlier drop or forward mechanisms, for example, if the BPF program
determines that the skb must be sent out of the node, then a redirect to
the physical device can take place directly without going through per-CPU
backlog queue. This helps to shift processing for such traffic from softirq
to process context, leading to better scheduling decisions/performance (see
measurements in the slides).
In this initial version, the netkit device ships as a pair, but we plan to
extend this further so it can also operate in single device mode. The pair
comes with a primary and a peer device. Only the primary device, typically
residing in hostns, can manage BPF programs for itself and its peer. The
peer device is designated for containers/Pods and cannot attach/detach
BPF programs. Upon the device creation, the user can set the default policy
to 'pass' or 'drop' for the case when no BPF program is attached.
Additionally, the device can be operated in L3 (default) or L2 mode. The
management of BPF programs is done via bpf_mprog, so that multi-attach is
supported right from the beginning with similar API and dependency controls
as tcx. For details on the latter see commit 053c8e1f235d ("bpf: Add generic
attach/detach/query API for multi-progs"). tc BPF compatibility is provided,
so that existing programs can be easily migrated.
Going forward, we plan to use netkit devices in Cilium as the main device
type for connecting Pods. They will be operated in L3 mode in order to
simplify a Pod's neighbor management and the peer will operate in default
drop mode, so that no traffic is leaving between the time when a Pod is
brought up by the CNI plugin and programs attached by the agent.
Additionally, the programs we attach via tcx on the physical devices are
using bpf_redirect_peer() for inbound traffic into netkit device, hence the
latter is also supporting the ndo_get_peer_dev callback. Similarly, we use
bpf_redirect_neigh() for the way out, pushing from netkit peer to phys device
directly. Also, BIG TCP is supported on netkit device. For the follow-up
work in single device mode, we plan to convert Cilium's cilium_host/_net
devices into a single one.
An extensive test suite for checking device operations and the BPF program
and link management API comes as BPF selftests in this series.
Co-developed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://github.com/borkmann/iproute2/tree/pr/netkit
Link: http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf (24ff.)
Link: https://lore.kernel.org/r/20231024214904.29825-2-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
|
For unimplemented counters, the registers PM{C,I}NTEN{SET,CLR}
and PMOVS{SET,CLR} are expected to have the corresponding bits RAZ.
Hence to ensure correct KVM's PMU emulation, mask out the RES0 bits.
Defer this work to the point that userspace can no longer change the
number of advertised PMCs.
Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20231020214053.2144305-7-rananta@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
The number of PMU event counters is indicated in PMCR_EL0.N.
For a vCPU with PMUv3 configured, the value is set to the same
value as the current PE on every vCPU reset. Unless the vCPU is
pinned to PEs that has the PMU associated to the guest from the
initial vCPU reset, the value might be different from the PMU's
PMCR_EL0.N on heterogeneous PMU systems.
Fix this by setting the vCPU's PMCR_EL0.N to the PMU's PMCR_EL0.N
value. Track the PMCR_EL0.N per guest, as only one PMU can be set
for the guest (PMCR_EL0.N must be the same for all vCPUs of the
guest), and it is convenient for updating the value.
To achieve this, the patch introduces a helper,
kvm_arm_pmu_get_max_counters(), that reads the maximum number of
counters from the arm_pmu associated to the VM. Make the function
global as upcoming patches will be interested to know the value
while setting the PMCR.N of the guest from userspace.
KVM does not yet support userspace modifying PMCR_EL0.N.
The following patch will add support for that.
Reviewed-by: Sebastian Ott <sebott@redhat.com>
Co-developed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
Link: https://lore.kernel.org/r/20231020214053.2144305-5-rananta@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Add a helper to read a vCPU's PMCR_EL0, and use it whenever KVM
reads a vCPU's PMCR_EL0.
Currently, the PMCR_EL0 value is tracked per vCPU. The following
patches will make (only) PMCR_EL0.N track per guest. Having the
new helper will be useful to combine the PMCR_EL0.N field
(tracked per guest) and the other fields (tracked per vCPU)
to provide the value of PMCR_EL0.
No functional change intended.
Reviewed-by: Sebastian Ott <sebott@redhat.com>
Signed-off-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20231020214053.2144305-4-rananta@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Future changes to KVM's sysreg emulation will rely on having a valid PMU
instance to determine the number of implemented counters (PMCR_EL0.N).
This is earlier than when userspace is expected to modify the vPMU
device attributes, where the default is selected today.
Select the default PMU when handling KVM_ARM_VCPU_INIT such that it is
available in time for sysreg emulation.
Reviewed-by: Sebastian Ott <sebott@redhat.com>
Co-developed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
Link: https://lore.kernel.org/r/20231020214053.2144305-3-rananta@google.com
[Oliver: rewrite changelog]
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Use FIELD_GET() to remove dependences on the field position, i.e., the
shift value. No functional change intended.
Link: https://lore.kernel.org/r/20231010204436.1000644-8-helgaas@kernel.org
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
|
|
Use FIELD_GET() to remove dependences on the field position, i.e., the
shift value. No functional change intended.
Link: https://lore.kernel.org/r/20231010204436.1000644-6-helgaas@kernel.org
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
|
|
The PASID Capability and Control registers are both 16 bits wide. Use
16-bit wide constants in field names to match the register width. No
functional change intended.
Link: https://lore.kernel.org/r/20231010204436.1000644-5-helgaas@kernel.org
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
|
|
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS
(for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).
As found with Coccinelle[1], add __counted_by for struct crash_mem.
[1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: kexec@lists.infradead.org
Acked-by: Baoquan He <bhe@redhat.com>
Link: https://lore.kernel.org/r/20230922175224.work.712-kees@kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Johannes Berg says:
====================
Three more fixes:
- don't drop all unprotected public action frames since
some don't have a protected dual
- fix pointer confusion in scanning code
- fix warning in some connections with multiple links
* tag 'wireless-2023-10-24' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
wifi: mac80211: don't drop all unprotected public action frames
wifi: cfg80211: fix assoc response warning on failed links
wifi: cfg80211: pass correct pointer to rdev_inform_bss()
====================
Link: https://lore.kernel.org/r/20231024103540.19198-2-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This preserves the existing IFLA_DSA_MASTER which is part of the uAPI
and creates an alias named IFLA_DSA_CONDUIT.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://lore.kernel.org/r/20231023181729.1191071-3-florian.fainelli@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Use more inclusive terms throughout the DSA subsystem by moving away
from "master" which is replaced by "conduit" and "slave" which is
replaced by "user". No functional changes.
Acked-by: Rob Herring <robh@kernel.org>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://lore.kernel.org/r/20231023181729.1191071-2-florian.fainelli@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
generated with:
$ ./tools/net/ynl/ynl-gen-c.py --mode uapi \
> --spec Documentation/netlink/specs/mptcp.yaml \
> --header -o include/uapi/linux/mptcp_pm.h
Link: https://github.com/multipath-tcp/mptcp_net-next/issues/340
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20231023-send-net-next-20231023-1-v2-5-16b1f701f900@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
in the current MPTCP control plane, all operations use a netlink
attribute of the same type "MPTCP_PM_ATTR". However, add/del/get/flush
operations only parse the first element in the message _ the one that
describes MPTCP endpoints (that was named MPTCP_PM_ATTR_ADDR and
mostly used in ADD_ADDR operations _ probably the similarity of "attr",
"addr" and "add" might cause some confusion to human readers).
Convert MPTCP from 'small_ops' to 'ops', thus allowing different attributes
for each single operation, hopefully makes all this clearer to human
readers.
- use a separate attribute set for add/del/get/flush address operation,
binary compatible with the existing one, to store the endpoint address.
MPTCP_PM_ENDPOINT_ADDR is added to the uAPI (with the same value as
MPTCP_PM_ATTR_ADDR) for these operations.
- convert mptcp_pm_ops[] and add policy files accordingly.
this prepares MPTCP control plane to be described as YAML spec.
Link: https://github.com/multipath-tcp/mptcp_net-next/issues/340
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20231023-send-net-next-20231023-1-v2-3-16b1f701f900@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"20 hotfixes. 12 are cc:stable and the remainder address post-6.5
issues or aren't considered necessary for earlier kernel versions"
* tag 'mm-hotfixes-stable-2023-10-24-09-40' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
maple_tree: add GFP_KERNEL to allocations in mas_expected_entries()
selftests/mm: include mman header to access MREMAP_DONTUNMAP identifier
mailmap: correct email aliasing for Oleksij Rempel
mailmap: map Bartosz's old address to the current one
mm/damon/sysfs: check DAMOS regions update progress from before_terminate()
MAINTAINERS: Ondrej has moved
kasan: disable kasan_non_canonical_hook() for HW tags
kasan: print the original fault addr when access invalid shadow
hugetlbfs: close race between MADV_DONTNEED and page fault
hugetlbfs: extend hugetlb_vma_lock to private VMAs
hugetlbfs: clear resv_map pointer if mmap fails
mm: zswap: fix pool refcount bug around shrink_worker()
mm/migrate: fix do_pages_move for compat pointers
riscv: fix set_huge_pte_at() for NAPOT mappings when a swap entry is set
riscv: handle VM_FAULT_[HWPOISON|HWPOISON_LARGE] faults instead of panicking
mmap: fix error paths with dup_anon_vma()
mmap: fix vma_iterator in error path of vma_merge()
mm: fix vm_brk_flags() to not bail out while holding lock
mm/mempolicy: fix set_mempolicy_home_node() previous VMA pointer
mm/page_alloc: correct start page when guard page debug is enabled
|
|
Introduce acpi_dev_uid_match() helper that matches the device with
supplied _UID string.
Signed-off-by: Raag Jadav <raag.jadav@intel.com>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
DRM_FORMAT_NV20 and DRM_FORMAT_NV30 formats is the 2x1 and non-subsampled
variant of NV15, a 10-bit 2-plane YUV format that has no padding between
components. Instead, luminance and chrominance samples are grouped into 4s
so that each group is packed into an integer number of bytes:
YYYY = UVUV = 4 * 10 bits = 40 bits = 5 bytes
The '20' and '30' suffix refers to the optimum effective bits per pixel
which is achieved when the total number of luminance samples is a multiple
of 4.
V2: Added NV30 format
Signed-off-by: Jonas Karlman <jonas@kwiboo.se>
Reviewed-by: Sandy Huang <hjc@rock-chips.com>
Reviewed-by: Christopher Obbard <chris.obbard@collabora.com>
Tested-by: Christopher Obbard <chris.obbard@collabora.com>
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
Link: https://patchwork.freedesktop.org/patch/msgid/20231023173718.188102-2-jonas@kwiboo.se
|
|
Suzuki noticed that KVM's PMU emulation is oblivious to the NSU and NSK
event filter bits. On systems that have EL3 these bits modify the
filter behavior in non-secure EL0 and EL1, respectively. Even though the
kernel doesn't use these bits, it is entirely possible some other guest
OS does. Additionally, it would appear that these and the M bit are
required by the architecture if EL3 is implemented.
Allow the EL3 event filter bits to be set if EL3 is advertised in the
guest's ID register. Implement the behavior of NSU and NSK according to
the pseudocode, and entirely ignore the M bit for perf event creation.
Reported-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Link: https://lore.kernel.org/r/20231019185618.3442949-3-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
The NSH bit, which filters event counting at EL2, is required by the
architecture if an implementation has EL2. Even though KVM doesn't
support nested virt yet, it makes no effort to hide the existence of EL2
from the ID registers. Userspace can, however, change the value of PFR0
to hide EL2. Align KVM's sysreg emulation with the architecture and make
NSH RES0 if EL2 isn't advertised. Keep in mind the bit is ignored when
constructing the backing perf event.
While at it, build the event type mask using explicit field definitions
instead of relying on ARMV8_PMU_EVTYPE_MASK. KVM probably should've been
doing this in the first place, as it avoids changes to the
aforementioned mask affecting sysreg emulation.
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Link: https://lore.kernel.org/r/20231019185618.3442949-2-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
The logic of whether filesystem can encode/decode file handles is open
coded in many places.
In preparation to changing the logic, move the open coded logic into
inline helpers.
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231023180801.2953446-2-amir73il@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Add new defines for DPC reason fields and use them instead of literals.
Link: https://lore.kernel.org/r/20231018113254.17616-7-ilpo.jarvinen@linux.intel.com
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
[bhelgaas: shorten comments]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
|
|
Use FIELD_GET() to remove dependencies on the field position, i.e., the
shift value. No functional change intended.
Link: https://lore.kernel.org/r/20231018113254.17616-5-ilpo.jarvinen@linux.intel.com
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
|
|
Convert open-coded variants of PCI field access into FIELD_GET/PREP()
to make the code easier to understand.
Add two missing defines into pci_regs.h. Logically, the Max No-Snoop
Latency Register is a separate word sized register in the PCIe spec,
but the pre-existing LTR defines in pci_regs.h with dword long values
seem to consider the registers together (the same goes for the only
user). Thus, follow the custom and make the new values also take both
word long LTR registers as a joint dword register.
Link: https://lore.kernel.org/r/20231024110336.26264-1-ilpo.jarvinen@linux.intel.com
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
|
|
VFIO has an operation where it unmaps an IOVA while returning a bitmap with
the dirty data. In reality the operation doesn't quite query the IO
pagetables that the PTE was dirty or not. Instead it marks as dirty on
anything that was mapped, and doing so in one syscall.
In IOMMUFD the equivalent is done in two operations by querying with
GET_DIRTY_IOVA followed by UNMAP_IOVA. However, this would incur two TLB
flushes given that after clearing dirty bits IOMMU implementations require
invalidating their IOTLB, plus another invalidation needed for the UNMAP.
To allow dirty bits to be queried faster, add a flag
(IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR) that requests to not clear the dirty
bits from the PTE (but just reading them), under the expectation that the
next operation is the unmap. An alternative is to unmap and just
perpectually mark as dirty as that's the same behaviour as today. So here
equivalent functionally can be provided with unmap alone, and if real dirty
info is required it will amortize the cost while querying.
There's still a race against DMA where in theory the unmap of the IOVA
(when the guest invalidates the IOTLB via emulated iommu) would race
against the VF performing DMA on the same IOVA. As discussed in [0], we are
accepting to resolve this race as throwing away the DMA and it doesn't
matter if it hit physical DRAM or not, the VM can't tell if we threw it
away because the DMA was blocked or because we failed to copy the DRAM.
[0] https://lore.kernel.org/linux-iommu/20220502185239.GR8364@nvidia.com/
Link: https://lore.kernel.org/r/20231024135109.73787-10-joao.m.martins@oracle.com
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Extend IOMMUFD_CMD_GET_HW_INFO op to query generic iommu capabilities for a
given device.
Capabilities are IOMMU agnostic and use device_iommu_capable() API passing
one of the IOMMU_CAP_*. Enumerate IOMMU_CAP_DIRTY_TRACKING for now in the
out_capabilities field returned back to userspace.
Link: https://lore.kernel.org/r/20231024135109.73787-9-joao.m.martins@oracle.com
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Connect a hw_pagetable to the IOMMU core dirty tracking
read_and_clear_dirty iommu domain op. It exposes all of the functionality
for the UAPI that read the dirtied IOVAs while clearing the Dirty bits from
the PTEs.
In doing so, add an IO pagetable API iopt_read_and_clear_dirty_data() that
performs the reading of dirty IOPTEs for a given IOVA range and then
copying back to userspace bitmap.
Underneath it uses the IOMMU domain kernel API which will read the dirty
bits, as well as atomically clearing the IOPTE dirty bit and flushing the
IOTLB at the end. The IOVA bitmaps usage takes care of the iteration of the
bitmaps user pages efficiently and without copies. Within the iterator
function we iterate over io-pagetable contigous areas that have been
mapped.
Contrary to past incantation of a similar interface in VFIO the IOVA range
to be scanned is tied in to the bitmap size, thus the application needs to
pass a appropriately sized bitmap address taking into account the iova
range being passed *and* page size ... as opposed to allowing bitmap-iova
!= iova.
Link: https://lore.kernel.org/r/20231024135109.73787-8-joao.m.martins@oracle.com
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Every IOMMU driver should be able to implement the needed iommu domain ops
to control dirty tracking.
Connect a hw_pagetable to the IOMMU core dirty tracking ops, specifically
the ability to enable/disable dirty tracking on an IOMMU domain
(hw_pagetable id). To that end add an io_pagetable kernel API to toggle
dirty tracking:
* iopt_set_dirty_tracking(iopt, [domain], state)
The intended caller of this is via the hw_pagetable object that is created.
Internally it will ensure the leftover dirty state is cleared /right
before/ dirty tracking starts. This is also useful for iommu drivers which
may decide that dirty tracking is always-enabled at boot without wanting to
toggle dynamically via corresponding iommu domain op.
Link: https://lore.kernel.org/r/20231024135109.73787-7-joao.m.martins@oracle.com
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Throughout IOMMU domain lifetime that wants to use dirty tracking, some
guarantees are needed such that any device attached to the iommu_domain
supports dirty tracking.
The idea is to handle a case where IOMMU in the system are assymetric
feature-wise and thus the capability may not be supported for all devices.
The enforcement is done by adding a flag into HWPT_ALLOC namely:
IOMMU_HWPT_ALLOC_DIRTY_TRACKING
.. Passed in HWPT_ALLOC ioctl() flags. The enforcement is done by creating
a iommu_domain via domain_alloc_user() and validating the requested flags
with what the device IOMMU supports (and failing accordingly) advertised).
Advertising the new IOMMU domain feature flag requires that the individual
iommu driver capability is supported when a future device attachment
happens.
Link: https://lore.kernel.org/r/20231024135109.73787-6-joao.m.martins@oracle.com
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Add to iommu domain operations a set of callbacks to perform dirty
tracking, particulary to start and stop tracking and to read and clear the
dirty data.
Drivers are generally expected to dynamically change its translation
structures to toggle the tracking and flush some form of control state
structure that stands in the IOVA translation path. Though it's not
mandatory, as drivers can also enable dirty tracking at boot, and just
clear the dirty bits before setting dirty tracking. For each of the newly
added IOMMU core APIs:
iommu_cap::IOMMU_CAP_DIRTY_TRACKING: new device iommu_capable value when
probing for capabilities of the device.
.set_dirty_tracking(): an iommu driver is expected to change its
translation structures and enable dirty tracking for the devices in the
iommu_domain. For drivers making dirty tracking always-enabled, it should
just return 0.
.read_and_clear_dirty(): an iommu driver is expected to walk the pagetables
for the iova range passed in and use iommu_dirty_bitmap_record() to record
dirty info per IOVA. When detecting that a given IOVA is dirty it should
also clear its dirty state from the PTE, *unless* the flag
IOMMU_DIRTY_NO_CLEAR is passed in -- flushing is steered from the caller of
the domain_op via iotlb_gather. The iommu core APIs use the same data
structure in use for dirty tracking for VFIO device dirty (struct
iova_bitmap) abstracted by iommu_dirty_bitmap_record() helper function.
domain::dirty_ops: IOMMU domains will store the dirty ops depending on
whether the iommu device supports dirty tracking or not. iommu drivers can
then use this field to figure if the dirty tracking is supported+enforced
on attach. The enforcement is enable via domain_alloc_user() which is done
via IOMMUFD hwpt flag introduced later.
Link: https://lore.kernel.org/r/20231024135109.73787-5-joao.m.martins@oracle.com
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Both VFIO and IOMMUFD will need iova bitmap for storing dirties and walking
the user bitmaps, so move to the common dependency into IOMMUFD. In doing
so, create the symbol IOMMUFD_DRIVER which designates the builtin code that
will be used by drivers when selected. Today this means MLX5_VFIO_PCI and
PDS_VFIO_PCI. IOMMU drivers will do the same (in future patches) when
supporting dirty tracking and select IOMMUFD_DRIVER accordingly.
Given that the symbol maybe be disabled, add header definitions in
iova_bitmap.h for when IOMMUFD_DRIVER=n
Link: https://lore.kernel.org/r/20231024135109.73787-3-joao.m.martins@oracle.com
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
ACPI, irqchip and the architecture code all inspect the MADT
enabled bit for a GICC entry in the MADT.
The addition of an 'online capable' bit means all these sites need
updating.
Move the current checks behind a helper to make future updates easier.
Signed-off-by: James Morse <james.morse@arm.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Signed-off-by: "Russell King (Oracle)" <rmk+kernel@armlinux.org.uk>
Acked-by: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Reviewed-by: Sudeep Holla <sudeep.holla@arm.com>
Link: https://lore.kernel.org/r/E1quv5D-00AeNJ-U8@rmk-PC.armlinux.org.uk
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
EEXIST
Return struct nft_elem_priv instead of struct nft_set_ext for
consistency with ("netfilter: nf_tables: expose opaque set element as
struct nft_elem_priv") and to prepare the introduction of element
timeout updates from control path.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Instead of copying struct nft_set_elem into struct nft_trans_elem, store
the pointer to the opaque set element object in the transaction. Adapt
set backend API (and set backend implementations) to take the pointer to
opaque set element representation whenever required.
This patch deconstifies .remove() and .activate() set backend API since
these modify the set element opaque object. And it also constify
nft_set_elem_ext() this provides access to the nft_set_ext struct
without updating the object.
According to pahole on x86_64, this patch shrinks struct nft_trans_elem
size from 216 to 24 bytes.
This patch also reduces stack memory consumption by removing the
template struct nft_set_elem object, using the opaque set element object
instead such as from the set iterator API, catchall elements and the get
element command.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Add placeholder structure and place it at the beginning of each struct
nft_*_elem for each existing set backend, instead of exposing elements
as void type to the frontend which defeats compiler type checks. Use
this pointer to this new type to replace void *.
This patch updates the following set backend API to use this new struct
nft_elem_priv placeholder structure:
- update
- deactivate
- flush
- get
as well as the following helper functions:
- nft_set_elem_ext()
- nft_set_elem_init()
- nft_set_elem_destroy()
- nf_tables_set_elem_destroy()
This patch adds nft_elem_priv_cast() to cast struct nft_elem_priv to
native element representation from the corresponding set backend.
BUILD_BUG_ON() makes sure this .priv placeholder is always at the top
of the opaque set element representation.
Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
.flush is always successful since this results from iterating over the
set elements to toggle mark the element as inactive in the next
generation.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
The spinlock is back from the day when connabels did not have
a fixed size and reallocation had to be supported.
Remove it. This change also allows to call the helpers from
softirq or timers without deadlocks.
Also add WARN()s to catch refcounting imbalances.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Merge the immutable branch genpd_dt into next, to allow the DT bindings to
be tested together with new pmdomain changes that are targeted for v6.7.
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
|
|
The MSM8917, MSM8937 and QM215 SoCs have VDDCX and VDDMX power domains
controlled in voltage level mode. Define the MSM8937 and QM215 power
domains as aliases because these SoCs are similar to MSM8917 and may
share some parts of the device tree.
Also add the compatibles for these SoCs to the documentation, with
qcom,msm8937-rpmpd using qcom,msm8917-rpmpd as a fallback compatible
because there are no known differences. QM215 is not compatible with
these because it uses different regulators.
Signed-off-by: Otto Pflüger <otto.pflueger@abscue.de>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/20231014133823.14088-2-otto.pflueger@abscue.de
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
|
|
Subtract one to __GTPA_MAX, otherwise GTPA_MAX is off by 2.
Fixes: 459aa660eb1d ("gtp: add initial driver for datapath of GPRS Tunneling Protocol (GTP-U)")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
In the previous implementation, when multiple xsk sockets were
associated with a single xsk_buff_pool, a situation could arise
where the xsk_tx_list maintained data at the front for one xsk
socket while starving the xsk sockets at the back of the list.
This could result in issues such as the inability to transmit packets,
increased latency, and jitter. To address this problem, we introduce
a new variable called tx_budget_spent, which limits each xsk to transmit
a maximum of MAX_PER_SOCKET_BUDGET tx descriptors. This allocation ensures
equitable opportunities for subsequent xsk sockets to send tx descriptors.
The value of MAX_PER_SOCKET_BUDGET is set to 32.
Signed-off-by: Albert Huang <huangjie.albert@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20231023125732.82261-1-huangjie.albert@bytedance.com
|
|
SIS_UTIL seems to work well, lets remove the old thing.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20231020134337.GD33965@noisy.programming.kicks-ass.net
|
|
Add cpus_share_resources() API. This is the preparation for the
optimization of select_idle_cpu() on platforms with cluster scheduler
level.
On a machine with clusters cpus_share_resources() will test whether
two cpus are within the same cluster. On a non-cluster machine it
will behaves the same as cpus_share_cache(). So we use "resources"
here for cache resources.
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-and-reviewed-by: Chen Yu <yu.c.chen@intel.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lkml.kernel.org/r/20231019033323.54147-2-yangyicong@huawei.com
|
|
It turns out that .branches > 0 in is_state_visited() is not a
sufficient condition to identify if two verifier states form a loop
when iterators convergence is computed. This commit adds logic to
distinguish situations like below:
(I) initial (II) initial
| |
V V
.---------> hdr ..
| | |
| V V
| .------... .------..
| | | | |
| V V V V
| ... ... .-> hdr ..
| | | | | |
| V V | V V
| succ <- cur | succ <- cur
| | | |
| V | V
| ... | ...
| | | |
'----' '----'
For both (I) and (II) successor 'succ' of the current state 'cur' was
previously explored and has branches count at 0. However, loop entry
'hdr' corresponding to 'succ' might be a part of current DFS path.
If that is the case 'succ' and 'cur' are members of the same loop
and have to be compared exactly.
Co-developed-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Co-developed-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Reviewed-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231024000917.12153-6-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Convergence for open coded iterators is computed in is_state_visited()
by examining states with branches count > 1 and using states_equal().
states_equal() computes sub-state relation using read and precision marks.
Read and precision marks are propagated from children states,
thus are not guaranteed to be complete inside a loop when branches
count > 1. This could be demonstrated using the following unsafe program:
1. r7 = -16
2. r6 = bpf_get_prandom_u32()
3. while (bpf_iter_num_next(&fp[-8])) {
4. if (r6 != 42) {
5. r7 = -32
6. r6 = bpf_get_prandom_u32()
7. continue
8. }
9. r0 = r10
10. r0 += r7
11. r8 = *(u64 *)(r0 + 0)
12. r6 = bpf_get_prandom_u32()
13. }
Here verifier would first visit path 1-3, create a checkpoint at 3
with r7=-16, continue to 4-7,3 with r7=-32.
Because instructions at 9-12 had not been visitied yet existing
checkpoint at 3 does not have read or precision mark for r7.
Thus states_equal() would return true and verifier would discard
current state, thus unsafe memory access at 11 would not be caught.
This commit fixes this loophole by introducing exact state comparisons
for iterator convergence logic:
- registers are compared using regs_exact() regardless of read or
precision marks;
- stack slots have to have identical type.
Unfortunately, this is too strict even for simple programs like below:
i = 0;
while(iter_next(&it))
i++;
At each iteration step i++ would produce a new distinct state and
eventually instruction processing limit would be reached.
To avoid such behavior speculatively forget (widen) range for
imprecise scalar registers, if those registers were not precise at the
end of the previous iteration and do not match exactly.
This a conservative heuristic that allows to verify wide range of
programs, however it precludes verification of programs that conjure
an imprecise value on the first loop iteration and use it as precise
on the second.
Test case iter_task_vma_for_each() presents one of such cases:
unsigned int seen = 0;
...
bpf_for_each(task_vma, vma, task, 0) {
if (seen >= 1000)
break;
...
seen++;
}
Here clang generates the following code:
<LBB0_4>:
24: r8 = r6 ; stash current value of
... body ... 'seen'
29: r1 = r10
30: r1 += -0x8
31: call bpf_iter_task_vma_next
32: r6 += 0x1 ; seen++;
33: if r0 == 0x0 goto +0x2 <LBB0_6> ; exit on next() == NULL
34: r7 += 0x10
35: if r8 < 0x3e7 goto -0xc <LBB0_4> ; loop on seen < 1000
<LBB0_6>:
... exit ...
Note that counter in r6 is copied to r8 and then incremented,
conditional jump is done using r8. Because of this precision mark for
r6 lags one state behind of precision mark on r8 and widening logic
kicks in.
Adding barrier_var(seen) after conditional is sufficient to force
clang use the same register for both counting and conditional jump.
This issue was discussed in the thread [1] which was started by
Andrew Werner <awerner32@gmail.com> demonstrating a similar bug
in callback functions handling. The callbacks would be addressed
in a followup patch.
[1] https://lore.kernel.org/bpf/97a90da09404c65c8e810cf83c94ac703705dc0e.camel@gmail.com/
Co-developed-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Co-developed-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231024000917.12153-4-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
As more drivers begin to use the fragment API, update the
document about how to decide which API to use for the
driver author.
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
CC: Lorenzo Bianconi <lorenzo@kernel.org>
CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: Liang Chen <liangchen.linux@gmail.com>
CC: Alexander Lobakin <aleksander.lobakin@intel.com>
CC: Dima Tisnek <dimaqq@gmail.com>
Link: https://lore.kernel.org/r/20231020095952.11055-5-linyunsheng@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Currently page pool supports the below use cases:
use case 1: allocate page without page splitting using
page_pool_alloc_pages() API if the driver knows
that the memory it need is always bigger than
half of the page allocated from page pool.
use case 2: allocate page frag with page splitting using
page_pool_alloc_frag() API if the driver knows
that the memory it need is always smaller than
or equal to the half of the page allocated from
page pool.
There is emerging use case [1] & [2] that is a mix of the
above two case: the driver doesn't know the size of memory it
need beforehand, so the driver may use something like below to
allocate memory with least memory utilization and performance
penalty:
if (size << 1 > max_size)
page = page_pool_alloc_pages();
else
page = page_pool_alloc_frag();
To avoid the driver doing something like above, add the
page_pool_alloc() API to support the above use case, and update
the true size of memory that is acctually allocated by updating
'*size' back to the driver in order to avoid exacerbating
truesize underestimate problem.
Rename page_pool_free() which is used in the destroy process to
__page_pool_destroy() to avoid confusion with the newly added
API.
1. https://lore.kernel.org/all/d3ae6bd3537fbce379382ac6a42f67e22f27ece2.1683896626.git.lorenzo@kernel.org/
2. https://lore.kernel.org/all/20230526054621.18371-3-liangchen.linux@gmail.com/
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
CC: Lorenzo Bianconi <lorenzo@kernel.org>
CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: Liang Chen <liangchen.linux@gmail.com>
CC: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://lore.kernel.org/r/20231020095952.11055-4-linyunsheng@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
PP_FLAG_PAGE_FRAG is not really needed after pp_frag_count
handling is unified and page_pool_alloc_frag() is supported
in 32-bit arch with 64-bit DMA, so remove it.
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
CC: Lorenzo Bianconi <lorenzo@kernel.org>
CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: Liang Chen <liangchen.linux@gmail.com>
CC: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://lore.kernel.org/r/20231020095952.11055-3-linyunsheng@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|