Age | Commit message (Collapse) | Author |
|
If the phy is configured to use pause-based rate matching, ensure that
the link is full duplex with pause frame reception enabled. As
suggested, if pause-based rate matching is enabled by the phy, then
pause reception is unconditionally enabled.
The interface duplex is determined based on the rate matching type.
When rate matching is enabled, so is the speed. We assume the maximum
interface speed is used. This is only relevant for MLO_AN_PHY. For
MLO_AN_INBAND, the MAC/PCS's view of the interface speed will be used.
Although there are no RATE_ADAPT_CRS phys in-tree, it has been added for
comparison (and the implementation is quite simple).
Co-developed-by: Russell King <linux@armlinux.org.uk>
Signed-off-by: Sean Anderson <sean.anderson@seco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This adds support for rate matching (also known as rate adaptation) to
the phy subsystem. The general idea is that the phy interface runs at
one speed, and the MAC throttles the rate at which it sends packets to
the link speed. There's a good overview of several techniques for
achieving this at [1]. This patch adds support for three: pause-frame
based (such as in Aquantia phys), CRS-based (such as in 10PASS-TS and
2BASE-TL), and open-loop-based (such as in 10GBASE-W).
This patch makes a few assumptions and a few non assumptions about the
types of rate matching available. First, it assumes that different phys
may use different forms of rate matching. Second, it assumes that phys
can use rate matching for any of their supported link speeds (e.g. if a
phy supports 10BASE-T and XGMII, then it can adapt XGMII to 10BASE-T).
Third, it does not assume that all interface modes will use the same
form of rate matching. Fourth, it does not assume that all phy devices
will support rate matching (even if some do). Relaxing or strengthening
these (non-)assumptions could result in a different API. For example, if
all interface modes were assumed to use the same form of rate matching,
then a bitmask of interface modes supportting rate matching would
suffice.
For some better visibility into the process, the current rate matching
mode is exposed as part of the ethtool ksettings. For the moment, only
read access is supported. I'm not sure what userspace might want to
configure yet (disable it altogether, disable just one mode, specify the
mode to use, etc.). For the moment, since only pause-based rate
adaptation support is added in the next few commits, rate matching can
be disabled altogether by adjusting the advertisement.
802.3 calls this feature "rate adaptation" in clause 49 (10GBASE-R) and
"rate matching" in clause 61 (10PASS-TL and 2BASE-TS). Aquantia also calls
this feature "rate adaptation". I chose "rate matching" because it is
shorter, and because Russell doesn't think "adaptation" is correct in this
context.
Signed-off-by: Sean Anderson <sean.anderson@seco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
If we call phylink_caps_to_linkmodes directly from
phylink_get_linkmodes, it is difficult to re-use this functionality in
MAC drivers. This is because MAC drivers must then work with an ethtool
linkmode bitmap, instead of with mac capabilities. Instead, let the
caller of phylink_get_linkmodes do the conversion. To reflect this
change, rename the function to phylink_get_capabilities.
Signed-off-by: Sean Anderson <sean.anderson@seco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This function is convenient for MAC drivers. They can use it to add or
remove particular link modes based on capabilities (such as if half
duplex is not supported for a particular interface mode).
Signed-off-by: Sean Anderson <sean.anderson@seco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This documents the possible MLO_PAUSE_* settings which can result from
different combinations of MAC_(A)SYM_PAUSE. Special note is paid to
settings which can result from user configuration (MLO_PAUSE_AN). The
autonegotiation results are more-or-less a direct consequence of IEEE
802.3 Table 28B-2.
Signed-off-by: Sean Anderson <sean.anderson@seco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
To abstract the rmid counters into a helper that returns the number
of bytes counted, architecture specific per-rmid state is needed.
It needs to be possible to reset this hidden state, as the values
may outlive the life of an rmid, or the mount time of the filesystem.
mon_event_read() is called with first = true when an rmid is first
allocated in mkdir_mondata_subdir(). Add resctrl_arch_reset_rmid()
and call it from __mon_event_count()'s rr->first check.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <quic_jiles@quicinc.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Xin Hao <xhao@linux.alibaba.com>
Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Tested-by: Cristian Marussi <cristian.marussi@arm.com>
Link: https://lore.kernel.org/r/20220902154829.30399-15-james.morse@arm.com
|
|
kmalloc's API family is critical for mm, with one nature that it will
round up the request size to a fixed one (mostly power of 2). Say
when user requests memory for '2^n + 1' bytes, actually 2^(n+1) bytes
could be allocated, so in worst case, there is around 50% memory
space waste.
The wastage is not a big issue for requests that get allocated/freed
quickly, but may cause problems with objects that have longer life
time.
We've met a kernel boot OOM panic (v5.10), and from the dumped slab
info:
[ 26.062145] kmalloc-2k 814056KB 814056KB
From debug we found there are huge number of 'struct iova_magazine',
whose size is 1032 bytes (1024 + 8), so each allocation will waste
1016 bytes. Though the issue was solved by giving the right (bigger)
size of RAM, it is still nice to optimize the size (either use a
kmalloc friendly size or create a dedicated slab for it).
And from lkml archive, there was another crash kernel OOM case [1]
back in 2019, which seems to be related with the similar slab waste
situation, as the log is similar:
[ 4.332648] iommu: Adding device 0000:20:02.0 to group 16
[ 4.338946] swapper/0 invoked oom-killer: gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null), order=0, oom_score_adj=0
...
[ 4.857565] kmalloc-2048 59164KB 59164KB
The crash kernel only has 256M memory, and 59M is pretty big here.
(Note: the related code has been changed and optimised in recent
kernel [2], these logs are just picked to demo the problem, also
a patch changing its size to 1024 bytes has been merged)
So add an way to track each kmalloc's memory waste info, and
leverage the existing SLUB debug framework (specifically
SLUB_STORE_USER) to show its call stack of original allocation,
so that user can evaluate the waste situation, identify some hot
spots and optimize accordingly, for a better utilization of memory.
The waste info is integrated into existing interface:
'/sys/kernel/debug/slab/kmalloc-xx/alloc_traces', one example of
'kmalloc-4k' after boot is:
126 ixgbe_alloc_q_vector+0xbe/0x830 [ixgbe] waste=233856/1856 age=280763/281414/282065 pid=1330 cpus=32 nodes=1
__kmem_cache_alloc_node+0x11f/0x4e0
__kmalloc_node+0x4e/0x140
ixgbe_alloc_q_vector+0xbe/0x830 [ixgbe]
ixgbe_init_interrupt_scheme+0x2ae/0xc90 [ixgbe]
ixgbe_probe+0x165f/0x1d20 [ixgbe]
local_pci_probe+0x78/0xc0
work_for_cpu_fn+0x26/0x40
...
which means in 'kmalloc-4k' slab, there are 126 requests of
2240 bytes which got a 4KB space (wasting 1856 bytes each
and 233856 bytes in total), from ixgbe_alloc_q_vector().
And when system starts some real workload like multiple docker
instances, there could are more severe waste.
[1]. https://lkml.org/lkml/2019/8/12/266
[2]. https://lore.kernel.org/lkml/2920df89-9975-5785-f79b-257d3052dfaf@huawei.com/
[Thanks Hyeonggon for pointing out several bugs about sorting/format]
[Thanks Vlastimil for suggesting way to reduce memory usage of
orig_size and keep it only for kmalloc objects]
Signed-off-by: Feng Tang <feng.tang@intel.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: John Garry <john.garry@huawei.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
MACsec EPN splits the packet number (PN) into two 32-bits fields,
epn_lsb (32 least significant bits (LSBs) of PN) and epn_msb (32
most significant bits (MSBs) of PN).
Epn_msb bits are managed by SW and for that HW is required to send
an object change event of type EPN event notifying the SW to update
the epn_msb in addition, once epn_msb is updated SW update HW with
the new epn_msb value for HW to perform replay protection.
To prevent HW from stopping while handling the event, SW manages
another bit for HW called epn_overlap, HW uses the latter to get
an indication regarding how to read the epn_msb value correctly
while still receiving packets.
Add epn event handling that updates the epn_overlap and epn_msb for
every 2^31 packets according to the following logic:
if epn_lsb crosses 2^31 (half sequence number wraparound) upon HW
relevant event, SW updates the esn_overlap value to OLD (value = 1).
When the epn_lsb crosses 2^32 (full sequence number wraparound)
upon HW relevant event, SW updates the esn_overlap to NEW
(value = 0) and increment the esn_msb.
When using MACsec EPN a salt and short secure channel id (ssci)
needs to be provided by the user, when offloading EPN need to pass
this salt and ssci to the HW to be used in the initial vector (IV)
calculations.
Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Emeel Hakim <ehakim@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
protection
Add ifc bits related to advanced steering operations (ASO) and general
object modify for macsec to use as part of offloading EPN and replay
protection features.
Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Emeel Hakim <ehakim@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Fix ifc fields name to be consistent with the device spec document.
Fixes: 8385c51ff5bc ("net/mlx5: Introduce MACsec Connect-X offload hardware bits and structures")
Signed-off-by: Emeel Hakim <ehakim@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
drivers/net/ethernet/freescale/fec.h
7b15515fc1ca ("Revert "fec: Restart PPS after link state change"")
40c79ce13b03 ("net: fec: add stop mode support for imx8 platform")
https://lore.kernel.org/all/20220921105337.62b41047@canb.auug.org.au/
drivers/pinctrl/pinctrl-ocelot.c
c297561bc98a ("pinctrl: ocelot: Fix interrupt controller")
181f604b33cd ("pinctrl: ocelot: add ability to be used in a non-mmio configuration")
https://lore.kernel.org/all/20220921110032.7cd28114@canb.auug.org.au/
tools/testing/selftests/drivers/net/bonding/Makefile
bbb774d921e2 ("net: Add tests for bonding and team address list management")
152e8ec77640 ("selftests/bonding: add a test for bonding lladdr target")
https://lore.kernel.org/all/20220921110437.5b7dbd82@canb.auug.org.au/
drivers/net/can/usb/gs_usb.c
5440428b3da6 ("can: gs_usb: gs_can_open(): fix race dev->can.state condition")
45dfa45f52e6 ("can: gs_usb: add RX and TX hardware timestamp support")
https://lore.kernel.org/all/84f45a7d-92b6-4dc5-d7a1-072152fab6ff@tessares.net/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull ARM SoC fixes from Arnd Bergmann:
"Another set of fixes for fixes for the soc tree:
- A fix for the interrupt number on at91/lan966 ethernet PHYs
- A second round of fixes for NXP i.MX series, including a couple of
build issues, and board specific DT corrections on TQMa8MPQL,
imx8mp-venice-gw74xx and imx8mm-verdin for reliability and
partially broken functionality
- Several fixes for Rockchip SoCs, addressing a USB issue on
BPI-R2-Pro, wakeup on Gru-Bob and reliability of high-speed SD
cards, among other minor issues
- A fix for a long-running naming mistake that prevented the moxart
mmc driver from working at all
- Multiple Arm SCMI firmware fixes for hardening some corner cases"
* tag 'soc-fixes-6.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (30 commits)
arm64: dts: imx8mp-venice-gw74xx: fix port/phy validation
ARM: dts: lan966x: Fix the interrupt number for internal PHYs
arm64: dts: imx8mp-venice-gw74xx: fix ksz9477 cpu port
arm64: dts: imx8mp-venice-gw74xx: fix CAN STBY polarity
dt-bindings: memory-controllers: fsl,imx8m-ddrc: drop Leonard Crestez
arm64: dts: tqma8mqml: Include phy-imx8-pcie.h header
arm64: defconfig: enable ARCH_NXP
arm64: dts: imx8mp-tqma8mpql-mba8mpxl: add missing pinctrl for RTC alarm
ARM: dts: fix Moxa SDIO 'compatible', remove 'sdhci' misnomer
arm64: dts: imx8mm-verdin: extend pmic voltages
arm64: dts: rockchip: Remove 'enable-active-low' from rk3566-quartz64-a
arm64: dts: rockchip: Remove 'enable-active-low' from rk3399-puma
arm64: dts: rockchip: fix property for usb2 phy supply on rk3568-evb1-v10
arm64: dts: rockchip: fix property for usb2 phy supply on rock-3a
arm64: dts: imx8ulp: add #reset-cells for pcc
arm64: dts: tqma8mpxl-ba8mpxl: Fix button GPIOs
arm64: dts: imx8mn: remove GPU power domain reset
arm64: dts: rockchip: Set RK3399-Gru PCLK_EDP to 24 MHz
arm64: dts: imx8mm: Reverse CPLD_Dn GPIO label mapping on MX8Menlo
arm64: dts: rockchip: fix upper usb port on BPI-R2-Pro
...
|
|
update_mba_bw() calculates a new control value for the MBA resource
based on the user provided mbps_val and the current measured
bandwidth. Some control values need remapping by delay_bw_map().
It does this by calling wrmsrl() directly. This needs splitting
up to be done by an architecture specific helper, so that the
remainder can eventually be moved to /fs/.
Add resctrl_arch_update_one() to apply one configuration value
to the provided resource and domain. This avoids the staging
and cross-calling that is only needed with changes made by
user-space. delay_bw_map() moves to be part of the arch code,
to maintain the 'percentage control' view of MBA resources
in resctrl.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <quic_jiles@quicinc.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Xin Hao <xhao@linux.alibaba.com>
Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Tested-by: Cristian Marussi <cristian.marussi@arm.com>
Link: https://lore.kernel.org/r/20220902154829.30399-12-james.morse@arm.com
|
|
To support resctrl's MBA software controller, the architecture must provide
a second configuration array to hold the mbps_val[] from user-space.
This complicates the interface between the architecture specific code and
the filesystem portions of resctrl that will move to /fs/, to allow
multiple architectures to support resctrl.
Make the filesystem parts of resctrl create an array for the mba_sc
values. The software controller can be changed to use this, allowing
the architecture code to only consider the values configured in hardware.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <quic_jiles@quicinc.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Xin Hao <xhao@linux.alibaba.com>
Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Tested-by: Cristian Marussi <cristian.marussi@arm.com>
Link: https://lore.kernel.org/r/20220902154829.30399-9-james.morse@arm.com
|
|
Invoking TIOCVHANGUP on 8250_mid port on Ice Lake-D and then reopening
the port triggers these faults during serial8250_do_startup():
DMAR: DRHD: handling fault status reg 3
DMAR: [DMA Write NO_PASID] Request device [00:1a.0] fault addr 0x0 [fault reason 0x05] PTE Write access is not set
If the IRQ hasn't been set up yet, the UART will have zeroes in its MSI
address/data registers. Disabling the IRQ at the interrupt controller
won't stop the UART from performing a DMA write to the address programmed
in its MSI address register (zero) when it wants to signal an interrupt.
The UARTs (in Ice Lake-D) implement PCI 2.1 style MSI without masking
capability, so there is no way to mask the interrupt at the source PCI
function level, except disabling the MSI capability entirely, but that
would cause it to fall back to INTx# assertion, and the PCI specification
prohibits disabling the MSI capability as a way to mask a function's
interrupt service request.
The MSI address register is zeroed by the hangup as the irq is freed.
The interrupt is signalled during serial8250_do_startup() performing a
THRE test that temporarily toggles THRI in IER. The THRE test currently
occurs before UART's irq (and MSI address) is properly set up.
Refactor serial8250_do_startup() such that irq is set up before the
THRE test. The current irq setup code is intermixed with the timer
setup code. As THRE test must be performed prior to the timer setup,
extract it into own function and call it only after the THRE test.
The ->setup_timer() needs to be part of the struct uart_8250_ops in
order to not create circular dependency between 8250 and 8250_base
modules.
Fixes: 40b36daad0ac ("[PATCH] 8250 UART backup timer")
Reported-by: Lennert Buytenhek <buytenh@arista.com>
Tested-by: Lennert Buytenhek <buytenh@arista.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Link: https://lore.kernel.org/r/20220922070005.2965-1-ilpo.jarvinen@linux.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Use BIT_ULL rather than encoding bits explicitly where applicable with
UART port flags. This makes a (__force upf_t) cast redundant, but keep
it for visual consistency with the flags defined in terms of userspace
macros.
Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
Link: https://lore.kernel.org/r/alpine.DEB.2.21.2209210007030.41633@angie.orcam.me.uk
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
A SERIAL_8250_16550A_VARIANTS configuration option has been recently
defined that lets one request the 8250 driver not to probe for 16550A
device features so as to reduce the driver's device startup time in
virtual machines.
Some actual hardware devices require these features to have been fully
determined however for their driver to work correctly, so define a flag
to let drivers request full 16550A feature probing on a device-by-device
basis if required regardless of the SERIAL_8250_16550A_VARIANTS option
setting chosen.
Fixes: dc56ecb81a0a ("serial: 8250: Support disabling mdelay-filled probes of 16550A variants")
Cc: stable@vger.kernel.org # v5.6+
Reported-by: Anders Blomdell <anders.blomdell@control.lth.se>
Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
Link: https://lore.kernel.org/r/alpine.DEB.2.21.2209202357520.41633@angie.orcam.me.uk
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
According to Greg, in the context of magic numbers as defined in
magic-number.rst, "the tty layer should not need this and I'll gladly
take patches"
Acked-by: Jiri Slaby <jirislaby@kernel.org>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Ref: https://lore.kernel.org/linux-doc/YyMlovoskUcHLEb7@kroah.com/
Link: https://lore.kernel.org/r/723478a270a3858f27843cbec621df4d5d44efcc.1663288066.git.nabijaczleweli@nabijaczleweli.xyz
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
According to Greg, in the context of magic numbers as defined in
magic-number.rst, "the tty layer should not need this and I'll gladly
take patches"
Acked-by: Jiri Slaby <jirislaby@kernel.org>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Ref: https://lore.kernel.org/linux-doc/YyMlovoskUcHLEb7@kroah.com/
Link: https://lore.kernel.org/r/476d024cd6b04160a5de381ea2b9856b60088cbd.1663288066.git.nabijaczleweli@nabijaczleweli.xyz
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Because domains are exposed to user-space via resctrl, the filesystem
must update its state when CPU hotplug callbacks are triggered.
Some of this work is common to any architecture that would support
resctrl, but the work is tied up with the architecture code to
free the memory.
Move the monitor subdir removal and the cancelling of the mbm/limbo
works into a new resctrl_offline_domain() call. These bits are not
specific to the architecture. Grouping them in one function allows
that code to be moved to /fs/ and re-used by another architecture.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <quic_jiles@quicinc.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Xin Hao <xhao@linux.alibaba.com>
Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Tested-by: Cristian Marussi <cristian.marussi@arm.com>
Link: https://lore.kernel.org/r/20220902154829.30399-6-james.morse@arm.com
|
|
Because domains are exposed to user-space via resctrl, the filesystem
must update its state when CPU hotplug callbacks are triggered.
Some of this work is common to any architecture that would support
resctrl, but the work is tied up with the architecture code to
allocate the memory.
Move domain_setup_mon_state(), the monitor subdir creation call and the
mbm/limbo workers into a new resctrl_online_domain() call. These bits
are not specific to the architecture. Grouping them in one function
allows that code to be moved to /fs/ and re-used by another architecture.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <quic_jiles@quicinc.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Xin Hao <xhao@linux.alibaba.com>
Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Tested-by: Cristian Marussi <cristian.marussi@arm.com>
Link: https://lore.kernel.org/r/20220902154829.30399-4-james.morse@arm.com
|
|
Other than pcie bus, introduce support for axi bus to mtk wed driver.
Axi bus is used to connect mt7986-wmac soc chip available on mt7986
device.
Tested-by: Daniel Golle <daniel@makrotopia.org>
Co-developed-by: Bo Jiao <Bo.Jiao@mediatek.com>
Signed-off-by: Bo Jiao <Bo.Jiao@mediatek.com>
Co-developed-by: Sujuan Chen <sujuan.chen@mediatek.com>
Signed-off-by: Sujuan Chen <sujuan.chen@mediatek.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Introduce Wireless Etherne Dispatcher support on transmission side
for mt7986 chipset
Tested-by: Daniel Golle <daniel@makrotopia.org>
Co-developed-by: Bo Jiao <Bo.Jiao@mediatek.com>
Signed-off-by: Bo Jiao <Bo.Jiao@mediatek.com>
Co-developed-by: Sujuan Chen <sujuan.chen@mediatek.com>
Signed-off-by: Sujuan Chen <sujuan.chen@mediatek.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
mon_enabled and mon_capable are always set as a pair by
rdt_get_mon_l3_config().
There is no point having two values.
Merge them together.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <quic_jiles@quicinc.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Xin Hao <xhao@linux.alibaba.com>
Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Tested-by: Cristian Marussi <cristian.marussi@arm.com>
Link: https://lore.kernel.org/r/20220902154829.30399-3-james.morse@arm.com
|
|
rdt_resources_all[] used to have extra entries for L2CODE/L2DATA.
These were hidden from resctrl by the alloc_enabled value.
Now that the L2/L2CODE/L2DATA resources have been merged together,
alloc_enabled doesn't mean anything, it always has the same value as
alloc_capable which indicates allocation is supported by this resource.
Remove alloc_enabled and its helpers.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <quic_jiles@quicinc.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Xin Hao <xhao@linux.alibaba.com>
Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Tested-by: Cristian Marussi <cristian.marussi@arm.com>
Link: https://lore.kernel.org/r/20220902154829.30399-2-james.morse@arm.com
|
|
The member documentation comment lines for struct counter_comp extend
past the 80-characters column boundary due to extra identation at the
start of each section. This patch realigns the comment block within the
80-characters boundary by removing these superfluous indents.
Reviewed-by: Yanteng Si <siyanteng@loongson.cn>
Link: https://lore.kernel.org/r/20220902120839.4260-1-william.gray@linaro.org/
Signed-off-by: William Breathitt Gray <william.gray@linaro.org>
Link: https://lore.kernel.org/r/8294b04153c33602e9c3dd21ac90c1e99bd0fdaf.1663844776.git.william.gray@linaro.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
As PAGE_SIZE is unsigned long, -1 > PAGE_SIZE when NR_CPUS <= 3.
This leads to very large file sizes:
topology$ ls -l
total 0
-r--r--r-- 1 root root 18446744073709551615 Sep 5 11:59 core_cpus
-r--r--r-- 1 root root 4096 Sep 5 11:59 core_cpus_list
-r--r--r-- 1 root root 4096 Sep 5 10:58 core_id
-r--r--r-- 1 root root 18446744073709551615 Sep 5 10:10 core_siblings
-r--r--r-- 1 root root 4096 Sep 5 11:59 core_siblings_list
-r--r--r-- 1 root root 18446744073709551615 Sep 5 11:59 die_cpus
-r--r--r-- 1 root root 4096 Sep 5 11:59 die_cpus_list
-r--r--r-- 1 root root 4096 Sep 5 11:59 die_id
-r--r--r-- 1 root root 18446744073709551615 Sep 5 11:59 package_cpus
-r--r--r-- 1 root root 4096 Sep 5 11:59 package_cpus_list
-r--r--r-- 1 root root 4096 Sep 5 10:58 physical_package_id
-r--r--r-- 1 root root 18446744073709551615 Sep 5 10:10 thread_siblings
-r--r--r-- 1 root root 4096 Sep 5 11:59 thread_siblings_list
Adjust the inequality to catch the case when NR_CPUS is configured
to a small value.
Fixes: 7ee951acd31a ("drivers/base: fix userspace break from using bin_attributes for cpumap and cpulist")
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Yury Norov <yury.norov@gmail.com>
Cc: stable@vger.kernel.org
Cc: feng xiangjun <fengxj325@gmail.com>
Reported-by: feng xiangjun <fengxj325@gmail.com>
Signed-off-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Link: https://lore.kernel.org/r/20220906203542.1796629-1-pauld@redhat.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
request_queues are a block layer implementation detail that should not
leak into file systems. Change the fscrypt inline crypto code to
retrieve block devices instead of request_queues from the file system.
As part of that, clean up the interaction with multi-device file systems
by returning both the number of devices and the actual device array in a
single method call.
Signed-off-by: Christoph Hellwig <hch@lst.de>
[ebiggers: bug fixes and minor tweaks]
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20220901193208.138056-4-ebiggers@kernel.org
|
|
The approach of fs/crypto/ internally managing the fscrypt_master_key
structs as the payloads of "struct key" objects contained in a
"struct key" keyring has outlived its usefulness. The original idea was
to simplify the code by reusing code from the keyrings subsystem.
However, several issues have arisen that can't easily be resolved:
- When a master key struct is destroyed, blk_crypto_evict_key() must be
called on any per-mode keys embedded in it. (This started being the
case when inline encryption support was added.) Yet, the keyrings
subsystem can arbitrarily delay the destruction of keys, even past the
time the filesystem was unmounted. Therefore, currently there is no
easy way to call blk_crypto_evict_key() when a master key is
destroyed. Currently, this is worked around by holding an extra
reference to the filesystem's request_queue(s). But it was overlooked
that the request_queue reference is *not* guaranteed to pin the
corresponding blk_crypto_profile too; for device-mapper devices that
support inline crypto, it doesn't. This can cause a use-after-free.
- When the last inode that was using an incompletely-removed master key
is evicted, the master key removal is completed by removing the key
struct from the keyring. Currently this is done via key_invalidate().
Yet, key_invalidate() takes the key semaphore. This can deadlock when
called from the shrinker, since in fscrypt_ioctl_add_key(), memory is
allocated with GFP_KERNEL under the same semaphore.
- More generally, the fact that the keyrings subsystem can arbitrarily
delay the destruction of keys (via garbage collection delay, or via
random processes getting temporary key references) is undesirable, as
it means we can't strictly guarantee that all secrets are ever wiped.
- Doing the master key lookups via the keyrings subsystem results in the
key_permission LSM hook being called. fscrypt doesn't want this, as
all access control for encrypted files is designed to happen via the
files themselves, like any other files. The workaround which SELinux
users are using is to change their SELinux policy to grant key search
access to all domains. This works, but it is an odd extra step that
shouldn't really have to be done.
The fix for all these issues is to change the implementation to what I
should have done originally: don't use the keyrings subsystem to keep
track of the filesystem's fscrypt_master_key structs. Instead, just
store them in a regular kernel data structure, and rework the reference
counting, locking, and lifetime accordingly. Retain support for
RCU-mode key lookups by using a hash table. Replace fscrypt_sb_free()
with fscrypt_sb_delete(), which releases the keys synchronously and runs
a bit earlier during unmount, so that block devices are still available.
A side effect of this patch is that neither the master keys themselves
nor the filesystem keyrings will be listed in /proc/keys anymore.
("Master key users" and the master key users keyrings will still be
listed.) However, this was mostly an implementation detail, and it was
intended just for debugging purposes. I don't know of anyone using it.
This patch does *not* change how "master key users" (->mk_users) works;
that still uses the keyrings subsystem. That is still needed for key
quotas, and changing that isn't necessary to solve the issues listed
above. If we decide to change that too, it would be a separate patch.
I've marked this as fixing the original commit that added the fscrypt
keyring, but as noted above the most important issue that this patch
fixes wasn't introduced until the addition of inline encryption support.
Fixes: 22d94f493bfb ("fscrypt: add FS_IOC_ADD_ENCRYPTION_KEY ioctl")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20220901193208.138056-2-ebiggers@kernel.org
|
|
We got report from sysbot [1] about warnings that were caused by
bpf program attached to contention_begin raw tracepoint triggering
the same tracepoint by using bpf_trace_printk helper that takes
trace_printk_lock lock.
Call Trace:
<TASK>
? trace_event_raw_event_bpf_trace_printk+0x5f/0x90
bpf_trace_printk+0x2b/0xe0
bpf_prog_a9aec6167c091eef_prog+0x1f/0x24
bpf_trace_run2+0x26/0x90
native_queued_spin_lock_slowpath+0x1c6/0x2b0
_raw_spin_lock_irqsave+0x44/0x50
bpf_trace_printk+0x3f/0xe0
bpf_prog_a9aec6167c091eef_prog+0x1f/0x24
bpf_trace_run2+0x26/0x90
native_queued_spin_lock_slowpath+0x1c6/0x2b0
_raw_spin_lock_irqsave+0x44/0x50
bpf_trace_printk+0x3f/0xe0
bpf_prog_a9aec6167c091eef_prog+0x1f/0x24
bpf_trace_run2+0x26/0x90
native_queued_spin_lock_slowpath+0x1c6/0x2b0
_raw_spin_lock_irqsave+0x44/0x50
bpf_trace_printk+0x3f/0xe0
bpf_prog_a9aec6167c091eef_prog+0x1f/0x24
bpf_trace_run2+0x26/0x90
native_queued_spin_lock_slowpath+0x1c6/0x2b0
_raw_spin_lock_irqsave+0x44/0x50
__unfreeze_partials+0x5b/0x160
...
The can be reproduced by attaching bpf program as raw tracepoint on
contention_begin tracepoint. The bpf prog calls bpf_trace_printk
helper. Then by running perf bench the spin lock code is forced to
take slow path and call contention_begin tracepoint.
Fixing this by skipping execution of the bpf program if it's
already running, Using bpf prog 'active' field, which is being
currently used by trampoline programs for the same reason.
Moving bpf_prog_inc_misses_counter to syscall.c because
trampoline.c is compiled in just for CONFIG_BPF_JIT option.
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Reported-by: syzbot+2251879aa068ad9c960d@syzkaller.appspotmail.com
[1] https://lore.kernel.org/bpf/YxhFe3EwqchC%2FfYf@krava/T/#t
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20220916071914.7156-1-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Add the bpf_lookup_user_key(), bpf_lookup_system_key() and bpf_key_put()
kfuncs, to respectively search a key with a given key handle serial number
and flags, obtain a key from a pre-determined ID defined in
include/linux/verification.h, and cleanup.
Introduce system_keyring_id_check() to validate the keyring ID parameter of
bpf_lookup_system_key().
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20220920075951.929132-8-roberto.sassu@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
In preparation for the patch that introduces the bpf_lookup_user_key() eBPF
kfunc, move KEY_LOOKUP_ definitions to include/linux/key.h, to be able to
validate the kfunc parameters. Add them to enum key_lookup_flag, so that
all the current ones and the ones defined in the future are automatically
exported through BTF and available to eBPF programs.
Also, add KEY_LOOKUP_ALL to the enum, with the logical OR of currently
defined flags as value, to facilitate checking whether a variable contains
only those flags.
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lore.kernel.org/r/20220920075951.929132-7-roberto.sassu@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Export bpf_dynptr_get_size(), so that kernel code dealing with eBPF dynamic
pointers can obtain the real size of data carried by this data structure.
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
Acked-by: KP Singh <kpsingh@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20220920075951.929132-6-roberto.sassu@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Allow dynamic pointers (struct bpf_dynptr_kern *) to be specified as
parameters in kfuncs. Also, ensure that dynamic pointers passed as argument
are valid and initialized, are a pointer to the stack, and of the type
local. More dynamic pointer types can be supported in the future.
To properly detect whether a parameter is of the desired type, introduce
the stringify_struct() macro to compare the returned structure name with
the desired name. In addition, protect against structure renames, by
halting the build with BUILD_BUG_ON(), so that developers have to revisit
the code.
To check if a dynamic pointer passed to the kfunc is valid and initialized,
and if its type is local, export the existing functions
is_dynptr_reg_valid_init() and is_dynptr_type_expected().
Cc: Joanne Koong <joannelkoong@gmail.com>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20220920075951.929132-5-roberto.sassu@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
In a prior change, we added a new BPF_MAP_TYPE_USER_RINGBUF map type which
will allow user-space applications to publish messages to a ring buffer
that is consumed by a BPF program in kernel-space. In order for this
map-type to be useful, it will require a BPF helper function that BPF
programs can invoke to drain samples from the ring buffer, and invoke
callbacks on those samples. This change adds that capability via a new BPF
helper function:
bpf_user_ringbuf_drain(struct bpf_map *map, void *callback_fn, void *ctx,
u64 flags)
BPF programs may invoke this function to run callback_fn() on a series of
samples in the ring buffer. callback_fn() has the following signature:
long callback_fn(struct bpf_dynptr *dynptr, void *context);
Samples are provided to the callback in the form of struct bpf_dynptr *'s,
which the program can read using BPF helper functions for querying
struct bpf_dynptr's.
In order to support bpf_ringbuf_drain(), a new PTR_TO_DYNPTR register
type is added to the verifier to reflect a dynptr that was allocated by
a helper function and passed to a BPF program. Unlike PTR_TO_STACK
dynptrs which are allocated on the stack by a BPF program, PTR_TO_DYNPTR
dynptrs need not use reference tracking, as the BPF helper is trusted to
properly free the dynptr before returning. The verifier currently only
supports PTR_TO_DYNPTR registers that are also DYNPTR_TYPE_LOCAL.
Note that while the corresponding user-space libbpf logic will be added
in a subsequent patch, this patch does contain an implementation of the
.map_poll() callback for BPF_MAP_TYPE_USER_RINGBUF maps. This
.map_poll() callback guarantees that an epoll-waiting user-space
producer will receive at least one event notification whenever at least
one sample is drained in an invocation of bpf_user_ringbuf_drain(),
provided that the function is not invoked with the BPF_RB_NO_WAKEUP
flag. If the BPF_RB_FORCE_WAKEUP flag is provided, a wakeup
notification is sent even if no sample was drained.
Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220920000100.477320-3-void@manifault.com
|
|
We want to support a ringbuf map type where samples are published from
user-space, to be consumed by BPF programs. BPF currently supports a
kernel -> user-space circular ring buffer via the BPF_MAP_TYPE_RINGBUF
map type. We'll need to define a new map type for user-space -> kernel,
as none of the helpers exported for BPF_MAP_TYPE_RINGBUF will apply
to a user-space producer ring buffer, and we'll want to add one or
more helper functions that would not apply for a kernel-producer
ring buffer.
This patch therefore adds a new BPF_MAP_TYPE_USER_RINGBUF map type
definition. The map type is useless in its current form, as there is no
way to access or use it for anything until we one or more BPF helpers. A
follow-on patch will therefore add a new helper function that allows BPF
programs to run callbacks on samples that are published to the ring
buffer.
Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220920000100.477320-2-void@manifault.com
|
|
and replace kref. With it a 'vfio-dev/vfioX' node is created under the
sysfs path of the parent, indicating the device is bound to a vfio
driver, e.g.:
/sys/devices/pci0000\:6f/0000\:6f\:01.0/vfio-dev/vfio0
It is also a preparatory step toward adding cdev for supporting future
device-oriented uAPI.
Add Documentation/ABI/testing/sysfs-devices-vfio-dev.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20220921104401.38898-16-kevin.tian@intel.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
ccw is the only exception which cannot use vfio_alloc_device() because
its private device structure is designed to serve both mdev and parent.
Life cycle of the parent is managed by css_driver so vfio_ccw_private
must be allocated/freed in css_driver probe/remove path instead of
conforming to vfio core life cycle for mdev.
Given that use a wait/completion scheme so the mdev remove path waits
after vfio_put_device() until receiving a completion notification from
@release. The completion indicates that all active references on
vfio_device have been released.
After that point although free of vfio_ccw_private is delayed to
css_driver it's at least guaranteed to have no parallel reference on
released vfio device part from other code paths.
memset() in @probe is removed. vfio_device is either already cleared
when probed for the first time or cleared in @release from last probe.
The right fix is to introduce separate structures for mdev and parent,
but this won't happen in short term per prior discussions.
Remove vfio_init/uninit_group_dev() as no user now.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Eric Farman <farman@linux.ibm.com>
Link: https://lore.kernel.org/r/20220921104401.38898-14-kevin.tian@intel.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
Tidy up @probe so all migration specific initialization logic is moved
to migration specific @init callback.
Remove vfio_pci_core_{un}init_device() given no user now.
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Link: https://lore.kernel.org/r/20220921104401.38898-5-kevin.tian@intel.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
Also introduce two pci core helpers as @init/@release for pci drivers:
- vfio_pci_core_init_dev()
- vfio_pci_core_release_dev()
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20220921104401.38898-3-kevin.tian@intel.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
The idea is to let vfio core manage the vfio_device life cycle instead
of duplicating the logic cross drivers. This is also a preparatory
step for adding struct device into vfio_device.
New pair of helpers together with a kref in vfio_device:
- vfio_alloc_device()
- vfio_put_device()
Drivers can register @init/@release callbacks to manage any private
state wrapping the vfio_device.
However vfio-ccw doesn't fit this model due to a life cycle mess
that its private structure mixes both parent and mdev info hence must
be allocated/freed outside of the life cycle of vfio device.
Per prior discussions this won't be fixed in short term by IBM folks.
Instead of waiting for those modifications introduce another helper
vfio_init_device() so ccw can call it to initialize a pre-allocated
vfio_device.
Further implication of the ccw trick is that vfio_device cannot be
freed uniformly in vfio core. Instead, require *EVERY* driver to
implement @release and free vfio_device inside. Then ccw can choose
to delay the free at its own discretion.
Another trick down the road is that kvzalloc() is used to accommodate
the need of gvt which uses vzalloc() while all others use kzalloc().
So drivers should call a helper vfio_free_device() to free the
vfio_device instead of assuming that kfree() or vfree() is appliable.
Later once the ccw mess is fixed we can remove those tricks and
fully handle structure alloc/free in vfio core.
Existing vfio_{un}init_group_dev() will be deprecated after all
existing usages are converted to the new model.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Co-developed-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Tony Krowiak <akrowiak@linux.ibm.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Link: https://lore.kernel.org/r/20220921104401.38898-2-kevin.tian@intel.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
Over the past couple years, the function _find_next_bit() was extended
with parameters that modify its behavior to implement and- zero- and le-
flavors. The parameters are passed at compile time, but current design
prevents a compiler from optimizing out the conditionals.
As find_next_bit() API grows, I expect that more parameters will be added.
Current design would require more conditional code in _find_next_bit(),
which would bloat the helper even more and make it barely readable.
This patch replaces _find_next_bit() with a macro FIND_NEXT_BIT, and adds
a set of wrappers, so that the compile-time optimizations become possible.
The common logic is moved to the new macro, and all flavors may be
generated by providing a FETCH macro parameter, like in this example:
#define FIND_NEXT_BIT(FETCH, MUNGE, size, start) ...
find_next_xornot_and_bit(addr1, addr2, addr3, size, start)
{
return FIND_NEXT_BIT(addr1[idx] ^ ~addr2[idx] & addr3[idx],
/* nop */, size, start);
}
The FETCH may be of any complexity, as soon as it only refers the bitmap(s)
and an iterator idx.
MUNGE is here to support _le code generation for BE builds. May be
empty.
I ran find_bit_benchmark 16 times on top of 6.0-rc2 and 16 times on top
of 6.0-rc2 + this series. The results for kvm/x86_64 are:
v6.0-rc2 Optimized Difference Z-score
Random dense bitmap ns ns ns %
find_next_bit: 787735 670546 117189 14.9 3.97
find_next_zero_bit: 777492 664208 113284 14.6 10.51
find_last_bit: 830925 687573 143352 17.3 2.35
find_first_bit: 3874366 3306635 567731 14.7 1.84
find_first_and_bit: 40677125 37739887 2937238 7.2 1.36
find_next_and_bit: 347865 304456 43409 12.5 1.35
Random sparse bitmap
find_next_bit: 19816 14021 5795 29.2 6.10
find_next_zero_bit: 1318901 1223794 95107 7.2 1.41
find_last_bit: 14573 13514 1059 7.3 6.92
find_first_bit: 1313321 1249024 64297 4.9 1.53
find_first_and_bit: 8921 8098 823 9.2 4.56
find_next_and_bit: 9796 7176 2620 26.7 5.39
Where the statistics is significant (z-score > 3), the improvement
is ~15%.
According to the bloat-o-meter, the Image size is 10-11K less:
x86_64/defconfig:
add/remove: 32/14 grow/shrink: 61/782 up/down: 6344/-16521 (-10177)
arm64/defconfig:
add/remove: 3/2 grow/shrink: 50/714 up/down: 608/-11556 (-10948)
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Yury Norov <yury.norov@gmail.com>
|
|
find_first_zero_bit_le() is an alias to find_next_zero_bit_le(),
despite that 'next' is known to be slower than 'first' version.
Now that we have common FIND_FIRST_BIT() macro helper, it's trivial
to implement find_first_zero_bit_le() as a real function.
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Yury Norov <yury.norov@gmail.com>
|
|
Define a platform driver to manage the remoteproc virtio device as
a platform devices.
The platform device allows to pass rproc_vdev_data platform data to
specify properties that are stored in the rproc_vdev structure.
Such approach will allow to preserve legacy remoteproc virtio device
creation but also to probe the device using device tree mechanism.
remoteproc_virtio.c update:
- Add rproc_virtio_driver platform driver. The probe ops replaces
the rproc_rvdev_add_device function.
- All reference to the rvdev->dev has been updated to rvdev-pdev->dev.
- rproc_rvdev_release is removed as associated to the rvdev device.
- The use of rvdev->kref counter is replaced by get/put_device on the
remoteproc virtio platform device.
- The vdev device no longer increments rproc device counter.
increment/decrement is done in rproc_virtio_probe/rproc_virtio_remove
function in charge of the vrings allocation/free.
remoteproc_core.c update:
Migrate from the rvdev device to the rvdev platform device.
From this patch, when a vdev resource is found in the resource table
the remoteproc core register a platform device.
Signed-off-by: Arnaud Pouliquen <arnaud.pouliquen@foss.st.com>
Link: https://lore.kernel.org/r/20220921135044.917140-5-arnaud.pouliquen@foss.st.com
Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
|
|
We need the poll_flags to know how to poll for the IO, and we should
have the batch structure in preparation for supporting batched
completions with iopoll.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This is in preparation to support iopoll for nvme passthrough.
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20220823161443.49436-4-joshi.k@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Put this up in the same way as iopoll is done for regular read/write IO.
Make place for storing a cookie into struct io_uring_cmd on submission.
Perform the completion using the ->uring_cmd_iopoll handler.
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Link: https://lore.kernel.org/r/20220823161443.49436-3-joshi.k@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
io_uring will invoke this to do completion polling on uring-cmd
operations.
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20220823161443.49436-2-joshi.k@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Some workloads rely on a registered eventfd (via
io_uring_register_eventfd(3)) in order to wake up and process the
io_uring.
In the case of a ring setup with IORING_SETUP_DEFER_TASKRUN, that eventfd
also needs to be signalled when there are tasks to run.
This changes an old behaviour which assumed 1 eventfd signal implied at
least 1 CQE, however only when this new flag is set (and so old users will
not notice). This should be expected with the IORING_SETUP_DEFER_TASKRUN
flag as it is not guaranteed that every task will result in a CQE.
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220830125013.570060-7-dylany@fb.com
[axboe: fold in call_rcu() serialization fix]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Allow deferring async tasks until the user calls io_uring_enter(2) with
the IORING_ENTER_GETEVENTS flag. Enable this mode with a flag at
io_uring_setup time. This functionality requires that the later
io_uring_enter will be called from the same submission task, and therefore
restrict this flag to work only when IORING_SETUP_SINGLE_ISSUER is also
set.
Being able to hand pick when tasks are run prevents the problem where
there is current work to be done, however task work runs anyway.
For example, a common workload would obtain a batch of CQEs, and process
each one. Interrupting this to additional taskwork would add latency but
not gain anything. If instead task work is deferred to just before more
CQEs are obtained then no additional latency is added.
The way this is implemented is by trying to keep task work local to a
io_ring_ctx, rather than to the submission task. This is required, as the
application will want to wake up only a single io_ring_ctx at a time to
process work, and so the lists of work have to be kept separate.
This has some other benefits like not having to check the task continually
in handle_tw_list (and potentially unlocking/locking those), and reducing
locks in the submit & process completions path.
There are networking cases where using this option can reduce request
latency by 50%. For example a contrived example using [1] where the client
sends 2k data and receives the same data back while doing some system
calls (to trigger task work) shows this reduction. The reason ends up
being that if sending responses is delayed by processing task work, then
the client side sits idle. Whereas reordering the sends first means that
the client runs it's workload in parallel with the local task work.
[1]:
Using https://github.com/DylanZA/netbench/tree/defer_run
Client:
./netbench --client_only 1 --control_port 10000 --host <host> --tx "epoll --threads 16 --per_thread 1 --size 2048 --resp 2048 --workload 1000"
Server:
./netbench --server_only 1 --control_port 10000 --rx "io_uring --defer_taskrun 0 --workload 100" --rx "io_uring --defer_taskrun 1 --workload 100"
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220830125013.570060-5-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|