Age | Commit message (Collapse) | Author |
|
clk_has_parent() doesn't modify the clocks being passed, so let's make
it const.
Suggested-by: Stephen Boyd <sboyd@kernel.org>
Signed-off-by: Maxime Ripard <maxime@cerno.tech>
Link: https://lore.kernel.org/r/20220816112530.1837489-21-maxime@cerno.tech
Tested-by: Linux Kernel Functional Testing <lkft@linaro.org>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
|
|
clk-divider instantiates clk_rate_request internally for its round_rate
implementations to share the code with its determine_rate
implementations.
However, it's missing a few fields (min_rate, max_rate) that would be
initialized properly if it was using clk_core_init_rate_req().
Let's create the clk_hw_init_rate_request() function for clock providers
to be able to share the code to instation clk_rate_requests with the
framework. This will also be useful for some tests introduced in later
patches.
Tested-by: Alexander Stein <alexander.stein@ew.tq-group.com> # imx8mp
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> # exynos4210, meson g12b
Signed-off-by: Maxime Ripard <maxime@cerno.tech>
Link: https://lore.kernel.org/r/20220816112530.1837489-17-maxime@cerno.tech
Tested-by: Linux Kernel Functional Testing <lkft@linaro.org>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
|
|
Multiple platforms (amlogic, imx8) return 0 when the clock rate cannot
be determined properly by the recalc_rate hook. Mention in the
documentation that the framework is ok with that.
Signed-off-by: Maxime Ripard <maxime@cerno.tech>
Link: https://lore.kernel.org/r/20220816112530.1837489-5-maxime@cerno.tech
Tested-by: Linux Kernel Functional Testing <lkft@linaro.org>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
|
|
From: Michael S. Tsirkin <mst@redhat.com>
The check for __LINUX_SPINLOCK_H within rwlock.h (and other files)
detects the direct include of the header file if it is at the very
beginning of the include section.
If it is listed later then chances are high that spinlock.h was already
included (including rwlock.h) and the additional listing of rwlock.h
will not cause any failure.
On PREEMPT_RT this additional rwlock.h will lead to compile failures
since it uses a different rwlock implementation.
Add __LINUX_INSIDE_SPINLOCK_H to spinlock.h and check for this instead
of __LINUX_SPINLOCK_H to detect wrong includes. This will help detect
direct includes of rwlock.h with without PREEMPT_RT enabled.
[ bigeasy: add remaining __LINUX_SPINLOCK_H user and rewrite
commit description. ]
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/YweemHxJx7O8rjBx@linutronix.de
|
|
BPF_PTR_POISON was added in commit c0a5a21c25f37 ("bpf: Allow storing
referenced kptr in map") to denote a bpf_func_proto btf_id which the
verifier will replace with a dynamically-determined btf_id at verification
time.
This patch adds verifier 'poison' functionality to BPF_PTR_POISON in
order to prepare for expanded use of the value to poison ret- and
arg-btf_id in ongoing work, namely rbtree and linked list patchsets
[0, 1]. Specifically, when the verifier checks helper calls, it assumes
that BPF_PTR_POISON'ed ret type will be replaced with a valid type before
- or in lieu of - the default ret_btf_id logic. Similarly for arg btf_id.
If poisoned btf_id reaches default handling block for either, consider
this a verifier internal error and fail verification. Otherwise a helper
w/ poisoned btf_id but no verifier logic replacing the type will cause a
crash as the invalid pointer is dereferenced.
Also move BPF_PTR_POISON to existing include/linux/posion.h header and
remove unnecessary shift.
[0]: lore.kernel.org/bpf/20220830172759.4069786-1-davemarchevsky@fb.com
[1]: lore.kernel.org/bpf/20220904204145.3089-1-memxor@gmail.com
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20220912154544.1398199-1-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Merge series from Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>:
Hi,
Dependencies/merging
====================
1. The DTS patches are independent.
2. The binding patches should come together, because of context changes. Could
be one of: Qualcomm SoC, ASoC or DT tree.
Changes since v3
================
1. Patch 9-10: re-order, so first apr.yaml is corrected and then we convert to
DT schema. This makes patchset fully bisectable in expense of changing the same
lines twice.
2. Patch 11: New patch.
Changes since v2
================
1. Patch 9: rename and extend commit msg.
2. Add Rb tags.
Changes since v1
================
1. Patch 9: New patch.
2. Patch 10: Correct also sound/qcom,q6apm-dai.yaml (Rob).
2. Patch 13: New patch.
3. Add Rb/Tb tags.
Best regards,
Krzysztof
Krzysztof Kozlowski (15):
arm64: dts: qcom: sdm630: align APR services node names with dtschema
arm64: dts: qcom: sdm845: align APR services node names with dtschema
arm64: dts: qcom: sm8250: align APR services node names with dtschema
arm64: dts: qcom: msm8996: fix APR services nodes
arm64: dts: qcom: sdm845: align dai node names with dtschema
arm64: dts: qcom: msm8996: align dai node names with dtschema
arm64: dts: qcom: qrb5165-rb5: align dai node names with dtschema
arm64: dts: qcom: sm8250: use generic name for LPASS clock controller
dt-bindings: soc: qcom: apr: correct service children
ASoC: dt-bindings: qcom,q6asm: convert to dtschema
ASoC: dt-bindings: qcom,q6adm: convert to dtschema
ASoC: dt-bindings: qcom,q6dsp-lpass-ports: cleanup example
ASoC: dt-bindings: qcom,q6dsp-lpass-clocks: cleanup example
ASoC: dt-bindings: qcom,q6apm-dai: adjust indentation in example
dt-bindings: soc: qcom: apr: add missing properties
.../bindings/soc/qcom/qcom,apr.yaml | 112 ++++++++++++++++--
.../bindings/sound/qcom,q6adm-routing.yaml | 52 ++++++++
.../devicetree/bindings/sound/qcom,q6adm.txt | 39 ------
.../bindings/sound/qcom,q6apm-dai.yaml | 21 ++--
.../bindings/sound/qcom,q6asm-dais.yaml | 112 ++++++++++++++++++
.../devicetree/bindings/sound/qcom,q6asm.txt | 70 -----------
.../sound/qcom,q6dsp-lpass-clocks.yaml | 36 +++---
.../sound/qcom,q6dsp-lpass-ports.yaml | 64 +++++-----
arch/arm64/boot/dts/qcom/msm8996.dtsi | 10 +-
arch/arm64/boot/dts/qcom/qrb5165-rb5.dts | 4 +-
arch/arm64/boot/dts/qcom/sdm630.dtsi | 8 +-
arch/arm64/boot/dts/qcom/sdm845-db845c.dts | 2 +-
.../boot/dts/qcom/sdm845-xiaomi-beryllium.dts | 2 +-
.../boot/dts/qcom/sdm845-xiaomi-polaris.dts | 4 +-
arch/arm64/boot/dts/qcom/sdm845.dtsi | 8 +-
arch/arm64/boot/dts/qcom/sm8250.dtsi | 10 +-
16 files changed, 346 insertions(+), 208 deletions(-)
create mode 100644 Documentation/devicetree/bindings/sound/qcom,q6adm-routing.yaml
delete mode 100644 Documentation/devicetree/bindings/sound/qcom,q6adm.txt
create mode 100644 Documentation/devicetree/bindings/sound/qcom,q6asm-dais.yaml
delete mode 100644 Documentation/devicetree/bindings/sound/qcom,q6asm.txt
--
2.34.1
|
|
Several ISA drivers feature IRQ support that can configured via an "irq"
array module parameter. This array typically matches directly with the
respective "base" array module parameter. To reduce code repetition, a
module_isa_driver_with_irq helper macro is introduced to provide a check
ensuring that the number of "irq" passed to the module matches with the
respective number of "base".
Signed-off-by: William Breathitt Gray <william.gray@linaro.org>
Acked-by: William Breathitt Gray <william.gray@linaro.org>
Signed-off-by: Bartosz Golaszewski <brgl@bgdev.pl>
|
|
Make use of new can_skb_get_data_len() helper.
Add support for variable CANXL MTU using the new can_is_canxl_dev_mtu().
Acked-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/all/20220912170725.120748-7-socketcan@hartkopp.net
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
|
|
- add new ETH_P_CANXL ethernet protocol type
- update skb checks for CAN XL
- add alloc_canxl_skb() which now needs a data length parameter
- introduce init_can_skb_reserve() to reduce code duplication
Acked-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/all/20220912170725.120748-6-socketcan@hartkopp.net
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
|
|
Add two helpers to retrieve the data length from CAN sk_buffs and prepare
the length information to be a uint16 value for the CAN XL support.
Acked-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/all/20220912170725.120748-3-socketcan@hartkopp.net
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
|
|
Replace open coded checks for sk_buffs containing Classical CAN and
CAN FD frame structures as a preparation for CAN XL support.
With the added length check the unintended processing of CAN XL frames
having the CANXL_XLF bit set can be suppressed even when the skb->len
fits to non CAN XL frames.
The CAN_RAW socket needs a rework to use these helpers. Therefore the
use of these helpers is postponed to the CAN_RAW CAN XL integration.
The J1939 protocol gets a check for Classical CAN frames too.
Acked-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/all/20220912170725.120748-2-socketcan@hartkopp.net
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86 into drm-misc-next
Immutable backlight-detect-refactor branch between acpi, drm-* and pdx86
Tag (immutable branch) with v6.0-rc1 + the (acpi/x86) backlight
detect refactor work. For merging into the acpi, drm-* and pdx86
subsystems.
Signed-off-by: Maxime Ripard <maxime@cerno.tech>
# -----BEGIN PGP SIGNATURE-----
#
# iQFIBAABCAAyFiEEuvA7XScYQRpenhd+kuxHeUQDJ9wFAmMVsogUHGhkZWdvZWRl
# QHJlZGhhdC5jb20ACgkQkuxHeUQDJ9yy6wgAlig+7hkq940L62lTpj0g2gNQv8zc
# HCsMpnU7dnJcZYaEvIjouZhf33ZbN52c0fQq2JWjt7fFX04LLyIiyrJ26Lc293JR
# ++yXpJcVoewRGqApy/P3Z05TKUCLll5bexvK4t8isnhOtEXD/nDPWKTLIV2Kd1DK
# nLY4KgRznXZ85RhYheUEdidZ7Lwlzt1JVBMq7tpnzu3nVdDExyZmqlqCUITcLynu
# ysuASQGr0D2i+1vb9eifHIA3xsQO0S37Bv62aBMBKxB6B8Fz1DYr8VA2YvoT82Hv
# IFT0hzCCZ/63Ljga05O78TwraxAQX0RvZWqjqGgnZg6fIBh2hxUiqeQY6g==
# =SA1R
# -----END PGP SIGNATURE-----
# gpg: Signature made Mon 05 Sep 2022 09:25:44 AM IST
# gpg: using RSA key BAF03B5D2718411A5E9E177E92EC4779440327DC
# gpg: issuer "hdegoede@redhat.com"
# gpg: Can't check signature: No public key
From: Hans de Goede <hdegoede@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/261afe3d-7790-e945-adf6-a2c96c9b1eff@redhat.com
|
|
The delay parameter isn't set by any user, therefore simplify the code
and switch to the basic workqueue API w/o delay support. This also
reduces the size of struct mmc_host.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://lore.kernel.org/r/13d8200a-e2a8-d907-38ce-a16fc5ce14aa@gmail.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux
Pull devicetree fixes from Rob Herring:
- Update some stale binding maintainer emails
- Fix property name error in apple,aic binding
- Add missing param to of_dma_configure_id() stub
- Fix an off-by-one error in unflatten_dt_nodes()
* tag 'devicetree-fixes-for-6.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
dt-bindings: pinctrl: qcom: drop non-working codeaurora.org emails
dt-bindings: power: qcom,rpmpd: drop non-working codeaurora.org emails
dt-bindings: apple,aic: Fix required item "apple,fiq-index" in affinity description
dt-bindings: interconnect: fsl,imx8m-noc: drop Leonard Crestez
of/device: Fix up of_dma_configure_id() stub
MAINTAINERS: Update email of Neil Armstrong
of: fdt: fix off-by-one error in unflatten_dt_nodes()
|
|
Merge series from AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>:
In an effort to give some love to the apparently forgotten MT6795 SoC,
I am upstreaming more components that are necessary to support platforms
powered by this one apart from a simple boot to serial console.
This series adds support for the regulators found in MT6331 and MT6332
main/companion PMICs.
Adding support to each driver in each subsystem is done in different
patch series as to avoid spamming uninteresting patches to maintainers.
Tested on a MT6795 Sony Xperia M5 (codename "Holly") smartphone.
|
|
Linux 6.0-rc4 so we can test on BeagleBone again.
|
|
Add a driver for the regulators found in the MT6332 PMICs,
including six buck and four LDO regulators.
Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Link: https://lore.kernel.org/r/20220913123456.384513-5-angelogioacchino.delregno@collabora.com
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
Add a driver for the regulators found in the MT6331 PMIC.
This PMIC features six buck and 21 Low DropOut (LDO) regulators.
Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Link: https://lore.kernel.org/r/20220913123456.384513-3-angelogioacchino.delregno@collabora.com
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
Fix spelling typo in comment.
Reported-by: k2ci <kernel-bot@kylinos.cn>
Signed-off-by: Jiangshan Yi <yijiangshan@kylinos.cn>
Signed-off-by: Helge Deller <deller@gmx.de>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyperv fixes from Wei Liu:
- Fix an error handling issue in DRM driver (Christophe JAILLET)
- Fix some issues in framebuffer driver (Vitaly Kuznetsov)
- Two typo fixes (Jason Wang, Shaomin Deng)
- Drop unnecessary casting in kvp tool (Zhou Jie)
* tag 'hyperv-fixes-signed-20220912' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
Drivers: hv: Never allocate anything besides framebuffer from framebuffer memory region
Drivers: hv: Always reserve framebuffer region for Gen1 VMs
PCI: Move PCI_VENDOR_ID_MICROSOFT/PCI_DEVICE_ID_HYPERV_VIDEO definitions to pci_ids.h
tools: hv: kvp: remove unnecessary (void*) conversions
Drivers: hv: remove duplicate word in a comment
tools: hv: Remove an extraneous "the"
drm/hyperv: Fix an error handling path in hyperv_vmbus_probe()
|
|
We disable PTM during suspend because that allows some Root Ports to enter
lower-power PM states, which means we also need to disable PTM for all
downstream devices. Add pci_suspend_ptm() and pci_resume_ptm() for this
purpose.
pci_enable_ptm() and pci_disable_ptm() are for drivers to use to enable or
disable PTM. They use dev->ptm_enabled to keep track of whether PTM should
be enabled.
pci_suspend_ptm() and pci_resume_ptm() are PCI core-internal functions to
temporarily disable PTM during suspend and (depending on dev->ptm_enabled)
re-enable PTM during resume.
Enable/disable/suspend/resume all use internal __pci_enable_ptm() and
__pci_disable_ptm() functions that only update the PTM Control register.
Outline:
pci_enable_ptm(struct pci_dev *dev)
{
__pci_enable_ptm(dev);
dev->ptm_enabled = 1;
pci_ptm_info(dev);
}
pci_disable_ptm(struct pci_dev *dev)
{
if (dev->ptm_enabled) {
__pci_disable_ptm(dev);
dev->ptm_enabled = 0;
}
}
pci_suspend_ptm(struct pci_dev *dev)
{
if (dev->ptm_enabled)
__pci_disable_ptm(dev);
}
pci_resume_ptm(struct pci_dev *dev)
{
if (dev->ptm_enabled)
__pci_enable_ptm(dev);
}
Nothing currently calls pci_resume_ptm(); the suspend path saves the PTM
state before disabling PTM, so the PTM state restore in the resume path
implicitly re-enables it. A future change will use pci_resume_ptm() to fix
some problems with this approach.
Link: https://lore.kernel.org/r/20220909202505.314195-5-helgaas@kernel.org
Tested-by: Rajvi Jingar <rajvi.jingar@linux.intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
|
|
Cache the PTM Capability offset instead of searching for it every time we
enable/disable PTM or save/restore PTM state. No functional change
intended.
Link: https://lore.kernel.org/r/20220909202505.314195-2-helgaas@kernel.org
Tested-by: Rajvi Jingar <rajvi.jingar@linux.intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/krzk/linux-mem-ctrl into arm/drivers
Memory controller drivers for v6.1 - MediaTek
Add support for the mt8188 SMI memory controller.
* tag 'memory-controller-drv-mediatek-6.1' of https://git.kernel.org/pub/scm/linux/kernel/git/krzk/linux-mem-ctrl:
memory: mtk-smi: mt8188: Add SMI Support
memory: mtk-smi: Add enable IOMMU SMC command for MM master
memory: mtk-smi: Add return value for configure port function
dt-bindings: memory: mediatek: Add mt8188 smi binding
Link: https://lore.kernel.org/r/20220909153037.824092-4-krzysztof.kozlowski@linaro.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
We need the driver core and debugfs changes in this branch.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
1. Retrieve extended operational memory physical pointers from the
auxiliary device info.
2. Setup memory registers.
3. Notify firmware that the memory is ready by sending the memory
ready command.
4. Disable PXP device if GSC is not in PXP mode.
CC: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Signed-off-by: Tomas Winkler <tomas.winkler@intel.com>
Signed-off-by: Alexander Usyskin <alexander.usyskin@intel.com>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220907215113.1596567-12-tomas.winkler@intel.com
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
|
|
Add slow_firmware flag to the mei auxiliary device info
to inform the mei driver about slow underlying firmware.
Such firmware will require to use larger operation timeouts.
Signed-off-by: Alexander Usyskin <alexander.usyskin@intel.com>
Signed-off-by: Tomas Winkler <tomas.winkler@intel.com>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220907215113.1596567-4-tomas.winkler@intel.com
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
|
|
struct mei_aux_device is an interface structure
requires proper documenation.
Signed-off-by: Tomas Winkler <tomas.winkler@intel.com>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220907215113.1596567-3-tomas.winkler@intel.com
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
|
|
Test scripts:
cd /sys/fs/cgroup/blkio/
echo "8:0 1024" > blkio.throttle.write_bps_device
echo $$ > cgroup.procs
dd if=/dev/zero of=/dev/sda bs=10k count=1 oflag=direct &
dd if=/dev/zero of=/dev/sda bs=10k count=1 oflag=direct &
Test result:
10240 bytes (10 kB, 10 KiB) copied, 10.0134 s, 1.0 kB/s
10240 bytes (10 kB, 10 KiB) copied, 10.0135 s, 1.0 kB/s
The problem is that the second bio is finished after 10s instead of 20s.
Root cause:
1) second bio will be flagged:
__blk_throtl_bio
while (true) {
...
if (sq->nr_queued[rw]) -> some bio is throttled already
break
};
bio_set_flag(bio, BIO_THROTTLED); -> flag the bio
2) flagged bio will be dispatched without waiting:
throtl_dispatch_tg
tg_may_dispatch
tg_with_in_bps_limit
if (bps_limit == U64_MAX || bio_flagged(bio, BIO_THROTTLED))
*wait = 0; -> wait time is zero
return true;
commit 9f5ede3c01f9 ("block: throttle split bio in case of iops limit")
support to count split bios for iops limit, thus it adds flagged bio
checking in tg_with_in_bps_limit() so that split bios will only count
once for bps limit, however, it introduce a new problem that io throttle
won't work if multiple bios are throttled.
In order to fix the problem, handle iops/bps limit in different ways:
1) for iops limit, there is no flag to record if the bio is throttled,
and iops is always applied.
2) for bps limit, original bio will be flagged with BIO_BPS_THROTTLED,
and io throttle will ignore bio with the flag.
Noted this patch also remove the code to set flag in __bio_clone(), it's
introduced in commit 111be8839817 ("block-throttle: avoid double
charge"), and author thinks split bio can be resubmited and throttled
again, which is wrong because split bio will continue to dispatch from
caller.
Fixes: 9f5ede3c01f9 ("block: throttle split bio in case of iops limit")
Cc: <stable@vger.kernel.org>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220829022240.3348319-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Batched completions can clear multiple bits, but we're only decrementing
the wait_cnt by one each time. This can cause waiters to never be woken,
stalling IO. Use the batched count instead.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=215679
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20220909184022.1709476-1-kbusch@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
set_mask_bits and bit_clear_unless. x86 CMPXCHG instruction returns
success in ZF flag, so this change saves a compare after cmpxchg (and
related move instruction in front of cmpxchg).
Also, try_cmpxchg implicitly assigns old *ptr value to "old" when cmpxchg
fails, enabling further code simplifications.
Link: https://lkml.kernel.org/r/20220822143851.3290-1-ubizjak@gmail.com
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Use atomic64_try_cmpxchg instead of
atomic64_cmpxchg (*ptr, old, new) == old in inode_set_max_iversion_raw,
inode_maybe_inc_version and inode_query_iversion. x86 CMPXCHG instruction
returns success in ZF flag, so this change saves a compare after cmpxchg
(and related move instruction in front of cmpxchg).
Also, try_cmpxchg implicitly assigns old *ptr value to "old" when cmpxchg
fails, enabling further code simplifications.
The loop in inode_maybe_inc_iversion improves from:
5563: 48 89 ca mov %rcx,%rdx
5566: 48 89 c8 mov %rcx,%rax
5569: 48 83 e2 fe and $0xfffffffffffffffe,%rdx
556d: 48 83 c2 02 add $0x2,%rdx
5571: f0 48 0f b1 16 lock cmpxchg %rdx,(%rsi)
5576: 48 39 c1 cmp %rax,%rcx
5579: 0f 84 85 fc ff ff je 5204 <...>
557f: 48 89 c1 mov %rax,%rcx
5582: eb df jmp 5563 <...>
to:
5563: 48 89 c2 mov %rax,%rdx
5566: 48 83 e2 fe and $0xfffffffffffffffe,%rdx
556a: 48 83 c2 02 add $0x2,%rdx
556e: f0 48 0f b1 11 lock cmpxchg %rdx,(%rcx)
5573: 0f 84 8b fc ff ff je 5204 <...>
5579: eb e8 jmp 5563 <...>
Link: https://lkml.kernel.org/r/20220821193011.88208-1-ubizjak@gmail.com
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Only x86 has own release_thread(), introduce a new weak release_thread()
function to clean empty definitions in other ARCHs.
Link: https://lkml.kernel.org/r/20220819014406.32266-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: Guo Ren <guoren@kernel.org> [csky]
Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Brian Cain <bcain@quicinc.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc]
Acked-by: Stafford Horne <shorne@gmail.com> [openrisc]
Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64]
Acked-by: Huacai Chen <chenhuacai@kernel.org> [LoongArch]
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Chris Zankel <chris@zankel.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Guo Ren <guoren@kernel.org> [csky]
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Xuerui Wang <kernel@xen0n.name>
Cc: Yoshinori Sato <ysato@users.osdn.me>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "kexec, panic: Making crash_kexec() NMI safe", v4.
This patch (of 2):
Most acquistions of kexec_mutex are done via mutex_trylock() - those were
a direct "translation" from:
8c5a1cf0ad3a ("kexec: use a mutex for locking rather than xchg()")
there have however been two additions since then that use mutex_lock():
crash_get_memory_size() and crash_shrink_memory().
A later commit will replace said mutex with an atomic variable, and
locking operations will become atomic_cmpxchg(). Rather than having those
mutex_lock() become while (atomic_cmpxchg(&lock, 0, 1)), turn them into
trylocks that can return -EBUSY on acquisition failure.
This does halve the printable size of the crash kernel, but that's still
neighbouring 2G for 32bit kernels which should be ample enough.
Link: https://lkml.kernel.org/r/20220630223258.4144112-1-vschneid@redhat.com
Link: https://lkml.kernel.org/r/20220630223258.4144112-2-vschneid@redhat.com
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Juri Lelli <jlelli@redhat.com>
Cc: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
commit 6c287605fd56 ("mm: remember exclusively mapped anonymous pages with
PG_anon_exclusive") made sure that when PageAnonExclusive() has to be
cleared during temporary unmapping of a page, that the PTE is
cleared/invalidated and that the TLB is flushed.
What we want to achieve in all cases is that we cannot end up with a pin on
an anonymous page that may be shared, because such pins would be
unreliable and could result in memory corruptions when the mapped page
and the pin go out of sync due to a write fault.
That TLB flush handling was inspired by an outdated comment in
mm/ksm.c:write_protect_page(), which similarly required the TLB flush in
the past to synchronize with GUP-fast. However, ever since general RCU GUP
fast was introduced in commit 2667f50e8b81 ("mm: introduce a general RCU
get_user_pages_fast()"), a TLB flush is no longer sufficient to handle
concurrent GUP-fast in all cases -- it only handles traditional IPI-based
GUP-fast correctly.
Peter Xu (thankfully) questioned whether that TLB flush is really
required. On architectures that send an IPI broadcast on TLB flush,
it works as expected. To synchronize with RCU GUP-fast properly, we're
conceptually fine, however, we have to enforce a certain memory order and
are missing memory barriers.
Let's document that, avoid the TLB flush where possible and use proper
explicit memory barriers where required. We shouldn't really care about the
additional memory barriers here, as we're not on extremely hot paths --
and we're getting rid of some TLB flushes.
We use a smp_mb() pair for handling concurrent pinning and a
smp_rmb()/smp_wmb() pair for handling the corner case of only temporary
PTE changes but permanent PageAnonExclusive changes.
One extreme example, whereby GUP-fast takes a R/O pin and KSM wants to
convert an exclusive anonymous page to a KSM page, and that page is already
mapped write-protected (-> no PTE change) would be:
Thread 0 (KSM) Thread 1 (GUP-fast)
(B1) Read the PTE
# (B2) skipped without FOLL_WRITE
(A1) Clear PTE
smp_mb()
(A2) Check pinned
(B3) Pin the mapped page
smp_mb()
(A3) Clear PageAnonExclusive
smp_wmb()
(A4) Restore PTE
(B4) Check if the PTE changed
smp_rmb()
(B5) Check PageAnonExclusive
Thread 1 will properly detect that PageAnonExclusive was cleared and
back off.
Note that we don't need a memory barrier between checking if the page is
pinned and clearing PageAnonExclusive, because stores are not
speculated.
The possible issues due to reordering are of theoretical nature so far
and attempts to reproduce the race failed.
Especially the "no PTE change" case isn't the common case, because we'd
need an exclusive anonymous page that's mapped R/O and the PTE is clean
in KSM code -- and using KSM with page pinning isn't extremely common.
Further, the clear+TLB flush we used for now implies a memory barrier.
So the problematic missing part should be the missing memory barrier
after pinning but before checking if the PTE changed.
Link: https://lkml.kernel.org/r/20220901083559.67446-1-david@redhat.com
Fixes: 6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive")
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Parri <parri.andrea@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Christoph von Recklinghausen <crecklin@redhat.com>
Cc: Don Dutile <ddutile@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "A few cleanup patches for hugetlb", v2.
This series contains a few cleanup patches to use helper functions to
simplify the codes, remove unneeded nid parameter and so on. More
details can be found in the respective changelogs.
This patch (of 10):
Make hugetlb_cma_check() static as it's only used inside mm/hugetlb.c.
Link: https://lkml.kernel.org/r/20220901120030.63318-1-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20220901120030.63318-2-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
bh_submit_read() has no user anymore, just remove it.
Link: https://lkml.kernel.org/r/20220901133505.2510834-15-yi.zhang@huawei.com
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Now that all ll_rw_block() users has been replaced to new safe helpers,
we just remove it here.
Link: https://lkml.kernel.org/r/20220901133505.2510834-13-yi.zhang@huawei.com
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Current ll_rw_block() helper is fragile because it assumes that locked
buffer means it's under IO which is submitted by some other who holds
the lock, it skip buffer if it failed to get the lock, so it's only
safe on the readahead path. Unfortunately, now that most filesystems
still use this helper mistakenly on the sync metadata read path. There
is no guarantee that the one who holds the buffer lock always submit IO
(e.g. buffer_migrate_folio_norefs() after commit 88dbcbb3a484 ("blkdev:
avoid migration stalls for blkdev pages"), it could lead to false
positive -EIO when submitting reading IO.
This patch add some friendly buffer read helpers to prepare replacing
ll_rw_block() and similar calls. We can only call bh_readahead_[]
helpers for the readahead paths.
Link: https://lkml.kernel.org/r/20220901133505.2510834-3-yi.zhang@huawei.com
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "fs/buffer: remove ll_rw_block()", v2.
ll_rw_block() will skip locked buffer before submitting IO, it assumes
that locked buffer means it is under IO. This assumption is not always
true because we cannot guarantee every buffer lock path would submit IO.
After commit 88dbcbb3a484 ("blkdev: avoid migration stalls for blkdev
pages"), buffer_migrate_folio_norefs() becomes one exceptional case, and
there may be others. So ll_rw_block() is not safe on the sync read path,
we could get false positive EIO return value when filesystem reading
metadata. It seems that it could be only used on the readahead path.
Unfortunately, many filesystem misuse the ll_rw_block() on the sync read
path. This patch set just remove ll_rw_block() and add new friendly
helpers, which could prevent false positive EIO on the read metadata path.
Thanks for the suggestion from Jan, the original discussion is at [1].
patch 1: remove unused helpers in fs/buffer.c
patch 2: add new bh_read_[*] helpers
patch 3-11: remove all ll_rw_block() calls in filesystems
patch 12-14: do some leftover cleanups.
[1]. https://lore.kernel.org/linux-mm/20220825080146.2021641-1-chengzhihao1@huawei.com/
This patch (of 14):
No one use __breadahead_gfp() and sb_breadahead_unmovable() any more,
remove them.
Link: https://lkml.kernel.org/r/20220901133505.2510834-1-yi.zhang@huawei.com
Link: https://lkml.kernel.org/r/20220901133505.2510834-2-yi.zhang@huawei.com
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Evgeniy Dushistov <dushistov@mail.ru>
Cc: Heming Zhao <ocfs2-devel@oss.oracle.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Yu Kuai <yukuai3@huawei.com>
Cc: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Simplify code by removing redundant CONFIG_TRANSPARENT_HUGEPAGE judgment.
No functional change.
Link: https://lkml.kernel.org/r/20220829095125.3284567-1-liushixin2@huawei.com
Signed-off-by: Liu Shixin <liushixin2@huawei.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Simplify code of has_transparent_hugepage define by using IS_BUILTIN.
No functional change.
Link: https://lkml.kernel.org/r/20220829095709.3287462-1-liushixin2@huawei.com
Signed-off-by: Liu Shixin <liushixin2@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
kswapd_run/stop() will set pgdat->kswapd to NULL, which could race with
kswapd_is_running() in kcompactd(),
kswapd_run/stop() kcompactd()
kswapd_is_running()
pgdat->kswapd // error or nomal ptr
verify pgdat->kswapd
// load non-NULL
pgdat->kswapd
pgdat->kswapd = NULL
task_is_running(pgdat->kswapd)
// Null pointer derefence
KASAN reports the null-ptr-deref shown below,
vmscan: Failed to start kswapd on node 0
...
BUG: KASAN: null-ptr-deref in kcompactd+0x440/0x504
Read of size 8 at addr 0000000000000024 by task kcompactd0/37
CPU: 0 PID: 37 Comm: kcompactd0 Kdump: loaded Tainted: G OE 5.10.60 #1
Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
Call trace:
dump_backtrace+0x0/0x394
show_stack+0x34/0x4c
dump_stack+0x158/0x1e4
__kasan_report+0x138/0x140
kasan_report+0x44/0xdc
__asan_load8+0x94/0xd0
kcompactd+0x440/0x504
kthread+0x1a4/0x1f0
ret_from_fork+0x10/0x18
At present kswapd/kcompactd_run() and kswapd/kcompactd_stop() are protected
by mem_hotplug_begin/done(), but without kcompactd(). There is no need to
involve memory hotplug lock in kcompactd(), so let's add a new mutex to
protect pgdat->kswapd accesses.
Also, because the kcompactd task will check the state of kswapd task, it's
better to call kcompactd_stop() before kswapd_stop() to reduce lock
conflicts.
[akpm@linux-foundation.org: add comments]
Link: https://lkml.kernel.org/r/20220827111959.186838-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Directly check state of struct memory_block, no need a single function.
Link: https://lkml.kernel.org/r/20220827112043.187028-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
All callers of find_get_pages_contig() have been removed, so it is no
longer needed.
Link: https://lkml.kernel.org/r/20220824004023.77310-8-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <clm@fb.com>
Cc: David Sterba <dsterba@suse.com>
Cc: David Sterba <dsterb@suse.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "Convert to filemap_get_folios_contig()", v3.
This patch series replaces find_get_pages_contig() with
filemap_get_folios_contig().
This patch (of 7):
This function is meant to replace find_get_pages_contig().
Unlike find_get_pages_contig(), filemap_get_folios_contig() no longer
takes in a target number of pages to find - It returns up to 15 contiguous
folios.
To be more consistent with filemap_get_folios(),
filemap_get_folios_contig() now also updates the start index passed in,
and takes an end index.
Link: https://lkml.kernel.org/r/20220824004023.77310-1-vishal.moola@gmail.com
Link: https://lkml.kernel.org/r/20220824004023.77310-2-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Sterba <dsterb@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In commit 2f1ee0913ce5 ("Revert "mm: use early_pfn_to_nid in
page_ext_init""), we call page_ext_init() after page_alloc_init_late() to
avoid some panic problem. It seems that we cannot track early page
allocations in current kernel even if page structure has been initialized
early.
This patch introduces a new boot parameter 'early_page_ext' to resolve
this problem. If we pass it to the kernel, page_ext_init() will be moved
up and the feature 'deferred initialization of struct pages' will be
disabled to initialize the page allocator early and prevent the panic
problem above. It can help us to catch early page allocations. This is
useful especially when we find that the free memory value is not the same
right after different kernel booting.
[akpm@linux-foundation.org: fix section issue by removing __meminitdata]
Link: https://lkml.kernel.org/r/20220825102714.669-1-lizhe.67@bytedance.com
Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mark-PK Tsai <mark-pk.tsai@mediatek.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
For several years, MEMCG_CHARGE_BATCH was kept at 32 but with bigger
machines and the network intensive workloads requiring througput in Gbps,
32 is too small and makes the memcg charging path a bottleneck. For now,
increase it to 64 for easy acceptance to 6.0. We will need to revisit
this in future for ever increasing demand of higher performance.
Please note that the memcg charge path drain the per-cpu memcg charge
stock, so there should not be any oom behavior change. Though it does
have impact on rstat flushing and high limit reclaim backoff.
To evaluate the impact of this optimization, on a 72 CPUs machine, we
ran the following workload in a three level of cgroup hierarchy.
$ netserver -6
# 36 instances of netperf with following params
$ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
Results (average throughput of netperf):
Without (6.0-rc1) 10482.7 Mbps
With patch 17064.7 Mbps (62.7% improvement)
With the patch, the throughput improved by 62.7%.
Link: https://lkml.kernel.org/r/20220825000506.239406-4-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Feng Tang <feng.tang@intel.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Michal Koutný" <mkoutny@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
With memcg v2 enabled, memcg->memory.usage is a very hot member for the
workloads doing memcg charging on multiple CPUs concurrently.
Particularly the network intensive workloads. In addition, there is a
false cache sharing between memory.usage and memory.high on the charge
path. This patch moves the usage into a separate cacheline and move all
the read most fields into separate cacheline.
To evaluate the impact of this optimization, on a 72 CPUs machine, we ran
the following workload in a three level of cgroup hierarchy.
$ netserver -6
# 36 instances of netperf with following params
$ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
Results (average throughput of netperf):
Without (6.0-rc1) 10482.7 Mbps
With patch 12413.7 Mbps (18.4% improvement)
With the patch, the throughput improved by 18.4%.
One side-effect of this patch is the increase in the size of struct
mem_cgroup. For example with this patch on 64 bit build, the size of
struct mem_cgroup increased from 4032 bytes to 4416 bytes. However for
the performance improvement, this additional size is worth it. In
addition there are opportunities to reduce the size of struct mem_cgroup
like deprecation of kmem and tcpmem page counters and better packing.
Link: https://lkml.kernel.org/r/20220825000506.239406-3-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Reviewed-by: Feng Tang <feng.tang@intel.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Michal Koutný" <mkoutny@suse.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Empty PTEs are passed to the pte_entry callback, not to pte_hole.
Link: https://lkml.kernel.org/r/3695521.kQq0lBPeGt@devpool047
Signed-off-by: Rolf Eike Beer <eb@emlix.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The below is one path where race between page_ext and offline of the
respective memory blocks will cause use-after-free on the access of
page_ext structure.
process1 process2
--------- ---------
a)doing /proc/page_owner doing memory offline
through offline_pages.
b) PageBuddy check is failed
thus proceed to get the
page_owner information
through page_ext access.
page_ext = lookup_page_ext(page);
migrate_pages();
.................
Since all pages are successfully
migrated as part of the offline
operation,send MEM_OFFLINE notification
where for page_ext it calls:
offline_page_ext()-->
__free_page_ext()-->
free_page_ext()-->
vfree(ms->page_ext)
mem_section->page_ext = NULL
c) Check for the PAGE_EXT
flags in the page_ext->flags
access results into the
use-after-free (leading to
the translation faults).
As mentioned above, there is really no synchronization between page_ext
access and its freeing in the memory_offline.
The memory offline steps(roughly) on a memory block is as below:
1) Isolate all the pages
2) while(1)
try free the pages to buddy.(->free_list[MIGRATE_ISOLATE])
3) delete the pages from this buddy list.
4) Then free page_ext.(Note: The struct page is still alive as it is
freed only during hot remove of the memory which frees the memmap,
which steps the user might not perform).
This design leads to the state where struct page is alive but the struct
page_ext is freed, where the later is ideally part of the former which
just representing the page_flags (check [3] for why this design is
chosen).
The abovementioned race is just one example __but the problem persists in
the other paths too involving page_ext->flags access(eg:
page_is_idle())__.
Fix all the paths where offline races with page_ext access by maintaining
synchronization with rcu lock and is achieved in 3 steps:
1) Invalidate all the page_ext's of the sections of a memory block by
storing a flag in the LSB of mem_section->page_ext.
2) Wait until all the existing readers to finish working with the
->page_ext's with synchronize_rcu(). Any parallel process that starts
after this call will not get page_ext, through lookup_page_ext(), for
the block parallel offline operation is being performed.
3) Now safely free all sections ->page_ext's of the block on which
offline operation is being performed.
Note: If synchronize_rcu() takes time then optimizations can be done in
this path through call_rcu()[2].
Thanks to David Hildenbrand for his views/suggestions on the initial
discussion[1] and Pavan kondeti for various inputs on this patch.
[1] https://lore.kernel.org/linux-mm/59edde13-4167-8550-86f0-11fc67882107@quicinc.com/
[2] https://lore.kernel.org/all/a26ce299-aed1-b8ad-711e-a49e82bdd180@quicinc.com/T/#u
[3] https://lore.kernel.org/all/6fa6b7aa-731e-891c-3efb-a03d6a700efa@redhat.com/
[quic_charante@quicinc.com: rename label `loop' to `ext_put_continue' per David]
Link: https://lkml.kernel.org/r/1661496993-11473-1-git-send-email-quic_charante@quicinc.com
Link: https://lkml.kernel.org/r/1660830600-9068-1-git-send-email-quic_charante@quicinc.com
Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Fernand Sieber <sieberf@amazon.com>
Cc: Minchan Kim <minchan@google.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Pavan Kondeti <quic_pkondeti@quicinc.com>
Cc: SeongJae Park <sjpark@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|