summaryrefslogtreecommitdiff
path: root/arch
AgeCommit message (Collapse)Author
2022-09-28ARM: configs: aspeed: Add support for USB flash drivesAdriana Kobylak
Add support to detect USB flash drives and create the /dev/sd* devices. Also add support for vfat to support USB drives formatted as FAT32. This support will be used to enable firmware updates via USB flash drives where the firmware image is stored in the USB drive and it's plugged into the BMC USB port. Signed-off-by: Adriana Kobylak <anoo@us.ibm.com> Tested-by: Adriana Kobylak <anoo@us.ibm.com> Link: https://lore.kernel.org/r/20211112202931.2379145-1-anoo@linux.ibm.com Signed-off-by: Joel Stanley <joel@jms.id.au>
2022-09-28ARM: dts: aspeed: ast2600-evb-a1: Add compatibleJoel Stanley
The AST2600 EVB A1 is an AST2600 EVB. Signed-off-by: Joel Stanley <joel@jms.id.au>
2022-09-28ARM: dts: aspeed: ast2600evb: Fix compatible stringJoel Stanley
The AST2600 EVB is not an A1. Signed-off-by: Joel Stanley <joel@jms.id.au>
2022-09-28ARM: dts: aspeed: ast2600-evb: Enable Quad SPI RX tranfersCédric Le Goater
Now that the pinctrl definitions of the ast2600 SoC have been fixed, see commit 925fbe1f7eb6 ("dt-bindings: pinctrl: aspeed-g6: add FWQSPI function/group"), it is safe to activate QSPI on the ast2600 evb. Cc: Chin-Ting Kuo <chin-ting_kuo@aspeedtech.com> Tested-by: Jae Hyun Yoo <quic_jaehyoo@quicinc.com> Signed-off-by: Cédric Le Goater <clg@kaod.org> Link: https://lore.kernel.org/r/20220603073705.1624351-1-clg@kaod.org Signed-off-by: Joel Stanley <joel@jms.id.au>
2022-09-28ARM: dts: aspeed-g6: Enable more UART controllersKen Chen
Setup the configuration of UART6, UART7, UART8, and UART9 in aspeed-g6.dtsi. Signed-off-by: Ken Chen <j220584470k@gmail.com> Link: https://lore.kernel.org/r/20220805090957.470434-1-j220584470k@gmail.com Signed-off-by: Joel Stanley <joel@jms.id.au>
2022-09-28ARM: dts: aspeed: yosemitev2: Disable the EEPROM driverKarthikeyan Pasupathi
Removed NIC EEPROM driver IPMB-12 channel and enabled it as generic i2c EEPROM. Signed-off-by: Karthikeyan Pasupathi <pkarthikeyan1509@gmail.com> Link: https://lore.kernel.org/r/20220914115307.GA339@hcl-ThinkPad-T495 Signed-off-by: Joel Stanley <joel@jms.id.au>
2022-09-28ARM: dts: aspeed: Add AMD DaytonaX BMCKonstantin Aladyshev
Add initial version of device tree for the BMC in the AMD DaytonaX platform. AMD DaytonaX platform is a customer reference board (CRB) with an Aspeed ast2500 BMC manufactured by AMD. Signed-off-by: Konstantin Aladyshev <aladyshev22@gmail.com> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Link: https://lore.kernel.org/r/20220921210950.10568-3-aladyshev22@gmail.com Signed-off-by: Joel Stanley <joel@jms.id.au>
2022-09-28ARM: dts: aspeed: Yosemite V2: Enable OCP debug cardKarthikeyan Pasupathi
Added IPMB-13 channel for Debug Card communication which improves the readability of the machine and makes it easier to debug the server and it will display some pieces of information about the server like "system info", "Critical sensors" and "critical sel". Signed-off-by: Karthikeyan Pasupathi <pkarthikeyan1509@gmail.com> Reviewed-by: Patrick Williams <patrick@stwcx.xyz> Link: https://lore.kernel.org/r/20220926124313.GA8400@hcl-ThinkPad-T495 Signed-off-by: Joel Stanley <joel@jms.id.au>
2022-09-28ARM: dts: aspeed: mtjade: Remove gpio-keys entriesQuan Nguyen
Remove the gpio-keys entries from the Ampere's Mt. Jade BMC device tree. The user space applications are going to change from using libevdev to libgpiod. Signed-off-by: Quan Nguyen <quan@os.amperecomputing.com> Link: https://lore.kernel.org/r/20220915080828.2894070-1-quan@os.amperecomputing.com Signed-off-by: Joel Stanley <joel@jms.id.au>
2022-09-28ARM: dts: aspeed: Add device tree for Ampere's Mt. Mitchell BMCQuan Nguyen
The Mt. Mitchell BMC is an ASPEED AST2600-based BMC for the Mt. Mitchell hardware reference platform with AmpereOne(TM) processor. Signed-off-by: Quan Nguyen <quan@os.amperecomputing.com> Signed-off-by: Phong Vo <phong@os.amperecomputing.com> Signed-off-by: Thang Q. Nguyen <thang@os.amperecomputing.com> Reviewed-by: Joel Stanley <joel@jms.id.au> Link: https://lore.kernel.org/r/20220817071539.176110-3-quan@os.amperecomputing.com Signed-off-by: Joel Stanley <joel@jms.id.au>
2022-09-27Merge tag 'soc-fixes-6.0-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc Pull ARM SoC fixes from Arnd Bergmann: "This should be the last set of bugfixes in the SoC tree: - Two fixes for Arm integrator, dealing with a regression caused by invalid DT properties combined with a change in dma address translation, and missing device_type annotations on the PCI bus - Fixes for drivers/reset/, addressing bugs in i.MX8MP, Sparx5 and NPCM8XX platforms - Bjorn Andersson's email address changes in the MAINTAINERS file - Multiple minor fixes to Qualcomm dts files, and a change to the remoteproc firmware filename that did not match the actual path in the linux-firmware package - Minor code fixes for the Allwinner/sunxi SRAM driver, and the broadcom STB Bus Interface Unit driver - A build fix for the sunplus sp7021 platform - Two dts fixes for TI OMAP family SoCs, addressing an extraneous usb4 device node and an incorrect DMA handle" * tag 'soc-fixes-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: ARM: dts: integrator: Fix DMA ranges ARM: dts: integrator: Tag PCI host with device_type ARM: sunplus: fix serial console kconfig and build problems reset: npcm: fix iprst2 and iprst4 setting arm64: dts: qcom: sm8350: fix UFS PHY serdes size soc: bcm: brcmstb: biuctrl: Avoid double of_node_put() arm64: dts: qcom: sc8280xp-x13s: Update firmware location soc: sunxi: sram: Fix debugfs info for A64 SRAM C soc: sunxi: sram: Fix probe function ordering issues soc: sunxi: sram: Prevent the driver from being unbound soc: sunxi: sram: Actually claim SRAM regions ARM: dts: am5748: keep usb4_tm disabled reset: microchip-sparx5: issue a reset on startup reset: imx7: Fix the iMX8MP PCIe PHY PERST support MAINTAINERS: Update Bjorn's email address arm64: dts: qcom: sc7280: move USB wakeup-source property arm64: dts: qcom: thinkpad-x13s: Fix firmware location arm64: dts: qcom: sm8150: Fix fastrpc iommu values ARM: dts: am33xx: Fix MMCHS0 dma properties
2022-09-27x86/alternative: Fix race in try_get_desc()Nadav Amit
I encountered some occasional crashes of poke_int3_handler() when kprobes are set, while accessing desc->vec. The text poke mechanism claims to have an RCU-like behavior, but it does not appear that there is any quiescent state to ensure that nobody holds reference to desc. As a result, the following race appears to be possible, which can lead to memory corruption. CPU0 CPU1 ---- ---- text_poke_bp_batch() -> smp_store_release(&bp_desc, &desc) [ notice that desc is on the stack ] poke_int3_handler() [ int3 might be kprobe's so sync events are do not help ] -> try_get_desc(descp=&bp_desc) desc = __READ_ONCE(bp_desc) if (!desc) [false, success] WRITE_ONCE(bp_desc, NULL); atomic_dec_and_test(&desc.refs) [ success, desc space on the stack is being reused and might have non-zero value. ] arch_atomic_inc_not_zero(&desc->refs) [ might succeed since desc points to stack memory that was freed and might be reused. ] Fix this issue with small backportable patch. Instead of trying to make RCU-like behavior for bp_desc, just eliminate the unnecessary level of indirection of bp_desc, and hold the whole descriptor as a global. Anyhow, there is only a single descriptor at any given moment. Fixes: 1f676247f36a4 ("x86/alternatives: Implement a better poke_int3_handler() completion scheme") Signed-off-by: Nadav Amit <namit@vmware.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@kernel.org Link: https://lkml.kernel.org/r/20220920224743.3089-1-namit@vmware.com
2022-09-27perf: Use sample_flags for raw_dataNamhyung Kim
Use the new sample_flags to indicate whether the raw data field is filled by the PMU driver. Although it could check with the NULL, follow the same rule with other fields. Remove the raw field from the perf_sample_data_init() to minimize the number of cache lines touched. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220921220032.2858517-2-namhyung@kernel.org
2022-09-27perf: Use sample_flags for addrNamhyung Kim
Use the new sample_flags to indicate whether the addr field is filled by the PMU driver. As most PMU drivers pass 0, it can set the flag only if it has a non-zero value. And use 0 in perf_sample_output() if it's not filled already. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220921220032.2858517-1-namhyung@kernel.org
2022-09-27Merge branch 'master' into i2c/for-mergewindowWolfram Sang
2022-09-27MIPS: Lantiq: vmmc: fix compile break introduced by gpiod patchDmitry Torokhov
"MIPS: Lantiq: switch vmmc to use gpiod API" patch introduced compile errors, this patch fixes them. Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
2022-09-27x86: kprobes: Remove unused macro stack_addrChen Zhongjin
An unused macro reported by [-Wunused-macros]. This macro is used to access the sp in pt_regs because at that time x86_32 can only get sp by kernel_stack_pointer(regs). '3c88c692c287 ("x86/stackframe/32: Provide consistent pt_regs")' This commit have unified the pt_regs and from them we can get sp from pt_regs with regs->sp easily. Nowhere is using this macro anymore. Refrencing pt_regs directly is more clear. Remove this macro for code cleaning. Link: https://lkml.kernel.org/r/20220924072629.104759-1-chenzhongjin@huawei.com Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-09-27riscv: dts: microchip: update memory configuration for v2022.10Conor Dooley
In the v2022.10 reference design, the seg registers are going to be changed, resulting in a required change to the memory map in Linux. A small 4M reservation is made at the end of 32-bit DDR to provide some memory for the HSS to use, so that it can cache its payload.bin between reboots of a specific context. Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
2022-09-27riscv: dts: microchip: add a devicetree for aries' m100pfsevpConor Dooley
Add device trees for both configs used by the Aries Embedded M100PFSEVP. The M100OFSEVP consists of a MPFS250T on a SOM, featuring: - 2GB DDR4 SDRAM dedicated to the HMS - 512MB DDR4 SDRAM dedicated to the FPGA - 32 MB SPI NOR Flash - 4 GByte eMMC and a carrier board with: - 2x Gigabit Ethernet - USB - 2x UART - 2x CAN - TFT connector - HSMC extension connector - 3x PMOD extension connectors - microSD-card slot Link: https://www.aries-embedded.com/polarfire-soc-fpga-microsemi-m100pfs-som-mpfs025t-pcie-serdes Link: https://www.aries-embedded.com/evaluation-kit/fpga/polarfire-microchip-soc-fpga-m100pfsevp-riscv-hsmc-pmod Link: https://downloads.aries-embedded.de/products/M100PFS/Hardware/M100PFSEVP-Schematics.pdf Co-developed-by: Wolfgang Grandegger <wg@aries-embedded.de> Signed-off-by: Wolfgang Grandegger <wg@aries-embedded.de> Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
2022-09-27riscv: dts: microchip: add sevkit device treeVattipalli Praveen
Add a basic dts for the Microchip Smart Embedded Vision dev kit. The SEV kit is an upcoming first party board, featuring an MPFS250T and: - Dual Sony Camera Sensors (IMX334) - IEEE 802.11 b/g/n 20MHz (1x1) Wi-Fi - Bluetooth 5 Low Energy - 4 GB DDR4 x64 - 2 GB LPDDR4 x32 - 1 GB SPI Flash - 8 GB eMMC flash & SD card slot (multiplexed) - HDMI2.0 Video Input/Output - MIPI DSI Output - MIPI CSI-2 Input Link: https://onlinedocs.microchip.com/pr/GUID-404D3738-DC76-46BA-8683-6A77E837C2DD-en-US-1/index.html?GUID-065AEBEE-7B2C-4895-8579-B1D73D797F06 Signed-off-by: Vattipalli Praveen <praveen.kumar@microchip.com> Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
2022-09-27riscv: dts: microchip: reduce the fic3 clock rateConor Dooley
For the v2022.09 release of the reference design, the fic3 clock rate been reduced from 62.5 MHz to 50 MHz as it allows timing to be closed significantly more quickly by customers who chose to build the reference design themselves. Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
2022-09-27riscv: dts: microchip: icicle: re-jig fabric peripheral addressesConor Dooley
When users try to add onto the reference design, they find that the current addresses that peripherals connected to Fabric InterConnect (FIC) 3 use are restrictive. For the v2022.09 reference design, the peripherals have been shifted down, leaving more contiguous address space for their custom IP/peripherals. Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
2022-09-27riscv: dts: microchip: icicle: update pci address propertiesConor Dooley
For the v2022.09 reference design the PCI root port's data region has been moved to FIC1 from FIC0. This is a shorter path, allowing for higher clock rates and improved through-put. As a result, the address at which the PCIe's data region appears to the core complex has changed. The config region's address is unchanged. As FIC0 is no longer used, its clock can be removed too. Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
2022-09-27riscv: dts: microchip: move the mpfs' pci node to -fabric.dtsiConor Dooley
In today's edition of moving things around: The PCIe root port on PolarFire SoC is more part of the FPGA than of the Core Complex. It is located on the other side of the chip and, apart from its interrupts, most of its configuration is determined by the FPGA bitstream rather. This includes: - address translation in both directions - the addresses at which the config and data regions appear to the core complex - the clocks used by the AXI bus - the plic interrupt used Moving the PCIe node to the -fabric.dtsi makes it clearer than a singular configuration for root port is not correct & allows the base SoC dtsi to be more easily included. Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
2022-09-27riscv: dts: microchip: add pci dma ranges for the icicle kitConor Dooley
The recently removed, accidentally included, "matr0" property was used in place of a dma-ranges property. The PCI controller is non-functional with mainline Linux in the v2022.02 or later reference designs and has not worked without configuration of address-translation since v2021.08. Add the address translation that will be used by the v2022.09 reference design & update the compatible used by the dts. Since this change is not backwards compatible, update the compatible to denote this, jumping over v2022.09 directly to v2022.10. Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
2022-09-28KVM: PPC: Book3S HV P9: Restore stolen time logging in dtlNicholas Piggin
Stolen time logging in dtl was removed from the P9 path, so guests had no stolen time accounting. Add it back in a simpler way that still avoids locks and per-core accounting code. Fixes: ecb6a7207f92 ("KVM: PPC: Book3S HV P9: Remove most of the vcore logic") Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220908132545.4085849-4-npiggin@gmail.com
2022-09-28KVM: PPC: Book3S HV: Update guest state entry/exit accounting to new APINicholas Piggin
Update the guest state and timing entry/exit accounting to use the new API, which was introduced following issues found[1]. KVM HV does possibly call instrumented code inside the guest context, and it does call srcu inside the guest context which is fragile at best. Switch to the new API, moving the guest context inside the srcu_read_lock/unlock region. [1] https://lore.kernel.org/lkml/20220201132926.3301912-1-mark.rutland@arm.com/ Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220908132545.4085849-3-npiggin@gmail.com
2022-09-28KVM: PPC: Book3S HV P9: Fix irq disabling in tick accountingNicholas Piggin
kvmhv_run_single_vcpu() disables PMIs as well as Linux irqs, however the tick time accounting code enables and disables irqs and not PMIs within this region. By chance this might not actually cause a bug, but it is clearly an incorrect use of the APIs. Fixes: 2251fbe76395e ("KVM: PPC: Book3S HV P9: Improve mtmsrd scheduling by delaying MSR[EE] disable") Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220908132545.4085849-2-npiggin@gmail.com
2022-09-28KVM: PPC: Book3S HV P9: Clear vcpu cpu fields before enabling host irqsNicholas Piggin
On guest entry, vcpu->cpu and vcpu->arch.thread_cpu are set after disabling host irqs. On guest exit there is a window whre tick time accounting briefly enables irqs before these fields are cleared. Move them up to ensure they are cleared before host irqs are run. This is possibly not a problem, but is more symmetric and makes the fields less surprising. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220908132545.4085849-1-npiggin@gmail.com
2022-09-28KVM: PPC: Book3S HV: Fix decrementer migrationFabiano Rosas
We used to have a workaround[1] for a hang during migration that was made ineffective when we converted the decrementer expiry to be relative to guest timebase. The point of the workaround was that in the absence of an explicit decrementer expiry value provided by userspace during migration, KVM needs to initialize dec_expires to a value that will result in an expired decrementer after subtracting the current guest timebase. That stops the vcpu from hanging after migration due to a decrementer that's too large. If the dec_expires is now relative to guest timebase, its initialization needs to be guest timebase-relative as well, otherwise we end up with a decrementer expiry that is still larger than the guest timebase. 1- https://git.kernel.org/torvalds/c/5855564c8ab2 Fixes: 3c1a4322bba7 ("KVM: PPC: Book3S HV: Change dec_expires to be relative to guest timebase") Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220816222517.1916391-1-farosas@linux.ibm.com
2022-09-27a.out: Remove the a.out implementationEric W. Biederman
In commit 19e8b701e258 ("a.out: Stop building a.out/osf1 support on alpha and m68k") the last users of a.out were disabled. As nothing has turned up to cause this change to be reverted, let's remove the code implementing a.out support as well. There may be userspace users of the uapi bits left so the uapi headers have been left untouched. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Arnd Bergmann <arnd@arndb.de> # arm defconfigs Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/871qrx3hq3.fsf@email.froward.int.ebiederm.org
2022-09-27efi/arm: libstub: move ARM specific code out of generic routinesArd Biesheuvel
Move some code that is only reachable when IS_ENABLED(CONFIG_ARM) into the ARM EFI arch code. Cc: Russell King <linux@armlinux.org.uk> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2022-09-27Merge tag 'efi-loongarch-for-v6.1-2' into HEADArd Biesheuvel
Second shared stable tag between EFI and LoongArch trees This is necessary because the EFI libstub refactoring patches are mostly directed at enabling LoongArch to wire up generic EFI boot support without being forced to consume DT properties that conflict with information that EFI also provides, e.g., memory map and reservations, etc.
2022-09-27efi/loongarch: libstub: remove dependency on flattened DTArd Biesheuvel
LoongArch does not use FDT or DT natively [yet], and the only reason it currently uses it is so that it can reuse the existing EFI stub code. Overloading the DT with data passed between the EFI stub and the core kernel has been a source of problems: there is the overlap between information provided by EFI which DT can also provide (initrd base/size, command line, memory descriptions), requiring us to reason about which is which and what to prioritize. It has also resulted in ABI leaks, i.e., internal ABI being promoted to external ABI inadvertently because the bootloader can set the EFI stub's DT properties as well (e.g., "kaslr-seed"). This has become especially problematic with boot environments that want to pretend that EFI boot is being done (to access ACPI and SMBIOS tables, for instance) but have no ability to execute the EFI stub, and so the environment that the EFI stub creates is emulated [poorly, in some cases]. Another downside of treating DT like this is that the DT binary that the kernel receives is different from the one created by the firmware, which is undesirable in the context of secure and measured boot. Given that LoongArch support in Linux is brand new, we can avoid these pitfalls, and treat the DT strictly as a hardware description, and use a separate handover method between the EFI stub and the kernel. Now that initrd loading and passing the EFI memory map have been refactored into pure EFI routines that use EFI configuration tables, the only thing we need to pass directly is the kernel command line (even if we could pass this via a config table as well, it is used extremely early, so passing it directly is preferred in this case.) Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Huacai Chen <chenhuacai@loongson.cn>
2022-09-26bpf: use bpf_prog_pack for bpf_dispatcherSong Liu
Allocate bpf_dispatcher with bpf_prog_pack_alloc so that bpf_dispatcher can share pages with bpf programs. arch_prepare_bpf_dispatcher() is updated to provide a RW buffer as working area for arch code to write to. This also fixes CPA W^X warnning like: CPA refuse W^X violation: 8000000000000163 -> 0000000000000163 range: ... Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20220926184739.3512547-2-song@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-26bpf: Use given function address for trampoline ip argJiri Olsa
Using function address given at the generation time as the trampoline ip argument. This way we get directly the function address that we need, so we don't need to: - read the ip from the stack - subtract X86_PATCH_SIZE - subtract ENDBR_INSN_SIZE if CONFIG_X86_KERNEL_IBT is enabled which is not even implemented yet ;-) Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20220926153340.1621984-4-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-26riscv: use vma iterator for vdsoLiam R. Howlett
Remove the linked list use in favour of the vma iterator. Link: https://lkml.kernel.org/r/20220906194824.2110408-67-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26um: remove vma linked list walkMatthew Wilcox (Oracle)
Use the VMA iterator instead. Link: https://lkml.kernel.org/r/20220906194824.2110408-40-Liam.Howlett@oracle.com Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26xtensa: remove vma linked list walksMatthew Wilcox (Oracle)
Use the VMA iterator instead. Since VMA can no longer be NULL in the loop, then deal with out-of-memory outside the loop. This means a slightly longer run time in the failure case (-ENOMEM) - it will run to the end of the VMAs before erroring instead of in the middle of the loop. Link: https://lkml.kernel.org/r/20220906194824.2110408-37-Liam.Howlett@oracle.com Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26x86: remove vma linked list walksMatthew Wilcox (Oracle)
Use the VMA iterator instead. Link: https://lkml.kernel.org/r/20220906194824.2110408-36-Liam.Howlett@oracle.com Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26s390: remove vma linked list walksMatthew Wilcox (Oracle)
Use the VMA iterator instead. Link: https://lkml.kernel.org/r/20220906194824.2110408-35-Liam.Howlett@oracle.com Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26powerpc: remove mmap linked list walksMatthew Wilcox (Oracle)
Use the VMA iterator instead. Link: https://lkml.kernel.org/r/20220906194824.2110408-34-Liam.Howlett@oracle.com Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26parisc: remove mmap linked list from cache handlingMatthew Wilcox (Oracle)
Use the VMA iterator instead. Link: https://lkml.kernel.org/r/20220906194824.2110408-33-Liam.Howlett@oracle.com Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26arm64: Change elfcore for_each_mte_vma() to use VMA iteratorLiam R. Howlett
Rework for_each_mte_vma() to use a VMA iterator instead of an explicit linked-list. Link: https://lkml.kernel.org/r/20220906194824.2110408-32-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Link: https://lore.kernel.org/r/20220218023650.672072-1-Liam.Howlett@oracle.com Signed-off-by: Will Deacon <will@kernel.org> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Tested-by: Yu Zhao <yuzhao@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26arm64: remove mmap linked list from vdsoMatthew Wilcox (Oracle)
Use the VMA iterator instead. Link: https://lkml.kernel.org/r/20220906194824.2110408-31-Liam.Howlett@oracle.com Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26mm: remove rb tree.Liam R. Howlett
Remove the RB tree and start using the maple tree for vm_area_struct tracking. Drop validate_mm() calls in expand_upwards() and expand_downwards() as the lock is not held. Link: https://lkml.kernel.org/r/20220906194824.2110408-18-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26mm: start tracking VMAs with maple treeLiam R. Howlett
Start tracking the VMAs with the new maple tree structure in parallel with the rb_tree. Add debug and trace events for maple tree operations and duplicate the rb_tree that is created on forks into the maple tree. The maple tree is added to the mm_struct including the mm_init struct, added support in required mm/mmap functions, added tracking in kernel/fork for process forking, and used to find the unmapped_area and checked against what the rbtree finds. This also moves the mmap_lock() in exit_mmap() since the oom reaper call does walk the VMAs. Otherwise lockdep will be unhappy if oom happens. When splitting a vma fails due to allocations of the maple tree nodes, the error path in __split_vma() calls new->vm_ops->close(new). The page accounting for hugetlb is actually in the close() operation, so it accounts for the removal of 1/2 of the VMA which was not adjusted. This results in a negative exit value. To avoid the negative charge, set vm_start = vm_end and vm_pgoff = 0. There is also a potential accounting issue in special mappings from insert_vm_struct() failing to allocate, so reverse the charge there in the failure scenario. Link: https://lkml.kernel.org/r/20220906194824.2110408-9-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNGYu Zhao
Some architectures support the accessed bit in non-leaf PMD entries, e.g., x86 sets the accessed bit in a non-leaf PMD entry when using it as part of linear address translation [1]. Page table walkers that clear the accessed bit may use this capability to reduce their search space. Note that: 1. Although an inline function is preferable, this capability is added as a configuration option for consistency with the existing macros. 2. Due to the little interest in other varieties, this capability was only tested on Intel and AMD CPUs. Thanks to the following developers for their efforts [2][3]. Randy Dunlap <rdunlap@infradead.org> Stephen Rothwell <sfr@canb.auug.org.au> [1]: Intel 64 and IA-32 Architectures Software Developer's Manual Volume 3 (June 2021), section 4.8 [2] https://lore.kernel.org/r/bfdcc7c8-922f-61a9-aa15-7e7250f04af7@infradead.org/ [3] https://lore.kernel.org/r/20220413151513.5a0d7a7e@canb.auug.org.au/ Link: https://lkml.kernel.org/r/20220918080010.2920238-3-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Reviewed-by: Barry Song <baohua@kernel.org> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26mm: x86, arm64: add arch_has_hw_pte_young()Yu Zhao
Patch series "Multi-Gen LRU Framework", v14. What's new ========== 1. OpenWrt, in addition to Android, Arch Linux Zen, Armbian, ChromeOS, Liquorix, post-factum and XanMod, is now shipping MGLRU on 5.15. 2. Fixed long-tailed direct reclaim latency seen on high-memory (TBs) machines. The old direct reclaim backoff, which tries to enforce a minimum fairness among all eligible memcgs, over-swapped by about (total_mem>>DEF_PRIORITY)-nr_to_reclaim. The new backoff, which pulls the plug on swapping once the target is met, trades some fairness for curtailed latency: https://lore.kernel.org/r/20220918080010.2920238-10-yuzhao@google.com/ 3. Fixed minior build warnings and conflicts. More comments and nits. TLDR ==== The current page reclaim is too expensive in terms of CPU usage and it often makes poor choices about what to evict. This patchset offers an alternative solution that is performant, versatile and straightforward. Patchset overview ================= The design and implementation overview is in patch 14: https://lore.kernel.org/r/20220918080010.2920238-15-yuzhao@google.com/ 01. mm: x86, arm64: add arch_has_hw_pte_young() 02. mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG Take advantage of hardware features when trying to clear the accessed bit in many PTEs. 03. mm/vmscan.c: refactor shrink_node() 04. Revert "include/linux/mm_inline.h: fold __update_lru_size() into its sole caller" Minor refactors to improve readability for the following patches. 05. mm: multi-gen LRU: groundwork Adds the basic data structure and the functions that insert pages to and remove pages from the multi-gen LRU (MGLRU) lists. 06. mm: multi-gen LRU: minimal implementation A minimal implementation without optimizations. 07. mm: multi-gen LRU: exploit locality in rmap Exploits spatial locality to improve efficiency when using the rmap. 08. mm: multi-gen LRU: support page table walks Further exploits spatial locality by optionally scanning page tables. 09. mm: multi-gen LRU: optimize multiple memcgs Optimizes the overall performance for multiple memcgs running mixed types of workloads. 10. mm: multi-gen LRU: kill switch Adds a kill switch to enable or disable MGLRU at runtime. 11. mm: multi-gen LRU: thrashing prevention 12. mm: multi-gen LRU: debugfs interface Provide userspace with features like thrashing prevention, working set estimation and proactive reclaim. 13. mm: multi-gen LRU: admin guide 14. mm: multi-gen LRU: design doc Add an admin guide and a design doc. Benchmark results ================= Independent lab results ----------------------- Based on the popularity of searches [01] and the memory usage in Google's public cloud, the most popular open-source memory-hungry applications, in alphabetical order, are: Apache Cassandra Memcached Apache Hadoop MongoDB Apache Spark PostgreSQL MariaDB (MySQL) Redis An independent lab evaluated MGLRU with the most widely used benchmark suites for the above applications. They posted 960 data points along with kernel metrics and perf profiles collected over more than 500 hours of total benchmark time. Their final reports show that, with 95% confidence intervals (CIs), the above applications all performed significantly better for at least part of their benchmark matrices. On 5.14: 1. Apache Spark [02] took 95% CIs [9.28, 11.19]% and [12.20, 14.93]% less wall time to sort three billion random integers, respectively, under the medium- and the high-concurrency conditions, when overcommitting memory. There were no statistically significant changes in wall time for the rest of the benchmark matrix. 2. MariaDB [03] achieved 95% CIs [5.24, 10.71]% and [20.22, 25.97]% more transactions per minute (TPM), respectively, under the medium- and the high-concurrency conditions, when overcommitting memory. There were no statistically significant changes in TPM for the rest of the benchmark matrix. 3. Memcached [04] achieved 95% CIs [23.54, 32.25]%, [20.76, 41.61]% and [21.59, 30.02]% more operations per second (OPS), respectively, for sequential access, random access and Gaussian (distribution) access, when THP=always; 95% CIs [13.85, 15.97]% and [23.94, 29.92]% more OPS, respectively, for random access and Gaussian access, when THP=never. There were no statistically significant changes in OPS for the rest of the benchmark matrix. 4. MongoDB [05] achieved 95% CIs [2.23, 3.44]%, [6.97, 9.73]% and [2.16, 3.55]% more operations per second (OPS), respectively, for exponential (distribution) access, random access and Zipfian (distribution) access, when underutilizing memory; 95% CIs [8.83, 10.03]%, [21.12, 23.14]% and [5.53, 6.46]% more OPS, respectively, for exponential access, random access and Zipfian access, when overcommitting memory. On 5.15: 5. Apache Cassandra [06] achieved 95% CIs [1.06, 4.10]%, [1.94, 5.43]% and [4.11, 7.50]% more operations per second (OPS), respectively, for exponential (distribution) access, random access and Zipfian (distribution) access, when swap was off; 95% CIs [0.50, 2.60]%, [6.51, 8.77]% and [3.29, 6.75]% more OPS, respectively, for exponential access, random access and Zipfian access, when swap was on. 6. Apache Hadoop [07] took 95% CIs [5.31, 9.69]% and [2.02, 7.86]% less average wall time to finish twelve parallel TeraSort jobs, respectively, under the medium- and the high-concurrency conditions, when swap was on. There were no statistically significant changes in average wall time for the rest of the benchmark matrix. 7. PostgreSQL [08] achieved 95% CI [1.75, 6.42]% more transactions per minute (TPM) under the high-concurrency condition, when swap was off; 95% CIs [12.82, 18.69]% and [22.70, 46.86]% more TPM, respectively, under the medium- and the high-concurrency conditions, when swap was on. There were no statistically significant changes in TPM for the rest of the benchmark matrix. 8. Redis [09] achieved 95% CIs [0.58, 5.94]%, [6.55, 14.58]% and [11.47, 19.36]% more total operations per second (OPS), respectively, for sequential access, random access and Gaussian (distribution) access, when THP=always; 95% CIs [1.27, 3.54]%, [10.11, 14.81]% and [8.75, 13.64]% more total OPS, respectively, for sequential access, random access and Gaussian access, when THP=never. Our lab results --------------- To supplement the above results, we ran the following benchmark suites on 5.16-rc7 and found no regressions [10]. fs_fio_bench_hdd_mq pft fs_lmbench pgsql-hammerdb fs_parallelio redis fs_postmark stream hackbench sysbenchthread kernbench tpcc_spark memcached unixbench multichase vm-scalability mutilate will-it-scale nginx [01] https://trends.google.com [02] https://lore.kernel.org/r/20211102002002.92051-1-bot@edi.works/ [03] https://lore.kernel.org/r/20211009054315.47073-1-bot@edi.works/ [04] https://lore.kernel.org/r/20211021194103.65648-1-bot@edi.works/ [05] https://lore.kernel.org/r/20211109021346.50266-1-bot@edi.works/ [06] https://lore.kernel.org/r/20211202062806.80365-1-bot@edi.works/ [07] https://lore.kernel.org/r/20211209072416.33606-1-bot@edi.works/ [08] https://lore.kernel.org/r/20211218071041.24077-1-bot@edi.works/ [09] https://lore.kernel.org/r/20211122053248.57311-1-bot@edi.works/ [10] https://lore.kernel.org/r/20220104202247.2903702-1-yuzhao@google.com/ Read-world applications ======================= Third-party testimonials ------------------------ Konstantin reported [11]: I have Archlinux with 8G RAM + zswap + swap. While developing, I have lots of apps opened such as multiple LSP-servers for different langs, chats, two browsers, etc... Usually, my system gets quickly to a point of SWAP-storms, where I have to kill LSP-servers, restart browsers to free memory, etc, otherwise the system lags heavily and is barely usable. 1.5 day ago I migrated from 5.11.15 kernel to 5.12 + the LRU patchset, and I started up by opening lots of apps to create memory pressure, and worked for a day like this. Till now I had not a single SWAP-storm, and mind you I got 3.4G in SWAP. I was never getting to the point of 3G in SWAP before without a single SWAP-storm. Vaibhav from IBM reported [12]: In a synthetic MongoDB Benchmark, seeing an average of ~19% throughput improvement on POWER10(Radix MMU + 64K Page Size) with MGLRU patches on top of 5.16 kernel for MongoDB + YCSB across three different request distributions, namely, Exponential, Uniform and Zipfan. Shuang from U of Rochester reported [13]: With the MGLRU, fio achieved 95% CIs [38.95, 40.26]%, [4.12, 6.64]% and [9.26, 10.36]% higher throughput, respectively, for random access, Zipfian (distribution) access and Gaussian (distribution) access, when the average number of jobs per CPU is 1; 95% CIs [42.32, 49.15]%, [9.44, 9.89]% and [20.99, 22.86]% higher throughput, respectively, for random access, Zipfian access and Gaussian access, when the average number of jobs per CPU is 2. Daniel from Michigan Tech reported [14]: With Memcached allocating ~100GB of byte-addressable Optante, performance improvement in terms of throughput (measured as queries per second) was about 10% for a series of workloads. Large-scale deployments ----------------------- We've rolled out MGLRU to tens of millions of ChromeOS users and about a million Android users. Google's fleetwide profiling [15] shows an overall 40% decrease in kswapd CPU usage, in addition to improvements in other UX metrics, e.g., an 85% decrease in the number of low-memory kills at the 75th percentile and an 18% decrease in app launch time at the 50th percentile. The downstream kernels that have been using MGLRU include: 1. Android [16] 2. Arch Linux Zen [17] 3. Armbian [18] 4. ChromeOS [19] 5. Liquorix [20] 6. OpenWrt [21] 7. post-factum [22] 8. XanMod [23] [11] https://lore.kernel.org/r/140226722f2032c86301fbd326d91baefe3d7d23.camel@yandex.ru/ [12] https://lore.kernel.org/r/87czj3mux0.fsf@vajain21.in.ibm.com/ [13] https://lore.kernel.org/r/20220105024423.26409-1-szhai2@cs.rochester.edu/ [14] https://lore.kernel.org/r/CA+4-3vksGvKd18FgRinxhqHetBS1hQekJE2gwco8Ja-bJWKtFw@mail.gmail.com/ [15] https://dl.acm.org/doi/10.1145/2749469.2750392 [16] https://android.com [17] https://archlinux.org [18] https://armbian.com [19] https://chromium.org [20] https://liquorix.net [21] https://openwrt.org [22] https://codeberg.org/pf-kernel [23] https://xanmod.org Summary ======= The facts are: 1. The independent lab results and the real-world applications indicate substantial improvements; there are no known regressions. 2. Thrashing prevention, working set estimation and proactive reclaim work out of the box; there are no equivalent solutions. 3. There is a lot of new code; no smaller changes have been demonstrated similar effects. Our options, accordingly, are: 1. Given the amount of evidence, the reported improvements will likely materialize for a wide range of workloads. 2. Gauging the interest from the past discussions, the new features will likely be put to use for both personal computers and data centers. 3. Based on Google's track record, the new code will likely be well maintained in the long term. It'd be more difficult if not impossible to achieve similar effects with other approaches. This patch (of 14): Some architectures automatically set the accessed bit in PTEs, e.g., x86 and arm64 v8.2. On architectures that do not have this capability, clearing the accessed bit in a PTE usually triggers a page fault following the TLB miss of this PTE (to emulate the accessed bit). Being aware of this capability can help make better decisions, e.g., whether to spread the work out over a period of time to reduce bursty page faults when trying to clear the accessed bit in many PTEs. Note that theoretically this capability can be unreliable, e.g., hotplugged CPUs might be different from builtin ones. Therefore it should not be used in architecture-independent code that involves correctness, e.g., to determine whether TLB flushes are required (in combination with the accessed bit). Link: https://lkml.kernel.org/r/20220918080010.2920238-1-yuzhao@google.com Link: https://lkml.kernel.org/r/20220918080010.2920238-2-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Reviewed-by: Barry Song <baohua@kernel.org> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Acked-by: Will Deacon <will@kernel.org> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-arm-kernel@lists.infradead.org Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26mm/swap: cache maximum swapfile size when init swapPeter Xu
We used to have swapfile_maximum_size() fetching a maximum value of swapfile size per-arch. As the caller of max_swapfile_size() grows, this patch introduce a variable "swapfile_maximum_size" and cache the value of old max_swapfile_size(), so that we don't need to calculate the value every time. Caching the value in swapfile_init() is safe because when reaching the phase we should have initialized all the relevant information. Here the major arch to take care of is x86, which defines the max swapfile size based on L1TF mitigation. Here both X86_BUG_L1TF or l1tf_mitigation should have been setup properly when reaching swapfile_init(). As a reference, the code path looks like this for x86: - start_kernel - setup_arch - early_cpu_init - early_identify_cpu --> setup X86_BUG_L1TF - parse_early_param - l1tf_cmdline --> set l1tf_mitigation - check_bugs - l1tf_select_mitigation --> set l1tf_mitigation - arch_call_rest_init - rest_init - kernel_init - kernel_init_freeable - do_basic_setup - do_initcalls --> calls swapfile_init() (initcall level 4) The swapfile size only depends on swp pte format on non-x86 archs, so caching it is safe too. Since at it, rename max_swapfile_size() to arch_max_swapfile_size() because arch can define its own function, so it's more straightforward to have "arch_" as its prefix. At the meantime, export swapfile_maximum_size to replace the old usages of max_swapfile_size(). [peterx@redhat.com: declare arch_max_swapfile_size) in swapfile.h] Link: https://lkml.kernel.org/r/YxTh1GuC6ro5fKL5@xz-m1.local Link: https://lkml.kernel.org/r/20220811161331.37055-7-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Minchan Kim <minchan@kernel.org> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>