summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2024-04-26pwm: Move contents of sysfs.c into core.cUwe Kleine-König
With the upcoming restructuring having all in a single file simplifies things a bit. The relevant and somewhat visible changes are: - Some dropped prototypes from include/linux/pwm.h that were only necessary that core.c has a declaration of the symbols defined in sysfs.c. The respective functions are static now. - The pwm class now also exists if CONFIG_SYSFS isn't enabled. Having CONFIG_SYSFS is not very relevant today, but even without it the class and device stuff still provides lifetime tracking. - Both files had an initcall, these are merged into a single one now. Instead of a big #ifdef block for CONFIG_DEBUG_FS, a single if (IS_ENABLED(CONFIG_DEBUG_FS)) is used now. This increases compile coverage a bit and is a tad nicer on the eyes. Link: https://lore.kernel.org/r/9e2d39a5280d7dda5bfc6682a8aef510148635b2.1710670958.git.u.kleine-koenig@pengutronix.de Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
2024-04-26pwm: Ensure that pwm_chips are allocated using pwmchip_alloc()Uwe Kleine-König
Memory holding a struct device must not be freed before the reference count drops to zero. So a struct pwm_chip must not live in memory freed by a driver on unbind. All in-tree drivers were fixed accordingly, but as out-of-tree drivers, that were not adapted, still compile fine, catch these in pwmchip_add(). Link: https://lore.kernel.org/r/35f5b229c98f78b2f6ce2397259a4a936be477c0.1707900770.git.u.kleine-koenig@pengutronix.de Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
2024-04-26cpufreq: amd-pstate: Add quirk for the pstate CPPC capabilities missingPerry Yuan
Add quirks table to get CPPC capabilities issue fixed by providing correct perf or frequency values while driver loading. If CPPC capabilities are not defined in the ACPI tables or wrongly defined by platform firmware, it needs to use quick to get those issues fixed with correct workaround values to make pstate driver can be loaded even though there are CPPC capabilities errors. The workaround will match the broken BIOS which lack of CPPC capabilities nominal_freq and lowest_freq definition in the ACPI table. $ cat /sys/devices/system/cpu/cpu0/acpi_cppc/lowest_freq 0 $ cat /sys/devices/system/cpu/cpu0/acpi_cppc/nominal_freq 0 Acked-by: Huang Rui <ray.huang@amd.com> Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com> Tested-by: Dhananjay Ugwekar <Dhananjay.Ugwekar@amd.com> Signed-off-by: Perry Yuan <perry.yuan@amd.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-04-26cpufreq: amd-pstate: Document the units for freq variables in amd_cpudataGautham R. Shenoy
The min_limit_freq, max_limit_freq, min_freq, max_freq, nominal_freq and the lowest_nominal_freq members of struct cpudata store the frequency value in khz to be consistent with the cpufreq core. Update the comment to document this. Reviewed-by: Li Meng <li.meng@amd.com> Tested-by: Dhananjay Ugwekar <Dhananjay.Ugwekar@amd.com> Signed-off-by: Gautham R. Shenoy <gautham.shenoy@amd.com> Signed-off-by: Perry Yuan <perry.yuan@amd.com> Acked-by: Huang Rui <ray.huang@amd.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-04-26cpufreq: amd-pstate: Document *_limit_* fields in struct amd_cpudataGautham R. Shenoy
The four fields of struct cpudata namely min_limit_perf, max_limit_perf, min_limit_freq, max_limit_freq introduced in the commit febab20caeba("cpufreq/amd-pstate: Fix scaling_min_freq and scaling_max_freq update") are currently undocumented Add comments describing these fields Acked-by: Huang Rui <ray.huang@amd.com> Fixes: febab20caeba("cpufreq/amd-pstate: Fix scaling_min_freq and scaling_max_freq update") Reviewed-by: Li Meng <li.meng@amd.com> Tested-by: Dhananjay Ugwekar <Dhananjay.Ugwekar@amd.com> Signed-off-by: Gautham R. Shenoy <gautham.shenoy@amd.com> Signed-off-by: Perry Yuan <perry.yuan@amd.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-04-26bpf: verifier: prevent userspace memory accessPuranjay Mohan
With BPF_PROBE_MEM, BPF allows de-referencing an untrusted pointer. To thwart invalid memory accesses, the JITs add an exception table entry for all such accesses. But in case the src_reg + offset is a userspace address, the BPF program might read that memory if the user has mapped it. Make the verifier add guard instructions around such memory accesses and skip the load if the address falls into the userspace region. The JITs need to implement bpf_arch_uaddress_limit() to define where the userspace addresses end for that architecture or TASK_SIZE is taken as default. The implementation is as follows: REG_AX = SRC_REG if(offset) REG_AX += offset; REG_AX >>= 32; if (REG_AX <= (uaddress_limit >> 32)) DST_REG = 0; else DST_REG = *(size *)(SRC_REG + offset); Comparing just the upper 32 bits of the load address with the upper 32 bits of uaddress_limit implies that the values are being aligned down to a 4GB boundary before comparison. The above means that all loads with address <= uaddress_limit + 4GB are skipped. This is acceptable because there is a large hole (much larger than 4GB) between userspace and kernel space memory, therefore a correctly functioning BPF program should not access this 4GB memory above the userspace. Let's analyze what this patch does to the following fentry program dereferencing an untrusted pointer: SEC("fentry/tcp_v4_connect") int BPF_PROG(fentry_tcp_v4_connect, struct sock *sk) { *(volatile long *)sk; return 0; } BPF Program before | BPF Program after ------------------ | ----------------- 0: (79) r1 = *(u64 *)(r1 +0) 0: (79) r1 = *(u64 *)(r1 +0) ----------------------------------------------------------------------- 1: (79) r1 = *(u64 *)(r1 +0) --\ 1: (bf) r11 = r1 ----------------------------\ \ 2: (77) r11 >>= 32 2: (b7) r0 = 0 \ \ 3: (b5) if r11 <= 0x8000 goto pc+2 3: (95) exit \ \-> 4: (79) r1 = *(u64 *)(r1 +0) \ 5: (05) goto pc+1 \ 6: (b7) r1 = 0 \-------------------------------------- 7: (b7) r0 = 0 8: (95) exit As you can see from above, in the best case (off=0), 5 extra instructions are emitted. Now, we analyze the same program after it has gone through the JITs of ARM64 and RISC-V architectures. We follow the single load instruction that has the untrusted pointer and see what instrumentation has been added around it. x86-64 JIT ========== JIT's Instrumentation (upstream) --------------------- 0: nopl 0x0(%rax,%rax,1) 5: xchg %ax,%ax 7: push %rbp 8: mov %rsp,%rbp b: mov 0x0(%rdi),%rdi --------------------------------- f: movabs $0x800000000000,%r11 19: cmp %r11,%rdi 1c: jb 0x000000000000002a 1e: mov %rdi,%r11 21: add $0x0,%r11 28: jae 0x000000000000002e 2a: xor %edi,%edi 2c: jmp 0x0000000000000032 2e: mov 0x0(%rdi),%rdi --------------------------------- 32: xor %eax,%eax 34: leave 35: ret The x86-64 JIT already emits some instructions to protect against user memory access. This patch doesn't make any changes for the x86-64 JIT. ARM64 JIT ========= No Intrumentation Verifier's Instrumentation (upstream) (This patch) ----------------- -------------------------- 0: add x9, x30, #0x0 0: add x9, x30, #0x0 4: nop 4: nop 8: paciasp 8: paciasp c: stp x29, x30, [sp, #-16]! c: stp x29, x30, [sp, #-16]! 10: mov x29, sp 10: mov x29, sp 14: stp x19, x20, [sp, #-16]! 14: stp x19, x20, [sp, #-16]! 18: stp x21, x22, [sp, #-16]! 18: stp x21, x22, [sp, #-16]! 1c: stp x25, x26, [sp, #-16]! 1c: stp x25, x26, [sp, #-16]! 20: stp x27, x28, [sp, #-16]! 20: stp x27, x28, [sp, #-16]! 24: mov x25, sp 24: mov x25, sp 28: mov x26, #0x0 28: mov x26, #0x0 2c: sub x27, x25, #0x0 2c: sub x27, x25, #0x0 30: sub sp, sp, #0x0 30: sub sp, sp, #0x0 34: ldr x0, [x0] 34: ldr x0, [x0] -------------------------------------------------------------------------------- 38: ldr x0, [x0] ----------\ 38: add x9, x0, #0x0 -----------------------------------\\ 3c: lsr x9, x9, #32 3c: mov x7, #0x0 \\ 40: cmp x9, #0x10, lsl #12 40: mov sp, sp \\ 44: b.ls 0x0000000000000050 44: ldp x27, x28, [sp], #16 \\--> 48: ldr x0, [x0] 48: ldp x25, x26, [sp], #16 \ 4c: b 0x0000000000000054 4c: ldp x21, x22, [sp], #16 \ 50: mov x0, #0x0 50: ldp x19, x20, [sp], #16 \--------------------------------------- 54: ldp x29, x30, [sp], #16 54: mov x7, #0x0 58: add x0, x7, #0x0 58: mov sp, sp 5c: autiasp 5c: ldp x27, x28, [sp], #16 60: ret 60: ldp x25, x26, [sp], #16 64: nop 64: ldp x21, x22, [sp], #16 68: ldr x10, 0x0000000000000070 68: ldp x19, x20, [sp], #16 6c: br x10 6c: ldp x29, x30, [sp], #16 70: add x0, x7, #0x0 74: autiasp 78: ret 7c: nop 80: ldr x10, 0x0000000000000088 84: br x10 There are 6 extra instructions added in ARM64 in the best case. This will become 7 in the worst case (off != 0). RISC-V JIT (RISCV_ISA_C Disabled) ========== No Intrumentation Verifier's Instrumentation (upstream) (This patch) ----------------- -------------------------- 0: nop 0: nop 4: nop 4: nop 8: li a6, 33 8: li a6, 33 c: addi sp, sp, -16 c: addi sp, sp, -16 10: sd s0, 8(sp) 10: sd s0, 8(sp) 14: addi s0, sp, 16 14: addi s0, sp, 16 18: ld a0, 0(a0) 18: ld a0, 0(a0) --------------------------------------------------------------- 1c: ld a0, 0(a0) --\ 1c: mv t0, a0 --------------------------\ \ 20: srli t0, t0, 32 20: li a5, 0 \ \ 24: lui t1, 4096 24: ld s0, 8(sp) \ \ 28: sext.w t1, t1 28: addi sp, sp, 16 \ \ 2c: bgeu t1, t0, 12 2c: sext.w a0, a5 \ \--> 30: ld a0, 0(a0) 30: ret \ 34: j 8 \ 38: li a0, 0 \------------------------------ 3c: li a5, 0 40: ld s0, 8(sp) 44: addi sp, sp, 16 48: sext.w a0, a5 4c: ret There are 7 extra instructions added in RISC-V. Fixes: 800834285361 ("bpf, arm64: Add BPF exception tables") Reported-by: Breno Leitao <leitao@debian.org> Suggested-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Ilya Leoshkevich <iii@linux.ibm.com> Signed-off-by: Puranjay Mohan <puranjay12@gmail.com> Link: https://lore.kernel.org/r/20240424100210.11982-2-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-04-26Merge tag 'qcom-drivers-fixes-for-6.9' of ↵Arnd Bergmann
https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux into for-next Qualcomm driver fix for v6.9 This reworks the memory layout of the argument buffers passed to trusted applications in QSEECOM, to avoid failures and system crashes. * tag 'qcom-drivers-fixes-for-6.9' of https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux: firmware: qcom: uefisecapp: Fix memory related IO errors and crashes Link: https://lore.kernel.org/r/20240420163816.1133528-1-andersson@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2024-04-26virtio: add debugfs infrastructure to allow to debug virtio featuresJiri Pirko
Currently there is no way for user to set what features the driver should obey or not, it is hard wired in the code. In order to be able to debug the device behavior in case some feature is disabled, introduce a debugfs infrastructure with couple of files allowing user to see what features the device advertises and to set filter for features used by driver. Example: $cat /sys/bus/virtio/devices/virtio0/features 1110010111111111111101010000110010000000100000000000000000000000 $ echo "5" >/sys/kernel/debug/virtio/virtio0/filter_feature_add $ cat /sys/kernel/debug/virtio/virtio0/filter_features 5 $ echo "virtio0" > /sys/bus/virtio/drivers/virtio_net/unbind $ echo "virtio0" > /sys/bus/virtio/drivers/virtio_net/bind $ cat /sys/bus/virtio/devices/virtio0/features 1110000111111111111101010000110010000000100000000000000000000000 Note that sysfs "features" now already exists, this patch does not touch it. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-26Merge branch 'memory-observability' into x86/amdJoerg Roedel
2024-04-26iommu: Add ops->domain_alloc_sva()Jason Gunthorpe
Make a new op that receives the device and the mm_struct that the SVA domain should be created for. Unlike domain_alloc_paging() the dev argument is never NULL here. This allows drivers to fully initialize the SVA domain and allocate the mmu_notifier during allocation. It allows the notifier lifetime to follow the lifetime of the iommu_domain. Since we have only one call site, upgrade the new op to return ERR_PTR instead of NULL. Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> [Removed smmu3 related changes - Vasant] Signed-off-by: Vasant Hegde <vasant.hegde@amd.com> Reviewed-by: Tina Zhang <tina.zhang@intel.com> Link: https://lore.kernel.org/r/20240418103400.6229-15-vasant.hegde@amd.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2024-04-26dma-mapping: Simplify arch_setup_dma_ops()Robin Murphy
The dma_base, size and iommu arguments are only used by ARM, and can now easily be deduced from the device itself, so there's no need to pass them through the callchain as well. Acked-by: Rob Herring <robh@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Michael Kelley <mhklinux@outlook.com> # For Hyper-V Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: Robin Murphy <robin.murphy@arm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Link: https://lore.kernel.org/r/5291c2326eab405b1aa7693aa964e8d3cb7193de.1713523152.git.robin.murphy@arm.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2024-04-26iommu/dma: Centralise iommu_setup_dma_ops()Robin Murphy
It's somewhat hard to see, but arm64's arch_setup_dma_ops() should only ever call iommu_setup_dma_ops() after a successful iommu_probe_device(), which means there should be no harm in achieving the same order of operations by running it off the back of iommu_probe_device() itself. This then puts it in line with the x86 and s390 .probe_finalize bodges, letting us pull it all into the main flow properly. As a bonus this lets us fold in and de-scope the PCI workaround setup as well. At this point we can also then pull the call up inside the group mutex, and avoid having to think about whether iommu_group_store_type() could theoretically race and free the domain if iommu_setup_dma_ops() ran just *before* iommu_device_use_default_domain() claims it... Furthermore we replace one .probe_finalize call completely, since the only remaining implementations are now one which only needs to run once for the initial boot-time probe, and two which themselves render that path unreachable. This leaves us a big step closer to realistically being able to unpick the variety of different things that iommu_setup_dma_ops() has been muddling together, and further streamline iommu-dma into core API flows in future. Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> # For Intel IOMMU Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: Robin Murphy <robin.murphy@arm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Link: https://lore.kernel.org/r/bebea331c1d688b34d9862eefd5ede47503961b8.1713523152.git.robin.murphy@arm.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2024-04-26dma-mapping: Add helpers for dma_range_map boundsRobin Murphy
Several places want to compute the lower and/or upper bounds of a dma_range_map, so let's factor that out into reusable helpers. Acked-by: Rob Herring <robh@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hanjun Guo <guohanjun@huawei.com> # For arm64 Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: Robin Murphy <robin.murphy@arm.com> Link: https://lore.kernel.org/r/45ec52f033ec4dfb364e23f48abaf787f612fa53.1713523152.git.robin.murphy@arm.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2024-04-26ACPI/IORT: Handle memory address size limits as limitsRobin Murphy
Return the Root Complex/Named Component memory address size limit as an inclusive limit value, rather than an exclusive size. This saves having to fudge an off-by-one for the 64-bit case, and simplifies our caller. Acked-by: Hanjun Guo <guohanjun@huawei.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: Robin Murphy <robin.murphy@arm.com> Link: https://lore.kernel.org/r/284ae9fbadb12f2e3b5a30cd4d037d0e6843a8f4.1713523152.git.robin.murphy@arm.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2024-04-26iommu: Add ops->domain_alloc_sva()Jason Gunthorpe
Make a new op that receives the device and the mm_struct that the SVA domain should be created for. Unlike domain_alloc_paging() the dev argument is never NULL here. This allows drivers to fully initialize the SVA domain and allocate the mmu_notifier during allocation. It allows the notifier lifetime to follow the lifetime of the iommu_domain. Since we have only one call site, upgrade the new op to return ERR_PTR instead of NULL. Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Vasant Hegde <vasant.hegde@amd.com> Reviewed-by: Tina Zhang <tina.zhang@intel.com> Link: https://lore.kernel.org/r/20240311090843.133455-15-vasant.hegde@amd.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20240416080656.60968-12-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2024-04-26iommu/vt-d: Remove private data use in fault messageJingqi Liu
According to Intel VT-d specification revision 4.0, "Private Data" field has been removed from Page Request/Response. Since the private data field is not used in fault message, remove the related definitions in page request descriptor and remove the related code in page request/response handler, as Intel hasn't shipped any products which support private data in the page request message. Signed-off-by: Jingqi Liu <Jingqi.liu@intel.com> Link: https://lore.kernel.org/r/20240308103811.76744-3-Jingqi.liu@intel.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <jroedel@suse.de>
2024-04-26iommu/vt-d: Allocate DMAR fault interrupts locallyDimitri Sivanich
The Intel IOMMU code currently tries to allocate all DMAR fault interrupt vectors on the boot cpu. On large systems with high DMAR counts this results in vector exhaustion, and most of the vectors are not initially allocated socket local. Instead, have a cpu on each node do the vector allocation for the DMARs on that node. The boot cpu still does the allocation for its node during its boot sequence. Signed-off-by: Dimitri Sivanich <sivanich@hpe.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/Zfydpp2Hm+as16TY@hpe.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <jroedel@suse.de>
2024-04-26fs: Create anon_inode_getfile_fmode()Dawid Osuchowski
Creates an anon_inode_getfile_fmode() function that works similarly to anon_inode_getfile() with the addition of being able to set the fmode member. Signed-off-by: Dawid Osuchowski <linux@osuchow.ski> Link: https://lore.kernel.org/r/20240426075854.4723-1-linux@osuchow.ski Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-26drivers/perf: riscv: Implement SBI PMU snapshot functionAtish Patra
SBI v2.0 SBI introduced PMU snapshot feature which adds the following features. 1. Read counter values directly from the shared memory instead of csr read. 2. Start multiple counters with initial values with one SBI call. These functionalities optimizes the number of traps to the higher privilege mode. If the kernel is in VS mode while the hypervisor deploy trap & emulate method, this would minimize all the hpmcounter CSR read traps. If the kernel is running in S-mode, the benefits reduced to CSR latency vs DRAM/cache latency as there is no trap involved while accessing the hpmcounter CSRs. In both modes, it does saves the number of ecalls while starting multiple counter together with an initial values. This is a likely scenario if multiple counters overflow at the same time. Acked-by: Palmer Dabbelt <palmer@rivosinc.com> Reviewed-by: Anup Patel <anup@brainfault.org> Reviewed-by: Conor Dooley <conor.dooley@microchip.com> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Reviewed-by: Samuel Holland <samuel.holland@sifive.com> Signed-off-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20240420151741.962500-10-atishp@rivosinc.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-04-26mmc: slot-gpio: Use irq_handler_t typeAndy Shevchenko
The irq_handler_t is already defined globally, let's use it in slot-gpio code. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Alexander Stein <alexander.stein@ew.tq-group.com> Link: https://lore.kernel.org/r/20240410195618.1632778-1-andriy.shevchenko@linux.intel.com Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2024-04-26mmc: core: Add mmc_gpiod_set_cd_config() functionHans de Goede
Some mmc host drivers may need to fixup a card-detection GPIO's config to e.g. enable the GPIO controllers builtin pull-up resistor on devices where the firmware description of the GPIO is broken (e.g. GpioInt with PullNone instead of PullUp in ACPI DSDT). Since this is the exception rather then the rule adding a config parameter to mmc_gpiod_request_cd() seems undesirable, so instead add a new mmc_gpiod_set_cd_config() function. This is simply a wrapper to call gpiod_set_config() on the card-detect GPIO acquired through mmc_gpiod_request_cd(). Reviewed-by: Andy Shevchenko <andy@kernel.org> Signed-off-by: Hans de Goede <hdegoede@redhat.com> Acked-by: Adrian Hunter <adrian.hunter@intel.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240410191639.526324-2-hdegoede@redhat.com Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2024-04-26xfrm: Preserve vlan tags for transport mode software GROPaul Davey
The software GRO path for esp transport mode uses skb_mac_header_rebuild prior to re-injecting the packet via the xfrm_napi_dev. This only copies skb->mac_len bytes of header which may not be sufficient if the packet contains 802.1Q tags or other VLAN tags. Worse copying only the initial header will leave a packet marked as being VLAN tagged but without the corresponding tag leading to mangling when it is later untagged. The VLAN tags are important when receiving the decrypted esp transport mode packet after GRO processing to ensure it is received on the correct interface. Therefore record the full mac header length in xfrm*_transport_input for later use in corresponding xfrm*_transport_finish to copy the entire mac header when rebuilding the mac header for GRO. The skb->data pointer is left pointing skb->mac_header bytes after the start of the mac header as is expected by the network stack and network and transport header offsets reset to this location. Fixes: 7785bba299a8 ("esp: Add a software GRO codepath") Signed-off-by: Paul Davey <paul.davey@alliedtelesis.co.nz> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2024-04-25instrumented.h: add instrument_memcpy_before, instrument_memcpy_afterAlexander Potapenko
Bug detection tools based on compiler instrumentation may miss memory accesses in custom memcpy implementations (such as copy_mc_to_kernel). Provide instrumentation hooks that tell KASAN, KCSAN, and KMSAN about such accesses. Link: https://lore.kernel.org/all/3b7dbd88-0861-4638-b2d2-911c97a4cadf@I-love.SAKURA.ne.jp/ Link: https://lkml.kernel.org/r/20240320101851.2589698-2-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: kmsan: implement kmsan_memmove()Alexander Potapenko
Provide a hook that can be used by custom memcpy implementations to tell KMSAN that the metadata needs to be copied. Without that, false positive reports are possible in the cases where KMSAN fails to intercept memory initialization. Link: https://lore.kernel.org/all/3b7dbd88-0861-4638-b2d2-911c97a4cadf@I-love.SAKURA.ne.jp/ Link: https://lkml.kernel.org/r/20240320101851.2589698-1-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Marco Elver <elver@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: set pageblock_order to HPAGE_PMD_ORDER in case with !CONFIG_HUGETLB_PAGE ↵Baolin Wang
but THP enabled As Vlastimil suggested in previous discussion[1], it doesn't make sense to set pageblock_order as MAX_PAGE_ORDER when hugetlbfs is not enabled and THP is enabled. Instead, it should be set to HPAGE_PMD_ORDER. [1] https://lore.kernel.org/all/76457ec5-d789-449b-b8ca-dcb6ceb12445@suse.cz/ Link: https://lkml.kernel.org/r/3d57d253070035bdc0f6d6e5681ce1ed0e1934f7.1712286863.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: inline destroy_large_folio() into __folio_put_large()Matthew Wilcox (Oracle)
destroy_large_folio() has only one caller, move its contents there. Link: https://lkml.kernel.org/r/20240405153228.2563754-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm/ksm: remove redundant code in ksm_forkJinjiang Tu
Since commit 3c6f33b7273a ("mm/ksm: support fork/exec for prctl"), when a child process is forked, the MMF_VM_MERGE_ANY flag will be inherited in mm_init(). So, it's unnecessary to set the flag in ksm_fork(). Link: https://lkml.kernel.org/r/20240402024934.1093361-1-tujinjiang@huawei.com Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Rik van Riel <riel@surriel.com> Cc: Stefan Roesch <shr@devkernel.io> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm/treewide: rename CONFIG_HAVE_FAST_GUP to CONFIG_HAVE_GUP_FASTDavid Hildenbrand
Nowadays, we call it "GUP-fast", the external interface includes functions like "get_user_pages_fast()", and we renamed all internal functions to reflect that as well. Let's make the config option reflect that. Link: https://lkml.kernel.org/r/20240402125516.223131-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: madvise: avoid split during MADV_PAGEOUT and MADV_COLDRyan Roberts
Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large folio that is fully and contiguously mapped in the pageout/cold vm range. This change means that large folios will be maintained all the way to swap storage. This both improves performance during swap-out, by eliding the cost of splitting the folio, and sets us up nicely for maintaining the large folio when it is swapped back in (to be covered in a separate series). Folios that are not fully mapped in the target range are still split, but note that behavior is changed so that if the split fails for any reason (folio locked, shared, etc) we now leave it as is and move to the next pte in the range and continue work on the proceeding folios. Previously any failure of this sort would cause the entire operation to give up and no folios mapped at higher addresses were paged out or made cold. Given large folios are becoming more common, this old behavior would have likely lead to wasted opportunities. While we are at it, change the code that clears young from the ptes to use ptep_test_and_clear_young(), via the new mkold_ptes() batch helper function. This is more efficent than get_and_clear/modify/set, especially for contpte mappings on arm64, where the old approach would require unfolding/refolding and the new approach can be done in place. Link: https://lkml.kernel.org/r/20240408183946.2991168-8-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Barry Song <v-songbaohua@oppo.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Chris Li <chrisl@kernel.org> Cc: Gao Xiang <xiang@kernel.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: swap: allow storage of all mTHP ordersRyan Roberts
Multi-size THP enables performance improvements by allocating large, pte-mapped folios for anonymous memory. However I've observed that on an arm64 system running a parallel workload (e.g. kernel compilation) across many cores, under high memory pressure, the speed regresses. This is due to bottlenecking on the increased number of TLBIs added due to all the extra folio splitting when the large folios are swapped out. Therefore, solve this regression by adding support for swapping out mTHP without needing to split the folio, just like is already done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is enabled, and when the swap backing store is a non-rotating block device. These are the same constraints as for the existing PMD-sized THP swap-out support. Note that no attempt is made to swap-in (m)THP here - this is still done page-by-page, like for PMD-sized THP. But swapping-out mTHP is a prerequisite for swapping-in mTHP. The main change here is to improve the swap entry allocator so that it can allocate any power-of-2 number of contiguous entries between [1, (1 << PMD_ORDER)]. This is done by allocating a cluster for each distinct order and allocating sequentially from it until the cluster is full. This ensures that we don't need to search the map and we get no fragmentation due to alignment padding for different orders in the cluster. If there is no current cluster for a given order, we attempt to allocate a free cluster from the list. If there are no free clusters, we fail the allocation and the caller can fall back to splitting the folio and allocates individual entries (as per existing PMD-sized THP fallback). The per-order current clusters are maintained per-cpu using the existing infrastructure. This is done to avoid interleving pages from different tasks, which would prevent IO being batched. This is already done for the order-0 allocations so we follow the same pattern. As is done for order-0 per-cpu clusters, the scanner now can steal order-0 entries from any per-cpu-per-order reserved cluster. This ensures that when the swap file is getting full, space doesn't get tied up in the per-cpu reserves. This change only modifies swap to be able to accept any order mTHP. It doesn't change the callers to elide doing the actual split. That will be done in separate changes. Link: https://lkml.kernel.org/r/20240408183946.2991168-6-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Gao Xiang <xiang@kernel.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: swap: update get_swap_pages() to take folio orderRyan Roberts
We are about to allow swap storage of any mTHP size. To prepare for that, let's change get_swap_pages() to take a folio order parameter instead of nr_pages. This makes the interface self-documenting; a power-of-2 number of pages must be provided. We will also need the order internally so this simplifies accessing it. Link: https://lkml.kernel.org/r/20240408183946.2991168-5-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: Gao Xiang <xiang@kernel.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: swap: simplify struct percpu_clusterRyan Roberts
struct percpu_cluster stores the index of cpu's current cluster and the offset of the next entry that will be allocated for the cpu. These two pieces of information are redundant because the cluster index is just (offset / SWAPFILE_CLUSTER). The only reason for explicitly keeping the cluster index is because the structure used for it also has a flag to indicate "no cluster". However this data structure also contains a spin lock, which is never used in this context, as a side effect the code copies the spinlock_t structure, which is questionable coding practice in my view. So let's clean this up and store only the next offset, and use a sentinal value (SWAP_NEXT_INVALID) to indicate "no cluster". SWAP_NEXT_INVALID is chosen to be 0, because 0 will never be seen legitimately; The first page in the swap file is the swap header, which is always marked bad to prevent it from being allocated as an entry. This also prevents the cluster to which it belongs being marked free, so it will never appear on the free list. This change saves 16 bytes per cpu. And given we are shortly going to extend this mechanism to be per-cpu-AND-per-order, we will end up saving 16 * 9 = 144 bytes per cpu, which adds up if you have 256 cpus in the system. Link: https://lkml.kernel.org/r/20240408183946.2991168-4-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Gao Xiang <xiang@kernel.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache()Ryan Roberts
Now that we no longer have a convenient flag in the cluster to determine if a folio is large, free_swap_and_cache() will take a reference and lock a large folio much more often, which could lead to contention and (e.g.) failure to split large folios, etc. Let's solve that problem by batch freeing swap and cache with a new function, free_swap_and_cache_nr(), to free a contiguous range of swap entries together. This allows us to first drop a reference to each swap slot before we try to release the cache folio. This means we only try to release the folio once, only taking the reference and lock once - much better than the previous 512 times for the 2M THP case. Contiguous swap entries are gathered in zap_pte_range() and madvise_free_pte_range() in a similar way to how present ptes are already gathered in zap_pte_range(). While we are at it, let's simplify by converting the return type of both functions to void. The return value was used only by zap_pte_range() to print a bad pte, and was ignored by everyone else, so the extra reporting wasn't exactly guaranteed. We will still get the warning with most of the information from get_swap_device(). With the batch version, we wouldn't know which pte was bad anyway so could print the wrong one. [ryan.roberts@arm.com: fix a build warning on parisc] Link: https://lkml.kernel.org/r/20240409111840.3173122-1-ryan.roberts@arm.com Link: https://lkml.kernel.org/r/20240408183946.2991168-3-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: Gao Xiang <xiang@kernel.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: swap: remove CLUSTER_FLAG_HUGE from swap_cluster_info:flagsRyan Roberts
Patch series "Swap-out mTHP without splitting", v7. This series adds support for swapping out multi-size THP (mTHP) without needing to first split the large folio via split_huge_page_to_list_to_order(). It closely follows the approach already used to swap-out PMD-sized THP. There are a couple of reasons for swapping out mTHP without splitting: - Performance: It is expensive to split a large folio and under extreme memory pressure some workloads regressed performance when using 64K mTHP vs 4K small folios because of this extra cost in the swap-out path. This series not only eliminates the regression but makes it faster to swap out 64K mTHP vs 4K small folios. - Memory fragmentation avoidance: If we can avoid splitting a large folio memory is less likely to become fragmented, making it easier to re-allocate a large folio in future. - Performance: Enables a separate series [7] to swap-in whole mTHPs, which means we won't lose the TLB-efficiency benefits of mTHP once the memory has been through a swap cycle. I've done what I thought was the smallest change possible, and as a result, this approach is only employed when the swap is backed by a non-rotating block device (just as PMD-sized THP is supported today). Discussion against the RFC concluded that this is sufficient. Performance Testing =================== I've run some swap performance tests on Ampere Altra VM (arm64) with 8 CPUs. The VM is set up with a 35G block ram device as the swap device and the test is run from inside a memcg limited to 40G memory. I've then run `usemem` from vm-scalability with 70 processes, each allocating and writing 1G of memory. I've repeated everything 6 times and taken the mean performance improvement relative to 4K page baseline: | alloc size | baseline | + this series | | | mm-unstable (~v6.9-rc1) | | |:-----------|------------------------:|------------------------:| | 4K Page | 0.0% | 1.3% | | 64K THP | -13.6% | 46.3% | | 2M THP | 91.4% | 89.6% | So with this change, the 64K swap performance goes from a 14% regression to a 46% improvement. While 2M shows a small regression I'm confident that this is just noise. [1] https://lore.kernel.org/linux-mm/20231010142111.3997780-1-ryan.roberts@arm.com/ [2] https://lore.kernel.org/linux-mm/20231017161302.2518826-1-ryan.roberts@arm.com/ [3] https://lore.kernel.org/linux-mm/20231025144546.577640-1-ryan.roberts@arm.com/ [4] https://lore.kernel.org/linux-mm/20240311150058.1122862-1-ryan.roberts@arm.com/ [5] https://lore.kernel.org/linux-mm/20240327144537.4165578-1-ryan.roberts@arm.com/ [6] https://lore.kernel.org/linux-mm/20240403114032.1162100-1-ryan.roberts@arm.com/ [7] https://lore.kernel.org/linux-mm/20240304081348.197341-1-21cnbao@gmail.com/ [8] https://lore.kernel.org/linux-mm/CAGsJ_4yMOow27WDvN2q=E4HAtDd2PJ=OQ5Pj9DG+6FLWwNuXUw@mail.gmail.com/ [9] https://lore.kernel.org/linux-mm/579d5127-c763-4001-9625-4563a9316ac3@redhat.com/ This patch (of 7): As preparation for supporting small-sized THP in the swap-out path, without first needing to split to order-0, Remove the CLUSTER_FLAG_HUGE, which, when present, always implies PMD-sized THP, which is the same as the cluster size. The only use of the flag was to determine whether a swap entry refers to a single page or a PMD-sized THP in swap_page_trans_huge_swapped(). Instead of relying on the flag, we now pass in order, which originates from the folio's order. This allows the logic to work for folios of any order. The one snag is that one of the swap_page_trans_huge_swapped() call sites does not have the folio. But it was only being called there to shortcut a call __try_to_reclaim_swap() in some cases. __try_to_reclaim_swap() gets the folio and (via some other functions) calls swap_page_trans_huge_swapped(). So I've removed the problematic call site and believe the new logic should be functionally equivalent. That said, removing the fast path means that we will take a reference and trylock a large folio much more often, which we would like to avoid. The next patch will solve this. Removing CLUSTER_FLAG_HUGE also means we can remove split_swap_cluster() which used to be called during folio splitting, since split_swap_cluster()'s only job was to remove the flag. Link: https://lkml.kernel.org/r/20240408183946.2991168-1-ryan.roberts@arm.com Link: https://lkml.kernel.org/r/20240408183946.2991168-2-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: Chris Li <chrisl@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Gao Xiang <xiang@kernel.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Barry Song <v-songbaohua@oppo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: generate PAGE_IDLE_FLAG definitionsMatthew Wilcox (Oracle)
If CONFIG_PAGE_IDLE_FLAG is not set, we can use FOLIO_FLAG_FALSE() to generate these definitions. Link: https://lkml.kernel.org/r/20240402201252.917342-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: remove page_idle and page_young wrappersMatthew Wilcox (Oracle)
All users have now been converted to the folio equivalents, so remove the page wrappers. Link: https://lkml.kernel.org/r/20240402201252.917342-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: init_mlocked_on_free_v3York Jasper Niebuhr
Implements the "init_mlocked_on_free" boot option. When this boot option is enabled, any mlock'ed pages are zeroed on free. If the pages are munlock'ed beforehand, no initialization takes place. This boot option is meant to combat the performance hit of "init_on_free" as reported in commit 6471384af2a6 ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options"). With "init_mlocked_on_free=1" only relevant data is freed while everything else is left untouched by the kernel. Correspondingly, this patch introduces no performance hit for unmapping non-mlock'ed memory. The unmapping overhead for purely mlocked memory was measured to be approximately 13%. Realistically, most systems mlock only a fraction of the total memory so the real-world system overhead should be close to zero. Optimally, userspace programs clear any key material or other confidential memory before exit and munlock the according memory regions. If a program crashes, userspace key managers fail to do this job. Accordingly, no munlock operations are performed so the data is caught and zeroed by the kernel. Should the program not crash, all memory will ideally be munlocked so no overhead is caused. CONFIG_INIT_MLOCKED_ON_FREE_DEFAULT_ON can be set to enable "init_mlocked_on_free" by default. Link: https://lkml.kernel.org/r/20240329145605.149917-1-yjnworkstation@gmail.com Signed-off-by: York Jasper Niebuhr <yjnworkstation@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: York Jasper Niebuhr <yjnworkstation@gmail.com> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm/ksm: fix ksm exec support for prctlJinjiang Tu
Patch series "mm/ksm: fix ksm exec support for prctl", v4. commit 3c6f33b7273a ("mm/ksm: support fork/exec for prctl") inherits MMF_VM_MERGE_ANY flag when a task calls execve(). However, it doesn't create the mm_slot, so ksmd will not try to scan this task. The first patch fixes the issue. The second patch refactors to prepare for the third patch. The third patch extends the selftests of ksm to verfity the deduplication really happens after fork/exec inherits ths KSM setting. This patch (of 3): commit 3c6f33b7273a ("mm/ksm: support fork/exec for prctl") inherits MMF_VM_MERGE_ANY flag when a task calls execve(). Howerver, it doesn't create the mm_slot, so ksmd will not try to scan this task. To fix it, allocate and add the mm_slot to ksm_mm_head in __bprm_mm_init() when the mm has MMF_VM_MERGE_ANY flag. Link: https://lkml.kernel.org/r/20240328111010.1502191-1-tujinjiang@huawei.com Link: https://lkml.kernel.org/r/20240328111010.1502191-2-tujinjiang@huawei.com Fixes: 3c6f33b7273a ("mm/ksm: support fork/exec for prctl") Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Rik van Riel <riel@surriel.com> Cc: Stefan Roesch <shr@devkernel.io> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: take placement mappings gap into accountRick Edgecombe
When memory is being placed, mmap() will take care to respect the guard gaps of certain types of memory (VM_SHADOWSTACK, VM_GROWSUP and VM_GROWSDOWN). In order to ensure guard gaps between mappings, mmap() needs to consider two things: 1. That the new mapping isn't placed in an any existing mappings guard gaps. 2. That the new mapping isn't placed such that any existing mappings are not in *its* guard gaps. The longstanding behavior of mmap() is to ensure 1, but not take any care around 2. So for example, if there is a PAGE_SIZE free area, and a mmap() with a PAGE_SIZE size, and a type that has a guard gap is being placed, mmap() may place the shadow stack in the PAGE_SIZE free area. Then the mapping that is supposed to have a guard gap will not have a gap to the adjacent VMA. For MAP_GROWSDOWN/VM_GROWSDOWN and MAP_GROWSUP/VM_GROWSUP this has not been a problem in practice because applications place these kinds of mappings very early, when there is not many mappings to find a space between. But for shadow stacks, they may be placed throughout the lifetime of the application. Use the start_gap field to find a space that includes the guard gap for the new mapping. Take care to not interfere with the alignment. Link: https://lkml.kernel.org/r/20240326021656.202649-12-rick.p.edgecombe@intel.com Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: H. Peter Anvin (Intel) <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Mark Brown <broonie@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25thp: add thp_get_unmapped_area_vmflags()Rick Edgecombe
When memory is being placed, mmap() will take care to respect the guard gaps of certain types of memory (VM_SHADOWSTACK, VM_GROWSUP and VM_GROWSDOWN). In order to ensure guard gaps between mappings, mmap() needs to consider two things: 1. That the new mapping isn't placed in an any existing mappings guard gaps. 2. That the new mapping isn't placed such that any existing mappings are not in *its* guard gaps. The longstanding behavior of mmap() is to ensure 1, but not take any care around 2. So for example, if there is a PAGE_SIZE free area, and a mmap() with a PAGE_SIZE size, and a type that has a guard gap is being placed, mmap() may place the shadow stack in the PAGE_SIZE free area. Then the mapping that is supposed to have a guard gap will not have a gap to the adjacent VMA. Add a THP implementations of the vm_flags variant of get_unmapped_area(). Future changes will call this from mmap.c in the do_mmap() path to allow shadow stacks to be placed with consideration taken for the start guard gap. Shadow stack memory is always private and anonymous and so special guard gap logic is not needed in a lot of caseis, but it can be mapped by THP, so needs to be handled. Link: https://lkml.kernel.org/r/20240326021656.202649-7-rick.p.edgecombe@intel.com Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: H. Peter Anvin (Intel) <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Mark Brown <broonie@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: use get_unmapped_area_vmflags()Rick Edgecombe
When memory is being placed, mmap() will take care to respect the guard gaps of certain types of memory (VM_SHADOWSTACK, VM_GROWSUP and VM_GROWSDOWN). In order to ensure guard gaps between mappings, mmap() needs to consider two things: 1. That the new mapping isn't placed in an any existing mappings guard gaps. 2. That the new mapping isn't placed such that any existing mappings are not in *its* guard gaps. The long standing behavior of mmap() is to ensure 1, but not take any care around 2. So for example, if there is a PAGE_SIZE free area, and a mmap() with a PAGE_SIZE size, and a type that has a guard gap is being placed, mmap() may place the shadow stack in the PAGE_SIZE free area. Then the mapping that is supposed to have a guard gap will not have a gap to the adjacent VMA. Use mm_get_unmapped_area_vmflags() in the do_mmap() so future changes can cause shadow stack mappings to be placed with a guard gap. Also use the THP variant that takes vm_flags, such that THP shadow stack can get the same treatment. Adjust the vm_flags calculation to happen earlier so that the vm_flags can be passed into __get_unmapped_area(). Link: https://lkml.kernel.org/r/20240326021656.202649-6-rick.p.edgecombe@intel.com Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: H. Peter Anvin (Intel) <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Mark Brown <broonie@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: introduce arch_get_unmapped_area_vmflags()Rick Edgecombe
When memory is being placed, mmap() will take care to respect the guard gaps of certain types of memory (VM_SHADOWSTACK, VM_GROWSUP and VM_GROWSDOWN). In order to ensure guard gaps between mappings, mmap() needs to consider two things: 1. That the new mapping isn't placed in an any existing mappings guard gaps. 2. That the new mapping isn't placed such that any existing mappings are not in *its* guard gaps. The longstanding behavior of mmap() is to ensure 1, but not take any care around 2. So for example, if there is a PAGE_SIZE free area, and a mmap() with a PAGE_SIZE size, and a type that has a guard gap is being placed, mmap() may place the shadow stack in the PAGE_SIZE free area. Then the mapping that is supposed to have a guard gap will not have a gap to the adjacent VMA. In order to take the start gap into account, the maple tree search needs to know the size of start gap the new mapping will need. The call chain from do_mmap() to the actual maple tree search looks like this: do_mmap(size, vm_flags, map_flags, ..) mm/mmap.c:get_unmapped_area(size, map_flags, ...) arch_get_unmapped_area(size, map_flags, ...) vm_unmapped_area(struct vm_unmapped_area_info) One option would be to add another MAP_ flag to mean a one page start gap (as is for shadow stack), but this consumes a flag unnecessarily. Another option could be to simply increase the size passed in do_mmap() by the start gap size, and adjust after the fact, but this will interfere with the alignment requirements passed in struct vm_unmapped_area_info, and unknown to mmap.c. Instead, introduce variants of arch_get_unmapped_area/_topdown() that take vm_flags. In future changes, these variants can be used in mmap.c:get_unmapped_area() to allow the vm_flags to be passed through to vm_unmapped_area(), while preserving the normal arch_get_unmapped_area/_topdown() for the existing callers. Link: https://lkml.kernel.org/r/20240326021656.202649-4-rick.p.edgecombe@intel.com Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: H. Peter Anvin (Intel) <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Mark Brown <broonie@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: switch mm->get_unmapped_area() to a flagRick Edgecombe
The mm_struct contains a function pointer *get_unmapped_area(), which is set to either arch_get_unmapped_area() or arch_get_unmapped_area_topdown() during the initialization of the mm. Since the function pointer only ever points to two functions that are named the same across all arch's, a function pointer is not really required. In addition future changes will want to add versions of the functions that take additional arguments. So to save a pointers worth of bytes in mm_struct, and prevent adding additional function pointers to mm_struct in future changes, remove it and keep the information about which get_unmapped_area() to use in a flag. Add the new flag to MMF_INIT_MASK so it doesn't get clobbered on fork by mmf_init_flags(). Most MM flags get clobbered on fork. In the pre-existing behavior mm->get_unmapped_area() would get copied to the new mm in dup_mm(), so not clobbering the flag preserves the existing behavior around inheriting the topdown-ness. Introduce a helper, mm_get_unmapped_area(), to easily convert code that refers to the old function pointer to instead select and call either arch_get_unmapped_area() or arch_get_unmapped_area_topdown() based on the flag. Then drop the mm->get_unmapped_area() function pointer. Leave the get_unmapped_area() pointer in struct file_operations alone. The main purpose of this change is to reorganize in preparation for future changes, but it also converts the calls of mm->get_unmapped_area() from indirect branches into a direct ones. The stress-ng bigheap benchmark calls realloc a lot, which calls through get_unmapped_area() in the kernel. On x86, the change yielded a ~1% improvement there on a retpoline config. In testing a few x86 configs, removing the pointer unfortunately didn't result in any actual size reductions in the compiled layout of mm_struct. But depending on compiler or arch alignment requirements, the change could shrink the size of mm_struct. Link: https://lkml.kernel.org/r/20240326021656.202649-3-rick.p.edgecombe@intel.com Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: H. Peter Anvin (Intel) <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mark Brown <broonie@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: remove __set_page_dirty_nobuffers()Kefeng Wang
There are no more callers of __set_page_dirty_nobuffers(), remove it. Link: https://lkml.kernel.org/r/20240327143008.3739435-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: remove "prot" parameter from move_pte()David Hildenbrand
The "prot" parameter is unused, and using it instead of what's stored in that particular PTE would very likely be wrong. Let's simply remove it. Link: https://lkml.kernel.org/r/20240327143301.741807-1-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Andreas Larsson <andreas@gaisler.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: optimize CONFIG_PER_VMA_LOCK member placement in vm_area_structDavid Hildenbrand
Currently, we end up wasting some memory in each vm_area_struct. Pahole states that: [...] int vm_lock_seq; /* 40 4 */ /* XXX 4 bytes hole, try to pack */ struct vma_lock * vm_lock; /* 48 8 */ bool detached; /* 56 1 */ /* XXX 7 bytes hole, try to pack */ [...] Let's reduce the holes and memory wastage by moving the bool: [...] bool detached; /* 40 1 */ /* XXX 3 bytes hole, try to pack */ int vm_lock_seq; /* 44 4 */ struct vma_lock * vm_lock; /* 48 8 */ [...] Effectively shrinking the vm_area_struct with CONFIG_PER_VMA_LOCK by 8 byte. Likely, we could place "detached" in the lowest bit of vm_lock, but at least on 64bit that won't really make a difference, so keep it simple. Link: https://lkml.kernel.org/r/20240327143548.744070-1-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25filemap: remove __set_page_dirty()Matthew Wilcox (Oracle)
All callers have been converted to use folios; remove this wrapper. Link: https://lkml.kernel.org/r/20240327185447.1076689-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: use rwsem assertion macros for mmap_lockMatthew Wilcox (Oracle)
This slightly strengthens our write assertion when lockdep is disabled. It also downgrades us from BUG_ON to WARN_ON, but I think that's an improvement. I don't think dumping the mm_struct was all that valuable; the call chain is what's important. Link: https://lkml.kernel.org/r/20240327190701.1082560-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: allow anon exclusive check over hugetlb tail pagesPeter Xu
PageAnonExclusive() used to forbid tail pages for hugetlbfs, as that used to be called mostly in hugetlb specific paths and the head page was guaranteed. As we move forward towards merging hugetlb paths into generic mm, we may start to pass in tail hugetlb pages (when with cont-pte/cont-pmd huge pages) for such check. Allow it to properly fetch the head, in which case the anon-exclusiveness of the head will always represents the tail page. There's already a sign of it when we look at the GUP-fast which already contain the hugetlb processing altogether: we used to have a specific commit 5805192c7b72 ("mm/gup: handle cont-PTE hugetlb pages correctly in gup_must_unshare() via GUP-fast") covering that area. Now with this more generic change, that can also go away. [akpm@linux-foundation.org: simplify PageAnonExclusive(), per Matthew] Link: https://lkml.kernel.org/r/Zg3u5Sh9EbbYPhaI@casper.infradead.org Link: https://lkml.kernel.org/r/20240403013249.1418299-2-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: WANG Xuerui <kernel@xen0n.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm/gup: handle hugetlb in the generic follow_page_mask codePeter Xu
Now follow_page() is ready to handle hugetlb pages in whatever form, and over all architectures. Switch to the generic code path. Time to retire hugetlb_follow_page_mask(), following the previous retirement of follow_hugetlb_page() in 4849807114b8. There may be a slight difference of how the loops run when processing slow GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each loop of __get_user_pages() will resolve one pgtable entry with the patch applied, rather than relying on the size of hugetlb hstate, the latter may cover multiple entries in one loop. A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over a tight loop of slow gup after the path switched. That shouldn't be a problem because slow-gup should not be a hot path for GUP in general: when page is commonly present, fast-gup will already succeed, while when the page is indeed missing and require a follow up page fault, the slow gup degrade will probably buried in the fault paths anyway. It also explains why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup: accelerate thp gup even for "pages != NULL"") lands, the latter not part of a performance analysis but a side benefit. If the performance will be a concern, we can consider handle CONT_PTE in follow_page(). Before that is justified to be necessary, keep everything clean and simple. Link: https://lkml.kernel.org/r/20240327152332.950956-14-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrew Jones <andrew.jones@linux.dev> Cc: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>