summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2024-02-25bdev: make bdev_{release, open_by_dev}() private to block layerChristian Brauner
Move both of them to the private block header. There's no caller in the tree anymore that uses them directly. Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-28-adbd023e19cc@kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-25bdev: remove bdev_open_by_path()Christian Brauner
Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-27-adbd023e19cc@kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-25pktcdvd: port block device access to fileChristian Brauner
Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-9-adbd023e19cc@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-25swap: port block device usage to fileChristian Brauner
Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-5-adbd023e19cc@kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-25md: port block device access to fileChristian Brauner
Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-4-adbd023e19cc@kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-25bdev: open block device as filesChristian Brauner
Add two new helpers to allow opening block devices as files. This is not the final infrastructure. This still opens the block device before opening a struct a file. Until we have removed all references to struct bdev_handle we can't switch the order: * Introduce blk_to_file_flags() to translate from block specific to flags usable to pen a new file. * Introduce bdev_file_open_by_{dev,path}(). * Introduce temporary sb_bdev_handle() helper to retrieve a struct bdev_handle from a block device file and update places that directly reference struct bdev_handle to rely on it. * Don't count block device openes against the number of open files. A bdev_file_open_by_{dev,path}() file is never installed into any file descriptor table. One idea that came to mind was to use kernel_tmpfile_open() which would require us to pass a path and it would then call do_dentry_open() going through the regular fops->open::blkdev_open() path. But then we're back to the problem of routing block specific flags such as BLK_OPEN_RESTRICT_WRITES through the open path and would have to waste FMODE_* flags every time we add a new one. With this we can avoid using a flag bit and we have more leeway in how we open block devices from bdev_open_by_{dev,path}(). Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-1-adbd023e19cc@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-25file: add alloc_file_pseudo_noaccount()Christian Brauner
When we open block devices as files we want to make sure to not charge them against the open file limit of the caller as that can cause spurious failures. Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-1-adbd023e19cc@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-25soc: samsung: exynos-pmu: Add regmap support for SoCs that protect PMU regsPeter Griffin
Some Exynos based SoCs like Tensor gs101 protect the PMU registers for security hardening reasons so that they are only write accessible in el3 via an SMC call. As most Exynos drivers that need to write PMU registers currently obtain a regmap via syscon (phys, pinctrl, watchdog). Support for the above usecase is implemented in this driver using a custom regmap similar to syscon to handle the SMC call. Platforms that don't secure PMU registers, get a mmio regmap like before. As regmaps abstract out the underlying register access changes to the leaf drivers are minimal. A new API exynos_get_pmu_regmap_by_phandle() is provided for leaf drivers that currently use syscon_regmap_lookup_by_phandle(). This also handles deferred probing. Tested-by: Sam Protsenko <semen.protsenko@linaro.org> Tested-by: Alexey Klimov <alexey.klimov@linaro.org> Reviewed-by: Sam Protsenko <semen.protsenko@linaro.org> Signed-off-by: Peter Griffin <peter.griffin@linaro.org> Link: https://lore.kernel.org/r/20240220220613.797068-2-peter.griffin@linaro.org Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
2024-02-25kbuild: Use -fmin-function-alignment when availablePetr Pavlu
GCC recently added option -fmin-function-alignment, which should appear in GCC 14. Unlike -falign-functions, this option causes all functions to be aligned at the specified value, including the cold ones. In particular, when an arm64 kernel is built with DYNAMIC_FTRACE_WITH_CALL_OPS=y, the 8-byte function alignment is required for correct functionality. This was done by -falign-functions=8 and having workarounds in the kernel to force the compiler to follow this alignment. The new -fmin-function-alignment option directly guarantees it. Detect availability of -fmin-function-alignment and use it instead of -falign-functions when present. Introduce CC_HAS_SANE_FUNCTION_ALIGNMENT and enable __cold to work as expected when it is set. Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2024-02-25procfs: make freeing proc_fs_info rcu-delayedAl Viro
makes proc_pid_ns() safe from rcu pathwalk (put_pid_ns() is still synchronous, but that's not a problem - it does rcu-delay everything that needs to be) Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-02-25nfs: fix UAF on pathwalk running into umountAl Viro
NFS ->d_revalidate(), ->permission() and ->get_link() need to access some parts of nfs_server when called in RCU mode: server->flags server->caps *(server->io_stats) and, worst of all, call server->nfs_client->rpc_ops->have_delegation (the last one - as NFS_PROTO(inode)->have_delegation()). We really don't want to RCU-delay the entire nfs_free_server() (it would have to be done with schedule_work() from RCU callback, since it can't be made to run from interrupt context), but actual freeing of nfs_server and ->io_stats can be done via call_rcu() just fine. nfs_client part is handled simply by making nfs_free_client() use kfree_rcu(). Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-02-24Merge tag 'iommu-fixes-v6.8-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu Pull iommu fixes from Joerg Roedel: - Intel VT-d fixes for nested domain handling: - Cache invalidation for changes in a parent domain - Dirty tracking setting for parent and nested domains - Fix a constant-out-of-range warning - ARM SMMU fixes: - Fix CD allocation from atomic context when using SVA with SMMUv3 - Revert the conversion of SMMUv2 to domain_alloc_paging(), as it breaks the boot for Qualcomm MSM8996 devices - Restore SVA handle sharing in core code as it turned out there are still drivers relying on it * tag 'iommu-fixes-v6.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: iommu/sva: Restore SVA handle sharing iommu/arm-smmu-v3: Do not use GFP_KERNEL under as spinlock iommu/vt-d: Fix constant-out-of-range warning iommu/vt-d: Set SSADE when attaching to a parent with dirty tracking iommu/vt-d: Add missing dirty tracking set for parent domain iommu/vt-d: Wrap the dirty tracking loop to be a helper iommu/vt-d: Remove domain parameter for intel_pasid_setup_dirty_tracking() iommu/vt-d: Add missing device iotlb flush for parent domain iommu/vt-d: Update iotlb in nested domain attach iommu/vt-d: Add missing iotlb flush for parent domain iommu/vt-d: Add __iommu_flush_iotlb_psi() iommu/vt-d: Track nested domains in parent Revert "iommu/arm-smmu: Convert to domain_alloc_paging()"
2024-02-24Merge tag 'cxl-fixes-6.8-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl Pull cxl fixes from Dan Williams: "A collection of significant fixes for the CXL subsystem. The largest change in this set, that bordered on "new development", is the fix for the fact that the location of the new qos_class attribute did not match the Documentation. The fix ends up deleting more code than it added, and it has a new unit test to backstop basic errors in this interface going forward. So the "red-diff" and unit test saved the "rip it out and try again" response. In contrast, the new notification path for firmware reported CXL errors (CXL CPER notifications) has a locking context bug that can not be fixed with a red-diff. Given where the release cycle stands, it is not comfortable to squeeze in that fix in these waning days. So, that receives the "back it out and try again later" treatment. There is a regression fix in the code that establishes memory NUMA nodes for platform CXL regions. That has an ack from x86 folks. There are a couple more fixups for Linux to understand (reassemble) CXL regions instantiated by platform firmware. The policy around platforms that do not match host-physical-address with system-physical-address (i.e. systems that have an address translation mechanism between the address range reported in the ACPI CEDT.CFMWS and endpoint decoders) has been softened to abort driver load rather than teardown the memory range (can cause system hangs). Lastly, there is a robustness / regression fix for cases where the driver would previously continue in the face of error, and a fixup for PCI error notification handling. Summary: - Fix NUMA initialization from ACPI CEDT.CFMWS - Fix region assembly failures due to async init order - Fix / simplify export of qos_class information - Fix cxl_acpi initialization vs single-window-init failures - Fix handling of repeated 'pci_channel_io_frozen' notifications - Workaround platforms that violate host-physical-address == system-physical address assumptions - Defer CXL CPER notification handling to v6.9" * tag 'cxl-fixes-6.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: cxl/acpi: Fix load failures due to single window creation failure acpi/ghes: Remove CXL CPER notifications cxl/pci: Fix disabling memory if DVSEC CXL Range does not match a CFMWS window cxl/test: Add support for qos_class checking cxl: Fix sysfs export of qos_class for memdev cxl: Remove unnecessary type cast in cxl_qos_class_verify() cxl: Change 'struct cxl_memdev_state' *_perf_list to single 'struct cxl_dpa_perf' cxl/region: Allow out of order assembly of autodiscovered regions cxl/region: Handle endpoint decoders in cxl_region_find_decoder() x86/numa: Fix the sort compare func used in numa_fill_memblks() x86/numa: Fix the address overlap check in numa_fill_memblks() cxl/pci: Skip to handle RAS errors if CXL.mem device is detached
2024-02-24sched: Add a new function to compare if two cpus have the same capacityQais Yousef
The new helper function is needed to help blk-mq check if it needs to dispatch the softirq on another CPU to match the performance level the IO requester is running at. This is important on HMP systems where not all CPUs have the same compute capacity. Signed-off-by: Qais Yousef <qyousef@layalina.io> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240223155749.2958009-2-qyousef@layalina.io Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-24mm: Introduce new flag to indicate wc safeAnkit Agrawal
The VM_ALLOW_ANY_UNCACHED flag is implemented for ARM64, allowing KVM stage 2 device mapping attributes to use NormalNC rather than DEVICE_nGnRE, which allows guest mappings supporting write-combining attributes (WC). ARM does not architecturally guarantee this is safe, and indeed some MMIO regions like the GICv2 VCPU interface can trigger uncontained faults if NormalNC is used. Even worse, the expectation is that there are platforms where even DEVICE_nGnRE can allow uncontained faults in corner cases. Unfortunately existing ARM IP requires platform integration to take responsibility to prevent this. To safely use VFIO in KVM the platform must guarantee full safety in the guest where no action taken against a MMIO mapping can trigger an uncontained failure. The assumption is that most VFIO PCI platforms support this for both mapping types, at least in common flows, based on some expectations of how PCI IP is integrated. This can be enabled more broadly, for instance into vfio-platform drivers, but only after the platform vendor completes auditing for safety. The VMA flag VM_ALLOW_ANY_UNCACHED was found to be the simplest and cleanest way to communicate the information from VFIO to KVM that mapping the region in S2 as NormalNC is safe. KVM consumes it to activate the code that does the S2 mapping as NormalNC. Suggested-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Marc Zyngier <maz@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20240224150546.368-3-ankita@nvidia.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-02-24cpufreq: Remove references to 10ms min sampling ratePierre Gondois
A minimum sampling rate value of 10ms was introduced in: commit cef9615a853e ("[CPUFREQ] ondemand: Uncouple minimal sampling rate from HZ in NO_HZ case") The use of this value was removed in: commit ed4676e25463 ("cpufreq: Replace "max_transition_latency" with "dynamic_switching"") Remove: - a comment referencing this value - an unused macro associated to this value Signed-off-by: Pierre Gondois <pierre.gondois@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-02-23writeback: add a writeback iteratorChristoph Hellwig
Refactor the code left in write_cache_pages into an iterator that the file system can call to get the next folio for a writeback operation: struct folio *folio = NULL; while ((folio = writeback_iter(mapping, wbc, folio, &error))) { error = <do per-folio writeback>; } The twist here is that the error value is passed by reference, so that the iterator can restore it when breaking out of the loop. Handling of the magic AOP_WRITEPAGE_ACTIVATE value stays outside the iterator and needs is just kept in the write_cache_pages legacy wrapper. in preparation for eventually killing it off. Heavily based on a for_each* based iterator from Matthew Wilcox. Link: https://lkml.kernel.org/r/20240215063649.2164017-14-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Christian Brauner <brauner@kernel.org> Cc: Dave Chinner <dchinner@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-23pagevec: add ability to iterate a queueMatthew Wilcox (Oracle)
Add a loop counter inside the folio_batch to let us iterate from 0-nr instead of decrementing nr and treating the batch as a stack. It would generate some very weird and suboptimal I/O patterns for page writeback to iterate over the batch as a stack. Link: https://lkml.kernel.org/r/20240215063649.2164017-11-hch@lst.de Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Dave Chinner <dchinner@redhat.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-23writeback: factor writeback_get_batch() out of write_cache_pages()Matthew Wilcox (Oracle)
This simple helper will be the basis of the writeback iterator. To make this work, we need to remember the current index and end positions in writeback_control. [hch@lst.de: heavily rebased, add helpers to get the tag and end index, don't keep the end index in struct writeback_control] Link: https://lkml.kernel.org/r/20240215063649.2164017-9-hch@lst.de Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Dave Chinner <dchinner@redhat.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-23writeback: remove a duplicate prototype for tag_pages_for_writebackMatthew Wilcox (Oracle)
[hch@lst.de: split from a larger patch] Link: https://lkml.kernel.org/r/20240215063649.2164017-3-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Dave Chinner <dchinner@redhat.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-23mm/damon/core: implement PSI metric DAMOS quota goalSeongJae Park
Extend DAMOS quota goal metric with system wide memory pressure stall time. Specifically, the system level 'some' PSI for memory is used. The target value can be set in microseconds. DAMOS measures the increased amount of the PSI metric in last quota_reset_interval and use the ratio of it versus the user-specified target PSI value as the score for the auto-tuning feedback loop. Link: https://lkml.kernel.org/r/20240219194431.159606-14-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-23mm/damon/core: support multiple metrics for quota goalSeongJae Park
DAMOS quota auto-tuning asks users to assess the current tuned quota and provide the feedback in a manual and repeated way. It allows users generate the feedback from a source that the kernel cannot access, and writing a script or a function for doing the manual and repeated feeding is not a big deal. However, additional works are additional works, and it could be more efficient if DAMOS could do the fetch itself, especially in case of DAMON sysfs interface use case, since it can avoid the context switches between the user-space and the kernel-space, though the overhead would be only trivial in most cases. Also in many cases, feedbacks could be made from kernel-accessible sources, such as PSI, CPU usage, etc. Make the quota goal to support multiple types of metrics including such ones. Link: https://lkml.kernel.org/r/20240219194431.159606-13-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-23mm/damon/core: let goal specified with only target and current valuesSeongJae Park
DAMOS quota auto-tuning feature let users to set the goal by providing a function for getting the current score of the tuned quota. It allows flexible goal setup, but only simple user-set quota is currently being used. As a result, the only user of the DAMOS quota auto-tuning is using a silly void pointer casting based score value passing function. Simplify the interface and the user code by letting user directly set the target and the current value. Link: https://lkml.kernel.org/r/20240219194431.159606-12-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-23mm/damon/core: remove ->goal field of damos_quotaSeongJae Park
DAMOS quota auto-tuning feature supports static signle goal and dynamic multiple goals via DAMON kernel API, specifically via ->goal and ->goals fields of damos_quota struct, respectively. All in-tree DAMOS kernel API users are using only the dynamic multiple goals now. Remove the unsued static single goal interface. Link: https://lkml.kernel.org/r/20240219194431.159606-11-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-23mm/damon/core: add multiple goals per damos_quota and helpers for thoseSeongJae Park
The feedback-driven DAMOS quota auto-tuning feature allows only single goal to the DAMON kernel API users. The API users could implement multiple goals for the end-users on their level, and that's what DAMON sysfs interface is doing. More DAMON kernel API users such as DAMON_RECLAIM would need to do similar work. To reduce unnecessary future duplciated efforts, support multiple goals from DAMOS core layer. To make the support in minimum non-destructive change, keep the old single goal setup interface, and add multiple goals setup. The single goal will treated as one of the multiple goals, so old API users are not required to make any change. Link: https://lkml.kernel.org/r/20240219194431.159606-9-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-23mm/damon/core: split out quota goal related fields to a structSeongJae Park
'struct damos_quota' is not small now. Split out fields for quota goal to a separate struct for easier reading. Link: https://lkml.kernel.org/r/20240219194431.159606-8-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-23mm/damon: move comments and fields for damos-quota-prioritization to the endSeongJae Park
The comments and definition of 'struct damos_quota' lists a few fields for effective quota generation first, fields for regions prioritization under the quota, and then remaining fields for effective quota generation. Readers' should unnecesssarily switch their context in the middle. List all the fields for the effective quota first, and then fields for the prioritization for making it easier to read. Link: https://lkml.kernel.org/r/20240219194431.159606-7-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-23mm/damon/core: set damos_quota->esz as public field and documentSeongJae Park
Patch series "mm/damon: let DAMOS feeds and tame/auto-tune itself". The Aim-oriented Feedback-driven DAMOS Aggressiveness Auto-tuning patchset[1] which has merged since commit 9294a037c015 ("mm/damon/core: implement goal-oriented feedback-driven quota auto-tuning") made the mechanism and the policy separated. That is, users can set a part of DAMOS control policies without a deep understanding of the mechanism but just their demands such as SLA. However, users are still required to do some additional work of manually collecting their target metric and feeding it to DAMOS. In the case of end-users who use DAMON sysfs interface, the context switches between user-space and kernel-space could also make it inefficient. The overhead is supposed to be only trivial in common cases, though. Meanwhile, in simple use cases, the target metric could be common system metrics that the kernel can efficiently self-retrieve, such as memory pressure stall time (PSI). Extend DAMOS quota auto-tuning to support multiple types of metrics including the DAMOS self-retrievable ones, and add support for memory pressure stall time metric. Different types of metrics can be supported in future. The auto-tuning capability is currently supported for only users of DAMOS kernel API and DAMON sysfs interface. Extend the support to DAMON_RECLAIM. Patches Sequence ================ First five patches are for helping debugging and fine-tuning existing quota control features. The first one (patch 1) exposes the effective quota that is made with given user inputs to DAMOS kernel API users and kernel-doc documents. Following four patches implement (patches 1, 2 and 3) and document (patches 4 and 5) a new DAMON sysfs file that exposes the value. Following six patches cleanup and simplify the existing DAMOS quota auto-tuning code by improving layout of comments and data structures (patches 6 and 7), supporting common use cases, namely multiple goals (patches 8, 9 and 10), and simplifying the interface (patch 11). Then six patches for the main purpose of this patchset follow. The first three changes extend the core logic for various target metrics (patch 12), implement memory pressure stall time-based target metric support (patch 13), and update DAMON sysfs interface to support the new target metric (patch 14). Then, documentation updates for the features on design (patch 15), ABI (patch 16), and usage (patch 17) follow. Last three patches add auto-tuning support on DAMON_RECLAIM. The patches implement DAMON_RECLAIM parameters for user-feedback driven quota auto-tuning (patch 18), memory pressure stall time-driven quota self-tuning (patch 19), and finally update the DAMON_RECLAIM usage document for the new parameters (patch 20). [1] https://lore.kernel.org/all/20231130023652.50284-1-sj@kernel.org/ This patch (of 20): DAMOS allow users to specify the quota as they want in multiple ways including time quota, size quota, and feedback-based auto-tuning. DAMOS makes one effective quota out of the inputs and use it at the end. Knowing the current effective quota helps understanding DAMOS' internal mechanism and fine-tuning quotas. DAMON kernel API users can get the information from ->esz field of damos_quota struct, but the field is marked as private purpose, and not kernel-doc documented. Make it public and document. Link: https://lkml.kernel.org/r/20240219194431.159606-1-sj@kernel.org Link: https://lkml.kernel.org/r/20240219194431.159606-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-23crash: split crash dumping code out from kexec_core.cBaoquan He
Currently, KEXEC_CORE select CRASH_CORE automatically because crash codes need be built in to avoid compiling error when building kexec code even though the crash dumping functionality is not enabled. E.g -------------------- CONFIG_CRASH_CORE=y CONFIG_KEXEC_CORE=y CONFIG_KEXEC=y CONFIG_KEXEC_FILE=y --------------------- After splitting out crashkernel reservation code and vmcoreinfo exporting code, there's only crash related code left in kernel/crash_core.c. Now move crash related codes from kexec_core.c to crash_core.c and only build it in when CONFIG_CRASH_DUMP=y. And also wrap up crash codes inside CONFIG_CRASH_DUMP ifdeffery scope, or replace inappropriate CONFIG_KEXEC_CORE ifdef with CONFIG_CRASH_DUMP ifdef in generic kernel files. With these changes, crash_core codes are abstracted from kexec codes and can be disabled at all if only kexec reboot feature is wanted. Link: https://lkml.kernel.org/r/20240124051254.67105-5-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: Pingfan Liu <piliu@redhat.com> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Michael Kelley <mhklinux@outlook.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-23crash: split vmcoreinfo exporting code out from crash_core.cBaoquan He
Now move the relevant codes into separate files: kernel/crash_reserve.c, include/linux/crash_reserve.h. And add config item CRASH_RESERVE to control its enabling. And also update the old ifdeffery of CONFIG_CRASH_CORE, including of <linux/crash_core.h> and config item dependency on CRASH_CORE accordingly. And also do renaming as follows: - arch/xxx/kernel/{crash_core.c => vmcore_info.c} because they are only related to vmcoreinfo exporting on x86, arm64, riscv. And also Remove config item CRASH_CORE, and rely on CONFIG_KEXEC_CORE to decide if build in crash_core.c. [yang.lee@linux.alibaba.com: remove duplicated include in vmcore_info.c] Link: https://lkml.kernel.org/r/20240126005744.16561-1-yang.lee@linux.alibaba.com Link: https://lkml.kernel.org/r/20240124051254.67105-3-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Acked-by: Hari Bathini <hbathini@linux.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Pingfan Liu <piliu@redhat.com> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Michael Kelley <mhklinux@outlook.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-23kexec: split crashkernel reservation code out from crash_core.cBaoquan He
Patch series "Split crash out from kexec and clean up related config items", v3. Motivation: ============= Previously, LKP reported a building error. When investigating, it can't be resolved reasonablly with the present messy kdump config items. https://lore.kernel.org/oe-kbuild-all/202312182200.Ka7MzifQ-lkp@intel.com/ The kdump (crash dumping) related config items could causes confusions: Firstly, CRASH_CORE enables codes including - crashkernel reservation; - elfcorehdr updating; - vmcoreinfo exporting; - crash hotplug handling; Now fadump of powerpc, kcore dynamic debugging and kdump all selects CRASH_CORE, while fadump - fadump needs crashkernel parsing, vmcoreinfo exporting, and accessing global variable 'elfcorehdr_addr'; - kcore only needs vmcoreinfo exporting; - kdump needs all of the current kernel/crash_core.c. So only enabling PROC_CORE or FA_DUMP will enable CRASH_CORE, this mislead people that we enable crash dumping, actual it's not. Secondly, It's not reasonable to allow KEXEC_CORE select CRASH_CORE. Because KEXEC_CORE enables codes which allocate control pages, copy kexec/kdump segments, and prepare for switching. These codes are shared by both kexec reboot and kdump. We could want kexec reboot, but disable kdump. In that case, CRASH_CORE should not be selected. -------------------- CONFIG_CRASH_CORE=y CONFIG_KEXEC_CORE=y CONFIG_KEXEC=y CONFIG_KEXEC_FILE=y --------------------- Thirdly, It's not reasonable to allow CRASH_DUMP select KEXEC_CORE. That could make KEXEC_CORE, CRASH_DUMP are enabled independently from KEXEC or KEXEC_FILE. However, w/o KEXEC or KEXEC_FILE, the KEXEC_CORE code built in doesn't make any sense because no kernel loading or switching will happen to utilize the KEXEC_CORE code. --------------------- CONFIG_CRASH_CORE=y CONFIG_KEXEC_CORE=y CONFIG_CRASH_DUMP=y --------------------- In this case, what is worse, on arch sh and arm, KEXEC relies on MMU, while CRASH_DUMP can still be enabled when !MMU, then compiling error is seen as the lkp test robot reported in above link. ------arch/sh/Kconfig------ config ARCH_SUPPORTS_KEXEC def_bool MMU config ARCH_SUPPORTS_CRASH_DUMP def_bool BROKEN_ON_SMP --------------------------- Changes: =========== 1, split out crash_reserve.c from crash_core.c; 2, split out vmcore_infoc. from crash_core.c; 3, move crash related codes in kexec_core.c into crash_core.c; 4, remove dependency of FA_DUMP on CRASH_DUMP; 5, clean up kdump related config items; 6, wrap up crash codes in crash related ifdefs on all 8 arch-es which support crash dumping, except of ppc; Achievement: =========== With above changes, I can rearrange the config item logic as below (the right item depends on or is selected by the left item): PROC_KCORE -----------> VMCORE_INFO |----------> VMCORE_INFO FA_DUMP----| |----------> CRASH_RESERVE ---->VMCORE_INFO / |---->CRASH_RESERVE KEXEC --| /| |--> KEXEC_CORE--> CRASH_DUMP-->/-|---->PROC_VMCORE KEXEC_FILE --| \ | \---->CRASH_HOTPLUG KEXEC --| |--> KEXEC_CORE (for kexec reboot only) KEXEC_FILE --| Test ======== On all 8 architectures, including x86_64, arm64, s390x, sh, arm, mips, riscv, loongarch, I did below three cases of config item setting and building all passed. Take configs on x86_64 as exampmle here: (1) Both CONFIG_KEXEC and KEXEC_FILE is unset, then all kexec/kdump items are unset automatically: # Kexec and crash features # CONFIG_KEXEC is not set # CONFIG_KEXEC_FILE is not set # end of Kexec and crash features (2) set CONFIG_KEXEC_FILE and 'make olddefconfig': --------------- # Kexec and crash features CONFIG_CRASH_RESERVE=y CONFIG_VMCORE_INFO=y CONFIG_KEXEC_CORE=y CONFIG_KEXEC_FILE=y CONFIG_CRASH_DUMP=y CONFIG_CRASH_HOTPLUG=y CONFIG_CRASH_MAX_MEMORY_RANGES=8192 # end of Kexec and crash features --------------- (3) unset CONFIG_CRASH_DUMP in case 2 and execute 'make olddefconfig': ------------------------ # Kexec and crash features CONFIG_KEXEC_CORE=y CONFIG_KEXEC_FILE=y # end of Kexec and crash features ------------------------ Note: For ppc, it needs investigation to make clear how to split out crash code in arch folder. Hope Hari and Pingfan can help have a look, see if it's doable. Now, I make it either have both kexec and crash enabled, or disable both of them altogether. This patch (of 14): Both kdump and fa_dump of ppc rely on crashkernel reservation. Move the relevant codes into separate files: crash_reserve.c, include/linux/crash_reserve.h. And also add config item CRASH_RESERVE to control its enabling of the codes. And update config items which has relationship with crashkernel reservation. And also change ifdeffery from CONFIG_CRASH_CORE to CONFIG_CRASH_RESERVE when those scopes are only crashkernel reservation related. And also rename arch/XXX/include/asm/{crash_core.h => crash_reserve.h} on arm64, x86 and risc-v because those architectures' crash_core.h is only related to crashkernel reservation. [akpm@linux-foundation.org: s/CRASH_RESEERVE/CRASH_RESERVE/, per Klara Modin] Link: https://lkml.kernel.org/r/20240124051254.67105-1-bhe@redhat.com Link: https://lkml.kernel.org/r/20240124051254.67105-2-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Acked-by: Hari Bathini <hbathini@linux.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Pingfan Liu <piliu@redhat.com> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Michael Kelley <mhklinux@outlook.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-23mm/vmalloc: remove vmap_area_listBaoquan He
Earlier, vmap_area_list is exported to vmcoreinfo so that makedumpfile get the base address of vmalloc area. Now, vmap_area_list is empty, so export VMALLOC_START to vmcoreinfo instead, and remove vmap_area_list. [urezki@gmail.com: fix a warning in the crash_save_vmcoreinfo_init()] Link: https://lkml.kernel.org/r/20240111192329.449189-1-urezki@gmail.com Link: https://lkml.kernel.org/r/20240102184633.748113-6-urezki@gmail.com Signed-off-by: Baoquan He <bhe@redhat.com> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Acked-by: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kazuhito Hagio <k-hagio-ab@nec.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com> Cc: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-23mm,page_owner: maintain own list of stack_records structsOscar Salvador
page_owner needs to increment a stack_record refcount when a new allocation occurs, and decrement it on a free operation. In order to do that, we need to have a way to get a stack_record from a handle. Implement __stack_depot_get_stack_record() which just does that, and make it public so page_owner can use it. Also, traversing all stackdepot buckets comes with its own complexity, plus we would have to implement a way to mark only those stack_records that were originated from page_owner, as those are the ones we are interested in. For that reason, page_owner maintains its own list of stack_records, because traversing that list is faster than traversing all buckets while keeping at the same time a low complexity. For now, add to stack_list only the stack_records of dummy_handle and failure_handle, and set their refcount of 1. Further patches will add code to increment or decrement stack_records count on allocation and free operation. Link: https://lkml.kernel.org/r/20240215215907.20121-4-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Marco Elver <elver@google.com> Acked-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-23lib/stackdepot: move stack_record struct definition into the headerOscar Salvador
In order to move the heavy lifting into page_owner code, this one needs to have access to the stack_record structure, which right now sits in lib/stackdepot.c. Move it to the stackdepot.h header so page_owner can access stack_record's struct fields. Link: https://lkml.kernel.org/r/20240215215907.20121-3-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Marco Elver <elver@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-23merge mm-hotfixes-stable into mm-nonmm-stable to pick up stackdepot changesAndrew Morton
2024-02-23stackdepot: use variable size records for non-evictable entriesMarco Elver
With the introduction of stack depot evictions, each stack record is now fixed size, so that future reuse after an eviction can safely store differently sized stack traces. In all cases that do not make use of evictions, this wastes lots of space. Fix it by re-introducing variable size stack records (up to the max allowed size) for entries that will never be evicted. We know if an entry will never be evicted if the flag STACK_DEPOT_FLAG_GET is not provided, since a later stack_depot_put() attempt is undefined behavior. With my current kernel config that enables KASAN and also SLUB owner tracking, I observe (after a kernel boot) a whopping reduction of 296 stack depot pools, which translates into 4736 KiB saved. The savings here are from SLUB owner tracking only, because KASAN generic mode still uses refcounting. Before: pools: 893 allocations: 29841 frees: 6524 in_use: 23317 freelist_size: 3454 After: pools: 597 refcounted_allocations: 17547 refcounted_frees: 6477 refcounted_in_use: 11070 freelist_size: 3497 persistent_count: 12163 persistent_bytes: 1717008 [elver@google.com: fix -Wstringop-overflow warning] Link: https://lore.kernel.org/all/20240201135747.18eca98e@canb.auug.org.au/ Link: https://lkml.kernel.org/r/20240201090434.1762340-1-elver@google.com Link: https://lore.kernel.org/all/CABXGCsOzpRPZGg23QqJAzKnqkZPKzvieeg=W7sgjgi3q0pBo0g@mail.gmail.com/ Link: https://lkml.kernel.org/r/20240129100708.39460-1-elver@google.com Link: https://lore.kernel.org/all/CABXGCsOzpRPZGg23QqJAzKnqkZPKzvieeg=W7sgjgi3q0pBo0g@mail.gmail.com/ Fixes: 108be8def46e ("lib/stackdepot: allow users to evict stack traces") Signed-off-by: Marco Elver <elver@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-23net: mpls: error out if inner headers are not setFlorian Westphal
mpls_gso_segment() assumes skb_inner_network_header() returns a valid result: mpls_hlen = skb_inner_network_header(skb) - skb_network_header(skb); if (unlikely(!mpls_hlen || mpls_hlen % MPLS_HLEN)) goto out; if (unlikely(!pskb_may_pull(skb, mpls_hlen))) With syzbot reproducer, skb_inner_network_header() yields 0, skb_network_header() returns 108, so this will "pskb_may_pull(skb, -108)))" which triggers a newly added DEBUG_NET_WARN_ON_ONCE() check: ------------[ cut here ]------------ WARNING: CPU: 0 PID: 5068 at include/linux/skbuff.h:2723 pskb_may_pull_reason include/linux/skbuff.h:2723 [inline] WARNING: CPU: 0 PID: 5068 at include/linux/skbuff.h:2723 pskb_may_pull include/linux/skbuff.h:2739 [inline] WARNING: CPU: 0 PID: 5068 at include/linux/skbuff.h:2723 mpls_gso_segment+0x773/0xaa0 net/mpls/mpls_gso.c:34 [..] skb_mac_gso_segment+0x383/0x740 net/core/gso.c:53 nsh_gso_segment+0x40a/0xad0 net/nsh/nsh.c:108 skb_mac_gso_segment+0x383/0x740 net/core/gso.c:53 __skb_gso_segment+0x324/0x4c0 net/core/gso.c:124 skb_gso_segment include/net/gso.h:83 [inline] [..] sch_direct_xmit+0x11a/0x5f0 net/sched/sch_generic.c:327 [..] packet_sendmsg+0x46a9/0x6130 net/packet/af_packet.c:3113 [..] First iteration of this patch made mpls_hlen signed and changed test to error out to "mpls_hlen <= 0 || ..". Eric Dumazet said: > I was thinking about adding a debug check in skb_inner_network_header() > if inner_network_header is zero (that would mean it is not 'set' yet), > but this would trigger even after your patch. So add new skb_inner_network_header_was_set() helper and use that. The syzbot reproducer injects data via packet socket. The skb that gets allocated and passed down the stack has ->protocol set to NSH (0x894f) and gso_type set to SKB_GSO_UDP | SKB_GSO_DODGY. This gets passed to skb_mac_gso_segment(), which sees NSH as ptype to find a callback for. nsh_gso_segment() retrieves next type: proto = tun_p_to_eth_p(nsh_hdr(skb)->np); ... which is MPLS (TUN_P_MPLS_UC). It updates skb->protocol and then calls mpls_gso_segment(). Inner offsets are all 0, so mpls_gso_segment() ends up with a negative header size. In case more callers rely on silent handling of such large may_pull values we could also 'legalize' this behaviour, either replacing the debug check with (len > INT_MAX) test or removing it and instead adding a comment before existing if (unlikely(len > skb->len)) return SKB_DROP_REASON_PKT_TOO_SMALL; test in pskb_may_pull_reason(), saying that this check also implicitly takes care of callers that miscompute header sizes. Cc: Simon Horman <horms@kernel.org> Fixes: 219eee9c0d16 ("net: skbuff: add overflow debug check to pull/push helpers") Reported-by: syzbot+99d15fcdb0132a1e1a82@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/00000000000043b1310611e388aa@google.com/raw Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20240222140321.14080-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-23KVM: Get rid of return value from kvm_arch_create_vm_debugfs()Oliver Upton
The general expectation with debugfs is that any initialization failure is nonfatal. Nevertheless, kvm_arch_create_vm_debugfs() allows implementations to return an error and kvm_create_vm_debugfs() allows that to fail VM creation. Change to a void return to discourage architectures from making debugfs failures fatal for the VM. Seems like everyone already had the right idea, as all implementations already return 0 unconditionally. Acked-by: Marc Zyngier <maz@kernel.org> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20240216155941.2029458-1-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-02-23KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changingSean Christopherson
Retry page faults without acquiring mmu_lock, and without even faulting the page into the primary MMU, if the resolved gfn is covered by an active invalidation. Contending for mmu_lock is especially problematic on preemptible kernels as the mmu_notifier invalidation task will yield mmu_lock (see rwlock_needbreak()), delay the in-progress invalidation, and ultimately increase the latency of resolving the page fault. And in the worst case scenario, yielding will be accompanied by a remote TLB flush, e.g. if the invalidation covers a large range of memory and vCPUs are accessing addresses that were already zapped. Faulting the page into the primary MMU is similarly problematic, as doing so may acquire locks that need to be taken for the invalidation to complete (the primary MMU has finer grained locks than KVM's MMU), and/or may cause unnecessary churn (getting/putting pages, marking them accessed, etc). Alternatively, the yielding issue could be mitigated by teaching KVM's MMU iterators to perform more work before yielding, but that wouldn't solve the lock contention and would negatively affect scenarios where a vCPU is trying to fault in an address that is NOT covered by the in-progress invalidation. Add a dedicated lockess version of the range-based retry check to avoid false positives on the sanity check on start+end WARN, and so that it's super obvious that checking for a racing invalidation without holding mmu_lock is unsafe (though obviously useful). Wrap mmu_invalidate_in_progress in READ_ONCE() to ensure that pre-checking invalidation in a loop won't put KVM into an infinite loop, e.g. due to caching the in-progress flag and never seeing it go to '0'. Force a load of mmu_invalidate_seq as well, even though it isn't strictly necessary to avoid an infinite loop, as doing so improves the probability that KVM will detect an invalidation that already completed before acquiring mmu_lock and bailing anyways. Do the pre-check even for non-preemptible kernels, as waiting to detect the invalidation until mmu_lock is held guarantees the vCPU will observe the worst case latency in terms of handling the fault, and can generate even more mmu_lock contention. E.g. the vCPU will acquire mmu_lock, detect retry, drop mmu_lock, re-enter the guest, retake the fault, and eventually re-acquire mmu_lock. This behavior is also why there are no new starvation issues due to losing the fairness guarantees provided by rwlocks: if the vCPU needs to retry, it _must_ drop mmu_lock, i.e. waiting on mmu_lock doesn't guarantee forward progress in the face of _another_ mmu_notifier invalidation event. Note, adding READ_ONCE() isn't entirely free, e.g. on x86, the READ_ONCE() may generate a load into a register instead of doing a direct comparison (MOV+TEST+Jcc instead of CMP+Jcc), but practically speaking the added cost is a few bytes of code and maaaaybe a cycle or three. Reported-by: Yan Zhao <yan.y.zhao@intel.com> Closes: https://lore.kernel.org/all/ZNnPF4W26ZbAyGto@yzhao56-desk.sh.intel.com Reported-by: Friedrich Weber <f.weber@proxmox.com> Cc: Kai Huang <kai.huang@intel.com> Cc: Yan Zhao <yan.y.zhao@intel.com> Cc: Yuan Yao <yuan.yao@linux.intel.com> Cc: Xu Yilun <yilun.xu@linux.intel.com> Acked-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20240222012640.2820927-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-23Merge tag 'mm-hotfixes-stable-2024-02-22-15-02' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "A batch of MM (and one non-MM) hotfixes. Ten are cc:stable and the remainder address post-6.7 issues or aren't considered appropriate for backporting" * tag 'mm-hotfixes-stable-2024-02-22-15-02' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: kasan: guard release_free_meta() shadow access with kasan_arch_is_ready() mm/damon/lru_sort: fix quota status loss due to online tunings mm/damon/reclaim: fix quota stauts loss due to online tunings MAINTAINERS: mailmap: update Shakeel's email address mm/damon/sysfs-schemes: handle schemes sysfs dir removal before commit_schemes_quota_goals mm: memcontrol: clarify swapaccount=0 deprecation warning mm/memblock: add MEMBLOCK_RSRV_NOINIT into flagname[] array mm/zswap: invalidate duplicate entry when !zswap_enabled lib/Kconfig.debug: TEST_IOV_ITER depends on MMU mm/swap: fix race when skipping swapcache mm/swap_state: update zswap LRU's protection range with the folio locked selftests/mm: uffd-unit-test check if huge page size is 0 mm/damon/core: check apply interval in damon_do_apply_schemes() mm: zswap: fix missing folio cleanup in writeback race path
2024-02-23thermal: core: Store zone ops in struct thermal_zone_deviceRafael J. Wysocki
The current code requires thermal zone creators to pass pointers to writable ops structures to thermal_zone_device_register_with_trips() which needs to modify the target struct thermal_zone_device_ops object if the "critical" operation in it is NULL. Moreover, the callers of thermal_zone_device_register_with_trips() are required to hold on to the struct thermal_zone_device_ops object passed to it until the given thermal zone is unregistered. Both of these requirements are quite inconvenient, so modify struct thermal_zone_device to contain struct thermal_zone_device_ops as field and make thermal_zone_device_register_with_trips() copy the contents of the struct thermal_zone_device_ops passed to it via a pointer (which can be const now) to that field. Also adjust the code using thermal zone ops accordingly and modify thermal_of_zone_register() to use a local ops variable during thermal zone registration so ops do not need to be freed in thermal_of_zone_unregister() any more. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org>
2024-02-23thermal: core: Store zone trips table in struct thermal_zone_deviceRafael J. Wysocki
The current code expects thermal zone creators to pass a pointer to a writable trips table to thermal_zone_device_register_with_trips() and that trips table is then used by the thermal core going forward. Consequently, the callers of thermal_zone_device_register_with_trips() are required to hold on to the trips table passed to it until the given thermal zone is unregistered, at which point the trips table can be freed, but at the same time they are not expected to access that table directly. This is both error prone and confusing. To address it, turn the trips table pointer in struct thermal_zone_device into a flex array (counted by its num_trips field), allocate it during thermal zone device allocation and copy the contents of the trips table supplied by the zone creator (which can be const now) into it, which will allow the callers of thermal_zone_device_register_with_trips() to drop their trip tables right after the zone registration. This requires the imx thermal driver to be adjusted to store the new temperature in its internal trips table in imx_set_trip_temp(), because it will be separate from the core's trips table now and it has to be explicitly kept in sync with the latter. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org>
2024-02-23Merge branch 'thermal-core'Rafael J. Wysocki
Merge thermal core changes for 6.9: - Minor fixes for thermal governors (Rafael J. Wysocki, Di Shen). - Trip point handling fixes for the iwlwifi wireless driver (Rafael J. Wysocki). - Code cleanups (Rafael J. Wysocki, AngeloGioacchino Del Regno). * thermal-tmp: thermal: gov_power_allocator: Avoid overwriting PID coefficients from setup time thermal: sysfs: Fix up white space in trip_point_temp_store() iwlwifi: mvm: Use for_each_thermal_trip() for walking trip points iwlwifi: mvm: Populate trip table before registering thermal zone iwlwifi: mvm: Drop unused fw_trips_index[] from iwl_mvm_thermal_device thermal: core: Change governor name to const char pointer thermal: gov_bang_bang: Fix possible cooling device state ping-pong thermal: gov_fair_share: Fix dependency on trip points ordering
2024-02-23mfd: twl-core: Add power off implementation for twl603xAndreas Kemnade
If the system-power-controller property is there, enable power off. Implementation is based on a Linux v3.0 vendor kernel. Signed-off-by: Andreas Kemnade <andreas@kemnade.info> Link: https://lore.kernel.org/r/20240217082007.3238948-3-andreas@kemnade.info Signed-off-by: Lee Jones <lee@kernel.org>
2024-02-23mfd: syscon: Remove extern from function prototypesPeter Griffin
The kernel coding style does not require 'extern' in function prototypes in .h files, so remove them as they are not needed. To avoid checkpatch warnings such as CHECK: Lines should not end with a '(' +struct regmap *syscon_regmap_lookup_by_phandle( The indentation is also updated. No functional changes in this patch. Signed-off-by: Peter Griffin <peter.griffin@linaro.org> Link: https://lore.kernel.org/r/20240220115012.471689-3-peter.griffin@linaro.org Signed-off-by: Lee Jones <lee@kernel.org>
2024-02-23iommu/sva: Restore SVA handle sharingJason Gunthorpe
Prior to commit 092edaddb660 ("iommu: Support mm PASID 1:n with sva domains") the code allowed a SVA handle to be bound multiple times to the same (mm, device) pair. This was alluded to in the kdoc comment, but we had understood this to be more a remark about allowing multiple devices, not a literal same-driver re-opening the same SVA. It turns out uacce and idxd were both relying on the core code to handle reference counting for same-device same-mm scenarios. As this looks hard to resolve in the drivers bring it back to the core code. The new design has changed the meaning of the domain->users refcount to refer to the number of devices that are sharing that domain for the same mm. This is part of the design to lift the SVA domain de-duplication out of the drivers. Return the old behavior by explicitly de-duplicating the struct iommu_sva handle. The same (mm, device) will return the same handle pointer and the core code will handle tracking this. The last unbind of the handle will destroy it. Fixes: 092edaddb660 ("iommu: Support mm PASID 1:n with sva domains") Reported-by: Zhangfei Gao <zhangfei.gao@linaro.org> Closes: https://lore.kernel.org/all/20240221110658.529-1-zhangfei.gao@linaro.org/ Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Link: https://lore.kernel.org/r/0-v1-9455fc497a6f+3b4-iommu_sva_sharing_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2024-02-23mfd: cs42l43: Tidy up header includesCharles Keepax
Use more forward declarations, move header guards to cover other includes, and rely less on including headers through other headers. Suggested-by: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: Charles Keepax <ckeepax@opensource.cirrus.com> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Link: https://lore.kernel.org/r/20240129152557.3221212-2-ckeepax@opensource.cirrus.com Signed-off-by: Lee Jones <lee@kernel.org>
2024-02-23mfd: sun4i-gpadc: Correct specified GPADC interrupt numbersFuyao Kashizuku
The identifiers are used as IRQ resource numbers, where 0 is treated specially. This fixes sun4i-gpadc-iio probe failed when request irq. The backstack: WARNING: CPU: 3 PID: 1 at drivers/base/platform.c:451 __platform_get_irq_byname+0xb8/0xc4 0 is an invalid IRQ number Modules linked in: CPU: 3 PID: 1 Comm: swapper/0 Not tainted 6.7.0-rc6 #9 Hardware name: Allwinner sun8i Family unwind_backtrace show_stack dump_stack_lvl __warn warn_slowpath_fmt __platform_get_irq_byname platform_get_irq_byname sun4i_irq_init sun4i_gpadc_probe platform_probe really_probe __driver_probe_device driver_probe_device __driver_attach bus_for_each_dev bus_add_driver driver_register do_one_initcall do_initcall_level do_initcalls kernel_init_freeable kernel_init Log reports: sun4i-gpadc-iio sun6i-a31-gpadc-iio.0: error -EINVAL: IRQ FIFO_DATA_PENDING not found sun4i-gpadc-iio: probe of sun6i-a31-gpadc-iio.0 failed with error -22 Signed-off-by: Fuyao Kashizuku <fuyao@sjterm.com> Acked-by: Jernej Skrabec <jernej.skrabec@gmail.com> Link: https://lore.kernel.org/r/ZYuFbUUus9apiCpq@debian.cyg Signed-off-by: Lee Jones <lee@kernel.org>
2024-02-23phy: constify of_phandle_args in xlateKrzysztof Kozlowski
The xlate callbacks are supposed to translate of_phandle_args to proper provider without modifying the of_phandle_args. Make the argument pointer to const for code safety and readability. Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Acked-by: Thierry Reding <treding@nvidia.com> Acked-by: Linus Walleij <linus.walleij@linaro.org> Acked-by: Florian Fainelli <florian.fainelli@broadcom.com> #Broadcom Link: https://lore.kernel.org/r/20240217093937.58234-1-krzysztof.kozlowski@linaro.org Signed-off-by: Vinod Koul <vkoul@kernel.org>
2024-02-23Merge tag 'irq-for-riscv-02-23-24' into irq/msiThomas Gleixner
Pick up RISCV INTC changes to handle conflicts with the AIA updates. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>