Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"25 hotfixes, mainly for MM. 13 are cc:stable"
* tag 'mm-hotfixes-stable-2023-02-02-19-24-2' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (26 commits)
mm: memcg: fix NULL pointer in mem_cgroup_track_foreign_dirty_slowpath()
Kconfig.debug: fix the help description in SCHED_DEBUG
mm/swapfile: add cond_resched() in get_swap_pages()
mm: use stack_depot_early_init for kmemleak
Squashfs: fix handling and sanity checking of xattr_ids count
sh: define RUNTIME_DISCARD_EXIT
highmem: round down the address passed to kunmap_flush_on_unmap()
migrate: hugetlb: check for hugetlb shared PMD in node migration
mm: hugetlb: proc: check for hugetlb shared PMD in /proc/PID/smaps
mm/MADV_COLLAPSE: catch !none !huge !bad pmd lookups
Revert "mm: kmemleak: alloc gray object for reserved region with direct map"
freevxfs: Kconfig: fix spelling
maple_tree: should get pivots boundary by type
.mailmap: update e-mail address for Eugen Hristev
mm, mremap: fix mremap() expanding for vma's with vm_ops->close()
squashfs: harden sanity check in squashfs_read_xattr_id_table
ia64: fix build error due to switch case label appearing next to declaration
mm: multi-gen LRU: fix crash during cgroup migration
Revert "mm: add nodes= arg to memory.reclaim"
zsmalloc: fix a race with deferred_handles storing
...
|
|
We currently pass a minimum major version to the generic EFI helper that
checks the system table magic and version, and refuse to boot if the
value is lower.
The motivation for this check is unknown, and even the code that uses
major version 2 as the minimum (ARM, arm64 and RISC-V) should make it
past this check without problems, and boot to a point where we have
access to a console or some other means to inform the user that the
firmware's major revision number made us unhappy. (Revision 2.0 of the
UEFI specification was released in January 2006, whereas ARM, arm64 and
RISC-V support where added in 2009, 2013 and 2017, respectively, so
checking for major version 2 or higher is completely arbitrary)
So just drop the check.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
|
|
A small wrapper around bvec_set_page for callers that have a virtual
address.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20230203150634.3199647-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
A smaller wrapper around bvec_set_page that takes a folio instead.
There are only two potential users for this in the tree, but the number
will grow in the future.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20230203150634.3199647-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Add a helper to initialize a bvec based of a page pointer. This will help
removing various open code bvec initializations.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20230203150634.3199647-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
cgroup information only makes sense on a live gendisk that allows
file system I/O (which includes the raw block device). So move over
the cgroup related members.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230203150400.3199230-20-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Switch from a request_queue pointer and reference to a gendisk once
for the throttle information in struct task_struct.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Link: https://lore.kernel.org/r/20230203150400.3199230-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/matthias.bgg/linux into soc/drivers
Introduce MediaTek regulator coupler driver to ensure that the SRAM
voltage in par with the GPU voltage. This allows for a stable use of the
GPU.
mtk-mutex:
- add support for MT8188 vdosys0 path
- allow it to be build as module
- add support for MT8195 vdosys1 path
mmsys:
- add MT8188 vdosys0 path
- allow to be build as a module
- add MT8195 vdosys1 path
- add support for CMDQ
- allow for up to 64 reset bits
- add supprot for the MT8195 vppsys[0,1] pathes
pm-domains:
- keep power for the MT8186 ADSP on by default
- add support for MT8188
- add support for buck isolation needed in specific pm-domains for
MT8188 and MT8192
mtk-svs:
- enable IRQ later to allow using kexec
- several improvments on the code base
- fix modalias
pmic wrapper:
- convert binding to yaml. As this is thightly coupled to the MT6357
PMIC, I took patches regarding it as well.
* tag 'v6.2-next-soc' of https://git.kernel.org/pub/scm/linux/kernel/git/matthias.bgg/linux: (41 commits)
soc: mediatek: mtk-svs: add missing MODULE_DEVICE_TABLE
soc: mediatek: mtk-devapc: Switch to devm_clk_get_enabled()
soc: mtk-svs: mt8183: refactor o_slope calculation
soc: mediatek: mtk-svs: delete superfluous platform data entries
soc: mediatek: mtk-svs: move svs_platform_probe into probe
soc: mediatek: mtk-svs: improve readability of platform_probe
soc: mediatek: mtk-svs: clean up platform probing
soc: mediatek: mtk-svs: keep svs alive if CONFIG_DEBUG_FS not supported
soc: mediatek: mtk-svs: Use pm_runtime_resume_and_get() in svs_init01()
soc: mediatek: mtk-svs: reset svs when svs_resume() fail
soc: mediatek: mtk-svs: restore default voltages when svs_init02() fail
soc: mediatek: mmsys: add support for MT8195 VPPSYS
dt-bindings: arm: mediatek: mmsys: Add support for MT8195 VPPSYS
soc: mediatek: Introduce mediatek-regulator-coupler driver
soc: mediatek: mtk-svs: Enable the IRQ later
soc: mediatek: add mtk-mutex support for mt8195 vdosys1
soc: mediatek: add mtk-mutex component - dp_intf1
soc: mediatek: mmsys: add reset control for MT8195 vdosys1
soc: mediatek: mmsys: add mmsys for support 64 reset bits
soc: mediatek: add cmdq support of mtk-mmsys config API for mt8195 vdosys1
...
Link: https://lore.kernel.org/r/396d51fc-81f3-4a2b-d7a7-b966bfe3002a@gmail.com
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
Josh reported a bug:
When the object to be patched is a module, and that module is
rmmod'ed and reloaded, it fails to load with:
module: x86/modules: Skipping invalid relocation target, existing value is nonzero for type 2, loc 00000000ba0302e9, val ffffffffa03e293c
livepatch: failed to initialize patch 'livepatch_nfsd' for module 'nfsd' (-8)
livepatch: patch 'livepatch_nfsd' failed for module 'nfsd', refusing to load module 'nfsd'
The livepatch module has a relocation which references a symbol
in the _previous_ loading of nfsd. When apply_relocate_add()
tries to replace the old relocation with a new one, it sees that
the previous one is nonzero and it errors out.
He also proposed three different solutions. We could remove the error
check in apply_relocate_add() introduced by commit eda9cec4c9a1
("x86/module: Detect and skip invalid relocations"). However the check
is useful for detecting corrupted modules.
We could also deny the patched modules to be removed. If it proved to be
a major drawback for users, we could still implement a different
approach. The solution would also complicate the existing code a lot.
We thus decided to reverse the relocation patching (clear all relocation
targets on x86_64). The solution is not
universal and is too much arch-specific, but it may prove to be simpler
in the end.
Reported-by: Josh Poimboeuf <jpoimboe@redhat.com>
Originally-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Song Liu <song@kernel.org>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Reviewed-by: Joe Lawrence <joe.lawrence@redhat.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20230125185401.279042-2-song@kernel.org
|
|
The perf subsystem assumes that all counters are by default per-CPU. So
the user space tool reads a counter from each CPU. However, the IOMMU
counters are system-wide and can be read from any CPU. Here we use a CPU
mask to restrict counting to one CPU to handle the issue. (with CPU
hotplug notifier to choose a different CPU if the chosen one is taken
off-line).
The CPU is exposed to /sys/bus/event_source/devices/dmar*/cpumask for
the user space perf tool.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Link: https://lore.kernel.org/r/20230128200428.1459118-6-kan.liang@linux.intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
A new field, which indicates the size of the remapping hardware register
set for this remapping unit, is introduced in the DMA-remapping hardware
unit definition (DRHD) structure with the VT-d Spec 4.0. With this
information, SW doesn't need to 'guess' the size of the register set
anymore.
Update the struct acpi_dmar_hardware_unit to reflect the field. Store the
size of the register set in struct dmar_drhd_unit for each dmar device.
The 'size' information is ResvZ for the old BIOS and platforms. Fall back
to the old guessing method. There is nothing changed.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Link: https://lore.kernel.org/r/20230128200428.1459118-2-kan.liang@linux.intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
There's no need to have a public header for Intel SVA implementation.
The device driver should interact with Intel SVA implementation via
the IOMMU generic APIs.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20230109014955.147068-2-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
Immutable branch from LEDs due for the v6.3 merge window
|
|
kexec allows replacing the current kernel with a different one. This is
usually a source of concerns for sysadmins that want to harden a system.
Linux already provides a way to disable loading new kexec kernel via
kexec_load_disabled, but that control is very coard, it is all or nothing
and does not make distinction between a panic kexec and a normal kexec.
This patch introduces new sysctl parameters, with finer tuning to specify
how many times a kexec kernel can be loaded. The sysadmin can set
different limits for kexec panic and kexec reboot kernels. The value can
be modified at runtime via sysctl, but only with a stricter value.
With these new parameters on place, a system with loadpin and verity
enabled, using the following kernel parameters:
sysctl.kexec_load_limit_reboot=0 sysct.kexec_load_limit_panic=1 can have a
good warranty that if initrd tries to load a panic kernel, a malitious
user will have small chances to replace that kernel with a different one,
even if they can trigger timeouts on the disk where the panic kernel
lives.
Link: https://lkml.kernel.org/r/20221114-disable-kexec-reset-v6-3-6a8531a09b9a@chromium.org
Signed-off-by: Ricardo Ribalda <ribalda@chromium.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Philipp Rudo <prudo@redhat.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Both syscalls (kexec and kexec_file) do the same check, let's factor it
out.
Link: https://lkml.kernel.org/r/20221114-disable-kexec-reset-v6-2-6a8531a09b9a@chromium.org
Signed-off-by: Ricardo Ribalda <ribalda@chromium.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Guilherme G. Piccoli <gpiccoli@igalia.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Philipp Rudo <prudo@redhat.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The header is the direct user of definitions from the math.h, include it.
Link: https://lkml.kernel.org/r/20230103121937.32085-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The percpu interface is supposed to be preempt and irq safe.
But:
The uniprocessor implementation of percpu_counter_add() is not irq safe:
if an interrupt happens during the +=, then the result is undefined.
Therefore: switch from preempt_disable() to local_irq_save().
This prevents interrupts from interrupting the +=, and as a side effect
prevents preemption.
Link: https://lkml.kernel.org/r/20221216150441.200533-2-manfred@colorfullife.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: "Sun, Jiebin" <jiebin.sun@intel.com>
Cc: <1vier1@web.de>
Cc: Alexander Sverdlin <alexander.sverdlin@siemens.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "error-injection: Clarify the requirements of error
injectable functions".
Patches for clarifying the requirement of error injectable functions and
to remove the confusing EI_ETYPE_NONE.
This patch (of 2):
Since the EI_ETYPE_NONE is confusing type, replace it with appropriate
errno. The EI_ETYPE_NONE has been introduced for a dummy (error) value,
but it can mislead people that they can use ALLOW_ERROR_INJECTION(func,
NONE). So remove it from the EI_ETYPE and use appropriate errno instead.
[akpm@linux-foundation.org: include/linux/error-injection.h needs errno.h]
Link: https://lkml.kernel.org/r/167081319306.387937.10079195394503045678.stgit@devnote3
Link: https://lkml.kernel.org/r/167081320421.387937.4259807348852421112.stgit@devnote3
Fixes: 663faf9f7bee ("error-injection: Add injectable error types")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Chris Mason <clm@meta.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Florent Revest <revest@chromium.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "Convert writepage_t to use a folio".
More folioisation. I split out the mpage work from everything else
because it completely dominated the patch, but some implementations I just
converted outright.
This patch (of 2):
We always write back an entire folio, but that's currently passed as the
head page. Convert all filesystems that use write_cache_pages() to expect
a folio instead of a page.
Link: https://lkml.kernel.org/r/20230126201255.1681189-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20230126201255.1681189-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This is the equivalent of memcpy_from_page(). It differs in that it takes
the position in a file instead of offset in a folio, it accepts the total
number of bytes to be copied (instead of the number of bytes to be copied
from this folio) and it returns how many bytes were copied from the folio,
rather than making the caller calculate that and then checking if the
caller got it right.
[akpm@linux-foundation.org: fix typo in comment]
Link: https://lkml.kernel.org/r/20230126201552.1681588-1-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: "Fabio M. De Francesco" <fmdefrancesco@gmail.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The ->rw_page method is a special purpose bypass of the usual bio handling
path that is limited to single-page reads and writes and synchronous which
causes a lot of extra code in the drivers, callers and the block layer.
The only remaining user is the MM swap code. Switch that swap code to
simply submit a single-vec on-stack bio an synchronously wait on it based
on a newly added QUEUE_FLAG_SYNCHRONOUS flag set by the drivers that
currently implement ->rw_page instead. While this touches one extra cache
line and executes extra code, it simplifies the block layer and drivers
and ensures that all feastures are properly supported by all drivers, e.g.
right now ->rw_page bypassed cgroup writeback entirely.
[akpm@linux-foundation.org: fix comment typo, per Dan]
Link: https://lkml.kernel.org/r/20230125133436.447864-8-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "Introduce per NUMA node memory error statistics", v2.
Background
==========
In the RFC for Kernel Support of Memory Error Detection [1], one advantage
of software-based scanning over hardware patrol scrubber is the ability to
make statistics visible to system administrators. The statistics include
2 categories:
* Memory error statistics, for example, how many memory error are
encountered, how many of them are recovered by the kernel. Note these
memory errors are non-fatal to kernel: during the machine check
exception (MCE) handling kernel already classified MCE's severity to be
unnecessary to panic (but either action required or optional).
* Scanner statistics, for example how many times the scanner have fully
scanned a NUMA node, how many errors are first detected by the scanner.
The memory error statistics are useful to userspace and actually not
specific to scanner detected memory errors, and are the focus of this
patchset.
Motivation
==========
Memory error stats are important to userspace but insufficient in kernel
today. Datacenter administrators can better monitor a machine's memory
health with the visible stats. For example, while memory errors are
inevitable on servers with 10+ TB memory, starting server maintenance when
there are only 1~2 recovered memory errors could be overreacting; in cloud
production environment maintenance usually means live migrate all the
workload running on the server and this usually causes nontrivial
disruption to the customer. Providing insight into the scope of memory
errors on a system helps to determine the appropriate follow-up action.
In addition, the kernel's existing memory error stats need to be
standardized so that userspace can reliably count on their usefulness.
Today kernel provides following memory error info to userspace, but they
are not sufficient or have disadvantages:
* HardwareCorrupted in /proc/meminfo: number of bytes poisoned in total,
not per NUMA node stats though
* ras:memory_failure_event: only available after explicitly enabled
* /dev/mcelog provides many useful info about the MCEs, but doesn't
capture how memory_failure recovered memory MCEs
* kernel logs: userspace needs to process log text
Exposing memory error stats is also a good start for the in-kernel memory
error detector. Today the data source of memory error stats are either
direct memory error consumption, or hardware patrol scrubber detection
(either signaled as UCNA or SRAO). Once in-kernel memory scanner is
implemented, it will be the main source as it is usually configured to
scan memory DIMMs constantly and faster than hardware patrol scrubber.
How Implemented
===============
As Naoya pointed out [2], exposing memory error statistics to userspace is
useful independent of software or hardware scanner. Therefore we
implement the memory error statistics independent of the in-kernel memory
error detector. It exposes the following per NUMA node memory error
counters:
/sys/devices/system/node/node${X}/memory_failure/total
/sys/devices/system/node/node${X}/memory_failure/recovered
/sys/devices/system/node/node${X}/memory_failure/ignored
/sys/devices/system/node/node${X}/memory_failure/failed
/sys/devices/system/node/node${X}/memory_failure/delayed
These counters describe how many raw pages are poisoned and after the
attempted recoveries by the kernel, their resolutions: how many are
recovered, ignored, failed, or delayed respectively. This approach can be
easier to extend for future use cases than /proc/meminfo, trace event, and
log. The following math holds for the statistics:
* total = recovered + ignored + failed + delayed
These memory error stats are reset during machine boot.
The 1st commit introduces these sysfs entries. The 2nd commit populates
memory error stats every time memory_failure attempts memory error
recovery. The 3rd commit adds documentations for introduced stats.
[1] https://lore.kernel.org/linux-mm/7E670362-C29E-4626-B546-26530D54F937@gmail.com/T/#mc22959244f5388891c523882e61163c6e4d703af
[2] https://lore.kernel.org/linux-mm/7E670362-C29E-4626-B546-26530D54F937@gmail.com/T/#m52d8d7a333d8536bd7ce74253298858b1c0c0ac6
This patch (of 3):
Today kernel provides following memory error info to userspace, but each
has its own disadvantage
* HardwareCorrupted in /proc/meminfo: number of bytes poisoned in total,
not per NUMA node stats though
* ras:memory_failure_event: only available after explicitly enabled
* /dev/mcelog provides many useful info about the MCEs, but
doesn't capture how memory_failure recovered memory MCEs
* kernel logs: userspace needs to process log text
Exposes per NUMA node memory error stats as sysfs entries:
/sys/devices/system/node/node${X}/memory_failure/total
/sys/devices/system/node/node${X}/memory_failure/recovered
/sys/devices/system/node/node${X}/memory_failure/ignored
/sys/devices/system/node/node${X}/memory_failure/failed
/sys/devices/system/node/node${X}/memory_failure/delayed
These counters describe how many raw pages are poisoned and after the
attempted recoveries by the kernel, their resolutions: how many are
recovered, ignored, failed, or delayed respectively. The following math
holds for the statistics:
* total = recovered + ignored + failed + delayed
Link: https://lkml.kernel.org/r/20230120034622.2698268-1-jiaqiyan@google.com
Link: https://lkml.kernel.org/r/20230120034622.2698268-2-jiaqiyan@google.com
Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Move memcg LRU code into a dedicated section. Improve the design doc to
outline its architecture.
Link: https://lkml.kernel.org/r/20230118001827.1040870-5-talumbau@google.com
Signed-off-by: T.J. Alumbaugh <talumbau@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm/damon: misc fixes".
This patchset contains three miscellaneous simple fixes for DAMON online
tuning.
This patch (of 3):
Commit cbeaa77b0449 ("mm/damon/core: use a dedicated struct for monitoring
attributes") moved monitoring intervals from damon_ctx to a new struct,
damon_attrs, but a comment in the header file has not updated for the
change. Update it.
Link: https://lkml.kernel.org/r/20230119013831.1911-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20230119013831.1911-2-sj@kernel.org
Fixes: cbeaa77b0449 ("mm/damon/core: use a dedicated struct for monitoring attributes")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendanhiggins@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm: In-kernel support for memory-deny-write-execute (MDWE)",
v2.
The background to this is that systemd has a configuration option called
MemoryDenyWriteExecute [2], implemented as a SECCOMP BPF filter. Its aim
is to prevent a user task from inadvertently creating an executable
mapping that is (or was) writeable. Since such BPF filter is stateless,
it cannot detect mappings that were previously writeable but subsequently
changed to read-only. Therefore the filter simply rejects any
mprotect(PROT_EXEC). The side-effect is that on arm64 with BTI support
(Branch Target Identification), the dynamic loader cannot change an ELF
section from PROT_EXEC to PROT_EXEC|PROT_BTI using mprotect(). For
libraries, it can resort to unmapping and re-mapping but for the main
executable it does not have a file descriptor. The original bug report in
the Red Hat bugzilla - [3] - and subsequent glibc workaround for libraries
- [4].
This series adds in-kernel support for this feature as a prctl
PR_SET_MDWE, that is inherited on fork(). The prctl denies PROT_WRITE |
PROT_EXEC mappings. Like the systemd BPF filter it also denies adding
PROT_EXEC to mappings. However unlike the BPF filter it only denies it if
the mapping didn't previous have PROT_EXEC. This allows to PROT_EXEC ->
PROT_EXEC | PROT_BTI with mprotect(), which is a problem with the BPF
filter.
This patch (of 2):
The aim of such policy is to prevent a user task from creating an
executable mapping that is also writeable.
An example of mmap() returning -EACCESS if the policy is enabled:
mmap(0, size, PROT_READ | PROT_WRITE | PROT_EXEC, flags, 0, 0);
Similarly, mprotect() would return -EACCESS below:
addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0);
mprotect(addr, size, PROT_READ | PROT_WRITE | PROT_EXEC);
The BPF filter that systemd MDWE uses is stateless, and disallows
mprotect() with PROT_EXEC completely. This new prctl allows PROT_EXEC to
be enabled if it was already PROT_EXEC, which allows the following case:
addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0);
mprotect(addr, size, PROT_READ | PROT_EXEC | PROT_BTI);
where PROT_BTI enables branch tracking identification on arm64.
Link: https://lkml.kernel.org/r/20230119160344.54358-1-joey.gouly@arm.com
Link: https://lkml.kernel.org/r/20230119160344.54358-2-joey.gouly@arm.com
Signed-off-by: Joey Gouly <joey.gouly@arm.com>
Co-developed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jeremy Linton <jeremy.linton@arm.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Mark Brown <broonie@kernel.org>
Cc: nd <nd@arm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Szabolcs Nagy <szabolcs.nagy@arm.com>
Cc: Topi Miettinen <toiwoton@gmail.com>
Cc: Zbigniew Jędrzejewski-Szmek <zbyszek@in.waw.pl>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Shadow_nodes is for shadow nodes reclaiming of workingset handling, it is
updated when page cache add or delete since long time ago workingset only
supported page cache. But when workingset supports anonymous page
detection, we missied updating shadow nodes for it. This caused that
shadow nodes of anonymous page will never be reclaimd by
scan_shadow_nodes() even they use much memory and system memory is tense.
So update shadow_nodes of anonymous page when swap cache is add or delete
by calling xas_set_update(..workingset_update_node).
Link: https://lkml.kernel.org/r/202301182013032211005@zte.com.cn
Fixes: aae466b0052e ("mm/swap: implement workingset detection for anonymous LRU")
Signed-off-by: Yang Yang <yang.yang29@zte.com.cn>
Reviewed-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
Cc: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Straightforward conversion of get_hwpoison_huge_page() to
get_hwpoison_hugetlb_folio(). Reduces two references to a head page in
memory-failure.c
[arnd@arndb.de: fix get_hwpoison_hugetlb_folio() stub]
Link: https://lkml.kernel.org/r/20230119111920.635260-1-arnd@kernel.org
Link: https://lkml.kernel.org/r/20230118174039.14247-1-sidhartha.kumar@oracle.com
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
page_ext must be initialized after all struct pages are initialized.
Therefore, page_ext is initialized after page_alloc_init_late(), and can
optionally be initialized earlier via early_page_ext kernel parameter
which as a side effect also disables deferred struct pages.
Allow to automatically init page_ext early when there are no deferred
struct pages in order to be able to use page_ext during kernel boot and
track for example page allocations early.
[pasha.tatashin@soleen.com: fix build with CONFIG_PAGE_EXTENSION=n]
Link: https://lkml.kernel.org/r/20230118155251.2522985-1-pasha.tatashin@soleen.com
Link: https://lkml.kernel.org/r/20230117204617.1553748-1-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Charan Teja Kalla <quic_charante@quicinc.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Li Zhe <lizhe.67@bytedance.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Only one caller doesn't have a folio, so move the page_folio() call to
that one caller from mem_cgroup_css_from_folio().
Link: https://lkml.kernel.org/r/20230116192507.2146150-3-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "Writeback folio conversions".
Remove more calls to compound_head() by passing folios around instead of
pages.
This patch (of 2):
The only caller of inode_attach_wb() which doesn't pass NULL already has a
folio, so convert the whole call-chain to take folios.
Link: https://lkml.kernel.org/r/20230116192507.2146150-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20230116192507.2146150-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Replace alloc_zeroed_user_highpage_movable(). The main difference is
returning a folio containing a single page instead of returning the page,
but take the opportunity to rename the function to match other allocation
functions a little better and rewrite the documentation to place more
emphasis on the zeroing rather than the highmem aspect.
Link: https://lkml.kernel.org/r/20230116191813.2145215-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
All callers to find_get_pages_range_tag(), find_get_pages_tag(),
pagevec_lookup_range_tag(), and pagevec_lookup_tag() have been removed.
Link: https://lkml.kernel.org/r/20230104211448.4804-24-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This is the equivalent of find_get_pages_range_tag(), except for folios
instead of pages.
One noteable difference is filemap_get_folios_tag() does not take in a
maximum pages argument. It instead tries to fill a folio batch and stops
either once full (15 folios) or reaching the end of the search range.
The new function supports large folios, the initial function did not since
all callers don't use large folios.
Link: https://lkml.kernel.org/r/20230104211448.4804-3-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Reviewed-by: Matthew Wilcow (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "Convert to filemap_get_folios_tag()", v5.
This patch series replaces find_get_pages_range_tag() with
filemap_get_folios_tag(). This also allows the removal of multiple calls
to compound_head() throughout.
It also makes a good chunk of the straightforward conversions to folios,
and takes the opportunity to introduce a function that grabs a folio from
the pagecache.
This patch (of 23):
Add function filemap_grab_folio() to grab a folio from the page cache.
This function is meant to serve as a folio replacement for
grab_cache_page, and is used to facilitate the removal of
find_get_pages_range_tag().
Link: https://lkml.kernel.org/r/20230104211448.4804-1-vishal.moola@gmail.com
Link: https://lkml.kernel.org/r/20230104211448.4804-2-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Pass vm_flags as a parameter to shmem_is_huge, rather than reading the
flags from the vm_area_struct in question. This allows the updated flags
from hugepage_madvise to be passed to the check, which is necessary
because madvise does not update the vm_area_struct's flags until after
hugepage_madvise returns.
This fixes an issue when shmem_enabled=madvise, where MADV_HUGEPAGE on
shmem was not able to register the mm_struct with khugepaged. Prior to
cd89fb065099, the mm_struct was registered by MADV_HUGEPAGE regardless of
the value of shmem_enabled (which was only checked when scanning vmas).
Link: https://lkml.kernel.org/r/20230113023011.1784015-1-stevensd@google.com
Fixes: cd89fb065099 ("mm,thp,shmem: make khugepaged obey tmpfs mount flags")
Signed-off-by: David Stevens <stevensd@chromium.org>
Cc: David Stevens <stevensd@chromium.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
__GFP_ATOMIC serves little purpose. Its main effect is to set
ALLOC_HARDER which adds a few little boosts to increase the chance of an
allocation succeeding, one of which is to lower the water-mark at which it
will succeed.
It is *always* paired with __GFP_HIGH which sets ALLOC_HIGH which also
adjusts this watermark. It is probable that other users of __GFP_HIGH
should benefit from the other little bonuses that __GFP_ATOMIC gets.
__GFP_ATOMIC also gives a warning if used with __GFP_DIRECT_RECLAIM.
There is little point to this. We already get a might_sleep() warning if
__GFP_DIRECT_RECLAIM is set.
__GFP_ATOMIC allows the "watermark_boost" to be side-stepped. It is
probable that testing ALLOC_HARDER is a better fit here.
__GFP_ATOMIC is used by tegra-smmu.c to check if the allocation might
sleep. This should test __GFP_DIRECT_RECLAIM instead.
This patch:
- removes __GFP_ATOMIC
- allows __GFP_HIGH allocations to ignore watermark boosting as well
as GFP_ATOMIC requests.
- makes other adjustments as suggested by the above.
The net result is not change to GFP_ATOMIC allocations. Other
allocations that use __GFP_HIGH will benefit from a few different extra
privileges. This affects:
xen, dm, md, ntfs3
the vermillion frame buffer
hibernation
ksm
swap
all of which likely produce more benefit than cost if these selected
allocation are more likely to succeed quickly.
[mgorman: Minor adjustments to rework on top of a series]
Link: https://lkml.kernel.org/r/163712397076.13692.4727608274002939094@noble.neil.brown.name
Link: https://lkml.kernel.org/r/20230113111217.14134-7-mgorman@techsingularity.net
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Thierry Reding <thierry.reding@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
There is 8 byte page_ext->flags field allocated per page whenever
CONFIG_PAGE_EXTENSION is enabled. However, not every user of page_ext
uses flags. Therefore, check whether flags is needed at least by one user
and if so allocate space for it.
For example when page_table_check is enabled, on a machine with 128G
of memory before the fix:
[ 2.244288] allocated 536870912 bytes of page_ext
after the fix:
[ 2.160154] allocated 268435456 bytes of page_ext
Also, add a kernel-doc comment before page_ext_operations that describes
the fields, and remove check if need() is set, as that is now a required
field.
[pasha.tatashin@soleen.com: address comments from Mike Rapoport]
Link: https://lkml.kernel.org/r/20230117202103.1412449-1-pasha.tatashin@soleen.com
Link: https://lkml.kernel.org/r/20230113154253.92480-1-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Charan Teja Kalla <quic_charante@quicinc.com>
Cc: Li Zhe <lizhe.67@bytedance.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
__HAVE_ARCH_PTE_SWP_EXCLUSIVE is now supported by all architectures that
support swp PTEs, so let's drop it.
Link: https://lkml.kernel.org/r/20230113171026.582290-27-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "update mlock to use folios", v4.
This series updates mlock to use folios, converting the internal interface
to using folios exclusively and exposing the folio interface externally.
As a product of this we move to using a folio batch rather than a pagevec
for mlock folios, which brings it in line with the core folio batches
contained in mm/swap.c.
This patch (of 5):
This performs the same task as pagevec_reinit(), only modifying a folio
batch rather than a pagevec.
Link: https://lkml.kernel.org/r/cover.1673526881.git.lstoakes@gmail.com
Link: https://lkml.kernel.org/r/9018cecacb39e34c883540f997f9be8281153613.1673526881.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Change hugetlb_clear_page_hwpoison() to folio_clear_hugetlb_hwpoison() by
changing the function to take in a folio. This converts one use of
ClearPageHWPoison and HPageRawHwpUnreliable to their folio equivalents.
Link: https://lkml.kernel.org/r/20230112204608.80136-4-sidhartha.kumar@oracle.com
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "Get rid of tail page fields".
Continue the shrinkage of the struct page definition by getting rid of the
'first tail page' and 'second tail page' fields. I originally did this
patch set before Hugh's rewrite of the subpages_mapcount, so it needed
substantial updates; hope I didn't miss anything.
This patch (of 28):
commit dad6a5eb5556(mm,hugetlb: use folio fields in second tail page)
added a transitional hugetlb field to struct page and struct folio to make
room for another int in the first tail of a compound page. Hugetlb folio
conversions have changed all page users of this field to use the fields
within the folio so struct page no longer needs this hugetlb specific
field.
Link: https://lkml.kernel.org/r/20230111142915.1001531-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20230111142915.1001531-29-willy@infradead.org
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Now that both callers use a folio, pass the folio in and save a call to
compound_head().
Link: https://lkml.kernel.org/r/20230111142915.1001531-28-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Use folio->_deferred_list directly.
Link: https://lkml.kernel.org/r/20230111142915.1001531-26-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Remove the entire block of definitions for the second tail page, and add
the deferred list to the struct folio. This actually moves _deferred_list
to a different offset in struct folio because I don't see a need to
include the padding.
This lets us use list_for_each_entry_safe() in deferred_split_scan()
and avoid a number of calls to compound_head().
Link: https://lkml.kernel.org/r/20230111142915.1001531-25-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Insert appropriate public: and private: markers to make the generated
kernel-doc look right.
Link: https://lkml.kernel.org/r/20230111142915.1001531-24-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
All former users now use the folio equivalents, so remove them from the
definition of struct page.
Link: https://lkml.kernel.org/r/20230111142915.1001531-23-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Replace uses of compound_dtor, compound_order and compound_nr by their
folio equivalents.
Link: https://lkml.kernel.org/r/20230111142915.1001531-19-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Turn compound_nr() into a wrapper around folio_nr_pages(). Similarly to
compound_order(), casting the struct page directly to struct folio
preserves the existing behaviour, while calling page_folio() would change
the behaviour. Move thp_nr_pages() down in the file so that compound_nr()
can be after folio_nr_pages().
[willy@infradead.org: fix assertion triggering]
Link: https://lkml.kernel.org/r/Y8AFgZEEjnUIaCbf@casper.infradead.org
Link: https://lkml.kernel.org/r/20230111142915.1001531-18-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Make compound_order() use struct folio. It can't be turned into a wrapper
around folio_order() as a page can be turned into a tail page between a
check in compound_order() and the assertion in folio_test_large().
Link: https://lkml.kernel.org/r/20230111142915.1001531-17-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
folio_mapcount_ptr(), compound_mapcount_ptr() and subpages_mapcount_ptr()
are all now unused.
Link: https://lkml.kernel.org/r/20230111142915.1001531-16-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|