Age | Commit message (Collapse) | Author |
|
As its not used outside iommu.c. Also rename it as dev_flush_pasid_all().
No functional change intended.
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Link: https://lore.kernel.org/r/20240828111029.5429-6-vasant.hegde@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
Do not try to set max_pasids in error path as dev_data is not allocated.
Fixes: a0c47f233e68 ("iommu/amd: Introduce iommu_dev_data.max_pasids")
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Link: https://lore.kernel.org/r/20240828111029.5429-5-vasant.hegde@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
It was added in commit 52815b75682e ("iommu/amd: Add support for
IOMMUv2 domain mode"), but never used it. Hence remove these unused
macros.
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Link: https://lore.kernel.org/r/20240828111029.5429-4-vasant.hegde@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
amd_iommu_is_attach_deferred() is a callback function called by
iommu_ops. Make it as static.
No functional changes intended.
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Link: https://lore.kernel.org/r/20240828111029.5429-3-vasant.hegde@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
Update event buffer head pointer once driver completes processing. So
that IOMMU can write new log without waiting for driver to complete
processing all event logs.
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Link: https://lore.kernel.org/r/20240828111029.5429-2-vasant.hegde@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-integrator into soc/arm
Integrator fixes for the v6.12 kernel cycle, some of_node_put():s
were missing in the SoC drivers.
* tag 'integrator-v6.12' of https://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-integrator:
bus: integrator-lm: fix OF node leak in probe()
ARM: versatile: fix OF node leak in CPUs prepare
Link: https://lore.kernel.org/r/CACRpkdahXECZXWA5uv=SZtkzU0E++fQj7QWK8kYuH0-asLUPqg@mail.gmail.com
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
Use of_property_present() to test for property presence rather than
of_(find|get)_property(). This is part of a larger effort to remove
callers of of_find_property() and similar functions. of_find_property()
leaks the DT struct property and data pointers which is a problem for
dynamically allocated nodes which may be freed.
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/r/20240731191312.1710417-6-robh@kernel.org
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
QMC driver requires fsl_soc.h to use function get_immrbase().
This header is provided by powerpc architecture and the functions
it declares are defined only when FSL_SOC is selected.
Today the dependency is the following:
depends on CPM1 || QUICC_ENGINE || \
(FSL_SOC && (CPM || QUICC_ENGINE) && COMPILE_TEST)
This dependency tentatively ensure that FSL_SOC is there when doing a
COMPILE_TEST.
CPM1 is only selected by PPC_8xx and cannot be selected manually.
CPM1 selects FSL_SOC
QUICC_ENGINE on the other hand can be selected by ARM or ARM64 which
doesn't select FSL_SOC. QUICC_ENGINE can also be selected with just
COMPILE_TEST.
It is therefore possible to end up with CPM_QMC selected
without FSL_SOC.
So fix it by making it depend on FSL_SOC at all time.
The rest of the above dependency is the same as the one for CPM_TSA on
which CPM_QMC also depends, so it can go away, leaving only a simple
dependency on FSL_SOC.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Closes: https://lore.kernel.org/lkml/20240904104859.020fe3a9@canb.auug.org.au/
Fixes: 8655b76b7004 ("soc: fsl: cpm1: qmc: Handle QUICC Engine (QE) soft-qmc firmware")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Acked-by: Herve Codina <herve.codina@bootlin.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
Add documentation for the rockchip rk3568 CAN-FD controller.
Co-developed-by: Elaine Zhang <zhangqing@rock-chips.com>
Signed-off-by: Elaine Zhang <zhangqing@rock-chips.com>
Tested-by: Alibek Omarov <a1ba.omarov@gmail.com>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Reviewed-by: Heiko Stuebner <heiko@sntech.de>
Link: https://patch.msgid.link/20240904-rockchip-canfd-v5-1-8ae22bcb27cc@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
|
|
Add previously omitted pwm node and possible pinctrl for Rockchip RV1126
Signed-off-by: Karthikeyan Krishnasamy <karthikeyan@linumiz.com>
Link: https://lore.kernel.org/r/20240903105245.715899-4-karthikeyan@linumiz.com
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
|
|
Add i2s0 node and possible pinctrl for Rockchip RV1126
Signed-off-by: Karthikeyan Krishnasamy <karthikeyan@linumiz.com>
Link: https://lore.kernel.org/r/20240903105245.715899-3-karthikeyan@linumiz.com
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
|
|
Add i2c3 node and possible pinctrl for Rockchip RV1126
Signed-off-by: Karthikeyan Krishnasamy <karthikeyan@linumiz.com>
Link: https://lore.kernel.org/r/20240903105245.715899-2-karthikeyan@linumiz.com
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
|
|
xe_bo_move() might be called in the TTM swapout path from validation
by another TTM device. If so, we are not likely to have a RPM
reference. So iff xe_pm_runtime_get() is safe to call from reclaim,
use it instead of xe_pm_runtime_get_noresume().
Strictly this is currently needed only if handle_system_ccs is true,
but use xe_pm_runtime_get() if possible anyway to increase test
coverage.
At the same time warn if handle_system_ccs is true and we can't
call xe_pm_runtime_get() from reclaim context. This will likely trip
if someone tries to enable SRIOV on LNL, without fixing Xe SRIOV
runtime resume / suspend.
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240903094232.166342-1-thomas.hellstrom@linux.intel.com
|
|
Cleanup the includes by putting them in alphabetical order.
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Martyn Welch <martyn.welch@collabora.com>
Link: https://lore.kernel.org/r/20240903154533.101258-1-brgl@bgdev.pl
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
|
|
Simplify the code in error paths by using the managed variant of the
clock getter that controls the clock state as well.
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Link: https://lore.kernel.org/r/20240819151705.37258-2-brgl@bgdev.pl
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
|
|
There are no more any board files that use the platform data for
gpio-davinci. We can remove the header defining it and port the code to
no longer store any context in pdata.
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Link: https://lore.kernel.org/r/20240819151705.37258-1-brgl@bgdev.pl
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
|
|
Sort the headers in alphabetic order in order to ease
the maintenance for this part.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20240902133148.2569486-6-andriy.shevchenko@linux.intel.com
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
|
|
Convert the module to be property provider agnostic and allow
it to be used on non-OF platforms.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20240902133148.2569486-5-andriy.shevchenko@linux.intel.com
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
|
|
We have a temporary variable to keep a pointer to struct device.
Utilise it where it makes sense.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20240902133148.2569486-4-andriy.shevchenko@linux.intel.com
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
|
|
There is no evidence that the 'dev' member of struct stmpe_gpio
is used, drop it.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20240902133148.2569486-3-andriy.shevchenko@linux.intel.com
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
|
|
First of all, remove duplicate message that platform_get_irq()
does already print. Second, correct the error message when unable
to register a handler, which is broken in two ways:
1) the misleading 'get' vs. 'register';
2) missing '\n' at the end.
(Yes, for the curious ones, the dev_*() cases do not require '\n'
and issue it automatically, but it's better to have them explicit)
Fix all this here.
Fixes: 1882e769362b ("gpio: stmpe: Simplify with dev_err_probe()")
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20240902133148.2569486-2-andriy.shevchenko@linux.intel.com
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
|
|
Use the isolate_folio_to_list() to unify hugetlb/LRU/non-LRU folio
isolation, which cleanup code a bit and save a few calls to
compound_head().
[wangkefeng.wang@huawei.com: various fixes]
Link: https://lkml.kernel.org/r/20240829150500.2599549-1-wangkefeng.wang@huawei.com
Link: https://lkml.kernel.org/r/20240827114728.3212578-6-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Add isolate_folio_to_list() helper to try to isolate HugeTLB, no-LRU
movable and LRU folios to a list, which will be reused by
do_migrate_range() from memory hotplug soon, also drop the
mf_isolate_folio() since we could directly use new helper in the
soft_offline_in_use_page().
Link: https://lkml.kernel.org/r/20240827114728.3212578-5-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Miaohe Lin <linmiaohe@huawei.com>
Tested-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Commit b15c87263a69 ("hwpoison, memory_hotplug: allow hwpoisoned pages to
be offlined") don't handle the hugetlb pages, the endless loop still occur
if offline a hwpoison hugetlb page, luckly, after the commit e591ef7d96d6
("mm, hwpoison,hugetlb,memory_hotplug: hotremove memory section with
hwpoisoned hugepage"), the HPageMigratable of hugetlb page will be
cleared, and the hwpoison hugetlb page will be skipped in
scan_movable_pages(), so the endless loop issue is fixed.
However if the HPageMigratable() check passed(without reference and lock),
the hugetlb page may be hwpoisoned, it won't cause issue since the
hwpoisoned page will be handled correctly in the next movable pages scan
loop, and it will be isolated in do_migrate_range() but fails to migrate.
In order to avoid the unnecessary isolation and unify all hwpoisoned page
handling, let's unconditionally check hwpoison firstly, and if it is a
hwpoisoned hugetlb page, try to unmap it as the catch all safety net like
normal page does.
Link: https://lkml.kernel.org/r/20240827114728.3212578-4-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Add unmap_poisoned_folio() helper which will be reused by
do_migrate_range() from memory hotplug soon.
[akpm@linux-foundation.org: whitespace tweak, per Miaohe Lin]
Link: https://lkml.kernel.org/r/1f80c7e3-c30d-1ac1-6a36-d1a5f5907f7c@huawei.com
Link: https://lkml.kernel.org/r/20240827114728.3212578-3-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm: memory_hotplug: improve do_migrate_range()", v3.
Unify hwpoisoned page handling and isolation of HugeTLB/LRU/non-LRU
movable page, also convert to use folios in do_migrate_range().
This patch (of 5):
Directly use a folio for HugeTLB and THP when calculate the next pfn, then
remove unused head variable.
Link: https://lkml.kernel.org/r/20240827114728.3212578-1-wangkefeng.wang@huawei.com
Link: https://lkml.kernel.org/r/20240827114728.3212578-2-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
'--kunitconfig' option of 'kunit.py run' supports '.kunitconfig' file name
convention. Add the file for DAMON kunit tests for more convenient kunit
run.
Link: https://lkml.kernel.org/r/20240827030336.7930-10-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendanhiggins@google.com>
Cc: David Gow <davidgow@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
There was a discussion about better places for kunit test code[1] and test
file name suffix[2]. Folowwing the conclusion, move kunit tests for DAMON
to mm/damon/tests/ subdirectory and rename those.
[1] https://lore.kernel.org/CABVgOS=pUdWb6NDHszuwb1HYws4a1-b1UmN=i8U_ED7HbDT0mg@mail.gmail.com
[2] https://lore.kernel.org/CABVgOSmKwPq7JEpHfS6sbOwsR0B-DBDk_JP-ZD9s9ZizvpUjbQ@mail.gmail.com
Link: https://lkml.kernel.org/r/20240827030336.7930-9-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendanhiggins@google.com>
Cc: David Gow <davidgow@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
registered
The test depends on registration of DAMON_OPS_PADDR. It would be
registered only when CONFIG_DAMON_PADDR is set. DAMON core kunit tests do
fake ops registration for such case. However, the functions for such fake
ops registration is not available to DAMON debugfs interface. Just skip
the test in the case.
Link: https://lkml.kernel.org/r/20240827030336.7930-8-sj@kernel.org
Fixes: 999b9467974f ("mm/damon/dbgfs-test: fix is_target_id() change")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendanhiggins@google.com>
Cc: David Gow <davidgow@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The test depends on registration of DAMON_OPS_PADDR. It would be
registered only when CONFIG_DAMON_PADDR is set. DAMON core kunit tests do
fake ops registration for such case. However, the functions for such fake
ops registration is not available to DAMON debugfs interface. Just skip
the test in the case.
Link: https://lkml.kernel.org/r/20240827030336.7930-7-sj@kernel.org
Fixes: 999b9467974f ("mm/damon/dbgfs-test: fix is_target_id() change")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendanhiggins@google.com>
Cc: David Gow <davidgow@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
DAMON core kunit test can be executed without CONFIG_DAMON_VADDR. In the
case, vaddr DAMON ops is not registered. Meanwhile, ops registration
kunit test assumes the vaddr ops is registered. Check and handle the case
by registrering fake vaddr ops inside the test code.
Link: https://lkml.kernel.org/r/20240827030336.7930-6-sj@kernel.org
Fixes: 4f540f5ab4f2 ("mm/damon/core-test: add a kunit test case for ops registration")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendanhiggins@google.com>
Cc: David Gow <davidgow@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
DAMON ops registration kunit test tests both vaddr and paddr use cases in
parts of the whole test cases. Basically testing only one ops use case is
enough. Do the test with only vaddr use case.
Link: https://lkml.kernel.org/r/20240827030336.7930-5-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendanhiggins@google.com>
Cc: David Gow <davidgow@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Some test scripts are missing executable permissions. It causes warnings
that make the test output unnecessarily verbose. Add executable
permissions.
Link: https://lkml.kernel.org/r/20240827030336.7930-4-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendanhiggins@google.com>
Cc: David Gow <davidgow@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Python-based tests creates __pycache__/ directory. Remove it with 'make
clean' by defining it as EXTRA_CLEAN.
Link: https://lkml.kernel.org/r/20240827030336.7930-3-sj@kernel.org
Fixes: b5906f5f7359 ("selftests/damon: add a test for update_schemes_tried_regions sysfs command")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendanhiggins@google.com>
Cc: David Gow <davidgow@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "misc fixups for DAMON {self,kunit} tests".
This patchset is for minor fixups of DAMON selftests and kunit tests.
First three patches make DAMON selftests more cleanly maintained (patches
1 and 2) without unnecessary warnings (patch 3). Following six patches
remove unnecessary test case (patch 4), handle configs combinations that
can make tests fail (patches 5-7), reorganize the test files following the
new guideline (patch 8), and add reference kunitconfig for DAMON kunit
tests (patch 9).
This patch (of 9):
DAMON selftests build access_memory_even, but its not on the .gitignore
list. Add it to make 'git status' output cleaner.
Link: https://lkml.kernel.org/r/20240827030336.7930-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20240827030336.7930-2-sj@kernel.org
Fixes: c94df805c774 ("selftests/damon: implement a program for even-numbered memory regions access")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendanhiggins@google.com>
Cc: David Gow <davidgow@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Problem statement:
Since commit fc137c0ddab2 ("sched/numa: enhance vma scanning logic"), the
Numa vma scan overhead has been reduced a lot. Meanwhile, the reducing of
the vma scan might create less Numa page fault information. The
insufficient information makes it harder for the Numa balancer to make
decision. Later, commit b7a5b537c55c08 ("sched/numa: Complete scanning of
partial VMAs regardless of PID activity") and commit 84db47ca7146d7
("sched/numa: Fix mm numa_scan_seq based unconditional scan") are found to
bring back part of the performance.
Recently when running SPECcpu omnetpp_r on a 320 CPUs/2 Sockets system, a
long duration of remote Numa node read was observed by PMU events: A few
cores having ~500MB/s remote memory access for ~20 seconds. It causes
high core-to-core variance and performance penalty. After the
investigation, it is found that many vmas are skipped due to the active
PID check. According to the trace events, in most cases,
vma_is_accessed() returns false because the history access info stored in
pids_active array has been cleared.
Proposal:
The main idea is to adjust vma_is_accessed() to let it return true easier.
Thus compare the diff between mm->numa_scan_seq and
vma->numab_state->prev_scan_seq. If the diff has exceeded the threshold,
scan the vma.
This patch especially helps the cases where there are small number of
threads, like the process-based SPECcpu. Without this patch, if the
SPECcpu process access the vma at the beginning, then sleeps for a long
time, the pid_active array will be cleared. A a result, if this process
is woken up again, it never has a chance to set prot_none anymore.
Because only the first 2 times of access is granted for vma scan:
(current->mm->numa_scan_seq) - vma->numab_state->start_scan_seq) < 2 to be
worse, no other threads within the task can help set the prot_none. This
causes information lost.
Raghavendra helped test current patch and got the positive result
on the AMD platform:
autonumabench NUMA01
base patched
Amean syst-NUMA01 194.05 ( 0.00%) 165.11 * 14.92%*
Amean elsp-NUMA01 324.86 ( 0.00%) 315.58 * 2.86%*
Duration User 380345.36 368252.04
Duration System 1358.89 1156.23
Duration Elapsed 2277.45 2213.25
autonumabench NUMA02
Amean syst-NUMA02 1.12 ( 0.00%) 1.09 * 2.93%*
Amean elsp-NUMA02 3.50 ( 0.00%) 3.56 * -1.84%*
Duration User 1513.23 1575.48
Duration System 8.33 8.13
Duration Elapsed 28.59 29.71
kernbench
Amean user-256 22935.42 ( 0.00%) 22535.19 * 1.75%*
Amean syst-256 7284.16 ( 0.00%) 7608.72 * -4.46%*
Amean elsp-256 159.01 ( 0.00%) 158.17 * 0.53%*
Duration User 68816.41 67615.74
Duration System 21873.94 22848.08
Duration Elapsed 506.66 504.55
Intel 256 CPUs/2 Sockets:
autonuma benchmark also shows improvements:
v6.10-rc5 v6.10-rc5
+patch
Amean syst-NUMA01 245.85 ( 0.00%) 230.84 * 6.11%*
Amean syst-NUMA01_THREADLOCAL 205.27 ( 0.00%) 191.86 * 6.53%*
Amean syst-NUMA02 18.57 ( 0.00%) 18.09 * 2.58%*
Amean syst-NUMA02_SMT 2.63 ( 0.00%) 2.54 * 3.47%*
Amean elsp-NUMA01 517.17 ( 0.00%) 526.34 * -1.77%*
Amean elsp-NUMA01_THREADLOCAL 99.92 ( 0.00%) 100.59 * -0.67%*
Amean elsp-NUMA02 15.81 ( 0.00%) 15.72 * 0.59%*
Amean elsp-NUMA02_SMT 13.23 ( 0.00%) 12.89 * 2.53%*
v6.10-rc5 v6.10-rc5
+patch
Duration User 1064010.16 1075416.23
Duration System 3307.64 3104.66
Duration Elapsed 4537.54 4604.73
The SPECcpu remote node access issue disappears with the patch applied.
Link: https://lkml.kernel.org/r/20240827112958.181388-1-yu.c.chen@intel.com
Fixes: fc137c0ddab2 ("sched/numa: enhance vma scanning logic")
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Yujie Liu <yujie.liu@intel.com>
Reported-by: Xiaoping Zhou <xiaoping.zhou@intel.com>
Reviewed-and-tested-by: Raghavendra K T <raghavendra.kt@amd.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: "Chen, Tim C" <tim.c.chen@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Raghavendra K T <raghavendra.kt@amd.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
commit 823430c8e9d9 ("memory tier: consolidate the initialization of
memory tiers") introduces a locking change that use guard(mutex) to
instead of mutex_lock/unlock() for memory_tier_lock. It unexpectedly
expanded the locked region to include the hotplug_memory_notifier(), as a
result, it triggers an locking dependency detected of ABBA deadlock.
Exclude hotplug_memory_notifier() from the locked region to fixing it.
The deadlock scenario is that when a memory online event occurs, the
execution of memory notifier will access the read lock of the
memory_chain.rwsem, then the reigistration of the memory notifier in
memory_tier_init() acquires the write lock of the memory_chain.rwsem while
holding memory_tier_lock. Then the memory online event continues to
invoke the memory hotplug callback registered by memory_tier_init().
Since this callback tries to acquire the memory_tier_lock, a deadlock
occurs.
In fact, this deadlock can't happen because memory_tier_init() always
executes before memory online events happen due to the subsys_initcall()
has an higher priority than module_init().
[ 133.491106] WARNING: possible circular locking dependency detected
[ 133.493656] 6.11.0-rc2+ #146 Tainted: G O N
[ 133.504290] ------------------------------------------------------
[ 133.515194] (udev-worker)/1133 is trying to acquire lock:
[ 133.525715] ffffffff87044e28 (memory_tier_lock){+.+.}-{3:3}, at: memtier_hotplug_callback+0x383/0x4b0
[ 133.536449]
[ 133.536449] but task is already holding lock:
[ 133.549847] ffffffff875d3310 ((memory_chain).rwsem){++++}-{3:3}, at: blocking_notifier_call_chain+0x60/0xb0
[ 133.556781]
[ 133.556781] which lock already depends on the new lock.
[ 133.556781]
[ 133.569957]
[ 133.569957] the existing dependency chain (in reverse order) is:
[ 133.577618]
[ 133.577618] -> #1 ((memory_chain).rwsem){++++}-{3:3}:
[ 133.584997] down_write+0x97/0x210
[ 133.588647] blocking_notifier_chain_register+0x71/0xd0
[ 133.592537] register_memory_notifier+0x26/0x30
[ 133.596314] memory_tier_init+0x187/0x300
[ 133.599864] do_one_initcall+0x117/0x5d0
[ 133.603399] kernel_init_freeable+0xab0/0xeb0
[ 133.606986] kernel_init+0x28/0x2f0
[ 133.610312] ret_from_fork+0x59/0x90
[ 133.613652] ret_from_fork_asm+0x1a/0x30
[ 133.617012]
[ 133.617012] -> #0 (memory_tier_lock){+.+.}-{3:3}:
[ 133.623390] __lock_acquire+0x2efd/0x5c60
[ 133.626730] lock_acquire+0x1ce/0x580
[ 133.629757] __mutex_lock+0x15c/0x1490
[ 133.632731] mutex_lock_nested+0x1f/0x30
[ 133.635717] memtier_hotplug_callback+0x383/0x4b0
[ 133.638748] notifier_call_chain+0xbf/0x370
[ 133.641647] blocking_notifier_call_chain+0x76/0xb0
[ 133.644636] memory_notify+0x2e/0x40
[ 133.647427] online_pages+0x597/0x720
[ 133.650246] memory_subsys_online+0x4f6/0x7f0
[ 133.653107] device_online+0x141/0x1d0
[ 133.655831] online_memory_block+0x4d/0x60
[ 133.658616] walk_memory_blocks+0xc0/0x120
[ 133.661419] add_memory_resource+0x51d/0x6c0
[ 133.664202] add_memory_driver_managed+0xf5/0x180
[ 133.667060] dev_dax_kmem_probe+0x7f7/0xb40 [kmem]
[ 133.669949] dax_bus_probe+0x147/0x230
[ 133.672687] really_probe+0x27f/0xac0
[ 133.675463] __driver_probe_device+0x1f3/0x460
[ 133.678493] driver_probe_device+0x56/0x1b0
[ 133.681366] __driver_attach+0x277/0x570
[ 133.684149] bus_for_each_dev+0x145/0x1e0
[ 133.686937] driver_attach+0x49/0x60
[ 133.689673] bus_add_driver+0x2f3/0x6b0
[ 133.692421] driver_register+0x170/0x4b0
[ 133.695118] __dax_driver_register+0x141/0x1b0
[ 133.697910] dax_kmem_init+0x54/0xff0 [kmem]
[ 133.700794] do_one_initcall+0x117/0x5d0
[ 133.703455] do_init_module+0x277/0x750
[ 133.706054] load_module+0x5d1d/0x74f0
[ 133.708602] init_module_from_file+0x12c/0x1a0
[ 133.711234] idempotent_init_module+0x3f1/0x690
[ 133.713937] __x64_sys_finit_module+0x10e/0x1a0
[ 133.716492] x64_sys_call+0x184d/0x20d0
[ 133.719053] do_syscall_64+0x6d/0x140
[ 133.721537] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 133.724239]
[ 133.724239] other info that might help us debug this:
[ 133.724239]
[ 133.730832] Possible unsafe locking scenario:
[ 133.730832]
[ 133.735298] CPU0 CPU1
[ 133.737759] ---- ----
[ 133.740165] rlock((memory_chain).rwsem);
[ 133.742623] lock(memory_tier_lock);
[ 133.745357] lock((memory_chain).rwsem);
[ 133.748141] lock(memory_tier_lock);
[ 133.750489]
[ 133.750489] *** DEADLOCK ***
[ 133.750489]
[ 133.756742] 6 locks held by (udev-worker)/1133:
[ 133.759179] #0: ffff888207be6158 (&dev->mutex){....}-{3:3}, at: __driver_attach+0x26c/0x570
[ 133.762299] #1: ffffffff875b5868 (device_hotplug_lock){+.+.}-{3:3}, at: lock_device_hotplug+0x20/0x30
[ 133.765565] #2: ffff88820cf6a108 (&dev->mutex){....}-{3:3}, at: device_online+0x2f/0x1d0
[ 133.768978] #3: ffffffff86d08ff0 (cpu_hotplug_lock){++++}-{0:0}, at: mem_hotplug_begin+0x17/0x30
[ 133.772312] #4: ffffffff8702dfb0 (mem_hotplug_lock){++++}-{0:0}, at: mem_hotplug_begin+0x23/0x30
[ 133.775544] #5: ffffffff875d3310 ((memory_chain).rwsem){++++}-{3:3}, at: blocking_notifier_call_chain+0x60/0xb0
[ 133.779113]
[ 133.779113] stack backtrace:
[ 133.783728] CPU: 5 UID: 0 PID: 1133 Comm: (udev-worker) Tainted: G O N 6.11.0-rc2+ #146
[ 133.787220] Tainted: [O]=OOT_MODULE, [N]=TEST
[ 133.789948] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 133.793291] Call Trace:
[ 133.795826] <TASK>
[ 133.798284] dump_stack_lvl+0xea/0x150
[ 133.801025] dump_stack+0x19/0x20
[ 133.803609] print_circular_bug+0x477/0x740
[ 133.806341] check_noncircular+0x2f4/0x3e0
[ 133.809056] ? __pfx_check_noncircular+0x10/0x10
[ 133.811866] ? __pfx_lockdep_lock+0x10/0x10
[ 133.814670] ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
[ 133.817610] __lock_acquire+0x2efd/0x5c60
[ 133.820339] ? __pfx___lock_acquire+0x10/0x10
[ 133.823128] ? __dax_driver_register+0x141/0x1b0
[ 133.825926] ? do_one_initcall+0x117/0x5d0
[ 133.828648] lock_acquire+0x1ce/0x580
[ 133.831349] ? memtier_hotplug_callback+0x383/0x4b0
[ 133.834293] ? __pfx_lock_acquire+0x10/0x10
[ 133.837134] __mutex_lock+0x15c/0x1490
[ 133.839829] ? memtier_hotplug_callback+0x383/0x4b0
[ 133.842753] ? memtier_hotplug_callback+0x383/0x4b0
[ 133.845602] ? __this_cpu_preempt_check+0x21/0x30
[ 133.848438] ? __pfx___mutex_lock+0x10/0x10
[ 133.851200] ? __pfx_lock_acquire+0x10/0x10
[ 133.853935] ? global_dirty_limits+0xc0/0x160
[ 133.856699] ? __sanitizer_cov_trace_switch+0x58/0xa0
[ 133.859564] mutex_lock_nested+0x1f/0x30
[ 133.862251] ? mutex_lock_nested+0x1f/0x30
[ 133.864964] memtier_hotplug_callback+0x383/0x4b0
[ 133.867752] notifier_call_chain+0xbf/0x370
[ 133.870550] ? writeback_set_ratelimit+0xe8/0x160
[ 133.873372] blocking_notifier_call_chain+0x76/0xb0
[ 133.876311] memory_notify+0x2e/0x40
[ 133.879013] online_pages+0x597/0x720
[ 133.881686] ? irqentry_exit+0x3e/0xa0
[ 133.884397] ? __pfx_online_pages+0x10/0x10
[ 133.887244] ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
[ 133.890299] ? mhp_init_memmap_on_memory+0x7a/0x1c0
[ 133.893203] memory_subsys_online+0x4f6/0x7f0
[ 133.896099] ? __pfx_memory_subsys_online+0x10/0x10
[ 133.899039] ? xa_load+0x16d/0x2e0
[ 133.901667] ? __pfx_xa_load+0x10/0x10
[ 133.904366] ? __pfx_memory_subsys_online+0x10/0x10
[ 133.907218] device_online+0x141/0x1d0
[ 133.909845] online_memory_block+0x4d/0x60
[ 133.912494] walk_memory_blocks+0xc0/0x120
[ 133.915104] ? __pfx_online_memory_block+0x10/0x10
[ 133.917776] add_memory_resource+0x51d/0x6c0
[ 133.920404] ? __pfx_add_memory_resource+0x10/0x10
[ 133.923104] ? _raw_write_unlock+0x31/0x60
[ 133.925781] ? register_memory_resource+0x119/0x180
[ 133.928450] add_memory_driver_managed+0xf5/0x180
[ 133.931036] dev_dax_kmem_probe+0x7f7/0xb40 [kmem]
[ 133.933665] ? __pfx_dev_dax_kmem_probe+0x10/0x10 [kmem]
[ 133.936332] ? __pfx___up_read+0x10/0x10
[ 133.938878] dax_bus_probe+0x147/0x230
[ 133.941332] ? __pfx_dax_bus_probe+0x10/0x10
[ 133.943954] really_probe+0x27f/0xac0
[ 133.946387] ? __sanitizer_cov_trace_const_cmp1+0x1e/0x30
[ 133.949106] __driver_probe_device+0x1f3/0x460
[ 133.951704] ? parse_option_str+0x149/0x190
[ 133.954241] driver_probe_device+0x56/0x1b0
[ 133.956749] __driver_attach+0x277/0x570
[ 133.959228] ? __pfx___driver_attach+0x10/0x10
[ 133.961776] bus_for_each_dev+0x145/0x1e0
[ 133.964367] ? __pfx_bus_for_each_dev+0x10/0x10
[ 133.967019] ? __kasan_check_read+0x15/0x20
[ 133.969543] ? _raw_spin_unlock+0x31/0x60
[ 133.972132] driver_attach+0x49/0x60
[ 133.974536] bus_add_driver+0x2f3/0x6b0
[ 133.977044] driver_register+0x170/0x4b0
[ 133.979480] __dax_driver_register+0x141/0x1b0
[ 133.982126] ? __pfx_dax_kmem_init+0x10/0x10 [kmem]
[ 133.984724] dax_kmem_init+0x54/0xff0 [kmem]
[ 133.987284] ? __pfx_dax_kmem_init+0x10/0x10 [kmem]
[ 133.989965] do_one_initcall+0x117/0x5d0
[ 133.992506] ? __pfx_do_one_initcall+0x10/0x10
[ 133.995185] ? __kasan_kmalloc+0x88/0xa0
[ 133.997748] ? kasan_poison+0x3e/0x60
[ 134.000288] ? kasan_unpoison+0x2c/0x60
[ 134.002762] ? kasan_poison+0x3e/0x60
[ 134.005202] ? __asan_register_globals+0x62/0x80
[ 134.007753] ? __pfx_dax_kmem_init+0x10/0x10 [kmem]
[ 134.010439] do_init_module+0x277/0x750
[ 134.012953] load_module+0x5d1d/0x74f0
[ 134.015406] ? __pfx_load_module+0x10/0x10
[ 134.017887] ? __pfx_ima_post_read_file+0x10/0x10
[ 134.020470] ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
[ 134.023127] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
[ 134.025767] ? security_kernel_post_read_file+0xa2/0xd0
[ 134.028429] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
[ 134.031162] ? kernel_read_file+0x503/0x820
[ 134.033645] ? __pfx_kernel_read_file+0x10/0x10
[ 134.036232] ? __pfx___lock_acquire+0x10/0x10
[ 134.038766] init_module_from_file+0x12c/0x1a0
[ 134.041291] ? init_module_from_file+0x12c/0x1a0
[ 134.043936] ? __pfx_init_module_from_file+0x10/0x10
[ 134.046516] ? __this_cpu_preempt_check+0x21/0x30
[ 134.049091] ? __kasan_check_read+0x15/0x20
[ 134.051551] ? do_raw_spin_unlock+0x60/0x210
[ 134.054077] idempotent_init_module+0x3f1/0x690
[ 134.056643] ? __pfx_idempotent_init_module+0x10/0x10
[ 134.059318] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
[ 134.061995] ? __fget_light+0x17d/0x210
[ 134.064428] __x64_sys_finit_module+0x10e/0x1a0
[ 134.066976] x64_sys_call+0x184d/0x20d0
[ 134.069405] do_syscall_64+0x6d/0x140
[ 134.071926] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[yanfei.xu@intel.com: add mutex_lock/unlock() pair back]
Link: https://lkml.kernel.org/r/20240830102447.1445296-1-yanfei.xu@intel.com
Link: https://lkml.kernel.org/r/20240827113614.1343049-1-yanfei.xu@intel.com
Fixes: 823430c8e9d9 ("memory tier: consolidate the initialization of memory tiers")
Signed-off-by: Yanfei Xu <yanfei.xu@intel.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Ho-Ren (Jack) Chuang <horen.chuang@linux.dev>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The aim is to simplify and making the vm_area_alloc_pages()
function less confusing as it became more clogged nowadays:
- eliminate a "bulk_gfp" variable and do not overwrite a gfp
flag for bulk allocator;
- drop __GFP_NOFAIL flag for high-order-page requests on upper
layer. It becomes less spread between levels when it comes to
__GFP_NOFAIL allocations;
- add a comment about a fallback path if high-order attempt is
unsuccessful because for such cases __GFP_NOFAIL is dropped;
- fix a typo in a commit message.
Link: https://lkml.kernel.org/r/20240827190916.34242-1-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In commit 714965ca8252 ("mm/mmap: start distinguishing if vma can be
removed in mergeability test") we relaxed the VMA merge rules for VMAs
possessing a vm_ops->close() hook, permitting this operation in instances
where we wouldn't delete the VMA as part of the merge operation.
This was later corrected in commit fc0c8f9089c2 ("mm, mmap: fix
vma_merge() case 7 with vma_ops->close") to account for a subtle case that
the previous commit had not taken into account.
In both instances, we first rely on is_mergeable_vma() to determine
whether we might be dealing with a VMA that might be removed, taking
advantage of the fact that a 'previous' VMA will never be deleted, only
VMAs that follow it.
The second patch corrects the instance where a merge of the previous VMA
into a subsequent one did not correctly check whether the subsequent VMA
had a vm_ops->close() handler.
Both changes prevent merge cases that are actually permissible (for
instance a merge of a VMA into a following VMA with a vm_ops->close(), but
with no previous VMA, which would result in the next VMA being extended,
not deleted).
In addition, both changes fail to consider the case where a VMA that would
otherwise be merged with the previous and next VMA might have
vm_ops->close(), on the assumption that for this to be the case, all three
would have to have the same vma->vm_file to be mergeable and thus the same
vm_ops.
And in addition both changes operate at 50,000 feet, trying to guess
whether a VMA will be deleted.
As we have majorly refactored the VMA merge operation and de-duplicated
code to the point where we know precisely where deletions will occur, this
patch removes the aforementioned checks altogether and instead explicitly
checks whether a VMA will be deleted.
In cases where a reduced merge is still possible (where we merge both
previous and next VMA but the next VMA has a vm_ops->close hook, meaning
we could just merge the previous and current VMA), we do so, otherwise the
merge is not permitted.
We take advantage of our userland testing to assert that this functions
correctly - replacing the previous limited vm_ops->close() tests with
tests for every single case where we delete a VMA.
We also update all testing for both new and modified VMAs to set
vma->vm_ops->close() in every single instance where this would not prevent
the merge, to assert that we never do so.
Link: https://lkml.kernel.org/r/9f96b8cfeef3d14afabddac3d6144afdfbef2e22.1725040657.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Bert Karwatzki <spasswolf@web.de>
Cc: Jeff Xu <jeffxu@chromium.org>
Cc: Jiri Olsa <olsajiri@gmail.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The existing vma_merge() function is no longer required to handle what
were previously referred to as cases 1-3 (i.e. the merging of a new VMA),
as this is now handled by vma_merge_new_vma().
Additionally, simplify the convoluted control flow of the original,
maintaining identical logic only expressed more clearly and doing away
with a complicated set of cases, rather logically examining each possible
outcome - merging of both the previous and subsequent VMA, merging of the
previous VMA and merging of the subsequent VMA alone.
We now utilise the previously implemented commit_merge() function to share
logic with vma_expand() de-duplicating code and providing less surface
area for bugs and confusion. In order to do so, we adjust this function
to accept parameters specific to merging existing ranges.
Link: https://lkml.kernel.org/r/2cf6016b7bfcc4965fc3cde10827560c42e4f12c.1725040657.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Bert Karwatzki <spasswolf@web.de>
Cc: Jeff Xu <jeffxu@chromium.org>
Cc: Jiri Olsa <olsajiri@gmail.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Pull the part of vma_expand() which actually commits the merge operation,
that is inserts it into the maple tree and sets the VMA's vma->vm_start
and vma->vm_end parameters, into its own function.
We implement only the parts needed for vma_expand() which now as a result
of previous work is also the means by which new VMA ranges are merged.
The next commit in the series will implement merging of existing ranges
which will extend commit_merge() to accommodate this case and result in
all merges using this common code.
Link: https://lkml.kernel.org/r/7b985a20dfa549e3c370cd274d732b64c44f6dbd.1725040657.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Bert Karwatzki <spasswolf@web.de>
Cc: Jeff Xu <jeffxu@chromium.org>
Cc: Jiri Olsa <olsajiri@gmail.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Now we have abstracted merge behaviour for new VMA ranges, we are able to
render vma_prepare(), init_vma_prep(), vma_complete(),
can_vma_merge_before() and can_vma_merge_after() static and internal to
vma.c.
These are internal implementation details of kernel VMA manipulation and
merging mechanisms and thus should not be exposed. This also renders the
functions userland testable.
Link: https://lkml.kernel.org/r/7f7f1c34ce10405a6aab2714c505af3cf41b7851.1725040657.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Bert Karwatzki <spasswolf@web.de>
Cc: Jeff Xu <jeffxu@chromium.org>
Cc: Jiri Olsa <olsajiri@gmail.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Abstract vma_merge_new_vma() to use vma_merge_struct and rename the
resultant function vma_merge_new_range() to be clear what the purpose of
this function is - a new VMA is desired in the specified range, and we
wish to see if it is possible to 'merge' surrounding VMAs into this range
rather than having to allocate a new VMA.
Note that this function uses vma_extend() exclusively, so adopts its
requirement that the iterator point at or before the gap. We add an
assert to this effect.
This is as opposed to vma_merge_existing_range(), which will be introduced
in a subsequent commit, and provide the same functionality for cases in
which we are modifying an existing VMA.
In mmap_region() and do_brk_flags() we open code scenarios where we prefer
to use vma_expand() rather than invoke a full vma_merge() operation.
Abstract this logic and eliminate all of the open-coding, and also use the
same logic for all cases where we add new VMAs to, rather than ultimately
use vma_merge(), rather use vma_expand().
Doing so removes duplication and simplifies VMA merging in all such cases,
laying the ground for us to eliminate the merging of new VMAs in
vma_merge() altogether.
Also add the ability for the vmg to track state, and able to report
errors, allowing for us to differentiate a failed merge from an inability
to allocate memory in callers.
This makes it far easier to understand what is happening in these cases
avoiding confusion, bugs and allowing for future optimisation.
Also introduce vma_iter_next_rewind() to allow for retrieval of the next,
and (optionally) the prev VMA, rewinding to the start of the previous gap.
Introduce are_anon_vmas_compatible() to abstract individual VMA anon_vma
comparison for the case of merging on both sides where the anon_vma of the
VMA being merged maybe compatible with prev and next, but prev and next's
anon_vma's may not be compatible with each other.
Finally also introduce can_vma_merge_left() / can_vma_merge_right() to
check adjacent VMA compatibility and that they are indeed adjacent.
Link: https://lkml.kernel.org/r/49d37c0769b6b9dc03b27fe4d059173832556392.1725040657.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Tested-by: Mark Brown <broonie@kernel.org>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Bert Karwatzki <spasswolf@web.de>
Cc: Jeff Xu <jeffxu@chromium.org>
Cc: Jiri Olsa <olsajiri@gmail.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The purpose of the vmg is to thread merge state through functions and
avoid egregious parameter lists. We expand this to vma_expand(), which is
used for a number of merge cases.
Accordingly, adjust its callers, mmap_region() and relocate_vma_down(), to
use a vmg.
An added purpose of this change is the ability in a future commit to
perform all new VMA range merging using vma_expand().
Link: https://lkml.kernel.org/r/4bc8c9dbc9ca52452ef8e587b28fe555854ceb38.1725040657.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Bert Karwatzki <spasswolf@web.de>
Cc: Jeff Xu <jeffxu@chromium.org>
Cc: Jiri Olsa <olsajiri@gmail.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Both can_vma_merge_before() and can_vma_merge_after() are invoked after
checking for compatible VMA NUMA policy, we can simply move this to
is_mergeable_vma() and abstract this altogether.
In mmap_region() we set vmg->policy to NULL, so the policy comparisons
checked in can_vma_merge_before() and can_vma_merge_after() are exactly
equivalent to !vma_policy(vmg.next) and !vma_policy(vmg.prev).
Equally, in do_brk_flags(), vmg->policy is NULL, so the
can_vma_merge_after() is checking !vma_policy(vma), as we set vmg.prev to
vma.
In vma_merge(), we compare prev and next policies with vmg->policy before
checking can_vma_merge_after() and can_vma_merge_before() respectively,
which this patch causes to be checked in precisely the same way.
This therefore maintains precisely the same logic as before, only now
abstracted into is_mergeable_vma().
Link: https://lkml.kernel.org/r/0dbff286d9c4988333bc6f4ff3734cb95dd5410a.1725040657.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Bert Karwatzki <spasswolf@web.de>
Cc: Jeff Xu <jeffxu@chromium.org>
Cc: Jiri Olsa <olsajiri@gmail.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Rather than passing around huge numbers of parameters to numerous helper
functions, abstract them into a single struct that we thread through the
operation, the vma_merge_struct ('vmg').
Adjust vma_merge() and vma_modify() to accept this parameter, as well as
predicate functions can_vma_merge_before(), can_vma_merge_after(), and the
vma_modify_...() helper functions.
Also introduce VMG_STATE() and VMG_VMA_STATE() helper macros to allow for
easy vmg declaration.
We additionally remove the requirement that vma_merge() is passed a VMA
object representing the candidate new VMA. Previously it used this to
obtain the mm_struct, file and anon_vma properties of the proposed range
(a rather confusing state of affairs), which are now provided by the vmg
directly.
We also remove the pgoff calculation previously performed vma_modify(),
and instead calculate this in VMG_VMA_STATE() via the vma_pgoff_offset()
helper.
Link: https://lkml.kernel.org/r/a955aad09d81329f6fbeb636b2dd10cde7b73dab.1725040657.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Bert Karwatzki <spasswolf@web.de>
Cc: Jeff Xu <jeffxu@chromium.org>
Cc: Jiri Olsa <olsajiri@gmail.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Add a variety of VMA merge unit tests to assert that the behaviour of VMA
merge is correct at an abstract level and VMAs are merged or not merged as
expected.
These are intentionally added _before_ we start refactoring vma_merge() in
order that we can continually assert correctness throughout the rest of
the series.
In order to reduce churn going forward, we backport the vma_merge_struct
data type to the test code which we introduce and use in a future commit,
and add wrappers around the merge new and existing VMA cases.
Link: https://lkml.kernel.org/r/1c7a0b43cfad2c511a6b1b52f3507696478ff51a.1725040657.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Bert Karwatzki <spasswolf@web.de>
Cc: Jeff Xu <jeffxu@chromium.org>
Cc: Jiri Olsa <olsajiri@gmail.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm: remove vma_merge()", v3.
The infamous vma_merge() function has been the cause of a great deal of
pain, bugs and confusion for a very long time.
It is subtle, contains many corner cases, tries to do far too much and is
as a result very fragile.
The fact that the function requires there to be a numbering system to
cover each possible eventuality with references to each in the many
branches of its implementation as to which case you are looking at speaks
to all this.
Some of this complexity is inherent - unfortunately there is no getting
away from the need to figure out precisely how to execute the merge,
whether we need to remove VMAs, whether it is safe to do so, what
constitutes a mergeable VMA and so on.
However, a lot of the complexity is not inherent but instead a product of
the function's 'organic' development.
Liam has gone to great lengths to improve the situation as a part of his
maple tree implementation, greatly improving the readability of the code,
and Vlastimil and myself have additionally gone to lengths to try to
improve things further.
However, with the availability of userland VMA testing, it now becomes
possible to perform a rather more significant refactoring while
maintaining confidence in its correct operation.
An attempt was previously made by Vlastimil [0] to eliminate vma_merge(),
however it was rather - brutal - and an astute reader might refer to the
date of that patch for insight as to its intent.
This series instead divides merge operations into two natural kinds -
merges which occur when a NEW vma is being added to the address space, and
merges which occur when a vma is being MODIFIED.
Happily, the vma_expand() function introduced by Liam, which has the
capacity for also deleting a subsequent VMA, covers each of the NEW vma
cases.
By abstracting the actual final commit of changes to a VMA to its own
function, commit_merge() and writing a wrapper around vma_expand() for new
VMA cases vma_merge_new_range(), we can avoid having to use vma_merge()
for these instances altogether.
By doing so we are also able to then de-duplicate all existing merge logic
in mmap_region() and do_brk_flags() and have everything invoke this new
function, so we universally take the same approach to merging new VMAs.
Having done so, we can then completely rework vma_merge() into
vma_merge_existing_range() and use this for the instances where a merge is
proposed for a region of an existing VMA.
This eliminates vma_merge() and its numbered cases and instead divides
things into logical cases - merge both, merge left, merge right (the
latter 2 being either partial or full merges).
The code is heavily annotated with ASCII diagrams and greatly simplified
in comparison to the existing vma_merge() function.
Having made this change, we take the opportunity to address an issue with
merging VMAs possessing a vm_ops->close() hook - commit 714965ca8252
("mm/mmap: start distinguishing if vma can be removed in mergeability
test") and commit fc0c8f9089c2 ("mm, mmap: fix vma_merge() case 7 with
vma_ops->close") make efforts to relax how we handle these, making
assumptions about which VMAs might end up deleted (and thus, if possessing
a vm_ops->close() hook, cannot be).
This refactor means we do not need to guess, so instead explicitly only
disallow merge in instances where a VMA with a vm_ops->close() hook would
be deleted (and try a smaller merge in cases where this is possible).
In addition to these changes, we introduce a new vma_merge_struct
abstraction to allow VMA merge state to be threaded through the operation
neatly.
There is heavy unit testing provided for all merge functionality, added
prior to the refactoring, allowing for before/after testing.
The vm_ops->close() change also introduces exhaustive testing to
demonstrate that this functions as expected, and in addition to this the
reproduction code from commit fc0c8f9089c2 ("mm, mmap: fix vma_merge()
case 7 with vma_ops->close") was tested and confirmed passing.
[0]:https://lore.kernel.org/linux-mm/20240401192623.18575-2-vbabka@suse.cz/
This patch (of 10):
Have vma.o depend on its source dependencies explicitly, as previously
these were simply being ignored as existing object files were up to date.
This now correctly re-triggers the build if mm/ source is changed as well
as local source code.
Also set clean as a phony rule.
Link: https://lkml.kernel.org/r/cover.1725040657.git.lorenzo.stoakes@oracle.com
Link: https://lkml.kernel.org/r/e3ea58f08364ae5432c9a074de0195a7c7e0b04a.1725040657.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Bert Karwatzki <spasswolf@web.de>
Cc: Jeff Xu <jeffxu@chromium.org>
Cc: Jiri Olsa <olsajiri@gmail.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The vma_munmap_struct has a hole of 4 bytes and pushes the struct to three
cachelines. Relocating the three booleans upwards allows for the struct
to only use two cachelines (as reported by pahole on amd64).
Before:
struct vma_munmap_struct {
struct vma_iterator * vmi; /* 0 8 */
struct vm_area_struct * vma; /* 8 8 */
struct vm_area_struct * prev; /* 16 8 */
struct vm_area_struct * next; /* 24 8 */
struct list_head * uf; /* 32 8 */
long unsigned int start; /* 40 8 */
long unsigned int end; /* 48 8 */
long unsigned int unmap_start; /* 56 8 */
/* --- cacheline 1 boundary (64 bytes) --- */
long unsigned int unmap_end; /* 64 8 */
int vma_count; /* 72 4 */
/* XXX 4 bytes hole, try to pack */
long unsigned int nr_pages; /* 80 8 */
long unsigned int locked_vm; /* 88 8 */
long unsigned int nr_accounted; /* 96 8 */
long unsigned int exec_vm; /* 104 8 */
long unsigned int stack_vm; /* 112 8 */
long unsigned int data_vm; /* 120 8 */
/* --- cacheline 2 boundary (128 bytes) --- */
bool unlock; /* 128 1 */
bool clear_ptes; /* 129 1 */
bool closed_vm_ops; /* 130 1 */
/* size: 136, cachelines: 3, members: 19 */
/* sum members: 127, holes: 1, sum holes: 4 */
/* padding: 5 */
/* last cacheline: 8 bytes */
};
After:
struct vma_munmap_struct {
struct vma_iterator * vmi; /* 0 8 */
struct vm_area_struct * vma; /* 8 8 */
struct vm_area_struct * prev; /* 16 8 */
struct vm_area_struct * next; /* 24 8 */
struct list_head * uf; /* 32 8 */
long unsigned int start; /* 40 8 */
long unsigned int end; /* 48 8 */
long unsigned int unmap_start; /* 56 8 */
/* --- cacheline 1 boundary (64 bytes) --- */
long unsigned int unmap_end; /* 64 8 */
int vma_count; /* 72 4 */
bool unlock; /* 76 1 */
bool clear_ptes; /* 77 1 */
bool closed_vm_ops; /* 78 1 */
/* XXX 1 byte hole, try to pack */
long unsigned int nr_pages; /* 80 8 */
long unsigned int locked_vm; /* 88 8 */
long unsigned int nr_accounted; /* 96 8 */
long unsigned int exec_vm; /* 104 8 */
long unsigned int stack_vm; /* 112 8 */
long unsigned int data_vm; /* 120 8 */
/* size: 128, cachelines: 2, members: 19 */
/* sum members: 127, holes: 1, sum holes: 1 */
};
Link: https://lkml.kernel.org/r/20240830040101.822209-22-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Bert Karwatzki <spasswolf@web.de>
Cc: Jeff Xu <jeffxu@chromium.org>
Cc: Jiri Olsa <olsajiri@gmail.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The comment has been outdated since 6b73cff239e52 ("mm: change munmap
splitting order and move_vma()"). The move_vma() was altered to fix the
fragile state of the accounting since then.
Link: https://lkml.kernel.org/r/20240830040101.822209-21-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Bert Karwatzki <spasswolf@web.de>
Cc: Jeff Xu <jeffxu@chromium.org>
Cc: Jiri Olsa <olsajiri@gmail.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|