summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-07-27drm/nouveau: add nvif_mmu to nouveau_drmBen Skeggs
This allocates a new nvif_mmu in nouveau_drm, and uses it for TTM backend memory allocations instead of nouveau_drm.master.mmu, which is removed by a later commit that removes nouveau_drm.master entirely. Signed-off-by: Ben Skeggs <bskeggs@nvidia.com> Signed-off-by: Danilo Krummrich <dakr@kernel.org> Link: https://patchwork.freedesktop.org/patch/msgid/20240726043828.58966-29-bskeggs@nvidia.com
2024-07-27drm/nouveau: move nvxx_* definitions to nouveau_drv.hBen Skeggs
These are some dodgy "convenience" macros for the DRM driver to peek into NVKM state. They're still used in a few places, but don't belong in nvif/device.h in any case. Move them to nouveau_drv.h, and modify callers to pass a nouveau_drm instead of an nvif_device. v2: - use drm->nvkm pointer for nvxx_*() macros, removing some void* v3: - add some explanation of the nvxx_*() macros Signed-off-by: Ben Skeggs <bskeggs@nvidia.com> Signed-off-by: Danilo Krummrich <dakr@kernel.org> Link: https://patchwork.freedesktop.org/patch/msgid/20240726043828.58966-28-bskeggs@nvidia.com
2024-07-27drm/nouveau/nvif: remove disp chan rd/wrBen Skeggs
There's no good reason the ioremap() that results from nvif_object_map() should fail, so add a check that the map succeeded, and remove the rd/wr methods from display channel objects. As this was the last user of rd/wr methods, the nvif plumbing is removed at the same time. Signed-off-by: Ben Skeggs <bskeggs@nvidia.com> Signed-off-by: Danilo Krummrich <dakr@kernel.org> Link: https://patchwork.freedesktop.org/patch/msgid/20240726043828.58966-27-bskeggs@nvidia.com
2024-07-27drm/nouveau/nvif: remove device rd/wrBen Skeggs
The previous commit ensures the device is always mapped, so these are unneeded. Signed-off-by: Ben Skeggs <bskeggs@nvidia.com> Signed-off-by: Danilo Krummrich <dakr@kernel.org> Link: https://patchwork.freedesktop.org/patch/msgid/20240726043828.58966-26-bskeggs@nvidia.com
2024-07-27drm/nouveau: always map deviceBen Skeggs
The next commit removes the nvif rd/wr methods from nvif_device, which were probably a bad idea, and mostly intended as a fallback if ioremap() failed (or wasn't available, as was the case in some tools I once used). The nv04 KMS driver already mapped the device, because it's mostly been kept alive on life-support for many years and still directly bashes PRI a lot for modesetting. Post-nv50, I tried pretty hard to keep PRI accesses out of the DRM code, but there's still a few random places where we do, and those were using the rd/wr paths prior to this commit. This allocates and maps a new nvif_device (which will replace the usage of nouveau_drm.master.device later on), and replicates this pointer to all other possible users. This will be cleaned up by the end of another patch series, after it's been made safe to do so. Signed-off-by: Ben Skeggs <bskeggs@nvidia.com> Signed-off-by: Danilo Krummrich <dakr@kernel.org> Link: https://patchwork.freedesktop.org/patch/msgid/20240726043828.58966-25-bskeggs@nvidia.com
2024-07-27drm/nouveau/nvif: remove device argsBen Skeggs
These were once used by used by userspace tools (with nvkm built as a library), to access multiple GPUs from a single nvif_client. The DRM code just uses the driver's default device, so remove the arguments. Signed-off-by: Ben Skeggs <bskeggs@nvidia.com> Signed-off-by: Danilo Krummrich <dakr@kernel.org> Link: https://patchwork.freedesktop.org/patch/msgid/20240726043828.58966-24-bskeggs@nvidia.com
2024-07-27drm/nouveau/nvif: remove client finiBen Skeggs
Does nothing. Remove it. Signed-off-by: Ben Skeggs <bskeggs@nvidia.com> Signed-off-by: Danilo Krummrich <dakr@kernel.org> Link: https://patchwork.freedesktop.org/patch/msgid/20240726043828.58966-23-bskeggs@nvidia.com
2024-07-27drm/nouveau/nvif: remove client devlistBen Skeggs
This was once used by userspace tools (with nvkm built as a library), but is now unused. Signed-off-by: Ben Skeggs <bskeggs@nvidia.com> Signed-off-by: Danilo Krummrich <dakr@kernel.org> Link: https://patchwork.freedesktop.org/patch/msgid/20240726043828.58966-22-bskeggs@nvidia.com
2024-07-27drm/nouveau/nvif: remove client versionBen Skeggs
This is not, and has never, been used for anything. Remove it. Signed-off-by: Ben Skeggs <bskeggs@nvidia.com> Signed-off-by: Danilo Krummrich <dakr@kernel.org> Link: https://patchwork.freedesktop.org/patch/msgid/20240726043828.58966-21-bskeggs@nvidia.com
2024-07-27drm/nouveau/nvif: remove client device argBen Skeggs
This was once used by userspace tools (with nvkm built as a library), as a way to select a "default device". The DRM code doesn't need this at all as clients only have access to a single device already, so inherit the value from its parent. Signed-off-by: Ben Skeggs <bskeggs@nvidia.com> Signed-off-by: Danilo Krummrich <dakr@kernel.org> Link: https://patchwork.freedesktop.org/patch/msgid/20240726043828.58966-20-bskeggs@nvidia.com
2024-07-27drm/nouveau/nvif: remove driver keep/finiBen Skeggs
These are remnants of code long gone. Remove them. Signed-off-by: Ben Skeggs <bskeggs@nvidia.com> Signed-off-by: Danilo Krummrich <dakr@kernel.org> Link: https://patchwork.freedesktop.org/patch/msgid/20240726043828.58966-19-bskeggs@nvidia.com
2024-07-27drm/nouveau/nvif: remove nvxx_client()Ben Skeggs
Make use of nouveau_cli.name instead of nvkm_client.name. Signed-off-by: Ben Skeggs <bskeggs@nvidia.com> Signed-off-by: Danilo Krummrich <dakr@kernel.org> Link: https://patchwork.freedesktop.org/patch/msgid/20240726043828.58966-18-bskeggs@nvidia.com
2024-07-27drm/nouveau/nvif: remove nvxx_object()Ben Skeggs
This hasn't been used in a while. Moves io accessors from nvkm/core/os.h to nvif/os.h at the same time to fix a compile issue that results from <nvkm/core/object.h> no longer being included. Signed-off-by: Ben Skeggs <bskeggs@nvidia.com> Signed-off-by: Danilo Krummrich <dakr@kernel.org> Link: https://patchwork.freedesktop.org/patch/msgid/20240726043828.58966-17-bskeggs@nvidia.com
2024-07-27drm/nouveau/nvif: remove route/tokenBen Skeggs
These were a cludge used to prevent userspace's nvif ioctl from accessing objects created by the kernel for the same client. That interface was removed in a previous patch, so these are no longer useful for anything. Signed-off-by: Ben Skeggs <bskeggs@nvidia.com> Signed-off-by: Danilo Krummrich <dakr@kernel.org> Link: https://patchwork.freedesktop.org/patch/msgid/20240726043828.58966-16-bskeggs@nvidia.com
2024-07-27drm/nouveau/nvif: remove support for userspace backendsBen Skeggs
The tools that used libnvkm no longer exist. Signed-off-by: Ben Skeggs <bskeggs@nvidia.com> Signed-off-by: Danilo Krummrich <dakr@kernel.org> Link: https://patchwork.freedesktop.org/patch/msgid/20240726043828.58966-15-bskeggs@nvidia.com
2024-07-27drm/nouveau/nvkm: remove nvkm_client_search()Ben Skeggs
Has been unused for a while now. Signed-off-by: Ben Skeggs <bskeggs@nvidia.com> Signed-off-by: Danilo Krummrich <dakr@kernel.org> Link: https://patchwork.freedesktop.org/patch/msgid/20240726043828.58966-14-bskeggs@nvidia.com
2024-07-27drm/nouveau/nvkm: remove perfmonBen Skeggs
This has never really been used for anything, in part due to never having reclocking stable enough in general to attempt to implement dynamic clock changes based on load, etc. To avoid having to rework its interfaces, remove it entirely. Signed-off-by: Ben Skeggs <bskeggs@nvidia.com> Signed-off-by: Danilo Krummrich <dakr@kernel.org> Link: https://patchwork.freedesktop.org/patch/msgid/20240726043828.58966-13-bskeggs@nvidia.com
2024-07-27drm/nouveau/nvkm: remove detect/mmio/subdev_mask from device argsBen Skeggs
All callers now pass "detect=true, mmio=true, subdev_mask=~0ULL", so remove the function arguments, and associated code. Signed-off-by: Ben Skeggs <bskeggs@nvidia.com> Signed-off-by: Danilo Krummrich <dakr@kernel.org> Link: https://patchwork.freedesktop.org/patch/msgid/20240726043828.58966-12-bskeggs@nvidia.com
2024-07-27drm/nouveau: remove abi16->handlesBen Skeggs
Hasn't been needed since 2015... Signed-off-by: Ben Skeggs <bskeggs@nvidia.com> Signed-off-by: Danilo Krummrich <dakr@kernel.org> Link: https://patchwork.freedesktop.org/patch/msgid/20240726043828.58966-11-bskeggs@nvidia.com
2024-07-27drm/nouveau: remove abi16->deviceBen Skeggs
The previous commit removes the last remnants of userspace's own nvif instance, so this isn't needed anymore to hide the abi16 objects from userspace and we can use nouveau_cli.device instead. Signed-off-by: Ben Skeggs <bskeggs@nvidia.com> Signed-off-by: Danilo Krummrich <dakr@kernel.org> Link: https://patchwork.freedesktop.org/patch/msgid/20240726043828.58966-10-bskeggs@nvidia.com
2024-07-27drm/nouveau: handle limited nvif ioctl in abi16Ben Skeggs
nouveau_usif.c was already stripped right back a couple of years ago, limiting what userspace could do with it. A follow-on series removes the nvkm side of these interfaces entirely, in order to make it less of a nightmare to add/change internal APIs in the future. Unfortunately. Userspace uses some of this. Fortunately, userspace only ever ended up using a fraction of the APIs, so those are reimplemened here in a more direct manner, and return -EINVAL to userspace for everything else. v2: - simplified struct nouveau_abi16_obj - added a couple of comments v3: - comment harder Signed-off-by: Ben Skeggs <bskeggs@nvidia.com> Signed-off-by: Danilo Krummrich <dakr@kernel.org> Link: https://patchwork.freedesktop.org/patch/msgid/20240726043828.58966-9-bskeggs@nvidia.com
2024-07-27drm/nouveau: add nouveau_cli to nouveau_abi16Ben Skeggs
Store a pointer to struct nouveau_cli in struct nouveau_abi16 to avoid some dubious void casts. Signed-off-by: Ben Skeggs <bskeggs@nvidia.com> Signed-off-by: Danilo Krummrich <dakr@kernel.org> Link: https://patchwork.freedesktop.org/patch/msgid/20240726043828.58966-8-bskeggs@nvidia.com
2024-07-27drm/nouveau: move allocation of root client out of nouveau_cli_init()Ben Skeggs
drm->master isn't really a nouveau_cli, and is mostly just used to get at an nvif_mmu object to implement a hack around issues with the ioctl interface to nvkm. Later patches in this series allocate nvif_device/mmu objects in nouveau_drm directly, removing the need for master. A pending series to remove the "ioctl" layer between DRM and NVKM removes the need for the above-mentioned hack entirely. The only other member of drm->master that's needed is the nvif_client, and is a dependency of device/mmu. So the first step is to move its allocation out of code handling nouveau_cli init. v2: - modified slightly due to the addition of tegra/pci cleanup patches v3: - move nvif init below drm_dev_alloc() to avoid changing nouveau_name() Signed-off-by: Ben Skeggs <bskeggs@nvidia.com> Signed-off-by: Danilo Krummrich <dakr@kernel.org> Link: https://patchwork.freedesktop.org/patch/msgid/20240726043828.58966-7-bskeggs@nvidia.com
2024-07-27drm/nouveau: store nvkm_device pointer in nouveau_drmBen Skeggs
There's various different places in the drm code that get at the nvkm_device via various creative (and not very type-safe) means. One of those being via nvif_device.object.priv. Another patch series is going to entirely remove the ioctl-like interfaces beween the drm code and nvkm, and that field will no longer exist. This provides a safer replacement for accessing the nvkm_device, and will used more in upcoming patches to cleanup other cases. v2: - fixup printk macros to not oops if used before client ctor Signed-off-by: Ben Skeggs <bskeggs@nvidia.com> Signed-off-by: Danilo Krummrich <dakr@kernel.org> Link: https://patchwork.freedesktop.org/patch/msgid/20240726043828.58966-6-bskeggs@nvidia.com
2024-07-27drm/nouveau: create pci device onceBen Skeggs
HW isn't touched anymore (aside from detection) until the first nvif_device has been allocated, so we no longer need a separate probe-only step before kicking efifb (etc) off the HW. Signed-off-by: Ben Skeggs <bskeggs@nvidia.com> Signed-off-by: Danilo Krummrich <dakr@kernel.org> Link: https://patchwork.freedesktop.org/patch/msgid/20240726043828.58966-5-bskeggs@nvidia.com
2024-07-27drm/nouveau: replace drm_device* with nouveau_drm* as dev drvdataBen Skeggs
We almost always want to cast the pointer from dev_get_drvdata() to 'struct nouveau_drm *', so just directly store that pointer instead, simplifying callers, and fixing some clumsy naming of dev/drm_dev variables at the same time. Signed-off-by: Ben Skeggs <bskeggs@nvidia.com> Signed-off-by: Danilo Krummrich <dakr@kernel.org> Link: https://patchwork.freedesktop.org/patch/msgid/20240726043828.58966-4-bskeggs@nvidia.com
2024-07-27drm/nouveau: handle pci/tegra drm_dev_{alloc, register} from common codeBen Skeggs
Unify some more of the PCI/Tegra DRM driver init, both as a general cleanup, and because a subsequent commit changes the pointer stored via dev_set_drvdata(), and this allows the change to be made in one place. Signed-off-by: Ben Skeggs <bskeggs@nvidia.com> Signed-off-by: Danilo Krummrich <dakr@kernel.org> Link: https://patchwork.freedesktop.org/patch/msgid/20240726043828.58966-3-bskeggs@nvidia.com
2024-07-27drm/nouveau: move nouveau_drm_device_fini() above init()Ben Skeggs
The next commit wants to be able to call fini() from an init() failure path to remove the need to duplicate a bunch of cleanup. Moving fini() above init() avoids the need for a forward-declaration. Signed-off-by: Ben Skeggs <bskeggs@nvidia.com> Signed-off-by: Danilo Krummrich <dakr@kernel.org> Link: https://patchwork.freedesktop.org/patch/msgid/20240726043828.58966-2-bskeggs@nvidia.com
2024-07-26minmax: avoid overly complicated constant expressions in VM codeLinus Torvalds
The minmax infrastructure is overkill for simple constants, and can cause huge expansions because those simple constants are then used by other things. For example, 'pageblock_order' is a core VM constant, but because it was implemented using 'min_t()' and all the type-checking that involves, it actually expanded to something like 2.5kB of preprocessor noise. And when that simple constant was then used inside other expansions: #define pageblock_nr_pages (1UL << pageblock_order) #define pageblock_start_pfn(pfn) ALIGN_DOWN((pfn), pageblock_nr_pages) and we then use that inside a 'max()' macro: case ISOLATE_SUCCESS: update_cached = false; last_migrated_pfn = max(cc->zone->zone_start_pfn, pageblock_start_pfn(cc->migrate_pfn - 1)); the end result was that one statement expanding to 253kB in size. There are probably other cases of this, but this one case certainly stood out. I've added 'MIN_T()' and 'MAX_T()' macros for this kind of "core simple constant with specific type" use. These macros skip the type checking, and as such need to be very sparingly used only for obvious cases that have active issues like this. Reported-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/all/36aa2cad-1db1-4abf-8dd2-fb20484aabc3@lucifer.local/ Cc: David Laight <David.Laight@aculab.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-07-26minmax: avoid overly complex min()/max() macro arguments in xenLinus Torvalds
We have some very fancy min/max macros that have tons of sanity checking to warn about mixed signedness etc. This is all things that a sane compiler should warn about, but there are no sane compiler interfaces for this, and '-Wsign-compare' is broken [1] and not useful. So then we compensate (some would say over-compensate) by doing the checks manually with some truly horrid macro games. And no, we can't just use __builtin_types_compatible_p(), because the whole question of "does it make sense to compare these two values" is a lot more complicated than that. For example, it makes a ton of sense to compare unsigned values with simple constants like "5", even if that is indeed a signed type. So we have these very strange macros to try to make sensible type checking decisions on the arguments to 'min()' and 'max()'. But that can cause enormous code expansion if the min()/max() macros are used with complicated expressions, and particularly if you nest these things so that you get the first big expansion then expanded again. The xen setup.c file ended up ballooning to over 50MB of preprocessed noise that takes 15s to compile (obviously depending on the build host), largely due to one single line. So let's split that one single line to just be simpler. I think it ends up being more legible to humans too at the same time. Now that single file compiles in under a second. Reported-and-reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/all/c83c17bb-be75-4c67-979d-54eee38774c6@lucifer.local/ Link: https://staticthinking.wordpress.com/2023/07/25/wsign-compare-is-garbage/ [1] Cc: David Laight <David.Laight@aculab.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-07-26drm/vblank: add dynamic per-crtc vblank configuration supportHamza Mahfooz
We would like to be able to enable vblank_disable_immediate unconditionally, however there are a handful of cases where a small off delay is necessary (e.g. with PSR enabled). So, we would like to be able to adjust the vblank off delay and disable imminent values dynamically for a given CRTC. Since, it will allow drivers to apply static screen optimizations more quickly and consequently allow users to benefit more so from the power savings afforded by the aforementioned optimizations, while avoiding issues in cases where an off delay is still warranted. In particular, the PSR case requires a small off delay of 2 frames, otherwise display firmware isn't able to keep up with all of the requests made to amdgpu. So, introduce drm_crtc_vblank_on_config() which is like drm_crtc_vblank_on(), but it allows drivers to specify the vblank CRTC configuration before enabling vblanking support for a given CRTC. Signed-off-by: Hamza Mahfooz <hamza.mahfooz@amd.com> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> Link: https://patchwork.freedesktop.org/patch/msgid/20240725205109.209743-1-hamza.mahfooz@amd.com
2024-07-26nilfs2: handle inconsistent state in nilfs_btnode_create_block()Ryusuke Konishi
Syzbot reported that a buffer state inconsistency was detected in nilfs_btnode_create_block(), triggering a kernel bug. It is not appropriate to treat this inconsistency as a bug; it can occur if the argument block address (the buffer index of the newly created block) is a virtual block number and has been reallocated due to corruption of the bitmap used to manage its allocation state. So, modify nilfs_btnode_create_block() and its callers to treat it as a possible filesystem error, rather than triggering a kernel bug. Link: https://lkml.kernel.org/r/20240725052007.4562-1-konishi.ryusuke@gmail.com Fixes: a60be987d45d ("nilfs2: B-tree node cache") Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+89cc4f2324ed37988b60@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=89cc4f2324ed37988b60 Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-26selftests/mm: skip test for non-LPA2 and non-LVA systemsDev Jain
Post my improvement of the test in e4a4ba415419 ("selftests/mm: va_high_addr_switch: dynamically initialize testcases to enable LPA2 testing"): The test begins to fail on 4k and 16k pages, on non-LPA2 systems. To reduce noise in the CI systems, let us skip the test when higher address space is not implemented. Link: https://lkml.kernel.org/r/20240718052504.356517-1-dev.jain@arm.com Fixes: e4a4ba415419 ("selftests/mm: va_high_addr_switch: dynamically initialize testcases to enable LPA2 testing") Signed-off-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Mark Brown <broonie@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-26mm/page_alloc: fix pcp->count race between drain_pages_zone() vs ↵Li Zhijian
__rmqueue_pcplist() It's expected that no page should be left in pcp_list after calling zone_pcp_disable() in offline_pages(). Previously, it's observed that offline_pages() gets stuck [1] due to some pages remaining in pcp_list. Cause: There is a race condition between drain_pages_zone() and __rmqueue_pcplist() involving the pcp->count variable. See below scenario: CPU0 CPU1 ---------------- --------------- spin_lock(&pcp->lock); __rmqueue_pcplist() { zone_pcp_disable() { /* list is empty */ if (list_empty(list)) { /* add pages to pcp_list */ alloced = rmqueue_bulk() mutex_lock(&pcp_batch_high_lock) ... __drain_all_pages() { drain_pages_zone() { /* read pcp->count, it's 0 here */ count = READ_ONCE(pcp->count) /* 0 means nothing to drain */ /* update pcp->count */ pcp->count += alloced << order; ... ... spin_unlock(&pcp->lock); In this case, after calling zone_pcp_disable() though, there are still some pages in pcp_list. And these pages in pcp_list are neither movable nor isolated, offline_pages() gets stuck as a result. Solution: Expand the scope of the pcp->lock to also protect pcp->count in drain_pages_zone(), to ensure no pages are left in the pcp list after zone_pcp_disable() [1] https://lore.kernel.org/linux-mm/6a07125f-e720-404c-b2f9-e55f3f166e85@fujitsu.com/ Link: https://lkml.kernel.org/r/20240723064428.1179519-1-lizhijian@fujitsu.com Fixes: 4b23a68f9536 ("mm/page_alloc: protect PCP lists with a spinlock") Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> Reported-by: Yao Xingtao <yaoxt.fnst@fujitsu.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-26mm: memcg: add cacheline padding after lruvec in mem_cgroup_per_nodeRoman Gushchin
Oliver Sand reported a performance regression caused by commit 98c9daf5ae6b ("mm: memcg: guard memcg1-specific members of struct mem_cgroup_per_node"), which puts some fields of the mem_cgroup_per_node structure under the CONFIG_MEMCG_V1 config option. Apparently it causes a false cache sharing between lruvec and lru_zone_size members of the structure. Fix it by adding an explicit padding after the lruvec member. Even though the padding is not required with CONFIG_MEMCG_V1 set, it seems like the introduced memory overhead is not significant enough to warrant another divergence in the mem_cgroup_per_node layout, so the padding is added unconditionally. Link: https://lkml.kernel.org/r/20240723171244.747521-1-roman.gushchin@linux.dev Fixes: 98c9daf5ae6b ("mm: memcg: guard memcg1-specific members of struct mem_cgroup_per_node") Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202407121335.31a10cb6-oliver.sang@intel.com Tested-by: Oliver Sang <oliver.sang@intel.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-26alloc_tag: outline and export free_reserved_page()Suren Baghdasaryan
Outline and export free_reserved_page() because modules use it and it in turn uses page_ext_{get|put} which should not be exported. The same result could be obtained by outlining {get|put}_page_tag_ref() but that would have higher performance impact as these functions are used in more performance critical paths. Link: https://lkml.kernel.org/r/20240717212844.2749975-1-surenb@google.com Fixes: dcfe378c81f7 ("lib: introduce support for page allocation tagging") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202407080044.DWMC9N9I-lkp@intel.com/ Suggested-by: Christoph Hellwig <hch@infradead.org> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Kees Cook <keescook@chromium.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Sourav Panda <souravpanda@google.com> Cc: <stable@vger.kernel.org> [6.10] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-26decompress_bunzip2: fix rare decompression failureRoss Lagerwall
The decompression code parses a huffman tree and counts the number of symbols for a given bit length. In rare cases, there may be >= 256 symbols with a given bit length, causing the unsigned char to overflow. This causes a decompression failure later when the code tries and fails to find the bit length for a given symbol. Since the maximum number of symbols is 258, use unsigned short instead. Link: https://lkml.kernel.org/r/20240717162016.1514077-1-ross.lagerwall@citrix.com Fixes: bc22c17e12c1 ("bzip2/lzma: library support for gzip, bzip2 and lzma decompression") Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Cc: Alain Knaff <alain@knaff.lu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-26mm/huge_memory: avoid PMD-size page cache if neededGavin Shan
xarray can't support arbitrary page cache size. the largest and supported page cache size is defined as MAX_PAGECACHE_ORDER by commit 099d90642a71 ("mm/filemap: make MAX_PAGECACHE_ORDER acceptable to xarray"). However, it's possible to have 512MB page cache in the huge memory's collapsing path on ARM64 system whose base page size is 64KB. 512MB page cache is breaking the limitation and a warning is raised when the xarray entry is split as shown in the following example. [root@dhcp-10-26-1-207 ~]# cat /proc/1/smaps | grep KernelPageSize KernelPageSize: 64 kB [root@dhcp-10-26-1-207 ~]# cat /tmp/test.c : int main(int argc, char **argv) { const char *filename = TEST_XFS_FILENAME; int fd = 0; void *buf = (void *)-1, *p; int pgsize = getpagesize(); int ret = 0; if (pgsize != 0x10000) { fprintf(stdout, "System with 64KB base page size is required!\n"); return -EPERM; } system("echo 0 > /sys/devices/virtual/bdi/253:0/read_ahead_kb"); system("echo 1 > /proc/sys/vm/drop_caches"); /* Open the xfs file */ fd = open(filename, O_RDONLY); assert(fd > 0); /* Create VMA */ buf = mmap(NULL, TEST_MEM_SIZE, PROT_READ, MAP_SHARED, fd, 0); assert(buf != (void *)-1); fprintf(stdout, "mapped buffer at 0x%p\n", buf); /* Populate VMA */ ret = madvise(buf, TEST_MEM_SIZE, MADV_NOHUGEPAGE); assert(ret == 0); ret = madvise(buf, TEST_MEM_SIZE, MADV_POPULATE_READ); assert(ret == 0); /* Collapse VMA */ ret = madvise(buf, TEST_MEM_SIZE, MADV_HUGEPAGE); assert(ret == 0); ret = madvise(buf, TEST_MEM_SIZE, MADV_COLLAPSE); if (ret) { fprintf(stdout, "Error %d to madvise(MADV_COLLAPSE)\n", errno); goto out; } /* Split xarray entry. Write permission is needed */ munmap(buf, TEST_MEM_SIZE); buf = (void *)-1; close(fd); fd = open(filename, O_RDWR); assert(fd > 0); fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE, TEST_MEM_SIZE - pgsize, pgsize); out: if (buf != (void *)-1) munmap(buf, TEST_MEM_SIZE); if (fd > 0) close(fd); return ret; } [root@dhcp-10-26-1-207 ~]# gcc /tmp/test.c -o /tmp/test [root@dhcp-10-26-1-207 ~]# /tmp/test ------------[ cut here ]------------ WARNING: CPU: 25 PID: 7560 at lib/xarray.c:1025 xas_split_alloc+0xf8/0x128 Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib \ nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct \ nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 \ ip_set rfkill nf_tables nfnetlink vfat fat virtio_balloon drm fuse \ xfs libcrc32c crct10dif_ce ghash_ce sha2_ce sha256_arm64 virtio_net \ sha1_ce net_failover virtio_blk virtio_console failover dimlib virtio_mmio CPU: 25 PID: 7560 Comm: test Kdump: loaded Not tainted 6.10.0-rc7-gavin+ #9 Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20240524-1.el9 05/24/2024 pstate: 83400005 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) pc : xas_split_alloc+0xf8/0x128 lr : split_huge_page_to_list_to_order+0x1c4/0x780 sp : ffff8000ac32f660 x29: ffff8000ac32f660 x28: ffff0000e0969eb0 x27: ffff8000ac32f6c0 x26: 0000000000000c40 x25: ffff0000e0969eb0 x24: 000000000000000d x23: ffff8000ac32f6c0 x22: ffffffdfc0700000 x21: 0000000000000000 x20: 0000000000000000 x19: ffffffdfc0700000 x18: 0000000000000000 x17: 0000000000000000 x16: ffffd5f3708ffc70 x15: 0000000000000000 x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 x11: ffffffffffffffc0 x10: 0000000000000040 x9 : ffffd5f3708e692c x8 : 0000000000000003 x7 : 0000000000000000 x6 : ffff0000e0969eb8 x5 : ffffd5f37289e378 x4 : 0000000000000000 x3 : 0000000000000c40 x2 : 000000000000000d x1 : 000000000000000c x0 : 0000000000000000 Call trace: xas_split_alloc+0xf8/0x128 split_huge_page_to_list_to_order+0x1c4/0x780 truncate_inode_partial_folio+0xdc/0x160 truncate_inode_pages_range+0x1b4/0x4a8 truncate_pagecache_range+0x84/0xa0 xfs_flush_unmap_range+0x70/0x90 [xfs] xfs_file_fallocate+0xfc/0x4d8 [xfs] vfs_fallocate+0x124/0x2f0 ksys_fallocate+0x4c/0xa0 __arm64_sys_fallocate+0x24/0x38 invoke_syscall.constprop.0+0x7c/0xd8 do_el0_svc+0xb4/0xd0 el0_svc+0x44/0x1d8 el0t_64_sync_handler+0x134/0x150 el0t_64_sync+0x17c/0x180 Fix it by correcting the supported page cache orders, different sets for DAX and other files. With it corrected, 512MB page cache becomes disallowed on all non-DAX files on ARM64 system where the base page size is 64KB. After this patch is applied, the test program fails with error -EINVAL returned from __thp_vma_allowable_orders() and the madvise() system call to collapse the page caches. Link: https://lkml.kernel.org/r/20240715000423.316491-1-gshan@redhat.com Fixes: 6b24ca4a1a8d ("mm: Use multi-index entries in the page cache") Signed-off-by: Gavin Shan <gshan@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Don Dutile <ddutile@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: <stable@vger.kernel.org> [5.17+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-26mm: huge_memory: use !CONFIG_64BIT to relax huge page alignment on 32 bit ↵Yang Shi
machines Yves-Alexis Perez reported commit 4ef9ad19e176 ("mm: huge_memory: don't force huge page alignment on 32 bit") didn't work for x86_32 [1]. It is because x86_32 uses CONFIG_X86_32 instead of CONFIG_32BIT. !CONFIG_64BIT should cover all 32 bit machines. [1] https://lore.kernel.org/linux-mm/CAHbLzkr1LwH3pcTgM+aGQ31ip2bKqiqEQ8=FQB+t2c3dhNKNHA@mail.gmail.com/ Link: https://lkml.kernel.org/r/20240712155855.1130330-1-yang@os.amperecomputing.com Fixes: 4ef9ad19e176 ("mm: huge_memory: don't force huge page alignment on 32 bit") Signed-off-by: Yang Shi <yang@os.amperecomputing.com> Reported-by: Yves-Alexis Perez <corsac@debian.org> Tested-by: Yves-Alexis Perez <corsac@debian.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Christoph Lameter <cl@linux.com> Cc: Jiri Slaby <jirislaby@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Salvatore Bonaccorso <carnil@debian.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger.kernel.org> [6.8+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-26mm: fix old/young bit handling in the faulting pathRam Tummala
Commit 3bd786f76de2 ("mm: convert do_set_pte() to set_pte_range()") replaced do_set_pte() with set_pte_range() and that introduced a regression in the following faulting path of non-anonymous vmas which caused the PTE for the faulting address to be marked as old instead of young. handle_pte_fault() do_pte_missing() do_fault() do_read_fault() || do_cow_fault() || do_shared_fault() finish_fault() set_pte_range() The polarity of prefault calculation is incorrect. This leads to prefault being incorrectly set for the faulting address. The following check will incorrectly mark the PTE old rather than young. On some architectures this will cause a double fault to mark it young when the access is retried. if (prefault && arch_wants_old_prefaulted_pte()) entry = pte_mkold(entry); On a subsequent fault on the same address, the faulting path will see a non NULL vmf->pte and instead of reaching the do_pte_missing() path, PTE will then be correctly marked young in handle_pte_fault() itself. Due to this bug, performance degradation in the fault handling path will be observed due to unnecessary double faulting. Link: https://lkml.kernel.org/r/20240710014539.746200-1-rtummala@nvidia.com Fixes: 3bd786f76de2 ("mm: convert do_set_pte() to set_pte_range()") Signed-off-by: Ram Tummala <rtummala@nvidia.com> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Yin Fengwei <fengwei.yin@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-26dt-bindings: arm: update James Clark's email addressJames Clark
My new address is james.clark@linaro.org Link: https://lkml.kernel.org/r/20240709102512.31212-3-james.clark@linaro.org Signed-off-by: James Clark <james.clark@linaro.org> Cc: Bjorn Andersson <quic_bjorande@quicinc.com> Cc: Conor Dooley <conor+dt@kernel.org> Cc: David S. Miller <davem@davemloft.net> Cc: Geliang Tang <geliang@kernel.org> Cc: Hao Zhang <quic_hazha@quicinc.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Jiri Kosina <jikos@kernel.org> Cc: Kees Cook <kees@kernel.org> Cc: Krzysztof Kozlowski <krzk+dt@kernel.org> Cc: Mao Jinlong <quic_jinlmao@quicinc.com> Cc: Matthieu Baerts <matttbe@kernel.org> Cc: Matt Ranostay <matt@ranostay.sg> Cc: Mike Leach <mike.leach@linaro.org> Cc: Oleksij Rempel <o.rempel@pengutronix.de> Cc: Rob Herring (Arm) <robh@kernel.org> Cc: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-26MAINTAINERS: mailmap: update James Clark's email addressJames Clark
My new address is james.clark@linaro.org Link: https://lkml.kernel.org/r/20240709102512.31212-2-james.clark@linaro.org Signed-off-by: James Clark <james.clark@linaro.org> Cc: Bjorn Andersson <quic_bjorande@quicinc.com> Cc: Conor Dooley <conor+dt@kernel.org> Cc: David S. Miller <davem@davemloft.net> Cc: Geliang Tang <geliang@kernel.org> Cc: Hao Zhang <quic_hazha@quicinc.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Jiri Kosina <jikos@kernel.org> Cc: Kees Cook <kees@kernel.org> Cc: Krzysztof Kozlowski <krzk+dt@kernel.org> Cc: Mao Jinlong <quic_jinlmao@quicinc.com> Cc: Matthieu Baerts <matttbe@kernel.org> Cc: Matt Ranostay <matt@ranostay.sg> Cc: Mike Leach <mike.leach@linaro.org> Cc: Oleksij Rempel <o.rempel@pengutronix.de> Cc: Rob Herring (Arm) <robh@kernel.org> Cc: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-26irqchip/loongarch-cpu: Fix return value of lpic_gsi_to_irq()Huacai Chen
lpic_gsi_to_irq() should return a valid Linux interrupt number if acpi_register_gsi() succeeds, and return 0 otherwise. But lpic_gsi_to_irq() converts a negative return value of acpi_register_gsi() to a positive value silently. Convert the return value explicitly. Fixes: e8bba72b396c ("irqchip / ACPI: Introduce ACPI_IRQ_MODEL_LPIC for LoongArch") Reported-by: Miao Wang <shankerwangmiao@gmail.com> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20240723064508.35560-1-chenhuacai@loongson.cn
2024-07-26dt-bindings: iio: adc: ad7192: Fix 'single-channel' constraintsRob Herring (Arm)
The 'single-channel' property is an uint32, not an array, so 'items' is an incorrect constraint. This didn't matter until dtschema recently changed how properties are decoded. This results in this warning: Documentation/devicetree/bindings/iio/adc/adi,ad7192.example.dtb: adc@0: \ channel@1:single-channel: 1 is not of type 'array' Fixes: caf7b7632b8d ("dt-bindings: iio: adc: ad7192: Add AD7194 support") Reviewed-by: Conor Dooley <conor.dooley@microchip.com> Link: https://lore.kernel.org/r/20240723230904.1299744-1-robh@kernel.org Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
2024-07-26drm/xe/rtp: Fix off-by-one when processing rulesLucas De Marchi
Gustavo noticed an odd "+ 2" in rtp_mark_active() while processing rtp rules and pointed that it should be "+ 1". In fact, while processing entries without actions (OOB workarounds), if the WA is activated and has OR rules, it will also inadvertently activate the very next workaround. Test in a LNL B0 platform by moving 18024947630 on top of 16020292621, makes the latter become active: $ cat /sys/kernel/debug/dri/0/gt0/workarounds ... OOB Workarounds 18024947630 16020292621 14018094691 16022287689 13011645652 22019338487_display In future a kunit test will be added to cover the rtp checks for entries without actions. Fixes: fe19328b900c ("drm/xe/rtp: Add support for entries with no action") Cc: Gustavo Sousa <gustavo.sousa@intel.com> Reviewed-by: Gustavo Sousa <gustavo.sousa@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240726064337.797576-6-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2024-07-26KVM: guest_memfd: abstract how prepared folios are recordedPaolo Bonzini
Right now, large folios are not supported in guest_memfd, and therefore the order used by kvm_gmem_populate() is always 0. In this scenario, using the up-to-date bit to track prepared-ness is nice and easy because we have one bit available per page. In the future, however, we might have large pages that are partially populated; for example, in the case of SEV-SNP, if a large page has both shared and private areas inside, it is necessary to populate it at a granularity that is smaller than that of the guest_memfd's backing store. In that case we will have to track preparedness at a 4K level, probably as a bitmap. In preparation for that, do not use explicitly folio_test_uptodate() and folio_mark_uptodate(). Return the state of the page directly from __kvm_gmem_get_pfn(), so that it is expected to apply to 2^N pages with N=*max_order. The function to mark a range as prepared for now takes just a folio, but is expected to take also an index and order (or something like that) when large pages are introduced. Thanks to Michael Roth for pointing out the issue with large pages. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-26KVM: guest_memfd: let kvm_gmem_populate() operate only on private gfnsPaolo Bonzini
This check is currently performed by sev_gmem_post_populate(), but it applies to all callers of kvm_gmem_populate(): the point of the function is that the memory is being encrypted and some work has to be done on all the gfns in order to encrypt them. Therefore, check the KVM_MEMORY_ATTRIBUTE_PRIVATE attribute prior to invoking the callback, and stop the operation if a shared page is encountered. Because CONFIG_KVM_PRIVATE_MEM in principle does not require attributes, this makes kvm_gmem_populate() depend on CONFIG_KVM_GENERIC_PRIVATE_MEM (which does require them). Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-26KVM: extend kvm_range_has_memory_attributes() to check subset of attributesPaolo Bonzini
While currently there is no other attribute than KVM_MEMORY_ATTRIBUTE_PRIVATE, KVM code such as kvm_mem_is_private() is written to expect their existence. Allow using kvm_range_has_memory_attributes() as a multi-page version of kvm_mem_is_private(), without it breaking later when more attributes are introduced. Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-26KVM: cleanup and add shortcuts to kvm_range_has_memory_attributes()Paolo Bonzini
Use a guard to simplify early returns, and add two more easy shortcuts. If the requested attributes are invalid, the attributes xarray will never show them as set. And if testing a single page, kvm_get_memory_attributes() is more efficient. Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-26KVM: guest_memfd: move check for already-populated page to common codePaolo Bonzini
Do not allow populating the same page twice with startup data. In the case of SEV-SNP, for example, the firmware does not allow it anyway, since the launch-update operation is only possible on pages that are still shared in the RMP. Even if it worked, kvm_gmem_populate()'s callback is meant to have side effects such as updating launch measurements, and updating the same page twice is unlikely to have the desired results. Races between calls to the ioctl are not possible because kvm_gmem_populate() holds slots_lock and the VM should not be running. But again, even if this worked on other confidential computing technology, it doesn't matter to guest_memfd.c whether this is something fishy such as missing synchronization in userspace, or rather something intentional. One of the racers wins, and the page is initialized by either kvm_gmem_prepare_folio() or kvm_gmem_populate(). Anyway, out of paranoia, adjust sev_gmem_post_populate() anyway to use the same errno that kvm_gmem_populate() is using. Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>