summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2024-09-10iomap: pass the iomap to the punch callbackChristoph Hellwig
XFS will need to look at the flags in the iomap structure, so pass it down all the way to the callback. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240910043949.3481298-5-hch@lst.de Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-10iomap: pass flags to iomap_file_buffered_write_punch_delallocChristoph Hellwig
To fix short write error handling, We'll need to figure out what operation iomap_file_buffered_write_punch_delalloc is called for. Pass the flags argument on to it, and reorder the argument list to match that of ->iomap_end so that the compiler only has to add the new punch argument to the end of it instead of reshuffling the registers. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240910043949.3481298-4-hch@lst.de Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-10Merge branch 'for-linus' into for-nextTakashi Iwai
Signed-off-by: Takashi Iwai <tiwai@suse.de>
2024-09-10vdpa: support set mac address from vdpa toolCindy Lu
Add new UAPI to support the mac address from vdpa tool Function vdpa_nl_cmd_dev_attr_set_doit() will get the new MAC address from the vdpa tool and then set it to the device. The usage is: vdpa dev set name vdpa_name mac **:**:**:**:**:** Here is example: root@L1# vdpa -jp dev config show vdpa0 { "config": { "vdpa0": { "mac": "82:4d:e9:5d:d7:e6", "link ": "up", "link_announce ": false, "mtu": 1500 } } } root@L1# vdpa dev set name vdpa0 mac 00:11:22:33:44:55 root@L1# vdpa -jp dev config show vdpa0 { "config": { "vdpa0": { "mac": "00:11:22:33:44:55", "link ": "up", "link_announce ": false, "mtu": 1500 } } } Signed-off-by: Cindy Lu <lulu@redhat.com> Message-Id: <20240731031653.1047692-2-lulu@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com>
2024-09-10virtio_balloon: introduce memory scan/reclaim infozhenwei pi
Expose memory scan/reclaim information to the host side via virtio balloon device. Now we have a metric to analyze the memory performance: y: counter increases n: counter does not changes h: the rate of counter change is high l: the rate of counter change is low OOM: VIRTIO_BALLOON_S_OOM_KILL STALL: VIRTIO_BALLOON_S_ALLOC_STALL ASCAN: VIRTIO_BALLOON_S_SCAN_ASYNC DSCAN: VIRTIO_BALLOON_S_SCAN_DIRECT ARCLM: VIRTIO_BALLOON_S_RECLAIM_ASYNC DRCLM: VIRTIO_BALLOON_S_RECLAIM_DIRECT - OOM[y], STALL[*], ASCAN[*], DSCAN[*], ARCLM[*], DRCLM[*]: the guest runs under really critial memory pressure - OOM[n], STALL[h], ASCAN[*], DSCAN[l], ARCLM[*], DRCLM[l]: the memory allocation stalls due to cgroup, not the global memory pressure. - OOM[n], STALL[h], ASCAN[*], DSCAN[h], ARCLM[*], DRCLM[h]: the memory allocation stalls due to global memory pressure. The performance gets hurt a lot. A high ratio between DRCLM/DSCAN shows quite effective memory reclaiming. - OOM[n], STALL[h], ASCAN[*], DSCAN[h], ARCLM[*], DRCLM[l]: the memory allocation stalls due to global memory pressure. the ratio between DRCLM/DSCAN gets low, the guest OS is thrashing heavily, the serious case leads poor performance and difficult trouble shooting. Ex, sshd may block on memory allocation when accepting new connections, a user can't login a VM by ssh command. - OOM[n], STALL[n], ASCAN[h], DSCAN[n], ARCLM[l], DRCLM[n]: the low ratio between ARCLM/ASCAN shows that the guest tries to reclaim more memory, but it can't. Once more memory is required in future, it will struggle to reclaim memory. Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: zhenwei pi <pizhenwei@bytedance.com> Message-Id: <20240423034109.1552866-5-pizhenwei@bytedance.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-09-10virtio_balloon: introduce memory allocation stall counterzhenwei pi
Memory allocation stall counter represents the performance/latency of memory allocation, expose this counter to the host side by virtio balloon device via out-of-bound way. Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: zhenwei pi <pizhenwei@bytedance.com> Message-Id: <20240423034109.1552866-4-pizhenwei@bytedance.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-09-10virtio_balloon: introduce oom-kill invocationszhenwei pi
When the guest OS runs under critical memory pressure, the guest starts to kill processes. A guest monitor agent may scan 'oom_kill' from /proc/vmstat, and reports the OOM KILL event. However, the agent may be killed and we will loss this critical event(and the later events). For now we can also grep for magic words in guest kernel log from host side. Rather than this unstable way, virtio balloon reports OOM-KILL invocations instead. Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: zhenwei pi <pizhenwei@bytedance.com> Message-Id: <20240423034109.1552866-3-pizhenwei@bytedance.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-09-10dma-mapping: add tracing for dma-mapping API callsSean Anderson
When debugging drivers, it can often be useful to trace when memory gets (un)mapped for DMA (and can be accessed by the device). Add some tracepoints for this purpose. Use u64 instead of phys_addr_t and dma_addr_t (and similarly %llx instead of %pa) because libtraceevent can't handle typedefs in all cases. Signed-off-by: Sean Anderson <sean.anderson@linux.dev> Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-09-10Merge tag 'drm-xe-next-2024-09-05' of ↵Dave Airlie
https://gitlab.freedesktop.org/drm/xe/kernel into drm-next Cross-subsystem Changes: - Split dma fence array creation into alloc and arm (Matthew Brost) Driver Changes: - Move kernel_lrc to execlist backend (Ilia) - Fix type width for pcode coommand (Karthik) - Make xe_drm.h include unambiguous (Jani) - Fixes and debug improvements for GSC load (Daniele) - Track resources and VF state by PF (Michal Wajdeczko) - Fix memory leak on error path (Nirmoy) - Cleanup header includes (Matt Roper) - Move pcode logic to tile scope (Matt Roper) - Move hwmon logic to device scope (Matt Roper) - Fix media TLB invalidation (Matthew Brost) - Threshold config fixes for PF (Michal Wajdeczko) - Remove extra "[drm]" from logs (Michal Wajdeczko) - Add missing runtime ref (Rodrigo Vivi) - Fix circular locking on runtime suspend (Rodrigo Vivi) - Fix rpm in TTM swapout path (Thomas) Signed-off-by: Dave Airlie <airlied@redhat.com> From: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/eirx5vdvoflbbqlrzi5cip6bpu3zjojm2pxseufu3rlq4pp6xv@eytjvhizfyu6
2024-09-09net: remove dev_pick_tx_cpu_id()Jakub Kicinski
dev_pick_tx_cpu_id() has been introduced with two users by commit a4ea8a3dacc3 ("net: Add generic ndo_select_queue functions"). The use in AF_PACKET has been removed in 2019 by commit b71b5837f871 ("packet: rework packet_pick_tx_queue() to use common code selection") The other user was a Netlogic XLP driver, removed in 2021 by commit 47ac6f567c28 ("staging: Remove Netlogic XLP network driver"). It's relatively unlikely that any modern driver will need an .ndo_select_queue implementation which picks purely based on CPU ID and skips XPS, delete dev_pick_tx_cpu_id() Found by code inspection. Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20240906161059.715546-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-09sched_ext: Compact struct bpf_iter_scx_dsq_kernTejun Heo
struct scx_iter_scx_dsq is defined as 6 u64's and scx_dsq_iter_kern was using 5 of them. We want to add two more u64 fields but it's better if we do so while staying within scx_iter_scx_dsq to maintain binary compatibility. The way scx_iter_scx_dsq_kern is laid out is rather inefficient - the node field takes up three u64's but only one bit of the last u64 is used. Turn the bool into u32 flags and only use the lower 16 bits freeing up 48 bits - 16 bits for flags, 32 bits for a u32 - for use by struct bpf_iter_scx_dsq_kern. This allows moving the dsq_seq and flags fields of bpf_iter_scx_dsq_kern into the cursor field reducing the struct size by a full u64. No behavior changes intended. Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-09mm/codetag: add pgalloc_tag_copy()Yu Zhao
Add pgalloc_tag_copy() to transfer the codetag from the old folio to the new one during migration. This makes original allocation sites persist cross migration rather than lump into the get_new_folio callbacks passed into migrate_pages(), e.g., compaction_alloc(): # echo 1 >/proc/sys/vm/compact_memory # grep compaction_alloc /proc/allocinfo Before this patch: 132968448 32463 mm/compaction.c:1880 func:compaction_alloc After this patch: 0 0 mm/compaction.c:1880 func:compaction_alloc Link: https://lkml.kernel.org/r/20240906042108.1150526-3-yuzhao@google.com Fixes: dcfe378c81f7 ("lib: introduce support for page allocation tagging") Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Suren Baghdasaryan <surenb@google.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Muchun Song <muchun.song@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09mm/codetag: fix pgalloc_tag_split()Yu Zhao
The current assumption is that a large folio can only be split into order-0 folios. That is not the case for hugeTLB demotion, nor for THP split: see commit c010d47f107f ("mm: thp: split huge page to any lower order pages"). When a large folio is split into ones of a lower non-zero order, only the new head pages should be tagged. Tagging tail pages can cause imbalanced "calls" counters, since only head pages are untagged by pgalloc_tag_sub() and the "calls" counts on tail pages are leaked, e.g., # echo 2048kB >/sys/kernel/mm/hugepages/hugepages-1048576kB/demote_size # echo 700 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages # time echo 700 >/sys/kernel/mm/hugepages/hugepages-1048576kB/demote # echo 0 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages # grep alloc_gigantic_folio /proc/allocinfo Before this patch: 0 549427200 mm/hugetlb.c:1549 func:alloc_gigantic_folio real 0m2.057s user 0m0.000s sys 0m2.051s After this patch: 0 0 mm/hugetlb.c:1549 func:alloc_gigantic_folio real 0m1.711s user 0m0.000s sys 0m1.704s Not tagging tail pages also improves the splitting time, e.g., by about 15% when demoting 1GB hugeTLB folios to 2MB ones, as shown above. Link: https://lkml.kernel.org/r/20240906042108.1150526-2-yuzhao@google.com Fixes: be25d1d4e822 ("mm: create new codetag references during page splitting") Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Suren Baghdasaryan <surenb@google.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Muchun Song <muchun.song@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09mm/codetag: fix a typoYu Zhao
Link: https://lkml.kernel.org/r/20240906042108.1150526-1-yuzhao@google.com Fixes: 22d407b164ff ("lib: add allocation tagging support for memory allocation profiling") Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Muchun Song <muchun.song@linux.dev> Cc: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09mm: restart if multiple traversals racedKinsey Ho
Currently, if multiple reclaimers raced on the same position, the reclaimers which detect the race will still reclaim from the same memcg. Instead, the reclaimers which detect the race should move on to the next memcg in the hierarchy. So, in the case where multiple traversals race, jump back to the start of the mem_cgroup_iter() function to find the next memcg in the hierarchy to reclaim from. Link: https://lkml.kernel.org/r/20240905003058.1859929-5-kinseyho@google.com Reported-by: syzbot+e099d407346c45275ce9@syzkaller.appspotmail.com Closes: https://lore.kernel.org/000000000000817cf10620e20d33@google.com/ Signed-off-by: Kinsey Ho <kinseyho@google.com> Reviewed-by: T.J. Mercier <tjmercier@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09cgroup: clarify css sibling linkage is protected by cgroup_mutex or RCUKinsey Ho
Patch series "Improve mem_cgroup_iter()", v4. Incremental cgroup iteration is being used again [1]. This patchset improves the reliability of mem_cgroup_iter(). It also improves simplicity and code readability. [1] https://lore.kernel.org/20240514202641.2821494-1-hannes@cmpxchg.org/ This patch (of 5): Explicitly document that css sibling/descendant linkage is protected by cgroup_mutex or RCU. Also, document in css_next_descendant_pre() and similar functions that it isn't necessary to hold a ref on @pos. The following changes in this patchset rely on this clarification for simplification in memcg iteration code. Link: https://lkml.kernel.org/r/20240905003058.1859929-1-kinseyho@google.com Link: https://lkml.kernel.org/r/20240905003058.1859929-2-kinseyho@google.com Suggested-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Kinsey Ho <kinseyho@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Hugh Dickins <hughd@google.com> Cc: T.J. Mercier <tjmercier@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09uprobes: use vm_special_mapping close() functionalitySven Schnelle
The following KASAN splat was shown: [ 44.505448] ================================================================== 20:37:27 [3421/145075] [ 44.505455] BUG: KASAN: slab-use-after-free in special_mapping_close+0x9c/0xc8 [ 44.505471] Read of size 8 at addr 00000000868dac48 by task sh/1384 [ 44.505479] [ 44.505486] CPU: 51 UID: 0 PID: 1384 Comm: sh Not tainted 6.11.0-rc6-next-20240902-dirty #1496 [ 44.505503] Hardware name: IBM 3931 A01 704 (z/VM 7.3.0) [ 44.505508] Call Trace: [ 44.505511] [<000b0324d2f78080>] dump_stack_lvl+0xd0/0x108 [ 44.505521] [<000b0324d2f5435c>] print_address_description.constprop.0+0x34/0x2e0 [ 44.505529] [<000b0324d2f5464c>] print_report+0x44/0x138 [ 44.505536] [<000b0324d1383192>] kasan_report+0xc2/0x140 [ 44.505543] [<000b0324d2f52904>] special_mapping_close+0x9c/0xc8 [ 44.505550] [<000b0324d12c7978>] remove_vma+0x78/0x120 [ 44.505557] [<000b0324d128a2c6>] exit_mmap+0x326/0x750 [ 44.505563] [<000b0324d0ba655a>] __mmput+0x9a/0x370 [ 44.505570] [<000b0324d0bbfbe0>] exit_mm+0x240/0x340 [ 44.505575] [<000b0324d0bc0228>] do_exit+0x548/0xd70 [ 44.505580] [<000b0324d0bc1102>] do_group_exit+0x132/0x390 [ 44.505586] [<000b0324d0bc13b6>] __s390x_sys_exit_group+0x56/0x60 [ 44.505592] [<000b0324d0adcbd6>] do_syscall+0x2f6/0x430 [ 44.505599] [<000b0324d2f78434>] __do_syscall+0xa4/0x170 [ 44.505606] [<000b0324d2f9454c>] system_call+0x74/0x98 [ 44.505614] [ 44.505616] Allocated by task 1384: [ 44.505621] kasan_save_stack+0x40/0x70 [ 44.505630] kasan_save_track+0x28/0x40 [ 44.505636] __kasan_kmalloc+0xa0/0xc0 [ 44.505642] __create_xol_area+0xfa/0x410 [ 44.505648] get_xol_area+0xb0/0xf0 [ 44.505652] uprobe_notify_resume+0x27a/0x470 [ 44.505657] irqentry_exit_to_user_mode+0x15e/0x1d0 [ 44.505664] pgm_check_handler+0x122/0x170 [ 44.505670] [ 44.505672] Freed by task 1384: [ 44.505676] kasan_save_stack+0x40/0x70 [ 44.505682] kasan_save_track+0x28/0x40 [ 44.505687] kasan_save_free_info+0x4a/0x70 [ 44.505693] __kasan_slab_free+0x5a/0x70 [ 44.505698] kfree+0xe8/0x3f0 [ 44.505704] __mmput+0x20/0x370 [ 44.505709] exit_mm+0x240/0x340 [ 44.505713] do_exit+0x548/0xd70 [ 44.505718] do_group_exit+0x132/0x390 [ 44.505722] __s390x_sys_exit_group+0x56/0x60 [ 44.505727] do_syscall+0x2f6/0x430 [ 44.505732] __do_syscall+0xa4/0x170 [ 44.505738] system_call+0x74/0x98 The problem is that uprobe_clear_state() kfree's struct xol_area, which contains struct vm_special_mapping *xol_mapping. This one is passed to _install_special_mapping() in xol_add_vma(). __mput reads: static inline void __mmput(struct mm_struct *mm) { VM_BUG_ON(atomic_read(&mm->mm_users)); uprobe_clear_state(mm); exit_aio(mm); ksm_exit(mm); khugepaged_exit(mm); /* must run before exit_mmap */ exit_mmap(mm); ... } So uprobe_clear_state() in the beginning free's the memory area containing the vm_special_mapping data, but exit_mmap() uses this address later via vma->vm_private_data (which was set in _install_special_mapping(). Fix this by moving uprobe_clear_state() to uprobes.c and use it as close() callback. [usama.anjum@collabora.com: remove unneeded condition] Link: https://lkml.kernel.org/r/20240906101825.177490-1-usama.anjum@collabora.com Link: https://lkml.kernel.org/r/20240903073629.2442754-1-svens@linux.ibm.com Fixes: 223febc6e557 ("mm: add optional close() to struct vm_special_mapping") Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09mm: pass vm_flags to generic_get_unmapped_area()Mark Brown
In preparation for using vm_flags to ensure guard pages for shadow stacks supply them as an argument to generic_get_unmapped_area(). The only user outside of the core code is the PowerPC book3s64 implementation which is trivially wrapping the generic implementation in the radix_enabled() case. No functional changes. Link: https://lkml.kernel.org/r/20240904-mm-generic-shadow-stack-guard-v2-2-a46b8b6dc0ed@kernel.org Signed-off-by: Mark Brown <broonie@kernel.org> Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Chris Zankel <chris@zankel.net> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Guo Ren <guoren@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Naveen N Rao <naveen@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09mm: make arch_get_unmapped_area() take vm_flags by defaultMark Brown
Patch series "mm: Care about shadow stack guard gap when getting an unmapped area", v2. As covered in the commit log for c44357c2e76b ("x86/mm: care about shadow stack guard gap during placement") our current mmap() implementation does not take care to ensure that a new mapping isn't placed with existing mappings inside it's own guard gaps. This is particularly important for shadow stacks since if two shadow stacks end up getting placed adjacent to each other then they can overflow into each other which weakens the protection offered by the feature. On x86 there is a custom arch_get_unmapped_area() which was updated by the above commit to cover this case by specifying a start_gap for allocations with VM_SHADOW_STACK. Both arm64 and RISC-V have equivalent features and use the generic implementation of arch_get_unmapped_area() so let's make the equivalent change there so they also don't get shadow stack pages placed without guard pages. The arm64 and RISC-V shadow stack implementations are currently on the list: https://lore.kernel.org/r/20240829-arm64-gcs-v12-0-42fec94743 https://lore.kernel.org/lkml/20240403234054.2020347-1-debug@rivosinc.com/ Given the addition of the use of vm_flags in the generic implementation we also simplify the set of possibilities that have to be dealt with in the core code by making arch_get_unmapped_area() take vm_flags as standard. This is a bit invasive since the prototype change touches quite a few architectures but since the parameter is ignored the change is straightforward, the simplification for the generic code seems worth it. This patch (of 3): When we introduced arch_get_unmapped_area_vmflags() in 961148704acd ("mm: introduce arch_get_unmapped_area_vmflags()") we did so as part of properly supporting guard pages for shadow stacks on x86_64, which uses a custom arch_get_unmapped_area(). Equivalent features are also present on both arm64 and RISC-V, both of which use the generic implementation of arch_get_unmapped_area() and will require equivalent modification there. Rather than continue to deal with having two versions of the functions let's bite the bullet and have all implementations of arch_get_unmapped_area() take vm_flags as a parameter. The new parameter is currently ignored by all implementations other than x86. The only caller that doesn't have a vm_flags available is mm_get_unmapped_area(), as for the x86 implementation and the wrapper used on other architectures this is modified to supply no flags. No functional changes. Link: https://lkml.kernel.org/r/20240904-mm-generic-shadow-stack-guard-v2-0-a46b8b6dc0ed@kernel.org Link: https://lkml.kernel.org/r/20240904-mm-generic-shadow-stack-guard-v2-1-a46b8b6dc0ed@kernel.org Signed-off-by: Mark Brown <broonie@kernel.org> Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Acked-by: Helge Deller <deller@gmx.de> [parisc] Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Chris Zankel <chris@zankel.net> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Guo Ren <guoren@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naveen N Rao <naveen@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09mm,tmpfs: consider end of file write in shmem_is_hugeRik van Riel
Take the end of a file write into consideration when deciding whether or not to use huge pages for tmpfs files when the tmpfs filesystem is mounted with huge=within_size This allows large writes that append to the end of a file to automatically use large pages. Doing 4MB sequential writes without fallocate to a 16GB tmpfs file with fio. The numbers without THP or with huge=always stay the same, but the performance with huge=within_size now matches that of huge=always. huge before after 4kB pages 1560 MB/s 1560 MB/s within_size 1560 MB/s 4720 MB/s always: 4720 MB/s 4720 MB/s [akpm@linux-foundation.org: coding-style cleanups] Link: https://lkml.kernel.org/r/20240903111928.7171e60c@imladris.surriel.com Signed-off-by: Rik van Riel <riel@surriel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09lib: zstd: export API needed for dictionary supportSergey Senozhatsky
Patch series "zram: introduce custom comp backends API", v7. This series introduces support for run-time compression algorithms tuning, so users, for instance, can adjust compression/acceleration levels and provide pre-trained compression/decompression dictionaries which certain algorithms support. At this point we stop supporting (old/deprecated) comp API. We may add new acomp API support in the future, but before that zram needs to undergo some major rework (we are not ready for async compression). Some benchmarks for reference (look at column #2) *** init zstd /sys/block/zram0/mm_stat 1750659072 504622188 514355200 0 514355200 1 0 34204 34204 *** init zstd dict=/home/ss/zstd-dict-amd64 /sys/block/zram0/mm_stat 1750650880 465908890 475398144 0 475398144 1 0 34185 34185 *** init zstd level=8 dict=/home/ss/zstd-dict-amd64 /sys/block/zram0/mm_stat 1750654976 430803319 439873536 0 439873536 1 0 34185 34185 *** init lz4 /sys/block/zram0/mm_stat 1750646784 664266564 677060608 0 677060608 1 0 34288 34288 *** init lz4 dict=/home/ss/lz4-dict-amd64 /sys/block/zram0/mm_stat 1750650880 619990300 632102912 0 632102912 1 0 34278 34278 *** init lz4hc /sys/block/zram0/mm_stat 1750630400 609023822 621232128 0 621232128 1 0 34288 34288 *** init lz4hc dict=/home/ss/lz4-dict-amd64 /sys/block/zram0/mm_stat 1750659072 505133172 515231744 0 515231744 1 0 34278 34278 Recompress init zram zstd (prio=0), zstd level=5 (prio 1), zstd with dict (prio 2) *** zstd /sys/block/zram0/mm_stat 1750982656 504630584 514269184 0 514269184 1 0 34204 34204 *** idle recompress priority=1 (zstd level=5) /sys/block/zram0/mm_stat 1750982656 488645601 525438976 0 514269184 1 0 34204 34204 *** idle recompress priority=2 (zstd dict) /sys/block/zram0/mm_stat 1750982656 460869640 517914624 0 514269184 1 0 34185 34204 This patch (of 24): We need to export a number of API functions that enable advanced zstd usage - C/D dictionaries, dictionaries sharing between contexts, etc. Link: https://lkml.kernel.org/r/20240902105656.1383858-1-senozhatsky@chromium.org Link: https://lkml.kernel.org/r/20240902105656.1383858-2-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Nick Terrell <terrelln@fb.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09maple_tree: fix comment typo on ma_flag of allocation treeWei Yang
The maple tree flag of allocation tree is MT_FLAGS_ALLOC_RANGE. Just correct it. Link: https://lkml.kernel.org/r/20240809020115.31575-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09mm: fix folio_alloc_noprof()Kent Overstreet
folio_alloc_noprof) wasn't calling the _noprof version, causing allocations to be accounted here instead of to the caller Link: https://lkml.kernel.org/r/20240901202459.4867-1-kent.overstreet@linux.dev Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09mm: split underused THPsUsama Arif
This is an attempt to mitigate the issue of running out of memory when THP is always enabled. During runtime whenever a THP is being faulted in (__do_huge_pmd_anonymous_page) or collapsed by khugepaged (collapse_huge_page), the THP is added to _deferred_list. Whenever memory reclaim happens in linux, the kernel runs the deferred_split shrinker which goes through the _deferred_list. If the folio was partially mapped, the shrinker attempts to split it. If the folio is not partially mapped, the shrinker checks if the THP was underused, i.e. how many of the base 4K pages of the entire THP were zero-filled. If this number goes above a certain threshold (decided by /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none), the shrinker will attempt to split that THP. Then at remap time, the pages that were zero-filled are mapped to the shared zeropage, hence saving memory. Link: https://lkml.kernel.org/r/20240830100438.3623486-6-usamaarif642@gmail.com Signed-off-by: Usama Arif <usamaarif642@gmail.com> Suggested-by: Rik van Riel <riel@surriel.com> Co-authored-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Alexander Zhu <alexlzhu@fb.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kairui Song <ryncsn@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Shuang Zhai <zhais@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Shuang Zhai <szhai2@cs.rochester.edu> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09mm: introduce a pageflag for partially mapped foliosUsama Arif
Currently folio->_deferred_list is used to keep track of partially_mapped folios that are going to be split under memory pressure. In the next patch, all THPs that are faulted in and collapsed by khugepaged are also going to be tracked using _deferred_list. This patch introduces a pageflag to be able to distinguish between partially mapped folios and others in the deferred_list at split time in deferred_split_scan. Its needed as __folio_remove_rmap decrements _mapcount, _large_mapcount and _entire_mapcount, hence it won't be possible to distinguish between partially mapped folios and others in deferred_split_scan. Eventhough it introduces an extra flag to track if the folio is partially mapped, there is no functional change intended with this patch and the flag is not useful in this patch itself, it will become useful in the next patch when _deferred_list has non partially mapped folios. Link: https://lkml.kernel.org/r/20240830100438.3623486-5-usamaarif642@gmail.com Signed-off-by: Usama Arif <usamaarif642@gmail.com> Cc: Alexander Zhu <alexlzhu@fb.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kairui Song <ryncsn@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Shuang Zhai <zhais@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Shuang Zhai <szhai2@cs.rochester.edu> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09mm: remap unused subpages to shared zeropage when splitting isolated thpYu Zhao
Patch series "mm: split underused THPs", v5. The current upstream default policy for THP is always. However, Meta uses madvise in production as the current THP=always policy vastly overprovisions THPs in sparsely accessed memory areas, resulting in excessive memory pressure and premature OOM killing. Using madvise + relying on khugepaged has certain drawbacks over THP=always. Using madvise hints mean THPs aren't "transparent" and require userspace changes. Waiting for khugepaged to scan memory and collapse pages into THP can be slow and unpredictable in terms of performance (i.e. you dont know when the collapse will happen), while production environments require predictable performance. If there is enough memory available, its better for both performance and predictability to have a THP from fault time, i.e. THP=always rather than wait for khugepaged to collapse it, and deal with sparsely populated THPs when the system is running out of memory. This patch series is an attempt to mitigate the issue of running out of memory when THP is always enabled. During runtime whenever a THP is being faulted in or collapsed by khugepaged, the THP is added to a list. Whenever memory reclaim happens, the kernel runs the deferred_split shrinker which goes through the list and checks if the THP was underused, i.e. how many of the base 4K pages of the entire THP were zero-filled. If this number goes above a certain threshold, the shrinker will attempt to split that THP. Then at remap time, the pages that were zero-filled are mapped to the shared zeropage, hence saving memory. This method avoids the downside of wasting memory in areas where THP is sparsely filled when THP is always enabled, while still providing the upside THPs like reduced TLB misses without having to use madvise. Meta production workloads that were CPU bound (>99% CPU utilzation) were tested with THP shrinker. The results after 2 hours are as follows: | THP=madvise | THP=always | THP=always | | | + shrinker series | | | + max_ptes_none=409 ----------------------------------------------------------------------------- Performance improvement | - | +1.8% | +1.7% (over THP=madvise) | | | ----------------------------------------------------------------------------- Memory usage | 54.6G | 58.8G (+7.7%) | 55.9G (+2.4%) ----------------------------------------------------------------------------- max_ptes_none=409 means that any THP that has more than 409 out of 512 (80%) zero filled filled pages will be split. To test out the patches, the below commands without the shrinker will invoke OOM killer immediately and kill stress, but will not fail with the shrinker: echo 450 > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none mkdir /sys/fs/cgroup/test echo $$ > /sys/fs/cgroup/test/cgroup.procs echo 20M > /sys/fs/cgroup/test/memory.max echo 0 > /sys/fs/cgroup/test/memory.swap.max # allocate twice memory.max for each stress worker and touch 40/512 of # each THP, i.e. vm-stride 50K. # With the shrinker, max_ptes_none of 470 and below won't invoke OOM # killer. # Without the shrinker, OOM killer is invoked immediately irrespective # of max_ptes_none value and kills stress. stress --vm 1 --vm-bytes 40M --vm-stride 50K This patch (of 5): Here being unused means containing only zeros and inaccessible to userspace. When splitting an isolated thp under reclaim or migration, the unused subpages can be mapped to the shared zeropage, hence saving memory. This is particularly helpful when the internal fragmentation of a thp is high, i.e. it has many untouched subpages. This is also a prerequisite for THP low utilization shrinker which will be introduced in later patches, where underutilized THPs are split, and the zero-filled pages are freed saving memory. Link: https://lkml.kernel.org/r/20240830100438.3623486-1-usamaarif642@gmail.com Link: https://lkml.kernel.org/r/20240830100438.3623486-3-usamaarif642@gmail.com Signed-off-by: Yu Zhao <yuzhao@google.com> Signed-off-by: Usama Arif <usamaarif642@gmail.com> Tested-by: Shuang Zhai <zhais@google.com> Cc: Alexander Zhu <alexlzhu@fb.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kairui Song <ryncsn@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Shuang Zhai <szhai2@cs.rochester.edu> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09mm: warn about illegal __GFP_NOFAIL usage in a more appropriate location and ↵Barry Song
manner Three points for this change: 1. We should consolidate all warnings in one place. Currently, the order > 1 warning is in the hotpath, while others are in less likely scenarios. Moving all warnings to the slowpath will reduce the overhead for order > 1 and increase the visibility of other warnings. 2. We currently have two warnings for order: one for order > 1 in the hotpath and another for order > costly_order in the laziest path. I suggest standardizing on order > 1 since it's been in use for a long time. 3. We don't need to check for __GFP_NOWARN in this case. __GFP_NOWARN is meant to suppress allocation failure reports, but here we're dealing with bug detection, not allocation failures. So replace WARN_ON_ONCE_GFP by WARN_ON_ONCE. [v-songbaohua@oppo.com: also update the doc for __GFP_NOFAIL with order > 1] Link: https://lkml.kernel.org/r/20240903223935.1697-1-21cnbao@gmail.com Link: https://lkml.kernel.org/r/20240830202823.21478-4-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Christoph Lameter <cl@linux.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David Rientjes <rientjes@google.com> Cc: "Eugenio Pérez" <eperezma@redhat.com> Cc: Hailong.Liu <hailong.liu@oppo.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <kees@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Maxime Coquelin <maxime.coquelin@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Xie Yongji <xieyongji@bytedance.com> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09mm: document __GFP_NOFAIL must be blockableBarry Song
Non-blocking allocation with __GFP_NOFAIL is not supported and may still result in NULL pointers (if we don't return NULL, we result in busy-loop within non-sleepable contexts): static inline struct page * __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, struct alloc_context *ac) { ... /* * Make sure that __GFP_NOFAIL request doesn't leak out and make sure * we always retry */ if (gfp_mask & __GFP_NOFAIL) { /* * All existing users of the __GFP_NOFAIL are blockable, so warn * of any new users that actually require GFP_NOWAIT */ if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask)) goto fail; ... } ... fail: warn_alloc(gfp_mask, ac->nodemask, "page allocation failure: order:%u", order); got_pg: return page; } Highlight this in the documentation of __GFP_NOFAIL so that non-mm subsystems can reject any illegal usage of __GFP_NOFAIL with GFP_ATOMIC, GFP_NOWAIT, etc. Link: https://lkml.kernel.org/r/20240830202823.21478-3-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Acked-by: David Hildenbrand <david@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: "Eugenio Pérez" <eperezma@redhat.com> Cc: Hailong.Liu <hailong.liu@oppo.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <kees@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Maxime Coquelin <maxime.coquelin@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Xie Yongji <xieyongji@bytedance.com> Cc: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09mm/damon/core: remove per-scheme region priority histogram bufferSeongJae Park
Nobody is reading from or writing to the per-scheme region priorities histogram buffer. It is only wasting memory. Remove it. Link: https://lkml.kernel.org/r/20240826042323.87025-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09mm/damon/core: introduce per-context region priorities histogram bufferSeongJae Park
Patch series "replace per-quota region priorities histogram buffer with per-context one". Each DAMOS quota (struct damos_quota) maintains a histogram for total regions size per its prioritization score. DAMOS calcultes minimum prioritization score of regions that are ok to apply the DAMOS action to while respecting the quota. The histogram is constructed only for the calculation of the minimum score in damos_adjust_quota() for each quota which called by kdamond_fn(). Hence, there is no real reason to have per-quota histogram. Only per-kdamond histogram is needed, since parallel kdamonds could have races otherwise. The current implementation is only wasting the memory, and can easily cause unintended stack usage[1]. So, introducing a per-kdamond histogram and replacing the per-quota one with it would be the right solution for the issue. However, supporting multiple DAMON contexts per kdamond is still an ongoing work[2] without a clear estimated time of arrival. Meanwhile, per-context histogram could be an effective and straightforward solution having no blocker. Let's fix the problem first in the way. This patch (of 4): Introduce per-context buffer for region priority scores-total size histogram. Same to the per-quota one (->histogram of struct damos_quota), the new buffer is hidden from DAMON API users by being defined as a private field of DAMON context structure. It is dynamically allocated and de-allocated at the beginning and ending of the execution of the kdamond by kdamond_fn() itself. [1] commit 0742cadf5e4c ("mm/damon/lru_sort: adjust local variable to dynamic allocation") [2] https://lore.kernel.org/20240531122320.909060-1-yorha.op@gmail.com Link: https://lkml.kernel.org/r/20240826042323.87025-1-sj@kernel.org Link: https://lkml.kernel.org/r/20240826042323.87025-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09mm: count the number of partially mapped anonymous THPs per sizeBarry Song
When a THP is added to the deferred_list due to partially mapped, its partial pages are unused, leading to wasted memory and potentially increasing memory reclamation pressure. Detailing the specifics of how unmapping occurs is quite difficult and not that useful, so we adopt a simple approach: each time a THP enters the deferred_list, we increment the count by 1; whenever it leaves for any reason, we decrement the count by 1. Link: https://lkml.kernel.org/r/20240824010441.21308-3-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chris Li <chrisl@kernel.org> Cc: Chuanhua Han <hanchuanhua@oppo.com> Cc: Kairui Song <kasong@tencent.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuai Yuan <yuanshuai@oppo.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09mm: count the number of anonymous THPs per sizeBarry Song
Patch series "mm: count the number of anonymous THPs per size", v4. Knowing the number of transparent anon THPs in the system is crucial for performance analysis. It helps in understanding the ratio and distribution of THPs versus small folios throughout the system. Additionally, partial unmapping by userspace can lead to significant waste of THPs over time and increase memory reclamation pressure. We need this information for comprehensive system tuning. This patch (of 2): Let's track for each anonymous THP size, how many of them are currently allocated. We'll track the complete lifespan of an anon THP, starting when it becomes an anon THP ("large anon folio") (->mapping gets set), until it gets freed (->mapping gets cleared). Introduce a new "nr_anon" counter per THP size and adjust the corresponding counter in the following cases: * We allocate a new THP and call folio_add_new_anon_rmap() to map it the first time and turn it into an anon THP. * We split an anon THP into multiple smaller ones. * We migrate an anon THP, when we prepare the destination. * We free an anon THP back to the buddy. Note that AnonPages in /proc/meminfo currently tracks the total number of *mapped* anonymous *pages*, and therefore has slightly different semantics. In the future, we might also want to track "nr_anon_mapped" for each THP size, which might be helpful when comparing it to the number of allocated anon THPs (long-term pinning, stuck in swapcache, memory leaks, ...). Further note that for now, we only track anon THPs after they got their ->mapping set, for example via folio_add_new_anon_rmap(). If we would allocate some in the swapcache, they will only show up in the statistics for now after they have been mapped to user space the first time, where we call folio_add_new_anon_rmap(). [akpm@linux-foundation.org: documentation fixups, per David] Link: https://lkml.kernel.org/r/3e8add35-e26b-443b-8a04-1078f4bc78f6@redhat.com Link: https://lkml.kernel.org/r/20240824010441.21308-1-21cnbao@gmail.com Link: https://lkml.kernel.org/r/20240824010441.21308-2-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chris Li <chrisl@kernel.org> Cc: Chuanhua Han <hanchuanhua@oppo.com> Cc: Kairui Song <kasong@tencent.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuai Yuan <yuanshuai@oppo.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09mm: cleanup count_mthp_stat() definitionRyan Roberts
Patch series "Shmem mTHP controls and stats improvements", v3. This is a small series to tidy up the way the shmem controls and stats are exposed. These patches were previously part of the series at [2], but I decided to split them out since they can go in independently. This patch (of 2): Let's move count_mthp_stat() so that it's always defined, even when THP is disabled. Previously uses of the function in files such as shmem.c, which are compiled even when THP is disabled, required ugly THP ifdeferry. With this cleanup, we can remove those ifdefs and the function resolves to a nop when THP is disabled. I shortly plan to call count_mthp_stat() from more THP-invariant source files. Link: https://lkml.kernel.org/r/20240808111849.651867-1-ryan.roberts@arm.com Link: https://lkml.kernel.org/r/20240808111849.651867-2-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: Barry Song <baohua@kernel.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Lance Yang <ioworker0@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Gavin Shan <gshan@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09dt-bindings: clock, reset: fix top-comment indentation rk3576 headersHeiko Stuebner
Block comments should align the * on each line, as checkpatch rightfully pointed out, so fix that style issue on the newly added rk3576 headers. Fixes: 49c04453db81 ("dt-bindings: clock, reset: Add support for rk3576") Signed-off-by: Heiko Stuebner <heiko@sntech.de> Link: https://lore.kernel.org/r/20240909223149.85364-1-heiko@sntech.de Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2024-09-09cxl: move cxl headers to new include/cxl/ directoryDave Jiang
Group all cxl related kernel headers into include/cxl/ directory. Reviewed-by: Alison Schofield <alison.schofield@intel.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Link: https://patch.msgid.link/20240905223711.1990186-2-dave.jiang@intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2024-09-09netdevice: add netdev_tx_reset_subqueue() shorthandAlexander Lobakin
Add a shorthand similar to other net*_subqueue() helpers for resetting the queue by its index w/o obtaining &netdev_tx_queue beforehand manually. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-09-09libeth: add Tx buffer completion helpersAlexander Lobakin
Software-side Tx buffers for storing DMA, frame size, skb pointers etc. are pretty much generic and every driver defines them the same way. The same can be said for software Tx completions -- same napi_consume_skb()s and all that... Add a couple simple wrappers for doing that to stop repeating the old tale at least within the Intel code. Drivers are free to use 'priv' member at the end of the structure. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-09-09net/mlx5: Add missing masks and QoS bit masks for scheduling elementsCarolina Jubran
Add the missing masks for supported element types and Transmit Scheduling Arbiter (TSAR) types in scheduling elements. Also, add the corresponding bit masks for these types in the QoS capabilities of a NIC scheduler. Fixes: 214baf22870c ("net/mlx5e: Support HTB offload") Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-09-09net/mlx5: Added missing definitions in preparation for HW SteeringYevgeny Kliteynik
As part of preparation for HWS, added missing definitions in qp.h and fs_core.h: - FS_FT_FDB_RX/TX table types that are used by HWS in addition to an existing FS_FT_FDB - MLX5_WQE_CTRL_INITIATOR_SMALL_FENCE that is used by HWS to require fence in WQE Reviewed-by: Hamdan Agbariya <hamdani@nvidia.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-09-09net/mlx5: Added missing mlx5_ifc definition for HW SteeringYevgeny Kliteynik
Add mlx5_ifc definitions that are required for HWS support. Note that due to change in the mlx5_ifc_flow_table_context_bits structure that now includes both SWS and HWS bits in a union, this patch also includes small change in one of SWS files that was required for compilation. Reviewed-by: Hamdan Agbariya <hamdani@nvidia.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-09-09firewire: core: use mutex to coordinate concurrent calls to flush completionsTakashi Sakamoto
In current implementation, test_and_set_bit_lock() is used to mediate concurrent calls of ohci_flush_iso_completions(). However, the ad-hoc usage of atomic operations is not preferable. This commit uses mutex_trylock() as the similar operations. The core function is responsible for the mediation, instead of 1394 OHCI driver. Link: https://lore.kernel.org/r/20240909140018.65289-3-o-takashi@sakamocchi.jp Signed-off-by: Takashi Sakamoto <o-takashi@sakamocchi.jp>
2024-09-09lib/generic-radix-tree.c: add preallocationKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09lib/generic-radix-tree.c: genradix_ptr_inlined()Kent Overstreet
Provide an inlined fast path Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09inode: make __iget() a static inlineKent Overstreet
bcachefs is switching to an rhashtable for vfs inodes instead of the standard inode.c hashtable, so we need this exported, or - a static inline makes more sense for a single atomic_inc(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09fs: add generic_llseek_cookie()Christian Brauner
This is similar to generic_file_llseek() but allows the caller to specify a cookie that will be updated to indicate that a seek happened. Caller's requiring that information in their readdir implementations can use that. Link: https://lore.kernel.org/r/20240830-vfs-file-f_version-v1-8-6d3e4816aa7b@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-09mtd: spinand: Add support for setting plane select bitsCheng Ming Lin
Add two flags for inserting the Plane Select bit into the column address during the write_to_cache and the read_from_cache operation. Add the SPINAND_HAS_PROG_PLANE_SELECT_BIT flag for serial NAND flash that require inserting the Plane Select bit into the column address during the write_to_cache operation. Add the SPINAND_HAS_READ_PLANE_SELECT_BIT flag for serial NAND flash that require inserting the Plane Select bit into the column address during the read_from_cache operation. Signed-off-by: Cheng Ming Lin <chengminglin@mxic.com.tw> Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Link: https://lore.kernel.org/linux-mtd/20240909092643.2434479-2-linchengming884@gmail.com
2024-09-09wifi: cfg80211: fix kernel-doc for per-link dataJohannes Berg
There cannot be brackets in kernel-doc, remove them. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Fixes: 62c16f219a73 ("wifi: cfg80211: move DFS related members to links[] in wireless_dev") Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-09-09Merge 6.11-rc7 into usb-nextGreg Kroah-Hartman
We need the USB fixes in here as well, and this also resolves the merge conflict in: drivers/usb/typec/ucsi/ucsi.c Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-09Merge 6.11-rc7 into char-misc-nextGreg Kroah-Hartman
We need the char-misc fixes in here as well. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-08treewide: Fix wrong singular form of jiffies in commentsAnna-Maria Behnsen
There are several comments all over the place, which uses a wrong singular form of jiffies. Replace 'jiffie' by 'jiffy'. No functional change. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> # m68k Link: https://lore.kernel.org/all/20240904-devel-anna-maria-b4-timers-flseep-v1-3-e98760256370@linutronix.de