summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2024-11-06io_uring/rsrc: split io_kiocb node type assignmentsJens Axboe
Currently the io_rsrc_node assignment in io_kiocb is an array of two pointers, as two nodes may be assigned to a request - one file node, and one buffer node. However, the buffer node can co-exist with the provided buffers, as currently it's not supported to use both provided and registered buffers at the same time. This crucially brings struct io_kiocb down to 4 cache lines again, as before it spilled into the 5th cacheline. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-06ACPI: processor: Move arch_init_invariance_cppc() call laterMario Limonciello
arch_init_invariance_cppc() is called at the end of acpi_cppc_processor_probe() in order to configure frequency invariance based upon the values from _CPC. This however doesn't work on AMD CPPC shared memory designs that have AMD preferred cores enabled because _CPC needs to be analyzed from all cores to judge if preferred cores are enabled. This issue manifests to users as a warning since commit 21fb59ab4b97 ("ACPI: CPPC: Adjust debug messages in amd_set_max_freq_ratio() to warn"): ``` Could not retrieve highest performance (-19) ``` However the warning isn't the cause of this, it was actually commit 279f838a61f9 ("x86/amd: Detect preferred cores in amd_get_boost_ratio_numerator()") which exposed the issue. To fix this problem, change arch_init_invariance_cppc() into a new weak symbol that is called at the end of acpi_processor_driver_init(). Each architecture that supports it can declare the symbol to override the weak one. Define it for x86, in arch/x86/kernel/acpi/cppc.c, and for all of the architectures using the generic arch_topology.c code. Fixes: 279f838a61f9 ("x86/amd: Detect preferred cores in amd_get_boost_ratio_numerator()") Reported-by: Ivan Shapovalov <intelfx@intelfx.name> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219431 Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Link: https://patch.msgid.link/20241104222855.3959267-1-superm1@kernel.org [ rjw: Changelog edit ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-11-06fs/xattr: add *at family syscallsChristian Göttsche
Add the four syscalls setxattrat(), getxattrat(), listxattrat() and removexattrat(). Those can be used to operate on extended attributes, especially security related ones, either relative to a pinned directory or on a file descriptor without read access, avoiding a /proc/<pid>/fd/<fd> detour, requiring a mounted procfs. One use case will be setfiles(8) setting SELinux file contexts ("security.selinux") without race conditions and without a file descriptor opened with read access requiring SELinux read permission. Use the do_{name}at() pattern from fs/open.c. Pass the value of the extended attribute, its length, and for setxattrat(2) the command (XATTR_CREATE or XATTR_REPLACE) via an added struct xattr_args to not exceed six syscall arguments and not merging the AT_* and XATTR_* flags. [AV: fixes by Christian Brauner folded in, the entire thing rebased on top of {filename,file}_...xattr() primitives, treatment of empty pathnames regularized. As the result, AT_EMPTY_PATH+NULL handling is cheap, so f...(2) can use it] Signed-off-by: Christian Göttsche <cgzones@googlemail.com> Link: https://lore.kernel.org/r/20240426162042.191916-1-cgoettsche@seltendoof.de Reviewed-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Christian Brauner <brauner@kernel.org> CC: x86@kernel.org CC: linux-alpha@vger.kernel.org CC: linux-kernel@vger.kernel.org CC: linux-arm-kernel@lists.infradead.org CC: linux-ia64@vger.kernel.org CC: linux-m68k@lists.linux-m68k.org CC: linux-mips@vger.kernel.org CC: linux-parisc@vger.kernel.org CC: linuxppc-dev@lists.ozlabs.org CC: linux-s390@vger.kernel.org CC: linux-sh@vger.kernel.org CC: sparclinux@vger.kernel.org CC: linux-fsdevel@vger.kernel.org CC: audit@vger.kernel.org CC: linux-arch@vger.kernel.org CC: linux-api@vger.kernel.org CC: linux-security-module@vger.kernel.org CC: selinux@vger.kernel.org [brauner: slight tweaks] Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-06Merge tag 'perf-core-for-bpf-next' from tip treeAndrii Nakryiko
Stable tag for bpf-next's uprobe work. Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2024-11-06libfs: Fix kernel-doc warning in generic_ci_validate_strict_nameAndré Almeida
Fix the indentation of the return values from generic_ci_validate_strict_name() to properly render the comment and to address a `make htmldocs` warning: Documentation/filesystems/api-summary:14: include/linux/fs.h:3504: WARNING: Bullet list ends without a blank line; unexpected unindent. Fixes: 0e152beb5aa1 ("libfs: Create the helper function generic_ci_validate_strict_name()") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Closes: https://lore.kernel.org/lkml/20241030162435.05425f60@canb.auug.org.au/ Signed-off-by: André Almeida <andrealmeid@igalia.com> Link: https://lore.kernel.org/r/20241101164251.327884-2-andrealmeid@igalia.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-06leds: class: Protect brightness_show() with led_cdev->led_access mutexMukesh Ojha
There is NULL pointer issue observed if from Process A where hid device being added which results in adding a led_cdev addition and later a another call to access of led_cdev attribute from Process B can result in NULL pointer issue. Use mutex led_cdev->led_access to protect access to led->cdev and its attribute inside brightness_show() and max_brightness_show() and also update the comment for mutex that it should be used to protect the led class device fields. Process A Process B kthread+0x114 worker_thread+0x244 process_scheduled_works+0x248 uhid_device_add_worker+0x24 hid_add_device+0x120 device_add+0x268 bus_probe_device+0x94 device_initial_probe+0x14 __device_attach+0xfc bus_for_each_drv+0x10c __device_attach_driver+0x14c driver_probe_device+0x3c __driver_probe_device+0xa0 really_probe+0x190 hid_device_probe+0x130 ps_probe+0x990 ps_led_register+0x94 devm_led_classdev_register_ext+0x58 led_classdev_register_ext+0x1f8 device_create_with_groups+0x48 device_create_groups_vargs+0xc8 device_add+0x244 kobject_uevent+0x14 kobject_uevent_env[jt]+0x224 mutex_unlock[jt]+0xc4 __mutex_unlock_slowpath+0xd4 wake_up_q+0x70 try_to_wake_up[jt]+0x48c preempt_schedule_common+0x28 __schedule+0x628 __switch_to+0x174 el0t_64_sync+0x1a8/0x1ac el0t_64_sync_handler+0x68/0xbc el0_svc+0x38/0x68 do_el0_svc+0x1c/0x28 el0_svc_common+0x80/0xe0 invoke_syscall+0x58/0x114 __arm64_sys_read+0x1c/0x2c ksys_read+0x78/0xe8 vfs_read+0x1e0/0x2c8 kernfs_fop_read_iter+0x68/0x1b4 seq_read_iter+0x158/0x4ec kernfs_seq_show+0x44/0x54 sysfs_kf_seq_show+0xb4/0x130 dev_attr_show+0x38/0x74 brightness_show+0x20/0x4c dualshock4_led_get_brightness+0xc/0x74 [ 3313.874295][ T4013] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000060 [ 3313.874301][ T4013] Mem abort info: [ 3313.874303][ T4013] ESR = 0x0000000096000006 [ 3313.874305][ T4013] EC = 0x25: DABT (current EL), IL = 32 bits [ 3313.874307][ T4013] SET = 0, FnV = 0 [ 3313.874309][ T4013] EA = 0, S1PTW = 0 [ 3313.874311][ T4013] FSC = 0x06: level 2 translation fault [ 3313.874313][ T4013] Data abort info: [ 3313.874314][ T4013] ISV = 0, ISS = 0x00000006, ISS2 = 0x00000000 [ 3313.874316][ T4013] CM = 0, WnR = 0, TnD = 0, TagAccess = 0 [ 3313.874318][ T4013] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [ 3313.874320][ T4013] user pgtable: 4k pages, 39-bit VAs, pgdp=00000008f2b0a000 .. [ 3313.874332][ T4013] Dumping ftrace buffer: [ 3313.874334][ T4013] (ftrace buffer empty) .. .. [ dd3313.874639][ T4013] CPU: 6 PID: 4013 Comm: InputReader [ 3313.874648][ T4013] pc : dualshock4_led_get_brightness+0xc/0x74 [ 3313.874653][ T4013] lr : led_update_brightness+0x38/0x60 [ 3313.874656][ T4013] sp : ffffffc0b910bbd0 .. .. [ 3313.874685][ T4013] Call trace: [ 3313.874687][ T4013] dualshock4_led_get_brightness+0xc/0x74 [ 3313.874690][ T4013] brightness_show+0x20/0x4c [ 3313.874692][ T4013] dev_attr_show+0x38/0x74 [ 3313.874696][ T4013] sysfs_kf_seq_show+0xb4/0x130 [ 3313.874700][ T4013] kernfs_seq_show+0x44/0x54 [ 3313.874703][ T4013] seq_read_iter+0x158/0x4ec [ 3313.874705][ T4013] kernfs_fop_read_iter+0x68/0x1b4 [ 3313.874708][ T4013] vfs_read+0x1e0/0x2c8 [ 3313.874711][ T4013] ksys_read+0x78/0xe8 [ 3313.874714][ T4013] __arm64_sys_read+0x1c/0x2c [ 3313.874718][ T4013] invoke_syscall+0x58/0x114 [ 3313.874721][ T4013] el0_svc_common+0x80/0xe0 [ 3313.874724][ T4013] do_el0_svc+0x1c/0x28 [ 3313.874727][ T4013] el0_svc+0x38/0x68 [ 3313.874730][ T4013] el0t_64_sync_handler+0x68/0xbc [ 3313.874732][ T4013] el0t_64_sync+0x1a8/0x1ac Signed-off-by: Mukesh Ojha <quic_mojha@quicinc.com> Reviewed-by: Anish Kumar <yesanishhere@gmail.com> Link: https://lore.kernel.org/r/20241103160527.82487-1-quic_mojha@quicinc.com Signed-off-by: Lee Jones <lee@kernel.org>
2024-11-06leds: Introduce ordered workqueue for LEDs events instead of system_wqDmitry Rokosov
This allows to setup ordered workqueue for LEDs events. This may be useful, because default 'system_wq' does not guarantee execution order of each work_struct, thus for several brightness update requests (for multiple LEDs), real brightness switch could be in random order. Yes, for sysfs-based LEDs we have flush_work() call inside brightness_store() operation, but it's blocking call, so userspace caller can be blocked at a long time, which means LEDs animation stream can be broken. Ordered workqueue has the same behaviour as system_wq + flush_work(), but all scheduled works are async and userspace caller is not blocked, which it better for userspace animation scheduling. Signed-off-by: Alexey Romanov <avromanov@salutedevices.com> Signed-off-by: Dmitry Rokosov <ddrokosov@salutedevices.com> Link: https://lore.kernel.org/r/20240903223936.21292-1-ddrokosov@salutedevices.com [Lee: Couple of style fix-ups] Signed-off-by: Lee Jones <lee@kernel.org>
2024-11-06clk: Add devm_clk_hw_register_gate_parent_hw()Claudiu Beznea
Add devm_clk_hw_register_gate_parent_hw() macro to allow registering devres managed gate clocks providing struct clk_hw object as parent. Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be> Acked-by: Stephen Boyd <sboyd@kernel.org> Signed-off-by: Claudiu Beznea <claudiu.beznea.uj@bp.renesas.com> Link: https://lore.kernel.org/20241101095720.2247815-3-claudiu.beznea.uj@bp.renesas.com Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
2024-11-05lib min_heap: avoid indirect function call by providing default swapKuan-Wei Chiu
The non-inline min heap API can result in an indirect function call to the custom swap function. This becomes particularly costly when CONFIG_MITIGATION_RETPOLINE is enabled, as indirect function calls are expensive in this case. To address this, copy the code from lib/sort.c and provide a default builtin swap implementation that performs element swaps based on the element size. This change allows most users to avoid the overhead of indirect function calls, improving efficiency. Link: https://lkml.kernel.org/r/20241020040200.939973-4-visitorckw@gmail.com Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw> Cc: Coly Li <colyli@suse.de> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: "Liang, Kan" <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Sakai <msakai@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05lib min_heap: optimize min heap by prescaling counters for better performanceKuan-Wei Chiu
Improve the efficiency of the min heap by prescaling counters, eliminating the need to repeatedly compute 'index * element_size' when accessing elements. By doing so, we avoid the overhead associated with recalculating the byte offset for each heap operation. However, with prescaling, the calculation for the parent element's location is no longer as simple as '(i - 1) / 2'. To address this, we copy the parent function from 'lib/sort.c', which calculates the parent offset in a branchless manner without using any division instructions. This optimization should result in a more efficient heap implementation by reducing the computational overhead of finding parent and child offsets. Link: https://lkml.kernel.org/r/20241020040200.939973-3-visitorckw@gmail.com Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw> Cc: Coly Li <colyli@suse.de> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: "Liang, Kan" <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Sakai <msakai@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05lib/min_heap: introduce non-inline versions of min heap API functionsKuan-Wei Chiu
Patch series "Enhance min heap API with non-inline functions and optimizations", v2. Add non-inline versions of the min heap API functions in lib/min_heap.c and updates all users outside of kernel/events/core.c to use these non-inline versions. To mitigate the performance impact of indirect function calls caused by the non-inline versions of the swap and compare functions, a builtin swap has been introduced that swaps elements based on their size. Additionally, it micro-optimizes the efficiency of the min heap by pre-scaling the counter, following the same approach as in lib/sort.c. Documentation for the min heap API has also been added to the core-api section. This patch (of 10): All current min heap API functions are marked with '__always_inline'. However, as the number of users increases, inlining these functions everywhere leads to a increase in kernel size. In performance-critical paths, such as when perf events are enabled and min heap functions are called on every context switch, it is important to retain the inline versions for optimal performance. To balance this, the original inline functions are kept, and additional non-inline versions of the functions have been added in lib/min_heap.c. Link: https://lkml.kernel.org/r/20241020040200.939973-1-visitorckw@gmail.com Link: https://lore.kernel.org/20240522161048.8d8bbc7b153b4ecd92c50666@linux-foundation.org Link: https://lkml.kernel.org/r/20241020040200.939973-2-visitorckw@gmail.com Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw> Cc: Coly Li <colyli@suse.de> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Kuan-Wei Chiu <visitorckw@gmail.com> Cc: "Liang, Kan" <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Sakai <msakai@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05percpu: cast percpu pointer in PERCPU_PTR() via unsigned longUros Bizjak
Cast pointer from percpu address space to generic (kernel) address space in PERCPU_PTR() macro via unsigned long intermediate cast [1]. This intermediate cast is also required to avoid build failure when GCC's strict named address space checks for x86 targets [2] are enabled. Found by GCC's named address space checks. [1] https://sparse.docs.kernel.org/en/latest/annotations.html#address-space-name [2] https://gcc.gnu.org/onlinedocs/gcc/Named-Address-Spaces.html#x86-Named-Address-Spaces Link: https://lkml.kernel.org/r/20241021080856.48746-3-ubizjak@gmail.com Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05percpu: introduce PERCPU_PTR() macroUros Bizjak
Introduce PERCPU_PTR() macro to cast the percpu pointer from the percpu address space to a generic (kernel) address space. Use it in per_cpu_ptr() and related SHIFT_PERCPU_PTR() macros. Also remove common knowledge from SHIFT_PERCPU_PTR() comment, "weird cast" is just a standard way to inform sparse of a cast from the percpu address space to a generic address space. Link: https://lkml.kernel.org/r/20241021080856.48746-2-ubizjak@gmail.com Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05percpu: merge VERIFY_PERCPU_PTR() into its only userUros Bizjak
Merge VERIFY_PERCPU_PTR() into non-CONFIG_SMP per_cpu_ptr() to make macro similar to CONFIG_SMP per_cpu_ptr(). This will allow a follow-up patch to refactor common code to a macro. No functional changes, non-CONFIG_SMP per_cpu_ptr() was the only user of VERIFY_PERCPU_PTR(). Link: https://lkml.kernel.org/r/20241021080856.48746-1-ubizjak@gmail.com Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05scatterlist: fix a typoSui Jingfeng
Replace the 'One' with 'On'. Link: https://lkml.kernel.org/r/20241012100817.323007-1-sui.jingfeng@linux.dev Fixes: af2880ec4402 ("scatterlist: add dedicated config for DMA flags") Signed-off-by: Sui Jingfeng <sui.jingfeng@linux.dev> Reviewed-by: Petr Tesarik <petr@tesarici.cz> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Michael Kelley <mhklinux@outlook.com> Cc: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05reboot: move reboot_notifier_list to kernel/reboot.cThomas Weißschuh
All the functions related to the reboot notifier list are in kernel/reboot.c. Move the list itself, too. As there are no direct users anymore, make the declaration static. Link: https://lkml.kernel.org/r/20241012-reboot_notifier_list-v1-1-6093bb9455ce@weissschuh.net Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05get rid of __get_task_comm()Yafang Shao
Patch series "Improve the copy of task comm", v8. Using {memcpy,strncpy,strcpy,kstrdup} to copy the task comm relies on the length of task comm. Changes in the task comm could result in a destination string that is overflow. Therefore, we should explicitly ensure the destination string is always NUL-terminated, regardless of the task comm. This approach will facilitate future extensions to the task comm. As suggested by Linus [0], we can identify all relevant code with the following git grep command: git grep 'memcpy.*->comm\>' git grep 'kstrdup.*->comm\>' git grep 'strncpy.*->comm\>' git grep 'strcpy.*->comm\>' PATCH #2~#4: memcpy PATCH #5~#6: kstrdup PATCH #7: strcpy Please note that strncpy() is not included in this series as it is being tracked by another effort. [1] This patch (of 7): We want to eliminate the use of __get_task_comm() for the following reasons: - The task_lock() is unnecessary Quoted from Linus [0]: : Since user space can randomly change their names anyway, using locking : was always wrong for readers (for writers it probably does make sense : to have some lock - although practically speaking nobody cares there : either, but at least for a writer some kind of race could have : long-term mixed results Link: https://lkml.kernel.org/r/20241007144911.27693-1-laoar.shao@gmail.com Link: https://lkml.kernel.org/r/20241007144911.27693-2-laoar.shao@gmail.com Link: https://lore.kernel.org/all/CAHk-=wivfrF0_zvf+oj6==Sh=-npJooP8chLPEfaFV0oNYTTBA@mail.gmail.com [0] Link: https://lore.kernel.org/all/CAHk-=whWtUC-AjmGJveAETKOMeMFSTwKwu99v7+b6AyHMmaDFA@mail.gmail.com/ Link: https://lore.kernel.org/all/CAHk-=wjAmmHUg6vho1KjzQi2=psR30+CogFd4aXrThr2gsiS4g@mail.gmail.com/ [0] Link: https://github.com/KSPP/linux/issues/90 [1] Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Kees Cook <keescook@chromium.org> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Matus Jokay <matus.jokay@stuba.sk> Cc: Alejandro Colomar <alx@kernel.org> Cc: "Serge E. Hallyn" <serge@hallyn.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Justin Stitt <justinstitt@google.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: David Airlie <airlied@gmail.com> Cc: Eric Paris <eparis@redhat.com> Cc: James Morris <jmorris@namei.org> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Maxime Ripard <mripard@kernel.org> Cc: Ondrej Mosnacek <omosnace@redhat.com> Cc: Paul Moore <paul@paul-moore.com> Cc: Quentin Monnet <qmo@kernel.org> Cc: Simon Horman <horms@kernel.org> Cc: Stephen Smalley <stephen.smalley.work@gmail.com> Cc: Thomas Zimmermann <tzimmermann@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05mm: migrate LRU_REFS_MASK bits in folio_migrate_flagsZhaoyang Huang
Bits of LRU_REFS_MASK are not inherited during migration which lead to new folio start from tier0 when MGLRU enabled. Try to bring as much bits of folio->flags as possible since compaction and alloc_contig_range which introduce migration do happen at times. Link: https://lkml.kernel.org/r/20240926050647.5653-1-zhaoyang.huang@unisoc.com Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com> Suggested-by: Yu Zhao <yuzhao@google.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05mm: pgtable: remove pte_offset_map_nolock()Qi Zheng
Now no users are using the pte_offset_map_nolock(), remove it. Link: https://lkml.kernel.org/r/d04f9bbbcde048fb6ffa6f2bdbc6f9b22d5286f9.1727332572.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Acked-by: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05mm: pgtable: introduce pte_offset_map_{ro|rw}_nolock()Qi Zheng
Patch series "introduce pte_offset_map_{ro|rw}_nolock()", v5. As proposed by David Hildenbrand [1], this series introduces the following two new helper functions to replace pte_offset_map_nolock(). 1. pte_offset_map_ro_nolock() 2. pte_offset_map_rw_nolock() As the name suggests, pte_offset_map_ro_nolock() is used for read-only case. In this case, only read-only operations will be performed on PTE page after the PTL is held. The RCU lock in pte_offset_map_nolock() will ensure that the PTE page will not be freed, and there is no need to worry about whether the pmd entry is modified. Therefore pte_offset_map_ro_nolock() is just a renamed version of pte_offset_map_nolock(). pte_offset_map_rw_nolock() is used for may-write case. In this case, the pte or pmd entry may be modified after the PTL is held, so we need to ensure that the pmd entry has not been modified concurrently. So in addition to the name change, it also outputs the pmdval when successful. The users should make sure the page table is stable like checking pte_same() or checking pmd_same() by using the output pmdval before performing the write operations. This series will convert all pte_offset_map_nolock() into the above two helper functions one by one, and finally completely delete it. This also a preparation for reclaiming the empty user PTE page table pages. This patch (of 13): Currently, the usage of pte_offset_map_nolock() can be divided into the following two cases: 1) After acquiring PTL, only read-only operations are performed on the PTE page. In this case, the RCU lock in pte_offset_map_nolock() will ensure that the PTE page will not be freed, and there is no need to worry about whether the pmd entry is modified. 2) After acquiring PTL, the pte or pmd entries may be modified. At this time, we need to ensure that the pmd entry has not been modified concurrently. To more clearing distinguish between these two cases, this commit introduces two new helper functions to replace pte_offset_map_nolock(). For 1), just rename it to pte_offset_map_ro_nolock(). For 2), in addition to changing the name to pte_offset_map_rw_nolock(), it also outputs the pmdval when successful. It is applicable for may-write cases where any modification operations to the page table may happen after the corresponding spinlock is held afterwards. But the users should make sure the page table is stable like checking pte_same() or checking pmd_same() by using the output pmdval before performing the write operations. Note: "RO" / "RW" expresses the intended semantics, not that the *kmap* will be read-only/read-write protected. Subsequent commits will convert pte_offset_map_nolock() into the above two functions one by one, and finally completely delete it. Link: https://lkml.kernel.org/r/cover.1727332572.git.zhengqi.arch@bytedance.com Link: https://lkml.kernel.org/r/5aeecfa131600a454b1f3a038a1a54282ca3b856.1727332572.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Acked-by: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05mm: move mm flags to mm_types.hNanyong Sun
The types of mm flags are now far beyond the core dump related features. This patch moves mm flags from linux/sched/coredump.h to linux/mm_types.h. The linux/sched/coredump.h has include the mm_types.h, so the C files related to coredump does not need to change head file inclusion. In addition, the inclusion of sched/coredump.h now can be deleted from the C files that irrelevant to core dump. Link: https://lkml.kernel.org/r/20240926074922.2721274-1-sunnanyong@huawei.com Signed-off-by: Nanyong Sun <sunnanyong@huawei.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05mm/mempolicy: fix comments for better documentationTanya Agarwal
Fix typo in mempolicy.h and Correct the number of allowed memory policy Link: https://lkml.kernel.org/r/20240926183516.4034-2-tanyaagarwal25699@gmail.com Signed-off-by: Tanya Agarwal <tanyaagarwal25699@gmail.com> Reviewed-by: Shuah Khan <skhan@linuxfoundation.org> Cc: Anup Sharma <anupnewsmail@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05mm/vmalloc: combine all TLB flush operations of KASAN shadow virtual address ↵Adrian Huang
into one operation When compiling kernel source 'make -j $(nproc)' with the up-and-running KASAN-enabled kernel on a 256-core machine, the following soft lockup is shown: watchdog: BUG: soft lockup - CPU#28 stuck for 22s! [kworker/28:1:1760] CPU: 28 PID: 1760 Comm: kworker/28:1 Kdump: loaded Not tainted 6.10.0-rc5 #95 Workqueue: events drain_vmap_area_work RIP: 0010:smp_call_function_many_cond+0x1d8/0xbb0 Code: 38 c8 7c 08 84 c9 0f 85 49 08 00 00 8b 45 08 a8 01 74 2e 48 89 f1 49 89 f7 48 c1 e9 03 41 83 e7 07 4c 01 e9 41 83 c7 03 f3 90 <0f> b6 01 41 38 c7 7c 08 84 c0 0f 85 d4 06 00 00 8b 45 08 a8 01 75 RSP: 0018:ffffc9000cb3fb60 EFLAGS: 00000202 RAX: 0000000000000011 RBX: ffff8883bc4469c0 RCX: ffffed10776e9949 RDX: 0000000000000002 RSI: ffff8883bb74ca48 RDI: ffffffff8434dc50 RBP: ffff8883bb74ca40 R08: ffff888103585dc0 R09: ffff8884533a1800 R10: 0000000000000004 R11: ffffffffffffffff R12: ffffed1077888d39 R13: dffffc0000000000 R14: ffffed1077888d38 R15: 0000000000000003 FS: 0000000000000000(0000) GS:ffff8883bc400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005577b5c8d158 CR3: 0000000004850000 CR4: 0000000000350ef0 Call Trace: <IRQ> ? watchdog_timer_fn+0x2cd/0x390 ? __pfx_watchdog_timer_fn+0x10/0x10 ? __hrtimer_run_queues+0x300/0x6d0 ? sched_clock_cpu+0x69/0x4e0 ? __pfx___hrtimer_run_queues+0x10/0x10 ? srso_return_thunk+0x5/0x5f ? ktime_get_update_offsets_now+0x7f/0x2a0 ? srso_return_thunk+0x5/0x5f ? srso_return_thunk+0x5/0x5f ? hrtimer_interrupt+0x2ca/0x760 ? __sysvec_apic_timer_interrupt+0x8c/0x2b0 ? sysvec_apic_timer_interrupt+0x6a/0x90 </IRQ> <TASK> ? asm_sysvec_apic_timer_interrupt+0x16/0x20 ? smp_call_function_many_cond+0x1d8/0xbb0 ? __pfx_do_kernel_range_flush+0x10/0x10 on_each_cpu_cond_mask+0x20/0x40 flush_tlb_kernel_range+0x19b/0x250 ? srso_return_thunk+0x5/0x5f ? kasan_release_vmalloc+0xa7/0xc0 purge_vmap_node+0x357/0x820 ? __pfx_purge_vmap_node+0x10/0x10 __purge_vmap_area_lazy+0x5b8/0xa10 drain_vmap_area_work+0x21/0x30 process_one_work+0x661/0x10b0 worker_thread+0x844/0x10e0 ? srso_return_thunk+0x5/0x5f ? __kthread_parkme+0x82/0x140 ? __pfx_worker_thread+0x10/0x10 kthread+0x2a5/0x370 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x30/0x70 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> Debugging Analysis: 1. The following ftrace log shows that the lockup CPU spends too much time iterating vmap_nodes and flushing TLB when purging vm_area structures. (Some info is trimmed). kworker: funcgraph_entry: | drain_vmap_area_work() { kworker: funcgraph_entry: | mutex_lock() { kworker: funcgraph_entry: 1.092 us | __cond_resched(); kworker: funcgraph_exit: 3.306 us | } ... ... kworker: funcgraph_entry: | flush_tlb_kernel_range() { ... ... kworker: funcgraph_exit: # 7533.649 us | } ... ... kworker: funcgraph_entry: 2.344 us | mutex_unlock(); kworker: funcgraph_exit: $ 23871554 us | } The drain_vmap_area_work() spends over 23 seconds. There are 2805 flush_tlb_kernel_range() calls in the ftrace log. * One is called in __purge_vmap_area_lazy(). * Others are called by purge_vmap_node->kasan_release_vmalloc. purge_vmap_node() iteratively releases kasan vmalloc allocations and flushes TLB for each vmap_area. - [Rough calculation] Each flush_tlb_kernel_range() runs about 7.5ms. -- 2804 * 7.5ms = 21.03 seconds. -- That's why a soft lock is triggered. 2. Extending the soft lockup time can work around the issue (For example, # echo 60 > /proc/sys/kernel/watchdog_thresh). This confirms the above-mentioned speculation: drain_vmap_area_work() spends too much time. If we combine all TLB flush operations of the KASAN shadow virtual address into one operation in the call path 'purge_vmap_node()->kasan_release_vmalloc()', the running time of drain_vmap_area_work() can be saved greatly. The idea is from the flush_tlb_kernel_range() call in __purge_vmap_area_lazy(). And, the soft lockup won't be triggered. Here is the test result based on 6.10: [6.10 wo/ the patch] 1. ftrace latency profiling (record a trace if the latency > 20s). echo 20000000 > /sys/kernel/debug/tracing/tracing_thresh echo drain_vmap_area_work > /sys/kernel/debug/tracing/set_graph_function echo function_graph > /sys/kernel/debug/tracing/current_tracer echo 1 > /sys/kernel/debug/tracing/tracing_on 2. Run `make -j $(nproc)` to compile the kernel source 3. Once the soft lockup is reproduced, check the ftrace log: cat /sys/kernel/debug/tracing/trace # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 76) $ 50412985 us | } /* __purge_vmap_area_lazy */ 76) $ 50412997 us | } /* drain_vmap_area_work */ 76) $ 29165911 us | } /* __purge_vmap_area_lazy */ 76) $ 29165926 us | } /* drain_vmap_area_work */ 91) $ 53629423 us | } /* __purge_vmap_area_lazy */ 91) $ 53629434 us | } /* drain_vmap_area_work */ 91) $ 28121014 us | } /* __purge_vmap_area_lazy */ 91) $ 28121026 us | } /* drain_vmap_area_work */ [6.10 w/ the patch] 1. Repeat step 1-2 in "[6.10 wo/ the patch]" 2. The soft lockup is not triggered and ftrace log is empty. cat /sys/kernel/debug/tracing/trace # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 3. Setting 'tracing_thresh' to 10/5 seconds does not get any ftrace log. 4. Setting 'tracing_thresh' to 1 second gets ftrace log. cat /sys/kernel/debug/tracing/trace # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 23) $ 1074942 us | } /* __purge_vmap_area_lazy */ 23) $ 1074950 us | } /* drain_vmap_area_work */ The worst execution time of drain_vmap_area_work() is about 1 second. Link: https://lore.kernel.org/lkml/ZqFlawuVnOMY2k3E@pc638.lan/ Link: https://lkml.kernel.org/r/20240726165246.31326-1-ahuang12@lenovo.com Fixes: 282631cb2447 ("mm: vmalloc: remove global purge_vmap_area_root rb-tree") Signed-off-by: Adrian Huang <ahuang12@lenovo.com> Co-developed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Tested-by: Jiwei Sun <sunjw10@lenovo.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05mm: shmem: fix khugepaged activation policy for shmemBaolin Wang
Shmem has a separate interface (different from anonymous pages) to control huge page allocation, that means shmem THP can be enabled while anonymous THP is disabled. However, in this case, khugepaged will not start to collapse shmem THP, which is unreasonable. To fix this issue, we should call start_stop_khugepaged() to activate or deactivate the khugepaged thread when setting shmem mTHP interfaces. Moreover, add a new helper shmem_hpage_pmd_enabled() to help to check whether shmem THP is enabled, which will determine if khugepaged should be activated. Link: https://lkml.kernel.org/r/9b9c6cbc4499bf44c6455367fd9e0f6036525680.1726978977.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reported-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05mm: refactor arch_calc_vm_flag_bits() and arm64 MTE handlingLorenzo Stoakes
Currently MTE is permitted in two circumstances (desiring to use MTE having been specified by the VM_MTE flag) - where MAP_ANONYMOUS is specified, as checked by arch_calc_vm_flag_bits() and actualised by setting the VM_MTE_ALLOWED flag, or if the file backing the mapping is shmem, in which case we set VM_MTE_ALLOWED in shmem_mmap() when the mmap hook is activated in mmap_region(). The function that checks that, if VM_MTE is set, VM_MTE_ALLOWED is also set is the arm64 implementation of arch_validate_flags(). Unfortunately, we intend to refactor mmap_region() to perform this check earlier, meaning that in the case of a shmem backing we will not have invoked shmem_mmap() yet, causing the mapping to fail spuriously. It is inappropriate to set this architecture-specific flag in general mm code anyway, so a sensible resolution of this issue is to instead move the check somewhere else. We resolve this by setting VM_MTE_ALLOWED much earlier in do_mmap(), via the arch_calc_vm_flag_bits() call. This is an appropriate place to do this as we already check for the MAP_ANONYMOUS case here, and the shmem file case is simply a variant of the same idea - we permit RAM-backed memory. This requires a modification to the arch_calc_vm_flag_bits() signature to pass in a pointer to the struct file associated with the mapping, however this is not too egregious as this is only used by two architectures anyway - arm64 and parisc. So this patch performs this adjustment and removes the unnecessary assignment of VM_MTE_ALLOWED in shmem_mmap(). [akpm@linux-foundation.org: fix whitespace, per Catalin] Link: https://lkml.kernel.org/r/ec251b20ba1964fb64cf1607d2ad80c47f3873df.1730224667.git.lorenzo.stoakes@oracle.com Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Suggested-by: Catalin Marinas <catalin.marinas@arm.com> Reported-by: Jann Horn <jannh@google.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andreas Larsson <andreas@gaisler.com> Cc: David S. Miller <davem@davemloft.net> Cc: Helge Deller <deller@gmx.de> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Brown <broonie@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05mm: refactor map_deny_write_exec()Lorenzo Stoakes
Refactor the map_deny_write_exec() to not unnecessarily require a VMA parameter but rather to accept VMA flags parameters, which allows us to use this function early in mmap_region() in a subsequent commit. While we're here, we refactor the function to be more readable and add some additional documentation. Link: https://lkml.kernel.org/r/6be8bb59cd7c68006ebb006eb9d8dc27104b1f70.1730224667.git.lorenzo.stoakes@oracle.com Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reported-by: Jann Horn <jannh@google.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Jann Horn <jannh@google.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Helge Deller <deller@gmx.de> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Brown <broonie@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05Input: introduce notion of passive observers for input handlersDmitry Torokhov
Sometimes it is useful to observe (and maybe modify) data coming from an input device, but only do that if there are other users of such input device. An example is touchpad switching functionality on Lenovo IdeaPad Z570 where it is desirable to suppress events coming from the touchpad if user toggles touchpad on/off button (on this laptop the firmware does not stop the device). Introduce notion of passive observers for input handlers to solve this issue. An input handler marked as passive observer behaves exactly like any other input handler or filter, but with one exception: it does not open/start underlying input device when attaching to it. Link: https://lore.kernel.org/r/ZxlEROX7bMo5cbZP@google.com Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
2024-11-05iomap: add a merge boundary flagChristoph Hellwig
File systems might have boundaries over which merges aren't possible. In fact these are very common, although most of the time some kind of header at the beginning of this region (e.g. XFS alloation groups, ext4 block groups) automatically create a merge barrier. But if that is not present, say for a device purely used for data we need to manually communicate that to iomap. Add a IOMAP_F_BOUNDARY flag to never merge I/O into a previous mapping. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05sysfs: bin_attribute: add const read/write callback variantsThomas Weißschuh
To make it possible to put struct bin_attribute into read-only memory, the sysfs core has to stop passing mutable pointers to the read() and write() callbacks. As there are numerous implementors of these callbacks throughout the tree it's not possible to change all of them at once. To enable a step-by-step transition, add new variants of the read() and write() callbacks which differ only in the constness of the struct bin_attribute argument. As most binary attributes are defined through macros, extend these macros to transparently handle both variants of callbacks to minimize the churn during the transition. As soon as all handlers are switch to the const variant, the non-const one can be removed together with the transition machinery. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Acked-by: Krzysztof Wilczyński <kw@linux.com> Link: https://lore.kernel.org/r/20241103-sysfs-const-bin_attr-v2-9-71110628844c@weissschuh.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-11-05sysfs: implement all BIN_ATTR_* macros in terms of __BIN_ATTR()Thomas Weißschuh
The preparations for the upcoming constification of struct bin_attribute requires some logic in the structure definition macros. To avoid duplication of that logic in multiple macros, reimplement all other macros in terms of __BIN_ATTR(). Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Acked-by: Krzysztof Wilczyński <kw@linux.com> Link: https://lore.kernel.org/r/20241103-sysfs-const-bin_attr-v2-8-71110628844c@weissschuh.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-11-05sysfs: treewide: constify attribute callback of bin_attribute::llseek()Thomas Weißschuh
The llseek() callbacks should not modify the struct bin_attribute passed as argument. Enforce this by marking the argument as const. As there are not many callback implementers perform this change throughout the tree at once. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Acked-by: Krzysztof Wilczyński <kw@linux.com> Link: https://lore.kernel.org/r/20241103-sysfs-const-bin_attr-v2-7-71110628844c@weissschuh.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-11-05sysfs: treewide: constify attribute callback of bin_attribute::mmap()Thomas Weißschuh
The mmap() callbacks should not modify the struct bin_attribute passed as argument. Enforce this by marking the argument as const. As there are not many callback implementers perform this change throughout the tree at once. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Acked-by: Andrew Donnellan <ajd@linux.ibm.com> # ocxl Acked-by: Krzysztof Wilczyński <kw@linux.com> Link: https://lore.kernel.org/r/20241103-sysfs-const-bin_attr-v2-6-71110628844c@weissschuh.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-11-05sysfs: treewide: constify attribute callback of bin_is_visible()Thomas Weißschuh
The is_bin_visible() callbacks should not modify the struct bin_attribute passed as argument. Enforce this by marking the argument as const. As there are not many callback implementers perform this change throughout the tree at once. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Jason Gunthorpe <jgg@nvidia.com> Acked-by: Ira Weiny <ira.weiny@intel.com> Acked-by: Krzysztof Wilczyński <kw@linux.com> Link: https://lore.kernel.org/r/20241103-sysfs-const-bin_attr-v2-5-71110628844c@weissschuh.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-11-05sysfs: introduce callback attribute_group::bin_sizeThomas Weißschuh
Several drivers need to dynamically calculate the size of an binary attribute. Currently this is done by assigning attr->size from the is_bin_visible() callback. This has drawbacks: * It is not documented. * A single attribute can be instantiated multiple times, overwriting the shared size field. * It prevents the structure to be moved to read-only memory. Introduce a new dedicated callback to calculate the size of the attribute. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Acked-by: Krzysztof Wilczyński <kw@linux.com> Link: https://lore.kernel.org/r/20241103-sysfs-const-bin_attr-v2-2-71110628844c@weissschuh.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-11-05iommu/vt-d: Remove unused dmar_msi_readDr. David Alan Gilbert
dmar_msi_read() has been unused since 2022 in commit cf8e8658100d ("arch: Remove Itanium (IA-64) architecture") Remove it. (dmar_msi_write still exists and is used once). Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Link: https://lore.kernel.org/r/20241022002702.302728-1-linux@treblig.org Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <jroedel@suse.de>
2024-11-05perf/core: Add aux_pause, aux_resume, aux_start_pausedAdrian Hunter
Hardware traces, such as instruction traces, can produce a vast amount of trace data, so being able to reduce tracing to more specific circumstances can be useful. The ability to pause or resume tracing when another event happens, can do that. Add ability for an event to "pause" or "resume" AUX area tracing. Add aux_pause bit to perf_event_attr to indicate that, if the event happens, the associated AUX area tracing should be paused. Ditto aux_resume. Do not allow aux_pause and aux_resume to be set together. Add aux_start_paused bit to perf_event_attr to indicate to an AUX area event that it should start in a "paused" state. Add aux_paused to struct hw_perf_event for AUX area events to keep track of the "paused" state. aux_paused is initialized to aux_start_paused. Add PERF_EF_PAUSE and PERF_EF_RESUME modes for ->stop() and ->start() callbacks. Call as needed, during __perf_event_output(). Add aux_in_pause_resume to struct perf_buffer to prevent races with the NMI handler. Pause/resume in NMI context will miss out if it coincides with another pause/resume. To use aux_pause or aux_resume, an event must be in a group with the AUX area event as the group leader. Example (requires Intel PT and tools patches also): $ perf record --kcore -e intel_pt/aux-action=start-paused/k,syscalls:sys_enter_newuname/aux-action=resume/,syscalls:sys_exit_newuname/aux-action=pause/ uname Linux [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.043 MB perf.data ] $ perf script --call-trace uname 30805 [000] 24001.058782799: name: 0x7ffc9c1865b0 uname 30805 [000] 24001.058784424: psb offs: 0 uname 30805 [000] 24001.058784424: cbr: 39 freq: 3904 MHz (139%) uname 30805 [000] 24001.058784629: ([kernel.kallsyms]) debug_smp_processor_id uname 30805 [000] 24001.058784629: ([kernel.kallsyms]) __x64_sys_newuname uname 30805 [000] 24001.058784629: ([kernel.kallsyms]) down_read uname 30805 [000] 24001.058784629: ([kernel.kallsyms]) __cond_resched uname 30805 [000] 24001.058784629: ([kernel.kallsyms]) preempt_count_add uname 30805 [000] 24001.058784629: ([kernel.kallsyms]) in_lock_functions uname 30805 [000] 24001.058784629: ([kernel.kallsyms]) preempt_count_sub uname 30805 [000] 24001.058784629: ([kernel.kallsyms]) up_read uname 30805 [000] 24001.058784629: ([kernel.kallsyms]) preempt_count_add uname 30805 [000] 24001.058784838: ([kernel.kallsyms]) in_lock_functions uname 30805 [000] 24001.058784838: ([kernel.kallsyms]) preempt_count_sub uname 30805 [000] 24001.058784838: ([kernel.kallsyms]) _copy_to_user uname 30805 [000] 24001.058784838: ([kernel.kallsyms]) syscall_exit_to_user_mode uname 30805 [000] 24001.058784838: ([kernel.kallsyms]) syscall_exit_work uname 30805 [000] 24001.058784838: ([kernel.kallsyms]) perf_syscall_exit uname 30805 [000] 24001.058784838: ([kernel.kallsyms]) debug_smp_processor_id uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) perf_trace_buf_alloc uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) perf_swevent_get_recursion_context uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) debug_smp_processor_id uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) debug_smp_processor_id uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) perf_tp_event uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) perf_trace_buf_update uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) tracing_gen_ctx_irq_test uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) perf_swevent_event uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) __perf_event_account_interrupt uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) __this_cpu_preempt_check uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) perf_event_output_forward uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) perf_event_aux_pause uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) ring_buffer_get uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) __rcu_read_lock uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) __rcu_read_unlock uname 30805 [000] 24001.058785254: ([kernel.kallsyms]) pt_event_stop uname 30805 [000] 24001.058785254: ([kernel.kallsyms]) debug_smp_processor_id uname 30805 [000] 24001.058785254: ([kernel.kallsyms]) debug_smp_processor_id uname 30805 [000] 24001.058785254: ([kernel.kallsyms]) native_write_msr uname 30805 [000] 24001.058785463: ([kernel.kallsyms]) native_write_msr uname 30805 [000] 24001.058785639: 0x0 Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: James Clark <james.clark@arm.com> Link: https://lkml.kernel.org/r/20241022155920.17511-3-adrian.hunter@intel.com
2024-11-05sched: Add Lazy preemption modelPeter Zijlstra
Change fair to use resched_curr_lazy(), which, when the lazy preemption model is selected, will set TIF_NEED_RESCHED_LAZY. This LAZY bit will be promoted to the full NEED_RESCHED bit on tick. As such, the average delay between setting LAZY and actually rescheduling will be TICK_NSEC/2. In short, Lazy preemption will delay preemption for fair class but will function as Full preemption for all the other classes, most notably the realtime (RR/FIFO/DEADLINE) classes. The goal is to bridge the performance gap with Voluntary, such that we might eventually remove that option entirely. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lkml.kernel.org/r/20241007075055.331243614@infradead.org
2024-11-05sched: Add TIF_NEED_RESCHED_LAZY infrastructurePeter Zijlstra
Add the basic infrastructure to split the TIF_NEED_RESCHED bit in two. Either bit will cause a resched on return-to-user, but only TIF_NEED_RESCHED will drive IRQ preemption. No behavioural change intended. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lkml.kernel.org/r/20241007075055.219540785@infradead.org
2024-11-05sched/ext: Remove sched_fork() hackThomas Gleixner
Instead of solving the underlying problem of the double invocation of __sched_fork() for idle tasks, sched-ext decided to hack around the issue by partially clearing out the entity struct to preserve the already enqueued node. A provided analysis and solution has been ignored for four months. Now that someone else has taken care of cleaning it up, remove the disgusting hack and clear out the full structure. Remove the comment in the structure declaration as well, as there is no requirement for @node being the last element anymore. Fixes: f0e1a0643a59 ("sched_ext: Implement BPF extensible scheduler class") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/87ldy82wkc.ffs@tglx
2024-11-05kcsan, seqlock: Fix incorrect assumption in read_seqbegin()Marco Elver
During testing of the preceding changes, I noticed that in some cases, current->kcsan_ctx.in_flat_atomic remained true until task exit. This is obviously wrong, because _all_ accesses for the given task will be treated as atomic, resulting in false negatives i.e. missed data races. Debugging led to fs/dcache.c, where we can see this usage of seqlock: struct dentry *d_lookup(const struct dentry *parent, const struct qstr *name) { struct dentry *dentry; unsigned seq; do { seq = read_seqbegin(&rename_lock); dentry = __d_lookup(parent, name); if (dentry) break; } while (read_seqretry(&rename_lock, seq)); [...] As can be seen, read_seqretry() is never called if dentry != NULL; consequently, current->kcsan_ctx.in_flat_atomic will never be reset to false by read_seqretry(). Give up on the wrong assumption of "assume closing read_seqretry()", and rely on the already-present annotations in read_seqcount_begin/retry(). Fixes: 88ecd153be95 ("seqlock, kcsan: Add annotations for KCSAN") Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20241104161910.780003-6-elver@google.com
2024-11-05seqlock, treewide: Switch to non-raw seqcount_latch interfaceMarco Elver
Switch all instrumentable users of the seqcount_latch interface over to the non-raw interface. Co-developed-by: "Peter Zijlstra (Intel)" <peterz@infradead.org> Signed-off-by: "Peter Zijlstra (Intel)" <peterz@infradead.org> Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20241104161910.780003-5-elver@google.com
2024-11-05kcsan, seqlock: Support seqcount_latch_tMarco Elver
While fuzzing an arm64 kernel, Alexander Potapenko reported: | BUG: KCSAN: data-race in ktime_get_mono_fast_ns / timekeeping_update | | write to 0xffffffc082e74248 of 56 bytes by interrupt on cpu 0: | update_fast_timekeeper kernel/time/timekeeping.c:430 [inline] | timekeeping_update+0x1d8/0x2d8 kernel/time/timekeeping.c:768 | timekeeping_advance+0x9e8/0xb78 kernel/time/timekeeping.c:2344 | update_wall_time+0x18/0x38 kernel/time/timekeeping.c:2360 | [...] | | read to 0xffffffc082e74258 of 8 bytes by task 5260 on cpu 1: | __ktime_get_fast_ns kernel/time/timekeeping.c:372 [inline] | ktime_get_mono_fast_ns+0x88/0x174 kernel/time/timekeeping.c:489 | init_srcu_struct_fields+0x40c/0x530 kernel/rcu/srcutree.c:263 | init_srcu_struct+0x14/0x20 kernel/rcu/srcutree.c:311 | [...] | | value changed: 0x000002f875d33266 -> 0x000002f877416866 | | Reported by Kernel Concurrency Sanitizer on: | CPU: 1 UID: 0 PID: 5260 Comm: syz.2.7483 Not tainted 6.12.0-rc3-dirty #78 This is a false positive data race between a seqcount latch writer and a reader accessing stale data. Since its introduction, KCSAN has never understood the seqcount_latch interface (due to being unannotated). Unlike the regular seqlock interface, the seqcount_latch interface for latch writers never has had a well-defined critical section, making it difficult to teach tooling where the critical section starts and ends. Introduce an instrumentable (non-raw) seqcount_latch interface, with which we can clearly denote writer critical sections. This both helps readability and tooling like KCSAN to understand when the writer is done updating all latch copies. Fixes: 88ecd153be95 ("seqlock, kcsan: Add annotations for KCSAN") Reported-by: Alexander Potapenko <glider@google.com> Co-developed-by: "Peter Zijlstra (Intel)" <peterz@infradead.org> Signed-off-by: "Peter Zijlstra (Intel)" <peterz@infradead.org> Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20241104161910.780003-4-elver@google.com
2024-11-05ACPI/IORT: Support CANWBS memory access flagNicolin Chen
The IORT spec, Issue E.f (April 2024), adds a new CANWBS bit to the Memory Access Flag field in the Memory Access Properties table, mainly for a PCI Root Complex. This CANWBS defines the coherency of memory accesses to be not marked IOWB cacheable/shareable. Its value further implies the coherency impact from a pair of mismatched memory attributes (e.g. in a nested translation case): 0x0: Use of mismatched memory attributes for accesses made by this device may lead to a loss of coherency. 0x1: Coherency of accesses made by this device to locations in Conventional memory are ensured as follows, even if the memory attributes for the accesses presented by the device or provided by the SMMU are different from Inner and Outer Write-back cacheable, Shareable. Note that the loss of coherency on a CANWBS-unsupported HW typically could occur to an SMMU that doesn't implement the S2FWB feature where additional cache flush operations would be required to prevent that from happening. Add a new ACPI_IORT_MF_CANWBS flag and set IOMMU_FWSPEC_PCI_RC_CANWBS upon the presence of this new flag. CANWBS and S2FWB are similar features, in that they both guarantee the VM can not violate coherency, however S2FWB can be bypassed by PCI No Snoop TLPs, while CANWBS cannot. Thus CANWBS meets the requirements to set IOMMU_CAP_ENFORCE_CACHE_COHERENCY. Architecturally ARM has expected that VFIO would disable No Snoop through PCI Config space, if this is done then the two would have the same protections. Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Hanjun Guo <guohanjun@huawei.com> Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com> Reviewed-by: Donald Dutile <ddutile@redhat.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/3-v4-9e99b76f3518+3a8-smmuv3_nesting_jgg@nvidia.com Signed-off-by: Will Deacon <will@kernel.org>
2024-11-05vfio: Remove VFIO_TYPE1_NESTING_IOMMUJason Gunthorpe
This control causes the ARM SMMU drivers to choose a stage 2 implementation for the IO pagetable (vs the stage 1 usual default), however this choice has no significant visible impact to the VFIO user. Further qemu never implemented this and no other userspace user is known. The original description in commit f5c9ecebaf2a ("vfio/iommu_type1: add new VFIO_TYPE1_NESTING_IOMMU IOMMU type") suggested this was to "provide SMMU translation services to the guest operating system" however the rest of the API to set the guest table pointer for the stage 1 and manage invalidation was never completed, or at least never upstreamed, rendering this part useless dead code. Upstream has now settled on iommufd as the uAPI for controlling nested translation. Choosing the stage 2 implementation should be done by through the IOMMU_HWPT_ALLOC_NEST_PARENT flag during domain allocation. Remove VFIO_TYPE1_NESTING_IOMMU and everything under it including the enable_nesting iommu_domain_op. Just in-case there is some userspace using this continue to treat requesting it as a NOP, but do not advertise support any more. Acked-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Mostafa Saleh <smostafa@google.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com> Reviewed-by: Donald Dutile <ddutile@redhat.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/1-v4-9e99b76f3518+3a8-smmuv3_nesting_jgg@nvidia.com Signed-off-by: Will Deacon <will@kernel.org>
2024-11-05mm/writeback: add folio_mark_dirty_lock()Joanne Koong
Add a new convenience helper folio_mark_dirty_lock() that grabs the folio lock before calling folio_mark_dirty(). Refactor set_page_dirty_lock() to directly use folio_mark_dirty_lock(). Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-11-05Merge 6.12-rc6 into driver-core-nextGreg Kroah-Hartman
We need the driver-core fix/revert in here as well to build on top of. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-11-05watchdog: da9063: Do not use a global variableFabio Estevam
Using the 'use_sw_pm' variable as global is not recommended as it prevents multi instances of the driver to run. Make it a member of the da9063 structure instead. Signed-off-by: Fabio Estevam <festevam@denx.de> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20241018135821.274376-1-festevam@gmail.com Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Wim Van Sebroeck <wim@linux-watchdog.org>
2024-11-05Merge v6.12-rc6 into usb-nextGreg Kroah-Hartman
We need the USB fixes in here as well, and this resolves a merge conflict in: drivers/usb/typec/tcpm/tcpm.c Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Link: https://lore.kernel.org/r/20241101150730.090dc30f@canb.auug.org.au Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-11-05Merge 6.12-rc6 into char-misc-nextGreg Kroah-Hartman
We need the char/misc/iio fixes in here as well. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-11-05misc: rtsx: Cleanup on DRV_NAME cardreader variablesDesnes Nunes
The rtsx_pci_ms memstick driver has been dropped, thus there is no more need for DRV_NAME_RTSX_PCI_MS variable. Additionally, this also stand- arizes DRV_NAME variables on alcor_pci and rtsx_usb drivers. Fixes: d0f459259c13 ("memstick: rtsx_pci_ms: Remove Realtek PCI memstick driver") Signed-off-by: Desnes Nunes <desnesn@redhat.com> Link: https://lore.kernel.org/r/20241031142801.1141680-1-desnesn@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>