summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2024-08-07buffer: Convert __block_write_begin() to take a folioMatthew Wilcox (Oracle)
Almost all callers have a folio now, so change __block_write_begin() to take a folio and remove a call to compound_head(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-07fs: Convert aops->write_begin to take a folioMatthew Wilcox (Oracle)
Convert all callers from working on a page to working on one page of a folio (support for working on an entire folio can come later). Removes a lot of folio->page->folio conversions. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-07fs: Convert aops->write_end to take a folioMatthew Wilcox (Oracle)
Most callers have a folio, and most implementations operate on a folio, so remove the conversion from folio->page->folio to fit through this interface. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-07buffer: Convert block_write_end() to take a folioMatthew Wilcox (Oracle)
All callers now have a folio, so pass it in instead of converting from a folio to a page and back to a folio again. Saves a call to compound_head(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-06x86/traps: Enable UBSAN traps on x86Gatlin Newhouse
Currently ARM64 extracts which specific sanitizer has caused a trap via encoded data in the trap instruction. Clang on x86 currently encodes the same data in the UD1 instruction but x86 handle_bug() and is_valid_bugaddr() currently only look at UD2. Bring x86 to parity with ARM64, similar to commit 25b84002afb9 ("arm64: Support Clang UBSAN trap codes for better reporting"). See the llvm links for information about the code generation. Enable the reporting of UBSAN sanitizer details on x86 compiled with clang when CONFIG_UBSAN_TRAP=y by analysing UD1 and retrieving the type immediate which is encoded by the compiler after the UD1. [ tglx: Simplified it by moving the printk() into handle_bug() ] Signed-off-by: Gatlin Newhouse <gatlin.newhouse@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/all/20240724000206.451425-1-gatlin.newhouse@gmail.com Link: https://github.com/llvm/llvm-project/commit/c5978f42ec8e9#diff-bb68d7cd885f41cfc35843998b0f9f534adb60b415f647109e597ce448e92d9f Link: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/X86/X86InstrSystem.td#L27
2024-08-05binfmt_elf, coredump: Log the reason of the failed core dumpsRoman Kisel
Missing, failed, or corrupted core dumps might impede crash investigations. To improve reliability of that process and consequently the programs themselves, one needs to trace the path from producing a core dumpfile to analyzing it. That path starts from the core dump file written to the disk by the kernel or to the standard input of a user mode helper program to which the kernel streams the coredump contents. There are cases where the kernel will interrupt writing the core out or produce a truncated/not-well-formed core dump without leaving a note. Add logging for the core dump collection failure paths to be able to reason what has gone wrong when the core dump is malformed or missing. Report the size of the data written to aid in diagnosing the user mode helper. Signed-off-by: Roman Kisel <romank@linux.microsoft.com> Link: https://lore.kernel.org/r/20240718182743.1959160-3-romank@linux.microsoft.com Signed-off-by: Kees Cook <kees@kernel.org>
2024-08-05coredump: Standartize and fix loggingRoman Kisel
The coredump code does not log the process ID and the comm consistently, logs unescaped comm when it does log it, and does not always use the ratelimited logging. That makes it harder to analyze logs and puts the system at the risk of spamming the system log incase something crashes many times over and over again. Fix that by logging TGID and comm (escaped) consistently and using the ratelimited logging always. Signed-off-by: Roman Kisel <romank@linux.microsoft.com> Tested-by: Allen Pais <apais@linux.microsoft.com> Link: https://lore.kernel.org/r/20240718182743.1959160-2-romank@linux.microsoft.com Signed-off-by: Kees Cook <kees@kernel.org>
2024-08-05net/mlx5: Add support for MTPTM and MTCTR registersRahul Rameshbabu
Make Management Precision Time Measurement (MTPTM) register and Management Cross Timestamp (MTCTR) register usable in mlx5 driver. Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Tested-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Link: https://patch.msgid.link/20240730134055.1835261-2-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-05inet6: constify 'struct net' parameter of various lookup helpersEric Dumazet
Following helpers do not touch their struct net argument: - bpf_sk_lookup_run_v6() - __inet6_lookup_established() - inet6_lookup_reuseport() - inet6_lookup_listener() - inet6_lookup_run_sk_lookup() - __inet6_lookup() - inet6_lookup() Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20240802134029.3748005-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-05fix bitmap corruption on close_range() with CLOSE_RANGE_UNSHAREAl Viro
copy_fd_bitmaps(new, old, count) is expected to copy the first count/BITS_PER_LONG bits from old->full_fds_bits[] and fill the rest with zeroes. What it does is copying enough words (BITS_TO_LONGS(count/BITS_PER_LONG)), then memsets the rest. That works fine, *if* all bits past the cutoff point are clear. Otherwise we are risking garbage from the last word we'd copied. For most of the callers that is true - expand_fdtable() has count equal to old->max_fds, so there's no open descriptors past count, let alone fully occupied words in ->open_fds[], which is what bits in ->full_fds_bits[] correspond to. The other caller (dup_fd()) passes sane_fdtable_size(old_fdt, max_fds), which is the smallest multiple of BITS_PER_LONG that covers all opened descriptors below max_fds. In the common case (copying on fork()) max_fds is ~0U, so all opened descriptors will be below it and we are fine, by the same reasons why the call in expand_fdtable() is safe. Unfortunately, there is a case where max_fds is less than that and where we might, indeed, end up with junk in ->full_fds_bits[] - close_range(from, to, CLOSE_RANGE_UNSHARE) with * descriptor table being currently shared * 'to' being above the current capacity of descriptor table * 'from' being just under some chunk of opened descriptors. In that case we end up with observably wrong behaviour - e.g. spawn a child with CLONE_FILES, get all descriptors in range 0..127 open, then close_range(64, ~0U, CLOSE_RANGE_UNSHARE) and watch dup(0) ending up with descriptor #128, despite #64 being observably not open. The minimally invasive fix would be to deal with that in dup_fd(). If this proves to add measurable overhead, we can go that way, but let's try to fix copy_fd_bitmaps() first. * new helper: bitmap_copy_and_expand(to, from, bits_to_copy, size). * make copy_fd_bitmaps() take the bitmap size in words, rather than bits; it's 'count' argument is always a multiple of BITS_PER_LONG, so we are not losing any information, and that way we can use the same helper for all three bitmaps - compiler will see that count is a multiple of BITS_PER_LONG for the large ones, so it'll generate plain memcpy()+memset(). Reproducer added to tools/testing/selftests/core/close_range_test.c Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-08-05inet: constify 'struct net' parameter of various lookup helpersEric Dumazet
Following helpers do not touch their struct net argument: - bpf_sk_lookup_run_v4() - inet_lookup_reuseport() - inet_lhash2_lookup() - inet_lookup_run_sk_lookup() - __inet_lookup_listener() - __inet_lookup_established() Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20240802134029.3748005-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-05refcount: Report UAF for refcount_sub_and_test(0) when counter==0Petr Pavlu
When a reference counter is at zero and refcount_sub_and_test() is invoked to subtract zero, the function accepts this request without any warning and returns true. This behavior does not seem ideal because the counter being already at zero indicates a use-after-free. Furthermore, returning true by refcount_sub_and_test() in this case potentially results in a double-free done by its caller. Modify the underlying function __refcount_sub_and_test() to warn about this case as a use-after-free and have it return false to avoid the potential double-free. Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240717130023.5675-1-petr.pavlu@suse.com Signed-off-by: Kees Cook <kees@kernel.org>
2024-08-05Input: cyttsp4 - remove driverDmitry Torokhov
The cyttsp4 touchscreen driver was contributed in 2013 and since then has seen no updates. The driver uses platform data (no device tree support) and there are no users of it in the mainline kernel. There were occasional fixes to it for issues either found by static code analysis tools or via visual inspection, but otherwise the driver is completely untested. Remove the driver. Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Reviewed-by: Javier Martinez Canillas <javierm@redhat.com> Link: https://lore.kernel.org/r/ZrAZ2cUow_z838tp@google.com Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
2024-08-05hid-asus: use hid for brightness control on keyboardLuke D. Jones
On almost all ASUS ROG series laptops the MCU used for the USB keyboard also has a HID packet used for setting the brightness. This is usually the same as the WMI method. But in some laptops the WMI method either is missing or doesn't work, so we should default to the HID control. Signed-off-by: Luke D. Jones <luke@ljones.dev> Acked-by: Benjamin Tissoires <bentiss@kernel.org> Link: https://lore.kernel.org/r/20240713074733.77334-2-luke@ljones.dev Reviewed-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Hans de Goede <hdegoede@redhat.com>
2024-08-05cleanup: Add usage and style documentationDan Williams
When proposing that PCI grow some new cleanup helpers for pci_dev_put() and pci_dev_{lock,unlock} [1], Bjorn had some fundamental questions about expectations and best practices. Upon reviewing an updated changelog with those details he recommended adding them to documentation in the header file itself. Add that documentation and link it into the rendering for Documentation/core-api/. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/171175585714.2192972.12661675876300167762.stgit@dwillia2-xfh.jf.intel.com
2024-08-05pmdomain: core: Enable s2idle for CPU PM domains on PREEMPT_RTUlf Hansson
To allow a genpd provider for a CPU PM domain to enter a domain-idle-state during s2idle on a PREEMPT_RT based configuration, we can't use the regular spinlock, as they are turned into sleepable locks on PREEMPT_RT. To address this problem, let's convert into using the raw spinlock, but only for genpd providers that have the GENPD_FLAG_CPU_DOMAIN bit set. In this way, the lock can still be acquired/released in atomic context, which is needed in the idle-path for PREEMPT_RT. Do note that the genpd power-on/off notifiers may also be fired during s2idle, but these are already prepared for PREEMPT_RT as they are based on the raw notifiers. However, consumers of them may need to adopt accordingly to work properly on PREEMPT_RT. Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Tested-by: Raghavendra Kakarla <quic_rkakarla@quicinc.com> # qcm6490 with PREEMPT_RT set Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20240527142557.321610-2-ulf.hansson@linaro.org
2024-08-05bpf: kprobe: remove unused declaring of bpf_kprobe_overrideMenglong Dong
After the commit 66665ad2f102 ("tracing/kprobe: bpf: Compare instruction pointer with original one"), "bpf_kprobe_override" is not used anywhere anymore, and we can remove it now. Link: https://lore.kernel.org/all/20240710085939.11520-1-dongml2@chinatelecom.cn/ Fixes: 66665ad2f102 ("tracing/kprobe: bpf: Compare instruction pointer with original one") Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Acked-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-08-04profiling: remove profile=sleep supportTetsuo Handa
The kernel sleep profile is no longer working due to a recursive locking bug introduced by commit 42a20f86dc19 ("sched: Add wrapper for get_wchan() to keep task blocked") Booting with the 'profile=sleep' kernel command line option added or executing # echo -n sleep > /sys/kernel/profiling after boot causes the system to lock up. Lockdep reports kthreadd/3 is trying to acquire lock: ffff93ac82e08d58 (&p->pi_lock){....}-{2:2}, at: get_wchan+0x32/0x70 but task is already holding lock: ffff93ac82e08d58 (&p->pi_lock){....}-{2:2}, at: try_to_wake_up+0x53/0x370 with the call trace being lock_acquire+0xc8/0x2f0 get_wchan+0x32/0x70 __update_stats_enqueue_sleeper+0x151/0x430 enqueue_entity+0x4b0/0x520 enqueue_task_fair+0x92/0x6b0 ttwu_do_activate+0x73/0x140 try_to_wake_up+0x213/0x370 swake_up_locked+0x20/0x50 complete+0x2f/0x40 kthread+0xfb/0x180 However, since nobody noticed this regression for more than two years, let's remove 'profile=sleep' support based on the assumption that nobody needs this functionality. Fixes: 42a20f86dc19 ("sched: Add wrapper for get_wchan() to keep task blocked") Cc: stable@vger.kernel.org # v5.16+ Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-08-04Merge branch 'sched/core' of ↵Tejun Heo
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into for-6.12 Pull tip/sched/core to resolve the following four conflicts. While 2-4 are simple context conflicts, 1 is a bit subtle and easy to resolve incorrectly. 1. 2c8d046d5d51 ("sched: Add normal_policy()") vs. faa42d29419d ("sched/fair: Make SCHED_IDLE entity be preempted in strict hierarchy") The former converts direct test on p->policy to use the helper normal_policy(). The latter moves the p->policy test to a different location. Resolve by converting the test on p->plicy in the new location to use normal_policy(). 2. a7a9fc549293 ("sched_ext: Add boilerplate for extensible scheduler class") vs. a110a81c52a9 ("sched/deadline: Deferrable dl server") Both add calls to put_prev_task_idle() and set_next_task_idle(). Simple context conflict. Resolve by taking changes from both. 3. a7a9fc549293 ("sched_ext: Add boilerplate for extensible scheduler class") vs. c245910049d0 ("sched/core: Add clearing of ->dl_server in put_prev_task_balance()") The former changes for_each_class() itertion to use for_each_active_class(). The latter moves away the adjacent dl_server handling code. Simple context conflict. Resolve by taking changes from both. 4. 60c27fb59f6c ("sched_ext: Implement sched_ext_ops.cpu_online/offline()") vs. 31b164e2e4af ("sched/smt: Introduce sched_smt_present_inc/dec() helper") 2f027354122f ("sched/core: Introduce sched_set_rq_on/offline() helper") The former adds scx_rq_deactivate() call. The latter two change code around it. Simple context conflict. Resolve by taking changes from both. Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-04i2c: Fix conditional for substituting empty ACPI functionsRichard Fitzgerald
Add IS_ENABLED(CONFIG_I2C) to the conditional around a bunch of ACPI functions. The conditional around these functions depended only on CONFIG_ACPI. But the functions are implemented in I2C core, so are only present if CONFIG_I2C is enabled. Signed-off-by: Richard Fitzgerald <rf@opensource.cirrus.com> Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
2024-08-04pinctrl: pinconf-generic: Add support for "input-schmitt-microvolt" propertyInochi Amaoto
Add "input-schmitt-microvolt" property to generic options used for DT parsing files. This enables drivers, which use generic pin configurations, to get the value passed to this property. Signed-off-by: Inochi Amaoto <inochiama@outlook.com> Link: https://lore.kernel.org/IA1PR20MB4953806785BA04E075DC4F03BBAC2@IA1PR20MB4953.namprd20.prod.outlook.com Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2024-08-03iio: backend: add a modified prbs23 supportNuno Sa
Support ADI specific prb23 sequence that can be used both for calibrating or debugging digital interfaces. Signed-off-by: Nuno Sa <nuno.sa@analog.com> Link: https://patch.msgid.link/20240802-dev-iio-backend-add-debugfs-v2-3-4cb62852f0d0@analog.com Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
2024-08-03iio: backend: add debugFs interfaceNuno Sa
This adds a basic debugfs interface for backends. Two new ops are being added: * debugfs_reg_access: Analogous to the core IIO one but for backend devices. * debugfs_print_chan_status: One useful usecase for this one is for testing test tones in a digital interface and "ask" the backend to dump more details on why a test tone might have errors. Signed-off-by: Nuno Sa <nuno.sa@analog.com> Link: https://patch.msgid.link/20240802-dev-iio-backend-add-debugfs-v2-2-4cb62852f0d0@analog.com Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
2024-08-03iio: backend: introduce struct iio_backend_infoNuno Sa
Instead of only passing the backend ops when calling devm_iio_backend_register(), pass an info like structure that will contains the ops and additional information. Fow now, the backend name is being added as that will be used by the debugFS interface introduced in a later patch. It also opens the door for further customizations passed by backends. All users of devm_iio_backend_register() were updated accordingly. Signed-off-by: Nuno Sa <nuno.sa@analog.com> Link: https://patch.msgid.link/20240802-dev-iio-backend-add-debugfs-v2-1-4cb62852f0d0@analog.com Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
2024-08-03iio: core: add function to retrieve active_scan_mask indexJulien Stephan
Add a function to retrieve the index of the active scan mask inside the available scan masks array. As in iio_scan_mask_match and iio_sanity_check_avail_scan_masks, this function does not handle multi-long masks correctly. It only checks the first long to be zero, and will use such mask as a terminator even if there was bits set after the first long. This should be fine since the available_scan_mask has already been sanity tested using iio_sanity_check_avail_scan_masks. See iio_scan_mask_match and iio_sanity_check_avail_scan_masks for more details Signed-off-by: Julien Stephan <jstephan@baylibre.com> Reviewed-by: David Lechner <dlechner@baylibre.com> Link: https://patch.msgid.link/20240731-ad7380-add-single-ended-chips-v2-2-cd63bf05744c@baylibre.com Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
2024-08-03Merge tag 'spi-mosi-config' into togregJonathan Cameron
spi: Support MOSI idle configuration Add support for configuring the idle state of the MOSI signal in controllers.
2024-08-03iio: core: annotate masklength as __privateNuno Sa
Now that all users are using the proper accessors, we can mark masklength as __private so that no one tries to write. We also get help from checkers in warning us in case someone does it. To access the private field from IIO core code, we need to use the ACCESS_PRIVATE() macro. Signed-off-by: Nuno Sa <nuno.sa@analog.com> Link: https://patch.msgid.link/20240726-dev-iio-masklength-private3-v1-23-82913fc0fb87@analog.com Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
2024-08-03iio: backend: spelling: continuous -> continuousDavid Lechner
This fixes the spelling in IIO_BACKEND_INTERNAL_CONTINUOUS_WAVE. Signed-off-by: David Lechner <dlechner@baylibre.com> Link: https://patch.msgid.link/20240726-iio-backend-spelling-continuous-v1-1-467c6e3f78ff@baylibre.com Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
2024-08-03iio: backend: remove unused parameterNuno Sa
Indio_dev was not being used in iio_backend_extend_chan_spec() so remove it. Signed-off-by: Nuno Sa <nuno.sa@analog.com> Link: https://patch.msgid.link/20240709-dev-iio-backend-add-debugfs-v1-1-fb4b8f2373c7@analog.com Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
2024-08-02linkmode: Change return type of linkmode_andnot to boolSimon Horman
linkmode_andnot() simply returns the result of bitmap_andnot(). And the return type of bitmap_andnot() is bool. So it makes sense for the return type of linkmode_andnot() to also be bool. I checked all call-sites and they either ignore the return value or treat it as a bool. Compile tested only. Link: https://lore.kernel.org/netdev/68088998-4486-4930-90a4-96a32f08c490@lunn.ch/ Signed-off-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20240801-linkfield-bowl-v1-1-d58f68967802@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-02net: remove IFF_* re-definitionJakub Kicinski
We re-define values of enum netdev_priv_flags as preprocessor macros with the same name. I guess this was done to avoid breaking out of tree modules which may use #ifdef X for kernel compatibility? Commit 7aa98047df95 ("net: move net_device priv_flags out from UAPI") which added the enum doesn't say. In any case, the flags with defines are quite old now, and defines for new flags don't get added. OOT drivers have to resort to code greps for compat detection, anyway. Let's delete these defines, save LoC, help LXR link to the right place. Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://patch.msgid.link/20240801163401.378723-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-02spi: Add dummy definitions for ACPI lookup functionsMark Brown
Merge series from Richard Fitzgerald <rf@opensource.cirrus.com>: Provide empty versions of acpi_spi_count_resources(), acpi_spi_device_alloc() and acpi_spi_find_controller_by_adev() if the real functions are not being built. This commit fixes two problems with the original definitions: 1) There wasn't an empty version of these functions 2) The #if only depended on CONFIG_ACPI. But the functions are implemented in the core spi.c so CONFIG_SPI_MASTER must also be enabled for the real functions to exist.
2024-08-02sched_ext: Allow p->scx.disallow only while loadingTejun Heo
From 1232da7eced620537a78f19c8cf3d4a3508e2419 Mon Sep 17 00:00:00 2001 From: Tejun Heo <tj@kernel.org> Date: Wed, 31 Jul 2024 09:14:52 -1000 p->scx.disallow provides a way for the BPF scheduler to reject certain tasks from attaching. It's currently allowed for both the load and fork paths; however, the latter doesn't actually work as p->sched_class is already set by the time scx_ops_init_task() is called during fork. This is a convenience feature which is mostly useful from the load path anyway. Allow it only from the load path. v2: Trigger scx_ops_error() iff @p->policy == SCHED_EXT to make it a bit easier for the BPF scheduler (David). Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: "Zhangqiao (2012 lab)" <zhangqiao22@huawei.com> Link: http://lkml.kernel.org/r/20240711110720.1285-1-zhangqiao22@huawei.com Fixes: 7bb6f0810ecf ("sched_ext: Allow BPF schedulers to disallow specific tasks from joining SCHED_EXT") Acked-by: David Vernet <void@manifault.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-02spi: Add empty versions of ACPI functionsRichard Fitzgerald
Provide empty versions of acpi_spi_count_resources(), acpi_spi_device_alloc() and acpi_spi_find_controller_by_adev() if the real functions are not being built. This commit fixes two problems with the original definitions: 1) There wasn't an empty version of these functions 2) The #if only depended on CONFIG_ACPI. But the functions are implemented in the core spi.c so CONFIG_SPI_MASTER must also be enabled for the real functions to exist. Signed-off-by: Richard Fitzgerald <rf@opensource.cirrus.com> Link: https://patch.msgid.link/20240802152215.20831-2-rf@opensource.cirrus.com Signed-off-by: Mark Brown <broonie@kernel.org>
2024-08-02Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm updates from Paolo Bonzini: "The bulk of the changes here is a largish change to guest_memfd, delaying the clearing and encryption of guest-private pages until they are actually added to guest page tables. This started as "let's make it impossible to misuse the API" for SEV-SNP; but then it ballooned a bit. The new logic is generally simpler and more ready for hugepage support in guest_memfd. Summary: - fix latent bug in how usage of large pages is determined for confidential VMs - fix "underline too short" in docs - eliminate log spam from limited APIC timer periods - disallow pre-faulting of memory before SEV-SNP VMs are initialized - delay clearing and encrypting private memory until it is added to guest page tables - this change also enables another small cleanup: the checks in SNP_LAUNCH_UPDATE that limit it to non-populated, private pages can now be moved in the common kvm_gmem_populate() function - fix compilation error that the RISC-V merge introduced in selftests" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: x86/mmu: fix determination of max NPT mapping level for private pages KVM: riscv: selftests: Fix compile error KVM: guest_memfd: abstract how prepared folios are recorded KVM: guest_memfd: let kvm_gmem_populate() operate only on private gfns KVM: extend kvm_range_has_memory_attributes() to check subset of attributes KVM: cleanup and add shortcuts to kvm_range_has_memory_attributes() KVM: guest_memfd: move check for already-populated page to common code KVM: remove kvm_arch_gmem_prepare_needed() KVM: guest_memfd: make kvm_gmem_prepare_folio() operate on a single struct kvm KVM: guest_memfd: delay kvm_gmem_prepare_folio() until the memory is passed to the guest KVM: guest_memfd: return locked folio from __kvm_gmem_get_pfn KVM: rename CONFIG_HAVE_KVM_GMEM_* to CONFIG_HAVE_KVM_ARCH_GMEM_* KVM: guest_memfd: do not go through struct page KVM: guest_memfd: delay folio_mark_uptodate() until after successful preparation KVM: guest_memfd: return folio from __kvm_gmem_get_pfn() KVM: x86: disallow pre-fault for SNP VMs before initialization KVM: Documentation: Fix title underline too short warning KVM: x86: Eliminate log spam from limited APIC timer periods
2024-08-02Merge branch 'kvm-fixes' into HEADPaolo Bonzini
* fix latent bug in how usage of large pages is determined for confidential VMs * fix "underline too short" in docs * eliminate log spam from limited APIC timer periods * disallow pre-faulting of memory before SEV-SNP VMs are initialized * delay clearing and encrypting private memory until it is added to guest page tables * this change also enables another small cleanup: the checks in SNP_LAUNCH_UPDATE that limit it to non-populated, private pages can now be moved in the common kvm_gmem_populate() function
2024-08-02Merge tag 'riscv-for-linus-6.11-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V fixes from Palmer Dabbelt: - A fix to avoid dropping some of the internal pseudo-extensions, which breaks *envcfg dependency parsing - The kernel entry address is now aligned in purgatory, which avoids a misaligned load that can lead to crash on systems that don't support misaligned accesses early in boot - The FW_SFENCE_VMA_RECEIVED perf event was duplicated in a handful of perf JSON configurations, one of them been updated to FW_SFENCE_VMA_ASID_SENT - The starfive cache driver is now restricted to 64-bit systems, as it isn't 32-bit clean - A fix for to avoid aliasing legacy-mode perf counters with software perf counters - VM_FAULT_SIGSEGV is now handled in the page fault code - A fix for stalls during CPU hotplug due to IPIs being disabled - A fix for memblock bounds checking. This manifests as a crash on systems with discontinuous memory maps that have regions that don't fit in the linear map * tag 'riscv-for-linus-6.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: riscv: Fix linear mapping checks for non-contiguous memory regions RISC-V: Enable the IPI before workqueue_online_cpu() riscv/mm: Add handling for VM_FAULT_SIGSEGV in mm_fault_error() perf: riscv: Fix selecting counters in legacy mode cache: StarFive: Require a 64-bit system perf arch events: Fix duplicate RISC-V SBI firmware event name riscv/purgatory: align riscv_kernel_entry riscv: cpufeature: Do not drop Linux-internal extensions
2024-08-02clockevents/drivers/i8253: Fix stop sequence for timer 0David Woodhouse
According to the data sheet, writing the MODE register should stop the counter (and thus the interrupts). This appears to work on real hardware, at least modern Intel and AMD systems. It should also work on Hyper-V. However, on some buggy virtual machines the mode change doesn't have any effect until the counter is subsequently loaded (or perhaps when the IRQ next fires). So, set MODE 0 and then load the counter, to ensure that those buggy VMs do the right thing and the interrupts stop. And then write MODE 0 *again* to stop the counter on compliant implementations too. Apparently, Hyper-V keeps firing the IRQ *repeatedly* even in mode zero when it should only happen once, but the second MODE write stops that too. Userspace test program (mostly written by tglx): ===== #include <stdio.h> #include <unistd.h> #include <stdlib.h> #include <stdint.h> #include <sys/io.h> static __always_inline void __out##bwl(type value, uint16_t port) \ { \ asm volatile("out" #bwl " %" #bw "0, %w1" \ : : "a"(value), "Nd"(port)); \ } \ \ static __always_inline type __in##bwl(uint16_t port) \ { \ type value; \ asm volatile("in" #bwl " %w1, %" #bw "0" \ : "=a"(value) : "Nd"(port)); \ return value; \ } BUILDIO(b, b, uint8_t) #define inb __inb #define outb __outb #define PIT_MODE 0x43 #define PIT_CH0 0x40 #define PIT_CH2 0x42 static int is8254; static void dump_pit(void) { if (is8254) { // Latch and output counter and status outb(0xC2, PIT_MODE); printf("%02x %02x %02x\n", inb(PIT_CH0), inb(PIT_CH0), inb(PIT_CH0)); } else { // Latch and output counter outb(0x0, PIT_MODE); printf("%02x %02x\n", inb(PIT_CH0), inb(PIT_CH0)); } } int main(int argc, char* argv[]) { int nr_counts = 2; if (argc > 1) nr_counts = atoi(argv[1]); if (argc > 2) is8254 = 1; if (ioperm(0x40, 4, 1) != 0) return 1; dump_pit(); printf("Set oneshot\n"); outb(0x38, PIT_MODE); outb(0x00, PIT_CH0); outb(0x0F, PIT_CH0); dump_pit(); usleep(1000); dump_pit(); printf("Set periodic\n"); outb(0x34, PIT_MODE); outb(0x00, PIT_CH0); outb(0x0F, PIT_CH0); dump_pit(); usleep(1000); dump_pit(); dump_pit(); usleep(100000); dump_pit(); usleep(100000); dump_pit(); printf("Set stop (%d counter writes)\n", nr_counts); outb(0x30, PIT_MODE); while (nr_counts--) outb(0xFF, PIT_CH0); dump_pit(); usleep(100000); dump_pit(); usleep(100000); dump_pit(); printf("Set MODE 0\n"); outb(0x30, PIT_MODE); dump_pit(); usleep(100000); dump_pit(); usleep(100000); dump_pit(); return 0; } ===== Suggested-by: Sean Christopherson <seanjc@google.com> Co-developed-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhkelley@outlook.com> Link: https://lore.kernel.org/all/20240802135555.564941-2-dwmw2@infradead.org
2024-08-02x86/i8253: Disable PIT timer 0 when not in useDavid Woodhouse
Leaving the PIT interrupt running can cause noticeable steal time for virtual guests. The VMM generally has a timer which toggles the IRQ input to the PIC and I/O APIC, which takes CPU time away from the guest. Even on real hardware, running the counter may use power needlessly (albeit not much). Make sure it's turned off if it isn't going to be used. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhkelley@outlook.com> Link: https://lore.kernel.org/all/20240802135555.564941-1-dwmw2@infradead.org
2024-08-02cpufreq: Remove LATENCY_MULTIPLIERQais Yousef
The current LATENCY_MULTIPLIER which has been around for nearly 20 years causes rate_limit_us to be always in ms range. On M1 mac mini I get 50 and 56us transition latency, but due to the 1000 multiplier we end up setting rate_limit_us to 50 and 56ms, which gets capped into 2ms and was 10ms before e13aa799c2a6 ("cpufreq: Change default transition delay to 2ms") On Intel I5 system transition latency is 20us but due to the multiplier we end up with 20ms that again is capped to 2ms. Given how good modern hardware and how modern workloads require systems to be more responsive to cater for sudden changes in workload (tasks sleeping/wakeup/migrating, uclamp causing a sudden boost or cap) and that 2ms is quarter of the time of 120Hz refresh rate system, drop the old logic in favour of providing 50% headroom. rate_limit_us = 1.5 * latency. I considered not adding any headroom which could mean that we can end up with infinite back-to-back requests. I also considered providing a constant headroom (e.g: 100us) assuming that any h/w or f/w dealing with the request shouldn't require a large headroom when transition_latency is actually high. But for both cases I wasn't sure if h/w or f/w can end up being overwhelmed dealing with the freq requests in a potentially busy system. So I opted for providing 50% breathing room. This is expected to impact schedutil only as the other user, dbs_governor, takes the max(2*tick, transition_delay_us) and the former was at least 2ms on 1ms TICK, which is equivalent to the max_delay_us before applying this patch. For systems with TICK of 4ms, this value would have almost always ended up with 8ms sampling rate. For systems that report 0 transition latency, we still default to returning 1ms as transition delay. This helps in eliminating a source of latency for applying requests as mentioned in [1]. For example if we have a 1ms tick, most systems will miss sending an update at tick when updating the util_avg for a task/CPU (rate_limit_us will be 2ms for most systems). Link: https://lore.kernel.org/lkml/20240724212255.mfr2ybiv2j2uqek7@airbuntu/ # [1] Link: https://lore.kernel.org/lkml/20240205022500.2232124-1-qyousef@layalina.io/ Signed-off-by: Qais Yousef <qyousef@layalina.io> Link: https://patch.msgid.link/20240728192659.58115-1-qyousef@layalina.io [ rjw: Subject edits ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-08-02thermal: trip: Drop thermal_zone_get_trip()Rafael J. Wysocki
There are no more callers of thermal_zone_get_trip() in the tree, so drop it. No functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://patch.msgid.link/2220301.Mh6RI2rZIc@rjwysocki.net
2024-08-02thermal: trip: Get rid of thermal_zone_get_num_trips()Rafael J. Wysocki
The only existing caller of thermal_zone_get_num_trips(), which is rcar_gen3_thermal_probe(), uses this function to put the number of trip points into a kernel log message, but this information is also available from the thermal sysfs interface. For this reason, remove the thermal_zone_get_num_trips() call from rcar_gen3_thermal_probe() and drop the former altogether. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://patch.msgid.link/2636988.Lt9SDvczpP@rjwysocki.net
2024-08-02HID: core: add helper for finding a field with a certain usageKerem Karabay
This helper will allow HID drivers to easily determine if they should bind to a hid_device by checking for the prescence of a certain field when its ID is not enough, which can be the case on USB devices with multiple interfaces and/or configurations. Convert google-hammer driver to use it, and remove now superfluous hammer_has_usage(). [jkosina@suse.com: expand changelog with the information about google-hammer being added as user of this API ] Signed-off-by: Kerem Karabay <kekrby@gmail.com> Signed-off-by: Aditya Garg <gargaditya08@live.com> Signed-off-by: Jiri Kosina <jkosina@suse.com>
2024-08-02uprobes: make uprobe_register() return struct uprobe *Oleg Nesterov
This way uprobe_unregister() and uprobe_apply() can use "struct uprobe *" rather than inode + offset. This simplifies the code and allows to avoid the unnecessary find_uprobe() + put_uprobe() in these functions. TODO: uprobe_unregister() still needs get_uprobe/put_uprobe to ensure that this uprobe can't be freed before up_write(&uprobe->register_rwsem). Co-developed-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20240801132734.GA8803@redhat.com
2024-08-02uprobes: kill uprobe_register_refctr()Oleg Nesterov
It doesn't make any sense to have 2 versions of _register(). Note that trace_uprobe_enable(), the only user of uprobe_register(), doesn't need to check tu->ref_ctr_offset to decide which one should be used, it could safely pass ref_ctr_offset == 0 to uprobe_register_refctr(). Add this argument to uprobe_register(), update the callers, and kill uprobe_register_refctr(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240801132728.GA8800@redhat.com
2024-08-02perf,x86: avoid missing caller address in stack traces captured in uprobeAndrii Nakryiko
When tracing user functions with uprobe functionality, it's common to install the probe (e.g., a BPF program) at the first instruction of the function. This is often going to be `push %rbp` instruction in function preamble, which means that within that function frame pointer hasn't been established yet. This leads to consistently missing an actual caller of the traced function, because perf_callchain_user() only records current IP (capturing traced function) and then following frame pointer chain (which would be caller's frame, containing the address of caller's caller). So when we have target_1 -> target_2 -> target_3 call chain and we are tracing an entry to target_3, captured stack trace will report target_1 -> target_3 call chain, which is wrong and confusing. This patch proposes a x86-64-specific heuristic to detect `push %rbp` (`push %ebp` on 32-bit architecture) instruction being traced. Given entire kernel implementation of user space stack trace capturing works under assumption that user space code was compiled with frame pointer register (%rbp/%ebp) preservation, it seems pretty reasonable to use this instruction as a strong indicator that this is the entry to the function. In that case, return address is still pointed to by %rsp/%esp, so we fetch it and add to stack trace before proceeding to unwind the rest using frame pointer-based logic. We also check for `endbr64` (for 64-bit modes) as another common pattern for function entry, as suggested by Josh Poimboeuf. Even if we get this wrong sometimes for uprobes attached not at the function entry, it's OK because stack trace will still be overall meaningful, just with one extra bogus entry. If we don't detect this, we end up with guaranteed to be missing caller function entry in the stack trace, which is worse overall. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20240729175223.23914-1-andrii@kernel.org
2024-08-02perf: Support PERF_SAMPLE_READ with inheritBen Gainey
This change allows events to use PERF_SAMPLE_READ with inherit so long as PERF_SAMPLE_TID is also set. This enables sample based profiling of a group of counters over a hierarchy of processes or threads. This is useful, for example, for collecting per-thread counters/metrics, event based sampling of multiple counters as a unit, access to the enabled and running time when using multiplexing and so on. Prior to this, users were restricted to either collecting aggregate statistics for a multi-threaded/-process application (e.g. with "perf stat"), or to sample individual threads, or to profile the entire system (which requires root or CAP_PERFMON, and may produce much more data than is required). Theoretically a tool could poll for or otherwise monitor thread/process creation and construct whatever events the user is interested in using perf_event_open, for each new thread or process, but this is racy, can lead to file-descriptor exhaustion, and ultimately just replicates the behaviour of inherit, but in userspace. This configuration differs from inherit without PERF_SAMPLE_READ in that the accumulated event count, and consequently any sample (such as if triggered by overflow of sample_period) will be on a per-thread rather than on an aggregate basis. The meaning of read_format::value field of both PERF_RECORD_READ and PERF_RECORD_SAMPLE is changed such that if the sampled event uses this new configuration then the values reported will be per-thread rather than the global aggregate value. This is a change from the existing semantics of read_format (where PERF_SAMPLE_READ is used without inherit), but it is necessary to expose the per-thread counter values, and it avoids reinventing a separate "read_format_thread" field that otherwise replicates the same behaviour. This change should not break existing tools, since this configuration was not previously valid and was rejected by the kernel. Tools that opt into this new mode will need to account for this when calculating the counter delta for a given sample. Tools that wish to have both the per-thread and aggregate value can perform the global aggregation themselves from the per-thread values. The change to read_format::value does not affect existing valid perf_event_attr configurations, nor does it change the behaviour of calls to "read" on an event descriptor. Both continue to report the aggregate value for the entire thread/process hierarchy. The difference between the results reported by "read" and PERF_RECORD_SAMPLE in this new configuration is justified on the basis that it is not (easily) possible for "read" to target a specific thread (the caller only has the fd for the original parent event). Signed-off-by: Ben Gainey <ben.gainey@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20240730084417.7693-3-ben.gainey@arm.com
2024-08-02perf: Rename perf_event_context.nr_pending to nr_no_switch_fast.Ben Gainey
nr_pending counts the number of events in the context that either pending_sigtrap or pending_work, but it is used to prevent taking the fast path in perf_event_context_sched_out. Renamed to reflect what it is used for, rather than what it counts. This change allows using the field to track other event properties that also require skipping the fast path without possible confusion over the name. Signed-off-by: Ben Gainey <ben.gainey@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20240730084417.7693-2-ben.gainey@arm.com
2024-08-02vsock/virtio: add SIOCOUTQ support for all virtio based transportsLuigi Leonardi
Introduce support for virtio_transport_unsent_bytes ioctl for virtio_transport, vhost_vsock and vsock_loopback. For all transports the unsent bytes counter is incremented in virtio_transport_get_credit. In virtio_transport (G2H) and in vhost-vsock (H2G) the counter is decremented when the skbuff is consumed. In vsock_loopback the same skbuff is passed from the transmitter to the receiver, so the counter is decremented before queuing the skbuff to the receiver. Signed-off-by: Luigi Leonardi <luigi.leonardi@outlook.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-02ata: libata: Print device quirks only onceDamien Le Moal
In ata_dev_print_quirks(), return early if ata_dev_print_info() returns false or if we already printed quirk information. This is to avoid printing a device quirks multiple times (that is, each time ata_dev_revalidate() is called). To remember if ata_dev_print_quirks() was already executed, define the EH context flag ATA_EHI_DID_PRINT_QUIRKS and set this flag in ata_dev_print_quirks(). Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Fixes: 58157d607aec ("ata: libata: Print quirks applied to devices") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>