summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2022-12-12Merge tag 'cxl-for-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxlLinus Torvalds
Pull cxl updates from Dan Williams: "Compute Express Link (CXL) updates for 6.2. While it may seem backwards, the CXL update this time around includes some focus on CXL 1.x enabling where the work to date had been with CXL 2.0 (VH topologies) in mind. First generation CXL can mostly be supported via BIOS, similar to DDR, however it became clear there are use cases for OS native CXL error handling and some CXL 3.0 endpoint features can be deployed on CXL 1.x hosts (Restricted CXL Host (RCH) topologies). So, this update brings RCH topologies into the Linux CXL device model. In support of the ongoing CXL 2.0+ enabling two new core kernel facilities are added. One is the ability for the kernel to flag collisions between userspace access to PCI configuration registers and kernel accesses. This is brought on by the PCIe Data-Object-Exchange (DOE) facility, a hardware mailbox over config-cycles. The other is a cpu_cache_invalidate_memregion() API that maps to wbinvd_on_all_cpus() on x86. To prevent abuse it is disabled in guest VMs and architectures that do not support it yet. The CXL paths that need it, dynamic memory region creation and security commands (erase / unlock), are disabled when it is not present. As for the CXL 2.0+ this cycle the subsystem gains support Persistent Memory Security commands, error handling in response to PCIe AER notifications, and support for the "XOR" host bridge interleave algorithm. Summary: - Add the cpu_cache_invalidate_memregion() API for cache flushing in response to physical memory reconfiguration, or memory-side data invalidation from operations like secure erase or memory-device unlock. - Add a facility for the kernel to warn about collisions between kernel and userspace access to PCI configuration registers - Add support for Restricted CXL Host (RCH) topologies (formerly CXL 1.1) - Add handling and reporting of CXL errors reported via the PCIe AER mechanism - Add support for CXL Persistent Memory Security commands - Add support for the "XOR" algorithm for CXL host bridge interleave - Rework / simplify CXL to NVDIMM interactions - Miscellaneous cleanups and fixes" * tag 'cxl-for-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: (71 commits) cxl/region: Fix memdev reuse check cxl/pci: Remove endian confusion cxl/pci: Add some type-safety to the AER trace points cxl/security: Drop security command ioctl uapi cxl/mbox: Add variable output size validation for internal commands cxl/mbox: Enable cxl_mbox_send_cmd() users to validate output size cxl/security: Fix Get Security State output payload endian handling cxl: update names for interleave ways conversion macros cxl: update names for interleave granularity conversion macros cxl/acpi: Warn about an invalid CHBCR in an existing CHBS entry tools/testing/cxl: Require cache invalidation bypass cxl/acpi: Fail decoder add if CXIMS for HBIG is missing cxl/region: Fix spelling mistake "memergion" -> "memregion" cxl/regs: Fix sparse warning cxl/acpi: Set ACPI's CXL _OSC to indicate RCD mode support tools/testing/cxl: Add an RCH topology cxl/port: Add RCD endpoint port enumeration cxl/mem: Move devm_cxl_add_endpoint() from cxl_core to cxl_mem tools/testing/cxl: Add XOR Math support to cxl_test cxl/acpi: Support CXL XOR Interleave Math (CXIMS) ...
2022-12-12Merge tag 'thermal-6.2-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull thermal control updates from Rafael Wysocki: "These include thermal core fixes to protect thermal device operations against thermal device removal, other thermal core fixes and updates of Intel thermal control drivers. Specifics: - Fix race conditions related to thermal device operations that are not protected against thermal device removal (Guenter Roeck) - Fix error code in __thermal_cooling_device_register() (Dan Carpenter) - Validate new cooling device state (coming from user space) in cur_state_store() and reuse the max_state value from cooling device structure in the sysfs interface (Viresh Kumar) - Fix some possible name leaks in error paths in the thermal control core code (Yang Yingliang) - Detect TCC lock bit set in the intel_tcc_cooling driver and make it refuse to update the TCC offset in that case (Zhang Rui) - Add TCC cooling support for RaptorLake-S (Zhang Rui) - Prevent accidental clearing of HFI status by one of the other drivers using the same status register (Srinivas Pandruvada) - Protect clearing of thermal status bits in Intel thermal control drivers (Srinivas Pandruvada) - Allow the HFI thermal control driver to ACK an HFI event for the previously observed timestamp (Srinivas Pandruvada) - Remove a pointless die_id check from the HFI thermal driver and adjust the definition a data structure used by it (Ricardo Neri)" * tag 'thermal-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: thermal: intel: hfi: Remove a pointless die_id check thermal: core: fix some possible name leaks in error paths thermal: intel: hfi: ACK HFI for the same timestamp thermal: intel: Protect clearing of thermal status bits thermal: intel: Prevent accidental clearing of HFI status thermal/core: Protect thermal device operations against thermal device removal thermal/core: Remove thermal_zone_set_trips() thermal/core: Protect sysfs accesses to thermal operations with thermal zone mutex thermal/core: Protect hwmon accesses to thermal operations with thermal zone mutex thermal/core: Introduce locked version of thermal_zone_device_update thermal/core: Move parameter validation from __thermal_zone_get_temp to thermal_zone_get_temp thermal/core: Ensure that thermal device is registered in thermal_zone_get_temp thermal/core: Delete device under thermal device zone lock thermal/core: Destroy thermal zone device mutex in release function thermal: intel: intel_tcc_cooling: Add TCC cooling support for RaptorLake-S thermal: intel: intel_tcc_cooling: Detect TCC lock bit thermal: intel: hfi: Improve the type of hfi_features::nr_table_pages thermal/core: fix error code in __thermal_cooling_device_register() thermal: sysfs: Reuse cdev->max_state thermal: Validate new state in cur_state_store()
2022-12-12Merge tag 'acpi-6.2-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI and PNP updates from Rafael Wysocki: "These include new code (for instance, support for the FFH address space type and support for new firmware data structures in ACPICA), some new quirks (mostly related to backlight handling and I2C enumeration), a number of fixes and a fair amount of cleanups all over. Specifics: - Update the ACPICA code in the kernel to the 20221020 upstream version and fix a couple of issues in it: - Make acpi_ex_load_op() match upstream implementation (Rafael Wysocki) - Add support for loong_arch-specific APICs in MADT (Huacai Chen) - Add support for fixed PCIe wake event (Huacai Chen) - Add EBDA pointer sanity checks (Vit Kabele) - Avoid accessing VGA memory when EBDA < 1KiB (Vit Kabele) - Add CCEL table support to both compiler/disassembler (Kuppuswamy Sathyanarayanan) - Add a couple of new UUIDs to the known UUID list (Bob Moore) - Add support for FFH Opregion special context data (Sudeep Holla) - Improve warning message for "invalid ACPI name" (Bob Moore) - Add support for CXL 3.0 structures (CXIMS & RDPAS) in the CEDT table (Alison Schofield) - Prepare IORT support for revision E.e (Robin Murphy) - Finish support for the CDAT table (Bob Moore) - Fix error code path in acpi_ds_call_control_method() (Rafael Wysocki) - Fix use-after-free in acpi_ut_copy_ipackage_to_ipackage() (Li Zetao) - Update the version of the ACPICA code in the kernel (Bob Moore) - Use ZERO_PAGE(0) instead of empty_zero_page in the ACPI device enumeration code (Giulio Benetti) - Change the return type of the ACPI driver remove callback to void and update its users accordingly (Dawei Li) - Add general support for FFH address space type and implement the low- level part of it for ARM64 (Sudeep Holla) - Fix stale comments in the ACPI tables parsing code and make it print more messages related to MADT (Hanjun Guo, Huacai Chen) - Replace invocations of generic library functions with more kernel- specific counterparts in the ACPI sysfs interface (Christophe JAILLET, Xu Panda) - Print full name paths of ACPI power resource objects during enumeration (Kane Chen) - Eliminate a compiler warning regarding a missing function prototype in the ACPI power management code (Sudeep Holla) - Fix and clean up the ACPI processor driver (Rafael Wysocki, Li Zhong, Colin Ian King, Sudeep Holla) - Add quirk for the HP Pavilion Gaming 15-cx0041ur to the ACPI EC driver (Mia Kanashi) - Add some mew ACPI backlight handling quirks and update some existing ones (Hans de Goede) - Make the ACPI backlight driver prefer the native backlight control over vendor backlight control when possible (Hans de Goede) - Drop unsetting ACPI APEI driver data on remove (Uwe Kleine-König) - Use xchg_release() instead of cmpxchg() for updating new GHES cache slots (Ard Biesheuvel) - Clean up the ACPI APEI code (Sudeep Holla, Christophe JAILLET, Jay Lu) - Add new I2C device enumeration quirks for Medion Lifetab S10346 and Lenovo Yoga Tab 3 Pro (YT3-X90F) (Hans de Goede) - Make the ACPI battery driver notify user space about adding new battery hooks and removing the existing ones (Armin Wolf) - Modify the pfr_update and pfr_telemetry drivers to use ACPI_FREE() for freeing acpi_object structures to help diagnostics (Wang ShaoBo) - Make the ACPI fan driver use sysfs_emit_at() in its sysfs interface code (ye xingchen) - Fix the _FIF package extraction failure handling in the ACPI fan driver (Hanjun Guo) - Fix the PCC mailbox handling error code path (Huisong Li) - Avoid using PCC Opregions if there is no platform interrupt allocated for this purpose (Huisong Li) - Use sysfs_emit() instead of scnprintf() in the ACPI PAD driver and CPPC library (ye xingchen) - Fix some kernel-doc issues in the ACPI GSI processing code (Xiongfeng Wang) - Fix name memory leak in pnp_alloc_dev() (Yang Yingliang) - Do not disable PNP devices on suspend when they cannot be re-enabled on resume (Hans de Goede) - Clean up the ACPI thermal driver a bit (Rafael Wysocki)" * tag 'acpi-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (67 commits) ACPI: x86: Add skip i2c clients quirk for Medion Lifetab S10346 ACPI: APEI: EINJ: Refactor available_error_type_show() ACPI: APEI: EINJ: Fix formatting errors ACPI: processor: perflib: Adjust acpi_processor_notify_smm() return value ACPI: processor: perflib: Rearrange acpi_processor_notify_smm() ACPI: processor: perflib: Rearrange unregistration routine ACPI: processor: perflib: Drop redundant parentheses ACPI: processor: perflib: Adjust white space ACPI: processor: idle: Drop unnecessary statements and parens ACPI: thermal: Adjust critical.flags.valid check ACPI: fan: Convert to use sysfs_emit_at() API ACPICA: Fix use-after-free in acpi_ut_copy_ipackage_to_ipackage() ACPI: battery: Call power_supply_changed() when adding hooks ACPI: use sysfs_emit() instead of scnprintf() ACPI: x86: Add skip i2c clients quirk for Lenovo Yoga Tab 3 Pro (YT3-X90F) ACPI: APEI: Remove a useless include PNP: Do not disable devices on suspend when they cannot be re-enabled on resume ACPI: processor: Silence missing prototype warnings ACPI: processor_idle: Silence missing prototype warnings ACPI: PM: Silence missing prototype warning ...
2022-12-12Merge tag 'pm-6.2-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management updates from Rafael Wysocki: "These include two new drivers (cpufreq driver for Apple SoC CPU P-states and the SCMI Powercap based power capping driver), other new hardware support and driver extensions (Qualcomm cpufreq driver and its DT bindings, TI cpufreq driver, intel_pstate, intel-uncore-freq), a bunch of fixes and cleanups all over and a cpupower utility update including new features related to RAPL support. Specifics: - Fix nasty and hard to debug race condition introduced by mistake in the runtime PM core code and clean up that code somewhat on top of the fix (Rafael Wysocki) - Generalize of_perf_domain_get_sharing_cpumask phandle format (Hector Martin) - Add new cpufreq driver for Apple SoC CPU P-states (Hector Martin) - Update Qualcomm cpufreq driver (Manivannan Sadhasivam, Chen Hui): - CPU clock provider support - Generic cleanups or reorganization - Potential memleak fix - Fix of the return value of cpufreq_driver->get() - Update Qualcomm cpufreq driver's DT bindings (Manivannan Sadhasivam, Rob Herring, Melody Olvera): - Support for CPU clock provider - Missing cache-related properties fixes - Support for QDU1000/QRU1000 - Add support for ti,am625 SoC and enable build of ti-cpufreq for ARCH_K3 (Dave Gerlach, and Vibhore Vardhan) - Use flexible array to simplify memory allocation in the tegra186 cpufreq driver (Christophe JAILLET) - Convert cpufreq statistics code to use sysfs_emit_at() (ye xingchen) - Allow intel_pstate to use no-HWP mode on Sapphire Rapids (Giovanni Gherdovich) - Add missing pci_dev_put() to the amd_freq_sensitivity cpufreq driver (Xiongfeng Wang) - Initialize the kobj_unregister completion before calling kobject_init_and_add() in the cpufreq core code (Yongqiang Liu) - Defer setting boost MSRs in the ACPI cpufreq driver (Stuart Hayes, Nathan Chancellor) - Make intel_pstate accept initial EPP value of 0x80 (Srinivas Pandruvada) - Make read-only array sys_clk_src in the SPEAr cpufreq driver static (Colin Ian King) - Make array speeds in the longhaul cpufreq driver static (Colin Ian King) - Use str_enabled_disabled() helper in the ACPI cpufreq driver (Andy Shevchenko) - Drop a reference to CVS from cpufreq documentation (Conghui Wang) - Improve kernel messages printed by the PSCI cpuidle driver (Ulf Hansson) - Make the DT cpuidle driver return the correct number of parsed idle states, clean it up and clarify a comment in it (Ulf Hansson) - Modify the tasks freezing code to avoid using pr_cont() and refine an error message printed by it (Rafael Wysocki) - Make the hibernation core code complain about memory map mismatches during resume to help diagnostics (Xueqin Luo) - Fix mistake in a kerneldoc comment in the hibernation code (xiongxin) - Reverse the order of performance and enabling operations in the generic power domains code (Abel Vesa) - Power off[on] domains in hibernate .freeze[thaw]_noirq hook of in the generic power domains code (Abel Vesa) - Consolidate genpd_restore_noirq() and genpd_resume_noirq() (Shawn Guo) - Pass generic PM noirq hooks to genpd_finish_suspend() (Shawn Guo) - Drop generic power domain status manipulation during hibernate restore (Shawn Guo) - Fix compiler warnings with make W=1 in the idle_inject power capping driver (Srinivas Pandruvada) - Use kstrtobool() instead of strtobool() in the power capping sysfs interface (Christophe JAILLET) - Add SCMI Powercap based power capping driver (Cristian Marussi) - Add Emerald Rapids support to the intel-uncore-freq driver (Artem Bityutskiy) - Repair slips in kernel-doc comments in the generic notifier code (Lukas Bulwahn) - Fix several DT issues in the OPP library reorganize code around opp-microvolt-<named> DT property (Viresh Kumar) - Allow any of opp-microvolt, opp-microamp, or opp-microwatt properties to be present without the others present (James Calligeros) - Fix clock-latency-ns property in DT example (Serge Semin) - Add a private governor_data for devfreq governors (Kant Fan) - Reorganize devfreq code to use device_match_of_node() and devm_platform_get_and_ioremap_resource() instead of open coding them (ye xingchen, Minghao Chi) - Make cpupower choose base_cpu to display default cpupower details instead of picking CPU 0 (Saket Kumar Bhaskar) - Add Georgian translation to cpupower documentation (Zurab Kargareteli) - Introduce powercap intel-rapl library, powercap-info command, and RAPL monitor into cpupower (Thomas Renninger)" * tag 'pm-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (64 commits) PM: runtime: Adjust white space in the core code cpufreq: Remove CVS version control contents from documentation cpufreq: stats: Convert to use sysfs_emit_at() API cpufreq: ACPI: Only set boost MSRs on supported CPUs PM: sleep: Refine error message in try_to_freeze_tasks() PM: sleep: Avoid using pr_cont() in the tasks freezing code PM: runtime: Relocate rpm_callback() right after __rpm_callback() PM: runtime: Do not call __rpm_callback() from rpm_idle() PM / devfreq: event: use devm_platform_get_and_ioremap_resource() PM / devfreq: event: Use device_match_of_node() PM / devfreq: Use device_match_of_node() powercap: idle_inject: Fix warnings with make W=1 PM: hibernate: Complain about memory map mismatches during resume dt-bindings: cpufreq: cpufreq-qcom-hw: Add QDU1000/QRU1000 cpufreq cpufreq: tegra186: Use flexible array to simplify memory allocation cpupower: rapl monitor - shows the used power consumption in uj for each rapl domain cpupower: Introduce powercap intel-rapl library and powercap-info command cpupower: Add Georgian translation cpufreq: intel_pstate: Add Sapphire Rapids support in no-HWP mode cpufreq: amd_freq_sensitivity: Add missing pci_dev_put() ...
2022-12-12Merge tag 'x86-misc-2022-12-10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull misc x86 updates from Thomas Gleixner: "Updates for miscellaneous x86 areas: - Reserve a new boot loader type for barebox which is usally used on ARM and MIPS, but can also be utilized as EFI payload on x86 to provide watchdog-supervised boot up. - Consolidate the native and compat 32bit signal handling code and split the 64bit version out into a separate source file - Switch the ESPFIX random usage to get_random_long()" * tag 'x86-misc-2022-12-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/espfix: Use get_random_long() rather than archrandom x86/signal/64: Move 64-bit signal code to its own file x86/signal/32: Merge native and compat 32-bit signal code x86/signal: Add ABI prefixes to frame setup functions x86/signal: Merge get_sigframe() x86: Remove __USER32_DS signal/compat: Remove compat_sigset_t override x86/signal: Remove sigset_t parameter from frame setup functions x86/signal: Remove sig parameter from frame setup functions Documentation/x86/boot: Reserve type_of_loader=13 for barebox
2022-12-12Merge tag 'timers-core-2022-12-10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer updates from Thomas Gleixner: "Updates for timers, timekeeping and drivers: Core: - The timer_shutdown[_sync]() infrastructure: Tearing down timers can be tedious when there are circular dependencies to other things which need to be torn down. A prime example is timer and workqueue where the timer schedules work and the work arms the timer. What needs to prevented is that pending work which is drained via destroy_workqueue() does not rearm the previously shutdown timer. Nothing in that shutdown sequence relies on the timer being functional. The conclusion was that the semantics of timer_shutdown_sync() should be: - timer is not enqueued - timer callback is not running - timer cannot be rearmed Preventing the rearming of shutdown timers is done by discarding rearm attempts silently. A warning for the case that a rearm attempt of a shutdown timer is detected would not be really helpful because it's entirely unclear how it should be acted upon. The only way to address such a case is to add 'if (in_shutdown)' conditionals all over the place. This is error prone and in most cases of teardown not required all. - The real fix for the bluetooth HCI teardown based on timer_shutdown_sync(). A larger scale conversion to timer_shutdown_sync() is work in progress. - Consolidation of VDSO time namespace helper functions - Small fixes for timer and timerqueue Drivers: - Prevent integer overflow on the XGene-1 TVAL register which causes an never ending interrupt storm. - The usual set of new device tree bindings - Small fixes and improvements all over the place" * tag 'timers-core-2022-12-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits) dt-bindings: timer: renesas,cmt: Add r8a779g0 CMT support dt-bindings: timer: renesas,tmu: Add r8a779g0 support clocksource/drivers/arm_arch_timer: Use kstrtobool() instead of strtobool() clocksource/drivers/timer-ti-dm: Fix missing clk_disable_unprepare in dmtimer_systimer_init_clock() clocksource/drivers/timer-ti-dm: Clear settings on probe and free clocksource/drivers/timer-ti-dm: Make timer_get_irq static clocksource/drivers/timer-ti-dm: Fix warning for omap_timer_match clocksource/drivers/arm_arch_timer: Fix XGene-1 TVAL register math error clocksource/drivers/timer-npcm7xx: Enable timer 1 clock before use dt-bindings: timer: nuvoton,npcm7xx-timer: Allow specifying all clocks dt-bindings: timer: rockchip: Add rockchip,rk3128-timer clockevents: Repair kernel-doc for clockevent_delta2ns() clocksource/drivers/ingenic-ost: Define pm functions properly in platform_driver struct clocksource/drivers/sh_cmt: Access registers according to spec vdso/timens: Refactor copy-pasted find_timens_vvar_page() helper into one copy Bluetooth: hci_qca: Fix the teardown problem for real timers: Update the documentation to reflect on the new timer_shutdown() API timers: Provide timer_shutdown[_sync]() timers: Add shutdown mechanism to the internal functions timers: Split [try_to_]del_timer[_sync]() to prepare for shutdown mode ...
2022-12-12Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Alexei Starovoitov says: ==================== pull-request: bpf-next 2022-12-11 We've added 74 non-merge commits during the last 11 day(s) which contain a total of 88 files changed, 3362 insertions(+), 789 deletions(-). The main changes are: 1) Decouple prune and jump points handling in the verifier, from Andrii. 2) Do not rely on ALLOW_ERROR_INJECTION for fmod_ret, from Benjamin. Merged from hid tree. 3) Do not zero-extend kfunc return values. Necessary fix for 32-bit archs, from Björn. 4) Don't use rcu_users to refcount in task kfuncs, from David. 5) Three reg_state->id fixes in the verifier, from Eduard. 6) Optimize bpf_mem_alloc by reusing elements from free_by_rcu, from Hou. 7) Refactor dynptr handling in the verifier, from Kumar. 8) Remove the "/sys" mount and umount dance in {open,close}_netns in bpf selftests, from Martin. 9) Enable sleepable support for cgrp local storage, from Yonghong. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (74 commits) selftests/bpf: test case for relaxed prunning of active_lock.id selftests/bpf: Add pruning test case for bpf_spin_lock bpf: use check_ids() for active_lock comparison selftests/bpf: verify states_equal() maintains idmap across all frames bpf: states_equal() must build idmap for all function frames selftests/bpf: test cases for regsafe() bug skipping check_id() bpf: regsafe() must not skip check_ids() docs/bpf: Add documentation for BPF_MAP_TYPE_SK_STORAGE selftests/bpf: Add test for dynptr reinit in user_ringbuf callback bpf: Use memmove for bpf_dynptr_{read,write} bpf: Move PTR_TO_STACK alignment check to process_dynptr_func bpf: Rework check_func_arg_reg_off bpf: Rework process_dynptr_func bpf: Propagate errors from process_* checks in check_func_arg bpf: Refactor ARG_PTR_TO_DYNPTR checks into process_dynptr_func bpf: Skip rcu_barrier() if rcu_trace_implies_rcu_gp() is true bpf: Reuse freed element in free_by_rcu during allocation selftests/bpf: Bring test_offload.py back to life bpf: Fix comment error in fixup_kfunc_call function bpf: Do not zero-extend kfunc return values ... ==================== Link: https://lore.kernel.org/r/20221212024701.73809-1-alexei.starovoitov@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-12Merge tag 'irq-core-2022-12-10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq updates from Thomas Gleixner: "Updates for the interrupt core and driver subsystem: The bulk is the rework of the MSI subsystem to support per device MSI interrupt domains. This solves conceptual problems of the current PCI/MSI design which are in the way of providing support for PCI/MSI[-X] and the upcoming PCI/IMS mechanism on the same device. IMS (Interrupt Message Store] is a new specification which allows device manufactures to provide implementation defined storage for MSI messages (as opposed to PCI/MSI and PCI/MSI-X that has a specified message store which is uniform accross all devices). The PCI/MSI[-X] uniformity allowed us to get away with "global" PCI/MSI domains. IMS not only allows to overcome the size limitations of the MSI-X table, but also gives the device manufacturer the freedom to store the message in arbitrary places, even in host memory which is shared with the device. There have been several attempts to glue this into the current MSI code, but after lengthy discussions it turned out that there is a fundamental design problem in the current PCI/MSI-X implementation. This needs some historical background. When PCI/MSI[-X] support was added around 2003, interrupt management was completely different from what we have today in the actively developed architectures. Interrupt management was completely architecture specific and while there were attempts to create common infrastructure the commonalities were rudimentary and just providing shared data structures and interfaces so that drivers could be written in an architecture agnostic way. The initial PCI/MSI[-X] support obviously plugged into this model which resulted in some basic shared infrastructure in the PCI core code for setting up MSI descriptors, which are a pure software construct for holding data relevant for a particular MSI interrupt, but the actual association to Linux interrupts was completely architecture specific. This model is still supported today to keep museum architectures and notorious stragglers alive. In 2013 Intel tried to add support for hot-pluggable IO/APICs to the kernel, which was creating yet another architecture specific mechanism and resulted in an unholy mess on top of the existing horrors of x86 interrupt handling. The x86 interrupt management code was already an incomprehensible maze of indirections between the CPU vector management, interrupt remapping and the actual IO/APIC and PCI/MSI[-X] implementation. At roughly the same time ARM struggled with the ever growing SoC specific extensions which were glued on top of the architected GIC interrupt controller. This resulted in a fundamental redesign of interrupt management and provided the today prevailing concept of hierarchical interrupt domains. This allowed to disentangle the interactions between x86 vector domain and interrupt remapping and also allowed ARM to handle the zoo of SoC specific interrupt components in a sane way. The concept of hierarchical interrupt domains aims to encapsulate the functionality of particular IP blocks which are involved in interrupt delivery so that they become extensible and pluggable. The X86 encapsulation looks like this: |--- device 1 [Vector]---[Remapping]---[PCI/MSI]--|... |--- device N where the remapping domain is an optional component and in case that it is not available the PCI/MSI[-X] domains have the vector domain as their parent. This reduced the required interaction between the domains pretty much to the initialization phase where it is obviously required to establish the proper parent relation ship in the components of the hierarchy. While in most cases the model is strictly representing the chain of IP blocks and abstracting them so they can be plugged together to form a hierarchy, the design stopped short on PCI/MSI[-X]. Looking at the hardware it's clear that the actual PCI/MSI[-X] interrupt controller is not a global entity, but strict a per PCI device entity. Here we took a short cut on the hierarchical model and went for the easy solution of providing "global" PCI/MSI domains which was possible because the PCI/MSI[-X] handling is uniform across the devices. This also allowed to keep the existing PCI/MSI[-X] infrastructure mostly unchanged which in turn made it simple to keep the existing architecture specific management alive. A similar problem was created in the ARM world with support for IP block specific message storage. Instead of going all the way to stack a IP block specific domain on top of the generic MSI domain this ended in a construct which provides a "global" platform MSI domain which allows overriding the irq_write_msi_msg() callback per allocation. In course of the lengthy discussions we identified other abuse of the MSI infrastructure in wireless drivers, NTB etc. where support for implementation specific message storage was just mindlessly glued into the existing infrastructure. Some of this just works by chance on particular platforms but will fail in hard to diagnose ways when the driver is used on platforms where the underlying MSI interrupt management code does not expect the creative abuse. Another shortcoming of today's PCI/MSI-X support is the inability to allocate or free individual vectors after the initial enablement of MSI-X. This results in an works by chance implementation of VFIO (PCI pass-through) where interrupts on the host side are not set up upfront to avoid resource exhaustion. They are expanded at run-time when the guest actually tries to use them. The way how this is implemented is that the host disables MSI-X and then re-enables it with a larger number of vectors again. That works by chance because most device drivers set up all interrupts before the device actually will utilize them. But that's not universally true because some drivers allocate a large enough number of vectors but do not utilize them until it's actually required, e.g. for acceleration support. But at that point other interrupts of the device might be in active use and the MSI-X disable/enable dance can just result in losing interrupts and therefore hard to diagnose subtle problems. Last but not least the "global" PCI/MSI-X domain approach prevents to utilize PCI/MSI[-X] and PCI/IMS on the same device due to the fact that IMS is not longer providing a uniform storage and configuration model. The solution to this is to implement the missing step and switch from global PCI/MSI domains to per device PCI/MSI domains. The resulting hierarchy then looks like this: |--- [PCI/MSI] device 1 [Vector]---[Remapping]---|... |--- [PCI/MSI] device N which in turn allows to provide support for multiple domains per device: |--- [PCI/MSI] device 1 |--- [PCI/IMS] device 1 [Vector]---[Remapping]---|... |--- [PCI/MSI] device N |--- [PCI/IMS] device N This work converts the MSI and PCI/MSI core and the x86 interrupt domains to the new model, provides new interfaces for post-enable allocation/free of MSI-X interrupts and the base framework for PCI/IMS. PCI/IMS has been verified with the work in progress IDXD driver. There is work in progress to convert ARM over which will replace the platform MSI train-wreck. The cleanup of VFIO, NTB and other creative "solutions" are in the works as well. Drivers: - Updates for the LoongArch interrupt chip drivers - Support for MTK CIRQv2 - The usual small fixes and updates all over the place" * tag 'irq-core-2022-12-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (134 commits) irqchip/ti-sci-inta: Fix kernel doc irqchip/gic-v2m: Mark a few functions __init irqchip/gic-v2m: Include arm-gic-common.h irqchip/irq-mvebu-icu: Fix works by chance pointer assignment iommu/amd: Enable PCI/IMS iommu/vt-d: Enable PCI/IMS x86/apic/msi: Enable PCI/IMS PCI/MSI: Provide pci_ims_alloc/free_irq() PCI/MSI: Provide IMS (Interrupt Message Store) support genirq/msi: Provide constants for PCI/IMS support x86/apic/msi: Enable MSI_FLAG_PCI_MSIX_ALLOC_DYN PCI/MSI: Provide post-enable dynamic allocation interfaces for MSI-X PCI/MSI: Provide prepare_desc() MSI domain op PCI/MSI: Split MSI-X descriptor setup genirq/msi: Provide MSI_FLAG_MSIX_ALLOC_DYN genirq/msi: Provide msi_domain_alloc_irq_at() genirq/msi: Provide msi_domain_ops:: Prepare_desc() genirq/msi: Provide msi_desc:: Msi_data genirq/msi: Provide struct msi_map x86/apic/msi: Remove arch_create_remap_msi_irq_domain() ...
2022-12-12Merge branches 'clk-mediatek', 'clk-trace', 'clk-qcom' and 'clk-microchip' ↵Stephen Boyd
into clk-next - Tracepoints for clk_rate_request structures * clk-mediatek: clk: mediatek: fix dependency of MT7986 ADC clocks clk: mediatek: Change PLL register API for MT8186 clk: mediatek: Add new clock driver to handle FHCTL hardware dt-bindings: clock: mediatek: Add new bindings of MediaTek frequency hopping clk: mediatek: Export PLL operations symbols clk: mediatek: mt8186-topckgen: Add GPU clock mux notifier clk: mediatek: mt8186-mfg: Propagate rate changes to parent clk: mediatek: mt8195-topckgen: Drop flags for main/univpll fixed factors clk: mediatek: mt8192: Drop flags for main/univpll fixed factors clk: mediatek: mt6795-topckgen: Drop flags for main/sys/univpll fixed factors clk: mediatek: mt8173: Drop flags for main/sys/univpll fixed factors clk: mediatek: mt8183: Drop flags for sys/univpll fixed factors clk: mediatek: mt8183: Compress top_divs array entries clk: mediatek: mt8186-topckgen: Drop flags for main/univpll fixed factors clk: mediatek: clk-mtk: Allow specifying flags on mtk_fixed_factor clocks * clk-trace: clk: Add trace events for rate requests clk: Store clk_core for clk_rate_request * clk-qcom: (69 commits) clk: qcom: rpmh: add support for SM6350 rpmh IPA clock clk: qcom: mmcc-msm8974: use parent_hws/_data instead of parent_names clk: qcom: mmcc-msm8974: move clock parent tables down clk: qcom: mmcc-msm8974: use ARRAY_SIZE instead of specifying num_parents clk: qcom: gcc-msm8974: use parent_hws/_data instead of parent_names clk: qcom: gcc-msm8974: move clock parent tables down clk: qcom: gcc-msm8974: use ARRAY_SIZE instead of specifying num_parents dt-bindings: clocks: qcom,mmcc: define clocks/clock-names for MSM8974 dt-bindings: clock: split qcom,gcc-msm8974,-msm8226 to the separate file clk: qcom: gcc-ipq4019: switch to devm_clk_notifier_register clk: qcom: rpmh: remove usage of platform name clk: qcom: rpmh: rename VRM clock data clk: qcom: rpmh: rename ARC clock data clk: qcom: rpmh: support separate symbol name for the RPMH clocks clk: qcom: rpmh: remove platform names from BCM clocks clk: qcom: rpmh: drop all _ao names clk: qcom: rpmh: reuse common duplicate clocks clk: qcom: rpmh: group clock definitions together clk: qcom: rpm: drop the platform from clock definitions clk: qcom: rpm: drop the _clk suffix completely ... * clk-microchip: clk: microchip: enable the MPFS clk driver by default if SOC_MICROCHIP_POLARFIRE clk: microchip: check for null return of devm_kzalloc()
2022-12-12Merge tag 'soc-drivers-6.2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc Pull ARM SoC driver updates from Arnd Bergmann: "There are few major updates in the SoC specific drivers, mainly the usual reworks and support for variants of the existing SoC. While this remains Arm centric for the most part, the branch now also contains updates to risc-v and loongarch specific code in drivers/soc/. Notable changes include: - Support for the newly added Qualcomm Snapdragon variants (MSM8956, MSM8976, SM6115, SM4250, SM8150, SA8155 and SM8550) in the soc ID, rpmh, rpm, spm and powerdomain drivers. - Documentation for the somewhat controversial qcom,board-id properties that are required for booting a number of machines - A new SoC identification driver for the loongson-2 (loongarch) platform - memory controller updates for stm32, tegra, and renesas. - a new DT binding to better describe LPDDR2/3/4/5 chips in the memory controller subsystem - Updates for Tegra specific drivers across multiple subsystems, improving support for newer SoCs and better identification - Minor fixes for Broadcom, Freescale, Apple, Renesas, Sifive, TI, Mediatek and Marvell SoC drivers" * tag 'soc-drivers-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (137 commits) soc: qcom: socinfo: Add SM6115 / SM4250 SoC IDs to the soc_id table dt-bindings: arm: qcom,ids: Add SoC IDs for SM6115 / SM4250 and variants soc: qcom: socinfo: Add SM8150 and SA8155 SoC IDs to the soc_id table dt-bindings: arm: qcom,ids: Add SoC IDs for SM8150 and SA8155 dt-bindings: soc: qcom: apr: document generic qcom,apr compatible soc: qcom: Select REMAP_MMIO for ICC_BWMON driver soc: qcom: Select REMAP_MMIO for LLCC driver soc: qcom: rpmpd: Add SM4250 support dt-bindings: power: rpmpd: Add SM4250 support dt-bindings: soc: qcom: aoss: Add compatible for SM8550 soc: qcom: llcc: Add configuration data for SM8550 dt-bindings: arm: msm: Add LLCC compatible for SM8550 soc: qcom: llcc: Add v4.1 HW version support soc: qcom: socinfo: Add SM8550 ID soc: qcom: rpmh-rsc: Avoid unnecessary checks on irq-done response soc: qcom: rpmh-rsc: Add support for RSC v3 register offsets soc: qcom: rpmhpd: Add SM8550 power domains dt-bindings: power: rpmpd: Add SM8550 to rpmpd binding soc: qcom: socinfo: Add MSM8956/76 SoC IDs to the soc_id table dt-bindings: arm: qcom,ids: Add SoC IDs for MSM8956 and MSM8976 ...
2022-12-12Merge tag 'arm64-upstream' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Will Deacon: "The highlights this time are support for dynamically enabling and disabling Clang's Shadow Call Stack at boot and a long-awaited optimisation to the way in which we handle the SVE register state on system call entry to avoid taking unnecessary traps from userspace. Summary: ACPI: - Enable FPDT support for boot-time profiling - Fix CPU PMU probing to work better with PREEMPT_RT - Update SMMUv3 MSI DeviceID parsing to latest IORT spec - APMT support for probing Arm CoreSight PMU devices CPU features: - Advertise new SVE instructions (v2.1) - Advertise range prefetch instruction - Advertise CSSC ("Common Short Sequence Compression") scalar instructions, adding things like min, max, abs, popcount - Enable DIT (Data Independent Timing) when running in the kernel - More conversion of system register fields over to the generated header CPU misfeatures: - Workaround for Cortex-A715 erratum #2645198 Dynamic SCS: - Support for dynamic shadow call stacks to allow switching at runtime between Clang's SCS implementation and the CPU's pointer authentication feature when it is supported (complete with scary DWARF parser!) Tracing and debug: - Remove static ftrace in favour of, err, dynamic ftrace! - Seperate 'struct ftrace_regs' from 'struct pt_regs' in core ftrace and existing arch code - Introduce and implement FTRACE_WITH_ARGS on arm64 to replace the old FTRACE_WITH_REGS - Extend 'crashkernel=' parameter with default value and fallback to placement above 4G physical if initial (low) allocation fails SVE: - Optimisation to avoid disabling SVE unconditionally on syscall entry and just zeroing the non-shared state on return instead Exceptions: - Rework of undefined instruction handling to avoid serialisation on global lock (this includes emulation of user accesses to the ID registers) Perf and PMU: - Support for TLP filters in Hisilicon's PCIe PMU device - Support for the DDR PMU present in Amlogic Meson G12 SoCs - Support for the terribly-named "CoreSight PMU" architecture from Arm (and Nvidia's implementation of said architecture) Misc: - Tighten up our boot protocol for systems with memory above 52 bits physical - Const-ify static keys to satisty jump label asm constraints - Trivial FFA driver cleanups in preparation for v1.1 support - Export the kernel_neon_* APIs as GPL symbols - Harden our instruction generation routines against instrumentation - A bunch of robustness improvements to our arch-specific selftests - Minor cleanups and fixes all over (kbuild, kprobes, kfence, PMU, ...)" * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (151 commits) arm64: kprobes: Return DBG_HOOK_ERROR if kprobes can not handle a BRK arm64: kprobes: Let arch do_page_fault() fix up page fault in user handler arm64: Prohibit instrumentation on arch_stack_walk() arm64:uprobe fix the uprobe SWBP_INSN in big-endian arm64: alternatives: add __init/__initconst to some functions/variables arm_pmu: Drop redundant armpmu->map_event() in armpmu_event_init() kselftest/arm64: Allow epoll_wait() to return more than one result kselftest/arm64: Don't drain output while spawning children kselftest/arm64: Hold fp-stress children until they're all spawned arm64/sysreg: Remove duplicate definitions from asm/sysreg.h arm64/sysreg: Convert ID_DFR1_EL1 to automatic generation arm64/sysreg: Convert ID_DFR0_EL1 to automatic generation arm64/sysreg: Convert ID_AFR0_EL1 to automatic generation arm64/sysreg: Convert ID_MMFR5_EL1 to automatic generation arm64/sysreg: Convert MVFR2_EL1 to automatic generation arm64/sysreg: Convert MVFR1_EL1 to automatic generation arm64/sysreg: Convert MVFR0_EL1 to automatic generation arm64/sysreg: Convert ID_PFR2_EL1 to automatic generation arm64/sysreg: Convert ID_PFR1_EL1 to automatic generation arm64/sysreg: Convert ID_PFR0_EL1 to automatic generation ...
2022-12-12Merge tag 'hyperv-next-signed-20221208' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyperv updates from Wei Liu: - Drop unregister syscore from hyperv_cleanup to avoid hang (Gaurav Kohli) - Clean up panic path for Hyper-V framebuffer (Guilherme G. Piccoli) - Allow IRQ remapping to work without x2apic (Nuno Das Neves) - Fix comments (Olaf Hering) - Expand hv_vp_assist_page definition (Saurabh Sengar) - Improvement to page reporting (Shradha Gupta) - Make sure TSC clocksource works when Linux runs as the root partition (Stanislav Kinsburskiy) * tag 'hyperv-next-signed-20221208' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: x86/hyperv: Remove unregister syscore call from Hyper-V cleanup iommu/hyper-v: Allow hyperv irq remapping without x2apic clocksource: hyper-v: Add TSC page support for root partition clocksource: hyper-v: Use TSC PFN getter to map vvar page clocksource: hyper-v: Introduce TSC PFN getter clocksource: hyper-v: Introduce a pointer to TSC page x86/hyperv: Expand definition of struct hv_vp_assist_page PCI: hv: update comment in x86 specific hv_arch_irq_unmask hv: fix comment typo in vmbus_channel/low_latency drivers: hv, hyperv_fb: Untangle and refactor Hyper-V panic notifiers video: hyperv_fb: Avoid taking busy spinlock on panic path hv_balloon: Add support for configurable order free page reporting mm/page_reporting: Add checks for page_reporting_order param
2022-12-12Merge tag 'tpmdd-next-v6.2-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd Pull tpm updates from Jarkko Sakkinen: "A random collection of TPM fixes and one bug fix for trusted keys" * tag 'tpmdd-next-v6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd: tpm: st33zp24: remove pointless checks on probe tpm/tpm_crb: Fix error message in __crb_relinquish_locality() tpm/tpm_ftpm_tee: Fix error handling in ftpm_mod_init() tpm: tpm_tis: Add the missed acpi_put_table() to fix memory leak tpm: tpm_crb: Add the missed acpi_put_table() to fix memory leak tpm: acpi: Call acpi_put_table() to fix memory leak tpm: Add flag to use default cancellation policy tpm: tis_i2c: Fix sanity check interrupt enable mask KEYS: trusted: tee: Make registered shm dependency explicit tpm: Avoid function type cast of put_device() tpm: st33zp24: switch to using gpiod API tpm: st33zp24: drop support for platform data
2022-12-12Merge tag 'slab-for-6.2-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab updates from Vlastimil Babka: - SLOB deprecation and SLUB_TINY The SLOB allocator adds maintenance burden and stands in the way of API improvements [1]. Deprecate it by renaming the config option (to make users notice) to CONFIG_SLOB_DEPRECATED with updated help text. SLUB should be used instead as SLAB will be the next on the removal list. Based on reports from a riscv k210 board with 8MB RAM, add a CONFIG_SLUB_TINY option to minimize SLUB's memory usage at the expense of scalability. This has resolved the k210 regression [2] so in case there are no others (that wouldn't be resolvable by further tweaks to SLUB_TINY) plan is to remove SLOB in a few cycles. Existing defconfigs with CONFIG_SLOB are converted to CONFIG_SLUB_TINY. - kmalloc() slub_debug redzone improvements A series from Feng Tang that builds on the tracking or requested size for kmalloc() allocations (for caches with debugging enabled) added in 6.1, to make redzone checks consider the requested size and not the rounded up one, in order to catch more subtle buffer overruns. Includes new slub_kunit test. - struct slab fields reordering to accomodate larger rcu_head RCU folks would like to grow rcu_head with debugging options, which breaks current struct slab layout's assumptions, so reorganize it to make this possible. - Miscellaneous improvements/fixes: - __alloc_size checking compiler workaround (Kees Cook) - Optimize and cleanup SLUB's sysfs init (Rasmus Villemoes) - Make SLAB compatible with PROVE_RAW_LOCK_NESTING (Jiri Kosina) - Correct SLUB's percpu allocation estimates (Baoquan He) - Re-enableS LUB's run-time failslab sysfs control (Alexander Atanasov) - Make tools/vm/slabinfo more user friendly when not run as root (Rong Tao) - Dead code removal in SLUB (Hyeonggon Yoo) * tag 'slab-for-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: (31 commits) mm, slob: rename CONFIG_SLOB to CONFIG_SLOB_DEPRECATED mm, slub: don't aggressively inline with CONFIG_SLUB_TINY mm, slub: remove percpu slabs with CONFIG_SLUB_TINY mm, slub: split out allocations from pre/post hooks mm/slub, kunit: Add a test case for kmalloc redzone check mm/slub, kunit: add SLAB_SKIP_KFENCE flag for cache creation mm, slub: refactor free debug processing mm, slab: ignore SLAB_RECLAIM_ACCOUNT with CONFIG_SLUB_TINY mm, slub: don't create kmalloc-rcl caches with CONFIG_SLUB_TINY mm, slub: lower the default slub_max_order with CONFIG_SLUB_TINY mm, slub: retain no free slabs on partial list with CONFIG_SLUB_TINY mm, slub: disable SYSFS support with CONFIG_SLUB_TINY mm, slub: add CONFIG_SLUB_TINY mm, slab: ignore hardened usercopy parameters when disabled slab: Remove special-casing of const 0 size allocations slab: Clean up SLOB vs kmalloc() definition mm/sl[au]b: rearrange struct slab fields to allow larger rcu_head mm/migrate: make isolate_movable_page() skip slab pages mm/slab: move and adjust kernel-doc for kmem_cache_alloc mm/slub, percpu: correct the calculation of early percpu allocation size ...
2022-12-12Merge tag 'printk-for-6.2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux Pull printk updates from Petr Mladek: - Add NMI-safe SRCU reader API. It uses atomic_inc() instead of this_cpu_inc() on strong load-store architectures. - Introduce new console_list_lock to synchronize a manipulation of the list of registered consoles and their flags. This is a first step in removing the big-kernel-lock-like behavior of console_lock(). This semaphore still serializes console->write() calbacks against: - each other. It primary prevents potential races between early and proper console drivers using the same device. - suspend()/resume() callbacks and init() operations in some drivers. - various other operations in the tty/vt and framebufer susbsystems. It is likely that console_lock() serializes even operations that are not directly conflicting with the console->write() callbacks here. This is the most complicated big-kernel-lock aspect of the console_lock() that will be hard to untangle. - Introduce new console_srcu lock that is used to safely iterate and access the registered console drivers under SRCU read lock. This is a prerequisite for introducing atomic console drivers and console kthreads. It will reduce the complexity of serialization against normal consoles and console_lock(). Also it should remove the risk of deadlock during critical situations, like Oops or panic, when only atomic consoles are registered. - Check whether the console is registered instead of enabled on many locations. It was a historical leftover. - Cleanly force a preferred console in xenfb code instead of a dirty hack. - A lot of code and comment clean ups and improvements. * tag 'printk-for-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: (47 commits) printk: htmldocs: add missing description tty: serial: sh-sci: use setup() callback for early console printk: relieve console_lock of list synchronization duties tty: serial: kgdboc: use console_list_lock to trap exit tty: serial: kgdboc: synchronize tty_find_polling_driver() and register_console() tty: serial: kgdboc: use console_list_lock for list traversal tty: serial: kgdboc: use srcu console list iterator proc: consoles: use console_list_lock for list iteration tty: tty_io: use console_list_lock for list synchronization printk, xen: fbfront: create/use safe function for forcing preferred netconsole: avoid CON_ENABLED misuse to track registration usb: early: xhci-dbc: use console_is_registered() tty: serial: xilinx_uartps: use console_is_registered() tty: serial: samsung_tty: use console_is_registered() tty: serial: pic32_uart: use console_is_registered() tty: serial: earlycon: use console_is_registered() tty: hvc: use console_is_registered() efi: earlycon: use console_is_registered() tty: nfcon: use console_is_registered() serial_core: replace uart_console_enabled() with uart_console_registered() ...
2022-12-12Merge tag 'locks-v6.2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux Pull file locking updates from Jeff Layton: "The main change here is to add the new locks_inode_context helper, and convert all of the places that dereference inode->i_flctx directly to use that instead. There is a new helper to indicate whether any locks are held on an inode. This is mostly for Ceph but may be usable elsewhere too. Andi Kleen requested that we print the PID when the LOCK_MAND warning fires, to help track down applications trying to use it. Finally, we added some new warnings to some of the file locking functions that fire when the ->fl_file and filp arguments differ. This helped us find some long-standing bugs in lockd. Patches for those are in Chuck Lever's tree and should be in his v6.2 PR. After that patch, people using NFSv2/v3 locking may see some warnings fire until those go in. Happy Holidays!" * tag 'locks-v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux: Add process name and pid to locks warning nfsd: use locks_inode_context helper nfs: use locks_inode_context helper lockd: use locks_inode_context helper ksmbd: use locks_inode_context helper cifs: use locks_inode_context helper ceph: use locks_inode_context helper filelock: add a new locks_inode_context accessor function filelock: new helper: vfs_inode_has_locks filelock: WARN_ON_ONCE when ->fl_file and filp don't match
2022-12-12Merge tag 'execve-v6.2-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull execve updates from Kees Cook: "Most are small refactorings and bug fixes, but three things stand out: switching timens (which got reverted before) looks solid now, FOLL_FORCE has been removed (no failures seen yet across several weeks in -next), and some whitespace cleanups (which are long overdue). - Add timens support (when switching mm). This version has survived in -next for the entire cycle (Andrei Vagin) - Various small bug fixes, refactoring, and readability improvements (Bernd Edlinger, Rolf Eike Beer, Bo Liu, Li Zetao Liu Shixin) - Remove FOLL_FORCE for stack setup (Kees Cook) - Whitespace cleanups (Rolf Eike Beer, Kees Cook)" * tag 'execve-v6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: binfmt_misc: fix shift-out-of-bounds in check_special_flags binfmt: Fix error return code in load_elf_fdpic_binary() exec: Remove FOLL_FORCE for stack setup binfmt_elf: replace IS_ERR() with IS_ERR_VALUE() binfmt_elf: simplify error handling in load_elf_phdrs() binfmt_elf: fix documented return value for load_elf_phdrs() exec: simplify initial stack size expansion binfmt: Fix whitespace issues exec: Add comments on check_unsafe_exec() fs counting ELF uapi: add spaces before '{' selftests/timens: add a test for vfork+exit fs/exec: switch timens when a task gets a new mm
2022-12-12Merge tag 'seccomp-v6.2-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull seccomp updates from Kees Cook: - Add missing kerndoc parameter (Randy Dunlap) - Improve seccomp selftest to check CAP_SYS_ADMIN (Gautam Menghani) - Fix allocation leak when cloned thread immediately dies (Kuniyuki Iwashima) * tag 'seccomp-v6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: seccomp: document the "filter_count" field seccomp: Move copy_seccomp() to no failure path. selftests/seccomp: Check CAP_SYS_ADMIN capability in the test mode_filter_without_nnp
2022-12-12Merge tag 'pstore-v6.2-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull pstore updates from Kees Cook: "A small collection of bug fixes, refactorings, and general improvements: - Reporting improvements and return path fixes (Guilherme G. Piccoli, Wang Yufen, Kees Cook) - Clean up kmsg_bytes module parameter usage (Guilherme G. Piccoli) - Add Guilherme to pstore MAINTAINERS entry - Choose friendlier allocation flags (Qiujun Huang, Stephen Boyd)" * tag 'pstore-v6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: pstore: Avoid kcore oops by vmap()ing with VM_IOREMAP pstore/ram: Fix error return code in ramoops_probe() pstore: Alert on backend write error MAINTAINERS: Update pstore maintainers pstore/ram: Set freed addresses to NULL pstore/ram: Move internal definitions out of kernel-wide include pstore/ram: Move pmsg init earlier pstore/ram: Consolidate kfree() paths efi: pstore: Follow convention for the efi-pstore backend name pstore: Inform unregistered backend names as well pstore: Expose kmsg_bytes as a module parameter pstore: Improve error reporting in case of backend overlap pstore/zone: Use GFP_ATOMIC to allocate zone buffer
2022-12-12Merge tag 'rcu.2022.12.02a' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu Pull RCU updates from Paul McKenney: - Documentation updates. This is the second in a series from an ongoing review of the RCU documentation. - Miscellaneous fixes. - Introduce a default-off Kconfig option that depends on RCU_NOCB_CPU that, on CPUs mentioned in the nohz_full or rcu_nocbs boot-argument CPU lists, causes call_rcu() to introduce delays. These delays result in significant power savings on nearly idle Android and ChromeOS systems. These savings range from a few percent to more than ten percent. This series also includes several commits that change call_rcu() to a new call_rcu_hurry() function that avoids these delays in a few cases, for example, where timely wakeups are required. Several of these are outside of RCU and thus have acks and reviews from the relevant maintainers. - Create an srcu_read_lock_nmisafe() and an srcu_read_unlock_nmisafe() for architectures that support NMIs, but which do not provide NMI-safe this_cpu_inc(). These NMI-safe SRCU functions are required by the upcoming lockless printk() work by John Ogness et al. - Changes providing minor but important increases in torture test coverage for the new RCU polled-grace-period APIs. - Changes to torturescript that avoid redundant kernel builds, thus providing about a 30% speedup for the torture.sh acceptance test. * tag 'rcu.2022.12.02a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (49 commits) net: devinet: Reduce refcount before grace period net: Use call_rcu_hurry() for dst_release() workqueue: Make queue_rcu_work() use call_rcu_hurry() percpu-refcount: Use call_rcu_hurry() for atomic switch scsi/scsi_error: Use call_rcu_hurry() instead of call_rcu() rcu/rcutorture: Use call_rcu_hurry() where needed rcu/rcuscale: Use call_rcu_hurry() for async reader test rcu/sync: Use call_rcu_hurry() instead of call_rcu rcuscale: Add laziness and kfree tests rcu: Shrinker for lazy rcu rcu: Refactor code a bit in rcu_nocb_do_flush_bypass() rcu: Make call_rcu() lazy to save power rcu: Implement lockdep_rcu_enabled for !CONFIG_DEBUG_LOCK_ALLOC srcu: Debug NMI safety even on archs that don't require it srcu: Explain the reason behind the read side critical section on GP start srcu: Warn when NMI-unsafe API is used in NMI arch/s390: Add ARCH_HAS_NMI_SAFE_THIS_CPU_OPS Kconfig option arch/loongarch: Add ARCH_HAS_NMI_SAFE_THIS_CPU_OPS Kconfig option rcu: Fix __this_cpu_read() lockdep warning in rcu_force_quiescent_state() rcu-tasks: Make grace-period-age message human-readable ...
2022-12-12Merge branches 'pm-devfreq' and 'pm-tools'Rafael J. Wysocki
Merge devfreq updates and cpupower utility updates for 6.2-rc1: - Add a private governor_data for devfreq governors (Kant Fan). - Reorganize devfreq code to use device_match_of_node() and devm_platform_get_and_ioremap_resource() instead of open coding them (ye xingchen, Minghao Chi). - Make cpupower choose base_cpu to display default cpupower details instead of picking CPU 0 (Saket Kumar Bhaskar). - Add Georgian translation to cpupower documentation (Zurab Kargareteli). - Introduce powercap intel-rapl library, powercap-info command, and RAPL monitor into cpupower (Thomas Renninger). * pm-devfreq: PM / devfreq: event: use devm_platform_get_and_ioremap_resource() PM / devfreq: event: Use device_match_of_node() PM / devfreq: Use device_match_of_node() PM/devfreq: governor: Add a private governor_data for governor * pm-tools: cpupower: rapl monitor - shows the used power consumption in uj for each rapl domain cpupower: Introduce powercap intel-rapl library and powercap-info command cpupower: Add Georgian translation tools/cpupower: Choose base_cpu to display default cpupower details
2022-12-12Merge branch 'pm-cpufreq'Rafael J. Wysocki
Merge cpufreq changes for 6.2-rc1: - Generalize of_perf_domain_get_sharing_cpumask phandle format (Hector Martin). - Add new cpufreq driver for Apple SoC CPU P-states (Hector Martin). - Update Qualcomm cpufreq driver, including: * CPU clock provider support, * Generic cleanups or reorganization. * Potential memleak fix. * Fix of the return value of cpufreq_driver->get(). (Manivannan Sadhasivam, Chen Hui). - Update Qualcomm cpufreq driver's DT bindings, including: * Support for CPU clock provider. * Missing cache-related properties fixes. * Support for QDU1000/QRU1000. (Manivannan Sadhasivam, Rob Herring, Melody Olvera). - Add support for ti,am625 SoC and enable build of ti-cpufreq for ARCH_K3 (Dave Gerlach, and Vibhore Vardhan). - Use flexible array to simplify memory allocation in the tegra186 cpufreq driver (Christophe JAILLET). - Convert cpufreq statistics code to use sysfs_emit_at() (ye xingchen). - Allow intel_pstate to use no-HWP mode on Sapphire Rapids (Giovanni Gherdovich). - Add missing pci_dev_put() to the amd_freq_sensitivity cpufreq driver (Xiongfeng Wang). - Initialize the kobj_unregister completion before calling kobject_init_and_add() in the cpufreq core code (Yongqiang Liu). - Defer setting boost MSRs in the ACPI cpufreq driver (Stuart Hayes, Nathan Chancellor). - Make intel_pstate accept initial EPP value of 0x80 (Srinivas Pandruvada). - Make read-only array sys_clk_src in the SPEAr cpufreq driver static (Colin Ian King). - Make array speeds in the longhaul cpufreq driver static (Colin Ian King). - Use str_enabled_disabled() helper in the ACPI cpufreq driver (Andy Shevchenko). - Drop a reference to CVS from cpufreq documentation (Conghui Wang). * pm-cpufreq: (30 commits) cpufreq: Remove CVS version control contents from documentation cpufreq: stats: Convert to use sysfs_emit_at() API cpufreq: ACPI: Only set boost MSRs on supported CPUs dt-bindings: cpufreq: cpufreq-qcom-hw: Add QDU1000/QRU1000 cpufreq cpufreq: tegra186: Use flexible array to simplify memory allocation cpufreq: intel_pstate: Add Sapphire Rapids support in no-HWP mode cpufreq: amd_freq_sensitivity: Add missing pci_dev_put() cpufreq: Init completion before kobject_init_and_add() cpufreq: apple-soc: Add new driver to control Apple SoC CPU P-states cpufreq: qcom-hw: Add CPU clock provider support dt-bindings: cpufreq: cpufreq-qcom-hw: Add cpufreq clock provider cpufreq: qcom-hw: Fix the frequency returned by cpufreq_driver->get() cpufreq: ACPI: Remove unused variables 'acpi_cpufreq_online' and 'ret' cpufreq: qcom-hw: Fix memory leak in qcom_cpufreq_hw_read_lut() arm64: dts: ti: k3-am625-sk: Add 1.4GHz OPP cpufreq: ti: Enable ti-cpufreq for ARCH_K3 arm64: dts: ti: k3-am625: Introduce operating-points table cpufreq: dt-platdev: Blacklist ti,am625 SoC cpufreq: ti-cpufreq: Add support for AM625 dt-bindings: cpufreq: qcom: Add missing cache related properties ...
2022-12-12Merge branches 'acpi-pm', 'acpi-processor', 'acpi-ec' and 'acpi-video'Rafael J. Wysocki
Make ACPI power management changes, ACPI processor driver updates, ACPI EC driver quirk and ACPI backlight driver updates for 6.2-rc1: - Print full name paths of ACPI power resources objects during enumeration (Kane Chen). - Eliminate a compiler warning regarding a missing function prototype in the ACPI power management code (Sudeep Holla). - Fix and clean up the ACPI processor driver (Rafael Wysocki, Li Zhong, Colin Ian King, Sudeep Holla). - Add quirk for the HP Pavilion Gaming 15-cx0041ur to the ACPI EC driver (Mia Kanashi). - Add some mew ACPI backlight handling quirks and update some existing ones (Hans de Goede). - Make the ACPI backlight driver prefer the native backlight control over vendor backlight control when possible (Hans de Goede). * acpi-pm: ACPI: PM: Silence missing prototype warning ACPI: PM: Print full name path while adding power resource * acpi-processor: ACPI: processor: perflib: Adjust acpi_processor_notify_smm() return value ACPI: processor: perflib: Rearrange acpi_processor_notify_smm() ACPI: processor: perflib: Rearrange unregistration routine ACPI: processor: perflib: Drop redundant parentheses ACPI: processor: perflib: Adjust white space ACPI: processor: idle: Drop unnecessary statements and parens ACPI: processor: Silence missing prototype warnings ACPI: processor_idle: Silence missing prototype warnings ACPI: processor: throttling: remove variable count ACPI: processor: idle: Check acpi_fetch_acpi_dev() return value * acpi-ec: ACPI: EC: Add quirk for the HP Pavilion Gaming 15-cx0041ur * acpi-video: ACPI: video: Prefer native over vendor ACPI: video: Simplify __acpi_video_get_backlight_type() ACPI: video: Add force_native quirk for Sony Vaio VPCY11S1E ACPI: video: Add force_vendor quirk for Sony Vaio PCG-FRV35 ACPI: video: Change Sony Vaio VPCEH3U1E quirk to force_native ACPI: video: Change GIGABYTE GB-BXBT-2807 quirk to force_none ACPI: video: Add a few bugtracker links to DMI quirks
2022-12-12Merge branches 'acpi-scan', 'acpi-bus', 'acpi-tables' and 'acpi-sysfs'Rafael J. Wysocki
Merge ACPI changes related to device enumeration, device object managenet, operation region handling, table parsing and sysfs interface: - Use ZERO_PAGE(0) instead of empty_zero_page in the ACPI device enumeration code (Giulio Benetti). - Change the return type of the ACPI driver remove callback to void and update its users accordingly (Dawei Li). - Add general support for FFH address space type and implement the low- level part of it for ARM64 (Sudeep Holla). - Fix stale comments in the ACPI tables parsing code and make it print more messages related to MADT (Hanjun Guo, Huacai Chen). - Replace invocations of generic library functions with more kernel- specific counterparts in the ACPI sysfs interface (Christophe JAILLET, Xu Panda). * acpi-scan: ACPI: scan: substitute empty_zero_page with helper ZERO_PAGE(0) * acpi-bus: ACPI: FFH: Silence missing prototype warnings ACPI: make remove callback of ACPI driver void ACPI: bus: Fix the _OSC capability check for FFH OpRegion arm64: Add architecture specific ACPI FFH Opregion callbacks ACPI: Implement a generic FFH Opregion handler * acpi-tables: ACPI: tables: Fix the stale comments for acpi_locate_initial_tables() ACPI: tables: Print CORE_PIC information when MADT is parsed * acpi-sysfs: ACPI: sysfs: use sysfs_emit() to instead of scnprintf() ACPI: sysfs: Use kstrtobool() instead of strtobool()
2022-12-12Merge tag 'linux-can-next-for-6.2-20221212' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next Marc Kleine-Budde says: ==================== linux-can-next-for-6.2-20221212 this is a pull request of 39 patches for net-next/master. The first 2 patches are by me fix a warning and coding style in the kvaser_usb driver. Vivek Yadav's patch sorts the includes of the m_can driver. Biju Das contributes 5 patches for the rcar_canfd driver improve the support for different IP core variants. Jean Delvare's patch for the ctucanfd drops the dependency on COMPILE_TEST. Vincent Mailhol's patch sorts the includes of the etas_es58x driver. Haibo Chen's contributes 2 patches that add i.MX93 support to the flexcan driver. Lad Prabhakar's patch updates the dt-bindings documentation of the rcar_canfd driver. Minghao Chi's patch converts the c_can platform driver to devm_platform_get_and_ioremap_resource(). In the next 7 patches Vincent Mailhol adds devlink support to the etas_es58x driver to report firmware, bootloader and hardware version. Xu Panda's patch converts a strncpy() -> strscpy() in the ucan driver. Ye Bin's patch removes a useless parameter from the AF_CAN protocol. The next 2 patches by Vincent Mailhol and remove unneeded or unused pointers to struct usb_interface in device's priv struct in the ucan and gs_usb driver. Vivek Yadav's patch cleans up the usage of the RAM initialization in the m_can driver. A patch by me add support for SO_MARK to the AF_CAN protocol. Geert Uytterhoeven's patch fixes the number of CAN channels in the rcan_canfd bindings documentation. In the last 11 patches Markus Schneider-Pargmann optimizes the register access in the t_can driver and cleans up the tcan glue driver. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-12USB: core: export usb_cache_string()Vincent Mailhol
usb_cache_string() can also be useful for the drivers so export it. Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/all/20221130174658.29282-4-mailhol.vincent@wanadoo.fr Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2022-12-12linux/virtio_net.h: Support USO offload in vnet header.Andrew Melnychenko
Now, it's possible to convert USO vnet packets from/to skb. Added support for GSO_UDP_L4 offload. Signed-off-by: Andrew Melnychenko <andrew@daynix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-11kcov: fix spelling typos in commentsRong Tao
Fix the typo of 'suport' in kcov.h Link: https://lkml.kernel.org/r/tencent_922CA94B789587D79FD154445D035AA19E07@qq.com Signed-off-by: Rong Tao <rongtao@cestc.cn> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11io-mapping: move some code within the include guarded sectionChristophe JAILLET
It is spurious to have some code out-side the include guard in a .h file. Fix it. Link: https://lkml.kernel.org/r/4dbaf427d4300edba6c6bbfaf4d57493b9bec6ee.1669565241.git.christophe.jaillet@wanadoo.fr Fixes: 1fbaf8fc12a0 ("mm: add a io_mapping_map_user helper") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11eventfd: change int to __u64 in eventfd_signal() ifndef CONFIG_EVENTFDZhang Qilong
Commit ee62c6b2dc93 ("eventfd: change int to __u64 in eventfd_signal()") forgot to change int to __u64 in the CONFIG_EVENTFD=n stub function. Link: https://lkml.kernel.org/r/20221124140154.104680-1-zhangqilong3@huawei.com Fixes: ee62c6b2dc93 ("eventfd: change int to __u64 in eventfd_signal()") Signed-off-by: Zhang Qilong <zhangqilong3@huawei.com> Cc: Dylan Yudaken <dylany@fb.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Sha Zhengju <handai.szj@taobao.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11mm: fix typo in struct pglist_data code commentWang Yong
change "stat" to "start". Link: https://lkml.kernel.org/r/20221207074011.GA151242@cloud Fixes: c959924b0dc5 ("memory tiering: adjust hot threshold automatically") Signed-off-by: Wang Yong <yongw.kernel@gmail.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11mm: add nodes= arg to memory.reclaimMina Almasry
The nodes= arg instructs the kernel to only scan the given nodes for proactive reclaim. For example use cases, consider a 2 tier memory system: nodes 0,1 -> top tier nodes 2,3 -> second tier $ echo "1m nodes=0" > memory.reclaim This instructs the kernel to attempt to reclaim 1m memory from node 0. Since node 0 is a top tier node, demotion will be attempted first. This is useful to direct proactive reclaim to specific nodes that are under pressure. $ echo "1m nodes=2,3" > memory.reclaim This instructs the kernel to attempt to reclaim 1m memory in the second tier, since this tier of memory has no demotion targets the memory will be reclaimed. $ echo "1m nodes=0,1" > memory.reclaim Instructs the kernel to reclaim memory from the top tier nodes, which can be desirable according to the userspace policy if there is pressure on the top tiers. Since these nodes have demotion targets, the kernel will attempt demotion first. Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg reclaim""), the proactive reclaim interface memory.reclaim does both reclaim and demotion. Reclaim and demotion incur different latency costs to the jobs in the cgroup. Demoted memory would still be addressable by the userspace at a higher latency, but reclaimed memory would need to incur a pagefault. The 'nodes' arg is useful to allow the userspace to control demotion and reclaim independently according to its policy: if the memory.reclaim is called on a node with demotion targets, it will attempt demotion first; if it is called on a node without demotion targets, it will only attempt reclaim. Link: https://lkml.kernel.org/r/20221202223533.1785418-1-almasrymina@google.com Signed-off-by: Mina Almasry <almasrymina@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Muchun Song <songmuchun@bytedance.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Wei Xu <weixugc@google.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: zefan li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11mm: memcg: fix stale protection of reclaim target memcgYosry Ahmed
Patch series "mm: memcg: fix protection of reclaim target memcg", v3. This series fixes a bug in calculating the protection of the reclaim target memcg where we end up using stale effective protection values from the last reclaim operation, instead of completely ignoring the protection of the reclaim target as intended. More detailed explanation and examples in patch 1, which includes the fix. Patches 2 & 3 introduce a selftest case that catches the bug. This patch (of 3): When we are doing memcg reclaim, the intended behavior is that we ignore any protection (memory.min, memory.low) of the target memcg (but not its children). Ever since the patch pointed to by the "Fixes" tag, we actually read a stale value for the target memcg protection when deciding whether to skip the memcg or not because it is protected. If the stale value happens to be high enough, we don't reclaim from the target memcg. Essentially, in some cases we may falsely skip reclaiming from the target memcg of reclaim because we read a stale protection value from last time we reclaimed from it. During reclaim, mem_cgroup_calculate_protection() is used to determine the effective protection (emin and elow) values of a memcg. The protection of the reclaim target is ignored, but we cannot set their effective protection to 0 due to a limitation of the current implementation (see comment in mem_cgroup_protection()). Instead, we leave their effective protection values unchaged, and later ignore it in mem_cgroup_protection(). However, mem_cgroup_protection() is called later in shrink_lruvec()->get_scan_count(), which is after the mem_cgroup_below_{min/low}() checks in shrink_node_memcgs(). As a result, the stale effective protection values of the target memcg may lead us to skip reclaiming from the target memcg entirely, before calling shrink_lruvec(). This can be even worse with recursive protection, where the stale target memcg protection can be higher than its standalone protection. See two examples below (a similar version of example (a) is added to test_memcontrol in a later patch). (a) A simple example with proactive reclaim is as follows. Consider the following hierarchy: ROOT | A | B (memory.min = 10M) Consider the following scenario: - B has memory.current = 10M. - The system undergoes global reclaim (or memcg reclaim in A). - In shrink_node_memcgs(): - mem_cgroup_calculate_protection() calculates the effective min (emin) of B as 10M. - mem_cgroup_below_min() returns true for B, we do not reclaim from B. - Now if we want to reclaim 5M from B using proactive reclaim (memory.reclaim), we should be able to, as the protection of the target memcg should be ignored. - In shrink_node_memcgs(): - mem_cgroup_calculate_protection() immediately returns for B without doing anything, as B is the target memcg, relying on mem_cgroup_protection() to ignore B's stale effective min (still 10M). - mem_cgroup_below_min() reads the stale effective min for B and we skip it instead of ignoring its protection as intended, as we never reach mem_cgroup_protection(). (b) An more complex example with recursive protection is as follows. Consider the following hierarchy with memory_recursiveprot: ROOT | A (memory.min = 50M) | B (memory.min = 10M, memory.high = 40M) Consider the following scenario: - B has memory.current = 35M. - The system undergoes global reclaim (target memcg is NULL). - B will have an effective min of 50M (all of A's unclaimed protection). - B will not be reclaimed from. - Now allocate 10M more memory in B, pushing it above it's high limit. - The system undergoes memcg reclaim from B (target memcg is B). - Like example (a), we do nothing in mem_cgroup_calculate_protection(), then call mem_cgroup_below_min(), which will read the stale effective min for B (50M) and skip it. In this case, it's even worse because we are not just considering B's standalone protection (10M), but we are reading a much higher stale protection (50M) which will cause us to not reclaim from B at all. This is an artifact of commit 45c7f7e1ef17 ("mm, memcg: decouple e{low,min} state mutations from protection checks") which made mem_cgroup_calculate_protection() only change the state without returning any value. Before that commit, we used to return MEMCG_PROT_NONE for the target memcg, which would cause us to skip the mem_cgroup_below_{min/low}() checks. After that commit we do not return anything and we end up checking the min & low effective protections for the target memcg, which are stale. Update mem_cgroup_supports_protection() to also check if we are reclaiming from the target, and rename it to mem_cgroup_unprotected() (now returns true if we should not protect the memcg, much simpler logic). Link: https://lkml.kernel.org/r/20221202031512.1365483-1-yosryahmed@google.com Link: https://lkml.kernel.org/r/20221202031512.1365483-2-yosryahmed@google.com Fixes: 45c7f7e1ef17 ("mm, memcg: decouple e{low,min} state mutations from protection checks") Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Chris Down <chris@chrisdown.name> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vasily Averin <vasily.averin@linux.dev> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11fsdax,xfs: port unshare to fsdaxShiyang Ruan
Implement unshare in fsdax mode: copy data from srcmap to iomap. Link: https://lkml.kernel.org/r/1669908753-169-1-git-send-email-ruansy.fnst@fujitsu.com Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11fsdax: introduce page->share for fsdax in reflink modeShiyang Ruan
Patch series "fsdax,xfs: fix warning messages", v2. Many testcases failed in dax+reflink mode with warning message in dmesg. Such as generic/051,075,127. The warning message is like this: [ 775.509337] ------------[ cut here ]------------ [ 775.509636] WARNING: CPU: 1 PID: 16815 at fs/dax.c:386 dax_insert_entry.cold+0x2e/0x69 [ 775.510151] Modules linked in: auth_rpcgss oid_registry nfsv4 algif_hash af_alg af_packet nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink ip6table_filter ip6_tables iptable_filter ip_tables x_tables dax_pmem nd_pmem nd_btt sch_fq_codel configfs xfs libcrc32c fuse [ 775.524288] CPU: 1 PID: 16815 Comm: fsx Kdump: loaded Tainted: G W 6.1.0-rc4+ #164 eb34e4ee4200c7cbbb47de2b1892c5a3e027fd6d [ 775.524904] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Arch Linux 1.16.0-3-3 04/01/2014 [ 775.525460] RIP: 0010:dax_insert_entry.cold+0x2e/0x69 [ 775.525797] Code: c7 c7 18 eb e0 81 48 89 4c 24 20 48 89 54 24 10 e8 73 6d ff ff 48 83 7d 18 00 48 8b 54 24 10 48 8b 4c 24 20 0f 84 e3 e9 b9 ff <0f> 0b e9 dc e9 b9 ff 48 c7 c6 a0 20 c3 81 48 c7 c7 f0 ea e0 81 48 [ 775.526708] RSP: 0000:ffffc90001d57b30 EFLAGS: 00010082 [ 775.527042] RAX: 000000000000002a RBX: 0000000000000000 RCX: 0000000000000042 [ 775.527396] RDX: ffffea000a0f6c80 RSI: ffffffff81dfab1b RDI: 00000000ffffffff [ 775.527819] RBP: ffffea000a0f6c40 R08: 0000000000000000 R09: ffffffff820625e0 [ 775.528241] R10: ffffc90001d579d8 R11: ffffffff820d2628 R12: ffff88815fc98320 [ 775.528598] R13: ffffc90001d57c18 R14: 0000000000000000 R15: 0000000000000001 [ 775.528997] FS: 00007f39fc75d740(0000) GS:ffff88817bc80000(0000) knlGS:0000000000000000 [ 775.529474] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 775.529800] CR2: 00007f39fc772040 CR3: 0000000107eb6001 CR4: 00000000003706e0 [ 775.530214] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 775.530592] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 775.531002] Call Trace: [ 775.531230] <TASK> [ 775.531444] dax_fault_iter+0x267/0x6c0 [ 775.531719] dax_iomap_pte_fault+0x198/0x3d0 [ 775.532002] __xfs_filemap_fault+0x24a/0x2d0 [xfs aa8d25411432b306d9554da38096f4ebb86bdfe7] [ 775.532603] __do_fault+0x30/0x1e0 [ 775.532903] do_fault+0x314/0x6c0 [ 775.533166] __handle_mm_fault+0x646/0x1250 [ 775.533480] handle_mm_fault+0xc1/0x230 [ 775.533810] do_user_addr_fault+0x1ac/0x610 [ 775.534110] exc_page_fault+0x63/0x140 [ 775.534389] asm_exc_page_fault+0x22/0x30 [ 775.534678] RIP: 0033:0x7f39fc55820a [ 775.534950] Code: 00 01 00 00 00 74 99 83 f9 c0 0f 87 7b fe ff ff c5 fe 6f 4e 20 48 29 fe 48 83 c7 3f 49 8d 0c 10 48 83 e7 c0 48 01 fe 48 29 f9 <f3> a4 c4 c1 7e 7f 00 c4 c1 7e 7f 48 20 c5 f8 77 c3 0f 1f 44 00 00 [ 775.535839] RSP: 002b:00007ffc66a08118 EFLAGS: 00010202 [ 775.536157] RAX: 00007f39fc772001 RBX: 0000000000042001 RCX: 00000000000063c1 [ 775.536537] RDX: 0000000000006400 RSI: 00007f39fac42050 RDI: 00007f39fc772040 [ 775.536919] RBP: 0000000000006400 R08: 00007f39fc772001 R09: 0000000000042000 [ 775.537304] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000001 [ 775.537694] R13: 00007f39fc772000 R14: 0000000000006401 R15: 0000000000000003 [ 775.538086] </TASK> [ 775.538333] ---[ end trace 0000000000000000 ]--- This also affects dax+noreflink mode if we run the test after a dax+reflink test. So, the most urgent thing is solving the warning messages. With these fixes, most warning messages in dax_associate_entry() are gone. But honestly, generic/388 will randomly failed with the warning. The case shutdown the xfs when fsstress is running, and do it for many times. I think the reason is that dax pages in use are not able to be invalidated in time when fs is shutdown. The next time dax page to be associated, it still remains the mapping value set last time. I'll keep on solving it. The warning message in dax_writeback_one() can also be fixed because of the dax unshare. This patch (of 8): fsdax page is used not only when CoW, but also mapread. To make the it easily understood, use 'share' to indicate that the dax page is shared by more than one extent. And add helper functions to use it. Also, the flag needs to be renamed to PAGE_MAPPING_DAX_SHARED. [ruansy.fnst@fujitsu.com: rename several functions] Link: https://lkml.kernel.org/r/1669972991-246-1-git-send-email-ruansy.fnst@fujitsu.com [ruansy.fnst@fujitsu.com: v2.2] Link: https://lkml.kernel.org/r/1670381359-53-1-git-send-email-ruansy.fnst@fujitsu.com Link: https://lkml.kernel.org/r/1669908538-55-1-git-send-email-ruansy.fnst@fujitsu.com Link: https://lkml.kernel.org/r/1669908538-55-2-git-send-email-ruansy.fnst@fujitsu.com Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11mm: add folio dtor and order setter functionsSidhartha Kumar
Patch series "convert core hugetlb functions to folios", v5. ============== OVERVIEW =========================== Now that many hugetlb helper functions that deal with hugetlb specific flags[1] and hugetlb cgroups[2] are converted to folios, higher level allocation, prep, and freeing functions within hugetlb can also be converted to operate in folios. Patch 1 of this series implements the wrapper functions around setting the compound destructor and compound order for a folio. Besides the user added in patch 1, patch 2 and patch 9 also use these helper functions. Patches 2-10 convert the higher level hugetlb functions to folios. ============== TESTING =========================== LTP: Ran 10 back to back rounds of the LTP hugetlb test suite. Gigantic Huge Pages: Test allocation and freeing via hugeadm commands: hugeadm --pool-pages-min 1GB:10 hugeadm --pool-pages-min 1GB:0 Demote: Demote 1 1GB hugepages to 512 2MB hugepages echo 1 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages echo 1 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages # 512 cat /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages # 0 [1] https://lore.kernel.org/lkml/20220922154207.1575343-1-sidhartha.kumar@oracle.com/ [2] https://lore.kernel.org/linux-mm/20221101223059.460937-1-sidhartha.kumar@oracle.com/ This patch (of 10): Add folio equivalents for set_compound_order() and set_compound_page_dtor(). Also remove extra new-lines introduced by mm/hugetlb: convert move_hugetlb_state() to folios and mm/hugetlb_cgroup: convert hugetlb_cgroup_uncharge_page() to folios. [sidhartha.kumar@oracle.com: clarify folio_set_compound_order() zero support] Link: https://lkml.kernel.org/r/20221207223731.32784-1-sidhartha.kumar@oracle.com Link: https://lkml.kernel.org/r/20221129225039.82257-1-sidhartha.kumar@oracle.com Link: https://lkml.kernel.org/r/20221129225039.82257-2-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Suggested-by: Mike Kravetz <mike.kravetz@oracle.com> Suggested-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Tarun Sahu <tsahu@linux.ibm.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Wei Chen <harperchen1110@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11folio-compat: remove lru_cache_add()Vishal Moola (Oracle)
There are no longer any callers of lru_cache_add(), so remove it. This saves 79 bytes of kernel text. Also cleanup some comments such that they reference the new folio_add_lru() instead. Link: https://lkml.kernel.org/r/20221101175326.13265-6-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11filemap: convert replace_page_cache_page() to replace_page_cache_folio()Vishal Moola (Oracle)
Patch series "Removing the lru_cache_add() wrapper". This patchset replaces all calls of lru_cache_add() with the folio equivalent: folio_add_lru(). This is allows us to get rid of the wrapper The series passes xfstests and the userfaultfd selftests. This patch (of 5): Eliminates 7 calls to compound_head(). Link: https://lkml.kernel.org/r/20221101175326.13265-1-vishal.moola@gmail.com Link: https://lkml.kernel.org/r/20221101175326.13265-2-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11mm/sparse-vmemmap: generalise vmemmap_populate_hugepages()Feiyang Chen
Generalise vmemmap_populate_hugepages() so ARM64 & X86 & LoongArch can share its implementation. Link: https://lkml.kernel.org/r/20221027125253.3458989-4-chenhuacai@loongson.cn Signed-off-by: Feiyang Chen <chenfeiyang@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn> Acked-by: Will Deacon <will@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Guo Ren <guoren@kernel.org> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: Min Zhou <zhoumin@loongson.cn> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Philippe Mathieu-Daudé <philmd@linaro.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Xuefeng Li <lixuefeng@loongson.cn> Cc: Xuerui Wang <kernel@xen0n.name> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11LoongArch: add sparse memory vmemmap supportFeiyang Chen
Add sparse memory vmemmap support for LoongArch. SPARSEMEM_VMEMMAP uses a virtually mapped memmap to optimise pfn_to_page and page_to_pfn operations. This is the most efficient option when sufficient kernel resources are available. Link: https://lkml.kernel.org/r/20221027125253.3458989-3-chenhuacai@loongson.cn Signed-off-by: Min Zhou <zhoumin@loongson.cn> Signed-off-by: Feiyang Chen <chenfeiyang@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Guo Ren <guoren@kernel.org> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Philippe Mathieu-Daudé <philmd@linaro.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Will Deacon <will@kernel.org> Cc: Xuefeng Li <lixuefeng@loongson.cn> Cc: Xuerui Wang <kernel@xen0n.name> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11include/linux/pgtable.h: : remove redundant pte variablezhang songyi
Return value from ptep_get_and_clear_full() directly instead of taking this in another redundant variable. Link: https://lkml.kernel.org/r/202211282107437343474@zte.com.cn Signed-off-by: zhang songyi <zhang.songyi@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11mm/gup: remove FOLL_MIGRATIONDavid Hildenbrand
Fortunately, the last user (KSM) is gone, so let's just remove this rather special code from generic GUP handling -- especially because KSM never required the PMD handling as KSM only deals with individual base pages. [akpm@linux-foundation.org: fix merge snafu]Link: https://lkml.kernel.org/r/20221021101141.84170-10-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11mm/pagewalk: add walk_page_range_vma()David Hildenbrand
Let's add walk_page_range_vma(), which is similar to walk_page_vma(), however, is only interested in a subset of the VMA range. To be used in KSM code to stop using follow_page() next. Link: https://lkml.kernel.org/r/20221021101141.84170-8-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11mm: remove VM_FAULT_WRITEDavid Hildenbrand
All users -- GUP and KSM -- are gone, let's just remove it. Link: https://lkml.kernel.org/r/20221021101141.84170-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Peter Xu <peterx@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11mm/pagewalk: don't trigger test_walk() in walk_page_vma()David Hildenbrand
As Peter points out, the caller passes a single VMA and can just do that check itself. And in fact, no existing users rely on test_walk() getting called. So let's just remove it and make the implementation slightly more efficient. Link: https://lkml.kernel.org/r/20221021101141.84170-7-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-10Merge tag 'mm-hotfixes-stable-2022-12-10-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "Nine hotfixes. Six for MM, three for other areas. Four of these patches address post-6.0 issues" * tag 'mm-hotfixes-stable-2022-12-10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: memcg: fix possible use-after-free in memcg_write_event_control() MAINTAINERS: update Muchun Song's email mm/gup: fix gup_pud_range() for dax mmap: fix do_brk_flags() modifying obviously incorrect VMAs mm/swap: fix SWP_PFN_BITS with CONFIG_PHYS_ADDR_T_64BIT on 32bit tmpfs: fix data loss from failed fallocate kselftests: cgroup: update kmem test precision tolerance mm: do not BUG_ON missing brk mapping, because userspace can unmap it mailmap: update Matti Vaittinen's email address
2022-12-10bpf: states_equal() must build idmap for all function framesEduard Zingerman
verifier.c:states_equal() must maintain register ID mapping across all function frames. Otherwise the following example might be erroneously marked as safe: main: fp[-24] = map_lookup_elem(...) ; frame[0].fp[-24].id == 1 fp[-32] = map_lookup_elem(...) ; frame[0].fp[-32].id == 2 r1 = &fp[-24] r2 = &fp[-32] call foo() r0 = 0 exit foo: 0: r9 = r1 1: r8 = r2 2: r7 = ktime_get_ns() 3: r6 = ktime_get_ns() 4: if (r6 > r7) goto skip_assign 5: r9 = r8 skip_assign: ; <--- checkpoint 6: r9 = *r9 ; (a) frame[1].r9.id == 2 ; (b) frame[1].r9.id == 1 7: if r9 == 0 goto exit: ; mark_ptr_or_null_regs() transfers != 0 info ; for all regs sharing ID: ; (a) r9 != 0 => &frame[0].fp[-32] != 0 ; (b) r9 != 0 => &frame[0].fp[-24] != 0 8: r8 = *r8 ; (a) r8 == &frame[0].fp[-32] ; (b) r8 == &frame[0].fp[-32] 9: r0 = *r8 ; (a) safe ; (b) unsafe exit: 10: exit While processing call to foo() verifier considers the following execution paths: (a) 0-10 (b) 0-4,6-10 (There is also path 0-7,10 but it is not interesting for the issue at hand. (a) is verified first.) Suppose that checkpoint is created at (6) when path (a) is verified, next path (b) is verified and (6) is reached. If states_equal() maintains separate 'idmap' for each frame the mapping at (6) for frame[1] would be empty and regsafe(r9)::check_ids() would add a pair 2->1 and return true, which is an error. If states_equal() maintains single 'idmap' for all frames the mapping at (6) would be { 1->1, 2->2 } and regsafe(r9)::check_ids() would return false when trying to add a pair 2->1. This issue was suggested in the following discussion: https://lore.kernel.org/bpf/CAEf4BzbFB5g4oUfyxk9rHy-PJSLQ3h8q9mV=rVoXfr_JVm8+1Q@mail.gmail.com/ Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20221209135733.28851-4-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-10NFSD: add delegation reaper to react to low memory conditionDai Ngo
The delegation reaper is called by nfsd memory shrinker's on the 'count' callback. It scans the client list and sends the courtesy CB_RECALL_ANY to the clients that hold delegations. To avoid flooding the clients with CB_RECALL_ANY requests, the delegation reaper sends only one CB_RECALL_ANY request to each client per 5 seconds. Signed-off-by: Dai Ngo <dai.ngo@oracle.com> [ cel: moved definition of RCA4_TYPE_MASK_RDATA_DLG ] Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-12-10sunrpc: svc: Remove an unused static function svc_ungetu32()Li zeming
The svc_ungetu32 function is not used, you could remove it. Signed-off-by: Li zeming <zeming@nfschina.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-12-09Merge tag 'ipsec-next-2022-12-09' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== ipsec-next 2022-12-09 1) Add xfrm packet offload core API. From Leon Romanovsky. 2) Add xfrm packet offload support for mlx5. From Leon Romanovsky and Raed Salem. 3) Fix a typto in a error message. From Colin Ian King. * tag 'ipsec-next-2022-12-09' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next: (38 commits) xfrm: Fix spelling mistake "oflload" -> "offload" net/mlx5e: Open mlx5 driver to accept IPsec packet offload net/mlx5e: Handle ESN update events net/mlx5e: Handle hardware IPsec limits events net/mlx5e: Update IPsec soft and hard limits net/mlx5e: Store all XFRM SAs in Xarray net/mlx5e: Provide intermediate pointer to access IPsec struct net/mlx5e: Skip IPsec encryption for TX path without matching policy net/mlx5e: Add statistics for Rx/Tx IPsec offloaded flows net/mlx5e: Improve IPsec flow steering autogroup net/mlx5e: Configure IPsec packet offload flow steering net/mlx5e: Use same coding pattern for Rx and Tx flows net/mlx5e: Add XFRM policy offload logic net/mlx5e: Create IPsec policy offload tables net/mlx5e: Generalize creation of default IPsec miss group and rule net/mlx5e: Group IPsec miss handles into separate struct net/mlx5e: Make clear what IPsec rx_err does net/mlx5e: Flatten the IPsec RX add rule path net/mlx5e: Refactor FTE setup code to be more clear net/mlx5e: Move IPsec flow table creation to separate function ... ==================== Link: https://lore.kernel.org/r/20221209093310.4018731-1-steffen.klassert@secunet.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>