summaryrefslogtreecommitdiff
path: root/Documentation/admin-guide/kernel-parameters.txt
AgeCommit message (Collapse)Author
2024-05-22Merge tag 'tty-6.10-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty Pull tty / serial updates from Greg KH: "Here is the big set of tty/serial driver changes for 6.10-rc1. Included in here are: - Usual good set of api cleanups and evolution by Jiri Slaby to make the serial interfaces move out of the 1990's by using kfifos instead of hand-rolling their own logic. - 8250_exar driver updates - max3100 driver updates - sc16is7xx driver updates - exar driver updates - sh-sci driver updates - tty ldisc api addition to help refuse bindings - other smaller serial driver updates All of these have been in linux-next for a while with no reported issues" * tag 'tty-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (113 commits) serial: Clear UPF_DEAD before calling tty_port_register_device_attr_serdev() serial: imx: Raise TX trigger level to 8 serial: 8250_pnp: Simplify "line" related code serial: sh-sci: simplify locking when re-issuing RXDMA fails serial: sh-sci: let timeout timer only run when DMA is scheduled serial: sh-sci: describe locking requirements for invalidating RXDMA serial: sh-sci: protect invalidating RXDMA on shutdown tty: add the option to have a tty reject a new ldisc serial: core: Call device_set_awake_path() for console port dt-bindings: serial: brcm,bcm2835-aux-uart: convert to dtschema tty: serial: uartps: Add support for uartps controller reset arm64: zynqmp: Add resets property for UART nodes dt-bindings: serial: cdns,uart: Add optional reset property serial: 8250_pnp: Switch to DEFINE_SIMPLE_DEV_PM_OPS() serial: 8250_exar: Keep the includes sorted serial: 8250_exar: Make type of bit the same in exar_ee_*_bit() serial: 8250_exar: Use BIT() in exar_ee_read() serial: 8250_exar: Switch to use dev_err_probe() serial: 8250_exar: Return directly from switch-cases serial: 8250_exar: Decrease indentation level ...
2024-05-19Merge tag 'mm-nonmm-stable-2024-05-19-11-56' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-mm updates from Andrew Morton: "Mainly singleton patches, documented in their respective changelogs. Notable series include: - Some maintenance and performance work for ocfs2 in Heming Zhao's series "improve write IO performance when fragmentation is high". - Some ocfs2 bugfixes from Su Yue in the series "ocfs2 bugs fixes exposed by fstests". - kfifo header rework from Andy Shevchenko in the series "kfifo: Clean up kfifo.h". - GDB script fixes from Florian Rommel in the series "scripts/gdb: Fixes for $lx_current and $lx_per_cpu". - After much discussion, a coding-style update from Barry Song explaining one reason why inline functions are preferred over macros. The series is "codingstyle: avoid unused parameters for a function-like macro"" * tag 'mm-nonmm-stable-2024-05-19-11-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (62 commits) fs/proc: fix softlockup in __read_vmcore nilfs2: convert BUG_ON() in nilfs_finish_roll_forward() to WARN_ON() scripts: checkpatch: check unused parameters for function-like macro Documentation: coding-style: ask function-like macros to evaluate parameters nilfs2: use __field_struct() for a bitwise field selftests/kcmp: remove unused open mode nilfs2: remove calls to folio_set_error() and folio_clear_error() kernel/watchdog_perf.c: tidy up kerneldoc watchdog: allow nmi watchdog to use raw perf event watchdog: handle comma separated nmi_watchdog command line nilfs2: make superblock data array index computation sparse friendly squashfs: remove calls to set the folio error flag squashfs: convert squashfs_symlink_read_folio to use folio APIs scripts/gdb: fix detection of current CPU in KGDB scripts/gdb: make get_thread_info accept pointers scripts/gdb: fix parameter handling in $lx_per_cpu scripts/gdb: fix failing KGDB detection during probe kfifo: don't use "proxy" headers media: stih-cec: add missing io.h media: rc: add missing io.h ...
2024-05-19Merge tag 'mm-stable-2024-05-17-19-19' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull mm updates from Andrew Morton: "The usual shower of singleton fixes and minor series all over MM, documented (hopefully adequately) in the respective changelogs. Notable series include: - Lucas Stach has provided some page-mapping cleanup/consolidation/ maintainability work in the series "mm/treewide: Remove pXd_huge() API". - In the series "Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy", Donet Tom has optimized mempolicy's MPOL_PREFERRED_MANY mode, yielding almost doubled performance in one test. - In their series "Memory allocation profiling" Kent Overstreet and Suren Baghdasaryan have contributed a means of determining (via /proc/allocinfo) whereabouts in the kernel memory is being allocated: number of calls and amount of memory. - Matthew Wilcox has provided the series "Various significant MM patches" which does a number of rather unrelated things, but in largely similar code sites. - In his series "mm: page_alloc: freelist migratetype hygiene" Johannes Weiner has fixed the page allocator's handling of migratetype requests, with resulting improvements in compaction efficiency. - In the series "make the hugetlb migration strategy consistent" Baolin Wang has fixed a hugetlb migration issue, which should improve hugetlb allocation reliability. - Liu Shixin has hit an I/O meltdown caused by readahead in a memory-tight memcg. Addressed in the series "Fix I/O high when memory almost met memcg limit". - In the series "mm/filemap: optimize folio adding and splitting" Kairui Song has optimized pagecache insertion, yielding ~10% performance improvement in one test. - Baoquan He has cleaned up and consolidated the early zone initialization code in the series "mm/mm_init.c: refactor free_area_init_core()". - Baoquan has also redone some MM initializatio code in the series "mm/init: minor clean up and improvement". - MM helper cleanups from Christoph Hellwig in his series "remove follow_pfn". - More cleanups from Matthew Wilcox in the series "Various page->flags cleanups". - Vlastimil Babka has contributed maintainability improvements in the series "memcg_kmem hooks refactoring". - More folio conversions and cleanups in Matthew Wilcox's series: "Convert huge_zero_page to huge_zero_folio" "khugepaged folio conversions" "Remove page_idle and page_young wrappers" "Use folio APIs in procfs" "Clean up __folio_put()" "Some cleanups for memory-failure" "Remove page_mapping()" "More folio compat code removal" - David Hildenbrand chipped in with "fs/proc/task_mmu: convert hugetlb functions to work on folis". - Code consolidation and cleanup work related to GUP's handling of hugetlbs in Peter Xu's series "mm/gup: Unify hugetlb, part 2". - Rick Edgecombe has developed some fixes to stack guard gaps in the series "Cover a guard gap corner case". - Jinjiang Tu has fixed KSM's behaviour after a fork+exec in the series "mm/ksm: fix ksm exec support for prctl". - Baolin Wang has implemented NUMA balancing for multi-size THPs. This is a simple first-cut implementation for now. The series is "support multi-size THP numa balancing". - Cleanups to vma handling helper functions from Matthew Wilcox in the series "Unify vma_address and vma_pgoff_address". - Some selftests maintenance work from Dev Jain in the series "selftests/mm: mremap_test: Optimizations and style fixes". - Improvements to the swapping of multi-size THPs from Ryan Roberts in the series "Swap-out mTHP without splitting". - Kefeng Wang has significantly optimized the handling of arm64's permission page faults in the series "arch/mm/fault: accelerate pagefault when badaccess" "mm: remove arch's private VM_FAULT_BADMAP/BADACCESS" - GUP cleanups from David Hildenbrand in "mm/gup: consistently call it GUP-fast". - hugetlb fault code cleanups from Vishal Moola in "Hugetlb fault path to use struct vm_fault". - selftests build fixes from John Hubbard in the series "Fix selftests/mm build without requiring "make headers"". - Memory tiering fixes/improvements from Ho-Ren (Jack) Chuang in the series "Improved Memory Tier Creation for CPUless NUMA Nodes". Fixes the initialization code so that migration between different memory types works as intended. - David Hildenbrand has improved follow_pte() and fixed an errant driver in the series "mm: follow_pte() improvements and acrn follow_pte() fixes". - David also did some cleanup work on large folio mapcounts in his series "mm: mapcount for large folios + page_mapcount() cleanups". - Folio conversions in KSM in Alex Shi's series "transfer page to folio in KSM". - Barry Song has added some sysfs stats for monitoring multi-size THP's in the series "mm: add per-order mTHP alloc and swpout counters". - Some zswap cleanups from Yosry Ahmed in the series "zswap same-filled and limit checking cleanups". - Matthew Wilcox has been looking at buffer_head code and found the documentation to be lacking. The series is "Improve buffer head documentation". - Multi-size THPs get more work, this time from Lance Yang. His series "mm/madvise: enhance lazyfreeing with mTHP in madvise_free" optimizes the freeing of these things. - Kemeng Shi has added more userspace-visible writeback instrumentation in the series "Improve visibility of writeback". - Kemeng Shi then sent some maintenance work on top in the series "Fix and cleanups to page-writeback". - Matthew Wilcox reduces mmap_lock traffic in the anon vma code in the series "Improve anon_vma scalability for anon VMAs". Intel's test bot reported an improbable 3x improvement in one test. - SeongJae Park adds some DAMON feature work in the series "mm/damon: add a DAMOS filter type for page granularity access recheck" "selftests/damon: add DAMOS quota goal test" - Also some maintenance work in the series "mm/damon/paddr: simplify page level access re-check for pageout" "mm/damon: misc fixes and improvements" - David Hildenbrand has disabled some known-to-fail selftests ni the series "selftests: mm: cow: flag vmsplice() hugetlb tests as XFAIL". - memcg metadata storage optimizations from Shakeel Butt in "memcg: reduce memory consumption by memcg stats". - DAX fixes and maintenance work from Vishal Verma in the series "dax/bus.c: Fixups for dax-bus locking"" * tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (426 commits) memcg, oom: cleanup unused memcg_oom_gfp_mask and memcg_oom_order selftests/mm: hugetlb_madv_vs_map: avoid test skipping by querying hugepage size at runtime mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_wp mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_fault selftests: cgroup: add tests to verify the zswap writeback path mm: memcg: make alloc_mem_cgroup_per_node_info() return bool mm/damon/core: fix return value from damos_wmark_metric_value mm: do not update memcg stats for NR_{FILE/SHMEM}_PMDMAPPED selftests: cgroup: remove redundant enabling of memory controller Docs/mm/damon/maintainer-profile: allow posting patches based on damon/next tree Docs/mm/damon/maintainer-profile: change the maintainer's timezone from PST to PT Docs/mm/damon/design: use a list for supported filters Docs/admin-guide/mm/damon/usage: fix wrong schemes effective quota update command Docs/admin-guide/mm/damon/usage: fix wrong example of DAMOS filter matching sysfs file selftests/damon: classify tests for functionalities and regressions selftests/damon/_damon_sysfs: use 'is' instead of '==' for 'None' selftests/damon/_damon_sysfs: find sysfs mount point from /proc/mounts selftests/damon/_damon_sysfs: check errors from nr_schemes file reads mm/damon/core: initialize ->esz_bp from damos_quota_init_priv() selftests/damon: add a test for DAMOS quota goal ...
2024-05-14Merge tag 'arm64-upstream' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Will Deacon: "The most interesting parts are probably the mm changes from Ryan which optimise the creation of the linear mapping at boot and (separately) implement write-protect support for userfaultfd. Outside of our usual directories, the Kbuild-related changes under scripts/ have been acked by Masahiro whilst the drivers/acpi/ parts have been acked by Rafael and the addition of cpumask_any_and_but() has been acked by Yury. ACPI: - Support for the Firmware ACPI Control Structure (FACS) signature feature which is used to reboot out of hibernation on some systems Kbuild: - Support for building Flat Image Tree (FIT) images, where the kernel Image is compressed alongside a set of devicetree blobs Memory management: - Optimisation of our early page-table manipulation for creation of the linear mapping - Support for userfaultfd write protection, which brings along some nice cleanups to our handling of invalid but present ptes - Extend our use of range TLBI invalidation at EL1 Perf and PMUs: - Ensure that the 'pmu->parent' pointer is correctly initialised by PMU drivers - Avoid allocating 'cpumask_t' types on the stack in some PMU drivers - Fix parsing of the CPU PMU "version" field in assembly code, as it doesn't follow the usual architectural rules - Add best-effort unwinding support for USER_STACKTRACE - Minor driver fixes and cleanups Selftests: - Minor cleanups to the arm64 selftests (missing NULL check, unused variable) Miscellaneous: - Add a command-line alias for disabling 32-bit application support - Add part number for Neoverse-V2 CPUs - Minor fixes and cleanups" * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (64 commits) arm64/mm: Fix pud_user_accessible_page() for PGTABLE_LEVELS <= 2 arm64/mm: Add uffd write-protect support arm64/mm: Move PTE_PRESENT_INVALID to overlay PTE_NG arm64/mm: Remove PTE_PROT_NONE bit arm64/mm: generalize PMD_PRESENT_INVALID for all levels arm64: simplify arch_static_branch/_jump function arm64: Add USER_STACKTRACE support arm64: Add the arm64.no32bit_el0 command line option drivers/perf: hisi: hns3: Actually use devm_add_action_or_reset() drivers/perf: hisi: hns3: Fix out-of-bound access when valid event group drivers/perf: hisi_pcie: Fix out-of-bound access when valid event group kselftest: arm64: Add a null pointer check arm64: defer clearing DAIF.D arm64: assembler: update stale comment for disable_step_tsk arm64/sysreg: Update PIE permission encodings kselftest/arm64: Remove unused parameters in abi test perf/arm-spe: Assign parents for event_source device perf/arm-smmuv3: Assign parents for event_source device perf/arm-dsu: Assign parents for event_source device perf/arm-dmc620: Assign parents for event_source device ...
2024-05-14Merge tag 'x86-irq-2024-05-12' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 interrupt handling updates from Thomas Gleixner: "Add support for posted interrupts on bare metal. Posted interrupts is a virtualization feature which allows to inject interrupts directly into a guest without host interaction. The VT-d interrupt remapping hardware sets the bit which corresponds to the interrupt vector in a vector bitmap which is either used to inject the interrupt directly into the guest via a virtualized APIC or in case that the guest is scheduled out provides a host side notification interrupt which informs the host that an interrupt has been marked pending in the bitmap. This can be utilized on bare metal for scenarios where multiple devices, e.g. NVME storage, raise interrupts with a high frequency. In the default mode these interrupts are handles independently and therefore require a full roundtrip of interrupt entry/exit. Utilizing posted interrupts this roundtrip overhead can be avoided by coalescing these interrupt entries to a single entry for the posted interrupt notification. The notification interrupt then demultiplexes the pending bits in a memory based bitmap and invokes the corresponding device specific handlers. Depending on the usage scenario and device utilization throughput improvements between 10% and 130% have been measured. As this is only relevant for high end servers with multiple device queues per CPU attached and counterproductive for situations where interrupts are arriving at distinct times, the functionality is opt-in via a kernel command line parameter" * tag 'x86-irq-2024-05-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/irq: Use existing helper for pending vector check iommu/vt-d: Enable posted mode for device MSIs iommu/vt-d: Make posted MSI an opt-in command line option x86/irq: Extend checks for pending vectors to posted interrupts x86/irq: Factor out common code for checking pending interrupts x86/irq: Install posted MSI notification handler x86/irq: Factor out handler invocation from common_interrupt() x86/irq: Set up per host CPU posted interrupt descriptors x86/irq: Reserve a per CPU IDT vector for posted MSIs x86/irq: Add a Kconfig option for posted MSI x86/irq: Remove bitfields in posted interrupt descriptor x86/irq: Unionize PID.PIR for 64bit access w/o casting KVM: VMX: Move posted interrupt descriptor out of VMX code
2024-05-13Merge tag 'sched-core-2024-05-13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - Add cpufreq pressure feedback for the scheduler - Rework misfit load-balancing wrt affinity restrictions - Clean up and simplify the code around ::overutilized and ::overload access. - Simplify sched_balance_newidle() - Bump SCHEDSTAT_VERSION to 16 due to a cleanup of CPU_MAX_IDLE_TYPES handling that changed the output. - Rework & clean up <asm/vtime.h> interactions wrt arch_vtime_task_switch() - Reorganize, clean up and unify most of the higher level scheduler balancing function names around the sched_balance_*() prefix - Simplify the balancing flag code (sched_balance_running) - Miscellaneous cleanups & fixes * tag 'sched-core-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits) sched/pelt: Remove shift of thermal clock sched/cpufreq: Rename arch_update_thermal_pressure() => arch_update_hw_pressure() thermal/cpufreq: Remove arch_update_thermal_pressure() sched/cpufreq: Take cpufreq feedback into account cpufreq: Add a cpufreq pressure feedback for the scheduler sched/fair: Fix update of rd->sg_overutilized sched/vtime: Do not include <asm/vtime.h> header s390/irq,nmi: Include <asm/vtime.h> header directly s390/vtime: Remove unused __ARCH_HAS_VTIME_TASK_SWITCH leftover sched/vtime: Get rid of generic vtime_task_switch() implementation sched/vtime: Remove confusing arch_vtime_task_switch() declaration sched/balancing: Simplify the sg_status bitmask and use separate ->overloaded and ->overutilized flags sched/fair: Rename set_rd_overutilized_status() to set_rd_overutilized() sched/fair: Rename SG_OVERLOAD to SG_OVERLOADED sched/fair: Rename {set|get}_rd_overload() to {set|get}_rd_overloaded() sched/fair: Rename root_domain::overload to ::overloaded sched/fair: Use helper functions to access root_domain::overload sched/fair: Check root_domain::overload value before update sched/fair: Combine EAS check with root_domain::overutilized access sched/fair: Simplify the continue_balancing logic in sched_balance_newidle() ...
2024-05-13Merge tag 'docs-6.10' of git://git.lwn.net/linuxLinus Torvalds
Pull documentation updates from Jonathan Corbet: "Another not-too-busy cycle for documentation, including: - Some build-system changes to detect the variable fonts installed by some distributions that can break the PDF build. - Various updates and additions to the Spanish, Chinese, Italian, and Japanese translations. - Update the stable-kernel rules to match modern practice ... and the usual array of corrections, updates, and typo fixes" * tag 'docs-6.10' of git://git.lwn.net/linux: (42 commits) cgroup: Add documentation for missing zswap memory.stat kernel-doc: Added "*" in $type_constants2 to fix 'make htmldocs' warning. docs:core-api: fixed typos and grammar in printk-index page Documentation: tracing: Fix spelling mistakes docs/zh_CN/rust: Update the translation of quick-start to 6.9-rc4 docs/zh_CN/rust: Update the translation of general-information to 6.9-rc4 docs/zh_CN/rust: Update the translation of coding-guidelines to 6.9-rc4 docs/zh_CN/rust: Update the translation of arch-support to 6.9-rc4 docs: stable-kernel-rules: fix typo sent->send docs/zh_CN: remove two inconsistent spaces docs: scripts/check-variable-fonts.sh: Improve commands for detection docs: stable-kernel-rules: create special tag to flag 'no backporting' docs: stable-kernel-rules: explain use of stable@kernel.org (w/o @vger.) docs: stable-kernel-rules: remove code-labels tags and a indention level docs: stable-kernel-rules: call mainline by its name and change example docs: stable-kernel-rules: reduce redundancy docs, kprobes: Add riscv as supported architecture Docs: typos/spelling docs: kernel_include.py: Cope with docutils 0.21 docs: ja_JP/howto: Catch up update in v6.8 ...
2024-05-13Merge tag 'keys-trusted-next-6.10-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd Pull trusted keys updates from Jarkko Sakkinen: "This contains a new key type for the Data Co-Processor (DCP), which is an IP core built into many NXP SoCs such as i.mx6ull" * tag 'keys-trusted-next-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd: docs: trusted-encrypted: add DCP as new trust source docs: document DCP-backed trusted keys kernel params MAINTAINERS: add entry for DCP-based trusted keys KEYS: trusted: Introduce NXP DCP-backed trusted keys KEYS: trusted: improve scalability of trust source config crypto: mxs-dcp: Add support for hardware-bound keys
2024-05-13Merge tag 'rcu.next.v6.10' of https://github.com/urezki/linuxLinus Torvalds
Pull RCU updates from Uladzislau Rezki: - Fix a lockdep complain for lazy-preemptible kernel, remove redundant BH disable for TINY_RCU, remove redundant READ_ONCE() in tree.c, fix false positives KCSAN splat and fix buffer overflow in the print_cpu_stall_info(). - Misc updates related to bpf, tracing and update the MAINTAINERS file. - An improvement of a normal synchronize_rcu() call in terms of latency. It maintains a separate track for sync. users only. This approach bypasses per-cpu nocb-lists thus sync-users do not depend on nocb-list length and how fast regular callbacks are processed. - RCU tasks: switch tasks RCU grace periods to sleep at TASK_IDLE priority, fix some comments, add some diagnostic warning to the exit_tasks_rcu_start() and fix a buffer overflow in the show_rcu_tasks_trace_gp_kthread(). - RCU torture: Increase memory to guest OS, fix a Tasks Rude RCU testing, some updates for TREE09, dump mode information to debug GP kthread state, remove redundant READ_ONCE(), fix some comments about RCU_TORTURE_PIPE_LEN and pipe_count, remove some redundant pointer initialization, fix a hung splat task by when the rcutorture tests start to exit, fix invalid context warning, add '--do-kvfree' parameter to torture test and use slow register unregister callbacks only for rcutype test. * tag 'rcu.next.v6.10' of https://github.com/urezki/linux: (48 commits) rcutorture: Use rcu_gp_slow_register/unregister() only for rcutype test torture: Scale --do-kvfree test time rcutorture: Fix invalid context warning when enable srcu barrier testing rcutorture: Make stall-tasks directly exit when rcutorture tests end rcutorture: Removing redundant function pointer initialization rcutorture: Make rcutorture support print rcu-tasks gp state rcutorture: Use the gp_kthread_dbg operation specified by cur_ops rcutorture: Re-use value stored to ->rtort_pipe_count instead of re-reading rcutorture: Fix rcu_torture_one_read() pipe_count overflow comment rcutorture: Remove extraneous rcu_torture_pipe_update_one() READ_ONCE() rcu: Allocate WQ with WQ_MEM_RECLAIM bit set rcu: Support direct wake-up of synchronize_rcu() users rcu: Add a trace event for synchronize_rcu_normal() rcu: Reduce synchronize_rcu() latency rcu: Fix buffer overflow in print_cpu_stall_info() rcu: Mollify sparse with RCU guard rcu-tasks: Fix show_rcu_tasks_trace_gp_kthread buffer overflow rcu-tasks: Fix the comments for tasks_rcu_exit_srcu_stall_timer rcu-tasks: Replace exit_tasks_rcu_start() initialization with WARN_ON_ONCE() rcu: Remove redundant CONFIG_PROVE_RCU #if condition ...
2024-05-13Merge tag 's390-6.10-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 updates from Alexander Gordeev: - Store AP Query Configuration Information in a static buffer - Rework the AP initialization and add missing cleanups to the error path - Swap IRQ and AP bus/device registration to avoid race conditions - Export prot_virt_guest symbol - Introduce AP configuration changes notifier interface to facilitate modularization of the AP bus - Add CONFIG_AP kernel configuration option to allow modularization of the AP bus - Rework CONFIG_ZCRYPT_DEBUG kernel configuration option description and dependency and rename it to CONFIG_AP_DEBUG - Convert sprintf() and snprintf() to sysfs_emit() in CIO code - Adjust indentation of RELOCS command build step - Make crypto performance counters upward compatible - Convert make_page_secure() and gmap_make_secure() to use folio - Rework channel-utilization-block (CUB) handling in preparation of introducing additional CUBs - Use attribute groups to simplify registration, removal and extension of measurement-related channel-path sysfs attributes - Add a per-channel-path binary "ext_measurement" sysfs attribute that provides access to extended channel-path measurement data - Export measurement data for all channel-measurement-groups (CMG), not only for a specific ones. This enables support of new CMG data formats in userspace without the need for kernel changes - Add a per-channel-path sysfs attribute "speed_bps" that provides the operating speed in bits per second or 0 if the operating speed is not available - The CIO tracepoint subchannel-type field "st" is incorrectly set to the value of subchannel-enabled SCHIB "ena" field. Fix that - Do not forcefully limit vmemmap starting address to MAX_PHYSMEM_BITS - Consider the maximum physical address available to a DCSS segment (512GB) when memory layout is set up - Simplify the virtual memory layout setup by reducing the size of identity mapping vs vmemmap overlap - Swap vmalloc and Lowcore/Real Memory Copy areas in virtual memory. This will allow to place the kernel image next to kernel modules - Move everyting KASLR related from <asm/setup.h> to <asm/page.h> - Put virtual memory layout information into a structure to improve code generation - Currently __kaslr_offset is the kernel offset in both physical and virtual memory spaces. Uncouple these offsets to allow uncoupling of the addresses spaces - Currently the identity mapping base address is implicit and is always set to zero. Make it explicit by putting into __identity_base persistent boot variable and use it in proper context - Introduce .amode31 section start and end macros AMODE31_START and AMODE31_END - Introduce OS_INFO entries that do not reference any data in memory, but rather provide only values - Store virtual memory layout in OS_INFO. It is read out by makedumpfile, crash and other tools - Store virtual memory layout in VMCORE_INFO. It is read out by crash and other tools when /proc/kcore device is used - Create additional PT_LOAD ELF program header that covers kernel image only, so that vmcore tools could locate kernel text and data when virtual and physical memory spaces are uncoupled - Uncouple physical and virtual address spaces - Map kernel at fixed location when KASLR mode is disabled. The location is defined by CONFIG_KERNEL_IMAGE_BASE kernel configuration value. - Rework deployment of kernel image for both compressed and uncompressed variants as defined by CONFIG_KERNEL_UNCOMPRESSED kernel configuration value - Move .vmlinux.relocs section in front of the compressed kernel. The interim section rescue step is avoided as result - Correct modules thunk offset calculation when branch target is more than 2GB away - Kernel modules contain their own set of expoline thunks. Now that the kernel modules area is less than 4GB away from kernel expoline thunks, make modules use kernel expolines. Also make EXPOLINE_EXTERN the default if the compiler supports it - userfaultfd can insert shared zeropages into processes running VMs, but that is not allowed for s390. Fallback to allocating a fresh zeroed anonymous folio and insert that instead - Re-enable shared zeropages for non-PV and non-skeys KVM guests - Rename hex2bitmap() to ap_hex2bitmap() and export it for external use - Add ap_config sysfs attribute to provide the means for setting or displaying adapters, domains and control domains assigned to a vfio-ap mediated device in a single operation - Make vfio_ap_mdev_link_queue() ignore duplicate link requests - Add write support to ap_config sysfs attribute to allow atomic update a vfio-ap mediated device state - Document ap_config sysfs attribute - Function os_info_old_init() is expected to be called only from a regular kdump kernel. Enable it to be called from a stand-alone dump kernel - Address gcc -Warray-bounds warning and fix array size in struct os_info - s390 does not support SMBIOS, so drop unneeded CONFIG_DMI checks - Use unwinder instead of __builtin_return_address() with ftrace to prevent returning of undefined values - Sections .hash and .gnu.hash are only created when CONFIG_PIE_BUILD kernel is enabled. Drop these for the case CONFIG_PIE_BUILD is disabled - Compile kernel with -fPIC and link with -no-pie to allow kpatch feature always succeed and drop the whole CONFIG_PIE_BUILD option-enabled code - Add missing virt_to_phys() converter for VSIE facility and crypto control blocks * tag 's390-6.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (54 commits) Revert "s390: Relocate vmlinux ELF data to virtual address space" KVM: s390: vsie: Use virt_to_phys for crypto control block s390: Relocate vmlinux ELF data to virtual address space s390: Compile kernel with -fPIC and link with -no-pie s390: vmlinux.lds.S: Drop .hash and .gnu.hash for !CONFIG_PIE_BUILD s390/ftrace: Use unwinder instead of __builtin_return_address() s390/pci: Drop unneeded reference to CONFIG_DMI s390/os_info: Fix array size in struct os_info s390/os_info: Initialize old os_info in standalone dump kernel docs: Update s390 vfio-ap doc for ap_config sysfs attribute s390/vfio-ap: Add write support to sysfs attr ap_config s390/vfio-ap: Ignore duplicate link requests in vfio_ap_mdev_link_queue s390/vfio-ap: Add sysfs attr, ap_config, to export mdev state s390/ap: Externalize AP bus specific bitmap reading function s390/mm: Re-enable the shared zeropage for !PV and !skeys KVM guests mm/userfaultfd: Do not place zeropages when zeropages are disallowed s390/expoline: Make modules use kernel expolines s390/nospec: Correct modules thunk offset calculation s390/boot: Do not rescue .vmlinux.relocs section s390/boot: Rework deployment of the kernel image ...
2024-05-09docs: document DCP-backed trusted keys kernel paramsDavid Gstir
Document the kernel parameters trusted.dcp_use_otp_key and trusted.dcp_skip_zk_test for DCP-backed trusted keys. Co-developed-by: Richard Weinberger <richard@nod.at> Signed-off-by: Richard Weinberger <richard@nod.at> Co-developed-by: David Oberhollenzer <david.oberhollenzer@sigma-star.at> Signed-off-by: David Oberhollenzer <david.oberhollenzer@sigma-star.at> Signed-off-by: David Gstir <david@sigma-star.at> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
2024-05-08Merge tag 'pci-v6.9-fixes-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci Pull pci fixes from Bjorn Helgaas: - Update kernel-parameters doc to describe "pcie_aspm=off" more accurately (Bjorn Helgaas) - Restore the parent's (not the child's) ASPM state to the parent during resume, which fixes a reboot during resume (Kai-Heng Feng) * tag 'pci-v6.9-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci: PCI/ASPM: Restore parent state to parent, child state to child PCI/ASPM: Clarify that pcie_aspm=off means leave ASPM untouched
2024-05-08watchdog: allow nmi watchdog to use raw perf eventSong Liu
NMI watchdog permanently consumes one hardware counters per CPU on the system. For systems that use many hardware counters, this causes more aggressive time multiplexing of perf events. OTOH, some CPUs (mostly Intel) support "ref-cycles" event, which is rarely used. Add kernel cmdline arg nmi_watchdog=rNNN to configure the watchdog to use raw event. For example, on Intel CPUs, we can use "r300" to configure the watchdog to use ref-cycles event. If the raw event does not work, fall back to use "cycles". [akpm@linux-foundation.org: fix kerneldoc] Link: https://lkml.kernel.org/r/20240430060236.1878002-2-song@kernel.org Signed-off-by: Song Liu <song@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-03PCI/ASPM: Clarify that pcie_aspm=off means leave ASPM untouchedBjorn Helgaas
Previously we claimed "pcie_aspm=off" meant that ASPM would be disabled, which is wrong. Correct this to say that with "pcie_aspm=off", Linux doesn't touch any ASPM configuration at all. ASPM may have been enabled by firmware, and that will be left unchanged. See "aspm_support_enabled". Link: https://lore.kernel.org/r/20240429191821.691726-1-helgaas@kernel.org Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: David E. Box <david.e.box@linux.intel.com>
2024-05-03arm64: Add the arm64.no32bit_el0 command line optionAndrea della Porta
Introducing the field 'el0' to the idreg-override for register ID_AA64PFR0_EL1. This field is also aliased to the new kernel command line option 'arm64.no32bit_el0' as a more recognizable and mnemonic name to disable the execution of 32 bit userspace applications (i.e. avoid Aarch32 execution state in EL0) from kernel command line. Link: https://lore.kernel.org/all/20240207105847.7739-1-andrea.porta@suse.com/ Signed-off-by: Andrea della Porta <andrea.porta@suse.com> Link: https://lore.kernel.org/r/20240429102833.6426-1-andrea.porta@suse.com Signed-off-by: Will Deacon <will@kernel.org>
2024-05-02Docs: typos/spellingRemington Brasga
Fix spelling and grammar in Docs descriptions Signed-off-by: Remington Brasga <rbrasga@uci.edu> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jonathan Corbet <corbet@lwn.net> Link: https://lore.kernel.org/r/20240429225527.2329-1-rbrasga@uci.edu
2024-04-30iommu/vt-d: Make posted MSI an opt-in command line optionJacob Pan
Add a command line opt-in option for posted MSI if CONFIG_X86_POSTED_MSI=y. Also introduce a helper function for testing if posted MSI is supported on the platform. Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240423174114.526704-12-jacob.jun.pan@linux.intel.com
2024-04-25mm: init_mlocked_on_free_v3York Jasper Niebuhr
Implements the "init_mlocked_on_free" boot option. When this boot option is enabled, any mlock'ed pages are zeroed on free. If the pages are munlock'ed beforehand, no initialization takes place. This boot option is meant to combat the performance hit of "init_on_free" as reported in commit 6471384af2a6 ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options"). With "init_mlocked_on_free=1" only relevant data is freed while everything else is left untouched by the kernel. Correspondingly, this patch introduces no performance hit for unmapping non-mlock'ed memory. The unmapping overhead for purely mlocked memory was measured to be approximately 13%. Realistically, most systems mlock only a fraction of the total memory so the real-world system overhead should be close to zero. Optimally, userspace programs clear any key material or other confidential memory before exit and munlock the according memory regions. If a program crashes, userspace key managers fail to do this job. Accordingly, no munlock operations are performed so the data is caught and zeroed by the kernel. Should the program not crash, all memory will ideally be munlocked so no overhead is caused. CONFIG_INIT_MLOCKED_ON_FREE_DEFAULT_ON can be set to enable "init_mlocked_on_free" by default. Link: https://lkml.kernel.org/r/20240329145605.149917-1-yjnworkstation@gmail.com Signed-off-by: York Jasper Niebuhr <yjnworkstation@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: York Jasper Niebuhr <yjnworkstation@gmail.com> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25cpu: Ignore "mitigations" kernel parameter if CPU_MITIGATIONS=nSean Christopherson
Explicitly disallow enabling mitigations at runtime for kernels that were built with CONFIG_CPU_MITIGATIONS=n, as some architectures may omit code entirely if mitigations are disabled at compile time. E.g. on x86, a large pile of Kconfigs are buried behind CPU_MITIGATIONS, and trying to provide sane behavior for retroactively enabling mitigations is extremely difficult, bordering on impossible. E.g. page table isolation and call depth tracking require build-time support, BHI mitigations will still be off without additional kernel parameters, etc. [ bp: Touchups. ] Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240420000556.2645001-3-seanjc@google.com
2024-04-24mm: Update shuffle documentation to match its current stateMaíra Canal
Commit 839195352d82 ("mm/shuffle: remove dynamic reconfiguration") removed the dynamic reconfiguration capabilities from the shuffle page allocator. This means that, now, we don't have any perspective of an "autodetection of memory-side-cache" that triggers the enablement of the shuffle page allocator. Therefore, let the documentation reflect that the only way to enable the shuffle page allocator is by setting `page_alloc.shuffle=1`. Signed-off-by: Maíra Canal <mcanal@igalia.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net> Link: https://lore.kernel.org/r/20240422142007.1062231-1-mcanal@igalia.com
2024-04-24sched/pelt: Remove shift of thermal clockVincent Guittot
The optional shift of the clock used by thermal/hw load avg has been introduced to handle case where the signal was not always a high frequency hw signal. Now that cpufreq provides a signal for firmware and SW pressure, we can remove this exception and always keep this PELT signal aligned with other signals. Mark sysctl_sched_migration_cost boot parameter as deprecated Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Reviewed-by: Qais Yousef <qyousef@layalina.io> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lore.kernel.org/r/20240326091616.3696851-6-vincent.guittot@linaro.org
2024-04-23Merge 6.9-rc5 into tty-nextGreg Kroah-Hartman
We want the tty fixes in here as well, and it resolves a merge conflict in: drivers/tty/serial/serial_core.c as well. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-04-17s390: Map kernel at fixed location when KASLR is disabledAlexander Gordeev
Since kernel virtual and physical address spaces are uncoupled the kernel is mapped at the top of the virtual address space in case KASLR is disabled. That does not pose any issue with regard to the kernel booting and operation, but makes it difficult to use a generated vmlinux with some debugging tools (e.g. gdb), because the exact location of the kernel image in virtual memory is unknown. Make that location known and introduce CONFIG_KERNEL_IMAGE_BASE configuration option. A custom CONFIG_KERNEL_IMAGE_BASE value that would break the virtual memory layout leads to a build error. The kernel image size is defined by KERNEL_IMAGE_SIZE macro and set to 512 MB, by analogy with x86. Suggested-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-04-15rcu: Reduce synchronize_rcu() latencyUladzislau Rezki (Sony)
A call to a synchronize_rcu() can be optimized from a latency point of view. Workloads which depend on this can benefit of it. The delay of wakeme_after_rcu() callback, which unblocks a waiter, depends on several factors: - how fast a process of offloading is started. Combination of: - !CONFIG_RCU_NOCB_CPU/CONFIG_RCU_NOCB_CPU; - !CONFIG_RCU_LAZY/CONFIG_RCU_LAZY; - other. - when started, invoking path is interrupted due to: - time limit; - need_resched(); - if limit is reached. - where in a nocb list it is located; - how fast previous callbacks completed; Example: 1. On our embedded devices i can easily trigger the scenario when it is a last in the list out of ~3600 callbacks: <snip> <...>-29 [001] d..1. 21950.145313: rcu_batch_start: rcu_preempt CBs=3613 bl=28 ... <...>-29 [001] ..... 21950.152578: rcu_invoke_callback: rcu_preempt rhp=00000000b2d6dee8 func=__free_vm_area_struct.cfi_jt <...>-29 [001] ..... 21950.152579: rcu_invoke_callback: rcu_preempt rhp=00000000a446f607 func=__free_vm_area_struct.cfi_jt <...>-29 [001] ..... 21950.152580: rcu_invoke_callback: rcu_preempt rhp=00000000a5cab03b func=__free_vm_area_struct.cfi_jt <...>-29 [001] ..... 21950.152581: rcu_invoke_callback: rcu_preempt rhp=0000000013b7e5ee func=__free_vm_area_struct.cfi_jt <...>-29 [001] ..... 21950.152582: rcu_invoke_callback: rcu_preempt rhp=000000000a8ca6f9 func=__free_vm_area_struct.cfi_jt <...>-29 [001] ..... 21950.152583: rcu_invoke_callback: rcu_preempt rhp=000000008f162ca8 func=wakeme_after_rcu.cfi_jt <...>-29 [001] d..1. 21950.152625: rcu_batch_end: rcu_preempt CBs-invoked=3612 idle=.... <snip> 2. We use cpuset/cgroup to classify tasks and assign them into different cgroups. For example "backgrond" group which binds tasks only to little CPUs or "foreground" which makes use of all CPUs. Tasks can be migrated between groups by a request if an acceleration is needed. See below an example how "surfaceflinger" task gets migrated. Initially it is located in the "system-background" cgroup which allows to run only on little cores. In order to speed it up it can be temporary moved into "foreground" cgroup which allows to use big/all CPUs: cgroup_attach_task(): -> cgroup_migrate_execute() -> cpuset_can_attach() -> percpu_down_write() -> rcu_sync_enter() -> synchronize_rcu() -> now move tasks to the new cgroup. -> cgroup_migrate_finish() <snip> rcuop/1-29 [000] ..... 7030.528570: rcu_invoke_callback: rcu_preempt rhp=00000000461605e0 func=wakeme_after_rcu.cfi_jt PERFD-SERVER-1855 [000] d..1. 7030.530293: cgroup_attach_task: dst_root=3 dst_id=22 dst_level=1 dst_path=/foreground pid=1900 comm=surfaceflinger TimerDispatch-2768 [002] d..5. 7030.537542: sched_migrate_task: comm=surfaceflinger pid=1900 prio=98 orig_cpu=0 dest_cpu=4 <snip> "Boosting a task" depends on synchronize_rcu() latency: - first trace shows a completion of synchronize_rcu(); - second shows attaching a task to a new group; - last shows a final step when migration occurs. 3. To address this drawback, maintain a separate track that consists of synchronize_rcu() callers only. After completion of a grace period users are deferred to a dedicated worker to process requests. 4. This patch reduces the latency of synchronize_rcu() approximately by ~30-40% on synthetic tests. The real test case, camera launch time, shows(time is in milliseconds): 1-run 542 vs 489 improvement 9% 2-run 540 vs 466 improvement 13% 3-run 518 vs 468 improvement 9% 4-run 531 vs 457 improvement 13% 5-run 548 vs 475 improvement 13% 6-run 509 vs 484 improvement 4% Synthetic test(no "noise" from other callbacks): Hardware: x86_64 64 CPUs, 64GB of memory Linux-6.6 - 10K tasks(simultaneous); - each task does(1000 loops) synchronize_rcu(); kfree(p); default: CONFIG_RCU_NOCB_CPU: takes 54 seconds to complete all users; patch: CONFIG_RCU_NOCB_CPU: takes 35 seconds to complete all users. Running 60K gives approximately same results on my setup. Please note it is without any interaction with another type of callbacks, otherwise it will impact a lot a default case. 5. By default it is disabled. To enable this perform one of the below sequence: echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp or pass a boot parameter "rcutree.rcu_normal_wake_from_gp=1" Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Co-developed-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
2024-04-12x86/bugs: Remove CONFIG_BHI_MITIGATION_AUTO and spectre_bhi=autoJosh Poimboeuf
Unlike most other mitigations' "auto" options, spectre_bhi=auto only mitigates newer systems, which is confusing and not particularly useful. Remove it. Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/412e9dc87971b622bbbaf64740ebc1f140bff343.1712813475.git.jpoimboe@kernel.org
2024-04-11x86/bugs: Clarify that syscall hardening isn't a BHI mitigationJosh Poimboeuf
While syscall hardening helps prevent some BHI attacks, there's still other low-hanging fruit remaining. Don't classify it as a mitigation and make it clear that the system may still be vulnerable if it doesn't have a HW or SW mitigation enabled. Fixes: ec9404e40e8f ("x86/bhi: Add BHI mitigation knob") Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/b5951dae3fdee7f1520d5136a27be3bdfe95f88b.1712813475.git.jpoimboe@kernel.org
2024-04-11x86/bugs: Fix BHI documentationJosh Poimboeuf
Fix up some inaccuracies in the BHI documentation. Fixes: ec9404e40e8f ("x86/bhi: Add BHI mitigation knob") Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/8c84f7451bfe0dd08543c6082a383f390d4aa7e2.1712813475.git.jpoimboe@kernel.org
2024-04-09Documentation: kernel-parameters: Add DEVNAME:0.0 format for serial portsTony Lindgren
Document the console option for DEVNAME:0.0 style addressing for serial ports. Suggested-by: Sebastian Reichel <sre@kernel.org> Signed-off-by: Tony Lindgren <tony@atomide.com> Reviewed-by: Dhruva Gole <d-gole@ti.com> Link: https://lore.kernel.org/r/20240327110021.59793-8-tony@atomide.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-04-08Merge tag 'nativebhi' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds
Pull x86 mitigations from Thomas Gleixner: "Mitigations for the native BHI hardware vulnerabilty: Branch History Injection (BHI) attacks may allow a malicious application to influence indirect branch prediction in kernel by poisoning the branch history. eIBRS isolates indirect branch targets in ring0. The BHB can still influence the choice of indirect branch predictor entry, and although branch predictor entries are isolated between modes when eIBRS is enabled, the BHB itself is not isolated between modes. Add mitigations against it either with the help of microcode or with software sequences for the affected CPUs" [ This also ends up enabling the full mitigation by default despite the system call hardening, because apparently there are other indirect calls that are still sufficiently reachable, and the 'auto' case just isn't hardened enough. We'll have some more inevitable tweaking in the future - Linus ] * tag 'nativebhi' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: KVM: x86: Add BHI_NO x86/bhi: Mitigate KVM by default x86/bhi: Add BHI mitigation knob x86/bhi: Enumerate Branch History Injection (BHI) bug x86/bhi: Define SPEC_CTRL_BHI_DIS_S x86/bhi: Add support for clearing branch history at syscall entry x86/syscall: Don't force use of indirect calls for system calls x86/bugs: Change commas to semicolons in 'spectre_v2' sysfs file
2024-04-08x86/bhi: Mitigate KVM by defaultPawan Gupta
BHI mitigation mode spectre_bhi=auto does not deploy the software mitigation by default. In a cloud environment, it is a likely scenario where userspace is trusted but the guests are not trusted. Deploying system wide mitigation in such cases is not desirable. Update the auto mode to unconditionally mitigate against malicious guests. Deploy the software sequence at VMexit in auto mode also, when hardware mitigation is not available. Unlike the force =on mode, software sequence is not deployed at syscalls in auto mode. Suggested-by: Alexandre Chartre <alexandre.chartre@oracle.com> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
2024-04-08x86/bhi: Add BHI mitigation knobPawan Gupta
Branch history clearing software sequences and hardware control BHI_DIS_S were defined to mitigate Branch History Injection (BHI). Add cmdline spectre_bhi={on|off|auto} to control BHI mitigation: auto - Deploy the hardware mitigation BHI_DIS_S, if available. on - Deploy the hardware mitigation BHI_DIS_S, if available, otherwise deploy the software sequence at syscall entry and VMexit. off - Turn off BHI mitigation. The default is auto mode which does not deploy the software sequence mitigation. This is because of the hardening done in the syscall dispatch path, which is the likely target of BHI. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
2024-03-29tracing: Fix documentation on tp_printk cmdline optionVitaly Chikunov
kernel-parameters.txt incorrectly states that workings of kernel.tracepoint_printk sysctl depends on "tracepoint_printk kernel cmdline option", this is a bit misleading for new users since the actual cmdline option name is tp_printk. Fixes: 0daa2302968c ("tracing: Add tp_printk cmdline to have tracepoints go to printk()") Signed-off-by: Vitaly Chikunov <vt@altlinux.org> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jonathan Corbet <corbet@lwn.net> Link: https://lore.kernel.org/r/20240323231704.1217926-1-vt@altlinux.org
2024-03-18Merge tag 'trace-v6.9-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing updates from Steven Rostedt: "Main user visible change: - User events can now have "multi formats" The current user events have a single format. If another event is created with a different format, it will fail to be created. That is, once an event name is used, it cannot be used again with a different format. This can cause issues if a library is using an event and updates its format. An application using the older format will prevent an application using the new library from registering its event. A task could also DOS another application if it knows the event names, and it creates events with different formats. The multi-format event is in a different name space from the single format. Both the event name and its format are the unique identifier. This will allow two different applications to use the same user event name but with different payloads. - Added support to have ftrace_dump_on_oops dump out instances and not just the main top level tracing buffer. Other changes: - Add eventfs_root_inode Only the root inode has a dentry that is static (never goes away) and stores it upon creation. There's no reason that the thousands of other eventfs inodes should have a pointer that never gets set in its descriptor. Create a eventfs_root_inode desciptor that has a eventfs_inode descriptor and a dentry pointer, and only the root inode will use this. - Added WARN_ON()s in eventfs There's some conditionals remaining in eventfs that should never be hit, but instead of removing them, add WARN_ON() around them to make sure that they are never hit. - Have saved_cmdlines allocation also include the map_cmdline_to_pid array The saved_cmdlines structure allocates a large amount of data to hold its mappings. Within it, it has three arrays. Two are already apart of it: map_pid_to_cmdline[] and saved_cmdlines[]. More memory can be saved by also including the map_cmdline_to_pid[] array as well. - Restructure __string() and __assign_str() macros used in TRACE_EVENT() Dynamic strings in TRACE_EVENT() are declared with: __string(name, source) And assigned with: __assign_str(name, source) In the tracepoint callback of the event, the __string() is used to get the size needed to allocate on the ring buffer and __assign_str() is used to copy the string into the ring buffer. There's a helper structure that is created in the TRACE_EVENT() macro logic that will hold the string length and its position in the ring buffer which is created by __string(). There are several trace events that have a function to create the string to save. This function is executed twice. Once for __string() and again for __assign_str(). There's no reason for this. The helper structure could also save the string it used in __string() and simply copy that into __assign_str() (it also already has its length). By using the structure to store the source string for the assignment, it means that the second argument to __assign_str() is no longer needed. It will be removed in the next merge window, but for now add a warning if the source string given to __string() is different than the source string given to __assign_str(), as the source to __assign_str() isn't even used and will be going away. - Added checks to make sure that the source of __string() is also the source of __assign_str() so that it can be safely removed in the next merge window. Included fixes that the above check found. - Other minor clean ups and fixes" * tag 'trace-v6.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (34 commits) tracing: Add __string_src() helper to help compilers not to get confused tracing: Use strcmp() in __assign_str() WARN_ON() check tracepoints: Use WARN() and not WARN_ON() for warnings tracing: Use div64_u64() instead of do_div() tracing: Support to dump instance traces by ftrace_dump_on_oops tracing: Remove second parameter to __assign_rel_str() tracing: Add warning if string in __assign_str() does not match __string() tracing: Add __string_len() example tracing: Remove __assign_str_len() ftrace: Fix most kernel-doc warnings tracing: Decrement the snapshot if the snapshot trigger fails to register tracing: Fix snapshot counter going between two tracers that use it tracing: Use EVENT_NULL_STR macro instead of open coding "(null)" tracing: Use ? : shortcut in trace macros tracing: Do not calculate strlen() twice for __string() fields tracing: Rework __assign_str() and __string() to not duplicate getting the string cxl/trace: Properly initialize cxl_poison region name net: hns3: tracing: fix hclgevf trace event strings drm/i915: Add missing ; to __assign_str() macros in tracepoint code NFSD: Fix nfsd_clid_class use of __string_len() macro ...
2024-03-18tracing: Support to dump instance traces by ftrace_dump_on_oopsHuang Yiwei
Currently ftrace only dumps the global trace buffer on an OOPs. For debugging a production usecase, instance trace will be helpful to check specific problems since global trace buffer may be used for other purposes. This patch extend the ftrace_dump_on_oops parameter to dump a specific or multiple trace instances: - ftrace_dump_on_oops=0: as before -- don't dump - ftrace_dump_on_oops[=1]: as before -- dump the global trace buffer on all CPUs - ftrace_dump_on_oops=2 or =orig_cpu: as before -- dump the global trace buffer on CPU that triggered the oops - ftrace_dump_on_oops=<instance_name>: new behavior -- dump the tracing instance matching <instance_name> - ftrace_dump_on_oops[=2/orig_cpu],<instance1_name>[=2/orig_cpu], <instrance2_name>[=2/orig_cpu]: new behavior -- dump the global trace buffer and multiple instance buffer on all CPUs, or only dump on CPU that triggered the oops if =2 or =orig_cpu is given Also, the sysctl node can handle the input accordingly. Link: https://lore.kernel.org/linux-trace-kernel/20240223083126.1817731-1-quic_hyiwei@quicinc.com Cc: Ross Zwisler <zwisler@google.com> Cc: <mhiramat@kernel.org> Cc: <mark.rutland@arm.com> Cc: <mcgrof@kernel.org> Cc: <keescook@chromium.org> Cc: <j.granados@samsung.com> Cc: <mathieu.desnoyers@efficios.com> Cc: <corbet@lwn.net> Signed-off-by: Huang Yiwei <quic_hyiwei@quicinc.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-14Merge tag 'mm-nonmm-stable-2024-03-14-09-36' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - Kuan-Wei Chiu has developed the well-named series "lib min_heap: Min heap optimizations". - Kuan-Wei Chiu has also sped up the library sorting code in the series "lib/sort: Optimize the number of swaps and comparisons". - Alexey Gladkov has added the ability for code running within an IPC namespace to alter its IPC and MQ limits. The series is "Allow to change ipc/mq sysctls inside ipc namespace". - Geert Uytterhoeven has contributed some dhrystone maintenance work in the series "lib: dhry: miscellaneous cleanups". - Ryusuke Konishi continues nilfs2 maintenance work in the series "nilfs2: eliminate kmap and kmap_atomic calls" "nilfs2: fix kernel bug at submit_bh_wbc()" - Nathan Chancellor has updated our build tools requirements in the series "Bump the minimum supported version of LLVM to 13.0.1". - Muhammad Usama Anjum continues with the selftests maintenance work in the series "selftests/mm: Improve run_vmtests.sh". - Oleg Nesterov has done some maintenance work against the signal code in the series "get_signal: minor cleanups and fix". Plus the usual shower of singleton patches in various parts of the tree. Please see the individual changelogs for details. * tag 'mm-nonmm-stable-2024-03-14-09-36' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (77 commits) nilfs2: prevent kernel bug at submit_bh_wbc() nilfs2: fix failure to detect DAT corruption in btree and direct mappings ocfs2: enable ocfs2_listxattr for special files ocfs2: remove SLAB_MEM_SPREAD flag usage assoc_array: fix the return value in assoc_array_insert_mid_shortcut() buildid: use kmap_local_page() watchdog/core: remove sysctl handlers from public header nilfs2: use div64_ul() instead of do_div() mul_u64_u64_div_u64: increase precision by conditionally swapping a and b kexec: copy only happens before uchunk goes to zero get_signal: don't initialize ksig->info if SIGNAL_GROUP_EXIT/group_exec_task get_signal: hide_si_addr_tag_bits: fix the usage of uninitialized ksig get_signal: don't abuse ksig->info.si_signo and ksig->sig const_structs.checkpatch: add device_type Normalise "name (ad@dr)" MODULE_AUTHORs to "name <ad@dr>" dyndbg: replace kstrdup() + strchr() with kstrdup_and_replace() list: leverage list_is_head() for list_entry_is_head() nilfs2: MAINTAINERS: drop unreachable project mirror site smp: make __smp_processor_id() 0-argument macro fat: fix uninitialized field in nostale filehandles ...
2024-03-13Merge tag 'drm-next-2024-03-13' of https://gitlab.freedesktop.org/drm/kernelLinus Torvalds
Pull drm updates from Dave Airlie: "Highlights are usual, more AMD IP blocks for future hw, i915/xe changes, Displayport tunnelling support for i915, msm YUV over DP changes, new tests for ttm, but its mostly a lot of stuff all over the place from lots of people. core: - EDID cleanups - scheduler error handling fixes - managed: add drmm_release_action() with tests - add ratelimited drm debug print - DPCD PSR early transport macro - DP tunneling and bandwidth allocation helpers - remove built-in edids - dp: Avoid AUX transfers on powered-down displays - dp: Add VSC SDP helpers cross drivers: - use new drm print helpers - switch to ->read_edid callback - gem: add stats for shared buffers plus updates to amdgpu, i915, xe syncobj: - fixes to waiting and sleeping ttm: - add tests - fix errno codes - simply busy-placement handling - fix page decryption media: - tc358743: fix v4l device registration video: - move all kernel parameters for video behind CONFIG_VIDEO sound: - remove <drm/drm_edid.h> include from header ci: - add tests for msm - fix apq8016 runner efifb: - use copy of global screen_info state vesafb: - use copy of global screen_info state simplefb: - fix logging bridge: - ite-6505: fix DP link-training bug - samsung-dsim: fix error checking in probe - samsung-dsim: add bsh-smm-s2/pro boards - tc358767: fix regmap usage - imx: add i.MX8MP HDMI PVI plus DT bindings - imx: add i.MX8MP HDMI TX plus DT bindings - sii902x: fix probing and unregistration - tc358767: limit pixel PLL input range - switch to new drm_bridge_read_edid() interface panel: - ltk050h3146w: error-handling fixes - panel-edp: support delay between power-on and enable; use put_sync in unprepare; support Mediatek MT8173 Chromebooks, BOE NV116WHM-N49 V8.0, BOE NV122WUM-N41, CSO MNC207QS1-1 plus DT bindings - panel-lvds: support EDT ETML0700Z9NDHA plus DT bindings - panel-novatek: FRIDA FRD400B25025-A-CTK plus DT bindings - add BOE TH101MB31IG002-28A plus DT bindings - add EDT ETML1010G3DRA plus DT bindings - add Novatek NT36672E LCD DSI plus DT bindings - nt36523: support 120Hz timings, fix includes - simple: fix display timings on RK32FN48H - visionox-vtdr6130: fix initialization - add Powkiddy RGB10MAX3 plus DT bindings - st7703: support panel rotation plus DT bindings - add Himax HX83112A plus DT bindings - ltk500hd1829: add support for ltk101b4029w and admatec 9904370 - simple: add BOE BP082WX1-100 8.2" panel plus DT bindungs panel-orientation-quirks: - GPD Win Mini amdgpu: - Validate DMABuf imports in compute VMs - Add RAS ACA framework - PSP 13 fixes - Misc code cleanups - Replay fixes - Atom interpretor PS, WS bounds checking - DML2 fixes - Audio fixes - DCN 3.5 Z state fixes - Remove deprecated ida_simple usage - UBSAN fixes - RAS fixes - Enable seq64 infrastructure - DC color block enablement - Documentation updates - DC documentation updates - DMCUB updates - ATHUB 4.1 support - LSDMA 7.0 support - JPEG DPG support - IH 7.0 support - HDP 7.0 support - VCN 5.0 support - SMU 13.0.6 updates - NBIO 7.11 updates - SDMA 6.1 updates - MMHUB 3.3 updates - DCN 3.5.1 support - NBIF 6.3.1 support - VPE 6.1.1 support amdkfd: - Validate DMABuf imports in compute VMs - SVM fixes - Trap handler updates and enhancements - Fix cache size reporting - Relocate the trap handler radeon: - Atom interpretor PS, WS bounds checking - Misc code cleanups xe: - new query for GuC submission version - Remove unused persistent exec_queues - Add vram frequency sysfs attributes - Add the flag XE_VM_BIND_FLAG_DUMPABLE - Drop pre-production workarounds - Drop kunit tests for unsupported platforms - Start pumbling SR-IOV support with memory based interrupts for VF - Allow to map BO in GGTT with PAT index corresponding to XE_CACHE_UC to work with memory based interrupts - Add GuC Doorbells Manager as prep work SR-IOV - Implement additional workarounds for xe2 and MTL - Program a few registers according to perfomance guide spec for Xe2 - Fix remaining 32b build issues and enable it back - Fix build with CONFIG_DEBUG_FS=n - Fix warnings from GuC ABI headers - Introduce Relay Communication for SR-IOV for VF <-> GuC <-> PF - Release mmap mappings on rpm suspend - Disable mid-thread preemption when not properly supported by hardware - Fix xe_exec by reserving extra fence slot for CPU bind - Fix xe_exec with full long running exec queue - Canonicalize addresses where needed for Xe2 and add to devcoredum - Toggle USM support for Xe2 - Only allow 1 ufence per exec / bind IOCTL - Add GuC firmware loading for Lunar Lake - Add XE_VMA_PTE_64K VMA flag i915: - Add more ADL-N PCI IDs - Enable fastboot also on older platforms - Early transport for panel replay and PSR - New ARL PCI IDs - DP TPS4 PHY test pattern support - Unify and improve VSC SDP for PSR and non-PSR cases - Refactor memory regions and improve debug logging - Rework global state serialization - Remove unused CDCLK divider fields - Unify HDCP connector logging format - Use display instead of graphics version in display code - Move VBT and opregion debugfs next to the implementation - Abstract opregion interface, use opaque type - MTL fixes - HPD handling fixes - Add GuC submission interface version query - Atomically invalidate userptr on mmu-notifier - Update handling of MMIO triggered reports - Don't make assumptions about intel_wakeref_t type - Extend driver code of Xe_LPG to Xe_LPG+ - Add flex arrays to struct i915_syncmap - Allow for very slow HuC loading - DP tunneling and bandwidth allocation support msm: - Correct bindings for MSM8976 and SM8650 platforms - Start migration of MDP5 platforms to DPU driver - X1E80100 MDSS support - DPU: - Improve DSC allocation, fixing several important corner cases - Add support for SDM630/SDM660 platforms - Simplify dpu_encoder_phys_ops - Apply fixes targeting DSC support with a single DSC encoder - Apply fixes for HCTL_EN timing configuration - X1E80100 support - Add support for YUV420 over DP - GPU: - fix sc7180 UBWC config - fix a7xx LLC config - new gpu support: a305B, a750, a702 - machine support: SM7150 (different power levels than other a618) - a7xx devcoredump support habanalabs: - configure IRQ affinity according to NUMA node - move HBM MMU page tables inside the HBM - improve device reset - check extended PCIe errors ivpu: - updates to firmware API - refactor BO allocation imx: - use devm_ functions during init hisilicon: - fix EDID includes mgag200: - improve ioremap usage - convert to struct drm_edid - Work around PCI write bursts nouveau: - disp: use kmemdup() - fix EDID includes - documentation fixes qaic: - fixes to BO handling - make use of DRM managed release - fix order of remove operations rockchip: - analogix_dp: get encoder port from DT - inno_hdmi: support HDMI for RK3128 - lvds: error-handling fixes ssd130x: - support SSD133x plus DT bindings tegra: - fix error handling tilcdc: - make use of DRM managed release v3d: - show memory stats in debugfs - Support display MMU page size vc4: - fix error handling in plane prepare_fb - fix framebuffer test in plane helpers virtio: - add venus capset defines vkms: - fix OOB access when programming the LUT - Kconfig improvements vmwgfx: - unmap surface before changing plane state - fix memory leak in error handling - documentation fixes - list command SVGA_3D_CMD_DEFINE_GB_SURFACE_V4 as invalid - fix null-pointer deref in execbuf - refactor display-mode probing - fix fencing for creating cursor MOBs - fix cursor-memory lifetime xlnx: - fix live video input for ZynqMP DPSUB lima: - fix memory leak loongson: - fail if no VRAM present meson: - switch to new drm_bridge_read_edid() interface renesas: - add RZ/G2L DU support plus DT bindings mxsfb: - Use managed mode config sun4i: - HDMI: updates to atomic mode setting mediatek: - Add display driver for MT8188 VDOSYS1 - DSI driver cleanups - Filter modes according to hardware capability - Fix a null pointer crash in mtk_drm_crtc_finish_page_flip etnaviv: - enhancements for NPU and MRT support" * tag 'drm-next-2024-03-13' of https://gitlab.freedesktop.org/drm/kernel: (1420 commits) drm/amd/display: Removed redundant @ symbol to fix kernel-doc warnings in -next repo drm/amd/pm: wait for completion of the EnableGfxImu message drm/amdgpu/soc21: add mode2 asic reset for SMU IP v14.0.1 drm/amdgpu: add smu 14.0.1 support drm/amdgpu: add VPE 6.1.1 discovery support drm/amdgpu/vpe: add VPE 6.1.1 support drm/amdgpu/vpe: don't emit cond exec command under collaborate mode drm/amdgpu/vpe: add collaborate mode support for VPE drm/amdgpu/vpe: add PRED_EXE and COLLAB_SYNC OPCODE drm/amdgpu/vpe: add multi instance VPE support drm/amdgpu/discovery: add nbif v6_3_1 ip block drm/amdgpu: Add nbif v6_3_1 ip block support drm/amdgpu: Add pcie v6_1_0 ip headers (v5) drm/amdgpu: Add nbif v6_3_1 ip headers (v5) arch/powerpc: Remove <linux/fb.h> from backlight code macintosh/via-pmu-backlight: Include <linux/backlight.h> fbdev/chipsfb: Include <linux/backlight.h> drm/etnaviv: Restore some id values drm/amdkfd: make kfd_class constant drm/amdgpu: add ring timeout information in devcoredump ...
2024-03-13Merge tag 'pm-6.9-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management updates from Rafael Wysocki: "From the functional perspective, the most significant change here is the addition of support for Energy Models that can be updated dynamically at run time. There is also the addition of LZ4 compression support for hibernation, the new preferred core support in amd-pstate, new platforms support in the Intel RAPL driver, new model-specific EPP handling in intel_pstate and more. Apart from that, the cpufreq default transition delay is reduced from 10 ms to 2 ms (along with some related adjustments), the system suspend statistics code undergoes a significant rework and there is a usual bunch of fixes and code cleanups all over. Specifics: - Allow the Energy Model to be updated dynamically (Lukasz Luba) - Add support for LZ4 compression algorithm to the hibernation image creation and loading code (Nikhil V) - Fix and clean up system suspend statistics collection (Rafael Wysocki) - Simplify device suspend and resume handling in the power management core code (Rafael Wysocki) - Fix PCI hibernation support description (Yiwei Lin) - Make hibernation take set_memory_ro() return values into account as appropriate (Christophe Leroy) - Set mem_sleep_current during kernel command line setup to avoid an ordering issue with handling it (Maulik Shah) - Fix wake IRQs handling when pm_runtime_force_suspend() is used as a driver's system suspend callback (Qingliang Li) - Simplify pm_runtime_get_if_active() usage and add a replacement for pm_runtime_put_autosuspend() (Sakari Ailus) - Add a tracepoint for runtime_status changes tracking (Vilas Bhat) - Fix section title markdown in the runtime PM documentation (Yiwei Lin) - Enable preferred core support in the amd-pstate cpufreq driver (Meng Li) - Fix min_perf assignment in amd_pstate_adjust_perf() and make the min/max limit perf values in amd-pstate always stay within the (highest perf, lowest perf) range (Tor Vic, Meng Li) - Allow intel_pstate to assign model-specific values to strings used in the EPP sysfs interface and make it do so on Meteor Lake (Srinivas Pandruvada) - Drop long-unused cpudata::prev_cummulative_iowait from the intel_pstate cpufreq driver (Jiri Slaby) - Prevent scaling_cur_freq from exceeding scaling_max_freq when the latter is an inefficient frequency (Shivnandan Kumar) - Change default transition delay in cpufreq to 2ms (Qais Yousef) - Remove references to 10ms minimum sampling rate from comments in the cpufreq code (Pierre Gondois) - Honour transition_latency over transition_delay_us in cpufreq (Qais Yousef) - Stop unregistering cpufreq cooling on CPU hot-remove (Viresh Kumar) - General enhancements / cleanups to ARM cpufreq drivers (tianyu2, Nícolas F. R. A. Prado, Erick Archer, Arnd Bergmann, Anastasia Belova) - Update cpufreq-dt-platdev to block/approve devices (Richard Acayan) - Make the SCMI cpufreq driver get a transition delay value from firmware (Pierre Gondois) - Prevent the haltpoll cpuidle governor from shrinking guest poll_limit_ns below grow_start (Parshuram Sangle) - Avoid potential overflow in integer multiplication when computing cpuidle state parameters (C Cheng) - Adjust MWAIT hint target C-state computation in the ACPI cpuidle driver and in intel_idle to return a correct value for C0 (He Rongguang) - Address multiple issues in the TPMI RAPL driver and add support for new platforms (Lunar Lake-M, Arrow Lake) to Intel RAPL (Zhang Rui) - Fix freq_qos_add_request() return value check in dtpm_cpu (Daniel Lezcano) - Fix kernel-doc for dtpm_create_hierarchy() (Yang Li) - Fix file leak in get_pkg_num() in x86_energy_perf_policy (Samasth Norway Ananda) - Fix cpupower-frequency-info.1 man page typo (Jan Kratochvil) - Fix a couple of warnings in the OPP core code related to W=1 builds (Viresh Kumar) - Move dev_pm_opp_{init|free}_cpufreq_table() to pm_opp.h (Viresh Kumar) - Extend dev_pm_opp_data with turbo support (Sibi Sankar) - dt-bindings: drop maxItems from inner items (David Heidelberg)" * tag 'pm-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (95 commits) dt-bindings: opp: drop maxItems from inner items OPP: debugfs: Fix warning around icc_get_name() OPP: debugfs: Fix warning with W=1 builds cpufreq: Move dev_pm_opp_{init|free}_cpufreq_table() to pm_opp.h OPP: Extend dev_pm_opp_data with turbo support Fix cpupower-frequency-info.1 man page typo cpufreq: scmi: Set transition_delay_us firmware: arm_scmi: Populate fast channel rate_limit firmware: arm_scmi: Populate perf commands rate_limit cpuidle: ACPI/intel: fix MWAIT hint target C-state computation PM: sleep: wakeirq: fix wake irq warning in system suspend powercap: dtpm: Fix kernel-doc for dtpm_create_hierarchy() function cpufreq: Don't unregister cpufreq cooling on CPU hotplug PM: suspend: Set mem_sleep_current during kernel command line setup cpufreq: Honour transition_latency over transition_delay_us cpufreq: Limit resolving a frequency to policy min/max Documentation: PM: Fix runtime_pm.rst markdown syntax cpufreq: amd-pstate: adjust min/max limit perf cpufreq: Remove references to 10ms min sampling rate cpufreq: intel_pstate: Update default EPPs for Meteor Lake ...
2024-03-12Merge tag 'slab-for-6.9' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab updates from Vlastimil Babka: - Freelist loading optimization (Chengming Zhou) When the per-cpu slab is depleted and a new one loaded from the cpu partial list, optimize the loading to avoid an irq enable/disable cycle. This results in a 3.5% performance improvement on the "perf bench sched messaging" test. - Kernel boot parameters cleanup after SLAB removal (Xiongwei Song) Due to two different main slab implementations we've had boot parameters prefixed either slab_ and slub_ with some later becoming an alias as both implementations gained the same functionality (i.e. slab_nomerge vs slub_nomerge). In order to eventually get rid of the implementation-specific names, the canonical and documented parameters are now all prefixed slab_ and the slub_ variants become deprecated but still working aliases. - SLAB_ kmem_cache creation flags cleanup (Vlastimil Babka) The flags had hardcoded #define values which became tedious and error-prone when adding new ones. Assign the values via an enum that takes care of providing unique bit numbers. Also deprecate SLAB_MEM_SPREAD which was only used by SLAB, so it's a no-op since SLAB removal. Assign it an explicit zero value. The removals of the flag usage are handled independently in the respective subsystems, with a final removal of any leftover usage planned for the next release. - Misc cleanups and fixes (Chengming Zhou, Xiaolei Wang, Zheng Yejian) Includes removal of unused code or function parameters and a fix of a memleak. * tag 'slab-for-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: slab: remove PARTIAL_NODE slab_state mm, slab: remove memcg_from_slab_obj() mm, slab: remove the corner case of inc_slabs_node() mm/slab: Fix a kmemleak in kmem_cache_destroy() mm, slab, kasan: replace kasan_never_merge() with SLAB_NO_MERGE mm, slab: use an enum to define SLAB_ cache creation flags mm, slab: deprecate SLAB_MEM_SPREAD flag mm, slab: fix the comment of cpu partial list mm, slab: remove unused object_size parameter in kmem_cache_flags() mm/slub: remove parameter 'flags' in create_kmalloc_caches() mm/slub: remove unused parameter in next_freelist_entry() mm/slub: remove full list manipulation for non-debug slab mm/slub: directly load freelist from cpu partial slab in the likely case mm/slub: make the description of slab_min_objects helpful in doc mm/slub: replace slub_$params with slab_$params in slub.rst mm/slub: unify all sl[au]b parameters with "slab_$param" Documentation: kernel-parameters: remove noaliencache
2024-03-12Merge tag 'docs-6.9' of git://git.lwn.net/linuxLinus Torvalds
Pull documentation updates from Jonathan Corbet: "A moderatly busy cycle for development this time around. - Some cleanup of the main index page for easier navigation - Rework some of the other top-level pages for better readability and, with luck, fewer merge conflicts in the future. - Submit-checklist improvements, hopefully the first of many. - New Italian translations - A fair number of kernel-doc fixes and improvements. We have also dropped the recommendation to use an old version of Sphinx. - A new document from Thorsten on bisection ... and lots of fixes and updates" * tag 'docs-6.9' of git://git.lwn.net/linux: (54 commits) docs: verify/bisect: fixes, finetuning, and support for Arch docs: Makefile: Add dependency to $(YNL_INDEX) for targets other than htmldocs docs: Move ja_JP/howto.rst to ja_JP/process/howto.rst docs: submit-checklist: use subheadings docs: submit-checklist: structure by category docs: new text on bisecting which also covers bug validation docs: drop the version constraints for sphinx and dependencies docs: kerneldoc-preamble.sty: Remove code for Sphinx <2.4 docs: Restore "smart quotes" for quotes docs/zh_CN: accurate translation of "function" docs: Include simplified link titles in main index docs: Correct formatting of title in admin-guide/index.rst docs: kernel_feat.py: fix build error for missing files MAINTAINERS: Set the field name for subsystem profile section kasan: Add documentation for CONFIG_KASAN_EXTRA_INFO Fixed case issue with 'fault-injection' in documentation kernel-doc: handle #if in enums as well Documentation: update mailing list addresses doc: kerneldoc.py: fix indentation scripts/kernel-doc: simplify signature printing ...
2024-03-12Merge tag 'rfds-for-linus-2024-03-11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 RFDS mitigation from Dave Hansen: "RFDS is a CPU vulnerability that may allow a malicious userspace to infer stale register values from kernel space. Kernel registers can have all kinds of secrets in them so the mitigation is basically to wait until the kernel is about to return to userspace and has user values in the registers. At that point there is little chance of kernel secrets ending up in the registers and the microarchitectural state can be cleared. This leverages some recent robustness fixes for the existing MDS vulnerability. Both MDS and RFDS use the VERW instruction for mitigation" * tag 'rfds-for-linus-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: KVM/x86: Export RFDS_NO and RFDS_CLEAR to guests x86/rfds: Mitigate Register File Data Sampling (RFDS) Documentation/hw-vuln: Add documentation for RFDS x86/mmio: Disable KVM mitigation when X86_FEATURE_CLEAR_CPU_BUF is set
2024-03-11Merge tag 'x86-core-2024-03-11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core x86 updates from Ingo Molnar: - The biggest change is the rework of the percpu code, to support the 'Named Address Spaces' GCC feature, by Uros Bizjak: - This allows C code to access GS and FS segment relative memory via variables declared with such attributes, which allows the compiler to better optimize those accesses than the previous inline assembly code. - The series also includes a number of micro-optimizations for various percpu access methods, plus a number of cleanups of %gs accesses in assembly code. - These changes have been exposed to linux-next testing for the last ~5 months, with no known regressions in this area. - Fix/clean up __switch_to()'s broken but accidentally working handling of FPU switching - which also generates better code - Propagate more RIP-relative addressing in assembly code, to generate slightly better code - Rework the CPU mitigations Kconfig space to be less idiosyncratic, to make it easier for distros to follow & maintain these options - Rework the x86 idle code to cure RCU violations and to clean up the logic - Clean up the vDSO Makefile logic - Misc cleanups and fixes * tag 'x86-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits) x86/idle: Select idle routine only once x86/idle: Let prefer_mwait_c1_over_halt() return bool x86/idle: Cleanup idle_setup() x86/idle: Clean up idle selection x86/idle: Sanitize X86_BUG_AMD_E400 handling sched/idle: Conditionally handle tick broadcast in default_idle_call() x86: Increase brk randomness entropy for 64-bit systems x86/vdso: Move vDSO to mmap region x86/vdso/kbuild: Group non-standard build attributes and primary object file rules together x86/vdso: Fix rethunk patching for vdso-image-{32,64}.o x86/retpoline: Ensure default return thunk isn't used at runtime x86/vdso: Use CONFIG_COMPAT_32 to specify vdso32 x86/vdso: Use $(addprefix ) instead of $(foreach ) x86/vdso: Simplify obj-y addition x86/vdso: Consolidate targets and clean-files x86/bugs: Rename CONFIG_RETHUNK => CONFIG_MITIGATION_RETHUNK x86/bugs: Rename CONFIG_CPU_SRSO => CONFIG_MITIGATION_SRSO x86/bugs: Rename CONFIG_CPU_IBRS_ENTRY => CONFIG_MITIGATION_IBRS_ENTRY x86/bugs: Rename CONFIG_CPU_UNRET_ENTRY => CONFIG_MITIGATION_UNRET_ENTRY x86/bugs: Rename CONFIG_SLS => CONFIG_MITIGATION_SLS ...
2024-03-11Merge tag 'x86_misc_for_v6.9_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull misc x86 fixes from Borislav Petkov: - Fix a wrong check in the function reporting whether a CPU executes (or not) a NMI handler - Ratelimit unknown NMIs messages in order to not potentially slow down the machine - Other fixlets * tag 'x86_misc_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/nmi: Fix the inverse "in NMI handler" check Documentation/maintainer-tip: Add C++ tail comments exception Documentation/maintainer-tip: Add Closes tag x86/nmi: Rate limit unknown NMI messages Documentation/kernel-parameters: Add spec_rstack_overflow to mitigations=off
2024-03-11Merge tag 'x86_sev_for_v6.9_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 SEV updates from Borislav Petkov: - Add the x86 part of the SEV-SNP host support. This will allow the kernel to be used as a KVM hypervisor capable of running SNP (Secure Nested Paging) guests. Roughly speaking, SEV-SNP is the ultimate goal of the AMD confidential computing side, providing the most comprehensive confidential computing environment up to date. This is the x86 part and there is a KVM part which did not get ready in time for the merge window so latter will be forthcoming in the next cycle. - Rework the early code's position-dependent SEV variable references in order to allow building the kernel with clang and -fPIE/-fPIC and -mcmodel=kernel - The usual set of fixes, cleanups and improvements all over the place * tag 'x86_sev_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits) x86/sev: Disable KMSAN for memory encryption TUs x86/sev: Dump SEV_STATUS crypto: ccp - Have it depend on AMD_IOMMU iommu/amd: Fix failure return from snp_lookup_rmpentry() x86/sev: Fix position dependent variable references in startup code crypto: ccp: Make snp_range_list static x86/Kconfig: Remove CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT Documentation: virt: Fix up pre-formatted text block for SEV ioctls crypto: ccp: Add the SNP_SET_CONFIG command crypto: ccp: Add the SNP_COMMIT command crypto: ccp: Add the SNP_PLATFORM_STATUS command x86/cpufeatures: Enable/unmask SEV-SNP CPU feature KVM: SEV: Make AVIC backing, VMSA and VMCB memory allocation SNP safe crypto: ccp: Add panic notifier for SEV/SNP firmware shutdown on kdump iommu/amd: Clean up RMP entries for IOMMU pages during SNP shutdown crypto: ccp: Handle legacy SEV commands when SNP is enabled crypto: ccp: Handle non-volatile INIT_EX data when SNP is enabled crypto: ccp: Handle the legacy TMR allocation when SNP is enabled x86/sev: Introduce an SNP leaked pages list crypto: ccp: Provide an API to issue SEV and SNP commands ...
2024-03-11Merge tag 'x86-fred-2024-03-10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 FRED support from Thomas Gleixner: "Support for x86 Fast Return and Event Delivery (FRED). FRED is a replacement for IDT event delivery on x86 and addresses most of the technical nightmares which IDT exposes: 1) Exception cause registers like CR2 need to be manually preserved in nested exception scenarios. 2) Hardware interrupt stack switching is suboptimal for nested exceptions as the interrupt stack mechanism rewinds the stack on each entry which requires a massive effort in the low level entry of #NMI code to handle this. 3) No hardware distinction between entry from kernel or from user which makes establishing kernel context more complex than it needs to be especially for unconditionally nestable exceptions like NMI. 4) NMI nesting caused by IRET unconditionally reenabling NMIs, which is a problem when the perf NMI takes a fault when collecting a stack trace. 5) Partial restore of ESP when returning to a 16-bit segment 6) Limitation of the vector space which can cause vector exhaustion on large systems. 7) Inability to differentiate NMI sources FRED addresses these shortcomings by: 1) An extended exception stack frame which the CPU uses to save exception cause registers. This ensures that the meta information for each exception is preserved on stack and avoids the extra complexity of preserving it in software. 2) Hardware interrupt stack switching is non-rewinding if a nested exception uses the currently interrupt stack. 3) The entry points for kernel and user context are separate and GS BASE handling which is required to establish kernel context for per CPU variable access is done in hardware. 4) NMIs are now nesting protected. They are only reenabled on the return from NMI. 5) FRED guarantees full restore of ESP 6) FRED does not put a limitation on the vector space by design because it uses a central entry points for kernel and user space and the CPUstores the entry type (exception, trap, interrupt, syscall) on the entry stack along with the vector number. The entry code has to demultiplex this information, but this removes the vector space restriction. The first hardware implementations will still have the current restricted vector space because lifting this limitation requires further changes to the local APIC. 7) FRED stores the vector number and meta information on stack which allows having more than one NMI vector in future hardware when the required local APIC changes are in place. The series implements the initial FRED support by: - Reworking the existing entry and IDT handling infrastructure to accomodate for the alternative entry mechanism. - Expanding the stack frame to accomodate for the extra 16 bytes FRED requires to store context and meta information - Providing FRED specific C entry points for events which have information pushed to the extended stack frame, e.g. #PF and #DB. - Providing FRED specific C entry points for #NMI and #MCE - Implementing the FRED specific ASM entry points and the C code to demultiplex the events - Providing detection and initialization mechanisms and the necessary tweaks in context switching, GS BASE handling etc. The FRED integration aims for maximum code reuse vs the existing IDT implementation to the extent possible and the deviation in hot paths like context switching are handled with alternatives to minimalize the impact. The low level entry and exit paths are seperate due to the extended stack frame and the hardware based GS BASE swichting and therefore have no impact on IDT based systems. It has been extensively tested on existing systems and on the FRED simulation and as of now there are no outstanding problems" * tag 'x86-fred-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits) x86/fred: Fix init_task thread stack pointer initialization MAINTAINERS: Add a maintainer entry for FRED x86/fred: Fix a build warning with allmodconfig due to 'inline' failing to inline properly x86/fred: Invoke FRED initialization code to enable FRED x86/fred: Add FRED initialization functions x86/syscall: Split IDT syscall setup code into idt_syscall_init() KVM: VMX: Call fred_entry_from_kvm() for IRQ/NMI handling x86/entry: Add fred_entry_from_kvm() for VMX to handle IRQ/NMI x86/entry/calling: Allow PUSH_AND_CLEAR_REGS being used beyond actual entry code x86/fred: Fixup fault on ERETU by jumping to fred_entrypoint_user x86/fred: Let ret_from_fork_asm() jmp to asm_fred_exit_user when FRED is enabled x86/traps: Add sysvec_install() to install a system interrupt handler x86/fred: FRED entry/exit and dispatch code x86/fred: Add a machine check entry stub for FRED x86/fred: Add a NMI entry stub for FRED x86/fred: Add a debug fault entry stub for FRED x86/idtentry: Incorporate definitions/declarations of the FRED entries x86/fred: Make exc_page_fault() work for FRED x86/fred: Allow single-step trap and NMI when starting a new task x86/fred: No ESPFIX needed when FRED is enabled ...
2024-03-11Merge tag 'x86-apic-2024-03-10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 APIC updates from Thomas Gleixner: "Rework of APIC enumeration and topology evaluation. The current implementation has a couple of shortcomings: - It fails to handle hybrid systems correctly. - The APIC registration code which handles CPU number assignents is in the middle of the APIC code and detached from the topology evaluation. - The various mechanisms which enumerate APICs, ACPI, MPPARSE and guest specific ones, tweak global variables as they see fit or in case of XENPV just hack around the generic mechanisms completely. - The CPUID topology evaluation code is sprinkled all over the vendor code and reevaluates global variables on every hotplug operation. - There is no way to analyze topology on the boot CPU before bringing up the APs. This causes problems for infrastructure like PERF which needs to size certain aspects upfront or could be simplified if that would be possible. - The APIC admission and CPU number association logic is incomprehensible and overly complex and needs to be kept around after boot instead of completing this right after the APIC enumeration. This update addresses these shortcomings with the following changes: - Rework the CPUID evaluation code so it is common for all vendors and provides information about the APIC ID segments in a uniform way independent of the number of segments (Thread, Core, Module, ..., Die, Package) so that this information can be computed instead of rewriting global variables of dubious value over and over. - A few cleanups and simplifcations of the APIC, IO/APIC and related interfaces to prepare for the topology evaluation changes. - Seperation of the parser stages so the early evaluation which tries to find the APIC address can be seperately overridden from the late evaluation which enumerates and registers the local APIC as further preparation for sanitizing the topology evaluation. - A new registration and admission logic which - encapsulates the inner workings so that parsers and guest logic cannot longer fiddle in it - uses the APIC ID segments to build topology bitmaps at registration time - provides a sane admission logic - allows to detect the crash kernel case, where CPU0 does not run on the real BSP, automatically. This is required to prevent sending INIT/SIPI sequences to the real BSP which would reset the whole machine. This was so far handled by a tedious command line parameter, which does not even work in nested crash scenarios. - Associates CPU number after the enumeration completed and prevents the late registration of APICs, which was somehow tolerated before. - Converting all parsers and guest enumeration mechanisms over to the new interfaces. This allows to get rid of all global variable tweaking from the parsers and enumeration mechanisms and sanitizes the XEN[PV] handling so it can use CPUID evaluation for the first time. - Mopping up existing sins by taking the information from the APIC ID segment bitmaps. This evaluates hybrid systems correctly on the boot CPU and allows for cleanups and fixes in the related drivers, e.g. PERF. The series has been extensively tested and the minimal late fallout due to a broken ACPI/MADT table has been addressed by tightening the admission logic further" * tag 'x86-apic-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (76 commits) x86/topology: Ignore non-present APIC IDs in a present package x86/apic: Build the x86 topology enumeration functions on UP APIC builds too smp: Provide 'setup_max_cpus' definition on UP too smp: Avoid 'setup_max_cpus' namespace collision/shadowing x86/bugs: Use fixed addressing for VERW operand x86/cpu/topology: Get rid of cpuinfo::x86_max_cores x86/cpu/topology: Provide __num_[cores|threads]_per_package x86/cpu/topology: Rename topology_max_die_per_package() x86/cpu/topology: Rename smp_num_siblings x86/cpu/topology: Retrieve cores per package from topology bitmaps x86/cpu/topology: Use topology logical mapping mechanism x86/cpu/topology: Provide logical pkg/die mapping x86/cpu/topology: Simplify cpu_mark_primary_thread() x86/cpu/topology: Mop up primary thread mask handling x86/cpu/topology: Use topology bitmaps for sizing x86/cpu/topology: Let XEN/PV use topology from CPUID/MADT x86/xen/smp_pv: Count number of vCPUs early x86/cpu/topology: Assign hotpluggable CPUIDs during init x86/cpu/topology: Reject unknown APIC IDs on ACPI hotplug x86/topology: Add a mechanism to track topology via APIC IDs ...
2024-03-11Merge tag 'timers-core-2024-03-10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer updates from Thomas Gleixner: "A large set of updates and features for timers and timekeeping: - The hierarchical timer pull model When timer wheel timers are armed they are placed into the timer wheel of a CPU which is likely to be busy at the time of expiry. This is done to avoid wakeups on potentially idle CPUs. This is wrong in several aspects: 1) The heuristics to select the target CPU are wrong by definition as the chance to get the prediction right is close to zero. 2) Due to #1 it is possible that timers are accumulated on a single target CPU 3) The required computation in the enqueue path is just overhead for dubious value especially under the consideration that the vast majority of timer wheel timers are either canceled or rearmed before they expire. The timer pull model avoids the above by removing the target computation on enqueue and queueing timers always on the CPU on which they get armed. This is achieved by having separate wheels for CPU pinned timers and global timers which do not care about where they expire. As long as a CPU is busy it handles both the pinned and the global timers which are queued on the CPU local timer wheels. When a CPU goes idle it evaluates its own timer wheels: - If the first expiring timer is a pinned timer, then the global timers can be ignored as the CPU will wake up before they expire. - If the first expiring timer is a global timer, then the expiry time is propagated into the timer pull hierarchy and the CPU makes sure to wake up for the first pinned timer. The timer pull hierarchy organizes CPUs in groups of eight at the lowest level and at the next levels groups of eight groups up to the point where no further aggregation of groups is required, i.e. the number of levels is log8(NR_CPUS). The magic number of eight has been established by experimention, but can be adjusted if needed. In each group one busy CPU acts as the migrator. It's only one CPU to avoid lock contention on remote timer wheels. The migrator CPU checks in its own timer wheel handling whether there are other CPUs in the group which have gone idle and have global timers to expire. If there are global timers to expire, the migrator locks the remote CPU timer wheel and handles the expiry. Depending on the group level in the hierarchy this handling can require to walk the hierarchy downwards to the CPU level. Special care is taken when the last CPU goes idle. At this point the CPU is the systemwide migrator at the top of the hierarchy and it therefore cannot delegate to the hierarchy. It needs to arm its own timer device to expire either at the first expiring timer in the hierarchy or at the first CPU local timer, which ever expires first. This completely removes the overhead from the enqueue path, which is e.g. for networking a true hotpath and trades it for a slightly more complex idle path. This has been in development for a couple of years and the final series has been extensively tested by various teams from silicon vendors and ran through extensive CI. There have been slight performance improvements observed on network centric workloads and an Intel team confirmed that this allows them to power down a die completely on a mult-die socket for the first time in a mostly idle scenario. There is only one outstanding ~1.5% regression on a specific overloaded netperf test which is currently investigated, but the rest is either positive or neutral performance wise and positive on the power management side. - Fixes for the timekeeping interpolation code for cross-timestamps: cross-timestamps are used for PTP to get snapshots from hardware timers and interpolated them back to clock MONOTONIC. The changes address a few corner cases in the interpolation code which got the math and logic wrong. - Simplifcation of the clocksource watchdog retry logic to automatically adjust to handle larger systems correctly instead of having more incomprehensible command line parameters. - Treewide consolidation of the VDSO data structures. - The usual small improvements and cleanups all over the place" * tag 'timers-core-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (62 commits) timer/migration: Fix quick check reporting late expiry tick/sched: Fix build failure for CONFIG_NO_HZ_COMMON=n vdso/datapage: Quick fix - use asm/page-def.h for ARM64 timers: Assert no next dyntick timer look-up while CPU is offline tick: Assume timekeeping is correctly handed over upon last offline idle call tick: Shut down low-res tick from dying CPU tick: Split nohz and highres features from nohz_mode tick: Move individual bit features to debuggable mask accesses tick: Move got_idle_tick away from common flags tick: Assume the tick can't be stopped in NOHZ_MODE_INACTIVE mode tick: Move broadcast cancellation up to CPUHP_AP_TICK_DYING tick: Move tick cancellation up to CPUHP_AP_TICK_DYING tick: Start centralizing tick related CPU hotplug operations tick/sched: Don't clear ts::next_tick again in can_stop_idle_tick() tick/sched: Rename tick_nohz_stop_sched_tick() to tick_nohz_full_stop_tick() tick: Use IS_ENABLED() whenever possible tick/sched: Remove useless oneshot ifdeffery tick/nohz: Remove duplicate between lowres and highres handlers tick/nohz: Remove duplicate between tick_nohz_switch_to_nohz() and tick_setup_sched_timer() hrtimer: Select housekeeping CPU during migration ...
2024-03-11x86/rfds: Mitigate Register File Data Sampling (RFDS)Pawan Gupta
RFDS is a CPU vulnerability that may allow userspace to infer kernel stale data previously used in floating point registers, vector registers and integer registers. RFDS only affects certain Intel Atom processors. Intel released a microcode update that uses VERW instruction to clear the affected CPU buffers. Unlike MDS, none of the affected cores support SMT. Add RFDS bug infrastructure and enable the VERW based mitigation by default, that clears the affected buffers just before exiting to userspace. Also add sysfs reporting and cmdline parameter "reg_file_data_sampling" to control the mitigation. For details see: Documentation/admin-guide/hw-vuln/reg-file-data-sampling.rst Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
2024-03-11Merge tag 'wq-for-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds
Pull workqueue updates from Tejun Heo: "This cycle, a lot of workqueue changes including some that are significant and invasive. - During v6.6 cycle, unbound workqueues were updated so that they are more topology aware and flexible, which among other things improved workqueue behavior on modern multi-L3 CPUs. In the process, commit 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues") switched unbound workqueues to use per-CPU frontend pool_workqueues as a part of increasing front-back mapping flexibility. An unwelcome side effect of this change was that this made max concurrency enforcement per-CPU blowing up the maximum number of allowed concurrent executions. I incorrectly assumed that this wouldn't cause practical problems as most unbound workqueue users are self-regulate max concurrency; however, there definitely are which don't (e.g. on IO paths) and the drastic increase in the allowed max concurrency led to noticeable perf regressions in some use cases. This is now addressed by separating out max concurrency enforcement to a separate struct - wq_node_nr_active - which makes @max_active consistently mean system-wide max concurrency regardless of the number of CPUs or (finally) NUMA nodes. This is a rather invasive and, in places, a bit clunky; however, the clunkiness rises from the the inherent requirement to handle the disagreement between the execution locality domain and max concurrency enforcement domain on some modern machines. See commit 5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement for unbound workqueues") for more details. - BH workqueue support is added. They are similar to per-CPU workqueues but execute work items in the softirq context. This is expected to replace tasklet. However, currently, it's missing the ability to disable and enable work items which is needed to convert many tasklet users. To avoid crowding this merge window too much, this will be included in the next merge window. A separate pull request will be sent for the couple conversion patches that are currently pending. - Waiman plugged a long-standing hole in workqueue CPU isolation where ordered workqueues didn't follow wq_unbound_cpumask updates. Ordered workqueues now follow the same rules as other unbound workqueues. - More CPU isolation improvements: Juri fixed another deficit in workqueue isolation where unbound rescuers don't respect wq_unbound_cpumask. Leonardo fixed delayed_work timers firing on isolated CPUs. - Other misc changes" * tag 'wq-for-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (54 commits) workqueue: Drain BH work items on hot-unplugged CPUs workqueue: Introduce from_work() helper for cleaner callback declarations workqueue: Control intensive warning threshold through cmdline workqueue: Make @flags handling consistent across set_work_data() and friends workqueue: Remove clear_work_data() workqueue: Factor out work_grab_pending() from __cancel_work_sync() workqueue: Clean up enum work_bits and related constants workqueue: Introduce work_cancel_flags workqueue: Use variable name irq_flags for saving local irq flags workqueue: Reorganize flush and cancel[_sync] functions workqueue: Rename __cancel_work_timer() to __cancel_timer_sync() workqueue: Use rcu_read_lock_any_held() instead of rcu_read_lock_held() workqueue: Cosmetic changes workqueue, irq_work: Build fix for !CONFIG_IRQ_WORK workqueue: Fix queue_work_on() with BH workqueues async: Use a dedicated unbound workqueue with raised min_active workqueue: Implement workqueue_set_min_active() workqueue: Fix kernel-doc comment of unplug_oldest_pwq() workqueue: Bind unbound workqueue rescuer to wq_unbound_cpumask kernel/workqueue: Let rescuers follow unbound wq cpumask changes ...
2024-03-11Merge branch 'pm-cpufreq'Rafael J. Wysocki
Merge cpufreq changes for 6.9-rc1: - Enable preferred core support in the amd-pstate cpufreq driver (Meng Li). - Fix min_perf assignment in amd_pstate_adjust_perf() and make the min/max limit perf values in amd-pstate always stay within the (highest perf, lowest perf) range (Tor Vic, Meng Li). - Change default transition delay in cpufreq to 2ms (Qais Yousef). - Drop long-unused cpudata::prev_cummulative_iowait from the intel_pstate cpufreq driver (Jiri Slaby). - Allow intel_pstate to assign model-specific values to strings used in the EPP sysfs interface and make it do so on Meteor Lake (Srinivas Pandruvada). - Remove references to 10ms minimum sampling rate from comments in the cpufreq code (Pierre Gondois). - Prevent scaling_cur_freq from exceeding scaling_max_freq when the latter is an inefficient frequency (Shivnandan Kumar). - Honour transition_latency over transition_delay_us in cpufreq (Qais Yousef). - Stop unregistering cpufreq cooling on CPU hot-remove (Viresh Kumar). - General enhancements / cleanups to ARM cpufreq drivers (tianyu2, Nícolas F. R. A. Prado, Erick Archer, Arnd Bergmann, Anastasia Belova). - Update cpufreq-dt-platdev to block/approve devices (Richard Acayan). - Make the SCMI cpufreq driver get a transition delay value from firmware (Pierre Gondois). * pm-cpufreq: (28 commits) cpufreq: scmi: Set transition_delay_us firmware: arm_scmi: Populate fast channel rate_limit firmware: arm_scmi: Populate perf commands rate_limit cpufreq: Don't unregister cpufreq cooling on CPU hotplug cpufreq: Honour transition_latency over transition_delay_us cpufreq: Limit resolving a frequency to policy min/max cpufreq: amd-pstate: adjust min/max limit perf cpufreq: Remove references to 10ms min sampling rate cpufreq: intel_pstate: Update default EPPs for Meteor Lake cpufreq: intel_pstate: Allow model specific EPPs cpufreq: qcom-hw: add CONFIG_COMMON_CLK dependency cpufreq: dt-platdev: block SDM670 in cpufreq-dt-platdev cpufreq: intel_pstate: remove cpudata::prev_cummulative_iowait cpufreq: Change default transition delay to 2ms cpufreq: amd-pstate: Fix min_perf assignment in amd_pstate_adjust_perf() Documentation: PM: amd-pstate: Fix section title underline Documentation: introduce amd-pstate preferrd core mode kernel command line options Documentation: amd-pstate: introduce amd-pstate preferred core cpufreq: amd-pstate: Update amd-pstate preferred core ranking dynamically ACPI: cpufreq: Add highest perf change notification ...
2024-02-26Merge branches 'rcu-doc.2024.02.14a', 'rcu-nocb.2024.02.14a', ↵Boqun Feng
'rcu-exp.2024.02.14a', 'rcu-tasks.2024.02.26a' and 'rcu-misc.2024.02.14a' into rcu.2024.02.26a