Age | Commit message (Collapse) | Author |
|
KVM PMU change for 6.7:
- Handle NMI/SMI requests after PMU/PMI requests so that a PMI=>NMI doesn't
require redoing the entire run loop due to the NMI not being detected until
the final kvm_vcpu_exit_request() check before entering the guest.
|
|
KVM x86 Xen changes for 6.7:
- Omit "struct kvm_vcpu_xen" entirely when CONFIG_KVM_XEN=n.
- Use the fast path directly from the timer callback when delivering Xen timer
events. Avoid the problematic races with using the fast path by ensuring
the hrtimer isn't running when (re)starting the timer or saving the timer
information (for userspace).
- Follow the lead of upstream Xen and ignore the VCPU_SSHOTTMR_future flag.
|
|
KVM x86 MMU changes for 6.7:
- Clean up code that deals with honoring guest MTRRs when the VM has
non-coherent DMA and host MTRRs are ignored, i.e. EPT is enabled.
- Zap EPT entries when non-coherent DMA assignment stops/start to prevent
using stale entries with the wrong memtype.
- Don't ignore guest PAT for CR0.CD=1 && KVM_X86_QUIRK_CD_NW_CLEARED=y, as
there's zero reason to ignore guest PAT if the effective MTRR memtype is WB.
This will also allow for future optimizations of handling guest MTRR updates
for VMs with non-coherent DMA and the quirk enabled.
- Harden the fast page fault path to guard against encountering an invalid
root when walking SPTEs.
|
|
KVM x86 misc changes for 6.7:
- Add CONFIG_KVM_MAX_NR_VCPUS to allow supporting up to 4096 vCPUs without
forcing more common use cases to eat the extra memory overhead.
- Add IBPB and SBPB virtualization support.
- Fix a bug where restoring a vCPU snapshot that was taken within 1 second of
creating the original vCPU would cause KVM to try to synchronize the vCPU's
TSC and thus clobber the correct TSC being set by userspace.
- Compute guest wall clock using a single TSC read to avoid generating an
inaccurate time, e.g. if the vCPU is preempted between multiple TSC reads.
- "Virtualize" HWCR.TscFreqSel to make Linux guests happy, which complain
about a "Firmware Bug" if the bit isn't set for select F/M/S combos.
- Don't apply side effects to Hyper-V's synthetic timer on writes from
userspace to fix an issue where the auto-enable behavior can trigger
spurious interrupts, i.e. do auto-enabling only for guest writes.
- Remove an unnecessary kick of all vCPUs when synchronizing the dirty log
without PML enabled.
- Advertise "support" for non-serializing FS/GS base MSR writes as appropriate.
- Use octal notation for file permissions through KVM x86.
- Fix a handful of typo fixes and warts.
|
|
KVM x86 APIC changes for 6.7:
- Purge VMX's posted interrupt descriptor *before* loading APIC state when
handling KVM_SET_LAPIC. Purging the PID after loading APIC state results in
lost APIC timer IRQs as the APIC timer can be armed as part of loading APIC
state, i.e. can immediately pend an IRQ if the expiry is in the past.
- Clear the ICR.BUSY bit when handling trap-like x2APIC writes. This avoids a
WARN, due to KVM expecting the BUSY bit to be cleared when sending IPIs.
|
|
Pull rust updates from Miguel Ojeda:
"A small one compared to the previous one in terms of features. In
terms of lines, as usual, the 'alloc' version upgrade accounts for
most of them.
Toolchain and infrastructure:
- Upgrade to Rust 1.73.0
This time around, due to how the kernel and Rust schedules have
aligned, there are two upgrades in fact. They contain the fixes for
a few issues we reported to the Rust project.
In addition, a few cleanups indicated by the upgraded compiler or
possible thanks to it. For instance, the compiler now detects
redundant explicit links.
- A couple changes to the Rust 'Makefile' so that it can be used with
toybox tools, allowing Rust to be used in the Android kernel build.
x86:
- Enable IBT if enabled in C
Documentation:
- Add "The Rust experiment" section to the Rust index page
MAINTAINERS:
- Add Maintainer Entry Profile field ('P:').
- Update our 'W:' field to point to the webpage we have been building
this year"
* tag 'rust-6.7' of https://github.com/Rust-for-Linux/linux:
docs: rust: add "The Rust experiment" section
x86: Enable IBT in Rust if enabled in C
rust: Use grep -Ev rather than relying on GNU grep
rust: Use awk instead of recent xargs
rust: upgrade to Rust 1.73.0
rust: print: use explicit link in documentation
rust: task: remove redundant explicit link
rust: kernel: remove `#[allow(clippy::new_ret_no_self)]`
MAINTAINERS: add Maintainer Entry Profile field for Rust
MAINTAINERS: update Rust webpage
rust: upgrade to Rust 1.72.1
rust: arc: add explicit `drop()` around `Box::from_raw()`
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull hardening updates from Kees Cook:
"One of the more voluminous set of changes is for adding the new
__counted_by annotation[1] to gain run-time bounds checking of
dynamically sized arrays with UBSan.
- Add LKDTM test for stuck CPUs (Mark Rutland)
- Improve LKDTM selftest behavior under UBSan (Ricardo Cañuelo)
- Refactor more 1-element arrays into flexible arrays (Gustavo A. R.
Silva)
- Analyze and replace strlcpy and strncpy uses (Justin Stitt, Azeem
Shaikh)
- Convert group_info.usage to refcount_t (Elena Reshetova)
- Add __counted_by annotations (Kees Cook, Gustavo A. R. Silva)
- Add Kconfig fragment for basic hardening options (Kees Cook, Lukas
Bulwahn)
- Fix randstruct GCC plugin performance mode to stay in groups (Kees
Cook)
- Fix strtomem() compile-time check for small sources (Kees Cook)"
* tag 'hardening-v6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (56 commits)
hwmon: (acpi_power_meter) replace open-coded kmemdup_nul
reset: Annotate struct reset_control_array with __counted_by
kexec: Annotate struct crash_mem with __counted_by
virtio_console: Annotate struct port_buffer with __counted_by
ima: Add __counted_by for struct modsig and use struct_size()
MAINTAINERS: Include stackleak paths in hardening entry
string: Adjust strtomem() logic to allow for smaller sources
hardening: x86: drop reference to removed config AMD_IOMMU_V2
randstruct: Fix gcc-plugin performance mode to stay in group
mailbox: zynqmp: Annotate struct zynqmp_ipi_pdata with __counted_by
drivers: thermal: tsens: Annotate struct tsens_priv with __counted_by
irqchip/imx-intmux: Annotate struct intmux_data with __counted_by
KVM: Annotate struct kvm_irq_routing_table with __counted_by
virt: acrn: Annotate struct vm_memory_region_batch with __counted_by
hwmon: Annotate struct gsc_hwmon_platform_data with __counted_by
sparc: Annotate struct cpuinfo_tree with __counted_by
isdn: kcapi: replace deprecated strncpy with strscpy_pad
isdn: replace deprecated strncpy with strscpy
NFS/flexfiles: Annotate struct nfs4_ff_layout_segment with __counted_by
nfs41: Annotate struct nfs4_file_layout_dsaddr with __counted_by
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks
Pull RCU updates from Frederic Weisbecker:
- RCU torture, locktorture and generic torture infrastructure updates
that include various fixes, cleanups and consolidations.
Among the user visible things, ftrace dumps can now be found into
their own file, and module parameters get better documented and
reported on dumps.
- Generic and misc fixes all over the place. Some highlights:
* Hotplug handling has seen some light cleanups and comments
* An RCU barrier can now be triggered through sysfs to serialize
memory stress testing and avoid OOM
* Object information is now dumped in case of invalid callback
invocation
* Also various SRCU issues, too hard to trigger to deserve urgent
pull requests, have been fixed
- RCU documentation updates
- RCU reference scalability test minor fixes and doc improvements.
- RCU tasks minor fixes
- Stall detection updates. Introduce RCU CPU Stall notifiers that
allows a subsystem to provide informations to help debugging. Also
cure some false positive stalls.
* tag 'rcu-next-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks: (56 commits)
srcu: Only accelerate on enqueue time
locktorture: Check the correct variable for allocation failure
srcu: Fix callbacks acceleration mishandling
rcu: Comment why callbacks migration can't wait for CPUHP_RCUTREE_PREP
rcu: Standardize explicit CPU-hotplug calls
rcu: Conditionally build CPU-hotplug teardown callbacks
rcu: Remove references to rcu_migrate_callbacks() from diagrams
rcu: Assume rcu_report_dead() is always called locally
rcu: Assume IRQS disabled from rcu_report_dead()
rcu: Use rcu_segcblist_segempty() instead of open coding it
rcu: kmemleak: Ignore kmemleak false positives when RCU-freeing objects
srcu: Fix srcu_struct node grpmask overflow on 64-bit systems
torture: Convert parse-console.sh to mktemp
rcutorture: Traverse possible cpu to set maxcpu in rcu_nocb_toggle()
rcutorture: Replace schedule_timeout*() 1-jiffy waits with HZ/20
torture: Add kvm.sh --debug-info argument
locktorture: Rename readers_bind/writers_bind to bind_readers/bind_writers
doc: Catch-up update for locktorture module parameters
locktorture: Add call_rcu_chains module parameter
locktorture: Add new module parameters to lock_torture_print_module_parms()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 core updates from Thomas Gleixner:
- Limit the hardcoded topology quirk for Hygon CPUs to those which have
a model ID less than 4.
The newer models have the topology CPUID leaf 0xB correctly
implemented and are not affected.
- Make SMT control more robust against enumeration failures
SMT control was added to allow controlling SMT at boottime or
runtime. The primary purpose was to provide a simple mechanism to
disable SMT in the light of speculation attack vectors.
It turned out that the code is sensible to enumeration failures and
worked only by chance for XEN/PV. XEN/PV has no real APIC enumeration
which means the primary thread mask is not set up correctly. By
chance a XEN/PV boot ends up with smp_num_siblings == 2, which makes
the hotplug control stay at its default value "enabled". So the mask
is never evaluated.
The ongoing rework of the topology evaluation caused XEN/PV to end up
with smp_num_siblings == 1, which sets the SMT control to "not
supported" and the empty primary thread mask causes the hotplug core
to deny the bringup of the APS.
Make the decision logic more robust and take 'not supported' and 'not
implemented' into account for the decision whether a CPU should be
booted or not.
- Fake primary thread mask for XEN/PV
Pretend that all XEN/PV vCPUs are primary threads, which makes the
usage of the primary thread mask valid on XEN/PV. That is consistent
with because all of the topology information on XEN/PV is fake or
even non-existent.
- Encapsulate topology information in cpuinfo_x86
Move the randomly scattered topology data into a separate data
structure for readability and as a preparatory step for the topology
evaluation overhaul.
- Consolidate APIC ID data type to u32
It's fixed width hardware data and not randomly u16, int, unsigned
long or whatever developers decided to use.
- Cure the abuse of cpuinfo for persisting logical IDs.
Per CPU cpuinfo is used to persist the logical package and die IDs.
That's really not the right place simply because cpuinfo is subject
to be reinitialized when a CPU goes through an offline/online cycle.
Use separate per CPU data for the persisting to enable the further
topology management rework. It will be removed once the new topology
management is in place.
- Provide a debug interface for inspecting topology information
Useful in general and extremly helpful for validating the topology
management rework in terms of correctness or "bug" compatibility.
* tag 'x86-core-2023-10-29-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
x86/apic, x86/hyperv: Use u32 in hv_snp_boot_ap() too
x86/cpu: Provide debug interface
x86/cpu/topology: Cure the abuse of cpuinfo for persisting logical ids
x86/apic: Use u32 for wakeup_secondary_cpu[_64]()
x86/apic: Use u32 for [gs]et_apic_id()
x86/apic: Use u32 for phys_pkg_id()
x86/apic: Use u32 for cpu_present_to_apicid()
x86/apic: Use u32 for check_apicid_used()
x86/apic: Use u32 for APIC IDs in global data
x86/apic: Use BAD_APICID consistently
x86/cpu: Move cpu_l[l2]c_id into topology info
x86/cpu: Move logical package and die IDs into topology info
x86/cpu: Remove pointless evaluation of x86_coreid_bits
x86/cpu: Move cu_id into topology info
x86/cpu: Move cpu_core_id into topology info
hwmon: (fam15h_power) Use topology_core_id()
scsi: lpfc: Use topology_core_id()
x86/cpu: Move cpu_die_id into topology info
x86/cpu: Move phys_proc_id into topology info
x86/cpu: Encapsulate topology information in cpuinfo_x86
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 APIC updates from Thomas Gleixner:
- Make the quirk for non-maskable MSI interrupts in the affinity setter
functional again.
It was broken by a MSI core code update, which restructured the code
in a way that the quirk flag was not longer set correctly.
Trying to restore the core logic caused a deeper inspection and it
turned out that the extra quirk flag is not required at all because
it's the inverse of the reservation mode bit, which only can be set
when the MSI interrupt is maskable.
So the trivial fix is to use the reservation mode check in the
affinity setter function and remove almost 40 lines of code related
to the no-mask quirk flag.
- Cure a Kconfig dependency issue which causes compile failures by
correcting the conditionals in the affected header files.
- Clean up coding style in the UV APIC driver.
* tag 'x86-apic-2023-10-29-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/apic/msi: Fix misconfigured non-maskable MSI quirk
x86/msi: Fix compile error caused by CONFIG_GENERIC_MSI_IRQ=y && !CONFIG_X86_LOCAL_APIC
x86/platform/uv/apic: Clean up inconsistent indenting
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 build update from Ingo Molnar:
"Enable CONFIG_DEBUG_ENTRY=y in the x86 defconfigs"
* tag 'x86-build-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/defconfig: Enable CONFIG_DEBUG_ENTRY=y
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm handling updates from Ingo Molnar:
- Add new NX-stack self-test
- Improve NUMA partial-CFMWS handling
- Fix #VC handler bugs resulting in SEV-SNP boot failures
- Drop the 4MB memory size restriction on minimal NUMA nodes
- Reorganize headers a bit, in preparation to header dependency
reduction efforts
- Misc cleanups & fixes
* tag 'x86-mm-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Drop the 4 MB restriction on minimal NUMA node memory size
selftests/x86/lam: Zero out buffer for readlink()
x86/sev: Drop unneeded #include
x86/sev: Move sev_setup_arch() to mem_encrypt.c
x86/tdx: Replace deprecated strncpy() with strtomem_pad()
selftests/x86/mm: Add new test that userspace stack is in fact NX
x86/sev: Make boot_ghcb_page[] static
x86/boot: Move x86_cache_alignment initialization to correct spot
x86/sev-es: Set x86_virt_bits to the correct value straight away, instead of a two-phase approach
x86/sev-es: Allow copy_from_kernel_nofault() in earlier boot
x86_64: Show CR4.PSE on auxiliaries like on BSP
x86/iommu/docs: Update AMD IOMMU specification document URL
x86/sev/docs: Update document URL in amd-memory-encryption.rst
x86/mm: Move arch_memory_failure() and arch_is_platform_page() definitions from <asm/processor.h> to <asm/pgtable.h>
ACPI/NUMA: Apply SRAT proximity domain to entire CFMWS window
x86/numa: Introduce numa_fill_memblks()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 irq fix from Ingo Molnar:
"Fix out-of-order NMI nesting checks resulting in false positive
warnings"
* tag 'x86-irq-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/nmi: Fix out-of-order NMI nesting checks & false positive warning
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 entry updates from Ingo Molnar:
- Make IA32_EMULATION boot time configurable with
the new ia32_emulation=<bool> boot option
- Clean up fast syscall return validation code: convert
it to C and refactor the code
- As part of this, optimize the canonical RIP test code
* tag 'x86-entry-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/entry/32: Clean up syscall fast exit tests
x86/entry/64: Use TASK_SIZE_MAX for canonical RIP test
x86/entry/64: Convert SYSRET validation tests to C
x86/entry/32: Remove SEP test for SYSEXIT
x86/entry/32: Convert do_fast_syscall_32() to bool return type
x86/entry/compat: Combine return value test from syscall handler
x86/entry/64: Remove obsolete comment on tracing vs. SYSRET
x86: Make IA32_EMULATION boot time configurable
x86/entry: Make IA32 syscalls' availability depend on ia32_enabled()
x86/elf: Make loading of 32bit processes depend on ia32_enabled()
x86/entry: Compile entry_SYSCALL32_ignore() unconditionally
x86/entry: Rename ignore_sysret()
x86: Introduce ia32_enabled()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 assembly code updates from Ingo Molnar:
- Micro-optimize the x86 bitops code
- Define target-specific {raw,this}_cpu_try_cmpxchg{64,128}() to
improve code generation
- Define and use raw_cpu_try_cmpxchg() preempt_count_set()
- Do not clobber %rsi in percpu_{try_,}cmpxchg{64,128}_op
- Remove the unused __sw_hweight64() implementation on x86-32
- Misc fixes and cleanups
* tag 'x86-asm-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/lib: Address kernel-doc warnings
x86/entry: Fix typos in comments
x86/entry: Remove unused argument %rsi passed to exc_nmi()
x86/bitops: Remove unused __sw_hweight64() assembly implementation on x86-32
x86/percpu: Do not clobber %rsi in percpu_{try_,}cmpxchg{64,128}_op
x86/percpu: Use raw_cpu_try_cmpxchg() in preempt_count_set()
x86/percpu: Define raw_cpu_try_cmpxchg and this_cpu_try_cmpxchg()
x86/percpu: Define {raw,this}_cpu_try_cmpxchg{64,128}
x86/asm/bitops: Use __builtin_clz{l|ll} to evaluate constant expressions
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 boot updates from Ingo Molnar:
- Rework PE header generation, primarily to generate a modern, 4k
aligned kernel image view with narrower W^X permissions.
- Further refine init-lifetime annotations
- Misc cleanups & fixes
* tag 'x86-boot-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
x86/boot: efistub: Assign global boot_params variable
x86/boot: Rename conflicting 'boot_params' pointer to 'boot_params_ptr'
x86/head/64: Move the __head definition to <asm/init.h>
x86/head/64: Add missing __head annotation to startup_64_load_idt()
x86/head/64: Mark 'startup_gdt[]' and 'startup_gdt_descr' as __initdata
x86/boot: Harmonize the style of array-type parameter for fixup_pointer() calls
x86/boot: Fix incorrect startup_gdt_descr.size
x86/boot: Compile boot code with -std=gnu11 too
x86/boot: Increase section and file alignment to 4k/512
x86/boot: Split off PE/COFF .data section
x86/boot: Drop PE/COFF .reloc section
x86/boot: Construct PE/COFF .text section from assembler
x86/boot: Derive file size from _edata symbol
x86/boot: Define setup size in linker script
x86/boot: Set EFI handover offset directly in header asm
x86/boot: Grab kernel_info offset from zoffset header directly
x86/boot: Drop references to startup_64
x86/boot: Drop redundant code setting the root device
x86/boot: Omit compression buffer from PE/COFF image memory footprint
x86/boot: Remove the 'bugger off' message
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 header file cleanup from Ingo Molnar:
"Replace <asm/export.h> uses with <linux/export.h> and then remove
<asm/export.h>"
* tag 'x86-headers-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/headers: Remove <asm/export.h>
x86/headers: Replace #include <asm/export.h> with #include <linux/export.h>
x86/headers: Remove unnecessary #include <asm/export.h>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull performance event updates from Ingo Molnar:
- Add AMD Unified Memory Controller (UMC) events introduced with Zen 4
- Simplify & clean up the uncore management code
- Fall back from RDPMC to RDMSR on certain uncore PMUs
- Improve per-package and cstate event reading
- Extend the Intel ref-cycles event to GP counters
- Fix Intel MTL event constraints
- Improve the Intel hybrid CPU handling code
- Micro-optimize the RAPL code
- Optimize perf_cgroup_switch()
- Improve large AUX area error handling
- Misc fixes and cleanups
* tag 'perf-core-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (26 commits)
perf/x86/amd/uncore: Pass through error code for initialization failures, instead of -ENODEV
perf/x86/amd/uncore: Fix uninitialized return value in amd_uncore_init()
x86/cpu: Fix the AMD Fam 17h, Fam 19h, Zen2 and Zen4 MSR enumerations
perf: Optimize perf_cgroup_switch()
perf/x86/amd/uncore: Add memory controller support
perf/x86/amd/uncore: Add group exclusivity
perf/x86/amd/uncore: Use rdmsr if rdpmc is unavailable
perf/x86/amd/uncore: Move discovery and registration
perf/x86/amd/uncore: Refactor uncore management
perf/core: Allow reading package events from perf_event_read_local
perf/x86/cstate: Allow reading the package statistics from local CPU
perf/x86/intel/pt: Fix kernel-doc comments
perf/x86/rapl: Annotate 'struct rapl_pmus' with __counted_by
perf/core: Rename perf_proc_update_handler() -> perf_event_max_sample_rate_handler(), for readability
perf/x86/rapl: Fix "Using plain integer as NULL pointer" Sparse warning
perf/x86/rapl: Use local64_try_cmpxchg in rapl_event_update()
perf/x86/rapl: Stop doing cpu_relax() in the local64_cmpxchg() loop in rapl_event_update()
perf/core: Bail out early if the request AUX area is out of bound
perf/x86/intel: Extend the ref-cycles event to GP counters
perf/x86/intel: Fix broken fixed event constraints extension
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool updates from Ingo Molnar:
"Misc fixes and cleanups:
- Fix potential MAX_NAME_LEN limit related build failures
- Fix scripts/faddr2line symbol filtering bug
- Fix scripts/faddr2line on LLVM=1
- Fix scripts/faddr2line to accept readelf output with mapping
symbols
- Minor cleanups"
* tag 'objtool-core-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
scripts/faddr2line: Skip over mapping symbols in output from readelf
scripts/faddr2line: Use LLVM addr2line and readelf if LLVM=1
scripts/faddr2line: Don't filter out non-function symbols from readelf
objtool: Remove max symbol name length limitation
objtool: Propagate early errors
objtool: Use 'the fallthrough' pseudo-keyword
x86/speculation, objtool: Use absolute relocations for annotations
x86/unwind/orc: Remove redundant initialization of 'mid' pointer in __orc_find()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"Fair scheduler (SCHED_OTHER) improvements:
- Remove the old and now unused SIS_PROP code & option
- Scan cluster before LLC in the wake-up path
- Use candidate prev/recent_used CPU if scanning failed for cluster
wakeup
NUMA scheduling improvements:
- Improve the VMA access-PID code to better skip/scan VMAs
- Extend tracing to cover VMA-skipping decisions
- Improve/fix the recently introduced sched_numa_find_nth_cpu() code
- Generalize numa_map_to_online_node()
Energy scheduling improvements:
- Remove the EM_MAX_COMPLEXITY limit
- Add tracepoints to track energy computation
- Make the behavior of the 'sched_energy_aware' sysctl more
consistent
- Consolidate and clean up access to a CPU's max compute capacity
- Fix uclamp code corner cases
RT scheduling improvements:
- Drive dl_rq->overloaded with dl_rq->pushable_dl_tasks updates
- Drive the ->rto_mask with rt_rq->pushable_tasks updates
Scheduler scalability improvements:
- Rate-limit updates to tg->load_avg
- On x86 disable IBRS when CPU is offline to improve single-threaded
performance
- Micro-optimize in_task() and in_interrupt()
- Micro-optimize the PSI code
- Avoid updating PSI triggers and ->rtpoll_total when there are no
state changes
Core scheduler infrastructure improvements:
- Use saved_state to reduce some spurious freezer wakeups
- Bring in a handful of fast-headers improvements to scheduler
headers
- Make the scheduler UAPI headers more widely usable by user-space
- Simplify the control flow of scheduler syscalls by using lock
guards
- Fix sched_setaffinity() vs. CPU hotplug race
Scheduler debuggability improvements:
- Disallow writing invalid values to sched_rt_period_us
- Fix a race in the rq-clock debugging code triggering warnings
- Fix a warning in the bandwidth distribution code
- Micro-optimize in_atomic_preempt_off() checks
- Enforce that the tasklist_lock is held in for_each_thread()
- Print the TGID in sched_show_task()
- Remove the /proc/sys/kernel/sched_child_runs_first sysctl
... and misc cleanups & fixes"
* tag 'sched-core-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (82 commits)
sched/fair: Remove SIS_PROP
sched/fair: Use candidate prev/recent_used CPU if scanning failed for cluster wakeup
sched/fair: Scan cluster before scanning LLC in wake-up path
sched: Add cpus_share_resources API
sched/core: Fix RQCF_ACT_SKIP leak
sched/fair: Remove unused 'curr' argument from pick_next_entity()
sched/nohz: Update comments about NEWILB_KICK
sched/fair: Remove duplicate #include
sched/psi: Update poll => rtpoll in relevant comments
sched: Make PELT acronym definition searchable
sched: Fix stop_one_cpu_nowait() vs hotplug
sched/psi: Bail out early from irq time accounting
sched/topology: Rename 'DIE' domain to 'PKG'
sched/psi: Delete the 'update_total' function parameter from update_triggers()
sched/psi: Avoid updating PSI triggers and ->rtpoll_total when there are no state changes
sched/headers: Remove comment referring to rq::cpu_load, since this has been removed
sched/numa: Complete scanning of inactive VMAs when there is no alternative
sched/numa: Complete scanning of partial VMAs regardless of PID activity
sched/numa: Move up the access pid reset logic
sched/numa: Trace decisions related to skipping VMAs
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Info Molnar:
"Futex improvements:
- Add the 'futex2' syscall ABI, which is an attempt to get away from
the multiplex syscall and adds a little room for extentions, while
lifting some limitations.
- Fix futex PI recursive rt_mutex waiter state bug
- Fix inter-process shared futexes on no-MMU systems
- Use folios instead of pages
Micro-optimizations of locking primitives:
- Improve arch_spin_value_unlocked() on asm-generic ticket spinlock
architectures, to improve lockref code generation
- Improve the x86-32 lockref_get_not_zero() main loop by adding
build-time CMPXCHG8B support detection for the relevant lockref
code, and by better interfacing the CMPXCHG8B assembly code with
the compiler
- Introduce arch_sync_try_cmpxchg() on x86 to improve
sync_try_cmpxchg() code generation. Convert some sync_cmpxchg()
users to sync_try_cmpxchg().
- Micro-optimize rcuref_put_slowpath()
Locking debuggability improvements:
- Improve CONFIG_DEBUG_RT_MUTEXES=y to have a fast-path as well
- Enforce atomicity of sched_submit_work(), which is de-facto atomic
but was un-enforced previously.
- Extend <linux/cleanup.h>'s no_free_ptr() with __must_check
semantics
- Fix ww_mutex self-tests
- Clean up const-propagation in <linux/seqlock.h> and simplify the
API-instantiation macros a bit
RT locking improvements:
- Provide the rt_mutex_*_schedule() primitives/helpers and use them
in the rtmutex code to avoid recursion vs. rtlock on the PI state.
- Add nested blocking lockdep asserts to rt_mutex_lock(),
rtlock_lock() and rwbase_read_lock()
.. plus misc fixes & cleanups"
* tag 'locking-core-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (39 commits)
futex: Don't include process MM in futex key on no-MMU
locking/seqlock: Fix grammar in comment
alpha: Fix up new futex syscall numbers
locking/seqlock: Propagate 'const' pointers within read-only methods, remove forced type casts
locking/lockdep: Fix string sizing bug that triggers a format-truncation compiler-warning
locking/seqlock: Change __seqprop() to return the function pointer
locking/seqlock: Simplify SEQCOUNT_LOCKNAME()
locking/atomics: Use atomic_try_cmpxchg_release() to micro-optimize rcuref_put_slowpath()
locking/atomic, xen: Use sync_try_cmpxchg() instead of sync_cmpxchg()
locking/atomic/x86: Introduce arch_sync_try_cmpxchg()
locking/atomic: Add generic support for sync_try_cmpxchg() and its fallback
locking/seqlock: Fix typo in comment
futex/requeue: Remove unnecessary ‘NULL’ initialization from futex_proxy_trylock_atomic()
locking/local, arch: Rewrite local_add_unless() as a static inline function
locking/debug: Fix debugfs API return value checks to use IS_ERR()
locking/ww_mutex/test: Make sure we bail out instead of livelock
locking/ww_mutex/test: Fix potential workqueue corruption
locking/ww_mutex/test: Use prng instead of rng to avoid hangs at bootup
futex: Add sys_futex_requeue()
futex: Add flags2 argument to futex_requeue()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fpu fixlet from Borislav Petkov:
- kernel-doc fix
* tag 'x86_fpu_for_6.7_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/fpu/xstate: Address kernel-doc warning
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 platform updates from Borislav Petkov:
- Make sure PCI function 4 IDs of AMD family 0x19, models 0x60-0x7f are
actually used in the amd_nb.c enumeration
- Add support for extracting NUMA information from devicetree for
Hyper-V usages
- Add PCI device IDs for the new AMD MI300 AI accelerators
- Annotate an array in struct uv_rtc_timer_head with the new
__counted_by attribute
- Rework UV's NMI action parameter handling
* tag 'x86_platform_for_6.7_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/amd_nb: Use Family 19h Models 60h-7Fh Function 4 IDs
x86/numa: Add Devicetree support
x86/of: Move the x86_flattree_get_config() call out of x86_dtb_init()
x86/amd_nb: Add AMD Family MI300 PCI IDs
x86/platform/uv: Annotate struct uv_rtc_timer_head with __counted_by
x86/platform/uv: Rework NMI "action" modparam handling
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cpuid updates from Borislav Petkov:
- Make sure the "svm" feature flag is cleared from /proc/cpuinfo when
virtualization support is disabled in the BIOS on AMD and Hygon
platforms
- A minor cleanup
* tag 'x86_cpu_for_6.7_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu/amd: Remove redundant 'break' statement
x86/cpu: Clear SVM feature if disabled by BIOS
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 resource control updates from Borislav Petkov:
- Add support for non-contiguous capacity bitmasks being added to
Intel's CAT implementation
- Other improvements to resctrl code: better configuration,
simplifications, debugging support, fixes
* tag 'x86_cache_for_6.7_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/resctrl: Display RMID of resource group
x86/resctrl: Add support for the files of MON groups only
x86/resctrl: Display CLOSID for resource group
x86/resctrl: Introduce "-o debug" mount option
x86/resctrl: Move default group file creation to mount
x86/resctrl: Unwind properly from rdt_enable_ctx()
x86/resctrl: Rename rftype flags for consistency
x86/resctrl: Simplify rftype flag definitions
x86/resctrl: Add multiple tasks to the resctrl group at once
Documentation/x86: Document resctrl's new sparse_masks
x86/resctrl: Add sparse_masks file in info
x86/resctrl: Enable non-contiguous CBMs in Intel CAT
x86/resctrl: Rename arch_has_sparse_bitmaps
x86/resctrl: Fix remaining kernel-doc warnings
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 hw mitigation updates from Borislav Petkov:
- A bunch of improvements, cleanups and fixlets to the SRSO mitigation
machinery and other, general cleanups to the hw mitigations code, by
Josh Poimboeuf
- Improve the return thunk detection by objtool as it is absolutely
important that the default return thunk is not used after returns
have been patched. Future work to detect and report this better is
pending
- Other misc cleanups and fixes
* tag 'x86_bugs_for_6.7_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
x86/retpoline: Document some thunk handling aspects
x86/retpoline: Make sure there are no unconverted return thunks due to KCSAN
x86/callthunks: Delete unused "struct thunk_desc"
x86/vdso: Run objtool on vdso32-setup.o
objtool: Fix return thunk patching in retpolines
x86/srso: Remove unnecessary semicolon
x86/pti: Fix kernel warnings for pti= and nopti cmdline options
x86/calldepth: Rename __x86_return_skl() to call_depth_return_thunk()
x86/nospec: Refactor UNTRAIN_RET[_*]
x86/rethunk: Use SYM_CODE_START[_LOCAL]_NOALIGN macros
x86/srso: Disentangle rethunk-dependent options
x86/srso: Move retbleed IBPB check into existing 'has_microcode' code block
x86/bugs: Remove default case for fully switched enums
x86/srso: Remove 'pred_cmd' label
x86/srso: Unexport untraining functions
x86/srso: Improve i-cache locality for alias mitigation
x86/srso: Fix unret validation dependencies
x86/srso: Fix vulnerability reporting for missing microcode
x86/srso: Print mitigation for retbleed IBPB case
x86/srso: Print actual mitigation if requested mitigation isn't possible
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 RAS updates from Borislav Petkov:
- Specify what error addresses reported on AMD are actually usable
memory error addresses for further decoding
* tag 'ras_core_for_6.7_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Cleanup mce_usable_address()
x86/mce: Define amd_mce_usable_address()
x86/MCE/AMD: Split amd_mce_is_memory_error()
|
|
gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull iov_iter updates from Christian Brauner:
"This contain's David's iov_iter cleanup work to convert the iov_iter
iteration macros to inline functions:
- Remove last_offset from iov_iter as it was only used by ITER_PIPE
- Add a __user tag on copy_mc_to_user()'s dst argument on x86 to
match that on powerpc and get rid of a sparse warning
- Convert iter->user_backed to user_backed_iter() in the sound PCM
driver
- Convert iter->user_backed to user_backed_iter() in a couple of
infiniband drivers
- Renumber the type enum so that the ITER_* constants match the order
in iterate_and_advance*()
- Since the preceding patch puts UBUF and IOVEC at 0 and 1, change
user_backed_iter() to just use the type value and get rid of the
extra flag
- Convert the iov_iter iteration macros to always-inline functions to
make the code easier to follow. It uses function pointers, but they
get optimised away
- Move the check for ->copy_mc to _copy_from_iter() and
copy_page_from_iter_atomic() rather than in memcpy_from_iter_mc()
where it gets repeated for every segment. Instead, we check once
and invoke a side function that can use iterate_bvec() rather than
iterate_and_advance() and supply a different step function
- Move the copy-and-csum code to net/ where it can be in proximity
with the code that uses it
- Fold memcpy_and_csum() in to its two users
- Move csum_and_copy_from_iter_full() out of line and merge in
csum_and_copy_from_iter() since the former is the only caller of
the latter
- Move hash_and_copy_to_iter() to net/ where it can be with its only
caller"
* tag 'vfs-6.7.iov_iter' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs:
iov_iter, net: Move hash_and_copy_to_iter() to net/
iov_iter, net: Merge csum_and_copy_from_iter{,_full}() together
iov_iter, net: Fold in csum_and_memcpy()
iov_iter, net: Move csum_and_copy_to/from_iter() to net/
iov_iter: Don't deal with iter->copy_mc in memcpy_from_iter_mc()
iov_iter: Convert iterate*() to inline funcs
iov_iter: Derive user-backedness from the iterator type
iov_iter: Renumber ITER_* constants
infiniband: Use user_backed_iter() to see if iterator is UBUF/IOVEC
sound: Fix snd_pcm_readv()/writev() to use iov access functions
iov_iter, x86: Be consistent about the __user tag on copy_mc_to_user()
iov_iter: Remove last_offset from iov_iter as it was for ITER_PIPE
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 fixes from Ingo Molnar:
- Fix a possible CPU hotplug deadlock bug caused by the new TSC
synchronization code
- Fix a legacy PIC discovery bug that results in device troubles on
affected systems, such as non-working keybards, etc
- Add a new Intel CPU model number to <asm/intel-family.h>
* tag 'x86-urgent-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/tsc: Defer marking TSC unstable to a worker
x86/i8259: Skip probing when ACPI/MADT advertises PCAT compatibility
x86/cpu: Add model number for Intel Arrow Lake mobile processor
|
|
The '%.ko' rule in arch/*/Makefile.postlink does nothing but call the
'true' command.
Remove the unneeded code.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: Nicolas Schier <n.schier@avm.de>
|
|
Currently, there is no standard implementation for vdso_install,
leading to various issues:
1. Code duplication
Many architectures duplicate similar code just for copying files
to the install destination.
Some architectures (arm, sparc, x86) create build-id symlinks,
introducing more code duplication.
2. Unintended updates of in-tree build artifacts
The vdso_install rule depends on the vdso files to install.
It may update in-tree build artifacts. This can be problematic,
as explained in commit 19514fc665ff ("arm, kbuild: make
"make install" not depend on vmlinux").
3. Broken code in some architectures
Makefile code is often copied from one architecture to another
without proper adaptation.
'make vdso_install' for parisc does not work.
'make vdso_install' for s390 installs vdso64, but not vdso32.
To address these problems, this commit introduces a generic vdso_install
rule.
Architectures that support vdso_install need to define vdso-install-y
in arch/*/Makefile. vdso-install-y lists the files to install.
For example, arch/x86/Makefile looks like this:
vdso-install-$(CONFIG_X86_64) += arch/x86/entry/vdso/vdso64.so.dbg
vdso-install-$(CONFIG_X86_X32_ABI) += arch/x86/entry/vdso/vdsox32.so.dbg
vdso-install-$(CONFIG_X86_32) += arch/x86/entry/vdso/vdso32.so.dbg
vdso-install-$(CONFIG_IA32_EMULATION) += arch/x86/entry/vdso/vdso32.so.dbg
These files will be installed to $(MODLIB)/vdso/ with the .dbg suffix,
if exists, stripped away.
vdso-install-y can optionally take the second field after the colon
separator. This is needed because some architectures install a vdso
file as a different base name.
The following is a snippet from arch/arm64/Makefile.
vdso-install-$(CONFIG_COMPAT_VDSO) += arch/arm64/kernel/vdso32/vdso.so.dbg:vdso32.so
This will rename vdso.so.dbg to vdso32.so during installation. If such
architectures change their implementation so that the base names match,
this workaround will go away.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Acked-by: Sven Schnelle <svens@linux.ibm.com> # s390
Reviewed-by: Nicolas Schier <nicolas@fjasle.eu>
Reviewed-by: Guo Ren <guoren@kernel.org>
Acked-by: Helge Deller <deller@gmx.de> # parisc
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
|
|
Service NMI and SMI requests after PMI requests in vcpu_enter_guest() so
that KVM does not need to cancel and redo the VM-Enter if the guest
configures its PMIs to be delivered as NMIs (likely) or SMIs (unlikely).
Because APIC emulation "injects" NMIs via KVM_REQ_NMI, handling PMI
requests after NMI requests (the likely case) means KVM won't detect the
pending NMI request until the final check for outstanding requests.
Detecting requests at the final stage is costly as KVM has already loaded
guest state, potentially queued events for injection, disabled IRQs,
dropped SRCU, etc., most of which needs to be unwound.
Note that changing the order of request processing doesn't change the end
result, as KVM's final check for outstanding requests prevents entering
the guest until all requests are serviced. I.e. KVM will ultimately
coalesce events (or not) regardless of the ordering.
Using SPEC2017 benchmark programs running along with Intel vtune in a VM
demonstrates that the following code change reduces 800~1500 canceled
VM-Enters per second.
Some glory details:
Probe the invocation to vmx_cancel_injection():
$ perf probe -a vmx_cancel_injection
$ perf stat -a -e probe:vmx_cancel_injection -I 10000 # per 10 seconds
Partial results when SPEC2017 with Intel vtune are running in the VM:
On kernel without the change:
10.010018010 14254 probe:vmx_cancel_injection
20.037646388 15207 probe:vmx_cancel_injection
30.078739816 15261 probe:vmx_cancel_injection
40.114033258 15085 probe:vmx_cancel_injection
50.149297460 15112 probe:vmx_cancel_injection
60.185103088 15104 probe:vmx_cancel_injection
On kernel with the change:
10.003595390 40 probe:vmx_cancel_injection
20.017855682 31 probe:vmx_cancel_injection
30.028355883 34 probe:vmx_cancel_injection
40.038686298 31 probe:vmx_cancel_injection
50.048795162 20 probe:vmx_cancel_injection
60.069057747 19 probe:vmx_cancel_injection
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Link: https://lore.kernel.org/r/20231002040839.2630027-1-mizhang@google.com
[sean: hoist PMU/PMI above SMI too, massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Tetsuo reported the following lockdep splat when the TSC synchronization
fails during CPU hotplug:
tsc: Marking TSC unstable due to check_tsc_sync_source failed
WARNING: inconsistent lock state
inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
ffffffff8cfa1c78 (watchdog_lock){?.-.}-{2:2}, at: clocksource_watchdog+0x23/0x5a0
{IN-HARDIRQ-W} state was registered at:
_raw_spin_lock_irqsave+0x3f/0x60
clocksource_mark_unstable+0x1b/0x90
mark_tsc_unstable+0x41/0x50
check_tsc_sync_source+0x14f/0x180
sysvec_call_function_single+0x69/0x90
Possible unsafe locking scenario:
lock(watchdog_lock);
<Interrupt>
lock(watchdog_lock);
stack backtrace:
_raw_spin_lock+0x30/0x40
clocksource_watchdog+0x23/0x5a0
run_timer_softirq+0x2a/0x50
sysvec_apic_timer_interrupt+0x6e/0x90
The reason is the recent conversion of the TSC synchronization function
during CPU hotplug on the control CPU to a SMP function call. In case
that the synchronization with the upcoming CPU fails, the TSC has to be
marked unstable via clocksource_mark_unstable().
clocksource_mark_unstable() acquires 'watchdog_lock', but that lock is
taken with interrupts enabled in the watchdog timer callback to minimize
interrupt disabled time. That's obviously a possible deadlock scenario,
Before that change the synchronization function was invoked in thread
context so this could not happen.
As it is not crucical whether the unstable marking happens slightly
delayed, defer the call to a worker thread which avoids the lock context
problem.
Fixes: 9d349d47f0e3 ("x86/smpboot: Make TSC synchronization function call based")
Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/87zg064ceg.ffs@tglx
|
|
David and a few others reported that on certain newer systems some legacy
interrupts fail to work correctly.
Debugging revealed that the BIOS of these systems leaves the legacy PIC in
uninitialized state which makes the PIC detection fail and the kernel
switches to a dummy implementation.
Unfortunately this fallback causes quite some code to fail as it depends on
checks for the number of legacy PIC interrupts or the availability of the
real PIC.
In theory there is no reason to use the PIC on any modern system when
IO/APIC is available, but the dependencies on the related checks cannot be
resolved trivially and on short notice. This needs lots of analysis and
rework.
The PIC detection has been added to avoid quirky checks and force selection
of the dummy implementation all over the place, especially in VM guest
scenarios. So it's not an option to revert the relevant commit as that
would break a lot of other scenarios.
One solution would be to try to initialize the PIC on detection fail and
retry the detection, but that puts the burden on everything which does not
have a PIC.
Fortunately the ACPI/MADT table header has a flag field, which advertises
in bit 0 that the system is PCAT compatible, which means it has a legacy
8259 PIC.
Evaluate that bit and if set avoid the detection routine and keep the real
PIC installed, which then gets initialized (for nothing) and makes the rest
of the code with all the dependencies work again.
Fixes: e179f6914152 ("x86, irq, pic: Probe for legacy PIC and set legacy_pic appropriately")
Reported-by: David Lazar <dlazar@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: David Lazar <dlazar@gmail.com>
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Cc: stable@vger.kernel.org
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218003
Link: https://lore.kernel.org/r/875y2u5s8g.ffs@tglx
|
|
For "reasons" Intel has code-named this CPU with a "_H" suffix.
[ dhansen: As usual, apply this and send it upstream quickly to
make it easier for anyone who is doing work that
consumes this. ]
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20231025202513.12358-1-tony.luck%40intel.com
|
|
The branch counters logging (A.K.A LBR event logging) introduces a
per-counter indication of precise event occurrences in LBRs. It can
provide a means to attribute exposed retirement latency to combinations
of events across a block of instructions. It also provides a means of
attributing Timed LBR latencies to events.
The feature is first introduced on SRF/GRR. It is an enhancement of the
ARCH LBR. It adds new fields in the LBR_INFO MSRs to log the occurrences
of events on the GP counters. The information is displayed by the order
of counters.
The design proposed in this patch requires that the events which are
logged must be in a group with the event that has LBR. If there are
more than one LBR group, the counters logging information only from the
current group (overflowed) are stored for the perf tool, otherwise the
perf tool cannot know which and when other groups are scheduled
especially when multiplexing is triggered. The user can ensure it uses
the maximum number of counters that support LBR info (4 by now) by
making the group large enough.
The HW only logs events by the order of counters. The order may be
different from the order of enabling which the perf tool can understand.
When parsing the information of each branch entry, convert the counter
order to the enabled order, and store the enabled order in the extension
space.
Unconditionally reset LBRs for an LBR event group when it's deleted. The
logged counter information is only valid for the current LBR group. If
another LBR group is scheduled later, the information from the stale
LBRs would be otherwise wrongly interpreted.
Add a sanity check in intel_pmu_hw_config(). Disable the feature if other
counter filters (inv, cmask, edge, in_tx) are set or LBR call stack mode
is enabled. (For the LBR call stack mode, we cannot simply flush the
LBR, since it will break the call stack. Also, there is no obvious usage
with the call stack mode for now.)
Only applying the PERF_SAMPLE_BRANCH_COUNTERS doesn't require any branch
stack setup.
Expose the maximum number of supported counters and the width of the
counters into the sysfs. The perf tool can use the information to parse
the logged counters in each branch.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20231025201626.3000228-5-kan.liang@linux.intel.com
|
|
Some attrs and is_visible implementations are rather far away from one
another which makes the whole thing hard to interpret.
There are only two attribute groups which have both .attrs and
.is_visible, group_default and group_caps_lbr. Move them together.
No functional changes.
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20231025201626.3000228-4-kan.liang@linux.intel.com
|
|
Add a helper function to check call stack sample type.
The later patch will invoke the function in several places.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20231025201626.3000228-3-kan.liang@linux.intel.com
|
|
Currently, branch_sample_type !=0 is used to check whether a branch
stack setup is required. But it doesn't check the sample type,
unnecessary branch stack setup may be done for a counting event. E.g.,
perf record -e "{branch-instructions,branch-misses}:S" -j any
Also, the event only with the new PERF_SAMPLE_BRANCH_COUNTERS branch
sample type may not require a branch stack setup either.
Add a new flag NEEDS_BRANCH_STACK to indicate whether the event requires
a branch stack setup. Replace the needs_branch_stack() by checking the
new flag.
The counting event check is implemented here. The later patch will take
the new PERF_SAMPLE_BRANCH_COUNTERS into account.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20231025201626.3000228-2-kan.liang@linux.intel.com
|
|
Currently, the additional information of a branch entry is stored in a
u64 space. With more and more information added, the space is running
out. For example, the information of occurrences of events will be added
for each branch.
Two places were suggested to append the counters.
https://lore.kernel.org/lkml/20230802215814.GH231007@hirez.programming.kicks-ass.net/
One place is right after the flags of each branch entry. It changes the
existing struct perf_branch_entry. The later ARCH specific
implementation has to be really careful to consistently pick
the right struct.
The other place is right after the entire struct perf_branch_stack.
The disadvantage is that the pointer of the extra space has to be
recorded. The common interface perf_sample_save_brstack() has to be
updated.
The latter is much straightforward, and should be easily understood and
maintained. It is implemented in the patch.
Add a new branch sample type, PERF_SAMPLE_BRANCH_COUNTERS, to indicate
the event which is recorded in the branch info.
The "u64 counters" may store the occurrences of several events. The
information regarding the number of events/counters and the width of
each counter should be exposed via sysfs as a reference for the perf
tool. Define the branch_counter_nr and branch_counter_width ABI here.
The support will be implemented later in the Intel-specific patch.
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20231025201626.3000228-1-kan.liang@linux.intel.com
|
|
Cross-merge networking fixes after downstream PR.
Conflicts:
net/mac80211/rx.c
91535613b609 ("wifi: mac80211: don't drop all unprotected public action frames")
6c02fab72429 ("wifi: mac80211: split ieee80211_drop_unencrypted_mgmt() return value")
Adjacent changes:
drivers/net/ethernet/apm/xgene/xgene_enet_main.c
61471264c018 ("net: ethernet: apm: Convert to platform remove callback returning void")
d2ca43f30611 ("net: xgene: Fix unused xgene_enet_of_match warning for !CONFIG_OF")
net/vmw_vsock/virtio_transport.c
64c99d2d6ada ("vsock/virtio: support to send non-linear skb")
53b08c498515 ("vsock/virtio: initialize the_virtio_vsock before using VQs")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
commit ef8dd01538ea ("genirq/msi: Make interrupt allocation less
convoluted"), reworked the code so that the x86 specific quirk for affinity
setting of non-maskable PCI/MSI interrupts is not longer activated if
necessary.
This could be solved by restoring the original logic in the core MSI code,
but after a deeper analysis it turned out that the quirk flag is not
required at all.
The quirk is only required when the PCI/MSI device cannot mask the MSI
interrupts, which in turn also prevents reservation mode from being enabled
for the affected interrupt.
This allows ot remove the NOMASK quirk bit completely as msi_set_affinity()
can instead check whether reservation mode is enabled for the interrupt,
which gives exactly the same answer.
Even in the momentary non-existing case that the reservation mode would be
not set for a maskable MSI interrupt this would not cause any harm as it
just would cause msi_set_affinity() to go needlessly through the
functionaly equivalent slow path, which works perfectly fine with maskable
interrupts as well.
Rework msi_set_affinity() to query the reservation mode and remove all
NOMASK quirk logic from the core code.
[ tglx: Massaged changelog ]
Fixes: ef8dd01538ea ("genirq/msi: Make interrupt allocation less convoluted")
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Koichiro Den <den@valinux.co.jp>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20231026032036.2462428-1-den@valinux.co.jp
|
|
A bug was recently found via CONFIG_DEBUG_ENTRY=y, and the x86
tree kinda is the main source of changes to the x86 entry code,
so enable this debug option by default in our defconfigs.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: linux-kernel@vger.kernel.org
|
|
In general users, don't have the necessary information to determine
whether late loading of a new microcode version is safe and does not
modify anything which the currently running kernel uses already, e.g.
removal of CPUID bits or behavioural changes of MSRs.
To address this issue, Intel has added a "minimum required version"
field to a previously reserved field in the microcode header. Microcode
updates should only be applied if the current microcode version is equal
to, or greater than this minimum required version.
Thomas made some suggestions on how meta-data in the microcode file could
provide Linux with information to decide if the new microcode is suitable
candidate for late loading. But even the "simpler" option requires a lot of
metadata and corresponding kernel code to parse it, so the final suggestion
was to add the 'minimum required version' field in the header.
When microcode changes visible features, microcode will set the minimum
required version to its own revision which prevents late loading.
Old microcode blobs have the minimum revision field always set to 0, which
indicates that there is no information and the kernel considers it
unsafe.
This is a pure OS software mechanism. The hardware/firmware ignores this
header field.
For early loading there is no restriction because OS visible features
are enumerated after the early load and therefore a change has no
effect.
The check is always enabled, but by default not enforced. It can be
enforced via Kconfig or kernel command line.
If enforced, the kernel refuses to late load microcode with a minimum
required version field which is zero or when the currently loaded
microcode revision is smaller than the minimum required revision.
If not enforced the load happens independent of the revision check to
stay compatible with the existing behaviour, but it influences the
decision whether the kernel is tainted or not. If the check signals that
the late load is safe, then the kernel is not tainted.
Early loading is not affected by this.
[ tglx: Massaged changelog and fixed up the implementation ]
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231002115903.776467264@linutronix.de
|
|
Applying microcode late can be fatal for the running kernel when the
update changes functionality which is in use already in a non-compatible
way, e.g. by removing a CPUID bit.
There is no way for admins which do not have access to the vendors deep
technical support to decide whether late loading of such a microcode is
safe or not.
Intel has added a new field to the microcode header which tells the
minimal microcode revision which is required to be active in the CPU in
order to be safe.
Provide infrastructure for handling this in the core code and a command
line switch which allows to enforce it.
If the update is considered safe the kernel is not tainted and the annoying
warning message not emitted. If it's enforced and the currently loaded
microcode revision is not safe for late loading then the load is aborted.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231017211724.079611170@linutronix.de
|
|
Offline CPUs need to be parked in a safe loop when microcode update is
in progress on the primary CPU. Currently, offline CPUs are parked in
mwait_play_dead(), and for Intel CPUs, its not a safe instruction,
because the MWAIT instruction can be patched in the new microcode update
that can cause instability.
- Add a new microcode state 'UCODE_OFFLINE' to report status on per-CPU
basis.
- Force NMI on the offline CPUs.
Wake up offline CPUs while the update is in progress and then return
them back to mwait_play_dead() after microcode update is complete.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231002115903.660850472@linutronix.de
|
|
When SMT siblings are soft-offlined and parked in one of the play_dead()
variants they still react on NMI, which is problematic on affected Intel
CPUs. The default play_dead() variant uses MWAIT on modern CPUs, which is
not guaranteed to be safe when updated concurrently.
Right now late loading is prevented when not all SMT siblings are online,
but as they still react on NMI, it is possible to bring them out of their
park position into a trivial rendezvous handler.
Provide a function which allows to do that. I does sanity checks whether
the target is in the cpus_booted_once_mask and whether the APIC driver
supports it.
Mark X2APIC and XAPIC as capable, but exclude 32bit and the UV and NUMACHIP
variants as that needs feedback from the relevant experts.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231002115903.603100036@linutronix.de
|
|
The wait for control loop in which the siblings are waiting for the
microcode update on the primary thread must be protected against
instrumentation as instrumentation can end up in #INT3, #DB or #PF,
which then returns with IRET. That IRET reenables NMI which is the
opposite of what the NMI rendezvous is trying to achieve.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231002115903.545969323@linutronix.de
|
|
stop_machine() does not prevent the spin-waiting sibling from handling
an NMI, which is obviously violating the whole concept of rendezvous.
Implement a static branch right in the beginning of the NMI handler
which is nopped out except when enabled by the late loading mechanism.
The late loader enables the static branch before stop_machine() is
invoked. Each CPU has an nmi_enable in its control structure which
indicates whether the CPU should go into the update routine.
This is required to bridge the gap between enabling the branch and
actually being at the point where it is required to enter the loader
wait loop.
Each CPU which arrives in the stopper thread function sets that flag and
issues a self NMI right after that. If the NMI function sees the flag
clear, it returns. If it's set it clears the flag and enters the
rendezvous.
This is safe against a real NMI which hits in between setting the flag
and sending the NMI to itself. The real NMI will be swallowed by the
microcode update and the self NMI will then let stuff continue.
Otherwise this would end up with a spurious NMI.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231002115903.489900814@linutronix.de
|
|
with a new handler which just separates the control flow of primary and
secondary CPUs.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231002115903.433704135@linutronix.de
|