Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Catalin Marinas:
"Nothing major this time around.
Apart from the usual perf/PMU updates, some page table cleanups, the
notable features are average CPU frequency based on the AMUv1
counters, CONFIG_HOTPLUG_SMT and MOPS instructions (memcpy/memset) in
the uaccess routines.
Perf and PMUs:
- Support for the 'Rainier' CPU PMU from Arm
- Preparatory driver changes and cleanups that pave the way for BRBE
support
- Support for partial virtualisation of the Apple-M1 PMU
- Support for the second event filter in Arm CSPMU designs
- Minor fixes and cleanups (CMN and DWC PMUs)
- Enable EL2 requirements for FEAT_PMUv3p9
Power, CPU topology:
- Support for AMUv1-based average CPU frequency
- Run-time SMT control wired up for arm64 (CONFIG_HOTPLUG_SMT). It
adds a generic topology_is_primary_thread() function overridden by
x86 and powerpc
New(ish) features:
- MOPS (memcpy/memset) support for the uaccess routines
Security/confidential compute:
- Fix the DMA address for devices used in Realms with Arm CCA. The
CCA architecture uses the address bit to differentiate between
shared and private addresses
- Spectre-BHB: assume CPUs Linux doesn't know about vulnerable by
default
Memory management clean-ups:
- Drop the P*D_TABLE_BIT definition in preparation for 128-bit PTEs
- Some minor page table accessor clean-ups
- PIE/POE (permission indirection/overlay) helpers clean-up
Kselftests:
- MTE: skip hugetlb tests if MTE is not supported on such mappings
and user correct naming for sync/async tag checking modes
Miscellaneous:
- Add a PKEY_UNRESTRICTED definition as 0 to uapi (toolchain people
request)
- Sysreg updates for new register fields
- CPU type info for some Qualcomm Kryo cores"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (72 commits)
arm64: mm: Don't use %pK through printk
perf/arm_cspmu: Fix missing io.h include
arm64: errata: Add newer ARM cores to the spectre_bhb_loop_affected() lists
arm64: cputype: Add MIDR_CORTEX_A76AE
arm64: errata: Add KRYO 2XX/3XX/4XX silver cores to Spectre BHB safe list
arm64: errata: Assume that unknown CPUs _are_ vulnerable to Spectre BHB
arm64: errata: Add QCOM_KRYO_4XX_GOLD to the spectre_bhb_k24_list
arm64/sysreg: Enforce whole word match for open/close tokens
arm64/sysreg: Fix unbalanced closing block
arm64: Kconfig: Enable HOTPLUG_SMT
arm64: topology: Support SMT control on ACPI based system
arch_topology: Support SMT control for OF based system
cpu/SMT: Provide a default topology_is_primary_thread()
arm64/mm: Define PTDESC_ORDER
perf/arm_cspmu: Add PMEVFILT2R support
perf/arm_cspmu: Generalise event filtering
perf/arm_cspmu: Move register definitons to header
arm64/kernel: Always use level 2 or higher for early mappings
arm64/mm: Drop PXD_TABLE_BIT
arm64/mm: Check pmd_table() in pmd_trans_huge()
...
|
|
The struct thermal_zone_device is already declared on line 32, so the
duplicate declaration has been removed.
Fixes: b1ae92dcfa8e ("thermal: core: Make struct thermal_zone_device definition internal")
Signed-off-by: xueqin Luo <luoxueqin@kylinos.cn>
Link: https://lore.kernel.org/r/20250206081436.51785-1-luoxueqin@kylinos.cn
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
|
|
* for-next/smt-control:
: Support SMT control on arm64
arm64: Kconfig: Enable HOTPLUG_SMT
arm64: topology: Support SMT control on ACPI based system
arch_topology: Support SMT control for OF based system
cpu/SMT: Provide a default topology_is_primary_thread()
|
|
'for-next/sysreg', 'for-next/misc', 'for-next/pgtable-cleanups', 'for-next/kselftest', 'for-next/uaccess-mops', 'for-next/pie-poe-cleanup', 'for-next/cputype-kryo', 'for-next/cca-dma-address', 'for-next/drop-pxd_table_bit' and 'for-next/spectre-bhb-assume-vulnerable', remote-tracking branch 'arm64/for-next/perf' into for-next/core
* arm64/for-next/perf:
perf/arm_cspmu: Fix missing io.h include
perf/arm_cspmu: Add PMEVFILT2R support
perf/arm_cspmu: Generalise event filtering
perf/arm_cspmu: Move register definitons to header
drivers/perf: apple_m1: Support host/guest event filtering
drivers/perf: apple_m1: Refactor event select/filter configuration
perf/dwc_pcie: fix duplicate pci_dev devices
perf/dwc_pcie: fix some unreleased resources
perf/arm-cmn: Minor event type housekeeping
perf: arm_pmu: Move PMUv3-specific data
perf: apple_m1: Don't disable counter in m1_pmu_enable_event()
perf: arm_v7_pmu: Don't disable counter in (armv7|krait_|scorpion_)pmu_enable_event()
perf: arm_v7_pmu: Drop obvious comments for enabling/disabling counters and interrupts
perf: arm_pmuv3: Don't disable counter in armv8pmu_enable_event()
perf: arm_pmu: Don't disable counter in armpmu_add()
perf: arm_pmuv3: Call kvm_vcpu_pmu_resync_el0() before enabling counters
perf: arm_pmuv3: Add support for ARM Rainier PMU
* for-next/amuv1-avg-freq:
: Add support for AArch64 AMUv1-based average freq
arm64: Utilize for_each_cpu_wrap for reference lookup
arm64: Update AMU-based freq scale factor on entering idle
arm64: Provide an AMU-based version of arch_freq_get_on_cpu
cpufreq: Introduce an optional cpuinfo_avg_freq sysfs entry
cpufreq: Allow arch_freq_get_on_cpu to return an error
arch_topology: init capacity_freq_ref to 0
* for-next/pkey_unrestricted:
: mm/pkey: Add PKEY_UNRESTRICTED macro
selftest/powerpc/mm/pkey: fix build-break introduced by commit 00894c3fc917
selftests/powerpc: Use PKEY_UNRESTRICTED macro
selftests/mm: Use PKEY_UNRESTRICTED macro
mm/pkey: Add PKEY_UNRESTRICTED macro
* for-next/sysreg:
: arm64 sysreg updates
arm64/sysreg: Enforce whole word match for open/close tokens
arm64/sysreg: Fix unbalanced closing block
arm64/sysreg: Add register fields for HFGWTR2_EL2
arm64/sysreg: Add register fields for HFGRTR2_EL2
arm64/sysreg: Add register fields for HFGITR2_EL2
arm64/sysreg: Add register fields for HDFGWTR2_EL2
arm64/sysreg: Add register fields for HDFGRTR2_EL2
arm64/sysreg: Update register fields for ID_AA64MMFR0_EL1
* for-next/misc:
: Miscellaneous arm64 patches
arm64: mm: Don't use %pK through printk
arm64/fpsimd: Remove unused declaration fpsimd_kvm_prepare()
* for-next/pgtable-cleanups:
: arm64 pgtable accessors cleanup
arm64/mm: Define PTDESC_ORDER
arm64/kernel: Always use level 2 or higher for early mappings
arm64/hugetlb: Consistently use pud_sect_supported()
arm64/mm: Convert __pte_to_phys() and __phys_to_pte_val() as functions
* for-next/kselftest:
: arm64 kselftest updates
kselftest/arm64: mte: Skip the hugetlb tests if MTE not supported on such mappings
kselftest/arm64: mte: Use the correct naming for tag check modes in check_hugetlb_options.c
* for-next/uaccess-mops:
: Implement the uaccess memory copy/set using MOPS instructions
arm64: lib: Use MOPS for usercopy routines
arm64: mm: Handle PAN faults on uaccess CPY* instructions
arm64: extable: Add fixup handling for uaccess CPY* instructions
* for-next/pie-poe-cleanup:
: PIE/POE helpers cleanup
arm64/sysreg: Move POR_EL0_INIT to asm/por.h
arm64/sysreg: Rename POE_RXW to POE_RWX
arm64/sysreg: Improve PIR/POR helpers
* for-next/cputype-kryo:
: Add cputype info for some Qualcomm Kryo cores
arm64: cputype: Add comments about Qualcomm Kryo 5XX and 6XX cores
arm64: cputype: Add QCOM_CPU_PART_KRYO_3XX_GOLD
* for-next/cca-dma-address:
: Fix DMA address for devices used in realms with Arm CCA
arm64: realm: Use aliased addresses for device DMA to shared buffers
dma: Introduce generic dma_addr_*crypted helpers
dma: Fix encryption bit clearing for dma_to_phys
* for-next/drop-pxd_table_bit:
: Drop the arm64 PXD_TABLE_BIT (clean-up in preparation for 128-bit PTEs)
arm64/mm: Drop PXD_TABLE_BIT
arm64/mm: Check pmd_table() in pmd_trans_huge()
arm64/mm: Check PUD_TYPE_TABLE in pud_bad()
arm64/mm: Check PXD_TYPE_TABLE in [p4d|pgd]_bad()
arm64/mm: Clear PXX_TYPE_MASK and set PXD_TYPE_SECT in [pmd|pud]_mkhuge()
arm64/mm: Clear PXX_TYPE_MASK in mk_[pmd|pud]_sect_prot()
arm64/ptdump: Test PMD_TYPE_MASK for block mapping
KVM: arm64: ptdump: Test PMD_TYPE_MASK for block mapping
* for-next/spectre-bhb-assume-vulnerable:
: Rework Spectre BHB mitigations to not assume "safe"
arm64: errata: Add newer ARM cores to the spectre_bhb_loop_affected() lists
arm64: cputype: Add MIDR_CORTEX_A76AE
arm64: errata: Add KRYO 2XX/3XX/4XX silver cores to Spectre BHB safe list
arm64: errata: Assume that unknown CPUs _are_ vulnerable to Spectre BHB
arm64: errata: Add QCOM_KRYO_4XX_GOLD to the spectre_bhb_k24_list
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull VDSO infrastructure updates from Thomas Gleixner:
- Consolidate the VDSO storage
The VDSO data storage and data layout has been largely architecture
specific for historical reasons. That increases the maintenance
effort and causes inconsistencies over and over.
There is no real technical reason for architecture specific layouts
and implementations. The architecture specific details can easily be
integrated into a generic layout, which also reduces the amount of
duplicated code for managing the mappings.
Convert all architectures over to a unified layout and common mapping
infrastructure. This splits the VDSO data layout into subsystem
specific blocks, timekeeping, random and architecture parts, which
provides a better structure and allows to improve and update the
functionalities without conflict and interaction.
- Rework the timekeeping data storage
The current implementation is designed for exposing system
timekeeping accessors, which was good enough at the time when it was
designed.
PTP and Time Sensitive Networking (TSN) change that as there are
requirements to expose independent PTP clocks, which are not related
to system timekeeping.
Replace the monolithic data storage by a structured layout, which
allows to add support for independent PTP clocks on top while reusing
both the data structures and the time accessor implementations.
* tag 'timers-vdso-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (55 commits)
sparc/vdso: Always reject undefined references during linking
x86/vdso: Always reject undefined references during linking
vdso: Rework struct vdso_time_data and introduce struct vdso_clock
vdso: Move architecture related data before basetime data
powerpc/vdso: Prepare introduction of struct vdso_clock
arm64/vdso: Prepare introduction of struct vdso_clock
x86/vdso: Prepare introduction of struct vdso_clock
time/namespace: Prepare introduction of struct vdso_clock
vdso/namespace: Rename timens_setup_vdso_data() to reflect new vdso_clock struct
vdso/vsyscall: Prepare introduction of struct vdso_clock
vdso/gettimeofday: Prepare helper functions for introduction of struct vdso_clock
vdso/gettimeofday: Prepare do_coarse_timens() for introduction of struct vdso_clock
vdso/gettimeofday: Prepare do_coarse() for introduction of struct vdso_clock
vdso/gettimeofday: Prepare do_hres_timens() for introduction of struct vdso_clock
vdso/gettimeofday: Prepare do_hres() for introduction of struct vdso_clock
vdso/gettimeofday: Prepare introduction of struct vdso_clock
vdso/helpers: Prepare introduction of struct vdso_clock
vdso/datapage: Define vdso_clock to prepare for multiple PTP clocks
vdso: Make vdso_time_data cacheline aligned
arm64: Make asm/cache.h compatible with vDSO
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer cleanups from Thomas Gleixner:
"A treewide hrtimer timer cleanup
hrtimers are initialized with hrtimer_init() and a subsequent store to
the callback pointer. This turned out to be suboptimal for the
upcoming Rust integration and is obviously a silly implementation to
begin with.
This cleanup replaces the hrtimer_init(T); T->function = cb; sequence
with hrtimer_setup(T, cb);
The conversion was done with Coccinelle and a few manual fixups.
Once the conversion has completely landed in mainline, hrtimer_init()
will be removed and the hrtimer::function becomes a private member"
* tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (100 commits)
wifi: rt2x00: Switch to use hrtimer_update_function()
io_uring: Use helper function hrtimer_update_function()
serial: xilinx_uartps: Use helper function hrtimer_update_function()
ASoC: fsl: imx-pcm-fiq: Switch to use hrtimer_setup()
RDMA: Switch to use hrtimer_setup()
virtio: mem: Switch to use hrtimer_setup()
drm/vmwgfx: Switch to use hrtimer_setup()
drm/xe/oa: Switch to use hrtimer_setup()
drm/vkms: Switch to use hrtimer_setup()
drm/msm: Switch to use hrtimer_setup()
drm/i915/request: Switch to use hrtimer_setup()
drm/i915/uncore: Switch to use hrtimer_setup()
drm/i915/pmu: Switch to use hrtimer_setup()
drm/i915/perf: Switch to use hrtimer_setup()
drm/i915/gvt: Switch to use hrtimer_setup()
drm/i915/huc: Switch to use hrtimer_setup()
drm/amdgpu: Switch to use hrtimer_setup()
stm class: heartbeat: Switch to use hrtimer_setup()
i2c: Switch to use hrtimer_setup()
iio: Switch to use hrtimer_setup()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer core updates from Thomas Gleixner:
- Fix a memory ordering issue in posix-timers
Posix-timer lookup is lockless and reevaluates the timer validity
under the timer lock, but the update which validates the timer is not
protected by the timer lock. That allows the store to be reordered
against the initialization stores, so that the lookup side can
observe a partially initialized timer. That's mostly a theoretical
problem, but incorrect nevertheless.
- Fix a long standing inconsistency of the coarse time getters
The coarse time getters read the base time of the current update
cycle without reading the actual hardware clock. NTP frequency
adjustment can set the base time backwards. The fine grained
interfaces compensate this by reading the clock and applying the new
conversion factor, but the coarse grained time getters use the base
time directly. That allows the user to observe time going backwards.
Cure it by always forwarding base time, when NTP changes the
frequency with an immediate step.
- Rework of posix-timer hashing
The posix-timer hash is not scalable and due to the CRIU timer
restore mechanism prone to massive contention on the global hash
bucket lock.
Replace the global hash lock with a fine grained per bucket locking
scheme to address that.
- Rework the proc/$PID/timers interface.
/proc/$PID/timers is provided for CRIU to be able to restore a timer.
The printout happens with sighand lock held and interrupts disabled.
That's not required as this can be done with RCU protection as well.
- Provide a sane mechanism for CRIU to restore a timer ID
CRIU restores timers by creating and deleting them until the kernel
internal per process ID counter reached the requested ID. That's
horribly slow for sparse timer IDs.
Provide a prctl() which allows CRIU to restore a timer with a given
ID. When enabled the ID pointer is used as input pointer to read the
requested ID from user space. When disabled, the normal allocation
scheme (next ID) is active as before. This is backwards compatible
for both kernel and user space.
- Make hrtimer_update_function() less expensive.
The sanity checks are valuable, but expensive for high frequency
usage in io/uring. Make the debug checks conditional and enable them
only when lockdep is enabled.
- Small updates, cleanups and improvements
* tag 'timers-core-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
selftests/timers: Improve skew_consistency by testing with other clockids
timekeeping: Fix possible inconsistencies in _COARSE clockids
posix-timers: Drop redundant memset() invocation
selftests/timers/posix-timers: Add a test for exact allocation mode
posix-timers: Provide a mechanism to allocate a given timer ID
posix-timers: Dont iterate /proc/$PID/timers with sighand:: Siglock held
posix-timers: Make per process list RCU safe
posix-timers: Avoid false cacheline sharing
posix-timers: Switch to jhash32()
posix-timers: Improve hash table performance
posix-timers: Make signal_struct:: Next_posix_timer_id an atomic_t
posix-timers: Make lock_timer() use guard()
posix-timers: Rework timer removal
posix-timers: Simplify lock/unlock_timer()
posix-timers: Use guards in a few places
posix-timers: Remove SLAB_PANIC from kmem cache
posix-timers: Remove a few paranoid warnings
posix-timers: Cleanup includes
posix-timers: Add cond_resched() to posix_timer_add() search loop
posix-timers: Initialise timer before adding it to the hash table
...
|
|
netdev netlink is the only reader of netdev_{,rx_}queue->napi,
and it already holds netdev->lock. Switch protection of
the writes to netdev->lock to "ops protected".
The expectation will be now that accessing queue->napi
will require netdev->lock for "ops locked" drivers, and
rtnl_lock for all other drivers.
Current "ops locked" drivers don't require any changes.
gve and netdevsim use _locked() helpers right next to
netif_queue_set_napi() so they must be holding the instance
lock. iavf doesn't call it. bnxt is a bit messy but all paths
seem locked.
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250324224537.248800-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Drivers which opt into instance lock protection of ops should
only call set_real_num_*_queues() under the instance lock.
This means that queue counts are double protected (writes
are under both rtnl_lock and instance lock, readers under
either).
Some readers may still be under the rtnl_lock, however, so for
now we need double protection of writers.
OTOH queue API paths are only under the protection of the instance
lock, so we need to validate that the instance is actually locking
ops, otherwise the input checks we do against queue count are racy.
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250324224537.248800-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Try to define some terminology for which fields are protected
by which lock and how. Some fields are protected by both rtnl_lock
and instance lock which is hard to talk about without having
a "key phrase" to refer to a particular protection scheme.
"ops protected" fields are defined later in the series, one by one.
Add ASSERT_RTNL() to netdev_ops_assert_locked() for drivers
not other instance protection of ops. Hopefully it's not too
confusion that netdev_lock_ops() does not match the lock which
netdev_ops_assert_locked() will assert, exactly. The noun "ops"
is in a different place in the name, so I think it's acceptable...
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250324224537.248800-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Since commit a953be53ce40 ("net-sysfs: add support for device-specific
rx queue sysfs attributes"), so for at least a decade now it is safe
to call net_rx_queue_update_kobjects() when SYSFS=n. That function
does its own ifdef-inery and will return 0. Remove the unnecessary
stub for netif_set_real_num_rx_queues().
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250324224537.248800-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq driver updates from Thomas Gleixner:
- Support for hard indices on RISC-V. The hart index identifies a hart
(core) within a specific interrupt domain in RISC-V's Priviledged
Architecture.
- Rework of the RISC-V MSI driver
This moves the driver over to the generic MSI library and solves the
affinity problem of unmaskable PCI/MSI controllers. Unmaskable
PCI/MSI controllers are prone to lose interrupts when the MSI message
is updated to change the affinity because the message write consists
of three 32-bit subsequent writes, which update address and data. As
these writes are non-atomic versus the device raising an interrupt,
the device can observe a half written update and issue an interrupt
on the wrong vector. This is mitiated by a carefully orchestrated
step by step update and the observation of an eventually pending
interrupt on the CPU which issues the update. The algorithm follows
the well established method of the X86 MSI driver.
- A new driver for the RISC-V Sophgo SG2042 MSI controller
- Overhaul of the Renesas RZQ2L driver
Simplification of the probe function by using devm_*() mechanisms,
which avoid the endless list of error prone gotos in the failure
paths.
- Expand the Renesas RZV2H driver to support RZ/G3E SoCs
- A workaround for Rockchip 3568002 erratum in the GIC-V3 driver to
ensure that the addressing is limited to the lower 32-bit of the
physical address space.
- Add support for the Allwinner AS23 NMI controller
- Expand the IMX irqsteer driver to handle up to 960 input interrupts
- The usual small updates, cleanups and device tree changes
* tag 'irq-drivers-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
irqchip/imx-irqsteer: Support up to 960 input interrupts
irqchip/sunxi-nmi: Support Allwinner A523 NMI controller
dt-bindings: irq: sun7i-nmi: Document the Allwinner A523 NMI controller
irqchip/davinci-cp-intc: Remove public header
irqchip/renesas-rzv2h: Add RZ/G3E support
irqchip/renesas-rzv2h: Update macros ICU_TSSR_TSSEL_{MASK,PREP}
irqchip/renesas-rzv2h: Update TSSR_TIEN macro
irqchip/renesas-rzv2h: Add field_width to struct rzv2h_hw_info
irqchip/renesas-rzv2h: Add max_tssel to struct rzv2h_hw_info
irqchip/renesas-rzv2h: Add struct rzv2h_hw_info with t_offs variable
irqchip/renesas-rzv2h: Use devm_pm_runtime_enable()
irqchip/renesas-rzv2h: Use devm_reset_control_get_exclusive_deasserted()
irqchip/renesas-rzv2h: Simplify rzv2h_icu_init()
irqchip/renesas-rzv2h: Drop irqchip from struct rzv2h_icu_priv
irqchip/renesas-rzv2h: Fix wrong variable usage in rzv2h_tint_set_type()
dt-bindings: interrupt-controller: renesas,rzv2h-icu: Document RZ/G3E SoC
riscv: sophgo: dts: Add msi controller for SG2042
irqchip: Add the Sophgo SG2042 MSI interrupt controller
dt-bindings: interrupt-controller: Add Sophgo SG2042 MSI
arm64: dts: rockchip: rk356x: Move PCIe MSI to use GIC ITS instead of MBI
...
|
|
Add SOF_TIMESTAMPING_TX_COMPLETION, for requesting a software timestamp
when hardware reports a packet completed.
Completion tstamp is useful for Bluetooth, as hardware timestamps do not
exist in the HCI specification except for ISO packets, and the hardware
has a queue where packets may wait. In this case the software SND
timestamp only reflects the kernel-side part of the total latency
(usually small) and queue length (usually 0 unless HW buffers
congested), whereas the completion report time is more informative of
the true latency.
It may also be useful in other cases where HW TX timestamps cannot be
obtained and user wants to estimate an upper bound to when the TX
probably happened.
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull MSI irq updates from Thomas Gleixner:
- Switch the MSI descriptor locking to guards
- Replace the broken PCI/TPH implementation, which lacks any form of
serialization against concurrent modifications with a properly
serialized mechanism in the PCI/MSI core code
- Replace the MSI descriptor abuse in the SCSI/UFS Qualcom driver with
dedicated driver internal storage
* tag 'irq-msi-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq/msi: Rename msi_[un]lock_descs()
scsi: ufs: qcom: Remove the MSI descriptor abuse
PCI/TPH: Replace the broken MSI-X control word update
PCI/MSI: Provide a sane mechanism for TPH
PCI: hv: Switch MSI descriptor locking to guard()
PCI/MSI: Switch to MSI descriptor locking to guard()
NTB/msi: Switch MSI descriptor locking to lock guard()
soc: ti: ti_sci_inta_msi: Switch MSI descriptor locking to guard()
genirq/msi: Use lock guards for MSI descriptor locking
cleanup: Provide retain_ptr()
genirq/msi: Make a few functions static
|
|
Revert "udp_tunnel: use static call for GRO hooks when possible"
This reverts commit 311b36574ceaccfa3f91b74054a09cd4bb877702.
Revert "udp_tunnel: create a fastpath GRO lookup."
This reverts commit 8d4880db378350f8ed8969feea13bdc164564fc1.
There are multiple small issues with the series. In the interest
of unblocking the merge window let's opt for a revert.
Link: https://lore.kernel.org/cover.1742557254.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
"A small set of core changes for the interrupt subsystem:
- Expose the MSI message in the existing debug filesystem dump.
That's useful for validation and debugging.
- Small cleanups"
* tag 'irq-core-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq: Make a few functions static
irqdomain: Remove extern from function declarations
genirq/msi: Expose MSI message data in debugfs
|
|
Some MACs require the PHY receive clock to be running to complete setup
actions. This may fail if the PHY has negotiated EEE, the MAC supports
receive clock stop, and the link has entered LPI state. Provide a pair
of APIs that MAC drivers can use to temporarily block the PHY disabling
the receive clock.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1tvO6k-008Vjt-MZ@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When the system is suspended, the PHY may be placed in low-power mode
by setting the BMCR 0.11 Power down bit. IEEE 802.3 states that the
behaviour of the PHY in this state is implementation specific, and
the PHY is not required to meet the RX_CLK and TX_CLK requirements.
Essentially, this means that a PHY may stop the clocks that it is
generating while in power down state.
However, MACs exist which require the clocks from the PHY to be running
in order to properly resume. phylink_prepare_resume() provides them
with a way to clear the Power down bit early.
Note, however, that IEEE 802.3 gives PHYs up to 500ms grace before the
transmit and receive clocks meet the requirements after clearing the
power down bit.
Add a resume preparation function, which will ensure that the receive
clock from the PHY is appropriately configured while resuming.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1tvO6V-008Vjb-AP@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This adds pasid_at|de]tach_ioas ops for attaching hwpt to pasid of a
device and the helpers for it. For now, only vfio-pci supports pasid
attach/detach.
Link: https://patch.msgid.link/r/20250321180143.8468-3-yi.l.liu@intel.com
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
There is no helpers for user to check if a given ID is allocated or not,
neither a helper to loop all the allocated IDs in an IDA and do something
for cleanup. With the two needs, a helper to get the lowest allocated ID
of a range and two variants based on it.
Caller can check if a given ID is allocated or not by:
bool ida_exists(struct ida *ida, unsigned int id)
Caller can iterate all allocated IDs by:
int id;
while ((id = ida_find_first(&pasid_ida)) >= 0) {
//anything to do with the allocated ID
ida_free(pasid_ida, pasid);
}
Link: https://patch.msgid.link/r/20250321180143.8468-2-yi.l.liu@intel.com
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
This extends the below APIs to support PASID. Device drivers to manage pasid
attach/replace/detach.
int iommufd_device_attach(struct iommufd_device *idev,
ioasid_t pasid, u32 *pt_id);
int iommufd_device_replace(struct iommufd_device *idev,
ioasid_t pasid, u32 *pt_id);
void iommufd_device_detach(struct iommufd_device *idev,
ioasid_t pasid);
The pasid operations share underlying attach/replace/detach infrastructure
with the device operations, but still have some different implications:
- no reserved region per pasid otherwise SVA architecture is already
broken (CPU address space doesn't count device reserved regions);
- accordingly no sw_msi trick;
Cache coherency enforcement is still applied to pasid operations since
it is about memory accesses post page table walking (no matter the walk
is per RID or per PASID).
Link: https://patch.msgid.link/r/20250321171940.7213-12-yi.l.liu@intel.com
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
There are only two sw_msi implementations in the entire system, thus it's
not very necessary to have an sw_msi pointer.
Instead, check domain->cookie_type to call the two sw_msi implementations
directly from the core code.
Link: https://patch.msgid.link/r/7ded87c871afcbaac665b71354de0a335087bf0f.1742871535.git.nicolinc@nvidia.com
Suggested-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
When DMA/MSI cookies were made first-class citizens back in commit
46983fcd67ac ("iommu: Pull IOVA cookie management into the core"), there
was no real need to further expose the two different cookie types.
However, now that IOMMUFD wants to add a third type of MSI-mapping
cookie, we do have a nicely compelling reason to properly dismabiguate
things at the domain level beyond just vaguely guessing from the domain
type.
Meanwhile, we also effectively have another "cookie" in the form of the
anonymous union for other user data, which isn't much better in terms of
being vague and unenforced. The fact is that all these cookie types are
mutually exclusive, in the sense that combining them makes zero sense
and/or would be catastrophic (iommu_set_fault_handler() on an SVA
domain, anyone?) - the only combination which *might* be reasonable is
perhaps a fault handler and an MSI cookie, but nobody's doing that at
the moment, so let's rule it out as well for the sake of being clear and
robust. To that end, we pull DMA and MSI cookies apart a little more,
mostly to clear up the ambiguity at domain teardown, then for clarity
(and to save a little space), move them into the union, whose ownership
we can then properly describe and enforce entirely unambiguously.
[nicolinc: rebase on latest tree; use prefix IOMMU_COOKIE_; merge unions
in iommu_domain; add IOMMU_COOKIE_IOMMUFD for iommufd_hwpt]
Link: https://patch.msgid.link/r/1ace9076c95204bbe193ee77499d395f15f44b23.1742871535.git.nicolinc@nvidia.com
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2025-03-18 (ice, idpf)
For ice:
Przemek modifies string declarations to resolve compile issues on gcc 7.5.
Karol adds padding to initial programming of GLTSYN_TIME* registers to
ensure it will occur in the future to prevent hardware issues.
Jesse Brandeburg turns off driver RDMA capability when the corresponding
kernel config is not enabled to aid in preventing resource exhaustion.
Jan adjusts type declaration to properly catch error conditions and
prevent truncation of values. He also adds bounds checking to prevent
overflow in ice_vc_cfg_q_quanta().
Lukasz adds checking and error reporting for invalid values in
ice_vc_cfg_q_bw().
Mateusz adds check for valid size for ice_vc_fdir_parse_raw().
For idpf:
Emil adds check, and handling, on failure to register netdev.
* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
idpf: check error for register_netdev() on init
ice: fix using untrusted value of pkt_len in ice_vc_fdir_parse_raw()
ice: fix input validation for virtchnl BW
ice: validate queue quanta parameters to prevent OOB access
ice: stop truncating queue ids when checking
virtchnl: make proto and filter action count unsigned
ice: fix reservation of resources for RDMA when disabled
ice: ensure periodic output start time is in the future
ice: health.c: fix compilation on gcc 7.5
====================
Link: https://patch.msgid.link/20250318200511.2958251-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/andi.shyti/linux into i2c/for-mergewindow
i2c-host updates for v6.15
Refactoring and cleanups
- octeon, cadence, i801, pasemi, mlxbf, bcm-iproc: general
refactorings
- octeon: remove 10-bit address support
Improvements
- amd-asf: improved error handling
- designware: use guard(mutex)
- amd-asf, designware: update naming to follow latest specs
- cadence: fix cleanup path in probe
- i801: use MMIO and I/O mapping helpers to access registers
- pxa: handle error after clk_prepare_enable
New features
- added i2c_10bit_addr_*_from_msg() and updated multiple drivers
- omap: added multiplexer state handling
- qcom-geni: update frequency configuration
- qup: introduce DMA usage policy
New hardware support
- exynos: add support for Samsung exynos7870
- k1: add support for spacemit k1 (new driver)
- imx: add support for i.mx94 lpi2c
- rk3x: add support for rk3562
Multiplexers
- ltc4306, reg: fix assignment in platform_driver structure
|
|
ENTRY and ENDPROC were deprecated years ago and replaced with
SYM_FUNC_{START,END}. Fix up a few outdated references in the objtool
documentation and comments. Also fix a few typos.
Suggested-by: Brendan Jackman <jackmanb@google.com>
Suggested-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/5eb7e06e1a0e87aaeda8d583ab060e7638a6ea8e.1742852846.git.jpoimboe@kernel.org
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core x86 updates from Ingo Molnar:
"x86 CPU features support:
- Generate the <asm/cpufeaturemasks.h> header based on build config
(H. Peter Anvin, Xin Li)
- x86 CPUID parsing updates and fixes (Ahmed S. Darwish)
- Introduce the 'setcpuid=' boot parameter (Brendan Jackman)
- Enable modifying CPU bug flags with '{clear,set}puid=' (Brendan
Jackman)
- Utilize CPU-type for CPU matching (Pawan Gupta)
- Warn about unmet CPU feature dependencies (Sohil Mehta)
- Prepare for new Intel Family numbers (Sohil Mehta)
Percpu code:
- Standardize & reorganize the x86 percpu layout and related cleanups
(Brian Gerst)
- Convert the stackprotector canary to a regular percpu variable
(Brian Gerst)
- Add a percpu subsection for cache hot data (Brian Gerst)
- Unify __pcpu_op{1,2}_N() macros to __pcpu_op_N() (Uros Bizjak)
- Construct __percpu_seg_override from __percpu_seg (Uros Bizjak)
MM:
- Add support for broadcast TLB invalidation using AMD's INVLPGB
instruction (Rik van Riel)
- Rework ROX cache to avoid writable copy (Mike Rapoport)
- PAT: restore large ROX pages after fragmentation (Kirill A.
Shutemov, Mike Rapoport)
- Make memremap(MEMREMAP_WB) map memory as encrypted by default
(Kirill A. Shutemov)
- Robustify page table initialization (Kirill A. Shutemov)
- Fix flush_tlb_range() when used for zapping normal PMDs (Jann Horn)
- Clear _PAGE_DIRTY for kernel mappings when we clear _PAGE_RW
(Matthew Wilcox)
KASLR:
- x86/kaslr: Reduce KASLR entropy on most x86 systems, to support PCI
BAR space beyond the 10TiB region (CONFIG_PCI_P2PDMA=y) (Balbir
Singh)
CPU bugs:
- Implement FineIBT-BHI mitigation (Peter Zijlstra)
- speculation: Simplify and make CALL_NOSPEC consistent (Pawan Gupta)
- speculation: Add a conditional CS prefix to CALL_NOSPEC (Pawan
Gupta)
- RFDS: Exclude P-only parts from the RFDS affected list (Pawan
Gupta)
System calls:
- Break up entry/common.c (Brian Gerst)
- Move sysctls into arch/x86 (Joel Granados)
Intel LAM support updates: (Maciej Wieczor-Retman)
- selftests/lam: Move cpu_has_la57() to use cpuinfo flag
- selftests/lam: Skip test if LAM is disabled
- selftests/lam: Test get_user() LAM pointer handling
AMD SMN access updates:
- Add SMN offsets to exclusive region access (Mario Limonciello)
- Add support for debugfs access to SMN registers (Mario Limonciello)
- Have HSMP use SMN through AMD_NODE (Yazen Ghannam)
Power management updates: (Patryk Wlazlyn)
- Allow calling mwait_play_dead with an arbitrary hint
- ACPI/processor_idle: Add FFH state handling
- intel_idle: Provide the default enter_dead() handler
- Eliminate mwait_play_dead_cpuid_hint()
Build system:
- Raise the minimum GCC version to 8.1 (Brian Gerst)
- Raise the minimum LLVM version to 15.0.0 (Nathan Chancellor)
Kconfig: (Arnd Bergmann)
- Add cmpxchg8b support back to Geode CPUs
- Drop 32-bit "bigsmp" machine support
- Rework CONFIG_GENERIC_CPU compiler flags
- Drop configuration options for early 64-bit CPUs
- Remove CONFIG_HIGHMEM64G support
- Drop CONFIG_SWIOTLB for PAE
- Drop support for CONFIG_HIGHPTE
- Document CONFIG_X86_INTEL_MID as 64-bit-only
- Remove old STA2x11 support
- Only allow CONFIG_EISA for 32-bit
Headers:
- Replace __ASSEMBLY__ with __ASSEMBLER__ in UAPI and non-UAPI
headers (Thomas Huth)
Assembly code & machine code patching:
- x86/alternatives: Simplify alternative_call() interface (Josh
Poimboeuf)
- x86/alternatives: Simplify callthunk patching (Peter Zijlstra)
- KVM: VMX: Use named operands in inline asm (Josh Poimboeuf)
- x86/hyperv: Use named operands in inline asm (Josh Poimboeuf)
- x86/traps: Cleanup and robustify decode_bug() (Peter Zijlstra)
- x86/kexec: Merge x86_32 and x86_64 code using macros from
<asm/asm.h> (Uros Bizjak)
- Use named operands in inline asm (Uros Bizjak)
- Improve performance by using asm_inline() for atomic locking
instructions (Uros Bizjak)
Earlyprintk:
- Harden early_serial (Peter Zijlstra)
NMI handler:
- Add an emergency handler in nmi_desc & use it in
nmi_shootdown_cpus() (Waiman Long)
Miscellaneous fixes and cleanups:
- by Ahmed S. Darwish, Andy Shevchenko, Ard Biesheuvel, Artem
Bityutskiy, Borislav Petkov, Brendan Jackman, Brian Gerst, Dan
Carpenter, Dr. David Alan Gilbert, H. Peter Anvin, Ingo Molnar,
Josh Poimboeuf, Kevin Brodsky, Mike Rapoport, Lukas Bulwahn, Maciej
Wieczor-Retman, Max Grobecker, Patryk Wlazlyn, Pawan Gupta, Peter
Zijlstra, Philip Redkin, Qasim Ijaz, Rik van Riel, Thomas Gleixner,
Thorsten Blum, Tom Lendacky, Tony Luck, Uros Bizjak, Vitaly
Kuznetsov, Xin Li, liuye"
* tag 'x86-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (211 commits)
zstd: Increase DYNAMIC_BMI2 GCC version cutoff from 4.8 to 11.0 to work around compiler segfault
x86/asm: Make asm export of __ref_stack_chk_guard unconditional
x86/mm: Only do broadcast flush from reclaim if pages were unmapped
perf/x86/intel, x86/cpu: Replace Pentium 4 model checks with VFM ones
perf/x86/intel, x86/cpu: Simplify Intel PMU initialization
x86/headers: Replace __ASSEMBLY__ with __ASSEMBLER__ in non-UAPI headers
x86/headers: Replace __ASSEMBLY__ with __ASSEMBLER__ in UAPI headers
x86/locking/atomic: Improve performance by using asm_inline() for atomic locking instructions
x86/asm: Use asm_inline() instead of asm() in clwb()
x86/asm: Use CLFLUSHOPT and CLWB mnemonics in <asm/special_insns.h>
x86/hweight: Use asm_inline() instead of asm()
x86/hweight: Use ASM_CALL_CONSTRAINT in inline asm()
x86/hweight: Use named operands in inline asm()
x86/stackprotector/64: Only export __ref_stack_chk_guard on CONFIG_SMP
x86/head/64: Avoid Clang < 17 stack protector in startup code
x86/kexec: Merge x86_32 and x86_64 code using macros from <asm/asm.h>
x86/runtime-const: Add the RUNTIME_CONST_PTR assembly macro
x86/cpu/intel: Limit the non-architectural constant_tsc model checks
x86/mm/pat: Replace Intel x86_model checks with VFM ones
x86/cpu/intel: Fix fast string initialization for extended Families
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull performance events updates from Ingo Molnar:
"Core:
- Move perf_event sysctls into kernel/events/ (Joel Granados)
- Use POLLHUP for pinned events in error (Namhyung Kim)
- Avoid the read if the count is already updated (Peter Zijlstra)
- Allow the EPOLLRDNORM flag for poll (Tao Chen)
- locking/percpu-rwsem: Add guard support [ NOTE: this got
(mis-)merged into the perf tree due to related work ] (Peter
Zijlstra)
perf_pmu_unregister() related improvements: (Peter Zijlstra)
- Simplify the perf_event_alloc() error path
- Simplify the perf_pmu_register() error path
- Simplify perf_pmu_register()
- Simplify perf_init_event()
- Simplify perf_event_alloc()
- Merge struct pmu::pmu_disable_count into struct
perf_cpu_pmu_context::pmu_disable_count
- Add this_cpc() helper
- Introduce perf_free_addr_filters()
- Robustify perf_event_free_bpf_prog()
- Simplify the perf_mmap() control flow
- Further simplify perf_mmap()
- Remove retry loop from perf_mmap()
- Lift event->mmap_mutex in perf_mmap()
- Detach 'struct perf_cpu_pmu_context' and 'struct pmu' lifetimes
- Fix perf_mmap() failure path
Uprobes:
- Harden x86 uretprobe syscall trampoline check (Jiri Olsa)
- Remove redundant spinlock in uprobe_deny_signal() (Liao Chang)
- Remove the spinlock within handle_singlestep() (Liao Chang)
x86 Intel PMU enhancements:
- Support PEBS counters snapshotting (Kan Liang)
- Fix intel_pmu_read_event() (Kan Liang)
- Extend per event callchain limit to branch stack (Kan Liang)
- Fix system-wide LBR profiling (Kan Liang)
- Allocate bts_ctx only if necessary (Li RongQing)
- Apply static call for drain_pebs (Peter Zijlstra)
x86 AMD PMU enhancements: (Ravi Bangoria)
- Remove pointless sample period check
- Fix ->config to sample period calculation for OP PMU
- Fix perf_ibs_op.cnt_mask for CurCnt
- Don't allow freq mode event creation through ->config interface
- Add PMU specific minimum period
- Add ->check_period() callback
- Ceil sample_period to min_period
- Add support for OP Load Latency Filtering
- Update DTLB/PageSize decode logic
Hardware breakpoints:
- Return EOPNOTSUPP for unsupported breakpoint type (Saket Kumar
Bhaskar)
Hardlockup detector improvements: (Li Huafei)
- perf_event memory leak
- Warn if watchdog_ev is leaked
Fixes and cleanups:
- Misc fixes and cleanups (Andy Shevchenko, Kan Liang, Peter
Zijlstra, Ravi Bangoria, Thorsten Blum, XieLudan)"
* tag 'perf-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (55 commits)
perf: Fix __percpu annotation
perf: Clean up pmu specific data
perf/x86: Remove swap_task_ctx()
perf/x86/lbr: Fix shorter LBRs call stacks for the system-wide mode
perf: Supply task information to sched_task()
perf: attach/detach PMU specific data
locking/percpu-rwsem: Add guard support
perf: Save PMU specific data in task_struct
perf: Extend per event callchain limit to branch stack
perf/ring_buffer: Allow the EPOLLRDNORM flag for poll
perf/core: Use POLLHUP for pinned events in error
perf/core: Use sysfs_emit() instead of scnprintf()
perf/core: Remove optional 'size' arguments from strscpy() calls
perf/x86/intel/bts: Check if bts_ctx is allocated when calling BTS functions
uprobes/x86: Harden uretprobe syscall trampoline check
watchdog/hardlockup/perf: Warn if watchdog_ev is leaked
watchdog/hardlockup/perf: Fix perf_event memory leak
perf/x86: Annotate struct bts_buffer::buf with __counted_by()
perf/core: Clean up perf_try_init_event()
perf/core: Fix perf_mmap() failure path
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"Core & fair scheduler changes:
- Cancel the slice protection of the idle entity (Zihan Zhou)
- Reduce the default slice to avoid tasks getting an extra tick
(Zihan Zhou)
- Force propagating min_slice of cfs_rq when {en,de}queue tasks
(Tianchen Ding)
- Refactor can_migrate_task() to elimate looping (I Hsin Cheng)
- Add unlikey branch hints to several system calls (Colin Ian King)
- Optimize current_clr_polling() on certain architectures (Yujun
Dong)
Deadline scheduler: (Juri Lelli)
- Remove redundant dl_clear_root_domain call
- Move dl_rebuild_rd_accounting to cpuset.h
Uclamp:
- Use the uclamp_is_used() helper instead of open-coding it (Xuewen
Yan)
- Optimize sched_uclamp_used static key enabling (Xuewen Yan)
Scheduler topology support: (Juri Lelli)
- Ignore special tasks when rebuilding domains
- Add wrappers for sched_domains_mutex
- Generalize unique visiting of root domains
- Rebuild root domain accounting after every update
- Remove partition_and_rebuild_sched_domains
- Stop exposing partition_sched_domains_locked
RSEQ: (Michael Jeanson)
- Update kernel fields in lockstep with CONFIG_DEBUG_RSEQ=y
- Fix segfault on registration when rseq_cs is non-zero
- selftests: Add rseq syscall errors test
- selftests: Ensure the rseq ABI TLS is actually 1024 bytes
Membarriers:
- Fix redundant load of membarrier_state (Nysal Jan K.A.)
Scheduler debugging:
- Introduce and use preempt_model_str() (Sebastian Andrzej Siewior)
- Make CONFIG_SCHED_DEBUG unconditional (Ingo Molnar)
Fixes and cleanups:
- Always save/restore x86 TSC sched_clock() on suspend/resume
(Guilherme G. Piccoli)
- Misc fixes and cleanups (Thorsten Blum, Juri Lelli, Sebastian
Andrzej Siewior)"
* tag 'sched-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
cpuidle, sched: Use smp_mb__after_atomic() in current_clr_polling()
sched/debug: Remove CONFIG_SCHED_DEBUG
sched/debug: Remove CONFIG_SCHED_DEBUG from self-test config files
sched/debug, Documentation: Remove (most) CONFIG_SCHED_DEBUG references from documentation
sched/debug: Make CONFIG_SCHED_DEBUG functionality unconditional
sched/debug: Make 'const_debug' tunables unconditional __read_mostly
sched/debug: Change SCHED_WARN_ON() to WARN_ON_ONCE()
rseq/selftests: Fix namespace collision with rseq UAPI header
include/{topology,cpuset}: Move dl_rebuild_rd_accounting to cpuset.h
sched/topology: Stop exposing partition_sched_domains_locked
cgroup/cpuset: Remove partition_and_rebuild_sched_domains
sched/topology: Remove redundant dl_clear_root_domain call
sched/deadline: Rebuild root domain accounting after every update
sched/deadline: Generalize unique visiting of root domains
sched/topology: Wrappers for sched_domains_mutex
sched/deadline: Ignore special tasks when rebuilding domains
tracing: Use preempt_model_str()
xtensa: Rely on generic printing of preemption model
x86: Rely on generic printing of preemption model
s390: Rely on generic printing of preemption model
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
"Locking primitives:
- Micro-optimize percpu_{,try_}cmpxchg{64,128}_op() and
{,try_}cmpxchg{64,128} on x86 (Uros Bizjak)
- mutexes: extend debug checks in mutex_lock() (Yunhui Cui)
- Misc cleanups (Uros Bizjak)
Lockdep:
- Fix might_fault() lockdep check of current->mm->mmap_lock (Peter
Zijlstra)
- Don't disable interrupts on RT in disable_irq_nosync_lockdep.*()
(Sebastian Andrzej Siewior)
- Disable KASAN instrumentation of lockdep.c (Waiman Long)
- Add kasan_check_byte() check in lock_acquire() (Waiman Long)
- Misc cleanups (Sebastian Andrzej Siewior)
Rust runtime integration:
- Use Pin for all LockClassKey usages (Mitchell Levy)
- sync: Add accessor for the lock behind a given guard (Alice Ryhl)
- sync: condvar: Add wait_interruptible_freezable() (Alice Ryhl)
- sync: lock: Add an example for Guard:: Lock_ref() (Boqun Feng)
Split-lock detection feature (x86):
- Fix warning mode with disabled mitigation mode (Maksim Davydov)
Locking events:
- Add locking events for rtmutex slow paths (Waiman Long)
- Add locking events for lockdep (Waiman Long)"
* tag 'locking-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
lockdep: Remove disable_irq_lockdep()
lockdep: Don't disable interrupts on RT in disable_irq_nosync_lockdep.*()
rust: lockdep: Use Pin for all LockClassKey usages
rust: sync: condvar: Add wait_interruptible_freezable()
rust: sync: lock: Add an example for Guard:: Lock_ref()
rust: sync: Add accessor for the lock behind a given guard
locking/lockdep: Add kasan_check_byte() check in lock_acquire()
locking/lockdep: Disable KASAN instrumentation of lockdep.c
locking/lock_events: Add locking events for lockdep
locking/lock_events: Add locking events for rtmutex slow paths
x86/split_lock: Fix the delayed detection logic
lockdep/mm: Fix might_fault() lockdep check of current->mm->mmap_lock
x86/locking: Remove semicolon from "lock" prefix
locking/mutex: Add MUTEX_WARN_ON() into fast path
x86/locking: Use asm_inline for {,try_}cmpxchg{64,128} emulations
x86/locking: Use ALT_OUTPUT_SP() for percpu_{,try_}cmpxchg{64,128}_op()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux
Pull RCU updates from Boqun Feng:
"Documentation:
- Add broken-timing possibility to stallwarn.rst
- Improve discussion of this_cpu_ptr(), add raw_cpu_ptr()
- Document self-propagating callbacks
- Point call_srcu() to call_rcu() for detailed memory ordering
- Add CONFIG_RCU_LAZY delays to call_rcu() kernel-doc header
- Clarify RCU_LAZY and RCU_LAZY_DEFAULT_OFF help text
- Remove references to old grace-period-wait primitives
srcu:
- Introduce srcu_read_{un,}lock_fast(), which is similar to
srcu_read_{un,}lock_lite(): avoid smp_mb()s in lock and unlock
at the cost of calling synchronize_rcu() in synchronize_srcu()
Moreover, by returning the percpu offset of the counter at
srcu_read_lock_fast() time, srcu_read_unlock_fast() can avoid
extra pointer dereferencing, which makes it faster than
srcu_read_{un,}lock_lite()
srcu_read_{un,}lock_fast() are intended to replace
rcu_read_{un,}lock_trace() if possible
RCU torture:
- Add get_torture_init_jiffies() to return the start time of the test
- Add a test_boost_holdoff module parameter to allow delaying
boosting tests when building rcutorture as built-in
- Add grace period sequence number logging at the beginning and end
of failure/close-call results
- Switch to hexadecimal for the expedited grace period sequence
number in the rcu_exp_grace_period trace point
- Make cur_ops->format_gp_seqs take buffer length
- Move RCU_TORTURE_TEST_{CHK_RDR_STATE,LOG_CPU} to bool
- Complain when invalid SRCU reader_flavor is specified
- Add FORCE_NEED_SRCU_NMI_SAFE Kconfig for testing, which forces SRCU
uses atomics even when percpu ops are NMI safe, and use the Kconfig
for SRCU lockdep testing
Misc:
- Split rcu_report_exp_cpu_mult() mask parameter and use for tracing
- Remove READ_ONCE() for rdp->gpwrap access in __note_gp_changes()
- Fix get_state_synchronize_rcu_full() GP-start detection
- Move RCU Tasks self-tests to core_initcall()
- Print segment lengths in show_rcu_nocb_gp_state()
- Make RCU watch ct_kernel_exit_state() warning
- Flush console log from kernel_power_off()
- rcutorture: Allow a negative value for nfakewriters
- rcu: Update TREE05.boot to test normal synchronize_rcu()
- rcu: Use _full() API to debug synchronize_rcu()
Make RCU handle PREEMPT_LAZY better:
- Fix header guard for rcu_all_qs()
- rcu: Rename PREEMPT_AUTO to PREEMPT_LAZY
- Update __cond_resched comment about RCU quiescent states
- Handle unstable rdp in rcu_read_unlock_strict()
- Handle quiescent states for PREEMPT_RCU=n, PREEMPT_COUNT=y
- osnoise: Provide quiescent states
- Adjust rcutorture with possible PREEMPT_RCU=n && PREEMPT_COUNT=y
combination
- Limit PREEMPT_RCU configurations
- Make rcutorture senario TREE07 and senario TREE10 use
PREEMPT_LAZY=y"
* tag 'rcu-next-v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux: (59 commits)
rcutorture: Make scenario TREE07 build CONFIG_PREEMPT_LAZY=y
rcutorture: Make scenario TREE10 build CONFIG_PREEMPT_LAZY=y
rcu: limit PREEMPT_RCU configurations
rcutorture: Update ->extendables check for lazy preemption
rcutorture: Update rcutorture_one_extend_check() for lazy preemption
osnoise: provide quiescent states
rcu: Use _full() API to debug synchronize_rcu()
rcu: Update TREE05.boot to test normal synchronize_rcu()
rcutorture: Allow a negative value for nfakewriters
Flush console log from kernel_power_off()
context_tracking: Make RCU watch ct_kernel_exit_state() warning
rcu/nocb: Print segment lengths in show_rcu_nocb_gp_state()
rcu-tasks: Move RCU Tasks self-tests to core_initcall()
rcu: Fix get_state_synchronize_rcu_full() GP-start detection
torture: Make SRCU lockdep testing use srcu_read_lock_nmisafe()
srcu: Add FORCE_NEED_SRCU_NMI_SAFE Kconfig for testing
rcutorture: Complain when invalid SRCU reader_flavor is specified
rcutorture: Move RCU_TORTURE_TEST_{CHK_RDR_STATE,LOG_CPU} to bool
rcutorture: Make cur_ops->format_gp_seqs take buffer length
rcutorture: Add ftrace-compatible timestamp to GP# failure/close-call output
...
|
|
Pull bitmap updates from Yury Norov:
- cpumask_next_wrap() rework (me)
- GENMASK() simplification (I Hsin)
- rust bindings for cpumasks (Viresh and me)
- scattered cleanups (Andy, Tamir, Vincent, Ignacio and Joel)
* tag 'bitmap-for-6.15' of https://github.com/norov/linux: (22 commits)
cpumask: align text in comment
riscv: fix test_and_{set,clear}_bit ordering documentation
treewide: fix typo 'unsigned __init128' -> 'unsigned __int128'
MAINTAINERS: add rust bindings entry for bitmap API
rust: Add cpumask helpers
uapi: Revert "bitops: avoid integer overflow in GENMASK(_ULL)"
cpumask: drop cpumask_next_wrap_old()
PCI: hv: Switch hv_compose_multi_msi_req_get_cpu() to using cpumask_next_wrap()
scsi: lpfc: rework lpfc_next_{online,present}_cpu()
scsi: lpfc: switch lpfc_irq_rebalance() to using cpumask_next_wrap()
s390: switch stop_machine_yield() to using cpumask_next_wrap()
padata: switch padata_find_next() to using cpumask_next_wrap()
cpumask: use cpumask_next_wrap() where appropriate
cpumask: re-introduce cpumask_next{,_and}_wrap()
cpumask: deprecate cpumask_next_wrap()
powerpc/xmon: simplify xmon_batch_next_cpu()
ibmvnic: simplify ibmvnic_set_queue_affinity()
virtio_net: simplify virtnet_set_affinity()
objpool: rework objpool_pop()
cpumask: add for_each_{possible,online}_cpu_wrap
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext updates from Tejun Heo:
- Add mechanism to count and report internal events. This significantly
improves visibility on subtle corner conditions.
- The default idle CPU selection logic is revamped and improved in
multiple ways including being made topology aware.
- sched_ext was disabling ttwu_queue for simplicity, which can be
costly when hardware topology is more complex. Implement
SCX_OPS_ALLOWED_QUEUED_WAKEUP so that BPF schedulers can selectively
enable ttwu_queue.
- tools/sched_ext updates to improve compatibility among others.
- Other misc updates and fixes.
- sched_ext/for-6.14-fixes were pulled a few times to receive
prerequisite fixes and resolve conflicts.
* tag 'sched_ext-for-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (42 commits)
sched_ext: idle: Refactor scx_select_cpu_dfl()
sched_ext: idle: Honor idle flags in the built-in idle selection policy
sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local()
sched_ext: Add trace point to track sched_ext core events
sched_ext: Change the event type from u64 to s64
sched_ext: Documentation: add task lifecycle summary
tools/sched_ext: Provide a compatible helper for scx_bpf_events()
selftests/sched_ext: Add NUMA-aware scheduler test
tools/sched_ext: Provide consistent access to scx flags
sched_ext: idle: Fix scx_bpf_pick_any_cpu_node() behavior
sched_ext: idle: Introduce scx_bpf_nr_node_ids()
sched_ext: idle: Introduce node-aware idle cpu kfunc helpers
sched_ext: idle: Per-node idle cpumasks
sched_ext: idle: Introduce SCX_OPS_BUILTIN_IDLE_PER_NODE
sched_ext: idle: Make idle static keys private
sched/topology: Introduce for_each_node_numadist() iterator
mm/numa: Introduce nearest_node_nodemask()
nodemask: numa: reorganize inclusion path
nodemask: add nodes_copy()
tools/sched_ext: Sync with scx repo
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
- Add deprecation info messages to cgroup1-only features
- rstat updates including a bug fix and breaking up a critical section
to reduce interrupt latency impact
- Other misc and doc updates
* tag 'cgroup-for-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: rstat: Cleanup flushing functions and locking
cgroup/rstat: avoid disabling irqs for O(num_cpu)
mm: Fix a build breakage in memcontrol-v1.c
blk-cgroup: Simplify policy files registration
cgroup: Update file naming comment
cgroup: Add deprecation message to legacy freezer controller
mm: Add transformation message for per-memcg swappiness
RFC cgroup/cpuset-v1: Add deprecation messages to sched_relax_domain_level
cgroup/cpuset-v1: Add deprecation messages to memory_migrate
cgroup/cpuset-v1: Add deprecation messages to mem_exclusive and mem_hardwall
cgroup: Print message when /proc/cgroups is read on v2-only system
cgroup/blkio: Add deprecation messages to reset_stats
cgroup/cpuset-v1: Add deprecation messages to memory_spread_page and memory_spread_slab
cgroup/cpuset-v1: Add deprecation messages to sched_load_balance and memory_pressure_enabled
cgroup, docs: Be explicit about independence of RT_GROUP_SCHED and non-cpu controllers
cgroup/rstat: Fix forceidle time in cpu.stat
cgroup/misc: Remove unused misc_cg_res_total_usage
cgroup/cpuset: Move procfs cpuset attribute under cgroup-v1.c
cgroup: update comment about dropping cgroup kn refs
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab updates from Vlastimil Babka:
- Move the TINY_RCU kvfree_rcu() implementation from RCU to SLAB
subsystem and cleanup its integration (Vlastimil Babka)
Following the move of the TREE_RCU batching kvfree_rcu()
implementation in 6.14, move also the simpler TINY_RCU variant.
Refactor the #ifdef guards so that the simple implementation is also
used with SLUB_TINY.
Remove the need for RCU to recognize fake callback function pointers
(__is_kvfree_rcu_offset()) when handling call_rcu() by implementing a
callback that calculates the object's address from the embedded
rcu_head address without knowing its offset.
- Improve kmalloc cache randomization in kvmalloc (GONG Ruiqi)
Due to an extra layer of function call, all kvmalloc() allocations
used the same set of random caches. Thanks to moving the kvmalloc()
implementation to slub.c, this is improved and randomization now
works for kvmalloc.
- Various improvements to debugging, testing and other cleanups (Hyesoo
Yu, Lilith Gkini, Uladzislau Rezki, Matthew Wilcox, Kevin Brodsky, Ye
Bin)
* tag 'slab-for-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
slub: Handle freelist cycle in on_freelist()
mm/slab: call kmalloc_noprof() unconditionally in kmalloc_array_noprof()
slab: Mark large folios for debugging purposes
kunit, slub: Add test_kfree_rcu_wq_destroy use case
mm, slab: cleanup slab_bug() parameters
mm: slub: call WARN() when detecting a slab corruption
mm: slub: Print the broken data before restoring them
slab: Achieve better kmalloc caches randomization in kvmalloc
slab: Adjust placement of __kvmalloc_node_noprof
mm/slab: simplify SLAB_* flag handling
slab: don't batch kvfree_rcu() with SLUB_TINY
rcu, slab: use a regular callback function for kvfree_rcu
rcu: remove trace_rcu_kvfree_callback
slab, rcu: move TINY_RCU variant of kvfree_rcu() to SLAB
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull seccomp updates from Kees Cook:
- avoid the lock trip seccomp_filter_release in common case (Mateusz
Guzik)
- remove unused 'sd' argument through-out (Oleg Nesterov)
- selftests/seccomp: Add hard-coded __NR_uretprobe for x86_64
* tag 'seccomp-v6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
seccomp: avoid the lock trip seccomp_filter_release in common case
seccomp: remove the 'sd' argument from __seccomp_filter()
seccomp: remove the 'sd' argument from __secure_computing()
seccomp: fix the __secure_computing() stub for !HAVE_ARCH_SECCOMP_FILTER
seccomp/mips: change syscall_trace_enter() to use secure_computing()
selftests/seccomp: Add hard-coded __NR_uretprobe for x86_64
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull hardening updates from Kees Cook:
"As usual, it's scattered changes all over. Patches touching things
outside of our traditional areas in the tree have been Acked by
maintainers or were trivial changes:
- loadpin: remove unsupported MODULE_COMPRESS_NONE (Arulpandiyan
Vadivel)
- samples/check-exec: Fix script name (Mickaël Salaün)
- yama: remove needless locking in yama_task_prctl() (Oleg Nesterov)
- lib/string_choices: Sort by function name (R Sundar)
- hardening: Allow default HARDENED_USERCOPY to be set at compile
time (Mel Gorman)
- uaccess: Split out compile-time checks into ucopysize.h
- kbuild: clang: Support building UM with SUBARCH=i386
- x86: Enable i386 FORTIFY_SOURCE on Clang 16+
- ubsan/overflow: Rework integer overflow sanitizer option
- Add missing __nonstring annotations for callers of
memtostr*()/strtomem*()
- Add __must_be_noncstr() and have memtostr*()/strtomem*() check for
it
- Introduce __nonstring_array for silencing future GCC 15 warnings"
* tag 'hardening-v6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (26 commits)
compiler_types: Introduce __nonstring_array
hardening: Enable i386 FORTIFY_SOURCE on Clang 16+
x86/build: Remove -ffreestanding on i386 with GCC
ubsan/overflow: Enable ignorelist parsing and add type filter
ubsan/overflow: Enable pattern exclusions
ubsan/overflow: Rework integer overflow sanitizer option to turn on everything
samples/check-exec: Fix script name
yama: don't abuse rcu_read_lock/get_task_struct in yama_task_prctl()
kbuild: clang: Support building UM with SUBARCH=i386
loadpin: remove MODULE_COMPRESS_NONE as it is no longer supported
lib/string_choices: Rearrange functions in sorted order
string.h: Validate memtostr*()/strtomem*() arguments more carefully
compiler.h: Introduce __must_be_noncstr()
nilfs2: Mark on-disk strings as nonstring
uapi: stddef.h: Introduce __kernel_nonstring
x86/tdx: Mark message.bytes as nonstring
string: kunit: Mark nonstring test strings as __nonstring
scsi: qla2xxx: Mark device strings as nonstring
scsi: mpt3sas: Mark device strings as nonstring
scsi: mpi3mr: Mark device strings as nonstring
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull execve updates from Kees Cook:
- elf: Define and use note name macros (Akihiko Odaki)
- elf: add remaining SHF_ flag macros (Timur Tabi)
- binfmt: Remove loader from linux_binprm struct (Yonatan Goldschmidt)
- binfmt_elf_fdpic: fix variable set but not used warning (sunliming)
* tag 'execve-v6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
binfmt_elf_fdpic: fix variable set but not used warning
elf: add remaining SHF_ flag macros
binfmt: Remove loader from linux_binprm struct
crash: Remove KEXEC_CORE_NOTE_NAME
s390/crash: Use note name macros
crash: Use note name macros
powerpc/crash: Use note name macros
binfmt_elf: Use note name macros
elf: Define note name macros
|
|
Monitors describing complex systems, such as the scheduler, can easily
grow to the point where they are just hard to understand because of the
many possible state transitions.
Often it is possible to break such descriptions into smaller monitors,
sharing some or all events. Enabling those smaller monitors concurrently
is, in fact, testing the system as if we had one single larger monitor.
Splitting models into multiple specification is not only easier to
understand, but gives some more clues when we see errors.
Add the possibility to create container monitors, whose only purpose is
to host other nested monitors. Enabling a container monitor enables all
nested ones, but it's still possible to enable nested monitors
independently.
Add the sched monitor as first container, for now empty.
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/20250305140406.350227-3-gmonaco@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Add the following tracepoints:
* sched_entry(bool preempt, ip)
Called while entering __schedule
* sched_exit(bool is_switch, ip)
Called while exiting __schedule
* sched_set_state(task, curr_state, state)
Called when a task changes its state (to and from running)
These tracepoints are useful to describe the Linux task model and are
adapted from the patches by Daniel Bristot de Oliveira
(https://bristot.me/linux-task-model/).
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/20250305140406.350227-2-gmonaco@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Currently network taps unbound to any interface are linked in the
global ptype_all list, affecting the performance in all the network
namespaces.
Add per netns ptypes chains, so that in the mentioned case only
the netns owning the packet socket(s) is affected.
While at that drop the global ptype_all list: no in kernel user
registers a tap on "any" type without specifying either the target
device or the target namespace (and IMHO doing that would not make
any sense).
Note that this adds a conditional in the fast path (to check for
per netns ptype_specific list) and increases the dataset size by
a cacheline (owing the per netns lists).
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumaze@google.com>
Link: https://patch.msgid.link/ae405f98875ee87f8150c460ad162de7e466f8a7.1742494826.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull tasklist_lock optimizations from Christian Brauner:
"According to the performance testbots this brings a 23% performance
increase when creating new processes:
- Reduce tasklist_lock hold time on exit:
- Perform add_device_randomness() without tasklist_lock
- Perform free_pid() calls outside of tasklist_lock
- Drop irq disablement around pidmap_lock
- Add some tasklist_lock asserts
- Call flush_sigqueue() lockless by changing release_task()
- Don't pointlessly clear TIF_SIGPENDING in __exit_signal() ->
clear_tsk_thread_flag()"
* tag 'kernel-6.15-rc1.tasklist_lock' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
pid: drop irq disablement around pidmap_lock
pid: perform free_pid() calls outside of tasklist_lock
pid: sprinkle tasklist_lock asserts
exit: hoist get_pid() in release_task() outside of tasklist_lock
exit: perform add_device_randomness() without tasklist_lock
exit: kill the pointless __exit_signal()->clear_tsk_thread_flag(TIF_SIGPENDING)
exit: change the release_task() paths to call flush_sigqueue() lockless
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs file handling updates from Christian Brauner:
"This contains performance improvements for struct file's new refcount
mechanism and various other performance work:
- The stock kernel transitioning the file to no refs held penalizes
the caller with an extra atomic to block any increments. For cases
where the file is highly likely to be going away this is easily
avoidable.
Add file_ref_put_close() to better handle the common case where
closing a file descriptor also operates on the last reference and
build fput_close_sync() and fput_close() on top of it. This brings
about 1% performance improvement by eliding one atomic in the
common case.
- Predict no error in close() since the vast majority of the time
system call returns 0.
- Reduce the work done in fdget_pos() by predicting that the file was
found and by explicitly comparing the reference count to one and
ignoring the dead zone"
* tag 'vfs-6.15-rc1.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: reduce work in fdget_pos()
fs: use fput_close() in path_openat()
fs: use fput_close() in filp_close()
fs: use fput_close_sync() in close()
file: add fput and file_ref_put routines optimized for use when closing a fd
fs: predict no error in close()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs initramfs updates from Christian Brauner:
"This adds basic kunit test coverage for initramfs unpacking and cleans
up some buffer handling issues and inefficiencies"
* tag 'vfs-6.15-rc1.initramfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
MAINTAINERS: append initramfs files to the VFS section
initramfs: avoid static buffer for error message
initramfs: fix hardlink hash leak without TRAILER
initramfs: reuse name_len for dir mtime tracking
initramfs: allocate heap buffers together
initramfs: avoid memcpy for hex header fields
vsprintf: add simple_strntoul
initramfs_test: kunit tests for initramfs unpacking
init: add initramfs_internal.h
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs ceph updates from Christian Brauner:
"This contains the work to remove access to page->index from ceph
and fixes the test failure observed for ceph with generic/421 by
refactoring ceph_writepages_start()"
* tag 'vfs-6.15-rc1.ceph' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fscrypt: Change fscrypt_encrypt_pagecache_blocks() to take a folio
ceph: Fix error handling in fill_readdir_cache()
fs: Remove page_mkwrite_check_truncate()
ceph: Pass a folio to ceph_allocate_page_array()
ceph: Convert ceph_move_dirty_page_in_page_array() to move_dirty_folio_in_page_array()
ceph: Remove uses of page from ceph_process_folio_batch()
ceph: Convert ceph_check_page_before_write() to use a folio
ceph: Convert writepage_nounlock() to write_folio_nounlock()
ceph: Convert ceph_readdir_cache_control to store a folio
ceph: Convert ceph_find_incompatible() to take a folio
ceph: Use a folio in ceph_page_mkwrite()
ceph: Remove ceph_writepage()
ceph: fix generic/421 test failure
ceph: introduce ceph_submit_write() method
ceph: introduce ceph_process_folio_batch() method
ceph: extend ceph_writeback_ctl for ceph_writepages_start() refactoring
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs pagesize updates from Christian Brauner:
"This enables block sizes greater than the page size for block devices.
With this we can start supporting block devices with logical block
sizes larger than 4k.
It also allows to lift the device cache sector size support to 64k.
This allows filesystems which can use larger sector sizes up to 64k to
ensure that the filesystem will not generate writes that are smaller
than the specified sector size"
* tag 'vfs-6.15-rc1.pagesize' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
bdev: add back PAGE_SIZE block size validation for sb_set_blocksize()
bdev: use bdev_io_min() for statx block size
block/bdev: lift block size restrictions to 64k
block/bdev: enable large folio support for large logical block sizes
fs/buffer fs/mpage: remove large folio restriction
fs/mpage: use blocks_per_folio instead of blocks_per_page
fs/mpage: avoid negative shift for large blocksize
fs/buffer: remove batching from async read
fs/buffer: simplify block_read_full_folio() with bh_offset()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs mount namespace updates from Christian Brauner:
"This expands the ability of anonymous mount namespaces:
- Creating detached mounts from detached mounts
Currently, detached mounts can only be created from attached
mounts. This limitaton prevents various use-cases. For example, the
ability to mount a subdirectory without ever having to make the
whole filesystem visible first.
The current permission modelis:
(1) Check that the caller is privileged over the owning user
namespace of it's current mount namespace.
(2) Check that the caller is located in the mount namespace of the
mount it wants to create a detached copy of.
While it is not strictly necessary to do it this way it is
consistently applied in the new mount api. This model will also be
used when allowing the creation of detached mount from another
detached mount.
The (1) requirement can simply be met by performing the same check
as for the non-detached case, i.e., verify that the caller is
privileged over its current mount namespace.
To meet the (2) requirement it must be possible to infer the origin
mount namespace that the anonymous mount namespace of the detached
mount was created from.
The origin mount namespace of an anonymous mount is the mount
namespace that the mounts that were copied into the anonymous mount
namespace originate from.
In order to check the origin mount namespace of an anonymous mount
namespace the sequence number of the original mount namespace is
recorded in the anonymous mount namespace.
With this in place it is possible to perform an equivalent check
(2') to (2). The origin mount namespace of the anonymous mount
namespace must be the same as the caller's mount namespace. To
establish this the sequence number of the caller's mount namespace
and the origin sequence number of the anonymous mount namespace are
compared.
The caller is always located in a non-anonymous mount namespace
since anonymous mount namespaces cannot be setns()ed into. The
caller's mount namespace will thus always have a valid sequence
number.
The owning namespace of any mount namespace, anonymous or
non-anonymous, can never change. A mount attached to a
non-anonymous mount namespace can never change mount namespace.
If the sequence number of the non-anonymous mount namespace and the
origin sequence number of the anonymous mount namespace match, the
owning namespaces must match as well.
Hence, the capability check on the owning namespace of the caller's
mount namespace ensures that the caller has the ability to copy the
mount tree.
- Allow mount detached mounts on detached mounts
Currently, detached mounts can only be mounted onto attached
mounts. This limitation makes it impossible to assemble a new
private rootfs and move it into place. Instead, a detached tree
must be created, attached, then mounted open and then either moved
or detached again. Lift this restriction.
In order to allow mounting detached mounts onto other detached
mounts the same permission model used for creating detached mounts
from detached mounts can be used (cf. above).
Allowing to mount detached mounts onto detached mounts leaves three
cases to consider:
(1) The source mount is an attached mount and the target mount is
a detached mount. This would be equivalent to moving a mount
between different mount namespaces. A caller could move an
attached mount to a detached mount. The detached mount can now
be freely attached to any mount namespace. This changes the
current delegatioh model significantly for no good reason. So
this will fail.
(2) Anonymous mount namespaces are always attached fully, i.e., it
is not possible to only attach a subtree of an anoymous mount
namespace. This simplifies the implementation and reasoning.
Consequently, if the anonymous mount namespace of the source
detached mount and the target detached mount are the identical
the mount request will fail.
(3) The source mount's anonymous mount namespace is different from
the target mount's anonymous mount namespace.
In this case the source anonymous mount namespace of the
source mount tree must be freed after its mounts have been
moved to the target anonymous mount namespace. The source
anonymous mount namespace must be empty afterwards.
By allowing to mount detached mounts onto detached mounts a caller
may do the following:
fd_tree1 = open_tree(-EBADF, "/mnt", OPEN_TREE_CLONE)
fd_tree2 = open_tree(-EBADF, "/tmp", OPEN_TREE_CLONE)
fd_tree1 and fd_tree2 refer to two different detached mount trees
that belong to two different anonymous mount namespace.
It is important to note that fd_tree1 and fd_tree2 both refer to
the root of their respective anonymous mount namespaces.
By allowing to mount detached mounts onto detached mounts the
caller may now do:
move_mount(fd_tree1, "", fd_tree2, "",
MOVE_MOUNT_F_EMPTY_PATH | MOVE_MOUNT_T_EMPTY_PATH)
This will cause the detached mount referred to by fd_tree1 to be
mounted on top of the detached mount referred to by fd_tree2.
Thus, the detached mount fd_tree1 is moved from its separate
anonymous mount namespace into fd_tree2's anonymous mount
namespace.
It also means that while fd_tree2 continues to refer to the root of
its respective anonymous mount namespace fd_tree1 doesn't anymore.
This has the consequence that only fd_tree2 can be moved to another
anonymous or non-anonymous mount namespace. Moving fd_tree1 will
now fail as fd_tree1 doesn't refer to the root of an anoymous mount
namespace anymore.
Now fd_tree1 and fd_tree2 refer to separate detached mount trees
referring to the same anonymous mount namespace.
This is conceptually fine. The new mount api does allow for this to
happen already via:
mount -t tmpfs tmpfs /mnt
mkdir -p /mnt/A
mount -t tmpfs tmpfs /mnt/A
fd_tree3 = open_tree(-EBADF, "/mnt", OPEN_TREE_CLONE | AT_RECURSIVE)
fd_tree4 = open_tree(-EBADF, "/mnt/A", 0)
Both fd_tree3 and fd_tree4 refer to two different detached mount
trees but both detached mount trees refer to the same anonymous
mount namespace. An as with fd_tree1 and fd_tree2, only fd_tree3
may be moved another mount namespace as fd_tree3 refers to the root
of the anonymous mount namespace just while fd_tree4 doesn't.
However, there's an important difference between the
fd_tree3/fd_tree4 and the fd_tree1/fd_tree2 example.
Closing fd_tree4 and releasing the respective struct file will have
no further effect on fd_tree3's detached mount tree.
However, closing fd_tree3 will cause the mount tree and the
respective anonymous mount namespace to be destroyed causing the
detached mount tree of fd_tree4 to be invalid for further mounting.
By allowing to mount detached mounts on detached mounts as in the
fd_tree1/fd_tree2 example both struct files will affect each other.
Both fd_tree1 and fd_tree2 refer to struct files that have
FMODE_NEED_UNMOUNT set.
To handle this we use the fact that @fd_tree1 will have a parent
mount once it has been attached to @fd_tree2.
When dissolve_on_fput() is called the mount that has been passed in
will refer to the root of the anonymous mount namespace. If it
doesn't it would mean that mounts are leaked. So before allowing to
mount detached mounts onto detached mounts this would be a bug.
Now that detached mounts can be mounted onto detached mounts it
just means that the mount has been attached to another anonymous
mount namespace and thus dissolve_on_fput() must not unmount the
mount tree or free the anonymous mount namespace as the file
referring to the root of the namespace hasn't been closed yet.
If it had been closed yet it would be obvious because the mount
namespace would be NULL, i.e., the @fd_tree1 would have already
been unmounted. If @fd_tree1 hasn't been unmounted yet and has a
parent mount it is safe to skip any cleanup as closing @fd_tree2
will take care of all cleanup operations.
- Allow mount propagation for detached mount trees
In commit ee2e3f50629f ("mount: fix mounting of detached mounts
onto targets that reside on shared mounts") I fixed a bug where
propagating the source mount tree of an anonymous mount namespace
into a target mount tree of a non-anonymous mount namespace could
be used to trigger an integer overflow in the non-anonymous mount
namespace causing any new mounts to fail.
The cause of this was that the propagation algorithm was unable to
recognize mounts from the source mount tree that were already
propagated into the target mount tree and then reappeared as
propagation targets when walking the destination propagation mount
tree.
When fixing this I disabled mount propagation into anonymous mount
namespaces. Make it possible for anonymous mount namespace to
receive mount propagation events correctly. This is now also a
correctness issue now that we allow mounting detached mount trees
onto detached mount trees.
Mark the source anonymous mount namespace with MNTNS_PROPAGATING
indicating that all mounts belonging to this mount namespace are
currently in the process of being propagated and make the
propagation algorithm discard those if they appear as propagation
targets"
* tag 'vfs-6.15-rc1.mount.namespace' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (21 commits)
selftests: test subdirectory mounting
selftests: add test for detached mount tree propagation
fs: namespace: fix uninitialized variable use
mount: handle mount propagation for detached mount trees
fs: allow creating detached mounts from fsmount() file descriptors
selftests: seventh test for mounting detached mounts onto detached mounts
selftests: sixth test for mounting detached mounts onto detached mounts
selftests: fifth test for mounting detached mounts onto detached mounts
selftests: fourth test for mounting detached mounts onto detached mounts
selftests: third test for mounting detached mounts onto detached mounts
selftests: second test for mounting detached mounts onto detached mounts
selftests: first test for mounting detached mounts onto detached mounts
fs: mount detached mounts onto detached mounts
fs: support getname_maybe_null() in move_mount()
selftests: create detached mounts from detached mounts
fs: create detached mounts from detached mounts
fs: add may_copy_tree()
fs: add fastpath for dissolve_on_fput()
fs: add assert for move_mount()
fs: add mnt_ns_empty() helper
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs sysv removal from Christian Brauner:
"This removes the sysv filesystem. We've discussed this various times.
It's time to try"
* tag 'vfs-6.15-rc1.sysv' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
sysv: Remove the filesystem
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs async dir updates from Christian Brauner:
"This contains cleanups that fell out of the work from async directory
handling:
- Change kern_path_locked() and user_path_locked_at() to never return
a negative dentry. This simplifies the usability of these helpers
in various places
- Drop d_exact_alias() from the remaining place in NFS where it is
still used. This also allows us to drop the d_exact_alias() helper
completely
- Drop an unnecessary call to fh_update() from nfsd_create_locked()
- Change i_op->mkdir() to return a struct dentry
Change vfs_mkdir() to return a dentry provided by the filesystems
which is hashed and positive. This allows us to reduce the number
of cases where the resulting dentry is not positive to very few
cases. The code in these places becomes simpler and easier to
understand.
- Repack DENTRY_* and LOOKUP_* flags"
* tag 'vfs-6.15-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
doc: fix inline emphasis warning
VFS: Change vfs_mkdir() to return the dentry.
nfs: change mkdir inode_operation to return alternate dentry if needed.
fuse: return correct dentry for ->mkdir
ceph: return the correct dentry on mkdir
hostfs: store inode in dentry after mkdir if possible.
Change inode_operations.mkdir to return struct dentry *
nfsd: drop fh_update() from S_IFDIR branch of nfsd_create_locked()
nfs/vfs: discard d_exact_alias()
VFS: add common error checks to lookup_one_qstr_excl()
VFS: change kern_path_locked() and user_path_locked_at() to never return negative dentry
VFS: repack LOOKUP_ bit flags.
VFS: repack DENTRY_ flags.
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs iomap updates from Christian Brauner:
- Allow the filesystem to submit the writeback bios.
- Allow the filsystem to track completions on a per-bio bases
instead of the entire I/O.
- Change writeback_ops so that ->submit_bio can be done by the
filesystem.
- A new ANON_WRITE flag for writes that don't have a block number
assigned to them at the iomap level leaving the filesystem to do
that work in the submission handler.
- Incremental iterator advance
The folio_batch support for zero range where the filesystem provides
a batch of folios to process that might not be logically continguous
requires more flexibility than the current offset based iteration
currently offers.
Update all iomap operations to advance the iterator within the
operation and thus remove the need to advance from the core iomap
iterator.
- Make buffered writes work with RWF_DONTCACHE
If RWF_DONTCACHE is set for a write, mark the folios being written as
uncached. On writeback completion the pages will be dropped.
- Introduce infrastructure for large atomic writes
This will eventually be used by xfs and ext4.
* tag 'vfs-6.15-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (42 commits)
iomap: rework IOMAP atomic flags
iomap: comment on atomic write checks in iomap_dio_bio_iter()
iomap: inline iomap_dio_bio_opflags()
iomap: fix inline data on buffered read
iomap: Lift blocksize restriction on atomic writes
iomap: Support SW-based atomic writes
iomap: Rename IOMAP_ATOMIC -> IOMAP_ATOMIC_HW
xfs: flag as supporting FOP_DONTCACHE
iomap: make buffered writes work with RWF_DONTCACHE
iomap: introduce a full map advance helper
iomap: rename iomap_iter processed field to status
iomap: remove unnecessary advance from iomap_iter()
dax: advance the iomap_iter on pte and pmd faults
dax: advance the iomap_iter on dedupe range
dax: advance the iomap_iter on unshare range
dax: advance the iomap_iter on zero range
dax: push advance down into dax_iomap_iter() for read and write
dax: advance the iomap_iter in the read/write path
iomap: convert misc simple ops to incremental advance
iomap: advance the iter on direct I/O
...
|