summaryrefslogtreecommitdiff
path: root/Documentation/admin-guide
AgeCommit message (Collapse)Author
2024-11-25Merge tag 'mm-nonmm-stable-2024-11-24-02-05' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - The series "resource: A couple of cleanups" from Andy Shevchenko performs some cleanups in the resource management code - The series "Improve the copy of task comm" from Yafang Shao addresses possible race-induced overflows in the management of task_struct.comm[] - The series "Remove unnecessary header includes from {tools/}lib/list_sort.c" from Kuan-Wei Chiu adds some cleanups and a small fix to the list_sort library code and to its selftest - The series "Enhance min heap API with non-inline functions and optimizations" also from Kuan-Wei Chiu optimizes and cleans up the min_heap library code - The series "nilfs2: Finish folio conversion" from Ryusuke Konishi finishes off nilfs2's folioification - The series "add detect count for hung tasks" from Lance Yang adds more userspace visibility into the hung-task detector's activity - Apart from that, singelton patches in many places - please see the individual changelogs for details * tag 'mm-nonmm-stable-2024-11-24-02-05' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (71 commits) gdb: lx-symbols: do not error out on monolithic build kernel/reboot: replace sprintf() with sysfs_emit() lib: util_macros_kunit: add kunit test for util_macros.h util_macros.h: fix/rework find_closest() macros Improve consistency of '#error' directive messages ocfs2: fix uninitialized value in ocfs2_file_read_iter() hung_task: add docs for hung_task_detect_count hung_task: add detect count for hung tasks dma-buf: use atomic64_inc_return() in dma_buf_getfile() fs/proc/kcore.c: fix coccinelle reported ERROR instances resource: avoid unnecessary resource tree walking in __region_intersects() ocfs2: remove unused errmsg function and table ocfs2: cluster: fix a typo lib/scatterlist: use sg_phys() helper checkpatch: always parse orig_commit in fixes tag nilfs2: convert metadata aops from writepage to writepages nilfs2: convert nilfs_recovery_copy_block() to take a folio nilfs2: convert nilfs_page_count_clean_buffers() to take a folio nilfs2: remove nilfs_writepage nilfs2: convert checkpoint file to be folio-based ...
2024-11-23Merge tag 'mm-stable-2024-11-18-19-27' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - The series "zram: optimal post-processing target selection" from Sergey Senozhatsky improves zram's post-processing selection algorithm. This leads to improved memory savings. - Wei Yang has gone to town on the mapletree code, contributing several series which clean up the implementation: - "refine mas_mab_cp()" - "Reduce the space to be cleared for maple_big_node" - "maple_tree: simplify mas_push_node()" - "Following cleanup after introduce mas_wr_store_type()" - "refine storing null" - The series "selftests/mm: hugetlb_fault_after_madv improvements" from David Hildenbrand fixes this selftest for s390. - The series "introduce pte_offset_map_{ro|rw}_nolock()" from Qi Zheng implements some rationaizations and cleanups in the page mapping code. - The series "mm: optimize shadow entries removal" from Shakeel Butt optimizes the file truncation code by speeding up the handling of shadow entries. - The series "Remove PageKsm()" from Matthew Wilcox completes the migration of this flag over to being a folio-based flag. - The series "Unify hugetlb into arch_get_unmapped_area functions" from Oscar Salvador implements a bunch of consolidations and cleanups in the hugetlb code. - The series "Do not shatter hugezeropage on wp-fault" from Dev Jain takes away the wp-fault time practice of turning a huge zero page into small pages. Instead we replace the whole thing with a THP. More consistent cleaner and potentiall saves a large number of pagefaults. - The series "percpu: Add a test case and fix for clang" from Andy Shevchenko enhances and fixes the kernel's built in percpu test code. - The series "mm/mremap: Remove extra vma tree walk" from Liam Howlett optimizes mremap() by avoiding doing things which we didn't need to do. - The series "Improve the tmpfs large folio read performance" from Baolin Wang teaches tmpfs to copy data into userspace at the folio size rather than as individual pages. A 20% speedup was observed. - The series "mm/damon/vaddr: Fix issue in damon_va_evenly_split_region()" fro Zheng Yejian fixes DAMON splitting. - The series "memcg-v1: fully deprecate charge moving" from Shakeel Butt removes the long-deprecated memcgv2 charge moving feature. - The series "fix error handling in mmap_region() and refactor" from Lorenzo Stoakes cleanup up some of the mmap() error handling and addresses some potential performance issues. - The series "x86/module: use large ROX pages for text allocations" from Mike Rapoport teaches x86 to use large pages for read-only-execute module text. - The series "page allocation tag compression" from Suren Baghdasaryan is followon maintenance work for the new page allocation profiling feature. - The series "page->index removals in mm" from Matthew Wilcox remove most references to page->index in mm/. A slow march towards shrinking struct page. - The series "damon/{self,kunit}tests: minor fixups for DAMON debugfs interface tests" from Andrew Paniakin performs maintenance work for DAMON's self testing code. - The series "mm: zswap swap-out of large folios" from Kanchana Sridhar improves zswap's batching of compression and decompression. It is a step along the way towards using Intel IAA hardware acceleration for this zswap operation. - The series "kasan: migrate the last module test to kunit" from Sabyrzhan Tasbolatov completes the migration of the KASAN built-in tests over to the KUnit framework. - The series "implement lightweight guard pages" from Lorenzo Stoakes permits userapace to place fault-generating guard pages within a single VMA, rather than requiring that multiple VMAs be created for this. Improved efficiencies for userspace memory allocators are expected. - The series "memcg: tracepoint for flushing stats" from JP Kobryn uses tracepoints to provide increased visibility into memcg stats flushing activity. - The series "zram: IDLE flag handling fixes" from Sergey Senozhatsky fixes a zram buglet which potentially affected performance. - The series "mm: add more kernel parameters to control mTHP" from Maíra Canal enhances our ability to control/configuremultisize THP from the kernel boot command line. - The series "kasan: few improvements on kunit tests" from Sabyrzhan Tasbolatov has a couple of fixups for the KASAN KUnit tests. - The series "mm/list_lru: Split list_lru lock into per-cgroup scope" from Kairui Song optimizes list_lru memory utilization when lockdep is enabled. * tag 'mm-stable-2024-11-18-19-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (215 commits) cma: enforce non-zero pageblock_order during cma_init_reserved_mem() mm/kfence: add a new kunit test test_use_after_free_read_nofault() zram: fix NULL pointer in comp_algorithm_show() memcg/hugetlb: add hugeTLB counters to memcg vmstat: call fold_vm_zone_numa_events() before show per zone NUMA event mm: mmap_lock: check trace_mmap_lock_$type_enabled() instead of regcount zram: ZRAM_DEF_COMP should depend on ZRAM MAINTAINERS/MEMORY MANAGEMENT: add document files for mm Docs/mm/damon: recommend academic papers to read and/or cite mm: define general function pXd_init() kmemleak: iommu/iova: fix transient kmemleak false positive mm/list_lru: simplify the list_lru walk callback function mm/list_lru: split the lock to per-cgroup scope mm/list_lru: simplify reparenting and initial allocation mm/list_lru: code clean up for reparenting mm/list_lru: don't export list_lru_add mm/list_lru: don't pass unnecessary key parameters kasan: add kunit tests for kmalloc_track_caller, kmalloc_node_track_caller kasan: change kasan_atomics kunit test as KUNIT_CASE_SLOW kasan: use EXPORT_SYMBOL_IF_KUNIT to export symbols ...
2024-11-22docs: Add debugging guide for the media subsystemSebastian Fricke
Provide a guide for developers on how to debug code with a focus on the media subsystem. This document aims to provide a rough overview over the possibilities and a rational to help choosing the right tool for the given circumstances. Signed-off-by: Sebastian Fricke <sebastian.fricke@collabora.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net> Link: https://lore.kernel.org/r/20241028-media_docs_improve_v3-v3-2-edf5c5b3746f@collabora.com
2024-11-21Merge tag 'net-next-6.13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Paolo Abeni: "The most significant set of changes is the per netns RTNL. The new behavior is disabled by default, regression risk should be contained. Notably the new config knob PTP_1588_CLOCK_VMCLOCK will inherit its default value from PTP_1588_CLOCK_KVM, as the first is intended to be a more reliable replacement for the latter. Core: - Started a very large, in-progress, effort to make the RTNL lock scope per network-namespace, thus reducing the lock contention significantly in the containerized use-case, comprising: - RCU-ified some relevant slices of the FIB control path - introduce basic per netns locking helpers - namespacified the IPv4 address hash table - remove rtnl_register{,_module}() in favour of rtnl_register_many() - refactor rtnl_{new,del,set}link() moving as much validation as possible out of RTNL lock - convert all phonet doit() and dumpit() handlers to RCU - convert IPv4 addresses manipulation to per-netns RTNL - convert virtual interface creation to per-netns RTNL the per-netns lock infrastructure is guarded by the CONFIG_DEBUG_NET_SMALL_RTNL knob, disabled by default ad interim. - Introduce NAPI suspension, to efficiently switching between busy polling (NAPI processing suspended) and normal processing. - Migrate the IPv4 routing input, output and control path from direct ToS usage to DSCP macros. This is a work in progress to make ECN handling consistent and reliable. - Add drop reasons support to the IPv4 rotue input path, allowing better introspection in case of packets drop. - Make FIB seqnum lockless, dropping RTNL protection for read access. - Make inet{,v6} addresses hashing less predicable. - Allow providing timestamp OPT_ID via cmsg, to correlate TX packets and timestamps Things we sprinkled into general kernel code: - Add small file operations for debugfs, to reduce the struct ops size. - Refactoring and optimization for the implementation of page_frag API, This is a preparatory work to consolidate the page_frag implementation. Netfilter: - Optimize set element transactions to reduce memory consumption - Extended netlink error reporting for attribute parser failure. - Make legacy xtables configs user selectable, giving users the option to configure iptables without enabling any other config. - Address a lot of false-positive RCU issues, pointed by recent CI improvements. BPF: - Put xsk sockets on a struct diet and add various cleanups. Overall, this helps to bump performance by 12% for some workloads. - Extend BPF selftests to increase coverage of XDP features in combination with BPF cpumap. - Optimize and homogenize bpf_csum_diff helper for all archs and also add a batch of new BPF selftests for it. - Extend netkit with an option to delegate skb->{mark,priority} scrubbing to its BPF program. - Make the bpf_get_netns_cookie() helper available also to tc(x) BPF programs. Protocols: - Introduces 4-tuple hash for connected udp sockets, speeding-up significantly connected sockets lookup. - Add a fastpath for some TCP timers that usually expires after close, the socket lock contention. - Add inbound and outbound xfrm state caches to speed up state lookups. - Avoid sending MPTCP advertisements on stale subflows, reducing risks on loosing them. - Make neighbours table flushing more scalable, maintaining per device neigh lists. Driver API: - Introduce a unified interface to configure transmission H/W shaping, and expose it to user-space via generic-netlink. - Add support for per-NAPI config via netlink. This makes napi configuration persistent across queues removal and re-creation. Requires driver updates, currently supported drivers are: nVidia/Mellanox mlx4 and mlx5, Broadcom brcm and Intel ice. - Add ethtool support for writing SFP / PHY firmware blocks. - Track RSS context allocation from ethtool core. - Implement support for mirroring to DSA CPU port, via TC mirror offload. - Consolidate FDB updates notification, to avoid duplicates on device-specific entries. - Expose DPLL clock quality level to the user-space. - Support master-slave PHY config via device tree. Tests and tooling: - forwarding: introduce deferred commands, to simplify the cleanup phase Drivers: - Updated several drivers - Amazon vNic, Google vNic, Microsoft vNic, Intel e1000e and Broadcom Tigon3 - to use netdev-genl to link the IRQs and queues to NAPI IDs, allowing busy polling and better introspection. - Ethernet high-speed NICs: - nVidia/Mellanox: - mlx5: - a large refactor to implement support for cross E-Switch scheduling - refactor H/W conter management to let it scale better - H/W GRO cleanups - Intel (100G, ice):: - add support for ethtool reset - implement support for per TX queue H/W shaping - AMD/Solarflare: - implement per device queue stats support - Broadcom (bnxt): - improve wildcard l4proto on IPv4/IPv6 ntuple rules - Marvell Octeon: - Add representor support for each Resource Virtualization Unit (RVU) device. - Hisilicon: - add support for the BMC Gigabit Ethernet - IBM (EMAC): - driver cleanup and modernization - Cisco (VIC): - raise the queues number limit to 256 - Ethernet virtual: - Google vNIC: - implement page pool support - macsec: - inherit lower device's features and TSO limits when offloading - virtio_net: - enable premapped mode by default - support for XDP socket(AF_XDP) zerocopy TX - wireguard: - set the TSO max size to be GSO_MAX_SIZE, to aggregate larger packets. - Ethernet NICs embedded and virtual: - Broadcom ASP: - enable software timestamping - Freescale: - add enetc4 PF driver - MediaTek: Airoha SoC: - implement BQL support - RealTek r8169: - enable TSO by default on r8168/r8125 - implement extended ethtool stats - Renesas AVB: - enable TX checksum offload - Synopsys (stmmac): - support header splitting for vlan tagged packets - move common code for DWMAC4 and DWXGMAC into a separate FPE module. - add dwmac driver support for T-HEAD TH1520 SoC - Synopsys (xpcs): - driver refactor and cleanup - TI: - icssg_prueth: add VLAN offload support - Xilinx emaclite: - add clock support - Ethernet switches: - Microchip: - implement support for the lan969x Ethernet switch family - add LAN9646 switch support to KSZ DSA driver - Ethernet PHYs: - Marvel: 88q2x: enable auto negotiation - Microchip: add support for LAN865X Rev B1 and LAN867X Rev C1/C2 - PTP: - Add support for the Amazon virtual clock device - Add PtP driver for s390 clocks - WiFi: - mac80211 - EHT 1024 aggregation size for transmissions - new operation to indicate that a new interface is to be added - support radio separation of multi-band devices - move wireless extension spy implementation to libiw - Broadcom: - brcmfmac: optional LPO clock support - Microchip: - add support for Atmel WILC3000 - Qualcomm (ath12k): - firmware coredump collection support - add debugfs support for a multitude of statistics - Qualcomm (ath5k): - Arcadyan ARV45XX AR2417 & Gigaset SX76[23] AR241[34]A support - Realtek: - rtw88: 8821au and 8812au USB adapters support - rtw89: add thermal protection - rtw89: fine tune BT-coexsitence to improve user experience - rtw89: firmware secure boot for WiFi 6 chip - Bluetooth - add Qualcomm WCN785x support for ids Foxconn 0xe0fc/0xe0f3 and 0x13d3:0x3623 - add Realtek RTL8852BE support for id Foxconn 0xe123 - add MediaTek MT7920 support for wireless module ids - btintel_pcie: add handshake between driver and firmware - btintel_pcie: add recovery mechanism - btnxpuart: add GPIO support to power save feature" * tag 'net-next-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1475 commits) mm: page_frag: fix a compile error when kernel is not compiled Documentation: tipc: fix formatting issue in tipc.rst selftests: nic_performance: Add selftest for performance of NIC driver selftests: nic_link_layer: Add selftest case for speed and duplex states selftests: nic_link_layer: Add link layer selftest for NIC driver bnxt_en: Add FW trace coredump segments to the coredump bnxt_en: Add a new ethtool -W dump flag bnxt_en: Add 2 parameters to bnxt_fill_coredump_seg_hdr() bnxt_en: Add functions to copy host context memory bnxt_en: Do not free FW log context memory bnxt_en: Manage the FW trace context memory bnxt_en: Allocate backing store memory for FW trace logs bnxt_en: Add a 'force' parameter to bnxt_free_ctx_mem() bnxt_en: Refactor bnxt_free_ctx_mem() bnxt_en: Add mem_valid bit to struct bnxt_ctx_mem_type bnxt_en: Update firmware interface spec to 1.10.3.85 selftests/bpf: Add some tests with sockmap SK_PASS bpf: fix recursive lock when verdict program return SK_PASS wireguard: device: support big tcp GSO wireguard: selftests: load nf_conntrack if not present ...
2024-11-20Merge tag 'media/v6.13-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media Pull media updates from Mauro Carvalho Chehab: - removal of the old omap4iss media driver - mantis: remove orphan mantis_core.h - add support for Raspberypi CFE - uvc driver got a co-maintainer - main media tree moved to git://linuxtv.org/media.git - lots of driver cleanups, updates and fixes * tag 'media/v6.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (233 commits) docs: media: update location of the media patches MAINTAINERS: update location of media main tree media: MAINTAINERS: Add Hans de Goede as USB VIDEO CLASS co-maintainer media: platform: samsung: s5p-jpeg: Remove deadcode media: qcom: camss: Add MSM8953 resources media: dt-bindings: Add qcom,msm8953-camss media: qcom: camss: implement pm domain ops for VFE v4.1 media: platform: exynos4-is: Fix an OF node reference leak in fimc_md_is_isp_available media: adv7180: Also check for "adi,force-bt656-4" media: dt-bindings: adv7180: Document 'adi,force-bt656-4' media: mgb4: Fix inconsistent input/output alignment in loopback mode media: replace obsolete hans.verkuil@cisco.com alias Documentation: media: improve V4L2_CID_MIN_BUFFERS_FOR_*, doc media: vicodec: add V4L2_CID_MIN_BUFFERS_FOR_* controls media: atomisp: Add check for rgby_data memory allocation failure media: atomisp: remove redundant re-checking of err media: atomisp: Fix spelling errors reported by codespell media: atomisp: Remove License information boilerplate media: atomisp: Fix typos in comment media: atomisp: hmm_bo: Fix spelling errors in hmm_bo.h ...
2024-11-20Merge tag 'docs-6.13' of git://git.lwn.net/linuxLinus Torvalds
Pull documentation updates from Jonathan Corbet: "Another moderately busy cycle in docsland: - Work on Chinese translations has picked up again. Happily, they are maintaining the existing translations and not just adding new ones. - Some maintenance of the Japanese and Italian translations as well. - The removal of the venerable "dontdiff" file. It has long outlived its usefulness and contained entries ("parse.*") that would actively mask actual source change. - The addition of enforcement information to the code-of-conduct documentation. Along with some build-system fixes and a lot of typo and language fixes" * tag 'docs-6.13' of git://git.lwn.net/linux: (52 commits) Documentation/CoC: spell out enforcement for unacceptable behaviors docs: fix typos and whitespace in Documentation/process/backporting.rst docs/zh_CN: fix one sentence in llvm.rst docs: bug-bisect: add a note about bisecting -next docs/zh_CN: add the translation of kbuild/llvm.rst Documentation: Fix incorrect paths/magic in magic numbers rst Documentation/maintainer-tip: Fix typos Documentation: Improve crash_kexec_post_notifiers description Docs/zh_CN: Translate physical_memory.rst to Simplified Chinese Documentation: admin: reorganize kernel-parameters intro docs/zh_CN: update the translation of process/programming-language.rst docs/zh_CN: update the translation of mm/page_owner.rst docs/zh_CN: update the translation of mm/page_table_check.rst docs/zh_CN: update the translation of mm/overcommit-accounting.rst docs/zh_CN: update the translation of mm/admon/faq.rst docs/zh_CN: update the translation of mm/active_mm.rst docs/zh_CN: update the translation of mm/hmm.rst docs: remove Documentation/dontdiff docs/zh_CN: Add a entry in Chinese glossary Docs/zh_CN: Fix the pfn calculation error in page_tables.rst ...
2024-11-19Merge tag 'rcu.release.v6.13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux Pull RCU updates from Frederic Weisbecker: "SRCU: - Introduction of the new SRCU-lite flavour with a new pair of srcu_read_[un]lock_lite() APIs. In practice the read side using this flavour becomes lighter by removing a full memory barrier on LOCK and a full memory barrier on UNLOCK. This comes at the expense of a higher latency write side with two (in the best case of a snaphot of unused read-sides) or more RCU grace periods on the update side which now assumes by itself the whole full ordering guarantee against the LOCK/UNLOCK counters on both indexes, along with the accesses performed inside. Uretprobes is a known potential user. Note this doesn't replace the default normal flavour of SRCU which still behaves the same as usual. - Add testing of SRCU-lite through rcutorture and rcuscale - Various cleanups on the way. Fixes: - Allow short-circuiting RCU-TASKS-RUDE grace periods on architectures that have sane noinstr boundaries forbidding tracing on low-level idle and kernel entry code. RCU-TASKS is enough on such configurations because it involves an RCU grace period that waits for all idle tasks to either schedule out voluntarily or enter into RCU unwatched noinstr code. - Allow and test start_poll_synchronize_rcu() with IRQs disabled. - Mention rcuog kthreads in relevant documentation and Kconfig help - Various fixes and consolidations rcutorture: - Add --no-affinity on tools to leave the affinity setting of guests up to the user. - Add guest_os_delay parameter to rcuscale for better warm-up control. - Fix and improve some rcuscale error handling. - Various cleanups and fixes stall: - Remove dead code - Stop dumping tasks if a stalled grace period eventually ended midway as that only produces confusing output. - Optimize detection of stalling CPUs and avoid useless node locking otherwise. NOCB: - Fix rcu_barrier() hang due to a race against callbacks deoffloading. This is not yet used, except by rcutorture, and waits for its promised cpusets interface. - Remove leftover function declaration" * tag 'rcu.release.v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux: (42 commits) rcuscale: Remove redundant WARN_ON_ONCE() splat rcuscale: Do a proper cleanup if kfree_scale_init() fails srcu: Unconditionally record srcu_read_lock_lite() in ->srcu_reader_flavor srcu: Check for srcu_read_lock_lite() across all CPUs srcu: Remove smp_mb() from srcu_read_unlock_lite() rcutorture: Avoid printing cpu=-1 for no-fault RCU boost failure rcuscale: Add guest_os_delay module parameter refscale: Correct affinity check torture: Add --no-affinity parameter to kvm.sh rcu/nocb: Fix missed RCU barrier on deoffloading rcu/kvfree: Fix data-race in __mod_timer / kvfree_call_rcu rcu/srcutiny: don't return before reenabling preemption rcu-tasks: Remove open-coded one-byte cmpxchg() emulation doc: Remove kernel-parameters.txt entry for rcutorture.read_exit rcutorture: Test start-poll primitives with interrupts disabled rcu: Permit start_poll_synchronize_rcu*() with interrupts disabled rcu: Allow short-circuiting of synchronize_rcu_tasks_rude() doc: Add rcuog kthreads to kernel-per-CPU-kthreads.rst rcu: Add rcuog kthreads to RCU_NOCB_CPU help text rcu: Use the BITS_PER_LONG macro ...
2024-11-18Merge tag 'arm64-upstream' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Catalin Marinas: - Support for running Linux in a protected VM under the Arm Confidential Compute Architecture (CCA) - Guarded Control Stack user-space support. Current patches follow the x86 ABI of implicitly creating a shadow stack on clone(). Subsequent patches (already on the list) will add support for clone3() allowing finer-grained control of the shadow stack size and placement from libc - AT_HWCAP3 support (not running out of HWCAP2 bits yet but we are getting close with the upcoming dpISA support) - Other arch features: - In-kernel use of the memcpy instructions, FEAT_MOPS (previously only exposed to user; uaccess support not merged yet) - MTE: hugetlbfs support and the corresponding kselftests - Optimise CRC32 using the PMULL instructions - Support for FEAT_HAFT enabling ARCH_HAS_NONLEAF_PMD_YOUNG - Optimise the kernel TLB flushing to use the range operations - POE/pkey (permission overlays): further cleanups after bringing the signal handler in line with the x86 behaviour for 6.12 - arm64 perf updates: - Support for the NXP i.MX91 PMU in the existing IMX driver - Support for Ampere SoCs in the Designware PCIe PMU driver - Support for Marvell's 'PEM' PCIe PMU present in the 'Odyssey' SoC - Support for Samsung's 'Mongoose' CPU PMU - Support for PMUv3.9 finer-grained userspace counter access control - Switch back to platform_driver::remove() now that it returns 'void' - Add some missing events for the CXL PMU driver - Miscellaneous arm64 fixes/cleanups: - Page table accessors cleanup: type updates, drop unused macros, reorganise arch_make_huge_pte() and clean up pte_mkcont(), sanity check addresses before runtime P4D/PUD folding - Command line override for ID_AA64MMFR0_EL1.ECV (advertising the FEAT_ECV for the generic timers) allowing Linux to boot with firmware deployments that don't set SCTLR_EL3.ECVEn - ACPI/arm64: tighten the check for the array of platform timer structures and adjust the error handling procedure in gtdt_parse_timer_block() - Optimise the cache flush for the uprobes xol slot (skip if no change) and other uprobes/kprobes cleanups - Fix the context switching of tpidrro_el0 when kpti is enabled - Dynamic shadow call stack fixes - Sysreg updates - Various arm64 kselftest improvements * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (168 commits) arm64: tls: Fix context-switching of tpidrro_el0 when kpti is enabled kselftest/arm64: Try harder to generate different keys during PAC tests kselftest/arm64: Don't leak pipe fds in pac.exec_sign_all() arm64/ptrace: Clarify documentation of VL configuration via ptrace kselftest/arm64: Corrupt P0 in the irritator when testing SSVE acpi/arm64: remove unnecessary cast arm64/mm: Change protval as 'pteval_t' in map_range() kselftest/arm64: Fix missing printf() argument in gcs/gcs-stress.c kselftest/arm64: Add FPMR coverage to fp-ptrace kselftest/arm64: Expand the set of ZA writes fp-ptrace does kselftets/arm64: Use flag bits for features in fp-ptrace assembler code kselftest/arm64: Enable build of PAC tests with LLVM=1 kselftest/arm64: Check that SVCR is 0 in signal handlers selftests/mm: Fix unused function warning for aarch64_write_signal_pkey() kselftest/arm64: Fix printf() compiler warnings in the arm64 syscall-abi.c tests kselftest/arm64: Fix printf() warning in the arm64 MTE prctl() test kselftest/arm64: Fix printf() compiler warnings in the arm64 fp tests kselftest/arm64: Fix build with stricter assemblers arm64/scs: Drop unused prototype __pi_scs_patch_vmlinux() arm64/scs: Deal with 64-bit relative offsets in FDE frames ...
2024-11-18Merge tag 'vfs-6.13.misc' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "Features: - Fixup and improve NLM and kNFSD file lock callbacks Last year both GFS2 and OCFS2 had some work done to make their locking more robust when exported over NFS. Unfortunately, part of that work caused both NLM (for NFS v3 exports) and kNFSD (for NFSv4.1+ exports) to no longer send lock notifications to clients This in itself is not a huge problem because most NFS clients will still poll the server in order to acquire a conflicted lock It's important for NLM and kNFSD that they do not block their kernel threads inside filesystem's file_lock implementations because that can produce deadlocks. We used to make sure of this by only trusting that posix_lock_file() can correctly handle blocking lock calls asynchronously, so the lock managers would only setup their file_lock requests for async callbacks if the filesystem did not define its own lock() file operation However, when GFS2 and OCFS2 grew the capability to correctly handle blocking lock requests asynchronously, they started signalling this behavior with EXPORT_OP_ASYNC_LOCK, and the check for also trusting posix_lock_file() was inadvertently dropped, so now most filesystems no longer produce lock notifications when exported over NFS Fix this by using an fop_flag which greatly simplifies the problem and grooms the way for future uses by both filesystems and lock managers alike - Add a sysctl to delete the dentry when a file is removed instead of making it a negative dentry Commit 681ce8623567 ("vfs: Delete the associated dentry when deleting a file") introduced an unconditional deletion of the associated dentry when a file is removed. However, this led to performance regressions in specific benchmarks, such as ilebench.sum_operations/s, prompting a revert in commit 4a4be1ad3a6e ("Revert "vfs: Delete the associated dentry when deleting a file""). This reintroduces the concept conditionally through a sysctl - Expand the statmount() system call: * Report the filesystem subtype in a new fs_subtype field to e.g., report fuse filesystem subtypes * Report the superblock source in a new sb_source field * Add a new way to return filesystem specific mount options in an option array that returns filesystem specific mount options separated by zero bytes and unescaped. This allows caller's to retrieve filesystem specific mount options and immediately pass them to e.g., fsconfig() without having to unescape or split them * Report security (LSM) specific mount options in a separate security option array. We don't lump them together with filesystem specific mount options as security mount options are generic and most users aren't interested in them The format is the same as for the filesystem specific mount option array - Support relative paths in fsconfig()'s FSCONFIG_SET_STRING command - Optimize acl_permission_check() to avoid costly {g,u}id ownership checks if possible - Use smp_mb__after_spinlock() to avoid full smp_mb() in evict() - Add synchronous wakeup support for ep_poll_callback. Currently, epoll only uses wake_up() to wake up task. But sometimes there are epoll users which want to use the synchronous wakeup flag to give a hint to the scheduler, e.g., the Android binder driver. So add a wake_up_sync() define, and use wake_up_sync() when sync is true in ep_poll_callback() Fixes: - Fix kernel documentation for inode_insert5() and iget5_locked() - Annotate racy epoll check on file->f_ep - Make F_DUPFD_QUERY associative - Avoid filename buffer overrun in initramfs - Don't let statmount() return empty strings - Add a cond_resched() to dump_user_range() to avoid hogging the CPU - Don't query the device logical blocksize multiple times for hfsplus - Make filemap_read() check that the offset is positive or zero Cleanups: - Various typo fixes - Cleanup wbc_attach_fdatawrite_inode() - Add __releases annotation to wbc_attach_and_unlock_inode() - Add hugetlbfs tracepoints - Fix various vfs kernel doc parameters - Remove obsolete TODO comment from io_cancel() - Convert wbc_account_cgroup_owner() to take a folio - Fix comments for BANDWITH_INTERVAL and wb_domain_writeout_add() - Reorder struct posix_acl to save 8 bytes - Annotate struct posix_acl with __counted_by() - Replace one-element array with flexible array member in freevxfs - Use idiomatic atomic64_inc_return() in alloc_mnt_ns()" * tag 'vfs-6.13.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits) statmount: retrieve security mount options vfs: make evict() use smp_mb__after_spinlock instead of smp_mb statmount: add flag to retrieve unescaped options fs: add the ability for statmount() to report the sb_source writeback: wbc_attach_fdatawrite_inode out of line writeback: add a __releases annoation to wbc_attach_and_unlock_inode fs: add the ability for statmount() to report the fs_subtype fs: don't let statmount return empty strings fs:aio: Remove TODO comment suggesting hash or array usage in io_cancel() hfsplus: don't query the device logical block size multiple times freevxfs: Replace one-element array with flexible array member fs: optimize acl_permission_check() initramfs: avoid filename buffer overrun fs/writeback: convert wbc_account_cgroup_owner to take a folio acl: Annotate struct posix_acl with __counted_by() acl: Realign struct posix_acl to save 8 bytes epoll: Add synchronous wakeup support for ep_poll_callback coredump: add cond_resched() to dump_user_range mm/page-writeback.c: Fix comment of wb_domain_writeout_add() mm/page-writeback.c: Update comment for BANDWIDTH_INTERVAL ...
2024-11-18docs: media: update location of the media patchesMauro Carvalho Chehab
Due to recent changes on the way we're maintaining media, the location of the main tree was updated. Change docs accordingly. Cc: stable@vger.kernel.org Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Reviewed-by: Hans Verkuil <hverkuil@xs4all.nl>
2024-11-15Merge branches 'rcu/fixes', 'rcu/nocb', 'rcu/torture', 'rcu/stall' and ↵Frederic Weisbecker
'rcu/srcu' into rcu/dev
2024-11-14memcg/hugetlb: add hugeTLB counters to memcgJoshua Hahn
This patch introduces a new counter to memory.stat that tracks hugeTLB usage, only if hugeTLB accounting is done to memory.current. This feature is enabled the same way hugeTLB accounting is enabled, via the memory_hugetlb_accounting mount flag for cgroupsv2. 1. Why is this patch necessary? Currently, memcg hugeTLB accounting is an opt-in feature [1] that adds hugeTLB usage to memory.current. However, the metric is not reported in memory.stat. Given that users often interpret memory.stat as a breakdown of the value reported in memory.current, the disparity between the two reports can be confusing. This patch solves this problem by including the metric in memory.stat as well, but only if it is also reported in memory.current (it would also be confusing if the value was reported in memory.stat, but not in memory.current) Aside from the consistency between the two files, we also see benefits in observability. Userspace might be interested in the hugeTLB footprint of cgroups for many reasons. For instance, system admins might want to verify that hugeTLB usage is distributed as expected across tasks: i.e. memory-intensive tasks are using more hugeTLB pages than tasks that don't consume a lot of memory, or are seen to fault frequently. Note that this is separate from wanting to inspect the distribution for limiting purposes (in which case, hugeTLB controller makes more sense). 2. We already have a hugeTLB controller. Why not use that? It is true that hugeTLB tracks the exact value that we want. In fact, by enabling the hugeTLB controller, we get all of the observability benefits that I mentioned above, and users can check the total hugeTLB usage, verify if it is distributed as expected, etc. With this said, there are 2 problems: (a) They are still not reported in memory.stat, which means the disparity between the memcg reports are still there. (b) We cannot reasonably expect users to enable the hugeTLB controller just for the sake of hugeTLB usage reporting, especially since they don't have any use for hugeTLB usage enforcing [2]. 3. Implementation Details: In the alloc / free hugetlb functions, we call lruvec_stat_mod_folio regardless of whether memcg accounts hugetlb. mem_cgroup_commit_charge which is called from alloc_hugetlb_folio will set memcg for the folio only if the CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING cgroup mount option is used, so lruvec_stat_mod_folio accounts per-memcg hugetlb counters only if the feature is enabled. Regardless of whether memcg accounts for hugetlb, the newly added global counter is updated and shown in /proc/vmstat. The global counter is added because vmstats is the preferred framework for cgroup stats. It makes stat items consistent between global and cgroups. It also provides a per-node breakdown, which is useful. Because it does not use cgroup-specific hooks, we also keep generic MM code separate from memcg code. [1] https://lore.kernel.org/all/20231006184629.155543-1-nphamcs@gmail.com/ [2] Of course, we can't make a new patch for every feature that can be duplicated. However, since the existing solution of enabling the hugeTLB controller is an imperfect solution that still leaves a discrepancy between memory.stat and memory.curent, I think that it is reasonable to isolate the feature in this case. Link: https://lkml.kernel.org/r/20241101204402.1885383-1-joshua.hahnjy@gmail.com Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com> Suggested-by: Nhat Pham <nphamcs@gmail.com> Suggested-by: Shakeel Butt <shakeel.butt@linux.dev> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Chris Down <chris@chrisdown.name> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.12-rc8). Conflicts: tools/testing/selftests/net/.gitignore 252e01e68241 ("selftests: net: add netlink-dumps to .gitignore") be43a6b23829 ("selftests: ncdevmem: Move ncdevmem under drivers/net/hw") https://lore.kernel.org/all/20241113122359.1b95180a@canb.auug.org.au/ drivers/net/phy/phylink.c 671154f174e0 ("net: phylink: ensure PHY momentary link-fails are handled") 7530ea26c810 ("net: phylink: remove "using_mac_select_pcs"") Adjacent changes: drivers/net/ethernet/stmicro/stmmac/dwmac-intel-plat.c 5b366eae7193 ("stmmac: dwmac-intel-plat: fix call balance of tx_clk handling routines") e96321fad3ad ("net: ethernet: Switch back to struct platform_driver::remove()") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-14Merge branches 'for-next/gcs', 'for-next/probes', 'for-next/asm-offsets', ↵Catalin Marinas
'for-next/tlb', 'for-next/misc', 'for-next/mte', 'for-next/sysreg', 'for-next/stacktrace', 'for-next/hwcap3', 'for-next/kselftest', 'for-next/crc32', 'for-next/guest-cca', 'for-next/haft' and 'for-next/scs', remote-tracking branch 'arm64/for-next/perf' into for-next/core * arm64/for-next/perf: perf: Switch back to struct platform_driver::remove() perf: arm_pmuv3: Add support for Samsung Mongoose PMU dt-bindings: arm: pmu: Add Samsung Mongoose core compatible perf/dwc_pcie: Fix typos in event names perf/dwc_pcie: Add support for Ampere SoCs ARM: pmuv3: Add missing write_pmuacr() perf/marvell: Marvell PEM performance monitor support perf/arm_pmuv3: Add PMUv3.9 per counter EL0 access control perf/dwc_pcie: Convert the events with mixed case to lowercase perf/cxlpmu: Support missing events in 3.1 spec perf: imx_perf: add support for i.MX91 platform dt-bindings: perf: fsl-imx-ddr: Add i.MX91 compatible drivers perf: remove unused field pmu_node * for-next/gcs: (42 commits) : arm64 Guarded Control Stack user-space support kselftest/arm64: Fix missing printf() argument in gcs/gcs-stress.c arm64/gcs: Fix outdated ptrace documentation kselftest/arm64: Ensure stable names for GCS stress test results kselftest/arm64: Validate that GCS push and write permissions work kselftest/arm64: Enable GCS for the FP stress tests kselftest/arm64: Add a GCS stress test kselftest/arm64: Add GCS signal tests kselftest/arm64: Add test coverage for GCS mode locking kselftest/arm64: Add a GCS test program built with the system libc kselftest/arm64: Add very basic GCS test program kselftest/arm64: Always run signals tests with GCS enabled kselftest/arm64: Allow signals tests to specify an expected si_code kselftest/arm64: Add framework support for GCS to signal handling tests kselftest/arm64: Add GCS as a detected feature in the signal tests kselftest/arm64: Verify the GCS hwcap arm64: Add Kconfig for Guarded Control Stack (GCS) arm64/ptrace: Expose GCS via ptrace and core files arm64/signal: Expose GCS state in signal frames arm64/signal: Set up and restore the GCS context for signal handlers arm64/mm: Implement map_shadow_stack() ... * for-next/probes: : Various arm64 uprobes/kprobes cleanups arm64: insn: Simulate nop instruction for better uprobe performance arm64: probes: Remove probe_opcode_t arm64: probes: Cleanup kprobes endianness conversions arm64: probes: Move kprobes-specific fields arm64: probes: Fix uprobes for big-endian kernels arm64: probes: Fix simulate_ldr*_literal() arm64: probes: Remove broken LDR (literal) uprobe support * for-next/asm-offsets: : arm64 asm-offsets.c cleanup (remove unused offsets) arm64: asm-offsets: remove PREEMPT_DISABLE_OFFSET arm64: asm-offsets: remove DMA_{TO,FROM}_DEVICE arm64: asm-offsets: remove VM_EXEC and PAGE_SZ arm64: asm-offsets: remove MM_CONTEXT_ID arm64: asm-offsets: remove COMPAT_{RT_,SIGFRAME_REGS_OFFSET arm64: asm-offsets: remove VMA_VM_* arm64: asm-offsets: remove TSK_ACTIVE_MM * for-next/tlb: : TLB flushing optimisations arm64: optimize flush tlb kernel range arm64: tlbflush: add __flush_tlb_range_limit_excess() * for-next/misc: : Miscellaneous patches arm64: tls: Fix context-switching of tpidrro_el0 when kpti is enabled arm64/ptrace: Clarify documentation of VL configuration via ptrace acpi/arm64: remove unnecessary cast arm64/mm: Change protval as 'pteval_t' in map_range() arm64: uprobes: Optimize cache flushes for xol slot acpi/arm64: Adjust error handling procedure in gtdt_parse_timer_block() arm64: fix .data.rel.ro size assertion when CONFIG_LTO_CLANG arm64/ptdump: Test both PTE_TABLE_BIT and PTE_VALID for block mappings arm64/mm: Sanity check PTE address before runtime P4D/PUD folding arm64/mm: Drop setting PTE_TYPE_PAGE in pte_mkcont() ACPI: GTDT: Tighten the check for the array of platform timer structures arm64/fpsimd: Fix a typo arm64: Expose ID_AA64ISAR1_EL1.XS to sanitised feature consumers arm64: Return early when break handler is found on linked-list arm64/mm: Re-organize arch_make_huge_pte() arm64/mm: Drop _PROT_SECT_DEFAULT arm64: Add command-line override for ID_AA64MMFR0_EL1.ECV arm64: head: Drop SWAPPER_TABLE_SHIFT arm64: cpufeature: add POE to cpucap_is_possible() arm64/mm: Change pgattr_change_is_safe() arguments as pteval_t * for-next/mte: : Various MTE improvements selftests: arm64: add hugetlb mte tests hugetlb: arm64: add mte support * for-next/sysreg: : arm64 sysreg updates arm64/sysreg: Update ID_AA64MMFR1_EL1 to DDI0601 2024-09 * for-next/stacktrace: : arm64 stacktrace improvements arm64: preserve pt_regs::stackframe during exec*() arm64: stacktrace: unwind exception boundaries arm64: stacktrace: split unwind_consume_stack() arm64: stacktrace: report recovered PCs arm64: stacktrace: report source of unwind data arm64: stacktrace: move dump_backtrace() to kunwind_stack_walk() arm64: use a common struct frame_record arm64: pt_regs: swap 'unused' and 'pmr' fields arm64: pt_regs: rename "pmr_save" -> "pmr" arm64: pt_regs: remove stale big-endian layout arm64: pt_regs: assert pt_regs is a multiple of 16 bytes * for-next/hwcap3: : Add AT_HWCAP3 support for arm64 (also wire up AT_HWCAP4) arm64: Support AT_HWCAP3 binfmt_elf: Wire up AT_HWCAP3 at AT_HWCAP4 * for-next/kselftest: (30 commits) : arm64 kselftest fixes/cleanups kselftest/arm64: Try harder to generate different keys during PAC tests kselftest/arm64: Don't leak pipe fds in pac.exec_sign_all() kselftest/arm64: Corrupt P0 in the irritator when testing SSVE kselftest/arm64: Add FPMR coverage to fp-ptrace kselftest/arm64: Expand the set of ZA writes fp-ptrace does kselftets/arm64: Use flag bits for features in fp-ptrace assembler code kselftest/arm64: Enable build of PAC tests with LLVM=1 kselftest/arm64: Check that SVCR is 0 in signal handlers kselftest/arm64: Fix printf() compiler warnings in the arm64 syscall-abi.c tests kselftest/arm64: Fix printf() warning in the arm64 MTE prctl() test kselftest/arm64: Fix printf() compiler warnings in the arm64 fp tests kselftest/arm64: Fix build with stricter assemblers kselftest/arm64: Test signal handler state modification in fp-stress kselftest/arm64: Provide a SIGUSR1 handler in the kernel mode FP stress test kselftest/arm64: Implement irritators for ZA and ZT kselftest/arm64: Remove unused ADRs from irritator handlers kselftest/arm64: Correct misleading comments on fp-stress irritators kselftest/arm64: Poll less often while waiting for fp-stress children kselftest/arm64: Increase frequency of signal delivery in fp-stress kselftest/arm64: Fix encoding for SVE B16B16 test ... * for-next/crc32: : Optimise CRC32 using PMULL instructions arm64/crc32: Implement 4-way interleave using PMULL arm64/crc32: Reorganize bit/byte ordering macros arm64/lib: Handle CRC-32 alternative in C code * for-next/guest-cca: : Support for running Linux as a guest in Arm CCA arm64: Document Arm Confidential Compute virt: arm-cca-guest: TSM_REPORT support for realms arm64: Enable memory encrypt for Realms arm64: mm: Avoid TLBI when marking pages as valid arm64: Enforce bounce buffers for realm DMA efi: arm64: Map Device with Prot Shared arm64: rsi: Map unprotected MMIO as decrypted arm64: rsi: Add support for checking whether an MMIO is protected arm64: realm: Query IPA size from the RMM arm64: Detect if in a realm and set RIPAS RAM arm64: rsi: Add RSI definitions * for-next/haft: : Support for arm64 FEAT_HAFT arm64: pgtable: Warn unexpected pmdp_test_and_clear_young() arm64: Enable ARCH_HAS_NONLEAF_PMD_YOUNG arm64: Add support for FEAT_HAFT arm64: setup: name 'tcr2' register arm64/sysreg: Update ID_AA64MMFR1_EL1 register * for-next/scs: : Dynamic shadow call stack fixes arm64/scs: Drop unused prototype __pi_scs_patch_vmlinux() arm64/scs: Deal with 64-bit relative offsets in FDE frames arm64/scs: Fix handling of DWARF augmentation data in CIE/FDE frames
2024-11-13Merge tag 'tpmdd-next-6.12-rc8' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd Pull tpm fixes from Jarkko Sakkinen: "Two bug fixes for TPM bus encryption (the remaining reported issues in the feature)" * tag 'tpmdd-next-6.12-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd: tpm: Disable TPM on tpm2_create_primary() failure tpm: Opt-in in disable PCR integrity protection
2024-11-13tpm: Opt-in in disable PCR integrity protectionJarkko Sakkinen
The initial HMAC session feature added TPM bus encryption and/or integrity protection to various in-kernel TPM operations. This can cause performance bottlenecks with IMA, as it heavily utilizes PCR extend operations. In order to mitigate this performance issue, introduce a kernel command-line parameter to the TPM driver for disabling the integrity protection for PCR extend operations (i.e. TPM2_PCR_Extend). Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Link: https://lore.kernel.org/linux-integrity/20241015193916.59964-1-zohar@linux.ibm.com/ Fixes: 6519fea6fd37 ("tpm: add hmac checks to tpm2_pcr_extend()") Tested-by: Mimi Zohar <zohar@linux.ibm.com> Co-developed-by: Roberto Sassu <roberto.sassu@huawei.com> Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com> Co-developed-by: Mimi Zohar <zohar@linux.ibm.com> Signed-off-by: Mimi Zohar <zohar@linux.ibm.com> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
2024-11-12doc: Remove kernel-parameters.txt entry for rcutorture.read_exitPaul E. McKenney
There is only ever the one read-exit task, and there is no module parameter named rcutorture.read_exit, so remove the bogus documentation. Instead, use rcutorture.read_exit_burst to enable/disable read-exit race testing. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: <bpf@vger.kernel.org> Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2024-11-12doc: Add rcuog kthreads to kernel-per-CPU-kthreads.rstPaul E. McKenney
This commit adds the rcuog kthreads to the list of callback-offloading kthreads that can be affinitied away from worker CPUs. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2024-11-12docs: bug-bisect: add a note about bisecting -nextThorsten Leemhuis
Explicitly mention how to bisect -next, as nothing in the kernel tree currently explains that bisects between -next versions won't work well and it's better to bisect between mainline and -next. Co-developed-by: Mark Brown <broonie@kernel.org> Signed-off-by: Mark Brown <broonie@kernel.org> Reviewed-by: Mark Brown <broonie@kernel.org> Signed-off-by: Thorsten Leemhuis <linux@leemhuis.info> Signed-off-by: Jonathan Corbet <corbet@lwn.net> Link: https://lore.kernel.org/r/ec19d5fc503ff7db3d4c4ff9e97fff24cc78f72a.1730808651.git.linux@leemhuis.info
2024-11-12rcutorture: Add srcu_read_lock_lite() support to rcutorture.reader_flavorPaul E. McKenney
This commit causes bit 0x4 of rcutorture.reader_flavor to select the new srcu_read_lock_lite() and srcu_read_unlock_lite() functions. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: <bpf@vger.kernel.org> Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2024-11-12rcutorture: Add reader_flavor parameter for SRCU readersPaul E. McKenney
This commit adds an rcutorture.reader_flavor parameter whose bits correspond to reader flavors. For example, SRCU's readers are 0x1 for normal and 0x2 for NMI-safe. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: <bpf@vger.kernel.org> Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2024-11-12net: Implement fault injection forcing skb reallocationBreno Leitao
Introduce a fault injection mechanism to force skb reallocation. The primary goal is to catch bugs related to pointer invalidation after potential skb reallocation. The fault injection mechanism aims to identify scenarios where callers retain pointers to various headers in the skb but fail to reload these pointers after calling a function that may reallocate the data. This type of bug can lead to memory corruption or crashes if the old, now-invalid pointers are used. By forcing reallocation through fault injection, we can stress-test code paths and ensure proper pointer management after potential skb reallocations. Add a hook for fault injection in the following functions: * pskb_trim_rcsum() * pskb_may_pull_reason() * pskb_trim() As the other fault injection mechanism, protect it under a debug Kconfig called CONFIG_FAIL_SKB_REALLOC. This patch was *heavily* inspired by Jakub's proposal from: https://lore.kernel.org/all/20240719174140.47a868e6@kernel.org/ CC: Akinobu Mita <akinobu.mita@gmail.com> Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Akinobu Mita <akinobu.mita@gmail.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/20241107-fault_v6-v6-1-1b82cb6ecacd@debian.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-11-11hung_task: add docs for hung_task_detect_countLance Yang
This commit introduces documentation for hung_task_detect_count in kernel.rst. Link: https://lkml.kernel.org/r/20241027120747.42833-3-ioworker0@gmail.com Signed-off-by: Mingzhe Yang <mingzhe.yang@ly.com> Signed-off-by: Lance Yang <ioworker0@gmail.com> Cc: Bang Li <libang.li@antgroup.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Huang Cun <cunhuang@tencent.com> Cc: Joel Granados <j.granados@samsung.com> Cc: Joel Granados <joel.granados@kernel.org> Cc: John Siddle <jsiddle@redhat.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thomas Weißschuh <linux@weissschuh.net> Cc: Yongliang Gao <leonylgao@tencent.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11mm: shmem: override mTHP shmem default with a kernel parameterMaíra Canal
Add the ``thp_shmem=`` kernel command line to allow specifying the default policy of each supported shmem hugepage size. The kernel parameter accepts the following format: thp_shmem=<size>[KMG],<size>[KMG]:<policy>;<size>[KMG]-<size>[KMG]:<policy> For example, thp_shmem=16K-64K:always;128K,512K:inherit;256K:advise;1M-2M:never;4M-8M:within_size Some GPUs may benefit from using huge pages. Since DRM GEM uses shmem to allocate anonymous pageable memory, it's essential to control the huge page allocation policy for the internal shmem mount. This control can be achieved through the ``transparent_hugepage_shmem=`` parameter. Beyond just setting the allocation policy, it's crucial to have granular control over the size of huge pages that can be allocated. The GPU may support only specific huge page sizes, and allocating pages larger/smaller than those sizes would be ineffective. Link: https://lkml.kernel.org/r/20241101165719.1074234-6-mcanal@igalia.com Signed-off-by: Maíra Canal <mcanal@igalia.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lance Yang <ioworker0@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11mm: shmem: control THP support through the kernel command lineMaíra Canal
Patch series "mm: add more kernel parameters to control mTHP", v5. This series introduces four patches related to the kernel parameters controlling mTHP and a fifth patch replacing `strcpy()` for `strscpy()` in the file `mm/huge_memory.c`. The first patch is a straightforward documentation update, correcting the format of the kernel parameter ``thp_anon=``. The second, third, and fourth patches focus on controlling THP support for shmem via the kernel command line. The second patch introduces a parameter to control the global default huge page allocation policy for the internal shmem mount. The third patch moves a piece of code to a shared header to ease the implementation of the fourth patch. Finally, the fourth patch implements a parameter similar to ``thp_anon=``, but for shmem. The goal of these changes is to simplify the configuration of systems that rely on mTHP support for shmem. For instance, a platform with a GPU that benefits from huge pages may want to enable huge pages for shmem. Having these kernel parameters streamlines the configuration process and ensures consistency across setups. This patch (of 4): Add a new kernel command line to control the hugepage allocation policy for the internal shmem mount, ``transparent_hugepage_shmem``. The parameter is similar to ``transparent_hugepage`` and has the following format: transparent_hugepage_shmem=<policy> where ``<policy>`` is one of the seven valid policies available for shmem. Configuring the default huge page allocation policy for the internal shmem mount can be beneficial for DRM GPU drivers. Just as CPU architectures, GPUs can also take advantage of huge pages, but this is possible only if DRM GEM objects are backed by huge pages. Since GEM uses shmem to allocate anonymous pageable memory, having control over the default huge page allocation policy allows for the exploration of huge pages use on GPUs that rely on GEM objects backed by shmem. Link: https://lkml.kernel.org/r/20241101165719.1074234-2-mcanal@igalia.com Link: https://lkml.kernel.org/r/20241101165719.1074234-4-mcanal@igalia.com Signed-off-by: Maíra Canal <mcanal@igalia.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: dri-devel@lists.freedesktop.org Cc: Hugh Dickins <hughd@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: kernel-dev@igalia.com Cc: Lance Yang <ioworker0@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11Merge tag 'v6.12-rc7' into __tmp-hansg-linux-tags_media_atomisp_6_13_1Mauro Carvalho Chehab
Linux 6.12-rc7 * tag 'v6.12-rc7': (1909 commits) Linux 6.12-rc7 filemap: Fix bounds checking in filemap_read() i2c: designware: do not hold SCL low when I2C_DYNAMIC_TAR_UPDATE is not set mailmap: add entry for Thorsten Blum ocfs2: remove entry once instead of null-ptr-dereference in ocfs2_xa_remove() signal: restore the override_rlimit logic fs/proc: fix compile warning about variable 'vmcore_mmap_ops' ucounts: fix counter leak in inc_rlimit_get_ucounts() selftests: hugetlb_dio: check for initial conditions to skip in the start mm: fix docs for the kernel parameter ``thp_anon=`` mm/damon/core: avoid overflow in damon_feed_loop_next_input() mm/damon/core: handle zero schemes apply interval mm/damon/core: handle zero {aggregation,ops_update} intervals mm/mlock: set the correct prev on failure objpool: fix to make percpu slot allocation more robust mm/page_alloc: keep track of free highatomic bcachefs: Fix UAF in __promote_alloc() error path bcachefs: Change OPT_STR max to be 1 less than the size of choices array bcachefs: btree_cache.freeable list fixes bcachefs: check the invalid parameter for perf test ...
2024-11-11mm: add per-order mTHP swpin countersBarry Song
This helps profile the sizes of folios being swapped in. Currently, only mTHP swap-out is being counted. The new interface can be found at: /sys/kernel/mm/transparent_hugepage/hugepages-<size>/stats swpin For example, cat /sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpin 12809 cat /sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpin 4763 [v-songbaohua@oppo.com: add a blank line in doc] Link: https://lkml.kernel.org/r/20241030233423.80759-1-21cnbao@gmail.com Link: https://lkml.kernel.org/r/20241026082423.26298-1-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Chris Li <chrisl@kernel.org> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Kairui Song <kasong@tencent.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Cc: Usama Arif <usamaarif642@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11mm: swap: count successful large folio zswap stores in hugepage zswpout statsKanchana P Sridhar
Added a new MTHP_STAT_ZSWPOUT entry to the sysfs transparent_hugepage stats so that successful large folio zswap stores can be accounted under the per-order sysfs "zswpout" stats: /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout Other non-zswap swap device swap-out events will be counted under the existing sysfs "swpout" stats: /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/swpout Also, added documentation for the newly added sysfs per-order hugepage "zswpout" stats. The documentation clarifies that only non-zswap swapouts will be accounted in the existing "swpout" stats. Link: https://lkml.kernel.org/r/20241001053222.6944-8-kanchana.p.sridhar@intel.com Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Wajdi Feghali <wajdi.k.feghali@intel.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: "Zou, Nanhai" <nanhai.zou@intel.com> Cc: Barry Song <21cnbao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11Merge branch 'mm-hotfixes-stable' into mm-stableAndrew Morton
Pick up e7ac4daeed91 ("mm: count zeromap read and set for swapout and swapin") in order to move mm: define obj_cgroup_get() if CONFIG_MEMCG is not defined mm: zswap: modify zswap_compress() to accept a page instead of a folio mm: zswap: rename zswap_pool_get() to zswap_pool_tryget() mm: zswap: modify zswap_stored_pages to be atomic_long_t mm: zswap: support large folios in zswap_store() mm: swap: count successful large folio zswap stores in hugepage zswpout stats mm: zswap: zswap_store_page() will initialize entry after adding to xarray. mm: add per-order mTHP swpin counters from mm-unstable into mm-stable.
2024-11-11mm: count zeromap read and set for swapout and swapinBarry Song
When the proportion of folios from the zeromap is small, missing their accounting may not significantly impact profiling. However, it's easy to construct a scenario where this becomes an issue—for example, allocating 1 GB of memory, writing zeros from userspace, followed by MADV_PAGEOUT, and then swapping it back in. In this case, the swap-out and swap-in counts seem to vanish into a black hole, potentially causing semantic ambiguity. On the other hand, Usama reported that zero-filled pages can exceed 10% in workloads utilizing zswap, while Hailong noted that some app in Android have more than 6% zero-filled pages. Before commit 0ca0c24e3211 ("mm: store zero pages to be swapped out in a bitmap"), both zswap and zRAM implemented similar optimizations, leading to these optimized-out pages being counted in either zswap or zRAM counters (with pswpin/pswpout also increasing for zRAM). With zeromap functioning prior to both zswap and zRAM, userspace will no longer detect these swap-out and swap-in actions. We have three ways to address this: 1. Introduce a dedicated counter specifically for the zeromap. 2. Use pswpin/pswpout accounting, treating the zero map as a standard backend. This approach aligns with zRAM's current handling of same-page fills at the device level. However, it would mean losing the optimized-out page counters previously available in zRAM and would not align with systems using zswap. Additionally, as noted by Nhat Pham, pswpin/pswpout counters apply only to I/O done directly to the backend device. 3. Count zeromap pages under zswap, aligning with system behavior when zswap is enabled. However, this would not be consistent with zRAM, nor would it align with systems lacking both zswap and zRAM. Given the complications with options 2 and 3, this patch selects option 1. We can find these counters from /proc/vmstat (counters for the whole system) and memcg's memory.stat (counters for the interested memcg). For example: $ grep -E 'swpin_zero|swpout_zero' /proc/vmstat swpin_zero 1648 swpout_zero 33536 $ grep -E 'swpin_zero|swpout_zero' /sys/fs/cgroup/system.slice/memory.stat swpin_zero 3905 swpout_zero 3985 This patch does not address any specific zeromap bug, but the missing swpout and swpin counts for zero-filled pages can be highly confusing and may mislead user-space agents that rely on changes in these counters as indicators. Therefore, we add a Fixes tag to encourage the inclusion of this counter in any kernel versions with zeromap. Many thanks to Kanchana for the contribution of changing count_objcg_event() to count_objcg_events() to support large folios[1], which has now been incorporated into this patch. [1] https://lkml.kernel.org/r/20241001053222.6944-5-kanchana.p.sridhar@intel.com Link: https://lkml.kernel.org/r/20241107011246.59137-1-21cnbao@gmail.com Fixes: 0ca0c24e3211 ("mm: store zero pages to be swapped out in a bitmap") Co-developed-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Signed-off-by: Barry Song <v-songbaohua@oppo.com> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Hailong Liu <hailong.liu@oppo.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Andi Kleen <ak@linux.intel.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chris Li <chrisl@kernel.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Kairui Song <kasong@tencent.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-07mm: fix docs for the kernel parameter ``thp_anon=``Maíra Canal
If we add ``thp_anon=32,64K:always`` to the kernel command line, we will see the following error: [ 0.000000] huge_memory: thp_anon=32,64K:always: error parsing string, ignoring setting This happens because the correct format isn't ``thp_anon=<size>,<size>[KMG]:<state>```, as [KMG] must follow each number to especify its unit. So, the correct format is ``thp_anon=<size>[KMG],<size>[KMG]:<state>```. Therefore, adjust the documentation to reflect the correct format of the parameter ``thp_anon=``. Link: https://lkml.kernel.org/r/20241101165719.1074234-3-mcanal@igalia.com Fixes: dd4d30d1cdbe ("mm: override mTHP "enabled" defaults at kernel cmdline") Signed-off-by: Maíra Canal <mcanal@igalia.com> Acked-by: Barry Song <baohua@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lance Yang <ioworker0@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06memcg-v1: fully deprecate move_charge_at_immigrateShakeel Butt
Patch series "memcg-v1: fully deprecate charge moving". The memcg v1's charge moving feature has been deprecated for almost 2 years and the kernel warns if someone try to use it. This warning has been backported to all stable kernel and there have not been any report of the warning or the request to support this feature anymore. Let's proceed to fully deprecate this feature. This patch (of 6): Proceed with the complete deprecation of memcg v1's charge moving feature. The deprecation warning has been in the kernel for almost two years and has been ported to all stable kernel since. Now is the time to fully deprecate this feature. Link: https://lkml.kernel.org/r/20241025012304.2473312-1-shakeel.butt@linux.dev Link: https://lkml.kernel.org/r/20241025012304.2473312-2-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05zram: permit only one post-processing operation at a timeSergey Senozhatsky
Both recompress and writeback soon will unlock slots during processing, which makes things too complex wrt possible race-conditions. We still want to clear PP_SLOT in slot_free, because this is how we figure out that slot that was selected for post-processing has been released under us and when we start post-processing we check if slot still has PP_SLOT set. At the same time, theoretically, we can have something like this: CPU0 CPU1 recompress scan slots set PP_SLOT unlock slot slot_free clear PP_SLOT allocate PP_SLOT writeback scan slots set PP_SLOT unlock slot select PP-slot test PP_SLOT So recompress will not detect that slot has been re-used and re-selected for concurrent writeback post-processing. Make sure that we only permit on post-processing operation at a time. So now recompress and writeback post-processing don't race against each other, we only need to handle slot re-use (slot_free and write), which is handled individually by each pp operation. Having recompress and writeback competing for the same slots is not exactly good anyway (can't imagine anyone doing that). Link: https://lkml.kernel.org/r/20240917021020.883356-3-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-04Documentation: Improve crash_kexec_post_notifiers descriptionGuilherme G. Piccoli
The crash_kexec_post_notifiers description could be improved a bit, by clarifying its upsides (yes, there are some!) and be more descriptive about the downsides, specially mentioning code that enables the option unconditionally, like Hyper-V[0], PowerPC (fadump)[1] and more recently, AMD SEV-SNP[2]. [0] Commit a11589563e96 ("x86/Hyper-V: Report crash register data or kmsg before running crash kernel"). [1] Commit 06e629c25daa ("powerpc/fadump: Fix inaccurate CPU state info in vmcore generated with panic"). [2] Commit 8ef979584ea8 ("crypto: ccp: Add panic notifier for SEV/SNP firmware shutdown on kdump"). Reviewed-by: Stephen Brennan <stephen.s.brennan@oracle.com> Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net> Link: https://lore.kernel.org/r/20241027204159.985163-1-gpiccoli@igalia.com
2024-11-04Documentation: admin: reorganize kernel-parameters introRandy Dunlap
Reorganize the introduction to the kernel-parameters file to place related paragraphs together: - move module info together and near the beginning - add a Special Handling section for dashes, underscores, double quotes, cpu lists, and metric (KMG) suffixes. Expand the KMG suffixes to include TPE as well. - add a Kernel Build Options section Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jonathan Corbet <corbet@lwn.net> Link: https://lore.kernel.org/r/20241029180320.412629-1-rdunlap@infradead.org
2024-10-29SLUB: Add support for per object memory policiesChristoph Lameter
The old SLAB allocator used to support memory policies on a per allocation bases. In SLUB the memory policies are applied on a per page frame / folio bases. Doing so avoids having to check memory policies in critical code paths for kmalloc and friends. This worked on general well on Intel/AMD/PowerPC because the interconnect technology is mature and can minimize the latencies through intelligent caching even if a small object is not placed optimally. However, on ARM we have an emergence of new NUMA interconnect technology based more on embedded devices. Caching of remote content can currently be ineffective using the standard building blocks / mesh available on that platform. Such architectures benefit if each slab object is individually placed according to memory policies and other restrictions. This patch adds another kernel parameter slab_strict_numa If that is set then a static branch is activated that will cause the hotpaths of the allocator to evaluate the current memory allocation policy. Each object will be properly placed by paying the price of extra processing and SLUB will no longer defer to the page allocator to apply memory policies at the folio level. This patch improves performance of memcached running on Ampere Altra 2P system (ARM Neoverse N1 processor) by 3.6% due to accurate placement of small kernel objects. Tested-by: Huang Shijie <shijie@os.amperecomputing.com> Signed-off-by: Christoph Lameter (Ampere) <cl@gentwo.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-10-28perf/marvell: Marvell PEM performance monitor supportGowthami Thiagarajan
PCI Express Interface PMU includes various performance counters to monitor the data that is transmitted over the PCIe link. The counters track various inbound and outbound transactions which includes separate counters for posted/non-posted/completion TLPs. Also, inbound and outbound memory read requests along with their latencies can also be monitored. Address Translation Services(ATS)events such as ATS Translation, ATS Page Request, ATS Invalidation along with their corresponding latencies are also supported. The performance counters are 64 bits wide. For instance, perf stat -e ib_tlp_pr <workload> tracks the inbound posted TLPs for the workload. Co-developed-by: Linu Cherian <lcherian@marvell.com> Signed-off-by: Linu Cherian <lcherian@marvell.com> Signed-off-by: Gowthami Thiagarajan <gthiagarajan@marvell.com> Link: https://lore.kernel.org/r/20241028055309.17893-1-gthiagarajan@marvell.com Signed-off-by: Will Deacon <will@kernel.org>
2024-10-28fs/writeback: convert wbc_account_cgroup_owner to take a folioPankaj Raghav
Most of the callers of wbc_account_cgroup_owner() are converting a folio to page before calling the function. wbc_account_cgroup_owner() is converting the page back to a folio to call mem_cgroup_css_from_folio(). Convert wbc_account_cgroup_owner() to take a folio instead of a page, and convert all callers to pass a folio directly except f2fs. Convert the page to folio for all the callers from f2fs as they were the only callers calling wbc_account_cgroup_owner() with a page. As f2fs is already in the process of converting to folios, these call sites might also soon be calling wbc_account_cgroup_owner() with a folio directly in the future. No functional changes. Only compile tested. Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Link: https://lore.kernel.org/r/20240926140121.203821-1-kernel@pankajraghav.com Acked-by: David Sterba <dsterba@suse.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-25fuse: enable dynamic configuration of fuse max pages limit (FUSE_MAX_MAX_PAGES)Joanne Koong
Introduce the capability to dynamically configure the max pages limit (FUSE_MAX_MAX_PAGES) through a sysctl. This allows system administrators to dynamically set the maximum number of pages that can be used for servicing requests in fuse. Previously, this is gated by FUSE_MAX_MAX_PAGES which is statically set to 256 pages. One result of this is that the buffer size for a write request is limited to 1 MiB on a 4k-page system. The default value for this sysctl is the original limit (256 pages). $ sysctl -a | grep max_pages_limit fs.fuse.max_pages_limit = 256 $ sysctl -n fs.fuse.max_pages_limit 256 $ echo 1024 | sudo tee /proc/sys/fs/fuse/max_pages_limit 1024 $ sysctl -n fs.fuse.max_pages_limit 1024 $ echo 65536 | sudo tee /proc/sys/fs/fuse/max_pages_limit tee: /proc/sys/fs/fuse/max_pages_limit: Invalid argument $ echo 0 | sudo tee /proc/sys/fs/fuse/max_pages_limit tee: /proc/sys/fs/fuse/max_pages_limit: Invalid argument $ echo 65535 | sudo tee /proc/sys/fs/fuse/max_pages_limit 65535 $ sysctl -n fs.fuse.max_pages_limit 65535 Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-10-22vfs: Add a sysctl for automated deletion of dentryYafang Shao
Commit 681ce8623567 ("vfs: Delete the associated dentry when deleting a file") introduced an unconditional deletion of the associated dentry when a file is removed. However, this led to performance regressions in specific benchmarks, such as ilebench.sum_operations/s [0], prompting a revert in commit 4a4be1ad3a6e ("Revert "vfs: Delete the associated dentry when deleting a file""). This patch seeks to reintroduce the concept conditionally, where the associated dentry is deleted only when the user explicitly opts for it during file removal. A new sysctl fs.automated_deletion_of_dentry is added for this purpose. Its default value is set to 0. There are practical use cases for this proactive dentry reclamation. Besides the Elasticsearch use case mentioned in commit 681ce8623567, additional examples have surfaced in our production environment. For instance, in video rendering services that continuously generate temporary files, upload them to persistent storage servers, and then delete them, a large number of negative dentries—serving no useful purpose—accumulate. Users in such cases would benefit from proactively reclaiming these negative dentries. Link: https://lore.kernel.org/linux-fsdevel/202405291318.4dfbb352-oliver.sang@intel.com [0] Link: https://lore.kernel.org/all/20240912-programm-umgibt-a1145fa73bb6@brauner/ Suggested-by: Christian Brauner <brauner@kernel.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Link: https://lore.kernel.org/r/20240929122831.92515-1-laoar.shao@gmail.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-21cpufreq: docs: Reflect latency changes in docsChristian Loehle
There were two changes related to transition latency recently. Namely commit e13aa799c2a6 ("cpufreq: Change default transition delay to 2ms") and commit 37c6dccd6837 ("cpufreq: Remove LATENCY_MULTIPLIER"). Both changed the defaults / maximums so let the documentation reflect that. Signed-off-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/46853b6e-bad5-4ace-9b23-ff157f234ae3@arm.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-10-17ipe: allow secondary and platform keyrings to install/update policiesLuca Boccassi
The current policy management makes it impossible to use IPE in a general purpose distribution. In such cases the users are not building the kernel, the distribution is, and access to the private key included in the trusted keyring is, for obvious reason, not available. This means that users have no way to enable IPE, since there will be no built-in generic policy, and no access to the key to sign updates validated by the trusted keyring. Just as we do for dm-verity, kernel modules and more, allow the secondary and platform keyrings to also validate policies. This allows users enrolling their own keys in UEFI db or MOK to also sign policies, and enroll them. This makes it sensible to enable IPE in general purpose distributions, as it becomes usable by any user wishing to do so. Keys in these keyrings can already load kernels and kernel modules, so there is no security downgrade. Add a kconfig each, like dm-verity does, but default to enabled if the dependencies are available. Signed-off-by: Luca Boccassi <bluca@debian.org> Reviewed-by: Serge Hallyn <serge@hallyn.com> [FW: fixed some style issues] Signed-off-by: Fan Wu <wufan@kernel.org>
2024-10-17ipe: also reject policy updates with the same versionLuca Boccassi
Currently IPE accepts an update that has the same version as the policy being updated, but it doesn't make it a no-op nor it checks that the old and new policyes are the same. So it is possible to change the content of a policy, without changing its version. This is very confusing from userspace when managing policies. Instead change the update logic to reject updates that have the same version with ESTALE, as that is much clearer and intuitive behaviour. Signed-off-by: Luca Boccassi <bluca@debian.org> Reviewed-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: Fan Wu <wufan@kernel.org>
2024-10-16media: admin-guide: Document the Raspberry Pi CFE (rp1-cfe)Tomi Valkeinen
Add documentation for rp1-cfe driver. Signed-off-by: Tomi Valkeinen <tomi.valkeinen@ideasonboard.com> Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
2024-10-14Documentation/tracing: Mention that RESET_ATTACK_MITIGATION can clear memorySteven Rostedt
At the 2024 Linux Plumbers Conference, I was talking with Hans de Goede about the persistent buffer to display traces from previous boots. He mentioned that UEFI can clear memory. In my own tests I have not seen this. He later informed me that it requires the config option: CONFIG_RESET_ATTACK_MITIGATION It appears that setting this will allow the memory to be cleared on boot up, which will definitely clear out the trace of the previous boot. Add this information under the trace_instance in kernel-parameters.txt to let people know that this can cause issues. Link: https://lore.kernel.org/all/20170825155019.6740-2-ard.biesheuvel@linaro.org/ Reported-by: Hans de Goede <hdegoede@redhat.com> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Jonathan Corbet <corbet@lwn.net> Link: https://lore.kernel.org/r/20241007131653.35837081@gandalf.local.home
2024-10-10soundwire: intel_auxdevice: add kernel parameter for mclk dividerPierre-Louis Bossart
Add a kernel parameter to work-around discrepancies between hardware and platform firmware, it's not unusual to see e.g. 38.4MHz listed in _DSD properties as the SoundWire clock source, but the hardware may be based on a 19.2 MHz mclk source. Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Signed-off-by: Bard Liao <yung-chuan.liao@linux.intel.com> Link: https://lore.kernel.org/r/20241004021850.9758-1-yung-chuan.liao@linux.intel.com Signed-off-by: Vinod Koul <vkoul@kernel.org>
2024-10-08media: staging: drop omap4issHans Verkuil
The omap4 camera driver has seen no progress since forever, and now OMAP4 support has also been dropped from u-boot (1). So it is time to retire this driver. (1): https://lists.denx.de/pipermail/u-boot/2024-July/558846.html Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl> Acked-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
2024-10-04arm64/idreg: Add overrride for GCSMark Brown
Hook up an override for GCS, allowing it to be disabled from the command line by specifying arm64.nogcs in case there are problems. Reviewed-by: Thiago Jung Bauermann <thiago.bauermann@linaro.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20241001-arm64-gcs-v13-17-222b78d87eee@kernel.org Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-02PCI: Add TLP Processing Hints (TPH) supportWei Huang
Add support for PCIe TLP Processing Hints (TPH) support (see PCIe r6.2, sec 6.17). Add TPH register definitions in pci_regs.h, including the TPH Requester capability register, TPH Requester control register, TPH Completer capability, and the ST fields of MSI-X entry. Introduce pcie_enable_tph() and pcie_disable_tph(), enabling drivers to toggle TPH support and configure specific ST mode as needed. Also add a new kernel parameter, "pci=notph", allowing users to disable TPH support across the entire system. Link: https://lore.kernel.org/r/20241002165954.128085-2-wei.huang2@amd.com Co-developed-by: Jing Liu <jing2.liu@intel.com> Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com> Co-developed-by: Eric Van Tassell <Eric.VanTassell@amd.com> Signed-off-by: Jing Liu <jing2.liu@intel.com> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Eric Van Tassell <Eric.VanTassell@amd.com> Signed-off-by: Wei Huang <wei.huang2@amd.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com> Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com> Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Lukas Wunner <lukas@wunner.de>
2024-09-28Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull x86 kvm updates from Paolo Bonzini: "x86: - KVM currently invalidates the entirety of the page tables, not just those for the memslot being touched, when a memslot is moved or deleted. This does not traditionally have particularly noticeable overhead, but Intel's TDX will require the guest to re-accept private pages if they are dropped from the secure EPT, which is a non starter. Actually, the only reason why this is not already being done is a bug which was never fully investigated and caused VM instability with assigned GeForce GPUs, so allow userspace to opt into the new behavior. - Advertise AVX10.1 to userspace (effectively prep work for the "real" AVX10 functionality that is on the horizon) - Rework common MSR handling code to suppress errors on userspace accesses to unsupported-but-advertised MSRs This will allow removing (almost?) all of KVM's exemptions for userspace access to MSRs that shouldn't exist based on the vCPU model (the actual cleanup is non-trivial future work) - Rework KVM's handling of x2APIC ICR, again, because AMD (x2AVIC) splits the 64-bit value into the legacy ICR and ICR2 storage, whereas Intel (APICv) stores the entire 64-bit value at the ICR offset - Fix a bug where KVM would fail to exit to userspace if one was triggered by a fastpath exit handler - Add fastpath handling of HLT VM-Exit to expedite re-entering the guest when there's already a pending wake event at the time of the exit - Fix a WARN caused by RSM entering a nested guest from SMM with invalid guest state, by forcing the vCPU out of guest mode prior to signalling SHUTDOWN (the SHUTDOWN hits the VM altogether, not the nested guest) - Overhaul the "unprotect and retry" logic to more precisely identify cases where retrying is actually helpful, and to harden all retry paths against putting the guest into an infinite retry loop - Add support for yielding, e.g. to honor NEED_RESCHED, when zapping rmaps in the shadow MMU - Refactor pieces of the shadow MMU related to aging SPTEs in prepartion for adding multi generation LRU support in KVM - Don't stuff the RSB after VM-Exit when RETPOLINE=y and AutoIBRS is enabled, i.e. when the CPU has already flushed the RSB - Trace the per-CPU host save area as a VMCB pointer to improve readability and cleanup the retrieval of the SEV-ES host save area - Remove unnecessary accounting of temporary nested VMCB related allocations - Set FINAL/PAGE in the page fault error code for EPT violations if and only if the GVA is valid. If the GVA is NOT valid, there is no guest-side page table walk and so stuffing paging related metadata is nonsensical - Fix a bug where KVM would incorrectly synthesize a nested VM-Exit instead of emulating posted interrupt delivery to L2 - Add a lockdep assertion to detect unsafe accesses of vmcs12 structures - Harden eVMCS loading against an impossible NULL pointer deref (really truly should be impossible) - Minor SGX fix and a cleanup - Misc cleanups Generic: - Register KVM's cpuhp and syscore callbacks when enabling virtualization in hardware, as the sole purpose of said callbacks is to disable and re-enable virtualization as needed - Enable virtualization when KVM is loaded, not right before the first VM is created Together with the previous change, this simplifies a lot the logic of the callbacks, because their very existence implies virtualization is enabled - Fix a bug that results in KVM prematurely exiting to userspace for coalesced MMIO/PIO in many cases, clean up the related code, and add a testcase - Fix a bug in kvm_clear_guest() where it would trigger a buffer overflow _if_ the gpa+len crosses a page boundary, which thankfully is guaranteed to not happen in the current code base. Add WARNs in more helpers that read/write guest memory to detect similar bugs Selftests: - Fix a goof that caused some Hyper-V tests to be skipped when run on bare metal, i.e. NOT in a VM - Add a regression test for KVM's handling of SHUTDOWN for an SEV-ES guest - Explicitly include one-off assets in .gitignore. Past Sean was completely wrong about not being able to detect missing .gitignore entries - Verify userspace single-stepping works when KVM happens to handle a VM-Exit in its fastpath - Misc cleanups" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (127 commits) Documentation: KVM: fix warning in "make htmldocs" s390: Enable KVM_S390_UCONTROL config in debug_defconfig selftests: kvm: s390: Add VM run test case KVM: SVM: let alternatives handle the cases when RSB filling is required KVM: VMX: Set PFERR_GUEST_{FINAL,PAGE}_MASK if and only if the GVA is valid KVM: x86/mmu: Use KVM_PAGES_PER_HPAGE() instead of an open coded equivalent KVM: x86/mmu: Add KVM_RMAP_MANY to replace open coded '1' and '1ul' literals KVM: x86/mmu: Fold mmu_spte_age() into kvm_rmap_age_gfn_range() KVM: x86/mmu: Morph kvm_handle_gfn_range() into an aging specific helper KVM: x86/mmu: Honor NEED_RESCHED when zapping rmaps and blocking is allowed KVM: x86/mmu: Add a helper to walk and zap rmaps for a memslot KVM: x86/mmu: Plumb a @can_yield parameter into __walk_slot_rmaps() KVM: x86/mmu: Move walk_slot_rmaps() up near for_each_slot_rmap_range() KVM: x86/mmu: WARN on MMIO cache hit when emulating write-protected gfn KVM: x86/mmu: Detect if unprotect will do anything based on invalid_list KVM: x86/mmu: Subsume kvm_mmu_unprotect_page() into the and_retry() version KVM: x86: Rename reexecute_instruction()=>kvm_unprotect_and_retry_on_failure() KVM: x86: Update retry protection fields when forcing retry on emulation failure KVM: x86: Apply retry protection to "unprotect on failure" path KVM: x86: Check EMULTYPE_WRITE_PF_TO_SP before unprotecting gfn ...