Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
- "hung_task: extend blocking task stacktrace dump to semaphore" from
Lance Yang enhances the hung task detector.
The detector presently dumps the blocking tasks's stack when it is
blocked on a mutex. Lance's series extends this to semaphores
- "nilfs2: improve sanity checks in dirty state propagation" from
Wentao Liang addresses a couple of minor flaws in nilfs2
- "scripts/gdb: Fixes related to lx_per_cpu()" from Illia Ostapyshyn
fixes a couple of issues in the gdb scripts
- "Support kdump with LUKS encryption by reusing LUKS volume keys" from
Coiby Xu addresses a usability problem with kdump.
When the dump device is LUKS-encrypted, the kdump kernel may not have
the keys to the encrypted filesystem. A full writeup of this is in
the series [0/N] cover letter
- "sysfs: add counters for lockups and stalls" from Max Kellermann adds
/sys/kernel/hardlockup_count and /sys/kernel/hardlockup_count and
/sys/kernel/rcu_stall_count
- "fork: Page operation cleanups in the fork code" from Pasha Tatashin
implements a number of code cleanups in fork.c
- "scripts/gdb/symbols: determine KASLR offset on s390 during early
boot" from Ilya Leoshkevich fixes some s390 issues in the gdb
scripts
* tag 'mm-nonmm-stable-2025-05-31-15-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (67 commits)
llist: make llist_add_batch() a static inline
delayacct: remove redundant code and adjust indentation
squashfs: add optional full compressed block caching
crash_dump, nvme: select CONFIGFS_FS as built-in
scripts/gdb/symbols: determine KASLR offset on s390 during early boot
scripts/gdb/symbols: factor out pagination_off()
scripts/gdb/symbols: factor out get_vmlinux()
kernel/panic.c: format kernel-doc comments
mailmap: update and consolidate Casey Connolly's name and email
nilfs2: remove wbc->for_reclaim handling
fork: define a local GFP_VMAP_STACK
fork: check charging success before zeroing stack
fork: clean-up naming of vm_stack/vm_struct variables in vmap stacks code
fork: clean-up ifdef logic around stack allocation
kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count
kernel/watchdog: add /sys/kernel/{hard,soft}lockup_count
x86/crash: make the page that stores the dm crypt keys inaccessible
x86/crash: pass dm crypt keys to kdump kernel
Revert "x86/mm: Remove unused __set_memory_prot()"
crash_dump: retrieve dm crypt keys in kdump kernel
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- "Add folio_mk_pte()" from Matthew Wilcox simplifies the act of
creating a pte which addresses the first page in a folio and reduces
the amount of plumbing which architecture must implement to provide
this.
- "Misc folio patches for 6.16" from Matthew Wilcox is a shower of
largely unrelated folio infrastructure changes which clean things up
and better prepare us for future work.
- "memory,x86,acpi: hotplug memory alignment advisement" from Gregory
Price adds early-init code to prevent x86 from leaving physical
memory unused when physical address regions are not aligned to memory
block size.
- "mm/compaction: allow more aggressive proactive compaction" from
Michal Clapinski provides some tuning of the (sadly, hard-coded (more
sadly, not auto-tuned)) thresholds for our invokation of proactive
compaction. In a simple test case, the reduction of a guest VM's
memory consumption was dramatic.
- "Minor cleanups and improvements to swap freeing code" from Kemeng
Shi provides some code cleaups and a small efficiency improvement to
this part of our swap handling code.
- "ptrace: introduce PTRACE_SET_SYSCALL_INFO API" from Dmitry Levin
adds the ability for a ptracer to modify syscalls arguments. At this
time we can alter only "system call information that are used by
strace system call tampering, namely, syscall number, syscall
arguments, and syscall return value.
This series should have been incorporated into mm.git's "non-MM"
branch, but I goofed.
- "fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions" from
Andrei Vagin extends the info returned by the PAGEMAP_SCAN ioctl
against /proc/pid/pagemap. This permits CRIU to more efficiently get
at the info about guard regions.
- "Fix parameter passed to page_mapcount_is_type()" from Gavin Shan
implements that fix. No runtime effect is expected because
validate_page_before_insert() happens to fix up this error.
- "kernel/events/uprobes: uprobe_write_opcode() rewrite" from David
Hildenbrand basically brings uprobe text poking into the current
decade. Remove a bunch of hand-rolled implementation in favor of
using more current facilities.
- "mm/ptdump: Drop assumption that pxd_val() is u64" from Anshuman
Khandual provides enhancements and generalizations to the pte dumping
code. This might be needed when 128-bit Page Table Descriptors are
enabled for ARM.
- "Always call constructor for kernel page tables" from Kevin Brodsky
ensures that the ctor/dtor is always called for kernel pgtables, as
it already is for user pgtables.
This permits the addition of more functionality such as "insert hooks
to protect page tables". This change does result in various
architectures performing unnecesary work, but this is fixed up where
it is anticipated to occur.
- "Rust support for mm_struct, vm_area_struct, and mmap" from Alice
Ryhl adds plumbing to permit Rust access to core MM structures.
- "fix incorrectly disallowed anonymous VMA merges" from Lorenzo
Stoakes takes advantage of some VMA merging opportunities which we've
been missing for 15 years.
- "mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE" from
SeongJae Park optimizes process_madvise()'s TLB flushing.
Instead of flushing each address range in the provided iovec, we
batch the flushing across all the iovec entries. The syscall's cost
was approximately halved with a microbenchmark which was designed to
load this particular operation.
- "Track node vacancy to reduce worst case allocation counts" from
Sidhartha Kumar makes the maple tree smarter about its node
preallocation.
stress-ng mmap performance increased by single-digit percentages and
the amount of unnecessarily preallocated memory was dramaticelly
reduced.
- "mm/gup: Minor fix, cleanup and improvements" from Baoquan He removes
a few unnecessary things which Baoquan noted when reading the code.
- ""Enhance sysfs handling for memory hotplug in weighted interleave"
from Rakie Kim "enhances the weighted interleave policy in the memory
management subsystem by improving sysfs handling, fixing memory
leaks, and introducing dynamic sysfs updates for memory hotplug
support". Fixes things on error paths which we are unlikely to hit.
- "mm/damon: auto-tune DAMOS for NUMA setups including tiered memory"
from SeongJae Park introduces new DAMOS quota goal metrics which
eliminate the manual tuning which is required when utilizing DAMON
for memory tiering.
- "mm/vmalloc.c: code cleanup and improvements" from Baoquan He
provides cleanups and small efficiency improvements which Baoquan
found via code inspection.
- "vmscan: enforce mems_effective during demotion" from Gregory Price
changes reclaim to respect cpuset.mems_effective during demotion when
possible. because presently, reclaim explicitly ignores
cpuset.mems_effective when demoting, which may cause the cpuset
settings to violated.
This is useful for isolating workloads on a multi-tenant system from
certain classes of memory more consistently.
- "Clean up split_huge_pmd_locked() and remove unnecessary folio
pointers" from Gavin Guo provides minor cleanups and efficiency gains
in in the huge page splitting and migrating code.
- "Use kmem_cache for memcg alloc" from Huan Yang creates a slab cache
for `struct mem_cgroup', yielding improved memory utilization.
- "add max arg to swappiness in memory.reclaim and lru_gen" from
Zhongkun He adds a new "max" argument to the "swappiness=" argument
for memory.reclaim MGLRU's lru_gen.
This directs proactive reclaim to reclaim from only anon folios
rather than file-backed folios.
- "kexec: introduce Kexec HandOver (KHO)" from Mike Rapoport is the
first step on the path to permitting the kernel to maintain existing
VMs while replacing the host kernel via file-based kexec. At this
time only memblock's reserve_mem is preserved.
- "mm: Introduce for_each_valid_pfn()" from David Woodhouse provides
and uses a smarter way of looping over a pfn range. By skipping
ranges of invalid pfns.
- "sched/numa: Skip VMA scanning on memory pinned to one NUMA node via
cpuset.mems" from Libo Chen removes a lot of pointless VMA scanning
when a task is pinned a single NUMA mode.
Dramatic performance benefits were seen in some real world cases.
- "JFS: Implement migrate_folio for jfs_metapage_aops" from Shivank
Garg addresses a warning which occurs during memory compaction when
using JFS.
- "move all VMA allocation, freeing and duplication logic to mm" from
Lorenzo Stoakes moves some VMA code from kernel/fork.c into the more
appropriate mm/vma.c.
- "mm, swap: clean up swap cache mapping helper" from Kairui Song
provides code consolidation and cleanups related to the folio_index()
function.
- "mm/gup: Cleanup memfd_pin_folios()" from Vishal Moola does that.
- "memcg: Fix test_memcg_min/low test failures" from Waiman Long
addresses some bogus failures which are being reported by the
test_memcontrol selftest.
- "eliminate mmap() retry merge, add .mmap_prepare hook" from Lorenzo
Stoakes commences the deprecation of file_operations.mmap() in favor
of the new file_operations.mmap_prepare().
The latter is more restrictive and prevents drivers from messing with
things in ways which, amongst other problems, may defeat VMA merging.
- "memcg: decouple memcg and objcg stocks"" from Shakeel Butt decouples
the per-cpu memcg charge cache from the objcg's one.
This is a step along the way to making memcg and objcg charging
NMI-safe, which is a BPF requirement.
- "mm/damon: minor fixups and improvements for code, tests, and
documents" from SeongJae Park is yet another batch of miscellaneous
DAMON changes. Fix and improve minor problems in code, tests and
documents.
- "memcg: make memcg stats irq safe" from Shakeel Butt converts memcg
stats to be irq safe. Another step along the way to making memcg
charging and stats updates NMI-safe, a BPF requirement.
- "Let unmap_hugepage_range() and several related functions take folio
instead of page" from Fan Ni provides folio conversions in the
hugetlb code.
* tag 'mm-stable-2025-05-31-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (285 commits)
mm: pcp: increase pcp->free_count threshold to trigger free_high
mm/hugetlb: convert use of struct page to folio in __unmap_hugepage_range()
mm/hugetlb: refactor __unmap_hugepage_range() to take folio instead of page
mm/hugetlb: refactor unmap_hugepage_range() to take folio instead of page
mm/hugetlb: pass folio instead of page to unmap_ref_private()
memcg: objcg stock trylock without irq disabling
memcg: no stock lock for cpu hot-unplug
memcg: make __mod_memcg_lruvec_state re-entrant safe against irqs
memcg: make count_memcg_events re-entrant safe against irqs
memcg: make mod_memcg_state re-entrant safe against irqs
memcg: move preempt disable to callers of memcg_rstat_updated
memcg: memcg_rstat_updated re-entrant safe against irqs
mm: khugepaged: decouple SHMEM and file folios' collapse
selftests/eventfd: correct test name and improve messages
alloc_tag: check mem_profiling_support in alloc_tag_init
Docs/damon: update titles and brief introductions to explain DAMOS
selftests/damon/_damon_sysfs: read tried regions directories in order
mm/damon/tests/core-kunit: add a test for damos_set_filters_default_reject()
mm/damon/paddr: remove unused variable, folio_list, in damon_pa_stat()
mm/damon/sysfs-schemes: fix wrong comment on damons_sysfs_quota_goal_metric_strs
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull automount updates from Al Viro:
"Automount wart removal
A bunch of odd boilerplate gone from instances - the reason for
those was the need to protect the yet-to-be-attched mount from
mark_mounts_for_expiry() deciding to take it out.
But that's easy to detect and take care of in mark_mounts_for_expiry()
itself; no need to have every instance simulate mount being busy by
grabbing an extra reference to it, with finish_automount() undoing
that once it attaches that mount.
Should've done it that way from the very beginning... This is a
flagday change, thankfully there are very few instances.
vfs_submount() is gone - its sole remaining user (trace_automount)
had been switched to saner primitives"
* tag 'pull-automount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
kill vfs_submount()
saner calling conventions for ->d_automount()
|
|
Pull UFS updates from Al Viro:
"The bulk of this is Eric's conversion of UFS to new mount API, with a
bit of cleanups from me. I hoped to get stricter sanity checks on
superblock flags into that pile, but... next cycle, hopefully"
* tag 'pull-ufs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
ufs: convert ufs to the new mount API
ufs: reject multiple conflicting -o ufstype=... on mount
ufs: split ->s_mount_opt - don't mix flavour and on-error
|
|
Pull mount propagation fix from Al Viro:
"6.15 allowed mount propagation to destinations in detached trees;
unfortunately, that breaks existing userland, so the old behaviour
needs to be restored.
It's not exactly a revert - the original behaviour had a bug, where
existence of detached tree might disrupt propagation between locations
not in detached trees. Thankfully, userland did not depend upon that
bug, so we want to keep the fix"
* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
Don't propagate mounts into detached trees
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this round, Matthew converted most of page operations to using
folio. Beyond the work, we've applied some performance tunings such as
GC and linear lookup, in addition to enhancing fault injection and
sanity checks.
Enhancements:
- large number of folio conversions
- add a control to turn on/off the linear lookup for performance
- tune GC logics for zoned block device
- improve fault injection and sanity checks
Bug fixes:
- handle error cases of memory donation
- fix to correct check conditions in f2fs_cross_rename
- fix to skip f2fs_balance_fs() if checkpoint is disabled
- don't over-report free space or inodes in statvfs
- prevent the current section from being selected as a victim during GC
- fix to calculate first_zoned_segno correctly
- fix to avoid inconsistence between SIT and SSA for zoned block device
As usual, there are several debugging patches and clean-ups as well"
* tag 'f2fs-for-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (195 commits)
f2fs: fix to correct check conditions in f2fs_cross_rename
f2fs: use d_inode(dentry) cleanup dentry->d_inode
f2fs: fix to skip f2fs_balance_fs() if checkpoint is disabled
f2fs: clean up to check bi_status w/ BLK_STS_OK
f2fs: introduce is_{meta,node}_folio
f2fs: add ckpt_valid_blocks to the section entry
f2fs: add a method for calculating the remaining blocks in the current segment in LFS mode.
f2fs: introduce FAULT_VMALLOC
f2fs: use vmalloc instead of kvmalloc in .init_{,de}compress_ctx
f2fs: add f2fs_bug_on() in f2fs_quota_read()
f2fs: add f2fs_bug_on() to detect potential bug
f2fs: remove unused sbi argument from checksum functions
f2fs: fix 32-bits hexademical number in fault injection doc
f2fs: don't over-report free space or inodes in statvfs
f2fs: return bool from __write_node_folio
f2fs: simplify return value handling in f2fs_fsync_node_pages
f2fs: always unlock the page in f2fs_write_single_data_page
f2fs: remove wbc->for_reclaim handling
f2fs: return bool from __f2fs_write_meta_folio
f2fs: fix to return correct error number in f2fs_sync_node_pages()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull ext2 and isofs updates from Jan Kara:
- isofs fix of handling of particularly formatted Rock Ridge timestamps
- Add deprecation notice about support of DAX in ext2 filesystem driver
* tag 'fs_for_v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
ext2: Deprecate DAX
isofs: fix Y2038 and Y2156 issues in Rock Ridge TF entry
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify updates from Jan Kara:
"Two fanotify cleanups and support for watching namespace-owned
filesystems by namespace admins (most useful for being able to watch
for new mounts / unmounts happening within a user namespace)"
* tag 'fsnotify_for_v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fanotify: support watching filesystems and mounts inside userns
fanotify: remove redundant permission checks
fanotify: Drop use of flex array in fanotify_fh
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core
Pull driver core updates from Greg KH:
"Here are the driver core / kernfs changes for 6.16-rc1.
Not a huge number of changes this development cycle, here's the
summary of what is included in here:
- kernfs locking tweaks, pushing some global locks down into a per-fs
image lock
- rust driver core and pci device bindings added for new features.
- sysfs const work for bin_attributes.
The final churn of switching away from and removing the
transitional struct members, "read_new", "write_new" and
"bin_attrs_new" will come after the merge window to avoid
unnecesary merge conflicts.
- auxbus device creation helpers added
- fauxbus fix for creating sysfs files after the probe completed
properly
- other tiny updates for driver core things.
All of these have been in linux-next for over a week with no reported
issues"
* tag 'driver-core-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core:
kernfs: Relax constraint in draining guard
Documentation: embargoed-hardware-issues.rst: Remove myself
drivers: hv: fix up const issue with vmbus_chan_bin_attrs
firmware_loader: use SHA-256 library API instead of crypto_shash API
docs: debugfs: do not recommend debugfs_remove_recursive
PM: wakeup: Do not expose 4 device wakeup source APIs
kernfs: switch global kernfs_rename_lock to per-fs lock
kernfs: switch global kernfs_idr_lock to per-fs lock
driver core: auxiliary bus: Fix IS_ERR() vs NULL mixup in __devm_auxiliary_device_create()
sysfs: constify attribute_group::bin_attrs
sysfs: constify bin_attribute argument of bin_attribute::read/write()
software node: Correct a OOB check in software_node_get_reference_args()
devres: simplify devm_kstrdup() using devm_kmemdup()
platform: replace magic number with macro PLATFORM_DEVID_NONE
component: do not try to unbind unbound components
driver core: auxiliary bus: add device creation helpers
driver core: faux: Add sysfs groups after probing
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Paolo Abeni:
"Core:
- Implement the Device Memory TCP transmit path, allowing zero-copy
data transmission on top of TCP from e.g. GPU memory to the wire.
- Move all the IPv6 routing tables management outside the RTNL scope,
under its own lock and RCU. The route control path is now 3x times
faster.
- Convert queue related netlink ops to instance lock, reducing again
the scope of the RTNL lock. This improves the control plane
scalability.
- Refactor the software crc32c implementation, removing unneeded
abstraction layers and improving significantly the related
micro-benchmarks.
- Optimize the GRO engine for UDP-tunneled traffic, for a 10%
performance improvement in related stream tests.
- Cover more per-CPU storage with local nested BH locking; this is a
prep work to remove the current per-CPU lock in local_bh_disable()
on PREMPT_RT.
- Introduce and use nlmsg_payload helper, combining buffer bounds
verification with accessing payload carried by netlink messages.
Netfilter:
- Rewrite the procfs conntrack table implementation, improving
considerably the dump performance. A lot of user-space tools still
use this interface.
- Implement support for wildcard netdevice in netdev basechain and
flowtables.
- Integrate conntrack information into nft trace infrastructure.
- Export set count and backend name to userspace, for better
introspection.
BPF:
- BPF qdisc support: BPF-qdisc can be implemented with BPF struct_ops
programs and can be controlled in similar way to traditional qdiscs
using the "tc qdisc" command.
- Refactor the UDP socket iterator, addressing long standing issues
WRT duplicate hits or missed sockets.
Protocols:
- Improve TCP receive buffer auto-tuning and increase the default
upper bound for the receive buffer; overall this improves the
single flow maximum thoughput on 200Gbs link by over 60%.
- Add AFS GSSAPI security class to AF_RXRPC; it provides transport
security for connections to the AFS fileserver and VL server.
- Improve TCP multipath routing, so that the sources address always
matches the nexthop device.
- Introduce SO_PASSRIGHTS for AF_UNIX, to allow disabling SCM_RIGHTS,
and thus preventing DoS caused by passing around problematic FDs.
- Retire DCCP socket. DCCP only receives updates for bugs, and major
distros disable it by default. Its removal allows for better
organisation of TCP fields to reduce the number of cache lines hit
in the fast path.
- Extend TCP drop-reason support to cover PAWS checks.
Driver API:
- Reorganize PTP ioctl flag support to require an explicit opt-in for
the drivers, avoiding the problem of drivers not rejecting new
unsupported flags.
- Converted several device drivers to timestamping APIs.
- Introduce per-PHY ethtool dump helpers, improving the support for
dump operations targeting PHYs.
Tests and tooling:
- Add support for classic netlink in user space C codegen, so that
ynl-c can now read, create and modify links, routes addresses and
qdisc layer configuration.
- Add ynl sub-types for binary attributes, allowing ynl-c to output
known struct instead of raw binary data, clarifying the classic
netlink output.
- Extend MPTCP selftests to improve the code-coverage.
- Add tests for XDP tail adjustment in AF_XDP.
New hardware / drivers:
- OpenVPN virtual driver: offload OpenVPN data channels processing to
the kernel-space, increasing the data transfer throughput WRT the
user-space implementation.
- Renesas glue driver for the gigabit ethernet RZ/V2H(P) SoC.
- Broadcom asp-v3.0 ethernet driver.
- AMD Renoir ethernet device.
- ReakTek MT9888 2.5G ethernet PHY driver.
- Aeonsemi 10G C45 PHYs driver.
Drivers:
- Ethernet high-speed NICs:
- nVidia/Mellanox (mlx5):
- refactor the steering table handling to significantly
reduce the amount of memory used
- add support for complex matches in H/W flow steering
- improve flow streeing error handling
- convert to netdev instance locking
- Intel (100G, ice, igb, ixgbe, idpf):
- ice: add switchdev support for LLDP traffic over VF
- ixgbe: add firmware manipulation and regions devlink support
- igb: introduce support for frame transmission premption
- igb: adds persistent NAPI configuration
- idpf: introduce RDMA support
- idpf: add initial PTP support
- Meta (fbnic):
- extend hardware stats coverage
- add devlink dev flash support
- Broadcom (bnxt):
- add support for RX-side device memory TCP
- Wangxun (txgbe):
- implement support for udp tunnel offload
- complete PTP and SRIOV support for AML 25G/10G devices
- Ethernet NICs embedded and virtual:
- Google (gve):
- add device memory TCP TX support
- Amazon (ena):
- support persistent per-NAPI config
- Airoha:
- add H/W support for L2 traffic offload
- add per flow stats for flow offloading
- RealTek (rtl8211): add support for WoL magic packet
- Synopsys (stmmac):
- dwmac-socfpga 1000BaseX support
- add Loongson-2K3000 support
- introduce support for hardware-accelerated VLAN stripping
- Broadcom (bcmgenet):
- expose more H/W stats
- Freescale (enetc, dpaa2-eth):
- enetc: add MAC filter, VLAN filter RSS and loopback support
- dpaa2-eth: convert to H/W timestamping APIs
- vxlan: convert FDB table to rhashtable, for better scalabilty
- veth: apply qdisc backpressure on full ring to reduce TX drops
- Ethernet switches:
- Microchip (kzZ88x3): add ETS scheduler support
- Ethernet PHYs:
- RealTek (rtl8211):
- add support for WoL magic packet
- add support for PHY LEDs
- CAN:
- Adds RZ/G3E CANFD support to the rcar_canfd driver.
- Preparatory work for CAN-XL support.
- Add self-tests framework with support for CAN physical interfaces.
- WiFi:
- mac80211:
- scan improvements with multi-link operation (MLO)
- Qualcomm (ath12k):
- enable AHB support for IPQ5332
- add monitor interface support to QCN9274
- add multi-link operation support to WCN7850
- add 802.11d scan offload support to WCN7850
- monitor mode for WCN7850, better 6 GHz regulatory
- Qualcomm (ath11k):
- restore hibernation support
- MediaTek (mt76):
- WiFi-7 improvements
- implement support for mt7990
- Intel (iwlwifi):
- enhanced multi-link single-radio (EMLSR) support on 5 GHz links
- rework device configuration
- RealTek (rtw88):
- improve throughput for RTL8814AU
- RealTek (rtw89):
- add multi-link operation support
- STA/P2P concurrency improvements
- support different SAR configs by antenna
- Bluetooth:
- introduce HCI Driver protocol
- btintel_pcie: do not generate coredump for diagnostic events
- btusb: add HCI Drv commands for configuring altsetting
- btusb: add RTL8851BE device 0x0bda:0xb850
- btusb: add new VID/PID 13d3/3584 for MT7922
- btusb: add new VID/PID 13d3/3630 and 13d3/3613 for MT7925
- btnxpuart: implement host-wakeup feature"
* tag 'net-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1611 commits)
selftests/bpf: Fix bpf selftest build warning
selftests: netfilter: Fix skip of wildcard interface test
net: phy: mscc: Stop clearing the the UDPv4 checksum for L2 frames
net: openvswitch: Fix the dead loop of MPLS parse
calipso: Don't call calipso functions for AF_INET sk.
selftests/tc-testing: Add a test for HFSC eltree double add with reentrant enqueue behaviour on netem
net_sched: hfsc: Address reentrant enqueue adding class to eltree twice
octeontx2-pf: QOS: Refactor TC_HTB_LEAF_DEL_LAST callback
octeontx2-pf: QOS: Perform cache sync on send queue teardown
net: mana: Add support for Multi Vports on Bare metal
net: devmem: ncdevmem: remove unused variable
net: devmem: ksft: upgrade rx test to send 1K data
net: devmem: ksft: add 5 tuple FS support
net: devmem: ksft: add exit_wait to make rx test pass
net: devmem: ksft: add ipv4 support
net: devmem: preserve sockc_err
page_pool: fix ugly page_pool formatting
net: devmem: move list_add to net_devmem_bind_dmabuf.
selftests: netfilter: nft_queue.sh: include file transfer duration in log message
net: phy: mscc: Fix memory leak when using one step timestamping
...
|
|
Pull jfs updates from David Kleikamp:
"A few small fixes for jfs"
* tag 'jfs-6.16' of github.com:kleikamp/linux-shaggy:
jfs: fix array-index-out-of-bounds read in add_missing_indices
jfs: Fix null-ptr-deref in jfs_ioc_trim
jfs: validate AG parameters in dbMount() to prevent crashes
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
Pull dlm updates from David Teigland:
"This fixes delays when shutting down SCTP connections, and updates dlm
Kconfig for SCTP"
* tag 'dlm-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
dlm: drop SCTP Kconfig dependency
dlm: reject SCTP configuration if not enabled
dlm: use SHUT_RDWR for SCTP shutdown
dlm: mask sk_shutdown value
|
|
Pull nfsd updates from Chuck Lever:
"The marquee feature for this release is that the limit on the maximum
rsize and wsize has been raised to 4MB. The default remains at 1MB,
but risk-seeking administrators now have the ability to try larger I/O
sizes with NFS clients that support them. Eventually the default
setting will be increased when we have confidence that this change
will not have negative impact.
With v6.16, NFSD now has its own debugfs file system where we can add
experimental features and make them available outside of our
development community without impacting production deployments. The
first experimental setting added is one that makes all NFS READ
operations use vfs_iter_read() instead of the NFSD splice actor. The
plan is to eventually retire the splice actor, as that will enable a
number of new capabilities such as the use of struct bio_vec from the
top to the bottom of the NFSD stack.
Jeff Layton contributed a number of observability improvements. The
use of dprintk() in a number of high-traffic code paths has been
replaced with static trace points.
This release sees the continuation of efforts to harden the NFSv4.2
COPY operation. Soon, the restriction on async COPY operations can be
lifted.
Many thanks to the contributors, reviewers, testers, and bug reporters
who participated during the v6.16 development cycle"
* tag 'nfsd-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (60 commits)
xdrgen: Fix code generated for counted arrays
SUNRPC: Bump the maximum payload size for the server
NFSD: Add a "default" block size
NFSD: Remove NFSSVC_MAXBLKSIZE_V2 macro
NFSD: Remove NFSD_BUFSIZE
sunrpc: Remove the RPCSVC_MAXPAGES macro
svcrdma: Adjust the number of entries in svc_rdma_send_ctxt::sc_pages
svcrdma: Adjust the number of entries in svc_rdma_recv_ctxt::rc_pages
sunrpc: Adjust size of socket's receive page array dynamically
SUNRPC: Remove svc_rqst :: rq_vec
SUNRPC: Remove svc_fill_write_vector()
NFSD: Use rqstp->rq_bvec in nfsd_iter_write()
SUNRPC: Export xdr_buf_to_bvec()
NFSD: De-duplicate the svc_fill_write_vector() call sites
NFSD: Use rqstp->rq_bvec in nfsd_iter_read()
sunrpc: Replace the rq_bvec array with dynamically-allocated memory
sunrpc: Replace the rq_pages array with dynamically-allocated memory
sunrpc: Remove backchannel check in svc_init_buffer()
sunrpc: Add a helper to derive maxpages from sv_max_mesg
svcrdma: Reduce the number of rdma_rw contexts per-QP
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"New ext4 features and performance improvements:
- Fast commit performance improvements
- Multi-fsblock atomic write support for bigalloc file systems
- Large folio support for regular files
This last can result in really stupendous performance for the right
workloads. For example, see [1] where the Kernel Test Robot reported
over 37% improvement on a large sequential I/O workload.
There are also the usual bug fixes and cleanups. Of note are cleanups
of the extent status tree to fix potential races that could result in
the extent status tree getting corrupted under heavy simultaneous
allocation and deallocation to a single file"
Link: https://lore.kernel.org/all/202505161418.ec0d753f-lkp@intel.com/ [1]
* tag 'ext4_for_linus-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (52 commits)
ext4: Add a WARN_ON_ONCE for querying LAST_IN_LEAF instead
ext4: Simplify flags in ext4_map_query_blocks()
ext4: Rename and document EXT4_EX_FILTER to EXT4_EX_QUERY_FILTER
ext4: Simplify last in leaf check in ext4_map_query_blocks
ext4: Unwritten to written conversion requires EXT4_EX_NOCACHE
ext4: only dirty folios when data journaling regular files
ext4: Add atomic block write documentation
ext4: Enable support for ext4 multi-fsblock atomic write using bigalloc
ext4: Add multi-fsblock atomic write support with bigalloc
ext4: Add support for EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS
ext4: Make ext4_meta_trans_blocks() non-static for later use
ext4: Check if inode uses extents in ext4_inode_can_atomic_write()
ext4: Document an edge case for overwrites
jbd2: remove journal_t argument from jbd2_superblock_csum()
jbd2: remove journal_t argument from jbd2_chksum()
ext4: remove sb argument from ext4_superblock_csum()
ext4: remove sbi argument from ext4_chksum()
ext4: enable large folio for regular file
ext4: make online defragmentation support large folios
ext4: make the writeback path support large folios
...
|
|
https://github.com/Paragon-Software-Group/linux-ntfs3
Pull ntfs updates from Konstantin Komarov:
"Added:
- missing direct_IO in ntfs_aops_cmpr
- handling of hdr_first_de() return value
Fixed:
- handling of InitializeFileRecordSegment operation.
Removed:
- ability to change compression on mounted volume
- redundant NULL check"
* tag 'ntfs3_for_6.16' of https://github.com/Paragon-Software-Group/linux-ntfs3:
fs/ntfs3: remove ability to change compression on mounted volume
fs/ntfs3: Fix handling of InitializeFileRecordSegment
fs/ntfs3: Add missing direct_IO in ntfs_aops_cmpr
fs/ntfs3: handle hdr_first_de() return value
fs/ntfs3: Drop redundant NULL check
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux
Pull orangefs update from Mike Marshall:
"Convert to use the new mount API.
Code from Eric Sandeen at redhat that converts orangefs over to the
new mount API"
* tag 'for-linus-6.16-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
orangefs: Convert to use the new mount API
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat
Pull exfat updates from Namjae Jeon:
- Fix xfstests generic/482 test failure
- Fix double free in delayed_free
* tag 'exfat-for-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat:
exfat: do not clear volume dirty flag during sync
exfat: fix double free in delayed_free
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"A fixup to the xarray conversion sent in the main 6.16 batch. It was
not included because it would cause rebase/refresh of like 80 patches,
right before sending the early pull request last week.
It's fixing a bug when zoned mode is enabled on btrfs so it's not
affecting most people"
* tag 'for-6.16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: don't drop a reference if btrfs_check_write_meta_pointer() fails
|
|
Should be "old_dir" here.
Fixes: 5c57132eaf52 ("f2fs: support project quota")
Signed-off-by: Zhiguo Niu <zhiguo.niu@unisoc.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
no logic changes.
Signed-off-by: Zhiguo Niu <zhiguo.niu@unisoc.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Syzbot reports a f2fs bug as below:
INFO: task syz-executor328:5856 blocked for more than 144 seconds.
Not tainted 6.15.0-rc6-syzkaller-00208-g3c21441eeffc #0
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz-executor328 state:D stack:24392 pid:5856 tgid:5832 ppid:5826 task_flags:0x400040 flags:0x00004006
Call Trace:
<TASK>
context_switch kernel/sched/core.c:5382 [inline]
__schedule+0x168f/0x4c70 kernel/sched/core.c:6767
__schedule_loop kernel/sched/core.c:6845 [inline]
schedule+0x165/0x360 kernel/sched/core.c:6860
io_schedule+0x81/0xe0 kernel/sched/core.c:7742
f2fs_balance_fs+0x4b4/0x780 fs/f2fs/segment.c:444
f2fs_map_blocks+0x3af1/0x43b0 fs/f2fs/data.c:1791
f2fs_expand_inode_data+0x653/0xaf0 fs/f2fs/file.c:1872
f2fs_fallocate+0x4f5/0x990 fs/f2fs/file.c:1975
vfs_fallocate+0x6a0/0x830 fs/open.c:338
ioctl_preallocate fs/ioctl.c:290 [inline]
file_ioctl fs/ioctl.c:-1 [inline]
do_vfs_ioctl+0x1b8f/0x1eb0 fs/ioctl.c:885
__do_sys_ioctl fs/ioctl.c:904 [inline]
__se_sys_ioctl+0x82/0x170 fs/ioctl.c:892
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xf6/0x210 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
The root cause is after commit 84b5bb8bf0f6 ("f2fs: modify
f2fs_is_checkpoint_ready logic to allow more data to be written with the
CP disable"), we will get chance to allow f2fs_is_checkpoint_ready() to
return true once below conditions are all true:
1. checkpoint is disabled
2. there are not enough free segments
3. there are enough free blocks
Then it will cause f2fs_balance_fs() to trigger foreground GC.
void f2fs_balance_fs(struct f2fs_sb_info *sbi, bool need)
...
if (!f2fs_is_checkpoint_ready(sbi))
return;
And the testcase mounts f2fs image w/ gc_merge,checkpoint=disable, so deadloop
will happen through below race condition:
- f2fs_do_shutdown - vfs_fallocate - gc_thread_func
- file_start_write
- __sb_start_write(SB_FREEZE_WRITE)
- f2fs_fallocate
- f2fs_expand_inode_data
- f2fs_map_blocks
- f2fs_balance_fs
- prepare_to_wait
- wake_up(gc_wait_queue_head)
- io_schedule
- bdev_freeze
- freeze_super
- sb->s_writers.frozen = SB_FREEZE_WRITE;
- sb_wait_write(sb, SB_FREEZE_WRITE);
- if (sbi->sb->s_writers.frozen >= SB_FREEZE_WRITE) continue;
: cause deadloop
This patch fix to add check condition in f2fs_balance_fs(), so that if
checkpoint is disabled, we will just skip trigger foreground GC to
avoid such deadloop issue.
Meanwhile let's remove f2fs_is_checkpoint_ready() check condition in
f2fs_balance_fs(), since it's redundant, due to the main logic in the
function is to check:
a) whether checkpoint is disabled
b) there is enough free segments
f2fs_balance_fs() still has all logics after f2fs_is_checkpoint_ready()'s
removal.
Reported-by: syzbot+aa5bb5f6860e08a60450@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/linux-f2fs-devel/682d743a.a00a0220.29bc26.0289.GAE@google.com
Fixes: 84b5bb8bf0f6 ("f2fs: modify f2fs_is_checkpoint_ready logic to allow more data to be written with the CP disable")
Cc: Qi Han <hanqi@vivo.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Reviewed-by: Zhiguo Niu <zhiguo.niu@unisoc.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Check bi_status w/ BLK_STS_OK instead of 0 for cleanup.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Just cleanup, no changes.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
when performing buffered writes in a large section,
overhead is incurred due to the iteration through
ckpt_valid_blocks within the section.
when SEGS_PER_SEC is 128, this overhead accounts for 20% within
the f2fs_write_single_data_page routine.
as the size of the section increases, the overhead also grows.
to handle this problem ckpt_valid_blocks is
added within the section entries.
Test
insmod null_blk.ko nr_devices=1 completion_nsec=1 submit_queues=8
hw_queue_depth=64 max_sectors=512 bs=4096 memory_backed=1
make_f2fs /dev/block/nullb0
make_f2fs -s 128 /dev/block/nullb0
fio --bs=512k --size=1536M --rw=write --name=1
--filename=/mnt/test_dir/seq_write
--ioengine=io_uring --iodepth=64 --end_fsync=1
before
SEGS_PER_SEC 1
2556MiB/s
SEGS_PER_SEC 128
2145MiB/s
after
SEGS_PER_SEC 1
2556MiB/s
SEGS_PER_SEC 128
2556MiB/s
Signed-off-by: yohan.joung <yohan.joung@sk.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
segment in LFS mode.
In LFS mode, the previous segment cannot use invalid blocks,
so the remaining blocks from the next_blkoff of the current segment
to the end of the section are calculated.
Signed-off-by: yohan.joung <yohan.joung@sk.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
The commit 93e72b3c612adcaca1 ("squashfs: migrate from ll_rw_block usage
to BIO") removed caching of compressed blocks in SquashFS, causing fio
performance regression in workloads with repeated file reads. Without
caching, every read triggers disk I/O, severely impacting performance in
tools like fio.
This patch introduces a new CONFIG_SQUASHFS_COMP_CACHE_FULL Kconfig option
to enable caching of all compressed blocks, restoring performance to
pre-BIO migration levels. When enabled, all pages in a BIO are cached in
the page cache, reducing disk I/O for repeated reads. The fio test
results with this patch confirm the performance restoration:
For example, fio tests (iodepth=1, numjobs=1,
ioengine=psync) show a notable performance restoration:
Disable CONFIG_SQUASHFS_COMP_CACHE_FULL:
IOPS=815, BW=102MiB/s (107MB/s)(6113MiB/60001msec)
Enable CONFIG_SQUASHFS_COMP_CACHE_FULL:
IOPS=2223, BW=278MiB/s (291MB/s)(16.3GiB/59999msec)
The tradeoff is increased memory usage due to caching all compressed
blocks. The CONFIG_SQUASHFS_COMP_CACHE_FULL option allows users to enable
this feature selectively, balancing performance and memory usage for
workloads with frequent repeated reads.
Link: https://lkml.kernel.org/r/20250521072559.2389-1-chanho.min@lge.com
Signed-off-by: Chanho Min <chanho.min@lge.com>
Reviewed-by Phillip Lougher <phillip@squashfs.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Configfs can be configured as a loadable module, which causes a link-time
failure for dm-crypt crash dump support:
crash_dump_dm_crypt.c:(.text+0x3a4): undefined reference to `config_item_init_type_name'
aarch64-linux-ld: kernel/crash_dump_dm_crypt.o: in function `configfs_dmcrypt_keys_init':
crash_dump_dm_crypt.c:(.init.text+0x90): undefined reference to `config_group_init'
aarch64-linux-ld: crash_dump_dm_crypt.c:(.init.text+0xb4): undefined reference to `configfs_register_subsystem'
aarch64-linux-ld: crash_dump_dm_crypt.c:(.init.text+0xd8): undefined reference to `configfs_unregister_subsystem'
This could be avoided with a dependency on CONFIGFS_FS=y, but the
dependency has an additional problem of causing Kconfig dependency loops
since most other uses select the symbol.
Using a simple 'select CONFIGFS_FS' here in turn fails with
CONFIG_DM_CRYPT=m, because that still only causes configfs to be a
loadable module.
The only version I found that fixes this reliably uses an additional
Kconfig symbol to ensure the 'select' actually turns on configfs as
builtin, with two additional changes to avoid dependency loops with nvme
and sysfs.
There is no compile-time dependency between configfs and sysfs, so
selecting configfs from a driver with sysfs disabled does not cause link
failures, only the default /sys/kernel/config mount point will not be
created.
Link: https://lkml.kernel.org/r/20250521160359.2132363-1-arnd@kernel.org
Fixes: 6b23858fd63b ("crash_dump: make dm crypt keys persist for the kdump kernel")
Fixes: 1fb470408497 ("nvme-loop: add configfs dependency")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Andreas Hindborg <a.hindborg@kernel.org>
Cc: Breno Leitao <leitao@debian.org>
Cc: Chaitanya Kulkarni <kch@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Coiby Xu <coxu@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Introduce a new fault type FAULT_VMALLOC to simulate no memory error in
f2fs_vmalloc().
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
.init_{,de}compress_ctx uses kvmalloc() to alloc memory, it will try
to allocate physically continuous page first, it may cause more memory
allocation pressure, let's use vmalloc instead to mitigate it.
[Test]
cd /data/local/tmp
touch file
f2fs_io setflags compression file
f2fs_io getflags file
for i in $(seq 1 10); do sync; echo 3 > /proc/sys/vm/drop_caches;\
time f2fs_io write 512 0 4096 zero osync file; truncate -s 0 file;\
done
[Result]
Before After Delta
21.243 21.694 -2.12%
For compression, we recommend to use ioctl to compress file data in
background for workaround.
For decompression, only zstd will be affected.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
mapping_read_folio_gfp() will return a folio, it should always be
uptodate, let's check folio uptodate status to detect any potenial
bug.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Add f2fs_bug_on() to check whether memory preallocation will fail or
not after radix_tree_preload(GFP_NOFS | __GFP_NOFAIL).
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Since __f2fs_crc32() now calls crc32() directly, it no longer uses its
sbi argument. Remove that, and simplify its callers accordingly.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 resource control updates from Borislav Petkov:
"Carve out the resctrl filesystem-related code into fs/resctrl/ so that
multiple architectures can share the fs API for manipulating their
respective hw resource control implementation.
This is the second step in the work towards sharing the resctrl
filesystem interface, the next one being plugging ARM's MPAM into the
aforementioned fs API"
* tag 'x86_cache_for_v6.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
MAINTAINERS: Add reviewers for fs/resctrl
x86,fs/resctrl: Move the resctrl filesystem code to live in /fs/resctrl
x86/resctrl: Always initialise rid field in rdt_resources_all[]
x86/resctrl: Relax some asm #includes
x86/resctrl: Prefer alloc(sizeof(*foo)) idiom in rdt_init_fs_context()
x86/resctrl: Squelch whitespace anomalies in resctrl core code
x86/resctrl: Move pseudo lock prototypes to include/linux/resctrl.h
x86/resctrl: Fix types in resctrl_arch_mon_ctx_{alloc,free}() stubs
x86/resctrl: Move enum resctrl_event_id to resctrl.h
x86/resctrl: Move the filesystem bits to headers visible to fs/resctrl
fs/resctrl: Add boiler plate for external resctrl code
x86/resctrl: Add 'resctrl' to the title of the resctrl documentation
x86/resctrl: Split trace.h
x86/resctrl: Expand the width of domid by replacing mon_data_bits
x86/resctrl: Add end-marker to the resctrl_event_id enum
x86/resctrl: Move is_mba_sc() out of core.c
x86/resctrl: Drop __init/__exit on assorted symbols
x86/resctrl: Resctrl_exit() teardown resctrl but leave the mount point
x86/resctrl: Check all domains are offline in resctrl_exit()
x86/resctrl: Rename resctrl_sched_in() to begin with "resctrl_arch_"
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer cleanups from Thomas Gleixner:
"Another set of timer API cleanups:
- Convert init_timer*(), try_to_del_timer_sync() and
destroy_timer_on_stack() over to the canonical timer_*()
namespace convention.
There is another large conversion pending, which has not been included
because it would have caused a gazillion of merge conflicts in next.
The conversion scripts will be run towards the end of the merge window
and a pull request sent once all conflict dependencies have been
merged"
* tag 'timers-cleanups-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
treewide, timers: Rename destroy_timer_on_stack() as timer_destroy_on_stack()
treewide, timers: Rename try_to_del_timer_sync() as timer_delete_sync_try()
timers: Rename init_timers() as timers_init()
timers: Rename NEXT_TIMER_MAX_DELTA as TIMER_NEXT_MAX_DELTA
timers: Rename __init_timer_on_stack() as __timer_init_on_stack()
timers: Rename __init_timer() as __timer_init()
timers: Rename init_timer_on_stack_key() as timer_init_key_on_stack()
timers: Rename init_timer_key() as timer_init_key()
|
|
In the zoned mode there's a bug in the extent buffer tree conversion to
xarray. The reference for eb is dropped and code continues but the
references get dropped by releasing the batch.
Reported-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Link: https://lore.kernel.org/linux-btrfs/202505191521.435b97ac-lkp@intel.com/
Fixes: 19d7f65f032f ("btrfs: convert the buffer_radix to an xarray")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
All versions up to 6.14 did not propagate mount events into detached
tree. Shortly after 6.14 a merge of vfs-6.15-rc1.mount.namespace
(130e696aa68b) has changed that.
Unfortunately, that has caused userland regressions (reported in
https://lore.kernel.org/all/CAOYeF9WQhFDe+BGW=Dp5fK8oRy5AgZ6zokVyTj1Wp4EUiYgt4w@mail.gmail.com/)
Straight revert wouldn't be an option - in particular, the variant in 6.14
had a bug that got fixed in d1ddc6f1d9f0 ("fix IS_MNT_PROPAGATING uses")
and we don't want to bring the bug back.
This is a modification of manual revert posted by Christian, with changes
needed to avoid reintroducing the breakage in scenario described in
d1ddc6f1d9f0.
Cc: stable@vger.kernel.org
Reported-by: Allison Karlitskaya <lis@redhat.com>
Tested-by: Allison Karlitskaya <lis@redhat.com>
Acked-by: Christian Brauner <brauner@kernel.org>
Co-developed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
"API:
- Fix memcpy_sglist to handle partially overlapping SG lists
- Use memcpy_sglist to replace null skcipher
- Rename CRYPTO_TESTS to CRYPTO_BENCHMARK
- Flip CRYPTO_MANAGER_DISABLE_TEST into CRYPTO_SELFTESTS
- Hide CRYPTO_MANAGER
- Add delayed freeing of driver crypto_alg structures
Compression:
- Allocate large buffers on first use instead of initialisation in scomp
- Drop destination linearisation buffer in scomp
- Move scomp stream allocation into acomp
- Add acomp scatter-gather walker
- Remove request chaining
- Add optional async request allocation
Hashing:
- Remove request chaining
- Add optional async request allocation
- Move partial block handling into API
- Add ahash support to hmac
- Fix shash documentation to disallow usage in hard IRQs
Algorithms:
- Remove unnecessary SIMD fallback code on x86 and arm/arm64
- Drop avx10_256 xts(aes)/ctr(aes) on x86
- Improve avx-512 optimisations for xts(aes)
- Move chacha arch implementations into lib/crypto
- Move poly1305 into lib/crypto and drop unused Crypto API algorithm
- Disable powerpc/poly1305 as it has no SIMD fallback
- Move sha256 arch implementations into lib/crypto
- Convert deflate to acomp
- Set block size correctly in cbcmac
Drivers:
- Do not use sg_dma_len before mapping in sun8i-ss
- Fix warm-reboot failure by making shutdown do more work in qat
- Add locking in zynqmp-sha
- Remove cavium/zip
- Add support for PCI device 0x17D8 to ccp
- Add qat_6xxx support in qat
- Add support for RK3576 in rockchip-rng
- Add support for i.MX8QM in caam
Others:
- Fix irq_fpu_usable/kernel_fpu_begin inconsistency during CPU bring-up
- Add new SEV/SNP platform shutdown API in ccp"
* tag 'v6.16-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (382 commits)
x86/fpu: Fix irq_fpu_usable() to return false during CPU onlining
crypto: qat - add missing header inclusion
crypto: api - Redo lookup on EEXIST
Revert "crypto: testmgr - Add hash export format testing"
crypto: marvell/cesa - Do not chain submitted requests
crypto: powerpc/poly1305 - add depends on BROKEN for now
Revert "crypto: powerpc/poly1305 - Add SIMD fallback"
crypto: ccp - Add missing tee info reg for teev2
crypto: ccp - Add missing bootloader info reg for pspv5
crypto: sun8i-ce - move fallback ahash_request to the end of the struct
crypto: octeontx2 - Use dynamic allocated memory region for lmtst
crypto: octeontx2 - Initialize cptlfs device info once
crypto: xts - Only add ecb if it is not already there
crypto: lrw - Only add ecb if it is not already there
crypto: testmgr - Add hash export format testing
crypto: testmgr - Use ahash for generic tfm
crypto: hmac - Add ahash support
crypto: testmgr - Ignore EEXIST on shash allocation
crypto: algapi - Add driver template support to crypto_inst_setname
crypto: shash - Set reqsize in shash_alg
...
|
|
Pull fscrypt update from Eric Biggers:
"Add support for 'hardware-wrapped inline encryption keys' to fscrypt.
When enabled on supported platforms, this feature protects file
contents keys from certain attacks, such as cold boot attacks.
This feature uses the block layer support for wrapped keys which was
merged in 6.15. Wrapped key support has existed out-of-tree in Android
for a long time, and it's finally ready for upstream now that there is
a platform on which it works end-to-end with upstream.
Specifically, it works on the Qualcomm SM8650 HDK, using the Qualcomm
ICE (Inline Crypto Engine) and HWKM (Hardware Key Manager). The
corresponding driver support is included in the SCSI tree for 6.16.
Validation for this feature includes two new tests that were already
merged into xfstests (generic/368 and generic/369)"
* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux:
fscrypt: add support for hardware-wrapped keys
|
|
Pull xfs updates from Carlos Maiolino:
- Atomic writes for XFS
- Remove experimental warnings for pNFS, scrub and parent pointers
* tag 'xfs-merge-6.16' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (26 commits)
xfs: add inode to zone caching for data placement
xfs: free the item in xfs_mru_cache_insert on failure
xfs: remove the EXPERIMENTAL warning for pNFS
xfs: remove some EXPERIMENTAL warnings
xfs: Remove deprecated xfs_bufd sysctl parameters
xfs: stop using set_blocksize
xfs: allow sysadmins to specify a maximum atomic write limit at mount time
xfs: update atomic write limits
xfs: add xfs_calc_atomic_write_unit_max()
xfs: add xfs_file_dio_write_atomic()
xfs: commit CoW-based atomic writes atomically
xfs: add large atomic writes checks in xfs_direct_write_iomap_begin()
xfs: add xfs_atomic_write_cow_iomap_begin()
xfs: refine atomic write size check in xfs_file_write_iter()
xfs: refactor xfs_reflink_end_cow_extent()
xfs: allow block allocator to take an alignment hint
xfs: ignore HW which cannot atomic write a single block
xfs: add helpers to compute transaction reservation for finishing intent items
xfs: add helpers to compute log item overhead
xfs: separate out setting buftarg atomic writes limits
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs updates from Gao Xiang:
"In this cycle, Intel QAT hardware accelerators are supported to
improve DEFLATE decompression performance. I've tested it with the
enwik9 dataset of 1 MiB pclusters on our Intel Sapphire Rapids
bare-metal server and a PL0 ESSD, and the sequential read performance
even surpasses LZ4 software decompression on this setup.
In addition, a `fsoffset` mount option is introduced for file-backed
mounts to specify the filesystem offset in order to adapt customized
container formats.
And other improvements and minor cleanups. Summary:
- Add a `fsoffset` mount option to specify the filesystem offset
- Support Intel QAT accelerators to boost up the DEFLATE algorithm
- Initialize per-CPU workers and CPU hotplug hooks lazily to avoid
unnecessary overhead when EROFS is not mounted
- Fix file handle encoding for 64-bit NIDs
- Minor cleanups"
* tag 'erofs-for-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: support DEFLATE decompression by using Intel QAT
erofs: clean up erofs_{init,exit}_sysfs()
erofs: add 'fsoffset' mount option to specify filesystem offset
erofs: lazily initialize per-CPU workers and CPU hotplug hooks
erofs: refine readahead tracepoint
erofs: avoid using multiple devices with different type
erofs: fix file handle encoding for 64-bit NIDs
|
|
Pull bcachefs updates from Kent Overstreet:
- Poisoned extents can now be moved: this lets us handle bitrotted data
without deleting it. For now, reading from poisoned extents only
returns -EIO: in the future we'll have an API for specifying "read
this data even if there were bitflips".
- Incompatible features may now be enabled at runtime, via
"opts/version_upgrade" in sysfs. Toggle it to incompatible, and then
toggle it back - option changes via the sysfs interface are
persistent.
- Various changes to support deployable disk images:
- RO mounts now use less memory
- Images may be stripped of alloc info, particularly useful for
slimming them down if they will primarily be mounted RO. Alloc
info will be automatically regenerated on first RW mount, and
this is quite fast
- Filesystem images generated with 'bcachefs image' will be
automatically resized the first time they're mounted on a larger
device
The images 'bcachefs image' generates with compression enabled have
been comparable in size to those generated by squashfs and erofs -
but you get a full RW capable filesystem
- Major error message improvements for btree node reads, data reads,
and elsewhere. We now build up a single error message that lists all
the errors encountered, actions taken to repair, and success/failure
of the IO. This extends to other error paths that may kick off other
actions, e.g. scheduling recovery passes: actions we took because of
an error are included in that error message, with
grouping/indentation so we can see what caused what.
- New option, 'rebalance_on_ac_only'. Does exactly what the name
suggests, quite handy with background compression.
- Repair/self healing:
- We can now kick off recovery passes and run them in the
background if we detect errors. Currently, this is just used by
code that walks backpointers. We now also check for missing
backpointers at runtime and run check_extents_to_backpointers if
required. The messy 6.14 upgrade left missing backpointers for
some users, and this will correct that automatically instead of
requiring a manual fsck - some users noticed this as copygc
spinning and not making progress.
In the future, as more recovery passes come online, we'll be able
to repair and recover from nearly anything - except for
unreadable btree nodes, and that's why you're using replication,
of course - without shutting down the filesystem.
- There's a new recovery pass, for checking the rebalance_work
btree, which tracks extents that rebalance will process later.
- Hardening:
- Close the last known hole in btree iterator/btree locking
assertions: path->should_be_locked paths must stay locked until
the end of the transaction. This shook out a few bugs, including
a performance issue that was causing unnecessary path_upgrade
transaction restarts.
- Performance:
- Faster snapshot deletion: this is an incompatible feature, as it
requires new sentinal values, for safety. Snapshot deletion no
longer has to do a full metadata scan, it now just scans the
inodes btree: if an extent/dirent/xattr is present for a given
snapshot ID, we already require that an inode be present with
that same snapshot ID.
If/when users hit scalability limits again (ridiculously huge
filesystems with lots of inodes, and many sparse snapshots), let
me know - the next step will be to add an index from snapshot ID
-> inode number, which won't be too hard.
- Faster device removal: the "scan for pointers to this device" no
longer does a full metadata scan, instead it walks backpointers.
Like fast snapshot deletion this is another incompat feature: it
also requires a new sentinal value, because we don't want to
reuse these device IDs until after a fsck.
- We're now coalescing redundant accounting updates prior to
transaction commit, taking some pressure off the journal. Shortly
we'll also be doing multiple extent updates in a transaction in
the main write path, which combined with the previous should
drastically cut down on the amount of metadata updates we have to
journal.
- Stack usage improvements: All allocator state has been moved off the
stack
- Debug improvements:
- enumerated refcounts: The debug code previously used for
filesystem write refs is now a small library, and used for other
heavily used refcounts. Different users of a refcount are
enumerated, making it much easier to debug refcount issues.
- Async object debugging: There's a new kconfig option that makes
various async objects (different types of bios, data updates,
write ops, etc.) visible in debugfs, and it should be fast enough
to leave on in production.
- Various sets of assertions no longer require
CONFIG_BCACHEFS_DEBUG, instead they're controlled by module
parameters and static keys, meaning users won't need to compile
custom kernels as often to help debug issues.
- bch2_trans_kmalloc() calls can be tracked (there's a new kconfig
option). With it on you can check the btree_transaction_stats in
debugfs to see the bch2_trans_kmalloc() calls a transaction did
when it used the most memory.
* tag 'bcachefs-2025-05-24' of git://evilpiepirate.org/bcachefs: (218 commits)
bcachefs: Don't mount bs > ps without TRANSPARENT_HUGEPAGE
bcachefs: Fix btree_iter_next_node() for new locking asserts
bcachefs: Ensure we don't use a blacklisted journal seq
bcachefs: Small check_fix_ptr fixes
bcachefs: Fix opts.recovery_pass_last
bcachefs: Fix allocate -> self healing path
bcachefs: Fix endianness in casefold check/repair
bcachefs: Path must be locked if trans->locked && should_be_locked
bcachefs: Simplify bch2_path_put()
bcachefs: Plumb btree_trans for more locking asserts
bcachefs: Clear trans->locked before unlock
bcachefs: Clear should_be_locked before unlock in key_cache_drop()
bcachefs: bch2_path_get() reuses paths if upgrade_fails & !should_be_locked
bcachefs: Give out new path if upgrade fails
bcachefs: Fix btree_path_get_locks when not doing trans restart
bcachefs: btree_node_locked_type_nowrite()
bcachefs: Kill bch2_path_put_nokeep()
bcachefs: bch2_journal_write_checksum()
bcachefs: Reduce stack usage in data_update_index_update()
bcachefs: bch2_trans_log_str()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 updates from Andreas Gruenbacher:
- Fix the long-standing warnings in inode_to_wb() when CONFIG_LOCKDEP
is enabled: gfs2 doesn't support cgroup writeback and so inode->i_wb
will never change. This is the counterpart of commit 9e888998ea4d
("writeback: fix false warning in inode_to_wb()")
- Fix a hang introduced by commit 8d391972ae2d ("gfs2: Remove
__gfs2_writepage()"): prevent gfs2_logd from creating transactions
for jdata pages while trying to flush the log
- Fix a race between gfs2_create_inode() and gfs2_evict_inode() by
deallocating partially created inodes on the gfs2_create_inode()
error path
- Fix a bug in the journal head lookup code that could cause mount to
fail after successful recovery
- Various smaller fixes and cleanups from various people
* tag 'gfs2-for-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: (23 commits)
gfs2: No more gfs2_find_jhead caching
gfs2: Get rid of duplicate log head lookup
gfs2: Simplify clean_journal
gfs2: Simplify gfs2_log_pointers_init
gfs2: Move gfs2_log_pointers_init
gfs2: Minor comments fix
gfs2: Don't start unnecessary transactions during log flush
gfs2: Move gfs2_trans_add_databufs
gfs2: Rename jdata_dirty_folio to gfs2_jdata_dirty_folio
gfs2: avoid inefficient use of crc32_le_shift()
gfs2: Do not call iomap_zero_range beyond eof
gfs: don't check for AOP_WRITEPAGE_ACTIVATE in gfs2_write_jdata_batch
gfs2: Fix usage of bio->bi_status in gfs2_end_log_write
gfs2: deallocate inodes in gfs2_create_inode
gfs2: Move GIF_ALLOC_FAILED check out of gfs2_ea_dealloc
gfs2: Move gfs2_dinode_dealloc
gfs2: Don't reread inodes unnecessarily
gfs2: gfs2_create_inode error handling fix
gfs2: Remove unnecessary NULL check before free_percpu()
gfs2: check sb_min_blocksize return value
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/a.hindborg/linux
Pull configfs updates from Andreas Hindborg:
- Allow creation of rw files with custom permissions. This allows
drivers to better protect secrets written through configfs
- Fix a bug where an error condition did not cause an early return
while populating attributes
- Report ENOMEM rather than EFAULT when kvasprintf() fails in
config_item_set_name()
- Add a Rust API for configfs. This allows Rust drivers to use configfs
through a memory safe interface
* tag 'configfs-for-v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/a.hindborg/linux:
MAINTAINERS: add configfs Rust abstractions
rust: configfs: add a sample demonstrating configfs usage
rust: configfs: introduce rust support for configfs
configfs: Correct error value returned by API config_item_set_name()
configfs: Do not override creating attribute file failure in populate_attrs()
configfs: Delete semicolon from macro type_print() definition
configfs: Add CONFIGFS_ATTR_PERM helper
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"Apart from numerous cleanups, there are some performance improvements
and one minor mount option update. There's one more radix-tree
conversion (one remaining), and continued work towards enabling large
folios (almost finished).
Performance:
- extent buffer conversion to xarray gains throughput and runtime
improvements on metadata heavy operations doing writeback (sample
test shows +50% throughput, -33% runtime)
- extent io tree cleanups lead to performance improvements by
avoiding unnecessary searches or repeated searches
- more efficient extent unpinning when committing transaction
(estimated run time improvement 3-5%)
User visible changes:
- remove standalone mount option 'nologreplay', deprecated in 5.9,
replacement is 'rescue=nologreplay'
- in scrub, update reporting, add back device stats message after
detected errors (accidentally removed during recent refactoring)
Core:
- convert extent buffer radix tree to xarray
- in subpage mode, move block perfect compression out of experimental
build
- in zoned mode, introduce sub block groups to allow managing special
block groups, like the one for relocation or tree-log, to handle
some corner cases of ENOSPC
- in scrub, simplify bitmaps for block tracking status
- continued preparations for large folios:
- remove assertions for folio order 0
- add support where missing: compression, buffered write, defrag,
hole punching, subpage, send
- fix fsync of files with no hard links not persisting deletion
- reject tree blocks which are not nodesize aligned, a precaution
from 4.9 times
- move transaction abort calls closer to the error sites
- remove usage of some struct bio_vec internals
- simplifications in extent map
- extent IO cleanups and optimizations
- error handling improvements
- enhanced ASSERT() macro with optional format strings
- cleanups:
- remove unused code
- naming unifications, dropped __, added prefix
- merge similar functions
- use common helpers for various data structures"
* tag 'for-6.16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (198 commits)
btrfs: move misplaced comment of btrfs_path::keep_locks
btrfs: remove standalone "nologreplay" mount option
btrfs: use a single variable to track return value at btrfs_page_mkwrite()
btrfs: don't return VM_FAULT_SIGBUS on failure to set delalloc for mmap write
btrfs: simplify early error checking in btrfs_page_mkwrite()
btrfs: pass true to btrfs_delalloc_release_space() at btrfs_page_mkwrite()
btrfs: fix wrong start offset for delalloc space release during mmap write
btrfs: fix harmless race getting delayed ref head count when running delayed refs
btrfs: log error codes during failures when writing super blocks
btrfs: simplify error return logic when getting folio at prepare_one_folio()
btrfs: return real error from __filemap_get_folio() calls
btrfs: remove superfluous return value check at btrfs_dio_iomap_begin()
btrfs: fix invalid data space release when truncating block in NOCOW mode
btrfs: update Kconfig option descriptions
btrfs: update list of features built under experimental config
btrfs: send: remove btrfs_debug() calls
btrfs: use boolean for delalloc argument to btrfs_free_reserved_extent()
btrfs: use boolean for delalloc argument to btrfs_free_reserved_bytes()
btrfs: fold error checks when allocating ordered extent and update comments
btrfs: check we grabbed inode reference when allocating an ordered extent
...
|
|
Pull block updates from Jens Axboe:
- ublk updates:
- Add support for updating the size of a ublk instance
- Zero-copy improvements
- Auto-registering of buffers for zero-copy
- Series simplifying and improving GET_DATA and request lookup
- Series adding quiesce support
- Lots of selftests additions
- Various cleanups
- NVMe updates via Christoph:
- add per-node DMA pools and use them for PRP/SGL allocations
(Caleb Sander Mateos, Keith Busch)
- nvme-fcloop refcounting fixes (Daniel Wagner)
- support delayed removal of the multipath node and optionally
support the multipath node for private namespaces (Nilay Shroff)
- support shared CQs in the PCI endpoint target code (Wilfred
Mallawa)
- support admin-queue only authentication (Hannes Reinecke)
- use the crc32c library instead of the crypto API (Eric Biggers)
- misc cleanups (Christoph Hellwig, Marcelo Moreira, Hannes
Reinecke, Leon Romanovsky, Gustavo A. R. Silva)
- MD updates via Yu:
- Fix that normal IO can be starved by sync IO, found by mkfs on
newly created large raid5, with some clean up patches for bdev
inflight counters
- Clean up brd, getting rid of atomic kmaps and bvec poking
- Add loop driver specifically for zoned IO testing
- Eliminate blk-rq-qos calls with a static key, if not enabled
- Improve hctx locking for when a plug has IO for multiple queues
pending
- Remove block layer bouncing support, which in turn means we can
remove the per-node bounce stat as well
- Improve blk-throttle support
- Improve delay support for blk-throttle
- Improve brd discard support
- Unify IO scheduler switching. This should also fix a bunch of lockdep
warnings we've been seeing, after enabling lockdep support for queue
freezing/unfreezeing
- Add support for block write streams via FDP (flexible data placement)
on NVMe
- Add a bunch of block helpers, facilitating the removal of a bunch of
duplicated boilerplate code
- Remove obsolete BLK_MQ pci and virtio Kconfig options
- Add atomic/untorn write support to blktrace
- Various little cleanups and fixes
* tag 'for-6.16/block-20250523' of git://git.kernel.dk/linux: (186 commits)
selftests: ublk: add test for UBLK_F_QUIESCE
ublk: add feature UBLK_F_QUIESCE
selftests: ublk: add test case for UBLK_U_CMD_UPDATE_SIZE
traceevent/block: Add REQ_ATOMIC flag to block trace events
ublk: run auto buf unregisgering in same io_ring_ctx with registering
io_uring: add helper io_uring_cmd_ctx_handle()
ublk: remove io argument from ublk_auto_buf_reg_fallback()
ublk: handle ublk_set_auto_buf_reg() failure correctly in ublk_fetch()
selftests: ublk: add test for covering UBLK_AUTO_BUF_REG_FALLBACK
selftests: ublk: support UBLK_F_AUTO_BUF_REG
ublk: support UBLK_AUTO_BUF_REG_FALLBACK
ublk: register buffer to local io_uring with provided buf index via UBLK_F_AUTO_BUF_REG
ublk: prepare for supporting to register request buffer automatically
ublk: convert to refcount_t
selftests: ublk: make IO & device removal test more stressful
nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk
nvme: introduce multipath_always_on module param
nvme-multipath: introduce delayed removal of the multipath head node
nvme-pci: derive and better document max segments limits
nvme-pci: use struct_size for allocation struct nvme_dev
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull iomap updates from Christian Brauner:
- More fallout and preparatory work associated with the folio batch
prototype posted a while back.
Mainly this just cleans up some of the helpers and pushes some
pos/len trimming further down in the write begin path.
- Add missing flag descriptions to the iomap documentation
* tag 'vfs-6.16-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
iomap: rework iomap_write_begin() to return folio offset and length
iomap: push non-large folio check into get folio path
iomap: helper to trim pos/bytes to within folio
iomap: drop pos param from __iomap_[get|put]_folio()
iomap: drop unnecessary pos param from iomap_write_[begin|end]
iomap: resample iter->pos after iomap_write_begin() calls
iomap: trace: Add missing flags to [IOMAP_|IOMAP_F_]FLAGS_STRINGS
Documentation: iomap: Add missing flags description
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull coredump updates from Christian Brauner:
"This adds support for sending coredumps over an AF_UNIX socket. It
also makes (implicit) use of the new SO_PEERPIDFD ability to hand out
pidfds for reaped peer tasks
The new coredump socket will allow userspace to not have to rely on
usermode helpers for processing coredumps and provides a saf way to
handle them instead of relying on super privileged coredumping helpers
This will also be significantly more lightweight since the kernel
doens't have to do a fork()+exec() for each crashing process to spawn
a usermodehelper. Instead the kernel just connects to the AF_UNIX
socket and userspace can process it concurrently however it sees fit.
Support for userspace is incoming starting with systemd-coredump
There's more work coming in that direction next cycle. The rest below
goes into some details and background
Coredumping currently supports two modes:
(1) Dumping directly into a file somewhere on the filesystem.
(2) Dumping into a pipe connected to a usermode helper process
spawned as a child of the system_unbound_wq or kthreadd
For simplicity I'm mostly ignoring (1). There's probably still some
users of (1) out there but processing coredumps in this way can be
considered adventurous especially in the face of set*id binaries
The most common option should be (2) by now. It works by allowing
userspace to put a string into /proc/sys/kernel/core_pattern like:
|/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h
The "|" at the beginning indicates to the kernel that a pipe must be
used. The path following the pipe indicator is a path to a binary that
will be spawned as a usermode helper process. Any additional
parameters pass information about the task that is generating the
coredump to the binary that processes the coredump
In the example the core_pattern shown causes the kernel to spawn
systemd-coredump as a usermode helper. There's various conceptual
consequences of this (non-exhaustive list):
- systemd-coredump is spawned with file descriptor number 0 (stdin)
connected to the read-end of the pipe. All other file descriptors
are closed. That specifically includes 1 (stdout) and 2 (stderr).
This has already caused bugs because userspace assumed that this
cannot happen (Whether or not this is a sane assumption is
irrelevant)
- systemd-coredump will be spawned as a child of system_unbound_wq.
So it is not a child of any userspace process and specifically not
a child of PID 1. It cannot be waited upon and is in a weird hybrid
upcall which are difficult for userspace to control correctly
- systemd-coredump is spawned with full kernel privileges. This
necessitates all kinds of weird privilege dropping excercises in
userspace to make this safe
- A new usermode helper has to be spawned for each crashing process
This adds a new mode:
(3) Dumping into an AF_UNIX socket
Userspace can set /proc/sys/kernel/core_pattern to:
@/path/to/coredump.socket
The "@" at the beginning indicates to the kernel that an AF_UNIX
coredump socket will be used to process coredumps
The coredump socket must be located in the initial mount namespace.
When a task coredumps it opens a client socket in the initial network
namespace and connects to the coredump socket:
- The coredump server uses SO_PEERPIDFD to get a stable handle on the
connected crashing task. The retrieved pidfd will provide a stable
reference even if the crashing task gets SIGKILLed while generating
the coredump. That is a huge attack vector right now
- By setting core_pipe_limit non-zero userspace can guarantee that
the crashing task cannot be reaped behind it's back and thus
process all necessary information in /proc/<pid>. The SO_PEERPIDFD
can be used to detect whether /proc/<pid> still refers to the same
process
The core_pipe_limit isn't used to rate-limit connections to the
socket. This can simply be done via AF_UNIX socket directly
- The pidfd for the crashing task will contain information how the
task coredumps. The PIDFD_GET_INFO ioctl gained a new flag
PIDFD_INFO_COREDUMP which can be used to retreive the coredump
information
If the coredump gets a new coredump client connection the kernel
guarantees that PIDFD_INFO_COREDUMP information is available.
Currently the following information is provided in the new
@coredump_mask extension to struct pidfd_info:
* PIDFD_COREDUMPED is raised if the task did actually coredump
* PIDFD_COREDUMP_SKIP is raised if the task skipped coredumping
(e.g., undumpable)
* PIDFD_COREDUMP_USER is raised if this is a regular coredump and
doesn't need special care by the coredump server
* PIDFD_COREDUMP_ROOT is raised if the generated coredump should
be treated as sensitive and the coredump server should restrict
access to the generated coredump to sufficiently privileged
users"
* tag 'vfs-6.16-rc1.coredump' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
mips, net: ensure that SOCK_COREDUMP is defined
selftests/coredump: add tests for AF_UNIX coredumps
selftests/pidfd: add PIDFD_INFO_COREDUMP infrastructure
coredump: validate socket name as it is written
coredump: show supported coredump modes
pidfs, coredump: add PIDFD_INFO_COREDUMP
coredump: add coredump socket
coredump: reflow dump helpers a little
coredump: massage do_coredump()
coredump: massage format_corename()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull pidfs updates from Christian Brauner:
"Features:
- Allow handing out pidfds for reaped tasks for AF_UNIX SO_PEERPIDFD
socket option
SO_PEERPIDFD is a socket option that allows to retrieve a pidfd for
the process that called connect() or listen(). This is heavily used
to safely authenticate clients in userspace avoiding security bugs
due to pid recycling races (dbus, polkit, systemd, etc.)
SO_PEERPIDFD currently doesn't support handing out pidfds if the
sk->sk_peer_pid thread-group leader has already been reaped. In
this case it currently returns EINVAL. Userspace still wants to get
a pidfd for a reaped process to have a stable handle it can pass
on. This is especially useful now that it is possible to retrieve
exit information through a pidfd via the PIDFD_GET_INFO ioctl()'s
PIDFD_INFO_EXIT flag
Another summary has been provided by David Rheinsberg:
> A pidfd can outlive the task it refers to, and thus user-space
> must already be prepared that the task underlying a pidfd is
> gone at the time they get their hands on the pidfd. For
> instance, resolving the pidfd to a PID via the fdinfo must be
> prepared to read `-1`.
>
> Despite user-space knowing that a pidfd might be stale, several
> kernel APIs currently add another layer that checks for this. In
> particular, SO_PEERPIDFD returns `EINVAL` if the peer-task was
> already reaped, but returns a stale pidfd if the task is reaped
> immediately after the respective alive-check.
>
> This has the unfortunate effect that user-space now has two ways
> to check for the exact same scenario: A syscall might return
> EINVAL/ESRCH/... *or* the pidfd might be stale, even though
> there is no particular reason to distinguish both cases. This
> also propagates through user-space APIs, which pass on pidfds.
> They must be prepared to pass on `-1` *or* the pidfd, because
> there is no guaranteed way to get a stale pidfd from the kernel.
>
> Userspace must already deal with a pidfd referring to a reaped
> task as the task may exit and get reaped at any time will there
> are still many pidfds referring to it
In order to allow handing out reaped pidfd SO_PEERPIDFD needs to
ensure that PIDFD_INFO_EXIT information is available whenever a
pidfd for a reaped task is created by PIDFD_INFO_EXIT. The uapi
promises that reaped pidfds are only handed out if it is guaranteed
that the caller sees the exit information:
TEST_F(pidfd_info, success_reaped)
{
struct pidfd_info info = {
.mask = PIDFD_INFO_CGROUPID | PIDFD_INFO_EXIT,
};
/*
* Process has already been reaped and PIDFD_INFO_EXIT been set.
* Verify that we can retrieve the exit status of the process.
*/
ASSERT_EQ(ioctl(self->child_pidfd4, PIDFD_GET_INFO, &info), 0);
ASSERT_FALSE(!!(info.mask & PIDFD_INFO_CREDS));
ASSERT_TRUE(!!(info.mask & PIDFD_INFO_EXIT));
ASSERT_TRUE(WIFEXITED(info.exit_code));
ASSERT_EQ(WEXITSTATUS(info.exit_code), 0);
}
To hand out pidfds for reaped processes we thus allocate a pidfs
entry for the relevant sk->sk_peer_pid at the time the
sk->sk_peer_pid is stashed and drop it when the socket is
destroyed. This guarantees that exit information will always be
recorded for the sk->sk_peer_pid task and we can hand out pidfds
for reaped processes
- Hand a pidfd to the coredump usermode helper process
Give userspace a way to instruct the kernel to install a pidfd for
the crashing process into the process started as a usermode helper.
There's still tricky race-windows that cannot be easily or
sometimes not closed at all by userspace. There's various ways like
looking at the start time of a process to make sure that the
usermode helper process is started after the crashing process but
it's all very very brittle and fraught with peril
The crashed-but-not-reaped process can be killed by userspace
before coredump processing programs like systemd-coredump have had
time to manually open a PIDFD from the PID the kernel provides
them, which means they can be tricked into reading from an
arbitrary process, and they run with full privileges as they are
usermode helper processes
Even if that specific race-window wouldn't exist it's still the
safest and cleanest way to let the kernel provide the pidfd
directly instead of requiring userspace to do it manually. In
parallel with this commit we already have systemd adding support
for this in [1]
When the usermode helper process is forked we install a pidfd file
descriptor three into the usermode helper's file descriptor table
so it's available to the exec'd program
Since usermode helpers are either children of the system_unbound_wq
workqueue or kthreadd we know that the file descriptor table is
empty and can thus always use three as the file descriptor number
Note, that we'll install a pidfd for the thread-group leader even
if a subthread is calling do_coredump(). We know that task linkage
hasn't been removed yet and even if this @current isn't the actual
thread-group leader we know that the thread-group leader cannot be
reaped until
@current has exited
- Allow telling when a task has not been found from finding the wrong
task when creating a pidfd
We currently report EINVAL whenever a struct pid has no tasked
attached anymore thereby conflating two concepts:
(1) The task has already been reaped
(2) The caller requested a pidfd for a thread-group leader but the
pid actually references a struct pid that isn't used as a
thread-group leader
This is causing issues for non-threaded workloads as in where they
expect ESRCH to be reported, not EINVAL
So allow userspace to reliably distinguish between (1) and (2)
- Make it possible to detect when a pidfs entry would outlive the
struct pid it pinned
- Add a range of new selftests
Cleanups:
- Remove unneeded NULL check from pidfd_prepare() for passed struct
pid
- Avoid pointless reference count bump during release_task()
Fixes:
- Various fixes to the pidfd and coredump selftests
- Fix error handling for replace_fd() when spawning coredump usermode
helper"
* tag 'vfs-6.16-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
pidfs: detect refcount bugs
coredump: hand a pidfd to the usermode coredump helper
coredump: fix error handling for replace_fd()
pidfs: move O_RDWR into pidfs_alloc_file()
selftests: coredump: Raise timeout to 2 minutes
selftests: coredump: Fix test failure for slow machines
selftests: coredump: Properly initialize pointer
net, pidfs: enable handing out pidfds for reaped sk->sk_peer_pid
pidfs: get rid of __pidfd_prepare()
net, pidfs: prepare for handing out pidfds for reaped sk->sk_peer_pid
pidfs: register pid in pidfs
net, pidfd: report EINVAL for ESRCH
release_task: kill the no longer needed get/put_pid(thread_pid)
pidfs: ensure consistent ENOENT/ESRCH reporting
exit: move wake_up_all() pidfd waiters into __unhash_process()
selftest/pidfd: add test for thread-group leader pidfd open for thread
pidfd: improve uapi when task isn't found
pidfd: remove unneeded NULL check from pidfd_prepare()
selftests/pidfd: adapt to recent changes
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs mount updates from Christian Brauner:
"This contains minor mount updates for this cycle:
- mnt->mnt_devname can never be NULL so simplify the code handling
that case
- Add a comment about concurrent changes during statmount() and
listmount()
- Update the STATMOUNT_SUPPORTED macro
- Convert mount flags to an enum"
* tag 'vfs-6.16-rc1.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
statmount: update STATMOUNT_SUPPORTED macro
fs: convert mount flags to enum
->mnt_devname is never NULL
mount: add a comment about concurrent changes with statmount()/listmount()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs freezing updates from Christian Brauner:
"This contains various filesystem freezing related work for this cycle:
- Allow the power subsystem to support filesystem freeze for suspend
and hibernate.
Now all the pieces are in place to actually allow the power
subsystem to freeze/thaw filesystems during suspend/resume.
Filesystems are only frozen and thawed if the power subsystem does
actually own the freeze.
If the filesystem is already frozen by the time we've frozen all
userspace processes we don't care to freeze it again. That's
userspace's job once the process resumes. We only actually freeze
filesystems if we absolutely have to and we ignore other failures
to freeze.
We could bubble up errors and fail suspend/resume if the error
isn't EBUSY (aka it's already frozen) but I don't think that this
is worth it. Filesystem freezing during suspend/resume is
best-effort. If the user has 500 ext4 filesystems mounted and 4
fail to freeze for whatever reason then we simply skip them.
What we have now is already a big improvement and let's see how we
fare with it before making our lives even harder (and uglier) than
we have to.
- Allow efivars to support freeze and thaw
Allow efivarfs to partake to resync variable state during system
hibernation and suspend. Add freeze/thaw support.
This is a pretty straightforward implementation. We simply add
regular freeze/thaw support for both userspace and the kernel.
efivars is the first pseudofilesystem that adds support for
filesystem freezing and thawing.
The simplicity comes from the fact that we simply always resync
variable state after efivarfs has been frozen. It doesn't matter
whether that's because of suspend, userspace initiated freeze or
hibernation. Efivars is simple enough that it doesn't matter that
we walk all dentries. There are no directories and there aren't
insane amounts of entries and both freeze/thaw are already
heavy-handed operations. If userspace initiated a freeze/thaw cycle
they would need CAP_SYS_ADMIN in the initial user namespace (as
that's where efivarfs is mounted) so it can't be triggered by
random userspace. IOW, we really really don't care"
* tag 'vfs-6.16-rc1.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
f2fs: fix freezing filesystem during resize
kernfs: add warning about implementing freeze/thaw
efivarfs: support freeze/thaw
power: freeze filesystems during suspend/resume
libfs: export find_next_child()
super: add filesystem freezing helpers for suspend and hibernate
gfs2: pass through holder from the VFS for freeze/thaw
super: use common iterator (Part 2)
super: use a common iterator (Part 1)
super: skip dying superblocks early
super: simplify user_get_super()
super: remove pointless s_root checks
fs: allow all writers to be frozen
locking/percpu-rwsem: add freezable alternative to down_read
|