Age | Commit message (Collapse) | Author |
|
I had missed a semantic conflict between commit d389a4a81155 ("mm: Add
folio flag manipulation functions") from the folio tree, and commit
eac96c3efdb5 ("mm: filemap: check if THP has hwpoisoned subpage for PMD
page fault") that added a new set of page flags.
My build tests had too many options enabled, which hid this issue. But
if you didn't have MEMORY_FAILURE or TRANSPARENT_HUGEPAGE enabled, you'd
end up with build errors like this:
include/linux/page-flags.h:806:29: error: macro "PAGEFLAG_FALSE" requires 2 arguments, but only 1 given
806 | PAGEFLAG_FALSE(HasHWPoisoned)
| ^
due to the missing lowercase name used for folio function naming.
Fixes: 49f8275c7d92 ("Merge tag 'folio-5.16' of git://git.infradead.org/users/willy/pagecache")
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Reported-by: Yang Shi <shy828301@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fpu updates from Thomas Gleixner:
- Cleanup of extable fixup handling to be more robust, which in turn
allows to make the FPU exception fixups more robust as well.
- Change the return code for signal frame related failures from
explicit error codes to a boolean fail/success as that's all what the
calling code evaluates.
- A large refactoring of the FPU code to prepare for adding AMX
support:
- Distangle the public header maze and remove especially the
misnomed kitchen sink internal.h which is despite it's name
included all over the place.
- Add a proper abstraction for the register buffer storage (struct
fpstate) which allows to dynamically size the buffer at runtime
by flipping the pointer to the buffer container from the default
container which is embedded in task_struct::tread::fpu to a
dynamically allocated container with a larger register buffer.
- Convert the code over to the new fpstate mechanism.
- Consolidate the KVM FPU handling by moving the FPU related code
into the FPU core which removes the number of exports and avoids
adding even more export when AMX has to be supported in KVM.
This also removes duplicated code which was of course
unnecessary different and incomplete in the KVM copy.
- Simplify the KVM FPU buffer handling by utilizing the new
fpstate container and just switching the buffer pointer from the
user space buffer to the KVM guest buffer when entering
vcpu_run() and flipping it back when leaving the function. This
cuts the memory requirements of a vCPU for FPU buffers in half
and avoids pointless memory copy operations.
This also solves the so far unresolved problem of adding AMX
support because the current FPU buffer handling of KVM inflicted
a circular dependency between adding AMX support to the core and
to KVM. With the new scheme of switching fpstate AMX support can
be added to the core code without affecting KVM.
- Replace various variables with proper data structures so the
extra information required for adding dynamically enabled FPU
features (AMX) can be added in one place
- Add AMX (Advanced Matrix eXtensions) support (finally):
AMX is a large XSTATE component which is going to be available with
Saphire Rapids XEON CPUs. The feature comes with an extra MSR
(MSR_XFD) which allows to trap the (first) use of an AMX related
instruction, which has two benefits:
1) It allows the kernel to control access to the feature
2) It allows the kernel to dynamically allocate the large register
state buffer instead of burdening every task with the the extra
8K or larger state storage.
It would have been great to gain this kind of control already with
AVX512.
The support comes with the following infrastructure components:
1) arch_prctl() to
- read the supported features (equivalent to XGETBV(0))
- read the permitted features for a task
- request permission for a dynamically enabled feature
Permission is granted per process, inherited on fork() and
cleared on exec(). The permission policy of the kernel is
restricted to sigaltstack size validation, but the syscall
obviously allows further restrictions via seccomp etc.
2) A stronger sigaltstack size validation for sys_sigaltstack(2)
which takes granted permissions and the potentially resulting
larger signal frame into account. This mechanism can also be used
to enforce factual sigaltstack validation independent of dynamic
features to help with finding potential victims of the 2K
sigaltstack size constant which is broken since AVX512 support
was added.
3) Exception handling for #NM traps to catch first use of a extended
feature via a new cause MSR. If the exception was caused by the
use of such a feature, the handler checks permission for that
feature. If permission has not been granted, the handler sends a
SIGILL like the #UD handler would do if the feature would have
been disabled in XCR0. If permission has been granted, then a new
fpstate which fits the larger buffer requirement is allocated.
In the unlikely case that this allocation fails, the handler
sends SIGSEGV to the task. That's not elegant, but unavoidable as
the other discussed options of preallocation or full per task
permissions come with their own set of horrors for kernel and/or
userspace. So this is the lesser of the evils and SIGSEGV caused
by unexpected memory allocation failures is not a fundamentally
new concept either.
When allocation succeeds, the fpstate properties are filled in to
reflect the extended feature set and the resulting sizes, the
fpu::fpstate pointer is updated accordingly and the trap is
disarmed for this task permanently.
4) Enumeration and size calculations
5) Trap switching via MSR_XFD
The XFD (eXtended Feature Disable) MSR is context switched with
the same life time rules as the FPU register state itself. The
mechanism is keyed off with a static key which is default
disabled so !AMX equipped CPUs have zero overhead. On AMX enabled
CPUs the overhead is limited by comparing the tasks XFD value
with a per CPU shadow variable to avoid redundant MSR writes. In
case of switching from a AMX using task to a non AMX using task
or vice versa, the extra MSR write is obviously inevitable.
All other places which need to be aware of the variable feature
sets and resulting variable sizes are not affected at all because
they retrieve the information (feature set, sizes) unconditonally
from the fpstate properties.
6) Enable the new AMX states
Note, this is relatively new code despite the fact that AMX support
is in the works for more than a year now.
The big refactoring of the FPU code, which allowed to do a proper
integration has been started exactly 3 weeks ago. Refactoring of the
existing FPU code and of the original AMX patches took a week and has
been subject to extensive review and testing. The only fallout which
has not been caught in review and testing right away was restricted
to AMX enabled systems, which is completely irrelevant for anyone
outside Intel and their early access program. There might be dragons
lurking as usual, but so far the fine grained refactoring has held up
and eventual yet undetected fallout is bisectable and should be
easily addressable before the 5.16 release. Famous last words...
Many thanks to Chang Bae and Dave Hansen for working hard on this and
also to the various test teams at Intel who reserved extra capacity
to follow the rapid development of this closely which provides the
confidence level required to offer this rather large update for
inclusion into 5.16-rc1
* tag 'x86-fpu-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (110 commits)
Documentation/x86: Add documentation for using dynamic XSTATE features
x86/fpu: Include vmalloc.h for vzalloc()
selftests/x86/amx: Add context switch test
selftests/x86/amx: Add test cases for AMX state management
x86/fpu/amx: Enable the AMX feature in 64-bit mode
x86/fpu: Add XFD handling for dynamic states
x86/fpu: Calculate the default sizes independently
x86/fpu/amx: Define AMX state components and have it used for boot-time checks
x86/fpu/xstate: Prepare XSAVE feature table for gaps in state component numbers
x86/fpu/xstate: Add fpstate_realloc()/free()
x86/fpu/xstate: Add XFD #NM handler
x86/fpu: Update XFD state where required
x86/fpu: Add sanity checks for XFD
x86/fpu: Add XFD state to fpstate
x86/msr-index: Add MSRs for XFD
x86/cpufeatures: Add eXtended Feature Disabling (XFD) feature bit
x86/fpu: Reset permission and fpstate on exec()
x86/fpu: Prepare fpu_clone() for dynamically enabled features
x86/fpu/signal: Prepare for variable sigframe length
x86/signal: Use fpu::__state_user_size for sigalt stack validation
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Thomas Gleixner:
- Revert the printk format based wchan() symbol resolution as it can
leak the raw value in case that the symbol is not resolvable.
- Make wchan() more robust and work with all kind of unwinders by
enforcing that the task stays blocked while unwinding is in progress.
- Prevent sched_fork() from accessing an invalid sched_task_group
- Improve asymmetric packing logic
- Extend scheduler statistics to RT and DL scheduling classes and add
statistics for bandwith burst to the SCHED_FAIR class.
- Properly account SCHED_IDLE entities
- Prevent a potential deadlock when initial priority is assigned to a
newly created kthread. A recent change to plug a race between cpuset
and __sched_setscheduler() introduced a new lock dependency which is
now triggered. Break the lock dependency chain by moving the priority
assignment to the thread function.
- Fix the idle time reporting in /proc/uptime for NOHZ enabled systems.
- Improve idle balancing in general and especially for NOHZ enabled
systems.
- Provide proper interfaces for live patching so it does not have to
fiddle with scheduler internals.
- Add cluster aware scheduling support.
- A small set of tweaks for RT (irqwork, wait_task_inactive(), various
scheduler options and delaying mmdrop)
- The usual small tweaks and improvements all over the place
* tag 'sched-core-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (69 commits)
sched/fair: Cleanup newidle_balance
sched/fair: Remove sysctl_sched_migration_cost condition
sched/fair: Wait before decaying max_newidle_lb_cost
sched/fair: Skip update_blocked_averages if we are defering load balance
sched/fair: Account update_blocked_averages in newidle_balance cost
x86: Fix __get_wchan() for !STACKTRACE
sched,x86: Fix L2 cache mask
sched/core: Remove rq_relock()
sched: Improve wake_up_all_idle_cpus() take #2
irq_work: Also rcuwait for !IRQ_WORK_HARD_IRQ on PREEMPT_RT
irq_work: Handle some irq_work in a per-CPU thread on PREEMPT_RT
irq_work: Allow irq_work_sync() to sleep if irq_work() no IRQ support.
sched/rt: Annotate the RT balancing logic irqwork as IRQ_WORK_HARD_IRQ
sched: Add cluster scheduler level for x86
sched: Add cluster scheduler level in core and related Kconfig for ARM64
topology: Represent clusters of CPUs within a die
sched: Disable -Wunused-but-set-variable
sched: Add wrapper for get_wchan() to keep task blocked
x86: Fix get_wchan() to support the ORC unwinder
proc: Use task_is_running() for wchan in /proc/$pid/stat
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
"Time, timers and timekeeping updates:
- No core updates
- No new clocksource/event driver
- A large rework of the ARM architected timer driver to prepare for
the support of the upcoming ARMv8.6 support
- Fix Kconfig options for Exynos MCT, Samsung PWM and TI DM timers
- Address a namespace collison in the ARC sp804 timer driver"
* tag 'timers-core-2021-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clocksource/drivers/timer-ti-dm: Select TIMER_OF
clocksource/drivers/exynosy: Depend on sub-architecture for Exynos MCT and Samsung PWM
clocksource/drivers/arch_arm_timer: Move workaround synchronisation around
clocksource/drivers/arm_arch_timer: Fix masking for high freq counters
clocksource/drivers/arm_arch_timer: Drop unnecessary ISB on CVAL programming
clocksource/drivers/arm_arch_timer: Remove any trace of the TVAL programming interface
clocksource/drivers/arm_arch_timer: Work around broken CVAL implementations
clocksource/drivers/arm_arch_timer: Advertise 56bit timer to the core code
clocksource/drivers/arm_arch_timer: Move MMIO timer programming over to CVAL
clocksource/drivers/arm_arch_timer: Fix MMIO base address vs callback ordering issue
clocksource/drivers/arm_arch_timer: Move drop _tval from erratum function names
clocksource/drivers/arm_arch_timer: Move system register timer programming over to CVAL
clocksource/drivers/arm_arch_timer: Extend write side of timer register accessors to u64
clocksource/drivers/arm_arch_timer: Drop CNT*_TVAL read accessors
clocksource/arm_arch_timer: Add build-time guards for unhandled register accesses
clocksource/drivers/arc_timer: Eliminate redefined macro error
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool updates from Thomas Gleixner:
- Improve retpoline code patching by separating it from alternatives
which reduces memory footprint and allows to do better optimizations
in the actual runtime patching.
- Add proper retpoline support for x86/BPF
- Address noinstr warnings in x86/kvm, lockdep and paravirtualization
code
- Add support to handle pv_opsindirect calls in the noinstr analysis
- Classify symbols upfront and cache the result to avoid redundant
str*cmp() invocations.
- Add a CFI hash to reduce memory consumption which also reduces
runtime on a allyesconfig by ~50%
- Adjust XEN code to make objtool handling more robust and as a side
effect to prevent text fragmentation due to placement of the
hypercall page.
* tag 'objtool-core-2021-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (41 commits)
bpf,x86: Respect X86_FEATURE_RETPOLINE*
bpf,x86: Simplify computing label offsets
x86,bugs: Unconditionally allow spectre_v2=retpoline,amd
x86/alternative: Add debug prints to apply_retpolines()
x86/alternative: Try inline spectre_v2=retpoline,amd
x86/alternative: Handle Jcc __x86_indirect_thunk_\reg
x86/alternative: Implement .retpoline_sites support
x86/retpoline: Create a retpoline thunk array
x86/retpoline: Move the retpoline thunk declarations to nospec-branch.h
x86/asm: Fixup odd GEN-for-each-reg.h usage
x86/asm: Fix register order
x86/retpoline: Remove unused replacement symbols
objtool,x86: Replace alternatives with .retpoline_sites
objtool: Shrink struct instruction
objtool: Explicitly avoid self modifying code in .altinstr_replacement
objtool: Classify symbols
objtool: Support pv_opsindirect calls for noinstr
x86/xen: Rework the xen_{cpu,irq,mmu}_opsarrays
x86/xen: Mark xen_force_evtchn_callback() noinstr
x86/xen: Make irq_disable() noinstr
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Thomas Gleixner:
- Move futex code into kernel/futex/ and split up the kitchen sink into
seperate files to make integration of sys_futex_waitv() simpler.
- Add a new sys_futex_waitv() syscall which allows to wait on multiple
futexes.
The main use case is emulating Windows' WaitForMultipleObjects which
allows Wine to improve the performance of Windows Games. Also native
Linux games can benefit from this interface as this is a common wait
pattern for this kind of applications.
- Add context to ww_mutex_trylock() to provide a path for i915 to
rework their eviction code step by step without making lockdep upset
until the final steps of rework are completed. It's also useful for
regulator and TTM to avoid dropping locks in the non contended path.
- Lockdep and might_sleep() cleanups and improvements
- A few improvements for the RT substitutions.
- The usual small improvements and cleanups.
* tag 'locking-core-2021-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits)
locking: Remove spin_lock_flags() etc
locking/rwsem: Fix comments about reader optimistic lock stealing conditions
locking: Remove rcu_read_{,un}lock() for preempt_{dis,en}able()
locking/rwsem: Disable preemption for spinning region
docs: futex: Fix kernel-doc references
futex: Fix PREEMPT_RT build
futex2: Documentation: Document sys_futex_waitv() uAPI
selftests: futex: Test sys_futex_waitv() wouldblock
selftests: futex: Test sys_futex_waitv() timeout
selftests: futex: Add sys_futex_waitv() test
futex,arm: Wire up sys_futex_waitv()
futex,x86: Wire up sys_futex_waitv()
futex: Implement sys_futex_waitv()
futex: Simplify double_lock_hb()
futex: Split out wait/wake
futex: Split out requeue
futex: Rename mark_wake_futex()
futex: Rename: match_futex()
futex: Rename: hb_waiter_{inc,dec,pending}()
futex: Split out PI futex
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Thomas Gleixner:
"Core:
- Allow ftrace to instrument parts of the perf core code
- Add a new mem_hops field to perf_mem_data_src which allows to
represent intra-node/package or inter-node/off-package details to
prepare for next generation systems which have more hieararchy
within the node/pacakge level.
Tools:
- Update for the new mem_hops field in perf_mem_data_src
Arch:
- A set of constraints fixes for the Intel uncore PMU
- The usual set of small fixes and improvements for x86 and PPC"
* tag 'perf-core-2021-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel: Fix ICL/SPR INST_RETIRED.PREC_DIST encodings
powerpc/perf: Fix data source encodings for L2.1 and L3.1 accesses
tools/perf: Add mem_hops field in perf_mem_data_src structure
perf: Add mem_hops field in perf_mem_data_src structure
perf: Add comment about current state of PERF_MEM_LVL_* namespace and remove an extra line
perf/core: Allow ftrace for functions in kernel/event/core.c
perf/x86: Add new event for AUX output counter index
perf/x86: Add compiler barrier after updating BTS
perf/x86/intel/uncore: Fix Intel SPR M3UPI event constraints
perf/x86/intel/uncore: Fix Intel SPR M2PCIE event constraints
perf/x86/intel/uncore: Fix Intel SPR IIO event constraints
perf/x86/intel/uncore: Fix Intel SPR CHA event constraints
perf/x86/intel/uncore: Fix Intel ICX IIO event constraints
perf/x86/intel/uncore: Fix invalid unit check
perf/x86/intel/uncore: Support extra IMC channel on Ice Lake server
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
"Updates for the interrupt subsystem:
Core changes:
- Prevent a potential deadlock when initial priority is assigned to a
newly created interrupt thread. A recent change to plug a race
between cpuset and __sched_setscheduler() introduced a new lock
dependency which is now triggered. Break the lock dependency chain
by moving the priority assignment to the thread function.
- A couple of small updates to make the irq core RT safe.
- Confine the irq_cpu_online/offline() API to the only left unfixable
user Cavium Octeon so that it does not grow new usage.
- A small documentation update
Driver changes:
- A large cross architecture rework to move irq_enter/exit() into the
architecture code to make addressing the NOHZ_FULL/RCU issues
simpler.
- The obligatory new irq chip driver for Microchip EIC
- Modularize a few irq chip drivers
- Expand usage of devm_*() helpers throughout the driver code
- The usual small fixes and improvements all over the place"
* tag 'irq-core-2021-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits)
h8300: Fix linux/irqchip.h include mess
dt-bindings: irqchip: renesas-irqc: Document r8a774e1 bindings
MIPS: irq: Avoid an unused-variable error
genirq: Hide irq_cpu_{on,off}line() behind a deprecated option
irqchip/mips-gic: Get rid of the reliance on irq_cpu_online()
MIPS: loongson64: Drop call to irq_cpu_offline()
irq: remove handle_domain_{irq,nmi}()
irq: remove CONFIG_HANDLE_DOMAIN_IRQ_IRQENTRY
irq: riscv: perform irqentry in entry code
irq: openrisc: perform irqentry in entry code
irq: csky: perform irqentry in entry code
irq: arm64: perform irqentry in entry code
irq: arm: perform irqentry in entry code
irq: add a (temporary) CONFIG_HANDLE_DOMAIN_IRQ_IRQENTRY
irq: nds32: avoid CONFIG_HANDLE_DOMAIN_IRQ
irq: arc: avoid CONFIG_HANDLE_DOMAIN_IRQ
irq: add generic_handle_arch_irq()
irq: unexport handle_irq_desc()
irq: simplify handle_domain_{irq,nmi}()
irq: mips: simplify do_domain_IRQ()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"The updates this time are more under the hood and enhancing existing
features (subpage with compression and zoned namespaces).
Performance related:
- misc small inode logging improvements (+3% throughput, -11% latency
on sample dbench workload)
- more efficient directory logging: bulk item insertion, less tree
searches and locking
- speed up bulk insertion of items into a b-tree, which is used when
logging directories, when running delayed items for directories
(fsync and transaction commits) and when running the slow path
(full sync) of an fsync (bulk creation run time -4%, deletion -12%)
Core:
- continued subpage support
- make defragmentation work
- make compression write work
- zoned mode
- support ZNS (zoned namespaces), zone capacity is number of
usable blocks in each zone
- add dedicated block group (zoned) for relocation, to prevent
out of order writes in some cases
- greedy block group reclaim, pick the ones with least usable
space first
- preparatory work for send protocol updates
- error handling improvements
- cleanups and refactoring
Fixes:
- lockdep warnings
- in show_devname callback, on seeding device
- device delete on loop device due to conversions to workqueues
- fix deadlock between chunk allocation and chunk btree modifications
- fix tracking of missing device count and status"
* tag 'for-5.16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (140 commits)
btrfs: remove root argument from check_item_in_log()
btrfs: remove root argument from add_link()
btrfs: remove root argument from btrfs_unlink_inode()
btrfs: remove root argument from drop_one_dir_item()
btrfs: clear MISSING device status bit in btrfs_close_one_device
btrfs: call btrfs_check_rw_degradable only if there is a missing device
btrfs: send: prepare for v2 protocol
btrfs: fix comment about sector sizes supported in 64K systems
btrfs: update device path inode time instead of bd_inode
fs: export an inode_update_time helper
btrfs: fix deadlock when defragging transparent huge pages
btrfs: sysfs: convert scnprintf and snprintf to sysfs_emit
btrfs: make btrfs_super_block size match BTRFS_SUPER_INFO_SIZE
btrfs: update comments for chunk allocation -ENOSPC cases
btrfs: fix deadlock between chunk allocation and chunk btree modifications
btrfs: zoned: use greedy gc for auto reclaim
btrfs: check-integrity: stop storing the block device name in btrfsic_dev_state
btrfs: use btrfs_get_dev_args_from_path in dev removal ioctls
btrfs: add a btrfs_get_dev_args_from_path helper
btrfs: handle device lookup with btrfs_dev_lookup_args
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs updates from Gao Xiang:
"There are some new features available for this cycle. Firstly, EROFS
LZMA algorithm support, specifically called MicroLZMA, is available as
an option for embedded devices, LiveCDs and/or as the secondary
auxiliary compression algorithm besides the primary algorithm in one
file.
In order to better support the LZMA fixed-sized output compression,
especially for 4KiB pcluster size (which has lowest memory pressure
thus useful for memory-sensitive scenarios), Lasse introduced a new
LZMA header/container format called MicroLZMA to minimize the original
LZMA1 header (for example, we don't need to waste 4-byte dictionary
size and another 8-byte uncompressed size, which can be calculated by
fs directly, for each pcluster) and enable EROFS fixed-sized output
compression.
Note that MicroLZMA can also be later used by other things in addition
to EROFS too where wasting minimal amount of space for headers is
important and it can be only compiled by enabling XZ_DEC_MICROLZMA.
MicroLZMA has been supported by the latest upstream XZ embedded [1] &
XZ utils [2], apply the latest related XZ embedded upstream patches by
the XZ author Lasse here.
Secondly, multiple device is also supported in this cycle, which is
designed for multi-layer container images. By working together with
inter-layer data deduplication and compression, we can achieve the
next high-performance container image solution. Our team will announce
the new Nydus container image service [3] implementation with new RAFS
v6 (EROFS-compatible) format in Open Source Summit 2021 China [4]
soon.
Besides, the secondary compression head support and readmore
decompression strategy are also included in this cycle. There are also
some minor bugfixes and cleanups, as always.
Summary:
- support multiple devices for multi-layer container images;
- support the secondary compression head;
- support readmore decompression strategy;
- support new LZMA algorithm (specifically called MicroLZMA);
- some bugfixes & cleanups"
* tag 'erofs-for-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: don't trigger WARN() when decompression fails
erofs: get rid of ->lru usage
erofs: lzma compression support
erofs: rename some generic methods in decompressor
lib/xz, lib/decompress_unxz.c: Fix spelling in comments
lib/xz: Add MicroLZMA decoder
lib/xz: Move s->lzma.len = 0 initialization to lzma_reset()
lib/xz: Validate the value before assigning it to an enum variable
lib/xz: Avoid overlapping memcpy() with invalid input with in-place decompression
erofs: introduce readmore decompression strategy
erofs: introduce the secondary compression head
erofs: get compression algorithms directly on mapping
erofs: add multiple device support
erofs: decouple basic mount options from fs_context
erofs: remove the fast path of per-CPU buffer decompression
|
|
Pull fscrypt updates from Eric Biggers:
"Some cleanups for fs/crypto/:
- Allow 256-bit master keys with AES-256-XTS
- Improve documentation and comments
- Remove unneeded field fscrypt_operations::max_namelen"
* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
fscrypt: improve a few comments
fscrypt: allow 256-bit master keys with AES-256-XTS
fscrypt: improve documentation for inline encryption
fscrypt: clean up comments in bio.c
fscrypt: remove fscrypt_operations::max_namelen
|
|
Pull block inode sync updates from Jens Axboe:
"This contains improvements to how bdev inode syncing is handled,
unifying the API"
* tag 'for-5.16/inode-sync-2021-10-29' of git://git.kernel.dk/linux-block:
block: simplify the block device syncing code
ntfs3: use sync_blockdev_nowait
fat: use sync_blockdev_nowait
btrfs: use sync_blockdev
xen-blkback: use sync_blockdev
block: remove __sync_blockdev
fs: remove __sync_filesystem
|
|
Pull kiocb->ki_complete() cleanup from Jens Axboe:
"This removes the res2 argument from kiocb->ki_complete().
Only the USB gadget code used it, everybody else passes 0. The USB
guys checked the user gadget code they could find, and everybody just
uses res as expected for the async interface"
* tag 'for-5.16/ki_complete-2021-10-29' of git://git.kernel.dk/linux-block:
fs: get rid of the res2 iocb->ki_complete argument
usb: remove res2 argument from gadget code completions
|
|
git://git.kernel.dk/linux-block
Pull QUEUE_FLAG_SCSI_PASSTHROUGH removal from Jens Axboe:
"This contains a series leading to the removal of the
QUEUE_FLAG_SCSI_PASSTHROUGH queue flag"
* tag 'for-5.16/passthrough-flag-2021-10-29' of git://git.kernel.dk/linux-block:
block: remove blk_{get,put}_request
block: remove QUEUE_FLAG_SCSI_PASSTHROUGH
block: remove the initialize_rq_fn blk_mq_ops method
scsi: add a scsi_alloc_request helper
bsg-lib: initialize the bsg_job in bsg_transport_sg_io_fn
nfsd/blocklayout: use ->get_unique_id instead of sending SCSI commands
sd: implement ->get_unique_id
block: add a ->get_unique_id method
|
|
Pull CDROM updates from Jens Axboe:
"On behalf of Phillip, here are the CDROM updates for the 5.16-rc1
merge window:
- Add ioctl for improved media change detection (Lukas)
- Reformat some documentation (Phillip)
- Redundant variable removal (luo)"
* tag 'for-5.16/cdrom-2021-10-29' of git://git.kernel.dk/linux-block:
cdrom: Remove redundant variable and its assignment
cdrom: docs: reformat table in Documentation/userspace-api/ioctl/cdrom.rst
drivers/cdrom: improved ioctl for media change detection
|
|
Pull SCSI multi-actuator support from Jens Axboe:
"This adds SCSI support for the recently merged block multi-actuator
support. Since this was sitting on top of the block tree, the SCSI
side asked me to queue it up."
* tag 'for-5.16/scsi-ma-2021-10-29' of git://git.kernel.dk/linux-block:
doc: Fix typo in request queue sysfs documentation
doc: document sysfs queue/independent_access_ranges attributes
libata: support concurrent positioning ranges log
scsi: sd: add concurrent positioning ranges support
|
|
Pull bdev size cleanups from Jens Axboe:
"Clean up the bdev size handling with new bdev_nr_bytes() helper"
* tag 'for-5.16/bdev-size-2021-10-29' of git://git.kernel.dk/linux-block: (34 commits)
partitions/ibm: use bdev_nr_sectors instead of open coding it
partitions/efi: use bdev_nr_bytes instead of open coding it
block/ioctl: use bdev_nr_sectors and bdev_nr_bytes
block: cache inode size in bdev
udf: use sb_bdev_nr_blocks
reiserfs: use sb_bdev_nr_blocks
ntfs: use sb_bdev_nr_blocks
jfs: use sb_bdev_nr_blocks
ext4: use sb_bdev_nr_blocks
block: add a sb_bdev_nr_blocks helper
block: use bdev_nr_bytes instead of open coding it in blkdev_fallocate
squashfs: use bdev_nr_bytes instead of open coding it
reiserfs: use bdev_nr_bytes instead of open coding it
pstore/blk: use bdev_nr_bytes instead of open coding it
ntfs3: use bdev_nr_bytes instead of open coding it
nilfs2: use bdev_nr_bytes instead of open coding it
nfs/blocklayout: use bdev_nr_bytes instead of open coding it
jfs: use bdev_nr_bytes instead of open coding it
hfsplus: use bdev_nr_sectors instead of open coding it
hfs: use bdev_nr_sectors instead of open coding it
...
|
|
Pull io_uring updates from Jens Axboe:
"Light on new features - basically just the hybrid mode support.
Outside of that it's just fixes, cleanups, and performance
improvements.
In detail:
- Add ring related information to the fdinfo output (Hao)
- Hybrid async mode (Hao)
- Support for batched issue on block (me)
- sqe error trace improvement (me)
- IOPOLL efficiency improvements (Pavel)
- submit state cleanups and improvements (Pavel)
- Completion side improvements (Pavel)
- Drain improvements (Pavel)
- Buffer selection cleanups (Pavel)
- Fixed file node improvements (Pavel)
- io-wq setup cancelation fix (Pavel)
- Various other performance improvements and cleanups (Pavel)
- Misc fixes (Arnd, Bixuan, Changcheng, Hao, me, Noah)"
* tag 'for-5.16/io_uring-2021-10-29' of git://git.kernel.dk/linux-block: (97 commits)
io-wq: remove worker to owner tw dependency
io_uring: harder fdinfo sq/cq ring iterating
io_uring: don't assign write hint in the read path
io_uring: clusterise ki_flags access in rw_prep
io_uring: kill unused param from io_file_supports_nowait
io_uring: clean up timeout async_data allocation
io_uring: don't try io-wq polling if not supported
io_uring: check if opcode needs poll first on arming
io_uring: clean iowq submit work cancellation
io_uring: clean io_wq_submit_work()'s main loop
io-wq: use helper for worker refcounting
io_uring: implement async hybrid mode for pollable requests
io_uring: Use ERR_CAST() instead of ERR_PTR(PTR_ERR())
io_uring: split logic of force_nonblock
io_uring: warning about unused-but-set parameter
io_uring: inform block layer of how many requests we are submitting
io_uring: simplify io_file_supports_nowait()
io_uring: combine REQ_F_NOWAIT_{READ,WRITE} flags
io_uring: arm poll for non-nowait files
fs/io_uring: Prioritise checking faster conditions first in io_write
...
|
|
Pull block driver updates from Jens Axboe:
- paride driver cleanups (Christoph)
- Remove cryptoloop support (Christoph)
- null_blk poll support (me)
- Now that add_disk() supports proper error handling, add it to various
drivers (Luis)
- Make ataflop actually work again (Michael)
- s390 dasd fixes (Stefan, Heiko)
- nbd fixes (Yu, Ye)
- Remove redundant wq flush in mtip32xx (Christophe)
- NVMe updates
- fix a multipath partition scanning deadlock (Hannes Reinecke)
- generate uevent once a multipath namespace is operational again
(Hannes Reinecke)
- support unique discovery controller NQNs (Hannes Reinecke)
- fix use-after-free when a port is removed (Israel Rukshin)
- clear shadow doorbell memory on resets (Keith Busch)
- use struct_size (Len Baker)
- add error handling support for add_disk (Luis Chamberlain)
- limit the maximal queue size for RDMA controllers (Max Gurtovoy)
- use a few more symbolic names (Max Gurtovoy)
- fix error code in nvme_rdma_setup_ctrl (Max Gurtovoy)
- add support for ->map_queues on FC (Saurav Kashyap)
- support the current discovery subsystem entry (Hannes Reinecke)
- use flex_array_size and struct_size (Len Baker)
- bcache fixes (Christoph, Coly, Chao, Lin, Qing)
- MD updates (Christoph, Guoqing, Xiao)
- Misc fixes (Dan, Ding, Jiapeng, Shin'ichiro, Ye)
* tag 'for-5.16/drivers-2021-10-29' of git://git.kernel.dk/linux-block: (117 commits)
null_blk: Fix handling of submit_queues and poll_queues attributes
block: ataflop: Fix warning comparing pointer to 0
bcache: replace snprintf in show functions with sysfs_emit
bcache: move uapi header bcache.h to bcache code directory
nvmet: use flex_array_size and struct_size
nvmet: register discovery subsystem as 'current'
nvmet: switch check for subsystem type
nvme: add new discovery log page entry definitions
block: ataflop: more blk-mq refactoring fixes
block: remove support for cryptoloop and the xor transfer
mtd: add add_disk() error handling
rnbd: add error handling support for add_disk()
um/drivers/ubd_kern: add error handling support for add_disk()
m68k/emu/nfblock: add error handling support for add_disk()
xen-blkfront: add error handling support for add_disk()
bcache: add error handling support for add_disk()
dm: add add_disk() error handling
block: aoe: fixup coccinelle warnings
nvmet: use struct_size over open coded arithmetic
nvme: drop scan_lock and always kick requeue list when removing namespaces
...
|
|
Pull block updates from Jens Axboe:
- mq-deadline accounting improvements (Bart)
- blk-wbt timer fix (Andrea)
- Untangle the block layer includes (Christoph)
- Rework the poll support to be bio based, which will enable adding
support for polling for bio based drivers (Christoph)
- Block layer core support for multi-actuator drives (Damien)
- blk-crypto improvements (Eric)
- Batched tag allocation support (me)
- Request completion batching support (me)
- Plugging improvements (me)
- Shared tag set improvements (John)
- Concurrent queue quiesce support (Ming)
- Cache bdev in ->private_data for block devices (Pavel)
- bdev dio improvements (Pavel)
- Block device invalidation and block size improvements (Xie)
- Various cleanups, fixes, and improvements (Christoph, Jackie,
Masahira, Tejun, Yu, Pavel, Zheng, me)
* tag 'for-5.16/block-2021-10-29' of git://git.kernel.dk/linux-block: (174 commits)
blk-mq-debugfs: Show active requests per queue for shared tags
block: improve readability of blk_mq_end_request_batch()
virtio-blk: Use blk_validate_block_size() to validate block size
loop: Use blk_validate_block_size() to validate block size
nbd: Use blk_validate_block_size() to validate block size
block: Add a helper to validate the block size
block: re-flow blk_mq_rq_ctx_init()
block: prefetch request to be initialized
block: pass in blk_mq_tags to blk_mq_rq_ctx_init()
block: add rq_flags to struct blk_mq_alloc_data
block: add async version of bio_set_polled
block: kill DIO_MULTI_BIO
block: kill unused polling bits in __blkdev_direct_IO()
block: avoid extra iter advance with async iocb
block: Add independent access ranges support
blk-mq: don't issue request directly in case that current is to be blocked
sbitmap: silence data race warning
blk-cgroup: synchronize blkg creation against policy deactivation
block: refactor bio_iov_bvec_set()
block: add single bio async direct IO helper
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux
Pull file locking updates from Jeff Layton:
"Most of this is just follow-on cleanup work of documentation and
comments from the mandatory locking removal in v5.15.
The only real functional change is that LOCK_MAND flock() support is
also being removed, as it has basically been non-functional since the
v2.5 days"
* tag 'locks-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
fs: remove leftover comments from mandatory locking removal
locks: remove changelog comments
docs: fs: locks.rst: update comment about mandatory file locking
Documentation: remove reference to now removed mandatory-locking doc
locks: remove LOCK_MAND flock lock support
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd
Pull tpm updates from Jarkko Sakkinen:
"Only bug fixes"
* tag 'tpmdd-next-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd:
tpm_tis_spi: Add missing SPI ID
tpm: fix Atmel TPM crash caused by too frequent queries
tpm: Check for integer overflow in tpm2_map_response_body()
tpm: tis: Kconfig: Add helper dependency on COMPILE_TEST
|
|
Pull memory folios from Matthew Wilcox:
"Add memory folios, a new type to represent either order-0 pages or the
head page of a compound page. This should be enough infrastructure to
support filesystems converting from pages to folios.
The point of all this churn is to allow filesystems and the page cache
to manage memory in larger chunks than PAGE_SIZE. The original plan
was to use compound pages like THP does, but I ran into problems with
some functions expecting only a head page while others expect the
precise page containing a particular byte.
The folio type allows a function to declare that it's expecting only a
head page. Almost incidentally, this allows us to remove various calls
to VM_BUG_ON(PageTail(page)) and compound_head().
This converts just parts of the core MM and the page cache. For 5.17,
we intend to convert various filesystems (XFS and AFS are ready; other
filesystems may make it) and also convert more of the MM and page
cache to folios. For 5.18, multi-page folios should be ready.
The multi-page folios offer some improvement to some workloads. The
80% win is real, but appears to be an artificial benchmark (postgres
startup, which isn't a serious workload). Real workloads (eg building
the kernel, running postgres in a steady state, etc) seem to benefit
between 0-10%. I haven't heard of any performance losses as a result
of this series. Nobody has done any serious performance tuning; I
imagine that tweaking the readahead algorithm could provide some more
interesting wins. There are also other places where we could choose to
create large folios and currently do not, such as writes that are
larger than PAGE_SIZE.
I'd like to thank all my reviewers who've offered review/ack tags:
Christoph Hellwig, David Howells, Jan Kara, Jeff Layton, Johannes
Weiner, Kirill A. Shutemov, Michal Hocko, Mike Rapoport, Vlastimil
Babka, William Kucharski, Yu Zhao and Zi Yan.
I'd also like to thank those who gave feedback I incorporated but
haven't offered up review tags for this part of the series: Nick
Piggin, Mel Gorman, Ming Lei, Darrick Wong, Ted Ts'o, John Hubbard,
Hugh Dickins, and probably a few others who I forget"
* tag 'folio-5.16' of git://git.infradead.org/users/willy/pagecache: (90 commits)
mm/writeback: Add folio_write_one
mm/filemap: Add FGP_STABLE
mm/filemap: Add filemap_get_folio
mm/filemap: Convert mapping_get_entry to return a folio
mm/filemap: Add filemap_add_folio()
mm/filemap: Add filemap_alloc_folio
mm/page_alloc: Add folio allocation functions
mm/lru: Add folio_add_lru()
mm/lru: Convert __pagevec_lru_add_fn to take a folio
mm: Add folio_evictable()
mm/workingset: Convert workingset_refault() to take a folio
mm/filemap: Add readahead_folio()
mm/filemap: Add folio_mkwrite_check_truncate()
mm/filemap: Add i_blocks_per_folio()
mm/writeback: Add folio_redirty_for_writepage()
mm/writeback: Add folio_account_redirty()
mm/writeback: Add folio_clear_dirty_for_io()
mm/writeback: Add folio_cancel_dirty()
mm/writeback: Add folio_account_cleaned()
mm/writeback: Add filemap_dirty_folio()
...
|
|
Decay max_newidle_lb_cost only when it has not been updated for a while
and ensure to not decay a recently changed value.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lore.kernel.org/r/20211019123537.17146-4-vincent.guittot@linaro.org
|
|
parisc, ia64 and powerpc32 are the only remaining architectures that
provide custom arch_{spin,read,write}_lock_flags() functions, which are
meant to re-enable interrupts while waiting for a spinlock.
However, none of these can actually run into this codepath, because
it is only called on architectures without CONFIG_GENERIC_LOCKBREAK,
or when CONFIG_DEBUG_LOCK_ALLOC is set without CONFIG_LOCKDEP, and none
of those combinations are possible on the three architectures.
Going back in the git history, it appears that arch/mn10300 may have
been able to run into this code path, but there is a good chance that
it never worked. On the architectures that still exist, it was
already impossible to hit back in 2008 after the introduction of
CONFIG_GENERIC_LOCKBREAK, and possibly earlier.
As this is all dead code, just remove it and the helper functions built
around it. For arch/ia64, the inline asm could be cleaned up, but
it seems safer to leave it untouched.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Helge Deller <deller@gmx.de> # parisc
Link: https://lore.kernel.org/r/20211022120058.1031690-1-arnd@kernel.org
|
|
These are now pointless wrappers around blk_mq_{alloc,free}_request,
so remove them.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20211025070517.1548584-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The header file include/uapi/linux/bcache.h is not really a user space
API heaer. This file defines the ondisk format of bcache internal meta
data but no one includes it from user space, bcache-tools has its own
copy of this header with minor modification.
Therefore, this patch moves include/uapi/linux/bcache.h to bcache code
directory as drivers/md/bcache/bcache_ondisk.h.
Suggested-by: Arnd Bergmann <arnd@kernel.org>
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20211029060930.119923-2-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This is preparatory work for send protocol update to version 2 and
higher.
We have many pending protocol update requests but still don't have the
basic protocol rev in place, the first thing that must happen is to do
the actual versioning support.
The protocol version is u32 and is a new member in the send ioctl
struct. Validity of the version field is backed by a new flag bit. Old
kernels would fail when a higher version is requested. Version protocol
0 will pick the highest supported version, BTRFS_SEND_STREAM_VERSION,
that's also exported in sysfs.
The version is still unchanged and will be increased once we have new
incompatible commands or stream updates.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When handling shmem page fault the THP with corrupted subpage could be
PMD mapped if certain conditions are satisfied. But kernel is supposed
to send SIGBUS when trying to map hwpoisoned page.
There are two paths which may do PMD map: fault around and regular
fault.
Before commit f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault()
codepaths") the thing was even worse in fault around path. The THP
could be PMD mapped as long as the VMA fits regardless what subpage is
accessed and corrupted. After this commit as long as head page is not
corrupted the THP could be PMD mapped.
In the regular fault path the THP could be PMD mapped as long as the
corrupted page is not accessed and the VMA fits.
This loophole could be fixed by iterating every subpage to check if any
of them is hwpoisoned or not, but it is somewhat costly in page fault
path.
So introduce a new page flag called HasHWPoisoned on the first tail
page. It indicates the THP has hwpoisoned subpage(s). It is set if any
subpage of THP is found hwpoisoned by memory failure and after the
refcount is bumped successfully, then cleared when the THP is freed or
split.
The soft offline path doesn't need this since soft offline handler just
marks a subpage hwpoisoned when the subpage is migrated successfully.
But shmem THP didn't get split then migrated at all.
Link: https://lkml.kernel.org/r/20211020210755.23964-3-shy828301@gmail.com
Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
Signed-off-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
* irq/misc-5.16:
: .
: Misc irqchip fixes for 5.16:
: - MAINTAINERS update for the ARM VIC DT binding
: - Allow drivers using the IRQCHIP_PLATFORM_DRIVER_BEGIN/END
: infrastructure to use COMPILE_TEST without CONFIG_OF
: - DT updates
: - Detangle h8300 linux/irqchip.h inclusion
: .
h8300: Fix linux/irqchip.h include mess
dt-bindings: irqchip: renesas-irqc: Document r8a774e1 bindings
irqchip: Fix compile-testing without CONFIG_OF
MAINTAINERS: update arm,vic.yaml reference
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from WiFi (mac80211), and BPF.
Current release - regressions:
- skb_expand_head: adjust skb->truesize to fix socket memory
accounting
- mptcp: fix corrupt receiver key in MPC + data + checksum
Previous releases - regressions:
- multicast: calculate csum of looped-back and forwarded packets
- cgroup: fix memory leak caused by missing cgroup_bpf_offline
- cfg80211: fix management registrations locking, prevent list
corruption
- cfg80211: correct false positive in bridge/4addr mode check
- tcp_bpf: fix race in the tcp_bpf_send_verdict resulting in reusing
previous verdict
Previous releases - always broken:
- sctp: enhancements for the verification tag, prevent attackers from
killing SCTP sessions
- tipc: fix size validations for the MSG_CRYPTO type
- mac80211: mesh: fix HE operation element length check, prevent out
of bound access
- tls: fix sign of socket errors, prevent positive error codes being
reported from read()/write()
- cfg80211: scan: extend RCU protection in
cfg80211_add_nontrans_list()
- implement ->sock_is_readable() for UDP and AF_UNIX, fix poll() for
sockets in a BPF sockmap
- bpf: fix potential race in tail call compatibility check resulting
in two operations which would make the map incompatible succeeding
- bpf: prevent increasing bpf_jit_limit above max
- bpf: fix error usage of map_fd and fdget() in generic batch update
- phy: ethtool: lock the phy for consistency of results
- prevent infinite while loop in skb_tx_hash() when Tx races with
driver reconfiguring the queue <> traffic class mapping
- usbnet: fixes for bad HW conjured by syzbot
- xen: stop tx queues during live migration, prevent UAF
- net-sysfs: initialize uid and gid before calling
net_ns_get_ownership
- mlxsw: prevent Rx stalls under memory pressure"
* tag 'net-5.15-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (67 commits)
Revert "net: hns3: fix pause config problem after autoneg disabled"
mptcp: fix corrupt receiver key in MPC + data + checksum
riscv, bpf: Fix potential NULL dereference
octeontx2-af: Fix possible null pointer dereference.
octeontx2-af: Display all enabled PF VF rsrc_alloc entries.
octeontx2-af: Check whether ipolicers exists
net: ethernet: microchip: lan743x: Fix skb allocation failure
net/tls: Fix flipped sign in async_wait.err assignment
net/tls: Fix flipped sign in tls_err_abort() calls
net/smc: Correct spelling mistake to TCPF_SYN_RECV
net/smc: Fix smc_link->llc_testlink_time overflow
nfp: bpf: relax prog rejection for mtu check through max_pkt_offset
vmxnet3: do not stop tx queues after netif_device_detach()
r8169: Add device 10ec:8162 to driver r8169
ptp: Document the PTP_CLK_MAGIC ioctl number
usbnet: fix error return code in usbnet_probe()
net: hns3: adjust string spaces of some parameters of tx bd info in debugfs
net: hns3: expand buffer len for some debugfs command
net: hns3: add more string spaces for dumping packets number of queue info in debugfs
net: hns3: fix data endian problem of some functions of debugfs
...
|
|
using packetdrill it's possible to observe that the receiver key contains
random values when clients transmit MP_CAPABLE with data and checksum (as
specified in RFC8684 §3.1). Fix the layout of mptcp_out_options, to avoid
using the skb extension copy when writing the MP_CAPABLE sub-option.
Fixes: d7b269083786 ("mptcp: shrink mptcp_out_options struct")
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/233
Reported-by: Poorva Sonparote <psonparo@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Link: https://lore.kernel.org/r/20211027203855.264600-1-mathew.j.martineau@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
sk->sk_err appears to expect a positive value, a convention that ktls
doesn't always follow and that leads to memory corruption in other code.
For instance,
[kworker]
tls_encrypt_done(..., err=<negative error from crypto request>)
tls_err_abort(.., err)
sk->sk_err = err;
[task]
splice_from_pipe_feed
...
tls_sw_do_sendpage
if (sk->sk_err) {
ret = -sk->sk_err; // ret is positive
splice_from_pipe_feed (continued)
ret = actor(...) // ret is still positive and interpreted as bytes
// written, resulting in underflow of buf->len and
// sd->len, leading to huge buf->offset and bogus
// addresses computed in later calls to actor()
Fix all tls_err_abort() callers to pass a negative error code
consistently and centralize the error-prone sign flip there, throwing in
a warning to catch future misuse and uninlining the function so it
really does only warn once.
Cc: stable@vger.kernel.org
Fixes: c46234ebb4d1e ("tls: RX path for ktls")
Reported-by: syzbot+b187b77c8474f9648fae@syzkaller.appspotmail.com
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
* irq/irq_cpu_offline:
: .
: Make irq_cpu_{on,off}line() deprecated kernel API, and only
: enable it for some obscure Cavium platform after having
: moved all the other users away from it.
:
: Next step, drop the platform itself.
: .
genirq: Hide irq_cpu_{on,off}line() behind a deprecated option
irqchip/mips-gic: Get rid of the reliance on irq_cpu_online()
MIPS: loongson64: Drop call to irq_cpu_offline()
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
* irq/remove-handle-domain-irq-20211026:
: Large rework of the architecture entry code from Mark Rutland.
: From the cover letter:
:
: <quote>
: The handle_domain_{irq,nmi}() functions were oringally intended as a
: convenience, but recent rework to entry code across the kernel tree has
: demonstrated that they cause more pain than they're worth and prevent
: architectures from being able to write robust entry code.
:
: This series reworks the irq code to remove them, handling the necessary
: entry work consistently in entry code (be it architectural or generic).
: </quote>
MIPS: irq: Avoid an unused-variable error
irq: remove handle_domain_{irq,nmi}()
irq: remove CONFIG_HANDLE_DOMAIN_IRQ_IRQENTRY
irq: riscv: perform irqentry in entry code
irq: openrisc: perform irqentry in entry code
irq: csky: perform irqentry in entry code
irq: arm64: perform irqentry in entry code
irq: arm: perform irqentry in entry code
irq: add a (temporary) CONFIG_HANDLE_DOMAIN_IRQ_IRQENTRY
irq: nds32: avoid CONFIG_HANDLE_DOMAIN_IRQ
irq: arc: avoid CONFIG_HANDLE_DOMAIN_IRQ
irq: add generic_handle_arch_irq()
irq: unexport handle_irq_desc()
irq: simplify handle_domain_{irq,nmi}()
irq: mips: simplify do_domain_IRQ()
irq: mips: stop (ab)using handle_domain_irq()
irq: mips: simplify bcm6345_l1_irq_handle()
irq: mips: avoid nested irq_enter()
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
There are some duplicated codes to validate the block
size in block drivers. This limitation actually comes
from block layer, so this patch tries to add a new block
layer helper for that.
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Link: https://lore.kernel.org/r/20211026144015.188-2-xieyongji@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
Two fixes:
* bridge vs. 4-addr mode check was wrong
* management frame registrations locking was
wrong, causing list corruption/crashes
====================
Link: https://lore.kernel.org/r/20211027143756.91711-1-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Nobody cares about iov iterators state if we return -EIOCBQUEUED, so as
the we now have __blkdev_direct_IO_async(), which gets pages only once,
we can skip expensive iov_iter_advance(). It's around 1-2% of all CPU
spent.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a6158edfbfa2ae3bc24aed29a72f035df18fad2f.1635337135.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
TP8014 adds a new SUBTYPE value and a new field EFLAGS for the
discovery log page entry.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Add support to discover if an ATA device supports the Concurrent
Positioning Ranges data log (address 0x47), indicating that the device
is capable of seeking to multiple different locations in parallel using
multiple actuators serving different LBA ranges.
Also add support to translate the concurrent positioning ranges log
into its equivalent Concurrent Positioning Ranges VPD page B9h in
libata-scsi.c.
The format of the Concurrent Positioning Ranges Log is defined in ACS-5
r9.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20211027022223.183838-4-damien.lemoal@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The Concurrent Positioning Ranges VPD page (for SCSI) and data log page
(for ATA) contain parameters describing the set of contiguous LBAs that
can be served independently by a single LUN multi-actuator hard-disk.
Similarly, a logically defined block device composed of multiple disks
can in some cases execute requests directed at different sector ranges
in parallel. A dm-linear device aggregating 2 block devices together is
an example.
This patch implements support for exposing a block device independent
access ranges to the user through sysfs to allow optimizing device
accesses to increase performance.
To describe the set of independent sector ranges of a device (actuators
of a multi-actuator HDDs or table entries of a dm-linear device),
The type struct blk_independent_access_ranges is introduced. This
structure describes the sector ranges using an array of
struct blk_independent_access_range structures. This range structure
defines the start sector and number of sectors of the access range.
The ranges in the array cannot overlap and must contain all sectors
within the device capacity.
The function disk_set_independent_access_ranges() allows a device
driver to signal to the block layer that a device has multiple
independent access ranges. In this case, a struct
blk_independent_access_ranges is attached to the device request queue
by the function disk_set_independent_access_ranges(). The function
disk_alloc_independent_access_ranges() is provided for drivers to
allocate this structure.
struct blk_independent_access_ranges contains kobjects (struct kobject)
to expose to the user through sysfs the set of independent access ranges
supported by a device. When the device is initialized, sysfs
registration of the ranges information is done from blk_register_queue()
using the block layer internal function
disk_register_independent_access_ranges(). If a driver calls
disk_set_independent_access_ranges() for a registered queue, e.g. when a
device is revalidated, disk_set_independent_access_ranges() will execute
disk_register_independent_access_ranges() to update the sysfs attribute
files. The sysfs file structure created starts from the
independent_access_ranges sub-directory and contains the start sector
and number of sectors of each range, with the information for each range
grouped in numbered sub-directories.
E.g. for a dual actuator HDD, the user sees:
$ tree /sys/block/sdk/queue/independent_access_ranges/
/sys/block/sdk/queue/independent_access_ranges/
|-- 0
| |-- nr_sectors
| `-- sector
`-- 1
|-- nr_sectors
`-- sector
For a regular device with a single access range, the
independent_access_ranges sysfs directory does not exist.
Device revalidation may lead to changes to this structure and to the
attribute values. When manipulated, the queue sysfs_lock and
sysfs_dir_lock mutexes are held for atomicity, similarly to how the
blk-mq and elevator sysfs queue sub-directories are protected.
The code related to the management of independent access ranges is
added in the new file block/blk-ia-ranges.c.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20211027022223.183838-2-damien.lemoal@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Daniel Borkmann says:
====================
pull-request: bpf 2021-10-26
We've added 12 non-merge commits during the last 7 day(s) which contain
a total of 23 files changed, 118 insertions(+), 98 deletions(-).
The main changes are:
1) Fix potential race window in BPF tail call compatibility check, from Toke Høiland-Jørgensen.
2) Fix memory leak in cgroup fs due to missing cgroup_bpf_offline(), from Quanyang Wang.
3) Fix file descriptor reference counting in generic_map_update_batch(), from Xu Kuohai.
4) Fix bpf_jit_limit knob to the max supported limit by the arch's JIT, from Lorenz Bauer.
5) Fix BPF sockmap ->poll callbacks for UDP and AF_UNIX sockets, from Cong Wang and Yucong Sun.
6) Fix BPF sockmap concurrency issue in TCP on non-blocking sendmsg calls, from Liu Jian.
7) Fix build failure of INODE_STORAGE and TASK_STORAGE maps on !CONFIG_NET, from Tejun Heo.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf: Fix potential race in tail call compatibility check
bpf: Move BPF_MAP_TYPE for INODE_STORAGE and TASK_STORAGE outside of CONFIG_NET
selftests/bpf: Use recv_timeout() instead of retries
net: Implement ->sock_is_readable() for UDP and AF_UNIX
skmsg: Extract and reuse sk_msg_is_readable()
net: Rename ->stream_memory_read to ->sock_is_readable
tcp_bpf: Fix one concurrency problem in the tcp_bpf_send_verdict function
cgroup: Fix memory leak caused by missing cgroup_bpf_offline
bpf: Fix error usage of map_fd and fdget() in generic_map_update_batch()
bpf: Prevent increasing bpf_jit_limit above max
bpf: Define bpf_jit_alloc_exec_limit for arm64 JIT
bpf: Define bpf_jit_alloc_exec_limit for riscv JIT
====================
Link: https://lore.kernel.org/r/20211026201920.11296-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Lorenzo noticed that the code testing for program type compatibility of
tail call maps is potentially racy in that two threads could encounter a
map with an unset type simultaneously and both return true even though they
are inserting incompatible programs.
The race window is quite small, but artificially enlarging it by adding a
usleep_range() inside the check in bpf_prog_array_compatible() makes it
trivial to trigger from userspace with a program that does, essentially:
map_fd = bpf_create_map(BPF_MAP_TYPE_PROG_ARRAY, 4, 4, 2, 0);
pid = fork();
if (pid) {
key = 0;
value = xdp_fd;
} else {
key = 1;
value = tc_fd;
}
err = bpf_map_update_elem(map_fd, &key, &value, 0);
While the race window is small, it has potentially serious ramifications in
that triggering it would allow a BPF program to tail call to a program of a
different type. So let's get rid of it by protecting the update with a
spinlock. The commit in the Fixes tag is the last commit that touches the
code in question.
v2:
- Use a spinlock instead of an atomic variable and cmpxchg() (Alexei)
v3:
- Put lock and the members it protects into an embedded 'owner' struct (Daniel)
Fixes: 3324b584b6f6 ("ebpf: misc core cleanup")
Reported-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211026110019.363464-1-toke@redhat.com
|
|
bpf_types.h has BPF_MAP_TYPE_INODE_STORAGE and BPF_MAP_TYPE_TASK_STORAGE
declared inside #ifdef CONFIG_NET although they are built regardless of
CONFIG_NET. So, when CONFIG_BPF_SYSCALL && !CONFIG_NET, they are built
without the declarations leading to spurious build failures and not
registered to bpf_map_types making them unavailable.
Fix it by moving the BPF_MAP_TYPE for the two map types outside of
CONFIG_NET.
Reported-by: kernel test robot <lkp@intel.com>
Fixes: a10787e6d58c ("bpf: Enable task local storage for tracing programs")
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/YXG1cuuSJDqHQfRY@slm.duckdns.org
|
|
tcp_bpf_sock_is_readable() is pretty much generic,
we can extract it and reuse it for non-TCP sockets.
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211008203306.37525-3-xiyou.wangcong@gmail.com
|
|
The proto ops ->stream_memory_read() is currently only used
by TCP to check whether psock queue is empty or not. We need
to rename it before reusing it for non-TCP protocols, and
adjust the exsiting users accordingly.
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211008203306.37525-2-xiyou.wangcong@gmail.com
|
|
If you already have an inode and need to update the time on the inode
there is no way to do this properly. Export this helper to allow file
systems to update time on the inode so the appropriate handler is
called, either ->update_time or generic_update_time.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
During a testing of an user-space application which transmits UDP
multicast datagrams and utilizes multicast routing to send the UDP
datagrams out of defined network interfaces, I've found a multicast
router does not fill-in UDP checksum into locally produced, looped-back
and forwarded UDP datagrams, if an original output NIC the datagrams
are sent to has UDP TX checksum offload enabled.
The datagrams are sent malformed out of the NIC the datagrams have been
forwarded to.
It is because:
1. If TX checksum offload is enabled on the output NIC, UDP checksum
is not calculated by kernel and is not filled into skb data.
2. dev_loopback_xmit(), which is called solely by
ip_mc_finish_output(), sets skb->ip_summed = CHECKSUM_UNNECESSARY
unconditionally.
3. Since 35fc92a9 ("[NET]: Allow forwarding of ip_summed except
CHECKSUM_COMPLETE"), the ip_summed value is preserved during
forwarding.
4. If ip_summed != CHECKSUM_PARTIAL, checksum is not calculated during
a packet egress.
The minimum fix in dev_loopback_xmit():
1. Preserves skb->ip_summed CHECKSUM_PARTIAL. This is the
case when the original output NIC has TX checksum offload enabled.
The effects are:
a) If the forwarding destination interface supports TX checksum
offloading, the NIC driver is responsible to fill-in the
checksum.
b) If the forwarding destination interface does NOT support TX
checksum offloading, checksums are filled-in by kernel before
skb is submitted to the NIC driver.
c) For local delivery, checksum validation is skipped as in the
case of CHECKSUM_UNNECESSARY, thanks to skb_csum_unnecessary().
2. Translates ip_summed CHECKSUM_NONE to CHECKSUM_UNNECESSARY. It
means, for CHECKSUM_NONE, the behavior is unmodified and is there
to skip a looped-back packet local delivery checksum validation.
Signed-off-by: Cyril Strejc <cyril.strejc@skoda.cz>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
irq_cpu_{on,off}line() are now only used by the Octeon platform.
Make their use conditional on this plaform being enabled, and
otherwise hidden away.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Serge Semin <fancer.lancer@gmail.com>
Link: https://lore.kernel.org/r/20211021170414.3341522-4-maz@kernel.org
|
|
Now that entry code handles IRQ entry (including setting the IRQ regs)
before calling irqchip code, irqchip code can safely call
generic_handle_domain_irq(), and there's no functional reason for it to
call handle_domain_irq().
Let's cement this split of responsibility and remove handle_domain_irq()
entirely, updating irqchip drivers to call generic_handle_domain_irq().
For consistency, handle_domain_nmi() is similarly removed and replaced
with a generic_handle_domain_nmi() function which also does not perform
any entry logic.
Previously handle_domain_{irq,nmi}() had a WARN_ON() which would fire
when they were called in an inappropriate context. So that we can
identify similar issues going forward, similar WARN_ON_ONCE() logic is
added to the generic_handle_*() functions, and comments are updated for
clarity and consistency.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
|