Age | Commit message (Collapse) | Author |
|
The -F/--field option is to customize the list of fields to output:
$ perf lock report -F contended,wait_max -k avg_wait
Name contended max wait (ns) avg wait (ns)
slock-AF_INET6 1 23543 23543
&lruvec->lru_lock 5 18317 11254
slock-AF_INET6 1 10379 10379
rcu_node_1 1 2104 2104
&dentry->d_lockr... 1 1844 1844
&dentry->d_lockr... 1 1672 1672
&newf->file_lock 15 2279 1025
&dentry->d_lockr... 1 792 792
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220323230259.288494-3-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
And use it to print output for each key field. No functional change
intended and the output should be identical.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220323230259.288494-2-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
The perf lock command has nothing to symbolize and lock names come
from the tracepoint. Moreover, kernel symbols are available even the
--synth=no option is given.
This will reduce the startup time by avoiding unnecessary synthesis.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220323230259.288494-1-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
Signed-off-by: Jan Engelhardt <jengelh@inai.de>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Instead of returning an error value, which ends up being the return
value for the read() system call, it is more elegant to simply return
the error as a string value.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
Do not cast the struct xprt to a sock_xprt unless we know it is a UDP or
TCP transport. Otherwise the call to lock the mutex will scribble over
whatever structure is actually there. This has been seen to cause hard
system lockups when the underlying transport was RDMA.
Fixes: b49ea673e119 ("SUNRPC: lock against ->sock changing during sysfs read")
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
Merge yet more updates from Andrew Morton:
"This is the material which was staged after willystuff in linux-next.
Subsystems affected by this patch series: mm (debug, selftests,
pagecache, thp, rmap, migration, kasan, hugetlb, pagemap, madvise),
and selftests"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (113 commits)
selftests: kselftest framework: provide "finished" helper
mm: madvise: MADV_DONTNEED_LOCKED
mm: fix race between MADV_FREE reclaim and blkdev direct IO read
mm: generalize ARCH_HAS_FILTER_PGPROT
mm: unmap_mapping_range_tree() with i_mmap_rwsem shared
mm: warn on deleting redirtied only if accounted
mm/huge_memory: remove stale locking logic from __split_huge_pmd()
mm/huge_memory: remove stale page_trans_huge_mapcount()
mm/swapfile: remove stale reuse_swap_page()
mm/khugepaged: remove reuse_swap_page() usage
mm/huge_memory: streamline COW logic in do_huge_pmd_wp_page()
mm: streamline COW logic in do_swap_page()
mm: slightly clarify KSM logic in do_swap_page()
mm: optimize do_wp_page() for fresh pages in local LRU pagevecs
mm: optimize do_wp_page() for exclusive pages in the swapcache
mm/huge_memory: make is_transparent_hugepage() static
userfaultfd/selftests: enable hugetlb remap and remove event testing
selftests/vm: add hugetlb madvise MADV_DONTNEED MADV_REMOVE test
mm: enable MADV_DONTNEED for hugetlb mappings
kasan: disable LOCKDEP when printing reports
...
|
|
Change the default error code returned by acpi_cppc_processor_probe()
from -EFAULT (which is completely inadequate) to -ENODATA and change
the debug messages printed by it to contain more information and be
more consistent.
While at it, format some white space to follow the coding style.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Huang Rui <ray.huang@amd.com>
|
|
If the NumEntries field in the _CPC return package is less than 2, do
not attempt to access the "Revision" element of that package, because
it may not be present then.
Fixes: 337aadff8e45 ("ACPI: Introduce CPU performance controls using CPPC")
BugLink: https://lore.kernel.org/lkml/20220322143534.GC32582@xsang-OptiPlex-9020/
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Huang Rui <ray.huang@amd.com>
|
|
Without fix:
[ 0.005429] ACPI: LAPIC_ADDR_OVR (address[(____ptrval____)])
With fix:
[ 0.005429] ACPI: LAPIC_ADDR_OVR (address[0x800fee00000])
Co-developed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
[ rjw: Subject edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V updates from Palmer Dabbelt:
- Support for Sv57-based virtual memory.
- Various improvements for the MicroChip PolarFire SOC and the
associated Icicle dev board, which should allow upstream kernels to
boot without any additional modifications.
- An improved memmove() implementation.
- Support for the new Ssconfpmf and SBI PMU extensions, which allows
for a much more useful perf implementation on RISC-V systems.
- Support for restartable sequences.
* tag 'riscv-for-linus-5.18-mw0' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (36 commits)
rseq/selftests: Add support for RISC-V
RISC-V: Add support for restartable sequence
MAINTAINERS: Add entry for RISC-V PMU drivers
Documentation: riscv: Remove the old documentation
RISC-V: Add sscofpmf extension support
RISC-V: Add perf platform driver based on SBI PMU extension
RISC-V: Add RISC-V SBI PMU extension definitions
RISC-V: Add a simple platform driver for RISC-V legacy perf
RISC-V: Add a perf core library for pmu drivers
RISC-V: Add CSR encodings for all HPMCOUNTERS
RISC-V: Remove the current perf implementation
RISC-V: Improve /proc/cpuinfo output for ISA extensions
RISC-V: Do no continue isa string parsing without correct XLEN
RISC-V: Implement multi-letter ISA extension probing framework
RISC-V: Extract multi-letter extension names from "riscv, isa"
RISC-V: Minimal parser for "riscv, isa" strings
RISC-V: Correctly print supported extensions
riscv: Fixed misaligned memory access. Fixed pointer comparison.
MAINTAINERS: update riscv/microchip entry
riscv: dts: microchip: add new peripherals to icicle kit device tree
...
|
|
To move the list iterator variable into the list_for_each_entry_*()
macro in the future it should be avoided to use the list iterator
variable after the loop body.
To *never* use the list iterator variable after the loop it was
concluded to use a separate iterator variable instead of a
found boolean [1].
This removes the need to use a found variable and simply checking if
the variable was set, can determine if the break/goto was hit.
Link: https://lore.kernel.org/all/CAHk-=wgRr_D8CB-D9Kg-c=EHreAsk5SqXPwr9Y7k9sA6cWXJ6w@mail.gmail.com/
Signed-off-by: Jakob Koschel <jakobkoschel@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Vasily Gorbik:
- Raise minimum supported machine generation to z10, which comes with
various cleanups and code simplifications (usercopy/spectre
mitigation/etc).
- Rework extables and get rid of anonymous out-of-line fixups.
- Page table helpers cleanup. Add set_pXd()/set_pte() helper functions.
Covert pte_val()/pXd_val() macros to functions.
- Optimize kretprobe handling by avoiding extra kprobe on
__kretprobe_trampoline.
- Add support for CEX8 crypto cards.
- Allow to trigger AP bus rescan via writing to /sys/bus/ap/scans.
- Add CONFIG_EXPOLINE_EXTERN option to build the kernel without COMDAT
group sections which simplifies kpatch support.
- Always use the packed stack layout and extend kernel unwinder tests.
- Add sanity checks for ftrace code patching.
- Add s390dbf debug log for the vfio_ap device driver.
- Various virtual vs physical address confusion fixes.
- Various small fixes and improvements all over the code.
* tag 's390-5.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (69 commits)
s390/test_unwind: add kretprobe tests
s390/kprobes: Avoid additional kprobe in kretprobe handling
s390: convert ".insn" encoding to instruction names
s390: assume stckf is always present
s390/nospec: move to single register thunks
s390: raise minimum supported machine generation to z10
s390/uaccess: Add copy_from/to_user_key functions
s390/nospec: align and size extern thunks
s390/nospec: add an option to use thunk-extern
s390/nospec: generate single register thunks if possible
s390/pci: make zpci_set_irq()/zpci_clear_irq() static
s390: remove unused expoline to BC instructions
s390/irq: use assignment instead of cast
s390/traps: get rid of magic cast for per code
s390/traps: get rid of magic cast for program interruption code
s390/signal: fix typo in comments
s390/asm-offsets: remove unused defines
s390/test_unwind: avoid build warning with W=1
s390: remove .fixup section
s390/bpf: encode register within extable entry
...
|
|
Pull Xtensa updates from Max Filippov:
- remove dependency on the compiler's libgcc
- allow selection of internal kernel ABI via Kconfig
- enable compiler plugins support for gcc-12 or newer
- various minor cleanups and fixes
* tag 'xtensa-20220325' of https://github.com/jcmvbkbc/linux-xtensa:
xtensa: define update_mmu_tlb function
xtensa: fix xtensa_wsr always writing 0
xtensa: enable plugin support
xtensa: clean up kernel exit assembly code
xtensa: rearrange NMI exit path
xtensa: merge stack alignment definitions
xtensa: fix DTC warning unit_address_format
xtensa: fix stop_machine_cpuslocked call in patch_text
xtensa: make secondary reset vector support conditional
xtensa: add kernel ABI selection to Kconfig
xtensa: don't link with libgcc
xtensa: add helpers for division, remainder and shifts
xtensa: add missing XCHAL_HAVE_WINDOWED check
xtensa: use XCHAL_NUM_AREGS as pt_regs::areg size
xtensa: rename PT_SIZE to PT_KERNEL_SIZE
xtensa: Remove unused early_read_config_byte() et al declarations
xtensa: use strscpy to copy strings
net: xtensa: use strscpy to copy strings
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"Livepatch support for 32-bit is probably the standout new feature,
otherwise mostly just lots of bits and pieces all over the board.
There's a series of commits cleaning up function descriptor handling,
which touches a few other arches as well as LKDTM. It has acks from
Arnd, Kees and Helge.
Summary:
- Enforce kernel RO, and implement STRICT_MODULE_RWX for 603.
- Add support for livepatch to 32-bit.
- Implement CONFIG_DYNAMIC_FTRACE_WITH_ARGS.
- Merge vdso64 and vdso32 into a single directory.
- Fix build errors with newer binutils.
- Add support for UADDR64 relocations, which are emitted by some
toolchains. This allows powerpc to build with the latest lld.
- Fix (another) potential userspace r13 corruption in transactional
memory handling.
- Cleanups of function descriptor handling & related fixes to LKDTM.
Thanks to Abdul Haleem, Alexey Kardashevskiy, Anders Roxell, Aneesh
Kumar K.V, Anton Blanchard, Arnd Bergmann, Athira Rajeev, Bhaskar
Chowdhury, Cédric Le Goater, Chen Jingwen, Christophe JAILLET,
Christophe Leroy, Corentin Labbe, Daniel Axtens, Daniel Henrique
Barboza, David Dai, Fabiano Rosas, Ganesh Goudar, Guo Zhengkui, Hangyu
Hua, Haren Myneni, Hari Bathini, Igor Zhbanov, Jakob Koschel, Jason
Wang, Jeremy Kerr, Joachim Wiberg, Jordan Niethe, Julia Lawall, Kajol
Jain, Kees Cook, Laurent Dufour, Madhavan Srinivasan, Mamatha Inamdar,
Maxime Bizon, Maxim Kiselev, Maxim Kochetkov, Michal Suchanek,
Nageswara R Sastry, Nathan Lynch, Naveen N. Rao, Nicholas Piggin,
Nour-eddine Taleb, Paul Menzel, Ping Fang, Pratik R. Sampat, Randy
Dunlap, Ritesh Harjani, Rohan McLure, Russell Currey, Sachin Sant,
Segher Boessenkool, Shivaprasad G Bhat, Sourabh Jain, Thierry Reding,
Tobias Waldekranz, Tyrel Datwyler, Vaibhav Jain, Vladimir Oltean,
Wedson Almeida Filho, and YueHaibing"
* tag 'powerpc-5.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (179 commits)
powerpc/pseries: Fix use after free in remove_phb_dynamic()
powerpc/time: improve decrementer clockevent processing
powerpc/time: Fix KVM host re-arming a timer beyond decrementer range
powerpc/tm: Fix more userspace r13 corruption
powerpc/xive: fix return value of __setup handler
powerpc/64: Add UADDR64 relocation support
powerpc: 8xx: fix a return value error in mpc8xx_pic_init
powerpc/ps3: remove unneeded semicolons
powerpc/64: Force inlining of prevent_user_access() and set_kuap()
powerpc/bitops: Force inlining of fls()
powerpc: declare unmodified attribute_group usages const
powerpc/spufs: Fix build warning when CONFIG_PROC_FS=n
powerpc/secvar: fix refcount leak in format_show()
powerpc/64e: Tie PPC_BOOK3E_64 to PPC_FSL_BOOK3E
powerpc: Move C prototypes out of asm-prototypes.h
powerpc/kexec: Declare kexec_paca static
powerpc/smp: Declare current_set static
powerpc: Cleanup asm-prototypes.c
powerpc/ftrace: Use STK_GOT in ftrace_mprofile.S
powerpc/ftrace: Regroup PPC64 specific operations in ftrace_mprofile.S
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux
Pull MIPS updates from Thomas Bogendoerfer:
- added support for QCN550x (ath79)
- enabled KCSAN
- removed TX39XX support
- various cleanups and fixes
* tag 'mips_5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux: (31 commits)
MIPS: Fix build error for loongson64 and sgi-ip27
MIPS: ingenic: correct unit node address
MIPS: Fix wrong comments in asm/prom.h
MIPS: Remove redundant definitions of device_tree_init()
MIPS: Remove redundant check in device_tree_init()
MIPS: pgalloc: fix memory leak caused by pgd_free()
MIPS: RB532: fix return value of __setup handler
MIPS: Only use current_stack_pointer on GCC
MIPS: boot/compressed: Use array reference for image bounds
mips: cdmm: Fix refcount leak in mips_cdmm_phys_base
mips: remove reference to "newer Loongson-3"
mips: Always permit to build u-boot images
MIPS: Sanitise Cavium switch cases in TLB handler synthesizers
DEC: Limit PMAX memory probing to R3k systems
mips: DEC: honor CONFIG_MIPS_FP_SUPPORT=n
MIPS: fix fortify panic when copying asm exception handlers
mips: ralink: fix a refcount leak in ill_acc_of_setup()
mips: Implement "current_stack_pointer"
MIPS: Remove TX39XX support
MIPS: Modernize READ_IMPLIES_EXEC
...
|
|
To use set_discharge helper function, active_discharge_on need to be
added.
Signed-off-by: ChiYuan Huang <cy_huang@richtek.com>
Link: https://lore.kernel.org/r/1648171577-9663-3-git-send-email-u0084500@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
To use set/get_bypass helper function, bypass mask need to be specified.
Signed-off-by: ChiYuan Huang <cy_huang@richtek.com>
Link: https://lore.kernel.org/r/1648171577-9663-2-git-send-email-u0084500@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
If SND_SOC_SOF_PCI_DEV is n, bulding fails:
sound/soc/sof/intel/pci-tng.o:(.data+0x1c0): undefined reference to `sof_pci_probe'
sound/soc/sof/intel/pci-tng.o:(.data+0x1c8): undefined reference to `sof_pci_remove'
sound/soc/sof/intel/pci-tng.o:(.data+0x1e0): undefined reference to `sof_pci_shutdown'
sound/soc/sof/intel/pci-tng.o:(.data+0x290): undefined reference to `sof_pci_pm'
Make SND_SOC_SOF_MERRIFIELD select SND_SOC_SOF_PCI_DEV to fix this.
Fixes: 8d4ba1be3d22 ("ASoC: SOF: pci: split PCI into different drivers")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zheng Bin <zhengbin13@huawei.com>
Acked-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Link: https://lore.kernel.org/r/20220323092501.145879-1-zhengbin13@huawei.com
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
We have duplicated definitions for various SMB3 PDUs in
fs/ksmbd and fs/cifs. Some had already been moved to
fs/smbfs_common/smb2pdu.h
Move definitions for
- error response
- query info and various related protocol flags
- various lease handling flags and the create lease context
to smbfs_common/smb2pdu.h to reduce code duplication
Reviewed-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
In the Windows spec for VM Generation ID, the originally specified CID
is longer than allowed by the ACPI spec. Hyper-V has added "VMGENCTR" as
a second valid CID that is conformant, while retaining the original CID
for compatibility with Windows guests.
Add this new CID to the list recognized by the driver.
Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
The comment about get_random_{u32,u64}() not invoking reseeding got
added in an unrelated commit, that then was recently reverted by
0313bc278dac ("Revert "random: block in /dev/urandom""). So this adds
that little comment snippet back, and improves the wording a bit too.
Reviewed-by: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
If CONFIG_RANDOM_TRUST_CPU is set, the RNG initializes using RDRAND.
But, the user can disable (or enable) this behavior by setting
`random.trust_cpu=0/1` on the kernel command line. This allows system
builders to do reasonable things while avoiding howls from tinfoil
hatters. (Or vice versa.)
CONFIG_RANDOM_TRUST_BOOTLOADER is basically the same thing, but regards
the seed passed via EFI or device tree, which might come from RDRAND or
a TPM or somewhere else. In order to allow distros to more easily enable
this while avoiding those same howls (or vice versa), this commit adds
the corresponding `random.trust_bootloader=0/1` toggle.
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Graham Christensen <graham@grahamc.com>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Dominik Brodowski <linux@dominikbrodowski.net>
Link: https://github.com/NixOS/nixpkgs/pull/165355
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
At boot time, EFI calls add_bootloader_randomness(), which in turn calls
add_hwgenerator_randomness(). Currently add_hwgenerator_randomness()
feeds the first 64 bytes of randomness to the "fast init"
non-crypto-grade phase. But if add_hwgenerator_randomness() gets called
with more than POOL_MIN_BITS of entropy, there's no point in passing it
off to the "fast init" stage, since that's enough entropy to bootstrap
the real RNG. The "fast init" stage is just there to provide _something_
in the case where we don't have enough entropy to properly bootstrap the
RNG. But if we do have enough entropy to bootstrap the RNG, the current
logic doesn't serve a purpose. So, in the case where we're passed
greater than or equal to POOL_MIN_BITS of entropy, this commit makes us
skip the "fast init" phase.
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
When part of the user buffer passed to generic_perform_write() or
iomap_file_buffered_write() cannot be faulted in for reading, the entire
write currently fails. The correct behavior would be to write all the
data that can be written, up to the point of failure.
Commit a6294593e8a1 ("iov_iter: Turn iov_iter_fault_in_readable into
fault_in_iov_iter_readable") gave us the information needed, so fix the
page prefaulting in generic_perform_write() and iomap_write_iter() to
only bail out when no pages could be faulted in.
We already factor in that pages that are faulted in may no longer be
resident by the time they are accessed. Paging out pages has the same
effect as not faulting in those pages in the first place, so the code
can already deal with that.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
io_put_kbuf_comp() should only be called while holding
->completion_lock, however there is no such assumption in io_clean_op()
and thus it can corrupt ->io_buffer_comp. Take the lock there, and
workaround the only user of io_clean_op() calling it with locks. Not
the prettiest solution, but it's easier to refactor it for-next.
Fixes: cc3cec8367cba ("io_uring: speedup provided buffer handling")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/743e2130b73ec6d48c4c5dd15db896c433431e6d.1648212967.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
io_req_complete_failed() doesn't require callers to hold ->uring_lock,
use IO_URING_F_UNLOCKED version of io_put_kbuf(). The only affected
place is the fail path of io_apoll_task_func(). Also add a lockdep
annotation to catch such bugs in the future.
Fixes: 3b2b78a8eb7cc ("io_uring: extend provided buf return to fails")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ccf602dbf8df3b6a8552a262d8ee0a13a086fbc7.1648212967.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Move a misplaced comment about req->creds and add a line with
assumptions about req->link.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1e51d1e6b1f3708c2d4127b4e371f9daa4c5f859.1648209006.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When polling sockets for accept, use EPOLLEXCLUSIVE. This is helpful
when multiple accept SQEs are submitted.
For O_NONBLOCK sockets multiple queued SQEs would previously have all
completed at once, but most with -EAGAIN as the result. Now only one
wakes up and completes.
For sockets without O_NONBLOCK there is no user facing change, but
internally the extra requests would previously be queued onto a worker
thread as they would wake up with no connection waiting, and be
punted. Now they do not wake up unnecessarily.
Co-developed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220325093755.4123343-1-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This drivers allows to communicate with a RTC PTA handled by OP-TEE [1].
This PTA allows to query RTC information, set/get time and set/get
offset depending on the supported features.
[1] https://github.com/OP-TEE/optee_os/pull/5179
Signed-off-by: Clément Léger <clement.leger@bootlin.com>
Reviewed-by: Jens Wiklander <jens.wiklander@linaro.org>
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Link: https://lore.kernel.org/r/20220308133505.471601-1-clement.leger@bootlin.com
|
|
Having !allow_set_time is equivalent to non-implemented set_time
function, which is normally represented with -ENODEV error in RTC
subsystem.
Today we are returning -EACCES error code, which is not considered
by RTC clients as a 'non implemented' feature, and which causes NTP
to retry hw clk sync (update_rtc) indefinitely.
Signed-off-by: Loic Poulain <loic.poulain@linaro.org>
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Link: https://lore.kernel.org/r/1645090578-20734-1-git-send-email-loic.poulain@linaro.org
|
|
Attach the interrupt as a wake-irq to the device, so that:
- A corresponding wakeup source is created (and reported in e.g
/sys/kernel/debug/wakeup_sources).
- The power subsystem take cares of arming/disarming
irq-wake automatically on suspend/resume.
Signed-off-by: Loic Poulain <loic.poulain@linaro.org>
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Link: https://lore.kernel.org/r/1645025082-6138-1-git-send-email-loic.poulain@linaro.org
|
|
This solves:
>> drivers/clk/sunxi-ng/ccu-sun6i-rtc.c:334:5: warning: no previous prototype for 'sun6i_rtc_ccu_probe' [-Wmissing-prototypes]
334 | int sun6i_rtc_ccu_probe(struct device *dev, void __iomem *reg)
| ^~~~~~~~~~~~~~~~~~~
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Acked-by: Jernej Skrabec <jernej.skrabec@gmail.com>
Link: https://lore.kernel.org/r/20220320210905.6606-1-alexandre.belloni@bootlin.com
|
|
Due to being a perl generated asm file, it got missed by the mass
convertion script.
arch/x86/crypto/poly1305-x86_64-cryptogams.o: warning: objtool: poly1305_init_x86_64()+0x3a: missing int3 after ret
arch/x86/crypto/poly1305-x86_64-cryptogams.o: warning: objtool: poly1305_blocks_x86_64()+0xf2: missing int3 after ret
arch/x86/crypto/poly1305-x86_64-cryptogams.o: warning: objtool: poly1305_emit_x86_64()+0x37: missing int3 after ret
arch/x86/crypto/poly1305-x86_64-cryptogams.o: warning: objtool: __poly1305_block()+0x6d: missing int3 after ret
arch/x86/crypto/poly1305-x86_64-cryptogams.o: warning: objtool: __poly1305_init_avx()+0x1e8: missing int3 after ret
arch/x86/crypto/poly1305-x86_64-cryptogams.o: warning: objtool: poly1305_blocks_avx()+0x18a: missing int3 after ret
arch/x86/crypto/poly1305-x86_64-cryptogams.o: warning: objtool: poly1305_blocks_avx()+0xaf8: missing int3 after ret
arch/x86/crypto/poly1305-x86_64-cryptogams.o: warning: objtool: poly1305_emit_avx()+0x99: missing int3 after ret
arch/x86/crypto/poly1305-x86_64-cryptogams.o: warning: objtool: poly1305_blocks_avx2()+0x18a: missing int3 after ret
arch/x86/crypto/poly1305-x86_64-cryptogams.o: warning: objtool: poly1305_blocks_avx2()+0x776: missing int3 after ret
arch/x86/crypto/poly1305-x86_64-cryptogams.o: warning: objtool: poly1305_blocks_avx512()+0x18a: missing int3 after ret
arch/x86/crypto/poly1305-x86_64-cryptogams.o: warning: objtool: poly1305_blocks_avx512()+0x796: missing int3 after ret
arch/x86/crypto/poly1305-x86_64-cryptogams.o: warning: objtool: poly1305_blocks_avx512()+0x10bd: missing int3 after ret
Fixes: f94909ceb1ed ("x86: Prepare asm files for straight-line-speculation")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
The chacha_Nblock_xor_avx512vl() functions all have their own,
identical, .LdoneN label, however in one particular spot {2,4} jump to
the 8 version instead of their own. Resulting in:
arch/x86/crypto/chacha-x86_64.o: warning: objtool: chacha_2block_xor_avx512vl() falls through to next function chacha_8block_xor_avx512vl()
arch/x86/crypto/chacha-x86_64.o: warning: objtool: chacha_4block_xor_avx512vl() falls through to next function chacha_8block_xor_avx512vl()
Make each function consistently use its own done label.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
pm_runtime_get_sync() will increment pm usage counter even it
failed. Forgetting to call pm_runtime_put_noidle will result
in reference leak in stm32_crc_remove, so we should fix it.
Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
The algorithm __cbc-aes-neonbs requires a fallback so we need
to select the config options for them or otherwise it will fail
to register on boot-up.
Fixes: 00b99ad2bac2 ("crypto: arm/aes-neonbs - Use generic cbc...")
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu updates from Joerg Roedel:
- IOMMU Core changes:
- Removal of aux domain related code as it is basically dead and
will be replaced by iommu-fd framework
- Split of iommu_ops to carry domain-specific call-backs separatly
- Cleanup to remove useless ops->capable implementations
- Improve 32-bit free space estimate in iova allocator
- Intel VT-d updates:
- Various cleanups of the driver
- Support for ATS of SoC-integrated devices listed in ACPI/SATC
table
- ARM SMMU updates:
- Fix SMMUv3 soft lockup during continuous stream of events
- Fix error path for Qualcomm SMMU probe()
- Rework SMMU IRQ setup to prepare the ground for PMU support
- Minor cleanups and refactoring
- AMD IOMMU driver:
- Some minor cleanups and error-handling fixes
- Rockchip IOMMU driver:
- Use standard driver registration
- MSM IOMMU driver:
- Minor cleanup and change to standard driver registration
- Mediatek IOMMU driver:
- Fixes for IOTLB flushing logic
* tag 'iommu-updates-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (47 commits)
iommu/amd: Improve amd_iommu_v2_exit()
iommu/amd: Remove unused struct fault.devid
iommu/amd: Clean up function declarations
iommu/amd: Call memunmap in error path
iommu/arm-smmu: Account for PMU interrupts
iommu/vt-d: Enable ATS for the devices in SATC table
iommu/vt-d: Remove unused function intel_svm_capable()
iommu/vt-d: Add missing "__init" for rmrr_sanity_check()
iommu/vt-d: Move intel_iommu_ops to header file
iommu/vt-d: Fix indentation of goto labels
iommu/vt-d: Remove unnecessary prototypes
iommu/vt-d: Remove unnecessary includes
iommu/vt-d: Remove DEFER_DEVICE_DOMAIN_INFO
iommu/vt-d: Remove domain and devinfo mempool
iommu/vt-d: Remove iova_cache_get/put()
iommu/vt-d: Remove finding domain in dmar_insert_one_dev_info()
iommu/vt-d: Remove intel_iommu::domains
iommu/mediatek: Always tlb_flush_all when each PM resume
iommu/mediatek: Add tlb_lock in tlb_flush_all
iommu/mediatek: Remove the power status checking in tlb flush all
...
|
|
Pull SCSI updates from James Bottomley:
"This series consists of the usual driver updates (qla2xxx, pm8001,
libsas, smartpqi, scsi_debug, lpfc, iscsi, mpi3mr) plus minor updates
and bug fixes.
The high blast radius core update is the removal of write same, which
affects block and several non-SCSI devices. The other big change,
which is more local, is the removal of the SCSI pointer"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (281 commits)
scsi: scsi_ioctl: Drop needless assignment in sg_io()
scsi: bsg: Drop needless assignment in scsi_bsg_sg_io_fn()
scsi: lpfc: Copyright updates for 14.2.0.0 patches
scsi: lpfc: Update lpfc version to 14.2.0.0
scsi: lpfc: SLI path split: Refactor BSG paths
scsi: lpfc: SLI path split: Refactor Abort paths
scsi: lpfc: SLI path split: Refactor SCSI paths
scsi: lpfc: SLI path split: Refactor CT paths
scsi: lpfc: SLI path split: Refactor misc ELS paths
scsi: lpfc: SLI path split: Refactor VMID paths
scsi: lpfc: SLI path split: Refactor FDISC paths
scsi: lpfc: SLI path split: Refactor LS_RJT paths
scsi: lpfc: SLI path split: Refactor LS_ACC paths
scsi: lpfc: SLI path split: Refactor the RSCN/SCR/RDF/EDC/FARPR paths
scsi: lpfc: SLI path split: Refactor PLOGI/PRLI/ADISC/LOGO paths
scsi: lpfc: SLI path split: Refactor base ELS paths and the FLOGI path
scsi: lpfc: SLI path split: Introduce lpfc_prep_wqe
scsi: lpfc: SLI path split: Refactor fast and slow paths to native SLI4
scsi: lpfc: SLI path split: Refactor lpfc_iocbq
scsi: lpfc: Use kcalloc()
...
|
|
Consumer examples in the bindings of resource providers are trivial,
useless and duplication of code. Remove the example code for consumer
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Link: https://lore.kernel.org/r/20220316130858.93455-2-krzysztof.kozlowski@canonical.com
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mike Snitzer:
- Significant refactoring and fixing of how DM core does bio-based IO
accounting with focus on fixing wildly inaccurate IO stats for
dm-crypt (and other DM targets that defer bio submission in their own
workqueues). End result is proper IO accounting, made possible by
targets being updated to use the new dm_submit_bio_remap() interface.
- Add hipri bio polling support (REQ_POLLED) to bio-based DM.
- Reduce dm_io and dm_target_io structs so that a single dm_io (which
contains dm_target_io and first clone bio) weighs in at 256 bytes.
For reference the bio struct is 128 bytes.
- Various other small cleanups, fixes or improvements in DM core and
targets.
- Update MAINTAINERS with my kernel.org email address to allow
distinction between my "upstream" and "Red" Hats.
* tag 'for-5.18/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (46 commits)
dm: consolidate spinlocks in dm_io struct
dm: reduce size of dm_io and dm_target_io structs
dm: switch dm_target_io booleans over to proper flags
dm: switch dm_io booleans over to proper flags
dm: update email address in MAINTAINERS
dm: return void from __send_empty_flush
dm: factor out dm_io_complete
dm cache: use dm_submit_bio_remap
dm: simplify dm_sumbit_bio_remap interface
dm thin: use dm_submit_bio_remap
dm: add WARN_ON_ONCE to dm_submit_bio_remap
dm: support bio polling
block: add ->poll_bio to block_device_operations
dm mpath: use DMINFO instead of printk with KERN_INFO
dm: stop using bdevname
dm-zoned: remove the ->name field in struct dmz_dev
dm: remove unnecessary local variables in __bind
dm: requeue IO if mapping table not yet available
dm io: remove stale comment block for dm_io()
dm thin metadata: remove unused dm_thin_remove_block and __remove
...
|
|
'clocks' in the example is not parsable with the 0 phandle value
because the number of #clock-cells is unknown in the previous entry.
Solve this by adding the clock provider node. Only 'cpg_clocks' is
needed as the examples are built with fixups which can be used to
identify phandles.
This is in preparation to support schema validation on .dtb files.
Signed-off-by: Rob Herring <robh@kernel.org>
Link: https://lore.kernel.org/r/20220301190400.1644150-1-robh@kernel.org
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Acked-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
|
|
Pull rdma updates from Jason Gunthorpe:
- Minor bug fixes in mlx5, mthca, pvrdma, rtrs, mlx4, hfi1, hns
- Minor cleanups: coding style, useless includes and documentation
- Reorganize how multicast processing works in rxe
- Replace a red/black tree with xarray in rxe which improves performance
- DSCP support and HW address handle re-use in irdma
- Simplify the mailbox command handling in hns
- Simplify iser now that FMR is eliminated
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (93 commits)
RDMA/nldev: Prevent underflow in nldev_stat_set_counter_dynamic_doit()
IB/iser: Fix error flow in case of registration failure
IB/iser: Generalize map/unmap dma tasks
IB/iser: Use iser_fr_desc as registration context
IB/iser: Remove iser_reg_data_sg helper function
RDMA/rxe: Use standard names for ref counting
RDMA/rxe: Replace red-black trees by xarrays
RDMA/rxe: Shorten pool names in rxe_pool.c
RDMA/rxe: Move max_elem into rxe_type_info
RDMA/rxe: Replace obj by elem in declaration
RDMA/rxe: Delete _locked() APIs for pool objects
RDMA/rxe: Reverse the sense of RXE_POOL_NO_ALLOC
RDMA/rxe: Replace mr by rkey in responder resources
RDMA/rxe: Fix ref error in rxe_av.c
RDMA/hns: Use the reserved loopback QPs to free MR before destroying MPT
RDMA/irdma: Add support for address handle re-use
RDMA/qib: Fix typos in comments
RDMA/mlx5: Fix memory leak in error flow for subscribe event routine
Revert "RDMA/core: Fix ib_qp_usecnt_dec() called when error"
RDMA/rxe: Remove useless argument for update_state()
...
|
|
Instead of having each time that wants to use ksft_exit() have to figure
out the internals of kselftest.h, add the helper ksft_finished() that
makes sure the passes, xfails, and skips are equal to the test plan count.
Link: https://lkml.kernel.org/r/20220201013717.2464392-1-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
MADV_DONTNEED historically rejects mlocked ranges, but with MLOCK_ONFAULT
and MCL_ONFAULT allowing to mlock without populating, there are valid use
cases for depopulating locked ranges as well.
Users mlock memory to protect secrets. There are allocators for secure
buffers that want in-use memory generally mlocked, but cleared and
invalidated memory to give up the physical pages. This could be done with
explicit munlock -> mlock calls on free -> alloc of course, but that adds
two unnecessary syscalls, heavy mmap_sem write locks, vma splits and
re-merges - only to get rid of the backing pages.
Users also mlockall(MCL_ONFAULT) to suppress sustained paging, but are
okay with on-demand initial population. It seems valid to selectively
free some memory during the lifetime of such a process, without having to
mess with its overall policy.
Why add a separate flag? Isn't this a pretty niche usecase?
- MADV_DONTNEED has been bailing on locked vmas forever. It's at least
conceivable that someone, somewhere is relying on mlock to protect
data from perhaps broader invalidation calls. Changing this behavior
now could lead to quiet data corruption.
- It also clarifies expectations around MADV_FREE and maybe
MADV_REMOVE. It avoids the situation where one quietly behaves
different than the others. MADV_FREE_LOCKED can be added later.
- The combination of mlock() and madvise() in the first place is
probably niche. But where it happens, I'd say that dropping pages
from a locked region once they don't contain secrets or won't page
anymore is much saner than relying on mlock to protect memory from
speculative or errant invalidation calls. It's just that we can't
change the default behavior because of the two previous points.
Given that, an explicit new flag seems to make the most sense.
[hannes@cmpxchg.org: fix mips build]
Link: https://lkml.kernel.org/r/20220304171912.305060-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Problem:
=======
Userspace might read the zero-page instead of actual data from a direct IO
read on a block device if the buffers have been called madvise(MADV_FREE)
on earlier (this is discussed below) due to a race between page reclaim on
MADV_FREE and blkdev direct IO read.
- Race condition:
==============
During page reclaim, the MADV_FREE page check in try_to_unmap_one() checks
if the page is not dirty, then discards its rmap PTE(s) (vs. remap back
if the page is dirty).
However, after try_to_unmap_one() returns to shrink_page_list(), it might
keep the page _anyway_ if page_ref_freeze() fails (it expects exactly
_one_ page reference, from the isolation for page reclaim).
Well, blkdev_direct_IO() gets references for all pages, and on READ
operations it only sets them dirty _later_.
So, if MADV_FREE'd pages (i.e., not dirty) are used as buffers for direct
IO read from block devices, and page reclaim happens during
__blkdev_direct_IO[_simple]() exactly AFTER bio_iov_iter_get_pages()
returns, but BEFORE the pages are set dirty, the situation happens.
The direct IO read eventually completes. Now, when userspace reads the
buffers, the PTE is no longer there and the page fault handler
do_anonymous_page() services that with the zero-page, NOT the data!
A synthetic reproducer is provided.
- Page faults:
===========
If page reclaim happens BEFORE bio_iov_iter_get_pages() the issue doesn't
happen, because that faults-in all pages as writeable, so
do_anonymous_page() sets up a new page/rmap/PTE, and that is used by
direct IO. The userspace reads don't fault as the PTE is there (thus
zero-page is not used/setup).
But if page reclaim happens AFTER it / BEFORE setting pages dirty, the PTE
is no longer there; the subsequent page faults can't help:
The data-read from the block device probably won't generate faults due to
DMA (no MMU) but even in the case it wouldn't use DMA, that happens on
different virtual addresses (not user-mapped addresses) because `struct
bio_vec` stores `struct page` to figure addresses out (which are different
from user-mapped addresses) for the read.
Thus userspace reads (to user-mapped addresses) still fault, then
do_anonymous_page() gets another `struct page` that would address/ map to
other memory than the `struct page` used by `struct bio_vec` for the read.
(The original `struct page` is not available, since it wasn't freed, as
page_ref_freeze() failed due to more page refs. And even if it were
available, its data cannot be trusted anymore.)
Solution:
========
One solution is to check for the expected page reference count in
try_to_unmap_one().
There should be one reference from the isolation (that is also checked in
shrink_page_list() with page_ref_freeze()) plus one or more references
from page mapping(s) (put in discard: label). Further references mean
that rmap/PTE cannot be unmapped/nuked.
(Note: there might be more than one reference from mapping due to
fork()/clone() without CLONE_VM, which use the same `struct page` for
references, until the copy-on-write page gets copied.)
So, additional page references (e.g., from direct IO read) now prevent the
rmap/PTE from being unmapped/dropped; similarly to the page is not freed
per shrink_page_list()/page_ref_freeze()).
- Races and Barriers:
==================
The new check in try_to_unmap_one() should be safe in races with
bio_iov_iter_get_pages() in get_user_pages() fast and slow paths, as it's
done under the PTE lock.
The fast path doesn't take the lock, but it checks if the PTE has changed
and if so, it drops the reference and leaves the page for the slow path
(which does take that lock).
The fast path requires synchronization w/ full memory barrier: it writes
the page reference count first then it reads the PTE later, while
try_to_unmap() writes PTE first then it reads page refcount.
And a second barrier is needed, as the page dirty flag should not be read
before the page reference count (as in __remove_mapping()). (This can be
a load memory barrier only; no writes are involved.)
Call stack/comments:
- try_to_unmap_one()
- page_vma_mapped_walk()
- map_pte() # see pte_offset_map_lock():
pte_offset_map()
spin_lock()
- ptep_get_and_clear() # write PTE
- smp_mb() # (new barrier) GUP fast path
- page_ref_count() # (new check) read refcount
- page_vma_mapped_walk_done() # see pte_unmap_unlock():
pte_unmap()
spin_unlock()
- bio_iov_iter_get_pages()
- __bio_iov_iter_get_pages()
- iov_iter_get_pages()
- get_user_pages_fast()
- internal_get_user_pages_fast()
# fast path
- lockless_pages_from_mm()
- gup_{pgd,p4d,pud,pmd,pte}_range()
ptep = pte_offset_map() # not _lock()
pte = ptep_get_lockless(ptep)
page = pte_page(pte)
try_grab_compound_head(page) # inc refcount
# (RMW/barrier
# on success)
if (pte_val(pte) != pte_val(*ptep)) # read PTE
put_compound_head(page) # dec refcount
# go slow path
# slow path
- __gup_longterm_unlocked()
- get_user_pages_unlocked()
- __get_user_pages_locked()
- __get_user_pages()
- follow_{page,p4d,pud,pmd}_mask()
- follow_page_pte()
ptep = pte_offset_map_lock()
pte = *ptep
page = vm_normal_page(pte)
try_grab_page(page) # inc refcount
pte_unmap_unlock()
- Huge Pages:
==========
Regarding transparent hugepages, that logic shouldn't change, as MADV_FREE
(aka lazyfree) pages are PageAnon() && !PageSwapBacked()
(madvise_free_pte_range() -> mark_page_lazyfree() -> lru_lazyfree_fn())
thus should reach shrink_page_list() -> split_huge_page_to_list() before
try_to_unmap[_one](), so it deals with normal pages only.
(And in case unlikely/TTU_SPLIT_HUGE_PMD/split_huge_pmd_address() happens,
which should not or be rare, the page refcount should be greater than
mapcount: the head page is referenced by tail pages. That also prevents
checking the head `page` then incorrectly call page_remove_rmap(subpage)
for a tail page, that isn't even in the shrink_page_list()'s page_list (an
effect of split huge pmd/pmvw), as it might happen today in this unlikely
scenario.)
MADV_FREE'd buffers:
===================
So, back to the "if MADV_FREE pages are used as buffers" note. The case
is arguable, and subject to multiple interpretations.
The madvise(2) manual page on the MADV_FREE advice value says:
1) 'After a successful MADV_FREE ... data will be lost when
the kernel frees the pages.'
2) 'the free operation will be canceled if the caller writes
into the page' / 'subsequent writes ... will succeed and
then [the] kernel cannot free those dirtied pages'
3) 'If there is no subsequent write, the kernel can free the
pages at any time.'
Thoughts, questions, considerations... respectively:
1) Since the kernel didn't actually free the page (page_ref_freeze()
failed), should the data not have been lost? (on userspace read.)
2) Should writes performed by the direct IO read be able to cancel
the free operation?
- Should the direct IO read be considered as 'the caller' too,
as it's been requested by 'the caller'?
- Should the bio technique to dirty pages on return to userspace
(bio_check_pages_dirty() is called/used by __blkdev_direct_IO())
be considered in another/special way here?
3) Should an upcoming write from a previously requested direct IO
read be considered as a subsequent write, so the kernel should
not free the pages? (as it's known at the time of page reclaim.)
And lastly:
Technically, the last point would seem a reasonable consideration and
balance, as the madvise(2) manual page apparently (and fairly) seem to
assume that 'writes' are memory access from the userspace process (not
explicitly considering writes from the kernel or its corner cases; again,
fairly).. plus the kernel fix implementation for the corner case of the
largely 'non-atomic write' encompassed by a direct IO read operation, is
relatively simple; and it helps.
Reproducer:
==========
@ test.c (simplified, but works)
#define _GNU_SOURCE
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
int main() {
int fd, i;
char *buf;
fd = open(DEV, O_RDONLY | O_DIRECT);
buf = mmap(NULL, BUF_SIZE, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
for (i = 0; i < BUF_SIZE; i += PAGE_SIZE)
buf[i] = 1; // init to non-zero
madvise(buf, BUF_SIZE, MADV_FREE);
read(fd, buf, BUF_SIZE);
for (i = 0; i < BUF_SIZE; i += PAGE_SIZE)
printf("%p: 0x%x\n", &buf[i], buf[i]);
return 0;
}
@ block/fops.c (formerly fs/block_dev.c)
+#include <linux/swap.h>
...
... __blkdev_direct_IO[_simple](...)
{
...
+ if (!strcmp(current->comm, "good"))
+ shrink_all_memory(ULONG_MAX);
+
ret = bio_iov_iter_get_pages(...);
+
+ if (!strcmp(current->comm, "bad"))
+ shrink_all_memory(ULONG_MAX);
...
}
@ shell
# NUM_PAGES=4
# PAGE_SIZE=$(getconf PAGE_SIZE)
# yes | dd of=test.img bs=${PAGE_SIZE} count=${NUM_PAGES}
# DEV=$(losetup -f --show test.img)
# gcc -DDEV=\"$DEV\" \
-DBUF_SIZE=$((PAGE_SIZE * NUM_PAGES)) \
-DPAGE_SIZE=${PAGE_SIZE} \
test.c -o test
# od -tx1 $DEV
0000000 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a
*
0040000
# mv test good
# ./good
0x7f7c10418000: 0x79
0x7f7c10419000: 0x79
0x7f7c1041a000: 0x79
0x7f7c1041b000: 0x79
# mv good bad
# ./bad
0x7fa1b8050000: 0x0
0x7fa1b8051000: 0x0
0x7fa1b8052000: 0x0
0x7fa1b8053000: 0x0
Note: the issue is consistent on v5.17-rc3, but it's intermittent with the
support of MADV_FREE on v4.5 (60%-70% error; needs swap). [wrap
do_direct_IO() in do_blockdev_direct_IO() @ fs/direct-io.c].
- v5.17-rc3:
# for i in {1..1000}; do ./good; done \
| cut -d: -f2 | sort | uniq -c
4000 0x79
# mv good bad
# for i in {1..1000}; do ./bad; done \
| cut -d: -f2 | sort | uniq -c
4000 0x0
# free | grep Swap
Swap: 0 0 0
- v4.5:
# for i in {1..1000}; do ./good; done \
| cut -d: -f2 | sort | uniq -c
4000 0x79
# mv good bad
# for i in {1..1000}; do ./bad; done \
| cut -d: -f2 | sort | uniq -c
2702 0x0
1298 0x79
# swapoff -av
swapoff /swap
# for i in {1..1000}; do ./bad; done \
| cut -d: -f2 | sort | uniq -c
4000 0x79
Ceph/TCMalloc:
=============
For documentation purposes, the use case driving the analysis/fix is Ceph
on Ubuntu 18.04, as the TCMalloc library there still uses MADV_FREE to
release unused memory to the system from the mmap'ed page heap (might be
committed back/used again; it's not munmap'ed.) - PageHeap::DecommitSpan()
-> TCMalloc_SystemRelease() -> madvise() - PageHeap::CommitSpan() ->
TCMalloc_SystemCommit() -> do nothing.
Note: TCMalloc switched back to MADV_DONTNEED a few commits after the
release in Ubuntu 18.04 (google-perftools/gperftools 2.5), so the issue
just 'disappeared' on Ceph on later Ubuntu releases but is still present
in the kernel, and can be hit by other use cases.
The observed issue seems to be the old Ceph bug #22464 [1], where checksum
mismatches are observed (and instrumentation with buffer dumps shows
zero-pages read from mmap'ed/MADV_FREE'd page ranges).
The issue in Ceph was reasonably deemed a kernel bug (comment #50) and
mostly worked around with a retry mechanism, but other parts of Ceph could
still hit that (rocksdb). Anyway, it's less likely to be hit again as
TCMalloc switched out of MADV_FREE by default.
(Some kernel versions/reports from the Ceph bug, and relation with
the MADV_FREE introduction/changes; TCMalloc versions not checked.)
- 4.4 good
- 4.5 (madv_free: introduction)
- 4.9 bad
- 4.10 good? maybe a swapless system
- 4.12 (madv_free: no longer free instantly on swapless systems)
- 4.13 bad
[1] https://tracker.ceph.com/issues/22464
Thanks:
======
Several people contributed to analysis/discussions/tests/reproducers in
the first stages when drilling down on ceph/tcmalloc/linux kernel:
- Dan Hill
- Dan Streetman
- Dongdong Tao
- Gavin Guo
- Gerald Yang
- Heitor Alves de Siqueira
- Ioanna Alifieraki
- Jay Vosburgh
- Matthew Ruffell
- Ponnuvel Palaniyappan
Reviews, suggestions, corrections, comments:
- Minchan Kim
- Yu Zhao
- Huang, Ying
- John Hubbard
- Christoph Hellwig
[mfo@canonical.com: v4]
Link: https://lkml.kernel.org/r/20220209202659.183418-1-mfo@canonical.comLink: https://lkml.kernel.org/r/20220131230255.789059-1-mfo@canonical.com
Fixes: 802a3a92ad7a ("mm: reclaim MADV_FREE pages")
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Dan Hill <daniel.hill@canonical.com>
Cc: Dan Streetman <dan.streetman@canonical.com>
Cc: Dongdong Tao <dongdong.tao@canonical.com>
Cc: Gavin Guo <gavin.guo@canonical.com>
Cc: Gerald Yang <gerald.yang@canonical.com>
Cc: Heitor Alves de Siqueira <halves@canonical.com>
Cc: Ioanna Alifieraki <ioanna-maria.alifieraki@canonical.com>
Cc: Jay Vosburgh <jay.vosburgh@canonical.com>
Cc: Matthew Ruffell <matthew.ruffell@canonical.com>
Cc: Ponnuvel Palaniyappan <ponnuvel.palaniyappan@canonical.com>
Cc: <stable@vger.kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
ARCH_HAS_FILTER_PGPROT config has duplicate definitions on platforms that
subscribe it. Instead make it a generic config option which can be
selected on applicable platforms when required.
Link: https://lkml.kernel.org/r/1643004823-16441-1-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Revert 48ec833b7851 ("Revert "mm/memory.c: share the i_mmap_rwsem"") to
reinstate c8475d144abb ("mm/memory.c: share the i_mmap_rwsem"): the
unmap_mapping_range family of functions do the unmapping of user pages
(ultimately via zap_page_range_single) without modifying the interval tree
itself, and unmapping races are necessarily guarded by page table lock,
thus the i_mmap_rwsem should be shared in unmap_mapping_pages() and
unmap_mapping_folio().
Commit 48ec833b7851 was intended as a short-term measure, allowing the
other shared lock changes into 3.19 final, before investigating three
trinity crashes, one of which had been bisected to commit c8475d144ab:
[1] https://lkml.org/lkml/2014/11/14/342
https://lore.kernel.org/lkml/5466142C.60100@oracle.com/
[2] https://lkml.org/lkml/2014/12/22/213
https://lore.kernel.org/lkml/549832E2.8060609@oracle.com/
[3] https://lkml.org/lkml/2014/12/9/741
https://lore.kernel.org/lkml/5487ACC5.1010002@oracle.com/
Two of those were Bad page states: free_pages_prepare() found PG_mlocked
still set - almost certain to have been fixed by 4.4 commit b87537d9e2fe
("mm: rmap use pte lock not mmap_sem to set PageMlocked"). The NULL deref
on rwsem in [2]: unclear, only happened once, not bisected to c8475d144ab.
No change to the i_mmap_lock_write() around __unmap_hugepage_range_final()
in unmap_single_vma(): IIRC that's a special usage, helping to serialize
hugetlbfs page table sharing, not to be dabbled with lightly. No change
to other uses of i_mmap_lock_write() by hugetlbfs.
I am not aware of any significant gains from the concurrency allowed by
this commit: it is submitted more to resolve an ancient misunderstanding.
Link: https://lkml.kernel.org/r/e4a5e356-6c87-47b2-3ce8-c2a95ae84e20@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
filemap_unaccount_folio() has a WARN_ON_ONCE(folio_test_dirty(folio)). It
is good to warn of late dirtying on a persistent filesystem, but late
dirtying on tmpfs can only lose data which is expected to be thrown away;
and it's a pity if that warning comes ONCE on tmpfs, then hides others
which really matter. Make it conditional on mapping_cap_writeback().
Cleanup: then folio_account_cleaned() no longer needs to check that for
itself, and so no longer needs to know the mapping.
Link: https://lkml.kernel.org/r/b5a1106c-7226-a5c6-ad41-ad4832cae1f@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jan Kara <jack@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Let's remove the stale logic that was required for reuse_swap_page().
[akpm@linux-foundation.org: simplification, per Yang Shi]
Link: https://lkml.kernel.org/r/20220131162940.210846-10-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Don Dutile <ddutile@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Liang Zhang <zhangliang5@huawei.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|