linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2022-03-25	perf lock: Add -F/--field option to control output	Namhyung Kim
	The -F/--field option is to customize the list of fields to output: $ perf lock report -F contended,wait_max -k avg_wait Name contended max wait (ns) avg wait (ns) slock-AF_INET6 1 23543 23543 &lruvec->lru_lock 5 18317 11254 slock-AF_INET6 1 10379 10379 rcu_node_1 1 2104 2104 &dentry->d_lockr... 1 1844 1844 &dentry->d_lockr... 1 1672 1672 &newf->file_lock 15 2279 1025 &dentry->d_lockr... 1 792 792 Signed-off-by: Namhyung Kim <namhyung@kernel.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20220323230259.288494-3-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2022-03-25	perf lock: Extend struct lock_key to have print function	Namhyung Kim
	And use it to print output for each key field. No functional change intended and the output should be identical. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20220323230259.288494-2-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2022-03-25	perf lock: Add --synth=no option for record	Namhyung Kim
	The perf lock command has nothing to symbolize and lock names come from the tracepoint. Moreover, kernel symbols are available even the --synth=no option is given. This will reduce the startup time by avoiding unnecessary synthesis. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20220323230259.288494-1-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2022-03-25	Documentation: amd-pstate: grammar and sentence structure updates	Jan Engelhardt
	Signed-off-by: Jan Engelhardt <jengelh@inai.de> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2022-03-25	SUNRPC: Don't return error values in sysfs read of closed files	Trond Myklebust
	Instead of returning an error value, which ends up being the return value for the read() system call, it is more elegant to simply return the error as a string value. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-25	SUNRPC: Do not dereference non-socket transports in sysfs	Trond Myklebust
	Do not cast the struct xprt to a sock_xprt unless we know it is a UDP or TCP transport. Otherwise the call to lock the mutex will scribble over whatever structure is actually there. This has been seen to cause hard system lockups when the underlying transport was RDMA. Fixes: b49ea673e119 ("SUNRPC: lock against ->sock changing during sysfs read") Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-25	Merge branch 'akpm' (patches from Andrew)	Linus Torvalds
	Merge yet more updates from Andrew Morton: "This is the material which was staged after willystuff in linux-next. Subsystems affected by this patch series: mm (debug, selftests, pagecache, thp, rmap, migration, kasan, hugetlb, pagemap, madvise), and selftests" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (113 commits) selftests: kselftest framework: provide "finished" helper mm: madvise: MADV_DONTNEED_LOCKED mm: fix race between MADV_FREE reclaim and blkdev direct IO read mm: generalize ARCH_HAS_FILTER_PGPROT mm: unmap_mapping_range_tree() with i_mmap_rwsem shared mm: warn on deleting redirtied only if accounted mm/huge_memory: remove stale locking logic from __split_huge_pmd() mm/huge_memory: remove stale page_trans_huge_mapcount() mm/swapfile: remove stale reuse_swap_page() mm/khugepaged: remove reuse_swap_page() usage mm/huge_memory: streamline COW logic in do_huge_pmd_wp_page() mm: streamline COW logic in do_swap_page() mm: slightly clarify KSM logic in do_swap_page() mm: optimize do_wp_page() for fresh pages in local LRU pagevecs mm: optimize do_wp_page() for exclusive pages in the swapcache mm/huge_memory: make is_transparent_hugepage() static userfaultfd/selftests: enable hugetlb remap and remove event testing selftests/vm: add hugetlb madvise MADV_DONTNEED MADV_REMOVE test mm: enable MADV_DONTNEED for hugetlb mappings kasan: disable LOCKDEP when printing reports ...
2022-03-25	ACPI: CPPC: Change default error code and clean up debug messages in probe	Rafael J. Wysocki
	Change the default error code returned by acpi_cppc_processor_probe() from -EFAULT (which is completely inadequate) to -ENODATA and change the debug messages printed by it to contain more information and be more consistent. While at it, format some white space to follow the coding style. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Huang Rui <ray.huang@amd.com>
2022-03-25	ACPI: CPPC: Avoid out of bounds access when parsing _CPC data	Rafael J. Wysocki
	If the NumEntries field in the _CPC return package is less than 2, do not attempt to access the "Revision" element of that package, because it may not be present then. Fixes: 337aadff8e45 ("ACPI: Introduce CPU performance controls using CPPC") BugLink: https://lore.kernel.org/lkml/20220322143534.GC32582@xsang-OptiPlex-9020/ Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Huang Rui <ray.huang@amd.com>
2022-03-25	ACPI: tables: Make LAPIC_ADDR_OVR address readable in message	Vasant Hegde
	Without fix: [ 0.005429] ACPI: LAPIC_ADDR_OVR (address[(____ptrval____)]) With fix: [ 0.005429] ACPI: LAPIC_ADDR_OVR (address[0x800fee00000]) Co-developed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Signed-off-by: Vasant Hegde <vasant.hegde@amd.com> [ rjw: Subject edits ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2022-03-25	Merge tag 'riscv-for-linus-5.18-mw0' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V updates from Palmer Dabbelt: - Support for Sv57-based virtual memory. - Various improvements for the MicroChip PolarFire SOC and the associated Icicle dev board, which should allow upstream kernels to boot without any additional modifications. - An improved memmove() implementation. - Support for the new Ssconfpmf and SBI PMU extensions, which allows for a much more useful perf implementation on RISC-V systems. - Support for restartable sequences. * tag 'riscv-for-linus-5.18-mw0' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (36 commits) rseq/selftests: Add support for RISC-V RISC-V: Add support for restartable sequence MAINTAINERS: Add entry for RISC-V PMU drivers Documentation: riscv: Remove the old documentation RISC-V: Add sscofpmf extension support RISC-V: Add perf platform driver based on SBI PMU extension RISC-V: Add RISC-V SBI PMU extension definitions RISC-V: Add a simple platform driver for RISC-V legacy perf RISC-V: Add a perf core library for pmu drivers RISC-V: Add CSR encodings for all HPMCOUNTERS RISC-V: Remove the current perf implementation RISC-V: Improve /proc/cpuinfo output for ISA extensions RISC-V: Do no continue isa string parsing without correct XLEN RISC-V: Implement multi-letter ISA extension probing framework RISC-V: Extract multi-letter extension names from "riscv, isa" RISC-V: Minimal parser for "riscv, isa" strings RISC-V: Correctly print supported extensions riscv: Fixed misaligned memory access. Fixed pointer comparison. MAINTAINERS: update riscv/microchip entry riscv: dts: microchip: add new peripherals to icicle kit device tree ...
2022-03-25	ACPI: IPMI: replace usage of found with dedicated list iterator variable	Jakob Koschel
	To move the list iterator variable into the list_for_each_entry_() macro in the future it should be avoided to use the list iterator variable after the loop body. To never* use the list iterator variable after the loop it was concluded to use a separate iterator variable instead of a found boolean [1]. This removes the need to use a found variable and simply checking if the variable was set, can determine if the break/goto was hit. Link: https://lore.kernel.org/all/CAHk-=wgRr_D8CB-D9Kg-c=EHreAsk5SqXPwr9Y7k9sA6cWXJ6w@mail.gmail.com/ Signed-off-by: Jakob Koschel <jakobkoschel@gmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2022-03-25	Merge tag 's390-5.18-1' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 updates from Vasily Gorbik: - Raise minimum supported machine generation to z10, which comes with various cleanups and code simplifications (usercopy/spectre mitigation/etc). - Rework extables and get rid of anonymous out-of-line fixups. - Page table helpers cleanup. Add set_pXd()/set_pte() helper functions. Covert pte_val()/pXd_val() macros to functions. - Optimize kretprobe handling by avoiding extra kprobe on __kretprobe_trampoline. - Add support for CEX8 crypto cards. - Allow to trigger AP bus rescan via writing to /sys/bus/ap/scans. - Add CONFIG_EXPOLINE_EXTERN option to build the kernel without COMDAT group sections which simplifies kpatch support. - Always use the packed stack layout and extend kernel unwinder tests. - Add sanity checks for ftrace code patching. - Add s390dbf debug log for the vfio_ap device driver. - Various virtual vs physical address confusion fixes. - Various small fixes and improvements all over the code. * tag 's390-5.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (69 commits) s390/test_unwind: add kretprobe tests s390/kprobes: Avoid additional kprobe in kretprobe handling s390: convert ".insn" encoding to instruction names s390: assume stckf is always present s390/nospec: move to single register thunks s390: raise minimum supported machine generation to z10 s390/uaccess: Add copy_from/to_user_key functions s390/nospec: align and size extern thunks s390/nospec: add an option to use thunk-extern s390/nospec: generate single register thunks if possible s390/pci: make zpci_set_irq()/zpci_clear_irq() static s390: remove unused expoline to BC instructions s390/irq: use assignment instead of cast s390/traps: get rid of magic cast for per code s390/traps: get rid of magic cast for program interruption code s390/signal: fix typo in comments s390/asm-offsets: remove unused defines s390/test_unwind: avoid build warning with W=1 s390: remove .fixup section s390/bpf: encode register within extable entry ...
2022-03-25	Merge tag 'xtensa-20220325' of https://github.com/jcmvbkbc/linux-xtensa	Linus Torvalds
	Pull Xtensa updates from Max Filippov: - remove dependency on the compiler's libgcc - allow selection of internal kernel ABI via Kconfig - enable compiler plugins support for gcc-12 or newer - various minor cleanups and fixes * tag 'xtensa-20220325' of https://github.com/jcmvbkbc/linux-xtensa: xtensa: define update_mmu_tlb function xtensa: fix xtensa_wsr always writing 0 xtensa: enable plugin support xtensa: clean up kernel exit assembly code xtensa: rearrange NMI exit path xtensa: merge stack alignment definitions xtensa: fix DTC warning unit_address_format xtensa: fix stop_machine_cpuslocked call in patch_text xtensa: make secondary reset vector support conditional xtensa: add kernel ABI selection to Kconfig xtensa: don't link with libgcc xtensa: add helpers for division, remainder and shifts xtensa: add missing XCHAL_HAVE_WINDOWED check xtensa: use XCHAL_NUM_AREGS as pt_regs::areg size xtensa: rename PT_SIZE to PT_KERNEL_SIZE xtensa: Remove unused early_read_config_byte() et al declarations xtensa: use strscpy to copy strings net: xtensa: use strscpy to copy strings
2022-03-25	Merge tag 'powerpc-5.18-1' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: "Livepatch support for 32-bit is probably the standout new feature, otherwise mostly just lots of bits and pieces all over the board. There's a series of commits cleaning up function descriptor handling, which touches a few other arches as well as LKDTM. It has acks from Arnd, Kees and Helge. Summary: - Enforce kernel RO, and implement STRICT_MODULE_RWX for 603. - Add support for livepatch to 32-bit. - Implement CONFIG_DYNAMIC_FTRACE_WITH_ARGS. - Merge vdso64 and vdso32 into a single directory. - Fix build errors with newer binutils. - Add support for UADDR64 relocations, which are emitted by some toolchains. This allows powerpc to build with the latest lld. - Fix (another) potential userspace r13 corruption in transactional memory handling. - Cleanups of function descriptor handling & related fixes to LKDTM. Thanks to Abdul Haleem, Alexey Kardashevskiy, Anders Roxell, Aneesh Kumar K.V, Anton Blanchard, Arnd Bergmann, Athira Rajeev, Bhaskar Chowdhury, Cédric Le Goater, Chen Jingwen, Christophe JAILLET, Christophe Leroy, Corentin Labbe, Daniel Axtens, Daniel Henrique Barboza, David Dai, Fabiano Rosas, Ganesh Goudar, Guo Zhengkui, Hangyu Hua, Haren Myneni, Hari Bathini, Igor Zhbanov, Jakob Koschel, Jason Wang, Jeremy Kerr, Joachim Wiberg, Jordan Niethe, Julia Lawall, Kajol Jain, Kees Cook, Laurent Dufour, Madhavan Srinivasan, Mamatha Inamdar, Maxime Bizon, Maxim Kiselev, Maxim Kochetkov, Michal Suchanek, Nageswara R Sastry, Nathan Lynch, Naveen N. Rao, Nicholas Piggin, Nour-eddine Taleb, Paul Menzel, Ping Fang, Pratik R. Sampat, Randy Dunlap, Ritesh Harjani, Rohan McLure, Russell Currey, Sachin Sant, Segher Boessenkool, Shivaprasad G Bhat, Sourabh Jain, Thierry Reding, Tobias Waldekranz, Tyrel Datwyler, Vaibhav Jain, Vladimir Oltean, Wedson Almeida Filho, and YueHaibing" * tag 'powerpc-5.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (179 commits) powerpc/pseries: Fix use after free in remove_phb_dynamic() powerpc/time: improve decrementer clockevent processing powerpc/time: Fix KVM host re-arming a timer beyond decrementer range powerpc/tm: Fix more userspace r13 corruption powerpc/xive: fix return value of __setup handler powerpc/64: Add UADDR64 relocation support powerpc: 8xx: fix a return value error in mpc8xx_pic_init powerpc/ps3: remove unneeded semicolons powerpc/64: Force inlining of prevent_user_access() and set_kuap() powerpc/bitops: Force inlining of fls() powerpc: declare unmodified attribute_group usages const powerpc/spufs: Fix build warning when CONFIG_PROC_FS=n powerpc/secvar: fix refcount leak in format_show() powerpc/64e: Tie PPC_BOOK3E_64 to PPC_FSL_BOOK3E powerpc: Move C prototypes out of asm-prototypes.h powerpc/kexec: Declare kexec_paca static powerpc/smp: Declare current_set static powerpc: Cleanup asm-prototypes.c powerpc/ftrace: Use STK_GOT in ftrace_mprofile.S powerpc/ftrace: Regroup PPC64 specific operations in ftrace_mprofile.S ...
2022-03-25	Merge tag 'mips_5.18' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux Pull MIPS updates from Thomas Bogendoerfer: - added support for QCN550x (ath79) - enabled KCSAN - removed TX39XX support - various cleanups and fixes * tag 'mips_5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux: (31 commits) MIPS: Fix build error for loongson64 and sgi-ip27 MIPS: ingenic: correct unit node address MIPS: Fix wrong comments in asm/prom.h MIPS: Remove redundant definitions of device_tree_init() MIPS: Remove redundant check in device_tree_init() MIPS: pgalloc: fix memory leak caused by pgd_free() MIPS: RB532: fix return value of __setup handler MIPS: Only use current_stack_pointer on GCC MIPS: boot/compressed: Use array reference for image bounds mips: cdmm: Fix refcount leak in mips_cdmm_phys_base mips: remove reference to "newer Loongson-3" mips: Always permit to build u-boot images MIPS: Sanitise Cavium switch cases in TLB handler synthesizers DEC: Limit PMAX memory probing to R3k systems mips: DEC: honor CONFIG_MIPS_FP_SUPPORT=n MIPS: fix fortify panic when copying asm exception handlers mips: ralink: fix a refcount leak in ill_acc_of_setup() mips: Implement "current_stack_pointer" MIPS: Remove TX39XX support MIPS: Modernize READ_IMPLIES_EXEC ...
2022-03-25	regulator: rt4831: Add active_discharge_on to fix discharge API	ChiYuan Huang
	To use set_discharge helper function, active_discharge_on need to be added. Signed-off-by: ChiYuan Huang <cy_huang@richtek.com> Link: https://lore.kernel.org/r/1648171577-9663-3-git-send-email-u0084500@gmail.com Signed-off-by: Mark Brown <broonie@kernel.org>
2022-03-25	regulator: rt4831: Add bypass mask to fix set_bypass API work	ChiYuan Huang
	To use set/get_bypass helper function, bypass mask need to be specified. Signed-off-by: ChiYuan Huang <cy_huang@richtek.com> Link: https://lore.kernel.org/r/1648171577-9663-2-git-send-email-u0084500@gmail.com Signed-off-by: Mark Brown <broonie@kernel.org>
2022-03-25	ASoC: SOF: Intel: Fix build error without SND_SOC_SOF_PCI_DEV	Zheng Bin
	If SND_SOC_SOF_PCI_DEV is n, bulding fails: sound/soc/sof/intel/pci-tng.o:(.data+0x1c0): undefined reference to `sof_pci_probe' sound/soc/sof/intel/pci-tng.o:(.data+0x1c8): undefined reference to `sof_pci_remove' sound/soc/sof/intel/pci-tng.o:(.data+0x1e0): undefined reference to `sof_pci_shutdown' sound/soc/sof/intel/pci-tng.o:(.data+0x290): undefined reference to `sof_pci_pm' Make SND_SOC_SOF_MERRIFIELD select SND_SOC_SOF_PCI_DEV to fix this. Fixes: 8d4ba1be3d22 ("ASoC: SOF: pci: split PCI into different drivers") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Zheng Bin <zhengbin13@huawei.com> Acked-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Link: https://lore.kernel.org/r/20220323092501.145879-1-zhengbin13@huawei.com Signed-off-by: Mark Brown <broonie@kernel.org>
2022-03-25	[smb3] move more common protocol header definitions to smbfs_common	Steve French
	We have duplicated definitions for various SMB3 PDUs in fs/ksmbd and fs/cifs. Some had already been moved to fs/smbfs_common/smb2pdu.h Move definitions for - error response - query info and various related protocol flags - various lease handling flags and the create lease context to smbfs_common/smb2pdu.h to reduce code duplication Reviewed-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2022-03-25	virt: vmgenid: recognize new CID added by Hyper-V	Michael Kelley
	In the Windows spec for VM Generation ID, the originally specified CID is longer than allowed by the ACPI spec. Hyper-V has added "VMGENCTR" as a second valid CID that is conformant, while retaining the original CID for compatibility with Windows guests. Add this new CID to the list recognized by the driver. Signed-off-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-03-25	random: re-add removed comment about get_random_{u32,u64} reseeding	Jason A. Donenfeld
	The comment about get_random_{u32,u64}() not invoking reseeding got added in an unrelated commit, that then was recently reverted by 0313bc278dac ("Revert "random: block in /dev/urandom""). So this adds that little comment snippet back, and improves the wording a bit too. Reviewed-by: Dominik Brodowski <linux@dominikbrodowski.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-03-25	random: treat bootloader trust toggle the same way as cpu trust toggle	Jason A. Donenfeld
	If CONFIG_RANDOM_TRUST_CPU is set, the RNG initializes using RDRAND. But, the user can disable (or enable) this behavior by setting `random.trust_cpu=0/1` on the kernel command line. This allows system builders to do reasonable things while avoiding howls from tinfoil hatters. (Or vice versa.) CONFIG_RANDOM_TRUST_BOOTLOADER is basically the same thing, but regards the seed passed via EFI or device tree, which might come from RDRAND or a TPM or somewhere else. In order to allow distros to more easily enable this while avoiding those same howls (or vice versa), this commit adds the corresponding `random.trust_bootloader=0/1` toggle. Cc: Theodore Ts'o <tytso@mit.edu> Cc: Graham Christensen <graham@grahamc.com> Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Dominik Brodowski <linux@dominikbrodowski.net> Link: https://github.com/NixOS/nixpkgs/pull/165355 Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-03-25	random: skip fast_init if hwrng provides large chunk of entropy	Jason A. Donenfeld
	At boot time, EFI calls add_bootloader_randomness(), which in turn calls add_hwgenerator_randomness(). Currently add_hwgenerator_randomness() feeds the first 64 bytes of randomness to the "fast init" non-crypto-grade phase. But if add_hwgenerator_randomness() gets called with more than POOL_MIN_BITS of entropy, there's no point in passing it off to the "fast init" stage, since that's enough entropy to bootstrap the real RNG. The "fast init" stage is just there to provide _something_ in the case where we don't have enough entropy to properly bootstrap the RNG. But if we do have enough entropy to bootstrap the RNG, the current logic doesn't serve a purpose. So, in the case where we're passed greater than or equal to POOL_MIN_BITS of entropy, this commit makes us skip the "fast init" phase. Cc: Dominik Brodowski <linux@dominikbrodowski.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-03-25	fs/iomap: Fix buffered write page prefaulting	Andreas Gruenbacher
	When part of the user buffer passed to generic_perform_write() or iomap_file_buffered_write() cannot be faulted in for reading, the entire write currently fails. The correct behavior would be to write all the data that can be written, up to the point of failure. Commit a6294593e8a1 ("iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable") gave us the information needed, so fix the page prefaulting in generic_perform_write() and iomap_write_iter() to only bail out when no pages could be faulted in. We already factor in that pages that are faulted in may no longer be resident by the time they are accessed. Paging out pages has the same effect as not faulting in those pages in the first place, so the code can already deal with that. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-03-25	io_uring: fix put_kbuf without proper locking	Pavel Begunkov
	io_put_kbuf_comp() should only be called while holding ->completion_lock, however there is no such assumption in io_clean_op() and thus it can corrupt ->io_buffer_comp. Take the lock there, and workaround the only user of io_clean_op() calling it with locks. Not the prettiest solution, but it's easier to refactor it for-next. Fixes: cc3cec8367cba ("io_uring: speedup provided buffer handling") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/743e2130b73ec6d48c4c5dd15db896c433431e6d.1648212967.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-25	io_uring: fix invalid flags for io_put_kbuf()	Pavel Begunkov
	io_req_complete_failed() doesn't require callers to hold ->uring_lock, use IO_URING_F_UNLOCKED version of io_put_kbuf(). The only affected place is the fail path of io_apoll_task_func(). Also add a lockdep annotation to catch such bugs in the future. Fixes: 3b2b78a8eb7cc ("io_uring: extend provided buf return to fails") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ccf602dbf8df3b6a8552a262d8ee0a13a086fbc7.1648212967.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-25	io_uring: improve req fields comments	Pavel Begunkov
	Move a misplaced comment about req->creds and add a line with assumptions about req->link. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1e51d1e6b1f3708c2d4127b4e371f9daa4c5f859.1648209006.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-25	io_uring: enable EPOLLEXCLUSIVE for accept poll	Dylan Yudaken
	When polling sockets for accept, use EPOLLEXCLUSIVE. This is helpful when multiple accept SQEs are submitted. For O_NONBLOCK sockets multiple queued SQEs would previously have all completed at once, but most with -EAGAIN as the result. Now only one wakes up and completes. For sockets without O_NONBLOCK there is no user facing change, but internally the extra requests would previously be queued onto a worker thread as they would wake up with no connection waiting, and be punted. Now they do not wake up unnecessarily. Co-developed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220325093755.4123343-1-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-25	rtc: optee: add RTC driver for OP-TEE RTC PTA	Clément Léger
	This drivers allows to communicate with a RTC PTA handled by OP-TEE [1]. This PTA allows to query RTC information, set/get time and set/get offset depending on the supported features. [1] https://github.com/OP-TEE/optee_os/pull/5179 Signed-off-by: Clément Léger <clement.leger@bootlin.com> Reviewed-by: Jens Wiklander <jens.wiklander@linaro.org> Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Link: https://lore.kernel.org/r/20220308133505.471601-1-clement.leger@bootlin.com
2022-03-25	rtc: pm8xxx: Return -ENODEV if set_time disallowed	Loic Poulain
	Having !allow_set_time is equivalent to non-implemented set_time function, which is normally represented with -ENODEV error in RTC subsystem. Today we are returning -EACCES error code, which is not considered by RTC clients as a 'non implemented' feature, and which causes NTP to retry hw clk sync (update_rtc) indefinitely. Signed-off-by: Loic Poulain <loic.poulain@linaro.org> Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Link: https://lore.kernel.org/r/1645090578-20734-1-git-send-email-loic.poulain@linaro.org
2022-03-25	rtc: pm8xxx: Attach wake irq to device	Loic Poulain
	Attach the interrupt as a wake-irq to the device, so that: - A corresponding wakeup source is created (and reported in e.g /sys/kernel/debug/wakeup_sources). - The power subsystem take cares of arming/disarming irq-wake automatically on suspend/resume. Signed-off-by: Loic Poulain <loic.poulain@linaro.org> Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Link: https://lore.kernel.org/r/1645025082-6138-1-git-send-email-loic.poulain@linaro.org
2022-03-25	clk: sunxi-ng: sun6i-rtc: include clk/sunxi-ng.h	Alexandre Belloni
	This solves: >> drivers/clk/sunxi-ng/ccu-sun6i-rtc.c:334:5: warning: no previous prototype for 'sun6i_rtc_ccu_probe' [-Wmissing-prototypes] 334 \| int sun6i_rtc_ccu_probe(struct device dev, void __iomem reg) \| ^~~~~~~~~~~~~~~~~~~ Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Acked-by: Jernej Skrabec <jernej.skrabec@gmail.com> Link: https://lore.kernel.org/r/20220320210905.6606-1-alexandre.belloni@bootlin.com
2022-03-25	crypto: x86/poly1305 - Fixup SLS	Peter Zijlstra
	Due to being a perl generated asm file, it got missed by the mass convertion script. arch/x86/crypto/poly1305-x86_64-cryptogams.o: warning: objtool: poly1305_init_x86_64()+0x3a: missing int3 after ret arch/x86/crypto/poly1305-x86_64-cryptogams.o: warning: objtool: poly1305_blocks_x86_64()+0xf2: missing int3 after ret arch/x86/crypto/poly1305-x86_64-cryptogams.o: warning: objtool: poly1305_emit_x86_64()+0x37: missing int3 after ret arch/x86/crypto/poly1305-x86_64-cryptogams.o: warning: objtool: __poly1305_block()+0x6d: missing int3 after ret arch/x86/crypto/poly1305-x86_64-cryptogams.o: warning: objtool: __poly1305_init_avx()+0x1e8: missing int3 after ret arch/x86/crypto/poly1305-x86_64-cryptogams.o: warning: objtool: poly1305_blocks_avx()+0x18a: missing int3 after ret arch/x86/crypto/poly1305-x86_64-cryptogams.o: warning: objtool: poly1305_blocks_avx()+0xaf8: missing int3 after ret arch/x86/crypto/poly1305-x86_64-cryptogams.o: warning: objtool: poly1305_emit_avx()+0x99: missing int3 after ret arch/x86/crypto/poly1305-x86_64-cryptogams.o: warning: objtool: poly1305_blocks_avx2()+0x18a: missing int3 after ret arch/x86/crypto/poly1305-x86_64-cryptogams.o: warning: objtool: poly1305_blocks_avx2()+0x776: missing int3 after ret arch/x86/crypto/poly1305-x86_64-cryptogams.o: warning: objtool: poly1305_blocks_avx512()+0x18a: missing int3 after ret arch/x86/crypto/poly1305-x86_64-cryptogams.o: warning: objtool: poly1305_blocks_avx512()+0x796: missing int3 after ret arch/x86/crypto/poly1305-x86_64-cryptogams.o: warning: objtool: poly1305_blocks_avx512()+0x10bd: missing int3 after ret Fixes: f94909ceb1ed ("x86: Prepare asm files for straight-line-speculation") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-03-25	crypto: x86/chacha20 - Avoid spurious jumps to other functions	Peter Zijlstra
	The chacha_Nblock_xor_avx512vl() functions all have their own, identical, .LdoneN label, however in one particular spot {2,4} jump to the 8 version instead of their own. Resulting in: arch/x86/crypto/chacha-x86_64.o: warning: objtool: chacha_2block_xor_avx512vl() falls through to next function chacha_8block_xor_avx512vl() arch/x86/crypto/chacha-x86_64.o: warning: objtool: chacha_4block_xor_avx512vl() falls through to next function chacha_8block_xor_avx512vl() Make each function consistently use its own done label. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Martin Willi <martin@strongswan.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-03-25	crypto: stm32 - fix reference leak in stm32_crc_remove	Zheng Yongjun
	pm_runtime_get_sync() will increment pm usage counter even it failed. Forgetting to call pm_runtime_put_noidle will result in reference leak in stm32_crc_remove, so we should fix it. Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-03-25	crypto: arm/aes-neonbs-cbc - Select generic cbc and aes	Herbert Xu
	The algorithm __cbc-aes-neonbs requires a fallback so we need to select the config options for them or otherwise it will fail to register on boot-up. Fixes: 00b99ad2bac2 ("crypto: arm/aes-neonbs - Use generic cbc...") Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-03-24	Merge tag 'iommu-updates-v5.18' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu Pull iommu updates from Joerg Roedel: - IOMMU Core changes: - Removal of aux domain related code as it is basically dead and will be replaced by iommu-fd framework - Split of iommu_ops to carry domain-specific call-backs separatly - Cleanup to remove useless ops->capable implementations - Improve 32-bit free space estimate in iova allocator - Intel VT-d updates: - Various cleanups of the driver - Support for ATS of SoC-integrated devices listed in ACPI/SATC table - ARM SMMU updates: - Fix SMMUv3 soft lockup during continuous stream of events - Fix error path for Qualcomm SMMU probe() - Rework SMMU IRQ setup to prepare the ground for PMU support - Minor cleanups and refactoring - AMD IOMMU driver: - Some minor cleanups and error-handling fixes - Rockchip IOMMU driver: - Use standard driver registration - MSM IOMMU driver: - Minor cleanup and change to standard driver registration - Mediatek IOMMU driver: - Fixes for IOTLB flushing logic * tag 'iommu-updates-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (47 commits) iommu/amd: Improve amd_iommu_v2_exit() iommu/amd: Remove unused struct fault.devid iommu/amd: Clean up function declarations iommu/amd: Call memunmap in error path iommu/arm-smmu: Account for PMU interrupts iommu/vt-d: Enable ATS for the devices in SATC table iommu/vt-d: Remove unused function intel_svm_capable() iommu/vt-d: Add missing "__init" for rmrr_sanity_check() iommu/vt-d: Move intel_iommu_ops to header file iommu/vt-d: Fix indentation of goto labels iommu/vt-d: Remove unnecessary prototypes iommu/vt-d: Remove unnecessary includes iommu/vt-d: Remove DEFER_DEVICE_DOMAIN_INFO iommu/vt-d: Remove domain and devinfo mempool iommu/vt-d: Remove iova_cache_get/put() iommu/vt-d: Remove finding domain in dmar_insert_one_dev_info() iommu/vt-d: Remove intel_iommu::domains iommu/mediatek: Always tlb_flush_all when each PM resume iommu/mediatek: Add tlb_lock in tlb_flush_all iommu/mediatek: Remove the power status checking in tlb flush all ...
2022-03-24	Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi	Linus Torvalds
	Pull SCSI updates from James Bottomley: "This series consists of the usual driver updates (qla2xxx, pm8001, libsas, smartpqi, scsi_debug, lpfc, iscsi, mpi3mr) plus minor updates and bug fixes. The high blast radius core update is the removal of write same, which affects block and several non-SCSI devices. The other big change, which is more local, is the removal of the SCSI pointer" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (281 commits) scsi: scsi_ioctl: Drop needless assignment in sg_io() scsi: bsg: Drop needless assignment in scsi_bsg_sg_io_fn() scsi: lpfc: Copyright updates for 14.2.0.0 patches scsi: lpfc: Update lpfc version to 14.2.0.0 scsi: lpfc: SLI path split: Refactor BSG paths scsi: lpfc: SLI path split: Refactor Abort paths scsi: lpfc: SLI path split: Refactor SCSI paths scsi: lpfc: SLI path split: Refactor CT paths scsi: lpfc: SLI path split: Refactor misc ELS paths scsi: lpfc: SLI path split: Refactor VMID paths scsi: lpfc: SLI path split: Refactor FDISC paths scsi: lpfc: SLI path split: Refactor LS_RJT paths scsi: lpfc: SLI path split: Refactor LS_ACC paths scsi: lpfc: SLI path split: Refactor the RSCN/SCR/RDF/EDC/FARPR paths scsi: lpfc: SLI path split: Refactor PLOGI/PRLI/ADISC/LOGO paths scsi: lpfc: SLI path split: Refactor base ELS paths and the FLOGI path scsi: lpfc: SLI path split: Introduce lpfc_prep_wqe scsi: lpfc: SLI path split: Refactor fast and slow paths to native SLI4 scsi: lpfc: SLI path split: Refactor lpfc_iocbq scsi: lpfc: Use kcalloc() ...
2022-03-24	dt-bindings: clock: drop useless consumer example	Krzysztof Kozlowski
	Consumer examples in the bindings of resource providers are trivial, useless and duplication of code. Remove the example code for consumer Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com> Link: https://lore.kernel.org/r/20220316130858.93455-2-krzysztof.kozlowski@canonical.com Acked-by: Rob Herring <robh@kernel.org> Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2022-03-24	Merge tag 'for-5.18/dm-changes' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - Significant refactoring and fixing of how DM core does bio-based IO accounting with focus on fixing wildly inaccurate IO stats for dm-crypt (and other DM targets that defer bio submission in their own workqueues). End result is proper IO accounting, made possible by targets being updated to use the new dm_submit_bio_remap() interface. - Add hipri bio polling support (REQ_POLLED) to bio-based DM. - Reduce dm_io and dm_target_io structs so that a single dm_io (which contains dm_target_io and first clone bio) weighs in at 256 bytes. For reference the bio struct is 128 bytes. - Various other small cleanups, fixes or improvements in DM core and targets. - Update MAINTAINERS with my kernel.org email address to allow distinction between my "upstream" and "Red" Hats. * tag 'for-5.18/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (46 commits) dm: consolidate spinlocks in dm_io struct dm: reduce size of dm_io and dm_target_io structs dm: switch dm_target_io booleans over to proper flags dm: switch dm_io booleans over to proper flags dm: update email address in MAINTAINERS dm: return void from __send_empty_flush dm: factor out dm_io_complete dm cache: use dm_submit_bio_remap dm: simplify dm_sumbit_bio_remap interface dm thin: use dm_submit_bio_remap dm: add WARN_ON_ONCE to dm_submit_bio_remap dm: support bio polling block: add ->poll_bio to block_device_operations dm mpath: use DMINFO instead of printk with KERN_INFO dm: stop using bdevname dm-zoned: remove the ->name field in struct dmz_dev dm: remove unnecessary local variables in __bind dm: requeue IO if mapping table not yet available dm io: remove stale comment block for dm_io() dm thin metadata: remove unused dm_thin_remove_block and __remove ...
2022-03-24	dt-bindings: clock: renesas: Make example 'clocks' parsable	Rob Herring
	'clocks' in the example is not parsable with the 0 phandle value because the number of #clock-cells is unknown in the previous entry. Solve this by adding the clock provider node. Only 'cpg_clocks' is needed as the examples are built with fixups which can be used to identify phandles. This is in preparation to support schema validation on .dtb files. Signed-off-by: Rob Herring <robh@kernel.org> Link: https://lore.kernel.org/r/20220301190400.1644150-1-robh@kernel.org Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be> Acked-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2022-03-24	Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma	Linus Torvalds
	Pull rdma updates from Jason Gunthorpe: - Minor bug fixes in mlx5, mthca, pvrdma, rtrs, mlx4, hfi1, hns - Minor cleanups: coding style, useless includes and documentation - Reorganize how multicast processing works in rxe - Replace a red/black tree with xarray in rxe which improves performance - DSCP support and HW address handle re-use in irdma - Simplify the mailbox command handling in hns - Simplify iser now that FMR is eliminated * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (93 commits) RDMA/nldev: Prevent underflow in nldev_stat_set_counter_dynamic_doit() IB/iser: Fix error flow in case of registration failure IB/iser: Generalize map/unmap dma tasks IB/iser: Use iser_fr_desc as registration context IB/iser: Remove iser_reg_data_sg helper function RDMA/rxe: Use standard names for ref counting RDMA/rxe: Replace red-black trees by xarrays RDMA/rxe: Shorten pool names in rxe_pool.c RDMA/rxe: Move max_elem into rxe_type_info RDMA/rxe: Replace obj by elem in declaration RDMA/rxe: Delete _locked() APIs for pool objects RDMA/rxe: Reverse the sense of RXE_POOL_NO_ALLOC RDMA/rxe: Replace mr by rkey in responder resources RDMA/rxe: Fix ref error in rxe_av.c RDMA/hns: Use the reserved loopback QPs to free MR before destroying MPT RDMA/irdma: Add support for address handle re-use RDMA/qib: Fix typos in comments RDMA/mlx5: Fix memory leak in error flow for subscribe event routine Revert "RDMA/core: Fix ib_qp_usecnt_dec() called when error" RDMA/rxe: Remove useless argument for update_state() ...
2022-03-24	selftests: kselftest framework: provide "finished" helper	Kees Cook
	Instead of having each time that wants to use ksft_exit() have to figure out the internals of kselftest.h, add the helper ksft_finished() that makes sure the passes, xfails, and skips are equal to the test plan count. Link: https://lkml.kernel.org/r/20220201013717.2464392-1-keescook@chromium.org Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-24	mm: madvise: MADV_DONTNEED_LOCKED	Johannes Weiner
	MADV_DONTNEED historically rejects mlocked ranges, but with MLOCK_ONFAULT and MCL_ONFAULT allowing to mlock without populating, there are valid use cases for depopulating locked ranges as well. Users mlock memory to protect secrets. There are allocators for secure buffers that want in-use memory generally mlocked, but cleared and invalidated memory to give up the physical pages. This could be done with explicit munlock -> mlock calls on free -> alloc of course, but that adds two unnecessary syscalls, heavy mmap_sem write locks, vma splits and re-merges - only to get rid of the backing pages. Users also mlockall(MCL_ONFAULT) to suppress sustained paging, but are okay with on-demand initial population. It seems valid to selectively free some memory during the lifetime of such a process, without having to mess with its overall policy. Why add a separate flag? Isn't this a pretty niche usecase? - MADV_DONTNEED has been bailing on locked vmas forever. It's at least conceivable that someone, somewhere is relying on mlock to protect data from perhaps broader invalidation calls. Changing this behavior now could lead to quiet data corruption. - It also clarifies expectations around MADV_FREE and maybe MADV_REMOVE. It avoids the situation where one quietly behaves different than the others. MADV_FREE_LOCKED can be added later. - The combination of mlock() and madvise() in the first place is probably niche. But where it happens, I'd say that dropping pages from a locked region once they don't contain secrets or won't page anymore is much saner than relying on mlock to protect memory from speculative or errant invalidation calls. It's just that we can't change the default behavior because of the two previous points. Given that, an explicit new flag seems to make the most sense. [hannes@cmpxchg.org: fix mips build] Link: https://lkml.kernel.org/r/20220304171912.305060-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-24	mm: fix race between MADV_FREE reclaim and blkdev direct IO read	Mauricio Faria de Oliveira
	Problem: ======= Userspace might read the zero-page instead of actual data from a direct IO read on a block device if the buffers have been called madvise(MADV_FREE) on earlier (this is discussed below) due to a race between page reclaim on MADV_FREE and blkdev direct IO read. - Race condition: ============== During page reclaim, the MADV_FREE page check in try_to_unmap_one() checks if the page is not dirty, then discards its rmap PTE(s) (vs. remap back if the page is dirty). However, after try_to_unmap_one() returns to shrink_page_list(), it might keep the page _anyway_ if page_ref_freeze() fails (it expects exactly _one_ page reference, from the isolation for page reclaim). Well, blkdev_direct_IO() gets references for all pages, and on READ operations it only sets them dirty _later_. So, if MADV_FREE'd pages (i.e., not dirty) are used as buffers for direct IO read from block devices, and page reclaim happens during __blkdev_direct_IO[_simple]() exactly AFTER bio_iov_iter_get_pages() returns, but BEFORE the pages are set dirty, the situation happens. The direct IO read eventually completes. Now, when userspace reads the buffers, the PTE is no longer there and the page fault handler do_anonymous_page() services that with the zero-page, NOT the data! A synthetic reproducer is provided. - Page faults: =========== If page reclaim happens BEFORE bio_iov_iter_get_pages() the issue doesn't happen, because that faults-in all pages as writeable, so do_anonymous_page() sets up a new page/rmap/PTE, and that is used by direct IO. The userspace reads don't fault as the PTE is there (thus zero-page is not used/setup). But if page reclaim happens AFTER it / BEFORE setting pages dirty, the PTE is no longer there; the subsequent page faults can't help: The data-read from the block device probably won't generate faults due to DMA (no MMU) but even in the case it wouldn't use DMA, that happens on different virtual addresses (not user-mapped addresses) because `struct bio_vec` stores `struct page` to figure addresses out (which are different from user-mapped addresses) for the read. Thus userspace reads (to user-mapped addresses) still fault, then do_anonymous_page() gets another `struct page` that would address/ map to other memory than the `struct page` used by `struct bio_vec` for the read. (The original `struct page` is not available, since it wasn't freed, as page_ref_freeze() failed due to more page refs. And even if it were available, its data cannot be trusted anymore.) Solution: ======== One solution is to check for the expected page reference count in try_to_unmap_one(). There should be one reference from the isolation (that is also checked in shrink_page_list() with page_ref_freeze()) plus one or more references from page mapping(s) (put in discard: label). Further references mean that rmap/PTE cannot be unmapped/nuked. (Note: there might be more than one reference from mapping due to fork()/clone() without CLONE_VM, which use the same `struct page` for references, until the copy-on-write page gets copied.) So, additional page references (e.g., from direct IO read) now prevent the rmap/PTE from being unmapped/dropped; similarly to the page is not freed per shrink_page_list()/page_ref_freeze()). - Races and Barriers: ================== The new check in try_to_unmap_one() should be safe in races with bio_iov_iter_get_pages() in get_user_pages() fast and slow paths, as it's done under the PTE lock. The fast path doesn't take the lock, but it checks if the PTE has changed and if so, it drops the reference and leaves the page for the slow path (which does take that lock). The fast path requires synchronization w/ full memory barrier: it writes the page reference count first then it reads the PTE later, while try_to_unmap() writes PTE first then it reads page refcount. And a second barrier is needed, as the page dirty flag should not be read before the page reference count (as in __remove_mapping()). (This can be a load memory barrier only; no writes are involved.) Call stack/comments: - try_to_unmap_one() - page_vma_mapped_walk() - map_pte() # see pte_offset_map_lock(): pte_offset_map() spin_lock() - ptep_get_and_clear() # write PTE - smp_mb() # (new barrier) GUP fast path - page_ref_count() # (new check) read refcount - page_vma_mapped_walk_done() # see pte_unmap_unlock(): pte_unmap() spin_unlock() - bio_iov_iter_get_pages() - __bio_iov_iter_get_pages() - iov_iter_get_pages() - get_user_pages_fast() - internal_get_user_pages_fast() # fast path - lockless_pages_from_mm() - gup_{pgd,p4d,pud,pmd,pte}_range() ptep = pte_offset_map() # not _lock() pte = ptep_get_lockless(ptep) page = pte_page(pte) try_grab_compound_head(page) # inc refcount # (RMW/barrier # on success) if (pte_val(pte) != pte_val(ptep)) # read PTE put_compound_head(page) # dec refcount # go slow path # slow path - __gup_longterm_unlocked() - get_user_pages_unlocked() - __get_user_pages_locked() - __get_user_pages() - follow_{page,p4d,pud,pmd}_mask() - follow_page_pte() ptep = pte_offset_map_lock() pte = ptep page = vm_normal_page(pte) try_grab_page(page) # inc refcount pte_unmap_unlock() - Huge Pages: ========== Regarding transparent hugepages, that logic shouldn't change, as MADV_FREE (aka lazyfree) pages are PageAnon() && !PageSwapBacked() (madvise_free_pte_range() -> mark_page_lazyfree() -> lru_lazyfree_fn()) thus should reach shrink_page_list() -> split_huge_page_to_list() before try_to_unmap[_one](), so it deals with normal pages only. (And in case unlikely/TTU_SPLIT_HUGE_PMD/split_huge_pmd_address() happens, which should not or be rare, the page refcount should be greater than mapcount: the head page is referenced by tail pages. That also prevents checking the head `page` then incorrectly call page_remove_rmap(subpage) for a tail page, that isn't even in the shrink_page_list()'s page_list (an effect of split huge pmd/pmvw), as it might happen today in this unlikely scenario.) MADV_FREE'd buffers: =================== So, back to the "if MADV_FREE pages are used as buffers" note. The case is arguable, and subject to multiple interpretations. The madvise(2) manual page on the MADV_FREE advice value says: 1) 'After a successful MADV_FREE ... data will be lost when the kernel frees the pages.' 2) 'the free operation will be canceled if the caller writes into the page' / 'subsequent writes ... will succeed and then [the] kernel cannot free those dirtied pages' 3) 'If there is no subsequent write, the kernel can free the pages at any time.' Thoughts, questions, considerations... respectively: 1) Since the kernel didn't actually free the page (page_ref_freeze() failed), should the data not have been lost? (on userspace read.) 2) Should writes performed by the direct IO read be able to cancel the free operation? - Should the direct IO read be considered as 'the caller' too, as it's been requested by 'the caller'? - Should the bio technique to dirty pages on return to userspace (bio_check_pages_dirty() is called/used by __blkdev_direct_IO()) be considered in another/special way here? 3) Should an upcoming write from a previously requested direct IO read be considered as a subsequent write, so the kernel should not free the pages? (as it's known at the time of page reclaim.) And lastly: Technically, the last point would seem a reasonable consideration and balance, as the madvise(2) manual page apparently (and fairly) seem to assume that 'writes' are memory access from the userspace process (not explicitly considering writes from the kernel or its corner cases; again, fairly).. plus the kernel fix implementation for the corner case of the largely 'non-atomic write' encompassed by a direct IO read operation, is relatively simple; and it helps. Reproducer: ========== @ test.c (simplified, but works) #define _GNU_SOURCE #include <fcntl.h> #include <stdio.h> #include <unistd.h> #include <sys/mman.h> int main() { int fd, i; char buf; fd = open(DEV, O_RDONLY \| O_DIRECT); buf = mmap(NULL, BUF_SIZE, PROT_READ \| PROT_WRITE, MAP_PRIVATE \| MAP_ANONYMOUS, -1, 0); for (i = 0; i < BUF_SIZE; i += PAGE_SIZE) buf[i] = 1; // init to non-zero madvise(buf, BUF_SIZE, MADV_FREE); read(fd, buf, BUF_SIZE); for (i = 0; i < BUF_SIZE; i += PAGE_SIZE) printf("%p: 0x%x\n", &buf[i], buf[i]); return 0; } @ block/fops.c (formerly fs/block_dev.c) +#include <linux/swap.h> ... ... __blkdev_direct_IO[_simple](...) { ... + if (!strcmp(current->comm, "good")) + shrink_all_memory(ULONG_MAX); + ret = bio_iov_iter_get_pages(...); + + if (!strcmp(current->comm, "bad")) + shrink_all_memory(ULONG_MAX); ... } @ shell # NUM_PAGES=4 # PAGE_SIZE=$(getconf PAGE_SIZE) # yes \| dd of=test.img bs=${PAGE_SIZE} count=${NUM_PAGES} # DEV=$(losetup -f --show test.img) # gcc -DDEV=\"$DEV\" \ -DBUF_SIZE=$((PAGE_SIZE NUM_PAGES)) \ -DPAGE_SIZE=${PAGE_SIZE} \ test.c -o test # od -tx1 $DEV 0000000 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a * 0040000 # mv test good # ./good 0x7f7c10418000: 0x79 0x7f7c10419000: 0x79 0x7f7c1041a000: 0x79 0x7f7c1041b000: 0x79 # mv good bad # ./bad 0x7fa1b8050000: 0x0 0x7fa1b8051000: 0x0 0x7fa1b8052000: 0x0 0x7fa1b8053000: 0x0 Note: the issue is consistent on v5.17-rc3, but it's intermittent with the support of MADV_FREE on v4.5 (60%-70% error; needs swap). [wrap do_direct_IO() in do_blockdev_direct_IO() @ fs/direct-io.c]. - v5.17-rc3: # for i in {1..1000}; do ./good; done \ \| cut -d: -f2 \| sort \| uniq -c 4000 0x79 # mv good bad # for i in {1..1000}; do ./bad; done \ \| cut -d: -f2 \| sort \| uniq -c 4000 0x0 # free \| grep Swap Swap: 0 0 0 - v4.5: # for i in {1..1000}; do ./good; done \ \| cut -d: -f2 \| sort \| uniq -c 4000 0x79 # mv good bad # for i in {1..1000}; do ./bad; done \ \| cut -d: -f2 \| sort \| uniq -c 2702 0x0 1298 0x79 # swapoff -av swapoff /swap # for i in {1..1000}; do ./bad; done \ \| cut -d: -f2 \| sort \| uniq -c 4000 0x79 Ceph/TCMalloc: ============= For documentation purposes, the use case driving the analysis/fix is Ceph on Ubuntu 18.04, as the TCMalloc library there still uses MADV_FREE to release unused memory to the system from the mmap'ed page heap (might be committed back/used again; it's not munmap'ed.) - PageHeap::DecommitSpan() -> TCMalloc_SystemRelease() -> madvise() - PageHeap::CommitSpan() -> TCMalloc_SystemCommit() -> do nothing. Note: TCMalloc switched back to MADV_DONTNEED a few commits after the release in Ubuntu 18.04 (google-perftools/gperftools 2.5), so the issue just 'disappeared' on Ceph on later Ubuntu releases but is still present in the kernel, and can be hit by other use cases. The observed issue seems to be the old Ceph bug #22464 [1], where checksum mismatches are observed (and instrumentation with buffer dumps shows zero-pages read from mmap'ed/MADV_FREE'd page ranges). The issue in Ceph was reasonably deemed a kernel bug (comment #50) and mostly worked around with a retry mechanism, but other parts of Ceph could still hit that (rocksdb). Anyway, it's less likely to be hit again as TCMalloc switched out of MADV_FREE by default. (Some kernel versions/reports from the Ceph bug, and relation with the MADV_FREE introduction/changes; TCMalloc versions not checked.) - 4.4 good - 4.5 (madv_free: introduction) - 4.9 bad - 4.10 good? maybe a swapless system - 4.12 (madv_free: no longer free instantly on swapless systems) - 4.13 bad [1] https://tracker.ceph.com/issues/22464 Thanks: ====== Several people contributed to analysis/discussions/tests/reproducers in the first stages when drilling down on ceph/tcmalloc/linux kernel: - Dan Hill - Dan Streetman - Dongdong Tao - Gavin Guo - Gerald Yang - Heitor Alves de Siqueira - Ioanna Alifieraki - Jay Vosburgh - Matthew Ruffell - Ponnuvel Palaniyappan Reviews, suggestions, corrections, comments: - Minchan Kim - Yu Zhao - Huang, Ying - John Hubbard - Christoph Hellwig [mfo@canonical.com: v4] Link: https://lkml.kernel.org/r/20220209202659.183418-1-mfo@canonical.comLink: https://lkml.kernel.org/r/20220131230255.789059-1-mfo@canonical.com Fixes: 802a3a92ad7a ("mm: reclaim MADV_FREE pages") Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Dan Hill <daniel.hill@canonical.com> Cc: Dan Streetman <dan.streetman@canonical.com> Cc: Dongdong Tao <dongdong.tao@canonical.com> Cc: Gavin Guo <gavin.guo@canonical.com> Cc: Gerald Yang <gerald.yang@canonical.com> Cc: Heitor Alves de Siqueira <halves@canonical.com> Cc: Ioanna Alifieraki <ioanna-maria.alifieraki@canonical.com> Cc: Jay Vosburgh <jay.vosburgh@canonical.com> Cc: Matthew Ruffell <matthew.ruffell@canonical.com> Cc: Ponnuvel Palaniyappan <ponnuvel.palaniyappan@canonical.com> Cc: <stable@vger.kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-24	mm: generalize ARCH_HAS_FILTER_PGPROT	Anshuman Khandual
	ARCH_HAS_FILTER_PGPROT config has duplicate definitions on platforms that subscribe it. Instead make it a generic config option which can be selected on applicable platforms when required. Link: https://lkml.kernel.org/r/1643004823-16441-1-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-24	mm: unmap_mapping_range_tree() with i_mmap_rwsem shared	Hugh Dickins
	Revert 48ec833b7851 ("Revert "mm/memory.c: share the i_mmap_rwsem"") to reinstate c8475d144abb ("mm/memory.c: share the i_mmap_rwsem"): the unmap_mapping_range family of functions do the unmapping of user pages (ultimately via zap_page_range_single) without modifying the interval tree itself, and unmapping races are necessarily guarded by page table lock, thus the i_mmap_rwsem should be shared in unmap_mapping_pages() and unmap_mapping_folio(). Commit 48ec833b7851 was intended as a short-term measure, allowing the other shared lock changes into 3.19 final, before investigating three trinity crashes, one of which had been bisected to commit c8475d144ab: [1] https://lkml.org/lkml/2014/11/14/342 https://lore.kernel.org/lkml/5466142C.60100@oracle.com/ [2] https://lkml.org/lkml/2014/12/22/213 https://lore.kernel.org/lkml/549832E2.8060609@oracle.com/ [3] https://lkml.org/lkml/2014/12/9/741 https://lore.kernel.org/lkml/5487ACC5.1010002@oracle.com/ Two of those were Bad page states: free_pages_prepare() found PG_mlocked still set - almost certain to have been fixed by 4.4 commit b87537d9e2fe ("mm: rmap use pte lock not mmap_sem to set PageMlocked"). The NULL deref on rwsem in [2]: unclear, only happened once, not bisected to c8475d144ab. No change to the i_mmap_lock_write() around __unmap_hugepage_range_final() in unmap_single_vma(): IIRC that's a special usage, helping to serialize hugetlbfs page table sharing, not to be dabbled with lightly. No change to other uses of i_mmap_lock_write() by hugetlbfs. I am not aware of any significant gains from the concurrency allowed by this commit: it is submitted more to resolve an ancient misunderstanding. Link: https://lkml.kernel.org/r/e4a5e356-6c87-47b2-3ce8-c2a95ae84e20@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Sasha Levin <sashal@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-24	mm: warn on deleting redirtied only if accounted	Hugh Dickins
	filemap_unaccount_folio() has a WARN_ON_ONCE(folio_test_dirty(folio)). It is good to warn of late dirtying on a persistent filesystem, but late dirtying on tmpfs can only lose data which is expected to be thrown away; and it's a pity if that warning comes ONCE on tmpfs, then hides others which really matter. Make it conditional on mapping_cap_writeback(). Cleanup: then folio_account_cleaned() no longer needs to check that for itself, and so no longer needs to know the mapping. Link: https://lkml.kernel.org/r/b5a1106c-7226-a5c6-ad41-ad4832cae1f@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Jan Kara <jack@suse.de> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-24	mm/huge_memory: remove stale locking logic from __split_huge_pmd()	David Hildenbrand
	Let's remove the stale logic that was required for reuse_swap_page(). [akpm@linux-foundation.org: simplification, per Yang Shi] Link: https://lkml.kernel.org/r/20220131162940.210846-10-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: David Rientjes <rientjes@google.com> Cc: Don Dutile <ddutile@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Liang Zhang <zhangliang5@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>