summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2021-06-25powerpc/pseries/dlpar: use rtas_get_sensor()Nathan Lynch
Instead of making bare calls to get-sensor-state, use rtas_get_sensor(), which correctly handles busy and extended delay statuses. Fixes: ab519a011caa ("powerpc/pseries: Kernel DLPAR Infrastructure") Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210504025329.1713878-1-nathanl@linux.ibm.com
2021-06-25powerpc/rtas-rtc: remove unused constantNathan Lynch
RTAS_CLOCK_BUSY is unused, remove it. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210503175811.1528208-1-nathanl@linux.ibm.com
2021-06-25powerpc/papr_scm: trivial: fix typo in a commentKajol Jain
There is a spelling mistake "byes" -> "bytes" in a comment of function drc_pmem_query_stats(). Fix that typo. Signed-off-by: Kajol Jain <kjain@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210418074003.6651-1-kjain@linux.ibm.com
2021-06-25powerpc: Fix is_kvm_guest() / kvm_para_available()Michael Ellerman
Commit a21d1becaa3f ("powerpc: Reintroduce is_kvm_guest() as a fast-path check") added is_kvm_guest() and changed kvm_para_available() to use it. is_kvm_guest() checks a static key, kvm_guest, and that static key is set in check_kvm_guest(). The problem is check_kvm_guest() is only called on pseries, and even then only in some configurations. That means is_kvm_guest() always returns false on all non-pseries and some pseries depending on configuration. That's a bug. For PR KVM guests this is noticable because they no longer do live patching of themselves, which can be detected by the omission of a message in dmesg such as: KVM: Live patching for a fast VM worked To fix it make check_kvm_guest() an initcall, to ensure it's always called at boot. It needs to be core so that it runs before kvm_guest_init() which is postcore. To be an initcall it needs to return int, where 0 means success, so update that. We still call it manually in pSeries_smp_probe(), because that runs before init calls are run. Fixes: a21d1becaa3f ("powerpc: Reintroduce is_kvm_guest() as a fast-path check") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210623130514.2543232-1-mpe@ellerman.id.au
2021-06-25powerpc/64s: Make prom_init require RELOCATABLEMichael Ellerman
When we boot from open firmware (OF) using PPC_OF_BOOT_TRAMPOLINE, aka. prom_init, we run parts of the kernel at an address other than the link address. That happens because OF loads the kernel above zero (OF is at zero) and we run prom_init before copying the kernel down to zero. Currently that works even for non-relocatable kernels, because we do various fixups to the prom_init code to make it run where it's loaded. However those fixups are not sufficient if the kernel becomes large enough. In that case prom_init()'s final call to __start() can end up generating a plt branch: bl c000000002000018 <00000078.plt_branch.__start> That results in the kernel jumping to the linked address of __start, 0xc000000000000000, when really it needs to jump to the 0xc000000000000000 + the runtime address because the kernel is still running at the load address. We could do further shenanigans to handle that, see Jordan's patch for example: https://lore.kernel.org/linuxppc-dev/20210421021721.1539289-1-jniethe5@gmail.com However it is much simpler to just require a kernel with prom_init() to be built relocatable. The result works in all configurations without further work, and requires less code. This should have no effect on most people, as our defconfigs and essentially all distro configs already have RELOCATABLE enabled. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210623130454.2542945-1-mpe@ellerman.id.au
2021-06-25selftests/powerpc: Use req_max_processed_len from sysfs NX capabilitiesHaren Myneni
On PowerVM, the hypervisor defines the maximum buffer length for each NX request and the kernel exported this value via sysfs. This patch reads this value if the sysfs entry is available and is used to limit the request length. Signed-off-by: Haren Myneni <haren@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/ed908341b1eb7ca0183c028a4ed4a0cf48bfe0f6.camel@linux.ibm.com
2021-06-25powerpc/bpf: Use bctrl for making function callsNaveen N. Rao
blrl corrupts the link stack. Instead use bctrl when making function calls from BPF programs. Reported-by: Anton Blanchard <anton@ozlabs.org> Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210609090024.1446800-1-naveen.n.rao@linux.vnet.ibm.com
2021-06-25powerpc/xmon: Add support for running a command on all cpus in xmonNaveen N. Rao
It is sometimes desirable to run a command on all cpus in xmon. A typical scenario is to obtain the backtrace from all cpus in xmon if there is a soft lockup. Add rudimentary support for the same. The command to be run on all cpus should be prefixed with 'c#'. As an example, 'c#t' will run 't' command and produce a backtrace on all cpus in xmon. Since many xmon commands are not sensible for running in this manner, we only allow a predefined list of commands -- 'r', 'S' and 't' for now. Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210601074801.617363-1-naveen.n.rao@linux.vnet.ibm.com
2021-06-25powerpc/configs: Enable STACK_TRACER and FTRACE_SYSCALLS in some of the configsNaveen N. Rao
Both these config options are generally enabled in distro kernels. Enable the same in a few powerpc64 configs to get better coverage and testing. Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210524120227.3333208-1-naveen.n.rao@linux.vnet.ibm.com
2021-06-25powerpc/kprobes: Warn if instruction patching failedNaveen N. Rao
When arming and disarming probes, we currently assume that instruction patching can never fail, and don't have a mechanism to surface errors. Add a warning in case instruction patching ever fails. Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/18d7b1309f938c08ce07738100932b551bdd3a52.1621416666.git.naveen.n.rao@linux.vnet.ibm.com
2021-06-25powerpc/kprobes: Roll IS_RFI() macro into IS_RFID()Naveen N. Rao
In kprobes and xmon, we should exclude both 32-bit and 64-bit variants of mtmsr and rfi instructions from being stepped. Have IS_RFID() also detect a rfi instruction similar to IS_MTMSRD(). Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/eee32e1b75dae85d471c89b4c0a123ad4b0aabf8.1621416666.git.naveen.n.rao@linux.vnet.ibm.com
2021-06-25powerpc/papr_scm: Add support for reporting dirty-shutdown-countVaibhav Jain
Persistent memory devices like NVDIMMs can loose cached writes in case something prevents flush on power-fail. Such situations are termed as dirty shutdown and are exposed to applications as last-shutdown-state (LSS) flag and a dirty-shutdown-counter(DSC) as described at [1]. The latter being useful in conditions where multiple applications want to detect a dirty shutdown event without racing with one another. PAPR-NVDIMMs have so far only exposed LSS style flags to indicate a dirty-shutdown-state. This patch further adds support for DSC via the "ibm,persistence-failed-count" device tree property of an NVDIMM. This property is a monotonic increasing 64-bit counter thats an indication of number of times an NVDIMM has encountered a dirty-shutdown event causing persistence loss. Since this value is not expected to change after system-boot hence papr_scm reads & caches its value during NVDIMM probe and exposes it as a PAPR sysfs attributed named 'dirty_shutdown' to match the name of similarly named NFIT sysfs attribute. Also this value is available to libnvdimm via PAPR_PDSM_HEALTH payload. 'struct nd_papr_pdsm_health' has been extended to add a new member called 'dimm_dsc' presence of which is indicated by the newly introduced PDSM_DIMM_DSC_VALID flag. References: [1] https://pmem.io/documents/Dirty_Shutdown_Handling-V1.0.pdf Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210624080621.252038-1-vaibhav@linux.ibm.com
2021-06-25powerpc/papr_scm: Make 'perf_stats' invisible if perf-stats unavailableVaibhav Jain
In case performance stats for an nvdimm are not available, reading the 'perf_stats' sysfs file returns an -ENOENT error. A better approach is to make the 'perf_stats' file entirely invisible to indicate that performance stats for an nvdimm are unavailable. So this patch updates 'papr_nd_attribute_group' to add a 'is_visible' callback implemented as newly introduced 'papr_nd_attribute_visible()' that returns an appropriate mode in case performance stats aren't supported in a given nvdimm. Also the initialization of 'papr_scm_priv.stat_buffer_len' is moved from papr_scm_nvdimm_init() to papr_scm_probe() so that it value is available when 'papr_nd_attribute_visible()' is called during nvdimm initialization. Even though 'perf_stats' attribute is available since v5.9, there are no known user-space tools/scripts that are dependent on presence of its sysfs file. Hence I dont expect any user-space breakage with this patch. Fixes: 2d02bf835e57 ("powerpc/papr_scm: Fetch nvdimm performance stats from PHYP") Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210513092349.285021-1-vaibhav@linux.ibm.com
2021-06-25powerpc/kprobes: Fix Oops by passing ppc_inst as a pointer to emulate_step() ↵Naveen N. Rao
on ppc32 Trying to use a kprobe on ppc32 results in the below splat: BUG: Unable to handle kernel data access on read at 0x7c0802a6 Faulting instruction address: 0xc002e9f0 Oops: Kernel access of bad area, sig: 11 [#1] BE PAGE_SIZE=4K PowerPC 44x Platform Modules linked in: CPU: 0 PID: 89 Comm: sh Not tainted 5.13.0-rc1-01824-g3a81c0495fdb #7 NIP: c002e9f0 LR: c0011858 CTR: 00008a47 REGS: c292fd50 TRAP: 0300 Not tainted (5.13.0-rc1-01824-g3a81c0495fdb) MSR: 00009000 <EE,ME> CR: 24002002 XER: 20000000 DEAR: 7c0802a6 ESR: 00000000 <snip> NIP [c002e9f0] emulate_step+0x28/0x324 LR [c0011858] optinsn_slot+0x128/0x10000 Call Trace: opt_pre_handler+0x7c/0xb4 (unreliable) optinsn_slot+0x128/0x10000 ret_from_syscall+0x0/0x28 The offending instruction is: 81 24 00 00 lwz r9,0(r4) Here, we are trying to load the second argument to emulate_step(): struct ppc_inst, which is the instruction to be emulated. On ppc64, structures are passed in registers when passed by value. However, per the ppc32 ABI, structures are always passed to functions as pointers. This isn't being adhered to when setting up the call to emulate_step() in the optprobe trampoline. Fix the same. Fixes: eacf4c0202654a ("powerpc: Enable OPTPROBES on PPC32") Cc: stable@vger.kernel.org Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/5bdc8cbc9a95d0779e27c9ddbf42b40f51f883c0.1624425798.git.christophe.leroy@csgroup.eu
2021-06-24mailmap: add Marek's other e-mail address and identity without diacriticsMarek Behún
Some of my commits were sent with identities Marek Behun <marek.behun@nic.cz> Marek Behún <marek.behun@nic.cz> while the correct one is Marek Behún <kabel@kernel.org> Put this into mailmap so that git shortlog prints all my commits under one identity. Link: https://lkml.kernel.org/r/20210616113624.19351-2-kabel@kernel.org Signed-off-by: Marek Behún <kabel@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24MAINTAINERS: fix Marek's identity againMarek Behún
Fix my name to use diacritics, since MAINTAINERS supports it. Fix my e-mail address in MAINTAINERS' marvell10g PHY driver description, I accidentally put my other e-mail address here. Link: https://lkml.kernel.org/r/20210616113624.19351-1-kabel@kernel.org Signed-off-by: Marek Behún <kabel@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24mm/page_alloc: do bulk array bounds check after checking populated elementsMel Gorman
Dan Carpenter reported the following The patch 0f87d9d30f21: "mm/page_alloc: add an array-based interface to the bulk page allocator" from Apr 29, 2021, leads to the following static checker warning: mm/page_alloc.c:5338 __alloc_pages_bulk() warn: potentially one past the end of array 'page_array[nr_populated]' The problem can occur if an array is passed in that is fully populated. That potentially ends up allocating a single page and storing it past the end of the array. This patch returns 0 if the array is fully populated. Link: https://lkml.kernel.org/r/20210618125102.GU30378@techsingularity.net Fixes: 0f87d9d30f21 ("mm/page_alloc: add an array-based interface to the bulk page allocator") Signed-off-by: Mel Gorman <mgorman@techsinguliarity.net> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24mm/page_alloc: __alloc_pages_bulk(): do bounds check before accessing arrayRasmus Villemoes
In the event that somebody would call this with an already fully populated page_array, the last loop iteration would do an access beyond the end of page_array. It's of course extremely unlikely that would ever be done, but this triggers my internal static analyzer. Also, if it really is not supposed to be invoked this way (i.e., with no NULL entries in page_array), the nr_populated<nr_pages check could simply be removed instead. Link: https://lkml.kernel.org/r/20210507064504.1712559-1-linux@rasmusvillemoes.dk Fixes: 0f87d9d30f21 ("mm/page_alloc: add an array-based interface to the bulk page allocator") Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Acked-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24mm/hwpoison: do not lock page again when me_huge_page() successfully recoversNaoya Horiguchi
Currently me_huge_page() temporary unlocks page to perform some actions then locks it again later. My testcase (which calls hard-offline on some tail page in a hugetlb, then accesses the address of the hugetlb range) showed that page allocation code detects this page lock on buddy page and printed out "BUG: Bad page state" message. check_new_page_bad() does not consider a page with __PG_HWPOISON as bad page, so this flag works as kind of filter, but this filtering doesn't work in this case because the "bad page" is not the actual hwpoisoned page. So stop locking page again. Actions to be taken depend on the page type of the error, so page unlocking should be done in ->action() callbacks. So let's make it assumed and change all existing callbacks that way. Link: https://lkml.kernel.org/r/20210609072029.74645-1-nao.horiguchi@gmail.com Fixes: commit 78bb920344b8 ("mm: hwpoison: dissolve in-use hugepage in unrecoverable memory error") Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Tony Luck <tony.luck@intel.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24mm,hwpoison: return -EHWPOISON to denote that the page has already been poisonedAili Yao
When memory_failure() is called with MF_ACTION_REQUIRED on the page that has already been hwpoisoned, memory_failure() could fail to send SIGBUS to the affected process, which results in infinite loop of MCEs. Currently memory_failure() returns 0 if it's called for already hwpoisoned page, then the caller, kill_me_maybe(), could return without sending SIGBUS to current process. An action required MCE is raised when the current process accesses to the broken memory, so no SIGBUS means that the current process continues to run and access to the error page again soon, so running into MCE loop. This issue can arise for example in the following scenarios: - Two or more threads access to the poisoned page concurrently. If local MCE is enabled, MCE handler independently handles the MCE events. So there's a race among MCE events, and the second or latter threads fall into the situation in question. - If there was a precedent memory error event and memory_failure() for the event failed to unmap the error page for some reason, the subsequent memory access to the error page triggers the MCE loop situation. To fix the issue, make memory_failure() return an error code when the error page has already been hwpoisoned. This allows memory error handler to control how it sends signals to userspace. And make sure that any process touching a hwpoisoned page should get a SIGBUS even in "already hwpoisoned" path of memory_failure() as is done in page fault path. Link: https://lkml.kernel.org/r/20210521030156.2612074-3-nao.horiguchi@gmail.com Signed-off-by: Aili Yao <yaoaili@kingsoft.com> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bp@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Jue Wang <juew@google.com> Cc: Tony Luck <tony.luck@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24mm/memory-failure: use a mutex to avoid memory_failure() racesTony Luck
Patch series "mm,hwpoison: fix sending SIGBUS for Action Required MCE", v5. I wrote this patchset to materialize what I think is the current allowable solution mentioned by the previous discussion [1]. I simply borrowed Tony's mutex patch and Aili's return code patch, then I queued another one to find error virtual address in the best effort manner. I know that this is not a perfect solution, but should work for some typical case. [1]: https://lore.kernel.org/linux-mm/20210331192540.2141052f@alex-virtual-machine/ This patch (of 2): There can be races when multiple CPUs consume poison from the same page. The first into memory_failure() atomically sets the HWPoison page flag and begins hunting for tasks that map this page. Eventually it invalidates those mappings and may send a SIGBUS to the affected tasks. But while all that work is going on, other CPUs see a "success" return code from memory_failure() and so they believe the error has been handled and continue executing. Fix by wrapping most of the internal parts of memory_failure() in a mutex. [akpm@linux-foundation.org: make mf_mutex local to memory_failure()] Link: https://lkml.kernel.org/r/20210521030156.2612074-1-nao.horiguchi@gmail.com Link: https://lkml.kernel.org/r/20210521030156.2612074-2-nao.horiguchi@gmail.com Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: Borislav Petkov <bp@suse.de> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Aili Yao <yaoaili@kingsoft.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: David Hildenbrand <david@redhat.com> Cc: Jue Wang <juew@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24mm, futex: fix shared futex pgoff on shmem huge pageHugh Dickins
If more than one futex is placed on a shmem huge page, it can happen that waking the second wakes the first instead, and leaves the second waiting: the key's shared.pgoff is wrong. When 3.11 commit 13d60f4b6ab5 ("futex: Take hugepages into account when generating futex_key"), the only shared huge pages came from hugetlbfs, and the code added to deal with its exceptional page->index was put into hugetlb source. Then that was missed when 4.8 added shmem huge pages. page_to_pgoff() is what others use for this nowadays: except that, as currently written, it gives the right answer on hugetlbfs head, but nonsense on hugetlbfs tails. Fix that by calling hugetlbfs-specific hugetlb_basepage_index() on PageHuge tails as well as on head. Yes, it's unconventional to declare hugetlb_basepage_index() there in pagemap.h, rather than in hugetlb.h; but I do not expect anything but page_to_pgoff() ever to need it. [akpm@linux-foundation.org: give hugetlb_basepage_index() prototype the correct scope] Link: https://lkml.kernel.org/r/b17d946b-d09-326e-b42a-52884c36df32@google.com Fixes: 800d8c63b2e9 ("shmem: add huge pages support") Reported-by: Neel Natu <neelnatu@google.com> Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Zhang Yi <wetpzy@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Darren Hart <dvhart@infradead.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24kthread: prevent deadlock when kthread_mod_delayed_work() races with ↵Petr Mladek
kthread_cancel_delayed_work_sync() The system might hang with the following backtrace: schedule+0x80/0x100 schedule_timeout+0x48/0x138 wait_for_common+0xa4/0x134 wait_for_completion+0x1c/0x2c kthread_flush_work+0x114/0x1cc kthread_cancel_work_sync.llvm.16514401384283632983+0xe8/0x144 kthread_cancel_delayed_work_sync+0x18/0x2c xxxx_pm_notify+0xb0/0xd8 blocking_notifier_call_chain_robust+0x80/0x194 pm_notifier_call_chain_robust+0x28/0x4c suspend_prepare+0x40/0x260 enter_state+0x80/0x3f4 pm_suspend+0x60/0xdc state_store+0x108/0x144 kobj_attr_store+0x38/0x88 sysfs_kf_write+0x64/0xc0 kernfs_fop_write_iter+0x108/0x1d0 vfs_write+0x2f4/0x368 ksys_write+0x7c/0xec It is caused by the following race between kthread_mod_delayed_work() and kthread_cancel_delayed_work_sync(): CPU0 CPU1 Context: Thread A Context: Thread B kthread_mod_delayed_work() spin_lock() __kthread_cancel_work() spin_unlock() del_timer_sync() kthread_cancel_delayed_work_sync() spin_lock() __kthread_cancel_work() spin_unlock() del_timer_sync() spin_lock() work->canceling++ spin_unlock spin_lock() queue_delayed_work() // dwork is put into the worker->delayed_work_list spin_unlock() kthread_flush_work() // flush_work is put at the tail of the dwork wait_for_completion() Context: IRQ kthread_delayed_work_timer_fn() spin_lock() list_del_init(&work->node); spin_unlock() BANG: flush_work is not longer linked and will never get proceed. The problem is that kthread_mod_delayed_work() checks work->canceling flag before canceling the timer. A simple solution is to (re)check work->canceling after __kthread_cancel_work(). But then it is not clear what should be returned when __kthread_cancel_work() removed the work from the queue (list) and it can't queue it again with the new @delay. The return value might be used for reference counting. The caller has to know whether a new work has been queued or an existing one was replaced. The proper solution is that kthread_mod_delayed_work() will remove the work from the queue (list) _only_ when work->canceling is not set. The flag must be checked after the timer is stopped and the remaining operations can be done under worker->lock. Note that kthread_mod_delayed_work() could remove the timer and then bail out. It is fine. The other canceling caller needs to cancel the timer as well. The important thing is that the queue (list) manipulation is done atomically under worker->lock. Link: https://lkml.kernel.org/r/20210610133051.15337-3-pmladek@suse.com Fixes: 9a6b06c8d9a220860468a ("kthread: allow to modify delayed kthread work") Signed-off-by: Petr Mladek <pmladek@suse.com> Reported-by: Martin Liu <liumartin@google.com> Cc: <jenhaochen@google.com> Cc: Minchan Kim <minchan@google.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24kthread_worker: split code for canceling the delayed work timerPetr Mladek
Patch series "kthread_worker: Fix race between kthread_mod_delayed_work() and kthread_cancel_delayed_work_sync()". This patchset fixes the race between kthread_mod_delayed_work() and kthread_cancel_delayed_work_sync() including proper return value handling. This patch (of 2): Simple code refactoring as a preparation step for fixing a race between kthread_mod_delayed_work() and kthread_cancel_delayed_work_sync(). It does not modify the existing behavior. Link: https://lkml.kernel.org/r/20210610133051.15337-2-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Cc: <jenhaochen@google.com> Cc: Martin Liu <liumartin@google.com> Cc: Minchan Kim <minchan@google.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24mm/vmalloc: unbreak kasan vmalloc supportDaniel Axtens
In commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings"), __vmalloc_node_range was changed such that __get_vm_area_node was no longer called with the requested/real size of the vmalloc allocation, but rather with a rounded-up size. This means that __get_vm_area_node called kasan_unpoision_vmalloc() with a rounded up size rather than the real size. This led to it allowing access to too much memory and so missing vmalloc OOBs and failing the kasan kunit tests. Pass the real size and the desired shift into __get_vm_area_node. This allows it to round up the size for the underlying allocators while still unpoisioning the correct quantity of shadow memory. Adjust the other call-sites to pass in PAGE_SHIFT for the shift value. Link: https://lkml.kernel.org/r/20210617081330.98629-1-dja@axtens.net Link: https://bugzilla.kernel.org/show_bug.cgi?id=213335 Fixes: 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") Signed-off-by: Daniel Axtens <dja@axtens.net> Tested-by: David Gow <davidgow@google.com> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Tested-by: Andrey Konovalov <andreyknvl@gmail.com> Acked-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24KVM: s390: prepare for hugepage vmallocClaudio Imbrenda
The Create Secure Configuration Ultravisor Call does not support using large pages for the virtual memory area. This is a hardware limitation. This patch replaces the vzalloc call with an almost equivalent call to the newly introduced vmalloc_no_huge function, which guarantees that only small pages will be used for the backing. The new call will not clear the allocated memory, but that has never been an actual requirement. Link: https://lkml.kernel.org/r/20210614132357.10202-3-imbrenda@linux.ibm.com Fixes: 121e6f3258fe3 ("mm/vmalloc: hugepage vmalloc mappings") Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24mm/vmalloc: add vmalloc_no_hugeClaudio Imbrenda
Patch series "mm: add vmalloc_no_huge and use it", v4. Add vmalloc_no_huge() and export it, so modules can allocate memory with small pages. Use the newly added vmalloc_no_huge() in KVM on s390 to get around a hardware limitation. This patch (of 2): Commit 121e6f3258fe3 ("mm/vmalloc: hugepage vmalloc mappings") added support for hugepage vmalloc mappings, it also added the flag VM_NO_HUGE_VMAP for __vmalloc_node_range to request the allocation to be performed with 0-order non-huge pages. This flag is not accessible when calling vmalloc, the only option is to call directly __vmalloc_node_range, which is not exported. This means that a module can't vmalloc memory with small pages. Case in point: KVM on s390x needs to vmalloc a large area, and it needs to be mapped with non-huge pages, because of a hardware limitation. This patch adds the function vmalloc_no_huge, which works like vmalloc, but it is guaranteed to always back the mapping using small pages. This new function is exported, therefore it is usable by modules. [akpm@linux-foundation.org: whitespace fixes, per Christoph] Link: https://lkml.kernel.org/r/20210614132357.10202-1-imbrenda@linux.ibm.com Link: https://lkml.kernel.org/r/20210614132357.10202-2-imbrenda@linux.ibm.com Fixes: 121e6f3258fe3 ("mm/vmalloc: hugepage vmalloc mappings") Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Acked-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24nilfs2: fix memory leak in nilfs_sysfs_delete_device_groupPavel Skripkin
My local syzbot instance hit memory leak in nilfs2. The problem was in missing kobject_put() in nilfs_sysfs_delete_device_group(). kobject_del() does not call kobject_cleanup() for passed kobject and it leads to leaking duped kobject name if kobject_put() was not called. Fail log: BUG: memory leak unreferenced object 0xffff8880596171e0 (size 8): comm "syz-executor379", pid 8381, jiffies 4294980258 (age 21.100s) hex dump (first 8 bytes): 6c 6f 6f 70 30 00 00 00 loop0... backtrace: kstrdup+0x36/0x70 mm/util.c:60 kstrdup_const+0x53/0x80 mm/util.c:83 kvasprintf_const+0x108/0x190 lib/kasprintf.c:48 kobject_set_name_vargs+0x56/0x150 lib/kobject.c:289 kobject_add_varg lib/kobject.c:384 [inline] kobject_init_and_add+0xc9/0x160 lib/kobject.c:473 nilfs_sysfs_create_device_group+0x150/0x800 fs/nilfs2/sysfs.c:999 init_nilfs+0xe26/0x12b0 fs/nilfs2/the_nilfs.c:637 Link: https://lkml.kernel.org/r/20210612140559.20022-1-paskripkin@gmail.com Fixes: da7141fb78db ("nilfs2: add /sys/fs/nilfs2/<device> group") Signed-off-by: Pavel Skripkin <paskripkin@gmail.com> Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Michael L. Semon <mlsemon35@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24mm/thp: another PVMW_SYNC fix in page_vma_mapped_walk()Hugh Dickins
Aha! Shouldn't that quick scan over pte_none()s make sure that it holds ptlock in the PVMW_SYNC case? That too might have been responsible for BUGs or WARNs in split_huge_page_to_list() or its unmap_page(), though I've never seen any. Link: https://lkml.kernel.org/r/1bdf384c-8137-a149-2a1e-475a4791c3c@google.com Link: https://lore.kernel.org/linux-mm/20210412180659.B9E3.409509F4@e16-tech.com/ Fixes: ace71a19cec5 ("mm: introduce page_vma_mapped_walk()") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Tested-by: Wang Yugui <wangyugui@e16-tech.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24mm/thp: fix page_vma_mapped_walk() if THP mapped by ptesHugh Dickins
Running certain tests with a DEBUG_VM kernel would crash within hours, on the total_mapcount BUG() in split_huge_page_to_list(), while trying to free up some memory by punching a hole in a shmem huge page: split's try_to_unmap() was unable to find all the mappings of the page (which, on a !DEBUG_VM kernel, would then keep the huge page pinned in memory). Crash dumps showed two tail pages of a shmem huge page remained mapped by pte: ptes in a non-huge-aligned vma of a gVisor process, at the end of a long unmapped range; and no page table had yet been allocated for the head of the huge page to be mapped into. Although designed to handle these odd misaligned huge-page-mapped-by-pte cases, page_vma_mapped_walk() falls short by returning false prematurely when !pmd_present or !pud_present or !p4d_present or !pgd_present: there are cases when a huge page may span the boundary, with ptes present in the next. Restructure page_vma_mapped_walk() as a loop to continue in these cases, while keeping its layout much as before. Add a step_forward() helper to advance pvmw->address across those boundaries: originally I tried to use mm's standard p?d_addr_end() macros, but hit the same crash 512 times less often: because of the way redundant levels are folded together, but folded differently in different configurations, it was just too difficult to use them correctly; and step_forward() is simpler anyway. Link: https://lkml.kernel.org/r/fedb8632-1798-de42-f39e-873551d5bc81@google.com Fixes: ace71a19cec5 ("mm: introduce page_vma_mapped_walk()") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24mm: page_vma_mapped_walk(): get vma_address_end() earlierHugh Dickins
page_vma_mapped_walk() cleanup: get THP's vma_address_end() at the start, rather than later at next_pte. It's a little unnecessary overhead on the first call, but makes for a simpler loop in the following commit. Link: https://lkml.kernel.org/r/4542b34d-862f-7cb4-bb22-e0df6ce830a2@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24mm: page_vma_mapped_walk(): use goto instead of while (1)Hugh Dickins
page_vma_mapped_walk() cleanup: add a label this_pte, matching next_pte, and use "goto this_pte", in place of the "while (1)" loop at the end. Link: https://lkml.kernel.org/r/a52b234a-851-3616-2525-f42736e8934@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24mm: page_vma_mapped_walk(): add a level of indentationHugh Dickins
page_vma_mapped_walk() cleanup: add a level of indentation to much of the body, making no functional change in this commit, but reducing the later diff when this is all converted to a loop. [hughd@google.com: : page_vma_mapped_walk(): add a level of indentation fix] Link: https://lkml.kernel.org/r/7f817555-3ce1-c785-e438-87d8efdcaf26@google.com Link: https://lkml.kernel.org/r/efde211-f3e2-fe54-977-ef481419e7f3@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24mm: page_vma_mapped_walk(): crossing page table boundaryHugh Dickins
page_vma_mapped_walk() cleanup: adjust the test for crossing page table boundary - I believe pvmw->address is always page-aligned, but nothing else here assumed that; and remember to reset pvmw->pte to NULL after unmapping the page table, though I never saw any bug from that. Link: https://lkml.kernel.org/r/799b3f9c-2a9e-dfef-5d89-26e9f76fd97@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24mm: page_vma_mapped_walk(): prettify PVMW_MIGRATION blockHugh Dickins
page_vma_mapped_walk() cleanup: rearrange the !pmd_present() block to follow the same "return not_found, return not_found, return true" pattern as the block above it (note: returning not_found there is never premature, since existence or prior existence of huge pmd guarantees good alignment). Link: https://lkml.kernel.org/r/378c8650-1488-2edf-9647-32a53cf2e21@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24mm: page_vma_mapped_walk(): use pmde for *pvmw->pmdHugh Dickins
page_vma_mapped_walk() cleanup: re-evaluate pmde after taking lock, then use it in subsequent tests, instead of repeatedly dereferencing pointer. Link: https://lkml.kernel.org/r/53fbc9d-891e-46b2-cb4b-468c3b19238e@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24mm: page_vma_mapped_walk(): settle PageHuge on entryHugh Dickins
page_vma_mapped_walk() cleanup: get the hugetlbfs PageHuge case out of the way at the start, so no need to worry about it later. Link: https://lkml.kernel.org/r/e31a483c-6d73-a6bb-26c5-43c3b880a2@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24mm: page_vma_mapped_walk(): use page for pvmw->pageHugh Dickins
Patch series "mm: page_vma_mapped_walk() cleanup and THP fixes". I've marked all of these for stable: many are merely cleanups, but I think they are much better before the main fix than after. This patch (of 11): page_vma_mapped_walk() cleanup: sometimes the local copy of pvwm->page was used, sometimes pvmw->page itself: use the local copy "page" throughout. Link: https://lkml.kernel.org/r/589b358c-febc-c88e-d4c2-7834b37fa7bf@google.com Link: https://lkml.kernel.org/r/88e67645-f467-c279-bf5e-af4b5c6b13eb@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Will Deacon <will@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24ata: rb532_cf: remove redundant codesgushengxian
The codes "dev_err(&pdev->dev, "no IRQ resource found\n");" is redundant because platform_get_irq() already prints an error. Signed-off-by: gushengxian <gushengxian@yulong.com> Link: https://lore.kernel.org/r/20210622115507.359017-1-13145886936@163.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-24blk: Fix lock inversion between ioc lock and bfqd lockJan Kara
Lockdep complains about lock inversion between ioc->lock and bfqd->lock: bfqd -> ioc: put_io_context+0x33/0x90 -> ioc->lock grabbed blk_mq_free_request+0x51/0x140 blk_put_request+0xe/0x10 blk_attempt_req_merge+0x1d/0x30 elv_attempt_insert_merge+0x56/0xa0 blk_mq_sched_try_insert_merge+0x4b/0x60 bfq_insert_requests+0x9e/0x18c0 -> bfqd->lock grabbed blk_mq_sched_insert_requests+0xd6/0x2b0 blk_mq_flush_plug_list+0x154/0x280 blk_finish_plug+0x40/0x60 ext4_writepages+0x696/0x1320 do_writepages+0x1c/0x80 __filemap_fdatawrite_range+0xd7/0x120 sync_file_range+0xac/0xf0 ioc->bfqd: bfq_exit_icq+0xa3/0xe0 -> bfqd->lock grabbed put_io_context_active+0x78/0xb0 -> ioc->lock grabbed exit_io_context+0x48/0x50 do_exit+0x7e9/0xdd0 do_group_exit+0x54/0xc0 To avoid this inversion we change blk_mq_sched_try_insert_merge() to not free the merged request but rather leave that upto the caller similarly to blk_mq_sched_try_merge(). And in bfq_insert_requests() we make sure to free all the merged requests after dropping bfqd->lock. Fixes: aee69d78dec0 ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler") Reviewed-by: Ming Lei <ming.lei@redhat.com> Acked-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210623093634.27879-3-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-24bfq: Remove merged request already in bfq_requests_merged()Jan Kara
Currently, bfq does very little in bfq_requests_merged() and handles all the request cleanup in bfq_finish_requeue_request() called from blk_mq_free_request(). That is currently safe only because blk_mq_free_request() is called shortly after bfq_requests_merged() while bfqd->lock is still held. However to fix a lock inversion between bfqd->lock and ioc->lock, we need to call blk_mq_free_request() after dropping bfqd->lock. That would mean that already merged request could be seen by other processes inside bfq queues and possibly dispatched to the device which is wrong. So move cleanup of the request from bfq_finish_requeue_request() to bfq_requests_merged(). Acked-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210623093634.27879-2-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-24net: bcmgenet: Add mdio-bcm-unimac soft dependencyJian-Hong Pan
The Broadcom UniMAC MDIO bus from mdio-bcm-unimac module comes too late. So, GENET cannot find the ethernet PHY on UniMAC MDIO bus. This leads GENET fail to attach the PHY as following log: bcmgenet fd580000.ethernet: GENET 5.0 EPHY: 0x0000 ... could not attach to PHY bcmgenet fd580000.ethernet eth0: failed to connect to PHY uart-pl011 fe201000.serial: no DMA platform data libphy: bcmgenet MII bus: probed ... unimac-mdio unimac-mdio.-19: Broadcom UniMAC MDIO bus It is not just coming too late, there is also no way for the module loader to figure out the dependency between GENET and its MDIO bus driver unless we provide this MODULE_SOFTDEP hint. This patch adds the soft dependency to load mdio-bcm-unimac module before genet module to fix this issue. Buglink: https://bugzilla.kernel.org/show_bug.cgi?id=213485 Fixes: 9a4e79697009 ("net: bcmgenet: utilize generic Broadcom UniMAC MDIO controller driver") Signed-off-by: Jian-Hong Pan <jhp@endlessos.org> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24ipv6: delete useless dst check in ip6_dst_lookup_tailzhang kai
parameter dst always points to null. Signed-off-by: zhang kai <zhangkaiheb@126.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24net: dsa: sja1105: fix NULL pointer dereference in sja1105_reload_cbs()Vladimir Oltean
priv->cbs is an array of priv->info->num_cbs_shapers elements of type struct sja1105_cbs_entry which only get allocated if CONFIG_NET_SCH_CBS is enabled. However, sja1105_reload_cbs() is called from sja1105_static_config_reload() which in turn is called for any of the items in sja1105_reset_reasons, therefore during the normal runtime of the driver and not just from a code path which can be triggered by the tc-cbs offload. The sja1105_reload_cbs() function does not contain a check whether the priv->cbs array is NULL or not, it just assumes it isn't and proceeds to iterate through the credit-based shaper elements. This leads to a NULL pointer dereference. The solution is to return success if the priv->cbs array has not been allocated, since sja1105_reload_cbs() has nothing to do. Fixes: 4d7525085a9b ("net: dsa: sja1105: offload the Credit-Based Shaper qdisc") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24PCI: tegra194: Fix host initialization during resumeVidya Sagar
Commit 275e88b06a27 ("PCI: tegra: Fix host link initialization") broke host initialization during resume as it misses out calling the API dw_pcie_setup_rc() which is required for host and MSI initialization. Link: https://lore.kernel.org/r/20210504172157.29712-1-vidyas@nvidia.com Fixes: 275e88b06a27 ("PCI: tegra: Fix host link initialization") Tested-by: Jon Hunter <jonathanh@nvidia.com> Signed-off-by: Vidya Sagar <vidyas@nvidia.com> Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2021-06-24PCI: tegra: Add missing MODULE_DEVICE_TABLEZou Wei
Add missing MODULE_DEVICE_TABLE definition so we generate correct modalias for automatic loading of this driver when it is built as a module. Link: https://lore.kernel.org/r/1620792422-16535-1-git-send-email-zou_wei@huawei.com Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Zou Wei <zou_wei@huawei.com> Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Vidya Sagar <vidyas@nvidia.com> Acked-by: Thierry Reding <treding@nvidia.com>
2021-06-24mlxsw: core_env: Avoid unnecessary memcpy()sIdo Schimmel
Simply get a pointer to the data in the register payload instead of copying it to a temporary buffer. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24Merge tag 'ieee802154-for-davem-2021-06-24' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/sschmidt/wpan Stefan Schmidt says: ==================== pull-request: ieee802154 for net 2021-06-24 An update from ieee802154 for your *net* tree. This time we only have fixes for ieee802154 hwsim driver. Sparked from some syzcaller reports We got a potential crash fix from Eric Dumazet and two memory leak fixes from Dongliang Mu. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24gve: Fix warnings reported for DQO patchsetBailey Forrest
https://patchwork.kernel.org/project/netdevbpf/list/?series=506637&state=* - Remove unused variable - Use correct integer type for string formatting. - Remove `inline` in C files Fixes: 9c1a59a2f4bc ("gve: DQO: Add ring allocation and initialization") Fixes: a57e5de476be ("gve: DQO: Add TX path") Signed-off-by: Bailey Forrest <bcf@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24e1000e: Check the PCIm stateSasha Neftin
Complete to commit def4ec6dce393e ("e1000e: PCIm function state support") Check the PCIm state only on CSME systems. There is no point to do this check on non CSME systems. This patch fixes a generation a false-positive warning: "Error in exiting dmoff" Fixes: def4ec6dce39 ("e1000e: PCIm function state support") Signed-off-by: Sasha Neftin <sasha.neftin@intel.com> Tested-by: Dvora Fuxbrumer <dvorax.fuxbrumer@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>