summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2020-05-08kselftests: dmabuf-heaps: Fix confused return value on expected error testingJohn Stultz
When I added the expected error testing, I forgot I need to set the return to zero when we successfully see an error. Without this change we only end up testing a single heap before the test quits. Cc: Shuah Khan <shuah@kernel.org> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: Benjamin Gaignard <benjamin.gaignard@linaro.org> Cc: Brian Starkey <brian.starkey@arm.com> Cc: Laura Abbott <labbott@redhat.com> Cc: "Andrew F. Davis" <afd@ti.com> Cc: linux-kselftest@vger.kernel.org Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2020-05-08fork: prevent accidental access to clone3 featuresChristian Brauner
Jan reported an issue where an interaction between sign-extending clone's flag argument on ppc64le and the new CLONE_INTO_CGROUP feature causes clone() to consistently fail with EBADF. The whole story is a little longer. The legacy clone() syscall is odd in a bunch of ways and here two things interact. First, legacy clone's flag argument is word-size dependent, i.e. it's an unsigned long whereas most system calls with flag arguments use int or unsigned int. Second, legacy clone() ignores unknown and deprecated flags. The two of them taken together means that users on 64bit systems can pass garbage for the upper 32bit of the clone() syscall since forever and things would just work fine. Just try this on a 64bit kernel prior to v5.7-rc1 where this will succeed and on v5.7-rc1 where this will fail with EBADF: int main(int argc, char *argv[]) { pid_t pid; /* Note that legacy clone() has different argument ordering on * different architectures so this won't work everywhere. * * Only set the upper 32 bits. */ pid = syscall(__NR_clone, 0xffffffff00000000 | SIGCHLD, NULL, NULL, NULL, NULL); if (pid < 0) exit(EXIT_FAILURE); if (pid == 0) exit(EXIT_SUCCESS); if (wait(NULL) != pid) exit(EXIT_FAILURE); exit(EXIT_SUCCESS); } Since legacy clone() couldn't be extended this was not a problem so far and nobody really noticed or cared since nothing in the kernel ever bothered to look at the upper 32 bits. But once we introduced clone3() and expanded the flag argument in struct clone_args to 64 bit we opened this can of worms. With the first flag-based extension to clone3() making use of the upper 32 bits of the flag argument we've effectively made it possible for the legacy clone() syscall to reach clone3() only flags. The sign extension scenario is just the odd corner-case that we needed to figure this out. The reason we just realized this now and not already when we introduced CLONE_CLEAR_SIGHAND was that CLONE_INTO_CGROUP assumes that a valid cgroup file descriptor has been given. So the sign extension (or the user accidently passing garbage for the upper 32 bits) caused the CLONE_INTO_CGROUP bit to be raised and the kernel to error out when it didn't find a valid cgroup file descriptor. Let's fix this by always capping the upper 32 bits for all codepaths that are not aware of clone3() features. This ensures that we can't reach clone3() only features by accident via legacy clone as with the sign extension case and also that legacy clone() works exactly like before, i.e. ignoring any unknown flags. This solution risks no regressions and is also pretty clean. Fixes: 7f192e3cd316 ("fork: add clone3") Fixes: ef2c41cf38a7 ("clone3: allow spawning processes into cgroups") Reported-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dmitry V. Levin <ldv@altlinux.org> Cc: Andreas Schwab <schwab@linux-m68k.org> Cc: Florian Weimer <fw@deneb.enyo.de> Cc: libc-alpha@sourceware.org Cc: stable@vger.kernel.org # 5.3+ Link: https://sourceware.org/pipermail/libc-alpha/2020-May/113596.html Link: https://lore.kernel.org/r/20200507103214.77218-1-christian.brauner@ubuntu.com
2020-05-08iommu/virtio: Reverse arguments to list_addJulia Lawall
Elsewhere in the file, there is a list_for_each_entry with &vdev->resv_regions as the second argument, suggesting that &vdev->resv_regions is the list head. So exchange the arguments on the list_add call to put the list head in the second argument. Fixes: 2a5a31487445 ("iommu/virtio: Add probe request") Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Link: https://lore.kernel.org/r/1588704467-13431-1-git-send-email-Julia.Lawall@inria.fr Signed-off-by: Joerg Roedel <jroedel@suse.de>
2020-05-08gfs2: More gfs2_find_jhead fixesAndreas Gruenbacher
It turns out that when extending an existing bio, gfs2_find_jhead fails to check if the block number is consecutive, which leads to incorrect reads for fragmented journals. In addition, limit the maximum bio size to an arbitrary value of 2 megabytes: since commit 07173c3ec276 ("block: enable multipage bvecs"), if we just keep adding pages until bio_add_page fails, bios will grow much larger than useful, which pins more memory than necessary with barely any additional performance gains. Fixes: f4686c26ecc3 ("gfs2: read journal in large chunks") Cc: stable@vger.kernel.org # v5.2+ Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-05-08gfs2: Another gfs2_walk_metadata fixAndreas Gruenbacher
Make sure we don't walk past the end of the metadata in gfs2_walk_metadata: the inode holds fewer pointers than indirect blocks. Slightly clean up gfs2_iomap_get. Fixes: a27a0c9b6a20 ("gfs2: gfs2_walk_metadata fix") Cc: stable@vger.kernel.org # v5.3+ Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-05-08gfs2: Fix use-after-free in gfs2_logd after withdrawBob Peterson
When the gfs2_logd daemon withdrew, the withdraw sequence called into make_fs_ro() to make the file system read-only. That caused the journal descriptors to be freed. However, those journal descriptors were used by gfs2_logd's call to gfs2_ail_flush_reqd(). This caused a use-after free and NULL pointer dereference. This patch changes function gfs2_logd() so that it stops all logd work until the thread is told to stop. Once a withdraw is done, it only does an interruptible sleep. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-05-08gfs2: Fix BUG during unmount after file system withdrawBob Peterson
Before this patch, when the logd daemon was forced to withdraw, it would try to request its journal be recovered by another cluster node. However, in single-user cases with lock_nolock, there are no other nodes to recover the journal. Function signal_our_withdraw() was recognizing the lock_nolock situation, but not until after it had evicted its journal inode. Since the journal descriptor that points to the inode was never removed from the master list, when the unmount occurred, it did another iput on the evicted inode, which resulted in a BUG_ON(inode->i_state & I_CLEAR). This patch moves the check for this situation earlier in function signal_our_withdraw(), which avoids the extra iput, so the unmount may happen normally. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-05-08gfs2: Fix error exit in do_xmoteBob Peterson
Before this patch, if an error was detected from glock function go_sync by function do_xmote, it would return. But the function had temporarily unlocked the gl_lockref spin_lock, and it never re-locked it. When the caller of do_xmote tried to unlock it again, it was already unlocked, which resulted in a corrupted spin_lock value. This patch makes sure the gl_lockref spin_lock is re-locked after it is unlocked. Thanks to Wu Bo <wubo40@huawei.com> for reporting this problem. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-05-08KVM: SVM: Disable AVIC before setting V_IRQSuravee Suthikulpanit
The commit 64b5bd270426 ("KVM: nSVM: ignore L1 interrupt window while running L2 with V_INTR_MASKING=1") introduced a WARN_ON, which checks if AVIC is enabled when trying to set V_IRQ in the VMCB for enabling irq window. The following warning is triggered because the requesting vcpu (to deactivate AVIC) does not get to process APICv update request for itself until the next #vmexit. WARNING: CPU: 0 PID: 118232 at arch/x86/kvm/svm/svm.c:1372 enable_irq_window+0x6a/0xa0 [kvm_amd] RIP: 0010:enable_irq_window+0x6a/0xa0 [kvm_amd] Call Trace: kvm_arch_vcpu_ioctl_run+0x6e3/0x1b50 [kvm] ? kvm_vm_ioctl_irq_line+0x27/0x40 [kvm] ? _copy_to_user+0x26/0x30 ? kvm_vm_ioctl+0xb3e/0xd90 [kvm] ? set_next_entity+0x78/0xc0 kvm_vcpu_ioctl+0x236/0x610 [kvm] ksys_ioctl+0x8a/0xc0 __x64_sys_ioctl+0x1a/0x20 do_syscall_64+0x58/0x210 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes by sending APICV update request to all other vcpus, and immediately update APIC for itself. Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Link: https://lkml.org/lkml/2020/5/2/167 Fixes: 64b5bd270426 ("KVM: nSVM: ignore L1 interrupt window while running L2 with V_INTR_MASKING=1") Message-Id: <1588818939-54264-1-git-send-email-suravee.suthikulpanit@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-08KVM: Introduce kvm_make_all_cpus_request_except()Suravee Suthikulpanit
This allows making request to all other vcpus except the one specified in the parameter. Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Message-Id: <1588771076-73790-2-git-send-email-suravee.suthikulpanit@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-08KVM: VMX: pass correct DR6 for GD userspace exitPaolo Bonzini
When KVM_EXIT_DEBUG is raised for the disabled-breakpoints case (DR7.GD), DR6 was incorrectly copied from the value in the VM. Instead, DR6.BD should be set in order to catch this case. On AMD this does not need any special code because the processor triggers a #DB exception that is intercepted. However, the testcase would fail without the previous patch because both DR6.BS and DR6.BD would be set. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-08KVM: x86, SVM: isolate vcpu->arch.dr6 from vmcb->save.dr6Paolo Bonzini
There are two issues with KVM_EXIT_DEBUG on AMD, whose root cause is the different handling of DR6 on intercepted #DB exceptions on Intel and AMD. On Intel, #DB exceptions transmit the DR6 value via the exit qualification field of the VMCS, and the exit qualification only contains the description of the precise event that caused a vmexit. On AMD, instead the DR6 field of the VMCB is filled in as if the #DB exception was to be injected into the guest. This has two effects when guest debugging is in use: * the guest DR6 is clobbered * the kvm_run->debug.arch.dr6 field can accumulate more debug events, rather than just the last one that happened (the testcase in the next patch covers this issue). This patch fixes both issues by emulating, so to speak, the Intel behavior on AMD processors. The important observation is that (after the previous patches) the VMCB value of DR6 is only ever observable from the guest is KVM_DEBUGREG_WONT_EXIT is set. Therefore we can actually set vmcb->save.dr6 to any value we want as long as KVM_DEBUGREG_WONT_EXIT is clear, which it will be if guest debugging is enabled. Therefore it is possible to enter the guest with an all-zero DR6, reconstruct the #DB payload from the DR6 we get at exit time, and let kvm_deliver_exception_payload move the newly set bits into vcpu->arch.dr6. Some extra bits may be included in the payload if KVM_DEBUGREG_WONT_EXIT is set, but this is harmless. This may not be the most optimized way to deal with this, but it is simple and, being confined within SVM code, it gets rid of the set_dr6 callback and kvm_update_dr6. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-08KVM: SVM: keep DR6 synchronized with vcpu->arch.dr6Paolo Bonzini
kvm_x86_ops.set_dr6 is only ever called with vcpu->arch.dr6 as the second argument. Ensure that the VMCB value is synchronized to vcpu->arch.dr6 on #DB (both "normal" and nested) and nested vmentry, so that the current value of DR6 is always available in vcpu->arch.dr6. The get_dr6 callback can just access vcpu->arch.dr6 and becomes redundant. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-08mmc: block: Fix request completion in the CQE timeout pathAdrian Hunter
First, it should be noted that the CQE timeout (60 seconds) is substantial so a CQE request that times out is really stuck, and the race between timeout and completion is extremely unlikely. Nevertheless this patch fixes an issue with it. Commit ad73d6feadbd7b ("mmc: complete requests from ->timeout") preserved the existing functionality, to complete the request. However that had only been necessary because the block layer timeout handler had been marking the request to prevent it from being completed normally. That restriction was removed at the same time, the result being that a request that has gone will have been completed anyway. That is, the completion was unnecessary. At the time, the unnecessary completion was harmless because the block layer would ignore it, although that changed in kernel v5.0. Note for stable, this patch will not apply cleanly without patch "mmc: core: Fix recursive locking issue in CQE recovery path" Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Fixes: ad73d6feadbd7b ("mmc: complete requests from ->timeout") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20200508062227.23144-1-adrian.hunter@intel.com Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2020-05-08mmc: core: Fix recursive locking issue in CQE recovery pathSarthak Garg
Consider the following stack trace -001|raw_spin_lock_irqsave -002|mmc_blk_cqe_complete_rq -003|__blk_mq_complete_request(inline) -003|blk_mq_complete_request(rq) -004|mmc_cqe_timed_out(inline) -004|mmc_mq_timed_out mmc_mq_timed_out acquires the queue_lock for the first time. The mmc_blk_cqe_complete_rq function also tries to acquire the same queue lock resulting in recursive locking where the task is spinning for the same lock which it has already acquired leading to watchdog bark. Fix this issue with the lock only for the required critical section. Cc: <stable@vger.kernel.org> Fixes: 1e8e55b67030 ("mmc: block: Add CQE support") Suggested-by: Sahitya Tummala <stummala@codeaurora.org> Signed-off-by: Sarthak Garg <sartgarg@codeaurora.org> Acked-by: Adrian Hunter <adrian.hunter@intel.com> Link: https://lore.kernel.org/r/1588868135-31783-1-git-send-email-vbadigan@codeaurora.org Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2020-05-08mmc: core: Check request type before completing the requestVeerabhadrarao Badiganti
In the request completion path with CQE, request type is being checked after the request is getting completed. This is resulting in returning the wrong request type and leading to the IO hang issue. ASYNC request type is getting returned for DCMD type requests. Because of this mismatch, mq->cqe_busy flag is never getting cleared and the driver is not invoking blk_mq_hw_run_queue. So requests are not getting dispatched to the LLD from the block layer. All these eventually leading to IO hang issues. So, get the request type before completing the request. Cc: <stable@vger.kernel.org> Fixes: 1e8e55b67030 ("mmc: block: Add CQE support") Signed-off-by: Veerabhadrarao Badiganti <vbadigan@codeaurora.org> Acked-by: Adrian Hunter <adrian.hunter@intel.com> Link: https://lore.kernel.org/r/1588775643-18037-2-git-send-email-vbadigan@codeaurora.org Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2020-05-08Merge tag 'drm-misc-fixes-2020-05-07' of ↵Dave Airlie
git://anongit.freedesktop.org/drm/drm-misc into drm-fixes A few minor fixes for an ordering issue in virtio, an (old) gcc warning in sun4i, a probe issue in ingenic-drm and a regression in the HDCP support. Signed-off-by: Dave Airlie <airlied@redhat.com> From: Maxime Ripard <maxime@cerno.tech> Link: https://patchwork.freedesktop.org/patch/msgid/20200507160130.id64niqgf5wsha4u@gilmour.lan
2020-05-08Merge tag 'amd-drm-fixes-5.7-2020-05-06' of ↵Dave Airlie
git://people.freedesktop.org/~agd5f/linux into drm-fixes amd-drm-fixes-5.7-2020-05-06: amdgpu: - Runtime PM fixes - DC fix for PPC - Misc DC fixes Signed-off-by: Dave Airlie <airlied@redhat.com> From: Alex Deucher <alexdeucher@gmail.com> Link: https://patchwork.freedesktop.org/patch/msgid/20200506212257.3893-1-alexander.deucher@amd.com
2020-05-07Merge branch 'for-v5.7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull security subsystem fix from James Morris: "Fix the default value of fs_context_parse_param hook" * 'for-v5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: security: Fix the default value of fs_context_parse_param hook
2020-05-07mm: limit boost_watermark on small zonesHenry Willard
Commit 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs") adds a boost_watermark() function which increases the min watermark in a zone by at least pageblock_nr_pages or the number of pages in a page block. On Arm64, with 64K pages and 512M huge pages, this is 8192 pages or 512M. It does this regardless of the number of managed pages managed in the zone or the likelihood of success. This can put the zone immediately under water in terms of allocating pages from the zone, and can cause a small machine to fail immediately due to OoM. Unlike set_recommended_min_free_kbytes(), which substantially increases min_free_kbytes and is tied to THP, boost_watermark() can be called even if THP is not active. The problem is most likely to appear on architectures such as Arm64 where pageblock_nr_pages is very large. It is desirable to run the kdump capture kernel in as small a space as possible to avoid wasting memory. In some architectures, such as Arm64, there are restrictions on where the capture kernel can run, and therefore, the space available. A capture kernel running in 768M can fail due to OoM immediately after boost_watermark() sets the min in zone DMA32, where most of the memory is, to 512M. It fails even though there is over 500M of free memory. With boost_watermark() suppressed, the capture kernel can run successfully in 448M. This patch limits boost_watermark() to boosting a zone's min watermark only when there are enough pages that the boost will produce positive results. In this case that is estimated to be four times as many pages as pageblock_nr_pages. Mel said: : There is no harm in marking it stable. Clearly it does not happen very : often but it's not impossible. 32-bit x86 is a lot less common now : which would previously have been vulnerable to triggering this easily. : ppc64 has a larger base page size but typically only has one zone. : arm64 is likely the most vulnerable, particularly when CMA is : configured with a small movable zone. Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs") Signed-off-by: Henry Willard <henry.willard@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/1588294148-6586-1-git-send-email-henry.willard@oracle.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-07ubsan: disable UBSAN_ALIGNMENT under COMPILE_TESTKees Cook
The documentation for UBSAN_ALIGNMENT already mentions that it should not be used on all*config builds (and for efficient-unaligned-access architectures), so just refactor the Kconfig to correctly implement this so randconfigs will stop creating insane images that freak out objtool under CONFIG_UBSAN_TRAP (due to the false positives producing functions that never return, etc). Link: http://lkml.kernel.org/r/202005011433.C42EA3E2D@keescook Fixes: 0887a7ebc977 ("ubsan: add trap instrumentation option") Signed-off-by: Kees Cook <keescook@chromium.org> Reported-by: Randy Dunlap <rdunlap@infradead.org> Link: https://lore.kernel.org/linux-next/202004231224.D6B3B650@keescook/ Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-07mm/vmscan: remove unnecessary argument description of isolate_lru_pages()Qiwu Chen
Since commit a9e7c39fa9fd9 ("mm/vmscan.c: remove 7th argument of isolate_lru_pages()"), the explanation of 'mode' argument has been unnecessary. Let's remove it. Signed-off-by: Qiwu Chen <chenqiwu@xiaomi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200501090346.2894-1-chenqiwu@xiaomi.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-07epoll: atomically remove wait entry on wake upRoman Penyaev
This patch does two things: - fixes a lost wakeup introduced by commit 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested epoll") - improves performance for events delivery. The description of the problem is the following: if N (>1) threads are waiting on ep->wq for new events and M (>1) events come, it is quite likely that >1 wakeups hit the same wait queue entry, because there is quite a big window between __add_wait_queue_exclusive() and the following __remove_wait_queue() calls in ep_poll() function. This can lead to lost wakeups, because thread, which was woken up, can handle not all the events in ->rdllist. (in better words the problem is described here: https://lkml.org/lkml/2019/10/7/905) The idea of the current patch is to use init_wait() instead of init_waitqueue_entry(). Internally init_wait() sets autoremove_wake_function as a callback, which removes the wait entry atomically (under the wq locks) from the list, thus the next coming wakeup hits the next wait entry in the wait queue, thus preventing lost wakeups. Problem is very well reproduced by the epoll60 test case [1]. Wait entry removal on wakeup has also performance benefits, because there is no need to take a ep->lock and remove wait entry from the queue after the successful wakeup. Here is the timing output of the epoll60 test case: With explicit wakeup from ep_scan_ready_list() (the state of the code prior 339ddb53d373): real 0m6.970s user 0m49.786s sys 0m0.113s After this patch: real 0m5.220s user 0m36.879s sys 0m0.019s The other testcase is the stress-epoll [2], where one thread consumes all the events and other threads produce many events: With explicit wakeup from ep_scan_ready_list() (the state of the code prior 339ddb53d373): threads events/ms run-time ms 8 5427 1474 16 6163 2596 32 6824 4689 64 7060 9064 128 6991 18309 After this patch: threads events/ms run-time ms 8 5598 1429 16 7073 2262 32 7502 4265 64 7640 8376 128 7634 16767 (number of "events/ms" represents event bandwidth, thus higher is better; number of "run-time ms" represents overall time spent doing the benchmark, thus lower is better) [1] tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c [2] https://github.com/rouming/test-tools/blob/master/stress-epoll.c Signed-off-by: Roman Penyaev <rpenyaev@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Jason Baron <jbaron@akamai.com> Cc: Khazhismel Kumykov <khazhy@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Heiher <r@hev.cc> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200430130326.1368509-2-rpenyaev@suse.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-07kselftests: introduce new epoll60 testcase for catching lost wakeupsRoman Penyaev
This test case catches lost wake up introduced by commit 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested epoll") The test is simple: we have 10 threads and 10 event fds. Each thread can harvest only 1 event. 1 producer fires all 10 events at once and waits that all 10 events will be observed by 10 threads. In case of lost wakeup epoll_wait() will timeout and 0 will be returned. Test case catches two sort of problems: forgotten wakeup on event, which hits the ->ovflist list, this problem was fixed by: 5a2513239750 ("eventpoll: fix missing wakeup for ovflist in ep_poll_callback") the other problem is when several sequential events hit the same waiting thread, thus other waiters get no wakeups. Problem is fixed in the following patch. Signed-off-by: Roman Penyaev <rpenyaev@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Khazhismel Kumykov <khazhy@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Heiher <r@hev.cc> Cc: Jason Baron <jbaron@akamai.com> Link: http://lkml.kernel.org/r/20200430130326.1368509-1-rpenyaev@suse.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-07percpu: make pcpu_alloc() aware of current gfp contextFilipe Manana
Since 5.7-rc1, on btrfs we have a percpu counter initialization for which we always pass a GFP_KERNEL gfp_t argument (this happens since commit 2992df73268f78 ("btrfs: Implement DREW lock")). That is safe in some contextes but not on others where allowing fs reclaim could lead to a deadlock because we are either holding some btrfs lock needed for a transaction commit or holding a btrfs transaction handle open. Because of that we surround the call to the function that initializes the percpu counter with a NOFS context using memalloc_nofs_save() (this is done at btrfs_init_fs_root()). However it turns out that this is not enough to prevent a possible deadlock because percpu_alloc() determines if it is in an atomic context by looking exclusively at the gfp flags passed to it (GFP_KERNEL in this case) and it is not aware that a NOFS context is set. Because percpu_alloc() thinks it is in a non atomic context it locks the pcpu_alloc_mutex. This can result in a btrfs deadlock when pcpu_balance_workfn() is running, has acquired that mutex and is waiting for reclaim, while the btrfs task that called percpu_counter_init() (and therefore percpu_alloc()) is holding either the btrfs commit_root semaphore or a transaction handle (done fs/btrfs/backref.c: iterate_extent_inodes()), which prevents reclaim from finishing as an attempt to commit the current btrfs transaction will deadlock. Lockdep reports this issue with the following trace: ====================================================== WARNING: possible circular locking dependency detected 5.6.0-rc7-btrfs-next-77 #1 Not tainted ------------------------------------------------------ kswapd0/91 is trying to acquire lock: ffff8938a3b3fdc8 (&delayed_node->mutex){+.+.}, at: __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs] but task is already holding lock: ffffffffb4f0dbc0 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #4 (fs_reclaim){+.+.}: fs_reclaim_acquire.part.0+0x25/0x30 __kmalloc+0x5f/0x3a0 pcpu_create_chunk+0x19/0x230 pcpu_balance_workfn+0x56a/0x680 process_one_work+0x235/0x5f0 worker_thread+0x50/0x3b0 kthread+0x120/0x140 ret_from_fork+0x3a/0x50 -> #3 (pcpu_alloc_mutex){+.+.}: __mutex_lock+0xa9/0xaf0 pcpu_alloc+0x480/0x7c0 __percpu_counter_init+0x50/0xd0 btrfs_drew_lock_init+0x22/0x70 [btrfs] btrfs_get_fs_root+0x29c/0x5c0 [btrfs] resolve_indirect_refs+0x120/0xa30 [btrfs] find_parent_nodes+0x50b/0xf30 [btrfs] btrfs_find_all_leafs+0x60/0xb0 [btrfs] iterate_extent_inodes+0x139/0x2f0 [btrfs] iterate_inodes_from_logical+0xa1/0xe0 [btrfs] btrfs_ioctl_logical_to_ino+0xb4/0x190 [btrfs] btrfs_ioctl+0x165a/0x3130 [btrfs] ksys_ioctl+0x87/0xc0 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x5c/0x260 entry_SYSCALL_64_after_hwframe+0x49/0xbe -> #2 (&fs_info->commit_root_sem){++++}: down_write+0x38/0x70 btrfs_cache_block_group+0x2ec/0x500 [btrfs] find_free_extent+0xc6a/0x1600 [btrfs] btrfs_reserve_extent+0x9b/0x180 [btrfs] btrfs_alloc_tree_block+0xc1/0x350 [btrfs] alloc_tree_block_no_bg_flush+0x4a/0x60 [btrfs] __btrfs_cow_block+0x122/0x5a0 [btrfs] btrfs_cow_block+0x106/0x240 [btrfs] commit_cowonly_roots+0x55/0x310 [btrfs] btrfs_commit_transaction+0x509/0xb20 [btrfs] sync_filesystem+0x74/0x90 generic_shutdown_super+0x22/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0x100/0x160 task_work_run+0x93/0xc0 exit_to_usermode_loop+0xf9/0x100 do_syscall_64+0x20d/0x260 entry_SYSCALL_64_after_hwframe+0x49/0xbe -> #1 (&space_info->groups_sem){++++}: down_read+0x3c/0x140 find_free_extent+0xef6/0x1600 [btrfs] btrfs_reserve_extent+0x9b/0x180 [btrfs] btrfs_alloc_tree_block+0xc1/0x350 [btrfs] alloc_tree_block_no_bg_flush+0x4a/0x60 [btrfs] __btrfs_cow_block+0x122/0x5a0 [btrfs] btrfs_cow_block+0x106/0x240 [btrfs] btrfs_search_slot+0x50c/0xd60 [btrfs] btrfs_lookup_inode+0x3a/0xc0 [btrfs] __btrfs_update_delayed_inode+0x90/0x280 [btrfs] __btrfs_commit_inode_delayed_items+0x81f/0x870 [btrfs] __btrfs_run_delayed_items+0x8e/0x180 [btrfs] btrfs_commit_transaction+0x31b/0xb20 [btrfs] iterate_supers+0x87/0xf0 ksys_sync+0x60/0xb0 __ia32_sys_sync+0xa/0x10 do_syscall_64+0x5c/0x260 entry_SYSCALL_64_after_hwframe+0x49/0xbe -> #0 (&delayed_node->mutex){+.+.}: __lock_acquire+0xef0/0x1c80 lock_acquire+0xa2/0x1d0 __mutex_lock+0xa9/0xaf0 __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs] btrfs_evict_inode+0x40d/0x560 [btrfs] evict+0xd9/0x1c0 dispose_list+0x48/0x70 prune_icache_sb+0x54/0x80 super_cache_scan+0x124/0x1a0 do_shrink_slab+0x176/0x440 shrink_slab+0x23a/0x2c0 shrink_node+0x188/0x6e0 balance_pgdat+0x31d/0x7f0 kswapd+0x238/0x550 kthread+0x120/0x140 ret_from_fork+0x3a/0x50 other info that might help us debug this: Chain exists of: &delayed_node->mutex --> pcpu_alloc_mutex --> fs_reclaim Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(fs_reclaim); lock(pcpu_alloc_mutex); lock(fs_reclaim); lock(&delayed_node->mutex); *** DEADLOCK *** 3 locks held by kswapd0/91: #0: (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30 #1: (shrinker_rwsem){++++}, at: shrink_slab+0x12f/0x2c0 #2: (&type->s_umount_key#43){++++}, at: trylock_super+0x16/0x50 stack backtrace: CPU: 1 PID: 91 Comm: kswapd0 Not tainted 5.6.0-rc7-btrfs-next-77 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0x8f/0xd0 check_noncircular+0x170/0x190 __lock_acquire+0xef0/0x1c80 lock_acquire+0xa2/0x1d0 __mutex_lock+0xa9/0xaf0 __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs] btrfs_evict_inode+0x40d/0x560 [btrfs] evict+0xd9/0x1c0 dispose_list+0x48/0x70 prune_icache_sb+0x54/0x80 super_cache_scan+0x124/0x1a0 do_shrink_slab+0x176/0x440 shrink_slab+0x23a/0x2c0 shrink_node+0x188/0x6e0 balance_pgdat+0x31d/0x7f0 kswapd+0x238/0x550 kthread+0x120/0x140 ret_from_fork+0x3a/0x50 This could be fixed by making btrfs pass GFP_NOFS instead of GFP_KERNEL to percpu_counter_init() in contextes where it is not reclaim safe, however that type of approach is discouraged since memalloc_[nofs|noio]_save() were introduced. Therefore this change makes pcpu_alloc() look up into an existing nofs/noio context before deciding whether it is in an atomic context or not. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Dennis Zhou <dennis@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Link: http://lkml.kernel.org/r/20200430164356.15543-1-fdmanana@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-07mm/slub: fix incorrect interpretation of s->offsetWaiman Long
In a couple of places in the slub memory allocator, the code uses "s->offset" as a check to see if the free pointer is put right after the object. That check is no longer true with commit 3202fa62fb43 ("slub: relocate freelist pointer to middle of object"). As a result, echoing "1" into the validate sysfs file, e.g. of dentry, may cause a bunch of "Freepointer corrupt" error reports like the following to appear with the system in panic afterwards. ============================================================================= BUG dentry(666:pmcd.service) (Tainted: G B): Freepointer corrupt ----------------------------------------------------------------------------- To fix it, use the check "s->offset == s->inuse" in the new helper function freeptr_outside_object() instead. Also add another helper function get_info_end() to return the end of info block (inuse + free pointer if not overlapping with object). Fixes: 3202fa62fb43 ("slub: relocate freelist pointer to middle of object") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Rafael Aquini <aquini@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Vitaly Nikolenko <vnik@duasynt.com> Cc: Silvio Cesare <silvio.cesare@gmail.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Markus Elfring <Markus.Elfring@web.de> Cc: Changbin Du <changbin.du@gmail.com> Link: http://lkml.kernel.org/r/20200429135328.26976-1-longman@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-07scripts/gdb: repair rb_first() and rb_last()Aymeric Agon-Rambosson
The current implementations of the rb_first() and rb_last() gdb functions have a variable that references itself in its instanciation, which causes the function to throw an error if a specific condition on the argument is met. The original author rather intended to reference the argument and made a typo. Referring the argument instead makes the function work as intended. Signed-off-by: Aymeric Agon-Rambosson <aymeric.agon@yandex.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Stephen Boyd <swboyd@chromium.org> Cc: Jan Kiszka <jan.kiszka@siemens.com> Cc: Kieran Bingham <kbingham@kernel.org> Cc: Douglas Anderson <dianders@chromium.org> Cc: Nikolay Borisov <n.borisov.lkml@gmail.com> Cc: Jackie Liu <liuyun01@kylinos.cn> Cc: Jason Wessel <jason.wessel@windriver.com> Link: http://lkml.kernel.org/r/20200427051029.354840-1-aymeric.agon@yandex.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-07eventpoll: fix missing wakeup for ovflist in ep_poll_callbackKhazhismel Kumykov
In the event that we add to ovflist, before commit 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested epoll") we would be woken up by ep_scan_ready_list, and did no wakeup in ep_poll_callback. With that wakeup removed, if we add to ovflist here, we may never wake up. Rather than adding back the ep_scan_ready_list wakeup - which was resulting in unnecessary wakeups, trigger a wake-up in ep_poll_callback. We noticed that one of our workloads was missing wakeups starting with 339ddb53d373 and upon manual inspection, this wakeup seemed missing to me. With this patch added, we no longer see missing wakeups. I haven't yet tried to make a small reproducer, but the existing kselftests in filesystem/epoll passed for me with this patch. [khazhy@google.com: use if/elif instead of goto + cleanup suggested by Roman] Link: http://lkml.kernel.org/r/20200424190039.192373-1-khazhy@google.com Fixes: 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested epoll") Signed-off-by: Khazhismel Kumykov <khazhy@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Roman Penyaev <rpenyaev@suse.de> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Roman Penyaev <rpenyaev@suse.de> Cc: Heiher <r@hev.cc> Cc: Jason Baron <jbaron@akamai.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200424025057.118641-1-khazhy@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-07arch/x86/kvm/svm/sev.c: change flag passed to GUP fast in sev_pin_memory()Janakarajan Natarajan
When trying to lock read-only pages, sev_pin_memory() fails because FOLL_WRITE is used as the flag for get_user_pages_fast(). Commit 73b0140bf0fe ("mm/gup: change GUP fast to use flags rather than a write 'bool'") updated the get_user_pages_fast() call sites to use flags, but incorrectly updated the call in sev_pin_memory(). As the original coding of this call was correct, revert the change made by that commit. Fixes: 73b0140bf0fe ("mm/gup: change GUP fast to use flags rather than a write 'bool'") Signed-off-by: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Wanpeng Li <wanpengli@tencent.com> Cc: Jim Mattson <jmattson@google.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Mike Marshall <hubcap@omnibond.com> Cc: Brijesh Singh <brijesh.singh@amd.com> Link: http://lkml.kernel.org/r/20200423152419.87202-1-Janakarajan.Natarajan@amd.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-07scripts/decodecode: fix trapping instruction formattingIvan Delalande
If the trapping instruction contains a ':', for a memory access through segment registers for example, the sed substitution will insert the '*' marker in the middle of the instruction instead of the line address: 2b: 65 48 0f c7 0f cmpxchg16b %gs:*(%rdi) <-- trapping instruction I started to think I had forgotten some quirk of the assembly syntax before noticing that it was actually coming from the script. Fix it to add the address marker at the right place for these instructions: 28: 49 8b 06 mov (%r14),%rax 2b:* 65 48 0f c7 0f cmpxchg16b %gs:(%rdi) <-- trapping instruction 30: 0f 94 c0 sete %al Fixes: 18ff44b189e2 ("scripts/decodecode: make faulting insn ptr more robust") Signed-off-by: Ivan Delalande <colona@arista.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/20200419223653.GA31248@visor Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-07kernel/kcov.c: fix typos in kcov_remote_start documentationMaciej Grochowski
Signed-off-by: Maciej Grochowski <maciej.grochowski@pm.me> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrey Konovalov <andreyknvl@google.com> Link: http://lkml.kernel.org/r/20200420030259.31674-1-maciek.grochowski@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-07mm/page_alloc: fix watchdog soft lockups during set_zone_contiguous()David Hildenbrand
Without CONFIG_PREEMPT, it can happen that we get soft lockups detected, e.g., while booting up. watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.6.0-next-20200331+ #4 Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014 RIP: __pageblock_pfn_to_page+0x134/0x1c0 Call Trace: set_zone_contiguous+0x56/0x70 page_alloc_init_late+0x166/0x176 kernel_init_freeable+0xfa/0x255 kernel_init+0xa/0x106 ret_from_fork+0x35/0x40 The issue becomes visible when having a lot of memory (e.g., 4TB) assigned to a single NUMA node - a system that can easily be created using QEMU. Inside VMs on a hypervisor with quite some memory overcommit, this is fairly easy to trigger. Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Reviewed-by: Shile Zhang <shile.zhang@linux.alibaba.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Shile Zhang <shile.zhang@linux.alibaba.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Alexander Duyck <alexander.duyck@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200416073417.5003-1-david@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-07mm, memcg: fix error return value of mem_cgroup_css_alloc()Yafang Shao
When I run my memcg testcase which creates lots of memcgs, I found there're unexpected out of memory logs while there're still enough available free memory. The error log is mkdir: cannot create directory 'foo.65533': Cannot allocate memory The reason is when we try to create more than MEM_CGROUP_ID_MAX memcgs, an -ENOMEM errno will be set by mem_cgroup_css_alloc(), but the right errno should be -ENOSPC "No space left on device", which is an appropriate errno for userspace's failed mkdir. As the errno really misled me, we should make it right. After this patch, the error log will be mkdir: cannot create directory 'foo.65533': No space left on device [akpm@linux-foundation.org: s/EBUSY/ENOSPC/, per Michal] [akpm@linux-foundation.org: s/EBUSY/ENOSPC/, per Michal] Fixes: 73f576c04b94 ("mm: memcontrol: fix cgroup creation failure after many small jobs") Suggested-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Link: http://lkml.kernel.org/r/20200407063621.GA18914@dhcp22.suse.cz Link: http://lkml.kernel.org/r/1586192163-20099-1-git-send-email-laoar.shao@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-07ipc/mqueue.c: change __do_notify() to bypass check_kill_permission()Oleg Nesterov
Commit cc731525f26a ("signal: Remove kernel interal si_code magic") changed the value of SI_FROMUSER(SI_MESGQ), this means that mq_notify() no longer works if the sender doesn't have rights to send a signal. Change __do_notify() to use do_send_sig_info() instead of kill_pid_info() to avoid check_kill_permission(). This needs the additional notify.sigev_signo != 0 check, shouldn't we change do_mq_notify() to deny sigev_signo == 0 ? Test-case: #include <signal.h> #include <mqueue.h> #include <unistd.h> #include <sys/wait.h> #include <assert.h> static int notified; static void sigh(int sig) { notified = 1; } int main(void) { signal(SIGIO, sigh); int fd = mq_open("/mq", O_RDWR|O_CREAT, 0666, NULL); assert(fd >= 0); struct sigevent se = { .sigev_notify = SIGEV_SIGNAL, .sigev_signo = SIGIO, }; assert(mq_notify(fd, &se) == 0); if (!fork()) { assert(setuid(1) == 0); mq_send(fd, "",1,0); return 0; } wait(NULL); mq_unlink("/mq"); assert(notified); return 0; } [manfred@colorfullife.com: 1) Add self_exec_id evaluation so that the implementation matches do_notify_parent 2) use PIDTYPE_TGID everywhere] Fixes: cc731525f26a ("signal: Remove kernel interal si_code magic") Reported-by: Yoji <yoji.fujihar.min@gmail.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Markus Elfring <elfring@users.sourceforge.net> Cc: <1vier1@web.de> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/e2a782e4-eab9-4f5c-c749-c07a8f7a4e66@colorfullife.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-07evm: Fix RCU list related warningsMadhuparna Bhowmik
This patch fixes the following warning and few other instances of traversal of evm_config_xattrnames list: [ 32.848432] ============================= [ 32.848707] WARNING: suspicious RCU usage [ 32.848966] 5.7.0-rc1-00006-ga8d5875ce5f0b #1 Not tainted [ 32.849308] ----------------------------- [ 32.849567] security/integrity/evm/evm_main.c:231 RCU-list traversed in non-reader section!! Since entries are only added to the list and never deleted, use list_for_each_entry_lockless() instead of list_for_each_entry_rcu for traversing the list. Also, add a relevant comment in evm_secfs.c to indicate this fact. Reported-by: kernel test robot <lkp@intel.com> Suggested-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com> Acked-by: Paul E. McKenney <paulmck@kernel.org> (RCU viewpoint) Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
2020-05-07ima: Fix return value of ima_write_policy()Roberto Sassu
This patch fixes the return value of ima_write_policy() when a new policy is directly passed to IMA and the current policy requires appraisal of the file containing the policy. Currently, if appraisal is not in ENFORCE mode, ima_write_policy() returns 0 and leads user space applications to an endless loop. Fix this issue by denying the operation regardless of the appraisal mode. Cc: stable@vger.kernel.org # 4.10.x Fixes: 19f8a84713edc ("ima: measure and appraise the IMA policy itself") Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com> Reviewed-by: Krzysztof Struczynski <krzysztof.struczynski@huawei.com> Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
2020-05-07evm: Check also if *tfm is an error pointer in init_desc()Roberto Sassu
This patch avoids a kernel panic due to accessing an error pointer set by crypto_alloc_shash(). It occurs especially when there are many files that require an unsupported algorithm, as it would increase the likelihood of the following race condition: Task A: *tfm = crypto_alloc_shash() <= error pointer Task B: if (*tfm == NULL) <= *tfm is not NULL, use it Task B: rc = crypto_shash_init(desc) <= panic Task A: *tfm = NULL This patch uses the IS_ERR_OR_NULL macro to determine whether or not a new crypto context must be created. Cc: stable@vger.kernel.org Fixes: d46eb3699502b ("evm: crypto hash replaced by shash") Co-developed-by: Krzysztof Struczynski <krzysztof.struczynski@huawei.com> Signed-off-by: Krzysztof Struczynski <krzysztof.struczynski@huawei.com> Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com> Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
2020-05-07ima: Set file->f_mode instead of file->f_flags in ima_calc_file_hash()Roberto Sassu
Commit a408e4a86b36 ("ima: open a new file instance if no read permissions") tries to create a new file descriptor to calculate a file digest if the file has not been opened with O_RDONLY flag. However, if a new file descriptor cannot be obtained, it sets the FMODE_READ flag to file->f_flags instead of file->f_mode. This patch fixes this issue by replacing f_flags with f_mode as it was before that commit. Cc: stable@vger.kernel.org # 4.20.x Fixes: a408e4a86b36 ("ima: open a new file instance if no read permissions") Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com> Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
2020-05-07net: fix a potential recursive NETDEV_FEAT_CHANGECong Wang
syzbot managed to trigger a recursive NETDEV_FEAT_CHANGE event between bonding master and slave. I managed to find a reproducer for this: ip li set bond0 up ifenslave bond0 eth0 brctl addbr br0 ethtool -K eth0 lro off brctl addif br0 bond0 ip li set br0 up When a NETDEV_FEAT_CHANGE event is triggered on a bonding slave, it captures this and calls bond_compute_features() to fixup its master's and other slaves' features. However, when syncing with its lower devices by netdev_sync_lower_features() this event is triggered again on slaves when the LRO feature fails to change, so it goes back and forth recursively until the kernel stack is exhausted. Commit 17b85d29e82c intentionally lets __netdev_update_features() return -1 for such a failure case, so we have to just rely on the existing check inside netdev_sync_lower_features() and skip NETDEV_FEAT_CHANGE event only for this specific failure case. Fixes: fd867d51f889 ("net/core: generic support for disabling netdev features down stack") Reported-by: syzbot+e73ceacfd8560cc8a3ca@syzkaller.appspotmail.com Reported-by: syzbot+c2fb6f9ddcea95ba49b5@syzkaller.appspotmail.com Cc: Jarod Wilson <jarod@redhat.com> Cc: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Jann Horn <jannh@google.com> Reviewed-by: Jay Vosburgh <jay.vosburgh@canonical.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-07mptcp: set correct vfs info for subflowsPaolo Abeni
When a subflow is created via mptcp_subflow_create_socket(), a new 'struct socket' is allocated, with a new i_ino value. When inspecting TCP sockets via the procfs and or the diag interface, the above ones are not related to the process owning the MPTCP master socket, even if they are a logical part of it ('ss -p' shows an empty process field) Additionally, subflows created by the path manager get the uid/gid from the running workqueue. Subflows are part of the owning MPTCP master socket, let's adjust the vfs info to reflect this. After this patch, 'ss' correctly displays subflows as belonging to the msk socket creator. Fixes: 2303f994b3e1 ("mptcp: Associate MPTCP context with TCP socket") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-07net: microchip: encx24j600: add missed kthread_stopChuhong Yuan
This driver calls kthread_run() in probe, but forgets to call kthread_stop() in probe failure and remove. Add the missed kthread_stop() to fix it. Signed-off-by: Chuhong Yuan <hslester96@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-07Revert "ipv6: add mtu lock check in __ip6_rt_update_pmtu"Maciej Żenczykowski
This reverts commit 19bda36c4299ce3d7e5bce10bebe01764a655a6d: | ipv6: add mtu lock check in __ip6_rt_update_pmtu | | Prior to this patch, ipv6 didn't do mtu lock check in ip6_update_pmtu. | It leaded to that mtu lock doesn't really work when receiving the pkt | of ICMPV6_PKT_TOOBIG. | | This patch is to add mtu lock check in __ip6_rt_update_pmtu just as ipv4 | did in __ip_rt_update_pmtu. The above reasoning is incorrect. IPv6 *requires* icmp based pmtu to work. There's already a comment to this effect elsewhere in the kernel: $ git grep -p -B1 -A3 'RTAX_MTU lock' net/ipv6/route.c=4813= static int rt6_mtu_change_route(struct fib6_info *f6i, void *p_arg) ... /* In IPv6 pmtu discovery is not optional, so that RTAX_MTU lock cannot disable it. We still use this lock to block changes caused by addrconf/ndisc. */ This reverts to the pre-4.9 behaviour. Cc: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Xin Long <lucien.xin@gmail.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Maciej Żenczykowski <maze@google.com> Fixes: 19bda36c4299 ("ipv6: add mtu lock check in __ip6_rt_update_pmtu") Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-07net: bareudp: avoid uninitialized variable warningArnd Bergmann
clang points out that building without IPv6 would lead to returning an uninitialized variable if a packet with family!=AF_INET is passed into bareudp_udp_encap_recv(): drivers/net/bareudp.c:139:6: error: variable 'err' is used uninitialized whenever 'if' condition is false [-Werror,-Wsometimes-uninitialized] if (family == AF_INET) ^~~~~~~~~~~~~~~~~ drivers/net/bareudp.c:146:15: note: uninitialized use occurs here if (unlikely(err)) { ^~~ include/linux/compiler.h:78:42: note: expanded from macro 'unlikely' # define unlikely(x) __builtin_expect(!!(x), 0) ^ drivers/net/bareudp.c:139:2: note: remove the 'if' if its condition is always true if (family == AF_INET) ^~~~~~~~~~~~~~~~~~~~~~ This cannot happen in practice, so change the condition in a way that gcc sees the IPv4 case as unconditionally true here. For consistency, change all the similar constructs in this file the same way, using "if(IS_ENABLED())" instead of #if IS_ENABLED()". Fixes: 571912c69f0e ("net: UDP tunnel encapsulation module for tunnelling different protocols like MPLS, IP, NSH etc.") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-07Merge tag 'trace-v5.7-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: - Fix bootconfig causing kernels to fail with CONFIG_BLK_DEV_RAM enabled - Fix allocation leaks in bootconfig tool - Fix a double initialization of a variable - Fix API bootconfig usage from kprobe boot time events - Reject NULL location for kprobes - Fix crash caused by preempt delay module not cleaning up kthread correctly - Add vmalloc_sync_mappings() to prevent x86_64 page faults from recursively faulting from tracing page faults - Fix comment in gpu/trace kerneldoc header - Fix documentation of how to create a trace event class - Make the local tracing_snapshot_instance_cond() function static * tag 'trace-v5.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tools/bootconfig: Fix resource leak in apply_xbc() tracing: Make tracing_snapshot_instance_cond() static tracing: Fix doc mistakes in trace sample gpu/trace: Minor comment updates for gpu_mem_total tracepoint tracing: Add a vmalloc_sync_mappings() for safe measure tracing: Wait for preempt irq delay thread to finish tracing/kprobes: Reject new event if loc is NULL tracing/boottime: Fix kprobe event API usage tracing/kprobes: Fix a double initialization typo bootconfig: Fix to remove bootconfig data from initrd while boot
2020-05-07Merge tag 'linux-kselftest-5.7-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest Pull kselftest fixes from Shuah Khan: "ftrace test fixes and a fix to kvm Makefile for relocatable native/cross builds and installs" * tag 'linux-kselftest-5.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: selftests: fix kvm relocatable native/cross builds and installs selftests/ftrace: Make XFAIL green color ftrace/selftest: make unresolved cases cause failure if --fail-unresolved set ftrace/selftests: workaround cgroup RT scheduling issues
2020-05-07io_uring: don't use 'fd' for openat/openat2/statxJens Axboe
We currently make some guesses as when to open this fd, but in reality we have no business (or need) to do so at all. In fact, it makes certain things fail, like O_PATH. Remove the fd lookup from these opcodes, we're just passing the 'fd' to generic helpers anyway. With that, we can also remove the special casing of fd values in io_req_needs_file(), and the 'fd_non_neg' check that we have. And we can ensure that we only read sqe->fd once. This fixes O_PATH usage with openat/openat2, and ditto statx path side oddities. Cc: stable@vger.kernel.org: # v5.6 Reported-by: Max Kellermann <mk@cm4all.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-07ALSA: rawmidi: Fix racy buffer resize under concurrent accessesTakashi Iwai
The rawmidi core allows user to resize the runtime buffer via ioctl, and this may lead to UAF when performed during concurrent reads or writes: the read/write functions unlock the runtime lock temporarily during copying form/to user-space, and that's the race window. This patch fixes the hole by introducing a reference counter for the runtime buffer read/write access and returns -EBUSY error when the resize is performed concurrently against read/write. Note that the ref count field is a simple integer instead of refcount_t here, since the all contexts accessing the buffer is basically protected with a spinlock, hence we need no expensive atomic ops. Also, note that this busy check is needed only against read / write functions, and not in receive/transmit callbacks; the race can happen only at the spinlock hole mentioned in the above, while the whole function is protected for receive / transmit callbacks. Reported-by: butt3rflyh4ck <butterflyhuangxx@gmail.com> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/CAFcO6XMWpUVK_yzzCpp8_XP7+=oUpQvuBeCbMffEDkpe8jWrfg@mail.gmail.com Link: https://lore.kernel.org/r/s5heerw3r5z.wl-tiwai@suse.de Signed-off-by: Takashi Iwai <tiwai@suse.de>
2020-05-07net: hisilicon: Make CONFIG_HNS invisibleGeert Uytterhoeven
The HNS config symbol enables the framework support for the Hisilicon Network Subsystem. It is already selected by all of its users, so there is no reason to make it visible. Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-07usb: hso: correct debug messageOliver Neukum
If you do not find the OUT endpoint, you should say so, rather than copy the error message for the IN endpoint. Presumably a copy and paste error. Signed-off-by: Oliver Neukum <oneukum@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-07tools/bootconfig: Fix resource leak in apply_xbc()Yunfeng Ye
Fix the @data and @fd allocations that are leaked in the error path of apply_xbc(). Link: http://lkml.kernel.org/r/583a49c9-c27a-931d-e6c2-6f63a4b18bea@huawei.com Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>