summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2020-10-29Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds
Pull rdma fixes from Jason Gunthorpe: "The good news is people are testing rc1 in the RDMA world - the bad news is testing of the for-next area is not as good as I had hoped, as we really should have caught at least the rdma_connect_locked() issue before now. Notable merge window regressions that didn't get caught/fixed in time for rc1: - Fix in kernel users of rxe, they were broken by the rapid fix to undo the uABI breakage in rxe from another patch - EFA userspace needs to read the GID table but was broken with the new GID table logic - Fix user triggerable deadlock in mlx5 using devlink reload - Fix deadlock in several ULPs using rdma_connect from the CM handler callbacks - Memory leak in qedr" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: RDMA/qedr: Fix memory leak in iWARP CM RDMA: Add rdma_connect_locked() RDMA/uverbs: Fix false error in query gid IOCTL RDMA/mlx5: Fix devlink deadlock on net namespace deletion RDMA/rxe: Fix small problem in network_type patch
2020-10-29r8169: fix issue with forced threading in combination with shared interruptsHeiner Kallweit
As reported by Serge flag IRQF_NO_THREAD causes an error if the interrupt is actually shared and the other driver(s) don't have this flag set. This situation can occur if a PCI(e) legacy interrupt is used in combination with forced threading. There's no good way to deal with this properly, therefore we have to remove flag IRQF_NO_THREAD. For fixing the original forced threading issue switch to napi_schedule(). Fixes: 424a646e072a ("r8169: fix operation under forced interrupt threading") Link: https://www.spinics.net/lists/netdev/msg694960.html Reported-by: Serge Belyshev <belyshev@depni.sinp.msu.ru> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Tested-by: Serge Belyshev <belyshev@depni.sinp.msu.ru> Link: https://lore.kernel.org/r/b5b53bfe-35ac-3768-85bf-74d1290cf394@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-29netem: fix zero division in tabledistAleksandr Nogikh
Currently it is possible to craft a special netlink RTM_NEWQDISC command that can result in jitter being equal to 0x80000000. It is enough to set the 32 bit jitter to 0x02000000 (it will later be multiplied by 2^6) or just set the 64 bit jitter via TCA_NETEM_JITTER64. This causes an overflow during the generation of uniformly distributed numbers in tabledist(), which in turn leads to division by zero (sigma != 0, but sigma * 2 is 0). The related fragment of code needs 32-bit division - see commit 9b0ed89 ("netem: remove unnecessary 64 bit modulus"), so switching to 64 bit is not an option. Fix the issue by keeping the value of jitter within the range that can be adequately handled by tabledist() - [0;INT_MAX]. As negative std deviation makes no sense, take the absolute value of the passed value and cap it at INT_MAX. Inside tabledist(), switch to unsigned 32 bit arithmetic in order to prevent overflows. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Aleksandr Nogikh <nogikh@google.com> Reported-by: syzbot+ec762a6342ad0d3c0d8f@syzkaller.appspotmail.com Acked-by: Stephen Hemminger <stephen@networkplumber.org> Link: https://lore.kernel.org/r/20201028170731.1383332-1-aleksandrnogikh@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-29ibmvnic: fix ibmvnic_set_macLijun Pan
Jakub Kicinski brought up a concern in ibmvnic_set_mac(). ibmvnic_set_mac() does this: ether_addr_copy(adapter->mac_addr, addr->sa_data); if (adapter->state != VNIC_PROBED) rc = __ibmvnic_set_mac(netdev, addr->sa_data); So if state == VNIC_PROBED, the user can assign an invalid address to adapter->mac_addr, and ibmvnic_set_mac() will still return 0. The fix is to validate ethernet address at the beginning of ibmvnic_set_mac(), and move the ether_addr_copy to the case of "adapter->state != VNIC_PROBED". Fixes: c26eba03e407 ("ibmvnic: Update reset infrastructure to support tunable parameters") Signed-off-by: Lijun Pan <ljp@linux.ibm.com> Link: https://lore.kernel.org/r/20201027220456.71450-1-ljp@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-29x86/sev-es: Do not support MMIO to/from encrypted memoryJoerg Roedel
MMIO memory is usually not mapped encrypted, so there is no reason to support emulated MMIO when it is mapped encrypted. Prevent a possible hypervisor attack where a RAM page is mapped as an MMIO page in the nested page-table, so that any guest access to it will trigger a #VC exception and leak the data on that page to the hypervisor via the GHCB (like with valid MMIO). On the read side this attack would allow the HV to inject data into the guest. Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lkml.kernel.org/r/20201028164659.27002-6-joro@8bytes.org
2020-10-29mptcp: add missing memory scheduling in the rx pathPaolo Abeni
When moving the skbs from the subflow into the msk receive queue, we must schedule there the required amount of memory. Try to borrow the required memory from the subflow, if needed, so that we leverage the existing TCP heuristic. Fixes: 6771bfd9ee24 ("mptcp: update mptcp ack sequence from work queue") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Link: https://lore.kernel.org/r/f6143a6193a083574f11b00dbf7b5ad151bc4ff4.1603810630.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-29drm/i915: Reject 90/270 degree rotated initial fbsVille Syrjälä
We don't currently handle the initial fb readout correctly for 90/270 degree rotated scanout. Reject it. Cc: stable@vger.kernel.org Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20201020194330.28568-1-ville.syrjala@linux.intel.com Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> (cherry picked from commit a40a8305a732f4ecc2186ac7ca132ba062ed770d) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2020-10-29drm/i915: Restore ILK-M RPS supportVille Syrjälä
Restore RPS for ILK-M. We lost it when an extra HAS_RPS() check appeared in intel_rps_enable(). Unfortunaltey this just makes the performance worse on my ILK because intel_ips insists on limiting the GPU freq to the minimum. If we don't do the RPS init then intel_ips will not limit the frequency for whatever reason. Either it can't get at some required information and thus makes wrong decisions, or we mess up some weights/etc. and cause it to make the wrong decisions when RPS init has been done, or the entire thing is just wrong. Would require a bunch of reverse engineering to figure out what's going on. Cc: stable@vger.kernel.org Cc: Chris Wilson <chris@chris-wilson.co.uk> Fixes: 9c878557b1eb ("drm/i915/gt: Use the RPM config register to determine clk frequencies") Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20201021131443.25616-1-ville.syrjala@linux.intel.com Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> (cherry picked from commit 2bf06370bcfb0dea5655e9a5ad460c7f7dca7739) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2020-10-29drm/i915/region: fix max size calculationMatthew Auld
We are incorrectly limiting the max allocation size as per the mm max_order, which is effectively the largest power-of-two that we can fit in the region size. However, it's normal to setup the region or allocator with a non-power-of-two size(for example 3G), which we should already handle correctly, except it seems for the early too-big-check. v2: make sure we also exercise the I915_BO_ALLOC_CONTIGUOUS path, which is quite different, since for that we are actually limited by the largest power-of-two that we can fit within the region size. (Chris) Fixes: b908be543e44 ("drm/i915: support creating LMEM objects") Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: CQ Tang <cq.tang@intel.com> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Link: https://patchwork.freedesktop.org/patch/msgid/20201021103606.241395-1-matthew.auld@intel.com (cherry picked from commit 83ebef47f8ebe320d5c5673db82f9903a4f40a69) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2020-10-29include: jhash/signal: Fix fall-through warnings for ClangGustavo A. R. Silva
In preparation to enable -Wimplicit-fallthrough for Clang, explicitly add break statements instead of letting the code fall through to the next case. This patch adds four break statements that, together, fix almost 40,000 warnings when building Linux 5.10-rc1 with Clang 12.0.0 and this[1] change reverted. Notice that in order to enable -Wimplicit-fallthrough for Clang, such change[1] is meant to be reverted at some point. So, this patch helps to move in that direction. Something important to mention is that there is currently a discrepancy between GCC and Clang when dealing with switch fall-through to empty case statements or to cases that only contain a break/continue/return statement[2][3][4]. Now that the -Wimplicit-fallthrough option has been globally enabled[5], any compiler should really warn on missing either a fallthrough annotation or any of the other case-terminating statements (break/continue/return/ goto) when falling through to the next case statement. Making exceptions to this introduces variation in case handling which may continue to lead to bugs, misunderstandings, and a general lack of robustness. The point of enabling options like -Wimplicit-fallthrough is to prevent human error and aid developers in spotting bugs before their code is even built/ submitted/committed, therefore eliminating classes of bugs. So, in order to really accomplish this, we should, and can, move in the direction of addressing any error-prone scenarios and get rid of the unintentional fallthrough bug-class in the kernel, entirely, even if there is some minor redundancy. Better to have explicit case-ending statements than continue to have exceptions where one must guess as to the right result. The compiler will eliminate any actual redundancy. [1] commit e2079e93f562c ("kbuild: Do not enable -Wimplicit-fallthrough for clang for now") [2] https://github.com/ClangBuiltLinux/linux/issues/636 [3] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91432 [4] https://godbolt.org/z/xgkvIh [5] commit a035d552a93b ("Makefile: Globally enable fall-through warning") Co-developed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-10-29Merge tag 'afs-fixes-20201029' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs Pull AFS fixes from David Howells: - Fix copy_file_range() to an afs file now returning EINVAL if the splice_write file op isn't supplied. - Fix a deref-before-check in afs_unuse_cell(). - Fix a use-after-free in afs_xattr_get_acl(). - Fix afs to not try to clear PG_writeback when laundering a page. - Fix afs to take a ref on a page that it sets PG_private on and to drop that ref when clearing PG_private. This is done through recently added helpers. - Fix a page leak if write_begin() fails. - Fix afs_write_begin() to not alter the dirty region info stored in page->private, but rather do this in afs_write_end() instead when we know what we actually changed. - Fix afs_invalidatepage() to alter the dirty region info on a page when partial page invalidation occurs so that we don't inadvertantly include a span of zeros that will get written back if a page gets laundered due to a remote 3rd-party induced invalidation. We mustn't, however, reduce the dirty region if the page has been seen to be mapped (ie. we got called through the page_mkwrite vector) as the page might still be mapped and we might lose data if the file is extended again. - Fix the dirty region info to have a lower resolution if the size of the page is too large for this to be encoded (e.g. powerpc32 with 64K pages). Note that this might not be the ideal way to handle this, since it may allow some leakage of undirtied zero bytes to the server's copy in the case of a 3rd-party conflict. To aid the last two fixes, two additional changes: - Wrap the manipulations of the dirty region info stored in page->private into helper functions. - Alter the encoding of the dirty region so that the region bounds can be stored with one fewer bit, making a bit available for the indication of mappedness. * tag 'afs-fixes-20201029' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: afs: Fix dirty-region encoding on ppc32 with 64K pages afs: Fix afs_invalidatepage to adjust the dirty region afs: Alter dirty range encoding in page->private afs: Wrap page->private manipulations in inline functions afs: Fix where page->private is set during write afs: Fix page leak on afs_write_begin() failure afs: Fix to take ref on page when PG_private is set afs: Fix afs_launder_page to not clear PG_writeback afs: Fix a use after free in afs_xattr_get_acl() afs: Fix tracing deref-before-check afs: Fix copy_file_range()
2020-10-29x86/head/64: Check SEV encryption before switching to kernel page-tableJoerg Roedel
When SEV is enabled, the kernel requests the C-bit position again from the hypervisor to build its own page-table. Since the hypervisor is an untrusted source, the C-bit position needs to be verified before the kernel page-table is used. Call sev_verify_cbit() before writing the CR3. [ bp: Massage. ] Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lkml.kernel.org/r/20201028164659.27002-5-joro@8bytes.org
2020-10-29x86/boot/compressed/64: Check SEV encryption in 64-bit boot-pathJoerg Roedel
Check whether the hypervisor reported the correct C-bit when running as an SEV guest. Using a wrong C-bit position could be used to leak sensitive data from the guest to the hypervisor. The check function is in a separate file: arch/x86/kernel/sev_verify_cbit.S so that it can be re-used in the running kernel image. [ bp: Massage. ] Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lkml.kernel.org/r/20201028164659.27002-4-joro@8bytes.org
2020-10-29tipc: fix memory leak caused by tipc_buf_append()Tung Nguyen
Commit ed42989eab57 ("tipc: fix the skb_unshare() in tipc_buf_append()") replaced skb_unshare() with skb_copy() to not reduce the data reference counter of the original skb intentionally. This is not the correct way to handle the cloned skb because it causes memory leak in 2 following cases: 1/ Sending multicast messages via broadcast link The original skb list is cloned to the local skb list for local destination. After that, the data reference counter of each skb in the original list has the value of 2. This causes each skb not to be freed after receiving ACK: tipc_link_advance_transmq() { ... /* release skb */ __skb_unlink(skb, &l->transmq); kfree_skb(skb); <-- memory exists after being freed } 2/ Sending multicast messages via replicast link Similar to the above case, each skb cannot be freed after purging the skb list: tipc_mcast_xmit() { ... __skb_queue_purge(pkts); <-- memory exists after being freed } This commit fixes this issue by using skb_unshare() instead. Besides, to avoid use-after-free error reported by KASAN, the pointer to the fragment is set to NULL before calling skb_unshare() to make sure that the original skb is not freed after freeing the fragment 2 times in case skb_unshare() returns NULL. Fixes: ed42989eab57 ("tipc: fix the skb_unshare() in tipc_buf_append()") Acked-by: Jon Maloy <jmaloy@redhat.com> Reported-by: Thang Hoang Ngo <thang.h.ngo@dektech.com.au> Signed-off-by: Tung Nguyen <tung.q.nguyen@dektech.com.au> Reviewed-by: Xin Long <lucien.xin@gmail.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Link: https://lore.kernel.org/r/20201027032403.1823-1-tung.q.nguyen@dektech.com.au Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-29gtp: fix an use-before-init in gtp_newlink()Masahiro Fujiwara
*_pdp_find() from gtp_encap_recv() would trigger a crash when a peer sends GTP packets while creating new GTP device. RIP: 0010:gtp1_pdp_find.isra.0+0x68/0x90 [gtp] <SNIP> Call Trace: <IRQ> gtp_encap_recv+0xc2/0x2e0 [gtp] ? gtp1_pdp_find.isra.0+0x90/0x90 [gtp] udp_queue_rcv_one_skb+0x1fe/0x530 udp_queue_rcv_skb+0x40/0x1b0 udp_unicast_rcv_skb.isra.0+0x78/0x90 __udp4_lib_rcv+0x5af/0xc70 udp_rcv+0x1a/0x20 ip_protocol_deliver_rcu+0xc5/0x1b0 ip_local_deliver_finish+0x48/0x50 ip_local_deliver+0xe5/0xf0 ? ip_protocol_deliver_rcu+0x1b0/0x1b0 gtp_encap_enable() should be called after gtp_hastable_new() otherwise *_pdp_find() will access the uninitialized hash table. Fixes: 1e3a3abd8b28 ("gtp: make GTP sockets in gtp_newlink optional") Signed-off-by: Masahiro Fujiwara <fujiwara.masahiro@gmail.com> Link: https://lore.kernel.org/r/20201027114846.3924-1-fujiwara.masahiro@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-29Merge tag 'ext4_for_linus_fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Bug fixes for the new ext4 fast commit feature, plus a fix for the 'data=journal' bug fix. Also use the generic casefolding support which has now landed in fs/libfs.c for 5.10" * tag 'ext4_for_linus_fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: indicate that fast_commit is available via /sys/fs/ext4/feature/... ext4: use generic casefolding support ext4: do not use extent after put_bh ext4: use IS_ERR() for error checking of path ext4: fix mmap write protection for data=journal mode jbd2: fix a kernel-doc markup ext4: use s_mount_flags instead of s_mount_state for fast commit state ext4: make num of fast commit blocks configurable ext4: properly check for dirty state in ext4_inode_datasync_dirty() ext4: fix double locking in ext4_fc_commit_dentry_updates()
2020-10-29dma-mapping: fix 32-bit overflow with CONFIG_ARM_LPAE=nGeert Uytterhoeven
On r8a7791/koelsch and shmobile_defconfig, PCIe probing fails with: rcar-pcie fe000000.pcie: Adjusted size 0x0 invalid rcar-pcie: probe of fe000000.pcie failed with error -22 of_dma_get_range() returns the following map: cpu_start 0x40000000 dma_start 0x40000000 size 0x080000000 offset 0 cpu_start 0x00000000 dma_start 0x00000000 size 0x100000000 offset 0 If CONFIG_ARM_LPAE=n, dma_addr_t is 32-bit. Hence when assigning r->dma_start + r->size to dma_end, this value will be truncated to 32-bit, yielding zero when processing the second table entry. Consequently, both dma_start and dma_end will be zero, leading to a zero size. Fix this by changing the dma_start and dma_end variables from dma_addr_t to u64. Fixes: e0d072782c734d27 ("dma-mapping: introduce DMA range map, supplanting dma_pfn_offset") Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-10-29xfs: set xefi_discard when creating a deferred agfl free log intent itemDarrick J. Wong
Make sure that we actually initialize xefi_discard when we're scheduling a deferred free of an AGFL block. This was (eventually) found by the UBSAN while I was banging on realtime rmap problems, but it exists in the upstream codebase. While we're at it, rearrange the structure to reduce the struct size from 64 to 56 bytes. Fixes: fcb762f5de2e ("xfs: add bmapi nodiscard flag") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-10-29lib/scatterlist: use consistent sg_copy_buffer() return typeDavid Disseldorp
sg_copy_buffer() returns a size_t with the number of bytes copied. Return 0 instead of false if the copy is skipped. Signed-off-by: David Disseldorp <ddiss@suse.de> Reviewed-by: Douglas Gilbert <dgilbert@interlog.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-29Merge tag 'nvme-5.10-2020-10-29' of git://git.infradead.org/nvme into block-5.10Jens Axboe
Pull NVMe fixes from Christoph: "nvme updates for 5.10: - improve zone revalidation (Keith Busch) - gracefully handle zero length messages in nvme-rdma (zhenwei pi) - nvme-fc error handling fixes (James Smart) - nvmet tracing NULL pointer dereference fix (Chaitanya Kulkarni)" * tag 'nvme-5.10-2020-10-29' of git://git.infradead.org/nvme: nvmet: fix a NULL pointer dereference when tracing the flush command nvme-fc: remove nvme_fc_terminate_io() nvme-fc: eliminate terminate_io use by nvme_fc_error_recovery nvme-fc: remove err_work work item nvme-fc: track error_recovery while connecting nvme-rdma: handle unexpected nvme completion data length nvme: ignore zone validate errors on subsequent scans
2020-10-29tools, bpftool: Remove two unused variables.Ian Rogers
Avoid an unused variable warning. Signed-off-by: Ian Rogers <irogers@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Tobias Klauser <tklauser@distanz.ch> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20201027233646.3434896-2-irogers@google.com
2020-10-29tools, bpftool: Avoid array index warnings.Ian Rogers
The bpf_caps array is shorter without CAP_BPF, avoid out of bounds reads if this isn't defined. Working around this avoids -Wno-array-bounds with clang. Signed-off-by: Ian Rogers <irogers@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Tobias Klauser <tklauser@distanz.ch> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20201027233646.3434896-1-irogers@google.com
2020-10-29xsysace: use platform_get_resource() and platform_get_irq_optional()Andy Shevchenko
Use platform_get_resource() to fetch the memory resource and platform_get_irq_optional() to get optional IRQ instead of open-coded variants. IRQ is not supposed to be changed at runtime, so there is no functional change in ace_fsm_yieldirq(). On the other hand we now take first resources instead of last ones to proceed. I can't imagine how broken should be firmware to have a garbage in the first resource slots. But if it the case, it needs to be documented. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Acked-by: Michal Simek <michal.simek@xilinx.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-29xsk: Fix possible memory leak at socket closeMagnus Karlsson
Fix a possible memory leak at xsk socket close that is caused by the refcounting of the umem object being wrong. The reference count of the umem was decremented only after the pool had been freed. Note that if the buffer pool is destroyed, it is important that the umem is destroyed after the pool, otherwise the umem would disappear while the driver is still running. And as the buffer pool needs to be destroyed in a work queue, the umem is also (if its refcount reaches zero) destroyed after the buffer pool in that same work queue. What was missing is that the refcount also needs to be decremented when the pool is not freed and when the pool has not even been created. The first case happens when the refcount of the pool is higher than 1, i.e. it is still being used by some other socket using the same device and queue id. In this case, it is safe to decrement the refcount of the umem outside of the work queue as the umem will never be freed because the refcount of the umem is always greater than or equal to the refcount of the buffer pool. The second case is if the buffer pool has not been created yet, i.e. the socket was closed before it was bound but after the umem was created. In this case, it is safe to destroy the umem outside of the work queue, since there is no pool that can use it by definition. Fixes: 1c1efc2af158 ("xsk: Create and free buffer pool independently from umem") Reported-by: syzbot+eb71df123dc2be2c1456@syzkaller.appspotmail.com Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Björn Töpel <bjorn.topel@intel.com> Link: https://lore.kernel.org/bpf/1603801921-2712-1-git-send-email-magnus.karlsson@gmail.com
2020-10-29bpf: Add struct bpf_redir_neigh forward declaration to BPF helper defsAndrii Nakryiko
Forward-declare struct bpf_redir_neigh in bpf_helper_defs.h to avoid compiler warning about unknown structs. Fixes: ba452c9e996d ("bpf: Fix bpf_redirect_neigh helper api to support supplying nexthop") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20201028181204.111241-1-andrii@kernel.org
2020-10-29afs: Fix dirty-region encoding on ppc32 with 64K pagesDavid Howells
The dirty region bounds stored in page->private on an afs page are 15 bits on a 32-bit box and can, at most, represent a range of up to 32K within a 32K page with a resolution of 1 byte. This is a problem for powerpc32 with 64K pages enabled. Further, transparent huge pages may get up to 2M, which will be a problem for the afs filesystem on all 32-bit arches in the future. Fix this by decreasing the resolution. For the moment, a 64K page will have a resolution determined from PAGE_SIZE. In the future, the page will need to be passed in to the helper functions so that the page size can be assessed and the resolution determined dynamically. Note that this might not be the ideal way to handle this, since it may allow some leakage of undirtied zero bytes to the server's copy in the case of a 3rd-party conflict. Fixing that would require a separately allocated record and is a more complicated fix. Fixes: 4343d00872e1 ("afs: Get rid of the afs_writeback record") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2020-10-29afs: Fix afs_invalidatepage to adjust the dirty regionDavid Howells
Fix afs_invalidatepage() to adjust the dirty region recorded in page->private when truncating a page. If the dirty region is entirely removed, then the private data is cleared and the page dirty state is cleared. Without this, if the page is truncated and then expanded again by truncate, zeros from the expanded, but no-longer dirty region may get written back to the server if the page gets laundered due to a conflicting 3rd-party write. It mustn't, however, shorten the dirty region of the page if that page is still mmapped and has been marked dirty by afs_page_mkwrite(), so a flag is stored in page->private to record this. Fixes: 4343d00872e1 ("afs: Get rid of the afs_writeback record") Signed-off-by: David Howells <dhowells@redhat.com>
2020-10-29afs: Alter dirty range encoding in page->privateDavid Howells
Currently, page->private on an afs page is used to store the range of dirtied data within the page, where the range includes the lower bound, but excludes the upper bound (e.g. 0-1 is a range covering a single byte). This, however, requires a superfluous bit for the last-byte bound so that on a 4KiB page, it can say 0-4096 to indicate the whole page, the idea being that having both numbers the same would indicate an empty range. This is unnecessary as the PG_private bit is clear if it's an empty range (as is PG_dirty). Alter the way the dirty range is encoded in page->private such that the upper bound is reduced by 1 (e.g. 0-0 is then specified the same single byte range mentioned above). Applying this to both bounds frees up two bits, one of which can be used in a future commit. This allows the afs filesystem to be compiled on ppc32 with 64K pages; without this, the following warnings are seen: ../fs/afs/internal.h: In function 'afs_page_dirty_to': ../fs/afs/internal.h:881:15: warning: right shift count >= width of type [-Wshift-count-overflow] 881 | return (priv >> __AFS_PAGE_PRIV_SHIFT) & __AFS_PAGE_PRIV_MASK; | ^~ ../fs/afs/internal.h: In function 'afs_page_dirty': ../fs/afs/internal.h:886:28: warning: left shift count >= width of type [-Wshift-count-overflow] 886 | return ((unsigned long)to << __AFS_PAGE_PRIV_SHIFT) | from; | ^~ Fixes: 4343d00872e1 ("afs: Get rid of the afs_writeback record") Signed-off-by: David Howells <dhowells@redhat.com>
2020-10-29afs: Wrap page->private manipulations in inline functionsDavid Howells
The afs filesystem uses page->private to store the dirty range within a page such that in the event of a conflicting 3rd-party write to the server, we write back just the bits that got changed locally. However, there are a couple of problems with this: (1) I need a bit to note if the page might be mapped so that partial invalidation doesn't shrink the range. (2) There aren't necessarily sufficient bits to store the entire range of data altered (say it's a 32-bit system with 64KiB pages or transparent huge pages are in use). So wrap the accesses in inline functions so that future commits can change how this works. Also move them out of the tracing header into the in-directory header. There's not really any need for them to be in the tracing header. Signed-off-by: David Howells <dhowells@redhat.com>
2020-10-29afs: Fix where page->private is set during writeDavid Howells
In afs, page->private is set to indicate the dirty region of a page. This is done in afs_write_begin(), but that can't take account of whether the copy into the page actually worked. Fix this by moving the change of page->private into afs_write_end(). Fixes: 4343d00872e1 ("afs: Get rid of the afs_writeback record") Signed-off-by: David Howells <dhowells@redhat.com>
2020-10-29afs: Fix page leak on afs_write_begin() failureDavid Howells
Fix the leak of the target page in afs_write_begin() when it fails. Fixes: 15b4650e55e0 ("afs: convert to new aops") Signed-off-by: David Howells <dhowells@redhat.com> cc: Nick Piggin <npiggin@gmail.com>
2020-10-29afs: Fix to take ref on page when PG_private is setDavid Howells
Fix afs to take a ref on a page when it sets PG_private on it and to drop the ref when removing the flag. Note that in afs_write_begin(), a lot of the time, PG_private is already set on a page to which we're going to add some data. In such a case, we leave the bit set and mustn't increment the page count. As suggested by Matthew Wilcox, use attach/detach_page_private() where possible. Fixes: 31143d5d515e ("AFS: implement basic file write support") Reported-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2020-10-29null_blk: Fix locking in zoned modeDamien Le Moal
When the zoned mode is enabled in null_blk, Serializing read, write and zone management operations for each zone is necessary to protect device level information for managing zone resources (zone open and closed counters) as well as each zone condition and write pointer position. Commit 35bc10b2eafb ("null_blk: synchronization fix for zoned device") introduced a spinlock to implement this serialization. However, when memory backing is also enabled, GFP_NOIO memory allocations are executed under the spinlock, resulting in might_sleep() warnings. Furthermore, the zone_lock spinlock is locked/unlocked using spin_lock_irq/spin_unlock_irq, similarly to the memory backing code with the nullb->lock spinlock. This nested use of irq locks wrecks the irq enabled/disabled state. Fix all this by introducing a bitmap for per-zone lock, with locking implemented using wait_on_bit_lock_io() and clear_and_wake_up_bit(). This locking mechanism allows keeping a zone locked while executing null_process_cmd(), serializing all operations to the zone while allowing to sleep during memory backing allocation with GFP_NOIO. Device level zone resource management information is protected using a spinlock which is not held while executing null_process_cmd(); Fixes: 35bc10b2eafb ("null_blk: synchronization fix for zoned device") Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-29null_blk: Fix zone reset all tracingDamien Le Moal
In the cae of the REQ_OP_ZONE_RESET_ALL operation, the command sector is ignored and the operation is applied to all sequential zones. For these commands, tracing the effect of the command using the command sector to determine the target zone is thus incorrect. Fix null_zone_mgmt() zone condition tracing in the case of REQ_OP_ZONE_RESET_ALL to apply tracing to all sequential zones that are not already empty. Fixes: 766c3297d7e1 ("null_blk: add trace in null_blk_zoned.c") Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-29nbd: don't update block size after device is startedMing Lei
Mounted NBD device can be resized, one use case is rbd-nbd. Fix the issue by setting up default block size, then not touch it in nbd_size_update() any more. This kind of usage is aligned with loop which has same use case too. Cc: stable@vger.kernel.org Fixes: c8a83a6b54d0 ("nbd: Use set_blocksize() to set device blocksize") Reported-by: lining <lining2020x@163.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Jan Kara <jack@suse.cz> Tested-by: lining <lining2020x@163.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-29cpufreq: schedutil: Always call driver if CPUFREQ_NEED_UPDATE_LIMITS is setRafael J. Wysocki
Because sugov_update_next_freq() may skip a frequency update even if the need_freq_update flag has been set for the policy at hand, policy limits updates may not take effect as expected. For example, if the intel_pstate driver operates in the passive mode with HWP enabled, it needs to update the HWP min and max limits when the policy min and max limits change, respectively, but that may not happen if the target frequency does not change along with the limit at hand. In particular, if the policy min is changed first, causing the target frequency to be adjusted to it, and the policy max limit is changed later to the same value, the HWP max limit will not be updated to follow it as expected, because the target frequency is still equal to the policy min limit and it will not change until that limit is updated. To address this issue, modify get_next_freq() to let the driver callback run if the CPUFREQ_NEED_UPDATE_LIMITS cpufreq driver flag is set regardless of whether or not the new frequency to set is equal to the previous one. Fixes: f6ebbcf08f37 ("cpufreq: intel_pstate: Implement passive mode with HWP enabled") Reported-by: Zhang Rui <rui.zhang@intel.com> Tested-by: Zhang Rui <rui.zhang@intel.com> Cc: 5.9+ <stable@vger.kernel.org> # 5.9+: 1c534352f47f cpufreq: Introduce CPUFREQ_NEED_UPDATE_LIMITS ... Cc: 5.9+ <stable@vger.kernel.org> # 5.9+: a62f68f5ca53 cpufreq: Introduce cpufreq_driver_test_flags() Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-10-29cpufreq: Introduce cpufreq_driver_test_flags()Rafael J. Wysocki
Add a helper function to test the flags of the cpufreq driver in use againt a given flags mask. In particular, this will be needed to test the CPUFREQ_NEED_UPDATE_LIMITS cpufreq driver flag in the schedutil governor. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-10-29arm64: Add workaround for Arm Cortex-A77 erratum 1508412Rob Herring
On Cortex-A77 r0p0 and r1p0, a sequence of a non-cacheable or device load and a store exclusive or PAR_EL1 read can cause a deadlock. The workaround requires a DMB SY before and after a PAR_EL1 register read. In addition, it's possible an interrupt (doing a device read) or KVM guest exit could be taken between the DMB and PAR read, so we also need a DMB before returning from interrupt and before returning to a guest. A deadlock is still possible with the workaround as KVM guests must also have the workaround. IOW, a malicious guest can deadlock an affected systems. This workaround also depends on a firmware counterpart to enable the h/w to insert DMB SY after load and store exclusive instructions. See the errata document SDEN-1152370 v10 [1] for more information. [1] https://static.docs.arm.com/101992/0010/Arm_Cortex_A77_MP074_Software_Developer_Errata_Notice_v10.pdf Signed-off-by: Rob Herring <robh@kernel.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Marc Zyngier <maz@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: James Morse <james.morse@arm.com> Cc: Suzuki K Poulose <suzuki.poulose@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Julien Thierry <julien.thierry.kdev@gmail.com> Cc: kvmarm@lists.cs.columbia.edu Link: https://lore.kernel.org/r/20201028182839.166037-2-robh@kernel.org Signed-off-by: Will Deacon <will@kernel.org>
2020-10-29arm64: Add part number for Arm Cortex-A77Rob Herring
Add the MIDR part number info for the Arm Cortex-A77. Signed-off-by: Rob Herring <robh@kernel.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20201028182839.166037-1-robh@kernel.org Signed-off-by: Will Deacon <will@kernel.org>
2020-10-29x86/boot/compressed/64: Sanity-check CPUID results in the early #VC handlerJoerg Roedel
The early #VC handler which doesn't have a GHCB can only handle CPUID exit codes. It is needed by the early boot code to handle #VC exceptions raised in verify_cpu() and to get the position of the C-bit. But the CPUID information comes from the hypervisor which is untrusted and might return results which trick the guest into the no-SEV boot path with no C-bit set in the page-tables. All data written to memory would then be unencrypted and could leak sensitive data to the hypervisor. Add sanity checks to the early #VC handler to make sure the hypervisor can not pretend that SEV is disabled. [ bp: Massage a bit. ] Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lkml.kernel.org/r/20201028164659.27002-3-joro@8bytes.org
2020-10-29spi: bcm2835: fix gpio cs level inversionMartin Hundebøll
The work on improving gpio chip-select in spi core, and the following fixes, has caused the bcm2835 spi driver to use wrong levels. Fix this by simply removing level handling in the bcm2835 driver, and let the core do its work. Fixes: 3e5ec1db8bfe ("spi: Fix SPI_CS_HIGH setting when using native and GPIO CS") Cc: <stable@vger.kernel.org> Signed-off-by: Martin Hundebøll <martin@geanix.com> Link: https://lore.kernel.org/r/20201014090230.2706810-1-martin@geanix.com Signed-off-by: Mark Brown <broonie@kernel.org>
2020-10-29ASoC: qcom: lpass-cpu: Fix clock disable failureV Sujith Kumar Reddy
Disable MI2S bit clock from PAUSE/STOP/SUSPEND usecase instead of shutdown time. Acheive this by invoking clk_disable API from cpu daiops trigger instead of cpu daiops shutdown. Change non-atomic API "clk_prepare_enable" to atomic API "clk_enable" in trigger, as trigger is being called from atomic context. Fixes: 7e6799d8f87d ("ASoC: qcom: lpass-cpu: Enable MI2S BCLK and LRCLK together") Signed-off-by: V Sujith Kumar Reddy <vsujithk@codeaurora.org> Signed-off-by: Srinivasa Rao Mandadapu <srivasam@codeaurora.org> Link: https://lore.kernel.org/r/1603098363-9251-1-git-send-email-srivasam@codeaurora.org Signed-off-by: Mark Brown <broonie@kernel.org>
2020-10-29ASoC: qcom: lpass-sc7180: Fix MI2S bitwidth field bit positionsV Sujith Kumar Reddy
Update SC7180 lpass_variant structure with proper I2S bitwidth field bit positions, as bitwidth denotes 0 to 1 bits, but previously used only 0 bit. Signed-off-by: V Sujith Kumar Reddy <vsujithk@codeaurora.org> Signed-off-by: Srinivasa Rao Mandadapu <srivasam@codeaurora.org> Link: https://lore.kernel.org/r/1603798474-4897-1-git-send-email-srivasam@codeaurora.org Signed-off-by: Mark Brown <broonie@kernel.org>
2020-10-29HID: i2c-hid: Put ACPI enumerated devices in D3 on shutdownHans de Goede
The i2c-hid driver would quietly fail to probe the i2c-hid sensor-hub with an ACPI device-id of SMO91D0 every other boot. Specifically, the i2c_smbus_read_byte() "Make sure there is something at this address" check would fail every other boot. It seems that the BIOS does not properly reset/power-cycle the device leaving it in a confused state where it refuses to respond to i2c-xfers. On boots where probing the device failed, the driver-core puts the device in D3 after the probe-failure, which causes the probe to succeed the next boot. Putting the device in D3 from the shutdown-handler fixes the sensors not working every other boot. This has been tested on both a Lenovo Miix 2-10 and a Dell Venue 8 Pro 5830 both of which use an i2c-hid sensor-hub with an ACPI id of SMO91D0. Note that it is safe to call acpi_device_set_power() with a NULL pointer as first argument, so on none ACPI enumerated devices this change is a no-op. Cc: Kai-Heng Feng <kai.heng.feng@canonical.com> Signed-off-by: Hans de Goede <hdegoede@redhat.com> Acked-by: Kai-Heng Feng <kai.heng.feng@canonical.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2020-10-29usb: cdns3: gadget: suspicious implicit sign extensionPeter Chen
The code: trb->length = cpu_to_le32(TRB_BURST_LEN(priv_ep->trb_burst_size) | TRB_LEN(length)); TRB_BURST_LEN(priv_ep->trb_burst_size) may be overflow for int 32 if priv_ep->trb_burst_size is equal or larger than 0x80; Below is the Coverity warning: sign_extension: Suspicious implicit sign extension: priv_ep->trb_burst_size with type u8 (8 bits, unsigned) is promoted in priv_ep->trb_burst_size << 24 to type int (32 bits, signed), then sign-extended to type unsigned long (64 bits, unsigned). If priv_ep->trb_burst_size << 24 is greater than 0x7FFFFFFF, the upper bits of the result will all be 1. To fix it, it needs to add an explicit cast to unsigned int type for ((p) << 24). Reviewed-by: Jun Li <jun.li@nxp.com> Signed-off-by: Peter Chen <peter.chen@nxp.com>
2020-10-29sched/fair: Check for idle core in wake_affineJulia Lawall
In the case of a thread wakeup, wake_affine determines whether a core will be chosen for the thread on the socket where the thread ran previously or on the socket of the waker. This is done primarily by comparing the load of the core where th thread ran previously (prev) and the load of the waker (this). commit 11f10e5420f6 ("sched/fair: Use load instead of runnable load in wakeup path") changed the load computation from the runnable load to the load average, where the latter includes the load of threads that have already blocked on the core. When a short-running daemon processes happens to run on prev, this change raised the situation that prev could appear to have a greater load than this, even when prev is actually idle. When prev and this are on the same socket, the idle prev is detected later, in select_idle_sibling. But if that does not hold, prev is completely ignored, causing the waking thread to move to the socket of the waker. In the case of N mostly active threads on N cores, this triggers other migrations and hurts performance. In contrast, before commit 11f10e5420f6, the load on an idle core was 0, and in the case of a non-idle waker core, the effect of wake_affine was to select prev as the target for searching for a core for the waking thread. To avoid unnecessary migrations, extend wake_affine_idle to check whether the core where the thread previously ran is currently idle, and if so simply return that core as the target. [1] commit 11f10e5420f6ce ("sched/fair: Use load instead of runnable load in wakeup path") This particularly has an impact when using the ondemand power manager, where kworkers run every 0.004 seconds on all cores, increasing the likelihood that an idle core will be considered to have a load. The following numbers were obtained with the benchmarking tool hyperfine (https://github.com/sharkdp/hyperfine) on the NAS parallel benchmarks (https://www.nas.nasa.gov/publications/npb.html). The tests were run on an 80-core Intel(R) Xeon(R) CPU E7-8870 v4 @ 2.10GHz. Active (intel_pstate) and passive (intel_cpufreq) power management were used. Times are in seconds. All experiments use all 160 hardware threads. v5.9/intel-pstate v5.9+patch/intel-pstate bt.C.c 24.725724+-0.962340 23.349608+-1.607214 lu.C.x 29.105952+-4.804203 25.249052+-5.561617 sp.C.x 31.220696+-1.831335 30.227760+-2.429792 ua.C.x 26.606118+-1.767384 25.778367+-1.263850 v5.9/ondemand v5.9+patch/ondemand bt.C.c 25.330360+-1.028316 23.544036+-1.020189 lu.C.x 35.872659+-4.872090 23.719295+-3.883848 sp.C.x 32.141310+-2.289541 29.125363+-0.872300 ua.C.x 29.024597+-1.667049 25.728888+-1.539772 On the smaller data sets (A and B) and on the other NAS benchmarks there is no impact on performance. This also has a major impact on the splash2x.volrend benchmark of the parsec benchmark suite that goes from 1m25 without this patch to 0m45, in active (intel_pstate) mode. Fixes: 11f10e5420f6 ("sched/fair: Use load instead of runnable load in wakeup path") Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by Vincent Guittot <vincent.guittot@linaro.org> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lkml.kernel.org/r/1603372550-14680-1-git-send-email-Julia.Lawall@inria.fr
2020-10-29sched: Remove relyance on STRUCT_ALIGNMENTPeter Zijlstra
Florian reported that all of kernel/sched/ is rebuild when CONFIG_BLK_DEV_INITRD is changed, which, while not a bug is unexpected. This is due to us including vmlinux.lds.h. Jakub explained that the problem is that we put the alignment requirement on the type instead of on a variable. Type alignment is a minimum, the compiler is free to pick any larger alignment for a specific instance of the type (eg. the variable). So force the type alignment on all individual variable definitions and remove the undesired dependency on vmlinux.lds.h. Fixes: 85c2ce9104eb ("sched, vmlinux.lds: Increase STRUCT_ALIGNMENT to 64 bytes for GCC-4.9") Reported-by: Florian Fainelli <f.fainelli@gmail.com> Suggested-by: Jakub Jelinek <jakub@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2020-10-29sched: Reenable interrupts in do_sched_yield()Thomas Gleixner
do_sched_yield() invokes schedule() with interrupts disabled which is not allowed. This goes back to the pre git era to commit a6efb709806c ("[PATCH] irqlock patch 2.5.27-H6") in the history tree. Reenable interrupts and remove the misleading comment which "explains" it. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/87r1pt7y5c.fsf@nanos.tec.linutronix.de
2020-10-29sched: membarrier: document memory ordering scenariosMathieu Desnoyers
Document membarrier ordering scenarios in membarrier.c. Thanks to Alan Stern for refreshing my memory. Now that I have those in mind, it seems appropriate to serialize them to comments for posterity. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20201020134715.13909-4-mathieu.desnoyers@efficios.com
2020-10-29sched: membarrier: cover kthread_use_mm (v4)Mathieu Desnoyers
Add comments and memory barrier to kthread_use_mm and kthread_unuse_mm to allow the effect of membarrier(2) to apply to kthreads accessing user-space memory as well. Given that no prior kthread use this guarantee and that it only affects kthreads, adding this guarantee does not affect user-space ABI. Refine the check in membarrier_global_expedited to exclude runqueues running the idle thread rather than all kthreads from the IPI cpumask. Now that membarrier_global_expedited can IPI kthreads, the scheduler also needs to update the runqueue's membarrier_state when entering lazy TLB state. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20201020134715.13909-3-mathieu.desnoyers@efficios.com