summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2020-09-23fscrypt: rename DCACHE_ENCRYPTED_NAME to DCACHE_NOKEY_NAMEEric Biggers
Originally we used the term "encrypted name" or "ciphertext name" to mean the encoded filename that is shown when an encrypted directory is listed without its key. But these terms are ambiguous since they also mean the filename stored on-disk. "Encrypted name" is especially ambiguous since it could also be understood to mean "this filename is encrypted on-disk", similar to "encrypted file". So we've started calling these encoded names "no-key names" instead. Therefore, rename DCACHE_ENCRYPTED_NAME to DCACHE_NOKEY_NAME to avoid confusion about what this flag means. Link: https://lore.kernel.org/r/20200924042624.98439-3-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-09-23fscrypt: don't call no-key names "ciphertext names"Eric Biggers
Currently we're using the term "ciphertext name" ambiguously because it can mean either the actual ciphertext filename, or the encoded filename that is shown when an encrypted directory is listed without its key. The latter we're now usually calling the "no-key name"; and while it's derived from the ciphertext name, it's not the same thing. To avoid this ambiguity, rename fscrypt_name::is_ciphertext_name to fscrypt_name::is_nokey_name, and update comments that say "ciphertext name" (or "encrypted name") to say "no-key name" instead when warranted. Link: https://lore.kernel.org/r/20200924042624.98439-2-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-09-22fscrypt: use sha256() instead of open codingEric Biggers
Now that there's a library function that calculates the SHA-256 digest of a buffer in one step, use it instead of sha256_init() + sha256_update() + sha256_final(). Link: https://lore.kernel.org/r/20200917045341.324996-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-09-22fscrypt: make fscrypt_set_test_dummy_encryption() take a 'const char *'Eric Biggers
fscrypt_set_test_dummy_encryption() requires that the optional argument to the test_dummy_encryption mount option be specified as a substring_t. That doesn't work well with filesystems that use the new mount API, since the new way of parsing mount options doesn't use substring_t. Make it take the argument as a 'const char *' instead. Instead of moving the match_strdup() into the callers in ext4 and f2fs, make them just use arg->from directly. Since the pattern is "test_dummy_encryption=%s", the argument will be null-terminated. Acked-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20200917041136.178600-14-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-09-22fscrypt: handle test_dummy_encryption in more logical wayEric Biggers
The behavior of the test_dummy_encryption mount option is that when a new file (or directory or symlink) is created in an unencrypted directory, it's automatically encrypted using a dummy encryption policy. That's it; in particular, the encryption (or lack thereof) of existing files (or directories or symlinks) doesn't change. Unfortunately the implementation of test_dummy_encryption is a bit weird and confusing. When test_dummy_encryption is enabled and a file is being created in an unencrypted directory, we set up an encryption key (->i_crypt_info) for the directory. This isn't actually used to do any encryption, however, since the directory is still unencrypted! Instead, ->i_crypt_info is only used for inheriting the encryption policy. One consequence of this is that the filesystem ends up providing a "dummy context" (policy + nonce) instead of a "dummy policy". In commit ed318a6cc0b6 ("fscrypt: support test_dummy_encryption=v2"), I mistakenly thought this was required. However, actually the nonce only ends up being used to derive a key that is never used. Another consequence of this implementation is that it allows for 'inode->i_crypt_info != NULL && !IS_ENCRYPTED(inode)', which is an edge case that can be forgotten about. For example, currently FS_IOC_GET_ENCRYPTION_POLICY on an unencrypted directory may return the dummy encryption policy when the filesystem is mounted with test_dummy_encryption. That seems like the wrong thing to do, since again, the directory itself is not actually encrypted. Therefore, switch to a more logical and maintainable implementation where the dummy encryption policy inheritance is done without setting up keys for unencrypted directories. This involves: - Adding a function fscrypt_policy_to_inherit() which returns the encryption policy to inherit from a directory. This can be a real policy, a dummy policy, or no policy. - Replacing struct fscrypt_dummy_context, ->get_dummy_context(), etc. with struct fscrypt_dummy_policy, ->get_dummy_policy(), etc. - Making fscrypt_fname_encrypted_size() take an fscrypt_policy instead of an inode. Acked-by: Jaegeuk Kim <jaegeuk@kernel.org> Acked-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20200917041136.178600-13-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-09-22fscrypt: move fscrypt_prepare_symlink() out-of-lineEric Biggers
In preparation for moving the logic for "get the encryption policy inherited by new files in this directory" to a single place, make fscrypt_prepare_symlink() a regular function rather than an inline function that wraps __fscrypt_prepare_symlink(). This way, the new function fscrypt_policy_to_inherit() won't need to be exported to filesystems. Acked-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20200917041136.178600-12-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-09-22fscrypt: make "#define fscrypt_policy" user-onlyEric Biggers
The fscrypt UAPI header defines fscrypt_policy to fscrypt_policy_v1, for source compatibility with old userspace programs. Internally, the kernel doesn't want that compatibility definition. Instead, fscrypt_private.h #undefs it and re-defines it to a union. That works for now. However, in order to add fscrypt_operations::get_dummy_policy(), we'll need to forward declare 'union fscrypt_policy' in include/linux/fscrypt.h. That would cause build errors because "fscrypt_policy" is used in ioctl numbers. To avoid this, modify the UAPI header to make the fscrypt_policy compatibility definition conditional on !__KERNEL__, and make the ioctls use fscrypt_policy_v1 instead of fscrypt_policy. Note that this doesn't change the actual ioctl numbers. Acked-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20200917041136.178600-11-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-09-22fscrypt: stop pretending that key setup is nofs-safeEric Biggers
fscrypt_get_encryption_info() has never actually been safe to call in a context that needs GFP_NOFS, since it calls crypto_alloc_skcipher(). crypto_alloc_skcipher() isn't GFP_NOFS-safe, even if called under memalloc_nofs_save(). This is because it may load kernel modules, and also because it internally takes crypto_alg_sem. Other tasks can do GFP_KERNEL allocations while holding crypto_alg_sem for write. The use of fscrypt_init_mutex isn't GFP_NOFS-safe either. So, stop pretending that fscrypt_get_encryption_info() is nofs-safe. I.e., when it allocates memory, just use GFP_KERNEL instead of GFP_NOFS. Note, another reason to do this is that GFP_NOFS is deprecated in favor of using memalloc_nofs_save() in the proper places. Acked-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20200917041136.178600-10-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-09-22fscrypt: require that fscrypt_encrypt_symlink() already has keyEric Biggers
Now that all filesystems have been converted to use fscrypt_prepare_new_inode(), the encryption key for new symlink inodes is now already set up whenever we try to encrypt the symlink target. Enforce this rather than try to set up the key again when it may be too late to do so safely. Acked-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20200917041136.178600-9-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-09-22fscrypt: remove fscrypt_inherit_context()Eric Biggers
Now that all filesystems have been converted to use fscrypt_prepare_new_inode() and fscrypt_set_context(), fscrypt_inherit_context() is no longer used. Remove it. Acked-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20200917041136.178600-8-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-09-22fscrypt: adjust logging for in-creation inodesEric Biggers
Now that a fscrypt_info may be set up for inodes that are currently being created and haven't yet had an inode number assigned, avoid logging confusing messages about "inode 0". Acked-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20200917041136.178600-7-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-09-22ubifs: use fscrypt_prepare_new_inode() and fscrypt_set_context()Eric Biggers
Convert ubifs to use the new functions fscrypt_prepare_new_inode() and fscrypt_set_context(). Unlike ext4 and f2fs, this doesn't appear to fix any deadlock bug. But it does shorten the code slightly and get all filesystems using the same helper functions, so that fscrypt_inherit_context() can be removed. It also fixes an incorrect error code where ubifs returned EPERM instead of the expected ENOKEY. Link: https://lore.kernel.org/r/20200917041136.178600-6-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-09-22f2fs: use fscrypt_prepare_new_inode() and fscrypt_set_context()Eric Biggers
Convert f2fs to use the new functions fscrypt_prepare_new_inode() and fscrypt_set_context(). This avoids calling fscrypt_get_encryption_info() from under f2fs_lock_op(), which can deadlock because fscrypt_get_encryption_info() isn't GFP_NOFS-safe. For more details about this problem, see the earlier patch "fscrypt: add fscrypt_prepare_new_inode() and fscrypt_set_context()". This also fixes a f2fs-specific deadlock when the filesystem is mounted with '-o test_dummy_encryption' and a file is created in an unencrypted directory other than the root directory: INFO: task touch:207 blocked for more than 30 seconds. Not tainted 5.9.0-rc4-00099-g729e3d0919844 #2 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:touch state:D stack: 0 pid: 207 ppid: 167 flags:0x00000000 Call Trace: [...] lock_page include/linux/pagemap.h:548 [inline] pagecache_get_page+0x25e/0x310 mm/filemap.c:1682 find_or_create_page include/linux/pagemap.h:348 [inline] grab_cache_page include/linux/pagemap.h:424 [inline] f2fs_grab_cache_page fs/f2fs/f2fs.h:2395 [inline] f2fs_grab_cache_page fs/f2fs/f2fs.h:2373 [inline] __get_node_page.part.0+0x39/0x2d0 fs/f2fs/node.c:1350 __get_node_page fs/f2fs/node.c:35 [inline] f2fs_get_node_page+0x2e/0x60 fs/f2fs/node.c:1399 read_inline_xattr+0x88/0x140 fs/f2fs/xattr.c:288 lookup_all_xattrs+0x1f9/0x2c0 fs/f2fs/xattr.c:344 f2fs_getxattr+0x9b/0x160 fs/f2fs/xattr.c:532 f2fs_get_context+0x1e/0x20 fs/f2fs/super.c:2460 fscrypt_get_encryption_info+0x9b/0x450 fs/crypto/keysetup.c:472 fscrypt_inherit_context+0x2f/0xb0 fs/crypto/policy.c:640 f2fs_init_inode_metadata+0xab/0x340 fs/f2fs/dir.c:540 f2fs_add_inline_entry+0x145/0x390 fs/f2fs/inline.c:621 f2fs_add_dentry+0x31/0x80 fs/f2fs/dir.c:757 f2fs_do_add_link+0xcd/0x130 fs/f2fs/dir.c:798 f2fs_add_link fs/f2fs/f2fs.h:3234 [inline] f2fs_create+0x104/0x290 fs/f2fs/namei.c:344 lookup_open.isra.0+0x2de/0x500 fs/namei.c:3103 open_last_lookups+0xa9/0x340 fs/namei.c:3177 path_openat+0x8f/0x1b0 fs/namei.c:3365 do_filp_open+0x87/0x130 fs/namei.c:3395 do_sys_openat2+0x96/0x150 fs/open.c:1168 [...] That happened because f2fs_add_inline_entry() locks the directory inode's page in order to add the dentry, then f2fs_get_context() tries to lock it recursively in order to read the encryption xattr. This problem is specific to "test_dummy_encryption" because normally the directory's fscrypt_info would be set up prior to f2fs_add_inline_entry() in order to encrypt the new filename. Regardless, the new design fixes this test_dummy_encryption deadlock as well as potential deadlocks with fs reclaim, by setting up any needed fscrypt_info structs prior to taking so many locks. The test_dummy_encryption deadlock was reported by Daniel Rosenberg. Reported-by: Daniel Rosenberg <drosen@google.com> Acked-by: Jaegeuk Kim <jaegeuk@kernel.org> Link: https://lore.kernel.org/r/20200917041136.178600-5-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-09-22ext4: use fscrypt_prepare_new_inode() and fscrypt_set_context()Eric Biggers
Convert ext4 to use the new functions fscrypt_prepare_new_inode() and fscrypt_set_context(). This avoids calling fscrypt_get_encryption_info() from within a transaction, which can deadlock because fscrypt_get_encryption_info() isn't GFP_NOFS-safe. For more details about this problem, see the earlier patch "fscrypt: add fscrypt_prepare_new_inode() and fscrypt_set_context()". Link: https://lore.kernel.org/r/20200917041136.178600-4-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-09-22ext4: factor out ext4_xattr_credits_for_new_inode()Eric Biggers
To compute a new inode's xattr credits, we need to know whether the inode will be encrypted or not. When we switch to use the new helper function fscrypt_prepare_new_inode(), we won't find out whether the inode will be encrypted until slightly later than is currently the case. That will require moving the code block that computes the xattr credits. To make this easier and reduce the length of __ext4_new_inode(), move this code block into a new function ext4_xattr_credits_for_new_inode(). Link: https://lore.kernel.org/r/20200917041136.178600-3-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-09-22fscrypt: add fscrypt_prepare_new_inode() and fscrypt_set_context()Eric Biggers
fscrypt_get_encryption_info() is intended to be GFP_NOFS-safe. But actually it isn't, since it uses functions like crypto_alloc_skcipher() which aren't GFP_NOFS-safe, even when called under memalloc_nofs_save(). Therefore it can deadlock when called from a context that needs GFP_NOFS, e.g. during an ext4 transaction or between f2fs_lock_op() and f2fs_unlock_op(). This happens when creating a new encrypted file. We can't fix this by just not setting up the key for new inodes right away, since new symlinks need their key to encrypt the symlink target. So we need to set up the new inode's key before starting the transaction. But just calling fscrypt_get_encryption_info() earlier doesn't work, since it assumes the encryption context is already set, and the encryption context can't be set until the transaction. The recently proposed fscrypt support for the ceph filesystem (https://lkml.kernel.org/linux-fscrypt/20200821182813.52570-1-jlayton@kernel.org/T/#u) will have this same ordering problem too, since ceph will need to encrypt new symlinks before setting their encryption context. Finally, f2fs can deadlock when the filesystem is mounted with '-o test_dummy_encryption' and a new file is created in an existing unencrypted directory. Similarly, this is caused by holding too many locks when calling fscrypt_get_encryption_info(). To solve all these problems, add new helper functions: - fscrypt_prepare_new_inode() sets up a new inode's encryption key (fscrypt_info), using the parent directory's encryption policy and a new random nonce. It neither reads nor writes the encryption context. - fscrypt_set_context() persists the encryption context of a new inode, using the information from the fscrypt_info already in memory. This replaces fscrypt_inherit_context(). Temporarily keep fscrypt_inherit_context() around until all filesystems have been converted to use fscrypt_set_context(). Acked-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20200917041136.178600-2-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-09-07fscrypt: restrict IV_INO_LBLK_32 to ino_bits <= 32Eric Biggers
When an encryption policy has the IV_INO_LBLK_32 flag set, the IV generation method involves hashing the inode number. This is different from fscrypt's other IV generation methods, where the inode number is either not used at all or is included directly in the IVs. Therefore, in principle IV_INO_LBLK_32 can work with any length inode number. However, currently fscrypt gets the inode number from inode::i_ino, which is 'unsigned long'. So currently the implementation limit is actually 32 bits (like IV_INO_LBLK_64), since longer inode numbers will have been truncated by the VFS on 32-bit platforms. Fix fscrypt_supported_v2_policy() to enforce the correct limit. This doesn't actually matter currently, since only ext4 and f2fs support IV_INO_LBLK_32, and they both only support 32-bit inode numbers. But we might as well fix it in case it matters in the future. Ideally inode::i_ino would instead be made 64-bit, but for now it's not needed. (Note, this limit does *not* prevent filesystems with 64-bit inode numbers from adding fscrypt support, since IV_INO_LBLK_* support is optional and is useful only on certain hardware.) Fixes: e3b1078bedd3 ("fscrypt: add support for IV_INO_LBLK_32 policies") Reported-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20200824203841.1707847-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-09-07fscrypt: drop unused inode argument from fscrypt_fname_alloc_bufferJeff Layton
Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20200810142139.487631-1-jlayton@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-09-06Linux 5.9-rc4Linus Torvalds
2020-09-06Merge tag 'io_uring-5.9-2020-09-06' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull more io_uring fixes from Jens Axboe: "Two followup fixes. One is fixing a regression from this merge window, the other is two commits fixing cancelation of deferred requests. Both have gone through full testing, and both spawned a few new regression test additions to liburing. - Don't play games with const, properly store the output iovec and assign it as needed. - Deferred request cancelation fix (Pavel)" * tag 'io_uring-5.9-2020-09-06' of git://git.kernel.dk/linux-block: io_uring: fix linked deferred ->files cancellation io_uring: fix cancel of deferred reqs with ->files io_uring: fix explicit async read/write mapping for large segments
2020-09-06Merge tag 'iommu-fixes-v5.9-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu Pull iommu fixes from Joerg Roedel: - three Intel VT-d fixes to fix address handling on 32bit, fix a NULL pointer dereference bug and serialize a hardware register access as required by the VT-d spec. - two patches for AMD IOMMU to force AMD GPUs into translation mode when memory encryption is active and disallow using IOMMUv2 functionality. This makes the AMDGPU driver work when memory encryption is active. - two more fixes for AMD IOMMU to fix updating the Interrupt Remapping Table Entries. - MAINTAINERS file update for the Qualcom IOMMU driver. * tag 'iommu-fixes-v5.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: iommu/vt-d: Handle 36bit addressing for x86-32 iommu/amd: Do not use IOMMUv2 functionality when SME is active iommu/amd: Do not force direct mapping when SME is active iommu/amd: Use cmpxchg_double() when updating 128-bit IRTE iommu/amd: Restore IRTE.RemapEn bit after programming IRTE iommu/vt-d: Fix NULL pointer dereference in dev_iommu_priv_set() iommu/vt-d: Serialize IOMMU GCMD register modifications MAINTAINERS: Update QUALCOMM IOMMU after Arm SMMU drivers move
2020-09-06Merge tag 'x86-urgent-2020-09-06' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: - more generic entry code ABI fallout - debug register handling bugfixes - fix vmalloc mappings on 32-bit kernels - kprobes instrumentation output fix on 32-bit kernels - fix over-eager WARN_ON_ONCE() on !SMAP hardware - NUMA debugging fix - fix Clang related crash on !RETPOLINE kernels * tag 'x86-urgent-2020-09-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/entry: Unbreak 32bit fast syscall x86/debug: Allow a single level of #DB recursion x86/entry: Fix AC assertion tracing/kprobes, x86/ptrace: Fix regs argument order for i386 x86, fakenuma: Fix invalid starting node ID x86/mm/32: Bring back vmalloc faulting on x86_32 x86/cmdline: Disable jump tables for cmdline.c
2020-09-06Merge tag 'for-linus-5.9-rc4-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen updates from Juergen Gross: "A small series for fixing a problem with Xen PVH guests when running as backends (e.g. as dom0). Mapping other guests' memory is now working via ZONE_DEVICE, thus not requiring to abuse the memory hotplug functionality for that purpose" * tag 'for-linus-5.9-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: xen: add helpers to allocate unpopulated memory memremap: rename MEMORY_DEVICE_DEVDAX to MEMORY_DEVICE_GENERIC xen/balloon: add header guard
2020-09-05io_uring: fix linked deferred ->files cancellationPavel Begunkov
While looking for ->files in ->defer_list, consider that requests there may actually be links. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-05io_uring: fix cancel of deferred reqs with ->filesPavel Begunkov
While trying to cancel requests with ->files, it also should look for requests in ->defer_list, otherwise it might end up hanging a thread. Cancel all requests in ->defer_list up to the last request there with matching ->files, that's needed to follow drain ordering semantics. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-05Merge tags 'auxdisplay-for-linus-v5.9-rc4', ↵Linus Torvalds
'clang-format-for-linus-v5.9-rc4' and 'compiler-attributes-for-linus-v5.9-rc4' of git://github.com/ojeda/linux Pull misc fixes from Miguel Ojeda: "A trivial patch for auxdisplay: - Replace HTTP links with HTTPS ones (Alexander A. Klimov) The usual clang-format trivial update: - Update with the latest for_each macro list (Miguel Ojeda) And Luc requested me to pick a sparse fix on my queue, so here it goes along with other two trivial Compiler Attributes ones (also from Luc). - sparse: use static inline for __chk_{user,io}_ptr() (Luc Van Oostenryck) - Compiler Attributes: fix comment concerning GCC 4.6 (Luc Van Oostenryck) - Compiler Attributes: remove comment about sparse not supporting __has_attribute (Luc Van Oostenryck)" * tag 'auxdisplay-for-linus-v5.9-rc4' of git://github.com/ojeda/linux: auxdisplay: Replace HTTP links with HTTPS ones * tag 'clang-format-for-linus-v5.9-rc4' of git://github.com/ojeda/linux: clang-format: Update with the latest for_each macro list * tag 'compiler-attributes-for-linus-v5.9-rc4' of git://github.com/ojeda/linux: sparse: use static inline for __chk_{user,io}_ptr() Compiler Attributes: fix comment concerning GCC 4.6 Compiler Attributes: remove comment about sparse not supporting __has_attribute
2020-09-05Merge tag 'arc-5.9-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc Pull ARC fixes from Vineet Gupta: - HSDK-4xd Dev system: perf driver updates for sampling interrupt - HSDK* Dev System: Ethernet broken [Evgeniy Didin] - HIGHMEM broken (2 memory banks) [Mike Rapoport] - show_regs() rewrite once and for all - Other minor fixes * tag 'arc-5.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc: ARC: [plat-hsdk]: Switch ethernet phy-mode to rgmii-id arc: fix memory initialization for systems with two memory banks irqchip/eznps: Fix build error for !ARC700 builds ARC: show_regs: fix r12 printing and simplify ARC: HSDK: wireup perf irq ARC: perf: don't bail setup if pct irq missing in device-tree ARC: pgalloc.h: delete a duplicated word + other fixes
2020-09-05Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc fixes from Andrew Morton: "19 patches. Subsystems affected by this patch series: MAINTAINERS, ipc, fork, checkpatch, lib, and mm (memcg, slub, pagemap, madvise, migration, hugetlb)" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: include/linux/log2.h: add missing () around n in roundup_pow_of_two() mm/khugepaged.c: fix khugepaged's request size in collapse_file mm/hugetlb: fix a race between hugetlb sysctl handlers mm/hugetlb: try preferred node first when alloc gigantic page from cma mm/migrate: preserve soft dirty in remove_migration_pte() mm/migrate: remove unnecessary is_zone_device_page() check mm/rmap: fixup copying of soft dirty and uffd ptes mm/migrate: fixup setting UFFD_WP flag mm: madvise: fix vma user-after-free checkpatch: fix the usage of capture group ( ... ) fork: adjust sysctl_max_threads definition to match prototype ipc: adjust proc_ipc_sem_dointvec definition to match prototype mm: track page table modifications in __apply_to_page_range() MAINTAINERS: IA64: mark Status as Odd Fixes only MAINTAINERS: add LLVM maintainers MAINTAINERS: update Cavium/Marvell entries mm: slub: fix conversion of freelist_corrupted() mm: memcg: fix memcg reclaim soft lockup memcg: fix use-after-free in uncharge_batch
2020-09-05include/linux/log2.h: add missing () around n in roundup_pow_of_two()Jason Gunthorpe
Otherwise gcc generates warnings if the expression is complicated. Fixes: 312a0c170945 ("[PATCH] LOG2: Alter roundup_pow_of_two() so that it can use a ilog2() on a constant") Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Link: https://lkml.kernel.org/r/0-v1-8a2697e3c003+41165-log_brackets_jgg@nvidia.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-05mm/khugepaged.c: fix khugepaged's request size in collapse_fileDavid Howells
collapse_file() in khugepaged passes PAGE_SIZE as the number of pages to be read to page_cache_sync_readahead(). The intent was probably to read a single page. Fix it to use the number of pages to the end of the window instead. Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: Yang Shi <shy828301@gmail.com> Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Eric Biggers <ebiggers@google.com> Link: https://lkml.kernel.org/r/20200903140844.14194-2-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-05mm/hugetlb: fix a race between hugetlb sysctl handlersMuchun Song
There is a race between the assignment of `table->data` and write value to the pointer of `table->data` in the __do_proc_doulongvec_minmax() on the other thread. CPU0: CPU1: proc_sys_write hugetlb_sysctl_handler proc_sys_call_handler hugetlb_sysctl_handler_common hugetlb_sysctl_handler table->data = &tmp; hugetlb_sysctl_handler_common table->data = &tmp; proc_doulongvec_minmax do_proc_doulongvec_minmax sysctl_head_finish __do_proc_doulongvec_minmax unuse_table i = table->data; *i = val; // corrupt CPU1's stack Fix this by duplicating the `table`, and only update the duplicate of it. And introduce a helper of proc_hugetlb_doulongvec_minmax() to simplify the code. The following oops was seen: BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor instruction fetch in kernel mode #PF: error_code(0x0010) - not-present page Code: Bad RIP value. ... Call Trace: ? set_max_huge_pages+0x3da/0x4f0 ? alloc_pool_huge_page+0x150/0x150 ? proc_doulongvec_minmax+0x46/0x60 ? hugetlb_sysctl_handler_common+0x1c7/0x200 ? nr_hugepages_store+0x20/0x20 ? copy_fd_bitmaps+0x170/0x170 ? hugetlb_sysctl_handler+0x1e/0x20 ? proc_sys_call_handler+0x2f1/0x300 ? unregister_sysctl_table+0xb0/0xb0 ? __fd_install+0x78/0x100 ? proc_sys_write+0x14/0x20 ? __vfs_write+0x4d/0x90 ? vfs_write+0xef/0x240 ? ksys_write+0xc0/0x160 ? __ia32_sys_read+0x50/0x50 ? __close_fd+0x129/0x150 ? __x64_sys_write+0x43/0x50 ? do_syscall_64+0x6c/0x200 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: e5ff215941d5 ("hugetlb: multiple hstates for multiple page sizes") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andi Kleen <ak@linux.intel.com> Link: http://lkml.kernel.org/r/20200828031146.43035-1-songmuchun@bytedance.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-05mm/hugetlb: try preferred node first when alloc gigantic page from cmaLi Xinhai
Since commit cf11e85fc08c ("mm: hugetlb: optionally allocate gigantic hugepages using cma"), the gigantic page would be allocated from node which is not the preferred node, although there are pages available from that node. The reason is that the nid parameter has been ignored in alloc_gigantic_page(). Besides, the __GFP_THISNODE also need be checked if user required to alloc only from the preferred node. After this patch, the preferred node is tried first before other allowed nodes, and don't try to allocate from other nodes if __GFP_THISNODE is specified. If user don't specify the preferred node, the current node will be used as preferred node, which makes sure consistent behavior of allocating gigantic and non-gigantic hugetlb page. Fixes: cf11e85fc08c ("mm: hugetlb: optionally allocate gigantic hugepages using cma") Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Link: https://lkml.kernel.org/r/20200902025016.697260-1-lixinhai.lxh@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-05mm/migrate: preserve soft dirty in remove_migration_pte()Ralph Campbell
The code to remove a migration PTE and replace it with a device private PTE was not copying the soft dirty bit from the migration entry. This could lead to page contents not being marked dirty when faulting the page back from device private memory. Signed-off-by: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Bharata B Rao <bharata@linux.ibm.com> Link: https://lkml.kernel.org/r/20200831212222.22409-3-rcampbell@nvidia.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-05mm/migrate: remove unnecessary is_zone_device_page() checkRalph Campbell
Patch series "mm/migrate: preserve soft dirty in remove_migration_pte()". I happened to notice this from code inspection after seeing Alistair Popple's patch ("mm/rmap: Fixup copying of soft dirty and uffd ptes"). This patch (of 2): The check for is_zone_device_page() and is_device_private_page() is unnecessary since the latter is sufficient to determine if the page is a device private page. Simplify the code for easier reading. Signed-off-by: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Bharata B Rao <bharata@linux.ibm.com> Link: https://lkml.kernel.org/r/20200831212222.22409-1-rcampbell@nvidia.com Link: https://lkml.kernel.org/r/20200831212222.22409-2-rcampbell@nvidia.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-05mm/rmap: fixup copying of soft dirty and uffd ptesAlistair Popple
During memory migration a pte is temporarily replaced with a migration swap pte. Some pte bits from the existing mapping such as the soft-dirty and uffd write-protect bits are preserved by copying these to the temporary migration swap pte. However these bits are not stored at the same location for swap and non-swap ptes. Therefore testing these bits requires using the appropriate helper function for the given pte type. Unfortunately several code locations were found where the wrong helper function is being used to test soft_dirty and uffd_wp bits which leads to them getting incorrectly set or cleared during page-migration. Fix these by using the correct tests based on pte type. Fixes: a5430dda8a3a ("mm/migrate: support un-addressable ZONE_DEVICE page in migration") Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages") Fixes: f45ec5ff16a7 ("userfaultfd: wp: support swap and page migration") Signed-off-by: Alistair Popple <alistair@popple.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Alistair Popple <alistair@popple.id.au> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20200825064232.10023-2-alistair@popple.id.au Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-05mm/migrate: fixup setting UFFD_WP flagAlistair Popple
Commit f45ec5ff16a75 ("userfaultfd: wp: support swap and page migration") introduced support for tracking the uffd wp bit during page migration. However the non-swap PTE variant was used to set the flag for zone device private pages which are a type of swap page. This leads to corruption of the swap offset if the original PTE has the uffd_wp flag set. Fixes: f45ec5ff16a75 ("userfaultfd: wp: support swap and page migration") Signed-off-by: Alistair Popple <alistair@popple.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Link: https://lkml.kernel.org/r/20200825064232.10023-1-alistair@popple.id.au Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-05mm: madvise: fix vma user-after-freeYang Shi
The syzbot reported the below use-after-free: BUG: KASAN: use-after-free in madvise_willneed mm/madvise.c:293 [inline] BUG: KASAN: use-after-free in madvise_vma mm/madvise.c:942 [inline] BUG: KASAN: use-after-free in do_madvise.part.0+0x1c8b/0x1cf0 mm/madvise.c:1145 Read of size 8 at addr ffff8880a6163eb0 by task syz-executor.0/9996 CPU: 0 PID: 9996 Comm: syz-executor.0 Not tainted 5.9.0-rc1-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x18f/0x20d lib/dump_stack.c:118 print_address_description.constprop.0.cold+0xae/0x497 mm/kasan/report.c:383 __kasan_report mm/kasan/report.c:513 [inline] kasan_report.cold+0x1f/0x37 mm/kasan/report.c:530 madvise_willneed mm/madvise.c:293 [inline] madvise_vma mm/madvise.c:942 [inline] do_madvise.part.0+0x1c8b/0x1cf0 mm/madvise.c:1145 do_madvise mm/madvise.c:1169 [inline] __do_sys_madvise mm/madvise.c:1171 [inline] __se_sys_madvise mm/madvise.c:1169 [inline] __x64_sys_madvise+0xd9/0x110 mm/madvise.c:1169 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Allocated by task 9992: kmem_cache_alloc+0x138/0x3a0 mm/slab.c:3482 vm_area_alloc+0x1c/0x110 kernel/fork.c:347 mmap_region+0x8e5/0x1780 mm/mmap.c:1743 do_mmap+0xcf9/0x11d0 mm/mmap.c:1545 vm_mmap_pgoff+0x195/0x200 mm/util.c:506 ksys_mmap_pgoff+0x43a/0x560 mm/mmap.c:1596 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Freed by task 9992: kmem_cache_free.part.0+0x67/0x1f0 mm/slab.c:3693 remove_vma+0x132/0x170 mm/mmap.c:184 remove_vma_list mm/mmap.c:2613 [inline] __do_munmap+0x743/0x1170 mm/mmap.c:2869 do_munmap mm/mmap.c:2877 [inline] mmap_region+0x257/0x1780 mm/mmap.c:1716 do_mmap+0xcf9/0x11d0 mm/mmap.c:1545 vm_mmap_pgoff+0x195/0x200 mm/util.c:506 ksys_mmap_pgoff+0x43a/0x560 mm/mmap.c:1596 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 It is because vma is accessed after releasing mmap_lock, but someone else acquired the mmap_lock and the vma is gone. Releasing mmap_lock after accessing vma should fix the problem. Fixes: 692fe62433d4c ("mm: Handle MADV_WILLNEED through vfs_fadvise()") Reported-by: syzbot+b90df26038d1d5d85c97@syzkaller.appspotmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: <stable@vger.kernel.org> [5.4+] Link: https://lkml.kernel.org/r/20200816141204.162624-1-shy828301@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-05checkpatch: fix the usage of capture group ( ... )Mrinal Pandey
The usage of "capture group (...)" in the immediate condition after `&&` results in `$1` being uninitialized. This issues a warning "Use of uninitialized value $1 in regexp compilation at ./scripts/checkpatch.pl line 2638". I noticed this bug while running checkpatch on the set of commits from v5.7 to v5.8-rc1 of the kernel on the commits with a diff content in their commit message. This bug was introduced in the script by commit e518e9a59ec3 ("checkpatch: emit an error when there's a diff in a changelog"). It has been in the script since then. The author intended to store the match made by capture group in variable `$1`. This should have contained the name of the file as `[\w/]+` matched. However, this couldn't be accomplished due to usage of capture group and `$1` in the same regular expression. Fix this by placing the capture group in the condition before `&&`. Thus, `$1` can be initialized to the text that capture group matches thereby setting it to the desired and required value. Fixes: e518e9a59ec3 ("checkpatch: emit an error when there's a diff in a changelog") Signed-off-by: Mrinal Pandey <mrinalmni@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Reviewed-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Cc: Joe Perches <joe@perches.com> Link: https://lkml.kernel.org/r/20200714032352.f476hanaj2dlmiot@mrinalpandey Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-05fork: adjust sysctl_max_threads definition to match prototypeTobias Klauser
Commit 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler") changed ctl_table.proc_handler to take a kernel pointer. Adjust the definition of sysctl_max_threads to match its prototype in linux/sysctl.h which fixes the following sparse error/warning: kernel/fork.c:3050:47: warning: incorrect type in argument 3 (different address spaces) kernel/fork.c:3050:47: expected void * kernel/fork.c:3050:47: got void [noderef] __user *buffer kernel/fork.c:3036:5: error: symbol 'sysctl_max_threads' redeclared with different type (incompatible argument 3 (different address spaces)): kernel/fork.c:3036:5: int extern [addressable] [signed] [toplevel] sysctl_max_threads( ... ) kernel/fork.c: note: in included file (through include/linux/key.h, include/linux/cred.h, include/linux/sched/signal.h, include/linux/sched/cputime.h): include/linux/sysctl.h:242:5: note: previously declared as: include/linux/sysctl.h:242:5: int extern [addressable] [signed] [toplevel] sysctl_max_threads( ... ) Fixes: 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler") Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Link: https://lkml.kernel.org/r/20200825093647.24263-1-tklauser@distanz.ch Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-05ipc: adjust proc_ipc_sem_dointvec definition to match prototypeTobias Klauser
Commit 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler") changed ctl_table.proc_handler to take a kernel pointer. Adjust the signature of proc_ipc_sem_dointvec to match ctl_table.proc_handler which fixes the following sparse error/warning: ipc/ipc_sysctl.c:94:47: warning: incorrect type in argument 3 (different address spaces) ipc/ipc_sysctl.c:94:47: expected void *buffer ipc/ipc_sysctl.c:94:47: got void [noderef] __user *buffer ipc/ipc_sysctl.c:194:35: warning: incorrect type in initializer (incompatible argument 3 (different address spaces)) ipc/ipc_sysctl.c:194:35: expected int ( [usertype] *proc_handler )( ... ) ipc/ipc_sysctl.c:194:35: got int ( * )( ... ) Fixes: 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler") Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Link: https://lkml.kernel.org/r/20200825105846.5193-1-tklauser@distanz.ch Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-05mm: track page table modifications in __apply_to_page_range()Joerg Roedel
__apply_to_page_range() is also used to change and/or allocate page-table pages in the vmalloc area of the address space. Make sure these changes get synchronized to other page-tables in the system by calling arch_sync_kernel_mappings() when necessary. The impact appears limited to x86-32, where apply_to_page_range may miss updating the PMD. That leads to explosions in drivers like BUG: unable to handle page fault for address: fe036000 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page *pde = 00000000 Oops: 0002 [#1] SMP CPU: 3 PID: 1300 Comm: gem_concurrent_ Not tainted 5.9.0-rc1+ #16 Hardware name: /NUC6i3SYB, BIOS SYSKLi35.86A.0024.2015.1027.2142 10/27/2015 EIP: __execlists_context_alloc+0x132/0x2d0 [i915] Code: 31 d2 89 f0 e8 2f 55 02 00 89 45 e8 3d 00 f0 ff ff 0f 87 11 01 00 00 8b 4d e8 03 4b 30 b8 5a 5a 5a 5a ba 01 00 00 00 8d 79 04 <c7> 01 5a 5a 5a 5a c7 81 fc 0f 00 00 5a 5a 5a 5a 83 e7 fc 29 f9 81 EAX: 5a5a5a5a EBX: f60ca000 ECX: fe036000 EDX: 00000001 ESI: f43b7340 EDI: fe036004 EBP: f6389cb8 ESP: f6389c9c DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010286 CR0: 80050033 CR2: fe036000 CR3: 2d361000 CR4: 001506d0 DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 DR6: fffe0ff0 DR7: 00000400 Call Trace: execlists_context_alloc+0x10/0x20 [i915] intel_context_alloc_state+0x3f/0x70 [i915] __intel_context_do_pin+0x117/0x170 [i915] i915_gem_do_execbuffer+0xcc7/0x2500 [i915] i915_gem_execbuffer2_ioctl+0xcd/0x1f0 [i915] drm_ioctl_kernel+0x8f/0xd0 drm_ioctl+0x223/0x3d0 __ia32_sys_ioctl+0x1ab/0x760 __do_fast_syscall_32+0x3f/0x70 do_fast_syscall_32+0x29/0x60 do_SYSENTER_32+0x15/0x20 entry_SYSENTER_32+0x9f/0xf2 EIP: 0xb7f28559 Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76 EAX: ffffffda EBX: 00000005 ECX: c0406469 EDX: bf95556c ESI: b7e68000 EDI: c0406469 EBP: 00000005 ESP: bf9554d8 DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000296 Modules linked in: i915 x86_pkg_temp_thermal intel_powerclamp crc32_pclmul crc32c_intel intel_cstate intel_uncore intel_gtt drm_kms_helper intel_pch_thermal video button autofs4 i2c_i801 i2c_smbus fan CR2: 00000000fe036000 It looks like kasan, xen and i915 are vulnerable. Actual impact is "on thinkpad X60 in 5.9-rc1, screen starts blinking after 30-or-so minutes, and machine is unusable" [sfr@canb.auug.org.au: ARCH_PAGE_TABLE_SYNC_MASK needs vmalloc.h] Link: https://lkml.kernel.org/r/20200825172508.16800a4f@canb.auug.org.au [chris@chris-wilson.co.uk: changelog addition] [pavel@ucw.cz: changelog addition] Fixes: 2ba3e6947aed ("mm/vmalloc: track which page-table levels were modified") Fixes: 86cf69f1d893 ("x86/mm/32: implement arch_sync_kernel_mappings()") Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Chris Wilson <chris@chris-wilson.co.uk> [x86-32] Tested-by: Pavel Machek <pavel@ucw.cz> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: <stable@vger.kernel.org> [5.8+] Link: https://lkml.kernel.org/r/20200821123746.16904-1-joro@8bytes.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-05MAINTAINERS: IA64: mark Status as Odd Fixes onlyRandy Dunlap
IA64 isn't really being maintained, so mark it as Odd Fixes only. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Link: http://lkml.kernel.org/r/7e719139-450f-52c2-59a2-7964a34eda1f@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-05MAINTAINERS: add LLVM maintainersNick Desaulniers
Nominate Nathan and myself to be point of contact for clang/LLVM related support, after a poll at the LLVM BoF at Linux Plumbers Conf 2020. While corporate sponsorship is beneficial, its important to not entrust the keys to the nukes with any one entity. Should Nathan and I find ourselves at the same employer, I would gladly step down. Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Sedat Dilek <sedat.dilek@gmail.com> Acked-by: Nathan Chancellor <natechancellor@gmail.com> Acked-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Acked-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com> Acked-by: Masahiro Yamada <masahiroy@kernel.org> Link: https://lkml.kernel.org/r/20200825143540.2948637-1-ndesaulniers@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-05MAINTAINERS: update Cavium/Marvell entriesRobert Richter
I am leaving Marvell and already do not have access to my @marvell.com email address. So switching over to my korg mail address or removing my address there another maintainer is already listed. For the entries there no other maintainer is listed I will keep looking into patches for Cavium systems for a while until someone from Marvell takes it over. Since I might have limited access to hardware and also limited time I changed state to 'Odd Fixes' for those entries. Signed-off-by: Robert Richter <rric@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Ganapatrao Kulkarni <gkulkarni@marvell.com> Cc: Sunil Goutham <sgoutham@marvell.com> CC: Borislav Petkov <bp@alien8.de> Cc: Marc Zyngier <maz@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Wolfram Sang <wsa@kernel.org>, Link: https://lkml.kernel.org/r/20200824122050.31164-1-rric@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-05mm: slub: fix conversion of freelist_corrupted()Eugeniu Rosca
Commit 52f23478081ae0 ("mm/slub.c: fix corrupted freechain in deactivate_slab()") suffered an update when picked up from LKML [1]. Specifically, relocating 'freelist = NULL' into 'freelist_corrupted()' created a no-op statement. Fix it by sticking to the behavior intended in the original patch [1]. In addition, make freelist_corrupted() immune to passing NULL instead of &freelist. The issue has been spotted via static analysis and code review. [1] https://lore.kernel.org/linux-mm/20200331031450.12182-1-dongli.zhang@oracle.com/ Fixes: 52f23478081ae0 ("mm/slub.c: fix corrupted freechain in deactivate_slab()") Signed-off-by: Eugeniu Rosca <erosca@de.adit-jv.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Dongli Zhang <dongli.zhang@oracle.com> Cc: Joe Jin <joe.jin@oracle.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20200824130643.10291-1-erosca@de.adit-jv.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-05mm: memcg: fix memcg reclaim soft lockupXunlei Pang
We've met softlockup with "CONFIG_PREEMPT_NONE=y", when the target memcg doesn't have any reclaimable memory. It can be easily reproduced as below: watchdog: BUG: soft lockup - CPU#0 stuck for 111s![memcg_test:2204] CPU: 0 PID: 2204 Comm: memcg_test Not tainted 5.9.0-rc2+ #12 Call Trace: shrink_lruvec+0x49f/0x640 shrink_node+0x2a6/0x6f0 do_try_to_free_pages+0xe9/0x3e0 try_to_free_mem_cgroup_pages+0xef/0x1f0 try_charge+0x2c1/0x750 mem_cgroup_charge+0xd7/0x240 __add_to_page_cache_locked+0x2fd/0x370 add_to_page_cache_lru+0x4a/0xc0 pagecache_get_page+0x10b/0x2f0 filemap_fault+0x661/0xad0 ext4_filemap_fault+0x2c/0x40 __do_fault+0x4d/0xf9 handle_mm_fault+0x1080/0x1790 It only happens on our 1-vcpu instances, because there's no chance for oom reaper to run to reclaim the to-be-killed process. Add a cond_resched() at the upper shrink_node_memcgs() to solve this issue, this will mean that we will get a scheduling point for each memcg in the reclaimed hierarchy without any dependency on the reclaimable memory in that memcg thus making it more predictable. Suggested-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Chris Down <chris@chrisdown.name> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: http://lkml.kernel.org/r/1598495549-67324-1-git-send-email-xlpang@linux.alibaba.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-05memcg: fix use-after-free in uncharge_batchMichal Hocko
syzbot has reported an use-after-free in the uncharge_batch path BUG: KASAN: use-after-free in instrument_atomic_write include/linux/instrumented.h:71 [inline] BUG: KASAN: use-after-free in atomic64_sub_return include/asm-generic/atomic-instrumented.h:970 [inline] BUG: KASAN: use-after-free in atomic_long_sub_return include/asm-generic/atomic-long.h:113 [inline] BUG: KASAN: use-after-free in page_counter_cancel mm/page_counter.c:54 [inline] BUG: KASAN: use-after-free in page_counter_uncharge+0x3d/0xc0 mm/page_counter.c:155 Write of size 8 at addr ffff8880371c0148 by task syz-executor.0/9304 CPU: 0 PID: 9304 Comm: syz-executor.0 Not tainted 5.8.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1f0/0x31e lib/dump_stack.c:118 print_address_description+0x66/0x620 mm/kasan/report.c:383 __kasan_report mm/kasan/report.c:513 [inline] kasan_report+0x132/0x1d0 mm/kasan/report.c:530 check_memory_region_inline mm/kasan/generic.c:183 [inline] check_memory_region+0x2b5/0x2f0 mm/kasan/generic.c:192 instrument_atomic_write include/linux/instrumented.h:71 [inline] atomic64_sub_return include/asm-generic/atomic-instrumented.h:970 [inline] atomic_long_sub_return include/asm-generic/atomic-long.h:113 [inline] page_counter_cancel mm/page_counter.c:54 [inline] page_counter_uncharge+0x3d/0xc0 mm/page_counter.c:155 uncharge_batch+0x6c/0x350 mm/memcontrol.c:6764 uncharge_page+0x115/0x430 mm/memcontrol.c:6796 uncharge_list mm/memcontrol.c:6835 [inline] mem_cgroup_uncharge_list+0x70/0xe0 mm/memcontrol.c:6877 release_pages+0x13a2/0x1550 mm/swap.c:911 tlb_batch_pages_flush mm/mmu_gather.c:49 [inline] tlb_flush_mmu_free mm/mmu_gather.c:242 [inline] tlb_flush_mmu+0x780/0x910 mm/mmu_gather.c:249 tlb_finish_mmu+0xcb/0x200 mm/mmu_gather.c:328 exit_mmap+0x296/0x550 mm/mmap.c:3185 __mmput+0x113/0x370 kernel/fork.c:1076 exit_mm+0x4cd/0x550 kernel/exit.c:483 do_exit+0x576/0x1f20 kernel/exit.c:793 do_group_exit+0x161/0x2d0 kernel/exit.c:903 get_signal+0x139b/0x1d30 kernel/signal.c:2743 arch_do_signal+0x33/0x610 arch/x86/kernel/signal.c:811 exit_to_user_mode_loop kernel/entry/common.c:135 [inline] exit_to_user_mode_prepare+0x8d/0x1b0 kernel/entry/common.c:166 syscall_exit_to_user_mode+0x5e/0x1a0 kernel/entry/common.c:241 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Commit 1a3e1f40962c ("mm: memcontrol: decouple reference counting from page accounting") reworked the memcg lifetime to be bound the the struct page rather than charges. It also removed the css_put_many from uncharge_batch and that is causing the above splat. uncharge_batch() is supposed to uncharge accumulated charges for all pages freed from the same memcg. The queuing is done by uncharge_page which however drops the memcg reference after it adds charges to the batch. If the current page happens to be the last one holding the reference for its memcg then the memcg is OK to go and the next page to be freed will trigger batched uncharge which needs to access the memcg which is gone already. Fix the issue by taking a reference for the memcg in the current batch. Fixes: 1a3e1f40962c ("mm: memcontrol: decouple reference counting from page accounting") Reported-by: syzbot+b305848212deec86eabe@syzkaller.appspotmail.com Reported-by: syzbot+b5ea6fb6f139c8b9482b@syzkaller.appspotmail.com Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Cc: Hugh Dickins <hughd@google.com> Link: https://lkml.kernel.org/r/20200820090341.GC5033@dhcp22.suse.cz Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-05Merge tag 'xfs-5.9-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs fix from Darrick Wong: "Fix a broken metadata verifier that would incorrectly validate attr fork extents of a realtime file against the realtime volume" * tag 'xfs-5.9-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: fix xfs_bmap_validate_extent_raw when checking attr fork of rt files
2020-09-05xfs: don't update mtime on COW faultsMikulas Patocka
When running in a dax mode, if the user maps a page with MAP_PRIVATE and PROT_WRITE, the xfs filesystem would incorrectly update ctime and mtime when the user hits a COW fault. This breaks building of the Linux kernel. How to reproduce: 1. extract the Linux kernel tree on dax-mounted xfs filesystem 2. run make clean 3. run make -j12 4. run make -j12 at step 4, make would incorrectly rebuild the whole kernel (although it was already built in step 3). The reason for the breakage is that almost all object files depend on objtool. When we run objtool, it takes COW page fault on its .data section, and these faults will incorrectly update the timestamp of the objtool binary. The updated timestamp causes make to rebuild the whole tree. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-05ext2: don't update mtime on COW faultsMikulas Patocka
When running in a dax mode, if the user maps a page with MAP_PRIVATE and PROT_WRITE, the ext2 filesystem would incorrectly update ctime and mtime when the user hits a COW fault. This breaks building of the Linux kernel. How to reproduce: 1. extract the Linux kernel tree on dax-mounted ext2 filesystem 2. run make clean 3. run make -j12 4. run make -j12 at step 4, make would incorrectly rebuild the whole kernel (although it was already built in step 3). The reason for the breakage is that almost all object files depend on objtool. When we run objtool, it takes COW page fault on its .data section, and these faults will incorrectly update the timestamp of the objtool binary. The updated timestamp causes make to rebuild the whole tree. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>