summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2020-06-03ext4: mballoc: refactor ext4_mb_good_group()Ritesh Harjani
ext4_mb_good_group() definition was changed some time back and now it even initializes the buddy cache (via ext4_mb_init_group()), if in case the EXT4_MB_GRP_NEED_INIT() is true for a group. Note that ext4_mb_init_group() could sleep and so should not be called under a spinlock held. This is fine as of now because ext4_mb_good_group() is called before loading the buddy bitmap without ext4_lock_group() held and again called after loading the bitmap, only this time with ext4_lock_group() held. But still this whole thing is confusing. So this patch refactors out ext4_mb_good_group_nolock() which should be called when without holding ext4_lock_group(). Also in further patches we hold the spinlock (ext4_lock_group()) while doing any calculations which involves grp->bb_free or grp->bb_fragments. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/d9f7d031a5fbe1c943fae6bf1ff5cdf0604ae722.1589955723.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03ext4: mballoc: introduce pcpu seqcnt for freeing PA to improve ENOSPC handlingRitesh Harjani
There could be a race in function ext4_mb_discard_group_preallocations() where the 1st thread may iterate through group's bb_prealloc_list and remove all the PAs and add to function's local list head. Now if the 2nd thread comes in to discard the group preallocations, it will see that the group->bb_prealloc_list is empty and will return 0. Consider for a case where we have less number of groups (for e.g. just group 0), this may even return an -ENOSPC error from ext4_mb_new_blocks() (where we call for ext4_mb_discard_group_preallocations()). But that is wrong, since 2nd thread should have waited for 1st thread to release all the PAs and should have retried for allocation. Since 1st thread was anyway going to discard the PAs. The algorithm using this percpu seq counter goes below: 1. We sample the percpu discard_pa_seq counter before trying for block allocation in ext4_mb_new_blocks(). 2. We increment this percpu discard_pa_seq counter when we either allocate or free these blocks i.e. while marking those blocks as used/free in mb_mark_used()/mb_free_blocks(). 3. We also increment this percpu seq counter when we successfully identify that the bb_prealloc_list is not empty and hence proceed for discarding of those PAs inside ext4_mb_discard_group_preallocations(). Now to make sure that the regular fast path of block allocation is not affected, as a small optimization we only sample the percpu seq counter on that cpu. Only when the block allocation fails and when freed blocks found were 0, that is when we sample percpu seq counter for all cpus using below function ext4_get_discard_pa_seq_sum(). This happens after making sure that all the PAs on grp->bb_prealloc_list got freed or if it's empty. It can be well argued that why don't just check for grp->bb_free to see if there are any free blocks to be allocated. So here are the two concerns which were discussed:- 1. If for some reason the blocks available in the group are not appropriate for allocation logic (say for e.g. EXT4_MB_HINT_GOAL_ONLY, although this is not yet implemented), then the retry logic may result into infinte looping since grp->bb_free is non-zero. 2. Also before preallocation was clubbed with block allocation with the same ext4_lock_group() held, there were lot of races where grp->bb_free could not be reliably relied upon. Due to above, this patch considers discard_pa_seq logic to determine if we should retry for block allocation. Say if there are are n threads trying for block allocation and none of those could allocate or discard any of the blocks, then all of those n threads will fail the block allocation and return -ENOSPC error. (Since the seq counter for all of those will match as no block allocation/discard was done during that duration). Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/7f254686903b87c419d798742fd9a1be34f0657b.1589955723.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03ext4: mballoc: refactor ext4_mb_discard_preallocations()Ritesh Harjani
Implement ext4_mb_discard_preallocations_should_retry() which we will need in later patches to add more logic like check for sequence number match to see if we should retry for block allocation or not. There should be no functionality change in this patch. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/1cfae0098d2aa9afbeb59331401258182868c8f2.1589955723.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03ext4: mballoc: add blocks to PA list under same spinlock after allocating blocksRitesh Harjani
ext4_mb_discard_preallocations() only checks for grp->bb_prealloc_list of every group to discard the group's PA to free up the space if allocation request fails. Consider below race:- Process A Process B 1. allocate blocks 1. Fails block allocation from ext4_mb_regular_allocator() ext4_lock_group() allocated blocks more than ac_o_ex.fe_len ext4_unlock_group() 2. Scans the grp->bb_prealloc_list (under ext4_lock_group()) and find nothing and thus return -ENOSPC. 2. Add the additional blocks to PA list ext4_lock_group() add blocks to grp->bb_prealloc_list ext4_unlock_group() Above race could be avoided if we add those additional blocks to grp->bb_prealloc_list at the same time with block allocation when ext4_lock_group() was still held. With this discard-PA will know if there are actually any blocks which could be freed from the PA Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/a2217dd782585b42328981832e6d396abaaccb80.1589955723.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03ext4: add casefold flag to EXT4_INODE_* flagsEric Biggers
No one currently needs EXT4_INODE_CASEFOLD, but add it to keep the EXT4_INODE_* definitions in sync with the EXT4_*_FL definitions. Also make it clearer that the casefold flag is only for directories. Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20200510215252.87833-1-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03ext4: rework map struct instantiation in ext4_ext_map_blocks()Eric Whitney
The path performing block allocations in ext4_ext_map_blocks() contains code trimming the length of a new extent that is repeated later in the function. This code is both redundant and unnecessary as the exact length of the new extent has already been calculated. Rewrite the instantiation of the map struct in this case to use the available values, avoiding the overhead of unnecessary conversions and improving clarity. Add another map struct instantiation tailored specifically to the separate case for an existing written extent. Remove an old comment that no longer appears applicable to the current code. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Link: https://lore.kernel.org/r/20200510155805.18808-1-enwlinux@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
2020-06-03ext4: make ext_debug() implementation to use pr_debug()Ritesh Harjani
ext_debug() msgs could be helpful, provided those could be enabled without recompiling kernel and also if we could selectively enable only required prints for case by case debugging. So make ext_debug() implementation use pr_debug(). Also change ext_debug() to be defined with CONFIG_EXT4_DEBUG. So EXT_DEBUG macro now mostly remain for below 3 functions. ext4_ext_show_path/leaf/move() (whose print msgs use ext_debug() which again could be dynamically enabled using pr_debug()) This also changes the ext_debug() to take inode as a parameter to add inode no. in all of it's msgs. Prints additional info like process name / pid, superblock id etc. This also removes any explicit function names passed in ext_debug(). Since ext_debug() on it's own prints file, func and line no. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/d31dc189b0aeda9384fe7665e36da7cd8c61571f.1589086800.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03ext4: mballoc: make mb_debug() implementation to use pr_debug()Ritesh Harjani
mb_debug() msg had only 1 control level for all type of msgs. And if we enable mballoc_debug then all of those msgs would be enabled. Instead of adding multiple debug levels for mb_debug() msgs, use pr_debug() with which we could have finer control to print msgs at all of different levels (i.e. at file, func, line no.). Also add process name/pid, superblk id, and other info in mb_debug() msg. This also kills the mballoc_debug module parameter, since it is not needed any more. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/f0c660cbde9e2edbe95c67942ca9ad80dd2231eb.1589086800.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03ext4: replace EXT_DEBUG with __maybe_unused in ↵Ritesh Harjani
ext4_ext_handle_unwritten_extents() Replace EXT_DEBUG with __maybe_unused from inside ext4_ext_handle_unwritten_extents() function. There should be no functionality change in this patch. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/ae335b94506cd9db9d2648c1f4dd25a80f9f3ce2.1589086800.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03ext4: improve ext_debug() msg in case of block allocation failureRitesh Harjani
ext4_map_blocks() has ext_debug msg early at the start of function. We also get ext_debug msg if we could allocate a block from ext4_ext_map_blocks(). But there is no ext_debug() msg in case of block allocation failure. So add one along with error code. Also add more info in ext_debug() msg like how many blocks were allocated v/s how many were requested in ext4_ext_map_blocks(). Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/1610ec2aa932396be00f9d552fe29da473ead176.1589086800.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03ext4: use BIT() macro for BH_** state bitsRitesh Harjani
Simply use BIT() macro for all BH_** state bits instead of open coding it. There should be no functionality change in this patch. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/57667689f51a3f9dba2fcef7d3425187fa3ba69f.1589086800.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03ext4: balloc: use task_pid_nr() helperRitesh Harjani
Use task_pid_nr() function instead of current->pid. There should be no functionality change in this patch. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/4b58403e15e9c8deb34a1b93deb3fc9cd153ab84.1589086800.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03ext4: mballoc: fix possible NULL ptr & remove BUG_ONs from DOUBLE_CHECKRitesh Harjani
Make sure to check for e4b->bd_info->bb_bitmap == NULL, in mb_cmp_bitmaps() and return if NULL, to avoid possible NULL ptr dereference. Similar to how we do this in other ifdef DOUBLE_CHECK functions. Also remove the BUG_ON() logic if kmalloc() or ext4_read_block_bitmap() fails. We should simply mark grp->bb_bitmap as NULL if above happens. In fact ext4_read_block_bitmap() may even return an error in case of resize ioctl. Hence remove this BUG_ON logic (fstests ext4/032 may trigger this). Link: https://lore.kernel.org/r/9a54f8a696ff17c057cd571be3d15ac3ec1407f1.1589086800.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03ext4: mballoc: refactor code inside DOUBLE_CHECK into separate functionRitesh Harjani
This patch implemets mb_group_bb_bitmap_alloc() and mb_group_bb_bitmap_free() function to remove #ifdef DOUBLE_CHECK macro and it's related code from inside ext4_mb_add_groupinfo()/ext4_mb_release(). There should be no functionality change in this patch. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/8c2095d74b779f0254a19b24982490dc6f07c4f9.1589086800.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03ext4: mballoc: make ext4_mb_use_preallocated() return type as boolRitesh Harjani
Change return type of function ext4_mb_use_preallocated() to bool to better reflect what this function can return. There should be no functionality change in this patch. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/7880cb6ef911465beafefcd7e9c3ea214688744b.1589086800.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03ext4: mballoc: simplify error handling in ext4_init_mballoc()Ritesh Harjani
This patch simplifies error handling logic in ext4_init_mballoc(), by adding all the cleanups at one place at the end of that function. There should be no functionality change in this patch. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/8621a7bc68f7107a9ac4292afeb784515333bd25.1589086800.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03ext4: mballoc: fix few other format specifier in mb_debug()Ritesh Harjani
Fix few other format specifiers in mb_debug() msgs. As such no other functionality change in this patch. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/574fa7f833abf2dbf3b53a2fea3195e71f6cdbd8.1589086800.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03ext4: mballoc: correct the mb_debug() format specifier for pa_len varRitesh Harjani
pa->pa_len is an integer. Fix all of the format specifier used in mb_debug() for pa_len to %d instead of %u. As such no functionality change in this patch. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/af4987f643c586f62bcc9961e43f0a67151d5551.1589086800.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03ext4: mballoc: add more mb_debug() msgsRitesh Harjani
This patch adds some more debugging mb_debug() msgs to help improve mballoc code debugging. Other than adding more mb_debug() msgs at few more places, there should be no other functionality change in this patch. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/5fc8e7788b924e211fcfa4a4c1d2f8503511661a.1589086800.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03ext4: mballoc: refactor ext4_mb_show_ac()Ritesh Harjani
This factors out ext4_mb_show_pa() function to show all the group's preallocation info. This could be useful info to be added in later patches. There should be no functionality change in this patch. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/8f07d890b0038dcc935e9c10e6043ec9f3792721.1589086800.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03ext4: mballoc: print bb_free info even when it is 0Ritesh Harjani
Improve the debugging msg by also printing even if bb_free is 0. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/c894f1d1d30f86ae38f4e3a861949665b6dc61cd.1589086800.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03ext4: avoid ext4_error()'s caused by ENOMEM in the truncate pathTheodore Ts'o
We can't fail in the truncate path without requiring an fsck. Add work around for this by using a combination of retry loops and the __GFP_NOFAIL flag. From: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Anna Pendleton <pendleton@google.com> Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20200507175028.15061-1-pendleton@google.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03ext4: fix race between ext4_sync_parent() and rename()Eric Biggers
'igrab(d_inode(dentry->d_parent))' without holding dentry->d_lock is broken because without d_lock, d_parent can be concurrently changed due to a rename(). Then if the old directory is immediately deleted, old d_parent->inode can be NULL. That causes a NULL dereference in igrab(). To fix this, use dget_parent() to safely grab a reference to the parent dentry, which pins the inode. This also eliminates the need to use d_find_any_alias() other than for the initial inode, as we no longer throw away the dentry at each step. This is an extremely hard race to hit, but it is possible. Adding a udelay() in between the reads of ->d_parent and its ->d_inode makes it reproducible on a no-journal filesystem using the following program: #include <fcntl.h> #include <unistd.h> int main() { if (fork()) { for (;;) { mkdir("dir1", 0700); int fd = open("dir1/file", O_RDWR|O_CREAT|O_SYNC); write(fd, "X", 1); close(fd); } } else { mkdir("dir2", 0700); for (;;) { rename("dir1/file", "dir2/file"); rmdir("dir1"); } } } Fixes: d59729f4e794 ("ext4: fix races in ext4_sync_parent()") Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20200506183140.541194-1-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03ext4: fix a typo in a commentChristophe JAILLET
s/extnets/extents/ Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Link: https://lore.kernel.org/r/20200503200647.154701-1-christophe.jaillet@wanadoo.fr Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03ext4: clean up ext4_ext_convert_to_initialized() error handlingEric Whitney
If ext4_ext_convert_to_initialized() fails when called within ext4_ext_handle_unwritten_extents(), immediately error out through the exit point at function end. Fix the error handling in the event ext4_ext_convert_to_initialized() returns 0, which it shouldn't do when converting an existing extent. The current code returns the passed in value of allocated (which is likely non-zero) while failing to set m_flags, m_pblk, and m_len. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Link: https://lore.kernel.org/r/20200430185320.23001-5-enwlinux@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03ext4: clean up GET_BLOCKS_PRE_IO error handlingEric Whitney
If the call to ext4_split_convert_extents() fails in the EXT4_GET_BLOCKS_PRE_IO case within ext4_ext_handle_unwritten_extents(), error out through the exit point at function end rather than jumping through an intermediate point. Fix the error handling in the event ext4_split_convert_extents() returns 0, which it shouldn't do when splitting an existing extent. The current code returns the passed in value of allocated (which is likely non-zero) while failing to set m_flags, m_pblk, and m_len. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Link: https://lore.kernel.org/r/20200430185320.23001-4-enwlinux@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03ext4: remove redundant GET_BLOCKS_CONVERT codeEric Whitney
Remove the redundant code assigning values to ext4_map_blocks components in ext4_ext_handle_unwritten_extents() for the EXT4_GET_BLOCKS_CONVERT case, using the code at the function exit instead. Clean up and reorder that code to eliminate more redundancy and improve readability. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Link: https://lore.kernel.org/r/20200430185320.23001-3-enwlinux@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03ext4: remove dead GET_BLOCKS_ZERO codeEric Whitney
There's no call to ext4_map_blocks() in the current ext4 code with a flags argument that combines EXT4_GET_BLOCKS_CONVERT and EXT4_GET_BLOCKS_ZERO. Remove the code that corresponds to this case from ext4_ext_handle_unwritten_extents(). Signed-off-by: Eric Whitney <enwlinux@gmail.com> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/20200430185320.23001-2-enwlinux@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03ext4: don't ignore return values from ext4_ext_dirty()Harshad Shirwadkar
Don't ignore return values from ext4_ext_dirty, since the errors indicate valid failures below Ext4. In all of the other instances of ext4_ext_dirty calls, the error return value is handled in some way. This patch makes those remaining couple of places to handle ext4_ext_dirty errors as well. In case of ext4_split_extent_at(), the ignorance of return value is intentional. The reason is that we are already in error path and there isn't much we can do if ext4_ext_dirty returns error. This patch adds a comment for that case explaining why we ignore the return value. In the longer run, we probably should make sure that errors from other mark_dirty routines are handled as well. Ran gce-xfstests smoke tests and verified that there were no regressions. Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200427013438.219117-2-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03ext4: handle ext4_mark_inode_dirty errorsHarshad Shirwadkar
ext4_mark_inode_dirty() can fail for real reasons. Ignoring its return value may lead ext4 to ignore real failures that would result in corruption / crashes. Harden ext4_mark_inode_dirty error paths to fail as soon as possible and return errors to the caller whenever appropriate. One of the possible scnearios when this bug could affected is that while creating a new inode, its directory entry gets added successfully but while writing the inode itself mark_inode_dirty returns error which is ignored. This would result in inconsistency that the directory entry points to a non-existent inode. Ran gce-xfstests smoke tests and verified that there were no regressions. Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20200427013438.219117-1-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03ext4: fix error pointer dereferenceJeffle Xu
Don't pass error pointers to brelse(). commit 7159a986b420 ("ext4: fix some error pointer dereferences") has fixed some cases, fix the remaining one case. Once ext4_xattr_block_find()->ext4_sb_bread() failed, error pointer is stored in @bs->bh, which will be passed to brelse() in the cleanup routine of ext4_xattr_set_handle(). This will then cause a NULL panic crash in __brelse(). BUG: unable to handle kernel NULL pointer dereference at 000000000000005b RIP: 0010:__brelse+0x1b/0x50 Call Trace: ext4_xattr_set_handle+0x163/0x5d0 ext4_xattr_set+0x95/0x110 __vfs_setxattr+0x6b/0x80 __vfs_setxattr_noperm+0x68/0x1b0 vfs_setxattr+0xa0/0xb0 setxattr+0x12c/0x1a0 path_setxattr+0x8d/0xc0 __x64_sys_setxattr+0x27/0x30 do_syscall_64+0x60/0x250 entry_SYSCALL_64_after_hwframe+0x49/0xbe In this case, @bs->bh stores '-EIO' actually. Fixes: fb265c9cb49e ("ext4: add ext4_sb_bread() to disambiguate ENOMEM cases") Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: stable@kernel.org # 2.6.19 Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/1587628004-95123-1-git-send-email-jefflexu@linux.alibaba.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03ext4: Avoid freeing inodes on dirty listJan Kara
When we are evicting inode with journalled data, we may race with transaction commit in the following way: CPU0 CPU1 jbd2_journal_commit_transaction() evict(inode) inode_io_list_del() inode_wait_for_writeback() process BJ_Forget list __jbd2_journal_insert_checkpoint() __jbd2_journal_refile_buffer() __jbd2_journal_unfile_buffer() if (test_clear_buffer_jbddirty(bh)) mark_buffer_dirty(bh) __mark_inode_dirty(inode) ext4_evict_inode(inode) frees the inode This results in use-after-free issues in the writeback code (or the assertion added in the previous commit triggering). Fix the problem by removing inode from writeback lists once all the page cache is evicted and so inode cannot be added to writeback lists again. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200421085445.5731-4-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03writeback: Export inode_io_list_del()Jan Kara
Ext4 needs to remove inode from writeback lists after it is out of visibility of its journalling machinery (which can still dirty the inode). Export inode_io_list_del() for it. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200421085445.5731-3-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03ext4: fix buffer_head refcnt leak when ext4_iget() failsXiyu Yang
ext4_orphan_get() invokes ext4_read_inode_bitmap(), which returns a reference of the specified buffer_head object to "bitmap_bh" with increased refcnt. When ext4_orphan_get() returns, local variable "bitmap_bh" becomes invalid, so the refcount should be decreased to keep refcount balanced. The reference counting issue happens in one exception handling path of ext4_orphan_get(). When ext4_iget() fails, the function forgets to decrease the refcnt increased by ext4_read_inode_bitmap(), causing a refcnt leak. Fix this issue by calling brelse() when ext4_iget() fails. Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn> Signed-off-by: Xin Tan <tanxin.ctf@gmail.com> Cc: stable@kernel.org Link: https://lore.kernel.org/r/1587618568-13418-1-git-send-email-xiyuyang19@fudan.edu.cn Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03ext4: fix EXT_MAX_EXTENT/INDEX to check for zeroed eh_maxHarshad Shirwadkar
If eh->eh_max is 0, EXT_MAX_EXTENT/INDEX would evaluate to unsigned (-1) resulting in illegal memory accesses. Although there is no consistent repro, we see that generic/019 sometimes crashes because of this bug. Ran gce-xfstests smoke and verified that there were no regressions. Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20200421023959.20879-2-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2020-06-03ext4: remove unnecessary comparisons to boolJason Yan
Fix the following coccicheck warning: fs/ext4/extents_status.c:1057:5-28: WARNING: Comparison to bool fs/ext4/inode.c:2314:18-24: WARNING: Comparison to bool Signed-off-by: Jason Yan <yanaijie@huawei.com> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/20200420042918.19459-1-yanaijie@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03ext4: translate a few more map flags to strings in tracepointsEric Whitney
As new ext4_map_blocks() flags have been added, not all have gotten flag bit to string translations to make tracepoint output more readable. Fix that, and go one step further by adding a translation for the EXT4_EX_NOCACHE flag as well. The EXT4_EX_FORCE_CACHE flag can never be set in a tracepoint in the current code, so there's no need to bother with a translation for it right now. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200415203140.30349-3-enwlinux@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03ext4: remove EXT4_GET_BLOCKS_KEEP_SIZE flagEric Whitney
The eofblocks code was removed in the 5.7 release by "ext4: remove EOFBLOCKS_FL and associated code" (4337ecd1fe99). The ext4_map_blocks() flag used to trigger it can now be removed as well. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200415203140.30349-2-enwlinux@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03ext4: fix a style issue in fs/ext4/acl.cCarlos Guerrero Álvarez
Fixed an if statement where braces were not needed. Link: https://lore.kernel.org/r/20200416141456.1089-1-carlosteniswarrior@gmail.com Signed-off-by: Carlos Guerrero Álvarez <carlosteniswarrior@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
2020-06-03arm64: mm: use ARCH_HAS_DEBUG_WX instead of arch definedZong Li
Extract DEBUG_WX to mm/Kconfig.debug for shared use. Change to use ARCH_HAS_DEBUG_WX instead of DEBUG_WX defined by arch port. Signed-off-by: Zong Li <zong.li@sifive.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Link: http://lkml.kernel.org/r/e19709e7576f65e303245fe520cad5f7bae72763.1587455584.git.zong.li@sifive.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03x86: mm: use ARCH_HAS_DEBUG_WX instead of arch definedZong Li
Extract DEBUG_WX to mm/Kconfig.debug for shared use. Change to use ARCH_HAS_DEBUG_WX instead of DEBUG_WX defined by arch port. Signed-off-by: Zong Li <zong.li@sifive.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Link: http://lkml.kernel.org/r/430736828d149df3f5b462d291e845ec690e0141.1587455584.git.zong.li@sifive.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03riscv: support DEBUG_WXZong Li
Support DEBUG_WX to check whether there are mapping with write and execute permission at the same time. [akpm@linux-foundation.org: replace macros with C] Signed-off-by: Zong Li <zong.li@sifive.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Link: http://lkml.kernel.org/r/282e266311bced080bc6f7c255b92f87c1eb65d6.1587455584.git.zong.li@sifive.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03mm: add DEBUG_WX supportZong Li
Patch series "Extract DEBUG_WX to shared use". Some architectures support DEBUG_WX function, it's verbatim from each others, so extract to mm/Kconfig.debug for shared use. PPC and ARM ports don't support generic page dumper yet, so we only refine x86 and arm64 port in this patch series. For RISC-V port, the DEBUG_WX support depends on other patches which be merged already: - RISC-V page table dumper - Support strict kernel memory permissions for security This patch (of 4): Some architectures support DEBUG_WX function, it's verbatim from each others. Extract to mm/Kconfig.debug for shared use. [akpm@linux-foundation.org: reword text, per Will Deacon & Zong Li] Link: http://lkml.kernel.org/r/20200427194245.oxRJKj3fn%25akpm@linux-foundation.org [zong.li@sifive.com: remove the specific name of arm64] Link: http://lkml.kernel.org/r/3a6a92ecedc54e1d0fc941398e63d504c2cd5611.1589178399.git.zong.li@sifive.com [zong.li@sifive.com: add MMU dependency for DEBUG_WX] Link: http://lkml.kernel.org/r/4a674ac7863ff39ca91847b10e51209771f99416.1589178399.git.zong.li@sifive.com Suggested-by: Palmer Dabbelt <palmer@dabbelt.com> Signed-off-by: Zong Li <zong.li@sifive.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Link: http://lkml.kernel.org/r/cover.1587455584.git.zong.li@sifive.com Link: http://lkml.kernel.org/r/23980cd0f0e5d79e24a92169116407c75bcc650d.1587455584.git.zong.li@sifive.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03drivers/base/memory.c: cache memory blocks in xarray to accelerate lookupScott Cheloha
Searching for a particular memory block by id is an O(n) operation because each memory block's underlying device is kept in an unsorted linked list on the subsystem bus. We can cut the lookup cost to O(log n) if we cache each memory block in an xarray. This time complexity improvement is significant on systems with many memory blocks. For example: 1. A 128GB POWER9 VM with 256MB memblocks has 512 blocks. With this change memory_dev_init() completes ~12ms faster and walk_memory_blocks() completes ~12ms faster. Before: [ 0.005042] memory_dev_init: adding memory blocks [ 0.021591] memory_dev_init: added memory blocks [ 0.022699] walk_memory_blocks: walking memory blocks [ 0.038730] walk_memory_blocks: walked memory blocks 0-511 After: [ 0.005057] memory_dev_init: adding memory blocks [ 0.009415] memory_dev_init: added memory blocks [ 0.010519] walk_memory_blocks: walking memory blocks [ 0.014135] walk_memory_blocks: walked memory blocks 0-511 2. A 256GB POWER9 LPAR with 256MB memblocks has 1024 blocks. With this change memory_dev_init() completes ~88ms faster and walk_memory_blocks() completes ~87ms faster. Before: [ 0.252246] memory_dev_init: adding memory blocks [ 0.395469] memory_dev_init: added memory blocks [ 0.409413] walk_memory_blocks: walking memory blocks [ 0.433028] walk_memory_blocks: walked memory blocks 0-511 [ 0.433094] walk_memory_blocks: walking memory blocks [ 0.500244] walk_memory_blocks: walked memory blocks 131072-131583 After: [ 0.245063] memory_dev_init: adding memory blocks [ 0.299539] memory_dev_init: added memory blocks [ 0.313609] walk_memory_blocks: walking memory blocks [ 0.315287] walk_memory_blocks: walked memory blocks 0-511 [ 0.315349] walk_memory_blocks: walking memory blocks [ 0.316988] walk_memory_blocks: walked memory blocks 131072-131583 3. A 32TB POWER9 LPAR with 256MB memblocks has 131072 blocks. With this change we complete memory_dev_init() ~37 minutes faster and walk_memory_blocks() at least ~30 minutes faster. The exact timing for walk_memory_blocks() is missing, though I observed that the soft lockups in walk_memory_blocks() disappeared with the change, suggesting that lower bound. Before: [ 13.703907] memory_dev_init: adding blocks [ 2287.406099] memory_dev_init: added all blocks [ 2347.494986] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160 [ 2527.625378] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160 [ 2707.761977] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160 [ 2887.899975] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160 [ 3068.028318] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160 [ 3248.158764] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160 [ 3428.287296] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160 [ 3608.425357] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160 [ 3788.554572] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160 [ 3968.695071] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160 [ 4148.823970] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160 After: [ 13.696898] memory_dev_init: adding blocks [ 15.660035] memory_dev_init: added all blocks (the walk_memory_blocks traces disappear) There should be no significant negative impact for machines with few memory blocks. A sparse xarray has a small footprint and an O(log n) lookup is negligibly slower than an O(n) lookup for only the smallest number of memory blocks. 1. A 16GB x86 machine with 128MB memblocks has 132 blocks. With this change memory_dev_init() completes ~300us faster and walk_memory_blocks() completes no faster or slower. The improvement is pretty close to noise. Before: [ 0.224752] memory_dev_init: adding memory blocks [ 0.227116] memory_dev_init: added memory blocks [ 0.227183] walk_memory_blocks: walking memory blocks [ 0.227183] walk_memory_blocks: walked memory blocks 0-131 After: [ 0.224911] memory_dev_init: adding memory blocks [ 0.226935] memory_dev_init: added memory blocks [ 0.227089] walk_memory_blocks: walking memory blocks [ 0.227089] walk_memory_blocks: walked memory blocks 0-131 [david@redhat.com: document the locking] Link: http://lkml.kernel.org/r/bc21eec6-7251-4c91-2f57-9a0671f8d414@redhat.com Signed-off-by: Scott Cheloha <cheloha@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Nathan Lynch <nathanl@linux.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Rafael J. Wysocki <rafael@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Rick Lindsley <ricklind@linux.vnet.ibm.com> Cc: Scott Cheloha <cheloha@linux.ibm.com> Link: http://lkml.kernel.org/r/20200121231028.13699-1-cheloha@linux.ibm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03mm/thp: rename pmd_mknotpresent() as pmd_mkinvalid()Anshuman Khandual
pmd_present() is expected to test positive after pmdp_mknotpresent() as the PMD entry still points to a valid huge page in memory. pmdp_mknotpresent() implies that given PMD entry is just invalidated from MMU perspective while still holding on to pmd_page() referred valid huge page thus also clearing pmd_present() test. This creates the following situation which is counter intuitive. [pmd_present(pmd_mknotpresent(pmd)) = true] This renames pmd_mknotpresent() as pmd_mkinvalid() reflecting the helper's functionality more accurately while changing the above mentioned situation as follows. This does not create any functional change. [pmd_present(pmd_mkinvalid(pmd)) = true] This is not applicable for platforms that define own pmdp_invalidate() via __HAVE_ARCH_PMDP_INVALIDATE. Suggestion for renaming came during a previous discussion here. https://patchwork.kernel.org/patch/11019637/ [anshuman.khandual@arm.com: change pmd_mknotvalid() to pmd_mkinvalid() per Will] Link: http://lkml.kernel.org/r/1587520326-10099-3-git-send-email-anshuman.khandual@arm.com Suggested-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Will Deacon <will@kernel.org> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Link: http://lkml.kernel.org/r/1584680057-13753-3-git-send-email-anshuman.khandual@arm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03powerpc/mm: drop platform defined pmd_mknotpresent()Anshuman Khandual
Patch series "mm/thp: Rename pmd_mknotpresent() as pmd_mknotvalid()", v2. This series renames pmd_mknotpresent() as pmd_mknotvalid(). Before that it drops an existing pmd_mknotpresent() definition from powerpc platform which was never required as it defines it's pmdp_invalidate() through subscribing __HAVE_ARCH_PMDP_INVALIDATE. This does not create any functional change. This rename was suggested by Catalin during a previous discussion while we were trying to change the THP helpers on arm64 platform for migration. https://patchwork.kernel.org/patch/11019637/ This patch (of 2): Platform needs to define pmd_mknotpresent() for generic pmdp_invalidate() only when __HAVE_ARCH_PMDP_INVALIDATE is not subscribed. Otherwise platform specific pmd_mknotpresent() is not required. Hence just drop it. Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1587520326-10099-1-git-send-email-anshuman.khandual@arm.com Link: http://lkml.kernel.org/r/1584680057-13753-1-git-send-email-anshuman.khandual@arm.com Link: http://lkml.kernel.org/r/1584680057-13753-2-git-send-email-anshuman.khandual@arm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03mm: thp: don't need to drain lru cache when splitting and mlocking THPYang Shi
Since commit 8f182270dfec ("mm/swap.c: flush lru pvecs on compound page arrival") THP would not stay in pagevec anymore. So the optimization made by commit d965432234db ("thp: increase split_huge_page() success rate") doesn't make sense anymore, which tries to unpin munlocked THPs from pagevec by draining pagevec. Draining lru cache before isolating THP in mlock path is also unnecessary. b676b293fb48 ("mm, thp: fix mapped pages avoiding unevictable list on mlock") added it and 9a73f61bdb8a ("thp, mlock: do not mlock PTE-mapped file huge pages") accidentally carried it over after the above optimization went in. Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Link: http://lkml.kernel.org/r/1585946493-7531-1-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03hugetlbfs: get unmapped area below TASK_UNMAPPED_BASE for hugetlbfsShijie Hu
In a 32-bit program, running on arm64 architecture. When the address space below mmap base is completely exhausted, shmat() for huge pages will return ENOMEM, but shmat() for normal pages can still success on no-legacy mode. This seems not fair. For normal pages, the calling trace of get_unmapped_area() is: => mm->get_unmapped_area() if on legacy mode, => arch_get_unmapped_area() => vm_unmapped_area() if on no-legacy mode, => arch_get_unmapped_area_topdown() => vm_unmapped_area() For huge pages, the calling trace of get_unmapped_area() is: => file->f_op->get_unmapped_area() => hugetlb_get_unmapped_area() => vm_unmapped_area() To solve this issue, we only need to make hugetlb_get_unmapped_area() take the same way as mm->get_unmapped_area(). Add *bottomup() and *topdown() for hugetlbfs, and check current mm->get_unmapped_area() to decide which one to use. If mm->get_unmapped_area is equal to arch_get_unmapped_area_topdown(), hugetlb_get_unmapped_area() calls topdown routine, otherwise calls bottomup routine. Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Shijie Hu <hushijie3@huawei.com> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Will Deacon <will@kernel.org> Cc: Xiaoming Ni <nixiaoming@huawei.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: yangerkun <yangerkun@huawei.com> Cc: ChenGang <cg.chen@huawei.com> Cc: Chen Jie <chenjie6@huawei.com> Link: http://lkml.kernel.org/r/20200518065338.113664-1-hushijie3@huawei.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03sparc32: register memory occupied by kernel as memblock.memoryMike Rapoport
sparc32 never registered the memory occupied by the kernel image with memblock_add() and it only reserved this memory with meblock_reserve(). With openbios as system firmware, the memory occupied by the kernel is reserved in openbios and removed from mem.available. The prom setup code in the kernel uses mem.available to set up the memory banks and essentially there is a hole for the memory occupied by the kernel image. Later in bootmem_init() this memory is memblock_reserve()d. Up until recently, memmap initialization would call __init_single_page() for the pages in that hole, the free_low_memory_core_early() would mark them as reserved and everything would be Ok. After the change in memmap initialization introduced by the commit "mm: memmap_init: iterate over memblock regions rather that check each PFN", the hole is skipped and the page structs for it are not initialized. And when they are passed from memblock to page allocator as reserved, the latter gets confused. Simply registering the memory occupied by the kernel with memblock_add() resolves this issue. Tested on qemu-system-sparc with Debian Etch [1] userspace. [1] https://people.debian.org/~aurel32/qemu/sparc/debian_etch_sparc_small.qcow2 Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: David S. Miller <davem@davemloft.net> Cc: Guenter Roeck <linux@roeck-us.net> Link: https://lkml.kernel.org/r/20200517000050.GA87467@roeck-us.nlllllet/ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03include/linux/memblock.h: fix minor typo and unclear commentchenqiwu
Fix a minor typo "usabe->usable" for the current discription of member variable "memory" in struct memblock. BTW, I think it's unclear the member variable "base" in struct memblock_type is currently described as the physical address of memory region, change it to base address of the region is clearer since the variable is decorated as phys_addr_t. Signed-off-by: chenqiwu <chenqiwu@xiaomi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Link: http://lkml.kernel.org/r/1588846952-32166-1-git-send-email-qiwuchen55@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>