summaryrefslogtreecommitdiff
path: root/fs/ext4
AgeCommit message (Collapse)Author
2021-02-21Merge tag 'iomap-5.12-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull iomap updates from Darrick Wong: "The big change in this cycle is some new code to make it possible for XFS to try unaligned directio overwrites without taking locks. If the block is fully written and within EOF (i.e. doesn't require any further fs intervention) then we can let the unlocked write proceed. If not, we fall back to synchronizing direct writes. Summary: - Adjust the final parameter of iomap_dio_rw. - Add a new flag to request that iomap directio writes return EAGAIN if the write is not a pure overwrite within EOF; this will be used to reduce lock contention with unaligned direct writes on XFS. - Amend XFS' directio code to eliminate exclusive locking for unaligned direct writes if the circumstances permit" * tag 'iomap-5.12-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: reduce exclusive locking on unaligned dio xfs: split the unaligned DIO write code out xfs: improve the reflink_bounce_dio_write tracepoint xfs: simplify the read/write tracepoints xfs: remove the buffered I/O fallback assert xfs: cleanup the read/write helper naming xfs: make xfs_file_aio_write_checks IOCB_NOWAIT-aware xfs: factor out a xfs_ilock_iocb helper iomap: add a IOMAP_DIO_OVERWRITE_ONLY flag iomap: pass a flags argument to iomap_dio_rw iomap: rename the flags variable in __iomap_dio_rw
2021-02-11ext4: add .kunitconfig fragment to enable ext4-specific testsDaniel Latypov
As of [1], we no longer want EXT4_KUNIT_TESTS and others to `select` their deps. This means it can get harder to get all the right things selected as we gain more tests w/ more deps over time. This patch (and [2]) proposes we store kunitconfig fragments in-tree to represent sets of tests. (N.B. right now we only have one ext4 test). There's still a discussion to be had about how to have a hierarchy of these files (e.g. if one wanted to test all of fs/, not just fs/ext4). But this fragment would likely be a leaf node and isn't blocked on deciding if we want `import` statements and the like. Usage ===== Before [2] (on its way to being merged): $ cp fs/ext4/.kunitconfig .kunit/ $ ./tools/testing/kunit/kunit.py run After [2]: $ ./tools/testing/kunit/kunit.py run --kunitconfig=fs/ext4/.kunitconfig ".kunitconfig" vs "kunitconfig" =============================== See also: commit 14ee5cfd4512 ("kunit: Rename 'kunitconfig' to '.kunitconfig'"). * The bit about .gitignore exluding it by default is now a con, however. * But there are a lot of directories with files that begin with "k" and so this could cause some annoyance w/ tab completion* * This is the name kunit.py expects right now, so some people are used to .kunitconfig over "kunitconfig" [1] https://lore.kernel.org/linux-ext4/20210122110234.2825685-1-geert@linux-m68k.org/ [2] https://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git/commit/?h=kunit&id=243180f5924ed27ea417db39feb7f9691777688e * 372/5556 directories isn't too much, but still not a small number: $ find -type f -name 'k*' | xargs dirname | sort -u | wc -l 372 Signed-off-by: Daniel Latypov <dlatypov@google.com> Link: https://lore.kernel.org/r/20210210013206.136227-1-dlatypov@google.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-02-11ext: EXT4_KUNIT_TESTS should depend on EXT4_FS instead of selecting itGeert Uytterhoeven
EXT4_KUNIT_TESTS selects EXT4_FS, thus enabling an optional feature the user may not want to enable. Fix this by making the test depend on EXT4_FS instead. Fixes: 1cbeab1b242d16fd ("ext4: add kunit test for decoding extended timestamps") Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Link: https://lore.kernel.org/r/20210122110234.2825685-1-geert@linux-m68k.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-02-08ext4: reset retry counter when ext4_alloc_file_blocks() makes progressEric Whitney
Change the retry policy in ext4_alloc_file_blocks() to allow for a full retry cycle whenever a portion of an allocation request has been fulfilled. A large allocation request often results in multiple calls to ext4_map_blocks(), each of which is potentially subject to a temporary ENOSPC condition and retry cycle. The current code only allows for a single retry cycle. This patch does not address a known bug or reported complaint. However, it should make block allocation for fallocate and zero range more robust. In addition, simplify the conditional controlling the allocation while loop, where testing len alone is sufficient. Remove the assignment to ret2 in the error path after the call to ext4_map_blocks() since its value isn't subsequently used. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Link: https://lore.kernel.org/r/20210113221403.18258-1-enwlinux@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-02-07fs-verity: add FS_IOC_READ_VERITY_METADATA ioctlEric Biggers
Add an ioctl FS_IOC_READ_VERITY_METADATA which will allow reading verity metadata from a file that has fs-verity enabled, including: - The Merkle tree - The fsverity_descriptor (not including the signature if present) - The built-in signature, if present This ioctl has similar semantics to pread(). It is passed the type of metadata to read (one of the above three), and a buffer, offset, and size. It returns the number of bytes read or an error. Separate patches will add support for each of the above metadata types. This patch just adds the ioctl itself. This ioctl doesn't make any assumption about where the metadata is stored on-disk. It does assume the metadata is in a stable format, but that's basically already the case: - The Merkle tree and fsverity_descriptor are defined by how fs-verity file digests are computed; see the "File digest computation" section of Documentation/filesystems/fsverity.rst. Technically, the way in which the levels of the tree are ordered relative to each other wasn't previously specified, but it's logical to put the root level first. - The built-in signature is the value passed to FS_IOC_ENABLE_VERITY. This ioctl is useful because it allows writing a server program that takes a verity file and serves it to a client program, such that the client can do its own fs-verity compatible verification of the file. This only makes sense if the client doesn't trust the server and if the server needs to provide the storage for the client. More concretely, there is interest in using this ability in Android to export APK files (which are protected by fs-verity) to "protected VMs". This would use Protected KVM (https://lwn.net/Articles/836693), which provides an isolated execution environment without having to trust the traditional "host". A "guest" VM can boot from a signed image and perform specific tasks in a minimum trusted environment using files that have fs-verity enabled on the host, without trusting the host or requiring that the guest has its own trusted storage. Technically, it would be possible to duplicate the metadata and store it in separate files for serving. However, that would be less efficient and would require extra care in userspace to maintain file consistency. In addition to the above, the ability to read the built-in signatures is useful because it allows a system that is using the in-kernel signature verification to migrate to userspace signature verification. Link: https://lore.kernel.org/r/20210115181819.34732-4-ebiggers@kernel.org Reviewed-by: Victor Hsieh <victorhsieh@google.com> Acked-by: Jaegeuk Kim <jaegeuk@kernel.org> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Eric Biggers <ebiggers@google.com>
2021-02-04ext4: fix potential htree index checksum corruptionTheodore Ts'o
In the case where we need to do an interior node split, and immediately afterwards, we are unable to allocate a new directory leaf block due to ENOSPC, the directory index checksum's will not be filled in correctly (and indeed, will not be correctly journalled). This looks like a bug that was introduced when we added largedir support. The original code doesn't make any sense (and should have been caught in code review), but it was hidden because most of the time, the index node checksum will be set by do_split(). But if do_split bails out due to ENOSPC, then ext4_handle_dirty_dx_node() won't get called, and so the directory index checksum field will not get set, leading to: EXT4-fs error (device sdb): dx_probe:858: inode #6635543: block 4022: comm nfsd: Directory index failed checksum Google-Bug-Id: 176345532 Fixes: e08ac99fa2a2 ("ext4: add largedir feature") Cc: Artem Blagodarenko <artem.blagodarenko@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-02-03ext4: factor out htree rep invariant checkVinicius Tinti
This patch moves some debugging code which is used to validate the hash tree node when doing a binary search of an htree node into a separate function, which is disabled by default (since it is only used by developers when they are modifying the htree code paths). In addition to cleaning up the code to make it more maintainable, it silences a Clang compiler warning when -Wunreachable-code-aggressive is enabled. (There is no plan to enable this warning by default, since there it has far too many false positives; nevertheless, this commit reduces one of the many false positives by one.) Signed-off-by: Vinicius Tinti <viniciustinti@gmail.com> Link: https://lore.kernel.org/r/20210202162837.129631-1-viniciustinti@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-02-03ext4: Change list_for_each* to list_for_each_entry*Daejun Park
In the fast_commit.c, list_for_each* + list_entry can be changed to list_for_each_entry*. It reduces number of variables and lines. Signed-off-by: Daejun Park <daejun7.park@samsung.com> Link: https://lore.kernel.org/r/20210111013726epcms2p4579ae56040d7043db785bf0d0a785dc7@epcms2p4 Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-02-02ext4: don't try to processed freed blocks until mballoc is initializedTheodore Ts'o
If we try to make any changes via the journal between when the journal is initialized, but before the multi-block allocated is initialized, we will end up deferencing a NULL pointer when the journal commit callback function calls ext4_process_freed_data(). The proximate cause of this failure was commit 2d01ddc86606 ("ext4: save error info to sb through journal if available") since file system corruption problems detected before the call to ext4_mb_init() would result in a journal commit before we aborted the mount of the file system.... and we would then trigger the NULL pointer deref. Link: https://lore.kernel.org/r/YAm8qH/0oo2ofSMR@mit.edu Reported-by: Murphy Zhou <jencce.kernel@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-02-02ext4: use DEFINE_MUTEX() for mutex lockZheng Yongjun
mutex lock can be initialized automatically with DEFINE_MUTEX() rather than explicitly calling mutex_init(). Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com> Link: https://lore.kernel.org/r/20201224132244.30907-1-zhengyongjun3@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-01-27block: use an on-stack bio in blkdev_issue_flushChristoph Hellwig
There is no point in allocating memory for a synchronous flush. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Acked-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24ext4: support idmapped mountsChristian Brauner
Enable idmapped mounts for ext4. All dedicated helpers we need for this exist. So this basically just means we're passing down the user_namespace argument from the VFS methods to the relevant helpers. Let's create simple example where we idmap an ext4 filesystem: root@f2-vm:~# truncate -s 5G ext4.img root@f2-vm:~# mkfs.ext4 ./ext4.img mke2fs 1.45.5 (07-Jan-2020) Discarding device blocks: done Creating filesystem with 1310720 4k blocks and 327680 inodes Filesystem UUID: 3fd91794-c6ca-4b0f-9964-289a000919cf Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736 Allocating group tables: done Writing inode tables: done Creating journal (16384 blocks): done Writing superblocks and filesystem accounting information: done root@f2-vm:~# losetup -f --show ./ext4.img /dev/loop0 root@f2-vm:~# mount /dev/loop0 /mnt root@f2-vm:~# ls -al /mnt/ total 24 drwxr-xr-x 3 root root 4096 Oct 28 13:34 . drwxr-xr-x 30 root root 4096 Oct 28 13:22 .. drwx------ 2 root root 16384 Oct 28 13:34 lost+found # Let's create an idmapped mount at /idmapped1 where we map uid and gid # 0 to uid and gid 1000 root@f2-vm:/# ./mount-idmapped --map-mount b:0:1000:1 /mnt/ /idmapped1/ root@f2-vm:/# ls -al /idmapped1/ total 24 drwxr-xr-x 3 ubuntu ubuntu 4096 Oct 28 13:34 . drwxr-xr-x 30 root root 4096 Oct 28 13:22 .. drwx------ 2 ubuntu ubuntu 16384 Oct 28 13:34 lost+found # Let's create an idmapped mount at /idmapped2 where we map uid and gid # 0 to uid and gid 2000 root@f2-vm:/# ./mount-idmapped --map-mount b:0:2000:1 /mnt/ /idmapped2/ root@f2-vm:/# ls -al /idmapped2/ total 24 drwxr-xr-x 3 2000 2000 4096 Oct 28 13:34 . drwxr-xr-x 31 root root 4096 Oct 28 13:39 .. drwx------ 2 2000 2000 16384 Oct 28 13:34 lost+found Let's create another example where we idmap the rootfs filesystem without a mapping for uid 0 and gid 0: # Create an idmapped mount of for a full POSIX range of rootfs under # /mnt but without a mapping for uid 0 to reduce attack surface root@f2-vm:/# ./mount-idmapped --map-mount b:1:1:65536 / /mnt/ # Since we don't have a mapping for uid and gid 0 all files owned by # uid and gid 0 should show up as uid and gid 65534: root@f2-vm:/# ls -al /mnt/ total 664 drwxr-xr-x 31 nobody nogroup 4096 Oct 28 13:39 . drwxr-xr-x 31 root root 4096 Oct 28 13:39 .. lrwxrwxrwx 1 nobody nogroup 7 Aug 25 07:44 bin -> usr/bin drwxr-xr-x 4 nobody nogroup 4096 Oct 28 13:17 boot drwxr-xr-x 2 nobody nogroup 4096 Aug 25 07:48 dev drwxr-xr-x 81 nobody nogroup 4096 Oct 28 04:00 etc drwxr-xr-x 4 nobody nogroup 4096 Oct 28 04:00 home lrwxrwxrwx 1 nobody nogroup 7 Aug 25 07:44 lib -> usr/lib lrwxrwxrwx 1 nobody nogroup 9 Aug 25 07:44 lib32 -> usr/lib32 lrwxrwxrwx 1 nobody nogroup 9 Aug 25 07:44 lib64 -> usr/lib64 lrwxrwxrwx 1 nobody nogroup 10 Aug 25 07:44 libx32 -> usr/libx32 drwx------ 2 nobody nogroup 16384 Aug 25 07:47 lost+found drwxr-xr-x 2 nobody nogroup 4096 Aug 25 07:44 media drwxr-xr-x 31 nobody nogroup 4096 Oct 28 13:39 mnt drwxr-xr-x 2 nobody nogroup 4096 Aug 25 07:44 opt drwxr-xr-x 2 nobody nogroup 4096 Apr 15 2020 proc drwx--x--x 6 nobody nogroup 4096 Oct 28 13:34 root drwxr-xr-x 2 nobody nogroup 4096 Aug 25 07:46 run lrwxrwxrwx 1 nobody nogroup 8 Aug 25 07:44 sbin -> usr/sbin drwxr-xr-x 2 nobody nogroup 4096 Aug 25 07:44 srv drwxr-xr-x 2 nobody nogroup 4096 Apr 15 2020 sys drwxrwxrwt 10 nobody nogroup 4096 Oct 28 13:19 tmp drwxr-xr-x 14 nobody nogroup 4096 Oct 20 13:00 usr drwxr-xr-x 12 nobody nogroup 4096 Aug 25 07:45 var # Since we do have a mapping for uid and gid 1000 all files owned by # uid and gid 1000 should simply show up as uid and gid 1000: root@f2-vm:/# ls -al /mnt/home/ubuntu/ total 40 drwxr-xr-x 3 ubuntu ubuntu 4096 Oct 28 00:43 . drwxr-xr-x 4 nobody nogroup 4096 Oct 28 04:00 .. -rw------- 1 ubuntu ubuntu 2936 Oct 28 12:26 .bash_history -rw-r--r-- 1 ubuntu ubuntu 220 Feb 25 2020 .bash_logout -rw-r--r-- 1 ubuntu ubuntu 3771 Feb 25 2020 .bashrc -rw-r--r-- 1 ubuntu ubuntu 807 Feb 25 2020 .profile -rw-r--r-- 1 ubuntu ubuntu 0 Oct 16 16:11 .sudo_as_admin_successful -rw------- 1 ubuntu ubuntu 1144 Oct 28 00:43 .viminfo Link: https://lore.kernel.org/r/20210121131959.646623-39-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-ext4@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24fs: make helpers idmap mount awareChristian Brauner
Extend some inode methods with an additional user namespace argument. A filesystem that is aware of idmapped mounts will receive the user namespace the mount has been marked with. This can be used for additional permission checking and also to enable filesystems to translate between uids and gids if they need to. We have implemented all relevant helpers in earlier patches. As requested we simply extend the exisiting inode method instead of introducing new ones. This is a little more code churn but it's mostly mechanical and doesnt't leave us with additional inode methods. Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24stat: handle idmapped mountsChristian Brauner
The generic_fillattr() helper fills in the basic attributes associated with an inode. Enable it to handle idmapped mounts. If the inode is accessed through an idmapped mount map it into the mount's user namespace before we store the uid and gid. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-12-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24acl: handle idmapped mountsChristian Brauner
The posix acl permission checking helpers determine whether a caller is privileged over an inode according to the acls associated with the inode. Add helpers that make it possible to handle acls on idmapped mounts. The vfs and the filesystems targeted by this first iteration make use of posix_acl_fix_xattr_from_user() and posix_acl_fix_xattr_to_user() to translate basic posix access and default permissions such as the ACL_USER and ACL_GROUP type according to the initial user namespace (or the superblock's user namespace) to and from the caller's current user namespace. Adapt these two helpers to handle idmapped mounts whereby we either map from or into the mount's user namespace depending on in which direction we're translating. Similarly, cap_convert_nscap() is used by the vfs to translate user namespace and non-user namespace aware filesystem capabilities from the superblock's user namespace to the caller's user namespace. Enable it to handle idmapped mounts by accounting for the mount's user namespace. In addition the fileystems targeted in the first iteration of this patch series make use of the posix_acl_chmod() and, posix_acl_update_mode() helpers. Both helpers perform permission checks on the target inode. Let them handle idmapped mounts. These two helpers are called when posix acls are set by the respective filesystems to handle this case we extend the ->set() method to take an additional user namespace argument to pass the mount's user namespace down. Link: https://lore.kernel.org/r/20210121131959.646623-9-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24attr: handle idmapped mountsChristian Brauner
When file attributes are changed most filesystems rely on the setattr_prepare(), setattr_copy(), and notify_change() helpers for initialization and permission checking. Let them handle idmapped mounts. If the inode is accessed through an idmapped mount map it into the mount's user namespace. Afterwards the checks are identical to non-idmapped mounts. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Helpers that perform checks on the ia_uid and ia_gid fields in struct iattr assume that ia_uid and ia_gid are intended values and have already been mapped correctly at the userspace-kernelspace boundary as we already do today. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-8-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24inode: make init and permission helpers idmapped mount awareChristian Brauner
The inode_owner_or_capable() helper determines whether the caller is the owner of the inode or is capable with respect to that inode. Allow it to handle idmapped mounts. If the inode is accessed through an idmapped mount it according to the mount's user namespace. Afterwards the checks are identical to non-idmapped mounts. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Similarly, allow the inode_init_owner() helper to handle idmapped mounts. It initializes a new inode on idmapped mounts by mapping the fsuid and fsgid of the caller from the mount's user namespace. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-7-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-23iomap: pass a flags argument to iomap_dio_rwChristoph Hellwig
Pass a set of flags to iomap_dio_rw instead of the boolean wait_for_completion argument. The IOMAP_DIO_FORCE_WAIT flag replaces the wait_for_completion, but only needs to be passed when the iocb isn't synchronous to start with to simplify the callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> [djwong: rework xfs_file.c so that we can push iomap changes separately] Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-01-15Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "A number of bug fixes for ext4: - Fix for the new fast_commit feature - Fix some error handling codepaths in whiteout handling and mountpoint sampling - Fix how we write ext4_error information so it goes through the journal when journalling is active, to avoid races that can lead to lost error information, superblock checksum failures, or DIF/DIX features" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: remove expensive flush on fast commit ext4: fix bug for rename with RENAME_WHITEOUT ext4: fix wrong list_splice in ext4_fc_cleanup ext4: use IS_ERR instead of IS_ERR_OR_NULL and set inode null when IS_ERR ext4: don't leak old mountpoint samples ext4: drop ext4_handle_dirty_super() ext4: fix superblock checksum failure when setting password salt ext4: use sbi instead of EXT4_SB(sb) in ext4_update_super() ext4: save error info to sb through journal if available ext4: protect superblock modifications with a buffer lock ext4: drop sync argument of ext4_commit_super() ext4: combine ext4_handle_error() and save_error_info()
2021-01-15ext4: remove expensive flush on fast commitDaejun Park
In the fast commit, it adds REQ_FUA and REQ_PREFLUSH on each fast commit block when barrier is enabled. However, in recovery phase, ext4 compares CRC value in the tail. So it is sufficient to add REQ_FUA and REQ_PREFLUSH on the block that has tail. Signed-off-by: Daejun Park <daejun7.park@samsung.com> Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20210106013242epcms2p5b6b4ed8ca86f29456fdf56aa580e74b4@epcms2p5 Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-01-15ext4: fix bug for rename with RENAME_WHITEOUTyangerkun
We got a "deleted inode referenced" warning cross our fsstress test. The bug can be reproduced easily with following steps: cd /dev/shm mkdir test/ fallocate -l 128M img mkfs.ext4 -b 1024 img mount img test/ dd if=/dev/zero of=test/foo bs=1M count=128 mkdir test/dir/ && cd test/dir/ for ((i=0;i<1000;i++)); do touch file$i; done # consume all block cd ~ && renameat2(AT_FDCWD, /dev/shm/test/dir/file1, AT_FDCWD, /dev/shm/test/dir/dst_file, RENAME_WHITEOUT) # ext4_add_entry in ext4_rename will return ENOSPC!! cd /dev/shm/ && umount test/ && mount img test/ && ls -li test/dir/file1 We will get the output: "ls: cannot access 'test/dir/file1': Structure needs cleaning" and the dmesg show: "EXT4-fs error (device loop0): ext4_lookup:1626: inode #2049: comm ls: deleted inode referenced: 139" ext4_rename will create a special inode for whiteout and use this 'ino' to replace the source file's dir entry 'ino'. Once error happens latter(the error above was the ENOSPC return from ext4_add_entry in ext4_rename since all space has been consumed), the cleanup do drop the nlink for whiteout, but forget to restore 'ino' with source file. This will trigger the bug describle as above. Signed-off-by: yangerkun <yangerkun@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: stable@vger.kernel.org Fixes: cd808deced43 ("ext4: support RENAME_WHITEOUT") Link: https://lore.kernel.org/r/20210105062857.3566-1-yangerkun@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-01-15ext4: fix wrong list_splice in ext4_fc_cleanupDaejun Park
After full/fast commit, entries in staging queue are promoted to main queue. In ext4_fs_cleanup function, it splice to staging queue to staging queue. Fixes: aa75f4d3daaeb ("ext4: main fast-commit commit path") Signed-off-by: Daejun Park <daejun7.park@samsung.com> Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201230094851epcms2p6eeead8cc984379b37b2efd21af90fd1a@epcms2p6 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2021-01-15ext4: use IS_ERR instead of IS_ERR_OR_NULL and set inode null when IS_ERRYi Li
1: ext4_iget/ext4_find_extent never returns NULL, use IS_ERR instead of IS_ERR_OR_NULL to fix this. 2: ext4_fc_replay_inode should set the inode to NULL when IS_ERR. and go to call iput properly. Fixes: 8016e29f4362 ("ext4: fast commit recovery path") Signed-off-by: Yi Li <yili@winhong.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20201230033827.3996064-1-yili@winhong.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2021-01-13ext4: simplify i_state checks in __ext4_update_other_inode_time()Eric Biggers
Since I_DIRTY_TIME and I_DIRTY_INODE are mutually exclusive in i_state, there's no need to check for I_DIRTY_TIME && !I_DIRTY_INODE. Just check for I_DIRTY_TIME. Also introduce a helper function in include/linux/fs.h to do this check. Link: https://lore.kernel.org/r/20210112190253.64307-12-ebiggers@kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Jan Kara <jack@suse.cz>
2021-01-13fs: don't call ->dirty_inode for lazytime timestamp updatesEric Biggers
There is no need to call ->dirty_inode for lazytime timestamp updates (i.e. for __mark_inode_dirty(I_DIRTY_TIME)), since by the definition of lazytime, filesystems must ignore these updates. Filesystems only need to care about the updated timestamps when they expire. Therefore, only call ->dirty_inode when I_DIRTY_INODE is set. Based on a patch from Christoph Hellwig: https://lore.kernel.org/r/20200325122825.1086872-4-hch@lst.de Link: https://lore.kernel.org/r/20210112190253.64307-6-ebiggers@kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Jan Kara <jack@suse.cz>
2020-12-24Merge tag 'ext4_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Various bug fixes and cleanups for ext4; no new features this cycle" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (29 commits) ext4: remove unnecessary wbc parameter from ext4_bio_write_page ext4: avoid s_mb_prefetch to be zero in individual scenarios ext4: defer saving error info from atomic context ext4: simplify ext4 error translation ext4: move functions in super.c ext4: make ext4_abort() use __ext4_error() ext4: standardize error message in ext4_protect_reserved_inode() ext4: remove redundant sb checksum recomputation ext4: don't remount read-only with errors=continue on reboot ext4: fix deadlock with fs freezing and EA inodes jbd2: add a helper to find out number of fast commit blocks ext4: make fast_commit.h byte identical with e2fsprogs/fast_commit.h ext4: fix fall-through warnings for Clang ext4: add docs about fast commit idempotence ext4: remove the unused EXT4_CURRENT_REV macro ext4: fix an IS_ERR() vs NULL check ext4: check for invalid block size early when mounting a file system ext4: fix a memory leak of ext4_free_data ext4: delete nonsensical (commented-out) code inside ext4_xattr_block_set() ext4: update ext4_data_block_valid related comments ...
2020-12-22ext4: don't leak old mountpoint samplesTheodore Ts'o
When the first file is opened, ext4 samples the mountpoint of the filesystem in 64 bytes of the super block. It does so using strlcpy(), this means that the remaining bytes in the super block string buffer are untouched. If the mount point before had a longer path than the current one, it can be reconstructed. Consider the case where the fs was mounted to "/media/johnjdeveloper" and later to "/". The super block buffer then contains "/\x00edia/johnjdeveloper". This case was seen in the wild and caused confusion how the name of a developer ands up on the super block of a filesystem used in production... Fix this by using strncpy() instead of strlcpy(). The superblock field is defined to be a fixed-size char array, and it is already marked using __nonstring in fs/ext4/ext4.h. The consumer of the field in e2fsprogs already assumes that in the case of a 64+ byte mount path, that s_last_mounted will not be NUL terminated. Link: https://lore.kernel.org/r/X9ujIOJG/HqMr88R@mit.edu Reported-by: Richard Weinberger <richard@nod.at> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2020-12-22ext4: drop ext4_handle_dirty_super()Jan Kara
The wrapper is now useless since it does what ext4_handle_dirty_metadata() does. Just remove it. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20201216101844.22917-9-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-22ext4: fix superblock checksum failure when setting password saltJan Kara
When setting password salt in the superblock, we forget to recompute the superblock checksum so it will not match until the next superblock modification which recomputes the checksum. Fix it. CC: Michael Halcrow <mhalcrow@google.com> Reported-by: Andreas Dilger <adilger@dilger.ca> Fixes: 9bd8212f981e ("ext4 crypto: add encryption policy and password salt support") Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20201216101844.22917-8-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-22ext4: use sbi instead of EXT4_SB(sb) in ext4_update_super()Jan Kara
No behavioral change. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20201216101844.22917-6-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-22ext4: save error info to sb through journal if availableJan Kara
If journalling is still working at the moment we get to writing error information to the superblock we cannot write directly to the superblock as such write could race with journalled update of the superblock and cause journal checksum failures, writing inconsistent information to the journal or other problems. We cannot journal the superblock directly from the error handling functions as we are running in uncertain context and could deadlock so just punt journalled superblock update to a workqueue. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20201216101844.22917-5-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-22ext4: protect superblock modifications with a buffer lockJan Kara
Protect all superblock modifications (including checksum computation) with a superblock buffer lock. That way we are sure computed checksum matches current superblock contents (a mismatch could cause checksum failures in nojournal mode or if an unjournalled superblock update races with a journalled one). Also we avoid modifying superblock contents while it is being written out (which can cause DIF/DIX failures if we are running in nojournal mode). Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20201216101844.22917-4-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-22ext4: drop sync argument of ext4_commit_super()Jan Kara
Everybody passes 1 as sync argument of ext4_commit_super(). Just drop it. Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20201216101844.22917-3-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-22ext4: remove unnecessary wbc parameter from ext4_bio_write_pageLei Chen
ext4_bio_write_page does not need wbc parameter, since its parameter io contains the io_wbc field. The io::io_wbc is initialized by ext4_io_submit_init which is called in ext4_writepages and ext4_writepage functions prior to ext4_bio_write_page. Therefor, when ext4_bio_write_page is called, wbc info has already been included in io parameter. Signed-off-by: Lei Chen <lennychen@tencent.com> Link: https://lore.kernel.org/r/1607669664-25656-1-git-send-email-lennychen@tencent.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-22ext4: combine ext4_handle_error() and save_error_info()Jan Kara
save_error_info() is always called together with ext4_handle_error(). Combine them into a single call and move unconditional bits out of save_error_info() into ext4_handle_error(). Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20201216101844.22917-2-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-22ext4: avoid s_mb_prefetch to be zero in individual scenariosChunguang Xu
Commit cfd732377221 ("ext4: add prefetching for block allocation bitmaps") introduced block bitmap prefetch, and expects to read block bitmaps of flex_bg through an IO. However, it seems to ignore the value range of s_log_groups_per_flex. In the scenario where the value of s_log_groups_per_flex is greater than 27, s_mb_prefetch or s_mb_prefetch_limit will overflow, cause a divide zero exception. In addition, the logic of calculating nr is also flawed, because the size of flexbg is fixed during a single mount, but s_mb_prefetch can be modified, which causes nr to fail to meet the value condition of [1, flexbg_size]. To solve this problem, we need to set the upper limit of s_mb_prefetch. Since we expect to load block bitmaps of a flex_bg through an IO, we can consider determining a reasonable upper limit among the IO limit parameters. After consideration, we chose BLK_MAX_SEGMENT_SIZE. This is a good choice to solve divide zero problem and avoiding performance degradation. [ Some minor code simplifications to make the changes easy to follow -- TYT ] Reported-by: Tosk Robot <tencent_os_robot@tencent.com> Signed-off-by: Chunguang Xu <brookxu@tencent.com> Reviewed-by: Samuel Liao <samuelliao@tencent.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/1607051143-24508-1-git-send-email-brookxu@tencent.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-22ext4: defer saving error info from atomic contextJan Kara
When filesystem inconsistency is detected with group locked, we currently try to modify superblock to store error there without blocking. However this can cause superblock checksum failures (or DIF/DIX failure) when the superblock is just being written out. Make error handling code just store error information in ext4_sb_info structure and copy it to on-disk superblock only in ext4_commit_super(). In case of error happening with group locked, we just postpone the superblock flushing to a workqueue. [ Added fixup so that s_first_error_* does not get updated after the file system is remounted. Also added fix for syzbot failure. - Ted ] Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20201127113405.26867-8-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: Hillf Danton <hdanton@sina.com> Reported-by: syzbot+9043030c040ce1849a60@syzkaller.appspotmail.com
2020-12-17Merge tag 'f2fs-for-5.11-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs updates from Jaegeuk Kim: "In this round, we've made more work into per-file compression support. For example, F2FS_IOC_GET | SET_COMPRESS_OPTION provides a way to change the algorithm or cluster size per file. F2FS_IOC_COMPRESS | DECOMPRESS_FILE provides a way to compress and decompress the existing normal files manually. There is also a new mount option, compress_mode=fs|user, which can control who compresses the data. Chao also added a checksum feature with a mount option so that we are able to detect any corrupted cluster. In addition, Daniel contributed casefolding with encryption patch, which will be used for Android devices. Summary: Enhancements: - add ioctls and mount option to manage per-file compression feature - support casefolding with encryption - support checksum for compressed cluster - avoid IO starvation by replacing mutex with rwsem - add sysfs, max_io_bytes, to control max bio size Bug fixes: - fix use-after-free issue when compression and fsverity are enabled - fix consistency corruption during fault injection test - fix data offset for lseek - get rid of buffer_head which has 32bits limit in fiemap - fix some bugs in multi-partitions support - fix nat entry count calculation in shrinker - fix some stat information And, we've refactored some logics and fix minor bugs as well" * tag 'f2fs-for-5.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (36 commits) f2fs: compress: fix compression chksum f2fs: fix shift-out-of-bounds in sanity_check_raw_super() f2fs: fix race of pending_pages in decompression f2fs: fix to account inline xattr correctly during recovery f2fs: inline: fix wrong inline inode stat f2fs: inline: correct comment in f2fs_recover_inline_data f2fs: don't check PAGE_SIZE again in sanity_check_raw_super() f2fs: convert to F2FS_*_INO macro f2fs: introduce max_io_bytes, a sysfs entry, to limit bio size f2fs: don't allow any writes on readonly mount f2fs: avoid race condition for shrinker count f2fs: add F2FS_IOC_DECOMPRESS_FILE and F2FS_IOC_COMPRESS_FILE f2fs: add compress_mode mount option f2fs: Remove unnecessary unlikely() f2fs: init dirty_secmap incorrectly f2fs: remove buffer_head which has 32bits limit f2fs: fix wrong block count instead of bytes f2fs: use new conversion functions between blks and bytes f2fs: rename logical_to_blk and blk_to_logical f2fs: fix kbytes written stat for multi-device case ...
2020-12-17ext4: simplify ext4 error translationJan Kara
We convert errno's to ext4 on-disk format error codes in save_error_info(). Add a function and a bit of macro magic to make this simpler. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/20201127113405.26867-7-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-17ext4: move functions in super.cJan Kara
Just move error info related functions in super.c close to ext4_handle_error(). We'll want to combine save_error_info() with ext4_handle_error() and this makes change more obvious and saves a forward declaration as well. No functional change. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/20201127113405.26867-6-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-17ext4: make ext4_abort() use __ext4_error()Jan Kara
The only difference between __ext4_abort() and __ext4_error() is that the former one ignores errors=continue mount option. Unify the code to reduce duplication. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/20201127113405.26867-5-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-17ext4: standardize error message in ext4_protect_reserved_inode()Jan Kara
We use __ext4_error() when ext4_protect_reserved_inode() finds filesystem corruption. However EXT4_ERROR_INODE_ERR() is perfectly capable of reporting all the needed information. So just use that. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/20201127113405.26867-4-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-17ext4: remove redundant sb checksum recomputationJan Kara
Superblock is written out either through ext4_commit_super() or through ext4_handle_dirty_super(). In both cases we recompute the checksum so it is not necessary to recompute it after updating superblock free inodes & blocks counters. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/20201127113405.26867-3-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-17ext4: don't remount read-only with errors=continue on rebootJan Kara
ext4_handle_error() with errors=continue mount option can accidentally remount the filesystem read-only when the system is rebooting. Fix that. Fixes: 1dc1097ff60e ("ext4: avoid panic during forced reboot") Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Cc: stable@kernel.org Link: https://lore.kernel.org/r/20201127113405.26867-2-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-17ext4: fix deadlock with fs freezing and EA inodesJan Kara
Xattr code using inodes with large xattr data can end up dropping last inode reference (and thus deleting the inode) from places like ext4_xattr_set_entry(). That function is called with transaction started and so ext4_evict_inode() can deadlock against fs freezing like: CPU1 CPU2 removexattr() freeze_super() vfs_removexattr() ext4_xattr_set() handle = ext4_journal_start() ... ext4_xattr_set_entry() iput(old_ea_inode) ext4_evict_inode(old_ea_inode) sb->s_writers.frozen = SB_FREEZE_FS; sb_wait_write(sb, SB_FREEZE_FS); ext4_freeze() jbd2_journal_lock_updates() -> blocks waiting for all handles to stop sb_start_intwrite() -> blocks as sb is already in SB_FREEZE_FS state Generally it is advisable to delete inodes from a separate transaction as it can consume quite some credits however in this case it would be quite clumsy and furthermore the credits for inode deletion are quite limited and already accounted for. So just tweak ext4_evict_inode() to avoid freeze protection if we have transaction already started and thus it is not really needed anyway. Cc: stable@vger.kernel.org Fixes: dec214d00e0d ("ext4: xattr inode deduplication") Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/20201127110649.24730-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-17ext4: make fast_commit.h byte identical with e2fsprogs/fast_commit.hHarshad Shirwadkar
This patch makes fast_commit.h byte by byte identical with e2fsprogs/fast_commit.h. This will help us ensure that there are no on-disk format inconsistencies between e2fsck and kernel ext4. Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201120202232.2240293-1-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-17ext4: fix fall-through warnings for ClangGustavo A. R. Silva
In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning by explicitly adding a break statement instead of just letting the code fall through to the next case. Link: https://github.com/KSPP/linux/issues/115 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Link: https://lore.kernel.org/r/03497331f088a938d7a728e7a689bd7953139429.1605896059.git.gustavoars@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-17ext4: add docs about fast commit idempotenceHarshad Shirwadkar
Fast commit on-disk format is designed such that the replay of these tags can be idempotent. This patch adds documentation in the code in form of comments and in form kernel docs that describes these characteristics. This patch also adds a TODO item needed to ensure kernel fast commit replay idempotence. Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201119232822.1860882-1-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-17ext4: remove the unused EXT4_CURRENT_REV macroKaixu Xia
There are no callers of the EXT4_CURRENT_REV macro, so remove it. Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/1605164202-31120-1-git-send-email-kaixuxia@tencent.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-17ext4: fix an IS_ERR() vs NULL checkDan Carpenter
The ext4_find_extent() function never returns NULL, it returns error pointers. Fixes: 44059e503b03 ("ext4: fast commit recovery path") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20201023112232.GB282278@mwanda Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org