summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2020-01-17f2fs: fix memleak of kobjectChao Yu
If kobject_init_and_add() failed, caller needs to invoke kobject_put() to release kobject explicitly. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-01-17f2fs: fix to add swap extent correctlyChao Yu
As Youling reported in mailing list: https://www.linuxquestions.org/questions/linux-newbie-8/the-file-system-f2fs-is-broken-4175666043/ https://www.linux.org/threads/the-file-system-f2fs-is-broken.26490/ There is a test case can corrupt f2fs image: - dd if=/dev/zero of=/swapfile bs=1M count=4096 - chmod 600 /swapfile - mkswap /swapfile - swapon --discard /swapfile The root cause is f2fs_swap_activate() intends to return zero value to setup_swap_extents() to enable SWP_FS mode (swap file goes through fs), in this flow, setup_swap_extents() setups swap extent with wrong block address range, result in discard_swap() erasing incorrect address. Because f2fs_swap_activate() has pinned swapfile, its data block address will not change, it's safe to let swap to handle IO through raw device, so we can get rid of SWAP_FS mode and initial swap extents inside f2fs_swap_activate(), by this way, later discard_swap() can trim in right address range. Fixes: 4969c06a0d83 ("f2fs: support swap file w/ DIO") Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-01-17f2fs: run fsck when getting bad inode during GCJaegeuk Kim
This is to avoid inifinite GC when trying to disable checkpoint. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-01-17f2fs: support data compressionChao Yu
This patch tries to support compression in f2fs. - New term named cluster is defined as basic unit of compression, file can be divided into multiple clusters logically. One cluster includes 4 << n (n >= 0) logical pages, compression size is also cluster size, each of cluster can be compressed or not. - In cluster metadata layout, one special flag is used to indicate cluster is compressed one or normal one, for compressed cluster, following metadata maps cluster to [1, 4 << n - 1] physical blocks, in where f2fs stores data including compress header and compressed data. - In order to eliminate write amplification during overwrite, F2FS only support compression on write-once file, data can be compressed only when all logical blocks in file are valid and cluster compress ratio is lower than specified threshold. - To enable compression on regular inode, there are three ways: * chattr +c file * chattr +c dir; touch dir/file * mount w/ -o compress_extension=ext; touch file.ext Compress metadata layout: [Dnode Structure] +-----------------------------------------------+ | cluster 1 | cluster 2 | ......... | cluster N | +-----------------------------------------------+ . . . . . . . . . Compressed Cluster . . Normal Cluster . +----------+---------+---------+---------+ +---------+---------+---------+---------+ |compr flag| block 1 | block 2 | block 3 | | block 1 | block 2 | block 3 | block 4 | +----------+---------+---------+---------+ +---------+---------+---------+---------+ . . . . . . +-------------+-------------+----------+----------------------------+ | data length | data chksum | reserved | compressed data | +-------------+-------------+----------+----------------------------+ Changelog: 20190326: - fix error handling of read_end_io(). - remove unneeded comments in f2fs_encrypt_one_page(). 20190327: - fix wrong use of f2fs_cluster_is_full() in f2fs_mpage_readpages(). - don't jump into loop directly to avoid uninitialized variables. - add TODO tag in error path of f2fs_write_cache_pages(). 20190328: - fix wrong merge condition in f2fs_read_multi_pages(). - check compressed file in f2fs_post_read_required(). 20190401 - allow overwrite on non-compressed cluster. - check cluster meta before writing compressed data. 20190402 - don't preallocate blocks for compressed file. - add lz4 compress algorithm - process multiple post read works in one workqueue Now f2fs supports processing post read work in multiple workqueue, it shows low performance due to schedule overhead of multiple workqueue executing orderly. 20190921 - compress: support buffered overwrite C: compress cluster flag V: valid block address N: NEW_ADDR One cluster contain 4 blocks before overwrite after overwrite - VVVV -> CVNN - CVNN -> VVVV - CVNN -> CVNN - CVNN -> CVVV - CVVV -> CVNN - CVVV -> CVVV 20191029 - add kconfig F2FS_FS_COMPRESSION to isolate compression related codes, add kconfig F2FS_FS_{LZO,LZ4} to cover backend algorithm. note that: will remove lzo backend if Jaegeuk agreed that too. - update codes according to Eric's comments. 20191101 - apply fixes from Jaegeuk 20191113 - apply fixes from Jaegeuk - split workqueue for fsverity 20191216 - apply fixes from Jaegeuk 20200117 - fix to avoid NULL pointer dereference [Jaegeuk Kim] - add tracepoint for f2fs_{,de}compress_pages() - fix many bugs and add some compression stats - fix overwrite/mmap bugs - address 32bit build error, reported by Geert. - bug fixes when handling errors and i_compressed_blocks Reported-by: <noreply@ellerman.id.au> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-01-17jbd2: clear JBD2_ABORT flag before journal_reset to update log tail info ↵Kai Li
when load journal If the journal is dirty when the filesystem is mounted, jbd2 will replay the journal but the journal superblock will not be updated by journal_reset() because JBD2_ABORT flag is still set (it was set in journal_init_common()). This is problematic because when a new transaction is then committed, it will be recorded in block 1 (journal->j_tail was set to 1 in journal_reset()). If unclean shutdown happens again before the journal superblock is updated, the new recorded transaction will not be replayed during the next mount (because of stale sb->s_start and sb->s_sequence values) which can lead to filesystem corruption. Fixes: 85e0c4e89c1b ("jbd2: if the journal is aborted then don't allow update of the log tail") Signed-off-by: Kai Li <li.kai4@h3c.com> Link: https://lore.kernel.org/r/20200111022542.5008-1-li.kai4@h3c.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-01-17ext4: drop ext4_kvmalloc()Theodore Ts'o
As Jan pointed out[1], as of commit 81378da64de ("jbd2: mark the transaction context with the scope GFP_NOFS context") we use memalloc_nofs_{save,restore}() while a jbd2 handle is active. So ext4_kvmalloc() so we can call allocate using GFP_NOFS is no longer necessary. [1] https://lore.kernel.org/r/20200109100007.GC27035@quack2.suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20200116155031.266620-1-tytso@mit.edu Reviewed-by: Jan Kara <jack@suse.cz>
2020-01-17ext4: Add EXT4_IOC_FSGETXATTR/EXT4_IOC_FSSETXATTR to compat_ioctlMartijn Coenen
These are backed by 'struct fsxattr' which has the same size on all architectures. Signed-off-by: Martijn Coenen <maco@android.com> Link: https://lore.kernel.org/r/20191227134639.35869-1-maco@android.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-01-17ext4: remove unused macro MPAGE_DA_EXTENT_TAILRitesh Harjani
Remove unused macro MPAGE_DA_EXTENT_TAIL which is no more used after below commit 4e7ea81d ("ext4: restructure writeback path") Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200101095137.25656-1-riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-01-17ext4: add missing braces in ext4_ext_drop_refs()Eric Biggers
For clarity, add braces to the loop in ext4_ext_drop_refs(). Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20191231180444.46586-9-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz>
2020-01-17ext4: fix some nonstandard indentation in extents.cEric Biggers
Clean up some code that was using 2-character indents. Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20191231180444.46586-8-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz>
2020-01-17ext4: remove obsolete comment from ext4_can_extents_be_merged()Eric Biggers
Support for unwritten extents was added to ext4 a long time ago, so remove a misleading comment that says they're a future feature. Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20191231180444.46586-7-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz>
2020-01-17ext4: fix documentation for ext4_ext_try_to_merge()Eric Biggers
Don't mention the nonexistent return value, and mention both types of merges that are attempted. Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20191231180444.46586-6-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz>
2020-01-17ext4: make some functions static in extents.cEric Biggers
Make the following functions static since they're only used in extents.c: __ext4_ext_dirty() ext4_can_extents_be_merged() ext4_collapse_range() ext4_insert_range() Also remove the prototype for ext4_ext_writepage_trans_blocks(), as this function is not defined anywhere. Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20191231180444.46586-5-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz>
2020-01-17ext4: remove redundant S_ISREG() checks from ext4_fallocate()Eric Biggers
ext4_fallocate() is only used in the file_operations for regular files. Also, the VFS only allows fallocate() on regular files and block devices, but block devices always use blkdev_fallocate(). For both of these reasons, S_ISREG() is always true in ext4_fallocate(). Therefore the S_ISREG() checks in ext4_zero_range(), ext4_collapse_range(), ext4_insert_range(), and ext4_punch_hole() are redundant. Remove them. Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20191231180444.46586-4-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz>
2020-01-17ext4: clean up len and offset checks in ext4_fallocate()Eric Biggers
- Fix some comments. - Consistently access i_size directly rather than using i_size_read(), since in all relevant cases we're under inode_lock(). - Simplify the alignment checks by using the IS_ALIGNED() macro. - In ext4_insert_range(), do the check against s_maxbytes in a way that is safe against signed overflow. (This doesn't currently matter for ext4 due to ext4's limited max file size, but this is something other filesystems have gotten wrong. We might as well do it safely.) Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20191231180444.46586-3-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz>
2020-01-17ext4: remove ext4_{ind,ext}_calc_metadata_amount()Eric Biggers
Remove the ext4_ind_calc_metadata_amount() and ext4_ext_calc_metadata_amount() functions, which have been unused since commit 71d4f7d03214 ("ext4: remove metadata reservation checks"). Also remove the i_da_metadata_calc_last_lblock and i_da_metadata_calc_len fields from struct ext4_inode_info, as these were only used by these removed functions. Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20191231180444.46586-2-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz>
2020-01-17ext4: remove unneeded check for error allocating bio_post_read_ctxEric Biggers
Since allocating an object from a mempool never fails when __GFP_DIRECT_RECLAIM (which is included in GFP_NOFS) is set, the check for failure to allocate a bio_post_read_ctx is unnecessary. Remove it. Also remove the redundant assignment to ->bi_private. Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20191231181256.47770-1-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-01-17ext4: fix deadlock allocating bio_post_read_ctx from mempoolEric Biggers
Without any form of coordination, any case where multiple allocations from the same mempool are needed at a time to make forward progress can deadlock under memory pressure. This is the case for struct bio_post_read_ctx, as one can be allocated to decrypt a Merkle tree page during fsverity_verify_bio(), which itself is running from a post-read callback for a data bio which has its own struct bio_post_read_ctx. Fix this by freeing the first bio_post_read_ctx before calling fsverity_verify_bio(). This works because verity (if enabled) is always the last post-read step. This deadlock can be reproduced by trying to read from an encrypted verity file after reducing NUM_PREALLOC_POST_READ_CTXS to 1 and patching mempool_alloc() to pretend that pool->alloc() always fails. Note that since NUM_PREALLOC_POST_READ_CTXS is actually 128, to actually hit this bug in practice would require reading from lots of encrypted verity files at the same time. But it's theoretically possible, as N available objects isn't enough to guarantee forward progress when > N/2 threads each need 2 objects at a time. Fixes: 22cfe4b48ccb ("ext4: add fs-verity read support") Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20191231181222.47684-1-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-01-17ext4: fix deadlock allocating crypto bounce page from mempoolEric Biggers
ext4_writepages() on an encrypted file has to encrypt the data, but it can't modify the pagecache pages in-place, so it encrypts the data into bounce pages and writes those instead. All bounce pages are allocated from a mempool using GFP_NOFS. This is not correct use of a mempool, and it can deadlock. This is because GFP_NOFS includes __GFP_DIRECT_RECLAIM, which enables the "never fail" mode for mempool_alloc() where a failed allocation will fall back to waiting for one of the preallocated elements in the pool. But since this mode is used for all a bio's pages and not just the first, it can deadlock waiting for pages already in the bio to be freed. This deadlock can be reproduced by patching mempool_alloc() to pretend that pool->alloc() always fails (so that it always falls back to the preallocations), and then creating an encrypted file of size > 128 KiB. Fix it by only using GFP_NOFS for the first page in the bio. For subsequent pages just use GFP_NOWAIT, and if any of those fail, just submit the bio and start a new one. This will need to be fixed in f2fs too, but that's less straightforward. Fixes: c9af28fdd449 ("ext4 crypto: don't let data integrity writebacks fail with ENOMEM") Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20191231181149.47619-1-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-01-17ext4: Delete ext4_kvzvalloc()Naoto Kobayashi
Since we're not using ext4_kvzalloc(), delete this function. Signed-off-by: Naoto Kobayashi <naoto.kobayashi4c@gmail.com> Link: https://lore.kernel.org/r/20191227080523.31808-2-naoto.kobayashi4c@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-01-17ext4: re-enable extent zeroout optimization on encrypted filesEric Biggers
For encrypted files, commit 36086d43f657 ("ext4 crypto: fix bugs in ext4_encrypted_zeroout()") disabled the optimization where when a write occurs to the middle of an unwritten extent, the head and/or tail of the extent (when they aren't too large) are zeroed out, turned into an initialized extent, and merged with the part being written to. This optimization helps prevent fragmentation of the extent tree. However, disabling this optimization also made fscrypt_zeroout_range() nearly impossible to test, as now it's only reachable via the very rare case in ext4_split_extent_at() where allocating a new extent tree block fails due to ENOSPC. 'gce-xfstests -c ext4/encrypt -g auto' doesn't even hit this at all. It's preferable to avoid really rare cases that are hard to test. That commit also cited data corruption in xfstest generic/127 as a reason to disable the extent zeroout optimization, but that's no longer reproducible anymore. It also cited fscrypt_zeroout_range() having poor performance, but I've written a patch to fix that. Therefore, re-enable the extent zeroout optimization on encrypted files. Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20191226161114.53606-1-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-01-17ext4: only use fscrypt_zeroout_range() on regular filesEric Biggers
fscrypt_zeroout_range() is only for encrypted regular files, not for encrypted directories or symlinks. Fortunately, currently it seems it's never called on non-regular files. But to be safe ext4 should explicitly check S_ISREG() before calling it. Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20191226161022.53490-1-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-01-17ext4: allow ZERO_RANGE on encrypted filesEric Biggers
When ext4 encryption support was first added, ZERO_RANGE was disallowed, supposedly because test failures (e.g. ext4/001) were seen when enabling it, and at the time there wasn't enough time/interest to debug it. However, there's actually no reason why ZERO_RANGE can't work on encrypted files. And it fact it *does* work now. Whole blocks in the zeroed range are converted to unwritten extents, as usual; encryption makes no difference for that part. Partial blocks are zeroed in the pagecache and then ->writepages() encrypts those blocks as usual. ext4_block_zero_page_range() handles reading and decrypting the block if needed before actually doing the pagecache write. Also, f2fs has always supported ZERO_RANGE on encrypted files. As far as I can tell, the reason that ext4/001 was failing in v4.1 was actually because of one of the bugs fixed by commit 36086d43f657 ("ext4 crypto: fix bugs in ext4_encrypted_zeroout()"). The bug made ext4_encrypted_zeroout() always return a positive value, which caused unwritten extents in encrypted files to sometimes not be marked as initialized after being written to. This bug was not actually in ZERO_RANGE; it just happened to trigger during the extents manipulation done in ext4/001 (and probably other tests too). So, let's enable ZERO_RANGE on encrypted files on ext4. Tested with: gce-xfstests -c ext4/encrypt -g auto gce-xfstests -c ext4/encrypt_1k -g auto Got the same set of test failures both with and without this patch. But with this patch 6 fewer tests are skipped: ext4/001, generic/008, generic/009, generic/033, generic/096, and generic/511. Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20191226154216.4808-1-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-01-17ext4: handle decryption error in __ext4_block_zero_page_range()Eric Biggers
fscrypt_decrypt_pagecache_blocks() can fail, because it uses skcipher_request_alloc(), which uses kmalloc(), which can fail; and also because it calls crypto_skcipher_decrypt(), which can fail depending on the driver that actually implements the crypto. Therefore it's not appropriate to WARN on decryption error in __ext4_block_zero_page_range(). Remove the WARN and just handle the error instead. Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20191226154105.4704-1-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-01-17ext4: remove unnecessary selections from EXT3_FSEric Biggers
Since EXT3_FS already selects EXT4_FS, there's no reason for it to redundantly select all the selections of EXT4_FS -- notwithstanding the comments that claim otherwise. Remove these redundant selections to avoid confusion. Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20191226153920.4466-1-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2020-01-17ext4: use true,false for bool variablezhengbin
Fixes coccicheck warning: fs/ext4/extents.c:5271:6-12: WARNING: Assignment of 0/1 to bool variable fs/ext4/extents.c:5287:4-10: WARNING: Assignment of 0/1 to bool variable Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: zhengbin <zhengbin13@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/1577241959-138695-1-git-send-email-zhengbin13@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-01-17ext4: uninline ext4_inode_journal_mode()Eric Biggers
Determining an inode's journaling mode has gotten more complicated over time. Move ext4_inode_journal_mode() from an inline function into ext4_jbd2.c to reduce the compiled code size. Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20191209233602.117778-1-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2020-01-17ext4: remove unnecessary ifdefs in htree_dirblock_to_tree()Eric Biggers
The ifdefs for CONFIG_FS_ENCRYPTION in htree_dirblock_to_tree() are unnecessary, as the called functions are already stubbed out when !CONFIG_FS_ENCRYPTION. Remove them. Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20191209213225.18477-1-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2020-01-17ext4: remove unnecessary assignment in ext4_htree_store_dirent()Chengguang Xu
We have allocated memory using kzalloc() so don't have to set 0 again in last byte. Signed-off-by: Chengguang Xu <cgxu519@mykernel.net> Link: https://lore.kernel.org/r/20191206054317.3107-1-cgxu519@mykernel.net Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-01-17ext4: avoid fetching btime in ext4_getattr() unless requestedTheodore Ts'o
Linus observed that an allmodconfig build which does a lot of stat(2) calls that ext4_getattr() was a noticeable (1%) amount of CPU time, due to the cache line for i_extra_isize getting pulled in. Since the normal stat system call doesn't return btime, it's a complete waste. So only calculate btime when it is explicitly requested. [ Fixed to check against request_mask instead of query_flags. ] Link: https://lore.kernel.org/r/CAHk-=wivmk_j6KbTX+Er64mLrG8abXZo0M10PNdAnHc8fWXfsQ@mail.gmail.com Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-01-17Merge tag 'io_uring-5.5-2020-01-16' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring fixes form Jens Axboe: - Ensure ->result is always set when IO is retried (Bijan) - In conjunction with the above, fix a regression in polled IO issue when retried (me/Bijan) - Don't setup async context for read/write fixed, otherwise we may wrongly map the iovec on retry (me) - Cancel io-wq work if we fail getting mm reference (me) - Ensure dependent work is always initialized correctly (me) - Only allow original task to submit IO, don't allow it from a passed ring fd (me) * tag 'io_uring-5.5-2020-01-16' of git://git.kernel.dk/linux-block: io_uring: only allow submit from owning task io_uring: ensure workqueue offload grabs ring mutex for poll list io_uring: clear req->result always before issuing a read/write request io_uring: be consistent in assigning next work from handler io-wq: cancel work if we fail getting a mm reference io_uring: don't setup async context for read/write fixed
2020-01-17Merge tag 'for-5.5-rc6-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few more fixes that have been in the works during last twp weeks. All have a user visible effect and are stable material: - scrub: properly update progress after calling cancel ioctl, calling 'resume' would start from the beginning otherwise - fix subvolume reference removal, after moving out of the original path the reference is not recognized and will lead to transaction abort - fix reloc root lifetime checks, could lead to crashes when there's subvolume cleaning running in parallel - fix memory leak when quotas get disabled in the middle of extent accounting - fix transaction abort in case of balance being started on degraded mount on eg. RAID1" * tag 'for-5.5-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: check rw_devices, not num_devices for balance Btrfs: always copy scrub arguments back to user space btrfs: relocation: fix reloc_root lifespan and access btrfs: fix memory leak in qgroup accounting btrfs: do not delete mismatched root refs btrfs: fix invalid removal of root ref btrfs: rework arguments of btrfs_unlink_subvol
2020-01-17Merge tag 'fuse-fixes-5.5-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse fix from Miklos Szeredi: "Fix a regression in the last release affecting the ftp module of the gvfs filesystem" * tag 'fuse-fixes-5.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: fix fuse_send_readpages() in the syncronous read case
2020-01-17btrfs: check rw_devices, not num_devices for balanceJosef Bacik
The fstest btrfs/154 reports [ 8675.381709] BTRFS: Transaction aborted (error -28) [ 8675.383302] WARNING: CPU: 1 PID: 31900 at fs/btrfs/block-group.c:2038 btrfs_create_pending_block_groups+0x1e0/0x1f0 [btrfs] [ 8675.390925] CPU: 1 PID: 31900 Comm: btrfs Not tainted 5.5.0-rc6-default+ #935 [ 8675.392780] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014 [ 8675.395452] RIP: 0010:btrfs_create_pending_block_groups+0x1e0/0x1f0 [btrfs] [ 8675.402672] RSP: 0018:ffffb2090888fb00 EFLAGS: 00010286 [ 8675.404413] RAX: 0000000000000000 RBX: ffff92026dfa91c8 RCX: 0000000000000001 [ 8675.406609] RDX: 0000000000000000 RSI: ffffffff8e100899 RDI: ffffffff8e100971 [ 8675.408775] RBP: ffff920247c61660 R08: 0000000000000000 R09: 0000000000000000 [ 8675.410978] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000ffffffe4 [ 8675.412647] R13: ffff92026db74000 R14: ffff920247c616b8 R15: ffff92026dfbc000 [ 8675.413994] FS: 00007fd5e57248c0(0000) GS:ffff92027d800000(0000) knlGS:0000000000000000 [ 8675.416146] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 8675.417833] CR2: 0000564aa51682d8 CR3: 000000006dcbc004 CR4: 0000000000160ee0 [ 8675.419801] Call Trace: [ 8675.420742] btrfs_start_dirty_block_groups+0x355/0x480 [btrfs] [ 8675.422600] btrfs_commit_transaction+0xc8/0xaf0 [btrfs] [ 8675.424335] reset_balance_state+0x14a/0x190 [btrfs] [ 8675.425824] btrfs_balance.cold+0xe7/0x154 [btrfs] [ 8675.427313] ? kmem_cache_alloc_trace+0x235/0x2c0 [ 8675.428663] btrfs_ioctl_balance+0x298/0x350 [btrfs] [ 8675.430285] btrfs_ioctl+0x466/0x2550 [btrfs] [ 8675.431788] ? mem_cgroup_charge_statistics+0x51/0xf0 [ 8675.433487] ? mem_cgroup_commit_charge+0x56/0x400 [ 8675.435122] ? do_raw_spin_unlock+0x4b/0xc0 [ 8675.436618] ? _raw_spin_unlock+0x1f/0x30 [ 8675.438093] ? __handle_mm_fault+0x499/0x740 [ 8675.439619] ? do_vfs_ioctl+0x56e/0x770 [ 8675.441034] do_vfs_ioctl+0x56e/0x770 [ 8675.442411] ksys_ioctl+0x3a/0x70 [ 8675.443718] ? trace_hardirqs_off_thunk+0x1a/0x1c [ 8675.445333] __x64_sys_ioctl+0x16/0x20 [ 8675.446705] do_syscall_64+0x50/0x210 [ 8675.448059] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 8675.479187] BTRFS: error (device vdb) in btrfs_create_pending_block_groups:2038: errno=-28 No space left We now use btrfs_can_overcommit() to see if we can flip a block group read only. Before this would fail because we weren't taking into account the usable un-allocated space for allocating chunks. With my patches we were allowed to do the balance, which is technically correct. The test is trying to start balance on degraded mount. So now we're trying to allocate a chunk and cannot because we want to allocate a RAID1 chunk, but there's only 1 device that's available for usage. This results in an ENOSPC. But we shouldn't even be making it this far, we don't have enough devices to restripe. The problem is we're using btrfs_num_devices(), that also includes missing devices. That's not actually what we want, we need to use rw_devices. The chunk_mutex is not needed here, rw_devices changes only in device add, remove or replace, all are excluded by EXCL_OP mechanism. Fixes: e4d8ec0f65b9 ("Btrfs: implement online profile changing") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add stacktrace, update changelog, drop chunk_mutex ] Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-17Btrfs: always copy scrub arguments back to user spaceFilipe Manana
If scrub returns an error we are not copying back the scrub arguments structure to user space. This prevents user space to know how much progress scrub has done if an error happened - this includes -ECANCELED which is returned when users ask for scrub to stop. A particular use case, which is used in btrfs-progs, is to resume scrub after it is canceled, in that case it relies on checking the progress from the scrub arguments structure and then use that progress in a call to resume scrub. So fix this by always copying the scrub arguments structure to user space, overwriting the value returned to user space with -EFAULT only if copying the structure failed to let user space know that either that copying did not happen, and therefore the structure is stale, or it happened partially and the structure is probably not valid and corrupt due to the partial copy. Reported-by: Graham Cobb <g.btrfs@cobb.uk.net> Link: https://lore.kernel.org/linux-btrfs/d0a97688-78be-08de-ca7d-bcb4c7fb397e@cobb.uk.net/ Fixes: 06fe39ab15a6a4 ("Btrfs: do not overwrite scrub error with fault error in scrub ioctl") CC: stable@vger.kernel.org # 5.1+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Tested-by: Graham Cobb <g.btrfs@cobb.uk.net> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-16io_uring: only allow submit from owning taskJens Axboe
If the credentials or the mm doesn't match, don't allow the task to submit anything on behalf of this ring. The task that owns the ring can pass the file descriptor to another task, but we don't want to allow that task to submit an SQE that then assumes the ring mm and creds if it needs to go async. Cc: stable@vger.kernel.org Suggested-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-17ubifs: Fix ino_t format warnings in orphan_delete()Geert Uytterhoeven
On alpha and s390x: fs/ubifs/debug.h:158:11: warning: format ‘%lu’ expects argument of type ‘long unsigned int’, but argument 4 has type ‘ino_t {aka unsigned int}’ [-Wformat=] ... fs/ubifs/orphan.c:132:3: note: in expansion of macro ‘dbg_gen’ dbg_gen("deleted twice ino %lu", orph->inum); ... fs/ubifs/orphan.c:140:3: note: in expansion of macro ‘dbg_gen’ dbg_gen("delete later ino %lu", orph->inum); __kernel_ino_t is "unsigned long" on most architectures, but not on alpha and s390x, where it is "unsigned int". Hence when printing an ino_t, it should always be cast to "unsigned long" first. Fix this by re-adding the recently removed casts. Fixes: 8009ce956c3d2802 ("ubifs: Don't leak orphans on memory during commit") Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Richard Weinberger <richard@nod.at>
2020-01-16ubifs: Fix deadlock in concurrent bulk-read and writepageZhihao Cheng
In ubifs, concurrent execution of writepage and bulk read on the same file may cause ABBA deadlock, for example (Reproduce method see Link): Process A(Bulk-read starts from page4) Process B(write page4 back) vfs_read wb_workfn or fsync ... ... generic_file_buffered_read write_cache_pages ubifs_readpage LOCK(page4) ubifs_bulk_read ubifs_writepage LOCK(ui->ui_mutex) ubifs_write_inode ubifs_do_bulk_read LOCK(ui->ui_mutex) find_or_create_page(alloc page4) ↑ LOCK(page4) <-- ABBA deadlock occurs! In order to ensure the serialization execution of bulk read, we can't remove the big lock 'ui->ui_mutex' in ubifs_bulk_read(). Instead, we allow ubifs_do_bulk_read() to lock page failed by replacing find_or_create_page(FGP_LOCK) with pagecache_get_page(FGP_LOCK | FGP_NOWAIT). Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Suggested-by: zhangyi (F) <yi.zhang@huawei.com> Cc: <Stable@vger.kernel.org> Fixes: 4793e7c5e1c ("UBIFS: add bulk-read facility") Link: https://bugzilla.kernel.org/show_bug.cgi?id=206153 Signed-off-by: Richard Weinberger <richard@nod.at>
2020-01-16ubifs: Fix wrong memory allocationSascha Hauer
In create_default_filesystem() when we allocate the idx node we must use the idx_node_size we calculated just one line before, not tmp, which contains completely other data. Fixes: c4de6d7e4319 ("ubifs: Refactor create_default_filesystem()") Cc: stable@vger.kernel.org # v4.20+ Reported-by: Naga Sureshkumar Relli <nagasure@xilinx.com> Tested-by: Naga Sureshkumar Relli <nagasure@xilinx.com> Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> Signed-off-by: Richard Weinberger <richard@nod.at>
2020-01-16ubifs: Add support for FS_ENCRYPT_FLEric Biggers
Make the FS_IOC_GETFLAGS ioctl on ubifs return the FS_ENCRYPT_FL flag on encrypted files, like ext4 and f2fs do. Also make this flag be ignored by FS_IOC_SETFLAGS, like ext4 and f2fs do, since it's a recognized flag but is not directly settable. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2020-01-16ubifs: Fix FS_IOC_SETFLAGS unexpectedly clearing encrypt flagEric Biggers
UBIFS's implementation of FS_IOC_SETFLAGS fails to preserve existing inode flags that aren't settable by FS_IOC_SETFLAGS, namely the encrypt flag. This causes the encrypt flag to be unexpectedly cleared. Fix it by preserving existing unsettable flags, like ext4 and f2fs do. Test case with kvm-xfstests shell: FSTYP=ubifs KEYCTL_PROG=keyctl . fs/ubifs/config . ~/xfstests/common/encrypt dev=$(__blkdev_to_ubi_volume /dev/vdc) ubiupdatevol -t $dev mount $dev /mnt -t ubifs k=$(_generate_session_encryption_key) mkdir /mnt/edir xfs_io -c "set_encpolicy $k" /mnt/edir echo contents > /mnt/edir/file chattr +i /mnt/edir/file chattr -i /mnt/edir/file With the bug, the following errors occur on the last command: [ 18.081559] fscrypt (ubifs, inode 67): Inconsistent encryption context (parent directory: 65) chattr: Operation not permitted while reading flags on /mnt/edir/file Fixes: d475a507457b ("ubifs: Add skeleton for fscrypto") Cc: <stable@vger.kernel.org> # v4.10+ Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2020-01-16xfs: check log iovec size to make sure it's plausibly a buffer log formatDarrick J. Wong
When log recovery is processing buffer log items, we should check that the incoming iovec actually describes a region of memory large enough to contain the log format and the dirty map. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-01-16xfs: make struct xfs_buf_log_format have a consistent sizeDarrick J. Wong
Increase XFS_BLF_DATAMAP_SIZE by 1 to fill in the implied padding at the end of struct xfs_buf_log_format. This makes the size consistent so that we can check it in xfs_ondisk.h, and will be needed once we start logging attribute values. On amd64 we get the following pahole: struct xfs_buf_log_format { short unsigned int blf_type; /* 0 2 */ short unsigned int blf_size; /* 2 2 */ short unsigned int blf_flags; /* 4 2 */ short unsigned int blf_len; /* 6 2 */ long long int blf_blkno; /* 8 8 */ unsigned int blf_map_size; /* 16 4 */ unsigned int blf_data_map[16]; /* 20 64 */ /* --- cacheline 1 boundary (64 bytes) was 20 bytes ago --- */ /* size: 88, cachelines: 2, members: 7 */ /* padding: 4 */ /* last cacheline: 24 bytes */ }; But on i386 we get the following: struct xfs_buf_log_format { short unsigned int blf_type; /* 0 2 */ short unsigned int blf_size; /* 2 2 */ short unsigned int blf_flags; /* 4 2 */ short unsigned int blf_len; /* 6 2 */ long long int blf_blkno; /* 8 8 */ unsigned int blf_map_size; /* 16 4 */ unsigned int blf_data_map[16]; /* 20 64 */ /* --- cacheline 1 boundary (64 bytes) was 20 bytes ago --- */ /* size: 84, cachelines: 2, members: 7 */ /* last cacheline: 20 bytes */ }; Notice how the amd64 compiler inserts 4 bytes of padding to the end of the structure to ensure 8-byte alignment. Prior to "xfs: fix memory corruption during remote attr value buffer invalidation" we would try to write to blf_data_map[17], which is harmless on amd64 but really bad on i386. This shouldn't cause any changes in the ondisk logging formats because the log code writes out the log vectors with the appropriate size for the log item's map_size, and log recovery treats the data_map array as a VLA. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-01-16xfs: complain if anyone tries to create a too-large buffer log itemDarrick J. Wong
Complain if someone calls xfs_buf_item_init on a buffer that is larger than the dirty bitmap can handle, or tries to log a region that's past the end of the dirty bitmap. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-01-16xfs: clean up xfs_buf_item_get_format return valueDarrick J. Wong
The only thing that can cause a nonzero return from xfs_buf_item_get_format is if the kmem_alloc fails, which it can't. Get rid of all the unnecessary error handling. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-01-16xfs: streamline xfs_attr3_leaf_inactiveDarrick J. Wong
Now that we know we don't have to take a transaction to stale the incore buffers for a remote value, get rid of the unnecessary memory allocation in the leaf walker and call the rmt_stale function directly. Flatten the loop while we're at it. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-01-16xfs: fix memory corruption during remote attr value buffer invalidationDarrick J. Wong
While running generic/103, I observed what looks like memory corruption and (with slub debugging turned on) a slub redzone warning on i386 when inactivating an inode with a 64k remote attr value. On a v5 filesystem, maximally sized remote attr values require one block more than 64k worth of space to hold both the remote attribute value header (64 bytes). On a 4k block filesystem this results in a 68k buffer; on a 64k block filesystem, this would be a 128k buffer. Note that even though we'll never use more than 65,600 bytes of this buffer, XFS_MAX_BLOCKSIZE is 64k. This is a problem because the definition of struct xfs_buf_log_format allows for XFS_MAX_BLOCKSIZE worth of dirty bitmap (64k). On i386 when we invalidate a remote attribute, xfs_trans_binval zeroes all 68k worth of the dirty map, writing right off the end of the log item and corrupting memory. We've gotten away with this on x86_64 for years because the compiler inserts a u32 padding on the end of struct xfs_buf_log_format. Fortunately for us, remote attribute values are written to disk with xfs_bwrite(), which is to say that they are not logged. Fix the problem by removing all places where we could end up creating a buffer log item for a remote attribute value and leave a note explaining why. Next, replace the open-coded buffer invalidation with a call to the helper we created in the previous patch that does better checking for bad metadata before marking the buffer stale. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-01-16xfs: refactor remote attr value buffer invalidationDarrick J. Wong
Hoist the code that invalidates remote extended attribute value buffers into a separate helper function. This prepares us for a memory corruption fix in the next patch. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-01-16reiserfs: fix handling of -EOPNOTSUPP in reiserfs_for_each_xattrJeff Mahoney
Commit 60e4cf67a58 (reiserfs: fix extended attributes on the root directory) introduced a regression open_xa_root started returning -EOPNOTSUPP but it was not handled properly in reiserfs_for_each_xattr. When the reiserfs module is built without CONFIG_REISERFS_FS_XATTR, deleting an inode would result in a warning and chowning an inode would also result in a warning and then fail to complete. With CONFIG_REISERFS_FS_XATTR enabled, the xattr root would always be present for read-write operations. This commit handles -EOPNOSUPP in the same way -ENODATA is handled. Fixes: 60e4cf67a582 ("reiserfs: fix extended attributes on the root directory") CC: stable@vger.kernel.org # Commit 60e4cf67a58 was picked up by stable Link: https://lore.kernel.org/r/20200115180059.6935-1-jeffm@suse.com Reported-by: Michael Brunnbauer <brunni@netestate.de> Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Jan Kara <jack@suse.cz>
2020-01-16fuse: fix fuse_send_readpages() in the syncronous read caseMiklos Szeredi
Buffered read in fuse normally goes via: -> generic_file_buffered_read() -> fuse_readpages() -> fuse_send_readpages() ->fuse_simple_request() [called since v5.4] In the case of a read request, fuse_simple_request() will return a non-negative bytecount on success or a negative error value. A positive bytecount was taken to be an error and the PG_error flag set on the page. This resulted in generic_file_buffered_read() falling back to ->readpage(), which would repeat the read request and succeed. Because of the repeated read succeeding the bug was not detected with regression tests or other use cases. The FTP module in GVFS however fails the second read due to the non-seekable nature of FTP downloads. Fix by checking and ignoring positive return value from fuse_simple_request(). Reported-by: Ondrej Holy <oholy@redhat.com> Link: https://gitlab.gnome.org/GNOME/gvfs/issues/441 Fixes: 134831e36bbd ("fuse: convert readpages to simple api") Cc: <stable@vger.kernel.org> # v5.4 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>