summaryrefslogtreecommitdiff
path: root/fs/btrfs/file.c
AgeCommit message (Collapse)Author
2020-01-20btrfs: drop create parameter to btrfs_get_extent()Omar Sandoval
We only pass this as 1 from __extent_writepage_io(). The parameter basically means "pretend I didn't pass in a page". This is silly since we can simply not pass in the page. Get rid of the parameter from btrfs_get_extent(), and since it's used as a get_extent_t callback, remove it from get_extent_t and btree_get_extent(), neither of which need it. While we're here, let's document btrfs_get_extent(). Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20btrfs: make btrfs_ordered_extent naming consistent with btrfs_file_extent_itemOmar Sandoval
ordered->start, ordered->len, and ordered->disk_len correspond to fi->disk_bytenr, fi->num_bytes, and fi->disk_num_bytes, respectively. It's confusing to translate between the two naming schemes. Since a btrfs_ordered_extent is basically a pending btrfs_file_extent_item, let's make the former use the naming from the latter. Note that I didn't touch the names in tracepoints just in case there are scripts depending on the current naming. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13Btrfs: fix cloning range with a hole when using the NO_HOLES featureFilipe Manana
When using the NO_HOLES feature if we clone a range that contains a hole and a temporary ENOSPC happens while dropping extents from the target inode's range, we can end up failing and aborting the transaction with -EEXIST or with a corrupt file extent item, that has a length greater than it should and overlaps with other extents. For example when cloning the following range from inode A to inode B: Inode A: extent A1 extent A2 [ ----------- ] [ hole, implicit, 4MB length ] [ ------------- ] 0 1MB 5MB 6MB Range to clone: [1MB, 6MB) Inode B: extent B1 extent B2 extent B3 extent B4 [ ---------- ] [ --------- ] [ ---------- ] [ ---------- ] 0 1MB 1MB 2MB 2MB 5MB 5MB 6MB Target range: [1MB, 6MB) (same as source, to make it easier to explain) The following can happen: 1) btrfs_punch_hole_range() gets -ENOSPC from __btrfs_drop_extents(); 2) At that point, 'cur_offset' is set to 1MB and __btrfs_drop_extents() set 'drop_end' to 2MB, meaning it was able to drop only extent B2; 3) We then compute 'clone_len' as 'drop_end' - 'cur_offset' = 2MB - 1MB = 1MB; 4) We then attempt to insert a file extent item at inode B with a file offset of 5MB, which is the value of clone_info->file_offset. This fails with error -EEXIST because there's already an extent at that offset (extent B4); 5) We abort the current transaction with -EEXIST and return that error to user space as well. Another example, for extent corruption: Inode A: extent A1 extent A2 [ ----------- ] [ hole, implicit, 10MB length ] [ ------------- ] 0 1MB 11MB 12MB Inode B: extent B1 extent B2 [ ----------- ] [ --------- ] [ ----------------------------- ] 0 1MB 1MB 5MB 5MB 12MB Target range: [1MB, 12MB) (same as source, to make it easier to explain) 1) btrfs_punch_hole_range() gets -ENOSPC from __btrfs_drop_extents(); 2) At that point, 'cur_offset' is set to 1MB and __btrfs_drop_extents() set 'drop_end' to 5MB, meaning it was able to drop only extent B2; 3) We then compute 'clone_len' as 'drop_end' - 'cur_offset' = 5MB - 1MB = 4MB; 4) We then insert a file extent item at inode B with a file offset of 11MB which is the value of clone_info->file_offset, and a length of 4MB (the value of 'clone_len'). So we get 2 extents items with ranges that overlap and an extent length of 4MB, larger then the extent A2 from inode A (1MB length); 5) After that we end the transaction, balance the btree dirty pages and then start another or join the previous transaction. It might happen that the transaction which inserted the incorrect extent was committed by another task so we end up with extent corruption if a power failure happens. So fix this by making sure we attempt to insert the extent to clone at the destination inode only if we are past dropping the sub-range that corresponds to a hole. Fixes: 690a5dbfc51315 ("Btrfs: fix ENOSPC errors, leading to transaction aborts, when cloning extents") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: remove extent_map::bdevDavid Sterba
We can now remove the bdev from extent_map. Previous patches made sure that bio_set_dev is correctly in all places and that we don't need to grab it from latest_bdev or pass it around inside the extent map. Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: Return offset from find_desired_extentNikolay Borisov
Instead of using an input pointer parameter as the return value and have an int as the return type of find_desired_extent, rework the function to directly return the found offset. Doing that the 'ret' variable in btrfs_llseek_file can be removed. Additional (subjective) benefit is that btrfs' llseek function now resemebles those of the other major filesystems. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: Simplify btrfs_file_llseekNikolay Borisov
Handle SEEK_END/SEEK_CUR in a single 'default' case by directly returning from generic_file_llseek. This makes the 'out' label redundant. Finally return directly the vale from vfs_setpos. No semantic changes. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: Speed up btrfs_file_llseekNikolay Borisov
Modifying the file position is done on a per-file basis. This renders holding the inode lock for writing useless and makes the performance of concurrent llseek's abysmal. Fix this by holding the inode for read. This provides protection against concurrent truncates and find_desired_extent already includes proper extent locking for the range which ensures proper locking against concurrent writes. SEEK_CUR and SEEK_END can be done lockessly. The former is synchronized by file::f_lock spinlock. SEEK_END is not synchronized but atomic, but that's OK since there is not guarantee that SEEK_END will always be at the end of the file in the face of tail modifications. This change brings ~82% performance improvement when doing a lot of parallel fseeks. The workload essentially does: for (d=0; d<num_seek_read; d++) { /* offset %= 16777216; */ fseek (f, 256 * d % 16777216, SEEK_SET); fread (buffer, 64, 1, f); } Without patch: num workprocesses = 16 num fseek/fread = 8000000 step = 256 fork 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 real 0m41.412s user 0m28.777s sys 2m16.510s With patch: num workprocesses = 16 num fseek/fread = 8000000 step = 256 fork 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 real 0m11.479s user 0m27.629s sys 0m21.040s Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18Btrfs: fix negative subv_writers counter and data space leak after buffered ↵Filipe Manana
write When doing a buffered write it's possible to leave the subv_writers counter of the root, used for synchronization between buffered nocow writers and snapshotting. This happens in an exceptional case like the following: 1) We fail to allocate data space for the write, since there's not enough available data space nor enough unallocated space for allocating a new data block group; 2) Because of that failure, we try to go to NOCOW mode, which succeeds and therefore we set the local variable 'only_release_metadata' to true and set the root's sub_writers counter to 1 through the call to btrfs_start_write_no_snapshotting() made by check_can_nocow(); 3) The call to btrfs_copy_from_user() returns zero, which is very unlikely to happen but not impossible; 4) No pages are copied because btrfs_copy_from_user() returned zero; 5) We call btrfs_end_write_no_snapshotting() which decrements the root's subv_writers counter to 0; 6) We don't set 'only_release_metadata' back to 'false' because we do it only if 'copied', the value returned by btrfs_copy_from_user(), is greater than zero; 7) On the next iteration of the while loop, which processes the same page range, we are now able to allocate data space for the write (we got enough data space released in the meanwhile); 8) After this if we fail at btrfs_delalloc_reserve_metadata(), because now there isn't enough free metadata space, or in some other place further below (prepare_pages(), lock_and_cleanup_extent_if_need(), btrfs_dirty_pages()), we break out of the while loop with 'only_release_metadata' having a value of 'true'; 9) Because 'only_release_metadata' is 'true' we end up decrementing the root's subv_writers counter to -1 (through a call to btrfs_end_write_no_snapshotting()), and we also end up not releasing the data space previously reserved through btrfs_check_data_free_space(). As a consequence the mechanism for synchronizing NOCOW buffered writes with snapshotting gets broken. Fix this by always setting 'only_release_metadata' to false at the start of each iteration. Fixes: 8257b2dc3c1a ("Btrfs: introduce btrfs_{start, end}_nocow_write() for each subvolume") Fixes: 7ee9e4405f26 ("Btrfs: check if we can nocow if we don't have data space") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: drop unused parameter is_new from btrfs_igetDavid Sterba
The parameter is now always set to NULL and could be dropped. The last user was get_default_root but that got reworked in 05dbe6837b60 ("Btrfs: unify subvol= and subvolid= mounting") and the parameter became unused. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: simplify inode locking for RWF_NOWAITGoldwyn Rodrigues
This is similar to 942491c9e6d6 ("xfs: fix AIM7 regression"). Apparently our current rwsem code doesn't like doing the trylock, then lock for real scheme. This causes extra contention on the lock and can be measured eg. by AIM7 benchmark. So change our read/write methods to just do the trylock for the RWF_NOWAIT case. Fixes: edf064e7c6fe ("btrfs: nowait aio support") Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
2019-10-17Btrfs: check for the full sync flag while holding the inode lock during fsyncFilipe Manana
We were checking for the full fsync flag in the inode before locking the inode, which is racy, since at that that time it might not be set but after we acquire the inode lock some other task set it. One case where this can happen is on a system low on memory and some concurrent task failed to allocate an extent map and therefore set the full sync flag on the inode, to force the next fsync to work in full mode. A consequence of missing the full fsync flag set is hitting the problems fixed by commit 0c713cbab620 ("Btrfs: fix race between ranged fsync and writeback of adjacent ranges"), BUG_ON() when dropping extents from a log tree, hitting assertion failures at tree-log.c:copy_items() or all sorts of weird inconsistencies after replaying a log due to file extents items representing ranges that overlap. So just move the check such that it's done after locking the inode and before starting writeback again. Fixes: 0c713cbab620 ("Btrfs: fix race between ranged fsync and writeback of adjacent ranges") CC: stable@vger.kernel.org # 5.2+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-10-15btrfs: qgroup: Always free PREALLOC META reserve in ↵Qu Wenruo
btrfs_delalloc_release_extents() [Background] Btrfs qgroup uses two types of reserved space for METADATA space, PERTRANS and PREALLOC. PERTRANS is metadata space reserved for each transaction started by btrfs_start_transaction(). While PREALLOC is for delalloc, where we reserve space before joining a transaction, and finally it will be converted to PERTRANS after the writeback is done. [Inconsistency] However there is inconsistency in how we handle PREALLOC metadata space. The most obvious one is: In btrfs_buffered_write(): btrfs_delalloc_release_extents(BTRFS_I(inode), reserve_bytes, true); We always free qgroup PREALLOC meta space. While in btrfs_truncate_block(): btrfs_delalloc_release_extents(BTRFS_I(inode), blocksize, (ret != 0)); We only free qgroup PREALLOC meta space when something went wrong. [The Correct Behavior] The correct behavior should be the one in btrfs_buffered_write(), we should always free PREALLOC metadata space. The reason is, the btrfs_delalloc_* mechanism works by: - Reserve metadata first, even it's not necessary In btrfs_delalloc_reserve_metadata() - Free the unused metadata space Normally in: btrfs_delalloc_release_extents() |- btrfs_inode_rsv_release() Here we do calculation on whether we should release or not. E.g. for 64K buffered write, the metadata rsv works like: /* The first page */ reserve_meta: num_bytes=calc_inode_reservations() free_meta: num_bytes=0 total: num_bytes=calc_inode_reservations() /* The first page caused one outstanding extent, thus needs metadata rsv */ /* The 2nd page */ reserve_meta: num_bytes=calc_inode_reservations() free_meta: num_bytes=calc_inode_reservations() total: not changed /* The 2nd page doesn't cause new outstanding extent, needs no new meta rsv, so we free what we have reserved */ /* The 3rd~16th pages */ reserve_meta: num_bytes=calc_inode_reservations() free_meta: num_bytes=calc_inode_reservations() total: not changed (still space for one outstanding extent) This means, if btrfs_delalloc_release_extents() determines to free some space, then those space should be freed NOW. So for qgroup, we should call btrfs_qgroup_free_meta_prealloc() other than btrfs_qgroup_convert_reserved_meta(). The good news is: - The callers are not that hot The hottest caller is in btrfs_buffered_write(), which is already fixed by commit 336a8bb8e36a ("btrfs: Fix wrong btrfs_delalloc_release_extents parameter"). Thus it's not that easy to cause false EDQUOT. - The trans commit in advance for qgroup would hide the bug Since commit f5fef4593653 ("btrfs: qgroup: Make qgroup async transaction commit more aggressive"), when btrfs qgroup metadata free space is slow, it will try to commit transaction and free the wrongly converted PERTRANS space, so it's not that easy to hit such bug. [FIX] So to fix the problem, remove the @qgroup_free parameter for btrfs_delalloc_release_extents(), and always pass true to btrfs_inode_rsv_release(). Reported-by: Filipe Manana <fdmanana@suse.com> Fixes: 43b18595d660 ("btrfs: qgroup: Use separate meta reservation type for delalloc") CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-10-01Btrfs: fix memory leak due to concurrent append writes with fiemapFilipe Manana
When we have a buffered write that starts at an offset greater than or equals to the file's size happening concurrently with a full ranged fiemap, we can end up leaking an extent state structure. Suppose we have a file with a size of 1Mb, and before the buffered write and fiemap are performed, it has a single extent state in its io tree representing the range from 0 to 1Mb, with the EXTENT_DELALLOC bit set. The following sequence diagram shows how the memory leak happens if a fiemap a buffered write, starting at offset 1Mb and with a length of 4Kb, are performed concurrently. CPU 1 CPU 2 extent_fiemap() --> it's a full ranged fiemap range from 0 to LLONG_MAX - 1 (9223372036854775807) --> locks range in the inode's io tree --> after this we have 2 extent states in the io tree: --> 1 for range [0, 1Mb[ with the bits EXTENT_LOCKED and EXTENT_DELALLOC_BITS set --> 1 for the range [1Mb, LLONG_MAX[ with the EXTENT_LOCKED bit set --> start buffered write at offset 1Mb with a length of 4Kb btrfs_file_write_iter() btrfs_buffered_write() --> cached_state is NULL lock_and_cleanup_extent_if_need() --> returns 0 and does not lock range because it starts at current i_size / eof --> cached_state remains NULL btrfs_dirty_pages() btrfs_set_extent_delalloc() (...) __set_extent_bit() --> splits extent state for range [1Mb, LLONG_MAX[ and now we have 2 extent states: --> one for the range [1Mb, 1Mb + 4Kb[ with EXTENT_LOCKED set --> another one for the range [1Mb + 4Kb, LLONG_MAX[ with EXTENT_LOCKED set as well --> sets EXTENT_DELALLOC on the extent state for the range [1Mb, 1Mb + 4Kb[ --> caches extent state [1Mb, 1Mb + 4Kb[ into @cached_state because it has the bit EXTENT_LOCKED set --> btrfs_buffered_write() ends up with a non-NULL cached_state and never calls anything to release its reference on it, resulting in a memory leak Fix this by calling free_extent_state() on cached_state if the range was not locked by lock_and_cleanup_extent_if_need(). The same issue can happen if anything else other than fiemap locks a range that covers eof and beyond. This could be triggered, sporadically, by test case generic/561 from the fstests suite, which makes duperemove run concurrently with fsstress, and duperemove does plenty of calls to fiemap. When CONFIG_BTRFS_DEBUG is set the leak is reported in dmesg/syslog when removing the btrfs module with a message like the following: [77100.039461] BTRFS: state leak: start 6574080 end 6582271 state 16402 in tree 0 refs 1 Otherwise (CONFIG_BTRFS_DEBUG not set) detectable with kmemleak. CC: stable@vger.kernel.org # 4.16+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: stop clearing EXTENT_DIRTY in inode I/O treeOmar Sandoval
Since commit fee187d9d9dd ("Btrfs: do not set EXTENT_DIRTY along with EXTENT_DELALLOC"), we never set EXTENT_DIRTY in inode->io_tree, so we can simplify and stop trying to clear it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: treat RWF_{,D}SYNC writes as sync for CRCsOmar Sandoval
The VFS indicates a synchronous write to ->write_iter() via iocb->ki_flags. The IOCB_{,D}SYNC flags may be set based on the file (see iocb_flags()) or the RWF_* flags passed to a syscall like pwritev2() (see kiocb_set_rw_flags()). However, in btrfs_file_write_iter(), we're checking if a write is synchronous based only on the file; we use this to decide when to bump the sync_writers counter and thus do CRCs synchronously. Make sure we do this for all synchronous writes as determined by the VFS. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add const ] Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: use correct count in btrfs_file_write_iter()Omar Sandoval
generic_write_checks() may modify iov_iter_count(), so we must get the count after the call, not before. Using the wrong one has a couple of consequences: 1. We check a longer range in check_can_nocow() for nowait than we're actually writing. 2. We create extra hole extent maps in btrfs_cont_expand(). As far as I can tell, this is harmless, but I might be missing something. These issues are pretty minor, but let's fix it before something more important trips on it. Fixes: edf064e7c6fe ("btrfs: nowait aio support") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: rename the btrfs_calc_*_metadata_size helpersJosef Bacik
btrfs_calc_trunc_metadata_size differs from trans_metadata_size in that it doesn't take into account any splitting at the levels, because truncate will never split nodes. However truncate _and_ changing will never split nodes, so rename btrfs_calc_trunc_metadata_size to btrfs_calc_metadata_size. Also btrfs_calc_trans_metadata_size is purely for inserting items, so rename this to btrfs_calc_insert_metadata_size. Making these clearer will help when I start using them differently in upcoming patches. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: Remove leftover of in-band dedupeNikolay Borisov
It's unlikely in-band dedupe is going to land so just remove any leftovers - dedupe.h header as well as the 'dedupe' parameter to btrfs_set_extent_delalloc. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09Btrfs: fix ENOSPC errors, leading to transaction aborts, when cloning extentsFilipe Manana
When cloning extents (or deduplicating) we create a transaction with a space reservation that considers we will drop or update a single file extent item of the destination inode (that we modify a single leaf). That is fine for the vast majority of scenarios, however it might happen that we need to drop many file extent items, and adjust at most two file extent items, in the destination root, which can span multiple leafs. This will lead to either the call to btrfs_drop_extents() to fail with ENOSPC or the subsequent calls to btrfs_insert_empty_item() or btrfs_update_inode() (called through clone_finish_inode_update()) to fail with ENOSPC. Such failure results in a transaction abort, leaving the filesystem in a read-only mode. In order to fix this we need to follow the same approach as the hole punching code, where we create a local reservation with 1 unit and keep ending and starting transactions, after balancing the btree inode, when __btrfs_drop_extents() returns ENOSPC. So fix this by making the extent cloning call calls the recently added btrfs_punch_hole_range() helper, which is what does the mentioned work for hole punching, and make sure whenever we drop extent items in a transaction, we also add a replacing file extent item, to avoid corruption (a hole) if after ending a transaction and before starting a new one, the old transaction gets committed and a power failure happens before we finish cloning. A test case for fstests follows soon. Reported-by: David Goodwin <david@codepoets.co.uk> Link: https://lore.kernel.org/linux-btrfs/a4a4cf31-9cf4-e52c-1f86-c62d336c9cd1@codepoets.co.uk/ Reported-by: Sam Tygier <sam@tygier.co.uk> Link: https://lore.kernel.org/linux-btrfs/82aace9f-a1e3-1f0b-055f-3ea75f7a41a0@tygier.co.uk/ Fixes: b6f3409b2197e8f ("Btrfs: reserve sufficient space for ioctl clone") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09Btrfs: factor out extent dropping code from hole punch handlerFilipe Manana
Move the code that is responsible for dropping extents in a range out of btrfs_punch_hole() into a new helper function, btrfs_punch_hole_range(), so that later it can be used by the reflinking (extent cloning and dedup) code to fix a ENOSPC bug. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-04btrfs: migrate the delalloc space stuff to it's own homeJosef Bacik
We have code for data and metadata reservations for delalloc. There's quite a bit of code here, and it's used in a lot of places so I've separated it out to it's own file. inode.c and file.c are already pretty large, and this code is complicated enough to live in its own space. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02btrfs: drop default value assignments in enumsDavid Sterba
A few more instances whre we don't need to specify the values as long as they are the same that enum assigns automatically. All of the enums are in-memory only and nothing relies on the exact values. Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02Btrfs: add missing inode version, ctime and mtime updates when punching holeFilipe Manana
If the range for which we are punching a hole covers only part of a page, we end up updating the inode item but we skip the update of the inode's iversion, mtime and ctime. Fix that by ensuring we update those properties of the inode. A patch for fstests test case generic/059 that tests this as been sent along with this fix. Fixes: 2aaa66558172b0 ("Btrfs: add hole punching") Fixes: e8c1c76e804b18 ("Btrfs: add missing inode update when punching hole") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01btrfs: Return EAGAIN if we can't start no snpashot write in check_can_nocowNikolay Borisov
The first thing code does in check_can_nocow is trying to block concurrent snapshots. If this fails (due to snpashot already being in progress) the function returns ENOSPC which makes no sense. Instead return EAGAIN. Despite this return value not being propagated to callers it's good practice to return the closest in terms of semantics error code. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01btrfs: Use newly introduced btrfs_lock_and_flush_ordered_rangeNikolay Borisov
There several functions which open code btrfs_lock_and_flush_ordered_range, just replace them with a call to the function. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-05-16Btrfs: fix race between ranged fsync and writeback of adjacent rangesFilipe Manana
When we do a full fsync (the bit BTRFS_INODE_NEEDS_FULL_SYNC is set in the inode) that happens to be ranged, which happens during a msync() or writes for files opened with O_SYNC for example, we can end up with a corrupt log, due to different file extent items representing ranges that overlap with each other, or hit some assertion failures. When doing a ranged fsync we only flush delalloc and wait for ordered exents within that range. If while we are logging items from our inode ordered extents for adjacent ranges complete, we end up in a race that can make us insert the file extent items that overlap with others we logged previously and the assertion failures. For example, if tree-log.c:copy_items() receives a leaf that has the following file extents items, all with a length of 4K and therefore there is an implicit hole in the range 68K to 72K - 1: (257 EXTENT_ITEM 64K), (257 EXTENT_ITEM 72K), (257 EXTENT_ITEM 76K), ... It copies them to the log tree. However due to the need to detect implicit holes, it may release the path, in order to look at the previous leaf to detect an implicit hole, and then later it will search again in the tree for the first file extent item key, with the goal of locking again the leaf (which might have changed due to concurrent changes to other inodes). However when it locks again the leaf containing the first key, the key corresponding to the extent at offset 72K may not be there anymore since there is an ordered extent for that range that is finishing (that is, somewhere in the middle of btrfs_finish_ordered_io()), and it just removed the file extent item but has not yet replaced it with a new file extent item, so the part of copy_items() that does hole detection will decide that there is a hole in the range starting from 68K to 76K - 1, and therefore insert a file extent item to represent that hole, having a key offset of 68K. After that we now have a log tree with 2 different extent items that have overlapping ranges: 1) The file extent item copied before copy_items() released the path, which has a key offset of 72K and a length of 4K, representing the file range 72K to 76K - 1. 2) And a file extent item representing a hole that has a key offset of 68K and a length of 8K, representing the range 68K to 76K - 1. This item was inserted after releasing the path, and overlaps with the extent item inserted before. The overlapping extent items can cause all sorts of unpredictable and incorrect behaviour, either when replayed or if a fast (non full) fsync happens later, which can trigger a BUG_ON() when calling btrfs_set_item_key_safe() through __btrfs_drop_extents(), producing a trace like the following: [61666.783269] ------------[ cut here ]------------ [61666.783943] kernel BUG at fs/btrfs/ctree.c:3182! [61666.784644] invalid opcode: 0000 [#1] PREEMPT SMP (...) [61666.786253] task: ffff880117b88c40 task.stack: ffffc90008168000 [61666.786253] RIP: 0010:btrfs_set_item_key_safe+0x7c/0xd2 [btrfs] [61666.786253] RSP: 0018:ffffc9000816b958 EFLAGS: 00010246 [61666.786253] RAX: 0000000000000000 RBX: 000000000000000f RCX: 0000000000030000 [61666.786253] RDX: 0000000000000000 RSI: ffffc9000816ba4f RDI: ffffc9000816b937 [61666.786253] RBP: ffffc9000816b998 R08: ffff88011dae2428 R09: 0000000000001000 [61666.786253] R10: 0000160000000000 R11: 6db6db6db6db6db7 R12: ffff88011dae2418 [61666.786253] R13: ffffc9000816ba4f R14: ffff8801e10c4118 R15: ffff8801e715c000 [61666.786253] FS: 00007f6060a18700(0000) GS:ffff88023f5c0000(0000) knlGS:0000000000000000 [61666.786253] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [61666.786253] CR2: 00007f6060a28000 CR3: 0000000213e69000 CR4: 00000000000006e0 [61666.786253] Call Trace: [61666.786253] __btrfs_drop_extents+0x5e3/0xaad [btrfs] [61666.786253] ? time_hardirqs_on+0x9/0x14 [61666.786253] btrfs_log_changed_extents+0x294/0x4e0 [btrfs] [61666.786253] ? release_extent_buffer+0x38/0xb4 [btrfs] [61666.786253] btrfs_log_inode+0xb6e/0xcdc [btrfs] [61666.786253] ? lock_acquire+0x131/0x1c5 [61666.786253] ? btrfs_log_inode_parent+0xee/0x659 [btrfs] [61666.786253] ? arch_local_irq_save+0x9/0xc [61666.786253] ? btrfs_log_inode_parent+0x1f5/0x659 [btrfs] [61666.786253] btrfs_log_inode_parent+0x223/0x659 [btrfs] [61666.786253] ? arch_local_irq_save+0x9/0xc [61666.786253] ? lockref_get_not_zero+0x2c/0x34 [61666.786253] ? rcu_read_unlock+0x3e/0x5d [61666.786253] btrfs_log_dentry_safe+0x60/0x7b [btrfs] [61666.786253] btrfs_sync_file+0x317/0x42c [btrfs] [61666.786253] vfs_fsync_range+0x8c/0x9e [61666.786253] SyS_msync+0x13c/0x1c9 [61666.786253] entry_SYSCALL_64_fastpath+0x18/0xad A sample of a corrupt log tree leaf with overlapping extents I got from running btrfs/072: item 14 key (295 108 200704) itemoff 2599 itemsize 53 extent data disk bytenr 0 nr 0 extent data offset 0 nr 458752 ram 458752 item 15 key (295 108 659456) itemoff 2546 itemsize 53 extent data disk bytenr 4343541760 nr 770048 extent data offset 606208 nr 163840 ram 770048 item 16 key (295 108 663552) itemoff 2493 itemsize 53 extent data disk bytenr 4343541760 nr 770048 extent data offset 610304 nr 155648 ram 770048 item 17 key (295 108 819200) itemoff 2440 itemsize 53 extent data disk bytenr 4334788608 nr 4096 extent data offset 0 nr 4096 ram 4096 The file extent item at offset 659456 (item 15) ends at offset 823296 (659456 + 163840) while the next file extent item (item 16) starts at offset 663552. Another different problem that the race can trigger is a failure in the assertions at tree-log.c:copy_items(), which expect that the first file extent item key we found before releasing the path exists after we have released path and that the last key we found before releasing the path also exists after releasing the path: $ cat -n fs/btrfs/tree-log.c 4080 if (need_find_last_extent) { 4081 /* btrfs_prev_leaf could return 1 without releasing the path */ 4082 btrfs_release_path(src_path); 4083 ret = btrfs_search_slot(NULL, inode->root, &first_key, 4084 src_path, 0, 0); 4085 if (ret < 0) 4086 return ret; 4087 ASSERT(ret == 0); (...) 4103 if (i >= btrfs_header_nritems(src_path->nodes[0])) { 4104 ret = btrfs_next_leaf(inode->root, src_path); 4105 if (ret < 0) 4106 return ret; 4107 ASSERT(ret == 0); 4108 src = src_path->nodes[0]; 4109 i = 0; 4110 need_find_last_extent = true; 4111 } (...) The second assertion implicitly expects that the last key before the path release still exists, because the surrounding while loop only stops after we have found that key. When this assertion fails it produces a stack like this: [139590.037075] assertion failed: ret == 0, file: fs/btrfs/tree-log.c, line: 4107 [139590.037406] ------------[ cut here ]------------ [139590.037707] kernel BUG at fs/btrfs/ctree.h:3546! [139590.038034] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI [139590.038340] CPU: 1 PID: 31841 Comm: fsstress Tainted: G W 5.0.0-btrfs-next-46 #1 (...) [139590.039354] RIP: 0010:assfail.constprop.24+0x18/0x1a [btrfs] (...) [139590.040397] RSP: 0018:ffffa27f48f2b9b0 EFLAGS: 00010282 [139590.040730] RAX: 0000000000000041 RBX: ffff897c635d92c8 RCX: 0000000000000000 [139590.041105] RDX: 0000000000000000 RSI: ffff897d36a96868 RDI: ffff897d36a96868 [139590.041470] RBP: ffff897d1b9a0708 R08: 0000000000000000 R09: 0000000000000000 [139590.041815] R10: 0000000000000008 R11: 0000000000000000 R12: 0000000000000013 [139590.042159] R13: 0000000000000227 R14: ffff897cffcbba88 R15: 0000000000000001 [139590.042501] FS: 00007f2efc8dee80(0000) GS:ffff897d36a80000(0000) knlGS:0000000000000000 [139590.042847] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [139590.043199] CR2: 00007f8c064935e0 CR3: 0000000232252002 CR4: 00000000003606e0 [139590.043547] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [139590.043899] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [139590.044250] Call Trace: [139590.044631] copy_items+0xa3f/0x1000 [btrfs] [139590.045009] ? generic_bin_search.constprop.32+0x61/0x200 [btrfs] [139590.045396] btrfs_log_inode+0x7b3/0xd70 [btrfs] [139590.045773] btrfs_log_inode_parent+0x2b3/0xce0 [btrfs] [139590.046143] ? do_raw_spin_unlock+0x49/0xc0 [139590.046510] btrfs_log_dentry_safe+0x4a/0x70 [btrfs] [139590.046872] btrfs_sync_file+0x3b6/0x440 [btrfs] [139590.047243] btrfs_file_write_iter+0x45b/0x5c0 [btrfs] [139590.047592] __vfs_write+0x129/0x1c0 [139590.047932] vfs_write+0xc2/0x1b0 [139590.048270] ksys_write+0x55/0xc0 [139590.048608] do_syscall_64+0x60/0x1b0 [139590.048946] entry_SYSCALL_64_after_hwframe+0x49/0xbe [139590.049287] RIP: 0033:0x7f2efc4be190 (...) [139590.050342] RSP: 002b:00007ffe743243a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [139590.050701] RAX: ffffffffffffffda RBX: 0000000000008d58 RCX: 00007f2efc4be190 [139590.051067] RDX: 0000000000008d58 RSI: 00005567eca0f370 RDI: 0000000000000003 [139590.051459] RBP: 0000000000000024 R08: 0000000000000003 R09: 0000000000008d60 [139590.051863] R10: 0000000000000078 R11: 0000000000000246 R12: 0000000000000003 [139590.052252] R13: 00000000003d3507 R14: 00005567eca0f370 R15: 0000000000000000 (...) [139590.055128] ---[ end trace 193f35d0215cdeeb ]--- So fix this race between a full ranged fsync and writeback of adjacent ranges by flushing all delalloc and waiting for all ordered extents to complete before logging the inode. This is the simplest way to solve the problem because currently the full fsync path does not deal with ranges at all (it assumes a full range from 0 to LLONG_MAX) and it always needs to look at adjacent ranges for hole detection. For use cases of ranged fsyncs this can make a few fsyncs slower but on the other hand it can make some following fsyncs to other ranges do less work or no need to do anything at all. A full fsync is rare anyway and happens only once after loading/creating an inode and once after less common operations such as a shrinking truncate. This is an issue that exists for a long time, and was often triggered by generic/127, because it does mmap'ed writes and msync (which triggers a ranged fsync). Adding support for the tree checker to detect overlapping extents (next patch in the series) and trigger a WARN() when such cases are found, and then calling btrfs_check_leaf_full() at the end of btrfs_insert_file_extent() made the issue much easier to detect. Running btrfs/072 with that change to the tree checker and making fsstress open files always with O_SYNC made it much easier to trigger the issue (as triggering it with generic/127 is very rare). CC: stable@vger.kernel.org # 3.16+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-05-03btrfs: don't double unlock on error in btrfs_punch_holeJosef Bacik
If we have an error writing out a delalloc range in btrfs_punch_hole_lock_range we'll unlock the inode and then goto out_only_mutex, where we will again unlock the inode. This is bad, don't do this. Fixes: f27451f22996 ("Btrfs: add support for fallocate's zero range operation") CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: extent-tree: Use btrfs_ref to refactor btrfs_free_extent()Qu Wenruo
Similar to btrfs_inc_extent_ref(), use btrfs_ref to replace the long parameter list and the confusing @owner parameter. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: extent-tree: Use btrfs_ref to refactor btrfs_inc_extent_ref()Qu Wenruo
Use the new btrfs_ref structure and replace parameter list to clean up the usage of owner and level to distinguish the extent types. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29Btrfs: fix data bytes_may_use underflow with fallocate due to failed quota ↵Robbie Ko
reserve When doing fallocate, we first add the range to the reserve_list and then reserve the quota. If quota reservation fails, we'll release all reserved parts of reserve_list. However, cur_offset is not updated to indicate that this range is already been inserted into the list. Therefore, the same range is freed twice. Once at list_for_each_entry loop, and once at the end of the function. This will result in WARN_ON on bytes_may_use when we free the remaining space. At the end, under the 'out' label we have a call to: btrfs_free_reserved_data_space(inode, data_reserved, alloc_start, alloc_end - cur_offset); The start offset, third argument, should be cur_offset. Everything from alloc_start to cur_offset was freed by the list_for_each_entry_safe_loop. Fixes: 18513091af94 ("btrfs: update btrfs_space_info's bytes_may_use timely") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Robbie Ko <robbieko@synology.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: get fs_info from eb in btrfs_leaf_free_spaceDavid Sterba
We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: use BUG() instead of BUG_ON(1)Arnd Bergmann
BUG_ON(1) leads to bogus warnings from clang when CONFIG_PROFILE_ANNOTATED_BRANCHES is set: fs/btrfs/volumes.c:5041:3: error: variable 'max_chunk_size' is used uninitialized whenever 'if' condition is false [-Werror,-Wsometimes-uninitialized] BUG_ON(1); ^~~~~~~~~ include/asm-generic/bug.h:61:36: note: expanded from macro 'BUG_ON' #define BUG_ON(condition) do { if (unlikely(condition)) BUG(); } while (0) ^~~~~~~~~~~~~~~~~~~ include/linux/compiler.h:48:23: note: expanded from macro 'unlikely' # define unlikely(x) (__branch_check__(x, 0, __builtin_constant_p(x))) ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ fs/btrfs/volumes.c:5046:9: note: uninitialized use occurs here max_chunk_size); ^~~~~~~~~~~~~~ include/linux/kernel.h:860:36: note: expanded from macro 'min' #define min(x, y) __careful_cmp(x, y, <) ^ include/linux/kernel.h:853:17: note: expanded from macro '__careful_cmp' __cmp_once(x, y, __UNIQUE_ID(__x), __UNIQUE_ID(__y), op)) ^ include/linux/kernel.h:847:25: note: expanded from macro '__cmp_once' typeof(y) unique_y = (y); \ ^ fs/btrfs/volumes.c:5041:3: note: remove the 'if' if its condition is always true BUG_ON(1); ^ include/asm-generic/bug.h:61:32: note: expanded from macro 'BUG_ON' #define BUG_ON(condition) do { if (unlikely(condition)) BUG(); } while (0) ^ fs/btrfs/volumes.c:4993:20: note: initialize the variable 'max_chunk_size' to silence this warning u64 max_chunk_size; ^ = 0 Change it to BUG() so clang can see that this code path can never continue. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29Btrfs: remove no longer used 'sync' member from transaction handleFilipe Manana
Commit db2462a6ad3d ("btrfs: don't run delayed refs in the end transaction logic") removed its last use, so now it does absolutely nothing, therefore remove it. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25btrfs: Remove unused arguments from btrfs_get_extent_fiemapNikolay Borisov
This function is a simple wrapper over btrfs_get_extent that returns either: a) A real extent in the passed range or b) Adjusted extent based on whether delalloc bytes are found backing up a hole. To support these semantics it doesn't need the page/pg_offset/create arguments which are passed to btrfs_get_extent in case an extent is to be created. So simplify the function by removing the unused arguments. No functional changes. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: Fix typos in comments and stringsAndrea Gelmini
The typos accumulate over time so once in a while time they get fixed in a large patch. Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: use offset_in_page instead of open-coding itJohannes Thumshirn
Constructs like 'var & (PAGE_SIZE - 1)' or 'var & ~PAGE_MASK' can denote an offset into a page. So replace them by the offset_in_page() macro instead of open-coding it if they're not used as an alignment check. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17Btrfs: remove no longer used io_err from btrfs_log_ctxFilipe Manana
The io_err field of struct btrfs_log_ctx is no longer used after the recent simplification of the fast fsync path, where we now wait for ordered extents to complete before logging the inode. We did this in commit b5e6c3e170b7 ("btrfs: always wait on ordered extents at fsync time") and commit a2120a473a80 ("btrfs: clean up the left over logged_list usage") removed its last use. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-11-28Merge tag 'for-4.20-rc4-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Some of these bugs are being hit during testing so we'd like to get them merged, otherwise there are usual stability fixes for stable trees" * tag 'for-4.20-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: relocation: set trans to be NULL after ending transaction Btrfs: fix race between enabling quotas and subvolume creation Btrfs: send, fix infinite loop due to directory rename dependencies Btrfs: ensure path name is null terminated at btrfs_control_ioctl Btrfs: fix rare chances for data loss when doing a fast fsync btrfs: Always try all copies when reading extent buffers
2018-11-13Btrfs: fix rare chances for data loss when doing a fast fsyncFilipe Manana
After the simplification of the fast fsync patch done recently by commit b5e6c3e170b7 ("btrfs: always wait on ordered extents at fsync time") and commit e7175a692765 ("btrfs: remove the wait ordered logic in the log_one_extent path"), we got a very short time window where we can get extents logged without writeback completing first or extents logged without logging the respective data checksums. Both issues can only happen when doing a non-full (fast) fsync. As soon as we enter btrfs_sync_file() we trigger writeback, then lock the inode and then wait for the writeback to complete before starting to log the inode. However before we acquire the inode's lock and after we started writeback, it's possible that more writes happened and dirtied more pages. If that happened and those pages get writeback triggered while we are logging the inode (for example, the VM subsystem triggering it due to memory pressure, or another concurrent fsync), we end up seeing the respective extent maps in the inode's list of modified extents and will log matching file extent items without waiting for the respective ordered extents to complete, meaning that either of the following will happen: 1) We log an extent after its writeback finishes but before its checksums are added to the csum tree, leading to -EIO errors when attempting to read the extent after a log replay. 2) We log an extent before its writeback finishes. Therefore after the log replay we will have a file extent item pointing to an unwritten extent (and without the respective data checksums as well). This could not happen before the fast fsync patch simplification, because for any extent we found in the list of modified extents, we would wait for its respective ordered extent to finish writeback or collect its checksums for logging if it did not complete yet. Fix this by triggering writeback again after acquiring the inode's lock and before waiting for ordered extents to complete. Fixes: e7175a692765 ("btrfs: remove the wait ordered logic in the log_one_extent path") Fixes: b5e6c3e170b7 ("btrfs: always wait on ordered extents at fsync time") CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-11-02Merge tag 'xfs-4.20-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull vfs dedup fixes from Dave Chinner: "This reworks the vfs data cloning infrastructure. We discovered many issues with these interfaces late in the 4.19 cycle - the worst of them (data corruption, setuid stripping) were fixed for XFS in 4.19-rc8, but a larger rework of the infrastructure fixing all the problems was needed. That rework is the contents of this pull request. Rework the vfs_clone_file_range and vfs_dedupe_file_range infrastructure to use a common .remap_file_range method and supply generic bounds and sanity checking functions that are shared with the data write path. The current VFS infrastructure has problems with rlimit, LFS file sizes, file time stamps, maximum filesystem file sizes, stripping setuid bits, etc and so they are addressed in these commits. We also introduce the ability for the ->remap_file_range methods to return short clones so that clones for vfs_copy_file_range() don't get rejected if the entire range can't be cloned. It also allows filesystems to sliently skip deduplication of partial EOF blocks if they are not capable of doing so without requiring errors to be thrown to userspace. Existing filesystems are converted to user the new remap_file_range method, and both XFS and ocfs2 are modified to make use of the new generic checking infrastructure" * tag 'xfs-4.20-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (28 commits) xfs: remove [cm]time update from reflink calls xfs: remove xfs_reflink_remap_range xfs: remove redundant remap partial EOF block checks xfs: support returning partial reflink results xfs: clean up xfs_reflink_remap_blocks call site xfs: fix pagecache truncation prior to reflink ocfs2: remove ocfs2_reflink_remap_range ocfs2: support partial clone range and dedupe range ocfs2: fix pagecache truncation prior to reflink ocfs2: truncate page cache for clone destination file before remapping vfs: clean up generic_remap_file_range_prep return value vfs: hide file range comparison function vfs: enable remap callers that can handle short operations vfs: plumb remap flags through the vfs dedupe functions vfs: plumb remap flags through the vfs clone functions vfs: make remap_file_range functions take and return bytes completed vfs: remap helper should update destination inode metadata vfs: pass remap flags to generic_remap_checks vfs: pass remap flags to generic_remap_file_range_prep vfs: combine the clone and dedupe into a single remap_file_range ...
2018-10-30vfs: combine the clone and dedupe into a single remap_file_rangeDarrick J. Wong
Combine the clone_file_range and dedupe_file_range operations into a single remap_file_range file operation dispatch since they're fundamentally the same operation. The differences between the two can be made in the prep functions. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-19btrfs: move the dio_sem higher up the callchainJosef Bacik
We're getting a lockdep splat because we take the dio_sem under the log_mutex. What we really need is to protect fsync() from logging an extent map for an extent we never waited on higher up, so just guard the whole thing with dio_sem. ====================================================== WARNING: possible circular locking dependency detected 4.18.0-rc4-xfstests-00025-g5de5edbaf1d4 #411 Not tainted ------------------------------------------------------ aio-dio-invalid/30928 is trying to acquire lock: 0000000092621cfd (&mm->mmap_sem){++++}, at: get_user_pages_unlocked+0x5a/0x1e0 but task is already holding lock: 00000000cefe6b35 (&ei->dio_sem){++++}, at: btrfs_direct_IO+0x3be/0x400 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #5 (&ei->dio_sem){++++}: lock_acquire+0xbd/0x220 down_write+0x51/0xb0 btrfs_log_changed_extents+0x80/0xa40 btrfs_log_inode+0xbaf/0x1000 btrfs_log_inode_parent+0x26f/0xa80 btrfs_log_dentry_safe+0x50/0x70 btrfs_sync_file+0x357/0x540 do_fsync+0x38/0x60 __ia32_sys_fdatasync+0x12/0x20 do_fast_syscall_32+0x9a/0x2f0 entry_SYSENTER_compat+0x84/0x96 -> #4 (&ei->log_mutex){+.+.}: lock_acquire+0xbd/0x220 __mutex_lock+0x86/0xa10 btrfs_record_unlink_dir+0x2a/0xa0 btrfs_unlink+0x5a/0xc0 vfs_unlink+0xb1/0x1a0 do_unlinkat+0x264/0x2b0 do_fast_syscall_32+0x9a/0x2f0 entry_SYSENTER_compat+0x84/0x96 -> #3 (sb_internal#2){.+.+}: lock_acquire+0xbd/0x220 __sb_start_write+0x14d/0x230 start_transaction+0x3e6/0x590 btrfs_evict_inode+0x475/0x640 evict+0xbf/0x1b0 btrfs_run_delayed_iputs+0x6c/0x90 cleaner_kthread+0x124/0x1a0 kthread+0x106/0x140 ret_from_fork+0x3a/0x50 -> #2 (&fs_info->cleaner_delayed_iput_mutex){+.+.}: lock_acquire+0xbd/0x220 __mutex_lock+0x86/0xa10 btrfs_alloc_data_chunk_ondemand+0x197/0x530 btrfs_check_data_free_space+0x4c/0x90 btrfs_delalloc_reserve_space+0x20/0x60 btrfs_page_mkwrite+0x87/0x520 do_page_mkwrite+0x31/0xa0 __handle_mm_fault+0x799/0xb00 handle_mm_fault+0x7c/0xe0 __do_page_fault+0x1d3/0x4a0 async_page_fault+0x1e/0x30 -> #1 (sb_pagefaults){.+.+}: lock_acquire+0xbd/0x220 __sb_start_write+0x14d/0x230 btrfs_page_mkwrite+0x6a/0x520 do_page_mkwrite+0x31/0xa0 __handle_mm_fault+0x799/0xb00 handle_mm_fault+0x7c/0xe0 __do_page_fault+0x1d3/0x4a0 async_page_fault+0x1e/0x30 -> #0 (&mm->mmap_sem){++++}: __lock_acquire+0x42e/0x7a0 lock_acquire+0xbd/0x220 down_read+0x48/0xb0 get_user_pages_unlocked+0x5a/0x1e0 get_user_pages_fast+0xa4/0x150 iov_iter_get_pages+0xc3/0x340 do_direct_IO+0xf93/0x1d70 __blockdev_direct_IO+0x32d/0x1c20 btrfs_direct_IO+0x227/0x400 generic_file_direct_write+0xcf/0x180 btrfs_file_write_iter+0x308/0x58c aio_write+0xf8/0x1d0 io_submit_one+0x3a9/0x620 __ia32_compat_sys_io_submit+0xb2/0x270 do_int80_syscall_32+0x5b/0x1a0 entry_INT80_compat+0x88/0xa0 other info that might help us debug this: Chain exists of: &mm->mmap_sem --> &ei->log_mutex --> &ei->dio_sem Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&ei->dio_sem); lock(&ei->log_mutex); lock(&ei->dio_sem); lock(&mm->mmap_sem); *** DEADLOCK *** 1 lock held by aio-dio-invalid/30928: #0: 00000000cefe6b35 (&ei->dio_sem){++++}, at: btrfs_direct_IO+0x3be/0x400 stack backtrace: CPU: 0 PID: 30928 Comm: aio-dio-invalid Not tainted 4.18.0-rc4-xfstests-00025-g5de5edbaf1d4 #411 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 Call Trace: dump_stack+0x7c/0xbb print_circular_bug.isra.37+0x297/0x2a4 check_prev_add.constprop.45+0x781/0x7a0 ? __lock_acquire+0x42e/0x7a0 validate_chain.isra.41+0x7f0/0xb00 __lock_acquire+0x42e/0x7a0 lock_acquire+0xbd/0x220 ? get_user_pages_unlocked+0x5a/0x1e0 down_read+0x48/0xb0 ? get_user_pages_unlocked+0x5a/0x1e0 get_user_pages_unlocked+0x5a/0x1e0 get_user_pages_fast+0xa4/0x150 iov_iter_get_pages+0xc3/0x340 do_direct_IO+0xf93/0x1d70 ? __alloc_workqueue_key+0x358/0x490 ? __blockdev_direct_IO+0x14b/0x1c20 __blockdev_direct_IO+0x32d/0x1c20 ? btrfs_run_delalloc_work+0x40/0x40 ? can_nocow_extent+0x490/0x490 ? kvm_clock_read+0x1f/0x30 ? can_nocow_extent+0x490/0x490 ? btrfs_run_delalloc_work+0x40/0x40 btrfs_direct_IO+0x227/0x400 ? btrfs_run_delalloc_work+0x40/0x40 generic_file_direct_write+0xcf/0x180 btrfs_file_write_iter+0x308/0x58c aio_write+0xf8/0x1d0 ? kvm_clock_read+0x1f/0x30 ? __might_fault+0x3e/0x90 io_submit_one+0x3a9/0x620 ? io_submit_one+0xe5/0x620 __ia32_compat_sys_io_submit+0xb2/0x270 do_int80_syscall_32+0x5b/0x1a0 entry_INT80_compat+0x88/0xa0 CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15Btrfs: don't clean dirty pages during buffered writesChris Mason
During buffered writes, we follow this basic series of steps: again: lock all the pages wait for writeback on all the pages Take the extent range lock wait for ordered extents on the whole range clean all the pages if (copy_from_user_in_atomic() hits a fault) { drop our locks goto again; } dirty all the pages release all the locks The extra waiting, cleaning and locking are there to make sure we don't modify pages in flight to the drive, after they've been crc'd. If some of the pages in the range were already dirty when the write began, and we need to goto again, we create a window where a dirty page has been cleaned and unlocked. It may be reclaimed before we're able to lock it again, which means we'll read the old contents off the drive and lose any modifications that had been pending writeback. We don't actually need to clean the pages. All of the other locking in place makes sure we don't start IO on the pages, so we can just leave them dirty for the duration of the write. Fixes: 73d59314e6ed (the original btrfs merge) CC: stable@vger.kernel.org # v4.4+ Signed-off-by: Chris Mason <clm@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15btrfs: switch update_size to bool in btrfs_block_rsv_migrate and ↵Lu Fengqi
btrfs_rsv_add_bytes Using true and false here is closer to the expected semantic than using 0 and 1. No functional change. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: prune unused includesDavid Sterba
Remove includes if none of the interfaces and exports is used in the given source file. Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: simplify pointer chasing of local fs_info variablesDavid Sterba
Functions that get btrfs inode can simply reach the fs_info by dereferencing the root and this looks a bit more straightforward compared to the btrfs_sb(...) indirection. If the transaction handle is available and not NULL it's used instead. Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: Use iocb to derive pos instead of passing a separate parameterGoldwyn Rodrigues
struct kiocb carries the ki_pos, so there is no need to pass it as a separate function parameter. generic_file_direct_write() increments ki_pos, so we now assign pos after the function. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> [ rename to btrfs_buffered_write ] Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: Get rid of the confusing btrfs_file_extent_inline_lenQu Wenruo
We used to call btrfs_file_extent_inline_len() to get the uncompressed data size of an inlined extent. However this function is hiding evil, for compressed extent, it has no choice but to directly read out ram_bytes from btrfs_file_extent_item. While for uncompressed extent, it uses item size to calculate the real data size, and ignoring ram_bytes completely. In fact, for corrupted ram_bytes, due to above behavior kernel btrfs_print_leaf() can't even print correct ram_bytes to expose the bug. Since we have the tree-checker to verify all EXTENT_DATA, such mismatch can be detected pretty easily, thus we can trust ram_bytes without the evil btrfs_file_extent_inline_len(). Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: remove remaing full_sync logic from btrfs_sync_fileDavid Sterba
The logic to check if the inode is already in the log can now be simplified since we always wait for the ordered extents to complete before deciding whether the inode needs to be logged. The big comment about it can go away too. CC: Filipe Manana <fdmanana@suse.com> Suggested-by: Filipe Manana <fdmanana@suse.com> [ code and changelog copied from mail discussion ] Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: always wait on ordered extents at fsync timeJosef Bacik
There's a priority inversion that exists currently with btrfs fsync. In some cases we will collect outstanding ordered extents onto a list and only wait on them at the very last second. However this "very last second" falls inside of a transaction handle, so if we are in a lower priority cgroup we can end up holding the transaction open for longer than needed, so if a high priority cgroup is also trying to fsync() it'll see latency. Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>