From 7dbeaad0af7d0a1a2a8e41d04e90964368ddfcc5 Mon Sep 17 00:00:00 2001 From: Qu Wenruo Date: Wed, 10 Jun 2020 09:04:43 +0800 Subject: btrfs: change timing for qgroup reserved space for ordered extents to fix reserved space leak [BUG] The following simple workload from fsstress can lead to qgroup reserved data space leak: 0/0: creat f0 x:0 0 0 0/0: creat add id=0,parent=-1 0/1: write f0[259 1 0 0 0 0] [600030,27288] 0 0/4: dwrite - xfsctl(XFS_IOC_DIOINFO) f0[259 1 0 0 64 627318] return 25, fallback to stat() 0/4: dwrite f0[259 1 0 0 64 627318] [610304,106496] 0 This would cause btrfs qgroup to leak 20480 bytes for data reserved space. If btrfs qgroup limit is enabled, such leak can lead to unexpected early EDQUOT and unusable space. [CAUSE] When doing direct IO, kernel will try to writeback existing buffered page cache, then invalidate them: generic_file_direct_write() |- filemap_write_and_wait_range(); |- invalidate_inode_pages2_range(); However for btrfs, the bi_end_io hook doesn't finish all its heavy work right after bio ends. In fact, it delays its work further: submit_extent_page(end_io_func=end_bio_extent_writepage); end_bio_extent_writepage() |- btrfs_writepage_endio_finish_ordered() |- btrfs_init_work(finish_ordered_fn); <<< Work queue execution >>> finish_ordered_fn() |- btrfs_finish_ordered_io(); |- Clear qgroup bits This means, when filemap_write_and_wait_range() returns, btrfs_finish_ordered_io() is not guaranteed to be executed, thus the qgroup bits for related range are not cleared. Now into how the leak happens, this will only focus on the overlapping part of buffered and direct IO part. 1. After buffered write The inode had the following range with QGROUP_RESERVED bit: 596 616K |///////////////| Qgroup reserved data space: 20K 2. Writeback part for range [596K, 616K) Write back finished, but btrfs_finish_ordered_io() not get called yet. So we still have: 596K 616K |///////////////| Qgroup reserved data space: 20K 3. Pages for range [596K, 616K) get released This will clear all qgroup bits, but don't update the reserved data space. So we have: 596K 616K | | Qgroup reserved data space: 20K That number doesn't match the qgroup bit range anymore. 4. Dio prepare space for range [596K, 700K) Qgroup reserved data space for that range, we got: 596K 616K 700K |///////////////|///////////////////////| Qgroup reserved data space: 20K + 104K = 124K 5. btrfs_finish_ordered_range() gets executed for range [596K, 616K) Qgroup free reserved space for that range, we got: 596K 616K 700K | |///////////////////////| We need to free that range of reserved space. Qgroup reserved data space: 124K - 20K = 104K 6. btrfs_finish_ordered_range() gets executed for range [596K, 700K) However qgroup bit for range [596K, 616K) is already cleared in previous step, so we only free 84K for qgroup reserved space. 596K 616K 700K | | | We need to free that range of reserved space. Qgroup reserved data space: 104K - 84K = 20K Now there is no way to release that 20K unless disabling qgroup or unmounting the fs. [FIX] This patch will change the timing of btrfs_qgroup_release/free_data() call. Here it uses buffered COW write as an example. The new timing | The old timing ----------------------------------------+--------------------------------------- btrfs_buffered_write() | btrfs_buffered_write() |- btrfs_qgroup_reserve_data() | |- btrfs_qgroup_reserve_data() | btrfs_run_delalloc_range() | btrfs_run_delalloc_range() |- btrfs_add_ordered_extent() | |- btrfs_qgroup_release_data() | The reserved is passed into | btrfs_ordered_extent structure | | btrfs_finish_ordered_io() | btrfs_finish_ordered_io() |- The reserved space is passed to | |- btrfs_qgroup_release_data() btrfs_qgroup_record | The resereved space is passed | to btrfs_qgroup_recrod | btrfs_qgroup_account_extents() | btrfs_qgroup_account_extents() |- btrfs_qgroup_free_refroot() | |- btrfs_qgroup_free_refroot() The point of such change is to ensure, when ordered extents are submitted, the qgroup reserved space is already released, to keep the timing aligned with file_write_and_wait_range(). So that qgroup data reserved space is all bound to btrfs_ordered_extent and solve the timing mismatch. Fixes: f695fdcef83a ("btrfs: qgroup: Introduce functions to release/free qgroup reserve data space") Suggested-by: Josef Bacik Reviewed-by: Josef Bacik Signed-off-by: Qu Wenruo Signed-off-by: David Sterba --- fs/btrfs/ordered-data.h | 3 +++ 1 file changed, 3 insertions(+) (limited to 'fs/btrfs/ordered-data.h') diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h index c01c9698250b..4a506c5598f8 100644 --- a/fs/btrfs/ordered-data.h +++ b/fs/btrfs/ordered-data.h @@ -92,6 +92,9 @@ struct btrfs_ordered_extent { /* compression algorithm */ int compress_type; + /* Qgroup reserved space */ + int qgroup_rsv; + /* reference count */ refcount_t refs; -- cgit From cd8d39f4aeb3b69af2478112366584256248eb2b Mon Sep 17 00:00:00 2001 From: Filipe Manana Date: Mon, 15 Jun 2020 10:36:48 +0100 Subject: btrfs: remove no longer used log_list member of struct btrfs_ordered_extent The 'log_list' member of an ordered extent was used keep track of which ordered extents we needed to wait after logging metadata, but is not used anymore since commit 5636cf7d6dc86f ("btrfs: remove the logged extents infrastructure"), as we now always wait on ordered extent completion before logging metadata. So just remove it since it's doing nothing and making each ordered extent structure waste more memory (2 pointers). Reviewed-by: Johannes Thumshirn Signed-off-by: Filipe Manana Reviewed-by: David Sterba Signed-off-by: David Sterba --- fs/btrfs/ordered-data.h | 3 --- 1 file changed, 3 deletions(-) (limited to 'fs/btrfs/ordered-data.h') diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h index 4a506c5598f8..435f93c46c32 100644 --- a/fs/btrfs/ordered-data.h +++ b/fs/btrfs/ordered-data.h @@ -104,9 +104,6 @@ struct btrfs_ordered_extent { /* list of checksums for insertion when the extent io is done */ struct list_head list; - /* If we need to wait on this to be done */ - struct list_head log_list; - /* If the transaction needs to wait on this ordered extent */ struct list_head trans_list; -- cgit From 3ef64143a7963fd882ab52fee1cc1c9ba2e408e0 Mon Sep 17 00:00:00 2001 From: Filipe Manana Date: Mon, 15 Jun 2020 10:36:58 +0100 Subject: btrfs: remove no longer used trans_list member of struct btrfs_ordered_extent The 'trans_list' member of an ordered extent was used to keep track of the ordered extents for which a transaction commit had to wait. These were ordered extents that were started and logged by an fsync. However we don't do that anymore and before we stopped doing it we changed the approach to wait for the ordered extents in commit 161c3549b45aee ("Btrfs: change how we wait for pending ordered extents"), which stopped using that list and therefore the 'trans_list' member is not used anymore since that commit. So just remove it since it's doing nothing and making each ordered extent structure waste memory (2 pointers). Reviewed-by: Johannes Thumshirn Signed-off-by: Filipe Manana Reviewed-by: David Sterba Signed-off-by: David Sterba --- fs/btrfs/ordered-data.h | 3 --- 1 file changed, 3 deletions(-) (limited to 'fs/btrfs/ordered-data.h') diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h index 435f93c46c32..a24a1f2d5f9d 100644 --- a/fs/btrfs/ordered-data.h +++ b/fs/btrfs/ordered-data.h @@ -104,9 +104,6 @@ struct btrfs_ordered_extent { /* list of checksums for insertion when the extent io is done */ struct list_head list; - /* If the transaction needs to wait on this ordered extent */ - struct list_head trans_list; - /* used to wait for the BTRFS_ORDERED_COMPLETE bit */ wait_queue_head_t wait; -- cgit From c3504372699bff6daeda207b4e30256c39f584c1 Mon Sep 17 00:00:00 2001 From: Nikolay Borisov Date: Wed, 3 Jun 2020 08:55:03 +0300 Subject: btrfs: make btrfs_lookup_ordered_extent take btrfs_inode It doesn't use the generic vfs inode for anything use btrfs_inode directly. Signed-off-by: Nikolay Borisov Reviewed-by: David Sterba Signed-off-by: David Sterba --- fs/btrfs/ordered-data.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'fs/btrfs/ordered-data.h') diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h index a24a1f2d5f9d..f2a78f8f6bce 100644 --- a/fs/btrfs/ordered-data.h +++ b/fs/btrfs/ordered-data.h @@ -163,7 +163,7 @@ int btrfs_add_ordered_extent_compress(struct inode *inode, u64 file_offset, int compress_type); void btrfs_add_ordered_sum(struct btrfs_ordered_extent *entry, struct btrfs_ordered_sum *sum); -struct btrfs_ordered_extent *btrfs_lookup_ordered_extent(struct inode *inode, +struct btrfs_ordered_extent *btrfs_lookup_ordered_extent(struct btrfs_inode *inode, u64 file_offset); void btrfs_start_ordered_extent(struct inode *inode, struct btrfs_ordered_extent *entry, int wait); -- cgit From e7fbf60453a7eae2a36ac9096c84ccb1067dabdf Mon Sep 17 00:00:00 2001 From: Nikolay Borisov Date: Wed, 3 Jun 2020 08:55:13 +0300 Subject: btrfs: make btrfs_add_ordered_extent take btrfs_inode Preparation to converting its callers to taking btrfs_inode. Signed-off-by: Nikolay Borisov Reviewed-by: David Sterba Signed-off-by: David Sterba --- fs/btrfs/ordered-data.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'fs/btrfs/ordered-data.h') diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h index f2a78f8f6bce..6c42f307b87f 100644 --- a/fs/btrfs/ordered-data.h +++ b/fs/btrfs/ordered-data.h @@ -151,7 +151,7 @@ int btrfs_dec_test_first_ordered_pending(struct inode *inode, struct btrfs_ordered_extent **cached, u64 *file_offset, u64 io_size, int uptodate); -int btrfs_add_ordered_extent(struct inode *inode, u64 file_offset, +int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset, u64 disk_bytenr, u64 num_bytes, u64 disk_num_bytes, int type); int btrfs_add_ordered_extent_dio(struct inode *inode, u64 file_offset, -- cgit From 4cc612090ba5828fb301623fa8cdf0d7a165f91c Mon Sep 17 00:00:00 2001 From: Nikolay Borisov Date: Wed, 3 Jun 2020 08:55:15 +0300 Subject: btrfs: make btrfs_add_ordered_extent_compress take btrfs_inode It simpy forwards its inode argument to __btrfs_add_ordered_extent which already takes btrfs_inode. Signed-off-by: Nikolay Borisov Reviewed-by: David Sterba Signed-off-by: David Sterba --- fs/btrfs/ordered-data.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'fs/btrfs/ordered-data.h') diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h index 6c42f307b87f..03865f721164 100644 --- a/fs/btrfs/ordered-data.h +++ b/fs/btrfs/ordered-data.h @@ -157,7 +157,7 @@ int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset, int btrfs_add_ordered_extent_dio(struct inode *inode, u64 file_offset, u64 disk_bytenr, u64 num_bytes, u64 disk_num_bytes, int type); -int btrfs_add_ordered_extent_compress(struct inode *inode, u64 file_offset, +int btrfs_add_ordered_extent_compress(struct btrfs_inode *inode, u64 file_offset, u64 disk_bytenr, u64 num_bytes, u64 disk_num_bytes, int type, int compress_type); -- cgit From 7095821ee1f57d9d4b179463738776e0fcd28d38 Mon Sep 17 00:00:00 2001 From: Nikolay Borisov Date: Wed, 3 Jun 2020 08:55:23 +0300 Subject: btrfs: make btrfs_dec_test_first_ordered_pending take btrfs_inode It doesn't really need vfs_inode but btrfs_inode. Signed-off-by: Nikolay Borisov Reviewed-by: David Sterba Signed-off-by: David Sterba --- fs/btrfs/ordered-data.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'fs/btrfs/ordered-data.h') diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h index 03865f721164..c2432f0165e1 100644 --- a/fs/btrfs/ordered-data.h +++ b/fs/btrfs/ordered-data.h @@ -147,7 +147,7 @@ void btrfs_remove_ordered_extent(struct inode *inode, int btrfs_dec_test_ordered_pending(struct inode *inode, struct btrfs_ordered_extent **cached, u64 file_offset, u64 io_size, int uptodate); -int btrfs_dec_test_first_ordered_pending(struct inode *inode, +int btrfs_dec_test_first_ordered_pending(struct btrfs_inode *inode, struct btrfs_ordered_extent **cached, u64 *file_offset, u64 io_size, int uptodate); -- cgit From c1e095202caae87ac4a5c353691a492692d61857 Mon Sep 17 00:00:00 2001 From: Nikolay Borisov Date: Wed, 3 Jun 2020 08:55:30 +0300 Subject: btrfs: make btrfs_add_ordered_extent_dio take btrfs_inode Simply forwards its argument so let's get rid of one extra BTRFS_I by taking btrfs_inode directly. Signed-off-by: Nikolay Borisov Reviewed-by: David Sterba Signed-off-by: David Sterba --- fs/btrfs/ordered-data.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'fs/btrfs/ordered-data.h') diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h index c2432f0165e1..d61ea9c880a3 100644 --- a/fs/btrfs/ordered-data.h +++ b/fs/btrfs/ordered-data.h @@ -154,7 +154,7 @@ int btrfs_dec_test_first_ordered_pending(struct btrfs_inode *inode, int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset, u64 disk_bytenr, u64 num_bytes, u64 disk_num_bytes, int type); -int btrfs_add_ordered_extent_dio(struct inode *inode, u64 file_offset, +int btrfs_add_ordered_extent_dio(struct btrfs_inode *inode, u64 file_offset, u64 disk_bytenr, u64 num_bytes, u64 disk_num_bytes, int type); int btrfs_add_ordered_extent_compress(struct btrfs_inode *inode, u64 file_offset, -- cgit