From d0164adc89f6bb374d304ffcc375c6d2652fe67d Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Fri, 6 Nov 2015 16:28:21 -0800 Subject: mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd __GFP_WAIT has been used to identify atomic context in callers that hold spinlocks or are in interrupts. They are expected to be high priority and have access one of two watermarks lower than "min" which can be referred to as the "atomic reserve". __GFP_HIGH users get access to the first lower watermark and can be called the "high priority reserve". Over time, callers had a requirement to not block when fallback options were available. Some have abused __GFP_WAIT leading to a situation where an optimisitic allocation with a fallback option can access atomic reserves. This patch uses __GFP_ATOMIC to identify callers that are truely atomic, cannot sleep and have no alternative. High priority users continue to use __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers that want to wake kswapd for background reclaim. __GFP_WAIT is redefined as a caller that is willing to enter direct reclaim and wake kswapd for background reclaim. This patch then converts a number of sites o __GFP_ATOMIC is used by callers that are high priority and have memory pools for those requests. GFP_ATOMIC uses this flag. o Callers that have a limited mempool to guarantee forward progress clear __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall into this category where kswapd will still be woken but atomic reserves are not used as there is a one-entry mempool to guarantee progress. o Callers that are checking if they are non-blocking should use the helper gfpflags_allow_blocking() where possible. This is because checking for __GFP_WAIT as was done historically now can trigger false positives. Some exceptions like dm-crypt.c exist where the code intent is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to flag manipulations. o Callers that built their own GFP flags instead of starting with GFP_KERNEL and friends now also need to specify __GFP_KSWAPD_RECLAIM. The first key hazard to watch out for is callers that removed __GFP_WAIT and was depending on access to atomic reserves for inconspicuous reasons. In some cases it may be appropriate for them to use __GFP_HIGH. The second key hazard is callers that assembled their own combination of GFP flags instead of starting with something like GFP_KERNEL. They may now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless if it's missed in most cases as other activity will wake kswapd. Signed-off-by: Mel Gorman Acked-by: Vlastimil Babka Acked-by: Michal Hocko Acked-by: Johannes Weiner Cc: Christoph Lameter Cc: David Rientjes Cc: Vitaly Wool Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/jbd2/transaction.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'fs/jbd2/transaction.c') diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c index 6b8338ec2464..89463eee6791 100644 --- a/fs/jbd2/transaction.c +++ b/fs/jbd2/transaction.c @@ -1937,8 +1937,8 @@ out: * @journal: journal for operation * @page: to try and free * @gfp_mask: we use the mask to detect how hard should we try to release - * buffers. If __GFP_WAIT and __GFP_FS is set, we wait for commit code to - * release the buffers. + * buffers. If __GFP_DIRECT_RECLAIM and __GFP_FS is set, we wait for commit + * code to release the buffers. * * * For all the buffers on this page, -- cgit From bc23f0c8d7ccd8d924c4e70ce311288cb3e61ea8 Mon Sep 17 00:00:00 2001 From: Jan Kara Date: Tue, 24 Nov 2015 15:34:35 -0500 Subject: jbd2: Fix unreclaimed pages after truncate in data=journal mode Ted and Namjae have reported that truncated pages don't get timely reclaimed after being truncated in data=journal mode. The following test triggers the issue easily: for (i = 0; i < 1000; i++) { pwrite(fd, buf, 1024*1024, 0); fsync(fd); fsync(fd); ftruncate(fd, 0); } The reason is that journal_unmap_buffer() finds that truncated buffers are not journalled (jh->b_transaction == NULL), they are part of checkpoint list of a transaction (jh->b_cp_transaction != NULL) and have been already written out (!buffer_dirty(bh)). We clean such buffers but we leave them in the checkpoint list. Since checkpoint transaction holds a reference to the journal head, these buffers cannot be released until the checkpoint transaction is cleaned up. And at that point we don't call release_buffer_page() anymore so pages detached from mapping are lingering in the system waiting for reclaim to find them and free them. Fix the problem by removing buffers from transaction checkpoint lists when journal_unmap_buffer() finds out they don't have to be there anymore. Reported-and-tested-by: Namjae Jeon Fixes: de1b794130b130e77ffa975bb58cb843744f9ae5 Signed-off-by: Jan Kara Signed-off-by: Theodore Ts'o Cc: stable@vger.kernel.org --- fs/jbd2/transaction.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'fs/jbd2/transaction.c') diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c index 6b8338ec2464..b99621277c66 100644 --- a/fs/jbd2/transaction.c +++ b/fs/jbd2/transaction.c @@ -2152,6 +2152,7 @@ static int journal_unmap_buffer(journal_t *journal, struct buffer_head *bh, if (!buffer_dirty(bh)) { /* bdflush has written it. We can drop it now */ + __jbd2_journal_remove_checkpoint(jh); goto zap_buffer; } @@ -2181,6 +2182,7 @@ static int journal_unmap_buffer(journal_t *journal, struct buffer_head *bh, /* The orphan record's transaction has * committed. We can cleanse this buffer */ clear_buffer_jbddirty(bh); + __jbd2_journal_remove_checkpoint(jh); goto zap_buffer; } } -- cgit From 087ffd4eae9929afd06f6a709861df3c3508492a Mon Sep 17 00:00:00 2001 From: Junxiao Bi Date: Fri, 4 Dec 2015 12:29:28 -0500 Subject: jbd2: fix null committed data return in undo_access introduced jbd2_write_access_granted() to improve write|undo_access speed, but missed to check the status of b_committed_data which caused a kernel panic on ocfs2. [ 6538.405938] ------------[ cut here ]------------ [ 6538.406686] kernel BUG at fs/ocfs2/suballoc.c:2400! [ 6538.406686] invalid opcode: 0000 [#1] SMP [ 6538.406686] Modules linked in: ocfs2 nfsd lockd grace nfs_acl auth_rpcgss sunrpc autofs4 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs sd_mod sg ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ppdev xen_kbdfront xen_netfront xen_fbfront parport_pc parport pcspkr i2c_piix4 acpi_cpufreq ext4 jbd2 mbcache xen_blkfront floppy pata_acpi ata_generic ata_piix cirrus ttm drm_kms_helper drm fb_sys_fops sysimgblt sysfillrect i2c_core syscopyarea dm_mirror dm_region_hash dm_log dm_mod [ 6538.406686] CPU: 1 PID: 16265 Comm: mmap_truncate Not tainted 4.3.0 #1 [ 6538.406686] Hardware name: Xen HVM domU, BIOS 4.3.1OVM 05/14/2014 [ 6538.406686] task: ffff88007c2bab00 ti: ffff880075b78000 task.ti: ffff880075b78000 [ 6538.406686] RIP: 0010:[] [] ocfs2_block_group_clear_bits+0x23b/0x250 [ocfs2] [ 6538.406686] RSP: 0018:ffff880075b7b7f8 EFLAGS: 00010246 [ 6538.406686] RAX: ffff8800760c5b40 RBX: ffff88006c06a000 RCX: ffffffffa06e6df0 [ 6538.406686] RDX: 0000000000000000 RSI: ffff88007a6f6ea0 RDI: ffff88007a760430 [ 6538.406686] RBP: ffff880075b7b878 R08: 0000000000000002 R09: 0000000000000001 [ 6538.406686] R10: ffffffffa06769be R11: 0000000000000000 R12: 0000000000000001 [ 6538.406686] R13: ffffffffa06a1750 R14: 0000000000000001 R15: ffff88007a6f6ea0 [ 6538.406686] FS: 00007f17fde30720(0000) GS:ffff88007f040000(0000) knlGS:0000000000000000 [ 6538.406686] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6538.406686] CR2: 0000000000601730 CR3: 000000007aea0000 CR4: 00000000000406e0 [ 6538.406686] Stack: [ 6538.406686] ffff88007c2bb5b0 ffff880075b7b8e0 ffff88007a7604b0 ffff88006c640800 [ 6538.406686] ffff88007a7604b0 ffff880075d77390 0000000075b7b878 ffffffffa06a309d [ 6538.406686] ffff880075d752d8 ffff880075b7b990 ffff880075b7b898 0000000000000000 [ 6538.406686] Call Trace: [ 6538.406686] [] ? ocfs2_read_group_descriptor+0x6d/0xa0 [ocfs2] [ 6538.406686] [] _ocfs2_free_suballoc_bits+0xe4/0x320 [ocfs2] [ 6538.406686] [] ? ocfs2_put_slot+0xf0/0xf0 [ocfs2] [ 6538.406686] [] _ocfs2_free_clusters+0xee/0x210 [ocfs2] [ 6538.406686] [] ? ocfs2_put_slot+0xf0/0xf0 [ocfs2] [ 6538.406686] [] ? ocfs2_put_slot+0xf0/0xf0 [ocfs2] [ 6538.406686] [] ? ocfs2_extend_trans+0x50/0x1a0 [ocfs2] [ 6538.406686] [] ocfs2_free_clusters+0x15/0x20 [ocfs2] [ 6538.406686] [] ocfs2_replay_truncate_records+0xfc/0x290 [ocfs2] [ 6538.406686] [] ? ocfs2_start_trans+0xec/0x1d0 [ocfs2] [ 6538.406686] [] __ocfs2_flush_truncate_log+0x140/0x2d0 [ocfs2] [ 6538.406686] [] ? ocfs2_reserve_blocks_for_rec_trunc.clone.0+0x44/0x170 [ocfs2] [ 6538.406686] [] ocfs2_remove_btree_range+0x374/0x630 [ocfs2] [ 6538.406686] [] ? jbd2_journal_stop+0x25b/0x470 [jbd2] [ 6538.406686] [] ocfs2_commit_truncate+0x305/0x670 [ocfs2] [ 6538.406686] [] ? ocfs2_journal_access_eb+0x20/0x20 [ocfs2] [ 6538.406686] [] ocfs2_truncate_file+0x297/0x380 [ocfs2] [ 6538.406686] [] ? jbd2_journal_begin_ordered_truncate+0x64/0xc0 [jbd2] [ 6538.406686] [] ocfs2_setattr+0x572/0x860 [ocfs2] [ 6538.406686] [] ? current_fs_time+0x3f/0x50 [ 6538.406686] [] notify_change+0x1d7/0x340 [ 6538.406686] [] ? generic_getxattr+0x79/0x80 [ 6538.406686] [] do_truncate+0x66/0x90 [ 6538.406686] [] ? __audit_syscall_entry+0xb0/0x110 [ 6538.406686] [] do_sys_ftruncate.clone.0+0xf3/0x120 [ 6538.406686] [] SyS_ftruncate+0xe/0x10 [ 6538.406686] [] entry_SYSCALL_64_fastpath+0x12/0x71 [ 6538.406686] Code: 28 48 81 ee b0 04 00 00 48 8b 92 50 fb ff ff 48 8b 80 b0 03 00 00 48 39 90 88 00 00 00 0f 84 30 fe ff ff 0f 0b eb fe 0f 0b eb fe <0f> 0b 0f 1f 00 eb fb 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 [ 6538.406686] RIP [] ocfs2_block_group_clear_bits+0x23b/0x250 [ocfs2] [ 6538.406686] RSP [ 6538.691128] ---[ end trace 31cd7011d6770d7e ]--- [ 6538.694492] Kernel panic - not syncing: Fatal exception [ 6538.695484] Kernel Offset: disabled Fixes: de92c8caf16c("jbd2: speedup jbd2_journal_get_[write|undo]_access()") Cc: Signed-off-by: Junxiao Bi Signed-off-by: Theodore Ts'o --- fs/jbd2/transaction.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) (limited to 'fs/jbd2/transaction.c') diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c index b99621277c66..1498ad9f731a 100644 --- a/fs/jbd2/transaction.c +++ b/fs/jbd2/transaction.c @@ -1009,7 +1009,8 @@ out: } /* Fast check whether buffer is already attached to the required transaction */ -static bool jbd2_write_access_granted(handle_t *handle, struct buffer_head *bh) +static bool jbd2_write_access_granted(handle_t *handle, struct buffer_head *bh, + bool undo) { struct journal_head *jh; bool ret = false; @@ -1036,6 +1037,9 @@ static bool jbd2_write_access_granted(handle_t *handle, struct buffer_head *bh) jh = READ_ONCE(bh->b_private); if (!jh) goto out; + /* For undo access buffer must have data copied */ + if (undo && !jh->b_committed_data) + goto out; if (jh->b_transaction != handle->h_transaction && jh->b_next_transaction != handle->h_transaction) goto out; @@ -1073,7 +1077,7 @@ int jbd2_journal_get_write_access(handle_t *handle, struct buffer_head *bh) struct journal_head *jh; int rc; - if (jbd2_write_access_granted(handle, bh)) + if (jbd2_write_access_granted(handle, bh, false)) return 0; jh = jbd2_journal_add_journal_head(bh); @@ -1210,7 +1214,7 @@ int jbd2_journal_get_undo_access(handle_t *handle, struct buffer_head *bh) char *committed_data = NULL; JBUFFER_TRACE(jh, "entry"); - if (jbd2_write_access_granted(handle, bh)) + if (jbd2_write_access_granted(handle, bh, true)) return 0; jh = jbd2_journal_add_journal_head(bh); -- cgit From a1c6f05733c27ba7067c06c095f49e8732a5ae17 Mon Sep 17 00:00:00 2001 From: Dmitry Monakhov Date: Mon, 13 Apr 2015 16:31:37 +0400 Subject: fs: use block_device name vsprintf helper Signed-off-by: Dmitry Monakhov Signed-off-by: Al Viro --- fs/jbd2/transaction.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) (limited to 'fs/jbd2/transaction.c') diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c index ca181e81c765..081dff087fc0 100644 --- a/fs/jbd2/transaction.c +++ b/fs/jbd2/transaction.c @@ -764,13 +764,11 @@ void jbd2_journal_unlock_updates (journal_t *journal) static void warn_dirty_buffer(struct buffer_head *bh) { - char b[BDEVNAME_SIZE]; - printk(KERN_WARNING - "JBD2: Spotted dirty metadata buffer (dev = %s, blocknr = %llu). " + "JBD2: Spotted dirty metadata buffer (dev = %pg, blocknr = %llu). " "There's a risk of filesystem corruption in case of system " "crash.\n", - bdevname(bh->b_bdev, b), (unsigned long long)bh->b_blocknr); + bh->b_bdev, (unsigned long long)bh->b_blocknr); } /* Call t_frozen trigger and copy buffer data into jh->b_frozen_data. */ -- cgit From 490c1b444ce653d0784a9a5fa4d11287029feeb9 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Sun, 13 Mar 2016 17:38:20 -0400 Subject: jbd2: do not fail journal because of frozen_buffer allocation failure Journal transaction might fail prematurely because the frozen_buffer is allocated by GFP_NOFS request: [ 72.440013] do_get_write_access: OOM for frozen_buffer [ 72.440014] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access [ 72.440015] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4735: Out of memory (...snipped....) [ 72.495559] do_get_write_access: OOM for frozen_buffer [ 72.495560] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access [ 72.496839] do_get_write_access: OOM for frozen_buffer [ 72.496841] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access [ 72.505766] Aborting journal on device sda1-8. [ 72.505851] EXT4-fs (sda1): Remounting filesystem read-only This wasn't a problem until "mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM" because small GPF_NOFS allocations never failed. This allocation seems essential for the journal and GFP_NOFS is too restrictive to the memory allocator so let's use __GFP_NOFAIL here to emulate the previous behavior. Signed-off-by: Michal Hocko Signed-off-by: Theodore Ts'o --- fs/jbd2/transaction.c | 22 +++++----------------- 1 file changed, 5 insertions(+), 17 deletions(-) (limited to 'fs/jbd2/transaction.c') diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c index 081dff087fc0..01e4652d88f6 100644 --- a/fs/jbd2/transaction.c +++ b/fs/jbd2/transaction.c @@ -966,14 +966,8 @@ repeat: if (!frozen_buffer) { JBUFFER_TRACE(jh, "allocate memory for buffer"); jbd_unlock_bh_state(bh); - frozen_buffer = jbd2_alloc(jh2bh(jh)->b_size, GFP_NOFS); - if (!frozen_buffer) { - printk(KERN_ERR "%s: OOM for frozen_buffer\n", - __func__); - JBUFFER_TRACE(jh, "oom!"); - error = -ENOMEM; - goto out; - } + frozen_buffer = jbd2_alloc(jh2bh(jh)->b_size, + GFP_NOFS | __GFP_NOFAIL); goto repeat; } jh->b_frozen_data = frozen_buffer; @@ -1226,15 +1220,9 @@ int jbd2_journal_get_undo_access(handle_t *handle, struct buffer_head *bh) goto out; repeat: - if (!jh->b_committed_data) { - committed_data = jbd2_alloc(jh2bh(jh)->b_size, GFP_NOFS); - if (!committed_data) { - printk(KERN_ERR "%s: No memory for committed data\n", - __func__); - err = -ENOMEM; - goto out; - } - } + if (!jh->b_committed_data) + committed_data = jbd2_alloc(jh2bh(jh)->b_size, + GFP_NOFS|__GFP_NOFAIL); jbd_lock_bh_state(bh); if (!jh->b_committed_data) { -- cgit From 09cbfeaf1a5a67bfb3201e0c83c810cecb2efa5a Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" Date: Fri, 1 Apr 2016 15:29:47 +0300 Subject: mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time ago with promise that one day it will be possible to implement page cache with bigger chunks than PAGE_SIZE. This promise never materialized. And unlikely will. We have many places where PAGE_CACHE_SIZE assumed to be equal to PAGE_SIZE. And it's constant source of confusion on whether PAGE_CACHE_* or PAGE_* constant should be used in a particular case, especially on the border between fs and mm. Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much breakage to be doable. Let's stop pretending that pages in page cache are special. They are not. The changes are pretty straight-forward: - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ; - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ; - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN}; - page_cache_get() -> get_page(); - page_cache_release() -> put_page(); This patch contains automated changes generated with coccinelle using script below. For some reason, coccinelle doesn't patch header files. I've called spatch for them manually. The only adjustment after coccinelle is revert of changes to PAGE_CAHCE_ALIGN definition: we are going to drop it later. There are few places in the code where coccinelle didn't reach. I'll fix them manually in a separate patch. Comments and documentation also will be addressed with the separate patch. virtual patch @@ expression E; @@ - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ expression E; @@ - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ @@ - PAGE_CACHE_SHIFT + PAGE_SHIFT @@ @@ - PAGE_CACHE_SIZE + PAGE_SIZE @@ @@ - PAGE_CACHE_MASK + PAGE_MASK @@ expression E; @@ - PAGE_CACHE_ALIGN(E) + PAGE_ALIGN(E) @@ expression E; @@ - page_cache_get(E) + get_page(E) @@ expression E; @@ - page_cache_release(E) + put_page(E) Signed-off-by: Kirill A. Shutemov Acked-by: Michal Hocko Signed-off-by: Linus Torvalds --- fs/jbd2/transaction.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'fs/jbd2/transaction.c') diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c index 01e4652d88f6..67c103867bf8 100644 --- a/fs/jbd2/transaction.c +++ b/fs/jbd2/transaction.c @@ -2263,7 +2263,7 @@ int jbd2_journal_invalidatepage(journal_t *journal, struct buffer_head *head, *bh, *next; unsigned int stop = offset + length; unsigned int curr_off = 0; - int partial_page = (offset || length < PAGE_CACHE_SIZE); + int partial_page = (offset || length < PAGE_SIZE); int may_free = 1; int ret = 0; @@ -2272,7 +2272,7 @@ int jbd2_journal_invalidatepage(journal_t *journal, if (!page_has_buffers(page)) return 0; - BUG_ON(stop > PAGE_CACHE_SIZE || stop < length); + BUG_ON(stop > PAGE_SIZE || stop < length); /* We will potentially be playing with lists other than just the * data lists (especially for journaled data mode), so be -- cgit