summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2013-04-03ext4: unregister es_shrinker if mount failedDmitry Monakhov
Otherwise destroyed ext_sb_info will be part of global shinker list and result in the following OOPS: JBD2: corrupted journal superblock JBD2: recovery failed EXT4-fs (dm-2): error loading journal general protection fault: 0000 [#1] SMP Modules linked in: fuse acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel microcode sg button sd_mod crc_t10dif ahci libahci pata_acpi ata_generic dm_mirror dm_region_hash dm_log dm_\ mod CPU 1 Pid: 2758, comm: mount Not tainted 3.8.0-rc3+ #136 /DH55TC RIP: 0010:[<ffffffff811bfb2d>] [<ffffffff811bfb2d>] unregister_shrinker+0xad/0xe0 RSP: 0000:ffff88011d5cbcd8 EFLAGS: 00010207 RAX: 6b6b6b6b6b6b6b6b RBX: 6b6b6b6b6b6b6b53 RCX: 0000000000000006 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000246 RBP: ffff88011d5cbce8 R08: 0000000000000002 R09: 0000000000000001 R10: 0000000000000001 R11: 0000000000000000 R12: ffff88011cd3f848 R13: ffff88011cd3f830 R14: ffff88011cd3f000 R15: 0000000000000000 FS: 00007f7b721dd7e0(0000) GS:ffff880121a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007fffa6f75038 CR3: 000000011bc1c000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process mount (pid: 2758, threadinfo ffff88011d5ca000, task ffff880116aacb80) Stack: ffff88011cd3f000 ffffffff8209b6c0 ffff88011d5cbd18 ffffffff812482f1 00000000000003f3 00000000ffffffea ffff880115f4c200 0000000000000000 ffff88011d5cbda8 ffffffff81249381 ffff8801219d8bf8 ffffffff00000000 Call Trace: [<ffffffff812482f1>] deactivate_locked_super+0x91/0xb0 [<ffffffff81249381>] mount_bdev+0x331/0x340 [<ffffffff81376730>] ? ext4_alloc_flex_bg_array+0x180/0x180 [<ffffffff81362035>] ext4_mount+0x15/0x20 [<ffffffff8124869a>] mount_fs+0x9a/0x2e0 [<ffffffff81277e25>] vfs_kern_mount+0xc5/0x170 [<ffffffff81279c02>] do_new_mount+0x172/0x2e0 [<ffffffff8127aa56>] do_mount+0x376/0x380 [<ffffffff8127ab98>] sys_mount+0x138/0x150 [<ffffffff818ffed9>] system_call_fastpath+0x16/0x1b Code: 8b 05 88 04 eb 00 48 3d 90 ff 06 82 48 8d 58 e8 75 19 4c 89 e7 e8 e4 d7 2c 00 48 c7 c7 00 ff 06 82 e8 58 5f ef ff 5b 41 5c c9 c3 <48> 8b 4b 18 48 8b 73 20 48 89 da 31 c0 48 c7 c7 c5 a0 e4 81 e\ 8 RIP [<ffffffff811bfb2d>] unregister_shrinker+0xad/0xe0 RSP <ffff88011d5cbcd8> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
2013-04-03ext4: fix journal callback list traversalDmitry Monakhov
It is incorrect to use list_for_each_entry_safe() for journal callback traversial because ->next may be removed by other task: ->ext4_mb_free_metadata() ->ext4_mb_free_metadata() ->ext4_journal_callback_del() This results in the following issue: WARNING: at lib/list_debug.c:62 __list_del_entry+0x1c0/0x250() Hardware name: list_del corruption. prev->next should be ffff88019a4ec198, but was 6b6b6b6b6b6b6b6b Modules linked in: cpufreq_ondemand acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode sg xhci_hcd button sd_mod crc_t10dif aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul ahci libahci pata_acpi ata_generic dm_mirror dm_region_hash dm_log dm_mod Pid: 16400, comm: jbd2/dm-1-8 Tainted: G W 3.8.0-rc3+ #107 Call Trace: [<ffffffff8106fb0d>] warn_slowpath_common+0xad/0xf0 [<ffffffff8106fc06>] warn_slowpath_fmt+0x46/0x50 [<ffffffff813637e9>] ? ext4_journal_commit_callback+0x99/0xc0 [<ffffffff8148cae0>] __list_del_entry+0x1c0/0x250 [<ffffffff813637bf>] ext4_journal_commit_callback+0x6f/0xc0 [<ffffffff813ca336>] jbd2_journal_commit_transaction+0x23a6/0x2570 [<ffffffff8108aa42>] ? try_to_del_timer_sync+0x82/0xa0 [<ffffffff8108b491>] ? del_timer_sync+0x91/0x1e0 [<ffffffff813d3ecf>] kjournald2+0x19f/0x6a0 [<ffffffff810ad630>] ? wake_up_bit+0x40/0x40 [<ffffffff813d3d30>] ? bit_spin_lock+0x80/0x80 [<ffffffff810ac6be>] kthread+0x10e/0x120 [<ffffffff810ac5b0>] ? __init_kthread_worker+0x70/0x70 [<ffffffff818ff6ac>] ret_from_fork+0x7c/0xb0 [<ffffffff810ac5b0>] ? __init_kthread_worker+0x70/0x70 This patch fix the issue as follows: - ext4_journal_commit_callback() make list truly traversial safe simply by always starting from list_head - fix race between two ext4_journal_callback_del() and ext4_journal_callback_try_del() Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Cc: stable@vger.kernel.com
2013-04-03jbd2: fix race between jbd2_journal_remove_checkpoint and ->j_commit_callbackDmitry Monakhov
The following race is possible: [kjournald2] other_task jbd2_journal_commit_transaction() j_state = T_FINISHED; spin_unlock(&journal->j_list_lock); ->jbd2_journal_remove_checkpoint() ->jbd2_journal_free_transaction(); ->kmem_cache_free(transaction) ->j_commit_callback(journal, transaction); -> USE_AFTER_FREE WARNING: at lib/list_debug.c:62 __list_del_entry+0x1c0/0x250() Hardware name: list_del corruption. prev->next should be ffff88019a4ec198, but was 6b6b6b6b6b6b6b6b Modules linked in: cpufreq_ondemand acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode sg xhci_hcd button sd_mod crc_t10dif aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul ahci libahci pata_acpi ata_generic dm_mirror dm_region_hash dm_log dm_mod Pid: 16400, comm: jbd2/dm-1-8 Tainted: G W 3.8.0-rc3+ #107 Call Trace: [<ffffffff8106fb0d>] warn_slowpath_common+0xad/0xf0 [<ffffffff8106fc06>] warn_slowpath_fmt+0x46/0x50 [<ffffffff813637e9>] ? ext4_journal_commit_callback+0x99/0xc0 [<ffffffff8148cae0>] __list_del_entry+0x1c0/0x250 [<ffffffff813637bf>] ext4_journal_commit_callback+0x6f/0xc0 [<ffffffff813ca336>] jbd2_journal_commit_transaction+0x23a6/0x2570 [<ffffffff8108aa42>] ? try_to_del_timer_sync+0x82/0xa0 [<ffffffff8108b491>] ? del_timer_sync+0x91/0x1e0 [<ffffffff813d3ecf>] kjournald2+0x19f/0x6a0 [<ffffffff810ad630>] ? wake_up_bit+0x40/0x40 [<ffffffff813d3d30>] ? bit_spin_lock+0x80/0x80 [<ffffffff810ac6be>] kthread+0x10e/0x120 [<ffffffff810ac5b0>] ? __init_kthread_worker+0x70/0x70 [<ffffffff818ff6ac>] ret_from_fork+0x7c/0xb0 [<ffffffff810ac5b0>] ? __init_kthread_worker+0x70/0x70 In order to demonstrace this issue one should mount ext4 with mount -o discard option on SSD disk. This makes callback longer and race window becomes wider. In order to fix this we should mark transaction as finished only after callbacks have completed Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
2013-04-03ext4: support simple conversion of extent-mapped inodes to use i_blocksTheodore Ts'o
In order to make it simpler to test the code which support i_blocks/indirect-mapped inodes, support the conversion of inodes which are less than 12 blocks and which are contained in no more than a single extent. The primary intended use of this code is to converting freshly created zero-length files and empty directories. Note that the version of chattr in e2fsprogs 1.42.7 and earlier has a check that prevents the clearing of the extent flag. A simple patch which allows "chattr -e <file>" to work will be checked into the e2fsprogs git repository. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-03ext4/jbd2: don't wait (forever) for stale tid caused by wraparoundTheodore Ts'o
In the case where an inode has a very stale transaction id (tid) in i_datasync_tid or i_sync_tid, it's possible that after a very large (2**31) number of transactions, that the tid number space might wrap, causing tid_geq()'s calculations to fail. Commit deeeaf13 "jbd2: fix fsync() tid wraparound bug", later modified by commit e7b04ac0 "jbd2: don't wake kjournald unnecessarily", attempted to fix this problem, but it only avoided kjournald spinning forever by fixing the logic in jbd2_log_start_commit(). Unfortunately, in the codepaths in fs/ext4/fsync.c and fs/ext4/inode.c that might call jbd2_log_start_commit() with a stale tid, those functions will subsequently call jbd2_log_wait_commit() with the same stale tid, and then wait for a very long time. To fix this, we replace the calls to jbd2_log_start_commit() and jbd2_log_wait_commit() with a call to a new function, jbd2_complete_transaction(), which will correctly handle stale tid's. As a bonus, jbd2_complete_transaction() will avoid locking j_state_lock for writing unless a commit needs to be started. This should have a small (but probably not measurable) improvement for ext4's scalability. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reported-by: Ben Hutchings <ben@decadent.org.uk> Reported-by: George Barnett <gbarnett@atlassian.com> Cc: stable@vger.kernel.org
2013-04-03ext4: add might_sleep() annotationsTheodore Ts'o
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2013-04-03ext4: add mutex_is_locked() assertion to ext4_truncate()Theodore Ts'o
[ Added fixup from Lukáš Czerner which only checks the assertion when the inode is not new and is not being freed. ] Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-03nfsd: remove /proc/fs/nfs when create /proc/fs/nfs/exports errorfanchaoting
when create /proc/fs/nfs/exports error, we should remove /proc/fs/nfs, if don't do it, it maybe cause Memory leak. Signed-off-by: fanchaoting <fanchaoting@cn.fujitsu.com> Reviewed-by: chendt.fnst <chendt.fnst@cn.fujitsu.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03nfsd: don't run get_file if nfs4_preprocess_stateid_op return errorfanchaoting
we should return error status directly when nfs4_preprocess_stateid_op return error. Signed-off-by: fanchaoting <fanchaoting@cn.fujitsu.com> Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03nfsd: convert the file_hashtbl to a hlistJeff Layton
We only ever traverse the hash chains in the forward direction, so a double pointer list head isn't really necessary. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Unfortunately, we introduced some big-endian bugs during the last merge window. Fortunately, Cai and Christian noticed before 3.9 shipped." * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: fix big-endian bugs which could cause fs corruptions
2013-04-03xfs: Add ratelimited printk for different alert levelsRich Johnston
Ratelimited printk will be useful in printing xfs messages which are otherwise not required to be printed always due to their high rate (to prevent kernel ring buffer from overflowing), while at the same time required to be printed. Signed-off-by: Raghavendra D Prabhu <rprabhu@wnohang.net> Reviewed-by: Rich Johnston <rjohnston@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-04-03sysfs: fix use after free in case of concurrent read/write and readdirMing Lei
The inode->i_mutex isn't hold when updating filp->f_pos in read()/write(), so the filp->f_pos might be read as 0 or 1 in readdir() when there is concurrent read()/write() on this same file, then may cause use after free in readdir(). The bug can be reproduced with Li Zefan's test code on the link: https://patchwork.kernel.org/patch/2160771/ This patch fixes the use after free under this situation. Cc: stable <stable@vger.kernel.org> Reported-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-04-03Merge branch 'for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull reiserfs fix from Jan Kara: "A fix for reiserfs xattr bug exposed by changes to lookup_one_len()" * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: reiserfs: Fix warning and inode leak when deleting inode with xattrs
2013-04-03ext4: refactor truncate codeTheodore Ts'o
Move common code in ext4_ind_truncate() and ext4_ext_truncate() into ext4_truncate(). This saves over 60 lines of code. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-03ext4: refactor punch hole codeTheodore Ts'o
Move common code in ext4_ind_punch_hole() and ext4_ext_punch_hole() into ext4_punch_hole(). This saves over 150 lines of code. This also fixes a potential bug when the punch_hole() code is racing against indirect-to-extents or extents-to-indirect migation. We are currently using i_mutex to protect against changes to the inode flag; specifically, the append-only, immutable, and extents inode flags. So we need to take i_mutex before deciding whether to use the extents-specific or indirect-specific punch_hole code. Also, there was a missing call to ext4_inode_block_unlocked_dio() in the indirect punch codepath. This was added in commit 02d262dffcf4c to block DIO readers racing against the punch operation in the codepath for extent-mapped inodes, but it was missing for indirect-block mapped inodes. One of the advantages of refactoring the code is that it makes such oversights much less likely. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-03ext4: fold ext4_alloc_blocks() in ext4_alloc_branch()Theodore Ts'o
The older code was far more complicated than it needed to be because of how we spliced in the ext4's new multiblock allocator into ext3's indirect block code. By folding ext4_alloc_blocks() into ext4_alloc_branch(), we make the code far more understable, shave off over 130 lines of code and half a kilobyte of compiled object code. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-03ext4: fold ext4_generic_write_end() into ext4_write_end()Zheng Liu
After collapsing the handling of data ordered and data writeback codepath, ext4_generic_write_end() has only one caller, ext4_write_end(). So we fold it into ext4_write_end(). Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2013-04-03ext4: collapse handling of data=ordered and data=writeback codepathsTheodore Ts'o
The only difference between how we handle data=ordered and data=writeback is a single call to ext4_jbd2_file_inode(). Eliminate code duplication by factoring out redundant the code paths. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2013-04-03ext4: fix big-endian bugs which could cause fs corruptionsZheng Liu
When an extent was zeroed out, we forgot to do convert from cpu to le16. It could make us hit a BUG_ON when we try to write dirty pages out. So fix it. [ Also fix a bug found by Dmitry Monakhov where we were missing le32_to_cpu() calls in the new indirect punch hole code. There are a number of other big endian warnings found by static code analyzers, but we'll wait for the next merge window to fix them all up. These fixes are designed to be Obviously Correct by code inspection, and easy to demonstrate that it won't make any difference (and hence, won't introduce any bugs) on little endian architectures such as x86. --tytso ] Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reported-by: CAI Qian <caiqian@redhat.com> Reported-by: Christian Kujau <lists@nerdbynature.de> Cc: Dmitry Monakhov <dmonakhov@openvz.org>
2013-04-03nfsd4: don't destroy in-use sessionJ. Bruce Fields
This changes session destruction to be similar to client destruction in that attempts to destroy a session while in use (which should be rare corner cases) result in DELAY. This simplifies things somewhat and helps meet a coming 4.2 requirement. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03nfsd4: don't destroy in-use clientsJ. Bruce Fields
When a setclientid_confirm or create_session confirms a client after a client reboot, it also destroys any previous state held by that client. The shutdown of that previous state must be careful not to free the client out from under threads processing other requests that refer to the client. This is a particular problem in the NFSv4.1 case when we hold a reference to a session (hence a client) throughout compound processing. The server attempts to handle this by unhashing the client at the time it's destroyed, then delaying the final free to the end. But this still leaves some races in the current code. I believe it's simpler just to fail the attempt to destroy the client by returning NFS4ERR_DELAY. This is a case that should never happen anyway. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03nfsd4: simplify bind_conn_to_session lockingJ. Bruce Fields
The locking here is very fiddly, and there's no reason for us to be setting cstate->session, since this is the only op in the compound. Let's just take the state lock and drop the reference counting. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03nfsd4: fix destroy_session raceJ. Bruce Fields
destroy_session uses the session and client without continuously holding any reference or locks. Put the whole thing under the state lock for now. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03nfsd4: clientid lookup cleanupJ. Bruce Fields
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03nfsd4: destroy_clientid simplificationJ. Bruce Fields
I'm not sure what the check for clientid expiry was meant to do here. The check for a matching session is redundant given the previous check for state: a client without state is, in particular, a client without sessions. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03nfsd4: remove some dprintk'sJ. Bruce Fields
E.g. printk's that just report the return value from an op are uninteresting as we already do that in the main proc_compound loop. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03nfsd4: STALE_STATEID cleanupJ. Bruce Fields
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03nfsd4: warn on odd create_session stateJ. Bruce Fields
This should never happen. (Note: the comparable case in setclientid_confirm *can* happen, since updating a client record can result in both confirmed and unconfirmed records with the same clientid.) Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03nfsd: fix bug on nfs4 stateid deallocationycnian@gmail.com
NFS4_OO_PURGE_CLOSE is not handled properly. To avoid memory leak, nfs4 stateid which is pointed by oo_last_closed_stid is freed in nfsd4_close(), but NFS4_OO_PURGE_CLOSE isn't cleared meanwhile. So the stateid released in THIS close procedure may be freed immediately in the coming encoding function. Sorry that Signed-off-by was forgotten in last version. Signed-off-by: Yanchuan Nian <ycnian@gmail.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03nfsd: remove unused macro in nfsv4Yanchuan Nian
lk_rflags is never used anywhere, and rflags is not defined in struct nfsd4_lock. Signed-off-by: Yanchuan Nian <ycnian@gmail.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03nfsd4: fix use-after-free of 4.1 client on connection lossJ. Bruce Fields
Once we drop the lock here there's nothing keeping the client around: the only lock still held is the xpt_lock on this socket, but this socket no longer has any connection with the client so there's no way for other code to know we're still using the client. The solution is simple: all nfsd4_probe_callback does is set a few variables and queue some work, so there's no reason we can't just keep it under the lock. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03nfsd4: fix race on client shutdownJ. Bruce Fields
Dropping the session's reference count after the client's means we leave a window where the session's se_client pointer is NULL. An xpt_user callback that encounters such a session may then crash: [ 303.956011] BUG: unable to handle kernel NULL pointer dereference at 0000000000000318 [ 303.959061] IP: [<ffffffff81481a8e>] _raw_spin_lock+0x1e/0x40 [ 303.959061] PGD 37811067 PUD 3d498067 PMD 0 [ 303.959061] Oops: 0002 [#8] PREEMPT SMP [ 303.959061] Modules linked in: md5 nfsd auth_rpcgss nfs_acl snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_page_alloc microcode psmouse snd_timer serio_raw pcspkr evdev snd soundcore i2c_piix4 i2c_core intel_agp intel_gtt processor button nfs lockd sunrpc fscache ata_generic pata_acpi ata_piix uhci_hcd libata btrfs usbcore usb_common crc32c scsi_mod libcrc32c zlib_deflate floppy virtio_balloon virtio_net virtio_pci virtio_blk virtio_ring virtio [ 303.959061] CPU 0 [ 303.959061] Pid: 264, comm: nfsd Tainted: G D 3.8.0-ARCH+ #156 Bochs Bochs [ 303.959061] RIP: 0010:[<ffffffff81481a8e>] [<ffffffff81481a8e>] _raw_spin_lock+0x1e/0x40 [ 303.959061] RSP: 0018:ffff880037877dd8 EFLAGS: 00010202 [ 303.959061] RAX: 0000000000000100 RBX: ffff880037a2b698 RCX: ffff88003d879278 [ 303.959061] RDX: ffff88003d879278 RSI: dead000000100100 RDI: 0000000000000318 [ 303.959061] RBP: ffff880037877dd8 R08: ffff88003c5a0f00 R09: 0000000000000002 [ 303.959061] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000 [ 303.959061] R13: 0000000000000318 R14: ffff880037a2b680 R15: ffff88003c1cbe00 [ 303.959061] FS: 0000000000000000(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000 [ 303.959061] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 303.959061] CR2: 0000000000000318 CR3: 000000003d49c000 CR4: 00000000000006f0 [ 303.959061] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 303.959061] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 303.959061] Process nfsd (pid: 264, threadinfo ffff880037876000, task ffff88003c1fd0a0) [ 303.959061] Stack: [ 303.959061] ffff880037877e08 ffffffffa03772ec ffff88003d879000 ffff88003d879278 [ 303.959061] ffff88003d879080 0000000000000000 ffff880037877e38 ffffffffa0222a1f [ 303.959061] 0000000000107ac0 ffff88003c22e000 ffff88003d879000 ffff88003c1cbe00 [ 303.959061] Call Trace: [ 303.959061] [<ffffffffa03772ec>] nfsd4_conn_lost+0x3c/0xa0 [nfsd] [ 303.959061] [<ffffffffa0222a1f>] svc_delete_xprt+0x10f/0x180 [sunrpc] [ 303.959061] [<ffffffffa0223d96>] svc_recv+0xe6/0x580 [sunrpc] [ 303.959061] [<ffffffffa03587c5>] nfsd+0xb5/0x140 [nfsd] [ 303.959061] [<ffffffffa0358710>] ? nfsd_destroy+0x90/0x90 [nfsd] [ 303.959061] [<ffffffff8107ae00>] kthread+0xc0/0xd0 [ 303.959061] [<ffffffff81010000>] ? perf_trace_xen_mmu_set_pte_at+0x50/0x100 [ 303.959061] [<ffffffff8107ad40>] ? kthread_freezable_should_stop+0x70/0x70 [ 303.959061] [<ffffffff814898ec>] ret_from_fork+0x7c/0xb0 [ 303.959061] [<ffffffff8107ad40>] ? kthread_freezable_should_stop+0x70/0x70 [ 303.959061] Code: ff ff 5d c3 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 65 48 8b 04 25 f0 c6 00 00 48 89 e5 83 80 44 e0 ff ff 01 b8 00 01 00 00 <3e> 66 0f c1 07 0f b6 d4 38 c2 74 0f 66 0f 1f 44 00 00 f3 90 0f [ 303.959061] RIP [<ffffffff81481a8e>] _raw_spin_lock+0x1e/0x40 [ 303.959061] RSP <ffff880037877dd8> [ 303.959061] CR2: 0000000000000318 [ 304.001218] ---[ end trace 2d809cd4a7931f5a ]--- [ 304.001903] note: nfsd[264] exited with preempt_count 2 Reported-by: Bryan Schumaker <bjschuma@netapp.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03nfsd4: handle seqid-mutating open errors from xdr decodingJ. Bruce Fields
If a client sets an owner (or group_owner or acl) attribute on open for create, and the mapping of that owner to an id fails, then we return BAD_OWNER. But BAD_OWNER is a seqid-mutating error, so we can't shortcut the open processing that case: we have to at least look up the owner so we can find the seqid to bump. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03nfsd4: remove BUG_ONJ. Bruce Fields
This BUG_ON just crashes the thread a little earlier than it would otherwise--it doesn't seem useful. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03nfsd: scale up the number of DRC hash buckets with cache sizeJeff Layton
We've now increased the size of the duplicate reply cache by quite a bit, but the number of hash buckets has not changed. So, we've gone from an average hash chain length of 16 in the old code to 4096 when the cache is its largest. Change the code to scale out the number of buckets with the max size of the cache. At the same time, we also need to fix the hash function since the existing one isn't really suitable when there are more than 256 buckets. Move instead to use the stock hash_32 function for this. Testing on a machine that had 2048 buckets showed that this gave a smaller longest:average ratio than the existing hash function: The formula here is longest hash bucket searched divided by average number of entries per bucket at the time that we saw that longest bucket: old hash: 68/(39258/2048) == 3.547404 hash_32: 45/(33773/2048) == 2.728807 Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03nfsd: keep stats on worst hash balancing seen so farJeff Layton
The typical case with the DRC is a cache miss, so if we keep track of the max number of entries that we've ever walked over in a search, then we should have a reasonable estimate of the longest hash chain that we've ever seen. With that, we'll also keep track of the total size of the cache when we see the longest chain. In the case of a tie, we prefer to track the smallest total cache size in order to properly gauge the worst-case ratio of max vs. avg chain length. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03nfsd: add new reply_cache_stats file in nfsdfsJeff Layton
For presenting statistics relating to duplicate reply cache. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03nfsd: track memory utilization by the DRCJeff Layton
Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03nfsd: break out comparator into separate functionJeff Layton
Break out the function that compares the rqstp and checksum against a reply cache entry. While we're at it, track the efficacy of the checksum over the NFS data by tracking the cases where we would have incorrectly matched a DRC entry if we had not tracked it or the length. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03nfsd: eliminate one of the DRC cache searchesJeff Layton
The most common case is to do a search of the cache, followed by an insert. In the case where we have to allocate an entry off the slab, then we end up having to redo the search, which is wasteful. Better optimize the code for the common case by eliminating the initial search of the cache and always preallocating an entry. In the case of a cache hit, we'll end up just freeing that entry but that's preferable to an extra search. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-03f2fs: reduce redundant spin_lock operationsJaegeuk Kim
This patch reduces redundant spin_lock operations in alloc_nid_failed(). The alloc_nid_failed() does not need to delete entry and add one again by triggering spin_lock and spin_unlock redundantly. Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-04-03f2fs: add NULL pointer checkP J P
Commit - fa9150a84c - replaces a call to generic_writepages() in f2fs_write_data_pages() with write_cache_pages(), with a function pointer argument pointing to routine: __f2fs_writepage. -> https://git.kernel.org/linus/fa9150a84ca333f68127097c4fa1eda4b3913a22 This patch adds a NULL pointer check in f2fs_write_data_pages() to avoid a possible NULL pointer dereference, in case if - mapping->a_ops->writepage - is NULL. Signed-off-by: P J P <ppandit@redhat.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-04-03f2fs: fix the bitmap consistency of dirty segmentsJaegeuk Kim
Like below, there are 8 segment bitmaps for SSR victim candidates. enum dirty_type { DIRTY_HOT_DATA, /* dirty segments assigned as hot data logs */ DIRTY_WARM_DATA, /* dirty segments assigned as warm data logs */ DIRTY_COLD_DATA, /* dirty segments assigned as cold data logs */ DIRTY_HOT_NODE, /* dirty segments assigned as hot node logs */ DIRTY_WARM_NODE, /* dirty segments assigned as warm node logs */ DIRTY_COLD_NODE, /* dirty segments assigned as cold node logs */ DIRTY, /* to count # of dirty segments */ PRE, /* to count # of entirely obsolete segments */ NR_DIRTY_TYPE }; The upper 6 bitmaps indicates segments dirtied by active log areas respectively. And, the DIRTY bitmap integrates all the 6 bitmaps. For example, o DIRTY_HOT_DATA : 1010000 o DIRTY_WARM_DATA: 0100000 o DIRTY_COLD_DATA: 0001000 o DIRTY_HOT_NODE : 0000010 o DIRTY_WARM_NODE: 0000001 o DIRTY_COLD_NODE: 0000000 In this case, o DIRTY : 1111011, which means that we should guarantee the consistency between DIRTY and other bitmaps concreately. However, the SSR mode selects victims freely from any log types, which can set multiple bits across the various bitmap types. So, this patch eliminates this inconsistency. Reviewed-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-04-03f2fs: avoid race for summary informationJaegeuk Kim
In order to do GC more reliably, I'd like to lock the vicitm summary page until its GC is completed, and also prevent any checkpoint process. Reviewed-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-04-03f2fs: allocate remained free segments in the LFS modeJaegeuk Kim
This patch adds a new condition that allocates free segments in the current active section even if SSR is needed. Otherwise, f2fs cannot allocate remained free segments in the section since SSR finds dirty segments only. Reviewed-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-04-03f2fs: check completion of foreground GCJaegeuk Kim
The foreground GCs are triggered under not enough free sections. So, we should not skip moving valid blocks in the victim segments. Reviewed-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-04-03f2fs: change GC bitmaps to apply the section granularityJaegeuk Kim
This patch removes a bitmap for victim segments selected by foreground GC, and modifies the other bitmap for victim segments selected by background GC. 1) foreground GC bitmap : We don't need to manage this, since we just only one previous victim section number instead of the whole victim history. The f2fs uses the victim section number in order not to allocate currently GC'ed section to current active logs. 2) background GC bitmap : This bitmap is used to avoid selecting victims repeatedly by background GCs. In addition, the victims are able to be selected by foreground GCs, since there is no need to read victim blocks during foreground GCs. By the fact that the foreground GC reclaims segments in a section unit, it'd be better to manage this bitmap based on the section granularity. Reviewed-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-04-03f2fs: allocate new segment aligned with sectionsJaegeuk Kim
When allocating a new segment under the LFS mode, we should keep the section boundary. Reviewed-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-04-03f2fs: remove redundant lock_page callsJaegeuk Kim
In get_node_page, we do not need to call lock_page all the time. If the node page is cached as uptodate, 1. grab_cache_page locks the page, 2. read_node_page unlocks the page, and 3. lock_page is called for further process. Let's avoid this. Reviewed-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>