summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2016-02-22f2fs: increase i_size to avoid missing dataJaegeuk Kim
When finsert is doing with dirting pages, we should increase i_size right away. Otherwise, the moved page is able to be dropped by the following filemap_write_and_wait_range before updating i_size. Especially, it can be done by if ((page->index >= end_index + 1) || !offset) goto out; in f2fs_write_data_page. This should resolve the below xfstests/091 failure reported by Dave. $ diff -u tests/generic/091.out /home/dave/src/xfstests-dev/results//f2fs/generic/091.out.bad --- tests/generic/091.out 2014-01-20 16:57:33.000000000 +1100 +++ /home/dave/src/xfstests-dev/results//f2fs/generic/091.out.bad 2016-02-08 15:21:02.701375087 +1100 @@ -1,7 +1,18 @@ QA output created by 091 fsx -N 10000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W -fsx -N 10000 -o 8192 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W -fsx -N 10000 -o 32768 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W -fsx -N 10000 -o 8192 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W -fsx -N 10000 -o 32768 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W -fsx -N 10000 -o 128000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -W +mapped writes DISABLED +skipping insert range behind EOF +skipping insert range behind EOF +truncating to largest ever: 0x11e00 +dowrite: write: Invalid argument +LOG DUMP (7 total operations): +1( 1 mod 256): SKIPPED (no operation) +2( 2 mod 256): SKIPPED (no operation) +3( 3 mod 256): FALLOC 0x2e0f2 thru 0x3134a (0x3258 bytes) PAST_EOF +4( 4 mod 256): SKIPPED (no operation) +5( 5 mod 256): SKIPPED (no operation) +6( 6 mod 256): TRUNCATE UP from 0x0 to 0x11e00 +7( 7 mod 256): WRITE 0x73400 thru 0x79fff (0x6c00 bytes) HOLE +Log of operations saved to "/mnt/test/junk.fsxops"; replay with --replay-ops +Correct content saved for comparison +(maybe hexdump "/mnt/test/junk" vs "/mnt/test/junk.fsxgood") Reported-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22f2fs: preallocate blocks for buffered aio writesJaegeuk Kim
This patch preallocates data blocks for buffered aio writes. With this patch, we can avoid redundant locking and unlocking of node pages given consecutive aio request. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22f2fs: move dio preallocation into f2fs_file_write_iterJaegeuk Kim
This patch moves preallocation code for direct IOs into f2fs_file_write_iter. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22f2fs: fix missing skip pages infoYunlei He
fix missing skip pages info in f2fs_writepages trace event. Signed-off-by: Yunlei He <heyunlei@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22f2fs: introduce f2fs_submit_merged_bio_condChao Yu
f2fs use single bio buffer per type data (META/NODE/DATA) for caching writes locating in continuous block address as many as possible, after submitting, these writes may be still cached in bio buffer, so we have to flush cached writes in bio buffer by calling f2fs_submit_merged_bio. Unfortunately, in the scenario of high concurrency, bio buffer could be flushed by someone else before we submit it as below reasons: a) there is no space in bio buffer. b) add a request of different type (SYNC, ASYNC). c) add a discontinuous block address. For this condition, f2fs_submit_merged_bio will be devastating, because it could break the following merging of writes in bio buffer, split one big bio into two smaller one. This patch introduces f2fs_submit_merged_bio_cond which can do a conditional submitting with bio buffer, before submitting it will judge whether: - page in DATA type bio buffer is matching with specified page; - page in DATA type bio buffer is belong to specified inode; - page in NODE type bio buffer is belong to specified inode; If there is no eligible page in bio buffer, we will skip submitting step, result in gaining more chance to merge consecutive block IOs in bio cache. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22f2fs: fix conflict on page->private usageJaegeuk Kim
This patch fixes confilct on page->private value between f2fs_trace_pid and atomic page. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22f2fs: flush bios to handle cp_error in put_superJaegeuk Kim
Sometimes, if cp_error is set, there remains under-writeback pages, resulting in kernel hang in put_super. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22f2fs: wait on page's writeback in writepages pathJaegeuk Kim
Likewise f2fs_write_cache_pages, let's do for node and meta pages too. Especially, for node blocks, we should do this before marking its fsync and dentry flags. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22f2fs: speed up handling holes in fiemapChao Yu
This patch makes f2fs_map_blocks supporting returning next potential page offset which skips hole region in indirect tree of inode, and use it to speed up fiemap in handling big hole case. Test method: xfs_io -f /mnt/f2fs/file -c "pwrite 1099511627776 4096" time xfs_io -f /mnt/f2fs/file -c "fiemap -v" Before: time xfs_io -f /mnt/f2fs/file -c "fiemap -v" /mnt/f2fs/file: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..2147483647]: hole 2147483648 1: [2147483648..2147483655]: 81920..81927 8 0x1 real 3m3.518s user 0m0.000s sys 3m3.456s After: time xfs_io -f /mnt/f2fs/file -c "fiemap -v" /mnt/f2fs/file: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..2147483647]: hole 2147483648 1: [2147483648..2147483655]: 81920..81927 8 0x1 real 0m0.008s user 0m0.000s sys 0m0.008s Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22f2fs: introduce get_next_page_offset to speed up SEEK_DATAChao Yu
When seeking data in ->llseek, if we encounter a big hole which covers several dnode pages, we will try to seek data from index of page which is the first page of next dnode page, at most we could skip searching (ADDRS_PER_BLOCK - 1) pages. However it's still not efficient, because if our indirect/double-indirect pointer are NULL, there are no dnode page locate in the tree indirect/ double-indirect pointer point to, it's not necessary to search the whole region. This patch introduces get_next_page_offset to calculate next page offset based on current searching level and max searching level returned from get_dnode_of_data, with this, we could skip searching the entire area indirect or double-indirect node block is not exist. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22f2fs: remove unneeded pointer conversionChao Yu
There are redundant pointer conversion in following call stack: - at position a, inode was been converted to f2fs_file_info. - at position b, f2fs_file_info was been converted to inode again. - truncate_blocks(inode,..) - fi = F2FS_I(inode) ---a - ADDRS_PER_PAGE(node_page, fi) - addrs_per_inode(fi) - inode = &fi->vfs_inode ---b - f2fs_has_inline_xattr(inode) - fi = F2FS_I(inode) - is_inode_flag_set(fi,..) In order to avoid unneeded conversion, alter ADDRS_PER_PAGE and addrs_per_inode to acept parameter with type of inode pointer. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22f2fs: simplify __allocate_data_blocksChao Yu
This patch uses existing function f2fs_map_block to simplify implementation of __allocate_data_blocks. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22f2fs: simplify f2fs_map_blocksChao Yu
In f2fs_map_blocks, we use duplicated codes to handle first block mapping and the following blocks mapping, it's unnecessary. This patch simplifies f2fs_map_blocks to avoid using copied codes. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22f2fs: introduce lifetime write IO statisticsShuoran Liu
This patch introduces lifetime IO write statistics exposed to the sysfs interface. The write IO amount is obtained from block layer, accumulated in the file system and stored in the hot node summary of checkpoint. Signed-off-by: Shuoran Liu <liushuoran@huawei.com> Signed-off-by: Pengyang Hou <houpengyang@huawei.com> [Jaegeuk Kim: add sysfs documentation] Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22f2fs: give scheduling point in shrinking pathJaegeuk Kim
It needs to give a chance to be rescheduled while shrinking slab entries. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22f2fs: improve shrink performance of extent nodesHou Pengyang
On the worst case, we need to scan the whole radix tree and every rb-tree to free the victimed extent_nodes when shrinking. Pengyang initially introduced a victim_list to record the victimed extent_nodes, and free these extent_nodes by just scanning a list. Later, Chao Yu enhances the original patch to improve memory footprint by removing victim list. The policy of lru list shrinking becomes: 1) lock lru list's lock 2) trylock extent tree's lock 3) remove extent node from lru list 4) unlock lru list's lock 5) do shrink 6) repeat 1) to 5) Signed-off-by: Hou Pengyang <houpengyang@huawei.com> Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22f2fs: don't set cached_en if it will be freedJaegeuk Kim
If en has empty list pointer, it will be freed sooner, so we don't need to set cached_en with it. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22f2fs: move extent_node list operations being coupled with rbtree operationJaegeuk Kim
This patch moves extent_node list operations to be handled together with its rbtree operations. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22f2fs: reconstruct the code to free an extent_nodeHou Pengyang
There are three steps to free an extent node: 1) list_del_init, 2)__detach_extent_node, 3) kmem_cache_free In path f2fs_destroy_extent_tree, 1->2->3 to free a node, But in path f2fs_update_extent_tree_range, it is 2->1->3. This patch makes all the order to be: 1->2->3 It makes sense, since in the next patch, we import a victim list in the path shrink_extent_tree, we could check if the extent_node is in the victim list by checking the list_empty(). So it is necessary to put 1) first. Signed-off-by: Hou Pengyang <houpengyang@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22f2fs: use wq_has_sleeper for cp_wait wait_queueJaegeuk Kim
We need to use wq_has_sleeper including smp_mb to consider cp_wait concurrency. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22f2fs: avoid unnecessary search while finding victim in gcFan Li
variable nsearched in get_victim_by_default() indicates the number of dirty segments we already checked. There are 2 problems about the way it updates: 1. When p.ofs_unit is greater than 1, the victim we find consists of multiple segments, possibly more than 1 dirty segment. But nsearched always increases by 1. 2. If segments have been found but not been chosen, nsearched won't increase. So even we have checked all dirty segments, nsearched may still less than p.max_search. All these problems could cause unnecessary search after all dirty segments have already been checked. Signed-off-by: Fan li <fanofcode.li@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22f2fs: delete unnecessary wait for page writebackYunlei He
no need to wait inline file page writeback for no one use it, so this patch delete unnecessary wait. Signed-off-by: Yunlei He <heyunlei@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22f2fs: use wait_for_stable_page to avoid contentionJaegeuk Kim
In write_begin, if storage supports stable_page, we don't need to wait for writeback to update its contents. This patch introduces to use wait_for_stable_page instead of wait_on_page_writeback. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22f2fs: enhance foreground GCChao Yu
If we configure section consist of multiple segments, foreground GC will do the garbage collection with following approach: for each segment in victim section blk_start_plug for each valid block in segment write out by OPU method submit bio cache <--- blk_finish_plug <--- There are two issue: 1) for most of the time, 'submit bio cache' will break the merging in current bio buffer from writes of next segments, making a smaller bio submitting. 2) block plug only cover IO submitting in one segment, which reduce opportunity of merging IOs in plug with multiple segments. So refactor the code as below structure to strive for biggest opportunity of merging IOs: blk_start_plug for each segment in victim section for each valid block in segment write out by OPU method submit bio cache blk_finish_plug Test method: 1. mkfs.f2fs -s 8 /dev/sdX 2. touch 32 files 3. write 2M data into each file 4. punch 1.5M data from offset 0 for each file 5. trigger foreground gc through ioctl Before patch, there are totoally 40 bios submitted. f2fs_submit_write_bio: dev = (8,32), WRITE_SYNC, DATA, sector = 65536, size = 122880 f2fs_submit_write_bio: dev = (8,32), WRITE_SYNC, DATA, sector = 65776, size = 122880 f2fs_submit_write_bio: dev = (8,32), WRITE_SYNC, DATA, sector = 66016, size = 122880 f2fs_submit_write_bio: dev = (8,32), WRITE_SYNC, DATA, sector = 66256, size = 122880 f2fs_submit_write_bio: dev = (8,32), WRITE_SYNC, DATA, sector = 66496, size = 32768 ----repeat for 8 times After patch, there are totally 35 bios submitted. f2fs_submit_write_bio: dev = (8,32), WRITE_SYNC, DATA, sector = 65536, size = 122880 ----repeat 34 times f2fs_submit_write_bio: dev = (8,32), WRITE_SYNC, DATA, sector = 73696, size = 16384 Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22f2fs: don't need to call set_page_dirty for io errorJaegeuk Kim
If end_io gets an error, we don't need to set the page as dirty, since we already set f2fs_stop_checkpoint which will not flush any data. This will resolve the following warning. ====================================================== [ INFO: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected ] 4.4.0+ #9 Tainted: G O ------------------------------------------------------ xfs_io/26773 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire: (&(&sbi->inode_lock[i])->rlock){+.+...}, at: [<ffffffffc025483f>] update_dirty_page+0x6f/0xd0 [f2fs] and this task is already holding: (&(&q->__queue_lock)->rlock){-.-.-.}, at: [<ffffffff81396ea2>] blk_queue_bio+0x422/0x490 which would create a new lock dependency: (&(&q->__queue_lock)->rlock){-.-.-.} -> (&(&sbi->inode_lock[i])->rlock){+.+...} Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22f2fs: avoid needless sync_inode_page when reading inline_dataJaegeuk Kim
In write_begin, if there is an inline_data, f2fs loads it into 0'th data page. Since it's the read path, we don't need to sync its inode page. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22f2fs: don't need to sync node page at every timeJaegeuk Kim
In write_end, we don't need to sync inode page at every time. Instead, we can expect f2fs_write_inode will update later. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22f2fs: avoid multiple node page writes due to inline_dataJaegeuk Kim
The sceanrio is: 1. create fully node blocks 2. flush node blocks 3. write inline_data for all the node blocks again 4. flush node blocks redundantly So, this patch tries to flush inline_data when flushing node blocks. Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22f2fs: do f2fs_balance_fs when block is allocatedJaegeuk Kim
We should consider data block allocation to trigger f2fs_balance_fs. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22f2fs: fix to overcome inline_data floodsJaegeuk Kim
The scenario is: 1. create lots of node blocks 2. sync 3. write lots of inline_data -> got panic due to no free space In that case, we should flush node blocks when writing inline_data in #3, and trigger gc as well. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22f2fs: use writepages->lock for WB_SYNC_ALLJaegeuk Kim
If there are many writepages calls by multiple threads in background, we don't need to serialize to merge all the bios, since it's background. In such the case, it'd better to run writepages concurrently. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22f2fs: remove needless condition checkJaegeuk Kim
This patch removes needless condition variable. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22f2fs: correct search area in get_new_segmentChao Yu
get_new_segment starts from current segment position, tries to search a free segment among its right neighbors locate in same section. But previously our search area was set as [current segment, max segment], which means we have to search to more bits in free_segmap bitmap for some worse cases. So here we correct the search area to [current segment, last segment in section] to avoid unnecessary searching. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22f2fs: export dirty_nats_ratio in sysfsChao Yu
This patch exports a new sysfs entry 'dirty_nat_ratio' to control threshold of dirty nat entries, if current ratio exceeds configured threshold, checkpoint will be triggered in f2fs_balance_fs_bg for flushing dirty nats. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22f2fs: flush dirty nat entries when exceeding thresholdChao Yu
When testing f2fs with xfstest, generic/251 is stuck for long time, the case uses below serials to obtain fresh released space in device, in order to prepare for following fstrim test. 1. rm -rf /mnt/dir 2. mkdir /mnt/dir/ 3. cp -axT `pwd`/ /mnt/dir/ 4. goto 1 During preparing step, all nat entries will be cached in nat cache, most of them are dirty entries with invalid blkaddr, which means nodes related to these entries have been truncated, and they could be reused after the dirty entries been checkpointed. However, there was no checkpoint been triggered, so nid allocators (e.g. mkdir, creat) will run into long journey of iterating all NAT pages, looking for free nids in alloc_nid->build_free_nids. Here, in f2fs_balance_fs_bg we give another chance to do checkpoint to flush nat entries for reusing them in free nid cache when dirty entry count exceeds 10% of max count. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22f2fs: relocate is_merged_pageChao Yu
Operations in is_merged_page is related to inner bio cache, move it to data.c. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22mbcache2: Use referenced bit instead of LRUJan Kara
Currently we maintain perfect LRU list by moving entry to the tail of the list when it gets used. However these operations on cache-global list are relatively expensive. In this patch we switch to lazy updates of LRU list. Whenever entry gets used, we set a referenced bit in it. When reclaiming entries, we give referenced entries another round in the LRU. Since the list is not a real LRU anymore, rename it to just 'list'. In my testing this logic gives about 30% boost to workloads with mostly unique xattr blocks (e.g. xattr-bench with 10 files and 10000 unique xattr values). Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-22ecryptfs_encrypt_and_encode_filename(): drop unused argumentAl Viro
the last time it was getting something other than NULL as crypt_stat had been back in 2009... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-02-22Merge branch 'multipath'Trond Myklebust
* multipath: NFS add callback_ops to nfs4_proc_bind_conn_to_session_callback pnfs/NFSv4.1: Add multipath capabilities to pNFS flexfiles servers over NFSv3 SUNRPC: Allow addition of new transports to a struct rpc_clnt NFSv4.1: nfs4_proc_bind_conn_to_session must iterate over all connections SUNRPC: Make NFS swap work with multipath SUNRPC: Add a helper to apply a function to all the rpc_clnt's transports SUNRPC: Allow caller to specify the transport to use SUNRPC: Use the multipath iterator to assign a transport to each task SUNRPC: Make rpc_clnt store the multipath iterators SUNRPC: Add a structure to track multiple transports SUNRPC: Make freeing of struct xprt rcu-safe SUNRPC: Uninline xprt_get(); It isn't performance critical. SUNRPC: Reorder rpc_task to put waitqueue related info in same cachelines SUNRPC: Remove unused function rpc_task_reset_client
2016-02-22Merge branch 'nfsv41_cb'Trond Myklebust
* nfsv41_cb: NFSv4.x: Fix NFS4ERR_RETRY_UNCACHED_REP in nfs4_callback_sequence NFSv4.x: Allow multiple callbacks in flight NFSv4.x: Fix wraparound issues when validing the callback sequence id NFSv4.x: Enforce the ca_maxresponsesize_cached on the back channel NFSv4.x: CB_SEQUENCE should return NFS4ERR_DELAY if still executing NFSv4.x: Remove hard coded slotids in callback channel
2016-02-22ecryptfs_lookup(): use lookup_one_len_unlocked()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-02-22NFSv4.x/pnfs: Fix a race between layoutget and bulk recallsTrond Myklebust
Replace another case where the layout 'plh_block_lgets' can trigger infinite loops in send_layoutget(). Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-02-22NFSv4.x/pnfs: Fix a race between layoutget and pnfs_destroy_layoutTrond Myklebust
If the server reboots while there is a layoutget outstanding, then the call to pnfs_choose_layoutget_stateid() will fail with an EAGAIN error, which causes an infinite loop in send_layoutget(). The reason why we never break out of the loop is that the layout 'plh_block_lgets' field is never cleared. Fix is to replace plh_block_lgets with NFS_LAYOUT_INVALID_STID, which can be reset after a new layoutget. Fixes: ab7d763e477c5 ("pNFS: Ensure nfs4_layoutget_prepare returns...") Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-02-22DLM: Save and restore socket callbacks properlyBob Peterson
This patch fixes the problems with patch b3a5bbfd7. 1. It removes a return statement from lowcomms_error_report because it needs to call the original error report in all paths through the function. 2. All socket callbacks are saved and restored, not just the sk_error_report, and that's done so with proper locking like sunrpc does. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2016-02-22DLM: Replace nodeid_to_addr with kernel_getpeernameBob Peterson
This patch replaces the call to nodeid_to_addr with a call to kernel_getpeername. This avoids taking a spinlock because it may potentially be called from a softirq context. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2016-02-22mbcache2: limit cache sizeJan Kara
So far number of entries in mbcache is limited only by the pressure from the shrinker. Since too many entries degrade the hash table and generally we expect that caching more entries has diminishing returns, limit number of entries the same way as in the old mbcache to 16 * hash table size. Once we exceed the desired maximum number of entries, we schedule a backround work to reclaim entries. If the background work cannot keep up and the number of entries exceeds two times the desired maximum, we reclaim some entries directly when allocating a new entry. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-22mbcache: remove mbcacheJan Kara
Both ext2 and ext4 are now converted to mbcache2. Remove the old mbcache code. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-22ext2: convert to mbcache2Jan Kara
The conversion is generally straightforward. We convert filesystem from a global cache to per-fs one. Similarly to ext4 the tricky part is that xattr block corresponding to found mbcache entry can get freed before we get buffer lock for that block. So we have to check whether the entry is still valid after getting the buffer lock. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-22ext4: convert to mbcache2Jan Kara
The conversion is generally straightforward. The only tricky part is that xattr block corresponding to found mbcache entry can get freed before we get buffer lock for that block. So we have to check whether the entry is still valid after getting buffer lock. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-22mbcache2: reimplement mbcacheJan Kara
Original mbcache was designed to have more features than what ext? filesystems ended up using. It supported entry being in more hashes, it had a home-grown rwlocking of each entry, and one cache could cache entries from multiple filesystems. This genericity also resulted in more complex locking, larger cache entries, and generally more code complexity. This is reimplementation of the mbcache functionality to exactly fit the purpose ext? filesystems use it for. Cache entries are now considerably smaller (7 instead of 13 longs), the code is considerably smaller as well (414 vs 913 lines of code), and IMO also simpler. The new code is also much more lightweight. I have measured the speed using artificial xattr-bench benchmark, which spawns P processes, each process sets xattr for F different files, and the value of xattr is randomly chosen from a pool of V values. Averages of runtimes for 5 runs for various combinations of parameters are below. The first value in each cell is old mbache, the second value is the new mbcache. V=10 F\P 1 2 4 8 16 32 64 10 0.158,0.157 0.208,0.196 0.500,0.277 0.798,0.400 3.258,0.584 13.807,1.047 61.339,2.803 100 0.172,0.167 0.279,0.222 0.520,0.275 0.825,0.341 2.981,0.505 12.022,1.202 44.641,2.943 1000 0.185,0.174 0.297,0.239 0.445,0.283 0.767,0.340 2.329,0.480 6.342,1.198 16.440,3.888 V=100 F\P 1 2 4 8 16 32 64 10 0.162,0.153 0.200,0.186 0.362,0.257 0.671,0.496 1.433,0.943 3.801,1.345 7.938,2.501 100 0.153,0.160 0.221,0.199 0.404,0.264 0.945,0.379 1.556,0.485 3.761,1.156 7.901,2.484 1000 0.215,0.191 0.303,0.246 0.471,0.288 0.960,0.347 1.647,0.479 3.916,1.176 8.058,3.160 V=1000 F\P 1 2 4 8 16 32 64 10 0.151,0.129 0.210,0.163 0.326,0.245 0.685,0.521 1.284,0.859 3.087,2.251 6.451,4.801 100 0.154,0.153 0.211,0.191 0.276,0.282 0.687,0.506 1.202,0.877 3.259,1.954 8.738,2.887 1000 0.145,0.179 0.202,0.222 0.449,0.319 0.899,0.333 1.577,0.524 4.221,1.240 9.782,3.579 V=10000 F\P 1 2 4 8 16 32 64 10 0.161,0.154 0.198,0.190 0.296,0.256 0.662,0.480 1.192,0.818 2.989,2.200 6.362,4.746 100 0.176,0.174 0.236,0.203 0.326,0.255 0.696,0.511 1.183,0.855 4.205,3.444 19.510,17.760 1000 0.199,0.183 0.240,0.227 1.159,1.014 2.286,2.154 6.023,6.039 ---,10.933 ---,36.620 V=100000 F\P 1 2 4 8 16 32 64 10 0.171,0.162 0.204,0.198 0.285,0.230 0.692,0.500 1.225,0.881 2.990,2.243 6.379,4.771 100 0.151,0.171 0.220,0.210 0.295,0.255 0.720,0.518 1.226,0.844 3.423,2.831 19.234,17.544 1000 0.192,0.189 0.249,0.225 1.162,1.043 2.257,2.093 5.853,4.997 ---,10.399 ---,32.198 We see that the new code is faster in pretty much all the cases and starting from 4 processes there are significant gains with the new code resulting in upto 20-times shorter runtimes. Also for large numbers of cached entries all values for the old code could not be measured as the kernel started hitting softlockups and died before the test completed. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>