summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2011-10-11xfs: don't serialise adjacent concurrent direct IO appending writesDave Chinner
For append write workloads, extending the file requires a certain amount of exclusive locking to be done up front to ensure sanity in things like ensuring that we've zeroed any allocated regions between the old EOF and the start of the new IO. For single threads, this typically isn't a problem, and for large IOs we don't serialise enough for it to be a problem for two threads on really fast block devices. However for smaller IO and larger thread counts we have a problem. Take 4 concurrent sequential, single block sized and aligned IOs. After the first IO is submitted but before it completes, we end up with this state: IO 1 IO 2 IO 3 IO 4 +-------+-------+-------+-------+ ^ ^ | | | | | | | \- ip->i_new_size \- ip->i_size And the IO is done without exclusive locking because offset <= ip->i_size. When we submit IO 2, we see offset > ip->i_size, and grab the IO lock exclusive, because there is a chance we need to do EOF zeroing. However, there is already an IO in progress that avoids the need for IO zeroing because offset <= ip->i_new_size. hence we could avoid holding the IO lock exlcusive for this. Hence after submission of the second IO, we'd end up this state: IO 1 IO 2 IO 3 IO 4 +-------+-------+-------+-------+ ^ ^ | | | | | | | \- ip->i_new_size \- ip->i_size There is no need to grab the i_mutex of the IO lock in exclusive mode if we don't need to invalidate the page cache. Taking these locks on every direct IO effective serialises them as taking the IO lock in exclusive mode has to wait for all shared holders to drop the lock. That only happens when IO is complete, so effective it prevents dispatch of concurrent direct IO writes to the same inode. And so you can see that for the third concurrent IO, we'd avoid exclusive locking for the same reason we avoided the exclusive lock for the second IO. Fixing this is a bit more complex than that, because we need to hold a write-submission local value of ip->i_new_size to that clearing the value is only done if no other thread has updated it before our IO completes..... Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>
2011-10-11xfs: don't serialise direct IO reads on page cache checksDave Chinner
There is no need to grab the i_mutex of the IO lock in exclusive mode if we don't need to invalidate the page cache. Taking these locks on every direct IO effective serialises them as taking the IO lock in exclusive mode has to wait for all shared holders to drop the lock. That only happens when IO is complete, so effective it prevents dispatch of concurrent direct IO reads to the same inode. Fix this by taking the IO lock shared to check the page cache state, and only then drop it and take the IO lock exclusively if there is work to be done. Hence for the normal direct IO case, no exclusive locking will occur. Signed-off-by: Dave Chinner <dchinner@redhat.com> Tested-by: Joern Engel <joern@logfs.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2011-10-11cifs: Display strictcache mount option in /proc/mountsSachin Prabhu
Commit d39454ffe4a3c85428483b8a8a8e5e797b6363d5 adds a strictcache mount option. This patch allows the display of this mount option in /proc/mounts when listing shares mounted with the strictcache mount option. Signed-off-by: Sachin Prabhu <sprabhu@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
2011-10-11nfsd4: more robust ignoring of WANT bits in OPENJ. Bruce Fields
Mask out the WANT bits right at the start instead of on each use. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-10-11nfsd4: move name-length checks to xdrJ. Bruce Fields
Again, these checks are better in the xdr code. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-10-11xfs: revert to using a kthread for AIL pushingChristoph Hellwig
Currently we have a few issues with the way the workqueue code is used to implement AIL pushing: - it accidentally uses the same workqueue as the syncer action, and thus can be prevented from running if there are enough sync actions active in the system. - it doesn't use the HIGHPRI flag to queue at the head of the queue of work items At this point I'm not confident enough in getting all the workqueue flags and tweaks right to provide a perfectly reliable execution context for AIL pushing, which is the most important piece in XFS to make forward progress when the log fills. Revert back to use a kthread per filesystem which fixes all the above issues at the cost of having a task struct and stack around for each mounted filesystem. In addition this also gives us much better ways to diagnose any issues involving hung AIL pushing and removes a small amount of code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Stefan Priebe <s.priebe@profihost.ag> Tested-by: Stefan Priebe <s.priebe@profihost.ag> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>
2011-10-11xfs: force the log if we encounter pinned buffers in .iop_pushbufChristoph Hellwig
We need to check for pinned buffers even in .iop_pushbuf given that inode items flush into the same buffers that may be pinned directly due operations on the unlinked inode list operating directly on buffers. To do this add a return value to .iop_pushbuf that tells the AIL push about this and use the existing log force mechanisms to unpin it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Stefan Priebe <s.priebe@profihost.ag> Tested-by: Stefan Priebe <s.priebe@profihost.ag> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>
2011-10-11xfs: do not update xa_last_pushed_lsn for locked itemsChristoph Hellwig
If an item was locked we should not update xa_last_pushed_lsn and thus skip it when restarting the AIL scan as we need to be able to lock and write it out as soon as possible. Otherwise heavy lock contention might starve AIL pushing too easily, especially given the larger backoff once we moved xa_last_pushed_lsn all the way to the target lsn. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Stefan Priebe <s.priebe@profihost.ag> Tested-by: Stefan Priebe <s.priebe@profihost.ag> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>
2011-10-11Btrfs: make sure not to defrag extents past i_sizeChris Mason
The btrfs file defrag code will loop through the extents and force COW on them. But there is a concurrent truncate in the middle of the defrag, it might end up defragging the same range over and over again. The problem is that writepage won't go through and do anything on pages past i_size, so the cow won't happen, so the file will appear to still be fragmented. defrag will end up hitting the same extents again and again. In the worst case, the truncate can actually live lock with the defrag because the defrag keeps creating new ordered extents which the truncate code keeps waiting on. The fix here is to make defrag check for i_size inside the main loop, instead of just once before the looping starts. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-10-11nfsd4: move access/deny validity checks to xdr codeJ. Bruce Fields
I'd rather put more of these sorts of checks into standardized xdr decoders for the various types rather than have them cluttering up the core logic in nfs4proc.c and nfs4state.c. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-10-10nfsd4: ignore WANT bits in open downgradeJ. Bruce Fields
We don't use WANT bits yet--and sending them can probably trigger a BUG() further down. Cc: stable@kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-10-10nfsd4: cleanup state.h commentsJ. Bruce Fields
These comments are mostly out of date. Reported-by: Bryan Schumaker <bjschuma@netapp.com>
2011-10-10nfsd4: clean up downgrading codeJ. Bruce Fields
In response to some review comments, get rid of the somewhat obscure for-loop with bitops, and improve a comment. Reported-by: Steve Dickson <steved@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-10-10nfsd4: fix state lock usage in LOCKUJ. Bruce Fields
In commit 5ec094c1096ab3bb795651855d53f18daa26afde "nfsd4: extend state lock over seqid replay logic" I modified the exit logic of all the seqid-based procedures except nfsd4_locku(). Fix the oversight. The result of the bug was a double-unlock while handling the LOCKU procedure, and a warning like: [ 142.150014] WARNING: at kernel/mutex-debug.c:78 debug_mutex_unlock+0xda/0xe0() ... [ 142.152927] Pid: 742, comm: nfsd Not tainted 3.1.0-rc1-SLIM+ #9 [ 142.152927] Call Trace: [ 142.152927] [<ffffffff8105fa4f>] warn_slowpath_common+0x7f/0xc0 [ 142.152927] [<ffffffff8105faaa>] warn_slowpath_null+0x1a/0x20 [ 142.152927] [<ffffffff810960ca>] debug_mutex_unlock+0xda/0xe0 [ 142.152927] [<ffffffff813e4200>] __mutex_unlock_slowpath+0x80/0x140 [ 142.152927] [<ffffffff813e42ce>] mutex_unlock+0xe/0x10 [ 142.152927] [<ffffffffa03bd3f5>] nfs4_lock_state+0x35/0x40 [nfsd] [ 142.152927] [<ffffffffa03b0b71>] nfsd4_proc_compound+0x2a1/0x690 [nfsd] [ 142.152927] [<ffffffffa039f9fb>] nfsd_dispatch+0xeb/0x230 [nfsd] [ 142.152927] [<ffffffffa02b1055>] svc_process_common+0x345/0x690 [sunrpc] [ 142.152927] [<ffffffff81058d10>] ? try_to_wake_up+0x280/0x280 [ 142.152927] [<ffffffffa02b16e2>] svc_process+0x102/0x150 [sunrpc] [ 142.152927] [<ffffffffa039f0bd>] nfsd+0xbd/0x160 [nfsd] [ 142.152927] [<ffffffffa039f000>] ? 0xffffffffa039efff [ 142.152927] [<ffffffff8108230c>] kthread+0x8c/0xa0 [ 142.152927] [<ffffffff813e8694>] kernel_thread_helper+0x4/0x10 [ 142.152927] [<ffffffff81082280>] ? kthread_worker_fn+0x190/0x190 [ 142.152927] [<ffffffff813e8690>] ? gs_change+0x13/0x13 Reported-by: Bryan Schumaker <bjschuma@netapp.com> Tested-by: Bryan Schumaker <bjschuma@netapp.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-10-10Btrfs: fix recursive auto-defragLi Zefan
Follow those steps: # mount -o autodefrag /dev/sda7 /mnt # dd if=/dev/urandom of=/mnt/tmp bs=200K count=1 # sync # dd if=/dev/urandom of=/mnt/tmp bs=8K count=1 conv=notrunc and then it'll go into a loop: writeback -> defrag -> writeback ... It's because writeback writes [8K, 200K] and then writes [0, 8K]. I tried to make writeback know if the pages are dirtied by defrag, but the patch was a bit intrusive. Here I simply set writeback_index when we defrag a file. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-10-10udf: Rename udf_warning to udf_warnJoe Perches
Rename udf_warning to udf_warn for consistency with normal logging uses of pr_warn. Rename function udf_warning to _udf_warn. Remove __func__ from uses and move __func__ to a new udf_warn macro that calls _udf_warn. Add \n's to uses of udf_warn, remove \n from _udf_warn. Coalesce formats. Reviewed-by: NamJae Jeon <linkinjeon@gmail.com> Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Jan Kara <jack@suse.cz>
2011-10-10udf: Rename udf_error to udf_errJoe Perches
Rename udf_error to udf_err for consistency with normal logging uses of pr_err. Rename function udf_err to _udf_err. Remove __func__ from uses and move __func__ to a new udf_err macro that calls _udf_err. Some of the udf_error uses had \n terminations, some did not so standardize \n's to udf_err uses, remove \n from _udf_err function. Coalesce udf_err formats. One message prefixed with udf_read_super is now prefixed with udf_fill_super. Reviewed-by: NamJae Jeon <linkinjeon@gmail.com> Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Jan Kara <jack@suse.cz>
2011-10-10udf: Promote some debugging messages to udf_errorJoe Perches
If there is a problem with a scratched disc or loader, it's valuable to know which error occurred. Convert some debug messages to udf_error, neaten those messages too. Add the calculated tag checksum and the read checksum to error message. Make udf_error a public function and move the logging prototypes together. Original-patch-by: NamJae Jeon <linkinjeon@gmail.com> Reviewed-by: NamJae Jeon <linkinjeon@gmail.com> Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Jan Kara <jack@suse.cz>
2011-10-10ext3: Remove the obsolete broken EXT3_IOC32_WAIT_FOR_READONLY.Tao Ma
There are no user of EXT3_IOC32_WAIT_FOR_READONLY and also it is broken. No one set the set_ro_timer, no one wake up us and our state is set to TASK_INTERRUPTIBLE not RUNNING. So remove it. Cc: Jan Kara <jack@suse.cz> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: Jan Kara <jack@suse.cz>
2011-10-10Merge git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
* git://git.samba.org/sfrench/cifs-2.6: [CIFS] Fix first time message on mount, ntlmv2 upgrade delayed to 3.2
2011-10-08ext4: fix ext4 so it works without CONFIG_PROC_FSFabrice Jouhaud
This fixes a bug which was introduced in dd68314ccf3fb. The problem came from the test of the return value of proc_mkdir which is always false without procfs, and this would initialization of ext4. Signed-off-by: Fabrice Jouhaud <yargil@free.fr> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-08ext4: use le32_to_cpu for ext4_extent_idx.ei_block in ext4_ext_search_left()Tao Ma
ext4_extent_idx.e_block is __le32, so use le32_to_cpu() in ext4_ext_search_left(). Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-08ext4: remove the obsolete/broken EXT4_IOC_WAIT_FOR_READONLY ioctlTao Ma
There are no users of the EXT4_IOC_WAIT_FOR_READONLY ioctl, and it is also broken. No one sets the set_ro_timer, no one wakes up us and our state is set to TASK_INTERRUPTIBLE not RUNNING. So remove it. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-08ext4: fix the comment describing ext4_ext_search_right()Tao Ma
The comment describing what ext4_ext_search_right() does is incorrect. We return 0 in *phys when *logical is the 'largest' allocated block, not smallest. Fix a few other typos while we're at it. Cc: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Tao Ma <boyu.mt@taobao.com>
2011-10-08ext4: remove deprecated oldallocLukas Czerner
For a long time now orlov is the default block allocator in the ext4. It performs better than the old one and no one seems to claim otherwise so we can safely drop it and make oldalloc and orlov mount option deprecated. This is a part of the effort to reduce number of ext4 options hence the test matrix. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-07[CIFS] Fix first time message on mount, ntlmv2 upgrade delayed to 3.2Steve French
Microsoft has a bug with ntlmv2 that requires use of ntlmssp, but we didn't get the required information on when/how to use ntlmssp to old (but once very popular) legacy servers (various NT4 fixpacks for example) until too late to merge for 3.1. Will upgrade to NTLMv2 in NTLMSSP in 3.2 Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Jeff Layton <jlayton@redhat.com>
2011-10-06ext4: Free resources in some error path in ext4_fill_superTao Ma
Some of the error path in ext4_fill_super don't release the resouces properly. So this patch just try to release them in the right way. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-06ext4: Free resources in ext4_mb_init()'s error pathsTao Ma
In commit 79a77c5ac, we move ext4_mb_init_backend after the allocation of s_locality_group to avoid memory leak in error path, but there are still some other error paths in ext4_mb_init that need to do the same work. So this patch adds all the error patch for ext4_mb_init. And all the pointers are reset to NULL in case the caller may double free them. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-06udf: Add readpages support for udf.Namjae Jeon
Use mpage_readpages() instead of multiple calls to udf_readpage() to reduce the CPU utilization and make performance higher. Signed-off-by: Namjae Jeon <linkinjeon@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2011-10-05ext3/balloc.c: local functions should be staticH Hartley Sweeten
This quites the sparse noise: warning: symbol 'ext3_trim_all_free' was not declared. Should it be static? Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Cc: Jan Kara <jack@suse.cz> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Signed-off-by: Jan Kara <jack@suse.cz>
2011-10-04ore/exofs: Change the type of the devices array (API change)Boaz Harrosh
In the pNFS obj-LD the device table at the layout level needs to point to a device_cache node, where it is possible and likely that many layouts will point to the same device-nodes. In Exofs we have a more orderly structure where we have a single array of devices that repeats twice for a round-robin view of the device table This patch moves to a model that can be used by the pNFS obj-LD where struct ore_components holds an array of ore_dev-pointers. (ore_dev is newly defined and contains a struct osd_dev *od member) Each pointer in the array of pointers will point to a bigger user-defined dev_struct. That can be accessed by use of the container_of macro. In Exofs an __alloc_dev_table() function allocates the ore_dev-pointers array as well as an exofs_dev array, in one allocation and does the addresses dance to set everything pointing correctly. It still keeps the double allocation trick for the inodes round-robin view of the table. The device table is always allocated dynamically, also for the single device case. So it is unconditionally freed at umount. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2011-10-03Merge branch 'btrfs-3.0' of git://github.com/chrismason/linuxLinus Torvalds
* 'btrfs-3.0' of git://github.com/chrismason/linux: Btrfs: force a page fault if we have a shorty copy on a page boundary
2011-10-03ore: Make ore_striping_info and ore_calc_stripe_info publicBoaz Harrosh
The struct ore_striping_info will be used later in other structures. And ore_calc_stripe_info as well. Rename them make struct ore_striping_info public. ore_calc_stripe_info is still static, will be made public on first use. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2011-10-03exofs: Remove unused data_map member from exofs_sb_infoBoaz Harrosh
The struct pnfs_osd_data_map data_map member of exofs_sb_info was never used after mount. In fact all it's members were duplicated by the ore_layout structure. So just remove the duplicated information. Also removed some stupid, but perfectly supported, restrictions on layout parameters. The case where num_devices is not divisible by mirror_count+1 is perfectly fine since the rotating device view will eventually use all the devices it can get. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Benny Halevy <bhalevy@tonian.com>
2011-10-03exofs: Rename struct ore_components comps => ocBoaz Harrosh
ore_components already has a comps member so this leads to things like comps->comps which is annoying. the name oc was already used in new code. So rename all old usage of ore_components comps => ore_components oc. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2011-10-03exofs/super.c: local functions should be staticH Hartley Sweeten
This quiets the following sparse noise: warning: symbol 'exofs_sync_fs' was not declared. Should it be static? warning: symbol 'exofs_free_sbi' was not declared. Should it be static? warning: symbol 'exofs_get_parent' was not declared. Should it be static? Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2011-10-03exofs/ore.c: local functions should be staticH Hartley Sweeten
This quiets the sparse noise: warning: symbol '_calc_trunk_info' was not declared. Should it be static? Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2011-10-03writeback: per-bdi background thresholdWu Fengguang
One thing puzzled me is that in JBOD case, the per-disk writeout performance is smaller than the corresponding single-disk case even when they have comparable bdi_thresh. Tracing shows find that in single disk case, bdi_writeback is always kept high while in JBOD case, it could drop low from time to time and correspondingly bdi_reclaimable could sometimes rush high. The fix is to watch bdi_reclaimable and kick background writeback as soon as it goes high. This resembles the global background threshold but in per-bdi manner. The trick is, as long as bdi_reclaimable does not go high, bdi_writeback naturally won't go low because bdi_reclaimable+bdi_writeback ~= bdi_thresh. With less fluctuated writeback pages, JBOD performance is observed to increase noticeably in various cases. vmstat:nr_written values before/after patch: 3.1.0-rc4-wo-underrun+ 3.1.0-rc4-bgthresh3+ ------------------------ ------------------------ 125596480 +25.9% 158179363 JBOD-10HDD-16G/ext4-100dd-1M-24p-16384M-20:10-X 61790815 +110.4% 130032231 JBOD-10HDD-16G/ext4-10dd-1M-24p-16384M-20:10-X 58853546 -0.1% 58823828 JBOD-10HDD-16G/ext4-1dd-1M-24p-16384M-20:10-X 110159811 +24.7% 137355377 JBOD-10HDD-16G/xfs-100dd-1M-24p-16384M-20:10-X 69544762 +10.8% 77080047 JBOD-10HDD-16G/xfs-10dd-1M-24p-16384M-20:10-X 50644862 +0.5% 50890006 JBOD-10HDD-16G/xfs-1dd-1M-24p-16384M-20:10-X 42677090 +28.0% 54643527 JBOD-10HDD-thresh=100M/ext4-100dd-1M-24p-16384M-100M:10-X 47491324 +13.3% 53785605 JBOD-10HDD-thresh=100M/ext4-10dd-1M-24p-16384M-100M:10-X 52548986 +0.9% 53001031 JBOD-10HDD-thresh=100M/ext4-1dd-1M-24p-16384M-100M:10-X 26783091 +36.8% 36650248 JBOD-10HDD-thresh=100M/xfs-100dd-1M-24p-16384M-100M:10-X 35526347 +14.0% 40492312 JBOD-10HDD-thresh=100M/xfs-10dd-1M-24p-16384M-100M:10-X 44670723 -1.1% 44177606 JBOD-10HDD-thresh=100M/xfs-1dd-1M-24p-16384M-100M:10-X 127996037 +22.4% 156719990 JBOD-10HDD-thresh=2G/ext4-100dd-1M-24p-16384M-2048M:10-X 57518856 +3.8% 59677625 JBOD-10HDD-thresh=2G/ext4-10dd-1M-24p-16384M-2048M:10-X 51919909 +12.2% 58269894 JBOD-10HDD-thresh=2G/ext4-1dd-1M-24p-16384M-2048M:10-X 86410514 +79.0% 154660433 JBOD-10HDD-thresh=2G/xfs-100dd-1M-24p-16384M-2048M:10-X 40132519 +38.6% 55617893 JBOD-10HDD-thresh=2G/xfs-10dd-1M-24p-16384M-2048M:10-X 48423248 +7.5% 52042927 JBOD-10HDD-thresh=2G/xfs-1dd-1M-24p-16384M-2048M:10-X 206041046 +44.1% 296846536 JBOD-10HDD-thresh=4G/xfs-100dd-1M-24p-16384M-4096M:10-X 72312903 -19.4% 58272885 JBOD-10HDD-thresh=4G/xfs-10dd-1M-24p-16384M-4096M:10-X 50635672 -0.5% 50384787 JBOD-10HDD-thresh=4G/xfs-1dd-1M-24p-16384M-4096M:10-X 68308534 +115.7% 147324758 JBOD-10HDD-thresh=800M/ext4-100dd-1M-24p-16384M-800M:10-X 57882933 +14.5% 66269621 JBOD-10HDD-thresh=800M/ext4-10dd-1M-24p-16384M-800M:10-X 52183472 +12.8% 58855181 JBOD-10HDD-thresh=800M/ext4-1dd-1M-24p-16384M-800M:10-X 53788956 +94.2% 104460352 JBOD-10HDD-thresh=800M/xfs-100dd-1M-24p-16384M-800M:10-X 44493342 +35.5% 60298210 JBOD-10HDD-thresh=800M/xfs-10dd-1M-24p-16384M-800M:10-X 42641209 +18.9% 50681038 JBOD-10HDD-thresh=800M/xfs-1dd-1M-24p-16384M-800M:10-X Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-10-03writeback: add bg_threshold parameter to __bdi_update_bandwidth()Wu Fengguang
No behavior change. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-10-02btrfs: use readahead API for scrubArne Jansen
Scrub uses a simple tree-enumeration to bring the relevant portions of the extent- and csum-tree into the page cache before starting the scrub-I/O. This is now replaced by using the new readahead-API. During readahead the scrub is being accounted as paused, so it won't hold off transaction commits. This change raises the average disk bandwith utilisation on my test volume from 70% to 90%. On another volume, the time for a test run went down from 89s to 43s. Changes v5: - reada1/2 are now of type struct reada_control * Signed-off-by: Arne Jansen <sensille@gmx.net>
2011-10-02btrfs: hooks for readaheadArne Jansen
This adds the hooks needed for readahead. In the readpage_end_io_hook, the extent state is checked for the EXTENT_READAHEAD flag. Only in this case the readahead hook is called, to keep the impact on non-ra as low as possible. Additionally, a hook for a failed IO is added, otherwise readahead would wait indefinitely for the extent to finish. Changes for v2: - eliminate race condition Signed-off-by: Arne Jansen <sensille@gmx.net>
2011-10-02btrfs: initial readahead code and prototypesArne Jansen
This is the implementation for the generic read ahead framework. To trigger a readahead, btrfs_reada_add must be called. It will start a read ahead for the given range [start, end) on tree root. The returned handle can either be used to wait on the readahead to finish (btrfs_reada_wait), or to send it to the background (btrfs_reada_detach). The read ahead works as follows: On btrfs_reada_add, the root of the tree is inserted into a radix_tree. reada_start_machine will then search for extents to prefetch and trigger some reads. When a read finishes for a node, all contained node/leaf pointers that lie in the given range will also be enqueued. The reads will be triggered in sequential order, thus giving a big win over a naive enumeration. It will also make use of multi-device layouts. Each disk will have its on read pointer and all disks will by utilized in parallel. Also will no two disks read both sides of a mirror simultaneously, as this would waste seeking capacity. Instead both disks will read different parts of the filesystem. Any number of readaheads can be started in parallel. The read order will be determined globally, i.e. 2 parallel readaheads will normally finish faster than the 2 started one after another. Changes v2: - protect root->node by transaction instead of node_lock - fix missed branches: The readahead had a too simple check to determine if a branch from a node should be checked or not. It now also records the upper bound of each node to see if the requested RA range lies within. - use KERN_CONT to debug output, to avoid line breaks - defer reada_start_machine to worker to avoid deadlock Changes v3: - protect root->node by rcu Changes v5: - changed EIO-semantics of reada_tree_block_flagged - remove spin_lock from reada_control and make elems an atomic_t - remove unused read_total from reada_control - kill reada_key_cmp, use btrfs_comp_cpu_keys instead - use kref-style release functions where possible - return struct reada_control * instead of void * from btrfs_reada_add Signed-off-by: Arne Jansen <sensille@gmx.net>
2011-10-02btrfs: state information for readaheadArne Jansen
Add state information for readahead to btrfs_fs_info and btrfs_device Changes v2: - don't wait in radix_trees - add own set of workers for readahead Reviewed-by: Josef Bacik <josef@redhat.com> Signed-off-by: Arne Jansen <sensille@gmx.net>
2011-10-02btrfs: add READAHEAD extent buffer flagArne Jansen
Add a READAHEAD extent buffer flag. Add a function to trigger a read with this flag set. Changes v2: - use extent buffer flags instead of extent state flags Changes v5: - adapt to changed read_extent_buffer_pages interface - don't return eb from reada_tree_block_flagged if it has CORRUPT flag set Signed-off-by: Arne Jansen <sensille@gmx.net>
2011-10-02btrfs: add an extra wait mode to read_extent_buffer_pagesArne Jansen
read_extent_buffer_pages currently has two modes, either trigger a read without waiting for anything, or wait for the I/O to finish. The former also bails when it's unable to lock the page. This patch now adds an additional parameter to allow it to block on page lock, but don't wait for completion. Changes v5: - merge the 2 wait parameters into one and define WAIT_NONE, WAIT_COMPLETE and WAIT_PAGE_LOCK Change v6: - fix bug introduced in v5 Signed-off-by: Arne Jansen <sensille@gmx.net>
2011-09-30Merge branch 'btrfs-3.0' into for-linusChris Mason
2011-09-30Btrfs: force a page fault if we have a shorty copy on a page boundaryJosef Bacik
A user reported a problem where ceph was getting into 100% cpu usage while doing some writing. It turns out it's because we were doing a short write on a not uptodate page, which means we'd fall back at one page at a time and fault the page in. The problem is our position is on the page boundary, so our fault in logic wasn't actually reading the page, so we'd just spin forever or until the page got read in by somebody else. This will force a readpage if we end up doing a short copy. Alexandre could reproduce this easily with ceph and reports it fixes his problem. I also wrote a reproducer that no longer hangs my box with this patch. Thanks, Reported-and-tested-by: Alexandre Oliva <aoliva@redhat.com> Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-09-29btrfs: integrating raid-repair and scrub-fixup-nodatasumJan Schmidt
This ties nodatasum fixup in scrub together with raid repair patches. While both series are working fine alone, scrub will report uncorrectable errors if they occur in a nodatasum extent *and* the page is in the page cache. Previously, we would have triggered readpage to find good data and do the repair. However, readpage wouldn't read anything in the case where the page is up to date in the cache. So, we simply take that good data we have and call repair_io_failure directly (unless the page in the cache is dirty). Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2011-09-29btrfs: Moved repair code from inode.c to extent_io.cJan Schmidt
The raid-retry code in inode.c can be generalized so that it works for metadata as well. Thus, this patch moves it to extent_io.c and makes the raid-retry code a raid-repair code. Repair works that way: Whenever a read error occurs and we have more mirrors to try, note the failed mirror, and retry another. If we find a good one, check if we did note a failure earlier and if so, do not allow the read to complete until after the bad sector was written with the good data we just fetched. As we have the extent locked while reading, no one can change the data in between. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2011-09-29btrfs: Put mirror_num in bi_bdevJan Schmidt
The error correction code wants to make sure that only the bad mirror is rewritten. Thus, we need to know which mirror is the bad one. I did not find a more apropriate field than bi_bdev. But I think using this is fine, because it is modified by the block layer, anyway, and should not be read after the bio returned. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>