summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2012-07-03md: fix up plugging (again).NeilBrown
The value returned by "mddev_check_plug" is only valid until the next 'schedule' as that will unplug things. This could happen at any call to mempool_alloc. So just calling mddev_check_plug at the start doesn't really make sense. So call it just before, or just after, queuing things for the thread. As the action that happens at unplug is to wake the thread, this makes lots of sense. If we cannot add a plug (which requires a small GFP_ATOMIC alloc) we wake thread immediately. RAID5 is a bit different. Requests are queued for the thread and the thread is woken by release_stripe. So we don't need to wake the thread on failure. However the thread doesn't perform certain actions when there is any active plug, so it is important to install a plug before waking the thread. So for RAID5 we install the plug *before* queuing the request and waking the thread. Without this patch it is possible for raid1 or raid10 to queue a request without then waking the thread, resulting in the array locking up. Also change raid10 to only flush_pending_write when there are not active plugs, just like raid1. This patch is suitable for 3.0 or later. I plan to submit it to -stable, but I'll like to let it spend a few weeks in mainline first to be sure it is completely safe. Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-03md: support re-add of recovering devices.NeilBrown
We currently only allow a device to be re-added if it appear to be in-sync. This is overly restrictive as it may be desirable to re-add a device that is in the middle of recovery. So remove the test for "InSync" - the test on rdev->raid_disk is sufficient to ensure that the re-add will succeed. Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com> Tested-by: Alexander Lyakas <alex.bolshoy@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-03md/raid1: fix bug in read_balance introduced by hot-replaceNeilBrown
When we added hot_replace we doubled the number of devices that could be in a RAID1 array. So we doubled how far read_balance would search. Unfortunately we didn't double the point at which it looped back to the beginning - so it effectively loops over all non-replacement disks twice. This doesn't cause bad behaviour, but it pointless and means we never read from replacement devices. Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-03raid5: delayed stripe fixShaohua Li
There isn't locking setting STRIPE_DELAYED and STRIPE_PREREAD_ACTIVE bits, but the two bits have relationship. A delayed stripe can be moved to hold list only when preread active stripe count is below IO_THRESHOLD. If a stripe has both the bits set, such stripe will be in delayed list and preread count not 0, which will make such stripe never leave delayed list. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-03md/raid456: When read error cannot be recovered, record bad blockmajianpeng
We may not be able to fix a bad block if: - the array is degraded - the over-write fails. In these cases we currently eject the device, but we should record a bad block if possible. Signed-off-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-03md: make 'name' arg to md_register_thread non-optional.NeilBrown
Having the 'name' arg optional and defaulting to the current personality name is no necessary and leads to errors, as when changing the level of an array we can end up using the name of the old level instead of the new one. So make it non-optional and always explicitly pass the name of the level that the array will be. Reported-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-03md/raid10: fix failure when trying to repair a read error.NeilBrown
commit 58c54fcca3bac5bf9290cfed31c76e4c4bfbabaf md/raid10: handle further errors during fix_read_error better. in 3.1 added "r10_sync_page_io" which takes an IO size in sectors. But we were passing the IO size in bytes!!! This resulting in bio_add_page failing, and empty request being sent down, and a consequent BUG_ON in scsi_lib. [fix missing space in error message at same time] This fix is suitable for 3.1.y and later. Cc: stable@vger.kernel.org Reported-by: Christian Balzer <chibi@gol.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-03powerpc/mm: remove obsolete comment about page size name arrayScott Wood
The array of names in hugetlbpage.c no longer exists. Signed-off-by: Scott Wood <scottwood@freescale.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-07-03powerpc: Kill flatdevtree_env.h tooPaul Bolle
Commit 430b01e8f5e524a2bfa50074d97d0bdc2505807b ("[POWERPC] Kill flatdevtree.c") killed the two files including flatdevtree_env.h. It was apparently just an oversight to not kill that header too. Kill it now. Signed-off-by: Paul Bolle <pebolle@tiscali.nl> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-07-03powerpc: Fix kernel-doc warningWanpeng Li
Warning(arch/powerpc/kernel/pci_of_scan.c:210): Excess function parameter 'node' description in 'of_scan_pci_bridge' Warning(arch/powerpc/kernel/vio.c:636): No description found for parameter 'desired' Warning(arch/powerpc/kernel/vio.c:636): Excess function parameter 'new_desired' description in 'vio_cmo_set_dev_desired' Warning(arch/powerpc/kernel/vio.c:1270): No description found for parameter 'viodrv' Warning(arch/powerpc/kernel/vio.c:1270): Excess function parameter 'drv' description in '__vio_register_driver' Warning(arch/powerpc/kernel/vio.c:1289): No description found for parameter 'viodrv' Warning(arch/powerpc/kernel/vio.c:1289): Excess function parameter 'driver' description in 'vio_unregister_driver' Signed-off-by: Wanpeng Li <liwp@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-07-03powerpc: Fix assmption of end_of_DRAM() returns end addressBharat Bhushan
memblock_end_of_DRAM() returns end_address + 1, not end address. While some code assumes that it returns end address. Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-07-03powerpc: Optimise the 64bit optimised __clear_userAnton Blanchard
I blame Mikey for this. He elevated my slightly dubious testcase: to benchmark status. And naturally we need to be number 1 at creating zeros. So lets improve __clear_user some more. As Paul suggests we can use dcbz for large lengths. This patch gets the destination cacheline aligned then uses dcbz on whole cachelines. Before: 10485760000 bytes (10 GB) copied, 0.414744 s, 25.3 GB/s After: 10485760000 bytes (10 GB) copied, 0.268597 s, 39.0 GB/s 39 GB/s, a new record. Signed-off-by: Anton Blanchard <anton@samba.org> Tested-by: Olof Johansson <olof@lixom.net> Acked-by: Olof Johansson <olof@lixom.net> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-07-03powerpc/iommu: Implement IOMMU pools to improve multiqueue adapter performanceAnton Blanchard
At the moment all queues in a multiqueue adapter will serialise against the IOMMU table lock. This is proving to be a big issue, especially with 10Gbit ethernet. This patch creates 4 pools and tries to spread the load across them. If the table is under 1GB in size we revert back to the original behaviour of 1 pool and 1 largealloc pool. We create a hash to map CPUs to pools. Since we prefer interrupts to be affinitised to primary CPUs, without some form of hashing we are very likely to end up using the same pool. As an example, POWER7 has 4 way SMT and with 4 pools all primary threads will map to the same pool. The largealloc pool is reduced from 1/2 to 1/4 of the space to partially offset the overhead of breaking the table up into pools. Some performance numbers were obtained with a Chelsio T3 adapter on two POWER7 boxes, running a 100 session TCP round robin test. Performance improved 69% with this patch applied. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-07-03powerpc/iommu: Push spinlock into iommu_range_alloc and __iommu_freeAnton Blanchard
In preparation for IOMMU pools, push the spinlock into iommu_range_alloc and __iommu_free. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-07-03powerpc/iommu: Reduce spinlock coverage in iommu_freeAnton Blanchard
This patch moves tce_free outside of the lock in iommu_free. Some performance numbers were obtained with a Chelsio T3 adapter on two POWER7 boxes, running a 100 session TCP round robin test. Performance improved 25% with this patch applied. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-07-03powerpc/iommu: Reduce spinlock coverage in iommu_alloc and iommu_freeAnton Blanchard
We currently hold the IOMMU spinlock around tce_build and tce_flush. This causes our spinlock hold times to be much higher than required and can impact multiqueue adapters. This patch moves tce_build and tce_flush outside of the lock in iommu_alloc, and tce_flush outside of the lock in iommu_free. Some performance numbers were obtained with a Chelsio T3 adapter on two POWER7 boxes, running a 100 session TCP round robin test. Performance improved 32% with this patch applied. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-07-03powerpc/pseries: Disable interrupts around IOMMU percpu data accessesAnton Blanchard
tce_buildmulti_pSeriesLP uses a per cpu page to communicate with the hypervisor. We currently rely on the IOMMU table spinlock but subsequent patches will be removing that so disable interrupts around all accesses of tce_page. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-07-03powerpc: POWER7 optimised memcpy using VMX and enhanced prefetchAnton Blanchard
Implement a POWER7 optimised memcpy using VMX and enhanced prefetch instructions. This is a copy of the POWER7 optimised copy_to_user/copy_from_user loop. Detailed implementation and performance details can be found in commit a66086b8197d (powerpc: POWER7 optimised copy_to_user/copy_from_user using VMX). I noticed memcpy issues when profiling a RAID6 workload: .memcpy .async_memcpy .async_copy_data .__raid_run_ops .handle_stripe .raid5d .md_thread I created a simplified testcase by building a RAID6 array with 4 1GB ramdisks (booting with brd.rd_size=1048576): # mdadm -CR -e 1.2 /dev/md0 --level=6 -n4 /dev/ram[0-3] I then timed how long it took to write to the entire array: # dd if=/dev/zero of=/dev/md0 bs=1M Before: 892 MB/s After: 999 MB/s A 12% improvement. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-07-03powerpc: Use enhanced touch instructions in POWER7 copy_to_user/copy_from_userAnton Blanchard
Version 2.06 of the POWER ISA introduced enhanced touch instructions, allowing us to specify a number of attributes including the length of a stream. This patch adds a software stream for both loads and stores in the POWER7 copy_tofrom_user loop. Since the setup is quite complicated and we have to use an eieio to ensure correct ordering of the "GO" command we only do this for copies above 4kB. To quantify any performance improvements we need a working set bigger than the caches so we operate on a 1GB file: # dd if=/dev/zero of=/tmp/foo bs=1M count=1024 And we compare how fast we can read the file: # dd if=/tmp/foo of=/dev/null bs=1M before: 7.7 GB/s after: 9.6 GB/s A 25% improvement. The worst case for this patch will be a completely L1 cache contained copy of just over 4kB. We can test this with the copy_to_user testcase we used to tune copy_tofrom_user originally: http://ozlabs.org/~anton/junkcode/copy_to_user.c # time ./copy_to_user2 -l 4224 -i 10000000 before: 6.807 s after: 6.946 s A 2% slowdown, which seems reasonable considering our data is unlikely to be completely L1 contained. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-07-03powerpc/pci: cleanup on duplicate assignmentGavin Shan
While creating the PCI root bus through function pci_create_root_bus() of PCI core, it should have assigned the secondary bus number for the newly created PCI root bus. Thus we needn't do the explicit assignment for the secondary bus number again in pcibios_scan_phb(). Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-07-03powerpc/numa: Fix OF node refcounting bugGavin Shan
The form affinity for NUMA is set to 1 if the firmware supports OPAL. Otherwise, we have to retrieve that from OF node "/chosen". For the latter case, OF node "/chosen" reference count was never decreased. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-07-03powerpc: POWER7 optimised copy_page using VMX and enhanced prefetchAnton Blanchard
Implement a POWER7 optimised copy_page using VMX and enhanced prefetch instructions. We use enhanced prefetch hints to prefetch both the load and store side. We copy a cacheline at a time and fall back to regular loads and stores if we are unable to use VMX (eg we are in an interrupt). The following microbenchmark was used to assess the impact of the patch: http://ozlabs.org/~anton/junkcode/page_fault_file.c We test MAP_PRIVATE page faults across a 1GB file, 100 times: # time ./page_fault_file -p -l 1G -i 100 Before: 22.25s After: 18.89s 17% faster Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-07-03powerpc: Rename copyuser_power7_vmx.c to vmx-helper.cAnton Blanchard
Subsequent patches will add more VMX library functions and it makes sense to keep all the c-code helper functions in the one file. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-07-03powerpc: Clear RI and EE at the same time in system call exitAnton Blanchard
mtmsrd is an expensive instruction, we save a few cycles by doing it once instead of twice. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-07-03powerpc: Use enhanced touch instructions in POWER7 copy_to_user/copy_from_userAnton Blanchard
Version 2.06 of the POWER ISA introduced enhanced touch instructions, allowing us to specify a number of attributes including the length of a stream. This patch adds a software stream for both loads and stores in the POWER7 copy_tofrom_user loop. Since the setup is quite complicated and we have to use an eieio to ensure correct ordering of the "GO" command we only do this for copies above 4kB. To quantify any performance improvements we need a working set bigger than the caches so we operate on a 1GB file: # dd if=/dev/zero of=/tmp/foo bs=1M count=1024 And we compare how fast we can read the file: # dd if=/tmp/foo of=/dev/null bs=1M before: 7.7 GB/s after: 9.6 GB/s A 25% improvement. The worst case for this patch will be a completely L1 cache contained copy of just over 4kB. We can test this with the copy_to_user testcase we used to tune copy_tofrom_user originally: http://ozlabs.org/~anton/junkcode/copy_to_user.c # time ./copy_to_user2 -l 4224 -i 10000000 before: 6.807 s after: 6.946 s A 2% slowdown, which seems reasonable considering our data is unlikely to be completely L1 contained. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-07-03powerpc/smp: remove call to ipi_call_lock()/ipi_call_unlock()Yong Zhang
1) call_function.lock used in smp_call_function_many() is just to protect call_function.queue and &data->refs, cpu_online_mask is outside of the lock. And it's not necessary to protect cpu_online_mask, because data->cpumask is pre-calculate and even if a cpu is brougt up when calling arch_send_call_function_ipi_mask(), it's harmless because validation test in generic_smp_call_function_interrupt() will take care of it. 2) For cpu down issue, stop_machine() will guarantee that no concurrent smp_call_fuction() is processing. Signed-off-by: Yong Zhang <yong.zhang0@gmail.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-07-03powerpc: 64bit optimised __clear_userAnton Blanchard
I noticed __clear_user high up in a profile of one of my RAID stress tests. The testcase was doing a dd from /dev/zero which ends up calling __clear_user. __clear_user is basically a loop with a single 4 byte store which is horribly slow. We can do much better by aligning the desination and doing 32 bytes of 8 byte stores in a loop. The following testcase was used to verify the patch: http://ozlabs.org/~anton/junkcode/stress_clear_user.c To show the improvement in performance I ran a dd from /dev/zero to /dev/null on a POWER7 box: Before: # dd if=/dev/zero of=/dev/null bs=1M count=10000 10485760000 bytes (10 GB) copied, 3.72379 s, 2.8 GB/s After: # time dd if=/dev/zero of=/dev/null bs=1M count=10000 10485760000 bytes (10 GB) copied, 0.728318 s, 14.4 GB/s Over 5x faster. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-07-03powerpc: tracing: Avoid tracepoint duplication with DECLARE_EVENT_CLASSAnton Blanchard
irq_entry, irq_exit, timer_interrupt_entry and timer_interrupt_exit all do the same thing so use DECLARE_EVENT_CLASS to avoid duplicating everything 4 times. This saves quite a lot of space in both instruction text and data: text data bss dec hex filename 9265 19622 16 28903 70e7 arch/powerpc/kernel/irq.o 6817 19019 16 25852 64fc arch/powerpc/kernel/irq.o Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-07-03powerpc: Enable jump label supportAnton Blanchard
When looking through some instruction traces I noticed our tracepoint checks were inline. It turns out we don't have CONFIG_JUMP_LABEL enabled. By enabling CONFIG_JUMP_LABEL we replace a load/compare/branch with a nop at every tracepoint call. For example in do_IRQ: CONFIG_JUMP_LABEL disabled: stdx 3,11,9 lwz 0,8(29) cmpwi 7,0,0 bne- 7,.L124 bl .irq_enter CONFIG_JUMP_LABEL enabled: stdx 3,11,9 nop bl .irq_enter Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-07-03powerpc/pseries/cpuidle: Replace pseries_notify_cpuidle_add call with notifierDeepthi Dharwar
The following patch is to remove the pseries_notify_add_cpu() call and replace it by a hot plug notifier. This would prevent cpuidle resources being released and allocated each time cpu comes online on pseries. The earlier design was causing a lockdep problem in start_secondary as reported on this thread -https://lkml.org/lkml/2012/5/17/2 This applies on 3.4-rc7 Signed-off-by: Deepthi Dharwar <deepthi@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-07-03powerpc/pseries/iommu: remove default window before attempting DDW manipulationNishanth Aravamudan
An upcoming release of firmware will add DDW extensions, in particular an API to "reset" the DMA window to the original configuration (32-bit, 2GB in size). With that API available, we can safely remove the default window, increasing the resources available to firmware for creation of larger windows for the slot in question -- if we encounter an error, we can use the new API to reset the state of the slot. Further, this same release of firmware will make it a hard requirement for OSes to release the existing window before any other windows will be shown as available, to avoid conflicts in addressing between the two windows. In anticipation of these changes, always remove the default window before we do any DDW manipulations. Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-07-03powerpc/ftrace: Use patch_instruction instead of probe_kernel_write()Steven Rostedt
The patch_instruction() interface is made to modify kernel text. It is safer to use that then the probe_kernel_write() when modifying kernel code. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-07-03powerpc: Have patch_instruction detect faultsSteven Rostedt
For ftrace to use the patch_instruction code, it needs to check for faults on write. Ftrace updates code all over the kernel, and we need to know if code is updated or not due to protections that are placed on some portions of the kernel. If ftrace does not detect a fault, it will error later on, and it will be much more difficult to find the problem. By changing patch_instruction() to detect faults, then ftrace will be able to make use of it too. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-07-03powerpc/ftrace: Have PPC skip updating with stop_machine()Steven Rostedt
PowerPC does not have the synchronization issues that x86 has with modifying code on one CPU while another CPU is executing it. The other CPU will either see the old or new code without any issues, unlike x86 which may issue a GPF. Instead of calling the heavy stop_machine, just update the code. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-07-03powerpc/boot: Only build board support files when required.Tony Breeds
Currently we build all board files regardless of the final zImage target. This is sub-optimal (in terms on compilation) and leads to problems in one platform needlessly causing failures for other platforms. Use the Kconfig variables to selectively construct this board files to build. Signed-off-by: Tony Breeds <tony@bakeyournoodle.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-07-02Merge branch 'merge' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc Pull a couple more powerpc fixes from Benjamin Herrenschmidt: "Here are two more fixes that I "missed" when scrubbing patchwork last week which are worth still having in 3.5." * 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: powerpc/kvm: sldi should be sld powerpc/xmon: Use cpumask iterator to avoid warning
2012-07-03security: document no_new_privsAndy Lutomirski
Document no_new_privs. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: James Morris <james.l.morris@oracle.com>
2012-07-03md/raid5: fix refcount problem when blocked_rdev is set.NeilBrown
commit 43220aa0f22cd3ce5b30246d50ccd696d119edea md/raid5: fix a hang on device failure. fixed a hang, but introduced a refcounting in-balance so that if the presence of bad-blocks ever caused an rdev to be 'blocked' we would increment the refcount on the rdev and never decrement it. So added the needed rdev_dec_pending when md_wait_for_blocked_rdev is not called. Reported-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-03md:Add blk_plug in sync_thread.majianpeng
Add blk_plug in sync_thread will increase the performance of sync. Because sync_thread did not blk_plug,so when raid sync, the bio merge not well. Testing environment: SATA controller: Intel Corporation 82801JI (ICH10 Family) SATA AHCI Controller. OS:Linux xxx 3.5.0-rc2+ #340 SMP Tue Jun 12 09:00:25 CST 2012 x86_64 x86_64 x86_64 GNU/Linux. RAID5: four ST31000524NS disk. Without blk_plug:recovery speed about 63M/Sec; Add blk_plug:recovery speed about 120M/Sec. Using blktrace: blktrace -d /dev/sdb -w 60 -o -|blkparse -i - without blk_plug: Total (8,16): Reads Queued: 309811, 1239MiB Writes Queued: 0, 0KiB Read Dispatches: 283583, 1189MiB Write Dispatches: 0, 0KiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 273351, 1149MiB Writes Completed: 0, 0KiB Read Merges: 23533, 94132KiB Write Merges: 0, 0KiB IO unplugs: 0 Timer unplugs: 0 add blk_plug: Total (8,16): Reads Queued: 428697, 1714MiB Writes Queued: 0, 0KiB Read Dispatches: 3954, 1714MiB Write Dispatches: 0, 0KiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 3956, 1715MiB Writes Completed: 0, 0KiB Read Merges: 424743, 1698MiB Write Merges: 0, 0KiB IO unplugs: 0 Timer unplugs: 3384 The ratio of merge will be markedly increased. Signed-off-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-03md/raid5: In ops_run_io, inc nr_pending before calling md_wait_for_blocked_rdevmajianpeng
In ops_run_io(), the call to md_wait_for_blocked_rdev will decrement nr_pending so we lose the reference we hold on the rdev. So atomic_inc it first to maintain the reference. This bug was introduced by commit 73e92e51b7969ef5477d md/raid5. Don't write to known bad block on doubtful devices. which appeared in 3.0, so patch is suitable for stable kernels since then. Cc: stable@vger.kernel.org Signed-off-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-03md/raid5: Do not add data_offset before call to is_badblockmajianpeng
In chunk_aligned_read() we are adding data_offset before calling is_badblock. But is_badblock also adds data_offset, so that is bad. So move the addition of data_offset to after the call to is_badblock. This bug was introduced by commit 31c176ecdf3563140e639 md/raid5: avoid reading from known bad blocks. which first appeared in 3.0. So that patch is suitable for any -stable kernel from 3.0.y onwards. However it will need minor revision for most of those (as the comment didn't appear until recently). Cc: stable@vger.kernel.org Signed-off-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-03md/raid5: prefer replacing failed devices over want-replacement devices.NeilBrown
If a RAID5 has both a failed device and a device marked as 'WantReplacement', then we should preferentially replace the failed device. However the current code replaces whichever is found first. So split into 2 loops, check fail failed/missing first, and only check for WantReplacement if nothing is failed or missing. Reported-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-03md/raid10: Don't try to recovery unmatched (and unused) chunks.NeilBrown
If a RAID10 has an odd number of chunks - as might happen when there are an odd number of devices - the last chunk has no pair and so is not mirrored. We don't store data there, but when recovering the last device in an array we retry to recover that last chunk from a non-existent location. This results in an error, and the recovery aborts. When we get to that last chunk we should just stop - there is nothing more to do anyway. This bug has been present since the introduction of RAID10, so the patch is appropriate for any -stable kernel. Cc: stable@vger.kernel.org Reported-by: Christian Balzer <chibi@gol.com> Tested-by: Christian Balzer <chibi@gol.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-02KVM: Sanitize KVM_IRQFD flagsAlex Williamson
We only know of one so far. Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-07-02KVM: Add missing KVM_IRQFD API documentationAlex Williamson
Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-07-02KVM: Pass kvm_irqfd to functionsAlex Williamson
Prune this down to just the struct kvm_irqfd so we can avoid changing function definition for every flag or field we use. Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-07-02Merge branch 'fixes' of git://github.com/hzhuang1/linux into fixesArnd Bergmann
* 'fixes' of git://github.com/hzhuang1/linux: ARM: pxa: hx4700: Fix basic suspend/resume
2012-07-02Btrfs: run delayed directory updates during log replayChris Mason
While we are resolving directory modifications in the tree log, we are triggering delayed metadata updates to the filesystem btrees. This commit forces the delayed updates to run so the replay code can find any modifications done. It stops us from crashing because the directory deleltion replay expects items to be removed immediately from the tree. Signed-off-by: Chris Mason <chris.mason@fusionio.com> cc: stable@kernel.org
2012-07-02Btrfs: hold a ref on the inode during writepagesJosef Bacik
We can race with unlink and not actually be able to do our igrab in btrfs_add_ordered_extent. This will result in all sorts of problems. Instead of doing the complicated work to try and handle returning an error properly from btrfs_add_ordered_extent, just hold a ref to the inode during writepages. If we cannot grab a ref we know we're freeing this inode anyway and can just drop the dirty pages on the floor, because screw them we're going to invalidate them anyway. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-02Btrfs: fix tree log remove space corner caseJosef Bacik
The tree log stuff can have allocated space that we end up having split across a bitmap and a real extent. The free space code does not deal with this, it assumes that if it finds an extent or bitmap entry that the entire range must fall within the entry it finds. This isn't necessarily the case, so rework the remove function so it can handle this case properly. This fixed two panics the user hit, first in the case where the space was initially in a bitmap and then in an extent entry, and then the reverse case. Thanks, Reported-and-tested-by: Shaun Reich <sreich@kde.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com>