summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2014-01-14libceph: fix preallocation check in get_reply()Ilya Dryomov
The check that makes sure that we have enough memory allocated to read in the entire header of the message in question is currently busted. It compares front_len of the incoming message with iov_len field of ceph_msg::front structure, which is used primarily to indicate the amount of data already read in, and not the size of the allocated buffer. Under certain conditions (e.g. a short read from a socket followed by that socket's shutdown and owning ceph_connection reset) this results in a warning similar to [85688.975866] libceph: get_reply front 198 > preallocated 122 (4#0) and, through another bug, leads to forever hung tasks and forced reboots. Fix this by comparing front_len with front_alloc_len field of struct ceph_msg, which stores the actual size of the buffer. Fixes: http://tracker.ceph.com/issues/5425 Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
2014-01-14libceph: rename front to front_len in get_reply()Ilya Dryomov
Rename front local variable to front_len in get_reply() to make its purpose more clear. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
2014-01-14libceph: rename ceph_msg::front_max to front_alloc_lenIlya Dryomov
Rename front_max field of struct ceph_msg to front_alloc_len to make its purpose more clear. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
2014-01-14f2fs: add delimiter to seperate name and value in debug phraseChangman Lee
Support for f2fs-tools/tools/f2stat to monitor /sys/kernel/debug/f2fs/status Signed-off-by: Changman Lee <cm224.lee@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-01-14f2fs: use spinlock rather than mutex for better speedGu Zheng
With the 2 previous changes, all the long time operations are moved out of the protection region, so here we can use spinlock rather than mutex (orphan_inode_mutex) for lower overhead. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-01-14f2fs: move alloc new orphan node out of lock protection regionGu Zheng
Move alloc new orphan node out of lock protection region. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-01-14f2fs: move grabing orphan pages out of protection regionGu Zheng
Move grabing orphan block page out of protection region, and grab all the orphan block pages ahead. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Reviewed-by: Chao Yu <chao2.yu@samsung.com> [Jaegeuk Kim: remove unnecessary code pointed by Chao Yu] Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-01-14IB/usnic: Append documentation to usnic_transport.h and cleanupUpinder Malhi
Add comment describing usnic_transport_rsrv port and remove extraneous space from usnic_transport.c. Signed-off-by: Upinder Malhi <umalhi@cisco.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-01-14f2fs: remove the needless parameter of f2fs_wait_on_page_writebackYuan Zhong
"boo sync" parameter is never referenced in f2fs_wait_on_page_writeback. We should remove this parameter. Signed-off-by: Yuan Zhong <yuan.mark.zhong@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2014-01-14IB/usnic: Fix typo "Ignorning" -> "Ignoring"Roland Dreier
Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-01-14IB/usnic: Expose flows via debugfsUpinder Malhi
Signed-off-by: Upinder Malhi <umalhi@cisco.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-01-14IB/usnic: Use for_each_sg instead of a for-loopUpinder Malhi
Use for_each_sg() instead of an explicit for-loop to iterate over scatter-gather list. Signed-off-by: Upinder Malhi <umalhi@cisco.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-01-14IB/usnic: Remove superflous parenthesesUpinder Malhi
Signed-off-by: Upinder Malhi <umalhi@cisco.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-01-14IB/core: Add RDMA_TRANSPORT_USNIC_UDPUpinder Malhi
Add RDMA_TRANSPORT_USNIC_UDP which will be used by usNIC. Signed-off-by: Upinder Malhi <umalhi@cisco.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-01-14IB/usnic: Add UDP support in usnic_ib_qp_grp.[hc]Upinder Malhi
UDP support for qp_grps/qps. Signed-off-by: Upinder Malhi <umalhi@cisco.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-01-14IB/usnic: Add UDP support in u*verbs.c, u*main.c and u*util.hUpinder Malhi
Add supports for: 1) Parsing the socket file descriptor pass down from userspace. 2) IP notifiers 3) Encoding the IP in the GID 4) Other aux. changes to support UDP Signed-off-by: Upinder Malhi <umalhi@cisco.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-01-14IB:usnic: Add UDP support to usnic_transport.[hc]Upinder Malhi
This patch provides API for rest of usNIC code to increment or decrement socket's reference count. Auxiliary socket APIs are also provided. Signed-off-by: Upinder Malhi <umalhi@cisco.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-01-14IB/usnic: Add UDP support to usnic_fwd.[hc]Upinder Malhi
Add *ip field* to *struct usnic_fwd_dev* as well as new *functions* to manipulate the *ip field.* Furthermore, add new functions for programming UDP flows in the forwarding device. Signed-off-by: Upinder Malhi <umalhi@cisco.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-01-14IB/usnic: Update ABI and Version file for UDP supportUpinder Malhi
Expand the kernel/userspace interface so userspace may push down a socket file descriptor to usNIC. Also, bump up the abi and version numbers. Signed-off-by: Upinder Malhi <umalhi@cisco.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-01-14IB/usnic: Port over sysfs to new usnic_fwd.hUpinder Malhi
This patch ports usnic_ib_sysfs.c to the new interface of usnic_fwd.h. Signed-off-by: Upinder Malhi <umalhi@cisco.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-01-14IB/usnic: Port over usnic_ib_qp_grp.[hc] to new usnic_fwd.hUpinder Malhi
This patch ports usnic_ib_qp_grp.[hc] to the new interface of usnic_fwd.h. Signed-off-by: Upinder Malhi <umalhi@cisco.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-01-14IB/usnic: Port over main.c and verbs.c to the usnic_fwd.hUpinder Malhi
This patch ports usnic_ib_main.c, usnic_ib_verbs.c and usnic_ib.h to the new interface of usnic_fwd.h. Signed-off-by: Upinder Malhi <umalhi@cisco.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-01-14IB/usnic: Push all forwarding state to usnic_fwd.[hc]Upinder Malhi
Push all of the usnic device forwarding state - such as mtu, mac - to usnic_fwd_dev. Furthermore, usnic_fwd.h exposes a improved interface for rest of the usnic code. The primary improvement is that usnic_fwd.h's flow management interface takes in high-level *filter* and *action* structures now, instead of low-level paramaters such as vnic_idx, rq_idx. Signed-off-by: Upinder Malhi <umalhi@cisco.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-01-14IB/usnic: Add struct usnic_transport_specUpinder Malhi
Add *struct usnic_transport_spec* for passing around transport specifications. Signed-off-by: Upinder Malhi <umalhi@cisco.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-01-14IB/usnic: Change WARN_ON to lockdep_assert_heldUpinder Malhi
usNIC calls WARN_ON(spin_is_locked..) at few places. In some of these instances, the call is made while holding a spinlock. Change all WARN_ON(spin_is_locked...) calls in usNIC to lockdep_assert_held to make it fool-proof bc the latter can be called while holding a spinlock and unlike spin_is_locked, lockdep_assert_held also works correctly on UP. Signed-off-by: Upinder Malhi <umalhi@cisco.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-01-14IB/usnic: Add Cisco VIC low-level hardware driverUpinder Malhi
This adds a driver that allows userspace to use UD-like QPs over a proprietary Cisco transport with Cisco's Virtual Interface Cards (VICs), including VIC 1240 and 1280 cards. Signed-off-by: Upinder Malhi <umalhi@cisco.com> Signed-off-by: Christian Benvenuti <benve@cisco.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-01-14OMAPDSS: DISPC: Fix 34xx overlay scaling calculationIvaylo Dimitrov
commit 7faa92339bbb1e6b9a80983b206642517327eb75 OMAPDSS: DISPC: Handle synclost errors in OMAP3 introduces limits check to prevent SYNCLOST errors on OMAP3. However, it misses the logic found in Nokia kernels that is needed to correctly calculate whether 3 tap or 5 tap rescaler to be used as well as the logic to fallback to 3 taps if 5 taps clock results in too tight horizontal timings. Without that patch "horizontal timing too tight" errors are seen when a video with resolution above 640x350 is tried to be played. The patch is a forward-ported logic found in Nokia N900 and N9/50 kernels. Signed-off-by: Ivaylo Dimitrov <ivo.g.dimitrov.75@gmail.com> Signed-off-by: Tomi Valkeinen <tomi.valkeinen@ti.com>
2014-01-13bridge: move br_net_exit() to br.cWANG Cong
And it can become static. Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-13net: usbnet: fix SG initialisationBjørn Mork
Commit 60e453a940ac ("USBNET: fix handling padding packet") added an extra SG entry in case padding is necessary, but failed to update the initialisation of the list. This can cause list traversal to fall off the end of the list, resulting in an oops. Fixes: 60e453a940ac ("USBNET: fix handling padding packet") Reported-by: Thomas Kear <thomas@kear.co.nz> Cc: Ming Lei <ming.lei@canonical.com> Signed-off-by: Bjørn Mork <bjorn@mork.no> Tested-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-13net: 3com: fix warning for incorrect type in argumentdingtianhong
The commit c466a9b2b329f7d9982c14eedc83a923d3bc711c (net: 3com: slight optimization of addr compare) cause a warning: "passing argument 1 of 'ether_addr_equal' from incompatible pointer type", so fix it. I think julia will convert ether_addr_equal to ether_addr_equal_64bits later. Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Ding Tianhong <dingtianhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-13net: qlcnic: fix warning for incorrect type in argumentdingtianhong
The commit 6878f79a8b71e8c7b0587a1185584f54fd31f185 (net: qlcnic: slight optimization of addr compare) cause a warning "sparse: incorrect type in argument 2 (different type sizes)", so fix it. I think julia will convert ether_addr_equal to ether_addr_equal_64bits later. Cc: Himanshu Madhani <himanshu.madhani@qlogic.com> Cc: Rajesh Borundia <rajesh.borundia@qlogic.com> Signed-off-by: Ding Tianhong <dingtianhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-13sh_eth: fix garbled TX error messageSergei Shtylyov
sh_eth_error() in case of a TX error tries to print a message using 2 dev_err() calls with the first string not finished by '\n', so that the resulting message would inevitably come out garbled, with something like "3net eth0: " inserted in the middle. Avoid that by merging 2 calls into one. While at it, insert an empty line after the nearby declaration. Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Reviewed-by: Simon Horman <horms+renesas@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-13Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Conflicts: net/xfrm/xfrm_policy.c Steffen Klassert says: ==================== This pull request has a merge conflict between commits be7928d20bab ("net: xfrm: xfrm_policy: fix inline not at beginning of declaration") and da7c224b1baa ("net: xfrm: xfrm_policy: silence compiler warning") from the net-next tree and commit 2f3ea9a95c58 ("xfrm: checkpatch erros with inline keyword position") from the ipsec-next tree. The version from net-next can be used, like it is done in linux-next. 1) Checkpatch cleanups, from Weilong Chen. 2) Fix lockdep complaints when pktgen is used with IPsec, from Fan Du. 3) Update pktgen to allow any combination of IPsec transport/tunnel mode and AH/ESP/IPcomp type, from Fan Du. 4) Make pktgen_dst_metrics static, Fengguang Wu. 5) Compile fix for pktgen when CONFIG_XFRM is not set, from Fan Du. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-13mtd: nand: use __packed shorthandBrian Norris
To be consistent with the rest of include/linux/mtd/nand.h, we should use the __packed shorthand instead of __attribute__((packed)). Signed-off-by: Brian Norris <computersforpeace@gmail.com> Acked-by: Huang Shijie <b32955@freescale.com>
2014-01-13mtd: nand: support Micron READ RETRYBrian Norris
Micron provides READ RETRY support via the ONFI vendor-specific parameter block (to indicate how many read-retry modes are available) and the ONFI {GET,SET}_FEATURES commands with a vendor-specific feature address (to support reading/switching the current read-retry mode). The recommended sequence is as follows: 1. Perform PAGE_READ operation 2. If no ECC error, we are done 3. Run SET_FEATURES with feature address 89h, mode 1 4. Retry PAGE_READ operation 5. If ECC error and there are remaining supported modes, increment the mode and return to step 3. Otherwise, this is a true ECC error. 6. Run SET_FEATURES with feature address 89h, mode 0, to return to the default state. This patch implements the chip->setup_read_retry() callback for Micron and fills in the chip->read_retries. Tested on Micron MT29F32G08CBADA, which supports 8 read-retry modes. The Micron vendor-specific table was checked against the datasheets for the following Micron NAND: Needs retry Cell-type Part number Vendor revision Byte 180 ----------- --------- ---------------- --------------- ------------ No SLC MT29F16G08ABABA 1 Reserved (0) No MLC MT29F32G08CBABA 1 Reserved (0) No SLC MT29F1G08AACWP 1 0 Yes MLC MT29F32G08CBADA 1 08h Yes MLC MT29F64G08CBABA 2 08h Signed-off-by: Brian Norris <computersforpeace@gmail.com> Acked-by: Huang Shijie <b32955@freescale.com>
2014-01-13mtd: nand: add generic READ RETRY supportBrian Norris
Modern MLC (and even SLC?) NAND can experience a large number of bitflips (beyond the recommended correctability capacity) due to drifts in the voltage threshold (Vt). These bitflips can cause ECC errors to occur well within the expected lifetime of the flash. To account for this, some manufacturers provide a mechanism for shifting the Vt threshold after a corrupted read. The generic pattern seems to be that a particular flash has N read retry modes (where N = 0, traditionally), and after an ECC failure, the host should reconfigure the flash to use the next available mode, then retry the read operation. This process repeats until all bitfips can be corrected or until the host has tried all available retry modes. This patch adds the infrastructure support for a vendor-specific/flash-specific callback, used for setting the read-retry mode (i.e., voltage threshold). For now, this patch always returns the flash to mode 0 (the default mode) after a successful read-retry, according to the flowchart found in Micron's datasheets. This may need to change in the future if it is determined that eventually, mode 0 is insufficient for the majority of the flash cells (and so for performance reasons, we should leave the flash in mode 1, 2, etc.). Signed-off-by: Brian Norris <computersforpeace@gmail.com> Acked-by: Huang Shijie <b32955@freescale.com>
2014-01-13mtd: nand: add ONFI vendor block for MicronBrian Norris
Signed-off-by: Brian Norris <computersforpeace@gmail.com> Acked-by: Huang Shijie <b32955@freescale.com>
2014-01-13mtd: nand: localize ECC failures per pageBrian Norris
ECC failures can be tracked at the page level, not the do_read_ops level (i.e., a potentially multi-page transaction). This helps prepare for READ RETRY support. Signed-off-by: Brian Norris <computersforpeace@gmail.com> Acked-by: Huang Shijie <b32955@freescale.com>
2014-01-13inet_diag: fix inet_diag_dump_icsk() to use correct state for timewait socketsNeal Cardwell
Fix inet_diag_dump_icsk() to reflect the fact that both TCP_TIME_WAIT and TCP_FIN_WAIT2 connections are represented by inet_timewait_sock (not just TIME_WAIT), and for such sockets the tw_substate field holds the real state, which can be either TCP_TIME_WAIT or TCP_FIN_WAIT2. This brings the inet_diag state-matching code in line with the field it uses to populate idiag_state. This is also analogous to the info exported in /proc/net/tcp, where get_tcp4_sock() exports sk->sk_state and get_timewait4_sock() exports tw->tw_substate. Before fixing this, (a) neither "ss -nemoi" nor "ss -nemoi state fin-wait-2" would return a socket in TCP_FIN_WAIT2; and (b) "ss -nemoi state time-wait" would also return sockets in state TCP_FIN_WAIT2. This is an old bug that predates 05dbc7b ("tcp/dccp: remove twchain"). Signed-off-by: Neal Cardwell <ncardwell@google.com> Cc: Eric Dumazet <edumazet@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-13Merge branch 'bonding_rcu'David S. Miller
Veaceslav Falico says: ==================== bonding: fix bond_3ad RCU usage While digging through bond_3ad.c I've found that the RCU usage there is just wrong - it's used as a kind of mutex/spinlock instead of RCU. v3->v4: remove useless goto and wrap __get_first_agg() in proper RCU. v2->v3: make bond_3ad_set_carrier() use RCU read lock for the whole function, so that all other functions will be protected by RCU as well. This way we can use _rcu variants everywhere. v1->v2: use generic primitives instead of _rcu ones cause we can hold RTNL lock without RCU one, which is still safe. This patchset is on top of bond_3ad.c cleanup: http://www.spinics.net/lists/netdev/msg265447.html ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-13bonding: fix __get_active_agg() RCU logicVeaceslav Falico
Currently, the implementation is meaningless - once again, we take the slave structure and use it after we've exited RCU critical section. Fix this by removing the rcu_read_lock() from __get_active_agg(), and ensuring that all its callers are holding RCU. Fixes: be79bd048 ("bonding: add RCU for bond_3ad_state_machine_handler()") CC: dingtianhong@huawei.com CC: Jay Vosburgh <fubar@us.ibm.com> CC: Andy Gospodarek <andy@greyhouse.net> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-13bonding: fix __get_first_agg RCU usageVeaceslav Falico
Currently, the RCU read lock usage is just wrong - it gets the slave struct under RCU and continues to use it when RCU lock is released. However, it's still safe to do this cause we didn't need the rcu_read_lock() initially - all of the __get_first_agg() callers are either holding RCU read lock or the RTNL lock, so that we can't sync while in it. Fixes: be79bd048 ("bonding: add RCU for bond_3ad_state_machine_handler()") CC: dingtianhong@huawei.com CC: Jay Vosburgh <fubar@us.ibm.com> CC: Andy Gospodarek <andy@greyhouse.net> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-13bonding: fix bond_3ad_set_carrier() RCU usageVeaceslav Falico
Currently, its usage is just plainly wrong. It first gets a slave under RCU, and, after releasing the RCU lock, continues to use it - whilst it can be freed. Fix this by ensuring that bond_3ad_set_carrier() holds RCU till it uses its slave (or its agg). Fixes: be79bd048ab ("bonding: add RCU for bond_3ad_state_machine_handler()") CC: dingtianhong@huawei.com CC: Jay Vosburgh <fubar@us.ibm.com> CC: Andy Gospodarek <andy@greyhouse.net> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-13arch: Re-sort some Kbuild files to hopefully help avoid some conflictsStephen Rothwell
Checkin: 93ea02bb8435 arch: Clean up asm/barrier.h implementations using asm-generic/barrier.h ... unfortunately left some Kbuild files out of order, which caused unnecessary merge conflicts, in particular with checkin: e3fec2f74f7f lib: Add missing arch generic-y entries for asm-generic/hash.h Put them back in order to make the upcoming merges cleaner. Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Link: http://lkml.kernel.org/r/20140114164420.d296fbcc4be3a5f126c86069@canb.auug.org.au Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Cc: Peter Zijlstra <peterz@infradead.org> Cc: David Miller <davem@davemloft.net>
2014-01-13Merge tag 'batman-adv-for-davem' of git://git.open-mesh.org/linux-mergeDavid S. Miller
Included changes: - drop dependency against CRC16 - move to new release version - add size check at compile time for packet structs - update copyright years in every file - implement new bonding/interface alternation feature Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-13net: resort some Kbuild files to hopefully help avoid some conflictsStephen Rothwell
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-14md: ensure metadata is writen after raid level change.NeilBrown
level_store() currently does not make sure the metadata is updates to reflect the new raid level. It simply sets MD_CHANGE_DEVS. Any level with a ->thread will quickly notice this and update the metadata. However RAID0 and Linear do not have a thread so no metadata update happens until the array is stopped. At that point the metadata is written. This is later that we would like. While the delay doesn't risk any data it can cause confusion. So if there is no md thread, immediately update the metadata after a level change. Reported-by: Richard Michael <rmichael@edgeofthenet.org> Signed-off-by: NeilBrown <neilb@suse.de>
2014-01-14md/raid10: avoid fullsync when not necessary.NeilBrown
This is the raid10 equivalent of commit 4f0a5e012cf41321d611e7cad63e1017d143d138 MD RAID1: Further conditionalize 'fullsync' If a device in a newly assembled array is not fully recovered we currently do a fully resync by don't need to. Signed-off-by: NeilBrown <neilb@suse.de>
2014-01-14md: allow a partially recovered device to be hot-added to an array.NeilBrown
When adding a new device into an array it is normally important to clear any stale data from ->recovery_offset else the new device may not be recovered properly. However when re-adding a device which is known to be nearly in-sync, this is not needed and can be detrimental. The (bitmap-based) resync will still happen, and further recovery is only needed from where-ever it was already up to. So if save_raid_disk is set, signifying a re-add, don't clear ->recovery_offset. Signed-off-by: NeilBrown <neilb@suse.de>
2014-01-14md: Change handling of save_raid_disk and metadata update during recovery.NeilBrown
Since commit d70ed2e4fafdbef0800e739 MD: Allow restarting an interrupted incremental recovery. we don't write out the metadata to devices while they are recovering. This had a good reason, but has unfortunate consequences. This patch changes things to make them work better. At issue is what happens if the array is shut down while a recovery is happening, particularly a bitmap-guided recovery. Ideally the recovery should pick up where it left off. However the metadata cannot represent the state "A recovery is in process which is guided by the bitmap". Before the above mentioned commit, we wrote metadata to the device which said "this is being recovered and it is up to <here>". So after a restart, a full recovery (not bitmap-guided) would happen from where-ever it was up to. After the commit the metadata wasn't updated so it still said "This device is fully in sync with <this> event count". That leads to a bitmap-based recovery following the whole bitmap, which should be a lot less work than a full recovery from some starting point. So this was an improvement. However updates some metadata but not all leads to other problems. In particular, the metadata written to the fully-up-to-date device record that the array has all devices present (even though some are recovering). So on restart, mdadm wants to find all devices and expects them to have current event counts. Obviously it doesn't (some have old event counts) so (when assembling with --incremental) it waits indefinitely for the rest of the expected devices. It really is wrong to not update all the metadata together. Do that is bound to cause confusion. Instead, we should make it possible to record the truth in the metadata. i.e. we need to be able to record that a device is being recovered based on the bitmap. We already have a Feature flag to say that recovery is happening. We now add another one to say that it is a bitmap-based recovery. With this we can remove the code that disables the write-out of metadata on some devices. So this patch: - moves the setting of 'saved_raid_disk' from add_new_disk to the validate_super methods. This makes sure it is always set properly, both when adding a new device to an array, and when assembling an array from a collection of devices. - Adds a metadata flag MD_FEATURE_RECOVERY_BITMAP which is only used if MD_FEATURE_RECOVERY_OFFSET is set, and record that a bitmap-based recovery is allowed. This is only present in v1.x metadata. v0.90 doesn't support devices which are in the middle of recovery at all. - Only skips writing metadata to Faulty devices. - Also allows rdev state to be set to "-insync" via sysfs. This can be used for external-metadata arrays. When the 'role' is set the device is assumed to be in-sync. If, after setting the role, we set the state to "-insync", the role is moved to saved_raid_disk which effectively says the device is partly in-sync with that slot and needs a bitmap recovery. Cc: Andrei Warkentin <andreiw@vmware.com> Signed-off-by: NeilBrown <neilb@suse.de>