summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2013-06-14idle: Enable interrupts in the weak arch_cpu_idle() implementationJames Bottomley
PARISC bootup triggers the warning at kernel/cpu/idle.c:96. That's caused by the weak arch_cpu_idle() implementation, which is provided to avoid that architectures implement idle_poll over and over. The switchover to polling mode happens in the first call of the weak arch_cpu_idle() implementation, but that code fails to reenable interrupts and therefor triggers the warning. Fix this by enabling interrupts in the weak arch_cpu_idle() code. [ tglx: Made the changelog match the patch ] Signed-off-by: James Bottomley <JBottomley@Parallels.com> Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/1371236142.2726.43.camel@dabdike Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-06-14xfs: don't shutdown log recovery on validation errorsDave Chinner
Unfortunately, we cannot guarantee that items logged multiple times and replayed by log recovery do not take objects back in time. When they are taken back in time, the go into an intermediate state which is corrupt, and hence verification that occurs on this intermediate state causes log recovery to abort with a corruption shutdown. Instead of causing a shutdown and unmountable filesystem, don't verify post-recovery items before they are written to disk. This is less than optimal, but there is no way to detect this issue for non-CRC filesystems If log recovery successfully completes, this will be undone and the object will be consistent by subsequent transactions that are replayed, so in most cases we don't need to take drastic action. For CRC enabled filesystems, leave the verifiers in place - we need to call them to recalculate the CRCs on the objects anyway. This recovery problem can be solved for such filesystems - we have a LSN stamped in all metadata at writeback time that we can to determine whether the item should be replayed or not. This is a separate piece of work, so is not addressed by this patch. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit 9222a9cf86c0d64ffbedf567412b55da18763aa3)
2013-06-14xfs: ensure btree root split sets blkno correctlyDave Chinner
For CRC enabled filesystems, the BMBT is rooted in an inode, so it passes through a different code path on root splits than the freespace and inode btrees. This is much less traversed by xfstests than the other trees. When testing on a 1k block size filesystem, I've been seeing ASSERT failures in generic/234 like: XFS: Assertion failed: cur->bc_btnum != XFS_BTNUM_BMAP || cur->bc_private.b.allocated == 0, file: fs/xfs/xfs_btree.c, line: 317 which are generally preceded by a lblock check failure. I noticed this in the bmbt stats: $ pminfo -f xfs.btree.block_map xfs.btree.block_map.lookup value 39135 xfs.btree.block_map.compare value 268432 xfs.btree.block_map.insrec value 15786 xfs.btree.block_map.delrec value 13884 xfs.btree.block_map.newroot value 2 xfs.btree.block_map.killroot value 0 ..... Very little coverage of root splits and merges. Indeed, on a 4k filesystem, block_map.newroot and block_map.killroot are both zero. i.e. the code is not exercised at all, and it's the only generic btree infrastructure operation that is not exercised by a default run of xfstests. Turns out that on a 1k filesystem, generic/234 accounts for one of those two root splits, and that is somewhat of a smoking gun. In fact, it's the same problem we saw in the directory/attr code where headers are memcpy()d from one block to another without updating the self describing metadata. Simple fix - when copying the header out of the root block, make sure the block number is updated correctly. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit ade1335afef556df6538eb02e8c0dc91fbd9cc37)
2013-06-14xfs: fix implicit padding in directory and attr CRC formatsDave Chinner
Michael L. Semon has been testing CRC patches on a 32 bit system and been seeing assert failures in the directory code from xfs/080. Thanks to Michael's heroic efforts with printk debugging, we found that the problem was that the last free space being left in the directory structure was too small to fit a unused tag structure and it was being corrupted and attempting to log a region out of bounds. Hence the assert failure looked something like: ..... #5 calling xfs_dir2_data_log_unused() 36 32 #1 4092 4095 4096 #2 8182 8183 4096 XFS: Assertion failed: first <= last && last < BBTOB(bp->b_length), file: fs/xfs/xfs_trans_buf.c, line: 568 Where #1 showed the first region of the dup being logged (i.e. the last 4 bytes of a directory buffer) and #2 shows the corrupt values being calculated from the length of the dup entry which overflowed the size of the buffer. It turns out that the problem was not in the logging code, nor in the freespace handling code. It is an initial condition bug that only shows up on 32 bit systems. When a new buffer is initialised, where's the freespace that is set up: [ 172.316249] calling xfs_dir2_leaf_addname() from xfs_dir_createname() [ 172.316346] #9 calling xfs_dir2_data_log_unused() [ 172.316351] #1 calling xfs_trans_log_buf() 60 63 4096 [ 172.316353] #2 calling xfs_trans_log_buf() 4094 4095 4096 Note the offset of the first region being logged? It's 60 bytes into the buffer. Once I saw that, I pretty much knew that the bug was going to be caused by this. Essentially, all direct entries are rounded to 8 bytes in length, and all entries start with an 8 byte alignment. This means that we can decode inplace as variables are naturally aligned. With the directory data supposedly starting on a 8 byte boundary, and all entries padded to 8 bytes, the minimum freespace in a directory block is supposed to be 8 bytes, which is large enough to fit a unused data entry structure (6 bytes in size). The fact we only have 4 bytes of free space indicates a directory data block alignment problem. And what do you know - there's an implicit hole in the directory data block header for the CRC format, which means the header is 60 byte on 32 bit intel systems and 64 bytes on 64 bit systems. Needs padding. And while looking at the structures, I found the same problem in the attr leaf header. Fix them both. Note that this only affects 32 bit systems with CRCs enabled. Everything else is just fine. Note that CRC enabled filesystems created before this fix on such systems will not be readable with this fix applied. Reported-by: Michael L. Semon <mlsemon35@gmail.com> Debugged-by: Michael L. Semon <mlsemon35@gmail.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit 8a1fd2950e1fe267e11fc8c85dcaa6b023b51b60)
2013-06-14xfs: don't emit v5 superblock warnings on writeDave Chinner
We write the superblock every 30s or so which results in the verifier being called. Right now that results in this output every 30s: XFS (vda): Version 5 superblock detected. This kernel has EXPERIMENTAL support enabled! Use of these features in this kernel is at your own risk! And spamming the logs. We don't need to check for whether we support v5 superblocks or whether there are feature bits we don't support set as these are only relevant when we first mount the filesytem. i.e. on superblock read. Hence for the write verification we can just skip all the checks (and hence verbose output) altogether. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit 34510185abeaa5be9b178a41c0a03d30aec3db7e)
2013-06-14staging:iio:ad7291: Rework regulator handlingLars-Peter Clausen
Add platform_data for the driver which allows to specify whether an external reference voltage is used or not. It is not possible to use the return value of regulator_get() for this since the regulator framework is able to return a dummy regulator in case no regulator has been specified. In this case the driver will always get a valid regulator, no matter if it has been specified or not. Also make the regulator non-optional if platform_data states that an external reference voltage should be used. Signed-off-by: Lars-Peter Clausen <lars@metafoo.de> Signed-off-by: Jonathan Cameron <jic23@kernel.org>
2013-06-14staging:iio:ad7291: Use sign_extend32 instead of open-coding itLars-Peter Clausen
This makes it a bit more obvious what is going on than open-coding it. Signed-off-by: Lars-Peter Clausen <lars@metafoo.de> Signed-off-by: Jonathan Cameron <jic23@kernel.org>
2013-06-14staging:iio:ad7291: Use i2c_smbus_{read,write}_word_swapped instead of ↵Lars-Peter Clausen
open-coding it No need to swap the upper and lower byte by hand if there is a helper function which already does this. Signed-off-by: Lars-Peter Clausen <lars@metafoo.de> Signed-off-by: Jonathan Cameron <jic23@kernel.org>
2013-06-14xhci: remove BUG() in xhci_get_endpoint_type()Mathias Nyman
If the endpoint type is unknown, set it to 0 and fail gracefully instead of causing a kernel panic. Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com> Signed-off-by: Sarah Sharp <sarah.a.sharp@linux.intel.com>
2013-06-14staging:iio:ad7291: Simplify threshold register lookupLars-Peter Clausen
The AD7291 register map is nicely uniform and we can easily compute the threshold register address for a certain channel using simple math instead of using a look-up table. Signed-off-by: Lars-Peter Clausen <lars@metafoo.de> Signed-off-by: Jonathan Cameron <jic23@kernel.org>
2013-06-14xhci: Remove BUG in xhci_setup_addressable_virt_devMathias Nyman
We may have more speed types in the future, so fail gracefully, rather than causing the kernel to panic. BUG() was called if the device speed was unknown when setting max packet size. Set the max packet size at the same time as the slot speed and get rid of one switch statement with BUG() option completely. [Note: Sarah merged a patch that she wrote that touched the xhci_setup_addressable_virt_dev function with this patch from Mathias for clarity.] Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com> Signed-off-by: Sarah Sharp <sarah.a.sharp@linux.intel.com>
2013-06-14staging:iio:ad7291: Remove unnecessary dev_info() from probe()Lars-Peter Clausen
No need spam the log during probe if nothing went wrong. Signed-off-by: Lars-Peter Clausen <lars@metafoo.de> Signed-off-by: Jonathan Cameron <jic23@kernel.org>
2013-06-14staging:iio:ad7291: Remove userspace resetLars-Peter Clausen
There is no reason why userspace should want to trigger a manual reset of the device, so remove this functionality. Signed-off-by: Lars-Peter Clausen <lars@metafoo.de> Signed-off-by: Jonathan Cameron <jic23@kernel.org>
2013-06-14staging:iio:ad7291: Use IIO_VAL_FRACTIONAL_LOG2Lars-Peter Clausen
This is a bit more straight forward and also allows better precession when using the an in-kernel consumer. Signed-off-by: Lars-Peter Clausen <lars@metafoo.de> Signed-off-by: Jonathan Cameron <jic23@kernel.org>
2013-06-14xhci: Remove BUG_ON in xhci_get_input_control_ctx.Sarah Sharp
Fail gracefully, instead of causing the kernel to panic, if the input control context doesn't have the right type (XHCI_CTX_TYPE_INPUT). Push finding the pointer to the input control context up into functions that can fail. Signed-off-by: Sarah Sharp <sarah.a.sharp@linux.intel.com> Cc: John Youn <johnyoun@synopsys.com>
2013-06-14xhci: Remove BUG_ON() in xhci_alloc_container_ctx.Sarah Sharp
It's horrible coding style to panic the kernel when someone passes you an argument value you didn't expect. In the future, we may want to add additional context types, so it's better to gracefully handle additional context types instead of panicking. Signed-off-by: Sarah Sharp <sarah.a.sharp@linux.intel.com> Cc: John Youn <johnyoun@synopsys.com>
2013-06-14dlm: disable nagle for SCTPMike Christie
For TCP we disable Nagle and I cannot think of why it would be needed for SCTP. When disabled it seems to improve dlm_lock operations like it does for TCP. Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: David Teigland <teigland@redhat.com>
2013-06-14dlm: retry failed SCTP sendsMike Christie
Currently if a SCTP send fails, we lose the data we were trying to send because the writequeue_entry is released when we do the send. When this happens other nodes will then hang waiting for a reply. This adds support for SCTP to retry the send operation. I also removed the retry limit for SCTP use, because we want to make sure we try every path during init time and for longer failures we want to continually retry in case paths come back up while trying other paths. We will do this until userspace tells us to stop. Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: David Teigland <teigland@redhat.com>
2013-06-14dlm: try other IPs when sctp init assoc failsMike Christie
Currently, if we cannot create a association to the first IP addr that is added to DLM, the SCTP init assoc code will just retry the same IP. This patch adds a simple failover schemes where we will try one of the addresses that was passed into DLM. Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: David Teigland <teigland@redhat.com>
2013-06-14dlm: clear correct bit during sctp init failure handlingMike Christie
We should be testing and cleaing the init pending bit because later when sctp_init_assoc is recalled it will be checking that it is not set and set the bit. We do not want to touch CF_CONNECT_PENDING here because we will queue swork and process_send_sockets will then call the connect_action function. Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: David Teigland <teigland@redhat.com>
2013-06-14dlm: set sctp assoc id during setupMike Christie
sctp_assoc was not getting set so later lookups failed. Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: David Teigland <teigland@redhat.com>
2013-06-14dlm: clear correct init bit during sctp setupMike Christie
We were clearing the base con's init pending flags, but the con for the node was the one with the pending bit set. Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: David Teigland <teigland@redhat.com>
2013-06-14GFS2: fix regression in dir_double_exhashBob Peterson
Recent commit e8830d8 introduced a bug in function dir_double_exhash; it was failing to set h in the fall-back case. This patch corrects it. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-06-14GFS2: Add atomic_open supportSteven Whitehouse
I've restricted atomic_open to only operate on regular files, although I still don't understand why atomic_open should not be possible also for directories on GFS2. That can always be added in later though, if it makes sense. The ->atomic_open function can be passed negative dentries, which in most cases means either ENOENT (->lookup) or a call to d_instantiate (->create). In the GFS2 case though, we need to actually perform the look up, since we do not know whether there has been a new inode created on another node. The look up calls d_splice_alias which then tries to rehash the dentry - so the solution here is to simply check for that in d_splice_alias. The same issue is likely to affect any other cluster filesystem implementing ->atomic_open Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "J. Bruce Fields" <bfields fieldses org> Cc: Jeff Layton <jlayton@redhat.com>
2013-06-14ARM64: mm: THP support.Steve Capper
Bring Transparent HugePage support to ARM. The size of a transparent huge page depends on the normal page size. A transparent huge page is always represented as a pmd. If PAGE_SIZE is 4KB, THPs are 2MB. If PAGE_SIZE is 64KB, THPs are 512MB. Signed-off-by: Steve Capper <steve.capper@linaro.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com>
2013-06-14ARM64: mm: Raise MAX_ORDER for 64KB pages and THP.Steve Capper
The buddy allocator has a default MAX_ORDER of 11, which is too low to allocate enough memory for 512MB Transparent HugePages if our base page size is 64KB. This patch introduces MAX_ZONE_ORDER and sets it to 14 when 64KB pages are used in conjuction with THP, otherwise the default value of 11 is used. Signed-off-by: Steve Capper <steve.capper@linaro.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com>
2013-06-14ARM64: mm: HugeTLB support.Steve Capper
Add huge page support to ARM64, different huge page sizes are supported depending on the size of normal pages: PAGE_SIZE is 4KB: 2MB - (pmds) these can be allocated at any time. 1024MB - (puds) usually allocated on bootup with the command line with something like: hugepagesz=1G hugepages=6 PAGE_SIZE is 64KB: 512MB - (pmds) usually allocated on bootup via command line. Signed-off-by: Steve Capper <steve.capper@linaro.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com>
2013-06-14ARM64: mm: Move PTE_PROT_NONE bit.Steve Capper
Under ARM64, PTEs can be broadly categorised as follows: - Present and valid: Bit #0 is set. The PTE is valid and memory access to the region may fault. - Present and invalid: Bit #0 is clear and bit #1 is set. Represents present memory with PROT_NONE protection. The PTE is an invalid entry, and the user fault handler will raise a SIGSEGV. - Not present (file or swap): Bits #0 and #1 are clear. Memory represented has been paged out. The PTE is an invalid entry, and the fault handler will try and re-populate the memory where necessary. Huge PTEs are block descriptors that have bit #1 clear. If we wish to represent PROT_NONE huge PTEs we then run into a problem as there is no way to distinguish between regular and huge PTEs if we set bit #1. To resolve this ambiguity this patch moves PTE_PROT_NONE from bit #1 to bit #2 and moves PTE_FILE from bit #2 to bit #3. The number of swap/file bits is reduced by 1 as a consequence, leaving 60 bits for file and swap entries. Signed-off-by: Steve Capper <steve.capper@linaro.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com>
2013-06-14ARM64: mm: Make PAGE_NONE pages read only and no-execute.Steve Capper
If we consider the following code sequence: my_pte = pte_modify(entry, myprot); x = pte_write(my_pte); y = pte_exec(my_pte); If myprot comes from a PROT_NONE page, then x and y will both be true which is undesireable behaviour. This patch sets the no-execute and read-only bits for PAGE_NONE such that the code above will return false for both x and y. Signed-off-by: Steve Capper <steve.capper@linaro.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com>
2013-06-14ARM64: mm: Restore memblock limit when map_mem finished.Steve Capper
In paging_init the memblock limit is set to restrict any addresses returned by early_alloc to fit within the initial direct kernel mapping in swapper_pg_dir. This allows map_mem to allocate puds, pmds and ptes from the initial direct kernel mapping. The limit stays low after paging_init() though, meaning any bootmem allocations will be from a restricted subset of memory. Gigabyte huge pages, for instance, are normally allocated from bootmem as their order (18) is too large for the default buddy allocator (MAX_ORDER = 11). This patch restores the memblock limit when map_mem has finished, allowing gigabyte huge pages (and other objects) to be allocated from all of bootmem. Signed-off-by: Steve Capper <steve.capper@linaro.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com>
2013-06-14mm: thp: Correct the HPAGE_PMD_ORDER check.Steve Capper
All Transparent Huge Pages are allocated by the buddy allocator. A compile time check is in place that fails when the order of a transparent huge page is too large to be allocated by the buddy allocator. Unfortunately that compile time check passes when: HPAGE_PMD_ORDER == MAX_ORDER ( which is incorrect as the buddy allocator can only allocate memory of order strictly less than MAX_ORDER. ) This patch updates the compile time check to fail in the above case. Signed-off-by: Steve Capper <steve.capper@linaro.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Andrew Morton <akpm@linux-foundation.org>
2013-06-14x86: mm: Remove general hugetlb code from x86.Steve Capper
huge_pte_alloc, huge_pte_offset and follow_huge_p[mu]d have already been copied over to mm. This patch removes the x86 copies of these functions and activates the general ones by enabling: CONFIG_ARCH_WANT_GENERAL_HUGETLB Signed-off-by: Steve Capper <steve.capper@linaro.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Andrew Morton <akpm@linux-foundation.org>
2013-06-14mm: hugetlb: Copy general hugetlb code from x86 to mm.Steve Capper
The huge_pte_alloc, huge_pte_offset and follow_huge_p[mu]d functions in x86/mm/hugetlbpage.c do not rely on any architecture specific knowledge other than the fact that pmds and puds can be treated as huge ptes. To allow other architectures to use this code (and reduce the need for code duplication), this patch copies these functions into mm, replaces the use of pud_large with pud_huge and provides a config flag to activate them: CONFIG_ARCH_WANT_GENERAL_HUGETLB If CONFIG_ARCH_WANT_HUGE_PMD_SHARE is also active then the huge_pmd_share code will be called by huge_pte_alloc (othewise we call pmd_alloc and skip the sharing code). Signed-off-by: Steve Capper <steve.capper@linaro.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Andrew Morton <akpm@linux-foundation.org>
2013-06-14x86: mm: Remove x86 version of huge_pmd_share.Steve Capper
The huge_pmd_share code has been copied over to mm/hugetlb.c to make it accessible to other architectures. Remove the x86 copy of the huge_pmd_share code and enable the ARCH_WANT_HUGE_PMD_SHARE config flag. That way we reference the general one. Signed-off-by: Steve Capper <steve.capper@linaro.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Andrew Morton <akpm@linux-foundation.org>
2013-06-14mm: hugetlb: Copy huge_pmd_share from x86 to mm.Steve Capper
Under x86, multiple puds can be made to reference the same bank of huge pmds provided that they represent a full PUD_SIZE of shared huge memory that is aligned to a PUD_SIZE boundary. The code to share pmds does not require any architecture specific knowledge other than the fact that pmds can be indexed, thus can be beneficial to some other architectures. This patch copies the huge pmd sharing (and unsharing) logic from x86/ to mm/ and introduces a new config option to activate it: CONFIG_ARCH_WANTS_HUGE_PMD_SHARE Signed-off-by: Steve Capper <steve.capper@linaro.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Andrew Morton <akpm@linux-foundation.org>
2013-06-14tcm_qla2xxx: Fix residual for underrun commands that failRoland Dreier
Suppose an initiator sends a DATA IN command with an allocation length shorter than the FC transfer length -- we get a target message like TARGET_CORE[qla2xxx]: Expected Transfer Length: 256 does not match SCSI CDB Length: 0 for SAM Opcode: 0x12 In that case, the target core adjusts the data_length and sets se_cmd->residual_count for the underrun. But now suppose that command fails and we end up in tcm_qla2xxx_queue_status() -- that function unconditionally overwrites residual_count with the already adjusted data_length, and the initiator will burp with a message like qla2xxx [0000:00:06.0]-301d:0: Dropped frame(s) detected (0x100 of 0x100 bytes). Fix this by adding on to the existing underflow residual count instead. Signed-off-by: Roland Dreier <roland@purestorage.com> Cc: Giridhar Malavali <giridhar.malavali@qlogic.com> Cc: Chad Dupuis <chad.dupuis@qlogic.com> Cc: stable <stable@vger.kernel.org> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2013-06-14target/iscsi: don't corrupt bh_count in iscsit_stop_time2retain_timer()Jörn Engel
Here is a fun one. Bug seems to have been introduced by commit 140854cb, almost two years ago. I have no idea why we only started seeing it now, but we did. Rough callgraph: core_tpg_set_initiator_node_queue_depth() `-> spin_lock_irqsave(&tpg->session_lock, flags); `-> lio_tpg_shutdown_session() `-> iscsit_stop_time2retain_timer() `-> spin_unlock_bh(&se_tpg->session_lock); `-> spin_lock_bh(&se_tpg->session_lock); `-> spin_unlock_irqrestore(&tpg->session_lock, flags); core_tpg_set_initiator_node_queue_depth() used to call spin_lock_bh(), but 140854cb changed that to spin_lock_irqsave(). However, lio_tpg_shutdown_session() still claims to be called with spin_lock_bh() held, as does iscsit_stop_time2retain_timer(): * Called with spin_lock_bh(&struct se_portal_group->session_lock) held Stale documentation is mostly annoying, but in this case the dropping the lock with the _bh variant is plain wrong. It is also wrong to drop locks two functions below the lock-holder, but I will ignore that bit for now. After some more locking and unlocking we eventually hit this backtrace: ------------[ cut here ]------------ WARNING: at kernel/softirq.c:159 local_bh_enable_ip+0xe8/0x100() Pid: 24645, comm: lio_helper.py Tainted: G O 3.6.11+ Call Trace: [<ffffffff8103e5ff>] warn_slowpath_common+0x7f/0xc0 [<ffffffffa040ae37>] ? iscsit_inc_conn_usage_count+0x37/0x50 [iscsi_target_mod] [<ffffffff8103e65a>] warn_slowpath_null+0x1a/0x20 [<ffffffff810472f8>] local_bh_enable_ip+0xe8/0x100 [<ffffffff815b8365>] _raw_spin_unlock_bh+0x15/0x20 [<ffffffffa040ae37>] iscsit_inc_conn_usage_count+0x37/0x50 [iscsi_target_mod] [<ffffffffa041149a>] iscsit_stop_session+0xfa/0x1c0 [iscsi_target_mod] [<ffffffffa0417fab>] lio_tpg_shutdown_session+0x7b/0x90 [iscsi_target_mod] [<ffffffffa033ede4>] core_tpg_set_initiator_node_queue_depth+0xe4/0x290 [target_core_mod] [<ffffffffa0409032>] iscsit_tpg_set_initiator_node_queue_depth+0x12/0x20 [iscsi_target_mod] [<ffffffffa0415c29>] lio_target_nacl_store_cmdsn_depth+0xa9/0x180 [iscsi_target_mod] [<ffffffffa0331b49>] target_fabric_nacl_base_attr_store+0x39/0x40 [target_core_mod] [<ffffffff811b857d>] configfs_write_file+0xbd/0x120 [<ffffffff81148f36>] vfs_write+0xc6/0x180 [<ffffffff81149251>] sys_write+0x51/0x90 [<ffffffff815c0969>] system_call_fastpath+0x16/0x1b ---[ end trace 3747632b9b164652 ]--- As a pure band-aid, this patch drops the _bh. Signed-off-by: Joern Engel <joern@logfs.org> Cc: stable <stable@vger.kernel.org> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2013-06-13Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "This is an assortment of crash fixes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: stop all workers before cleaning up roots Btrfs: fix use-after-free bug during umount Btrfs: init relocate extent_io_tree with a mapping btrfs: Drop inode if inode root is NULL Btrfs: don't delete fs_roots until after we cleanup the transaction
2013-06-13mei: me: clear interrupts on the resume pathTomas Winkler
We need to clear pending interrupts on the resume path. This brings the device into defined state before starting the reset flow This should solve suspend/resume issues: mei_me : wait hw ready failed. status = 0x0 mei_me : version message write failed Signed-off-by: Tomas Winkler <tomas.winkler@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-06-13mei: nfc: fix nfc device freeingTomas Winkler
The nfc_dev is a static variable and is not cleaned properly upon reset mainly ndev->cl and ndev->cl_info are not set to NULL after freeing which mei_stop:198: mei_me 0000:00:16.0: stopping the device. [ 404.253427] general protection fault: 0000 [#2] SMP [ 404.253437] Modules linked in: mei_me(-) binfmt_misc snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device edd af_packet cpufreq_conservative cpufreq_userspace cpufreq_powersave fuse loop dm_mod hid_generic usbhid hid coretemp acpi_cpufreq mperf kvm_intel kvm crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul snd_hda_codec_hdmi glue_helper aes_x86_64 e1000e snd_hda_intel snd_hda_codec ehci_pci iTCO_wdt iTCO_vendor_support ehci_hcd snd_hwdep xhci_hcd snd_pcm usbcore ptp mei sg microcode snd_timer pps_core i2c_i801 snd pcspkr battery rtc_cmos lpc_ich mfd_core soundcore usb_common snd_page_alloc ac ext3 jbd mbcache drm_kms_helper drm intel_agp i2c_algo_bit intel_gtt i2c_core sd_mod crc_t10dif thermal fan video button processor thermal_sys hwmon ahci libahci libata scsi_mod [last unloaded: mei_me] [ 404.253591] CPU: 0 PID: 5551 Comm: modprobe Tainted: G D W 3.10.0-rc3 #1 [ 404.253611] task: ffff880143cd8300 ti: ffff880144a2a000 task.ti: ffff880144a2a000 [ 404.253619] RIP: 0010:[<ffffffff81334e5d>] [<ffffffff81334e5d>] device_del+0x1d/0x1d0 [ 404.253638] RSP: 0018:ffff880144a2bcf8 EFLAGS: 00010206 [ 404.253645] RAX: 2020302e30202030 RBX: ffff880144fdb000 RCX: 0000000000000086 [ 404.253652] RDX: 0000000000000001 RSI: 0000000000000086 RDI: ffff880144fdb000 [ 404.253659] RBP: ffff880144a2bd18 R08: 0000000000000651 R09: 0000000000000006 [ 404.253666] R10: 0000000000000651 R11: 0000000000000006 R12: ffff880144fdb000 [ 404.253673] R13: ffff880149371098 R14: ffff880144482c00 R15: ffffffffa04710e0 [ 404.253681] FS: 00007f251c59a700(0000) GS:ffff88014e200000(0000) knlGS:0000000000000000 [ 404.253689] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 404.253696] CR2: ffffffffff600400 CR3: 0000000145319000 CR4: 00000000001407f0 [ 404.253703] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 404.253710] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 404.253716] Stack: [ 404.253720] ffff880144fdb000 ffff880143ffe000 ffff880149371098 ffffffffa0471000 [ 404.253732] ffff880144a2bd38 ffffffff8133502d ffff88014e20cf48 ffff880143ffe1d8 [ 404.253744] ffff880144a2bd48 ffffffffa02a4749 ffff880144a2bd58 ffffffffa02a4ba1 [ 404.253755] Call Trace: [ 404.253766] [<ffffffff8133502d>] device_unregister+0x1d/0x60 [ 404.253787] [<ffffffffa02a4749>] mei_cl_remove_device+0x9/0x10 [mei] [ 404.253804] [<ffffffffa02a4ba1>] mei_nfc_host_exit+0x21/0x30 [mei] [ 404.253819] [<ffffffffa029c2dd>] mei_stop+0x3d/0x90 [mei] [ 404.253830] [<ffffffffa046e220>] mei_me_remove+0x60/0xe0 [mei_me] [ 404.253843] [<ffffffff81278f37>] pci_device_remove+0x37/0xb0 [ 404.253855] [<ffffffff81337c68>] __device_release_driver+0x98/0x100 [ 404.253865] [<ffffffff81337d80>] driver_detach+0xb0/0xc0 [ 404.253876] [<ffffffff81336b4f>] bus_remove_driver+0x8f/0x120 [ 404.253891] [<ffffffff81075990>] ? try_to_wake_up+0x2b0/0x2b0 [ 404.253903] [<ffffffff81338a48>] driver_unregister+0x58/0x90 [ 404.253913] [<ffffffff8127906b>] pci_unregister_driver+0x2b/0xb0 [ 404.253924] [<ffffffffa046f244>] mei_me_driver_exit+0x10/0xdcc [mei_me] [ 404.253936] [<ffffffff810a50d8>] SyS_delete_module+0x198/0x2b0 [ 404.253949] [<ffffffff814850d9>] ? do_page_fault+0x9/0x10 [ 404.253961] [<ffffffff81489692>] system_call_fastpath+0x16/0x1b [ 404.253967] Code: 41 5c 41 5d 41 5e 41 5f c9 c3 0f 1f 40 00 55 48 89 e5 41 56 41 55 41 54 49 89 fc 53 48 8b 87 88 00 00 00 4c 8b 37 48 85 c0 74 18 <48> 8b 78 78 4c 89 e2 be 02 00 00 00 48 81 c7 f8 00 00 00 e8 3b [ 404.254048] RIP [<ffffffff81334e5d>] device_del+0x1d/0x1d0 Cc: Samuel Ortiz <sameo@linux.intel.com> Signed-off-by: Tomas Winkler <tomas.winkler@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-06-13mei: init: Flush scheduled work before resetting the deviceSamuel Ortiz
Flushing pending work items before resetting the device makes more sense than doing so afterwards. Some of them, like e.g. the NFC initialization one, find themselves with client IDs changed after the reset, eventually leading to trigger a client.c:mei_me_cl_by_id() warning after a few modprobe/rmmod cycles. Signed-off-by: Samuel Ortiz <sameo@linux.intel.com> Signed-off-by: Tomas Winkler <tomas.winkler@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-06-14ARM: davinci: remove __init atrribute from function declarationLad, Prabhakar
__init attribute does not make sense on function declarations. Get rid of them in mach-davinci. Signed-off-by: Lad, Prabhakar <prabhakar.csengg@gmail.com> Cc: Sekhar Nori <nsekhar@ti.com> Signed-off-by: Sekhar Nori <nsekhar@ti.com>
2013-06-13cpuset: rename @cont to @cgrpLi Zefan
Cont is short for container. control group was named process container at first, but then people found container already has a meaning in linux kernel. Clean up the leftover variable name @cont. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-06-13cgroup: update sane_behavior documentationTejun Heo
f12dc02014 ("cgroup: mark "tasks" cgroup file as insane") and cc5943a781 ("cgroup: mark "notify_on_release" and "release_agent" cgroup files insane") forgot to update the changed behavior documentation in cgroup.h. Update it. Signed-off-by: Tejun Heo <tj@kernel.org>
2013-06-13cgroup: use percpu refcnt for cgroup_subsys_statesTejun Heo
A css (cgroup_subsys_state) is how each cgroup is represented to a controller. As such, it can be used in hot paths across the various subsystems different controllers are associated with. One of the common operations is reference counting, which up until now has been implemented using a global atomic counter and can have significant adverse impact on scalability. For example, css refcnt can be gotten and put multiple times by blkcg for each IO request. For highops configurations which try to do as much per-cpu as possible, the global frequent refcnting can be very expensive. In general, given the various and hugely diverse paths css's end up being used from, we need to make it cheap and highly scalable. In its usage, css refcnting isn't very different from module refcnting. This patch converts css refcnting to use the recently added percpu_ref. css_get/tryget/put() directly maps to the matching percpu_ref operations and the deactivation logic is no longer necessary as percpu_ref already has refcnt killing. The only complication is that as the refcnt is per-cpu, percpu_ref_kill() in itself doesn't ensure that further tryget operations will fail, which we need to guarantee before invoking ->css_offline()'s. This is resolved collecting kill confirmation using percpu_ref_kill_and_confirm() and initiating the offline phase of destruction after all css refcnt's are confirmed to be seen as killed on all CPUs. The previous patches already splitted destruction into two phases, so percpu_ref_kill_and_confirm() can be hooked up easily. This patch removes css_refcnt() which is used for rcu dereference sanity check in css_id(). While we can add a percpu refcnt API to ask the same question, css_id() itself is scheduled to be removed fairly soon, so let's not bother with it. Just drop the sanity check and use rcu_dereference_raw() instead. v2: - init_cgroup_css() was calling percpu_ref_init() without checking the return value. This causes two problems - the obvious lack of error handling and percpu_ref_init() being called from cgroup_init_subsys() before the allocators are up, which triggers warnings but doesn't cause actual problems as the refcnt isn't used for roots anyway. Fix both by moving percpu_ref_init() to cgroup_create(). - The base references were put too early by percpu_ref_kill_and_confirm() and cgroup_offline_fn() put the refs one extra time. This wasn't noticeable because css's go through another RCU grace period before being freed. Update cgroup_destroy_locked() to grab an extra reference before killing the refcnts. This problem was noticed by Kent. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Kent Overstreet <koverstreet@google.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: "Alasdair G. Kergon" <agk@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Mikulas Patocka <mpatocka@redhat.com> Cc: Glauber Costa <glommer@gmail.com>
2013-06-13Merge branch 'for-3.11' of ↵Tejun Heo
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu into for-3.11 This is to receive percpu_refcount which will replace atomic_t reference count in cgroup_subsys_state. Signed-off-by: Tejun Heo <tj@kernel.org>
2013-06-13cgroup: split cgroup destruction into two stepsTejun Heo
Split cgroup_destroy_locked() into two steps and put the latter half into cgroup_offline_fn() which is executed from a work item. The latter half is responsible for offlining the css's, removing the cgroup from internal lists, and propagating release notification to the parent. The separation is to allow using percpu refcnt for css. Note that this allows for other cgroup operations to happen between the first and second halves of destruction, including creating a new cgroup with the same name. As the target cgroup is marked DEAD in the first half and cgroup internals don't care about the names of cgroups, this should be fine. A comment explaining this will be added by the next patch which implements the actual percpu refcnting. As RCU freeing is guaranteed to happen after the second step of destruction, we can use the same work item for both. This patch renames cgroup->free_work to ->destroy_work and uses it for both purposes. INIT_WORK() is now performed right before queueing the work item. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-13cgroup: reorder the operations in cgroup_destroy_locked()Tejun Heo
This patch reorders the operations in cgroup_destroy_locked() such that the userland visible parts happen before css offlining and removal from the ->sibling list. This will be used to make css use percpu refcnt. While at it, split out CGRP_DEAD related comment from the refcnt deactivation one and correct / clarify how different guarantees are met. While this patch changes the specific order of operations, it shouldn't cause any noticeable behavior difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-13percpu-refcount: implement percpu_tryget() along with ↵Tejun Heo
percpu_ref_kill_and_confirm() Implement percpu_tryget() which stops giving out references once the percpu_ref is visible as killed. Because the refcnt is per-cpu, different CPUs will start to see a refcnt as killed at different points in time and tryget() may continue to succeed on subset of cpus for a while after percpu_ref_kill() returns. For use cases where it's necessary to know when all CPUs start to see the refcnt as dead, percpu_ref_kill_and_confirm() is added. The new function takes an extra argument @confirm_kill which is invoked when the refcnt is guaranteed to be viewed as killed on all CPUs. While this isn't the prettiest interface, it doesn't force synchronous wait and is much safer than requiring the caller to do its own call_rcu(). v2: Patch description rephrased to emphasize that tryget() may continue to succeed on some CPUs after kill() returns as suggested by Kent. v3: Function comment in percpu_ref_kill_and_confirm() updated warning people to not depend on the implied RCU grace period from the confirm callback as it's an implementation detail. Signed-off-by: Tejun Heo <tj@kernel.org> Slightly-Grumpily-Acked-by: Kent Overstreet <koverstreet@google.com>
2013-06-13sctp: fully initialize sctp_outq in sctp_outq_initNeil Horman
In commit 2f94aabd9f6c925d77aecb3ff020f1cc12ed8f86 (refactor sctp_outq_teardown to insure proper re-initalization) we modified sctp_outq_teardown to use sctp_outq_init to fully re-initalize the outq structure. Steve West recently asked me why I removed the q->error = 0 initalization from sctp_outq_teardown. I did so because I was operating under the impression that sctp_outq_init would properly initalize that value for us, but it doesn't. sctp_outq_init operates under the assumption that the outq struct is all 0's (as it is when called from sctp_association_init), but using it in __sctp_outq_teardown violates that assumption. We should do a memset in sctp_outq_init to ensure that the entire structure is in a known state there instead. Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Reported-by: "West, Steve (NSN - US/Fort Worth)" <steve.west@nsn.com> CC: Vlad Yasevich <vyasevich@gmail.com> CC: netdev@vger.kernel.org CC: davem@davemloft.net Acked-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>