git.armlinux.org.uk/linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2016-05-31	Btrfs: fix race between device replace and read repair	Filipe Manana
	While we are finishing a device replace operation we can have a concurrent task trying to do a read repair operation, in which case it will call btrfs_map_block() to get a struct btrfs_bio which can have a stripe that points to the source device of the device replace operation. This allows for the read repair task to dereference the stripe's device pointer after the device replace operation has freed the source device, resulting in an invalid memory access. This is similar to the problem solved by my previous patch in the same series and named "Btrfs: fix race between device replace and discard". So fix this by surrounding the call to btrfs_map_block() and the code that uses the returned struct btrfs_bio with calls to btrfs_bio_counter_inc_blocked() and btrfs_bio_counter_dec(), giving the proper serialization with the finishing phase of the device replace operation. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com>
2016-05-31	Btrfs: fix race between device replace and discard	Filipe Manana
	While we are finishing a device replace operation, we can make a discard operation (fs mounted with -o discard) do an invalid memory access like the one reported by the following trace: [ 3206.384654] general protection fault: 0000 [#1] PREEMPT SMP [ 3206.387520] Modules linked in: dm_mod btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis psmouse tpm ppdev sg parport_pc evdev i2c_piix4 parport processor serio_raw i2c_core pcspkr button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sd_mod virtio_scsi ata_piix libata virtio_pci virtio_ring scsi_mod e1000 virtio floppy [last unloaded: btrfs] [ 3206.388595] CPU: 14 PID: 29194 Comm: fsstress Not tainted 4.6.0-rc7-btrfs-next-29+ #1 [ 3206.388595] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [ 3206.388595] task: ffff88017ace0100 ti: ffff880171b98000 task.ti: ffff880171b98000 [ 3206.388595] RIP: 0010:[<ffffffff8124d233>] [<ffffffff8124d233>] blkdev_issue_discard+0x5c/0x2a7 [ 3206.388595] RSP: 0018:ffff880171b9bb80 EFLAGS: 00010246 [ 3206.388595] RAX: ffff880171b9bc28 RBX: 000000000090d000 RCX: 0000000000000000 [ 3206.388595] RDX: ffffffff82fa1b48 RSI: ffffffff8179f46c RDI: ffffffff82fa1b48 [ 3206.388595] RBP: ffff880171b9bcc0 R08: 0000000000000000 R09: 0000000000000001 [ 3206.388595] R10: ffff880171b9bce0 R11: 000000000090f000 R12: ffff880171b9bbe8 [ 3206.388595] R13: 0000000000000010 R14: 0000000000004868 R15: 6b6b6b6b6b6b6b6b [ 3206.388595] FS: 00007f6182e4e700(0000) GS:ffff88023fdc0000(0000) knlGS:0000000000000000 [ 3206.388595] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3206.388595] CR2: 00007f617c2bbb18 CR3: 000000017ad9c000 CR4: 00000000000006e0 [ 3206.388595] Stack: [ 3206.388595] 0000000000004878 0000000000000000 0000000002400040 0000000000000000 [ 3206.388595] 0000000000000000 ffff880171b9bbe8 ffff880171b9bbb0 ffff880171b9bbb0 [ 3206.388595] ffff880171b9bbc0 ffff880171b9bbc0 ffff880171b9bbd0 ffff880171b9bbd0 [ 3206.388595] Call Trace: [ 3206.388595] [<ffffffffa042899e>] btrfs_issue_discard+0x12f/0x143 [btrfs] [ 3206.388595] [<ffffffffa042899e>] ? btrfs_issue_discard+0x12f/0x143 [btrfs] [ 3206.388595] [<ffffffffa042e862>] btrfs_discard_extent+0x87/0xde [btrfs] [ 3206.388595] [<ffffffffa04303b5>] btrfs_finish_extent_commit+0xb2/0x1df [btrfs] [ 3206.388595] [<ffffffff8149c246>] ? __mutex_unlock_slowpath+0x150/0x15b [ 3206.388595] [<ffffffffa04464c4>] btrfs_commit_transaction+0x7fc/0x980 [btrfs] [ 3206.388595] [<ffffffff8149c246>] ? __mutex_unlock_slowpath+0x150/0x15b [ 3206.388595] [<ffffffffa0459af6>] btrfs_sync_file+0x38f/0x428 [btrfs] [ 3206.388595] [<ffffffff811a8292>] vfs_fsync_range+0x8c/0x9e [ 3206.388595] [<ffffffff811a82c0>] vfs_fsync+0x1c/0x1e [ 3206.388595] [<ffffffff811a8417>] do_fsync+0x31/0x4a [ 3206.388595] [<ffffffff811a8637>] SyS_fsync+0x10/0x14 [ 3206.388595] [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8 [ 3206.388595] [<ffffffff81100c6b>] ? time_hardirqs_off+0x9/0x14 [ 3206.388595] [<ffffffff8108e87d>] ? trace_hardirqs_off_caller+0x1f/0xaa This happens because when we call btrfs_map_block() from btrfs_discard_extent() to get a btrfs_bio structure, the device replace operation has not finished yet, but before we use the device of one of the stripes from the returned btrfs_bio structure, the device object is freed. This is illustrated by the following diagram. CPU 1 CPU 2 btrfs_dev_replace_start() (...) btrfs_dev_replace_finishing() btrfs_start_transaction() btrfs_commit_transaction() (...) btrfs_sync_file() btrfs_start_transaction() (...) btrfs_commit_transaction() btrfs_finish_extent_commit() btrfs_discard_extent() btrfs_map_block() --> returns a struct btrfs_bio with a stripe that has a device field pointing to source device of the replace operation (the device that is being replaced) mutex_lock(&uuid_mutex) mutex_lock(&fs_info->fs_devices->device_list_mutex) mutex_lock(&fs_info->chunk_mutex) btrfs_dev_replace_update_device_in_mapping_tree() --> iterates the mapping tree and for each extent map that has a stripe pointing to the source device, it updates the stripe to point to the target device instead btrfs_rm_dev_replace_blocked() --> waits for fs_info->bio_counter to go down to 0 btrfs_rm_dev_replace_remove_srcdev() --> removes source device from the list of devices mutex_unlock(&fs_info->chunk_mutex) mutex_unlock(&fs_info->fs_devices->device_list_mutex) mutex_unlock(&uuid_mutex) btrfs_rm_dev_replace_free_srcdev() --> frees the source device --> iterates over all stripes of the returned struct btrfs_bio --> for each stripe it dereferences its device pointer --> it ends up finding a pointer to the device used as the source device for the replace operation and that was already freed So fix this by surrounding the call to btrfs_map_block(), and the code that uses the returned struct btrfs_bio, with calls to btrfs_bio_counter_inc_blocked() and btrfs_bio_counter_dec(), so that the finishing phase of the device replace operation blocks until the the bio counter decreases to zero before it frees the source device. This is the same approach we do at btrfs_map_bio() for example. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com>
2016-05-30	Merge branch 'uuid' (lib/uuid fixes from Andy)	Linus Torvalds
	Merge lib/uuid fixes from Andy Shevchenko. * emailed patches from Andy Shevchenko <andriy.shevchenko@linux.intel.com>: lib/uuid.c: use correct offset in uuid parser lib/uuid: add a test module
2016-05-30	lib/uuid.c: use correct offset in uuid parser	Bjørn Mork
	Use '+ 0' and '+ 1' as offsets, like they were intended, instead of adding to the result. Fixes: 2b1b0d66704a ("lib/uuid.c: introduce a few more generic helpers") Signed-off-by: Bjørn Mork <bjorn@mork.no> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-30	lib/uuid: add a test module	Andy Shevchenko
	It appears that somehow I missed a test of the latest UUID rework which landed in the kernel. Present a small test module to avoid such cases in the future. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-30	Merge branch 'linus' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto fixes from Herbert Xu: "This fixes the following issues: - missing selection in public_key that may result in a build failure - Potential crash in error path in omap-sham - ccp AES XTS bug that affects requests larger than 4096" * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: crypto: ccp - Fix AES XTS error for request sizes above 4096 crypto: public_key: select CRYPTO_AKCIPHER crypto: omap-sham - potential Oops on error in probe
2016-05-30	libceph: use %s instead of %pE in dout()s	Ilya Dryomov
	Commit d30291b985d1 ("libceph: variable-sized ceph_object_id") changed dout()s in what is now encode_request() and ceph_object_locator_to_pg() to use %pE, mostly to document that, although all rbd and cephfs object names are NULL-terminated strings, ceph_object_id will handle any RADOS object name, including the one containing NULs, just fine. However, it turns out that vbin_printf() can't handle anything but ints and %s - all %p suffixes are ignored. The buffer %p** points to isn't recorded, resulting in trash in the messages if the buffer had been reused by the time bstr_printf() got to it. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-30	libceph: put request only if it's done in handle_reply()	Ilya Dryomov
	handle_reply() may be called twice on the same request: on ack and then on commit. This occurs on btrfs-formatted OSDs or if cephfs sync write path is triggered - CEPH_OSD_FLAG_ACK \| CEPH_OSD_FLAG_ONDISK. handle_reply() handles this with the help of done_request(). Fixes: 5aea3dcd5021 ("libceph: a major OSD client update") Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-30	libceph: change ceph_osdmap_flag() to take osdc	Ilya Dryomov
	For the benefit of every single caller, take osdc instead of map. Also, now that osdc->osdmap can't ever be NULL, drop the check. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-30	gpio: drop lock before reading GPIO direction	Linus Walleij
	When adding the gpiochip, the GPIO HW drivers' callback get_direction() could get called in atomic context. Some of the GPIO HW drivers may sleep when accessing the register. Move the lock before initializing the descriptors. Reported-by: Laxman Dewangan <ldewangan@nvidia.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2016-05-30	gpio: bail out silently on NULL descriptors	Linus Walleij
	In fdeb8e1547cb9dd39d5d7223b33f3565cf86c28e ("gpio: reflect base and ngpio into gpio_device") assumed that GPIO descriptors are either valid or error pointers, but gpiod_get_[index_]optional() actually return NULL descriptors and then all subsequent calls should just bail out. Cc: stable@vger.kernel.org Cc: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Cc: Florian Fainelli <f.fainelli@gmail.com> Cc: Andrew Lunn <andrew@lunn.ch> Fixes: fdeb8e1547cb ("gpio: reflect base and ngpio into gpio_device") Reported-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2016-05-30	drm/i915: kill STANDARD/CURSOR plane screams	Ville Syrjälä
	Stop yelling the plane type. "STANDARD" doesn't mean anything anyway. Let's just use the plane name here. Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/1464371966-15190-8-git-send-email-ville.syrjala@linux.intel.com Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
2016-05-30	drm/i915: Give encoders useful names	Ville Syrjälä
	Rather than let the core generate usless encoder names, let's pass in something that actually identifies the piece of hardware we're dealing with. v2: Use 'DSI %c' instead of 'MIPI %c' for DSI encoders (Jani) v3: Use port_name() in DSI code since we have it Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/1464371966-15190-7-git-send-email-ville.syrjala@linux.intel.com Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
2016-05-30	drm/i915: Give meaningful names to all the planes	Ville Syrjälä
	Let's name our planes in a way that makes sense wrt. the spec: - skl+ -> "plane 1A", "plane 2A", "plane 1C", "cursor A" etc. - g4x+ -> "primary A", "primary B", "sprite A", "cursor C" etc. - pre-g4x -> "plane A", "cursor B" etc. v2: Rebase on top of the fixed/cleaned error paths Use a local 'name' variable to make things easier v3: Pass the name as a function argument to drm_universal_plane_init() (Jani) v3: Pass the printf style string to drm_universal_plane_init() Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/1464371966-15190-6-git-send-email-ville.syrjala@linux.intel.com Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
2016-05-30	drm/i915: Don't leak primary/cursor planes on crtc init failure	Ville Syrjälä
	Call intel_plane_destroy() instead of drm_plane_cleanup() so that we also free the plane struct itself when bailing out of the crtc init. And make intel_plane_destroy() NULL tolerant to avoid having to check for it in the caller. Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/1464371966-15190-5-git-send-email-ville.syrjala@linux.intel.com Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
2016-05-30	drm/i915: Set crtc->name to "pipe A", "pipe B", etc.	Ville Syrjälä
	v2: Fix intel_crtc leak on failure to allocate the name Use a local 'name' variable to make things easier v3: Pass the name as a function arguemnt to drm_crtc_init_with_planes() (Jani) v4: Pass the printf style format string to drm_crtc_init_with_planes() v5: Drop spurious code changes Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/1464371966-15190-4-git-send-email-ville.syrjala@linux.intel.com Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
2016-05-30	drm/i915: Use plane->name in debug prints	Ville Syrjälä
	We have plane->name, so let's use that in debug messages instead of just printing the more or less useless object ID. v2: slap on a commit message Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/1464371966-15190-3-git-send-email-ville.syrjala@linux.intel.com Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
2016-05-30	drm/i915: Use crtc->name in debug messages	Ville Syrjälä
	We have crtc->name, so let's use that in debug messages instead of just printing the more or less useless object ID. v2: Rebased due to intel_dpll_mgr.c, slap on a commit message Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/1464371966-15190-2-git-send-email-ville.syrjala@linux.intel.com Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
2016-05-30	gpio: handle compatible ioctl() pointers	Linus Walleij
	If we're using the compatible ioctl() we need to handle the argument pointer in a special way or there will be trouble. Fixes: 3c702e9987e2 ("gpio: add a userspace chardev ABI for GPIOs") Reported-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Reviewed-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2016-05-30	vfio/type1: Fix build warning	Alex Williamson
	This function cannot actually be called with npage = 0, so in practice this doesn't return an uninitialized value. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-05-30	vfio/pci: Fix ordering of eventfd vs virqfd shutdown	Alex Williamson
	Both the INTx and MSI/X disable paths do an eventfd_ctx_put() for the trigger eventfd before calling vfio_virqfd_disable() any potential mask and unmask eventfds. This opens a use-after-free race where an inopportune irqfd can reference the freed signalling eventfd. Reorder to avoid this possibility. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-05-30	cpufreq: intel_pstate: Downgrade print level for _PPC	Srinivas Pandruvada
	Downgrade pr_info to pr_debug for the "_PPC limits will be enforced" message. In server systems with many cores this message is annoying. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> [ rjw: Changelog ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-05-30	Btrfs: fix race between device replace and chunk allocation	Filipe Manana
	While iterating and copying extents from the source device, the device replace code keeps adjusting a left cursor that is used to make sure that once we finish processing a device extent, any future writes to extents from the corresponding block group will get into both the source and target devices. This left cursor is also used for resuming the device replace operation at mount time. However using this left cursor to decide whether writes go into both devices or only the source device is not enough to guarantee we don't miss copying extents into the target device. There are two cases where the current approach fails. The first one is related to when there are holes in the device and they get allocated for new block groups while the device replace operation is iterating the device extents (more on this explained below). The second one is that when that loop over the device extents finishes, we start dellaloc, wait for all ordered extents and then commit the current transaction, we might have got new block groups allocated that are now using a device extent that has an offset greater then or equals to the value of the left cursor, in which case writes to extents belonging to these new block groups will get issued only to the source device. For the first case where the current approach of using a left cursor fails, consider the source device currently has the following layout: [ extent bg A ] [ hole, unallocated space ] [extent bg B ] 3Gb 4Gb 5Gb While we are iterating the device extents from the source device using the commit root of the device tree, the following happens: CPU 1 CPU 2 <we are at transaction N> scrub_enumerate_chunks() --> searches the device tree for extents belonging to the source device using the device tree's commit root --> 1st iteration finds extent belonging to block group A --> sets block group A to RO mode (btrfs_inc_block_group_ro) --> sets cursor left to found_key.offset which is 3Gb --> scrub_chunk() starts copies all allocated extents from block group's A stripe at source device into target device btrfs_alloc_chunk() --> allocates device extent in the range [4Gb, 5Gb[ from the source device for a new block group C extent allocated from block group C for a direct IO, buffered write or btree node/leaf extent is written to, perhaps in response to a writepages() call from the VM or directly through direct IO the write is made only against the source device and not against the target device because the extent's offset is in the interval [4Gb, 5Gb[ which is larger then the value of cursor_left (3Gb) --> scrub_chunks() finishes --> updates left cursor from 3Gb to 4Gb --> btrfs_dec_block_group_ro() sets block group A back to RW mode <we are still at transaction N> --> 2nd iteration finds extent belonging to block group B - it did not find the new extent in the range [4Gb, 5Gb[ for block group C because we are using the device tree's commit root or even because the block group's items are not all yet inserted in the respective btrees, that is, the block group is still attached to some transaction handle's new_bgs list and btrfs_create_pending_block_groups() was not called yet against that transaction handle, so the device extent items were not yet inserted into the devices tree <we are still at transaction N> --> so we end not copying anything from the newly allocated device extent from the source device to the target device So fix this by making __btrfs_map_block() always redirect writes to the target device as well, independently of the left cursor's value. With this change the left cursor is now used only for the purpose of tracking progress and allow a mount operation to resume a device replace. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com>
2016-05-30	Btrfs: fix race setting block group back to RW mode during device replace	Filipe Manana
	After it finishes processing a device extent, the device replace code sets back the block group to RW mode and then after that it sets the left cursor to match the logical end address of the block group, so that future writes into extents belonging to the block group go both the source (old) and target (new) devices. However from the moment we turn the block group back to RW mode we have a short time window, that lasts until we update the left cursor's value, where extents can be allocated from the block group and written to, in which case they will not be copied/written to the target (new) device. Fix this by updating the left cursor's value before turning the block group back to RW mode. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com>
2016-05-30	Btrfs: fix unprotected assignment of the left cursor for device replace	Filipe Manana
	We were assigning new values to fields of the device replace object without holding the respective lock after processing each device extent. This is important for the left cursor field which can be accessed by a concurrent task running __btrfs_map_block (which, correctly, takes the device replace lock). So change these fields while holding the device replace lock. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com>
2016-05-30	Btrfs: fix race setting block group readonly during device replace	Filipe Manana
	When we do a device replace, for each device extent we find from the source device, we set the corresponding block group to readonly mode to prevent writes into it from happening while we are copying the device extent from the source to the target device. However just before we set the block group to readonly mode some concurrent task might have already allocated an extent from it or decided it could perform a nocow write into one of its extents, which can make the device replace process to miss copying an extent since it uses the extent tree's commit root to search for extents and only once it finishes searching for all extents belonging to the block group it does set the left cursor to the logical end address of the block group - this is a problem if the respective ordered extents finish while we are searching for extents using the extent tree's commit root and no transaction commit happens while we are iterating the tree, since it's the delayed references created by the ordered extents (when they complete) that insert the extent items into the extent tree (using the non-commit root of course). Example: CPU 1 CPU 2 btrfs_dev_replace_start() btrfs_scrub_dev() scrub_enumerate_chunks() --> finds device extent belonging to block group X <transaction N starts> starts buffered write against some inode writepages is run against that inode forcing dellaloc to run btrfs_writepages() extent_writepages() extent_write_cache_pages() __extent_writepage() writepage_delalloc() run_delalloc_range() cow_file_range() btrfs_reserve_extent() --> allocates an extent from block group X (which is not yet in RO mode) btrfs_add_ordered_extent() --> creates ordered extent Y flush_epd_write_bio() --> bio against the extent from block group X is submitted btrfs_inc_block_group_ro(bg X) --> sets block group X to readonly scrub_chunk(bg X) scrub_stripe(device extent from srcdev) --> keeps searching for extent items belonging to the block group using the extent tree's commit root --> it never blocks due to fs_info->scrub_pause_req as no one tries to commit transaction N --> copies all extents found from the source device into the target device --> finishes search loop bio completes ordered extent Y completes and creates delayed data reference which will add an extent item to the extent tree when run (typically at transaction commit time) --> so the task doing the scrub/device replace at CPU 1 misses this and does not copy this extent into the new/target device btrfs_dec_block_group_ro(bg X) --> turns block group X back to RW mode dev_replace->cursor_left is set to the logical end offset of block group X So fix this by waiting for all cow and nocow writes after setting a block group to readonly mode. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com>
2016-05-30	Btrfs: fix race between device replace and block group removal	Filipe Manana
	When it's finishing, the device replace code iterates all extent maps representing block group and for each one that has a stripe that refers to the source device, it replaces its device with the target device. However when it replaces the source device with the target device it, the target device still has an ID of 0ULL (BTRFS_DEV_REPLACE_DEVID), only after its ID is changed to match the one from the source device. This leads to races with the chunk removal code that can temporarly see a device with an ID of 0ULL and then attempt to use that ID to remove items from the device tree and fail, causing a transaction abort: [ 9238.594364] BTRFS info (device sdf): dev_replace from /dev/sdf (devid 3) to /dev/sde finished [ 9238.594377] ------------[ cut here ]------------ [ 9238.594402] WARNING: CPU: 14 PID: 21566 at fs/btrfs/volumes.c:2771 btrfs_remove_chunk+0x2e5/0x793 [btrfs] [ 9238.594403] BTRFS: Transaction aborted (error 1) [ 9238.594416] Modules linked in: btrfs crc32c_generic acpi_cpufreq xor tpm_tis tpm raid6_pq ppdev parport_pc processor psmouse parport i2c_piix4 evdev sg i2c_core se rio_raw pcspkr button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix virtio_pci libata virtio_ring virtio e1000 scsi_mod fl oppy [last unloaded: btrfs] [ 9238.594418] CPU: 14 PID: 21566 Comm: btrfs-cleaner Not tainted 4.6.0-rc7-btrfs-next-29+ #1 [ 9238.594419] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [ 9238.594421] 0000000000000000 ffff88017f1dbc60 ffffffff8126b42c ffff88017f1dbcb0 [ 9238.594422] 0000000000000000 ffff88017f1dbca0 ffffffff81052b14 00000ad37f1dbd18 [ 9238.594423] 0000000000000001 ffff88018068a558 ffff88005c4b9c00 ffff880233f60db0 [ 9238.594424] Call Trace: [ 9238.594428] [<ffffffff8126b42c>] dump_stack+0x67/0x90 [ 9238.594430] [<ffffffff81052b14>] __warn+0xc2/0xdd [ 9238.594432] [<ffffffff81052b7a>] warn_slowpath_fmt+0x4b/0x53 [ 9238.594434] [<ffffffff8116c311>] ? kmem_cache_free+0x128/0x188 [ 9238.594450] [<ffffffffa04d43f5>] btrfs_remove_chunk+0x2e5/0x793 [btrfs] [ 9238.594452] [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc [ 9238.594464] [<ffffffffa04a26fa>] btrfs_delete_unused_bgs+0x317/0x382 [btrfs] [ 9238.594476] [<ffffffffa04a961d>] cleaner_kthread+0x1ad/0x1c7 [btrfs] [ 9238.594489] [<ffffffffa04a9470>] ? btree_invalidatepage+0x8e/0x8e [btrfs] [ 9238.594490] [<ffffffff8106f403>] kthread+0xd4/0xdc [ 9238.594494] [<ffffffff8149e242>] ret_from_fork+0x22/0x40 [ 9238.594495] [<ffffffff8106f32f>] ? kthread_stop+0x286/0x286 [ 9238.594496] ---[ end trace 183efbe50275f059 ]--- The sequence of steps leading to this is like the following: CPU 1 CPU 2 btrfs_dev_replace_finishing() at this point dev_replace->tgtdev->devid == BTRFS_DEV_REPLACE_DEVID (0ULL) ... btrfs_start_transaction() btrfs_commit_transaction() btrfs_delete_unused_bgs() btrfs_remove_chunk() looks up for the extent map corresponding to the chunk lock_chunks() (chunk_mutex) check_system_chunk() unlock_chunks() (chunk_mutex) locks fs_info->chunk_mutex btrfs_dev_replace_update_device_in_mapping_tree() --> iterates fs_info->mapping_tree and replaces the device in every extent map's map->stripes[] with dev_replace->tgtdev, which still has an id of 0ULL (BTRFS_DEV_REPLACE_DEVID) iterates over all stripes from the extent map --> calls btrfs_free_dev_extent() passing it the target device that still has an ID of 0ULL --> btrfs_free_dev_extent() fails --> aborts current transaction finishes setting up the target device, namely it sets tgtdev->devid to the value of srcdev->devid (which is necessarily > 0) frees the srcdev unlocks fs_info->chunk_mutex So fix this by taking the device list mutex while processing the stripes for the chunk's extent map. This is similar to the race between device replace and block group creation that was fixed by commit 50460e37186a ("Btrfs: fix race when finishing dev replace leading to transaction abort"). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com>
2016-05-30	Btrfs: fix race between readahead and device replace/removal	Filipe Manana
	The list of devices is protected by the device_list_mutex and the device replace code, in its finishing phase correctly takes that mutex before removing the source device from that list. However the readahead code was iterating that list without acquiring the respective mutex leading to crashes later on due to invalid memory accesses: [125671.831036] general protection fault: 0000 [#1] PREEMPT SMP [125671.832129] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis tpm ppdev evdev parport_pc psmouse sg parport processor ser [125671.834973] CPU: 10 PID: 19603 Comm: kworker/u32:19 Tainted: G W 4.6.0-rc7-btrfs-next-29+ #1 [125671.834973] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [125671.834973] Workqueue: btrfs-readahead btrfs_readahead_helper [btrfs] [125671.834973] task: ffff8801ac520540 ti: ffff8801ac918000 task.ti: ffff8801ac918000 [125671.834973] RIP: 0010:[<ffffffff81270479>] [<ffffffff81270479>] __radix_tree_lookup+0x6a/0x105 [125671.834973] RSP: 0018:ffff8801ac91bc28 EFLAGS: 00010206 [125671.834973] RAX: 0000000000000000 RBX: 6b6b6b6b6b6b6b6a RCX: 0000000000000000 [125671.834973] RDX: 0000000000000000 RSI: 00000000000c1bff RDI: ffff88002ebd62a8 [125671.834973] RBP: ffff8801ac91bc70 R08: 0000000000000001 R09: 0000000000000000 [125671.834973] R10: ffff8801ac91bc70 R11: 0000000000000000 R12: ffff88002ebd62a8 [125671.834973] R13: 0000000000000000 R14: 0000000000000000 R15: 00000000000c1bff [125671.834973] FS: 0000000000000000(0000) GS:ffff88023fd40000(0000) knlGS:0000000000000000 [125671.834973] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [125671.834973] CR2: 000000000073cae4 CR3: 00000000b7723000 CR4: 00000000000006e0 [125671.834973] Stack: [125671.834973] 0000000000000000 ffff8801422d5600 ffff8802286bbc00 0000000000000000 [125671.834973] 0000000000000001 ffff8802286bbc00 00000000000c1bff 0000000000000000 [125671.834973] ffff88002e639eb8 ffff8801ac91bc80 ffffffff81270541 ffff8801ac91bcb0 [125671.834973] Call Trace: [125671.834973] [<ffffffff81270541>] radix_tree_lookup+0xd/0xf [125671.834973] [<ffffffffa04ae6a6>] reada_peer_zones_set_lock+0x3e/0x60 [btrfs] [125671.834973] [<ffffffffa04ae8b9>] reada_pick_zone+0x29/0x103 [btrfs] [125671.834973] [<ffffffffa04af42f>] reada_start_machine_worker+0x129/0x2d3 [btrfs] [125671.834973] [<ffffffffa04880be>] btrfs_scrubparity_helper+0x185/0x3aa [btrfs] [125671.834973] [<ffffffffa0488341>] btrfs_readahead_helper+0xe/0x10 [btrfs] [125671.834973] [<ffffffff81069691>] process_one_work+0x271/0x4e9 [125671.834973] [<ffffffff81069dda>] worker_thread+0x1eb/0x2c9 [125671.834973] [<ffffffff81069bef>] ? rescuer_thread+0x2b3/0x2b3 [125671.834973] [<ffffffff8106f403>] kthread+0xd4/0xdc [125671.834973] [<ffffffff8149e242>] ret_from_fork+0x22/0x40 [125671.834973] [<ffffffff8106f32f>] ? kthread_stop+0x286/0x286 So fix this by taking the device_list_mutex in the readahead code. We can't use here the lighter approach of using a rcu_read_lock() and rcu_read_unlock() pair together with a list_for_each_entry_rcu() call because we end up doing calls to sleeping functions (kzalloc()) in the respective code path. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com>
2016-05-30	ACPI / Thermal / video: fix max_level incorrect value	Aaron Lu
	commit 059500940def (ACPI/video: export acpi_video_get_levels) mistakenly dropped the correct value of max_level and that caused the set_level function following failed and the acpi_video backlight interface didn't get created. Fix this by passing back the correct max_level value. While at it, also fix the param used in acpi_video_device_lcd_query_levels where acpi_handle is expected but acpi_video_device is passed. Fixes: 059500940def (ACPI/video: export acpi_video_get_levels) Reported-and-tested-by: Valdis Kletnieks <valdis.kletnieks@vt.edu> Signed-off-by: Aaron Lu <aaron.lu@intel.com> Acked-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-05-30	kernel-doc: reset contents and section harder	Jani Nikula
	If the documentation comment does not have params or sections, the section heading may leak from the previous documentation comment. Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2016-05-30	kernel-doc: concatenate contents of colliding sections	Jani Nikula
	If there are multiple sections with the same section name, the current implementation results in several sections by the same heading, with the content duplicated from the last section to all. Even if there's the error message, a more graceful approach is to combine all the identically named sections into one, with concatenated contents. With the supported sections already limited to select few, there are massively fewer collisions than there used to be, but this is still useful for e.g. when function parameters are documented in the middle of a documentation comment, with description spread out above and below. (This is not a recommended documentation style, but used in the kernel nonetheless.) We can now also demote the error to a warning. Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2016-05-30	kernel-doc: limit the "section header:" detection to a select few	Jani Nikula
	kernel-doc currently identifies anything matching "section header:" (specifically a string of word characters and spaces followed by a colon) as a new section in the documentation comment, and renders the section header accordingly. Unfortunately, this turns all uses of colon into sections, mostly unintentionally. Considering the output, erroneously creating sections when not intended is always worse than erroneously not creating sections when intended. For example, a line with "http://example.com" turns into a "http" heading followed by "//example.com" in normal text style, which is quite ugly. OTOH, "WARNING: Beware of the Leopard" is just fine even if "WARNING" does not turn into a heading. It is virtually impossible to change all the kernel-doc comments, either way. The compromise is to pick the most commonly used and depended on section headers (with variants) and accept them as section headers. The accepted section headers are, case insensitive: * description: * context: * return: * returns: Additionally, case sensitive: * @return: All of the above are commonly used in the kernel-doc comments, and will result in worse output if not identified as section headers. Also, kernel-doc already has some special handling for all of them, so there's nothing particularly controversial in adding more special treatment for them. While at it, improve the whitespace handling surrounding section names. Do not consider the whitespace as part of the name. Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2016-05-30	kernel-doc/rst: remove fixme comment	Jani Nikula
	Yes, for our purposes the type should contain typedef. Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2016-05-30	kernel-doc/rst: use undescribed instead of _undescribed_	Jani Nikula
	The latter isn't special to rst. Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2016-05-30	kernel-doc: strip leading whitespace from continued param descs	Jani Nikula
	If a param description spans multiple lines, check any leading whitespace in the first continuation line, and remove same amount of whitespace from following lines. This allows indentation in the multi-line parameter descriptions for aesthetical reasons while not causing accidentally significant indentation in the rst output. Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2016-05-30	kernel-doc: improve handling of whitespace on the first line param description	Jani Nikula
	Handle whitespace on the first line of param text as if it was the empty string. There is no need to add the newline in this case. This improves the rst output in particular, where blank lines may be problematic in parameter lists. Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2016-05-30	kernel-doc/rst: change the output layout	Jani Nikula
	Move away from field lists, and simply use strong emphasis for section headings on lines of their own. Do not use rst section headings, because their nesting depth depends on the surrounding context, which kernel-doc has no knowledge of. Also, they do not need to end up in any table of contexts or indexes. There are two related immediate benefits. Field lists are typically rendered in two columns, while the new style uses the horizontal width better. With no extra indent on the left, there's no need to be as fussy about it. Field lists are more susceptible to indentation problems than the new style. Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2016-05-30	kernel-doc: strip leading blank lines from inline doc comments	Jani Nikula
	The inline member markup allows whitespace lines before the actual documentation starts. Strip the leading blank lines. This improves the rst output. Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2016-05-30	kernel-doc/rst: blank lines in output are not needed	Jani Nikula
	Current approach leads to two blank lines, while one is enough. Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2016-05-30	kernel-doc: fix wrong code indentation	Jani Nikula
	No functional changes. Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2016-05-30	kernel-doc: do not regard $, %, or & prefixes as special in section names	Jani Nikula
	The use of these is confusing in the script, and per this grep, they're not used anywhere anyway: $ git grep " \* [%$&][a-zA-Z0-9_]:" -- .[ch] \| grep -v "\$$Id\\|Revision\\|Date$" While at it, throw out the constants array, nothing is ever put there again. Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2016-05-30	kernel-doc/rst: highlight function/struct/enum purpose lines too	Jani Nikula
	Let the user use @foo, &bar, %baz, etc. in the first kernel-doc purpose line too. Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2016-05-30	kernel-doc/rst: drop redundant unescape in highlighting	Jani Nikula
	This bit is already done by xml_unescape() above. Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2016-05-30	kernel-doc/rst: add support for struct/union/enum member references	Jani Nikula
	Link "&foo->bar", "&foo->bar()", "&foo.bar", and "&foo.bar()" to the struct/union/enum foo definition. The members themselves do not currently have anchors to link to, but this is better than nothing, and promotes a universal notation. Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2016-05-30	kernel-doc/rst: add support for &union foo and &typedef foo references	Jani Nikula
	Let the user use "&union foo" and "&typedef foo" to reference foo. The difference to using "union &foo", "typedef &foo", or just "&foo" (which are valid too) is that "union" and "typedef" become part of the link text. Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2016-05-30	kernel-doc/rst: &foo references are more universal than structs	Jani Nikula
	It's possible to use &foo to reference structs, enums, typedefs, etc. in the Sphinx C domain. Thus do not prefix the links with "struct". Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2016-05-30	kernel-doc/rst: reference functions according to C domain spec	Jani Nikula
	The Sphinx C domain spec says function references should include the parens (). Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2016-05-30	kernel-doc/rst: do not output DOC: section titles for requested ones	Jani Nikula
	If the user requests a specific DOC: section by name, do not output its section title. In these cases, the surrounding context already has a heading, and the DOC: section title is only used as an identifier and a heading for clarity in the source file. Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2016-05-30	kernel-doc: add names for output selection	Jani Nikula
	Make the output selection a bit more readable by adding constants for the various types of output selection. While at it, actually call the variable for choosing what to output $output_selection. No functional changes. Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2016-05-30	kernel-doc: add names for states and substates	Jani Nikula
	Make the state machine a bit more readable by adding constants for parser states and inline member documentation parser substates. While at it, rename the "split" documentation to "inline" documentation. No functional changes. Signed-off-by: Jani Nikula <jani.nikula@intel.com>