linux-arm.git - Russell King's ARM Linux kernel tree

Age	Commit message (Collapse)	Author
2016-05-13	ASoC: fsl_ssi: Fix channel slipping in Playback at startup	Arnaud Mouiche
	Previously, SCR.SSIEN and SCR.TE were enabled at once if no capture stream was also running. This may not give a chance for the DMA to write the first sample in TX FIFO before the streaming starts on the PCM bus, inserting void samples first. Those void samples are then responsible for slipping the channels. Signed-off-by: Arnaud Mouiche <arnaud.mouiche@invoxia.com> Reviewed-by: Fabio Estevam <fabio.estevam@nxp.com> Tested-by: Caleb Crome <caleb@crome.org> Signed-off-by: Mark Brown <broonie@kernel.org>
2016-05-13	ASoC: fsl_ssi: Fix samples being dropped at Playback startup	Arnaud Mouiche
	If the capture is already running while playback is started, it is highly probable (>80% in a 8 channels scenario) that samples are lost between the DMA and TX fifo. The reason is that SIER.TDMAE is set before STCR.TFEN0, leaving a time window where the FIFO doesn't receive the samples written by the DMA. This particular case happened only if capture is already enabled as SCR.SSIEN is already set at the playback startup instant. Signed-off-by: Arnaud Mouiche <arnaud.mouiche@invoxia.com> Reviewed-by: Fabio Estevam <fabio.estevam@nxp.com> Tested-by: Caleb Crome <caleb@crome.org> Signed-off-by: Mark Brown <broonie@kernel.org>
2016-05-13	ASoC: fsl_ssi: Save a dev reference for dev_err() purpose.	Arnaud Mouiche
	Most of functions only receive the ssi_private reference and don't have a knowledge of 'dev' pointer, even for debug purpose. Signed-off-by: Arnaud Mouiche <arnaud.mouiche@invoxia.com> Tested-by: Caleb Crome <caleb@crome.org> Signed-off-by: Mark Brown <broonie@kernel.org>
2016-05-13	ASoC: fsl_ssi: The IPG/5 limitation concerns the bitclk, not the sysclk.	Arnaud Mouiche
	im6sl reference manual 47.7.4: " Bit clock - Used to serially clock the data bits in and out of the SSI port. This clock is either generated internally (from SSI's sys clock) or taken from external clock source (through the Tx/Rx clock ports). [...] Care should be taken to ensure that the bit clock frequency (either internally generated by dividing the SSI's sys clock or sourced from external device through Tx/Rx clock ports) is never greater than 1/5 of the ipg_clk (from CCM) frequency. " Since, in master mode, the sysclk is a multiple of bitclk, we can easily reach a high sysclk value, whereas keeping a reasonable bitclk. ex: 8ch x 16bit x 48kHz = 6144000, requires a 24576000 sysclk (PM=1) yet ipg_clk/5 = 66Mhz/5 = 13.2 Signed-off-by: Arnaud Mouiche <arnaud.mouiche@invoxia.com> Reviewed-by: Fabio Estevam <fabio.estevam@nxp.com> Tested-by: Caleb Crome <caleb@crome.org> Signed-off-by: Mark Brown <broonie@kernel.org>
2016-05-13	ASoC: fsl_ssi: Real hardware channels max number is 32	Arnaud Mouiche
	The max number of slots in TDM mode is 32: - Frame Rate Divider Control is a 5bit value - Time slot mask registers control 32 slots. Signed-off-by: Arnaud Mouiche <arnaud.mouiche@invoxia.com> Reviewed-by: Fabio Estevam <fabio.estevam@nxp.com> Tested-by: Caleb Crome <caleb@crome.org> Signed-off-by: Mark Brown <broonie@kernel.org>
2016-05-13	spi: pic32-sqi: Fix linker error, undefined reference to `bad_dma_ops'.	Purna Chandra Mandal
	Even if DMA support is disabled code using DMA mapping APIs compiles fine, but fails in linking. ------- drivers/built-in.o: In function `ring_desc_ring_free': spi-pic32-sqi.c:(.text+0x2cfbe0): undefined reference to `bad_dma_ops' spi-pic32-sqi.c:(.text+0x2cfbe4): undefined reference to `bad_dma_ops' drivers/built-in.o: In function `pic32_sqi_probe': spi-pic32-sqi.c:(.text+0x2cfe48): undefined reference to `bad_dma_ops' spi-pic32-sqi.c:(.text+0x2cfeb0): undefined reference to `bad_dma_ops' spi-pic32-sqi.c:(.text+0x2cff38): undefined reference to `bad_dma_ops' -------- Correct dependency by adding 'depends on HAS_DMA' in Kconfig. Signed-off-by: Purna Chandra Mandal <purna.mandal@microchip.com> Signed-off-by: Mark Brown <broonie@kernel.org>
2016-05-13	ASoC: pcm5102a: Add support for PCM5102A codec	Florian Meier
	Some definitions to support the PCM5102A codec by Texas Instruments. Signed-off-by: Florian Meier <florian.meier@koalo.de> Changes to original patch by Florian Meier: * rebased (Makefile and Kconfig * fixed checkpath errors (spaces, newlines) * added dt-binding documentation Signed-off-by: Martin Sperl <kernel@martin.sperl.org> Signed-off-by: Mark Brown <broonie@kernel.org>
2016-05-13	ASoC: hdac_hdmi: add link management	Vinod Koul
	Manage the hda idisp link using shiny new link APIs. We need to keep link On while we probe and also hold the reference in runtime resume and drop in suspend Signed-off-by: Jeeja KP <jeeja.kp@intel.com> Signed-off-by: Vinod Koul <vinod.koul@intel.com> Signed-off-by: Mark Brown <broonie@kernel.org>
2016-05-13	ASoC: Intel: Skylake: add link management	Vinod Koul
	Use shiny new link APIs to manage the links. Also remove old link configuration logic from driver. We need to keep link and cmd dma to off during active suspend to allow system to enter low power state and turn it on if the link and cmd dma was on before active suspend in active resume. Signed-off-by: Jeeja KP <jeeja.kp@intel.com> Signed-off-by: Vinod Koul <vinod.koul@intel.com> Signed-off-by: Mark Brown <broonie@kernel.org>
2016-05-13	ALSA: hdac: add link pm and ref counting	Vinod Koul
	The HDA links can be switched off when not is use, similarly command DMA can be stopped as well. This calls for a reference counting mechanism on the link by it's users to manage the link power. The DMA can be turned off when all links are off For this we add two APIs snd_hdac_ext_bus_link_get snd_hdac_ext_bus_link_put They help users to turn up/down link and manage the DMA as well Signed-off-by: Jeeja KP <jeeja.kp@intel.com> Signed-off-by: Vinod Koul <vinod.koul@intel.com> Acked-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Mark Brown <broonie@kernel.org>
2016-05-13	i2c: st: Implement bus clear	Peter Griffin
	>From I2C specifications: http://www.nxp.com/documents/user_manual/UM10204.pdf Chapter 3.1.16, when the i2c device held the SDA line low, the master should send 9 clocks pulses to try to recover. Signed-off-by: Frederic Pillon <frederic.pillon@st.com> Signed-off-by: Peter Griffin <peter.griffin@linaro.org> Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
2016-05-13	i2c: only check scl functions when using generic recovery	Wolfram Sang
	A custom recovery function doesn't need these pointers to be populated because it may work differently internally. Signed-off-by: Wolfram Sang <wsa@the-dreams.de> Tested-by: Peter Griffin <peter.griffin@linaro.org>
2016-05-13	Merge remote-tracking branches 'regulator/fix/axp20x', ↵	Mark Brown
	'regulator/fix/da9063', 'regulator/fix/gpio' and 'regulator/fix/s2mps11' into regulator-linus
2016-05-13	Merge branch 'kvm-ppc-next' of ↵	Paolo Bonzini
	git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into HEAD
2016-05-13	Merge remote-tracking branches 'regmap/topic/doc' and 'regmap/topic/flat' ↵	Mark Brown
	into regmap-next
2016-05-13	Merge remote-tracking branches 'regmap/fix/be', 'regmap/fix/doc' and ↵	Mark Brown
	'regmap/fix/spmi' into regmap-linus
2016-05-13	Merge remote-tracking branch 'regmap/fix/mmio' into regmap-linus	Mark Brown

2016-05-13	crypto: qat - change the adf_ctl_stop_devices to void	Tadeusz Struk
	Change the adf_ctl_stop_devices to a void function. Signed-off-by: Tadeusz Struk <tadeusz.struk@intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2016-05-13	dmaengine: vdma: Add clock support	Kedareswara rao Appana
	Added basic clock support for axi dma's. The clocks are requested at probe and released at remove. Reviewed-by: Shubhrajyoti Datta <shubhraj@xilinx.com> Signed-off-by: Kedareswara rao Appana <appanad@xilinx.com> Signed-off-by: Vinod Koul <vinod.koul@intel.com>
2016-05-13	Documentation: DT: vdma: Add clock support for dmas	Kedareswara rao Appana
	This patch updates the binding doc with clock description for AXI DMA's. Acked-by: Rob Herring <robh@kernel.org> Acked-by: Sören Brinkmann <soren.brinkmann@xilinx.com> Signed-off-by: Kedareswara rao Appana <appanad@xilinx.com> Signed-off-by: Vinod Koul <vinod.koul@intel.com>
2016-05-13	dmaengine: vdma: Add config structure to differentiate dmas	Kedareswara rao Appana
	This patch adds config structure in the driver to differentiate AXI DMA's and to add more features(clock support etc..) to these DMA's. Signed-off-by: Kedareswara rao Appana <appanad@xilinx.com> Signed-off-by: Vinod Koul <vinod.koul@intel.com>
2016-05-13	MAINTAINERS: Update Tegra DMA maintainers	Jon Hunter
	Update the Tegra DMA driver maintainer field to include the newly added Tegra210 ADMA and add Jon Hunter as a co-maintainer for Tegra DMA. Signed-off-by: Jon Hunter <jonathanh@nvidia.com> Signed-off-by: Vinod Koul <vinod.koul@intel.com>
2016-05-13	dmaengine: tegra-adma: Add support for Tegra210 ADMA	Jon Hunter
	Add support for the Tegra210 Audio DMA controller that is used for transferring data between system memory and the Audio sub-system. The driver only supports cyclic transfers because this is being solely used for audio. This driver is based upon the work by Dara Ramesh <dramesh@nvidia.com>. Signed-off-by: Jon Hunter <jonathanh@nvidia.com> Signed-off-by: Vinod Koul <vinod.koul@intel.com>
2016-05-13	Documentation: DT: Add binding documentation for NVIDIA ADMA	Jon Hunter
	Add device-tree binding documentation for the Tegra210 Audio DMA controller. Signed-off-by: Jon Hunter <jonathanh@nvidia.com> Acked-by: Rob Herring <robh@kernel.org> Signed-off-by: Vinod Koul <vinod.koul@intel.com>
2016-05-13	Merge branch 'drm-fixes-4.6' of git://people.freedesktop.org/~agd5f/linux ↵	Dave Airlie
	into drm-fixes DP mode validation regression fix. * 'drm-fixes-4.6' of git://people.freedesktop.org/~agd5f/linux: drm/amdgpu: fix DP mode validation drm/radeon: fix DP mode validation
2016-05-13	xen-netback: fix extra_info handling in xenvif_tx_err()	Paul Durrant
	Patch 562abd39 "xen-netback: support multiple extra info fragments passed from frontend" contained a mistake which can result in an in- correct number of responses being generated when handling errors encountered when processing packets containing extra info fragments. This patch fixes the problem. Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Reported-by: Jan Beulich <JBeulich@suse.com> Cc: Wei Liu <wei.liu2@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-13	udp: Resolve NULL pointer dereference over flow-based vxlan device	Alexander Duyck
	While testing an OpenStack configuration using VXLANs I saw the following call trace: RIP: 0010:[<ffffffff815fad49>] udp4_lib_lookup_skb+0x49/0x80 RSP: 0018:ffff88103867bc50 EFLAGS: 00010286 RAX: ffff88103269bf00 RBX: ffff88103269bf00 RCX: 00000000ffffffff RDX: 0000000000004300 RSI: 0000000000000000 RDI: ffff880f2932e780 RBP: ffff88103867bc60 R08: 0000000000000000 R09: 000000009001a8c0 R10: 0000000000004400 R11: ffffffff81333a58 R12: ffff880f2932e794 R13: 0000000000000014 R14: 0000000000000014 R15: ffffe8efbfd89ca0 FS: 0000000000000000(0000) GS:ffff88103fd80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000488 CR3: 0000000001c06000 CR4: 00000000001426e0 Stack: ffffffff81576515 ffffffff815733c0 ffff88103867bc98 ffffffff815fcc17 ffff88103269bf00 ffffe8efbfd89ca0 0000000000000014 0000000000000080 ffffe8efbfd89ca0 ffff88103867bcc8 ffffffff815fcf8b ffff880f2932e794 Call Trace: [<ffffffff81576515>] ? skb_checksum+0x35/0x50 [<ffffffff815733c0>] ? skb_push+0x40/0x40 [<ffffffff815fcc17>] udp_gro_receive+0x57/0x130 [<ffffffff815fcf8b>] udp4_gro_receive+0x10b/0x2c0 [<ffffffff81605863>] inet_gro_receive+0x1d3/0x270 [<ffffffff81589e59>] dev_gro_receive+0x269/0x3b0 [<ffffffff8158a1b8>] napi_gro_receive+0x38/0x120 [<ffffffffa0871297>] gro_cell_poll+0x57/0x80 [vxlan] [<ffffffff815899d0>] net_rx_action+0x160/0x380 [<ffffffff816965c7>] __do_softirq+0xd7/0x2c5 [<ffffffff8107d969>] run_ksoftirqd+0x29/0x50 [<ffffffff8109a50f>] smpboot_thread_fn+0x10f/0x160 [<ffffffff8109a400>] ? sort_range+0x30/0x30 [<ffffffff81096da8>] kthread+0xd8/0xf0 [<ffffffff81693c82>] ret_from_fork+0x22/0x40 [<ffffffff81096cd0>] ? kthread_park+0x60/0x60 The following trace is seen when receiving a DHCP request over a flow-based VXLAN tunnel. I believe this is caused by the metadata dst having a NULL dev value and as a result dev_net(dev) is causing a NULL pointer dereference. To resolve this I am replacing the check for skb_dst(skb)->dev with just skb->dev. This makes sense as the callers of this function are usually in the receive path and as such skb->dev should always be populated. In addition other functions in the area where these are called are already using dev_net(skb->dev) to determine the namespace the UDP packet belongs in. Fixes: 63058308cd55 ("udp: Add udp6_lib_lookup_skb and udp4_lib_lookup_skb") Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-13	sunrpc: set SOCK_FASYNC	Eric Dumazet
	sunrpc is using SOCKWQ_ASYNC_NOSPACE without setting SOCK_FASYNC, so the recent optimizations done in sk_set_bit() and sk_clear_bit() broke it. There is still the risk that a subsequent sock_fasync() call would clear SOCK_FASYNC, but sunrpc does not use this yet. Fixes: 9317bb69824e ("net: SOCKWQ_ASYNC_NOSPACE optimizations") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Jiri Pirko <jiri@resnulli.us> Reported-by: Huang, Ying <ying.huang@intel.com> Tested-by: Jiri Pirko <jiri@resnulli.us> Tested-by: Huang, Ying <ying.huang@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-13	Merge tag 'perf-urgent-for-mingo-20160512' of ↵	Ingo Molnar
	git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent Pull perf/urgent fixes from Arnaldo Carvalho de Melo: - Fallback to usermode-only counters when perf_event_paranoid > 1, which is the case now (Arnaldo Carvalho de Melo) - Do not reassign parg after collapse_tree() in libtraceevent, which may cause tool crashes (Steven Rostedt) - Fix the build on Fedora Rawhide, where readdir_r() is deprecated and also wrt -Werror=unused-const-variable= + x86_32_regoffset_table on !x86_64 (Arnaldo Carvalho de Melo) - Fix the build on Ubuntu 12.04.5, where dwarf_getlocations() isn't available, i.e. libdw-dev < 0.157 (Arnaldo Carvalho de Melo) Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-13	ext4: pre-zero allocated blocks for DAX IO	Jan Kara
	Currently ext4 treats DAX IO the same way as direct IO. I.e., it allocates unwritten extents before IO is done and converts unwritten extents afterwards. However this way DAX IO can race with page fault to the same area: ext4_ext_direct_IO() dax_fault() dax_io() get_block() - allocates unwritten extent copy_from_iter_pmem() get_block() - converts unwritten block to written and zeroes it out ext4_convert_unwritten_extents() So data written with DAX IO gets lost. Similarly dax_new_buf() called from dax_io() can overwrite data that has been already written to the block via mmap. Fix the problem by using pre-zeroed blocks for DAX IO the same way as we use them for DAX mmap. The downside of this solution is that every allocating write writes each block twice (once zeros, once data). Fixing the race with locking is possible as well however we would need to lock-out faults for the whole range written to by DAX IO. And that is not easy to do without locking-out faults for the whole file which seems too aggressive. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-05-13	ext4: refactor direct IO code	Jan Kara
	Currently ext4 direct IO handling is split between ext4_ext_direct_IO() and ext4_ind_direct_IO(). However the extent based function calls into the indirect based one for some cases and for example it is not able to handle file extending. Previously it was not also properly handling retries in case of ENOSPC errors. With DAX things would get even more contrieved so just refactor the direct IO code and instead of indirect / extent split do the split to read vs writes. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-05-13	ext4: fix race in transient ENOSPC detection	Jan Kara
	When there are blocks to free in the running transaction, block allocator can return ENOSPC although the filesystem has some blocks to free. We use ext4_should_retry_alloc() to force commit of the current transaction and return whether anything was committed so that it makes sense to retry the allocation. However the transaction may get committed after block allocation fails but before we call ext4_should_retry_alloc(). So ext4_should_retry_alloc() returns false because there is nothing to commit and we wrongly return ENOSPC. Fix the race by unconditionally returning 1 from ext4_should_retry_alloc() when we tried to commit a transaction. This should not add any unnecessary retries since we had a transaction running a while ago when trying to allocate blocks and we want to retry the allocation once that transaction has committed anyway. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-05-13	ext4: handle transient ENOSPC properly for DAX	Jan Kara
	ext4_dax_get_blocks() was accidentally omitted fixing get blocks handlers to properly handle transient ENOSPC errors. Fix it now to use ext4_get_blocks_trans() helper which takes care of these errors. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-05-13	dax: call get_blocks() with create == 1 for write faults to unwritten extents	Jan Kara
	Currently, __dax_fault() does not call get_blocks() callback with create argument set, when we got back unwritten extent from the initial get_blocks() call during a write fault. This is because originally filesystems were supposed to convert unwritten extents to written ones using complete_unwritten() callback. Later this was abandoned in favor of using pre-zeroed blocks however the condition whether get_blocks() needs to be called with create == 1 remained. Fix the condition so that filesystems are not forced to zero-out and convert unwritten extents when get_blocks() is called with create == 0 (which introduces unnecessary overhead for read faults and can be problematic as the filesystem may possibly be read-only). Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-05-13	ARC: pae: STRICT_MM_TYPECHECKS was broken	Vineet Gupta
	Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
2016-05-12	jfs: Switch to generic xattr handlers	Andreas Gruenbacher
	This is mostly the same as on other filesystems except for attribute names with an "os2." prefix: for those, the prefix is not stored on disk, and on-attribute names without a prefix have "os2." added. As on several other filesystems, the underlying function for setting/removing xattrs (__jfs_setxattr) removes attributes when the value is NULL, so the set xattr handlers will work as expected. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-12	jfs: Clean up xattr name mapping	Andreas Gruenbacher
	Instead of stripping "os2." prefixes in __jfs_setxattr, make callers strip them, as __jfs_getxattr already does. With that change, use the same name mapping function in jfs_{get,set,remove}xattr. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-12	gfs2: Switch to generic xattr handlers	Al Viro
	Switch to the generic xattr handlers and take the necessary glocks at the layer below. The following are the new xattr "entry points"; they are called with the glock held already in the following cases: gfs2_xattr_get: From SELinux, during lookups. gfs2_xattr_set: The glock is never held. gfs2_get_acl: From gfs2_create_inode -> posix_acl_create and gfs2_setattr -> posix_acl_chmod. gfs2_set_acl: From gfs2_setattr -> posix_acl_chmod. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-13	Merge tag 'drm/panel/for-4.7-rc1' of ↵	Dave Airlie
	git://anongit.freedesktop.org/tegra/linux into drm-next drm/panel: Changes for v4.7-rc1 This contains support for a bunch of new panels in the simple panel driver along with some cleanup and support for a new Analogix HDMI to DP bridge. * tag 'drm/panel/for-4.7-rc1' of git://anongit.freedesktop.org/tegra/linux: drm/panel: simple: Add support for TPK U.S.A. LLC Fusion 7" and 10.1" panels drm/bridge: Add Analogix anx78xx support devicetree: Add ANX7814 SlimPort transmitter binding of: Add vendor prefix for Analogix Semiconductor drm/dp: Add define to set 0.5% down-spread in MAX_DOWNSPREAD register drm/panel: simple: Add support for Innolux AT070TN92 drm/panel: simple: Remove useless drm_mode_set_name() drm/panel: simple: Set appropriate mode type drm/panel: simple: Add timings for the Olimex LCD-OLinuXino-4.3TS drm/panel: simple: Add the 7" DPI panel from Adafruit of: Add vendor prefix for On Tat Industrial Company.
2016-05-12	Merge branch 'akpm' (patches from Andrew)	Linus Torvalds
	Merge fixes from Andrew Morton: "4 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm: thp: calculate the mapcount correctly for THP pages during WP faults ksm: fix conflict between mmput and scan_get_next_rmap_item ocfs2: fix posix_acl_create deadlock ocfs2: revert using ocfs2_acl_chmod to avoid inode cluster lock hang
2016-05-13	Btrfs: add semaphore to synchronize direct IO writes with fsync	Filipe Manana
	Due to the optimization of lockless direct IO writes (the inode's i_mutex is not held) introduced in commit 38851cc19adb ("Btrfs: implement unlocked dio write"), we started having races between such writes with concurrent fsync operations that use the fast fsync path. These races were addressed in the patches titled "Btrfs: fix race between fsync and lockless direct IO writes" and "Btrfs: fix race between fsync and direct IO writes for prealloc extents". The races happened because the direct IO path, like every other write path, does create extent maps followed by the corresponding ordered extents while the fast fsync path collected first ordered extents and then it collected extent maps. This made it possible to log file extent items (based on the collected extent maps) without waiting for the corresponding ordered extents to complete (get their IO done). The two fixes mentioned before added a solution that consists of making the direct IO path create first the ordered extents and then the extent maps, while the fsync path attempts to collect any new ordered extents once it collects the extent maps. This was simple and did not require adding any synchonization primitive to any data structure (struct btrfs_inode for example) but it makes things more fragile for future development endeavours and adds an exceptional approach compared to the other write paths. This change adds a read-write semaphore to the btrfs inode structure and makes the direct IO path create the extent maps and the ordered extents while holding read access on that semaphore, while the fast fsync path collects extent maps and ordered extents while holding write access on that semaphore. The logic for direct IO write path is encapsulated in a new helper function that is used both for cow and nocow direct IO writes. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com>
2016-05-13	Btrfs: fix race between block group relocation and nocow writes	Filipe Manana
	Relocation of a block group waits for all existing tasks flushing dellaloc, starting direct IO writes and any ordered extents before starting the relocation process. However for direct IO writes that end up doing nocow (inode either has the flag nodatacow set or the write is against a prealloc extent) we have a short time window that allows for a race that makes relocation proceed without waiting for the direct IO write to complete first, resulting in data loss after the relocation finishes. This is illustrated by the following diagram: CPU 1 CPU 2 btrfs_relocate_block_group(bg X) direct IO write starts against an extent in block group X using nocow mode (inode has the nodatacow flag or the write is for a prealloc extent) btrfs_direct_IO() btrfs_get_blocks_direct() --> can_nocow_extent() returns 1 btrfs_inc_block_group_ro(bg X) --> turns block group into RO mode btrfs_wait_ordered_roots() --> returns and does not know about the DIO write happening at CPU 2 (the task there has not created yet an ordered extent) relocate_block_group(bg X) --> rc->stage == MOVE_DATA_EXTENTS find_next_extent() --> returns extent that the DIO write is going to write to relocate_data_extent() relocate_file_extent_cluster() --> reads the extent from disk into pages belonging to the relocation inode and dirties them --> creates DIO ordered extent btrfs_submit_direct() --> submits bio against a location on disk obtained from an extent map before the relocation started btrfs_wait_ordered_range() --> writes all the pages read before to disk (belonging to the relocation inode) relocation finishes bio completes and wrote new data to the old location of the block group So fix this by tracking the number of nocow writers for a block group and make sure relocation waits for that number to go down to 0 before starting to move the extents. The same race can also happen with buffered writes in nocow mode since the patch I recently made titled "Btrfs: don't do unnecessary delalloc flushes when relocating", because we are no longer flushing all delalloc which served as a synchonization mechanism (due to page locking) and ensured the ordered extents for nocow buffered writes were created before we called btrfs_wait_ordered_roots(). The race with direct IO writes in nocow mode existed before that patch (no pages are locked or used during direct IO) and that fixed only races with direct IO writes that do cow. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com>
2016-05-13	Btrfs: fix race between fsync and direct IO writes for prealloc extents	Filipe Manana
	When we do a direct IO write against a preallocated extent (fallocate) that does not go beyond the i_size of the inode, we do the write operation without holding the inode's i_mutex (an optimization that landed in commit 38851cc19adb ("Btrfs: implement unlocked dio write")). This allows for a very tiny time window where a race can happen with a concurrent fsync using the fast code path, as the direct IO write path creates first a new extent map (no longer flagged as a prealloc extent) and then it creates the ordered extent, while the fast fsync path first collects ordered extents and then it collects extent maps. This allows for the possibility of the fast fsync path to collect the new extent map without collecting the new ordered extent, and therefore logging an extent item based on the extent map without waiting for the ordered extent to be created and complete. This can result in a situation where after a log replay we end up with an extent not marked anymore as prealloc but it was only partially written (or not written at all), exposing random, stale or garbage data corresponding to the unwritten pages and without any checksums in the csum tree covering the extent's range. This is an extension of what was done in commit de0ee0edb21f ("Btrfs: fix race between fsync and lockless direct IO writes"). So fix this by creating first the ordered extent and then the extent map, so that this way if the fast fsync patch collects the new extent map it also collects the corresponding ordered extent. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com>
2016-05-13	Btrfs: fix number of transaction units for renames with whiteout	Filipe Manana
	When we do a rename with the whiteout flag, we need to create the whiteout inode, which in the worst case requires 5 transaction units (1 inode item, 1 inode ref, 2 dir items and 1 xattr if selinux is enabled). So bump the number of transaction units from 11 to 16 if the whiteout flag is set. Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-05-13	Btrfs: pin logs earlier when doing a rename exchange operation	Filipe Manana
	The btrfs_rename_exchange() started as a copy-paste from btrfs_rename(), which had a race fixed by my previous patch titled "Btrfs: pin log earlier when renaming", and so it suffers from the same problem. We pin the logs of the affected roots after we insert the new inode references, leaving a time window where concurrent tasks logging the inodes can end up logging both the new and old references, resulting in log trees that when replayed can turn the metadata into inconsistent states. This behaviour was added to btrfs_rename() in 2009 without any explanation about why not pinning the logs earlier, just leaving a comment about the posibility for the race. As of today it's perfectly safe and sane to pin the logs before we start doing any of the steps involved in the rename operation. Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-05-13	Btrfs: unpin logs if rename exchange operation fails	Filipe Manana
	If rename exchange operations fail at some point after we pinned any of the logs, we end up aborting the current transaction but never unpin the logs, which leaves concurrent tasks that are trying to sync the logs (as part of an fsync request from user space) blocked forever and preventing the filesystem from being unmountable. Fix this by safely unpinning the log. Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-05-13	Btrfs: fix inode leak on failure to setup whiteout inode in rename	Filipe Manana
	If we failed to fully setup the whiteout inode during a rename operation with the whiteout flag, we ended up leaking the inode, not decrementing its link count nor removing all its items from the fs/subvol tree. Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-05-13	btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT	Dan Fuhry
	Two new flags, RENAME_EXCHANGE and RENAME_WHITEOUT, provide for new behavior in the renameat2() syscall. This behavior is primarily used by overlayfs. This patch adds support for these flags to btrfs, enabling it to be used as a fully functional upper layer for overlayfs. RENAME_EXCHANGE support was written by Davide Italiano originally submitted on 2 April 2015. Signed-off-by: Davide Italiano <dccitaliano@gmail.com> Signed-off-by: Dan Fuhry <dfuhry@datto.com> [ remove unlikely ] Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-05-13	Btrfs: pin log earlier when renaming	Filipe Manana
	We were pinning the log right after the first step in the rename operation (inserting inode ref for the new name in the destination directory) instead of doing it before. This behaviour was introduced in 2009 for some reason that was not mentioned neither on the changelog nor any comment, with the drawback of a small time window where concurrent log writers can end up logging the new inode reference for the inode we are renaming while the rename operation is in progress (so that we can end up with a log containing both the new and old references). As of today there's no reason to not pin the log before that first step anymore, so just fix this. Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-05-13	Btrfs: unpin log if rename operation fails	Filipe Manana
	If rename operations fail at some point after we pinned the log, we end up aborting the current transaction but never unpin the log, which leaves concurrent tasks that are trying to sync the log (as part of an fsync request from user space) blocked forever and preventing the filesystem from being unmountable. Fix this by safely unpinning the log. Signed-off-by: Filipe Manana <fdmanana@suse.com>