git.armlinux.org.uk/linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2018-05-31	crypto: chtls - wait for memory sendmsg, sendpage	Atul Gupta
	address suspicious code <gustavo@embeddedor.com> 1210 set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); 1211 } The issue is that in the code above, set_bit is never reached due to the 'continue' statement at line 1208. Also reported by bug report:<dan.carpenter@oracle.com> 1210 set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Not reachable. Its required to wait for buffer in the send path and takes care of unaddress and un-handled SOCK_NOSPACE. v2: use csk_mem_free where appropriate proper indent of goto do_nonblock replace out with do_rm_wq Reported-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Atul Gupta <atul.gupta@chelsio.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-05-31	crypto: chtls - key len correction	Atul Gupta
	corrected the key length to copy 128b key. Removed 192b and 256b key as user input supports key of size 128b in gcm_ctx Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Atul Gupta <atul.gupta@chelsio.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-05-31	crypto: salsa20 - Revert "crypto: salsa20 - export generic helpers"	Eric Biggers
	This reverts commit eb772f37ae8163a89e28a435f6a18742ae06653b, as now the x86 Salsa20 implementation has been removed and the generic helpers are no longer needed outside of salsa20_generic.c. We could keep this just in case someone else wants to add a new optimized Salsa20 implementation. But given that we have ChaCha20 now too, I think it's unlikely. And this can always be reverted back. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-05-31	crypto: x86/salsa20 - remove x86 salsa20 implementations	Eric Biggers
	The x86 assembly implementations of Salsa20 use the frame base pointer register (%ebp or %rbp), which breaks frame pointer convention and breaks stack traces when unwinding from an interrupt in the crypto code. Recent (v4.10+) kernels will warn about this, e.g. WARNING: kernel stack regs at 00000000a8291e69 in syzkaller047086:4677 has bad 'bp' value 000000001077994c [...] But after looking into it, I believe there's very little reason to still retain the x86 Salsa20 code. First, these are not vectorized (SSE2/SSSE3/AVX2) implementations, which would be needed to get anywhere close to the best Salsa20 performance on any remotely modern x86 processor; they're just regular x86 assembly. Second, it's still unclear that anyone is actually using the kernel's Salsa20 at all, especially given that now ChaCha20 is supported too, and with much more efficient SSSE3 and AVX2 implementations. Finally, in benchmarks I did on both Intel and AMD processors with both gcc 8.1.0 and gcc 4.9.4, the x86_64 salsa20-asm is actually slightly slower than salsa20-generic (~3% slower on Skylake, ~10% slower on Zen), while the i686 salsa20-asm is only slightly faster than salsa20-generic (~15% faster on Skylake, ~20% faster on Zen). The gcc version made little difference. So, the x86_64 salsa20-asm is pretty clearly useless. That leaves just the i686 salsa20-asm, which based on my tests provides a 15-20% speed boost. But that's without updating the code to not use %ebp. And given the maintenance cost, the small speed difference vs. salsa20-generic, the fact that few people still use i686 kernels, the doubt that anyone is even using the kernel's Salsa20 at all, and the fact that a SSE2 implementation would almost certainly be much faster on any remotely modern x86 processor yet no one has cared enough to add one yet, I don't think it's worthwhile to keep. Thus, just remove both the x86_64 and i686 salsa20-asm implementations. Reported-by: syzbot+ffa3a158337bbc01ff09@syzkaller.appspotmail.com Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-05-31	crypto: ccp - Add GET_ID SEV command	Janakarajan Natarajan
	The GET_ID command, added as of SEV API v0.16, allows the SEV firmware to be queried about a unique CPU ID. This unique ID can then be used to obtain the public certificate containing the Chip Endorsement Key (CEK) public key signed by the AMD SEV Signing Key (ASK). For more information please refer to "Section 5.12 GET_ID" of https://support.amd.com/TechDocs/55766_SEV-KM%20API_Specification.pdf Signed-off-by: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-05-31	crypto: ccp - Add DOWNLOAD_FIRMWARE SEV command	Janakarajan Natarajan
	The DOWNLOAD_FIRMWARE command, added as of SEV API v0.15, allows the OS to install SEV firmware newer than the currently active SEV firmware. For the new SEV firmware to be applied it must: * Pass the validation test performed by the existing firmware. * Be of the same build or a newer build compared to the existing firmware. For more information please refer to "Section 5.11 DOWNLOAD_FIRMWARE" of https://support.amd.com/TechDocs/55766_SEV-KM%20API_Specification.pdf Signed-off-by: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-05-31	crypto: qat - Add MODULE_FIRMWARE for all qat drivers	Conor McLoughlin
	Signed-off-by: Conor McLoughlin <conor.mcloughlin@intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-05-31	crypto: ccree - silence debug prints	Gilad Ben-Yossef
	The cache parameter register configuration was being too verbose. Use dev_dbg() to only provide the information if needed. Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com> Reviewed-by: Simon Horman <horms+renesas@verge.net.au> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-05-31	crypto: ccree - better clock handling	Gilad Ben-Yossef
	Use managed clock handling, differentiate between no clock (possibly OK) and clock init failure (never OK) and correctly handle clock detection being deferred. Suggested-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com> Reviewed-by: Simon Horman <horms+renesas@verge.net.au> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-05-31	crypto: ccree - correct host regs offset	Gilad Ben-Yossef
	The product signature and HW revision register have different offset on the older HW revisions. This fixes the problem of the driver failing sanity check on silicon despite working on the FPGA emulation systems. Fixes: 27b3b22dd98c ("crypto: ccree - add support for older HW revs") Cc: stable@vger.kernel.org Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com> Reviewed-by: Simon Horman <horms+renesas@verge.net.au> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-05-31	crypto: chelsio - Remove separate buffer used for DMA map B0 block in CCM	Harsh Jain
	Extends memory required for IV to include B0 Block and DMA map in single operation. Signed-off-by: Harsh Jain <harsh@chelsio.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-05-31	crypt: chelsio - Send IV as Immediate for cipher algo	Harsh Jain
	Send IV in WR as immediate instead of dma mapped entry for cipher. Signed-off-by: Harsh Jain <harsh@chelsio.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-05-31	crypto: chelsio - Return -ENOSPC for transient busy indication.	Harsh Jain
	Change the return type based on following patch https://www.mail-archive.com/linux-crypto@vger.kernel.org/msg28552.html Signed-off-by: Harsh Jain <harsh@chelsio.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-05-31	crypto: caam/qi - fix warning in init_cgr()	Horia Geantă
	Coverity warns about an "Unintentional integer overflow (OVERFLOW_BEFORE_WIDEN)" when computing the congestion threshold value. Even though it is highly unlikely for an overflow to happen, use this as an opportunity to simplify the code. Signed-off-by: Horia Geantă <horia.geanta@nxp.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-05-31	crypto: caam - fix rfc4543 descriptors	Horia Geantă
	In some cases the CCB DMA-based internal transfer started by the MOVE command (src=M3 register, dst=descriptor buffer) does not finish in time and DECO executes the unpatched descriptor. This leads eventually to a DECO Watchdog Timer timeout error. To make sure the transfer ends, change the MOVE command to be blocking. Signed-off-by: Horia Geantă <horia.geanta@nxp.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-05-31	crypto: caam - fix MC firmware detection	Horia Geantă
	Management Complex (MC) f/w detection is based on CTPR_MS[DPAA2] bit. This is incorrect since: -the bit is set for all CAAM blocks integrated in SoCs with a certain Layerscape Chassis -some SoCs with LS Chassis don't have an MC block (thus no MC f/w) To fix this, MC f/w detection will be based on the presence of "fsl,qoriq-mc" compatible string in the device tree. Fixes: 297b9cebd2fc0 ("crypto: caam/jr - add support for DPAA2 parts") Signed-off-by: Horia Geantă <horia.geanta@nxp.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-05-31	crypto: clarify licensing of OpenSSL asm code	Adam Langley
	Several source files have been taken from OpenSSL. In some of them a comment that "permission to use under GPL terms is granted" was included below a contradictory license statement. In several cases, there was no indication that the license of the code was compatible with the GPLv2. This change clarifies the licensing for all of these files. I've confirmed with the author (Andy Polyakov) that a) he has licensed the files with the GPLv2 comment under that license and b) that he's also happy to license the other files under GPLv2 too. In one case, the file is already contained in his CRYPTOGAMS bundle, which has a GPLv2 option, and so no special measures are needed. In all cases, the license status of code has been clarified by making the GPLv2 license prominent. The .S files have been regenerated from the updated .pl files. This is a comment-only change. No code is changed. Signed-off-by: Adam Langley <agl@chromium.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-05-31	crypto: morus - Mark MORUS SIMD glue as x86-specific	Ondrej Mosnacek
	Commit 56e8e57fc3a7 ("crypto: morus - Add common SIMD glue code for MORUS") accidetally consiedered the glue code to be usable by different architectures, but it seems to be only usable on x86. This patch moves it under arch/x86/crypto and adds 'depends on X86' to the Kconfig options and also removes the prompt to hide these internal options from the user. Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Ondrej Mosnacek <omosnacek@gmail.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-05-31	crypto: testmgr - eliminate redundant decryption test vectors	Eric Biggers
	Currently testmgr has separate encryption and decryption test vectors for symmetric ciphers. That's massively redundant, since with few exceptions (mostly mistakes, apparently), all decryption tests are identical to the encryption tests, just with the input/result flipped. Therefore, eliminate the redundancy by removing the decryption test vectors and updating testmgr to test both encryption and decryption using what used to be the encryption test vectors. Naming is adjusted accordingly: each cipher_testvec now has a 'ptext' (plaintext), 'ctext' (ciphertext), and 'len' instead of an 'input', 'result', 'ilen', and 'rlen'. Note that it was always the case that 'ilen == rlen'. AES keywrap ("kw(aes)") is special because its IV is generated by the encryption. Previously this was handled by specifying 'iv_out' for encryption and 'iv' for decryption. To make it work cleanly with only one set of test vectors, put the IV in 'iv', remove 'iv_out', and add a boolean that indicates that the IV is generated by the encryption. In total, this removes over 10000 lines from testmgr.h, with no reduction in test coverage since prior patches already copied the few unique decryption test vectors into the encryption test vectors. This covers all algorithms that used 'struct cipher_testvec', e.g. any block cipher in the ECB, CBC, CTR, XTS, LRW, CTS-CBC, PCBC, OFB, or keywrap modes, and Salsa20 and ChaCha20. No change is made to AEAD tests, though we probably can eliminate a similar redundancy there too. The testmgr.h portion of this patch was automatically generated using the following awk script, with some slight manual fixups on top (updated 'struct cipher_testvec' definition, updated a few comments, and fixed up the AES keywrap test vectors): BEGIN { OTHER = 0; ENCVEC = 1; DECVEC = 2; DECVEC_TAIL = 3; mode = OTHER } /^static const struct cipher_testvec._enc_/ { sub("_enc", ""); mode = ENCVEC } /^static const struct cipher_testvec._dec_/ { mode = DECVEC } mode == ENCVEC && !/\.ilen[[:space:]]=/ { sub(/\.input[[:space:]]=$/, ".ptext =") sub(/\.input[[:space:]]=/, ".ptext\t=") sub(/\.result[[:space:]]=$/, ".ctext =") sub(/\.result[[:space:]]=/, ".ctext\t=") sub(/\.rlen[[:space:]]=/, ".len\t=") print } mode == DECVEC_TAIL && /[^[:space:]]/ { mode = OTHER } mode == OTHER { print } mode == ENCVEC && /^};/ { mode = OTHER } mode == DECVEC && /^};/ { mode = DECVEC_TAIL } Note that git's default diff algorithm gets confused by the testmgr.h portion of this patch, and reports too many lines added and removed. It's better viewed with 'git diff --minimal' (or 'git show --minimal'), which reports "2 files changed, 919 insertions(+), 11723 deletions(-)". Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-05-31	crypto: testmgr - add extra kw(aes) encryption test vector	Eric Biggers
	One "kw(aes)" decryption test vector doesn't exactly match an encryption test vector with input and result swapped. In preparation for removing the decryption test vectors, add this test vector to the encryption test vectors, so we don't lose any test coverage. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-05-31	crypto: testmgr - add extra ecb(tnepres) encryption test vectors	Eric Biggers
	None of the four "ecb(tnepres)" decryption test vectors exactly match an encryption test vector with input and result swapped. In preparation for removing the decryption test vectors, add these to the encryption test vectors, so we don't lose any test coverage. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-05-31	crypto: testmgr - make an cbc(des) encryption test vector chunked	Eric Biggers
	One "cbc(des)" decryption test vector doesn't exactly match an encryption test vector with input and result swapped. It's almost the same as one, but the decryption version is "chunked" while the encryption version is "unchunked". In preparation for removing the decryption test vectors, make the encryption one both chunked and unchunked, so we don't lose any test coverage. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-05-31	crypto: testmgr - add extra ecb(des) encryption test vectors	Eric Biggers
	Two "ecb(des)" decryption test vectors don't exactly match any of the encryption test vectors with input and result swapped. In preparation for removing the decryption test vectors, add these to the encryption test vectors, so we don't lose any test coverage. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-05-30	ASoC: core: Fix return code shown on error for hw_params	Jon Hunter
	When the call to hw_params for a component fails, the error code is held by the variable '__ret' but the error message displays the value held by the variable 'ret'. Fix the return code shown when hw_params fails for a component. Fixes: b8135864d4d3 ("ASoC: snd_soc_component_driver has snd_pcm_ops") Signed-off-by: Jon Hunter <jonathanh@nvidia.com> Signed-off-by: Mark Brown <broonie@kernel.org>
2018-05-30	btrfs: return ENOMEM if path allocation fails in btrfs_cross_ref_exist	Su Yue
	The error code does not match the reason of failure and may confuse the callers. Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30	PCI: mobiveil: Add Mobiveil PCIe Host Bridge IP driver DT bindings	Subrahmanya Lingappa
	Add DT bindings for the Mobiveil PCIe Host Bridge IP driver and update the vendor prefixes file. Signed-off-by: Subrahmanya Lingappa <l.subrahmanya@mobiveil.co.in> Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Rob Herring <robh@kernel.org>
2018-05-30	Merge branch 'for-linus' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 fixes from Martin Schwidefsky: - a missing -msoft-float for the compile of the kexec purgatory - a fix for the dasd driver to avoid the double use of a field in the 'struct request' [ That latter one is being discussed, and Christoph asked for something cleaner, but for now it's a fix ] * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390/dasd: use blk_mq_rq_from_pdu for per request data s390/purgatory: Fix endless interrupt loop
2018-05-30	btrfs: raid56: Remove VLA usage	Kees Cook
	In the quest to remove all stack VLA usage from the kernel[1], this allocates the working buffers during regular init, instead of using stack space. This refactors the allocation code a bit to make it easier to review. [1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30	xfs: repair superblocks	Darrick J. Wong
	If one of the backup superblocks is found to differ seriously from superblock 0, write out a fresh copy from the in-core sb. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-05-30	xfs: add helpers to attach quotas to inodes	Darrick J. Wong
	Add a helper routine to attach quota information to inodes that are about to undergo repair. If that fails, we need to schedule a quotacheck for the next mount but allow the corrupted metadata repair to continue. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-05-30	xfs: recover AG btree roots from rmap data	Darrick J. Wong
	Add a helper function to help us recover btree roots from the rmap data. Callers pass in a list of rmap owner codes, buffer ops, and magic numbers. We iterate the rmap records looking for owner matches, and then read the matching blocks to see if the magic number & uuid match. If so, we then read-verify the block, and if that passes then we retain a pointer to the block with the highest level, assuming that by the end of the call we will have found the root. This will be used to reset the AGF/AGI btree root fields during their rebuild procedures. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-05-30	xfs: add helpers to dispose of old btree blocks after a repair	Darrick J. Wong
	Now that we've plumbed in the ability to construct a list of dead btree blocks following a repair, add more helpers to dispose of them. This is done by examining the rmapbt -- if the btree was the only owner we can free the block, otherwise it's crosslinked and we can only remove the rmapbt record. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-05-30	xfs: add helpers to collect and sift btree block pointers during repair	Darrick J. Wong
	Add some helpers to assemble a list of fs block extents. Generally, repair functions will iterate the rmapbt to make a list (1) of all extents owned by the nominal owner of the metadata structure; then they will iterate all other structures with the same rmap owner to make a list (2) of active blocks; and finally we have a subtraction function to subtract all the blocks in (2) from (1), with the result that (1) is now a list of blocks that were owned by the old btree and must be disposed. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-05-30	xfs: add helpers to allocate and initialize fresh btree roots	Darrick J. Wong
	Add a pair of helper functions to allocate and initialize fresh btree roots. The repair functions will use these as part of recreating corrupted metadata. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
2018-05-30	xfs: add helpers to deal with transaction allocation and rolling	Darrick J. Wong
	For repairs, we need to reserve at least as many blocks as we think we're going to need to rebuild the data structure, and we're going to need some helpers to roll transactions while maintaining locks on the AG headers so that other threads cannot wander into the middle of a repair. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
2018-05-30	xfs: grab the per-ag structure whenever relevant	Darrick J. Wong
	Grab and hold the per-AG data across a scrub run whenever relevant. This helps us avoid repeated trips through rcu and the radix tree in the repair code. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-05-30	btrfs: return error value if create_io_em failed in cow_file_range	Su Yue
	In cow_file_range(), create_io_em() may fail, but its return value is not recorded. Then return value may be 0 even it failed which is a wrong behavior. Let cow_file_range() return PTR_ERR(em) if create_io_em() failed. Fixes: 6f9994dbabe5 ("Btrfs: create a helper to create em for IO") CC: stable@vger.kernel.org # 4.11+ Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30	btrfs: drop useless member qgroup_reserved of btrfs_pending_snapshot	Gu JinXiang
	Since there is no more use of qgroup_reserved member in struct btrfs_pending_snapshot, remove it. Signed-off-by: Gu JinXiang <gujx@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30	btrfs: drop unused parameter qgroup_reserved	Gu JinXiang
	Since commit 7775c8184ec0 ("btrfs: remove unused parameter from btrfs_subvolume_release_metadata") parameter qgroup_reserved is not used by caller of function btrfs_subvolume_reserve_metadata. So remove it. Signed-off-by: Gu JinXiang <gujx@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30	btrfs: balance dirty metadata pages in btrfs_finish_ordered_io	Ethan Lien
	[Problem description and how we fix it] We should balance dirty metadata pages at the end of btrfs_finish_ordered_io, since a small, unmergeable random write can potentially produce dirty metadata which is multiple times larger than the data itself. For example, a small, unmergeable 4KiB write may produce: 16KiB dirty leaf (and possibly 16KiB dirty node) in subvolume tree 16KiB dirty leaf (and possibly 16KiB dirty node) in checksum tree 16KiB dirty leaf (and possibly 16KiB dirty node) in extent tree Although we do call balance dirty pages in write side, but in the buffered write path, most metadata are dirtied only after we reach the dirty background limit (which by far only counts dirty data pages) and wakeup the flusher thread. If there are many small, unmergeable random writes spread in a large btree, we'll find a burst of dirty pages exceeds the dirty_bytes limit after we wakeup the flusher thread - which is not what we expect. In our machine, it caused out-of-memory problem since a page cannot be dropped if it is marked dirty. Someone may worry about we may sleep in btrfs_btree_balance_dirty_nodelay, but since we do btrfs_finish_ordered_io in a separate worker, it will not stop the flusher consuming dirty pages. Also, we use different worker for metadata writeback endio, sleep in btrfs_finish_ordered_io help us throttle the size of dirty metadata pages. [Reproduce steps] To reproduce the problem, we need to do 4KiB write randomly spread in a large btree. In our 2GiB RAM machine: 1) Create 4 subvolumes. 2) Run fio on each subvolume: [global] direct=0 rw=randwrite ioengine=libaio bs=4k iodepth=16 numjobs=1 group_reporting size=128G runtime=1800 norandommap time_based randrepeat=0 3) Take snapshot on each subvolume and repeat fio on existing files. 4) Repeat step (3) until we get large btrees. In our case, by observing btrfs_root_item->bytes_used, we have 2GiB of metadata in each subvolume tree and 12GiB of metadata in extent tree. 5) Stop all fio, take snapshot again, and wait until all delayed work is completed. 6) Start all fio. Few seconds later we hit OOM when the flusher starts to work. It can be reproduced even when using nocow write. Signed-off-by: Ethan Lien <ethanlien@synology.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add comment ] Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30	btrfs: lift some btrfs_cross_ref_exist checks in nocow path	Ethan Lien
	In nocow path, we check if the extent is snapshotted in btrfs_cross_ref_exist(). We can do the similar check earlier and avoid unnecessary search into extent tree. A fio test on a Intel D-1531, 16GB RAM, SSD RAID-5 machine as follows: [global] group_reporting time_based thread=1 ioengine=libaio bs=4k iodepth=32 size=64G runtime=180 numjobs=8 rw=randwrite [file1] filename=/mnt/nocow/testfile IOPS result: unpatched patched 1 fio round: 46670 46958 snapshot 2 fio round: 51826 54498 3 fio round: 59767 61289 After snapshot, the first fio get about 5% performance gain. As we continually write to the same file, all writes will resume to nocow mode and eventually we have no performance gain. Signed-off-by: Ethan Lien <ethanlien@synology.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update comments ] Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30	btrfs: Remove fs_info argument from btrfs_uuid_tree_rem	Lu Fengqi
	This function always takes a transaction handle which contains a reference to the fs_info. Use that and remove the extra argument. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> [ rename the function ] Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30	btrfs: Remove fs_info argument from btrfs_uuid_tree_add	Lu Fengqi
	This function always takes a transaction handle which contains a reference to the fs_info. Use that and remove the extra argument. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30	Btrfs: remove unused check of skip_locking	Liu Bo
	The check is superfluous since all callers who set search_for_commit also have skip_locking set. ASSERT() is put in place to ensure skip_locking is set by new callers. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30	Btrfs: remove always true check in unlock_up	Liu Bo
	As unlock_up() is written as for () { if (!path->locks[i]) break; ... if (... && path->locks[i]) { } } Apparently, @path->locks[i] is always true at this 'if'. Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: David Sterba <dsterba@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30	Btrfs: grab write lock directly if write_lock_level is the max level	Liu Bo
	Typically, when acquiring root node's lock, btrfs tries its best to get read lock and trade for write lock if @write_lock_level implies to do so. In case of (cow && (p->keep_locks \|\| p->lowest_level)), write_lock_level is set to BTRFS_MAX_LEVEL, which means we need to acquire root node's write lock directly. In this particular case, the dance of acquiring read lock and then trading for write lock can be saved. Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30	Btrfs: move get root out of btrfs_search_slot to a helper	Liu Bo
	It's good to have a helper instead of having all get-root details open-coded. The new helper locks (if necessary) and sets root node of the path. Also invert the checks to make the code flow easier to read. There is no functional change in this commit. Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30	Btrfs: use more straightforward extent_buffer_uptodate check	Liu Bo
	If parent_transid "0" is passed to btrfs_buffer_uptodate(), btrfs_buffer_uptodate() is equivalent to extent_buffer_uptodate(), but extent_buffer_uptodate() is preferred since we don't have to look into verify_parent_transid(). Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30	Btrfs: remove superfluous free_extent_buffer in read_block_for_search	Liu Bo
	read_block_for_search() can be simplified as: tmp = find_extent_buffer(); if (tmp) return; ... free_extent_buffer(); read_tree_block(); Apparently, @tmp must be NULL at this point, free_extent_buffer() is not needed. Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30	btrfs: drop unused space_info parameter from create_space_info	Lu Fengqi
	Since commit dc2d3005d27d ("btrfs: remove dead create_space_info calls"), there is only one caller btrfs_init_space_info. However, it doesn't need create_space_info to return space_info at all. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>