summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2016-02-22mbcache2: reimplement mbcacheJan Kara
Original mbcache was designed to have more features than what ext? filesystems ended up using. It supported entry being in more hashes, it had a home-grown rwlocking of each entry, and one cache could cache entries from multiple filesystems. This genericity also resulted in more complex locking, larger cache entries, and generally more code complexity. This is reimplementation of the mbcache functionality to exactly fit the purpose ext? filesystems use it for. Cache entries are now considerably smaller (7 instead of 13 longs), the code is considerably smaller as well (414 vs 913 lines of code), and IMO also simpler. The new code is also much more lightweight. I have measured the speed using artificial xattr-bench benchmark, which spawns P processes, each process sets xattr for F different files, and the value of xattr is randomly chosen from a pool of V values. Averages of runtimes for 5 runs for various combinations of parameters are below. The first value in each cell is old mbache, the second value is the new mbcache. V=10 F\P 1 2 4 8 16 32 64 10 0.158,0.157 0.208,0.196 0.500,0.277 0.798,0.400 3.258,0.584 13.807,1.047 61.339,2.803 100 0.172,0.167 0.279,0.222 0.520,0.275 0.825,0.341 2.981,0.505 12.022,1.202 44.641,2.943 1000 0.185,0.174 0.297,0.239 0.445,0.283 0.767,0.340 2.329,0.480 6.342,1.198 16.440,3.888 V=100 F\P 1 2 4 8 16 32 64 10 0.162,0.153 0.200,0.186 0.362,0.257 0.671,0.496 1.433,0.943 3.801,1.345 7.938,2.501 100 0.153,0.160 0.221,0.199 0.404,0.264 0.945,0.379 1.556,0.485 3.761,1.156 7.901,2.484 1000 0.215,0.191 0.303,0.246 0.471,0.288 0.960,0.347 1.647,0.479 3.916,1.176 8.058,3.160 V=1000 F\P 1 2 4 8 16 32 64 10 0.151,0.129 0.210,0.163 0.326,0.245 0.685,0.521 1.284,0.859 3.087,2.251 6.451,4.801 100 0.154,0.153 0.211,0.191 0.276,0.282 0.687,0.506 1.202,0.877 3.259,1.954 8.738,2.887 1000 0.145,0.179 0.202,0.222 0.449,0.319 0.899,0.333 1.577,0.524 4.221,1.240 9.782,3.579 V=10000 F\P 1 2 4 8 16 32 64 10 0.161,0.154 0.198,0.190 0.296,0.256 0.662,0.480 1.192,0.818 2.989,2.200 6.362,4.746 100 0.176,0.174 0.236,0.203 0.326,0.255 0.696,0.511 1.183,0.855 4.205,3.444 19.510,17.760 1000 0.199,0.183 0.240,0.227 1.159,1.014 2.286,2.154 6.023,6.039 ---,10.933 ---,36.620 V=100000 F\P 1 2 4 8 16 32 64 10 0.171,0.162 0.204,0.198 0.285,0.230 0.692,0.500 1.225,0.881 2.990,2.243 6.379,4.771 100 0.151,0.171 0.220,0.210 0.295,0.255 0.720,0.518 1.226,0.844 3.423,2.831 19.234,17.544 1000 0.192,0.189 0.249,0.225 1.162,1.043 2.257,2.093 5.853,4.997 ---,10.399 ---,32.198 We see that the new code is faster in pretty much all the cases and starting from 4 processes there are significant gains with the new code resulting in upto 20-times shorter runtimes. Also for large numbers of cached entries all values for the old code could not be measured as the kernel started hitting softlockups and died before the test completed. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-21ext4: iterate over buffer heads correctly in move_extent_per_page()Eryu Guan
In commit bcff24887d00 ("ext4: don't read blocks from disk after extents being swapped") bh is not updated correctly in the for loop and wrong data has been written to disk. generic/324 catches this on sub-page block size ext4. Fixes: bcff24887d00 ("ext4: don't read blocks from disk after extentsbeing swapped") Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2016-02-21ext4: make sure to revoke all the freeable blocks in ext4_free_blocksDaeho Jeong
Now, ext4_free_blocks() doesn't revoke data blocks of per-file data journalled inode and it can cause file data inconsistency problems. Even though data blocks of per-file data journalled inode are already forgotten by jbd2_journal_invalidatepage() in advance of invoking ext4_free_blocks(), we still need to revoke the data blocks here. Moreover some of the metadata blocks, which are not found by sb_find_get_block(), are still needed to be revoked, but this is also missing here. Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2016-02-21vfs: define kernel_copy_file_from_fd()Mimi Zohar
This patch defines kernel_read_file_from_fd(), a wrapper for the VFS common kernel_read_file(). Changelog: - Separated from the kernel modules patch Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Luis R. Rodriguez <mcgrof@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
2016-02-21security: define kernel_read_file hookMimi Zohar
The kernel_read_file security hook is called prior to reading the file into memory. Changelog v4+: - export security_kernel_read_file() Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Luis R. Rodriguez <mcgrof@kernel.org> Acked-by: Casey Schaufler <casey@schaufler-ca.com>
2016-02-21vfs: define kernel_read_file_from_pathMimi Zohar
This patch defines kernel_read_file_from_path(), a wrapper for the VFS common kernel_read_file(). Changelog: - revert error msg regression - reported by Sergey Senozhatsky - Separated from the IMA patch Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Luis R. Rodriguez <mcgrof@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk>
2016-02-20Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: "This is unusually large, partly due to the EFI fixes that prevent accidental deletion of EFI variables through efivarfs that may brick machines. These fixes are somewhat involved to maintain compatibility with existing install methods and other usage modes, while trying to turn off the 'rm -rf' bricking vector. Other fixes are for large page ioremap()s and for non-temporal user-memcpy()s" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm: Fix vmalloc_fault() to handle large pages properly hpet: Drop stale URLs x86/uaccess/64: Handle the caching of 4-byte nocache copies properly in __copy_user_nocache() x86/uaccess/64: Make the __copy_user_nocache() assembly code more readable lib/ucs2_string: Correct ucs2 -> utf8 conversion efi: Add pstore variables to the deletion whitelist efi: Make efivarfs entries immutable by default efi: Make our variable validation list include the guid efi: Do variable name validation tests in utf8 efi: Use ucs2_as_utf8 in efivarfs instead of open coding a bad version lib/ucs2_string: Add ucs2 -> utf8 helper functions
2016-02-20fs/pnode.c: treat zero mnt_group_id-s as unequalMaxim Patlasov
propagate_one(m) calculates "type" argument for copy_tree() like this: > if (m->mnt_group_id == last_dest->mnt_group_id) { > type = CL_MAKE_SHARED; > } else { > type = CL_SLAVE; > if (IS_MNT_SHARED(m)) > type |= CL_MAKE_SHARED; > } The "type" argument then governs clone_mnt() behavior with respect to flags and mnt_master of new mount. When we iterate through a slave group, it is possible that both current "m" and "last_dest" are not shared (although, both are slaves, i.e. have non-NULL mnt_master-s). Then the comparison above erroneously makes new mount shared and sets its mnt_master to last_source->mnt_master. The patch fixes the problem by handling zero mnt_group_id-s as though they are unequal. The similar problem exists in the implementation of "else" clause above when we have to ascend upward in the master/slave tree by calling: > last_source = last_source->mnt_master; > last_dest = last_source->mnt_parent; proper number of times. The last step is governed by "n->mnt_group_id != last_dest->mnt_group_id" condition that may lie if both are zero. The patch fixes this case in the same way as the former one. [AV: don't open-code an obvious helper...] Signed-off-by: Maxim Patlasov <mpatlasov@virtuozzo.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-02-20affs_do_readpage_ofs(): just use kmap_atomic() around memcpy()Al Viro
It forgets kunmap() on a failure exit, but there's really no point keeping the page kmapped at all - after all, what we are doing is a bunch of memcpy() into the parts of page, so kmap_atomic()/kunmap_atomic() just around those memcpy() is enough. Spotted-by: Insu Yun <wuninsu@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-02-20xattr handlers: plug a lock leak in simple_xattr_listMateusz Guzik
The code could leak xattrs->lock on error. Problem introduced with 786534b92f3ce68f4 "tmpfs: listxattr should include POSIX ACL xattrs". Signed-off-by: Mateusz Guzik <mguzik@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-02-20fs: allow no_seek_end_llseek to actually seekWouter van Kesteren
The user-visible impact of the issue is for example that without this patch sensors-detect breaks when trying to seek in /dev/cpu/0/cpuid. '~0ULL' is a 'unsigned long long' that when converted to a loff_t, which is signed, gets turned into -1. later in vfs_setpos we have 'if (offset > maxsize)', which makes it always return EINVAL. Fixes: b25472f9b961 ("new helpers: no_seek_end_llseek{,_size}()") Signed-off-by: Wouter van Kesteren <woutershep@gmail.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-02-19Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 bugfixes from Ted Ts'o: "Miscellaneous ext4 bug fixes for v4.5" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: fix crashes in dioread_nolock mode ext4: fix bh->b_state corruption ext4: fix memleak in ext4_readdir() ext4: remove unused parameter "newblock" in convert_initialized_extent() ext4: don't read blocks from disk after extents being swapped ext4: fix potential integer overflow ext4: add a line break for proc mb_groups display ext4: ioctl: fix erroneous return value ext4: fix scheduling in atomic on group checksum failure ext4 crypto: move context consistency check to ext4_file_open() ext4 crypto: revalidate dentry after adding or removing the key
2016-02-19Merge branch 'for-linus-4.5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fix from Chris Mason: "My for-linus-4.5 branch has a btrfs DIO error passing fix. I know how much you love DIO, so I'm going to suggest against reading it. We'll follow up with a patch to drop the error arg from dio_end_io in the next merge window." * 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix direct IO requests not reporting IO error to user space
2016-02-19Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge fixes from Andrew Morton: "10 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm: slab: free kmem_cache_node after destroy sysfs file ipc/shm: handle removed segments gracefully in shm_mmap() MAINTAINERS: update Kselftest Framework mailing list devm_memremap_release(): fix memremap'd addr handling mm/hugetlb.c: fix incorrect proc nr_hugepages value mm, x86: fix pte_page() crash in gup_pte_range() fsnotify: turn fsnotify reaper thread into a workqueue job Revert "fsnotify: destroy marks with call_srcu instead of dedicated thread" mm: fix regression in remap_file_pages() emulation thp, dax: do not try to withdraw pgtable from non-anon VMA
2016-02-19orangefs: get rid of op refcountsAl Viro
not needed anymore Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-02-19orangefs: have ..._clean_interrupted_...() wait for copy to/from daemonAl Viro
* turn all those list_del(&op->list) into list_del_init() * don't pick ops that are already given up in control device ->read()/->write_iter(). * have orangefs_clean_interrupted_operation() notice if op is currently being copied to/from daemon (by said ->read()/->write_iter()) and wait for that to finish. * when we are done copying to/from daemon and find that it had been given up while we were doing that, wake the waiting ..._clean_interrupted_... As the result, we are guaranteed that orangefs_clean_interrupted_operation(op) doesn't return until nobody else can see op. Moreover, we don't need to play with op refcounts anymore. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-02-19orangefs: set correct ->downcall.status on failing to copy reply from daemonAl Viro
... and clean the end of control device ->write_iter() while we are at it Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-02-19Orangefs: remove vestigial ASYNC codeMike Marshall
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-02-19Orangefs: make some gossip statements more helpful.Mike Marshall
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-02-19orangefs: get rid of op->doneAl Viro
shouldn't be needed now Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-02-19orangefs_readdir_index_put(): get rid of bufmap argumentAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-02-19orangefs: bufmap rewriteAl Viro
new waiting-for-slot logics: * make request for slot wait for bufmap to be set up if it comes before it's installed *OR* while it's running down * make closing control device wait for all slots to be freed * waiting itself rewritten to (open-coded) analogues of wait_event_... primitives - we would need wait_event_locked() and, pardon an obscenely long name, wait_event_interruptible_exclusive_timeout_locked(). * we never wait for more than slot_timeout_secs in total and, if during the wait the daemon goes away, we only allow ORANGEFS_BUFMAP_WAIT_TIMEOUT_SECS for it to come back. * (cosmetical) bitmap is used instead of an array of zeroes and ones * old (and only reached if we are about to corrupt memory) waiting for daemon restart in service_operation() removed. [Martin's fixes folded] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-02-19orangefs_bufmap_..._query(): don't bother with refcountsAl Viro
... just hold the spinlock while fetching the field in question. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-02-19orangefs: lift handling of timeouts and attempts count to service_operation()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-02-19service_operation(): don't block signals, just use ..._killableAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-02-19orangefs: sanitize handling of request listAl Viro
* checking that daemon is running (to decide whether we want to limit the timeout) should be done *after* the damn thing is included into the list; doing that before means that if the daemon gets shut down in between, we'll end up waiting indefinitely (== up to kill -9). * cancels should go into the head of the queue - the sooner they are picked, the less work daemon has to do and the sooner we get to free the slot held by aborted operation. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-02-19orangefs: get rid of loop in wait_for_matching_downcall()Al Viro
turn op->waitq into struct completion... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-02-19orangefs: use S_ISREG(mode) and friends instead of mode & S_IFREG.Martin Brandenburg
Suggestion from Dan Carpenter. Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-02-19orangefs: delay freeing slot until cancel completesAl Viro
Make cancels reuse the aborted read/write op, to make sure they do not fail on lack of memory. Don't issue a cancel unless the daemon has seen our read/write, has not replied and isn't being shut down. If cancel *is* issued, don't wait for it to complete; stash the slot in there and just have it freed when cancel is finally replied to or purged (and delay dropping the reference until then, obviously). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-02-19ext4: Make Q_GETNEXTQUOTA work for quota in hidden inodesEric Sandeen
We forgot to set .get_nextdqblk operation in quotactl_ops structure used by ext4 when quota is using hidden inode thus the operation was not really supported. Fix the omission. Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Jan Kara <jack@suse.cz>
2016-02-19ext4: fix crashes in dioread_nolock modeJan Kara
Competing overwrite DIO in dioread_nolock mode will just overwrite pointer to io_end in the inode. This may result in data corruption or extent conversion happening from IO completion interrupt because we don't properly set buffer_defer_completion() when unlocked DIO races with locked DIO to unwritten extent. Since unlocked DIO doesn't need io_end for anything, just avoid allocating it and corrupting pointer from inode for locked DIO. A cleaner fix would be to avoid these games with io_end pointer from the inode but that requires more intrusive changes so we leave that for later. Cc: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-19ext4: fix bh->b_state corruptionJan Kara
ext4 can update bh->b_state non-atomically in _ext4_get_block() and ext4_da_get_block_prep(). Usually this is fine since bh is just a temporary storage for mapping information on stack but in some cases it can be fully living bh attached to a page. In such case non-atomic update of bh->b_state can race with an atomic update which then gets lost. Usually when we are mapping bh and thus updating bh->b_state non-atomically, nobody else touches the bh and so things work out fine but there is one case to especially worry about: ext4_finish_bio() uses BH_Uptodate_Lock on the first bh in the page to synchronize handling of PageWriteback state. So when blocksize < pagesize, we can be atomically modifying bh->b_state of a buffer that actually isn't under IO and thus can race e.g. with delalloc trying to map that buffer. The result is that we can mistakenly set / clear BH_Uptodate_Lock bit resulting in the corruption of PageWriteback state or missed unlock of BH_Uptodate_Lock. Fix the problem by always updating bh->b_state bits atomically. CC: stable@vger.kernel.org Reported-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-18fsnotify: turn fsnotify reaper thread into a workqueue jobJeff Layton
We don't require a dedicated thread for fsnotify cleanup. Switch it over to a workqueue job instead that runs on the system_unbound_wq. In the interest of not thrashing the queued job too often when there are a lot of marks being removed, we delay the reaper job slightly when queueing it, to allow several to gather on the list. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Tested-by: Eryu Guan <guaneryu@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Eric Paris <eparis@parisplace.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-18Revert "fsnotify: destroy marks with call_srcu instead of dedicated thread"Jeff Layton
This reverts commit c510eff6beba ("fsnotify: destroy marks with call_srcu instead of dedicated thread"). Eryu reported that he was seeing some OOM kills kick in when running a testcase that adds and removes inotify marks on a file in a tight loop. The above commit changed the code to use call_srcu to clean up the marks. While that does (in principle) work, the srcu callback job is limited to cleaning up entries in small batches and only once per jiffy. It's easily possible to overwhelm that machinery with too many call_srcu callbacks, and Eryu's reproduer did just that. There's also another potential problem with using call_srcu here. While you can obviously sleep while holding the srcu_read_lock, the callbacks run under local_bh_disable, so you can't sleep there. It's possible when putting the last reference to the fsnotify_mark that we'll end up putting a chain of references including the fsnotify_group, uid, and associated keys. While I don't see any obvious ways that that could occurs, it's probably still best to avoid using call_srcu here after all. This patch reverts the above patch. A later patch will take a different approach to eliminated the dedicated thread here. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Reported-by: Eryu Guan <guaneryu@gmail.com> Tested-by: Eryu Guan <guaneryu@gmail.com> Cc: Jan Kara <jack@suse.com> Cc: Eric Paris <eparis@parisplace.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-18vfs: define kernel_read_file_id enumerationMimi Zohar
To differentiate between the kernel_read_file() callers, this patch defines a new enumeration named kernel_read_file_id and includes the caller identifier as an argument. Subsequent patches define READING_KEXEC_IMAGE, READING_KEXEC_INITRAMFS, READING_FIRMWARE, READING_MODULE, and READING_POLICY. Changelog v3: - Replace the IMA specific enumeration with a generic one. Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Luis R. Rodriguez <mcgrof@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk>
2016-02-18vfs: define a generic function to read a file from the kernelMimi Zohar
For a while it was looked down upon to directly read files from Linux. These days there exists a few mechanisms in the kernel that do just this though to load a file into a local buffer. There are minor but important checks differences on each. This patch set is the first attempt at resolving some of these differences. This patch introduces a common function for reading files from the kernel with the corresponding security post-read hook and function. Changelog v4+: - export security_kernel_post_read_file() - Fengguang Wu v3: - additional bounds checking - Luis v2: - To simplify patch review, re-ordered patches Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com> Reviewed-by: Luis R. Rodriguez <mcgrof@suse.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: Al Viro <viro@zeniv.linux.org.uk>
2016-02-18x86/mm/pkeys: Dump pkey from VMA in /proc/pid/smapsDave Hansen
The protection key can now be just as important as read/write permissions on a VMA. We need some debug mechanism to help figure out if it is in play. smaps seems like a logical place to expose it. arch/x86/kernel/setup.c is a bit of a weirdo place to put this code, but it already had seq_file.h and there was not a much better existing place to put it. We also use no #ifdef. If protection keys is .config'd out we will effectively get the same function as if we used the weak generic function. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bp@suse.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave@sr71.net> Cc: Dave Young <dyoung@redhat.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Joerg Roedel <jroedel@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Salter <msalter@redhat.com> Cc: Mark Williamson <mwilliamson@undo-software.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20160212210227.4F8EB3F8@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-18quota: Forbid Q_GETQUOTA and Q_GETNEXTQUOTA for frozen filesystemJan Kara
Commit 7955118eafc4 (quota: Allow Q_GETQUOTA for frozen filesystem) allowed Q_GETQUOTA call for frozen filesystem. It makes sense on the first look but zero-day testing has shown that with this change ext4 warns about starting a transaction for frozen filesystem. This happens because ext4_acquire_dquot() prepares for allocating space for new quota structure. Although it would be possible to implement Q_GETQUOTA for ext4 without allocating space for non-existent structures, the matter further complicates because OCFS2 needs to update on-disk structure use count when a new cluster node loads quota information from disk. So just revert the change and forbid Q_GETQUOTA together with Q_GETNEXTQUOTA for frozen filesystem. Add comment to quotactl_cmd_write() to save us from repeating this excercise in a few years when I forget again. Signed-off-by: Jan Kara <jack@suse.cz>
2016-02-18quota: Fix possible races during quota loadingJan Kara
When loading new quota structure from disk, there is a possibility caller of dqget() will see uninitialized data due to CPU reordering loads or stores - loads from dquot can be reordered before test of DQ_ACTIVE_B bit or setting of this bit could be reordered before filling of the structure. Fix the issue by adding proper memory barriers. Signed-off-by: Jan Kara <jack@suse.cz>
2016-02-18btrfs: fix memory leak of fs_info in block group cacheKinglong Mee
When starting up linux with btrfs filesystem, I got many memory leak messages by kmemleak as, unreferenced object 0xffff880066882000 (size 4096): comm "modprobe", pid 730, jiffies 4294690024 (age 196.599s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff8174d52e>] kmemleak_alloc+0x4e/0xb0 [<ffffffff811d09aa>] kmem_cache_alloc_trace+0xea/0x1e0 [<ffffffffa03620fb>] btrfs_alloc_dummy_fs_info+0x6b/0x2a0 [btrfs] [<ffffffffa03624fc>] btrfs_alloc_dummy_block_group+0x5c/0x120 [btrfs] [<ffffffffa0360aa9>] btrfs_test_free_space_cache+0x39/0xed0 [btrfs] [<ffffffffa03b5a74>] trace_raw_output_xfs_attr_class+0x54/0xe0 [xfs] [<ffffffff81002122>] do_one_initcall+0xb2/0x1f0 [<ffffffff811765aa>] do_init_module+0x5e/0x1e9 [<ffffffff810fec09>] load_module+0x20a9/0x2690 [<ffffffff810ff439>] SyS_finit_module+0xb9/0xf0 [<ffffffff81757daf>] entry_SYSCALL_64_fastpath+0x12/0x76 [<ffffffffffffffff>] 0xffffffffffffffff unreferenced object 0xffff8800573f8000 (size 10256): comm "modprobe", pid 730, jiffies 4294690185 (age 196.460s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff8174d52e>] kmemleak_alloc+0x4e/0xb0 [<ffffffff8119ca6e>] kmalloc_order+0x5e/0x70 [<ffffffff8119caa4>] kmalloc_order_trace+0x24/0x90 [<ffffffffa03620b3>] btrfs_alloc_dummy_fs_info+0x23/0x2a0 [btrfs] [<ffffffffa03624fc>] btrfs_alloc_dummy_block_group+0x5c/0x120 [btrfs] [<ffffffffa036603d>] run_test+0xfd/0x320 [btrfs] [<ffffffffa0366f34>] btrfs_test_free_space_tree+0x94/0xee [btrfs] [<ffffffffa03b5aab>] trace_raw_output_xfs_attr_class+0x8b/0xe0 [xfs] [<ffffffff81002122>] do_one_initcall+0xb2/0x1f0 [<ffffffff811765aa>] do_init_module+0x5e/0x1e9 [<ffffffff810fec09>] load_module+0x20a9/0x2690 [<ffffffff810ff439>] SyS_finit_module+0xb9/0xf0 [<ffffffff81757daf>] entry_SYSCALL_64_fastpath+0x12/0x76 [<ffffffffffffffff>] 0xffffffffffffffff This patch lets btrfs using fs_info stored in btrfs_root for block group cache directly without allocating a new one. Fixes: d0bd456074 ("Btrfs: add fragment=* debug mount option") Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18btrfs: Continue write in case of can_not_nocowZhao Lei
btrfs failed in xfstests btrfs/080 with -o nodatacow. Can be reproduced by following script: DEV=/dev/vdg MNT=/mnt/tmp umount $DEV &>/dev/null mkfs.btrfs -f $DEV mount -o nodatacow $DEV $MNT dd if=/dev/zero of=$MNT/test bs=1 count=2048 & btrfs subvolume snapshot -r $MNT $MNT/test_snap & wait -- We can see dd failed on NO_SPACE. Reason: __btrfs_buffered_write should run cow write when no_cow impossible, and current code is designed with above logic. But check_can_nocow() have 2 type of return value(0 and <0) on can_not_no_cow, and current code only continue write on first case, the second case happened in doing subvolume. Fix: Continue write when check_can_nocow() return 0 and <0. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
2016-02-18btrfs: drop null testing before destroy functionsKinglong Mee
Cleanup. kmem_cache_destroy has support NULL argument checking, so drop the double null testing before calling it. Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18btrfs: fix build warningSudip Mukherjee
We were getting build warning about: fs/btrfs/extent-tree.c:7021:34: warning: ‘used_bg’ may be used uninitialized in this function It is not a valid warning as used_bg is never used uninitilized since locked is initially false so we can never be in the section where 'used_bg' is used. But gcc is not able to understand that and we can initialize it while declaring to silence the warning. Signed-off-by: Sudip Mukherjee <sudip@vectorindia.org> Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18btrfs: use proper type for failrec in extent_stateDavid Sterba
We use the private member of extent_state to store the failrec and play pointless pointer games. Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18btrfs: Replace CURRENT_TIME by current_fs_time()Deepa Dinamani
CURRENT_TIME macro is not appropriate for filesystems as it doesn't use the right granularity for filesystem timestamps. Use current_fs_time() instead. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <jbacik@fb.com> Cc: linux-btrfs@vger.kernel.org Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18btrfs: remove open-coded swap() in backref.c:__merge_refsDave Jones
The kernel provides a swap() that does the same thing as this code. Signed-off-by: Dave Jones <dsj@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18btrfs: remove redundant error checkByongho Lee
While running btrfs_mksubvol(), d_really_is_positive() is called twice. First in btrfs_mksubvol() and second inside btrfs_may_create(). So I remove the first one. Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18btrfs: simplify expression in btrfs_calc_trans_metadata_size()Byongho Lee
Simplify expression in btrfs_calc_trans_metadata_size(). Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com> Reviewed-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18Btrfs: check reserved when deciding to background flushJosef Bacik
We will sometimes start background flushing the various enospc related things (delayed nodes, delalloc, etc) if we are getting close to reserving all of our available space. We don't want to do this however when we are actually using this space as it causes unneeded thrashing. We currently try to do this by checking bytes_used >= thresh, but bytes_used is only part of the equation, we need to use bytes_reserved as well as this represents space that is very likely to become bytes_used in the future. My tracing tool will keep count of the number of times we kick off the async flusher, the following are counts for the entire run of generic/027 No Patch Patch avg: 5385 5009 median: 5500 4916 We skewed lower than the average with my patch and higher than the average with the patch, overall it cuts the flushing from anywhere from 5-10%, which in the case of actual ENOSPC is quite helpful. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18Btrfs: add transaction space reservation tracepointsJosef Bacik
There are a few places where we add to trans->bytes_reserved but don't have the corresponding trace point. With these added my tool no longer sees transaction leaks. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>