Age | Commit message (Collapse) | Author |
|
Original mbcache was designed to have more features than what ext?
filesystems ended up using. It supported entry being in more hashes, it
had a home-grown rwlocking of each entry, and one cache could cache
entries from multiple filesystems. This genericity also resulted in more
complex locking, larger cache entries, and generally more code
complexity.
This is reimplementation of the mbcache functionality to exactly fit the
purpose ext? filesystems use it for. Cache entries are now considerably
smaller (7 instead of 13 longs), the code is considerably smaller as
well (414 vs 913 lines of code), and IMO also simpler. The new code is
also much more lightweight.
I have measured the speed using artificial xattr-bench benchmark, which
spawns P processes, each process sets xattr for F different files, and
the value of xattr is randomly chosen from a pool of V values. Averages
of runtimes for 5 runs for various combinations of parameters are below.
The first value in each cell is old mbache, the second value is the new
mbcache.
V=10
F\P 1 2 4 8 16 32 64
10 0.158,0.157 0.208,0.196 0.500,0.277 0.798,0.400 3.258,0.584 13.807,1.047 61.339,2.803
100 0.172,0.167 0.279,0.222 0.520,0.275 0.825,0.341 2.981,0.505 12.022,1.202 44.641,2.943
1000 0.185,0.174 0.297,0.239 0.445,0.283 0.767,0.340 2.329,0.480 6.342,1.198 16.440,3.888
V=100
F\P 1 2 4 8 16 32 64
10 0.162,0.153 0.200,0.186 0.362,0.257 0.671,0.496 1.433,0.943 3.801,1.345 7.938,2.501
100 0.153,0.160 0.221,0.199 0.404,0.264 0.945,0.379 1.556,0.485 3.761,1.156 7.901,2.484
1000 0.215,0.191 0.303,0.246 0.471,0.288 0.960,0.347 1.647,0.479 3.916,1.176 8.058,3.160
V=1000
F\P 1 2 4 8 16 32 64
10 0.151,0.129 0.210,0.163 0.326,0.245 0.685,0.521 1.284,0.859 3.087,2.251 6.451,4.801
100 0.154,0.153 0.211,0.191 0.276,0.282 0.687,0.506 1.202,0.877 3.259,1.954 8.738,2.887
1000 0.145,0.179 0.202,0.222 0.449,0.319 0.899,0.333 1.577,0.524 4.221,1.240 9.782,3.579
V=10000
F\P 1 2 4 8 16 32 64
10 0.161,0.154 0.198,0.190 0.296,0.256 0.662,0.480 1.192,0.818 2.989,2.200 6.362,4.746
100 0.176,0.174 0.236,0.203 0.326,0.255 0.696,0.511 1.183,0.855 4.205,3.444 19.510,17.760
1000 0.199,0.183 0.240,0.227 1.159,1.014 2.286,2.154 6.023,6.039 ---,10.933 ---,36.620
V=100000
F\P 1 2 4 8 16 32 64
10 0.171,0.162 0.204,0.198 0.285,0.230 0.692,0.500 1.225,0.881 2.990,2.243 6.379,4.771
100 0.151,0.171 0.220,0.210 0.295,0.255 0.720,0.518 1.226,0.844 3.423,2.831 19.234,17.544
1000 0.192,0.189 0.249,0.225 1.162,1.043 2.257,2.093 5.853,4.997 ---,10.399 ---,32.198
We see that the new code is faster in pretty much all the cases and
starting from 4 processes there are significant gains with the new code
resulting in upto 20-times shorter runtimes. Also for large numbers of
cached entries all values for the old code could not be measured as the
kernel started hitting softlockups and died before the test completed.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
In commit bcff24887d00 ("ext4: don't read blocks from disk after extents
being swapped") bh is not updated correctly in the for loop and wrong
data has been written to disk. generic/324 catches this on sub-page
block size ext4.
Fixes: bcff24887d00 ("ext4: don't read blocks from disk after extentsbeing swapped")
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
|
|
Now, ext4_free_blocks() doesn't revoke data blocks of per-file data
journalled inode and it can cause file data inconsistency problems.
Even though data blocks of per-file data journalled inode are already
forgotten by jbd2_journal_invalidatepage() in advance of invoking
ext4_free_blocks(), we still need to revoke the data blocks here.
Moreover some of the metadata blocks, which are not found by
sb_find_get_block(), are still needed to be revoked, but this is also
missing here.
Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
|
|
This patch defines kernel_read_file_from_fd(), a wrapper for the VFS
common kernel_read_file().
Changelog:
- Separated from the kernel modules patch
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Luis R. Rodriguez <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
|
|
The kernel_read_file security hook is called prior to reading the file
into memory.
Changelog v4+:
- export security_kernel_read_file()
Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Luis R. Rodriguez <mcgrof@kernel.org>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
|
|
This patch defines kernel_read_file_from_path(), a wrapper for the VFS
common kernel_read_file().
Changelog:
- revert error msg regression - reported by Sergey Senozhatsky
- Separated from the IMA patch
Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Luis R. Rodriguez <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
"This is unusually large, partly due to the EFI fixes that prevent
accidental deletion of EFI variables through efivarfs that may brick
machines. These fixes are somewhat involved to maintain compatibility
with existing install methods and other usage modes, while trying to
turn off the 'rm -rf' bricking vector.
Other fixes are for large page ioremap()s and for non-temporal
user-memcpy()s"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Fix vmalloc_fault() to handle large pages properly
hpet: Drop stale URLs
x86/uaccess/64: Handle the caching of 4-byte nocache copies properly in __copy_user_nocache()
x86/uaccess/64: Make the __copy_user_nocache() assembly code more readable
lib/ucs2_string: Correct ucs2 -> utf8 conversion
efi: Add pstore variables to the deletion whitelist
efi: Make efivarfs entries immutable by default
efi: Make our variable validation list include the guid
efi: Do variable name validation tests in utf8
efi: Use ucs2_as_utf8 in efivarfs instead of open coding a bad version
lib/ucs2_string: Add ucs2 -> utf8 helper functions
|
|
propagate_one(m) calculates "type" argument for copy_tree() like this:
> if (m->mnt_group_id == last_dest->mnt_group_id) {
> type = CL_MAKE_SHARED;
> } else {
> type = CL_SLAVE;
> if (IS_MNT_SHARED(m))
> type |= CL_MAKE_SHARED;
> }
The "type" argument then governs clone_mnt() behavior with respect to flags
and mnt_master of new mount. When we iterate through a slave group, it is
possible that both current "m" and "last_dest" are not shared (although,
both are slaves, i.e. have non-NULL mnt_master-s). Then the comparison
above erroneously makes new mount shared and sets its mnt_master to
last_source->mnt_master. The patch fixes the problem by handling zero
mnt_group_id-s as though they are unequal.
The similar problem exists in the implementation of "else" clause above
when we have to ascend upward in the master/slave tree by calling:
> last_source = last_source->mnt_master;
> last_dest = last_source->mnt_parent;
proper number of times. The last step is governed by
"n->mnt_group_id != last_dest->mnt_group_id" condition that may lie if
both are zero. The patch fixes this case in the same way as the former one.
[AV: don't open-code an obvious helper...]
Signed-off-by: Maxim Patlasov <mpatlasov@virtuozzo.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
It forgets kunmap() on a failure exit, but there's really no point keeping
the page kmapped at all - after all, what we are doing is a bunch of memcpy()
into the parts of page, so kmap_atomic()/kunmap_atomic() just around those
memcpy() is enough.
Spotted-by: Insu Yun <wuninsu@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
The code could leak xattrs->lock on error.
Problem introduced with 786534b92f3ce68f4 "tmpfs: listxattr should
include POSIX ACL xattrs".
Signed-off-by: Mateusz Guzik <mguzik@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
The user-visible impact of the issue is for example that without this
patch sensors-detect breaks when trying to seek in /dev/cpu/0/cpuid.
'~0ULL' is a 'unsigned long long' that when converted to a loff_t,
which is signed, gets turned into -1. later in vfs_setpos we have
'if (offset > maxsize)', which makes it always return EINVAL.
Fixes: b25472f9b961 ("new helpers: no_seek_end_llseek{,_size}()")
Signed-off-by: Wouter van Kesteren <woutershep@gmail.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 bugfixes from Ted Ts'o:
"Miscellaneous ext4 bug fixes for v4.5"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: fix crashes in dioread_nolock mode
ext4: fix bh->b_state corruption
ext4: fix memleak in ext4_readdir()
ext4: remove unused parameter "newblock" in convert_initialized_extent()
ext4: don't read blocks from disk after extents being swapped
ext4: fix potential integer overflow
ext4: add a line break for proc mb_groups display
ext4: ioctl: fix erroneous return value
ext4: fix scheduling in atomic on group checksum failure
ext4 crypto: move context consistency check to ext4_file_open()
ext4 crypto: revalidate dentry after adding or removing the key
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fix from Chris Mason:
"My for-linus-4.5 branch has a btrfs DIO error passing fix.
I know how much you love DIO, so I'm going to suggest against reading
it. We'll follow up with a patch to drop the error arg from
dio_end_io in the next merge window."
* 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix direct IO requests not reporting IO error to user space
|
|
Merge fixes from Andrew Morton:
"10 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm: slab: free kmem_cache_node after destroy sysfs file
ipc/shm: handle removed segments gracefully in shm_mmap()
MAINTAINERS: update Kselftest Framework mailing list
devm_memremap_release(): fix memremap'd addr handling
mm/hugetlb.c: fix incorrect proc nr_hugepages value
mm, x86: fix pte_page() crash in gup_pte_range()
fsnotify: turn fsnotify reaper thread into a workqueue job
Revert "fsnotify: destroy marks with call_srcu instead of dedicated thread"
mm: fix regression in remap_file_pages() emulation
thp, dax: do not try to withdraw pgtable from non-anon VMA
|
|
not needed anymore
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
|
|
* turn all those list_del(&op->list) into list_del_init()
* don't pick ops that are already given up in control device
->read()/->write_iter().
* have orangefs_clean_interrupted_operation() notice if op is currently
being copied to/from daemon (by said ->read()/->write_iter()) and
wait for that to finish.
* when we are done copying to/from daemon and find that it had been
given up while we were doing that, wake the waiting ..._clean_interrupted_...
As the result, we are guaranteed that orangefs_clean_interrupted_operation(op)
doesn't return until nobody else can see op. Moreover, we don't need to play
with op refcounts anymore.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
|
|
... and clean the end of control device ->write_iter() while we are at it
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
|
|
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
|
|
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
|
|
shouldn't be needed now
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
|
|
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
|
|
new waiting-for-slot logics:
* make request for slot wait for bufmap to be set up if it
comes before it's installed *OR* while it's running down
* make closing control device wait for all slots to be freed
* waiting itself rewritten to (open-coded) analogues of wait_event_...
primitives - we would need wait_event_locked() and, pardon an obscenely
long name, wait_event_interruptible_exclusive_timeout_locked().
* we never wait for more than slot_timeout_secs in total and,
if during the wait the daemon goes away, we only allow
ORANGEFS_BUFMAP_WAIT_TIMEOUT_SECS for it to come back.
* (cosmetical) bitmap is used instead of an array of zeroes and ones
* old (and only reached if we are about to corrupt memory) waiting
for daemon restart in service_operation() removed.
[Martin's fixes folded]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
|
|
... just hold the spinlock while fetching the field in question.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
|
|
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
|
|
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
|
|
* checking that daemon is running (to decide whether we want to limit
the timeout) should be done *after* the damn thing is included into
the list; doing that before means that if the daemon gets shut down
in between, we'll end up waiting indefinitely (== up to kill -9).
* cancels should go into the head of the queue - the sooner they
are picked, the less work daemon has to do and the sooner we get to
free the slot held by aborted operation.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
|
|
turn op->waitq into struct completion...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
|
|
Suggestion from Dan Carpenter.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
|
|
Make cancels reuse the aborted read/write op, to make sure they do not
fail on lack of memory.
Don't issue a cancel unless the daemon has seen our read/write, has not
replied and isn't being shut down.
If cancel *is* issued, don't wait for it to complete; stash the slot
in there and just have it freed when cancel is finally replied to or
purged (and delay dropping the reference until then, obviously).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
|
|
We forgot to set .get_nextdqblk operation in quotactl_ops structure used
by ext4 when quota is using hidden inode thus the operation was not
really supported. Fix the omission.
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
Competing overwrite DIO in dioread_nolock mode will just overwrite
pointer to io_end in the inode. This may result in data corruption or
extent conversion happening from IO completion interrupt because we
don't properly set buffer_defer_completion() when unlocked DIO races
with locked DIO to unwritten extent.
Since unlocked DIO doesn't need io_end for anything, just avoid
allocating it and corrupting pointer from inode for locked DIO.
A cleaner fix would be to avoid these games with io_end pointer from the
inode but that requires more intrusive changes so we leave that for
later.
Cc: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
ext4 can update bh->b_state non-atomically in _ext4_get_block() and
ext4_da_get_block_prep(). Usually this is fine since bh is just a
temporary storage for mapping information on stack but in some cases it
can be fully living bh attached to a page. In such case non-atomic
update of bh->b_state can race with an atomic update which then gets
lost. Usually when we are mapping bh and thus updating bh->b_state
non-atomically, nobody else touches the bh and so things work out fine
but there is one case to especially worry about: ext4_finish_bio() uses
BH_Uptodate_Lock on the first bh in the page to synchronize handling of
PageWriteback state. So when blocksize < pagesize, we can be atomically
modifying bh->b_state of a buffer that actually isn't under IO and thus
can race e.g. with delalloc trying to map that buffer. The result is
that we can mistakenly set / clear BH_Uptodate_Lock bit resulting in the
corruption of PageWriteback state or missed unlock of BH_Uptodate_Lock.
Fix the problem by always updating bh->b_state bits atomically.
CC: stable@vger.kernel.org
Reported-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
We don't require a dedicated thread for fsnotify cleanup. Switch it
over to a workqueue job instead that runs on the system_unbound_wq.
In the interest of not thrashing the queued job too often when there are
a lot of marks being removed, we delay the reaper job slightly when
queueing it, to allow several to gather on the list.
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Tested-by: Eryu Guan <guaneryu@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Eric Paris <eparis@parisplace.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This reverts commit c510eff6beba ("fsnotify: destroy marks with
call_srcu instead of dedicated thread").
Eryu reported that he was seeing some OOM kills kick in when running a
testcase that adds and removes inotify marks on a file in a tight loop.
The above commit changed the code to use call_srcu to clean up the
marks. While that does (in principle) work, the srcu callback job is
limited to cleaning up entries in small batches and only once per jiffy.
It's easily possible to overwhelm that machinery with too many call_srcu
callbacks, and Eryu's reproduer did just that.
There's also another potential problem with using call_srcu here. While
you can obviously sleep while holding the srcu_read_lock, the callbacks
run under local_bh_disable, so you can't sleep there.
It's possible when putting the last reference to the fsnotify_mark that
we'll end up putting a chain of references including the fsnotify_group,
uid, and associated keys. While I don't see any obvious ways that that
could occurs, it's probably still best to avoid using call_srcu here
after all.
This patch reverts the above patch. A later patch will take a different
approach to eliminated the dedicated thread here.
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Reported-by: Eryu Guan <guaneryu@gmail.com>
Tested-by: Eryu Guan <guaneryu@gmail.com>
Cc: Jan Kara <jack@suse.com>
Cc: Eric Paris <eparis@parisplace.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
To differentiate between the kernel_read_file() callers, this patch
defines a new enumeration named kernel_read_file_id and includes the
caller identifier as an argument.
Subsequent patches define READING_KEXEC_IMAGE, READING_KEXEC_INITRAMFS,
READING_FIRMWARE, READING_MODULE, and READING_POLICY.
Changelog v3:
- Replace the IMA specific enumeration with a generic one.
Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Luis R. Rodriguez <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
|
|
For a while it was looked down upon to directly read files from Linux.
These days there exists a few mechanisms in the kernel that do just
this though to load a file into a local buffer. There are minor but
important checks differences on each. This patch set is the first
attempt at resolving some of these differences.
This patch introduces a common function for reading files from the kernel
with the corresponding security post-read hook and function.
Changelog v4+:
- export security_kernel_post_read_file() - Fengguang Wu
v3:
- additional bounds checking - Luis
v2:
- To simplify patch review, re-ordered patches
Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Reviewed-by: Luis R. Rodriguez <mcgrof@suse.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
|
|
The protection key can now be just as important as read/write
permissions on a VMA. We need some debug mechanism to help
figure out if it is in play. smaps seems like a logical
place to expose it.
arch/x86/kernel/setup.c is a bit of a weirdo place to put
this code, but it already had seq_file.h and there was not
a much better existing place to put it.
We also use no #ifdef. If protection keys is .config'd out we
will effectively get the same function as if we used the weak
generic function.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Dave Young <dyoung@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Salter <msalter@redhat.com>
Cc: Mark Williamson <mwilliamson@undo-software.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210227.4F8EB3F8@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Commit 7955118eafc4 (quota: Allow Q_GETQUOTA for frozen filesystem)
allowed Q_GETQUOTA call for frozen filesystem. It makes sense on the
first look but zero-day testing has shown that with this change ext4
warns about starting a transaction for frozen filesystem. This happens
because ext4_acquire_dquot() prepares for allocating space for new quota
structure. Although it would be possible to implement Q_GETQUOTA for
ext4 without allocating space for non-existent structures, the matter
further complicates because OCFS2 needs to update on-disk structure use
count when a new cluster node loads quota information from disk. So just
revert the change and forbid Q_GETQUOTA together with Q_GETNEXTQUOTA for
frozen filesystem. Add comment to quotactl_cmd_write() to save us from
repeating this excercise in a few years when I forget again.
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
When loading new quota structure from disk, there is a possibility caller
of dqget() will see uninitialized data due to CPU reordering loads or
stores - loads from dquot can be reordered before test of DQ_ACTIVE_B
bit or setting of this bit could be reordered before filling of the
structure. Fix the issue by adding proper memory barriers.
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
When starting up linux with btrfs filesystem, I got many memory leak
messages by kmemleak as,
unreferenced object 0xffff880066882000 (size 4096):
comm "modprobe", pid 730, jiffies 4294690024 (age 196.599s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<ffffffff8174d52e>] kmemleak_alloc+0x4e/0xb0
[<ffffffff811d09aa>] kmem_cache_alloc_trace+0xea/0x1e0
[<ffffffffa03620fb>] btrfs_alloc_dummy_fs_info+0x6b/0x2a0 [btrfs]
[<ffffffffa03624fc>] btrfs_alloc_dummy_block_group+0x5c/0x120 [btrfs]
[<ffffffffa0360aa9>] btrfs_test_free_space_cache+0x39/0xed0 [btrfs]
[<ffffffffa03b5a74>] trace_raw_output_xfs_attr_class+0x54/0xe0 [xfs]
[<ffffffff81002122>] do_one_initcall+0xb2/0x1f0
[<ffffffff811765aa>] do_init_module+0x5e/0x1e9
[<ffffffff810fec09>] load_module+0x20a9/0x2690
[<ffffffff810ff439>] SyS_finit_module+0xb9/0xf0
[<ffffffff81757daf>] entry_SYSCALL_64_fastpath+0x12/0x76
[<ffffffffffffffff>] 0xffffffffffffffff
unreferenced object 0xffff8800573f8000 (size 10256):
comm "modprobe", pid 730, jiffies 4294690185 (age 196.460s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<ffffffff8174d52e>] kmemleak_alloc+0x4e/0xb0
[<ffffffff8119ca6e>] kmalloc_order+0x5e/0x70
[<ffffffff8119caa4>] kmalloc_order_trace+0x24/0x90
[<ffffffffa03620b3>] btrfs_alloc_dummy_fs_info+0x23/0x2a0 [btrfs]
[<ffffffffa03624fc>] btrfs_alloc_dummy_block_group+0x5c/0x120 [btrfs]
[<ffffffffa036603d>] run_test+0xfd/0x320 [btrfs]
[<ffffffffa0366f34>] btrfs_test_free_space_tree+0x94/0xee [btrfs]
[<ffffffffa03b5aab>] trace_raw_output_xfs_attr_class+0x8b/0xe0 [xfs]
[<ffffffff81002122>] do_one_initcall+0xb2/0x1f0
[<ffffffff811765aa>] do_init_module+0x5e/0x1e9
[<ffffffff810fec09>] load_module+0x20a9/0x2690
[<ffffffff810ff439>] SyS_finit_module+0xb9/0xf0
[<ffffffff81757daf>] entry_SYSCALL_64_fastpath+0x12/0x76
[<ffffffffffffffff>] 0xffffffffffffffff
This patch lets btrfs using fs_info stored in btrfs_root for
block group cache directly without allocating a new one.
Fixes: d0bd456074 ("Btrfs: add fragment=* debug mount option")
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
btrfs failed in xfstests btrfs/080 with -o nodatacow.
Can be reproduced by following script:
DEV=/dev/vdg
MNT=/mnt/tmp
umount $DEV &>/dev/null
mkfs.btrfs -f $DEV
mount -o nodatacow $DEV $MNT
dd if=/dev/zero of=$MNT/test bs=1 count=2048 &
btrfs subvolume snapshot -r $MNT $MNT/test_snap &
wait
--
We can see dd failed on NO_SPACE.
Reason:
__btrfs_buffered_write should run cow write when no_cow impossible,
and current code is designed with above logic.
But check_can_nocow() have 2 type of return value(0 and <0) on
can_not_no_cow, and current code only continue write on first case,
the second case happened in doing subvolume.
Fix:
Continue write when check_can_nocow() return 0 and <0.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
|
|
Cleanup.
kmem_cache_destroy has support NULL argument checking,
so drop the double null testing before calling it.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We were getting build warning about:
fs/btrfs/extent-tree.c:7021:34: warning: ‘used_bg’ may be used
uninitialized in this function
It is not a valid warning as used_bg is never used uninitilized since
locked is initially false so we can never be in the section where
'used_bg' is used. But gcc is not able to understand that and we can
initialize it while declaring to silence the warning.
Signed-off-by: Sudip Mukherjee <sudip@vectorindia.org>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We use the private member of extent_state to store the failrec and play
pointless pointer games.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
CURRENT_TIME macro is not appropriate for filesystems as it
doesn't use the right granularity for filesystem timestamps.
Use current_fs_time() instead.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: linux-btrfs@vger.kernel.org
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The kernel provides a swap() that does the same thing as this code.
Signed-off-by: Dave Jones <dsj@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
While running btrfs_mksubvol(), d_really_is_positive() is called twice.
First in btrfs_mksubvol() and second inside btrfs_may_create(). So I
remove the first one.
Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Simplify expression in btrfs_calc_trans_metadata_size().
Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com>
Reviewed-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We will sometimes start background flushing the various enospc related things
(delayed nodes, delalloc, etc) if we are getting close to reserving all of our
available space. We don't want to do this however when we are actually using
this space as it causes unneeded thrashing. We currently try to do this by
checking bytes_used >= thresh, but bytes_used is only part of the equation, we
need to use bytes_reserved as well as this represents space that is very likely
to become bytes_used in the future.
My tracing tool will keep count of the number of times we kick off the async
flusher, the following are counts for the entire run of generic/027
No Patch Patch
avg: 5385 5009
median: 5500 4916
We skewed lower than the average with my patch and higher than the average with
the patch, overall it cuts the flushing from anywhere from 5-10%, which in the
case of actual ENOSPC is quite helpful. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There are a few places where we add to trans->bytes_reserved but don't have the
corresponding trace point. With these added my tool no longer sees transaction
leaks.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|