Age | Commit message (Collapse) | Author |
|
fs/logfs/dev_bdev.c: In function '__bdev_writeseg':
include/linux/kernel.h:601:17: warning: comparison of distinct pointer types lacks a cast [enabled by default]
(void) (&_min1 == &_min2); \
fs/logfs/dev_bdev.c:84:14: note: in expansion of macro 'min'
max_pages = min(nr_pages, BIO_MAX_PAGES);
fs/logfs/dev_bdev.c: In function 'do_erase':
include/linux/kernel.h:601:17: warning: comparison of distinct pointer types lacks a cast [enabled by default]
(void) (&_min1 == &_min2); \
fs/logfs/dev_bdev.c:174:14: note: in expansion of macro 'min'
max_pages = min(nr_pages, BIO_MAX_PAGES);
Lets use min_t and mention the type.
Signed-off-by: Sudip Mukherjee <sudip@vectorindia.org>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The comment here says that it is checking for invalid bits. But, the mask
is *actually* checking to ensure that _any_ valid bit is set, which is
quite different.
Without this check, an unexpected bit could get set on an inotify object.
Since these bits are also interpreted by the fsnotify/dnotify code, there
is the potential for an object to be mishandled inside the kernel. For
instance, can we be sure that setting the dnotify flag FS_DN_RENAME on an
inotify watch is harmless?
Add the actual check which was intended. Retain the existing inotify bits
are being added to the watch. Plus, this is existing behavior which would
be nice to preserve.
I did a quick sniff test that inotify functions and that my
'inotify-tools' package passes 'make check'.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: John McCutchan <john@johnmccutchan.com>
Cc: Robert Love <rlove@rlove.org>
Cc: Eric Paris <eparis@parisplace.org>
Cc: Josh Boyer <jwboyer@fedoraproject.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
There was a report that my patch:
inotify: actually check for invalid bits in sys_inotify_add_watch()
broke CRIU.
The reason is that CRIU looks up raw flags in /proc/$pid/fdinfo/* to
figure out how to rebuild inotify watches and then passes those flags
directly back in to the inotify API. One of those flags
(FS_EVENT_ON_CHILD) is set in mark->mask, but is not part of the inotify
API. It is used inside the kernel to _implement_ inotify but it is not
and has never been part of the API.
My patch above ensured that we only allow bits which are part of the API
(IN_ALL_EVENTS). This broke CRIU.
FS_EVENT_ON_CHILD is really internal to the kernel. It is set _anyway_ on
all inotify marks. So, CRIU was really just trying to set a bit that was
already set.
This patch hides that bit from fdinfo. CRIU will not see the bit, not try
to set it, and should work as before. We should not have been exposing
this bit in the first place, so this is a good patch independent of the
CRIU problem.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reported-by: Andrey Wagin <avagin@gmail.com>
Acked-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Eric Paris <eparis@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: John McCutchan <john@johnmccutchan.com>
Cc: Robert Love <rlove@rlove.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security subsystem update from James Morris:
"This is mostly maintenance updates across the subsystem, with a
notable update for TPM 2.0, and addition of Jarkko Sakkinen as a
maintainer of that"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (40 commits)
apparmor: clarify CRYPTO dependency
selinux: Use a kmem_cache for allocation struct file_security_struct
selinux: ioctl_has_perm should be static
selinux: use sprintf return value
selinux: use kstrdup() in security_get_bools()
selinux: use kmemdup in security_sid_to_context_core()
selinux: remove pointless cast in selinux_inode_setsecurity()
selinux: introduce security_context_str_to_sid
selinux: do not check open perm on ftruncate call
selinux: change CONFIG_SECURITY_SELINUX_CHECKREQPROT_VALUE default
KEYS: Merge the type-specific data with the payload data
KEYS: Provide a script to extract a module signature
KEYS: Provide a script to extract the sys cert list from a vmlinux file
keys: Be more consistent in selection of union members used
certs: add .gitignore to stop git nagging about x509_certificate_list
KEYS: use kvfree() in add_key
Smack: limited capability for changing process label
TPM: remove unnecessary little endian conversion
vTPM: support little endian guests
char: Drop owner assignment from i2c_driver
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull userns hardlink capability check fix from Eric Biederman:
"This round just contains a single patch. There has been a lot of
other work this period but it is not quite ready yet, so I am pushing
it until 4.5.
The remaining change by Dirk Steinmetz wich fixes both Gentoo and
Ubuntu containers allows hardlinks if we have the appropriate
capabilities in the user namespace. Security wise it is really a
gimme as the user namespace root can already call setuid become that
user and create the hardlink"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
namei: permit linking with CAP_FOWNER in userns
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux
Pull pstore updates from Tony Luck:
"Half dozen small cleanups plus change to allow pstore backend drivers
to be unloaded"
* tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
pstore: fix code comment to match code
efi-pstore: fix kernel-doc argument name
pstore: Fix return type of pstore_is_mounted()
pstore: add pstore unregister
pstore: add a helper function pstore_register_kmsg
pstore: add vmalloc error check
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"Most part of the patches include enhancing the stability and
performance of in-memory extent caches feature.
In addition, it introduces several new features and configurable
points:
- F2FS_GOING_DOWN_METAFLUSH ioctl to test power failures
- F2FS_IOC_WRITE_CHECKPOINT ioctl to trigger checkpoint by users
- background_gc=sync mount option to do gc synchronously
- periodic checkpoints
- sysfs entry to control readahead blocks for free nids
And the following bug fixes have been merged.
- fix SSA corruption by collapse/insert_range
- correct a couple of gc behaviors
- fix the results of f2fs_map_blocks
- fix error case handling of volatile/atomic writes"
* tag 'for-f2fs-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (54 commits)
f2fs: fix to skip shrinking extent nodes
f2fs: fix error path of ->symlink
f2fs: fix to clear GCed flag for atomic written page
f2fs: don't need to submit bio on error case
f2fs: fix leakage of inmemory atomic pages
f2fs: refactor __find_rev_next_{zero}_bit
f2fs: support fiemap for inline_data
f2fs: flush dirty data for bmap
f2fs: relocate the tracepoint for background_gc
f2fs crypto: fix racing of accessing encrypted page among
f2fs: export ra_nid_pages to sysfs
f2fs: readahead for free nids building
f2fs: support lower priority asynchronous readahead in ra_meta_pages
f2fs: don't tag REQ_META for temporary non-meta pages
f2fs: add a tracepoint for f2fs_read_data_pages
f2fs: set GFP_NOFS for grab_cache_page
f2fs: fix SSA updates resulting in corruption
Revert "f2fs: do not skip dentry block writes"
f2fs: add F2FS_GOING_DOWN_METAFLUSH to test power-failure
f2fs: merge meta writes as many possible
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
Pull dlm update from David Teigland:
"This includes one simple fix to make posix locks interruptible by
signals in cases where a signal handler is used"
* tag 'dlm-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
dlm: make posix locks interruptible
|
|
Pull file locking updates from Jeff Layton:
"The largest series of changes is from Ben who offered up a set to add
a new helper function for setting locks based on the type set in
fl_flags. Dmitry also send in a fix for a potential race that he
found with KTSAN"
* tag 'locks-v4.4-1' of git://git.samba.org/jlayton/linux:
locks: cleanup posix_lock_inode_wait and flock_lock_inode_wait
Move locks API users to locks_lock_inode_wait()
locks: introduce locks_lock_inode_wait()
locks: Use more file_inode and fix a comment
fs: fix data races on inode->i_flctx
locks: change tracepoint for generic_add_lease
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here's the "big" driver core updates for 4.4-rc1. Primarily a bunch
of debugfs updates, with a smattering of minor driver core fixes and
updates as well.
All have been in linux-next for a long time"
* tag 'driver-core-4.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
debugfs: Add debugfs_create_ulong()
of: to support binding numa node to specified device in devicetree
debugfs: Add read-only/write-only bool file ops
debugfs: Add read-only/write-only size_t file ops
debugfs: Add read-only/write-only x64 file ops
debugfs: Consolidate file mode checks in debugfs_create_*()
Revert "mm: Check if section present during memory block (un)registering"
driver-core: platform: Provide helpers for multi-driver modules
mm: Check if section present during memory block (un)registering
devres: fix a for loop bounds check
CMA: fix CONFIG_CMA_SIZE_MBYTES overflow in 64bit
base/platform: assert that dev_pm_domain callbacks are called unconditionally
sysfs: correctly handle short reads on PREALLOC attrs.
base: soc: siplify ida usage
kobject: move EXPORT_SYMBOL() macros next to corresponding definitions
kobject: explain what kobject's sd field is
debugfs: document that debugfs_remove*() accepts NULL and error values
debugfs: Pass bool pointer to debugfs_create_bool()
ACPI / EC: Fix broken 64bit big-endian users of 'global_lock'
|
|
Pull block integrity updates from Jens Axboe:
""This is the joint work of Dan and Martin, cleaning up and improving
the support for block data integrity"
* 'for-4.4/integrity' of git://git.kernel.dk/linux-block:
block, libnvdimm, nvme: provide a built-in blk_integrity nop profile
block: blk_flush_integrity() for bio-based drivers
block: move blk_integrity to request_queue
block: generic request_queue reference counting
nvme: suspend i/o during runtime blk_integrity_unregister
md: suspend i/o during runtime blk_integrity_unregister
md, dm, scsi, nvme, libnvdimm: drop blk_integrity_unregister() at shutdown
block: Inline blk_integrity in struct gendisk
block: Export integrity data interval size in sysfs
block: Reduce the size of struct blk_integrity
block: Consolidate static integrity profile properties
block: Move integrity kobject to struct gendisk
|
|
Pull core block updates from Jens Axboe:
"This is the core block pull request for 4.4. I've got a few more
topic branches this time around, some of them will layer on top of the
core+drivers changes and will come in a separate round. So not a huge
chunk of changes in this round.
This pull request contains:
- Enable blk-mq page allocation tracking with kmemleak, from Catalin.
- Unused prototype removal in blk-mq from Christoph.
- Cleanup of the q->blk_trace exchange, using cmpxchg instead of two
xchg()'s, from Davidlohr.
- A plug flush fix from Jeff.
- Also from Jeff, a fix that means we don't have to update shared tag
sets at init time unless we do a state change. This cuts down boot
times on thousands of devices a lot with scsi/blk-mq.
- blk-mq waitqueue barrier fix from Kosuke.
- Various fixes from Ming:
- Fixes for segment merging and splitting, and checks, for
the old core and blk-mq.
- Potential blk-mq speedup by marking ctx pending at the end
of a plug insertion batch in blk-mq.
- direct-io no page dirty on kernel direct reads.
- A WRITE_SYNC fix for mpage from Roman"
* 'for-4.4/core' of git://git.kernel.dk/linux-block:
blk-mq: avoid excessive boot delays with large lun counts
blktrace: re-write setting q->blk_trace
blk-mq: mark ctx as pending at batch in flush plug path
blk-mq: fix for trace_block_plug()
block: check bio_mergeable() early before merging
blk-mq: check bio_mergeable() early before merging
block: avoid to merge splitted bio
block: setup bi_phys_segments after splitting
block: fix plug list flushing for nomerge queues
blk-mq: remove unused blk_mq_clone_flush_request prototype
blk-mq: fix waitqueue_active without memory barrier in block/blk-mq-tag.c
fs: direct-io: don't dirtying pages for ITER_BVEC/ITER_KVEC direct read
fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write
block: kmemleak: Track the page allocations for struct request
|
|
In tracefs' start_creating(), we pin the file system to safely access
its root. When we failed to create a file, we unpin the file system via
failed_creating() to release the mount count and eventually the reference
of the singleton vfsmount.
However, when we run into an error during lookup_one_len() when still
in start_creating(), we only release the parent's mutex but not so the
reference on the mount.
F.e., in securityfs_create_file(), after doing simple_pin_fs() when
lookup_one_len() fails there, we infact do simple_release_fs(). This
seems necessary here as well.
Same issue seen in debugfs due to 190afd81e4a5 ("debugfs: split the
beginning and the end of __create_file() off"), which seemed to got
carried over into tracefs, too. Noticed during code review.
Link: http://lkml.kernel.org/r/68efa86101b778cf7517ed7c6ad573bd69f60ec6.1446672850.git.daniel@iogearbox.net
Fixes: 4282d60689d4 ("tracefs: Add new tracefs file system")
Cc: stable@vger.kernel.org # 4.1+
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Martin Schwidefsky:
"There is only one new feature in this pull for the 4.4 merge window,
most of it is small enhancements, cleanup and bug fixes:
- Add the s390 backend for the software dirty bit tracking. This
adds two new pgtable functions pte_clear_soft_dirty and
pmd_clear_soft_dirty which is why there is a hit to
arch/x86/include/asm/pgtable.h in this pull request.
- A series of cleanup patches for the AP bus, this includes the
removal of the support for two outdated crypto cards (PCICC and
PCICA).
- The irq handling / signaling on buffer full in the runtime
instrumentation code is dropped.
- Some micro optimizations: remove unnecessary memory barriers for a
couple of functions: [smb_]rmb, [smb_]wmb, atomics, bitops, and for
spin_unlock. Use the builtin bswap if available and make
test_and_set_bit_lock more cache friendly.
- Statistics and a tracepoint for the diagnose calls to the
hypervisor.
- The CPU measurement facility support to sample KVM guests is
improved.
- The vector instructions are now always enabled for user space
processes if the hardware has the vector facility. This simplifies
the FPU handling code. The fpu-internal.h header is split into fpu
internals, api and types just like x86.
- Cleanup and improvements for the common I/O layer.
- Rework udelay to solve a problem with kprobe. udelay has busy loop
semantics but still uses an idle processor state for the wait"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (66 commits)
s390: remove runtime instrumentation interrupts
s390/cio: de-duplicate subchannel validation
s390/css: unneeded initialization in for_each_subchannel
s390/Kconfig: use builtin bswap
s390/dasd: fix disconnected device with valid path mask
s390/dasd: fix invalid PAV assignment after suspend/resume
s390/dasd: fix double free in dasd_eckd_read_conf
s390/kernel: fix ptrace peek/poke for floating point registers
s390/cio: move ccw_device_stlck functions
s390/cio: move ccw_device_call_handler
s390/topology: reduce per_cpu() invocations
s390/nmi: reduce size of percpu variable
s390/nmi: fix terminology
s390/nmi: remove casts
s390/nmi: remove pointless error strings
s390: don't store registers on disabled wait anymore
s390: get rid of __set_psw_mask()
s390/fpu: split fpu-internal.h into fpu internals, api, and type headers
s390/dasd: fix list_del corruption after lcu changes
s390/spinlock: remove unneeded serializations at unlock
...
|
|
This patch changes function gfs2_dir_hash_inval so it uses the
i_lock spin_lock to protect the in-core hash table, i_hash_cache.
This will prevent double-frees due to a race between gfs2_evict_inode
and inode invalidation.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU changes from Ingo Molnar:
"The main changes in this cycle were:
- Improvements to expedited grace periods (Paul E McKenney)
- Performance improvements to and locktorture tests for percpu-rwsem
(Oleg Nesterov, Paul E McKenney)
- Torture-test changes (Paul E McKenney, Davidlohr Bueso)
- Documentation updates (Paul E McKenney)
- Miscellaneous fixes (Paul E McKenney, Boqun Feng, Oleg Nesterov,
Patrick Marlier)"
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits)
fs/writeback, rcu: Don't use list_entry_rcu() for pointer offsetting in bdi_split_work_to_wbs()
rcu: Better hotplug handling for synchronize_sched_expedited()
rcu: Enable stall warnings for synchronize_rcu_expedited()
rcu: Add tasks to expedited stall-warning messages
rcu: Add online/offline info to expedited stall warning message
rcu: Consolidate expedited CPU selection
rcu: Prepare for consolidating expedited CPU selection
cpu: Remove try_get_online_cpus()
rcu: Stop excluding CPU hotplug in synchronize_sched_expedited()
rcu: Stop silencing lockdep false positive for expedited grace periods
rcu: Switch synchronize_sched_expedited() to IPI
locktorture: Fix module unwind when bad torture_type specified
torture: Forgive non-plural arguments
rcutorture: Fix unused-function warning for torturing_tasks()
rcutorture: Fix module unwind when bad torture_type specified
rcu_sync: Cleanup the CONFIG_PROVE_RCU checks
locking/percpu-rwsem: Clean up the lockdep annotations in percpu_down_read()
locking/percpu-rwsem: Fix the comments outdated by rcu_sync
locking/percpu-rwsem: Make use of the rcu_sync infrastructure
locking/percpu-rwsem: Make percpu_free_rwsem() after kzalloc() safe
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull wchan kernel address hiding from Ingo Molnar:
"This fixes a wchan related information leak in /proc/PID/stat.
There's a bit of an ABI twist to it: instead of setting the wchan
field to 0 (which is our usual technique) we set it conditionally to a
0/1 flag to keep ABI compatibility with older procps versions that
only fetches /proc/PID/wchan (symbolic names) if the absolute wchan
address is nonzero"
* 'core-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
fs/proc, core/debug: Don't expose absolute kernel addresses via wchan
|
|
When decoding GETATTR replies, the client checks the attribute bitmap
for which attributes the server has sent. It misses bits at the word
boundaries, though; fix that.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
The arguments passed around for getacl and setacl xdr encoding, struct
nfs_setaclargs and struct nfs_getaclargs, both contain an array of
pages, an offset into the first page, and the length of the page data.
The offset is unused as it is always zero; remove it.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
As new_valid_dev always returns 1, so !new_valid_dev check is not
needed, remove it.
Signed-off-by: Yaowei Bai <bywxiaobai@163.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
Replace wait_event_killable with wait_event_interruptible
so that a program waiting for a posix lock can be
interrupted by a signal. With the killable version,
a program was not interruptible by a signal if it
had a signal handler set for it, overriding the default
action of terminating the process.
Signed-off-by: Eric Ren <zren@suse.com>
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
When we are using the no-holes feature, if we punch a hole into a file
range that already contains a hole which overlaps the range we are passing
to fallocate(), we end up removing the extent map that represents the
existing hole without adding a new one. This happens because with the
no-holes feature we do not have explicit extent items to represent holes
and therefore the call to __btrfs_drop_extents(), made from
btrfs_punch_hole(), returns an end offset to the variable drop_end that
is smaller than the end of the range passed to fallocate(), while it
drops all existing extent maps in that range.
Normally having a missing extent map is not a problem, for example for
a readpages() operation we just end up building the extent map by
looking at the fs/subvol tree for a matching extent item (or a lack of
one for implicit holes). However for an fsync that uses the fast path,
which needs to look at the list of modified extent maps, this means
the fsync will not record information about the complete hole we had
before the fallocate() call into the log tree, resulting in a file with
content/layout that does not match what we had neither before nor after
the hole punch operation.
The following test case for fstests reproduces the issue. It fails without
this change because we get a file with a different digest after the fsync
log replay and also with a different extent/hole layout.
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
_cleanup_flakey
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
. ./common/punch
. ./common/dmflakey
# real QA test starts here
_need_to_be_root
_supported_fs generic
_supported_os Linux
_require_scratch
_require_xfs_io_command "fpunch"
_require_xfs_io_command "fiemap"
_require_dm_target flakey
_require_metadata_journaling $SCRATCH_DEV
# This test was motivated by an issue found in btrfs when the btrfs
# no-holes feature is enabled (introduced in kernel 3.14). So enable
# the feature if the fs being tested is btrfs.
if [ $FSTYP == "btrfs" ]; then
_require_btrfs_fs_feature "no_holes"
_require_btrfs_mkfs_feature "no-holes"
MKFS_OPTIONS="$MKFS_OPTIONS -O no-holes"
fi
rm -f $seqres.full
_scratch_mkfs >>$seqres.full 2>&1
_init_flakey
_mount_flakey
# Create out test file with some data and then fsync it.
# We do the fsync only to make sure the last fsync we do in this test
# triggers the fast code path of btrfs' fsync implementation, a
# condition necessary to trigger the bug btrfs had.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 128K" \
-c "fsync" \
$SCRATCH_MNT/foobar | _filter_xfs_io
# Now punch a hole against the range [96K, 128K[.
$XFS_IO_PROG -c "fpunch 96K 32K" $SCRATCH_MNT/foobar
# Punch another hole against a range that overlaps the previous range
# and ends beyond eof.
$XFS_IO_PROG -c "fpunch 64K 128K" $SCRATCH_MNT/foobar
# Punch another hole against a range that overlaps the first range
# ([96K, 128K[) and ends at eof.
$XFS_IO_PROG -c "fpunch 32K 96K" $SCRATCH_MNT/foobar
# Fsync our file. We want to verify that, after a power failure and
# mounting the filesystem again, the file content reflects all the hole
# punch operations.
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar
echo "File digest before power failure:"
md5sum $SCRATCH_MNT/foobar | _filter_scratch
echo "Fiemap before power failure:"
$XFS_IO_PROG -c "fiemap -v" $SCRATCH_MNT/foobar | _filter_fiemap
# Silently drop all writes and umount to simulate a crash/power failure.
_load_flakey_table $FLAKEY_DROP_WRITES
_unmount_flakey
# Allow writes again, mount to trigger log replay and validate file
# contents.
_load_flakey_table $FLAKEY_ALLOW_WRITES
_mount_flakey
echo "File digest after log replay:"
# Must match the same digest we got before the power failure.
md5sum $SCRATCH_MNT/foobar | _filter_scratch
echo "Fiemap after log replay:"
# Must match the same extent listing we got before the power failure.
$XFS_IO_PROG -c "fiemap -v" $SCRATCH_MNT/foobar | _filter_fiemap
_unmount_flakey
status=0
exit
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
When executing generic/001 in a loop on a ppc64 machine (with both sectorsize
and nodesize set to 64k), the following call trace is observed,
WARNING: at /root/repos/linux/fs/btrfs/locking.c:253
Modules linked in:
CPU: 2 PID: 8353 Comm: umount Not tainted 4.3.0-rc5-13676-ga5e681d #54
task: c0000000f2b1f560 ti: c0000000f6008000 task.ti: c0000000f6008000
NIP: c000000000520c88 LR: c0000000004a3b34 CTR: 0000000000000000
REGS: c0000000f600a820 TRAP: 0700 Not tainted (4.3.0-rc5-13676-ga5e681d)
MSR: 8000000102029032 <SF,VEC,EE,ME,IR,DR,RI> CR: 24444884 XER: 00000000
CFAR: c0000000004a3b30 SOFTE: 1
GPR00: c0000000004a3b34 c0000000f600aaa0 c00000000108ac00 c0000000f5a808c0
GPR04: 0000000000000000 c0000000f600ae60 0000000000000000 0000000000000005
GPR08: 00000000000020a1 0000000000000001 c0000000f2b1f560 0000000000000030
GPR12: 0000000084842882 c00000000fdc0900 c0000000f600ae60 c0000000f070b800
GPR16: 0000000000000000 c0000000f3c8a000 0000000000000000 0000000000000049
GPR20: 0000000000000001 0000000000000001 c0000000f5aa01f8 0000000000000000
GPR24: 0f83e0f83e0f83e1 c0000000f5a808c0 c0000000f3c8d000 c000000000000000
GPR28: c0000000f600ae74 0000000000000001 c0000000f3c8d000 c0000000f5a808c0
NIP [c000000000520c88] .btrfs_tree_lock+0x48/0x2a0
LR [c0000000004a3b34] .btrfs_lock_root_node+0x44/0x80
Call Trace:
[c0000000f600aaa0] [c0000000f600ab80] 0xc0000000f600ab80 (unreliable)
[c0000000f600ab80] [c0000000004a3b34] .btrfs_lock_root_node+0x44/0x80
[c0000000f600ac00] [c0000000004a99dc] .btrfs_search_slot+0xa8c/0xc00
[c0000000f600ad40] [c0000000004ab878] .btrfs_insert_empty_items+0x98/0x120
[c0000000f600adf0] [c00000000050da44] .btrfs_finish_chunk_alloc+0x1d4/0x620
[c0000000f600af20] [c0000000004be854] .btrfs_create_pending_block_groups+0x1d4/0x2c0
[c0000000f600b020] [c0000000004bf188] .do_chunk_alloc+0x3c8/0x420
[c0000000f600b100] [c0000000004c27cc] .find_free_extent+0xbfc/0x1030
[c0000000f600b260] [c0000000004c2ce8] .btrfs_reserve_extent+0xe8/0x250
[c0000000f600b330] [c0000000004c2f90] .btrfs_alloc_tree_block+0x140/0x590
[c0000000f600b440] [c0000000004a47b4] .__btrfs_cow_block+0x124/0x780
[c0000000f600b530] [c0000000004a4fc0] .btrfs_cow_block+0xf0/0x250
[c0000000f600b5e0] [c0000000004a917c] .btrfs_search_slot+0x22c/0xc00
[c0000000f600b720] [c00000000050aa40] .btrfs_remove_chunk+0x1b0/0x9f0
[c0000000f600b850] [c0000000004c4e04] .btrfs_delete_unused_bgs+0x434/0x570
[c0000000f600b950] [c0000000004d3cb8] .close_ctree+0x2e8/0x3b0
[c0000000f600ba20] [c00000000049d178] .btrfs_put_super+0x18/0x30
[c0000000f600ba90] [c000000000243cd4] .generic_shutdown_super+0xa4/0x1a0
[c0000000f600bb10] [c0000000002441d8] .kill_anon_super+0x18/0x30
[c0000000f600bb90] [c00000000049c898] .btrfs_kill_super+0x18/0xc0
[c0000000f600bc10] [c0000000002444f8] .deactivate_locked_super+0x98/0xe0
[c0000000f600bc90] [c000000000269f94] .cleanup_mnt+0x54/0xa0
[c0000000f600bd10] [c0000000000bd744] .task_work_run+0xc4/0x100
[c0000000f600bdb0] [c000000000016334] .do_notify_resume+0x74/0x80
[c0000000f600be30] [c0000000000098b8] .ret_from_except_lite+0x64/0x68
Instruction dump:
fba1ffe8 fbc1fff0 fbe1fff8 7c791b78 f8010010 f821ff21 e94d0290 81030040
812a04e8 7d094a78 7d290034 5529d97e <0b090000> 3b400000 3be30050 3bc3004c
The above call trace is seen even on x86_64; albeit very rarely and that too
with nodesize set to 64k and with nospace_cache mount option being used.
The reason for the above call trace is,
btrfs_remove_chunk
check_system_chunk
Allocate chunk if required
For each physical stripe on underlying device,
btrfs_free_dev_extent
...
Take lock on Device tree's root node
btrfs_cow_block("dev tree's root node");
btrfs_reserve_extent
find_free_extent
index = BTRFS_RAID_DUP;
have_caching_bg = false;
When in LOOP_CACHING_NOWAIT state, Assume we find a block group
which is being cached; Hence have_caching_bg is set to true
When repeating the search for the next RAID index, we set
have_caching_bg to false.
Hence right after completing the LOOP_CACHING_NOWAIT state, we incorrectly
skip LOOP_CACHING_WAIT state and move to LOOP_ALLOC_CHUNK state where we
allocate a chunk and try to add entries corresponding to the chunk's physical
stripe into the device tree. When doing so the task deadlocks itself waiting
for the blocking lock on the root node of the device tree.
This commit fixes the issue by introducing a new local variable to help
indicate as to whether a block group of any RAID type is being cached.
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
Even with quota disabled, generic/127 will trigger a kernel warning by
underflow data space info.
The bug is caused by buffered write, which in case of short copy, the
start parameter for btrfs_delalloc_release_space() is wrong, and
round_up/down() in btrfs_delalloc_release() extents the range to page
aligned, decreasing one more page than expected.
This patch will fix it by passing correct start.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
NFS: NFSoRDMA Client Side Changes
In addition to a variety of bugfixes, these patches are mostly geared at
enabling both swap and backchannel support to the NFS over RDMA client.
Signed-off-by: Anna Schumake <Anna.Schumaker@Netapp.com>
|
|
Fix code comment about kmsg_dump register so it matches the code.
Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
|
|
Forechannel transports get their own "bc_up" method to create an
endpoint for the backchannel service.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
[Anna Schumaker: Add forward declaration of struct net to xprt.h]
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
For loosely coupled pNFS/flexfiles systems, there is often no advantage
at all in going through the MDS for I/O, since the MDS is subject to
the same limitations as all other clients when talking to DSes. If a
DS is unresponsive, I/O through the MDS will fail.
For such systems, the only scalable solution is to have the pNFS clients
retry doing pNFS, and so the protocol now provides a flag that allows
the pNFS server to signal this.
If LAYOUTGET returns FF_FLAGS_NO_IO_THRU_MDS, then we should assume that
the MDS wants the client to retry using these devices, even if they were
previously marked as being unavailable. To do so, we add a helper,
ff_layout_mark_devices_valid() that will be called from layoutget.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
If the pNFS/flexfiles file is mirrored, and a read to one mirror fails,
then we should bump the mirror index, so that we retry to a different
mirror. Once we've iterated through all mirrors and all failed, we can
return the layout and issue a new LAYOUTGET.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
It turns out that at least some versions of glibc end up reading
/proc/meminfo at every single startup, because glibc wants to know the
amount of memory the machine has. And while that's arguably insane,
it's just how things are.
And it turns out that it's not all that expensive most of the time, but
the vmalloc information statistics (amount of virtual memory used in the
vmalloc space, and the biggest remaining chunk) can be rather expensive
to compute.
The 'get_vmalloc_info()' function actually showed up on my profiles as
4% of the CPU usage of "make test" in the git source repository, because
the git tests are lots of very short-lived shell-scripts etc.
It turns out that apparently this same silly vmalloc info gathering
shows up on the facebook servers too, according to Dave Jones. So it's
not just "make test" for git.
We had two patches to just cache the information (one by me, one by
Ingo) to mitigate this issue, but the whole vmalloc information of of
rather dubious value to begin with, and people who *actually* want to
know what the situation is wrt the vmalloc area should just look at the
much more complete /proc/vmallocinfo instead.
In fact, according to my testing - and perhaps more importantly,
according to that big search engine in the sky: Google - there is
nothing out there that actually cares about those two expensive fields:
VmallocUsed and VmallocChunk.
So let's try to just remove them entirely. Actually, this just removes
the computation and reports the numbers as zero for now, just to try to
be minimally intrusive.
If this breaks anything, we'll obviously have to re-introduce the code
to compute this all and add the caching patches on top. But if given
the option, I'd really prefer to just remove this bad idea entirely
rather than add even more code to work around our historical mistake
that likely nobody really cares about.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Merge file descriptor allocation speedup.
Eric Dumazet has a test-case for a fairly common network deamon load
pattern: openign and closing a lot of sockets that each have very little
work done on them. It turns out that in that case, the cost of just
finding the correct file descriptor number can be a dominating factor.
We've long had a trivial optimization for allocating file descriptors
sequentially, but that optimization ends up being not very effective
when other file descriptors are being closed concurrently, and the fd
patterns are not some simple FIFO pattern. In such cases we ended up
spending a lot of time just scanning the bitmap of open file descriptors
in order to find the next file descriptor number to open.
This trivial patch-series mitigates that by simply introducing a
second-level bitmap of which words in the first bitmap are already fully
allocated. That cuts down the cost of scanning by an order of magnitude
in some pathological (but realistic) cases.
The second patch is an even more trivial patch to avoid unnecessarily
dirtying the cacheline for the close-on-exec bit array that normally
ends up being all empty.
* fs-file-descriptor-optimization:
vfs: conditionally clear close-on-exec flag
vfs: Fix pathological performance case for __alloc_fd()
|
|
We clear the close-on-exec flag when opening and closing files, and the
bit was almost always already clear before. Avoid dirtying the
cacheline if the clearning isn't necessary. That avoids unnecessary
cacheline dirtying and bouncing in multi-socket environments.
Eric Dumazet has a file descriptor benchmark that goes 4% faster from
this on his two-socket machine. It's probably partly superlinear
improvement due to getting slightly less spinlock contention on the
file_lock spinlock due to less work in the critical section.
Tested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Al Viro points out that:
> > * [Linux-specific aside] our __alloc_fd() can degrade quite badly
> > with some use patterns. The cacheline pingpong in the bitmap is probably
> > inevitable, unless we accept considerably heavier memory footprint,
> > but we also have a case when alloc_fd() takes O(n) and it's _not_ hard
> > to trigger - close(3);open(...); will have the next open() after that
> > scanning the entire in-use bitmap.
And Eric Dumazet has a somewhat realistic multithreaded microbenchmark
that opens and closes a lot of sockets with minimal work per socket.
This patch largely fixes it. We keep a 2nd-level bitmap of the open
file bitmaps, showing which words are already full. So then we can
traverse that second-level bitmap to efficiently skip already allocated
file descriptors.
On his benchmark, this improves performance by up to an order of
magnitude, by avoiding the excessive open file bitmap scanning.
Tested-and-acked-by: Eric Dumazet <edumazet@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs bug fixes from Miklos Szeredi:
"This contains fixes for bugs that appeared in earlier kernels (all are
marked for -stable)"
* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: free lower_mnt array in ovl_put_super
ovl: free stack of paths in ovl_fill_super
ovl: fix open in stacked overlay
ovl: fix dentry reference leak
ovl: use O_LARGEFILE in ovl_copy_up()
|
|
As new_valid_dev always returns 1, so !new_valid_dev check is not
needed, remove it.
Signed-off-by: Yaowei Bai <bywxiaobai@163.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Commit e66cf161 replaced the gl_spin spinlock in struct gfs2_glock with a
gl_lockref lockref and defined gl_spin as gl_lockref.lock (the spinlock in
gl_lockref). Remove that define to make the references to gl_lockref.lock more
obvious.
Signed-off-by: Andreas Gruenbacher <andreas.gruenbacher@gmail.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
|
|
bdi_split_work_to_wbs()
bdi_split_work_to_wbs() uses list_for_each_entry_rcu_continue()
to walk @bdi->wb_list. To set up the initial iteration
condition, it uses list_entry_rcu() to calculate the entry
pointer corresponding to the list head; however, this isn't an
actual RCU dereference and using list_entry_rcu() for it ended
up breaking a proposed list_entry_rcu() change because it was
feeding an non-lvalue pointer into the macro.
Don't use the RCU variant for simple pointer offsetting. Use
list_entry() instead.
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Darren Hart <dvhart@linux.intel.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Patrick Marlier <patrick.marlier@gmail.com>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: pranith kumar <bobby.prani@gmail.com>
Link: http://lkml.kernel.org/r/20151027051939.GA19355@mtj.duckdns.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Attempting to hardlink to an unsafe file (e.g. a setuid binary) from
within an unprivileged user namespace fails, even if CAP_FOWNER is held
within the namespace. This may cause various failures, such as a gentoo
installation within a lxc container failing to build and install specific
packages.
This change permits hardlinking of files owned by mapped uids, if
CAP_FOWNER is held for that namespace. Furthermore, it improves consistency
by using the existing inode_owner_or_capable(), which is aware of
namespaced capabilities as of 23adbe12ef7d3 ("fs,userns: Change
inode_capable to capable_wrt_inode_uidgid").
Signed-off-by: Dirk Steinmetz <public@rsjtdrjgfuzkfg.com>
This is hitting us in Ubuntu during some dpkg upgrades in containers.
When upgrading a file dpkg creates a hard link to the old file to back
it up before overwriting it. When packages upgrade suid files owned by a
non-root user the link isn't permitted, and the package upgrade fails.
This patch fixes our problem.
Tested-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
|
|
When rebasing my patchset, I forgot to pick up a cleanup patch to remove
old hotfix in 4.2 release.
Witouth the cleanup, it will screw up new qgroup reserve framework and
always cause minus reserved number.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
Between btrfs_allocerved_file_extent() and
btrfs_add_delayed_qgroup_reserve(), there is a window that delayed_refs
are run and delayed ref head maybe freed before
btrfs_add_delayed_qgroup_reserve().
This will cause btrfs_dad_delayed_qgroup_reserve() to return -ENOENT,
and cause transaction to be aborted.
This patch will record qgroup reserve space info into delayed_ref_head
at btrfs_add_delayed_ref(), to eliminate the race window.
Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
cleaner_kthread() kthread calls try_to_freeze() at the beginning of every
cleanup attempt. This operation can't ever succeed though, as the kthread
hasn't marked itself as freezable.
Before (hopefully eventually) kthread freezing gets converted to fileystem
freezing, we'd rather mark cleaner_kthread() freezable (as my
understanding is that it can generate filesystem I/O during suspend).
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
Ancient qgroup code call memcpy() on a extent buffer and use it for leaf
iteration.
As extent buffer contains lock, pointers to pages, it's never sane to do
such copy.
The following bug may be caused by this insane operation:
[92098.841309] general protection fault: 0000 [#1] SMP
[92098.841338] Modules linked in: ...
[92098.841814] CPU: 1 PID: 24655 Comm: kworker/u4:12 Not tainted
4.3.0-rc1 #1
[92098.841868] Workqueue: btrfs-qgroup-rescan btrfs_qgroup_rescan_helper
[btrfs]
[92098.842261] Call Trace:
[92098.842277] [<ffffffffc035a5d8>] ? read_extent_buffer+0xb8/0x110
[btrfs]
[92098.842304] [<ffffffffc0396d00>] ? btrfs_find_all_roots+0x60/0x70
[btrfs]
[92098.842329] [<ffffffffc039af3d>]
btrfs_qgroup_rescan_worker+0x28d/0x5a0 [btrfs]
Where btrfs_qgroup_rescan_worker+0x28d is btrfs_disk_key_to_cpu(),
called in reading key from the copied extent_buffer.
This patch will use btrfs_clone_extent_buffer() to a better copy of
extent buffer to deal such case.
Reported-by: Stephane Lesimple <stephane_btrfs@lesimple.fr>
Suggested-by: Filipe Manana <fdmanana@kernel.org>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
Enable the extended 'limit' syntax (a range), the new 'stripes' and
extended 'usage' syntax (a range) filters in the filters mask. The patch
comes separate and not within the series that introduced the new filters
because the patch adding the mask was merged in a late rc. The
integration branch was based on an older rc and could not merge the
patch due to the missing changes.
Prerequisities:
* btrfs: check unsupported filters in balance arguments
* btrfs: extend balance filter limit to take minimum and maximum
* btrfs: add balance filter for stripes
* btrfs: extend balance filter usage to take minimum and maximum
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
Similar to the 'limit' filter, we can enhance the 'usage' filter to
accept a range. The change is backward compatible, the range is applied
only in connection with the BTRFS_BALANCE_ARGS_USAGE_RANGE flag.
We don't have a usecase yet, the current syntax has been sufficient. The
enhancement should provide parity with other range-like filters.
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
Balance block groups which have the given number of stripes, defined by
a range min..max. This is useful to selectively rebalance only chunks
that do not span enough devices, applies to RAID0/10/5/6.
Signed-off-by: Gabríel Arthúr Pétursson <gabriel@system.is>
[ renamed bargs members, added to the UAPI, wrote the changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
The 'limit' filter is underdesigned, it should have been a range for
[min,max], with some relaxed semantics when one of the bounds is
missing. Besides that, using a full u64 for a single value is a waste of
bytes.
Let's fix both by extending the use of the u64 bytes for the [min,max]
range. This can be done in a backward compatible way, the range will be
interpreted only if the appropriate flag is set
(BTRFS_BALANCE_ARGS_LIMIT_RANGE).
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
The code for btrfs inode-resolve has never worked properly for
files with enough hard links to trigger extrefs. It was trying to
get the leaf out of a path after freeing the path:
btrfs_release_path(path);
leaf = path->nodes[0];
item_size = btrfs_item_size_nr(leaf, slot);
The fix here is to use the extent buffer we cloned just a little higher
up to avoid deadlocks caused by using the leaf in the path.
Signed-off-by: Chris Mason <clm@fb.com>
cc: stable@vger.kernel.org # v3.7+
cc: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
We don't verify that all the balance filter arguments supplemented by
the flags are actually known to the kernel. Thus we let it silently pass
and do nothing.
At the moment this means only the 'limit' filter, but we're going to add
a few more soon so it's better to have that fixed. Also in older stable
kernels so that it works with newer userspace tools.
Cc: stable@vger.kernel.org # 3.16+
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
On NOMMU archs, the FDPIC ELF loader sets up the usable brk range to
overlap with all but the last PAGE_SIZE bytes of the stack. This leads
to catastrophic memory reuse/corruption if brk is used. Fix by setting
the brk area to zero size to disable its use.
Signed-off-by: Rich Felker <dalias@libc.org>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Greg Ungerer <gerg@uclinux.org>
|
|
In the kernel 4.2 merge window we had a big changes to the implementation
of delayed references and qgroups which made the no_quota field of delayed
references not used anymore. More specifically the no_quota field is not
used anymore as of:
commit 0ed4792af0e8 ("btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.")
Leaving the no_quota field actually prevents delayed references from
getting merged, which in turn cause the following BUG_ON(), at
fs/btrfs/extent-tree.c, to be hit when qgroups are enabled:
static int run_delayed_tree_ref(...)
{
(...)
BUG_ON(node->ref_mod != 1);
(...)
}
This happens on a scenario like the following:
1) Ref1 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
2) Ref2 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
It's not merged with Ref1 because Ref1->no_quota != Ref2->no_quota.
3) Ref3 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
It's not merged with the reference at the tail of the list of refs
for bytenr X because the reference at the tail, Ref2 is incompatible
due to Ref2->no_quota != Ref3->no_quota.
4) Ref4 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
It's not merged with the reference at the tail of the list of refs
for bytenr X because the reference at the tail, Ref3 is incompatible
due to Ref3->no_quota != Ref4->no_quota.
5) We run delayed references, trigger merging of delayed references,
through __btrfs_run_delayed_refs() -> btrfs_merge_delayed_refs().
6) Ref1 and Ref3 are merged as Ref1->no_quota = Ref3->no_quota and
all other conditions are satisfied too. So Ref1 gets a ref_mod
value of 2.
7) Ref2 and Ref4 are merged as Ref2->no_quota = Ref4->no_quota and
all other conditions are satisfied too. So Ref2 gets a ref_mod
value of 2.
8) Ref1 and Ref2 aren't merged, because they have different values
for their no_quota field.
9) Delayed reference Ref1 is picked for running (select_delayed_ref()
always prefers references with an action == BTRFS_ADD_DELAYED_REF).
So run_delayed_tree_ref() is called for Ref1 which triggers the
BUG_ON because Ref1->red_mod != 1 (equals 2).
So fix this by removing the no_quota field, as it's not used anymore as
of commit 0ed4792af0e8 ("btrfs: qgroup: Switch to new extent-oriented
qgroup mechanism.").
The use of no_quota was also buggy in at least two places:
1) At delayed-refs.c:btrfs_add_delayed_tree_ref() - we were setting
no_quota to 0 instead of 1 when the following condition was true:
is_fstree(ref_root) || !fs_info->quota_enabled
2) At extent-tree.c:__btrfs_inc_extent_ref() - we were attempting to
reset a node's no_quota when the condition "!is_fstree(root_objectid)
|| !root->fs_info->quota_enabled" was true but we did it only in
an unused local stack variable, that is, we never reset the no_quota
value in the node itself.
This fixes the remainder of problems several people have been having when
running delayed references, mostly while a balance is running in parallel,
on a 4.2+ kernel.
Very special thanks to Stéphane Lesimple for helping debugging this issue
and testing this fix on his multi terabyte filesystem (which took more
than one day to balance alone, plus fsck, etc).
Also, this fixes deadlock issue when using the clone ioctl with qgroups
enabled, as reported by Elias Probst in the mailing list. The deadlock
happens because after calling btrfs_insert_empty_item we have our path
holding a write lock on a leaf of the fs/subvol tree and then before
releasing the path we called check_ref() which did backref walking, when
qgroups are enabled, and tried to read lock the same leaf. The trace for
this case is the following:
INFO: task systemd-nspawn:6095 blocked for more than 120 seconds.
(...)
Call Trace:
[<ffffffff86999201>] schedule+0x74/0x83
[<ffffffff863ef64c>] btrfs_tree_read_lock+0xc0/0xea
[<ffffffff86137ed7>] ? wait_woken+0x74/0x74
[<ffffffff8639f0a7>] btrfs_search_old_slot+0x51a/0x810
[<ffffffff863a129b>] btrfs_next_old_leaf+0xdf/0x3ce
[<ffffffff86413a00>] ? ulist_add_merge+0x1b/0x127
[<ffffffff86411688>] __resolve_indirect_refs+0x62a/0x667
[<ffffffff863ef546>] ? btrfs_clear_lock_blocking_rw+0x78/0xbe
[<ffffffff864122d3>] find_parent_nodes+0xaf3/0xfc6
[<ffffffff86412838>] __btrfs_find_all_roots+0x92/0xf0
[<ffffffff864128f2>] btrfs_find_all_roots+0x45/0x65
[<ffffffff8639a75b>] ? btrfs_get_tree_mod_seq+0x2b/0x88
[<ffffffff863e852e>] check_ref+0x64/0xc4
[<ffffffff863e9e01>] btrfs_clone+0x66e/0xb5d
[<ffffffff863ea77f>] btrfs_ioctl_clone+0x48f/0x5bb
[<ffffffff86048a68>] ? native_sched_clock+0x28/0x77
[<ffffffff863ed9b0>] btrfs_ioctl+0xabc/0x25cb
(...)
The problem goes away by eleminating check_ref(), which no longer is
needed as its purpose was to get a value for the no_quota field of
a delayed reference (this patch removes the no_quota field as mentioned
earlier).
Reported-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr>
Tested-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr>
Reported-by: Elias Probst <mail@eliasprobst.eu>
Reported-by: Peter Becker <floyd.net@gmail.com>
Reported-by: Malte Schröder <malte@tnxip.de>
Reported-by: Derek Dongray <derek@valedon.co.uk>
Reported-by: Erkki Seppala <flux-btrfs@inside.org>
Cc: stable@vger.kernel.org # 4.2+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
|