Age | Commit message (Collapse) | Author |
|
Pull xfs fixes from Darrick Wong:
- fix a race between overlapping copy on write aio
- fix cow fork swapping when we defragment reflinked files
* tag 'xfs-4.14-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: handle racy AIO in xfs_reflink_end_cow
xfs: always swap the cow forks when swapping extents
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Two more fixes for bugs introduced in 4.13.
The sector_t problem with 32bit architecture and !LBDAF config seems
serious but the number of affected deployments is hopefully low.
The clashing status bits could lead to a confusing in-memory state of
the whole-filesystem operations if used with the quota override sysfs
knob"
* 'for-4.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix overlap of fs_info::flags values
btrfs: avoid overflow when sector_t is 32 bit
|
|
Pull ceph fixes from Ilya Dryomov:
"Two fixups for CephFS snapshot-handling patches in -rc1"
* tag 'ceph-for-4.14-rc4' of git://github.com/ceph/ceph-client:
ceph: fix __choose_mds() for LSSNAP request
ceph: properly queue cap snap for newly created snap realm
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs fixes from Miklos Szeredi:
"Fix a regression in 4.14 and one in 4.13. The latter is a case when
Docker is doing something it really shouldn't and gets away with it.
We now print a warning instead of erroring out.
There are also fixes to several error paths"
* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: fix regression caused by exclusive upper/work dir protection
ovl: fix missing unlock_rename() in ovl_do_copy_up()
ovl: fix dentry leak in ovl_indexdir_cleanup()
ovl: fix dput() of ERR_PTR in ovl_cleanup_index()
ovl: fix error value printed in ovl_lookup_index()
ovl: fix may_write_real() for overlayfs directories
|
|
Commit 34b1744c91cc ("nfsd4: define ->op_release for compound ops")
defined a couple ->op_release functions and run them if necessary.
But there's a problem with that is that it reused
nfsd4_secinfo_release() as the op_release of OP_SECINFO_NO_NAME, and
caused a leak on struct nfsd4_secinfo_no_name in
nfsd4_encode_secinfo_no_name(), because there's no .si_exp field in
struct nfsd4_secinfo_no_name.
I found this because I was unable to umount an ext4 partition after
exporting it via NFS & run fsstress on the nfs mount. A simplified
reproducer would be:
# mount a local-fs device at /mnt/test, and export it via NFS with
# fsid=0 export option (this is required)
mount /dev/sda5 /mnt/test
echo "/mnt/test *(rw,no_root_squash,fsid=0)" >> /etc/exports
service nfs restart
# locally mount the nfs export with all default, note that I have
# nfsv4.1 configured as the default nfs version, because of the
# fsid export option, v4 mount would fail and fall back to v3
mount localhost:/mnt/test /mnt/nfs
# try to umount the underlying device, but got EBUSY
umount /mnt/nfs
service nfs stop
umount /mnt/test <=== EBUSY here
Fixed it by defining a separate nfsd4_secinfo_no_name_release()
function as the op_release method of OP_SECINFO_NO_NAME that
releases the correct nfsd4_secinfo_no_name structure.
Fixes: 34b1744c91cc ("nfsd4: define ->op_release for compound ops")
Signed-off-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
Using the ARRAY_SIZE macro improves the readability of the code.
Found with Coccinelle with the following semantic patch:
@r depends on (org || report)@
type T;
T[] E;
position p;
@@
(
(sizeof(E)@p /sizeof(*E))
|
(sizeof(E)@p /sizeof(E[...]))
|
(sizeof(E)@p /sizeof(T))
)
Signed-off-by: Jérémy Lefaure <jeremy.lefaure@lse.epita.fr>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
Enforcing exclusive ownership on upper/work dirs caused a docker
regression: https://github.com/moby/moby/issues/34672.
Euan spotted the regression and pointed to the offending commit.
Vivek has brought the regression to my attention and provided this
reproducer:
Terminal 1:
mount -t overlay -o workdir=work,lowerdir=lower,upperdir=upper none
merged/
Terminal 2:
unshare -m
Terminal 1:
umount merged
mount -t overlay -o workdir=work,lowerdir=lower,upperdir=upper none
merged/
mount: /root/overlay-testing/merged: none already mounted or mount point
busy
To fix the regression, I replaced the error with an alarming warning.
With index feature enabled, mount does fail, but logs a suggestion to
override exclusive dir protection by disabling index.
Note that index=off mount does take the inuse locks, so a concurrent
index=off will issue the warning and a concurrent index=on mount will fail.
Documentation was updated to reflect this change.
Fixes: 2cac0c00a6cd ("ovl: get exclusive ownership on upper/work dirs")
Cc: <stable@vger.kernel.org> # v4.13
Reported-by: Euan Kemp <euank@euank.com>
Reported-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Use the ovl_lock_rename_workdir() helper which requires
unlock_rename() only on lock success.
Fixes: ("fd210b7d67ee ovl: move copy up lock out")
Cc: <stable@vger.kernel.org> # v4.13
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
index dentry was not released when breaking out of the loop
due to index verification error.
Fixes: 415543d5c64f ("ovl: cleanup bad and stale index entries on mount")
Cc: <stable@vger.kernel.org> # v4.13
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Fixes: caf70cb2ba5d ("ovl: cleanup orphan index entries")
Cc: <stable@vger.kernel.org> # v4.13
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Fixes: 359f392ca53e ("ovl: lookup index entry for copy up origin")
Cc: <stable@vger.kernel.org> # v4.13
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Overlayfs directory file_inode() is the overlay inode whether the real
inode is upper or lower.
This fixes a regression in xfstest generic/158.
Fixes: 7c6893e3c9ab ("ovl: don't allow writing ioctl on lower layer")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Drop the arguments from the macro and adjust all callers with the
following script:
perl -pi -e 's/DEFINE_TIMER\((.*), 0, 0\);/DEFINE_TIMER($1);/g;' \
$(git grep DEFINE_TIMER | cut -d: -f1 | sort -u | grep -v timer.h)
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> # for m68k parts
Acked-by: Guenter Roeck <linux@roeck-us.net> # for watchdog parts
Acked-by: David S. Miller <davem@davemloft.net> # for networking parts
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Kalle Valo <kvalo@codeaurora.org> # for wireless parts
Acked-by: Arnd Bergmann <arnd@arndb.de>
Cc: linux-mips@linux-mips.org
Cc: Petr Mladek <pmladek@suse.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Sebastian Reichel <sre@kernel.org>
Cc: Kalle Valo <kvalo@qca.qualcomm.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: linux1394-devel@lists.sourceforge.net
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: linux-s390@vger.kernel.org
Cc: linux-wireless@vger.kernel.org
Cc: "James E.J. Bottomley" <jejb@linux.vnet.ibm.com>
Cc: Wim Van Sebroeck <wim@iguana.be>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Ursula Braun <ubraun@linux.vnet.ibm.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Harish Patil <harish.patil@cavium.com>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Cc: Michael Reed <mdr@sgi.com>
Cc: Manish Chopra <manish.chopra@cavium.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: linux-pm@vger.kernel.org
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Julian Wiedmann <jwi@linux.vnet.ibm.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Mark Gross <mark.gross@intel.com>
Cc: linux-watchdog@vger.kernel.org
Cc: linux-scsi@vger.kernel.org
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Stefan Richter <stefanr@s5r6.in-berlin.de>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: netdev@vger.kernel.org
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Link: https://lkml.kernel.org/r/1507159627-127660-11-git-send-email-keescook@chromium.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Instead of granting client's full requests until we hit our DRC size
limit and then failing CREATE_SESSIONs (and hence mounts) completely,
start granting clients smaller slot tables as we approach the limit.
The factor chosen here is pretty much arbitrary.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
An NFSv4.1+ client negotiates the size of its duplicate reply cache size
in the initial CREATE_SESSION request. The server preallocates the
memory for the duplicate reply cache to ensure that we'll never fail to
record the response to a nonidempotent operation.
To prevent a few CREATE_SESSIONs from consuming all of memory we set an
upper limit based on nr_free_buffer_pages(). 1/2^10 has been too
limiting in practice; 1/2^7 is still less than one percent.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
These checks should have already be done centrally in
nfsd4_proc_compound, the checks in each individual operation are
unnecessary.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
Since we can now use a lock stateid or a delegation stateid, that
differs from the context stateid, we need to change the test in
nfs4_layoutget_handle_exception() to take this into account.
This fixes an infinite layoutget loop in the NFS client whereby
it keeps retrying the initial layoutget using the same broken
stateid.
Fixes: 70d2f7b1ea19b ("pNFS: Use the standard I/O stateid when...")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
Handle start-all writeback like we do periodic or kupdate
style writeback - by marking the bdi_writeback as needing a full
flush, and simply waking the thread. This eliminates the need to
allocate and queue a specific work item just for this purpose.
After this change, we truly only ever have one of them running at
any point in time. We mark the need to start all flushes, and the
writeback thread will clear it once it has processed the request.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Merge misc fixes from Andrew Morton:
"A lot of stuff, sorry about that. A week on a beach, then a bunch of
time catching up then more time letting it bake in -next. Shan't do
that again!"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (51 commits)
include/linux/fs.h: fix comment about struct address_space
checkpatch: fix ignoring cover-letter logic
m32r: fix build failure
lib/ratelimit.c: use deferred printk() version
kernel/params.c: improve STANDARD_PARAM_DEF readability
kernel/params.c: fix an overflow in param_attr_show
kernel/params.c: fix the maximum length in param_get_string
mm/memory_hotplug: define find_{smallest|biggest}_section_pfn as unsigned long
mm/memory_hotplug: change pfn_to_section_nr/section_nr_to_pfn macro to inline function
kernel/kcmp.c: drop branch leftover typo
memremap: add scheduling point to devm_memremap_pages
mm, page_alloc: add scheduling point to memmap_init_zone
mm, memory_hotplug: add scheduling point to __add_pages
lib/idr.c: fix comment for idr_replace()
mm: memcontrol: use vmalloc fallback for large kmem memcg arrays
kernel/sysctl.c: remove duplicate UINT_MAX check on do_proc_douintvec_conv()
include/linux/bitfield.h: remove 32bit from FIELD_GET comment block
lib/lz4: make arrays static const, reduces object code size
exec: binfmt_misc: kill the onstack iname[BINPRM_BUF_SIZE] array
exec: binfmt_misc: fix race between load_misc_binary() and kill_node()
...
|
|
Because the values of BTRFS_FS_EXCL_OP and BTRFS_FS_QUOTA_OVERRIDE overlap,
we should change the value.
First, BTRFS_FS_EXCL_OP was set to 14.
commit 171938e52807 ("btrfs: track exclusive filesystem operation in flags")
Next, the value of BTRFS_FS_QUOTA_OVERRIDE was set to 14.
commit f29efe292198 ("btrfs: add quota override flag to enable quota override for CAP_SYS_RESOURCE")
As a result, the value 14 overlapped, by accident.
This problem is solved by defining the value of BTRFS_FS_EXCL_OP as 16,
the flags are internal.
Fixes: f29efe292198 ("btrfs: add quota override flag to enable quota override for CAP_SYS_RESOURCE")
CC: stable@vger.kernel.org # 4.13+
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minimize the change, update only BTRFS_FS_EXCL_OP ]
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Jean-Denis Girard noticed commit c821e7f3 "pass bytes to
btrfs_bio_alloc" (https://patchwork.kernel.org/patch/9763081/)
introduces a regression on 32 bit machines.
When CONFIG_LBDAF is _not_ defined (CONFIG_LBDAF == Support for large
(2TB+) block devices and files) sector_t is 32 bit on 32bit machines.
In the function submit_extent_page, 'sector' (which is sector_t type) is
multiplied by 512 to convert it from sectors to bytes, leading to an
overflow when the disk is bigger than 4GB (!).
I added a cast to u64 to avoid overflow.
Fixes: c821e7f3 ("btrfs: pass bytes to btrfs_bio_alloc")
CC: stable@vger.kernel.org # 4.13+
Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
Tested-by: Jean-Denis Girard <jd.girard@sysnux.pf>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
security_inode_getsecurity() provides the text string value
of a security attribute. It does not provide a "secctx".
The code in xattr_getsecurity() that calls security_inode_getsecurity()
and then calls security_release_secctx() happened to work because
SElinux and Smack treat the attribute and the secctx the same way.
It fails for cap_inode_getsecurity(), because that module has no
secctx that ever needs releasing. It turns out that Smack is the
one that's doing things wrong by not allocating memory when instructed
to do so by the "alloc" parameter.
The fix is simple enough. Change the security_release_secctx() to
kfree() because it isn't a secctx being returned by
security_inode_getsecurity(). Change Smack to allocate the string when
told to do so.
Note: this also fixes memory leaks for LSMs which implement
inode_getsecurity but not release_secctx, such as capabilities.
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: stable@vger.kernel.org
Signed-off-by: James Morris <james.l.morris@oracle.com>
|
|
If we got two AIO writes into a COW area the second one might not have any
COW extents left to convert. Handle that case gracefully instead of
triggering an assert or accessing beyond the bounds of the extent list.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Since the CoW fork exists as a secondary data structure to the data
fork, we must always swap cow forks during swapext. We also need to
swap the extent counts and reset the cowblocks tags.
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
After the previous change "fmt" can't go away, we can kill
iname/iname_addr and use fmt->interpreter.
Link: http://lkml.kernel.org/r/20170922143653.GA17232@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ben Woodard <woodard@redhat.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Jim Foraker <foraker1@llnl.gov>
Cc: <tdhooge@llnl.gov>
Cc: Travis Gummels <tgummels@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
load_misc_binary() makes a local copy of fmt->interpreter under
entries_lock to avoid the race with kill_node() but this is not enough;
the whole Node can be freed after we drop entries_lock, not only the
->interpreter string.
Add dget/dput(fmt->dentry) to ensure bm_evict_inode() can't destroy/free
this Node.
Link: http://lkml.kernel.org/r/20170922143650.GA17227@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ben Woodard <woodard@redhat.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Jim Foraker <foraker1@llnl.gov>
Cc: Travis Gummels <tgummels@redhat.com>
Cc: <tdhooge@llnl.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
If MISC_FMT_OPEN_FILE flag is set e->interp_file must be valid or we
have a bug which should not be silently ignored.
Link: http://lkml.kernel.org/r/20170922143647.GA17222@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ben Woodard <woodard@redhat.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Jim Foraker <foraker1@llnl.gov>
Cc: <tdhooge@llnl.gov>
Cc: Travis Gummels <tgummels@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
bm_evict_inode()
To ensure that load_misc_binary() can't use the partially destroyed
Node, see also the next patch.
The current logic looks wrong in any case, once we close interp_file it
doesn't make any sense to delay kfree(inode->i_private), this Node is no
longer valid. Even if the MISC_FMT_OPEN_FILE/interp_file checks were
not racy (they are), load_misc_binary() should not try to reopen
->interpreter if MISC_FMT_OPEN_FILE is set but ->interp_file is NULL.
And I can't understand why do we use filp_close(), not fput().
Link: http://lkml.kernel.org/r/20170922143644.GA17216@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ben Woodard <woodard@redhat.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Jim Foraker <foraker1@llnl.gov>
Cc: <tdhooge@llnl.gov>
Cc: Travis Gummels <tgummels@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
kill_node() nullifies/checks Node->dentry to avoid double free. This
complicates the next changes and this is very confusing:
- we do not need to check dentry != NULL under entries_lock,
kill_node() is always called under inode_lock(d_inode(root)) and we
rely on this inode_lock() anyway, without this lock the
MISC_FMT_OPEN_FILE cleanup could race with itself.
- if kill_inode() was already called and ->dentry == NULL we should not
even try to close e->interp_file.
We can change bm_entry_write() to simply check !list_empty(list) before
kill_node. Again, we rely on inode_lock(), in particular it saves us
from the race with bm_status_write(), another caller of kill_node().
Link: http://lkml.kernel.org/r/20170922143641.GA17210@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ben Woodard <woodard@redhat.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Jim Foraker <foraker1@llnl.gov>
Cc: <tdhooge@llnl.gov>
Cc: Travis Gummels <tgummels@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Patch series "exec: binfmt_misc: fix use-after-free, kill
iname[BINPRM_BUF_SIZE]".
It looks like this code was always wrong, then commit 948b701a607f
("binfmt_misc: add persistent opened binary handler for containers")
added more problems.
This patch (of 6):
load_script() can simply use i_name instead, it points into bprm->buf[]
and nobody can change this memory until we call prepare_binprm().
The only complication is that we need to also change the signature of
bprm_change_interp() but this change looks good too.
While at it, do whitespace/style cleanups.
NOTE: the real motivation for this change is that people want to
increase BINPRM_BUF_SIZE, we need to change load_misc_binary() too but
this looks more complicated because afaics it is very buggy.
Link: http://lkml.kernel.org/r/20170918163446.GA26793@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Travis Gummels <tgummels@redhat.com>
Cc: Ben Woodard <woodard@redhat.com>
Cc: Jim Foraker <foraker1@llnl.gov>
Cc: <tdhooge@llnl.gov>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When reading the event from the uffd, we put it on a temporary
fork_event list to detect if we can still access it after releasing and
retaking the event_wqh.lock.
If fork aborts and removes the event from the fork_event all is fine as
long as we're still in the userfault read context and fork_event head is
still alive.
We've to put the event allocated in the fork kernel stack, back from
fork_event list-head to the event_wqh head, before returning from
userfaultfd_ctx_read, because the fork_event head lifetime is limited to
the userfaultfd_ctx_read stack lifetime.
Forgetting to move the event back to its event_wqh place then results in
__remove_wait_queue(&ctx->event_wqh, &ewq->wq); in
userfaultfd_event_wait_completion to remove it from a head that has been
already freed from the reader stack.
This could only happen if resolve_userfault_fork failed (for example if
there are no file descriptors available to allocate the fork uffd). If
it succeeded it was put back correctly.
Furthermore, after find_userfault_evt receives a fork event, the forked
userfault context in fork_nctx and uwq->msg.arg.reserved.reserved1 can
be released by the fork thread as soon as the event_wqh.lock is
released. Taking a reference on the fork_nctx before dropping the lock
prevents an use after free in resolve_userfault_fork().
If the fork side aborted and it already released everything, we still
try to succeed resolve_userfault_fork(), if possible.
Fixes: 893e26e61d04eac9 ("userfaultfd: non-cooperative: Add fork() event")
Link: http://lkml.kernel.org/r/20170920180413.26713-1-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
As Ju Hyung Park reported:
"When 'fstrim' is called for manual trim, a BUG() can be triggered
randomly with this patch.
I'm seeing this issue on both x86 Desktop and arm64 Android phone.
On x86 Desktop, this was caused during Ubuntu boot-up. I have a
cronjob installed which calls 'fstrim -v /' during boot. On arm64
Android, this was caused during GC looping with 1ms gc_min_sleep_time
& gc_max_sleep_time."
Root cause of this issue is that f2fs_wait_discard_bios can only be
used by f2fs_put_super, because during put_super there must be no
other referrers, so it can ignore discard entry's reference count
when removing the entry, otherwise in other caller we will hit bug_on
in __remove_discard_cmd as there may be other issuer added reference
count in discard entry.
Thread A Thread B
- issue_discard_thread
- f2fs_ioc_fitrim
- f2fs_trim_fs
- f2fs_wait_discard_bios
- __issue_discard_cmd
- __submit_discard_cmd
- __wait_discard_cmd
- dc->ref++
- __wait_one_discard_bio
- __wait_discard_cmd
- __remove_discard_cmd
- f2fs_bug_on(sbi, dc->ref)
Fixes: 969d1b180d987c2be02de890d0fff0f66a0e80de
Reported-by: Ju Hyung Park <qkrwngud825@gmail.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
When someone calls wakeup_flusher_threads() or
wakeup_flusher_threads_bdi(), they schedule writeback of all dirty
pages in the system (or on that bdi). If we are tight on memory, we
can get tons of these queued from kswapd/vmscan. This causes (at
least) two problems:
1) We consume a ton of memory just allocating writeback work items.
We've seen as much as 600 million of these writeback work items
pending. That's a lot of memory to pointlessly hold hostage,
while the box is under memory pressure.
2) We spend so much time processing these work items, that we
introduce a softlockup in writeback processing. This is because
each of the writeback work items don't end up doing any work (it's
hard when you have millions of identical ones coming in to the
flush machinery), so we just sit in a tight loop pulling work
items and deleting/freeing them.
Fix this by adding a 'start_all' bit to the writeback structure, and
set that when someone attempts to flush all dirty pages. The bit is
cleared when we start writeback on that work item. If the bit is
already set when we attempt to queue !nr_pages writeback, then we
simply ignore it.
This provides us one full flush in flight, with one pending as well,
and makes for more efficient handling of this type of writeback.
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Chris Mason <clm@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Now that we have no external callers of wb_start_writeback(), we
can shuffle the passing in of 'nr_pages'. Everybody passes in 0
at this point, so just kill the argument and move the dirty
count retrieval to that function.
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Chris Mason <clm@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We don't have any callers outside of fs-writeback.c anymore,
make it private.
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Chris Mason <clm@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Similar to wakeup_flusher_threads(), except that we only wake
up the flusher threads on the specified backing device.
No functional changes in this patch.
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Chris Mason <clm@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
All the callers pass in 'true' for range_cyclic, so kill the
argument.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We're writing back the full range of dirty pages on the devices,
there's no point in making this special and not do normal range
cyclic writeback.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Everybody is passing in 0 now, let's get rid of the argument.
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Since the previous commit removed any case where grow_buffers()
would return failure due to memory allocations, we can safely
remove the case where we have to call free_more_memory() in
this function.
Since this is also the last user of free_more_memory(), kill
it off completely.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We currently use it for find_or_create_page(), which means that it
cannot fail. Ensure we also pass in 'retry == true' to
alloc_page_buffers(), which also ensure that it cannot fail.
After this, there are no failure cases in grow_dev_page() that
occur because of a failed memory allocation.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Instead of adding weird retry logic in that function, utilize
__GFP_NOFAIL to ensure that the vm takes care of handling any
potential retries appropriately. This means we don't have to
call free_more_memory() from here.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
into devel-stable
This series provides the needed changes to suport the ELF_FDPIC binary
format on ARM. Both MMU and non-MMU systems are supported. This format
has many advantages over the BFLT format used on MMU-less systems, such
as being real ELF that can be parsed by standard tools, can support
shared dynamic libs, etc.
|
|
previous commit 5d37ca14 "ceph: send LSSNAP request to auth mds
of directory inode" is buggy. It makes __choose_mds() choose mds
base on hash of '.snap' dentry.
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
commit 3ae0bebc "ceph: queue cap snap only when snap realm's
context changes" introduced a regression: we may not call
queue_realm_cap_snaps() for newly created snap realm. This
regression allows unflushed snapshot data to be overwritten.
Link: http://tracker.ceph.com/issues/21483
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
Check for a NULL dsaddr in filelayout_free_lseg() before calling
nfs4_fl_put_deviceid(). This fixes the following oops:
[ 1967.645207] BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
[ 1967.646010] IP: [<ffffffffc06d6aea>] nfs4_put_deviceid_node+0xa/0x90 [nfsv4]
[ 1967.646010] PGD c08bc067 PUD 915d3067 PMD 0
[ 1967.753036] Oops: 0000 [#1] SMP
[ 1967.753036] Modules linked in: nfs_layout_nfsv41_files ext4 mbcache jbd2 loop rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache amd64_edac_mod ipmi_ssif edac_mce_amd edac_core kvm_amd sg kvm ipmi_si ipmi_devintf irqbypass pcspkr k8temp ipmi_msghandler i2c_piix4 shpchp nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic crct10dif_common amdkfd amd_iommu_v2 radeon i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops mptsas ttm scsi_transport_sas mptscsih drm mptbase serio_raw i2c_core bnx2 dm_mirror dm_region_hash dm_log dm_mod
[ 1967.790031] CPU: 2 PID: 1370 Comm: ls Not tainted 3.10.0-709.el7.test.bz1463784.x86_64 #1
[ 1967.790031] Hardware name: IBM BladeCenter LS21 -[7971AC1]-/Server Blade, BIOS -[BAE155AUS-1.10]- 06/03/2009
[ 1967.790031] task: ffff8800c42a3f40 ti: ffff8800c4064000 task.ti: ffff8800c4064000
[ 1967.790031] RIP: 0010:[<ffffffffc06d6aea>] [<ffffffffc06d6aea>] nfs4_put_deviceid_node+0xa/0x90 [nfsv4]
[ 1967.790031] RSP: 0000:ffff8800c4067978 EFLAGS: 00010246
[ 1967.790031] RAX: ffffffffc062f000 RBX: ffff8801d468a540 RCX: dead000000000200
[ 1967.790031] RDX: ffff8800c40679f8 RSI: ffff8800c4067a0c RDI: 0000000000000000
[ 1967.790031] RBP: ffff8800c4067980 R08: ffff8801d468a540 R09: 0000000000000000
[ 1967.790031] R10: 0000000000000000 R11: ffffffffffffffff R12: ffff8801d468a540
[ 1967.790031] R13: ffff8800c40679f8 R14: ffff8801d5645300 R15: ffff880126f15ff0
[ 1967.790031] FS: 00007f11053c9800(0000) GS:ffff88012bd00000(0000) knlGS:0000000000000000
[ 1967.790031] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1967.790031] CR2: 0000000000000030 CR3: 0000000094b55000 CR4: 00000000000007e0
[ 1967.790031] Stack:
[ 1967.790031] ffff8801d468a540 ffff8800c4067990 ffffffffc062d2fe ffff8800c40679b0
[ 1967.790031] ffffffffc062b5b4 ffff8800c40679f8 ffff8801d468a540 ffff8800c40679d8
[ 1967.790031] ffffffffc06d39af ffff8800c40679f8 ffff880126f16078 0000000000000001
[ 1967.790031] Call Trace:
[ 1967.790031] [<ffffffffc062d2fe>] nfs4_fl_put_deviceid+0xe/0x10 [nfs_layout_nfsv41_files]
[ 1967.790031] [<ffffffffc062b5b4>] filelayout_free_lseg+0x24/0x90 [nfs_layout_nfsv41_files]
[ 1967.790031] [<ffffffffc06d39af>] pnfs_free_lseg_list+0x5f/0x80 [nfsv4]
[ 1967.790031] [<ffffffffc06d5a67>] _pnfs_return_layout+0x157/0x270 [nfsv4]
[ 1967.790031] [<ffffffffc06c17dd>] nfs4_evict_inode+0x4d/0x70 [nfsv4]
[ 1967.790031] [<ffffffff8121de19>] evict+0xa9/0x180
[ 1967.790031] [<ffffffff8121e729>] iput+0xf9/0x190
[ 1967.790031] [<ffffffffc0652cea>] nfs_dentry_iput+0x3a/0x50 [nfs]
[ 1967.790031] [<ffffffff8121ab4f>] shrink_dentry_list+0x20f/0x490
[ 1967.790031] [<ffffffff8121b018>] d_invalidate+0xd8/0x150
[ 1967.790031] [<ffffffffc065446b>] nfs_readdir_page_filler+0x40b/0x600 [nfs]
[ 1967.790031] [<ffffffffc0654bbd>] nfs_readdir_xdr_to_array+0x20d/0x3b0 [nfs]
[ 1967.790031] [<ffffffff811f3482>] ? __mem_cgroup_commit_charge+0xe2/0x2f0
[ 1967.790031] [<ffffffff81183208>] ? __add_to_page_cache_locked+0x48/0x170
[ 1967.790031] [<ffffffffc0654d60>] ? nfs_readdir_xdr_to_array+0x3b0/0x3b0 [nfs]
[ 1967.790031] [<ffffffffc0654d82>] nfs_readdir_filler+0x22/0x90 [nfs]
[ 1967.790031] [<ffffffff8118351f>] do_read_cache_page+0x7f/0x190
[ 1967.790031] [<ffffffff81215d30>] ? fillonedir+0xe0/0xe0
[ 1967.790031] [<ffffffff8118366c>] read_cache_page+0x1c/0x30
[ 1967.790031] [<ffffffffc0654f9b>] nfs_readdir+0x1ab/0x6b0 [nfs]
[ 1967.790031] [<ffffffffc06bd1c0>] ? nfs4_xdr_dec_layoutget+0x270/0x270 [nfsv4]
[ 1967.790031] [<ffffffff81215d30>] ? fillonedir+0xe0/0xe0
[ 1967.790031] [<ffffffff81215c20>] vfs_readdir+0xb0/0xe0
[ 1967.790031] [<ffffffff81216045>] SyS_getdents+0x95/0x120
[ 1967.790031] [<ffffffff816b9449>] system_call_fastpath+0x16/0x1b
[ 1967.790031] Code: 90 31 d2 48 89 d0 5d c3 85 f6 74 f5 8d 4e 01 89 f0 f0 0f b1 0f 39 f0 74 e2 89 c6 eb eb 0f 1f 40 00 66 66 66 66 90 55 48 89 e5 53 <48> 8b 47 30 48 89 fb a8 04 74 3b 8b 57 60 83 fa 02 74 19 8d 4a
[ 1967.790031] RIP [<ffffffffc06d6aea>] nfs4_put_deviceid_node+0xa/0x90 [nfsv4]
[ 1967.790031] RSP <ffff8800c4067978>
[ 1967.790031] CR2: 0000000000000030
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Fixes: 1ebf98012792 ("NFS/filelayout: Fix racy setting of fl->dsaddr...")
Cc: stable@vger.kernel.org # v4.13+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
Michael Sterrett reports a NULL pointer dereference on NFSv3 mounts when
CONFIG_NFS_V4 is not set because the NFS UOC rpc_wait_queue has not been
initialized. Move the initialization of the queue out of the CONFIG_NFS_V4
conditional setion.
Fixes: 7d6ddf88c4db ("NFS: Add an iocounter wait function for async RPC tasks")
Cc: stable@vger.kernel.org # 4.11+
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
nfs_idmap_get_desc() can't actually return zero. But if it did then
we would return ERR_PTR(0) which is NULL and the caller,
nfs_idmap_get_key(), doesn't expect that so it leads to a NULL pointer
dereference.
I've cleaned this up by changing the "<=" to "<" so it's more clear that
we don't return ERR_PTR(0).
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
The units of RPC_MAX_AUTH_SIZE is bytes, not 4-byte words. This causes
the client to request a larger-than-necessary session replay slot size.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
Now that we no longer try to reserve metadata blocks for delayed
allocations (which tended to overestimate the required number of
blocks significantly), we really don't need retry allocations when the
disk is very full as aggressively any more.
The only time when it makes sense to retry an allocation is if we have
freshly deleted blocks that will only become available after a
transaction commit. And if we lose that race, it's not worth it to
try more than once.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|