summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2020-05-13io_uring: polled fixed file must go through free iterationJens Axboe
When we changed the file registration handling, it became important to iterate the bulk request freeing list for fixed files as well, or we miss dropping the fixed file reference. If not, we're leaking references, and we'll get a kworker stuck waiting for file references to disappear. This also means we can remove the special casing of fixed vs non-fixed files, we need to iterate for both and we can just rely on __io_req_aux_free() doing io_put_file() instead of doing it manually. Fixes: 055895537302 ("io_uring: refactor file register/unregister/update handling") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-13fs: Introduce DCACHE_DONTCACHEIra Weiny
DCACHE_DONTCACHE indicates a dentry should not be cached on final dput(). Also add a helper function to mark DCACHE_DONTCACHE on all dentries pointing to a specific inode when that inode is being set I_DONTCACHE. This facilitates dropping dentry references to inodes sooner which require eviction to swap S_DAX mode. Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-13fs: Lift XFS_IDONTCACHE to the VFS layerIra Weiny
DAX effective mode (S_DAX) changes requires inode eviction. XFS has an advisory flag (XFS_IDONTCACHE) to prevent caching of the inode if no other additional references are taken. We lift this flag to the VFS layer and change the behavior slightly by allowing the flag to remain even if multiple references are taken. This will expedite the eviction of inodes to change S_DAX. Cc: Al Viro <viro@zeniv.linux.org.uk> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-13fanotify: don't write with size under sizeof(response)Fabian Frederick
fanotify_write() only aligned copy_from_user size to sizeof(response) for higher values. This patch avoids all values below as suggested by Amir Goldstein and set to response size unconditionally. Link: https://lore.kernel.org/r/20200512181921.405973-1-fabf@skynet.be Signed-off-by: Fabian Frederick <fabf@skynet.be> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2020-05-13fsnotify: Remove proc_fs.h includeFabian Frederick
proc_fs.h was already included in fdinfo.h Link: https://lore.kernel.org/r/20200512181906.405927-1-fabf@skynet.be Signed-off-by: Fabian Frederick <fabf@skynet.be> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2020-05-13fanotify: remove reference to fill_event_metadata()Fabian Frederick
fill_event_metadata() was removed in commit bb2f7b4542c7 ("fanotify: open code fill_event_metadata()") Link: https://lore.kernel.org/r/20200512181836.405879-1-fabf@skynet.be Signed-off-by: Fabian Frederick <fabf@skynet.be> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2020-05-13fsnotify: add mutex destroyFabian Frederick
Call mutex_destroy() before freeing notification group. This only adds some additional debug checks when mutex debugging is enabled but still it may be useful. Link: https://lore.kernel.org/r/20200512181803.405832-1-fabf@skynet.be Signed-off-by: Fabian Frederick <fabf@skynet.be> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2020-05-13fanotify: prefix should_merge()Fabian Frederick
Prefix function with fanotify_ like others. Link: https://lore.kernel.org/r/20200512181715.405728-1-fabf@skynet.be Signed-off-by: Fabian Frederick <fabf@skynet.be> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2020-05-13NFS/pnfs: Don't use RPC_TASK_CRED_NOREF with pnfsTrond Myklebust
When we're doing pnfs then the credential being used for the RPC call is not necessarily the same as the one used in the open context, so don't use RPC_TASK_CRED_NOREF. Fixes: 612965072020 ("NFSv4: Avoid referencing the cred unnecessarily during NFSv4 I/O") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-05-13nsproxy: attach to namespaces via pidfdsChristian Brauner
For quite a while we have been thinking about using pidfds to attach to namespaces. This patchset has existed for about a year already but we've wanted to wait to see how the general api would be received and adopted. Now that more and more programs in userspace have started using pidfds for process management it's time to send this one out. This patch makes it possible to use pidfds to attach to the namespaces of another process, i.e. they can be passed as the first argument to the setns() syscall. When only a single namespace type is specified the semantics are equivalent to passing an nsfd. That means setns(nsfd, CLONE_NEWNET) equals setns(pidfd, CLONE_NEWNET). However, when a pidfd is passed, multiple namespace flags can be specified in the second setns() argument and setns() will attach the caller to all the specified namespaces all at once or to none of them. Specifying 0 is not valid together with a pidfd. Here are just two obvious examples: setns(pidfd, CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET); setns(pidfd, CLONE_NEWUSER); Allowing to also attach subsets of namespaces supports various use-cases where callers setns to a subset of namespaces to retain privilege, perform an action and then re-attach another subset of namespaces. If the need arises, as Eric suggested, we can extend this patchset to assume even more context than just attaching all namespaces. His suggestion specifically was about assuming the process' root directory when setns(pidfd, 0) or setns(pidfd, SETNS_PIDFD) is specified. For now, just keep it flexible in terms of supporting subsets of namespaces but let's wait until we have users asking for even more context to be assumed. At that point we can add an extension. The obvious example where this is useful is a standard container manager interacting with a running container: pushing and pulling files or directories, injecting mounts, attaching/execing any kind of process, managing network devices all these operations require attaching to all or at least multiple namespaces at the same time. Given that nowadays most containers are spawned with all namespaces enabled we're currently looking at at least 14 syscalls, 7 to open the /proc/<pid>/ns/<ns> nsfds, another 7 to actually perform the namespace switch. With time namespaces we're looking at about 16 syscalls. (We could amortize the first 7 or 8 syscalls for opening the nsfds by stashing them in each container's monitor process but that would mean we need to send around those file descriptors through unix sockets everytime we want to interact with the container or keep on-disk state. Even in scenarios where a caller wants to join a particular namespace in a particular order callers still profit from batching other namespaces. That mostly applies to the user namespace but all container runtimes I found join the user namespace first no matter if it privileges or deprivileges the container similar to how unshare behaves.) With pidfds this becomes a single syscall no matter how many namespaces are supposed to be attached to. A decently designed, large-scale container manager usually isn't the parent of any of the containers it spawns so the containers don't die when it crashes or needs to update or reinitialize. This means that for the manager to interact with containers through pids is inherently racy especially on systems where the maximum pid number is not significicantly bumped. This is even more problematic since we often spawn and manage thousands or ten-thousands of containers. Interacting with a container through a pid thus can become risky quite quickly. Especially since we allow for an administrator to enable advanced features such as syscall interception where we're performing syscalls in lieu of the container. In all of those cases we use pidfds if they are available and we pass them around as stable references. Using them to setns() to the target process' namespaces is as reliable as using nsfds. Either the target process is already dead and we get ESRCH or we manage to attach to its namespaces but we can't accidently attach to another process' namespaces. So pidfds lend themselves to be used with this api. The other main advantage is that with this change the pidfd becomes the only relevant token for most container interactions and it's the only token we need to create and send around. Apart from significiantly reducing the number of syscalls from double digit to single digit which is a decent reason post-spectre/meltdown this also allows to switch to a set of namespaces atomically, i.e. either attaching to all the specified namespaces succeeds or we fail. If we fail we haven't changed a single namespace. There are currently three namespaces that can fail (other than for ENOMEM which really is not very interesting since we then have other problems anyway) for non-trivial reasons, user, mount, and pid namespaces. We can fail to attach to a pid namespace if it is not our current active pid namespace or a descendant of it. We can fail to attach to a user namespace because we are multi-threaded or because our current mount namespace shares filesystem state with other tasks, or because we're trying to setns() to the same user namespace, i.e. the target task has the same user namespace as we do. We can fail to attach to a mount namespace because it shares filesystem state with other tasks or because we fail to lookup the new root for the new mount namespace. In most non-pathological scenarios these issues can be somewhat mitigated. But there are cases where we're half-attached to some namespace and failing to attach to another one. I've talked about some of these problem during the hallway track (something only the pre-COVID-19 generation will remember) of Plumbers in Los Angeles in 2018(?). Even if all these issues could be avoided with super careful userspace coding it would be nicer to have this done in-kernel. Pidfds seem to lend themselves nicely for this. The other neat thing about this is that setns() becomes an actual counterpart to the namespace bits of unshare(). Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: Serge Hallyn <serge@hallyn.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Serge Hallyn <serge@hallyn.com> Cc: Jann Horn <jannh@google.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Aleksa Sarai <cyphar@cyphar.com> Link: https://lore.kernel.org/r/20200505140432.181565-3-christian.brauner@ubuntu.com
2020-05-13ovl: return required buffer size for file handlesLubos Dolezel
Overlayfs doesn't work well with the fanotify mechanism. Fanotify first probes for the required buffer size for the file handle, but overlayfs currently bails out without passing the size back. That results in errors in the kernel log, such as: [527944.485384] overlayfs: failed to encode file handle (/, err=-75, buflen=0, len=29, type=1) [527944.485386] fanotify: failed to encode fid (fsid=ae521e68.a434d95f, type=255, bytes=0, err=-2) Signed-off-by: Lubos Dolezel <lubos@dolezel.info> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-05-13ovl: sync dirty data when remounting to ro modeChengguang Xu
sync_filesystem() does not sync dirty data for readonly filesystem during umount, so before changing to readonly filesystem we should sync dirty data for data integrity. Signed-off-by: Chengguang Xu <cgxu519@mykernel.net> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-05-13ovl: whiteout inode sharingChengguang Xu
Share inode with different whiteout files for saving inode and speeding up delete operation. If EMLINK is encountered when linking a shared whiteout, create a new one. In case of any other error, disable sharing for this super block. Note: ofs->whiteout is protected by inode lock on workdir. Signed-off-by: Chengguang Xu <cgxu519@mykernel.net> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-05-13ovl: inherit SB_NOSEC flag from upperdirJeffle Xu
Since the stacking of regular file operations [1], the overlayfs edition of write_iter() is called when writing regular files. Since then, xattr lookup is needed on every write since file_remove_privs() is called from ovl_write_iter(), which would become the performance bottleneck when writing small chunks of data. In my test case, file_remove_privs() would consume ~15% CPU when running fstime of unixbench (the workload is repeadly writing 1 KB to the same file) [2]. Inherit the SB_NOSEC flag from upperdir. Since then xattr lookup would be done only once on the first write. Unixbench fstime gets a ~20% performance gain with this patch. [1] https://lore.kernel.org/lkml/20180606150905.GC9426@magnolia/T/ [2] https://www.spinics.net/lists/linux-unionfs/msg07153.html Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-05-13ovl: skip overlayfs superblocks at global syncKonstantin Khlebnikov
Stacked filesystems like overlayfs has no own writeback, but they have to forward syncfs() requests to backend for keeping data integrity. During global sync() each overlayfs instance calls method ->sync_fs() for backend although it itself is in global list of superblocks too. As a result one syscall sync() could write one superblock several times and send multiple disk barriers. This patch adds flag SB_I_SKIP_SYNC into sb->sb_iflags to avoid that. Reported-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-05-13ovl: index dir act as work dirAmir Goldstein
With index=on, let index dir act as the work dir for copy up and cleanups. This will help implementing whiteout inode sharing. We still create the "work" dir on mount regardless of index=on and it is used to test the features supported by upper fs. One reason is that before the feature tests, we do not know if index could be enabled or not. The reason we do not use "index" directory also as workdir with index=off is because the existence of the "index" directory acts as a simple persistent signal that index was enabled on this filesystem and tools may want to use that signal. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-05-13ovl: prepare to copy up without workdirAmir Goldstein
With index=on, we copy up lower hardlinks to work dir and move them into index dir. Fix locking to allow work dir and index dir to be the same directory. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-05-13ovl: cleanup non-empty directories in ovl_indexdir_cleanup()Amir Goldstein
Teach ovl_indexdir_cleanup() to remove temp directories containing whiteouts to prepare for using index dir instead of work dir for removing merge directories. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-05-13ovl: resolve more conflicting mount optionsAmir Goldstein
Similar to the way that a conflict between metacopy=on,redirect_dir=off is resolved, also resolve conflicts between nfs_export=on,index=off and nfs_export=on,metacopy=on. An explicit mount option wins over a default config value. Both explicit mount options result in an error. Without this change the xfstests group overlay/exportfs are skipped if metacopy is enabled by default. Reported-by: Chengguang Xu <cgxu519@mykernel.net> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-05-13ovl: potential crash in ovl_fid_to_fh()Dan Carpenter
The "buflen" value comes from the user and there is a potential that it could be zero. In do_handle_to_path() we know that "handle->handle_bytes" is non-zero and we do: handle_dwords = handle->handle_bytes >> 2; So values 1-3 become zero. Then in ovl_fh_to_dentry() we do: int len = fh_len << 2; So now len is in the "0,4-128" range and a multiple of 4. But if "buflen" is zero it will try to copy negative bytes when we do the memcpy in ovl_fid_to_fh(). memcpy(&fh->fb, fid, buflen - OVL_FH_WIRE_OFFSET); And that will lead to a crash. Thanks to Amir Goldstein for his help with this patch. Fixes: cbe7fba8edfc ("ovl: make sure that real fid is 32bit aligned in memory") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Cc: <stable@vger.kernel.org> # v5.5 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-05-12zonefs: use REQ_OP_ZONE_APPEND for sync DIOJohannes Thumshirn
Synchronous direct I/O to a sequential write only zone can be issued using the new REQ_OP_ZONE_APPEND request operation. As dispatching multiple BIOs can potentially result in reordering, we cannot support asynchronous IO via this interface. We also can only dispatch up to queue_max_zone_append_sectors() via the new zone-append method and have to return a short write back to user-space in case an IO larger than queue_max_zone_append_sectors() has been issued. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Acked-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-12block: add blk_io_schedule() for avoiding task hung in sync dioMing Lei
Sync dio could be big, or may take long time in discard or in case of IO failure. We have prevented task hung in submit_bio_wait() and blk_execute_rq(), so apply the same trick for prevent task hung from happening in sync dio. Add helper of blk_io_schedule() and use io_schedule_timeout() to prevent task hung warning. Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Cc: Salman Qazi <sqazi@google.com> Cc: Jesse Barnes <jsbarnes@google.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Bart Van Assche <bvanassche@acm.org> Cc: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-12fs-verity: remove unnecessary extern keywordsEric Biggers
Remove the unnecessary 'extern' keywords from function declarations. This makes it so that we don't have a mix of both styles, so it won't be ambiguous what to use in new fs-verity patches. This also makes the code shorter and matches the 'checkpatch --strict' expectation. Link: https://lore.kernel.org/r/20200511192118.71427-3-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-05-12fs-verity: fix all kerneldoc warningsEric Biggers
Fix all kerneldoc warnings in fs/verity/ and include/linux/fsverity.h. Most of these were due to missing documentation for function parameters. Detected with: scripts/kernel-doc -v -none fs/verity/*.{c,h} include/linux/fsverity.h This cleanup makes it possible to check new patches for kerneldoc warnings without having to filter out all the existing ones. Link: https://lore.kernel.org/r/20200511192118.71427-2-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-05-12fscrypt: remove unnecessary extern keywordsEric Biggers
Remove the unnecessary 'extern' keywords from function declarations. This makes it so that we don't have a mix of both styles, so it won't be ambiguous what to use in new fscrypt patches. This also makes the code shorter and matches the 'checkpatch --strict' expectation. Link: https://lore.kernel.org/r/20200511191358.53096-4-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-05-12fscrypt: fix all kerneldoc warningsEric Biggers
Fix all kerneldoc warnings in fs/crypto/ and include/linux/fscrypt.h. Most of these were due to missing documentation for function parameters. Detected with: scripts/kernel-doc -v -none fs/crypto/*.{c,h} include/linux/fscrypt.h This cleanup makes it possible to check new patches for kerneldoc warnings without having to filter out all the existing ones. For consistency, also adjust some function "brief descriptions" to include the parentheses and to wrap at 80 characters. (The latter matches the checkpatch expectation.) Link: https://lore.kernel.org/r/20200511191358.53096-2-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-05-12dlm: remove BUG() before panic()Arnd Bergmann
Building a kernel with clang sometimes fails with an objtool error in dlm: fs/dlm/lock.o: warning: objtool: revert_lock_pc()+0xbd: can't find jump dest instruction at .text+0xd7fc The problem is that BUG() never returns and the compiler knows that anything after it is unreachable, however the panic still emits some code that does not get fully eliminated. Having both BUG() and panic() is really pointless as the BUG() kills the current process and the subsequent panic() never hits. In most cases, we probably don't really want either and should replace the DLM_ASSERT() statements with WARN_ON(), as has been done for some of them. Remove the BUG() here so the user at least sees the panic message and we can reliably build randconfig kernels. Fixes: e7fd41792fc0 ("[DLM] The core of the DLM for GFS2/CLVM") Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: clang-built-linux@googlegroups.com Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David Teigland <teigland@redhat.com>
2020-05-12dlm: Switch to using wait_event()Ross Lagerwall
We saw an issue in a production server on a customer deployment where DLM 4.0.7 gets "stuck" and unable to join new lockspaces. There is no useful response for the dlm in do_event() if wait_event_interruptible() is interrupted, so switch to wait_event(). Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Signed-off-by: David Teigland <teigland@redhat.com>
2020-05-12fs:dlm:remove unneeded semicolon in rcom.cWu Bo
Fix the following coccicheck warning: fs/dlm/rcom.c:566:2-3: Unneeded semicolon Signed-off-by: Wu Bo <wubo40@huawei.com> Signed-off-by: David Teigland <teigland@redhat.com>
2020-05-12dlm: user: Replace zero-length array with flexible-array memberGustavo A. R. Silva
The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David Teigland <teigland@redhat.com>
2020-05-12dlm: dlm_internal: Replace zero-length array with flexible-array memberGustavo A. R. Silva
The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David Teigland <teigland@redhat.com>
2020-05-12Merge tag 'gfs2-v5.7-rc1.fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull gfs2 fixes from Andreas Gruenbacher: "Various gfs2 fixes. Fixes for bugs prior to v5.7: - Fix random block reads when reading fragmented journals (v5.2) - Fix a possible random memory access in gfs2_walk_metadata (v5.3) Fixes for v5.7: - Fix several overlooked gfs2_qa_get / gfs2_qa_put imbalances - Fix several bugs in the new filesystem withdraw logic" * tag 'gfs2-v5.7-rc1.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: Revert "gfs2: Don't demote a glock until its revokes are written" gfs2: If go_sync returns error, withdraw but skip invalidate gfs2: Grab glock reference sooner in gfs2_add_revoke gfs2: don't call quota_unhold if quotas are not locked gfs2: move privileged user check to gfs2_quota_lock_check gfs2: remove check for quotas on in gfs2_quota_check gfs2: Change BUG_ON to an assert_withdraw in gfs2_quota_change gfs2: Fix problems regarding gfs2_qa_get and _put gfs2: More gfs2_find_jhead fixes gfs2: Another gfs2_walk_metadata fix gfs2: Fix use-after-free in gfs2_logd after withdraw gfs2: Fix BUG during unmount after file system withdraw gfs2: Fix error exit in do_xmote gfs2: fix withdraw sequence deadlock
2020-05-12pstore: Refactor pstorefs record list removalKees Cook
The "unlink" handling should perform list removal (which can also make sure records don't get double-erased), and the "evict" handling should be responsible only for memory freeing. Link: https://lore.kernel.org/lkml/20200506152114.50375-8-keescook@chromium.org/ Signed-off-by: Kees Cook <keescook@chromium.org>
2020-05-12pstore: Add proper unregister lock checkingKees Cook
The pstore backend lock wasn't being used during pstore_unregister(). Add sanity check and locking. Link: https://lore.kernel.org/lkml/20200506152114.50375-7-keescook@chromium.org/ Signed-off-by: Kees Cook <keescook@chromium.org>
2020-05-12pstore: Convert "records_list" locking to mutexKees Cook
The pstorefs internal list lock doesn't need to be a spinlock and will create problems when trying to access the list in the subsequent patch that will walk the pstorefs records during pstore_unregister(). Change this to a mutex to avoid may_sleep() warnings when unregistering devices. Link: https://lore.kernel.org/lkml/20200506152114.50375-6-keescook@chromium.org/ Signed-off-by: Kees Cook <keescook@chromium.org>
2020-05-12pstore: Rename "allpstore" to "records_list"Kees Cook
The name "allpstore" doesn't carry much meaning, so rename it to what it actually is: the list of all records present in the filesystem. The lock is also renamed accordingly. Link: https://lore.kernel.org/lkml/20200506152114.50375-5-keescook@chromium.org/ Signed-off-by: Kees Cook <keescook@chromium.org>
2020-05-12pstore: Convert "psinfo" locking to mutexKees Cook
Currently pstore can only have a single backend attached at a time, and it tracks the active backend via "psinfo", under a lock. The locking for this does not need to be a spinlock, and in order to avoid may_sleep() issues during future changes to pstore_unregister(), switch to a mutex instead. Link: https://lore.kernel.org/lkml/20200506152114.50375-4-keescook@chromium.org/ Signed-off-by: Kees Cook <keescook@chromium.org>
2020-05-12pstore: Rename "pstore_lock" to "psinfo_lock"Kees Cook
The name "pstore_lock" sounds very global, but it is only supposed to be used for managing changes to "psinfo", so rename it accordingly. Link: https://lore.kernel.org/lkml/20200506152114.50375-3-keescook@chromium.org/ Signed-off-by: Kees Cook <keescook@chromium.org>
2020-05-12pstore: Drop useless try_module_get() for backendKees Cook
There is no reason to be doing a module get/put in pstore_register(), since the module calling pstore_register() cannot be unloaded since it hasn't finished its initialization. Remove it so there is no confusion about how registration ordering works. Link: https://lore.kernel.org/lkml/20200506152114.50375-2-keescook@chromium.org/ Signed-off-by: Kees Cook <keescook@chromium.org>
2020-05-11f2fs: compress: fix zstd data corruptionChao Yu
During zstd compression, ZSTD_endStream() may return non-zero value because distination buffer is full, but there is still compressed data remained in intermediate buffer, it means that zstd algorithm can not save at last one block space, let's just writeback raw data instead of compressed one, this can fix data corruption when decompressing incomplete stored compression data. Fixes: 50cfa66f0de0 ("f2fs: compress: support zstd compress algorithm") Signed-off-by: Daeho Jeong <daehojeong@google.com> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-05-11f2fs: add compressed/gc data read IO statChao Yu
in order to account data read IOs more accurately. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-05-11f2fs: fix potential use-after-free issueChao Yu
In error path of f2fs_read_multi_pages(), it should let last referrer release decompress io context memory, otherwise, other referrer will cause use-after-free issue. Fixes: 4c8ff7095bef ("f2fs: support data compression") Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-05-11f2fs: compress: don't handle non-compressed data in workqueueChao Yu
If bio has no compressed data, we don't need to handle end_io work in workqueue, instead, it should just let interrupter handle it directly to speed up IO response. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-05-11f2fs: remove redundant assignment to variable errColin Ian King
The variable err is being assigned with a value that is never read and it is being updated later with a new value. The initialization is redundant and can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-05-11f2fs: refactor resize_fs to avoid meta updates in progressJaegeuk Kim
Sahitya raised an issue: - prevent meta updates while checkpoint is in progress allocate_segment_for_resize() can cause metapage updates if it requires to change the current node/data segments for resizing. Stop these meta updates when there is a checkpoint already in progress to prevent inconsistent CP data. Signed-off-by: Sahitya Tummala <stummala@codeaurora.org> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-05-11f2fs: use round_up to enhance calculationChao Yu
.i_cluster_size should be power of 2, so we can use round_up() instead of roundup() to enhance the calculation. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-05-11f2fs: introduce F2FS_IOC_RESERVE_COMPRESS_BLOCKSChao Yu
This patch introduces a new ioctl to rollback all compress inode status: - add reserved blocks in dnode blocks - increase i_compr_blocks, i_blocks, total_valid_block_count - remove immutable flag Then compress inode can be restored to support overwrite functionality again. Signee-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-05-11f2fs: Avoid double lock for cp_rwsem during checkpointSayali Lokhande
There could be a scenario where f2fs_sync_node_pages gets called during checkpoint, which in turn tries to flush inline data and calls iput(). This results in deadlock as iput() tries to hold cp_rwsem, which is already held at the beginning by checkpoint->block_operations(). Call stack : Thread A Thread B f2fs_write_checkpoint() - block_operations(sbi) - f2fs_lock_all(sbi); - down_write(&sbi->cp_rwsem); - open() - igrab() - write() write inline data - unlink() - f2fs_sync_node_pages() - if (is_inline_node(page)) - flush_inline_data() - ilookup() page = f2fs_pagecache_get_page() if (!page) goto iput_out; iput_out: -close() -iput() iput(inode); - f2fs_evict_inode() - f2fs_truncate_blocks() - f2fs_lock_op() - down_read(&sbi->cp_rwsem); Fixes: 2049d4fcb057 ("f2fs: avoid multiple node page writes due to inline_data") Signed-off-by: Sayali Lokhande <sayalil@codeaurora.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-05-11f2fs: report delalloc reserve as non-free in statfs for project quotaKonstantin Khlebnikov
This reserved space isn't committed yet but cannot be used for allocations. For userspace it has no difference from used space. See the same fix in ext4 commit f06925c73942 ("ext4: report delalloc reserve as non-free in statfs for project quota"). Fixes: ddc34e328d06 ("f2fs: introduce f2fs_statfs_project") Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-05-11f2fs: Fix wrong stub helper update_sit_infoYueHaibing
update_sit_info should be f2fs_update_sit_info, otherwise build fails while no CONFIG_F2FS_STAT_FS. Fixes: fc7100ea2a52 ("f2fs: Add f2fs stats to sysfs") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>