summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2020-04-02open_last_lookups(): consolidate fsnotify_create() callsAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-04-02take post-lookup part of do_last() out of loopAl Viro
now we can have open_last_lookups() directly from the loop in path_openat() - the rest of do_last() never returns a symlink to follow, so we can bloody well leave the loop first. Rename the rest of that thing from do_last() to do_open() and make it return an int. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-04-02link_path_walk(): sample parent's i_uid and i_mode for the last componentAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-04-02__nd_alloc_stack(): make it return boolAl Viro
... and adjust the caller (reserve_stack()). Rename to nd_alloc_stack(), while we are at it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-04-02reserve_stack(): switch to __nd_alloc_stack()Al Viro
expand the call of nd_alloc_stack() into it (and don't recheck the depth on the second call) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-04-02pick_link(): take reserving space on stack into a new helperAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-04-02pick_link(): more straightforward handling of allocation failuresAl Viro
pick_link() needs to push onto stack; we start with using two-element array embedded into struct nameidata and the first time we need more than that we switch to separately allocated array. Allocation can fail, of course, and handling of that would be simple enough - we need to drop 'link' and bugger off. However, the things get more complicated in RCU mode. There we must do GFP_ATOMIC allocation. If that fails, we try to switch to non-RCU mode and repeat the allocation. To switch to non-RCU mode we need to grab references to 'link' and to everything in nameidata. The latter done by unlazy_walk(); the former - legitimize_path(). 'link' must go first - after unlazy_walk() we are out of RCU-critical period and it's too late to call legitimize_path() since the references in link->mnt and link->dentry might be pointing to freed and reused memory. So we do legitimize_path(), then unlazy_walk(). And that's where it gets too subtle: what to do if the former fails? We MUST do path_put(link) to avoid leaks. And we can't do that under rcu_read_lock(). Solution in mainline was to empty then nameidata manually, drop out of RCU mode and then do put_path(). In effect, we open-code the things eventual terminate_walk() would've done on error in RCU mode. That looks badly out of place and confusing. We could add a comment along the lines of the explanation above, but... there's a simpler solution. Call unlazy_walk() even if legitimaze_path() fails. It will take us out of RCU mode, so we'll be able to do path_put(link). Yes, it will do unnecessary work - attempt to grab references on the stuff in nameidata, only to have them dropped as soon as we return the error to upper layer and get terminate_walk() called there. So what? We are thoroughly off the fast path by that point - we had GFP_ATOMIC allocation fail, we had ->d_seq or mount_lock mismatch and we are about to try walking the same path from scratch in non-RCU mode. Which will need to do the same allocation, this time with GFP_KERNEL, so it will be able to apply memory pressure for blocking stuff. Compared to that the cost of several lockref_get_not_dead() is noise. And the logics become much easier to understand that way. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-04-02fold path_to_nameidata() into its only remaining callerAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-04-02pick_link(): pass it struct path already with normal refcounting rulesAl Viro
step_into() tries to avoid grabbing and dropping mount references on the steps that do not involve crossing mountpoints (which is obviously the majority of cases). So it uses a local struct path with unusual refcounting rules - path.mnt is pinned if and only if it's not equal to nd->path.mnt. We used to have similar beasts all over the place and we had quite a few bugs crop up in their handling - it's easy to get confused when changing e.g. cleanup on failure exits (or adding a new check, etc.) Now that's mostly gone - the step_into() instance (which is what we need them for) is the only one left. It is exposed to mount traversal and it's (shortly) seen by pick_link(). Since pick_link() needs to store it in link stack, where the normal rules apply, it has to make sure that mount is pinned regardless of nd->path.mnt value. That's done on all calls of pick_link() and very early in those. Let's do that in the caller (step_into()) instead - that way the fewer places need to be aware of such struct path instances. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-04-02fs/namei.c: kill follow_mount()Al Viro
The only remaining caller (path_pts()) should be using follow_down() anyway. And clean path_pts() a bit. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-04-02non-RCU analogue of the previous commitAl Viro
new helper: choose_mountpoint(). Wrapper around choose_mountpoint_rcu(), similar to lookup_mnt() vs. __lookup_mnt(). follow_dotdot() switched to it. Now we don't grab mount_lock exclusive anymore; note that the primitive used non-RCU mount traversals in other direction (lookup_mnt()) doesn't bother with that either - it uses mount_lock seqcount instead. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-04-02helper for mount rootwards traversalAl Viro
The loops in follow_dotdot{_rcu()} are doing the same thing: we have a mount and we want to find out how far up the chain of mounts do we need to go. We follow the chain of mount until we find one that is not directly overmounting the root of another mount. If such a mount is found, we want the location it's mounted upon. If we run out of chain (i.e. get to a mount that is not mounted on anything else) or run into process' root, we report failure. On success, we want (in RCU case) d_seq of resulting location sampled or (in non-RCU case) references to that location acquired. This commit introduces such primitive for RCU case and switches follow_dotdot_rcu() to it; non-RCU case will be go in the next commit. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-04-02follow_dotdot(): be lazy about changing nd->pathAl Viro
Change nd->path only after the loop is done and only in case we hadn't ended up finding ourselves in root. Same for NO_XDEV check. That separates the "check how far back do we need to go through the mount stack" logics from the rest of .. traversal. NOTE: path_get/path_put introduced here are temporary. They will go away later in the series. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-04-02follow_dotdot_rcu(): be lazy about changing nd->pathAl Viro
Change nd->path only after the loop is done and only in case we hadn't ended up finding ourselves in root. Same for NO_XDEV check. Don't recheck mount_lock on each step either. That separates the "check how far back do we need to go through the mount stack" logics from the rest of .. traversal. Note that the sequence for d_seq/d_inode here is * sample mount_lock seqcount ... * sample d_seq * fetch d_inode * verify mount_lock seqcount The last step makes sure that d_inode value we'd got matches d_seq - it dentry is guaranteed to have been a mountpoint through the entire thing, so its d_inode must have been stable. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-04-02follow_dotdot{,_rcu}(): massage loopsAl Viro
The logics in both of them is the same: while true if in process' root // uncommon break if *not* in mount root // normal case find the parent return if at absolute root // very uncommon break move to underlying mountpoint report that we are in root Pull the common path out of the loop: if in process' root // uncommon goto in_root if unlikely(in mount root) while true if at absolute root goto in_root move to underlying mountpoint if in process' root goto in_root if in mount root break; find the parent // we are not in mount root return in_root: report that we are in root The reason for that transformation is that we get to keep the common path straight *and* get a separate block for "move through underlying mountpoints", which will allow to sanitize NO_XDEV handling there. What's more, the pared-down loops will be easier to deal with - in particular, non-RCU case has no need to grab mount_lock and rewriting it to the form that wouldn't do that is a non-trivial change. Better do that with less stuff getting in the way... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-04-02lift all calls of step_into() out of follow_dotdot/follow_dotdot_rcuAl Viro
lift step_into() into handle_dots() (where they merge with each other); have follow_... return dentry and pass inode/seq to the caller. [braino fix folded; kudos to Qian Cai <cai@lca.pw> for reporting it] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-04-01Merge branch 'linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto updates from Herbert Xu: "API: - Fix out-of-sync IVs in self-test for IPsec AEAD algorithms Algorithms: - Use formally verified implementation of x86/curve25519 Drivers: - Enhance hwrng support in caam - Use crypto_engine for skcipher/aead/rsa/hash in caam - Add Xilinx AES driver - Add uacce driver - Register zip engine to uacce in hisilicon - Add support for OCTEON TX CPT engine in marvell" * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (162 commits) crypto: af_alg - bool type cosmetics crypto: arm[64]/poly1305 - add artifact to .gitignore files crypto: caam - limit single JD RNG output to maximum of 16 bytes crypto: caam - enable prediction resistance in HRWNG bus: fsl-mc: add api to retrieve mc version crypto: caam - invalidate entropy register during RNG initialization crypto: caam - check if RNG job failed crypto: caam - simplify RNG implementation crypto: caam - drop global context pointer and init_done crypto: caam - use struct hwrng's .init for initialization crypto: caam - allocate RNG instantiation descriptor with GFP_DMA crypto: ccree - remove duplicated include from cc_aead.c crypto: chelsio - remove set but not used variable 'adap' crypto: marvell - enable OcteonTX cpt options for build crypto: marvell - add the Virtual Function driver for CPT crypto: marvell - add support for OCTEON TX CPT engine crypto: marvell - create common Kconfig and Makefile for Marvell crypto: arm/neon - memzero_explicit aes-cbc key crypto: bcm - Use scnprintf() for avoiding potential buffer overflow crypto: atmel-i2c - Fix wakeup fail ...
2020-04-01ext4: save all error info in save_error_info() and drop ext4_set_errno()Theodore Ts'o
Using a separate function, ext4_set_errno() to set the errno is problematic because it doesn't do the right thing once s_last_error_errorcode is non-zero. It's also less racy to set all of the error information all at once. (Also, as a bonus, it shrinks code size slightly.) Link: https://lore.kernel.org/r/20200329020404.686965-1-tytso@mit.edu Fixes: 878520ac45f9 ("ext4: save the error code which triggered...") Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-04-01NFS: Try to join page groups before an O_DIRECT retransmissionTrond Myklebust
If we have to retransmit requests, try to join their page groups first. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-04-01NFS: Refactor nfs_lock_and_join_requests()Trond Myklebust
Refactor nfs_lock_and_join_requests() in order to separate out the subrequest merging into its own function nfs_lock_and_join_group() that can be used by O_DIRECT. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-04-01NFS: Reverse the submission order of requests in __nfs_pageio_add_request()Trond Myklebust
If we have to split the request up into subrequests, we have to submit the request pointed to by the function call parameter last, in case there is an error or other issue that causes us to exit before the last request is submitted. The reason is that the caller is expected to perform cleanup in those cases. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-04-01NFS: Clean up nfs_lock_and_join_requests()Trond Myklebust
Clean up nfs_lock_and_join_requests() to simplify the calculation of the range covered by the page group, taking into account the presence of mirrors. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-04-01NFS: Remove the redundant function nfs_pgio_has_mirroring()Trond Myklebust
We need to trust that desc->pg_mirror_idx is set correctly, whether or not mirroring is enabled. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-04-01NFS: Fix memory leaks in nfs_pageio_stop_mirroring()Trond Myklebust
If we just set the mirror count to 1 without first clearing out the mirrors, we can leak queued up requests. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-04-01NFS: Fix a request reference leak in nfs_direct_write_clear_reqs()Trond Myklebust
nfs_direct_write_scan_commit_list() will lock the request and bump the reference count, but we also need to account for the reference that was taken when we initially added the request to the commit list. Fixes: fb5f7f20cdb9 ("NFS: commit errors should be fatal") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-04-01NFS: Fix use-after-free issues in nfs_pageio_add_request()Trond Myklebust
We need to ensure that we create the mirror requests before calling nfs_pageio_add_request_mirror() on the request we are adding. Otherwise, we can end up with a use-after-free if the call to nfs_pageio_add_request_mirror() triggers I/O. Fixes: c917cfaf9bbe ("NFS: Fix up NFS I/O subrequest creation") Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-04-01NFS: Fix races nfs_page_group_destroy() vs nfs_destroy_unlinked_subrequests()Trond Myklebust
When a subrequest is being detached from the subgroup, we want to ensure that it is not holding the group lock, or in the process of waiting for the group lock. Fixes: 5b2b5187fa85 ("NFS: Fix nfs_page_group_destroy() and nfs_lock_and_join_requests() race cases") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-04-01signal: Extend exec_id to 64bitsEric W. Biederman
Replace the 32bit exec_id with a 64bit exec_id to make it impossible to wrap the exec_id counter. With care an attacker can cause exec_id wrap and send arbitrary signals to a newly exec'd parent. This bypasses the signal sending checks if the parent changes their credentials during exec. The severity of this problem can been seen that in my limited testing of a 32bit exec_id it can take as little as 19s to exec 65536 times. Which means that it can take as little as 14 days to wrap a 32bit exec_id. Adam Zabrocki has succeeded wrapping the self_exe_id in 7 days. Even my slower timing is in the uptime of a typical server. Which means self_exec_id is simply a speed bump today, and if exec gets noticably faster self_exec_id won't even be a speed bump. Extending self_exec_id to 64bits introduces a problem on 32bit architectures where reading self_exec_id is no longer atomic and can take two read instructions. Which means that is is possible to hit a window where the read value of exec_id does not match the written value. So with very lucky timing after this change this still remains expoiltable. I have updated the update of exec_id on exec to use WRITE_ONCE and the read of exec_id in do_notify_parent to use READ_ONCE to make it clear that there is no locking between these two locations. Link: https://lore.kernel.org/kernel-hardening/20200324215049.GA3710@pi3.com.pl Fixes: 2.3.23pre2 Cc: stable@vger.kernel.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-04-01NFS: Fix a page leak in nfs_destroy_unlinked_subrequests()Trond Myklebust
When we detach a subrequest from the list, we must also release the reference it holds to the parent. Fixes: 5b2b5187fa85 ("NFS: Fix nfs_page_group_destroy() and nfs_lock_and_join_requests() race cases") Cc: stable@vger.kernel.org # v4.14+ Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-04-01io_uring: add missing finish_wait() in io_sq_thread()Hillf Danton
Add it to pair with prepare_to_wait() in an attempt to avoid anything weird in the field. Fixes: b41e98524e42 ("io_uring: add per-task callback handler") Reported-by: syzbot+0c3370f235b74b3cfd97@syzkaller.appspotmail.com Signed-off-by: Hillf Danton <hdanton@sina.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-31Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-nextLinus Torvalds
Pull networking updates from David Miller: "Highlights: 1) Fix the iwlwifi regression, from Johannes Berg. 2) Support BSS coloring and 802.11 encapsulation offloading in hardware, from John Crispin. 3) Fix some potential Spectre issues in qtnfmac, from Sergey Matyukevich. 4) Add TTL decrement action to openvswitch, from Matteo Croce. 5) Allow paralleization through flow_action setup by not taking the RTNL mutex, from Vlad Buslov. 6) A lot of zero-length array to flexible-array conversions, from Gustavo A. R. Silva. 7) Align XDP statistics names across several drivers for consistency, from Lorenzo Bianconi. 8) Add various pieces of infrastructure for offloading conntrack, and make use of it in mlx5 driver, from Paul Blakey. 9) Allow using listening sockets in BPF sockmap, from Jakub Sitnicki. 10) Lots of parallelization improvements during configuration changes in mlxsw driver, from Ido Schimmel. 11) Add support to devlink for generic packet traps, which report packets dropped during ACL processing. And use them in mlxsw driver. From Jiri Pirko. 12) Support bcmgenet on ACPI, from Jeremy Linton. 13) Make BPF compatible with RT, from Thomas Gleixnet, Alexei Starovoitov, and your's truly. 14) Support XDP meta-data in virtio_net, from Yuya Kusakabe. 15) Fix sysfs permissions when network devices change namespaces, from Christian Brauner. 16) Add a flags element to ethtool_ops so that drivers can more simply indicate which coalescing parameters they actually support, and therefore the generic layer can validate the user's ethtool request. Use this in all drivers, from Jakub Kicinski. 17) Offload FIFO qdisc in mlxsw, from Petr Machata. 18) Support UDP sockets in sockmap, from Lorenz Bauer. 19) Fix stretch ACK bugs in several TCP congestion control modules, from Pengcheng Yang. 20) Support virtual functiosn in octeontx2 driver, from Tomasz Duszynski. 21) Add region operations for devlink and use it in ice driver to dump NVM contents, from Jacob Keller. 22) Add support for hw offload of MACSEC, from Antoine Tenart. 23) Add support for BPF programs that can be attached to LSM hooks, from KP Singh. 24) Support for multiple paths, path managers, and counters in MPTCP. From Peter Krystad, Paolo Abeni, Florian Westphal, Davide Caratti, and others. 25) More progress on adding the netlink interface to ethtool, from Michal Kubecek" * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2121 commits) net: ipv6: rpl_iptunnel: Fix potential memory leak in rpl_do_srh_inline cxgb4/chcr: nic-tls stats in ethtool net: dsa: fix oops while probing Marvell DSA switches net/bpfilter: remove superfluous testing message net: macb: Fix handling of fixed-link node net: dsa: ksz: Select KSZ protocol tag netdevsim: dev: Fix memory leak in nsim_dev_take_snapshot_write net: stmmac: add EHL 2.5Gbps PCI info and PCI ID net: stmmac: add EHL PSE0 & PSE1 1Gbps PCI info and PCI ID net: stmmac: create dwmac-intel.c to contain all Intel platform net: dsa: bcm_sf2: Support specifying VLAN tag egress rule net: dsa: bcm_sf2: Add support for matching VLAN TCI net: dsa: bcm_sf2: Move writing of CFP_DATA(5) into slicing functions net: dsa: bcm_sf2: Check earlier for FLOW_EXT and FLOW_MAC_EXT net: dsa: bcm_sf2: Disable learning for ASP port net: dsa: b53: Deny enslaving port 7 for 7278 into a bridge net: dsa: b53: Prevent tagged VLAN on port 7 for 7278 net: dsa: b53: Restore VLAN entries upon (re)configuration net: dsa: bcm_sf2: Fix overflow checks hv_netvsc: Remove unnecessary round_up for recv_completion_cnt ...
2020-03-31Merge tag 'selinux-pr-20200330' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux Pull SELinux updates from Paul Moore: "We've got twenty SELinux patches for the v5.7 merge window, the highlights are below: - Deprecate setting /sys/fs/selinux/checkreqprot to 1. This flag was originally created to deal with legacy userspace and the READ_IMPLIES_EXEC personality flag. We changed the default from 1 to 0 back in Linux v4.4 and now we are taking the next step of deprecating it, at some point in the future we will take the final step of rejecting 1. - Allow kernfs symlinks to inherit the SELinux label of the parent directory. In order to preserve backwards compatibility this is protected by the genfs_seclabel_symlinks SELinux policy capability. - Optimize how we store filename transitions in the kernel, resulting in some significant improvements to policy load times. - Do a better job calculating our internal hash table sizes which resulted in additional policy load improvements and likely general SELinux performance improvements as well. - Remove the unused initial SIDs (labels) and improve how we handle initial SIDs. - Enable per-file labeling for the bpf filesystem. - Ensure that we properly label NFS v4.2 filesystems to avoid a temporary unlabeled condition. - Add some missing XFS quota command types to the SELinux quota access controls. - Fix a problem where we were not updating the seq_file position index correctly in selinuxfs. - We consolidate some duplicated code into helper functions. - A number of list to array conversions. - Update Stephen Smalley's email address in MAINTAINERS" * tag 'selinux-pr-20200330' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux: selinux: clean up indentation issue with assignment statement NFS: Ensure security label is set for root inode MAINTAINERS: Update my email address selinux: avtab_init() and cond_policydb_init() return void selinux: clean up error path in policydb_init() selinux: remove unused initial SIDs and improve handling selinux: reduce the use of hard-coded hash sizes selinux: Add xfs quota command types selinux: optimize storage of filename transitions selinux: factor out loop body from filename_trans_read() security: selinux: allow per-file labeling for bpffs selinux: generalize evaluate_cond_node() selinux: convert cond_expr to array selinux: convert cond_av_list to array selinux: convert cond_list to array selinux: sel_avc_get_stat_idx should increase position index selinux: allow kernfs symlinks to inherit parent directory context selinux: simplify evaluate_cond_node() Documentation,selinux: deprecate setting checkreqprot to 1 selinux: move status variables out of selinux_ss
2020-03-31Merge tag '5.7-rc-smb3-fixes-part1' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull cifs updates from Steve French: "First part of cifs/smb3 changes for merge window (others are still being tested). Various RDMA (smbdirect) fixes, addition of SMB3.1.1 POSIX support in readdir, 3 fixes for stable, and a fix for flock. Summary: New feature: - SMB3.1.1 POSIX support in readdir Fixes: - various RDMA (smbdirect) fixes - fix for flock - fallocate fix - some improved mount warnings - two timestamp related fixes - reconnect fix - three fixes for stable" * tag '5.7-rc-smb3-fixes-part1' of git://git.samba.org/sfrench/cifs-2.6: (28 commits) cifs: update internal module version number cifs: Allocate encryption header through kmalloc cifs: smbd: Check and extend sender credits in interrupt context cifs: smbd: Calculate the correct maximum packet size for segmented SMBDirect send/receive smb3: use SMB2_SIGNATURE_SIZE define CIFS: Fix bug which the return value by asynchronous read is error CIFS: check new file size when extending file by fallocate SMB3: Minor cleanup of protocol definitions SMB3: Additional compression structures SMB3: Add new compression flags cifs: smb2pdu.h: Replace zero-length array with flexible-array member cifs: clear PF_MEMALLOC before exiting demultiplex thread cifs: cifspdu.h: Replace zero-length array with flexible-array member CIFS: Warn less noisily on default mount fs/cifs: fix gcc warning in sid_to_id cifs: allow unlock flock and OFD lock across fork cifs: do d_move in rename cifs: add SMB2_open() arg to return POSIX data cifs: plumb smb2 POSIX dir enumeration cifs: add smb2 POSIX info level ...
2020-03-31Merge tag 'gfs2-for-5.7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull gfs2 updates from Bob Peterson: "We've got a lot of patches (39) for this merge window. Most of these patches are related to corruption that occurs when journals are replayed. For example: 1. A node fails while writing to the file system. 2. Other nodes use the metadata that was once used by the failed node. 3. When the node returns to the cluster, its journal is replayed, but the older metadata blocks overwrite the changes from step 2. Summary: - Fixed the recovery sequence to prevent corruption during journal replay. - Many bug fixes found during recovery testing. - New improved file system withdraw sequence. - Fixed how resource group buffers are managed. - Fixed how metadata revokes are tracked and written. - Improve processing of IO errors hit by daemons like logd and quotad. - Improved error checking in metadata writes. - Fixed how qadata quota data structures are managed" * tag 'gfs2-for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: (39 commits) gfs2: Fix oversight in gfs2_ail1_flush gfs2: change from write to read lock for sd_log_flush_lock in journal replay gfs2: instrumentation wrt ail1 stuck gfs2: don't lock sd_log_flush_lock in try_rgrp_unlink gfs2: Remove unnecessary gfs2_qa_{get,put} pairs gfs2: Split gfs2_rsqa_delete into gfs2_rs_delete and gfs2_qa_put gfs2: Change inode qa_data to allow multiple users gfs2: eliminate gfs2_rsqa_alloc in favor of gfs2_qa_alloc gfs2: Switch to list_{first,last}_entry gfs2: Clean up inode initialization and teardown gfs2: Additional information when gfs2_ail1_flush withdraws gfs2: leaf_dealloc needs to allocate one more revoke gfs2: allow journal replay to hold sd_log_flush_lock gfs2: don't allow releasepage to free bd still used for revokes gfs2: flesh out delayed withdraw for gfs2_log_flush gfs2: Do proper error checking for go_sync family of glops functions gfs2: Don't demote a glock until its revokes are written gfs2: drain the ail2 list after io errors gfs2: Withdraw in gfs2_ail1_flush if write_cache_pages fails gfs2: Do log_flush in gfs2_ail_empty_gl even if ail list is empty ...
2020-03-31Merge tag 'for-5.7-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "A number of core changes that make things work better in general, code is simpler and cleaner. Core changes: - per-inode file extent tree, for in memory tracking of contiguous extent ranges to make sure i_size adjustments are accurate - tree root structures are protected by reference counts, replacing SRCU that did not cover some cases - leak detector for tree root structures - per-transaction pinned extent tracking - buffer heads are replaced by bios for super block access - speedup of extent back reference resolution, on an example test scenario the runtime of send went down from a hour to minutes - factor out locking scheme used for subvolume writer and NOCOW exclusion, abstracted as DREW lock, double reader-writer exclusion (allow either readers or writers) - cleanup and abstract extent allocation policies, preparation for zoned device support - make reflink/clone_range work on inline extents - add more cancellation point for relocation, improves long response from 'balance cancel' - add page migration callback for data pages - switch to guid for uuids, with additional cleanups of the interface - make ranged full fsyncs more efficient - removal of obsolete ioctl flag BTRFS_SUBVOL_CREATE_ASYNC - remove b-tree readahead from delayed refs paths, avoiding seek and read unnecessary blocks Features: - v2 of ioctl to delete subvolumes, allowing to delete by id and more future extensions Fixes: - fix qgroup rescan worker that could block umount - fix crash during unmount due to race with delayed inode workers - fix dellaloc flushing logic that could create unnecessary chunks under heavy load - fix missing file extent item for hole after ranged fsync - several fixes in relocation error handling Other: - more documentation of relocation, device replace, space reservations - many random cleanups" * tag 'for-5.7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (210 commits) btrfs: fix missing semaphore unlock in btrfs_sync_file btrfs: use nofs allocations for running delayed items btrfs: sysfs: Use scnprintf() instead of snprintf() btrfs: do not resolve backrefs for roots that are being deleted btrfs: track reloc roots based on their commit root bytenr btrfs: restart relocate_tree_blocks properly btrfs: reloc: reorder reservation before root selection btrfs: do not readahead in build_backref_tree btrfs: do not use readahead for running delayed refs btrfs: Remove async_transid from btrfs_mksubvol/create_subvol/create_snapshot btrfs: Remove transid argument from btrfs_ioctl_snap_create_transid btrfs: Remove BTRFS_SUBVOL_CREATE_ASYNC support btrfs: kill the subvol_srcu btrfs: make btrfs_cleanup_fs_roots use the radix tree lock btrfs: don't take an extra root ref at allocation time btrfs: hold a ref on the root on the dead roots list btrfs: make inodes hold a ref on their roots btrfs: move the root freeing stuff into btrfs_put_root btrfs: move ino_cache_inode dropping out of btrfs_free_fs_root btrfs: make the extent buffer leak check per fs info ...
2020-03-31Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscryptLinus Torvalds
Pull fscrypt updates from Eric Biggers: "Add an ioctl FS_IOC_GET_ENCRYPTION_NONCE which retrieves a file's encryption nonce. This makes it easier to write automated tests which verify that fscrypt is doing the encryption correctly" * tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt: ubifs: wire up FS_IOC_GET_ENCRYPTION_NONCE f2fs: wire up FS_IOC_GET_ENCRYPTION_NONCE ext4: wire up FS_IOC_GET_ENCRYPTION_NONCE fscrypt: add FS_IOC_GET_ENCRYPTION_NONCE ioctl
2020-03-31xfs: remove redundant variable assignment in xfs_symlink()Kaixu Xia
The variables 'udqp' and 'gdqp' have been initialized, so remove redundant variable assignment in xfs_symlink(). Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Dave Chinner <dchinner@redhat.com Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-03-31xfs: ratelimit inode flush on buffered write ENOSPCDarrick J. Wong
A customer reported rcu stalls and softlockup warnings on a computer with many CPU cores and many many more IO threads trying to write to a filesystem that is totally out of space. Subsequent analysis pointed to the many many IO threads calling xfs_flush_inodes -> sync_inodes_sb, which causes a lot of wb_writeback_work to be queued. The writeback worker spends so much time trying to wake the many many threads waiting for writeback completion that it trips the softlockup detector, and (in this case) the system automatically reboots. In addition, they complain that the lengthy xfs_flush_inodes scan traps all of those threads in uninterruptible sleep, which hampers their ability to kill the program or do anything else to escape the situation. If there's thousands of threads trying to write to files on a full filesystem, each of those threads will start separate copies of the inode flush scan. This is kind of pointless since we only need one scan, so rate limit the inode flush. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2020-03-31io_uring: refactor file register/unregister/update handlingXiaoguang Wang
While diving into io_uring fileset register/unregister/update codes, we found one bug in the fileset update handling. io_uring fileset update use a percpu_ref variable to check whether we can put the previously registered file, only when the refcnt of the perfcpu_ref variable reaches zero, can we safely put these files. But this doesn't work so well. If applications always issue requests continually, this perfcpu_ref will never have an chance to reach zero, and it'll always be in atomic mode, also will defeat the gains introduced by fileset register/unresiger/update feature, which are used to reduce the atomic operation overhead of fput/fget. To fix this issue, while applications do IORING_REGISTER_FILES or IORING_REGISTER_FILES_UPDATE operations, we allocate a new percpu_ref and kill the old percpu_ref, new requests will use the new percpu_ref. Once all previous old requests complete, old percpu_refs will be dropped and registered files will be put safely. Link: https://lore.kernel.org/io-uring/5a8dac33-4ca2-4847-b091-f7dcd3ad0ff3@linux.alibaba.com/T/#t Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-30f2fs: compress: add .{init,destroy}_decompress_ctx callbackChao Yu
Add below two callback interfaces in struct f2fs_compress_ops: int (*init_decompress_ctx)(struct decompress_io_ctx *dic); void (*destroy_decompress_ctx)(struct decompress_io_ctx *dic); Which will be used by zstd compress algorithm later. In addition, this patch adds callback function pointer check, so that specified algorithm can avoid defining unneeded functions. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-03-30f2fs: compress: fix to call missing destroy_compress_ctx()Chao Yu
Otherwise, it will cause memory leak. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-03-30f2fs: change default compression algorithmChao Yu
Use LZ4 as default compression algorithm, as compared to LZO, it shows almost the same compression ratio and much better decompression speed. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-03-30f2fs: clean up {cic,dic}.ref handlingChao Yu
{cic,dic}.ref should be initialized to number of compressed pages, let's initialize it directly rather than doing w/ f2fs_set_compressed_page(). Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-03-30f2fs: fix to use f2fs_readpage_limit() in f2fs_read_multi_pages()Chao Yu
Multipage read flow should consider fsverity, so it needs to use f2fs_readpage_limit() instead of i_size_read() to check EOF condition. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-03-30f2fs: xattr.h: Make stub helpers inlineYueHaibing
Fix gcc warnings: In file included from fs/f2fs/dir.c:15:0: fs/f2fs/xattr.h:157:13: warning: 'f2fs_destroy_xattr_caches' defined but not used [-Wunused-function] static void f2fs_destroy_xattr_caches(struct f2fs_sb_info *sbi) { } ^~~~~~~~~~~~~~~~~~~~~~~~~ fs/f2fs/xattr.h:156:12: warning: 'f2fs_init_xattr_caches' defined but not used [-Wunused-function] static int f2fs_init_xattr_caches(struct f2fs_sb_info *sbi) { return 0; } Reported-by: Hulk Robot <hulkci@huawei.com> Fixes: a999150f4fe3 ("f2fs: use kmem_cache pool during inline xattr lookups") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-03-30f2fs: fix to avoid double unlockChao Yu
On image that has verity and compression feature, if compressed pages and non-compressed pages are mixed in one bio, we may double unlock non-compressed page in below flow: - f2fs_post_read_work - f2fs_decompress_work - f2fs_decompress_bio - __read_end_io - unlock_page - fsverity_enqueue_verify_work - f2fs_verity_work - f2fs_verify_bio - unlock_page So it should skip handling non-compressed page in f2fs_decompress_work() if verity is on. Besides, add missing dec_page_count() in f2fs_verify_bio(). Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-03-30f2fs: fix potential .flags overflow on 32bit architectureChao Yu
f2fs_inode_info.flags is unsigned long variable, it has 32 bits in 32bit architecture, since we introduced FI_MMAP_FILE flag when we support data compression, we may access memory cross the border of .flags field, corrupting .i_sem field, result in below deadlock. To fix this issue, let's expand .flags as an array to grab enough space to store new flags. Call Trace: __schedule+0x8d0/0x13fc ? mark_held_locks+0xac/0x100 schedule+0xcc/0x260 rwsem_down_write_slowpath+0x3ab/0x65d down_write+0xc7/0xe0 f2fs_drop_nlink+0x3d/0x600 [f2fs] f2fs_delete_inline_entry+0x300/0x440 [f2fs] f2fs_delete_entry+0x3a1/0x7f0 [f2fs] f2fs_unlink+0x500/0x790 [f2fs] vfs_unlink+0x211/0x490 do_unlinkat+0x483/0x520 sys_unlink+0x4a/0x70 do_fast_syscall_32+0x12b/0x683 entry_SYSENTER_32+0xaa/0x102 Fixes: 4c8ff7095bef ("f2fs: support data compression") Tested-by: Ondrej Jirman <megous@megous.com> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-03-30f2fs: fix NULL pointer dereference in f2fs_verity_work()Chao Yu
If both compression and fsverity feature is on, generic/572 will report below NULL pointer dereference bug. BUG: kernel NULL pointer dereference, address: 0000000000000018 RIP: 0010:f2fs_verity_work+0x60/0x90 [f2fs] #PF: supervisor read access in kernel mode Workqueue: fsverity_read_queue f2fs_verity_work [f2fs] RIP: 0010:f2fs_verity_work+0x60/0x90 [f2fs] Call Trace: process_one_work+0x16c/0x3f0 worker_thread+0x4c/0x440 ? rescuer_thread+0x350/0x350 kthread+0xf8/0x130 ? kthread_unpark+0x70/0x70 ret_from_fork+0x35/0x40 There are two issue in f2fs_verity_work(): - it needs to traverse and verify all pages in bio. - if pages in bio belong to non-compressed cluster, accessing decompress IO context stored in page private will cause NULL pointer dereference. Fix them. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-03-30f2fs: fix to clear PG_error if fsverity failedChao Yu
In f2fs_decompress_end_io(), we should clear PG_error flag before page unlock, otherwise reread will fail due to the flag as described in commit fb7d70db305a ("f2fs: clear PageError on the read path"). Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-03-30f2fs: don't call fscrypt_get_encryption_info() explicitly in f2fs_tmpfile()Chao Yu
In f2fs_tmpfile(), parent inode's encryption info is only used when inheriting encryption context to its child inode, however, we have already called fscrypt_get_encryption_info() in fscrypt_inherit_context() to get the encryption info, so just removing unneeded one in f2fs_tmpfile(). Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>