summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2020-06-01Merge tag 'pstore-v5.8-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull pstore updates from Kees Cook: "Fixes and new features for pstore. This is a pretty big set of changes (relative to past pstore pulls), but it has been in -next for a while. The biggest change here is the ability to support a block device as a pstore backend, which has been desired for a while. A lot of additional fixes and refactorings are also included, mostly in support of the new features. - refactor pstore locking for safer module unloading (Kees Cook) - remove orphaned records from pstorefs when backend unloaded (Kees Cook) - refactor dump_oops parameter into max_reason (Pavel Tatashin) - introduce pstore/zone for common code for contiguous storage (WeiXiong Liao) - introduce pstore/blk for block device backend (WeiXiong Liao) - introduce mtd backend (WeiXiong Liao)" * tag 'pstore-v5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (35 commits) mtd: Support kmsg dumper based on pstore/blk pstore/blk: Introduce "best_effort" mode pstore/blk: Support non-block storage devices pstore/blk: Provide way to query pstore configuration pstore/zone: Provide way to skip "broken" zone for MTD devices Documentation: Add details for pstore/blk pstore/zone,blk: Add ftrace frontend support pstore/zone,blk: Add console frontend support pstore/zone,blk: Add support for pmsg frontend pstore/blk: Introduce backend for block devices pstore/zone: Introduce common layer to manage storage zones ramoops: Add "max-reason" optional field to ramoops DT node pstore/ram: Introduce max_reason and convert dump_oops pstore/platform: Pass max_reason to kmesg dump printk: Introduce kmsg_dump_reason_str() printk: honor the max_reason field in kmsg_dumper printk: Collapse shutdown types into a single dump reason pstore/ftrace: Provide ftrace log merging routine pstore/ram: Refactor ftrace buffer merging pstore/ram: Refactor DT size parsing ...
2020-06-01Merge branch 'linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto updates from Herbert Xu: "API: - Introduce crypto_shash_tfm_digest() and use it wherever possible. - Fix use-after-free and race in crypto_spawn_alg. - Add support for parallel and batch requests to crypto_engine. Algorithms: - Update jitter RNG for SP800-90B compliance. - Always use jitter RNG as seed in drbg. Drivers: - Add Arm CryptoCell driver cctrng. - Add support for SEV-ES to the PSP driver in ccp" * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (114 commits) crypto: hisilicon - fix driver compatibility issue with different versions of devices crypto: engine - do not requeue in case of fatal error crypto: cavium/nitrox - Fix a typo in a comment crypto: hisilicon/qm - change debugfs file name from qm_regs to regs crypto: hisilicon/qm - add DebugFS for xQC and xQE dump crypto: hisilicon/zip - add debugfs for Hisilicon ZIP crypto: hisilicon/hpre - add debugfs for Hisilicon HPRE crypto: hisilicon/sec2 - add debugfs for Hisilicon SEC crypto: hisilicon/qm - add debugfs to the QM state machine crypto: hisilicon/qm - add debugfs for QM crypto: stm32/crc32 - protect from concurrent accesses crypto: stm32/crc32 - don't sleep in runtime pm crypto: stm32/crc32 - fix multi-instance crypto: stm32/crc32 - fix run-time self test issue. crypto: stm32/crc32 - fix ext4 chksum BUG_ON() crypto: hisilicon/zip - Use temporary sqe when doing work crypto: hisilicon - add device error report through abnormal irq crypto: hisilicon - remove codes of directly report device errors through MSI crypto: hisilicon - QM memory management optimization crypto: hisilicon - unify initial value assignment into QM ...
2020-06-01sh: remove sh5 supportArnd Bergmann
sh5 never became a product and has probably never really worked. Remove it by recursively deleting all associated Kconfig options and all corresponding files. Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Rich Felker <dalias@libc.org>
2020-06-01Merge branches 'pm-core' and 'pm-sleep'Rafael J. Wysocki
* pm-core: PM: runtime: Replace pm_runtime_callbacks_present() PM: runtime: clk: Fix clk_pm_runtime_get() error path PM: runtime: Make clear what we do when conditions are wrong in rpm_suspend() * pm-sleep: PM: hibernate: Restrict writes to the resume device PM: hibernate: Split off snapshot dev option PM: hibernate: Incorporate concurrency handling PM: sleep: Helpful edits for devices.rst documentation Documentation: PM: sleep: Update driver flags documentation PM: sleep: core: Rename DPM_FLAG_LEAVE_SUSPENDED PM: sleep: core: Rename DPM_FLAG_NEVER_SKIP PM: sleep: core: Rename dev_pm_smart_suspend_and_suspended() PM: sleep: core: Rename dev_pm_may_skip_resume() PM: sleep: core: Rework the power.may_skip_resume handling PM: sleep: core: Do not skip callbacks in the resume phase PM: sleep: core: Fold functions into their callers PM: sleep: core: Simplify the SMART_SUSPEND flag handling
2020-06-01ceph: skip checking caps when session reconnecting and releasing reqsXiubo Li
It make no sense to check the caps when reconnecting to mds. And for the async dirop caps, they will be put by its _cb() function, so when releasing the requests, it will make no sense too. URL: https://tracker.ceph.com/issues/45635 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01ceph: make sure mdsc->mutex is nested in s->s_mutex to fix dead lockXiubo Li
send_mds_reconnect takes the s_mutex while the mdsc->mutex is already held. That inverts the locking order documented in mds_client.h. Drop the mdsc->mutex, acquire the s_mutex and then reacquire the mdsc->mutex to prevent a deadlock. URL: https://tracker.ceph.com/issues/45609 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01ceph: don't return -ESTALE if there's still an open fileLuis Henriques
Similarly to commit 03f219041fdb ("ceph: check i_nlink while converting a file handle to dentry"), this fixes another corner case with name_to_handle_at/open_by_handle_at. The issue has been detected by xfstest generic/467, when doing: - name_to_handle_at("/cephfs/myfile") - open("/cephfs/myfile") - unlink("/cephfs/myfile") - sync; sync; - drop caches - open_by_handle_at() The call to open_by_handle_at should not fail because the file hasn't been deleted yet (only unlinked) and we do have a valid handle to it. -ESTALE shall be returned only if i_nlink is 0 *and* i_count is 1. This patch also makes sure we have LINK caps before checking i_nlink. Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Acked-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01ceph: allow rename operation under different quota realmsLuis Henriques
Returning -EXDEV when trying to 'mv' files/directories from different quota realms results in copy+unlink operations instead of the faster CEPH_MDS_OP_RENAME. This will occur even when there aren't any quotas set in the destination directory, or if there's enough space left for the new file(s). This patch adds a new helper function to be called on rename operations which will allow these operations if they can be executed. This patch mimics userland fuse client commit b8954e5734b3 ("client: optimize rename operation under different quota root"). Since ceph_quota_is_same_realm() is now called only from this new helper, make it static. URL: https://tracker.ceph.com/issues/44791 Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01ceph: normalize 'delta' parameter usage in check_quota_exceededLuis Henriques
Function check_quota_exceeded() uses delta parameter only for the QUOTA_CHECK_MAX_BYTES_OP operation. Using this parameter also for MAX_FILES will makes the code cleaner and will be required to support cross-quota-tree renames. Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01ceph: ceph_kick_flushing_caps needs the s_mutexJeff Layton
The mdsc->cap_dirty_lock is not held while walking the list in ceph_kick_flushing_caps, which is not safe. ceph_early_kick_flushing_caps does something similar, but the s_mutex is held while it's called and I think that guards against changes to the list. Ensure we hold the s_mutex when calling ceph_kick_flushing_caps, and add some clarifying comments. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01ceph: request expedited service on session's last cap flushJeff Layton
When flushing a lot of caps to the MDS's at once (e.g. for syncfs), we can end up waiting a substantial amount of time for MDS replies, due to the fact that it may delay some of them so that it can batch them up together in a single journal transaction. This can lead to stalls when calling sync or syncfs. What we'd really like to do is request expedited service on the _last_ cap we're flushing back to the server. If the CHECK_CAPS_FLUSH flag is set on the request and the current inode was the last one on the session->s_cap_dirty list, then mark the request with CEPH_CLIENT_CAPS_SYNC. Note that this heuristic is not perfect. New inodes can race onto the list after we've started flushing, but it does seem to fix some common use cases. URL: https://tracker.ceph.com/issues/44744 Reported-by: Jan Fajerski <jfajerski@suse.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01ceph: convert mdsc->cap_dirty to a per-session listJeff Layton
This is a per-sb list now, but that makes it difficult to tell when the cap is the last dirty one associated with the session. Switch this to be a per-session list, but continue using the mdsc->cap_dirty_lock to protect the lists. This list is only ever walked in ceph_flush_dirty_caps, so change that to walk the sessions array and then flush the caps for inodes on each session's list. If the auth cap ever changes while the inode has dirty caps, then move the inode to the appropriate session for the new auth_cap. Also, ensure that we never remove an auth cap while the inode is still on the s_cap_dirty list. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01ceph: reset i_requested_max_size if file write is not wantedYan, Zheng
write can stuck at waiting for larger max_size in following sequence of events: - client opens a file and writes to position 'A' (larger than unit of max size increment) - client closes the file handle and updates wanted caps (not wanting file write caps) - client opens and truncates the file, writes to position 'A' again. At the 1st event, client set inode's requested_max_size to 'A'. At the 2nd event, mds removes client's writable range, but client does not reset requested_max_size. At the 3rd event, client does not request max size because requested_max_size is already larger than 'A'. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01ceph: throw a warning if we destroy session with mutex still lockedJeff Layton
Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01ceph: fix potential race in ceph_check_capsJeff Layton
Nothing ensures that session will still be valid by the time we dereference the pointer. Take and put a reference. In principle, we should always be able to get a reference here, but throw a warning if that's ever not the case. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01ceph: document what protects i_dirty_item and i_flushing_itemJeff Layton
Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01ceph: don't take i_ceph_lock in handle_cap_importJeff Layton
Just take it before calling it. This means we have to do a couple of minor in-memory operations under the spinlock now, but those shouldn't be an issue. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01ceph: don't release i_ceph_lock in handle_cap_truncJeff Layton
There's no reason to do this here. Just have the caller handle it. Also, add a lockdep assertion. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01ceph: add comments for handle_cap_flush_ack logicJeff Layton
Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01ceph: split up __finish_cap_flushJeff Layton
This function takes a mdsc argument or ci argument, but if both are passed in, it ignores the ci arg. Fortunately, nothing does that, but there's no good reason to have the same function handle both cases. Also, get rid of some branches and just use |= to set the wake_* vals. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01ceph: reorganize __send_cap for less spinlock abuseJeff Layton
Get rid of the __releases annotation by breaking it up into two functions: __prep_cap which is done under the spinlock and __send_cap that is done outside it. Add new fields to cap_msg_args for the wake boolean and old_xattr_buf pointer. Nothing checks the return value from __send_cap, so make it void return. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01ceph: add metadata perf metric supportXiubo Li
Add a new "r_ended" field to struct ceph_mds_request and use that to maintain the average latency of MDS requests. URL: https://tracker.ceph.com/issues/43215 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01ceph: add read/write latency metric supportXiubo Li
Calculate the latency for OSD read requests. Add a new r_end_stamp field to struct ceph_osd_request that will hold the time of that the reply was received. Use that to calculate the RTT for each call, and divide the sum of those by number of calls to get averate RTT. Keep a tally of RTT for OSD writes and number of calls to track average latency of OSD writes. URL: https://tracker.ceph.com/issues/43215 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01ceph: add caps perf metric for each superblockXiubo Li
Count hits and misses in the caps cache. If the client has all of the necessary caps when a task needs references, then it's counted as a hit. Any other situation is a miss. URL: https://tracker.ceph.com/issues/43215 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01ceph: add dentry lease metric supportXiubo Li
For dentry leases, only count the hit/miss info triggered from the vfs calls. For the cases like request reply handling and ceph_trim_dentries, ignore them. For now, these are only viewable using debugfs. Future patches will allow the client to send the stats to the MDS. The output looks like: item total miss hit ------------------------------------------------- d_lease 11 7 141 URL: https://tracker.ceph.com/issues/43215 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01cifs: minor fix to two debug messagesSteve French
Joe Perches pointed out that we were missing a newline at the end of two debug messages Reported-by: Joe Perches <joe@perches.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2020-06-01cifs: Standardize logging outputJoe Perches
Use pr_fmt to standardize all logging for fs/cifs. Some logging output had no CIFS: specific prefix. Now all output has one of three prefixes: o CIFS: o CIFS: VFS: o Root-CIFS: Miscellanea: o Convert printks to pr_<level> o Neaten macro definitions o Remove embedded CIFS: prefixes from formats o Convert "illegal" to "invalid" o Coalesce formats o Add missing '\n' format terminations o Consolidate multiple cifs_dbg continuations into single calls o More consistent use of upper case first word output logging o Multiline statement argument alignment and wrapping Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2020-06-01smb3: Add new parm "nodelete"Steve French
In order to handle workloads where it is important to make sure that a buggy app did not delete content on the drive, the new mount option "nodelete" allows standard permission checks on the server to work, but prevents on the client any attempts to unlink a file or delete a directory on that mount point. This can be helpful when running a little understood app on a network mount that contains important content that should not be deleted. Signed-off-by: Steve French <stfrench@microsoft.com> CC: Stable <stable@vger.kernel.org> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2020-06-01cifs: move some variables off the stack in smb2_ioctl_query_infoRonnie Sahlberg
Move some large data structures off the stack and into dynamically allocated memory in the function smb2_ioctl_query_info Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2020-06-01cifs: reduce stack use in smb2_compound_opRonnie Sahlberg
Move a lot of structures and arrays off the stack and into a dynamically allocated structure instead. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2020-06-01cifs: get rid of unused parameter in reconn_setup_dfs_targets()Paulo Alcantara
The target iterator parameter "it" is not used in reconn_setup_dfs_targets(), so just remove it. Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2020-06-01cifs: handle hostnames that resolve to same ip in failoverPaulo Alcantara
In order to support reconnect to hostnames that resolve to same ip address, besides relying on the currently set hostname to match DFS targets, attempt to resolve the targets and then match their addresses with the reconnected server ip address. For instance, if we have two hostnames "FOO" and "BAR", and both resolve to the same ip address, we would be able to handle failover in DFS paths like \\FOO\dfs\link1 -> [ \BAZ\share2 (*), \BAR\share1 ] \\FOO\dfs\link2 -> [ \BAZ\share2 (*), \FOO\share1 ] so when "BAZ" is no longer accessible, link1 and link2 would get reconnected despite having different target hostnames. Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2020-06-01cifs: set up next DFS target before generic_ip_connect()Paulo Alcantara
If we mount a very specific DFS link \\FS0.FOO.COM\dfs\link -> \FS0\share1, \FS1\share2 where its target list contains NB names ("FS0" & "FS1") rather than FQDN ones ("FS0.FOO.COM" & "FS1.FOO.COM"), we end up connecting to \FOO\share1 but server->hostname will have "FOO.COM". The reason is because both "FS0" and "FS0.FOO.COM" resolve to same IP address and they share same TCP server connection, but "FS0.FOO.COM" was the first hostname set -- which is OK. However, if the echo thread timeouts and we still have a good connection to "FS0", in cifs_reconnect() rc = generic_ip_connect(server) -> success if (rc) { ... reconn_inval_dfs_target(server, cifs_sb, &tgt_list, &tgt_it); ... } ... it successfully reconnects to "FS0" server but does not set up next DFS target - which should be the same target server "\FS0\share1" - and server->hostname remains set to "FS0.FOO.COM" rather than "FS0", as reconn_inval_dfs_target() would have it set to "FS0" if called earlier. Finally, in __smb2_reconnect(), the reconnect of tcons would fail because tcon->ses->server->hostname (FS0.FOO.COM) does not match DFS target's hostname (FS0). Fix that by calling reconn_inval_dfs_target() before generic_ip_connect() so server->hostname will get updated correctly prior to reconnecting its tcons in __smb2_reconnect(). With "cifs: handle hostnames that resolve to same ip in failover" patch - The above problem would not occur. - We could save an DNS query to find out that they both resolve to the same ip address. Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2020-06-01cifs: remove redundant initialization of variable rcColin Ian King
The variable rc is being initialized with a value that is never read and it is being updated later with a new value. The initialization is redundant and can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2020-06-01cifs: handle "nolease" option for vers=1.0Kenneth D'souza
The "nolease" mount option is only supported for SMB2+ mounts. Fail with appropriate error message if vers=1.0 option is passed. Signed-off-by: Kenneth D'souza <kdsouza@redhat.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2020-05-31pstore/blk: Introduce "best_effort" modeKees Cook
In order to use arbitrary block devices as a pstore backend, provide a new module param named "best_effort", which will allow using any block device, even if it has not provided a panic_write callback. Link: https://lore.kernel.org/lkml/20200511233229.27745-12-keescook@chromium.org/ Signed-off-by: Kees Cook <keescook@chromium.org>
2020-05-31pstore/blk: Support non-block storage devicesWeiXiong Liao
Add support for non-block devices (e.g. MTD). A non-block driver calls pstore_blk_register_device() to register iself. In addition, pstore/zone is updated to handle non-block devices, where an erase must be done before a write. Without this, there is no way to remove records stored to an MTD. Signed-off-by: WeiXiong Liao <liaoweixiong@allwinnertech.com> Link: https://lore.kernel.org/lkml/20200511233229.27745-10-keescook@chromium.org/ Co-developed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org>
2020-05-31pstore/blk: Provide way to query pstore configurationWeiXiong Liao
In order to configure itself, the MTD backend needs to be able to query the current pstore configuration. Introduce pstore_blk_get_config() for this purpose. Signed-off-by: WeiXiong Liao <liaoweixiong@allwinnertech.com> Link: https://lore.kernel.org/lkml/20200511233229.27745-9-keescook@chromium.org/ Co-developed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org>
2020-05-31pstore/zone: Provide way to skip "broken" zone for MTD devicesWeiXiong Liao
One requirement to support MTD devices in pstore/zone is having a way to declare certain regions as broken. Add this support to pstore/zone. The MTD driver should return -ENOMSG when encountering a bad region, which tells pstore/zone to skip and try the next one. Signed-off-by: WeiXiong Liao <liaoweixiong@allwinnertech.com> Link: https://lore.kernel.org/lkml/20200511233229.27745-8-keescook@chromium.org/ Co-developed-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Colin Ian King <colin.king@canonical.com> Link: //lore.kernel.org/lkml/20200512173801.222666-1-colin.king@canonical.com Signed-off-by: Kees Cook <keescook@chromium.org>
2020-05-31Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
xdp_umem.c had overlapping changes between the 64-bit math fix for the calculation of npgs and the removal of the zerocopy memory type which got rid of the chunk_size_nohdr member. The mlx5 Kconfig conflict is a case where we just take the net-next copy of the Kconfig entry dependency as it takes on the ESWITCH dependency by one level of indirection which is what the 'net' conflicting change is trying to ensure. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-31afs: Rename struct afs_fs_cursor to afs_operationDavid Howells
As a prelude to implementing asynchronous fileserver operations in the afs filesystem, rename struct afs_fs_cursor to afs_operation. This struct is going to form the core of the operation management and is going to acquire more members in later. Signed-off-by: David Howells <dhowells@redhat.com>
2020-05-31afs: Remove the error argument from afs_protocol_error()David Howells
Remove the error argument from afs_protocol_error() as it's always -EBADMSG. Signed-off-by: David Howells <dhowells@redhat.com>
2020-05-31afs: Set error flag rather than return error from file status decodeDavid Howells
Set a flag in the call struct to indicate an unmarshalling error rather than return and handle an error from the decoding of file statuses. This flag is checked on a successful return from the delivery function. Signed-off-by: David Howells <dhowells@redhat.com>
2020-05-31afs: Make callback processing more efficient.David Howells
afs_vol_interest objects represent the volume IDs currently being accessed from a fileserver. These hold lists of afs_cb_interest objects that repesent the superblocks using that volume ID on that server. When a callback notification from the server telling of a modification by another client arrives, the volume ID specified in the notification is looked up in the server's afs_vol_interest list. Through the afs_cb_interest list, the relevant superblocks can be iterated over and the specific inode looked up and marked in each one. Make the following efficiency improvements: (1) Hold rcu_read_lock() over the entire processing rather than locking it each time. (2) Do all the callbacks for each vid together rather than individually. Each volume then only needs to be looked up once. (3) afs_vol_interest objects are now stored in an rb_tree rather than a flat list to reduce the lookup step count. (4) afs_vol_interest lookup is now done with RCU, but because it's in an rb_tree which may rotate under us, a seqlock is used so that if it changes during the walk, we repeat the walk with a lock held. With this and the preceding patch which adds RCU-based lookups in the inode cache, target volumes/vnodes can be taken without the need to take any locks, except on the target itself. Signed-off-by: David Howells <dhowells@redhat.com>
2020-05-31afs: Show more information in /proc/net/afs/serversDavid Howells
Show more information in /proc/net/afs/servers to make it easier to see what's going on with the server probing. Signed-off-by: David Howells <dhowells@redhat.com>
2020-05-31afs: Actively poll fileservers to maintain NAT or firewall openingsDavid Howells
When an AFS client accesses a file, it receives a limited-duration callback promise that the server will notify it if another client changes a file. This callback duration can be a few hours in length. If a client mounts a volume and then an application prevents it from being unmounted, say by chdir'ing into it, but then does nothing for some time, the rxrpc_peer record will expire and rxrpc-level keepalive will cease. If there is NAT or a firewall between the client and the server, the route back for the server may close after a comparatively short duration, meaning that attempts by the server to notify the client may then bounce. The client, however, may (so far as it knows) still have a valid unexpired promise and will then rely on its cached data and will not see changes made on the server by a third party until it incidentally rechecks the status or the promise needs renewal. To deal with this, the client needs to regularly probe the server. This has two effects: firstly, it keeps a route open back for the server, and secondly, it causes the server to disgorge any notifications that got queued up because they couldn't be sent. Fix this by adding a mechanism to emit regular probes. Two levels of probing are made available: Under normal circumstances the 'slow' queue will be used for a fileserver - this just probes the preferred address once every 5 mins or so; however, if server fails to respond to any probes, the server will shift to the 'fast' queue from which all its interfaces will be probed every 30s. When it finally responds, the record will switch back to the slow queue. Further notes: (1) Probing is now no longer driven from the fileserver rotation algorithm. (2) Probes are dispatched to all interfaces on a fileserver when that an afs_server object is set up to record it. (3) The afs_server object is removed from the probe queues when we start to probe it. afs_is_probing_server() returns true if it's not listed - ie. it's undergoing probing. (4) The afs_server object is added back on to the probe queue when the final outstanding probe completes, but the probed_at time is set when we're about to launch a probe so that it's not dependent on the probe duration. (5) The timer and the work item added for this must be handed a count on net->servers_outstanding, which they hand on or release. This makes sure that network namespace cleanup waits for them. Fixes: d2ddc776a458 ("afs: Overhaul volume and server record caching and fileserver rotation") Reported-by: Dave Botsch <botsch@cnf.cornell.edu> Signed-off-by: David Howells <dhowells@redhat.com>
2020-05-31afs: Split the usage count on struct afs_serverDavid Howells
Split the usage count on the afs_server struct to have an active count that registers who's actually using it separately from the reference count on the object. This allows a future patch to dispatch polling probes without advancing the "unuse" time into the future each time we emit a probe, which would otherwise prevent unused server records from expiring. Included in this: (1) The latter part of afs_destroy_server() in which the RCU destruction of afs_server objects is invoked and the outstanding server count is decremented is split out into __afs_put_server(). (2) afs_put_server() now calls __afs_put_server() rather then setting the management timer. (3) The calls begun by afs_fs_give_up_all_callbacks() and afs_fs_get_capabilities() can now take a ref on the server record, so afs_destroy_server() can just drop its ref and needn't wait for the completion of these calls. They'll put the ref when they're done. (4) Because of (3), afs_fs_probe_done() no longer needs to wake up afs_destroy_server() with server->probe_outstanding. (5) afs_gc_servers can be simplified. It only needs to check if server->active is 0 rather than playing games with the refcount. (6) afs_manage_servers() can propose a server for gc if usage == 0 rather than if ref == 1. The gc is effected by (5). Signed-off-by: David Howells <dhowells@redhat.com>
2020-05-31afs: Use the serverUnique field in the UVLDB record to reduce rpc opsDavid Howells
The U-version VLDB volume record retrieved by the VL.GetEntryByNameU rpc op carries a change counter (the serverUnique field) for each fileserver listed in the record as backing that volume. This is incremented whenever the registration details for a fileserver change (such as its address list). Note that the same value will be seen in all UVLDB records that refer to that fileserver. This should be checked before calling the VL server to re-query the address list for a fileserver. If it's the same, there's no point doing the query. Reported-by: Jeffrey Altman <jaltman@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com>
2020-05-31afs: Always include dir in bulk status fetch from afs_do_lookup()David Howells
When a lookup is done in an AFS directory, the filesystem will speculate and fetch up to 49 other statuses for files in the same directory and fetch those as well, turning them into inodes or updating inodes that already exist. However, occasionally, a callback break might go missing due to NAT timing out, but the afs filesystem doesn't then realise that the directory is not up to date. Alleviate this by using one of the status slots to check the directory in which the lookup is being done. Reported-by: Dave Botsch <botsch@cnf.cornell.edu> Suggested-by: Jeffrey Altman <jaltman@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com>
2020-05-31vfs, afs, ext4: Make the inode hash table RCU searchableDavid Howells
Make the inode hash table RCU searchable so that searches that want to access or modify an inode without taking a ref on that inode can do so without taking the inode hash table lock. The main thing this requires is some RCU annotation on the list manipulation operations. Inodes are already freed by RCU in most cases. Users of this interface must take care as the inode may be still under construction or may be being torn down around them. There are at least three instances where this can be of use: (1) Testing whether the inode number iunique() is going to return is currently unique (the iunique_lock is still held). (2) Ext4 date stamp updating. (3) AFS callback breaking. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> cc: linux-ext4@vger.kernel.org cc: linux-afs@lists.infradead.org