summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2019-04-25NFS: Add a mount option "softerr" to allow clients to see ETIMEDOUT errorsTrond Myklebust
Add a mount option that exposes the ETIMEDOUT errors that occur during soft timeouts to the application. This allows aware applications to distinguish between server disk IO errors and client timeout errors. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25NFS: Consider ETIMEDOUT to be a fatal errorTrond Myklebust
When we introduce the 'softerr' mount option, we will see the RPC layer returning ETIMEDOUT errors if the server is unresponsive. We want to consider those errors to be fatal on par with the EIO errors that are returned by ordinary 'soft' timeouts.. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25SUNRPC: Add function rpc_sleep_on_timeout()Trond Myklebust
Clean up the RPC task sleep interfaces by replacing the task->tk_timeout 'hidden parameter' to rpc_sleep_on() with a new function that takes an absolute timeout. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25SUNRPC: Remove unused argument 'action' from rpc_sleep_on_priority()Trond Myklebust
None of the callers set the 'action' argument, so let's just remove it. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25SUNRPC: Fix up task signallingTrond Myklebust
The RPC_TASK_KILLED flag should really not be set from another context because it can clobber data in the struct task when task->tk_flags is changed non-atomically. Let's therefore swap out RPC_TASK_KILLED with an atomic flag, and add a function to set that flag and safely wake up the task. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25ext4: Support case-insensitive file name lookupsGabriel Krisman Bertazi
This patch implements the actual support for case-insensitive file name lookups in ext4, based on the feature bit and the encoding stored in the superblock. A filesystem that has the casefold feature set is able to configure directories with the +F (EXT4_CASEFOLD_FL) attribute, enabling lookups to succeed in that directory in a case-insensitive fashion, i.e: match a directory entry even if the name used by userspace is not a byte per byte match with the disk name, but is an equivalent case-insensitive version of the Unicode string. This operation is called a case-insensitive file name lookup. The feature is configured as an inode attribute applied to directories and inherited by its children. This attribute can only be enabled on empty directories for filesystems that support the encoding feature, thus preventing collision of file names that only differ by case. * dcache handling: For a +F directory, Ext4 only stores the first equivalent name dentry used in the dcache. This is done to prevent unintentional duplication of dentries in the dcache, while also allowing the VFS code to quickly find the right entry in the cache despite which equivalent string was used in a previous lookup, without having to resort to ->lookup(). d_hash() of casefolded directories is implemented as the hash of the casefolded string, such that we always have a well-known bucket for all the equivalencies of the same string. d_compare() uses the utf8_strncasecmp() infrastructure, which handles the comparison of equivalent, same case, names as well. For now, negative lookups are not inserted in the dcache, since they would need to be invalidated anyway, because we can't trust missing file dentries. This is bad for performance but requires some leveraging of the vfs layer to fix. We can live without that for now, and so does everyone else. * on-disk data: Despite using a specific version of the name as the internal representation within the dcache, the name stored and fetched from the disk is a byte-per-byte match with what the user requested, making this implementation 'name-preserving'. i.e. no actual information is lost when writing to storage. DX is supported by modifying the hashes used in +F directories to make them case/encoding-aware. The new disk hashes are calculated as the hash of the full casefolded string, instead of the string directly. This allows us to efficiently search for file names in the htree without requiring the user to provide an exact name. * Dealing with invalid sequences: By default, when a invalid UTF-8 sequence is identified, ext4 will treat it as an opaque byte sequence, ignoring the encoding and reverting to the old behavior for that unique file. This means that case-insensitive file name lookup will not work only for that file. An optional bit can be set in the superblock telling the filesystem code and userspace tools to enforce the encoding. When that optional bit is set, any attempt to create a file name using an invalid UTF-8 sequence will fail and return an error to userspace. * Normalization algorithm: The UTF-8 algorithms used to compare strings in ext4 is implemented lives in fs/unicode, and is based on a previous version developed by SGI. It implements the Canonical decomposition (NFD) algorithm described by the Unicode specification 12.1, or higher, combined with the elimination of ignorable code points (NFDi) and full case-folding (CF) as documented in fs/unicode/utf8_norm.c. NFD seems to be the best normalization method for EXT4 because: - It has a lower cost than NFC/NFKC (which requires decomposing to NFD as an intermediary step) - It doesn't eliminate important semantic meaning like compatibility decompositions. Although: - This implementation is not completely linguistic accurate, because different languages have conflicting rules, which would require the specialization of the filesystem to a given locale, which brings all sorts of problems for removable media and for users who use more than one language. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-04-25ext4: include charset encoding information in the superblockGabriel Krisman Bertazi
Support for encoding is considered an incompatible feature, since it has potential to create collisions of file names in existing filesystems. If the feature flag is not enabled, the entire filesystem will operate on opaque byte sequences, respecting the original behavior. The s_encoding field stores a magic number indicating the encoding format and version used globally by file and directory names in the filesystem. The s_encoding_flags defines policies for using the charset encoding, like how to handle invalid sequences. The magic number is mapped to the exact charset table, but the mapping is specific to ext4. Since we don't have any commitment to support old encodings, the only encoding I am supporting right now is utf8-12.1.0. The current implementation prevents the user from enabling encoding and per-directory encryption on the same filesystem at the same time. The incompatibility between these features lies in how we do efficient directory searches when we cannot be sure the encryption of the user provided fname will match the actual hash stored in the disk without decrypting every directory entry, because of normalization cases. My quickest solution is to simply block the concurrent use of these features for now, and enable it later, once we have a better solution. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-04-25unicode: update unicode database unicode version 12.1.0Gabriel Krisman Bertazi
Regenerate utf8data.h based on the latest UCD files and run tests against the latest version. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-04-25unicode: introduce test module for normalized utf8 implementationGabriel Krisman Bertazi
This implements a in-kernel sanity test module for the utf8 normalization core. At probe time, it will run basic sequences through the utf8n core, to identify problems will equivalent sequences and normalization/casefold code. This is supposed to be useful for regression testing when adding support for a new version of utf8 to linux. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-04-25unicode: implement higher level API for string handlingGabriel Krisman Bertazi
This patch integrates the utf8n patches with some higher level API to perform UTF-8 string comparison, normalization and casefolding operations. Implemented is a variation of NFD, and casefold is performed by doing full casefold on top of NFD. These algorithms are based on the core implemented by Olaf Weber from SGI. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-04-25unicode: reduce the size of utf8data[]Olaf Weber
Remove the Hangul decompositions from the utf8data trie, and do algorithmic decomposition to calculate them on the fly. To store the decomposition the caller of utf8lookup()/utf8nlookup() must provide a 12-byte buffer, which is used to synthesize a leaf with the decomposition. This significantly reduces the size of the utf8data[] array. Changes made by Gabriel: Rebase to mainline Fix checkpatch errors Extract robustness fixes and merge back to original mkutf8data.c patch Regenerate utf8data.h Signed-off-by: Olaf Weber <olaf@sgi.com> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-04-25Merge tag 'ceph-for-5.1-rc7' of git://github.com/ceph/ceph-clientLinus Torvalds
Pull ceph fixes from Ilya Dryomov: "dentry name handling fixes from Jeff and a memory leak fix from Zheng. Both are old issues, marked for stable" * tag 'ceph-for-5.1-rc7' of git://github.com/ceph/ceph-client: ceph: fix ci->i_head_snapc leak ceph: handle the case where a dentry has been renamed on outstanding req ceph: ensure d_name stability in ceph_dentry_hash() ceph: only use d_name directly when parent is locked
2019-04-25unicode: introduce code for UTF-8 normalizationOlaf Weber
Supporting functions for UTF-8 normalization are in utf8norm.c with the header utf8norm.h. Two normalization forms are supported: nfdi and nfdicf. nfdi: - Apply unicode normalization form NFD. - Remove any Default_Ignorable_Code_Point. nfdicf: - Apply unicode normalization form NFD. - Remove any Default_Ignorable_Code_Point. - Apply a full casefold (C + F). For the purposes of the code, a string is valid UTF-8 if: - The values encoded are 0x1..0x10FFFF. - The surrogate codepoints 0xD800..0xDFFFF are not encoded. - The shortest possible encoding is used for all values. The supporting functions work on null-terminated strings (utf8 prefix) and on length-limited strings (utf8n prefix). From the original SGI patch and for conformity with coding standards, the utf8data_t typedef was dropped, since it was just masking the struct keyword. On other occasions, namely utf8leaf_t and utf8trie_t, I decided to keep it, since they are simple pointers to memory buffers, and using uchars here wouldn't provide any more meaningful information. From the original submission, we also converted from the compatibility form to canonical. Changes made by Gabriel: Rebase to Mainline Fix up checkpatch.pl warnings Drop typedefs move out of libxfs Convert from NFKD to NFD Signed-off-by: Olaf Weber <olaf@sgi.com> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-04-25unicode: introduce UTF-8 character databaseGabriel Krisman Bertazi
The decomposition and casefolding of UTF-8 characters are described in a prefix tree in utf8data.h, which is a generate from the Unicode Character Database (UCD), published by the Unicode Consortium, and should not be edited by hand. The structures in utf8data.h are meant to be used for lookup operations by the unicode subsystem, when decoding a utf-8 string. mkutf8data.c is the source for a program that generates utf8data.h. It was written by Olaf Weber from SGI and originally proposed to be merged into Linux in 2014. The original proposal performed the compatibility decomposition, NFKD, but the current version was modified by me to do canonical decomposition, NFD, as suggested by the community. The changes from the original submission are: * Rebase to mainline. * Fix out-of-tree-build. * Update makefile to build 11.0.0 ucd files. * drop references to xfs. * Convert NFKD to NFD. * Merge back robustness fixes from original patch. Requested by Dave Chinner. The original submission is archived at: <https://linux-xfs.oss.sgi.narkive.com/Xx10wjVY/rfc-unicode-utf-8-support-for-xfs> The utf8data.h file can be regenerated using the instructions in fs/unicode/README.utf8data. - Notes on the update from 8.0.0 to 11.0: The structure of the ucd files and special cases have not experienced any changes between versions 8.0.0 and 11.0.0. 8.0.0 saw the addition of Cherokee LC characters, which is an interesting case for case-folding. The update is accompanied by new tests on the test_ucd module to catch specific cases. No changes to mkutf8data script were required for the updates. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-04-25ext4: actually request zeroing of inode table after growKirill Tkhai
It is never possible, that number of block groups decreases, since only online grow is supported. But after a growing occured, we have to zero inode tables for just created new block groups. Fixes: 19c5246d2516 ("ext4: add new online resize interface") Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Cc: stable@kernel.org
2019-04-25ext4: cond_resched in work-heavy group loopsKhazhismel Kumykov
Signed-off-by: Khazhismel Kumykov <khazhy@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2019-04-25ext4: fix use-after-free race with debug_want_extra_isizeBarret Rhoden
When remounting with debug_want_extra_isize, we were not performing the same checks that we do during a normal mount. That allowed us to set a value for s_want_extra_isize that reached outside the s_inode_size. Fixes: e2b911c53584 ("ext4: clean up feature test macros with predicate functions") Reported-by: syzbot+f584efa0ac7213c226b7@syzkaller.appspotmail.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Barret Rhoden <brho@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2019-04-25ext4: avoid drop reference to iloc.bh twicePan Bian
The reference to iloc.bh has been dropped in ext4_mark_iloc_dirty. However, the reference is dropped again if error occurs during ext4_handle_dirty_metadata, which may result in use-after-free bugs. Fixes: fb265c9cb49e("ext4: add ext4_sb_bread() to disambiguate ENOMEM cases") Signed-off-by: Pan Bian <bianpan2016@163.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Cc: stable@kernel.org
2019-04-25fs/quota: erase unused but set variable warningJiang Biao
Local variable *reserved* of remove_dquot_ref() is only used if define CONFIG_QUOTA_DEBUG, but not ebraced in CONFIG_QUOTA_DEBUG macro, which leads to unused-but-set-variable warning when compiling. This patch ebrace it into CONFIG_QUOTA_DEBUG macro like what is done in add_dquot_ref(). Signed-off-by: Jiang Biao <benbjiang@tencent.com> Signed-off-by: Jan Kara <jack@suse.cz>
2019-04-25afs: Provide mount-time configurable byte-range file locking emulationDavid Howells
Provide byte-range file locking emulation that can be configured at mount time to one of four modes: (1) flock=local. Locking is done locally only and no reference is made to the server. (2) flock=openafs. Byte-range locking is done locally only; whole-file locking is done with reference to the server. Whole-file locks cannot be upgraded unless the client holds an exclusive lock. (3) flock=strict. Byte-range and whole-file locking both require a sufficient whole-file lock on the server. (4) flock=write. As strict, but the client always gets an exclusive whole-file lock on the server. Signed-off-by: David Howells <dhowells@redhat.com>
2019-04-25afs: Add more tracepointsDavid Howells
Add four more tracepoints: (1) afs_make_fs_call1 - Split from afs_make_fs_call but takes a filename to log also. (2) afs_make_fs_call2 - Like the above but takes two filenames to log. (3) afs_lookup - Log the result of doing a successful lookup, including a negative result (fid 0:0). (4) afs_get_tree - Log the set up of a volume for mounting. It also extends the name buffer on the afs_edit_dir tracepoint to 24 chars and puts quotes around the filename in the text representation. Signed-off-by: David Howells <dhowells@redhat.com>
2019-04-25afs: Implement sillyrename for unlink and renameDavid Howells
Implement sillyrename for AFS unlink and rename, using the NFS variant implementation as a basis. Note that the asynchronous file locking extender/releaser has to be notified with a state change to stop it complaining if there's a race between that and the actual file deletion. A tracepoint, afs_silly_rename, is also added to note the silly rename and the cleanup. The afs_edit_dir tracepoint is given some extra reason indicators and the afs_flock_ev tracepoint is given a silly-delete file lock cancellation indicator. Signed-off-by: David Howells <dhowells@redhat.com>
2019-04-25afs: Add directory reload tracepointDavid Howells
Add a tracepoint (afs_reload_dir) to indicate when a directory is being reloaded. Signed-off-by: David Howells <dhowells@redhat.com>
2019-04-25afs: Handle lock rpc ops failing on a file that got deletedDavid Howells
Holding a file lock on an AFS file does not prevent it from being deleted on the server, so we need to handle an error resulting from that when we try setting, extending or releasing a lock. Fix this by adding a "deleted" lock state and cancelling the lock extension process for that file and aborting all waiters for the lock. Fixes: 0fafdc9f888b ("afs: Fix file locking") Reported-by: Jonathan Billings <jsbillin@umich.edu> Signed-off-by: David Howells <dhowells@redhat.com>
2019-04-25afs: Improve dir check failure reportsDavid Howells
Improve the content of directory check failure reports from: kAFS: afs_dir_check_page(6d57): bad magic 1/2 is 0000 to dump more information about the individual blocks in a directory page. Signed-off-by: David Howells <dhowells@redhat.com>
2019-04-25afs: Add file locking tracepointsDavid Howells
Add two tracepoints for monitoring AFS file locking. Firstly, add one that follows the operational part: echo 1 >/sys/kernel/debug/tracing/events/afs/afs_flock_op/enable And add a second that more follows the event-driven part: echo 1 >/sys/kernel/debug/tracing/events/afs/afs_flock_ev/enable Individual file_lock structs seen by afs are tagged with debugging IDs that are displayed in the trace log to make it easier to see what's going on, especially as setting the first lock always seems to involve copying the file_lock twice. Signed-off-by: David Howells <dhowells@redhat.com>
2019-04-25afs: Further fix file lockingDavid Howells
Further fix the file locking in the afs filesystem client in a number of ways, including: (1) Don't submit the operation to obtain a lock from the server in a work queue context, but rather do it in the process context of whoever issued the requesting system call. (2) The owner of the file_lock struct at the front of the pending_locks queue now owns right to talk to the server. (3) Write locks can be instantly granted if they don't overlap with any other locks *and* we have a write lock on the server. (4) In the event of an authentication/permission error, all other matching pending locks requests are also immediately aborted. (5) Properly use VFS core locks_lock_file_wait() to distribute the server lock amongst local client locks, including waiting for the lock to become available. Test with: sqlite3 /afs/.../scratch/billings.sqlite <<EOF CREATE TABLE hosts ( hostname varchar(80), shorthost varchar(80), room varchar(30), building varchar(30), PRIMARY KEY(shorthost) ); EOF With the version of sqlite3 that I have, this should fail consistently with EAGAIN, whether or not the program is straced (which introduces some delays between lock syscalls). Fixes: 0fafdc9f888b ("afs: Fix file locking") Reported-by: Jonathan Billings <jsbillin@umich.edu> Signed-off-by: David Howells <dhowells@redhat.com>
2019-04-25afs: Fix AFS file locking to allow fine grained locksDavid Howells
Fix AFS file locking to allow fine grained locks as some applications, such as firefox, won't work if they can't take such locks on certain state files - thereby preventing the use of kAFS to distribute a home directory. Note that this cannot be made completely functional as the protocol only has provision for whole-file locks, so there exists the possibility of a process deadlocking itself by getting a partial read-lock on a file first and then trying to get a non-overlapping write-lock - but we got the server's read lock with the first lock, so we're now stuck. OpenAFS solves this by just granting any partial-range lock directly without consulting the server - and hoping there's no remote collision. I want to implement that in a separate patch and it requires a bit more thought. Fixes: 8d6c554126b8 ("AFS: implement file locking") Reported-by: Jonathan Billings <jsbillings@jsbillings.org> Signed-off-by: David Howells <dhowells@redhat.com>
2019-04-25afs: Calculate lock extend timer from set/extend reply receptionDavid Howells
Record the timestamp on the first reply DATA packet received in response to a set- or extend-lock operation, then use this to calculate the time remaining till the lock expires rather than using whatever time the requesting process wakes up and finishes processing the operation as a base. Signed-off-by: David Howells <dhowells@redhat.com>
2019-04-25afs: Split wait from afs_make_call()David Howells
Split the call to afs_wait_for_call_to_complete() from afs_make_call() to make it easier to handle asynchronous calls and to make it easier to convert a synchronous call to an asynchronous one in future, for instance when someone tries to interrupt an operation by pressing Ctrl-C. Signed-off-by: David Howells <dhowells@redhat.com>
2019-04-25btrfs: Switch memory allocations in async csum calculation path to kvmallocNikolay Borisov
Recent multi-page biovec rework allowed creation of bios that can span large regions - up to 128 megabytes in the case of btrfs. OTOH btrfs' submission path currently allocates a contiguous array to store the checksums for every bio submitted. This means we can request up to (128mb / BTRFS_SECTOR_SIZE) * 4 bytes + 32bytes of memory from kmalloc. On busy systems with possibly fragmented memory said kmalloc can fail which will trigger BUG_ON due to improper error handling IO submission context in btrfs. Until error handling is improved or bios in btrfs limited to a more manageable size (e.g. 1m) let's use kvmalloc to fallback to vmalloc for such large allocations. There is no hard requirement that the memory allocated for checksums during IO submission has to be contiguous, but this is a simple fix that does not require several non-contiguous allocations. For small writes this is unlikely to have any visible effect since kmalloc will still satisfy allocation requests as usual. For larger requests the code will just fallback to vmalloc. We've performed evaluation on several workload types and there was no significant difference kmalloc vs kvmalloc. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-25quota: fix wrong indentationChengguang Xu
We need to check return code only when calling ->read_dqblk(), so fix it properly. Signed-off-by: Chengguang Xu <cgxu519@gmx.com> Signed-off-by: Jan Kara <jack@suse.cz>
2019-04-25debugfs: update documented return values of debugfs helpersRonald Tschalär
Since commit ff9fb72bc077 ("debugfs: return error values, not NULL") these helper functions do not return NULL anymore (with the exception of debugfs_create_u32_array()). Fixes: ff9fb72bc077 ("debugfs: return error values, not NULL") Signed-off-by: Ronald Tschalär <ronald@innovation.ch> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-04-25crypto: shash - remove shash_desc::flagsEric Biggers
The flags field in 'struct shash_desc' never actually does anything. The only ostensibly supported flag is CRYPTO_TFM_REQ_MAY_SLEEP. However, no shash algorithm ever sleeps, making this flag a no-op. With this being the case, inevitably some users who can't sleep wrongly pass MAY_SLEEP. These would all need to be fixed if any shash algorithm actually started sleeping. For example, the shash_ahash_*() functions, which wrap a shash algorithm with the ahash API, pass through MAY_SLEEP from the ahash API to the shash API. However, the shash functions are called under kmap_atomic(), so actually they're assumed to never sleep. Even if it turns out that some users do need preemption points while hashing large buffers, we could easily provide a helper function crypto_shash_update_large() which divides the data into smaller chunks and calls crypto_shash_update() and cond_resched() for each chunk. It's not necessary to have a flag in 'struct shash_desc', nor is it necessary to make individual shash algorithms aware of this at all. Therefore, remove shash_desc::flags, and document that the crypto_shash_*() functions can be called from any context. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-04-24cifs: fix page reference leak with readv/writevJérôme Glisse
CIFS can leak pages reference gotten through GUP (get_user_pages*() through iov_iter_get_pages()). This happen if cifs_send_async_read() or cifs_write_from_iter() calls fail from within __cifs_readv() and __cifs_writev() respectively. This patch move page unreference to cifs_aio_ctx_release() which will happens on all code paths this is all simpler to follow for correctness. Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Cc: Steve French <sfrench@samba.org> Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Stable <stable@vger.kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2019-04-24cifs: do not attempt cifs operation on smb2+ rename errorFrank Sorenson
A path-based rename returning EBUSY will incorrectly try opening the file with a cifs (NT Create AndX) operation on an smb2+ mount, which causes the server to force a session close. If the mount is smb2+, skip the fallback. Signed-off-by: Frank Sorenson <sorenson@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com> CC: Stable <stable@vger.kernel.org> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2019-04-24cifs: fix memory leak in SMB2_readRonnie Sahlberg
Commit 088aaf17aa79300cab14dbee2569c58cfafd7d6e introduced a leak where if SMB2_read() returned an error we would return without freeing the request buffer. Cc: Stable <stable@vger.kernel.org> Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-04-24fuse: Add ioctl flag for x32 compat ioctlIan Abbott
Currently, a CUSE server running on a 64-bit kernel can tell when an ioctl request comes from a process running a 32-bit ABI, but cannot tell whether the requesting process is using legacy IA32 emulation or x32 ABI. In particular, the server does not know the size of the client process's `time_t` type. For 64-bit kernels, the `FUSE_IOCTL_COMPAT` and `FUSE_IOCTL_32BIT` flags are currently set in the ioctl input request (`struct fuse_ioctl_in` member `flags`) for a 32-bit requesting process. This patch defines a new flag `FUSE_IOCTL_COMPAT_X32` and sets it if the 32-bit requesting process is using the x32 ABI. This allows the server process to distinguish between requests coming from client processes using IA32 emulation or the x32 ABI and so infer the size of the client process's `time_t` type and any other IA32/x32 differences. Signed-off-by: Ian Abbott <abbotti@mev.co.uk> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-04-24fuse: Convert fusectl to use the new mount APIDavid Howells
Convert the fusectl filesystem to the new internal mount API as the old one will be obsoleted and removed. This allows greater flexibility in communication of mount parameters between userspace, the VFS and the filesystem. See Documentation/filesystems/mount_api.txt for more information. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-04-24fuse: document fuse_fsync_in.fsync_flagsAlan Somers
The FUSE_FSYNC_DATASYNC flag was introduced by commit b6aeadeda22a ("[PATCH] FUSE - file operations") as a magic number. No new values have been added to fsync_flags since. Signed-off-by: Alan Somers <asomers@FreeBSD.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-04-24fuse: Add FOPEN_STREAM to use stream_open()Kirill Smelkov
Starting from commit 9c225f2655e3 ("vfs: atomic f_pos accesses as per POSIX") files opened even via nonseekable_open gate read and write via lock and do not allow them to be run simultaneously. This can create read vs write deadlock if a filesystem is trying to implement a socket-like file which is intended to be simultaneously used for both read and write from filesystem client. See commit 10dce8af3422 ("fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock") for details and e.g. commit 581d21a2d02a ("xenbus: fix deadlock on writes to /proc/xen/xenbus") for a similar deadlock example on /proc/xen/xenbus. To avoid such deadlock it was tempting to adjust fuse_finish_open to use stream_open instead of nonseekable_open on just FOPEN_NONSEEKABLE flags, but grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE, and in particular GVFS which actually uses offset in its read and write handlers https://codesearch.debian.net/search?q=-%3Enonseekable+%3D https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080 https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346 https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481 so if we would do such a change it will break a real user. Add another flag (FOPEN_STREAM) for filesystem servers to indicate that the opened handler is having stream-like semantics; does not use file position and thus the kernel is free to issue simultaneous read and write request on opened file handle. This patch together with stream_open() should be added to stable kernels starting from v3.14+. This will allow to patch OSSPD and other FUSE filesystems that provide stream-like files to return FOPEN_STREAM | FOPEN_NONSEEKABLE in open handler and this way avoid the deadlock on all kernel versions. This should work because fuse_finish_open ignores unknown open flags returned from a filesystem and so passing FOPEN_STREAM to a kernel that is not aware of this flag cannot hurt. In turn the kernel that is not aware of FOPEN_STREAM will be < v3.14 where just FOPEN_NONSEEKABLE is sufficient to implement streams without read vs write deadlock. Cc: stable@vger.kernel.org # v3.14+ Signed-off-by: Kirill Smelkov <kirr@nexedi.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-04-24fuse: require /dev/fuse reads to have enough buffer capacityKirill Smelkov
A FUSE filesystem server queues /dev/fuse sys_read calls to get filesystem requests to handle. It does not know in advance what would be that request as it can be anything that client issues - LOOKUP, READ, WRITE, ... Many requests are short and retrieve data from the filesystem. However WRITE and NOTIFY_REPLY write data into filesystem. Before getting into operation phase, FUSE filesystem server and kernel client negotiate what should be the maximum write size the client will ever issue. After negotiation the contract in between server/client is that the filesystem server then should queue /dev/fuse sys_read calls with enough buffer capacity to receive any client request - WRITE in particular, while FUSE client should not, in particular, send WRITE requests with > negotiated max_write payload. FUSE client in kernel and libfuse historically reserve 4K for request header. This way the contract is that filesystem server should queue sys_reads with 4K+max_write buffer. If the filesystem server does not follow this contract, what can happen is that fuse_dev_do_read will see that request size is > buffer size, and then it will return EIO to client who issued the request but won't indicate in any way that there is a problem to filesystem server. This can be hard to diagnose because for some requests, e.g. for NOTIFY_REPLY which mimics WRITE, there is no client thread that is waiting for request completion and that EIO goes nowhere, while on filesystem server side things look like the kernel is not replying back after successful NOTIFY_RETRIEVE request made by the server. We can make the problem easy to diagnose if we indicate via error return to filesystem server when it is violating the contract. This should not practically cause problems because if a filesystem server is using shorter buffer, writes to it were already very likely to cause EIO, and if the filesystem is read-only it should be too following FUSE_MIN_READ_BUFFER minimum buffer size. Please see [1] for context where the problem of stuck filesystem was hit for real (because kernel client was incorrectly sending more than max_write data with NOTIFY_REPLY; see also previous patch), how the situation was traced and for more involving patch that did not make it into the tree. [1] https://marc.info/?l=linux-fsdevel&m=155057023600853&w=2 Signed-off-by: Kirill Smelkov <kirr@nexedi.com> Cc: Han-Wen Nienhuys <hanwen@google.com> Cc: Jakob Unterwurzacher <jakobunt@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-04-24fuse: retrieve: cap requested size to negotiated max_writeKirill Smelkov
FUSE filesystem server and kernel client negotiate during initialization phase, what should be the maximum write size the client will ever issue. Correspondingly the filesystem server then queues sys_read calls to read requests with buffer capacity large enough to carry request header + that max_write bytes. A filesystem server is free to set its max_write in anywhere in the range between [1*page, fc->max_pages*page]. In particular go-fuse[2] sets max_write by default as 64K, wheres default fc->max_pages corresponds to 128K. Libfuse also allows users to configure max_write, but by default presets it to possible maximum. If max_write is < fc->max_pages*page, and in NOTIFY_RETRIEVE handler we allow to retrieve more than max_write bytes, corresponding prepared NOTIFY_REPLY will be thrown away by fuse_dev_do_read, because the filesystem server, in full correspondence with server/client contract, will be only queuing sys_read with ~max_write buffer capacity, and fuse_dev_do_read throws away requests that cannot fit into server request buffer. In turn the filesystem server could get stuck waiting indefinitely for NOTIFY_REPLY since NOTIFY_RETRIEVE handler returned OK which is understood by clients as that NOTIFY_REPLY was queued and will be sent back. Cap requested size to negotiate max_write to avoid the problem. This aligns with the way NOTIFY_RETRIEVE handler works, which already unconditionally caps requested retrieve size to fuse_conn->max_pages. This way it should not hurt NOTIFY_RETRIEVE semantic if we return less data than was originally requested. Please see [1] for context where the problem of stuck filesystem was hit for real, how the situation was traced and for more involving patch that did not make it into the tree. [1] https://marc.info/?l=linux-fsdevel&m=155057023600853&w=2 [2] https://github.com/hanwen/go-fuse Signed-off-by: Kirill Smelkov <kirr@nexedi.com> Cc: Han-Wen Nienhuys <hanwen@google.com> Cc: Jakob Unterwurzacher <jakobunt@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-04-24fuse: allow filesystems to have precise control over data cacheKirill Smelkov
On networked filesystems file data can be changed externally. FUSE provides notification messages for filesystem to inform kernel that metadata or data region of a file needs to be invalidated in local page cache. That provides the basis for filesystem implementations to invalidate kernel cache explicitly based on observed filesystem-specific events. FUSE has also "automatic" invalidation mode(*) when the kernel automatically invalidates data cache of a file if it sees mtime change. It also automatically invalidates whole data cache of a file if it sees file size being changed. The automatic mode has corresponding capability - FUSE_AUTO_INVAL_DATA. However, due to probably historical reason, that capability controls only whether mtime change should be resulting in automatic invalidation or not. A change in file size always results in invalidating whole data cache of a file irregardless of whether FUSE_AUTO_INVAL_DATA was negotiated(+). The filesystem I write[1] represents data arrays stored in networked database as local files suitable for mmap. It is read-only filesystem - changes to data are committed externally via database interfaces and the filesystem only glues data into contiguous file streams suitable for mmap and traditional array processing. The files are big - starting from hundreds gigabytes and more. The files change regularly, and frequently by data being appended to their end. The size of files thus changes frequently. If a file was accessed locally and some part of its data got into page cache, we want that data to stay cached unless there is memory pressure, or unless corresponding part of the file was actually changed. However current FUSE behaviour - when it sees file size change - is to invalidate the whole file. The data cache of the file is thus completely lost even on small size change, and despite that the filesystem server is careful to accurately translate database changes into FUSE invalidation messages to kernel. Let's fix it: if a filesystem, through new FUSE_EXPLICIT_INVAL_DATA capability, indicates to kernel that it is fully responsible for data cache invalidation, then the kernel won't invalidate files data cache on size change and only truncate that cache to new size in case the size decreased. (*) see 72d0d248ca "fuse: add FUSE_AUTO_INVAL_DATA init flag", eed2179efe "fuse: invalidate inode mapping if mtime changes" (+) in writeback mode the kernel does not invalidate data cache on file size change, but neither it allows the filesystem to set the size due to external event (see 8373200b12 "fuse: Trust kernel i_size only") [1] https://lab.nexedi.com/kirr/wendelin.core/blob/a50f1d9f/wcfs/wcfs.go#L20 Signed-off-by: Kirill Smelkov <kirr@nexedi.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-04-24fuse: convert printk -> pr_*Kirill Smelkov
Functions, like pr_err, are a more modern variant of printing compared to printk. They could be used to denoise sources by using needed level in the print function name, and by automatically inserting per-driver / function / ... print prefix as defined by pr_fmt macro. pr_* are also said to be used in Documentation/process/coding-style.rst and more recent code - for example overlayfs - uses them instead of printk. Convert CUSE and FUSE to use the new pr_* functions. CUSE output stays completely unchanged, while FUSE output is amended a bit for "trying to steal weird page" warning - the second line now comes also with "fuse:" prefix. I hope it is ok. Suggested-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Kirill Smelkov <kirr@nexedi.com> Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-04-24fuse: honor RLIMIT_FSIZE in fuse_file_fallocateLiu Bo
fstests generic/228 reported this failure that fuse fallocate does not honor what 'ulimit -f' has set. This adds the necessary inode_newsize_ok() check. Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Fixes: 05ba1f082300 ("fuse: add FALLOCATE operation") Cc: <stable@vger.kernel.org> # v3.5 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-04-24fuse: fix writepages on 32bitMiklos Szeredi
Writepage requests were cropped to i_size & 0xffffffff, which meant that mmaped writes to any file larger than 4G might be silently discarded. Fix by storing the file size in a properly sized variable (loff_t instead of size_t). Reported-by: Antonio SJ Musumeci <trapexit@spawn.link> Fixes: 6eaf4782eb09 ("fuse: writepages: crop secondary requests") Cc: <stable@vger.kernel.org> # v3.13 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-04-24locks: move checks from locks_free_lock() to locks_release_private()NeilBrown
Code that allocates locks using locks_alloc_lock() will free it using locks_free_lock(), and will benefit from the BUG_ON() consistency checks therein. However some code (nfsd and lockd) allocate a lock embedded in some other data structure, and so free the lock themselves after calling locks_release_private(). This path does not benefit from the consistency checks. To help catch future errors, move the BUG_ON() checks to locks_release_private() - which locks_free_lock() already calls. This ensures that all users for locks will find out if the lock isn't detached properly before being free. Signed-off-by: NeilBrown <neilb@suse.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-04-24nfsd: fh_drop_write in nfsd_unlinkJ. Bruce Fields
fh_want_write() can now be called twice, but I'm also fixing up the callers not to do that. Other cases include setattr and create. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-04-24nfsd: allow fh_want_write to be called twiceJ. Bruce Fields
A fuzzer recently triggered lockdep warnings about potential sb_writers deadlocks caused by fh_want_write(). Looks like we aren't careful to pair each fh_want_write() with an fh_drop_write(). It's not normally a problem since fh_put() will call fh_drop_write() for us. And was OK for NFSv3 where we'd do one operation that might call fh_want_write(), and then put the filehandle. But an NFSv4 protocol fuzzer can do weird things like call unlink twice in a compound, and then we get into trouble. I'm a little worried about this approach of just leaving everything to fh_put(). But I think there are probably a lot of fh_want_write()/fh_drop_write() imbalances so for now I think we need it to be more forgiving. Signed-off-by: J. Bruce Fields <bfields@redhat.com>