summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2019-08-12fscrypt: add FS_IOC_ADD_ENCRYPTION_KEY ioctlEric Biggers
Add a new fscrypt ioctl, FS_IOC_ADD_ENCRYPTION_KEY. This ioctl adds an encryption key to the filesystem's fscrypt keyring ->s_master_keys, making any files encrypted with that key appear "unlocked". Why we need this ~~~~~~~~~~~~~~~~ The main problem is that the "locked/unlocked" (ciphertext/plaintext) status of encrypted files is global, but the fscrypt keys are not. fscrypt only looks for keys in the keyring(s) the process accessing the filesystem is subscribed to: the thread keyring, process keyring, and session keyring, where the session keyring may contain the user keyring. Therefore, userspace has to put fscrypt keys in the keyrings for individual users or sessions. But this means that when a process with a different keyring tries to access encrypted files, whether they appear "unlocked" or not is nondeterministic. This is because it depends on whether the files are currently present in the inode cache. Fixing this by consistently providing each process its own view of the filesystem depending on whether it has the key or not isn't feasible due to how the VFS caches work. Furthermore, while sometimes users expect this behavior, it is misguided for two reasons. First, it would be an OS-level access control mechanism largely redundant with existing access control mechanisms such as UNIX file permissions, ACLs, LSMs, etc. Encryption is actually for protecting the data at rest. Second, almost all users of fscrypt actually do need the keys to be global. The largest users of fscrypt, Android and Chromium OS, achieve this by having PID 1 create a "session keyring" that is inherited by every process. This works, but it isn't scalable because it prevents session keyrings from being used for any other purpose. On general-purpose Linux distros, the 'fscrypt' userspace tool [1] can't similarly abuse the session keyring, so to make 'sudo' work on all systems it has to link all the user keyrings into root's user keyring [2]. This is ugly and raises security concerns. Moreover it can't make the keys available to system services, such as sshd trying to access the user's '~/.ssh' directory (see [3], [4]) or NetworkManager trying to read certificates from the user's home directory (see [5]); or to Docker containers (see [6], [7]). By having an API to add a key to the *filesystem* we'll be able to fix the above bugs, remove userspace workarounds, and clearly express the intended semantics: the locked/unlocked status of an encrypted directory is global, and encryption is orthogonal to OS-level access control. Why not use the add_key() syscall ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We use an ioctl for this API rather than the existing add_key() system call because the ioctl gives us the flexibility needed to implement fscrypt-specific semantics that will be introduced in later patches: - Supporting key removal with the semantics such that the secret is removed immediately and any unused inodes using the key are evicted; also, the eviction of any in-use inodes can be retried. - Calculating a key-dependent cryptographic identifier and returning it to userspace. - Allowing keys to be added and removed by non-root users, but only keys for v2 encryption policies; and to prevent denial-of-service attacks, users can only remove keys they themselves have added, and a key is only really removed after all users who added it have removed it. Trying to shoehorn these semantics into the keyrings syscalls would be very difficult, whereas the ioctls make things much easier. However, to reuse code the implementation still uses the keyrings service internally. Thus we get lockless RCU-mode key lookups without having to re-implement it, and the keys automatically show up in /proc/keys for debugging purposes. References: [1] https://github.com/google/fscrypt [2] https://goo.gl/55cCrI#heading=h.vf09isp98isb [3] https://github.com/google/fscrypt/issues/111#issuecomment-444347939 [4] https://github.com/google/fscrypt/issues/116 [5] https://bugs.launchpad.net/ubuntu/+source/fscrypt/+bug/1770715 [6] https://github.com/google/fscrypt/issues/128 [7] https://askubuntu.com/questions/1130306/cannot-run-docker-on-an-encrypted-filesystem Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-08-12fscrypt: rename keyinfo.c to keysetup.cEric Biggers
Rename keyinfo.c to keysetup.c since this better describes what the file does (sets up the key), and it matches the new file keysetup_v1.c. Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-08-12fscrypt: move v1 policy key setup to keysetup_v1.cEric Biggers
In preparation for introducing v2 encryption policies which will find and derive encryption keys differently from the current v1 encryption policies, move the v1 policy-specific key setup code from keyinfo.c into keysetup_v1.c. Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-08-12fscrypt: refactor key setup code in preparation for v2 policiesEric Biggers
Do some more refactoring of the key setup code, in preparation for introducing a filesystem-level keyring and v2 encryption policies: - Now that ci_inode exists, don't pass around the inode unnecessarily. - Define a function setup_file_encryption_key() which handles the crypto key setup given an under-construction fscrypt_info. Don't pass the fscrypt_context, since everything is in the fscrypt_info. [This will be extended for v2 policies and the fs-level keyring.] - Define a function fscrypt_set_derived_key() which sets the per-file key, without depending on anything specific to v1 policies. [This will also be used for v2 policies.] - Define a function fscrypt_setup_v1_file_key() which takes the raw master key, thus separating finding the key from using it. [This will also be used if the key is found in the fs-level keyring.] Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-08-12fscrypt: rename fscrypt_master_key to fscrypt_direct_keyEric Biggers
In preparation for introducing a filesystem-level keyring which will contain fscrypt master keys, rename the existing 'struct fscrypt_master_key' to 'struct fscrypt_direct_key'. This is the structure in the existing table of master keys that's maintained to deduplicate the crypto transforms for v1 DIRECT_KEY policies. I've chosen to keep this table as-is rather than make it automagically add/remove the keys to/from the filesystem-level keyring, since that would add a lot of extra complexity to the filesystem-level keyring. Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-08-12fscrypt: add ->ci_inode to fscrypt_infoEric Biggers
Add an inode back-pointer to 'struct fscrypt_info', such that inode->i_crypt_info->ci_inode == inode. This will be useful for: 1. Evicting the inodes when a fscrypt key is removed, since we'll track the inodes using a given key by linking their fscrypt_infos together, rather than the inodes directly. This avoids bloating 'struct inode' with a new list_head. 2. Simplifying the per-file key setup, since the inode pointer won't have to be passed around everywhere just in case something goes wrong and it's needed for fscrypt_warn(). Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-08-12fscrypt: use FSCRYPT_* definitions, not FS_*Eric Biggers
Update fs/crypto/ to use the new names for the UAPI constants rather than the old names, then make the old definitions conditional on !__KERNEL__. Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-08-12fscrypt: use ENOPKG when crypto API support missingEric Biggers
Return ENOPKG rather than ENOENT when trying to open a file that's encrypted using algorithms not available in the kernel's crypto API. This avoids an ambiguity, since ENOENT is also returned when the file doesn't exist. Note: this is the same approach I'm taking for fs-verity. Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-08-12fscrypt: improve warnings for missing crypto API supportEric Biggers
Users of fscrypt with non-default algorithms will encounter an error like the following if they fail to include the needed algorithms into the crypto API when configuring the kernel (as per the documentation): Error allocating 'adiantum(xchacha12,aes)' transform: -2 This requires that the user figure out what the "-2" error means. Make it more friendly by printing a warning like the following instead: Missing crypto API support for Adiantum (API name: "adiantum(xchacha12,aes)") Also upgrade the log level for *other* errors to KERN_ERR. Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-08-12fscrypt: improve warning messages for unsupported encryption contextsEric Biggers
When fs/crypto/ encounters an inode with an invalid encryption context, currently it prints a warning if the pair of encryption modes are unrecognized, but it's silent if there are other problems such as unsupported context size, format, or flags. To help people debug such situations, add more warning messages. Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-08-12fscrypt: make fscrypt_msg() take inode instead of super_blockEric Biggers
Most of the warning and error messages in fs/crypto/ are for situations related to a specific inode, not merely to a super_block. So to make things easier, make fscrypt_msg() take an inode rather than a super_block, and make it print the inode number. Note: This is the same approach I'm taking for fsverity_msg(). Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-08-12fscrypt: clean up base64 encoding/decodingEric Biggers
Some minor cleanups for the code that base64 encodes and decodes encrypted filenames and long name digests: - Rename "digest_{encode,decode}()" => "base64_{encode,decode}()" since they are used for filenames too, not just for long name digests. - Replace 'while' loops with more conventional 'for' loops. - Use 'u8' for binary data. Keep 'char' for string data. - Fully constify the lookup table (pointer was not const). - Improve comment. No actual change in behavior. Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-08-12fscrypt: remove loadable module related codeEric Biggers
Since commit 643fa9612bf1 ("fscrypt: remove filesystem specific build config option"), fs/crypto/ can no longer be built as a loadable module. Thus it no longer needs a module_exit function, nor a MODULE_LICENSE. So remove them, and change module_init to late_initcall. Reviewed-by: Chandan Rajendra <chandan@linux.ibm.com> Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-08-12fanotify, inotify, dnotify, security: add security hook for fs notificationsAaron Goidel
As of now, setting watches on filesystem objects has, at most, applied a check for read access to the inode, and in the case of fanotify, requires CAP_SYS_ADMIN. No specific security hook or permission check has been provided to control the setting of watches. Using any of inotify, dnotify, or fanotify, it is possible to observe, not only write-like operations, but even read access to a file. Modeling the watch as being merely a read from the file is insufficient for the needs of SELinux. This is due to the fact that read access should not necessarily imply access to information about when another process reads from a file. Furthermore, fanotify watches grant more power to an application in the form of permission events. While notification events are solely, unidirectional (i.e. they only pass information to the receiving application), permission events are blocking. Permission events make a request to the receiving application which will then reply with a decision as to whether or not that action may be completed. This causes the issue of the watching application having the ability to exercise control over the triggering process. Without drawing a distinction within the permission check, the ability to read would imply the greater ability to control an application. Additionally, mount and superblock watches apply to all files within the same mount or superblock. Read access to one file should not necessarily imply the ability to watch all files accessed within a given mount or superblock. In order to solve these issues, a new LSM hook is implemented and has been placed within the system calls for marking filesystem objects with inotify, fanotify, and dnotify watches. These calls to the hook are placed at the point at which the target path has been resolved and are provided with the path struct, the mask of requested notification events, and the type of object on which the mark is being set (inode, superblock, or mount). The mask and obj_type have already been translated into common FS_* values shared by the entirety of the fs notification infrastructure. The path struct is passed rather than just the inode so that the mount is available, particularly for mount watches. This also allows for use of the hook by pathname-based security modules. However, since the hook is intended for use even by inode based security modules, it is not placed under the CONFIG_SECURITY_PATH conditional. Otherwise, the inode-based security modules would need to enable all of the path hooks, even though they do not use any of them. This only provides a hook at the point of setting a watch, and presumes that permission to set a particular watch implies the ability to receive all notification about that object which match the mask. This is all that is required for SELinux. If other security modules require additional hooks or infrastructure to control delivery of notification, these can be added by them. It does not make sense for us to propose hooks for which we have no implementation. The understanding that all notifications received by the requesting application are all strictly of a type for which the application has been granted permission shows that this implementation is sufficient in its coverage. Security modules wishing to provide complete control over fanotify must also implement a security_file_open hook that validates that the access requested by the watching application is authorized. Fanotify has the issue that it returns a file descriptor with the file mode specified during fanotify_init() to the watching process on event. This is already covered by the LSM security_file_open hook if the security module implements checking of the requested file mode there. Otherwise, a watching process can obtain escalated access to a file for which it has not been authorized. The selinux_path_notify hook implementation works by adding five new file permissions: watch, watch_mount, watch_sb, watch_reads, and watch_with_perm (descriptions about which will follow), and one new filesystem permission: watch (which is applied to superblock checks). The hook then decides which subset of these permissions must be held by the requesting application based on the contents of the provided mask and the obj_type. The selinux_file_open hook already checks the requested file mode and therefore ensures that a watching process cannot escalate its access through fanotify. The watch, watch_mount, and watch_sb permissions are the baseline permissions for setting a watch on an object and each are a requirement for any watch to be set on a file, mount, or superblock respectively. It should be noted that having either of the other two permissions (watch_reads and watch_with_perm) does not imply the watch, watch_mount, or watch_sb permission. Superblock watches further require the filesystem watch permission to the superblock. As there is no labeled object in view for mounts, there is no specific check for mount watches beyond watch_mount to the inode. Such a check could be added in the future, if a suitable labeled object existed representing the mount. The watch_reads permission is required to receive notifications from read-exclusive events on filesystem objects. These events include accessing a file for the purpose of reading and closing a file which has been opened read-only. This distinction has been drawn in order to provide a direct indication in the policy for this otherwise not obvious capability. Read access to a file should not necessarily imply the ability to observe read events on a file. Finally, watch_with_perm only applies to fanotify masks since it is the only way to set a mask which allows for the blocking, permission event. This permission is needed for any watch which is of this type. Though fanotify requires CAP_SYS_ADMIN, this is insufficient as it gives implicit trust to root, which we do not do, and does not support least privilege. Signed-off-by: Aaron Goidel <acgoide@tycho.nsa.gov> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Paul Moore <paul@paul-moore.com>
2019-08-12ext4: set error return correctly when ext4_htree_store_dirent failsColin Ian King
Currently when the call to ext4_htree_store_dirent fails the error return variable 'ret' is is not being set to the error code and variable count is instead, hence the error code is not being returned. Fix this by assigning ret to the error return code. Addresses-Coverity: ("Unused value") Fixes: 8af0f0822797 ("ext4: fix readdir error in the case of inline_data+dir_index") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-08-12ext4: drop legacy pre-1970 encoding workaroundTheodore Ts'o
Originally, support for expanded timestamps had a bug in that pre-1970 times were erroneously encoded as being in the the 24th century. This was fixed in commit a4dad1ae24f8 ("ext4: Fix handling of extended tv_sec") which landed in 4.4. Starting with 4.4, pre-1970 timestamps were correctly encoded, but for backwards compatibility those incorrectly encoded timestamps were mapped back to the pre-1970 dates. Given that backwards compatibility workaround has been around for 4 years, and given that running e2fsck from e2fsprogs 1.43.2 and later will offer to fix these timestamps (which has been released for 3 years), it's past time to drop the legacy workaround from the kernel. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-08-12xfs: don't crash on null attr fork xfs_bmapi_readDarrick J. Wong
Zorro Lang reported a crash in generic/475 if we try to inactivate a corrupt inode with a NULL attr fork (stack trace shortened somewhat): RIP: 0010:xfs_bmapi_read+0x311/0xb00 [xfs] RSP: 0018:ffff888047f9ed68 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: ffff888047f9f038 RCX: 1ffffffff5f99f51 RDX: 0000000000000002 RSI: 0000000000000008 RDI: 0000000000000012 RBP: ffff888002a41f00 R08: ffffed10005483f0 R09: ffffed10005483ef R10: ffffed10005483ef R11: ffff888002a41f7f R12: 0000000000000004 R13: ffffe8fff53b5768 R14: 0000000000000005 R15: 0000000000000001 FS: 00007f11d44b5b80(0000) GS:ffff888114200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000ef6000 CR3: 000000002e176003 CR4: 00000000001606e0 Call Trace: xfs_dabuf_map.constprop.18+0x696/0xe50 [xfs] xfs_da_read_buf+0xf5/0x2c0 [xfs] xfs_da3_node_read+0x1d/0x230 [xfs] xfs_attr_inactive+0x3cc/0x5e0 [xfs] xfs_inactive+0x4c8/0x5b0 [xfs] xfs_fs_destroy_inode+0x31b/0x8e0 [xfs] destroy_inode+0xbc/0x190 xfs_bulkstat_one_int+0xa8c/0x1200 [xfs] xfs_bulkstat_one+0x16/0x20 [xfs] xfs_bulkstat+0x6fa/0xf20 [xfs] xfs_ioc_bulkstat+0x182/0x2b0 [xfs] xfs_file_ioctl+0xee0/0x12a0 [xfs] do_vfs_ioctl+0x193/0x1000 ksys_ioctl+0x60/0x90 __x64_sys_ioctl+0x6f/0xb0 do_syscall_64+0x9f/0x4d0 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7f11d39a3e5b The "obvious" cause is that the attr ifork is null despite the inode claiming an attr fork having at least one extent, but it's not so obvious why we ended up with an inode in that state. Reported-by: Zorro Lang <zlang@redhat.com> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204031 Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Bill O'Donnell <billodo@redhat.com>
2019-08-12xfs: remove more ondisk directory corruption assertsDarrick J. Wong
Continue our game of replacing ASSERTs for corrupt ondisk metadata with EFSCORRUPTED returns. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Bill O'Donnell <billodo@redhat.com>
2019-08-12Merge 5.3-rc4 into driver-core-nextGreg Kroah-Hartman
We need the driver core fixes in here as well. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-11ext4: add new ioctl EXT4_IOC_GET_ES_CACHETheodore Ts'o
For debugging reasons, it's useful to know the contents of the extent cache. Since the extent cache contains much of what is in the fiemap ioctl, use an fiemap-style interface to return this information. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-08-11ext4: add a new ioctl EXT4_IOC_GETSTATETheodore Ts'o
The new ioctl EXT4_IOC_GETSTATE returns some of the dynamic state of an ext4 inode for debugging purposes. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-08-11ext4: add a new ioctl EXT4_IOC_CLEAR_ES_CACHETheodore Ts'o
The new ioctl EXT4_IOC_CLEAR_ES_CACHE will force an inode's extent status cache to be cleared out. This is intended for use for debugging. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-08-11jbd2: flush_descriptor(): Do not decrease buffer head's ref countChandan Rajendra
When executing generic/388 on a ppc64le machine, we notice the following call trace, VFS: brelse: Trying to free free buffer WARNING: CPU: 0 PID: 6637 at /root/repos/linux/fs/buffer.c:1195 __brelse+0x84/0xc0 Call Trace: __brelse+0x80/0xc0 (unreliable) invalidate_bh_lru+0x78/0xc0 on_each_cpu_mask+0xa8/0x130 on_each_cpu_cond_mask+0x130/0x170 invalidate_bh_lrus+0x44/0x60 invalidate_bdev+0x38/0x70 ext4_put_super+0x294/0x560 generic_shutdown_super+0xb0/0x170 kill_block_super+0x38/0xb0 deactivate_locked_super+0xa4/0xf0 cleanup_mnt+0x164/0x1d0 task_work_run+0x110/0x160 do_notify_resume+0x414/0x460 ret_from_except_lite+0x70/0x74 The warning happens because flush_descriptor() drops bh reference it does not own. The bh reference acquired by jbd2_journal_get_descriptor_buffer() is owned by the log_bufs list and gets released when this list is processed. The reference for doing IO is only acquired in write_dirty_buffer() later in flush_descriptor(). Reported-by: Harish Sriram <harish@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Chandan Rajendra <chandan@linux.ibm.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-08-11ext4: remove unnecessary error checkShi Siyuan
Remove unnecessary error check in ext4_file_write_iter(), because this check will be done in upcoming later function -- ext4_write_checks() -> generic_write_checks() Change-Id: I7b0ab27f693a50765c15b5eaa3f4e7c38f42e01e Signed-off-by: shisiyuan <shisiyuan@xiaomi.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-08-11ext4: fix warning when turn on dioread_nolock and inline_datayangerkun
mkfs.ext4 -O inline_data /dev/vdb mount -o dioread_nolock /dev/vdb /mnt echo "some inline data..." >> /mnt/test-file echo "some inline data..." >> /mnt/test-file sync The above script will trigger "WARN_ON(!io_end->handle && sbi->s_journal)" because ext4_should_dioread_nolock() returns false for a file with inline data. Move the check to a place after we have already removed the inline data and prepared inode to write normal pages. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: yangerkun <yangerkun@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-08-11Merge tag 'dax-fixes-5.3-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull dax fixes from Dan Williams: "A filesystem-dax and device-dax fix for v5.3. The filesystem-dax fix is tagged for stable as the implementation has been mistakenly throwing away all cow pages on any truncate or hole punch operation as part of the solution to coordinate device-dma vs truncate to dax pages. The device-dax change fixes up a regression this cycle from the introduction of a common 'internal per-cpu-ref' implementation. Summary: - Fix dax_layout_busy_page() to not discard private cow pages of fs/dax private mappings. - Update the memremap_pages core to properly cleanup on behalf of internal reference-count users like device-dax" * tag 'dax-fixes-5.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: mm/memremap: Fix reuse of pgmap instances with internal references dax: dax_layout_busy_page() should not unmap cow pages
2019-08-10Merge tag 'gfs2-v5.3-rc3.fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull gfs2 fix from Andreas Gruenbacher: "Fix incorrect lseek / fiemap results" * tag 'gfs2-v5.3-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: gfs2: gfs2_walk_metadata fix
2019-08-09Merge tag 'for-linus-20190809' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: - Revert of a bcache patch that caused an oops for some (Coly) - ata rb532 unused warning fix (Gustavo) - AoE kernel crash fix (He) - Error handling fixup for blkdev_get() (Jan) - libata read/write translation and SFF PIO fix (me) - Use after free and error handling fix for O_DIRECT fragments. There's still a nowait + sync oddity in there, we'll nail that start next week. If all else fails, I'll queue a revert of the NOWAIT change. (me) - Loop GFP_KERNEL -> GFP_NOIO deadlock fix (Mikulas) - Two BFQ regression fixes that caused crashes (Paolo) * tag 'for-linus-20190809' of git://git.kernel.dk/linux-block: bcache: Revert "bcache: use sysfs_match_string() instead of __sysfs_match_string()" loop: set PF_MEMALLOC_NOIO for the worker thread bdev: Fixup error handling in blkdev_get() block, bfq: handle NULL return value by bfq_init_rq() block, bfq: move update of waker and woken list to queue freeing block, bfq: reset last_completed_rq_bfqq if the pointed queue is freed block: aoe: Fix kernel crash due to atomic sleep when exiting libata: add SG safety checks in SFF pio transfers libata: have ata_scsi_rw_xlat() fail invalid passthrough requests block: fix O_DIRECT error handling for bio fragments ata: rb532_cf: Fix unused variable warning in rb532_pata_driver_probe
2019-08-09gfs2: Minor gfs2_alloc_inode cleanupAndreas Gruenbacher
In gfs2_alloc_inode, when kmem_cache_alloc cannot allocate a new object, return NULL immediately. The code currently relies on the fact that i_inode is the first member in struct gfs2_inode and so ip and &ip->i_inode evaluate to the same address, but that isn't immediately obvious. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Bob Peterson <rpeterso@redhat.com>
2019-08-09gfs2: implement gfs2_block_zero_range using iomap_zero_rangeChristoph Hellwig
iomap handles all the nitty-gritty details of zeroing a file range for us, so use the proper helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Bob Peterson <rpeterso@redhat.com>
2019-08-09gfs2: Add support for IOMAP_ZEROAndreas Gruenbacher
Add support for the IOMAP_ZERO iomap operation so that iomap_zero_range will work as expected. In the IOMAP_ZERO case, the caller of iomap_zero_range is responsible for taking an exclusive glock on the inode, so we need no additional locking in gfs2_iomap_begin. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Bob Peterson <rpeterso@redhat.com>
2019-08-09gfs2: gfs2_iomap_begin cleanupAndreas Gruenbacher
Following commit d0a22a4b03b8 ("gfs2: Fix iomap write page reclaim deadlock"), gfs2_iomap_begin and gfs2_iomap_begin_write can be further cleaned up and the split between those two functions can be improved. With suggestions from Christoph Hellwig. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Bob Peterson <rpeterso@redhat.com>
2019-08-09gfs2: gfs2_walk_metadata fixAndreas Gruenbacher
It turns out that the current version of gfs2_metadata_walker suffers from multiple problems that can cause gfs2_hole_size to report an incorrect size. This will confuse fiemap as well as lseek with the SEEK_DATA flag. Fix that by changing gfs2_hole_walker to compute the metapath to the first data block after the hole (if any), and compute the hole size based on that. Fixes xfstest generic/490. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Bob Peterson <rpeterso@redhat.com> Cc: stable@vger.kernel.org # v4.18+
2019-08-09fs/core/vmcore: Move sev_active() reference to x86 arch codeThiago Jung Bauermann
Secure Encrypted Virtualization is an x86-specific feature, so it shouldn't appear in generic kernel code because it forces non-x86 architectures to define the sev_active() function, which doesn't make a lot of sense. To solve this problem, add an x86 elfcorehdr_read() function to override the generic weak implementation. To do that, it's necessary to make read_from_oldmem() public so that it can be used outside of vmcore.c. Also, remove the export for sev_active() since it's only used in files that won't be built as modules. Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Lianbo Jiang <lijiang@redhat.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190806044919.10622-6-bauerman@linux.ibm.com
2019-08-08Merge tag 'nfs-for-5.3-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client fixes from Trond Myklebust: "Highlights include: Stable fixes: - NFSv4: Ensure we check the return value of update_open_stateid() so we correctly track active open state. - NFSv4: Fix for delegation state recovery to ensure we recover all open modes that are active. - NFSv4: Fix an Oops in nfs4_do_setattr Fixes: - NFS: Fix regression whereby fscache errors are appearing on 'nofsc' mounts - NFSv4: Fix a potential sleep while atomic in nfs4_do_reclaim() - NFSv4: Fix a credential refcount leak in nfs41_check_delegation_stateid - pNFS: Report errors from the call to nfs4_select_rw_stateid() - NFSv4: Various other delegation and open stateid recovery fixes - NFSv4: Fix state recovery behaviour when server connection times out" * tag 'nfs-for-5.3-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFSv4: Ensure state recovery handles ETIMEDOUT correctly NFS: Fix regression whereby fscache errors are appearing on 'nofsc' mounts NFSv4: Fix an Oops in nfs4_do_setattr NFSv4: Fix a potential sleep while atomic in nfs4_do_reclaim() NFSv4: Check the return value of update_open_stateid() NFSv4.1: Only reap expired delegations NFSv4.1: Fix open stateid recovery NFSv4: Report the error from nfs4_select_rw_stateid() NFSv4: When recovering state fails with EAGAIN, retry the same recovery NFSv4: Print an error in the syslog when state is marked as irrecoverable NFSv4: Fix delegation state recovery NFSv4: Fix a credential refcount leak in nfs41_check_delegation_stateid
2019-08-08Merge tag '5.3-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull cifs fixes from Steve French: "Six small SMB3 fixes, two for stable" * tag '5.3-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6: SMB3: Kernel oops mounting a encryptData share with CONFIG_DEBUG_VIRTUAL smb3: update TODO list of missing features smb3: send CAP_DFS capability during session setup SMB3: Fix potential memory leak when processing compound chain SMB3: Fix deadlock in validate negotiate hits reconnect cifs: fix rmmod regression in cifs.ko caused by force_sig changes
2019-08-08bdev: Fixup error handling in blkdev_get()Jan Kara
Commit 89e524c04fa9 ("loop: Fix mount(2) failure due to race with LOOP_SET_FD") converted blkdev_get() to use the new helpers for finishing claiming of a block device. However the conversion botched the error handling in blkdev_get() and thus the bdev has been marked as held even in case __blkdev_get() returned error. This led to occasional warnings with block/001 test from blktests like: kernel: WARNING: CPU: 5 PID: 907 at fs/block_dev.c:1899 __blkdev_put+0x396/0x3a0 Correct the error handling. CC: stable@vger.kernel.org Fixes: 89e524c04fa9 ("loop: Fix mount(2) failure due to race with LOOP_SET_FD") Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-07fs/handle.c - fix up kerneldocValdis Klētnieks
When building with W=1, we get some kerneldoc warnings: CC fs/fhandle.o fs/fhandle.c:259: warning: Function parameter or member 'flags' not described in 'sys_open_by_handle_at' fs/fhandle.c:259: warning: Excess function parameter 'flag' description in 'sys_open_by_handle_at' Fix the typo that caused it. Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-08-07block: fix O_DIRECT error handling for bio fragmentsJens Axboe
0eb6ddfb865c tried to fix this up, but introduced a use-after-free of dio. Additionally, we still had an issue with error handling, as reported by Darrick: "I noticed a regression in xfs/747 (an unreleased xfstest for the xfs_scrub media scanning feature) on 5.3-rc3. I'll condense that down to a simpler reproducer: error-test: 0 209 linear 8:48 0 error-test: 209 1 error error-test: 210 6446894 linear 8:48 210 Basically we have a ~3G /dev/sdd and we set up device mapper to fail IO for sector 209 and to pass the io to the scsi device everywhere else. On 5.3-rc3, performing a directio pread of this range with a < 1M buffer (in other words, a request for fewer than MAX_BIO_PAGES bytes) yields EIO like you'd expect: pread64(3, 0x7f880e1c7000, 1048576, 0) = -1 EIO (Input/output error) pread: Input/output error +++ exited with 0 +++ But doing it with a larger buffer succeeds(!): pread64(3, "XFSB\0\0\20\0\0\0\0\0\0\fL\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1146880, 0) = 1146880 read 1146880/1146880 bytes at offset 0 1 MiB, 1 ops; 0.0009 sec (1.124 GiB/sec and 1052.6316 ops/sec) +++ exited with 0 +++ (Note that the part of the buffer corresponding to the dm-error area is uninitialized) On 5.3-rc2, both commands would fail with EIO like you'd expect. The only change between rc2 and rc3 is commit 0eb6ddfb865c ("block: Fix __blkdev_direct_IO() for bio fragments"). AFAICT we end up in __blkdev_direct_IO with a 1120K buffer, which gets split into two bios: one for the first BIO_MAX_PAGES worth of data (1MB) and a second one for the 96k after that." Fix this by noting that it's always safe to dereference dio if we get BLK_QC_T_EAGAIN returned, as end_io hasn't been run for that case. So we can safely increment the dio size before calling submit_bio(), and then decrement it on failure (not that it really matters, as the bio and dio are going away). For error handling, return to the original method of just using 'ret' for tracking the error, and the size tracking in dio->size. Fixes: 0eb6ddfb865c ("block: Fix __blkdev_direct_IO() for bio fragments") Fixes: 6a43074e2f46 ("block: properly handle IOCB_NOWAIT for async O_DIRECT IO") Reported-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-07NFSv4: Ensure state recovery handles ETIMEDOUT correctlyTrond Myklebust
Ensure that the state recovery code handles ETIMEDOUT correctly, and also that we set RPC_TASK_TIMEOUT when recovering open state. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-08-07btrfs: trim: Check the range passed into to prevent overflowQu Wenruo
Normally the range->len is set to default value (U64_MAX), but when it's not default value, we should check if the range overflows. And if it overflows, return -EINVAL before doing anything. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-08-07Btrfs: fix sysfs warning and missing raid sysfs directoriesFilipe Manana
In the 5.3 merge window, commit 7c7e301406d0a9 ("btrfs: sysfs: Replace default_attrs in ktypes with groups"), we started using the member "defaults_groups" for the kobject type "btrfs_raid_ktype". That leads to a series of warnings when running some test cases of fstests, such as btrfs/027, btrfs/124 and btrfs/176. The traces produced by those warnings are like the following: [116648.059212] kernfs: can not remove 'total_bytes', no directory [116648.060112] WARNING: CPU: 3 PID: 28500 at fs/kernfs/dir.c:1504 kernfs_remove_by_name_ns+0x75/0x80 (...) [116648.066482] CPU: 3 PID: 28500 Comm: umount Tainted: G W 5.3.0-rc3-btrfs-next-54 #1 (...) [116648.069376] RIP: 0010:kernfs_remove_by_name_ns+0x75/0x80 (...) [116648.072385] RSP: 0018:ffffabfd0090bd08 EFLAGS: 00010282 [116648.073437] RAX: 0000000000000000 RBX: ffffffffc0c11998 RCX: 0000000000000000 [116648.074201] RDX: ffff9fff603a7a00 RSI: ffff9fff603978a8 RDI: ffff9fff603978a8 [116648.074956] RBP: ffffffffc0b9ca2f R08: 0000000000000000 R09: 0000000000000001 [116648.075708] R10: ffff9ffe1f72e1c0 R11: 0000000000000000 R12: ffffffffc0b94120 [116648.076434] R13: ffffffffb3d9b4e0 R14: 0000000000000000 R15: dead000000000100 [116648.077143] FS: 00007f9cdc78a2c0(0000) GS:ffff9fff60380000(0000) knlGS:0000000000000000 [116648.077852] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [116648.078546] CR2: 00007f9fc4747ab4 CR3: 00000005c7832003 CR4: 00000000003606e0 [116648.079235] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [116648.079907] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [116648.080585] Call Trace: [116648.081262] remove_files+0x31/0x70 [116648.081929] sysfs_remove_group+0x38/0x80 [116648.082596] sysfs_remove_groups+0x34/0x70 [116648.083258] kobject_del+0x20/0x60 [116648.083933] btrfs_free_block_groups+0x405/0x430 [btrfs] [116648.084608] close_ctree+0x19a/0x380 [btrfs] [116648.085278] generic_shutdown_super+0x6c/0x110 [116648.085951] kill_anon_super+0xe/0x30 [116648.086621] btrfs_kill_super+0x12/0xa0 [btrfs] [116648.087289] deactivate_locked_super+0x3a/0x70 [116648.087956] cleanup_mnt+0xb4/0x160 [116648.088620] task_work_run+0x7e/0xc0 [116648.089285] exit_to_usermode_loop+0xfa/0x100 [116648.089933] do_syscall_64+0x1cb/0x220 [116648.090567] entry_SYSCALL_64_after_hwframe+0x49/0xbe [116648.091197] RIP: 0033:0x7f9cdc073b37 (...) [116648.100046] ---[ end trace 22e24db328ccadf8 ]--- [116648.100618] ------------[ cut here ]------------ [116648.101175] kernfs: can not remove 'used_bytes', no directory [116648.101731] WARNING: CPU: 3 PID: 28500 at fs/kernfs/dir.c:1504 kernfs_remove_by_name_ns+0x75/0x80 (...) [116648.105649] CPU: 3 PID: 28500 Comm: umount Tainted: G W 5.3.0-rc3-btrfs-next-54 #1 (...) [116648.107461] RIP: 0010:kernfs_remove_by_name_ns+0x75/0x80 (...) [116648.109336] RSP: 0018:ffffabfd0090bd08 EFLAGS: 00010282 [116648.109979] RAX: 0000000000000000 RBX: ffffffffc0c119a0 RCX: 0000000000000000 [116648.110625] RDX: ffff9fff603a7a00 RSI: ffff9fff603978a8 RDI: ffff9fff603978a8 [116648.111283] RBP: ffffffffc0b9ca41 R08: 0000000000000000 R09: 0000000000000001 [116648.111940] R10: ffff9ffe1f72e1c0 R11: 0000000000000000 R12: ffffffffc0b94120 [116648.112603] R13: ffffffffb3d9b4e0 R14: 0000000000000000 R15: dead000000000100 [116648.113268] FS: 00007f9cdc78a2c0(0000) GS:ffff9fff60380000(0000) knlGS:0000000000000000 [116648.113939] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [116648.114607] CR2: 00007f9fc4747ab4 CR3: 00000005c7832003 CR4: 00000000003606e0 [116648.115286] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [116648.115966] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [116648.116649] Call Trace: [116648.117326] remove_files+0x31/0x70 [116648.117997] sysfs_remove_group+0x38/0x80 [116648.118671] sysfs_remove_groups+0x34/0x70 [116648.119342] kobject_del+0x20/0x60 [116648.120022] btrfs_free_block_groups+0x405/0x430 [btrfs] [116648.120707] close_ctree+0x19a/0x380 [btrfs] [116648.121396] generic_shutdown_super+0x6c/0x110 [116648.122057] kill_anon_super+0xe/0x30 [116648.122702] btrfs_kill_super+0x12/0xa0 [btrfs] [116648.123335] deactivate_locked_super+0x3a/0x70 [116648.123961] cleanup_mnt+0xb4/0x160 [116648.124586] task_work_run+0x7e/0xc0 [116648.125210] exit_to_usermode_loop+0xfa/0x100 [116648.125830] do_syscall_64+0x1cb/0x220 [116648.126463] entry_SYSCALL_64_after_hwframe+0x49/0xbe [116648.127080] RIP: 0033:0x7f9cdc073b37 (...) [116648.135923] ---[ end trace 22e24db328ccadf9 ]--- These happen because, during the unmount path, we call kobject_del() for raid kobjects that are not fully initialized, meaning that we set their ktype (as btrfs_raid_ktype) through link_block_group() but we didn't set their parent kobject, which is done through btrfs_add_raid_kobjects(). We have this split raid kobject setup since commit 75cb379d263521 ("btrfs: defer adding raid type kobject until after chunk relocation") in order to avoid triggering reclaim during contextes where we can not (either we are holding a transaction handle or some lock required by the transaction commit path), so that we do the calls to kobject_add(), which triggers GFP_KERNEL allocations, through btrfs_add_raid_kobjects() in contextes where it is safe to trigger reclaim. That change expected that a new raid kobject can only be created either when mounting the filesystem or after raid profile conversion through the relocation path. However, we can have new raid kobject created in other two cases at least: 1) During device replace (or scrub) after adding a device a to the filesystem. The replace procedure (and scrub) do calls to btrfs_inc_block_group_ro() which can allocate a new block group with a new raid profile (because we now have more devices). This can be triggered by test cases btrfs/027 and btrfs/176. 2) During a degraded mount trough any write path. This can be triggered by test case btrfs/124. Fixing this by adding extra calls to btrfs_add_raid_kobjects(), not only makes things more complex and fragile, can also introduce deadlocks with reclaim the following way: 1) Calling btrfs_add_raid_kobjects() at btrfs_inc_block_group_ro() or anywhere in the replace/scrub path will cause a deadlock with reclaim because if reclaim happens and a transaction commit is triggered, the transaction commit path will block at btrfs_scrub_pause(). 2) During degraded mounts it is essentially impossible to figure out where to add extra calls to btrfs_add_raid_kobjects(), because allocation of a block group with a new raid profile can happen anywhere, which means we can't safely figure out which contextes are safe for reclaim, as we can either hold a transaction handle or some lock needed by the transaction commit path. So it is too complex and error prone to have this split setup of raid kobjects. So fix the issue by consolidating the setup of the kobjects in a single place, at link_block_group(), and setup a nofs context there in order to prevent reclaim being triggered by the memory allocations done through the call chain of kobject_add(). Besides fixing the sysfs warnings during kobject_del(), this also ensures the sysfs directories for the new raid profiles end up created and visible to users (a bug that existed before the 5.3 commit 7c7e301406d0a9 ("btrfs: sysfs: Replace default_attrs in ktypes with groups")). Fixes: 75cb379d263521 ("btrfs: defer adding raid type kobject until after chunk relocation") Fixes: 7c7e301406d0a9 ("btrfs: sysfs: Replace default_attrs in ktypes with groups") Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-08-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds
Pull networking fixes from David Miller: "Yeah I should have sent a pull request last week, so there is a lot more here than usual: 1) Fix memory leak in ebtables compat code, from Wenwen Wang. 2) Several kTLS bug fixes from Jakub Kicinski (circular close on disconnect etc.) 3) Force slave speed check on link state recovery in bonding 802.3ad mode, from Thomas Falcon. 4) Clear RX descriptor bits before assigning buffers to them in stmmac, from Jose Abreu. 5) Several missing of_node_put() calls, mostly wrt. for_each_*() OF loops, from Nishka Dasgupta. 6) Double kfree_skb() in peak_usb can driver, from Stephane Grosjean. 7) Need to hold sock across skb->destructor invocation, from Cong Wang. 8) IP header length needs to be validated in ipip tunnel xmit, from Haishuang Yan. 9) Use after free in ip6 tunnel driver, also from Haishuang Yan. 10) Do not use MSI interrupts on r8169 chips before RTL8168d, from Heiner Kallweit. 11) Upon bridge device init failure, we need to delete the local fdb. From Nikolay Aleksandrov. 12) Handle erros from of_get_mac_address() properly in stmmac, from Martin Blumenstingl. 13) Handle concurrent rename vs. dump in netfilter ipset, from Jozsef Kadlecsik. 14) Setting NETIF_F_LLTX on mac80211 causes complete breakage with some devices, so revert. From Johannes Berg. 15) Fix deadlock in rxrpc, from David Howells. 16) Fix Kconfig deps of enetc driver, we must have PHYLIB. From Yue Haibing. 17) Fix mvpp2 crash on module removal, from Matteo Croce. 18) Fix race in genphy_update_link, from Heiner Kallweit. 19) bpf_xdp_adjust_head() stopped working with generic XDP when we fixes generic XDP to support stacked devices properly, fix from Jesper Dangaard Brouer. 20) Unbalanced RCU locking in rt6_update_exception_stamp_rt(), from David Ahern. 21) Several memory leaks in new sja1105 driver, from Vladimir Oltean" * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (214 commits) net: dsa: sja1105: Fix memory leak on meta state machine error path net: dsa: sja1105: Fix memory leak on meta state machine normal path net: dsa: sja1105: Really fix panic on unregistering PTP clock net: dsa: sja1105: Use the LOCKEDS bit for SJA1105 E/T as well net: dsa: sja1105: Fix broken learning with vlan_filtering disabled net: dsa: qca8k: Add of_node_put() in qca8k_setup_mdio_bus() net: sched: sample: allow accessing psample_group with rtnl net: sched: police: allow accessing police->params with rtnl net: hisilicon: Fix dma_map_single failed on arm64 net: hisilicon: fix hip04-xmit never return TX_BUSY net: hisilicon: make hip04_tx_reclaim non-reentrant tc-testing: updated vlan action tests with batch create/delete net sched: update vlan action for batched events operations net: stmmac: tc: Do not return a fragment entry net: stmmac: Fix issues when number of Queues >= 4 net: stmmac: xgmac: Fix XGMAC selftests be2net: disable bh with spin_lock in be_process_mcc net: cxgb3_main: Fix a resource leak in a error path in 'init_one()' net: ethernet: sun4i-emac: Support phy-handle property for finding PHYs net: bridge: move default pvid init/deinit to NETDEV_REGISTER/UNREGISTER ...
2019-08-05SMB3: Kernel oops mounting a encryptData share with CONFIG_DEBUG_VIRTUALSebastien Tisserant
Fix kernel oops when mounting a encryptData CIFS share with CONFIG_DEBUG_VIRTUAL Signed-off-by: Sebastien Tisserant <stisserant@wallix.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-08-05smb3: send CAP_DFS capability during session setupSteve French
We had a report of a server which did not do a DFS referral because the session setup Capabilities field was set to 0 (unlike negotiate protocol where we set CAP_DFS). Better to send it session setup in the capabilities as well (this also more closely matches Windows client behavior). Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> CC: Stable <stable@vger.kernel.org>
2019-08-05SMB3: Fix potential memory leak when processing compound chainPavel Shilovsky
When a reconnect happens in the middle of processing a compound chain the code leaks a buffer from the memory pool. Fix this by properly checking for a return code and freeing buffers in case of error. Also maintain a buf variable to be equal to either smallbuf or bigbuf depending on a response buffer size while parsing a chain and when returning to the caller. Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-08-05SMB3: Fix deadlock in validate negotiate hits reconnectPavel Shilovsky
Currently we skip SMB2_TREE_CONNECT command when checking during reconnect because Tree Connect happens when establishing an SMB session. For SMB 3.0 protocol version the code also calls validate negotiate which results in SMB2_IOCL command being sent over the wire. This may deadlock on trying to acquire a mutex when checking for reconnect. Fix this by skipping SMB2_IOCL command when doing the reconnect check. Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> CC: Stable <stable@vger.kernel.org>
2019-08-05dax: dax_layout_busy_page() should not unmap cow pagesVivek Goyal
Vivek: "As of now dax_layout_busy_page() calls unmap_mapping_range() with last argument as 1, which says even unmap cow pages. I am wondering who needs to get rid of cow pages as well. I noticed one interesting side affect of this. I mount xfs with -o dax and mmaped a file with MAP_PRIVATE and wrote some data to a page which created cow page. Then I called fallocate() on that file to zero a page of file. fallocate() called dax_layout_busy_page() which unmapped cow pages as well and then I tried to read back the data I wrote and what I get is old data from persistent memory. I lost the data I had written. This read basically resulted in new fault and read back the data from persistent memory. This sounds wrong. Are there any users which need to unmap cow pages as well? If not, I am proposing changing it to not unmap cow pages. I noticed this while while writing virtio_fs code where when I tried to reclaim a memory range and that corrupted the executable and I was running from virtio-fs and program got segment violation." Dan: "In fact the unmap_mapping_range() in this path is only to synchronize against get_user_pages_fast() and force it to call back into the filesystem to re-establish the mapping. COW pages should be left untouched by dax_layout_busy_page()." Cc: <stable@vger.kernel.org> Fixes: 5fac7408d828 ("mm, fs, dax: handle layout changes to pinned dax mappings") Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Link: https://lore.kernel.org/r/20190802192956.GA3032@redhat.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-08-05rdma: Enable ib_alloc_cq to spread work over a device's comp_vectorsChuck Lever
Send and Receive completion is handled on a single CPU selected at the time each Completion Queue is allocated. Typically this is when an initiator instantiates an RDMA transport, or when a target accepts an RDMA connection. Some ULPs cannot open a connection per CPU to spread completion workload across available CPUs and MSI vectors. For such ULPs, provide an API that allows the RDMA core to select a completion vector based on the device's complement of available comp_vecs. ULPs that invoke ib_alloc_cq() with only comp_vector 0 are converted to use the new API so that their completion workloads interfere less with each other. Suggested-by: Håkon Bugge <haakon.bugge@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Cc: <linux-cifs@vger.kernel.org> Cc: <v9fs-developer@lists.sourceforge.net> Link: https://lore.kernel.org/r/20190729171923.13428.52555.stgit@manet.1015granger.net Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-08-04cifs: fix rmmod regression in cifs.ko caused by force_sig changesSteve French
Fixes: 72abe3bcf091 ("signal/cifs: Fix cifs_put_tcp_session to call send_sig instead of force_sig") The global change from force_sig caused module unloading of cifs.ko to fail (since the cifsd process could not be killed, "rmmod cifs" now would always fail) Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> CC: Eric W. Biederman <ebiederm@xmission.com>