Age | Commit message (Collapse) | Author |
|
Allow the FS_IOC_ADD_ENCRYPTION_KEY and FS_IOC_REMOVE_ENCRYPTION_KEY
ioctls to be used by non-root users to add and remove encryption keys
from the filesystem-level crypto keyrings, subject to limitations.
Motivation: while privileged fscrypt key management is sufficient for
some users (e.g. Android and Chromium OS, where a privileged process
manages all keys), the old API by design also allows non-root users to
set up and use encrypted directories, and we don't want to regress on
that. Especially, we don't want to force users to continue using the
old API, running into the visibility mismatch between files and keyrings
and being unable to "lock" encrypted directories.
Intuitively, the ioctls have to be privileged since they manipulate
filesystem-level state. However, it's actually safe to make them
unprivileged if we very carefully enforce some specific limitations.
First, each key must be identified by a cryptographic hash so that a
user can't add the wrong key for another user's files. For v2
encryption policies, we use the key_identifier for this. v1 policies
don't have this, so managing keys for them remains privileged.
Second, each key a user adds is charged to their quota for the keyrings
service. Thus, a user can't exhaust memory by adding a huge number of
keys. By default each non-root user is allowed up to 200 keys; this can
be changed using the existing sysctl 'kernel.keys.maxkeys'.
Third, if multiple users add the same key, we keep track of those users
of the key (of which there remains a single copy), and won't really
remove the key, i.e. "lock" the encrypted files, until all those users
have removed it. This prevents denial of service attacks that would be
possible under simpler schemes, such allowing the first user who added a
key to remove it -- since that could be a malicious user who has
compromised the key. Of course, encryption keys should be kept secret,
but the idea is that using encryption should never be *less* secure than
not using encryption, even if your key was compromised.
We tolerate that a user will be unable to really remove a key, i.e.
unable to "lock" their encrypted files, if another user has added the
same key. But in a sense, this is actually a good thing because it will
avoid providing a false notion of security where a key appears to have
been removed when actually it's still in memory, available to any
attacker who compromises the operating system kernel.
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
Add a new fscrypt policy version, "v2". It has the following changes
from the original policy version, which we call "v1" (*):
- Master keys (the user-provided encryption keys) are only ever used as
input to HKDF-SHA512. This is more flexible and less error-prone, and
it avoids the quirks and limitations of the AES-128-ECB based KDF.
Three classes of cryptographically isolated subkeys are defined:
- Per-file keys, like used in v1 policies except for the new KDF.
- Per-mode keys. These implement the semantics of the DIRECT_KEY
flag, which for v1 policies made the master key be used directly.
These are also planned to be used for inline encryption when
support for it is added.
- Key identifiers (see below).
- Each master key is identified by a 16-byte master_key_identifier,
which is derived from the key itself using HKDF-SHA512. This prevents
users from associating the wrong key with an encrypted file or
directory. This was easily possible with v1 policies, which
identified the key by an arbitrary 8-byte master_key_descriptor.
- The key must be provided in the filesystem-level keyring, not in a
process-subscribed keyring.
The following UAPI additions are made:
- The existing ioctl FS_IOC_SET_ENCRYPTION_POLICY can now be passed a
fscrypt_policy_v2 to set a v2 encryption policy. It's disambiguated
from fscrypt_policy/fscrypt_policy_v1 by the version code prefix.
- A new ioctl FS_IOC_GET_ENCRYPTION_POLICY_EX is added. It allows
getting the v1 or v2 encryption policy of an encrypted file or
directory. The existing FS_IOC_GET_ENCRYPTION_POLICY ioctl could not
be used because it did not have a way for userspace to indicate which
policy structure is expected. The new ioctl includes a size field, so
it is extensible to future fscrypt policy versions.
- The ioctls FS_IOC_ADD_ENCRYPTION_KEY, FS_IOC_REMOVE_ENCRYPTION_KEY,
and FS_IOC_GET_ENCRYPTION_KEY_STATUS now support managing keys for v2
encryption policies. Such keys are kept logically separate from keys
for v1 encryption policies, and are identified by 'identifier' rather
than by 'descriptor'. The 'identifier' need not be provided when
adding a key, since the kernel will calculate it anyway.
This patch temporarily keeps adding/removing v2 policy keys behind the
same permission check done for adding/removing v1 policy keys:
capable(CAP_SYS_ADMIN). However, the next patch will carefully take
advantage of the cryptographically secure master_key_identifier to allow
non-root users to add/remove v2 policy keys, thus providing a full
replacement for v1 policies.
(*) Actually, in the API fscrypt_policy::version is 0 while on-disk
fscrypt_context::format is 1. But I believe it makes the most sense
to advance both to '2' to have them be in sync, and to consider the
numbering to start at 1 except for the API quirk.
Reviewed-by: Paul Crowley <paulcrowley@google.com>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
Add an implementation of HKDF (RFC 5869) to fscrypt, for the purpose of
deriving additional key material from the fscrypt master keys for v2
encryption policies. HKDF is a key derivation function built on top of
HMAC. We choose SHA-512 for the underlying unkeyed hash, and use an
"hmac(sha512)" transform allocated from the crypto API.
We'll be using this to replace the AES-ECB based KDF currently used to
derive the per-file encryption keys. While the AES-ECB based KDF is
believed to meet the original security requirements, it is nonstandard
and has problems that don't exist in modern KDFs such as HKDF:
1. It's reversible. Given a derived key and nonce, an attacker can
easily compute the master key. This is okay if the master key and
derived keys are equally hard to compromise, but now we'd like to be
more robust against threats such as a derived key being compromised
through a timing attack, or a derived key for an in-use file being
compromised after the master key has already been removed.
2. It doesn't evenly distribute the entropy from the master key; each 16
input bytes only affects the corresponding 16 output bytes.
3. It isn't easily extensible to deriving other values or keys, such as
a public hash for securely identifying the key, or per-mode keys.
Per-mode keys will be immediately useful for Adiantum encryption, for
which fscrypt currently uses the master key directly, introducing
unnecessary usage constraints. Per-mode keys will also be useful for
hardware inline encryption, which is currently being worked on.
HKDF solves all the above problems.
Reviewed-by: Paul Crowley <paulcrowley@google.com>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
Add a new fscrypt ioctl, FS_IOC_GET_ENCRYPTION_KEY_STATUS. Given a key
specified by 'struct fscrypt_key_specifier' (the same way a key is
specified for the other fscrypt key management ioctls), it returns
status information in a 'struct fscrypt_get_key_status_arg'.
The main motivation for this is that applications need to be able to
check whether an encrypted directory is "unlocked" or not, so that they
can add the key if it is not, and avoid adding the key (which may
involve prompting the user for a passphrase) if it already is.
It's possible to use some workarounds such as checking whether opening a
regular file fails with ENOKEY, or checking whether the filenames "look
like gibberish" or not. However, no workaround is usable in all cases.
Like the other key management ioctls, the keyrings syscalls may seem at
first to be a good fit for this. Unfortunately, they are not. Even if
we exposed the keyring ID of the ->s_master_keys keyring and gave
everyone Search permission on it (note: currently the keyrings
permission system would also allow everyone to "invalidate" the keyring
too), the fscrypt keys have an additional state that doesn't map cleanly
to the keyrings API: the secret can be removed, but we can be still
tracking the files that were using the key, and the removal can be
re-attempted or the secret added again.
After later patches, some applications will also need a way to determine
whether a key was added by the current user vs. by some other user.
Reserved fields are included in fscrypt_get_key_status_arg for this and
other future extensions.
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
Add a new fscrypt ioctl, FS_IOC_REMOVE_ENCRYPTION_KEY. This ioctl
removes an encryption key that was added by FS_IOC_ADD_ENCRYPTION_KEY.
It wipes the secret key itself, then "locks" the encrypted files and
directories that had been unlocked using that key -- implemented by
evicting the relevant dentries and inodes from the VFS caches.
The problem this solves is that many fscrypt users want the ability to
remove encryption keys, causing the corresponding encrypted directories
to appear "locked" (presented in ciphertext form) again. Moreover,
users want removing an encryption key to *really* remove it, in the
sense that the removed keys cannot be recovered even if kernel memory is
compromised, e.g. by the exploit of a kernel security vulnerability or
by a physical attack. This is desirable after a user logs out of the
system, for example. In many cases users even already assume this to be
the case and are surprised to hear when it's not.
It is not sufficient to simply unlink the master key from the keyring
(or to revoke or invalidate it), since the actual encryption transform
objects are still pinned in memory by their inodes. Therefore, to
really remove a key we must also evict the relevant inodes.
Currently one workaround is to run 'sync && echo 2 >
/proc/sys/vm/drop_caches'. But, that evicts all unused inodes in the
system rather than just the inodes associated with the key being
removed, causing severe performance problems. Moreover, it requires
root privileges, so regular users can't "lock" their encrypted files.
Another workaround, used in Chromium OS kernels, is to add a new
VFS-level ioctl FS_IOC_DROP_CACHE which is a more restricted version of
drop_caches that operates on a single super_block. It does:
shrink_dcache_sb(sb);
invalidate_inodes(sb, false);
But it's still a hack. Yet, the major users of filesystem encryption
want this feature badly enough that they are actually using these hacks.
To properly solve the problem, start maintaining a list of the inodes
which have been "unlocked" using each master key. Originally this
wasn't possible because the kernel didn't keep track of in-use master
keys at all. But, with the ->s_master_keys keyring it is now possible.
Then, add an ioctl FS_IOC_REMOVE_ENCRYPTION_KEY. It finds the specified
master key in ->s_master_keys, then wipes the secret key itself, which
prevents any additional inodes from being unlocked with the key. Then,
it syncs the filesystem and evicts the inodes in the key's list. The
normal inode eviction code will free and wipe the per-file keys (in
->i_crypt_info). Note that freeing ->i_crypt_info without evicting the
inodes was also considered, but would have been racy.
Some inodes may still be in use when a master key is removed, and we
can't simply revoke random file descriptors, mmap's, etc. Thus, the
ioctl simply skips in-use inodes, and returns -EBUSY to indicate that
some inodes weren't evicted. The master key *secret* is still removed,
but the fscrypt_master_key struct remains to keep track of the remaining
inodes. Userspace can then retry the ioctl to evict the remaining
inodes. Alternatively, if userspace adds the key again, the refreshed
secret will be associated with the existing list of inodes so they
remain correctly tracked for future key removals.
The ioctl doesn't wipe pagecache pages. Thus, we tolerate that after a
kernel compromise some portions of plaintext file contents may still be
recoverable from memory. This can be solved by enabling page poisoning
system-wide, which security conscious users may choose to do. But it's
very difficult to solve otherwise, e.g. note that plaintext file
contents may have been read in other places than pagecache pages.
Like FS_IOC_ADD_ENCRYPTION_KEY, FS_IOC_REMOVE_ENCRYPTION_KEY is
initially restricted to privileged users only. This is sufficient for
some use cases, but not all. A later patch will relax this restriction,
but it will require introducing key hashes, among other changes.
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
Add a new fscrypt ioctl, FS_IOC_ADD_ENCRYPTION_KEY. This ioctl adds an
encryption key to the filesystem's fscrypt keyring ->s_master_keys,
making any files encrypted with that key appear "unlocked".
Why we need this
~~~~~~~~~~~~~~~~
The main problem is that the "locked/unlocked" (ciphertext/plaintext)
status of encrypted files is global, but the fscrypt keys are not.
fscrypt only looks for keys in the keyring(s) the process accessing the
filesystem is subscribed to: the thread keyring, process keyring, and
session keyring, where the session keyring may contain the user keyring.
Therefore, userspace has to put fscrypt keys in the keyrings for
individual users or sessions. But this means that when a process with a
different keyring tries to access encrypted files, whether they appear
"unlocked" or not is nondeterministic. This is because it depends on
whether the files are currently present in the inode cache.
Fixing this by consistently providing each process its own view of the
filesystem depending on whether it has the key or not isn't feasible due
to how the VFS caches work. Furthermore, while sometimes users expect
this behavior, it is misguided for two reasons. First, it would be an
OS-level access control mechanism largely redundant with existing access
control mechanisms such as UNIX file permissions, ACLs, LSMs, etc.
Encryption is actually for protecting the data at rest.
Second, almost all users of fscrypt actually do need the keys to be
global. The largest users of fscrypt, Android and Chromium OS, achieve
this by having PID 1 create a "session keyring" that is inherited by
every process. This works, but it isn't scalable because it prevents
session keyrings from being used for any other purpose.
On general-purpose Linux distros, the 'fscrypt' userspace tool [1] can't
similarly abuse the session keyring, so to make 'sudo' work on all
systems it has to link all the user keyrings into root's user keyring
[2]. This is ugly and raises security concerns. Moreover it can't make
the keys available to system services, such as sshd trying to access the
user's '~/.ssh' directory (see [3], [4]) or NetworkManager trying to
read certificates from the user's home directory (see [5]); or to Docker
containers (see [6], [7]).
By having an API to add a key to the *filesystem* we'll be able to fix
the above bugs, remove userspace workarounds, and clearly express the
intended semantics: the locked/unlocked status of an encrypted directory
is global, and encryption is orthogonal to OS-level access control.
Why not use the add_key() syscall
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We use an ioctl for this API rather than the existing add_key() system
call because the ioctl gives us the flexibility needed to implement
fscrypt-specific semantics that will be introduced in later patches:
- Supporting key removal with the semantics such that the secret is
removed immediately and any unused inodes using the key are evicted;
also, the eviction of any in-use inodes can be retried.
- Calculating a key-dependent cryptographic identifier and returning it
to userspace.
- Allowing keys to be added and removed by non-root users, but only keys
for v2 encryption policies; and to prevent denial-of-service attacks,
users can only remove keys they themselves have added, and a key is
only really removed after all users who added it have removed it.
Trying to shoehorn these semantics into the keyrings syscalls would be
very difficult, whereas the ioctls make things much easier.
However, to reuse code the implementation still uses the keyrings
service internally. Thus we get lockless RCU-mode key lookups without
having to re-implement it, and the keys automatically show up in
/proc/keys for debugging purposes.
References:
[1] https://github.com/google/fscrypt
[2] https://goo.gl/55cCrI#heading=h.vf09isp98isb
[3] https://github.com/google/fscrypt/issues/111#issuecomment-444347939
[4] https://github.com/google/fscrypt/issues/116
[5] https://bugs.launchpad.net/ubuntu/+source/fscrypt/+bug/1770715
[6] https://github.com/google/fscrypt/issues/128
[7] https://askubuntu.com/questions/1130306/cannot-run-docker-on-an-encrypted-filesystem
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
Rename keyinfo.c to keysetup.c since this better describes what the file
does (sets up the key), and it matches the new file keysetup_v1.c.
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
In preparation for introducing v2 encryption policies which will find
and derive encryption keys differently from the current v1 encryption
policies, move the v1 policy-specific key setup code from keyinfo.c into
keysetup_v1.c.
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
Do some more refactoring of the key setup code, in preparation for
introducing a filesystem-level keyring and v2 encryption policies:
- Now that ci_inode exists, don't pass around the inode unnecessarily.
- Define a function setup_file_encryption_key() which handles the crypto
key setup given an under-construction fscrypt_info. Don't pass the
fscrypt_context, since everything is in the fscrypt_info.
[This will be extended for v2 policies and the fs-level keyring.]
- Define a function fscrypt_set_derived_key() which sets the per-file
key, without depending on anything specific to v1 policies.
[This will also be used for v2 policies.]
- Define a function fscrypt_setup_v1_file_key() which takes the raw
master key, thus separating finding the key from using it.
[This will also be used if the key is found in the fs-level keyring.]
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
In preparation for introducing a filesystem-level keyring which will
contain fscrypt master keys, rename the existing 'struct
fscrypt_master_key' to 'struct fscrypt_direct_key'. This is the
structure in the existing table of master keys that's maintained to
deduplicate the crypto transforms for v1 DIRECT_KEY policies.
I've chosen to keep this table as-is rather than make it automagically
add/remove the keys to/from the filesystem-level keyring, since that
would add a lot of extra complexity to the filesystem-level keyring.
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
Add an inode back-pointer to 'struct fscrypt_info', such that
inode->i_crypt_info->ci_inode == inode.
This will be useful for:
1. Evicting the inodes when a fscrypt key is removed, since we'll track
the inodes using a given key by linking their fscrypt_infos together,
rather than the inodes directly. This avoids bloating 'struct inode'
with a new list_head.
2. Simplifying the per-file key setup, since the inode pointer won't
have to be passed around everywhere just in case something goes wrong
and it's needed for fscrypt_warn().
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
Update fs/crypto/ to use the new names for the UAPI constants rather
than the old names, then make the old definitions conditional on
!__KERNEL__.
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
Return ENOPKG rather than ENOENT when trying to open a file that's
encrypted using algorithms not available in the kernel's crypto API.
This avoids an ambiguity, since ENOENT is also returned when the file
doesn't exist.
Note: this is the same approach I'm taking for fs-verity.
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
Users of fscrypt with non-default algorithms will encounter an error
like the following if they fail to include the needed algorithms into
the crypto API when configuring the kernel (as per the documentation):
Error allocating 'adiantum(xchacha12,aes)' transform: -2
This requires that the user figure out what the "-2" error means.
Make it more friendly by printing a warning like the following instead:
Missing crypto API support for Adiantum (API name: "adiantum(xchacha12,aes)")
Also upgrade the log level for *other* errors to KERN_ERR.
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
When fs/crypto/ encounters an inode with an invalid encryption context,
currently it prints a warning if the pair of encryption modes are
unrecognized, but it's silent if there are other problems such as
unsupported context size, format, or flags. To help people debug such
situations, add more warning messages.
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
Most of the warning and error messages in fs/crypto/ are for situations
related to a specific inode, not merely to a super_block. So to make
things easier, make fscrypt_msg() take an inode rather than a
super_block, and make it print the inode number.
Note: This is the same approach I'm taking for fsverity_msg().
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
Some minor cleanups for the code that base64 encodes and decodes
encrypted filenames and long name digests:
- Rename "digest_{encode,decode}()" => "base64_{encode,decode}()" since
they are used for filenames too, not just for long name digests.
- Replace 'while' loops with more conventional 'for' loops.
- Use 'u8' for binary data. Keep 'char' for string data.
- Fully constify the lookup table (pointer was not const).
- Improve comment.
No actual change in behavior.
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
Since commit 643fa9612bf1 ("fscrypt: remove filesystem specific build
config option"), fs/crypto/ can no longer be built as a loadable module.
Thus it no longer needs a module_exit function, nor a MODULE_LICENSE.
So remove them, and change module_init to late_initcall.
Reviewed-by: Chandan Rajendra <chandan@linux.ibm.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
As of now, setting watches on filesystem objects has, at most, applied a
check for read access to the inode, and in the case of fanotify, requires
CAP_SYS_ADMIN. No specific security hook or permission check has been
provided to control the setting of watches. Using any of inotify, dnotify,
or fanotify, it is possible to observe, not only write-like operations, but
even read access to a file. Modeling the watch as being merely a read from
the file is insufficient for the needs of SELinux. This is due to the fact
that read access should not necessarily imply access to information about
when another process reads from a file. Furthermore, fanotify watches grant
more power to an application in the form of permission events. While
notification events are solely, unidirectional (i.e. they only pass
information to the receiving application), permission events are blocking.
Permission events make a request to the receiving application which will
then reply with a decision as to whether or not that action may be
completed. This causes the issue of the watching application having the
ability to exercise control over the triggering process. Without drawing a
distinction within the permission check, the ability to read would imply
the greater ability to control an application. Additionally, mount and
superblock watches apply to all files within the same mount or superblock.
Read access to one file should not necessarily imply the ability to watch
all files accessed within a given mount or superblock.
In order to solve these issues, a new LSM hook is implemented and has been
placed within the system calls for marking filesystem objects with inotify,
fanotify, and dnotify watches. These calls to the hook are placed at the
point at which the target path has been resolved and are provided with the
path struct, the mask of requested notification events, and the type of
object on which the mark is being set (inode, superblock, or mount). The
mask and obj_type have already been translated into common FS_* values
shared by the entirety of the fs notification infrastructure. The path
struct is passed rather than just the inode so that the mount is available,
particularly for mount watches. This also allows for use of the hook by
pathname-based security modules. However, since the hook is intended for
use even by inode based security modules, it is not placed under the
CONFIG_SECURITY_PATH conditional. Otherwise, the inode-based security
modules would need to enable all of the path hooks, even though they do not
use any of them.
This only provides a hook at the point of setting a watch, and presumes
that permission to set a particular watch implies the ability to receive
all notification about that object which match the mask. This is all that
is required for SELinux. If other security modules require additional hooks
or infrastructure to control delivery of notification, these can be added
by them. It does not make sense for us to propose hooks for which we have
no implementation. The understanding that all notifications received by the
requesting application are all strictly of a type for which the application
has been granted permission shows that this implementation is sufficient in
its coverage.
Security modules wishing to provide complete control over fanotify must
also implement a security_file_open hook that validates that the access
requested by the watching application is authorized. Fanotify has the issue
that it returns a file descriptor with the file mode specified during
fanotify_init() to the watching process on event. This is already covered
by the LSM security_file_open hook if the security module implements
checking of the requested file mode there. Otherwise, a watching process
can obtain escalated access to a file for which it has not been authorized.
The selinux_path_notify hook implementation works by adding five new file
permissions: watch, watch_mount, watch_sb, watch_reads, and watch_with_perm
(descriptions about which will follow), and one new filesystem permission:
watch (which is applied to superblock checks). The hook then decides which
subset of these permissions must be held by the requesting application
based on the contents of the provided mask and the obj_type. The
selinux_file_open hook already checks the requested file mode and therefore
ensures that a watching process cannot escalate its access through
fanotify.
The watch, watch_mount, and watch_sb permissions are the baseline
permissions for setting a watch on an object and each are a requirement for
any watch to be set on a file, mount, or superblock respectively. It should
be noted that having either of the other two permissions (watch_reads and
watch_with_perm) does not imply the watch, watch_mount, or watch_sb
permission. Superblock watches further require the filesystem watch
permission to the superblock. As there is no labeled object in view for
mounts, there is no specific check for mount watches beyond watch_mount to
the inode. Such a check could be added in the future, if a suitable labeled
object existed representing the mount.
The watch_reads permission is required to receive notifications from
read-exclusive events on filesystem objects. These events include accessing
a file for the purpose of reading and closing a file which has been opened
read-only. This distinction has been drawn in order to provide a direct
indication in the policy for this otherwise not obvious capability. Read
access to a file should not necessarily imply the ability to observe read
events on a file.
Finally, watch_with_perm only applies to fanotify masks since it is the
only way to set a mask which allows for the blocking, permission event.
This permission is needed for any watch which is of this type. Though
fanotify requires CAP_SYS_ADMIN, this is insufficient as it gives implicit
trust to root, which we do not do, and does not support least privilege.
Signed-off-by: Aaron Goidel <acgoide@tycho.nsa.gov>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paul Moore <paul@paul-moore.com>
|
|
Currently when the call to ext4_htree_store_dirent fails the error return
variable 'ret' is is not being set to the error code and variable count is
instead, hence the error code is not being returned. Fix this by assigning
ret to the error return code.
Addresses-Coverity: ("Unused value")
Fixes: 8af0f0822797 ("ext4: fix readdir error in the case of inline_data+dir_index")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Originally, support for expanded timestamps had a bug in that pre-1970
times were erroneously encoded as being in the the 24th century. This
was fixed in commit a4dad1ae24f8 ("ext4: Fix handling of extended
tv_sec") which landed in 4.4. Starting with 4.4, pre-1970 timestamps
were correctly encoded, but for backwards compatibility those
incorrectly encoded timestamps were mapped back to the pre-1970 dates.
Given that backwards compatibility workaround has been around for 4
years, and given that running e2fsck from e2fsprogs 1.43.2 and later
will offer to fix these timestamps (which has been released for 3
years), it's past time to drop the legacy workaround from the kernel.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Zorro Lang reported a crash in generic/475 if we try to inactivate a
corrupt inode with a NULL attr fork (stack trace shortened somewhat):
RIP: 0010:xfs_bmapi_read+0x311/0xb00 [xfs]
RSP: 0018:ffff888047f9ed68 EFLAGS: 00010202
RAX: dffffc0000000000 RBX: ffff888047f9f038 RCX: 1ffffffff5f99f51
RDX: 0000000000000002 RSI: 0000000000000008 RDI: 0000000000000012
RBP: ffff888002a41f00 R08: ffffed10005483f0 R09: ffffed10005483ef
R10: ffffed10005483ef R11: ffff888002a41f7f R12: 0000000000000004
R13: ffffe8fff53b5768 R14: 0000000000000005 R15: 0000000000000001
FS: 00007f11d44b5b80(0000) GS:ffff888114200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000ef6000 CR3: 000000002e176003 CR4: 00000000001606e0
Call Trace:
xfs_dabuf_map.constprop.18+0x696/0xe50 [xfs]
xfs_da_read_buf+0xf5/0x2c0 [xfs]
xfs_da3_node_read+0x1d/0x230 [xfs]
xfs_attr_inactive+0x3cc/0x5e0 [xfs]
xfs_inactive+0x4c8/0x5b0 [xfs]
xfs_fs_destroy_inode+0x31b/0x8e0 [xfs]
destroy_inode+0xbc/0x190
xfs_bulkstat_one_int+0xa8c/0x1200 [xfs]
xfs_bulkstat_one+0x16/0x20 [xfs]
xfs_bulkstat+0x6fa/0xf20 [xfs]
xfs_ioc_bulkstat+0x182/0x2b0 [xfs]
xfs_file_ioctl+0xee0/0x12a0 [xfs]
do_vfs_ioctl+0x193/0x1000
ksys_ioctl+0x60/0x90
__x64_sys_ioctl+0x6f/0xb0
do_syscall_64+0x9f/0x4d0
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7f11d39a3e5b
The "obvious" cause is that the attr ifork is null despite the inode
claiming an attr fork having at least one extent, but it's not so
obvious why we ended up with an inode in that state.
Reported-by: Zorro Lang <zlang@redhat.com>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204031
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
|
|
Continue our game of replacing ASSERTs for corrupt ondisk metadata with
EFSCORRUPTED returns.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
|
|
We need the driver core fixes in here as well.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
For debugging reasons, it's useful to know the contents of the extent
cache. Since the extent cache contains much of what is in the fiemap
ioctl, use an fiemap-style interface to return this information.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
The new ioctl EXT4_IOC_GETSTATE returns some of the dynamic state of
an ext4 inode for debugging purposes.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
The new ioctl EXT4_IOC_CLEAR_ES_CACHE will force an inode's extent
status cache to be cleared out. This is intended for use for
debugging.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
When executing generic/388 on a ppc64le machine, we notice the following
call trace,
VFS: brelse: Trying to free free buffer
WARNING: CPU: 0 PID: 6637 at /root/repos/linux/fs/buffer.c:1195 __brelse+0x84/0xc0
Call Trace:
__brelse+0x80/0xc0 (unreliable)
invalidate_bh_lru+0x78/0xc0
on_each_cpu_mask+0xa8/0x130
on_each_cpu_cond_mask+0x130/0x170
invalidate_bh_lrus+0x44/0x60
invalidate_bdev+0x38/0x70
ext4_put_super+0x294/0x560
generic_shutdown_super+0xb0/0x170
kill_block_super+0x38/0xb0
deactivate_locked_super+0xa4/0xf0
cleanup_mnt+0x164/0x1d0
task_work_run+0x110/0x160
do_notify_resume+0x414/0x460
ret_from_except_lite+0x70/0x74
The warning happens because flush_descriptor() drops bh reference it
does not own. The bh reference acquired by
jbd2_journal_get_descriptor_buffer() is owned by the log_bufs list and
gets released when this list is processed. The reference for doing IO is
only acquired in write_dirty_buffer() later in flush_descriptor().
Reported-by: Harish Sriram <harish@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Chandan Rajendra <chandan@linux.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Remove unnecessary error check in ext4_file_write_iter(),
because this check will be done in upcoming later function --
ext4_write_checks() -> generic_write_checks()
Change-Id: I7b0ab27f693a50765c15b5eaa3f4e7c38f42e01e
Signed-off-by: shisiyuan <shisiyuan@xiaomi.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
mkfs.ext4 -O inline_data /dev/vdb
mount -o dioread_nolock /dev/vdb /mnt
echo "some inline data..." >> /mnt/test-file
echo "some inline data..." >> /mnt/test-file
sync
The above script will trigger "WARN_ON(!io_end->handle && sbi->s_journal)"
because ext4_should_dioread_nolock() returns false for a file with inline
data. Move the check to a place after we have already removed the inline
data and prepared inode to write normal pages.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull dax fixes from Dan Williams:
"A filesystem-dax and device-dax fix for v5.3.
The filesystem-dax fix is tagged for stable as the implementation has
been mistakenly throwing away all cow pages on any truncate or hole
punch operation as part of the solution to coordinate device-dma vs
truncate to dax pages.
The device-dax change fixes up a regression this cycle from the
introduction of a common 'internal per-cpu-ref' implementation.
Summary:
- Fix dax_layout_busy_page() to not discard private cow pages of
fs/dax private mappings.
- Update the memremap_pages core to properly cleanup on behalf of
internal reference-count users like device-dax"
* tag 'dax-fixes-5.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
mm/memremap: Fix reuse of pgmap instances with internal references
dax: dax_layout_busy_page() should not unmap cow pages
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 fix from Andreas Gruenbacher:
"Fix incorrect lseek / fiemap results"
* tag 'gfs2-v5.3-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: gfs2_walk_metadata fix
|
|
Pull block fixes from Jens Axboe:
- Revert of a bcache patch that caused an oops for some (Coly)
- ata rb532 unused warning fix (Gustavo)
- AoE kernel crash fix (He)
- Error handling fixup for blkdev_get() (Jan)
- libata read/write translation and SFF PIO fix (me)
- Use after free and error handling fix for O_DIRECT fragments. There's
still a nowait + sync oddity in there, we'll nail that start next
week. If all else fails, I'll queue a revert of the NOWAIT change.
(me)
- Loop GFP_KERNEL -> GFP_NOIO deadlock fix (Mikulas)
- Two BFQ regression fixes that caused crashes (Paolo)
* tag 'for-linus-20190809' of git://git.kernel.dk/linux-block:
bcache: Revert "bcache: use sysfs_match_string() instead of __sysfs_match_string()"
loop: set PF_MEMALLOC_NOIO for the worker thread
bdev: Fixup error handling in blkdev_get()
block, bfq: handle NULL return value by bfq_init_rq()
block, bfq: move update of waker and woken list to queue freeing
block, bfq: reset last_completed_rq_bfqq if the pointed queue is freed
block: aoe: Fix kernel crash due to atomic sleep when exiting
libata: add SG safety checks in SFF pio transfers
libata: have ata_scsi_rw_xlat() fail invalid passthrough requests
block: fix O_DIRECT error handling for bio fragments
ata: rb532_cf: Fix unused variable warning in rb532_pata_driver_probe
|
|
In gfs2_alloc_inode, when kmem_cache_alloc cannot allocate a new object, return
NULL immediately. The code currently relies on the fact that i_inode is the
first member in struct gfs2_inode and so ip and &ip->i_inode evaluate to the
same address, but that isn't immediately obvious.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
|
|
iomap handles all the nitty-gritty details of zeroing a file
range for us, so use the proper helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
|
|
Add support for the IOMAP_ZERO iomap operation so that iomap_zero_range will
work as expected. In the IOMAP_ZERO case, the caller of iomap_zero_range is
responsible for taking an exclusive glock on the inode, so we need no
additional locking in gfs2_iomap_begin.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
|
|
Following commit d0a22a4b03b8 ("gfs2: Fix iomap write page reclaim deadlock"),
gfs2_iomap_begin and gfs2_iomap_begin_write can be further cleaned up and the
split between those two functions can be improved.
With suggestions from Christoph Hellwig.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
|
|
It turns out that the current version of gfs2_metadata_walker suffers
from multiple problems that can cause gfs2_hole_size to report an
incorrect size. This will confuse fiemap as well as lseek with the
SEEK_DATA flag.
Fix that by changing gfs2_hole_walker to compute the metapath to the
first data block after the hole (if any), and compute the hole size
based on that.
Fixes xfstest generic/490.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
Cc: stable@vger.kernel.org # v4.18+
|
|
Secure Encrypted Virtualization is an x86-specific feature, so it shouldn't
appear in generic kernel code because it forces non-x86 architectures to
define the sev_active() function, which doesn't make a lot of sense.
To solve this problem, add an x86 elfcorehdr_read() function to override
the generic weak implementation. To do that, it's necessary to make
read_from_oldmem() public so that it can be used outside of vmcore.c.
Also, remove the export for sev_active() since it's only used in files that
won't be built as modules.
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Lianbo Jiang <lijiang@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190806044919.10622-6-bauerman@linux.ibm.com
|
|
Pull NFS client fixes from Trond Myklebust:
"Highlights include:
Stable fixes:
- NFSv4: Ensure we check the return value of update_open_stateid() so
we correctly track active open state.
- NFSv4: Fix for delegation state recovery to ensure we recover all
open modes that are active.
- NFSv4: Fix an Oops in nfs4_do_setattr
Fixes:
- NFS: Fix regression whereby fscache errors are appearing on 'nofsc'
mounts
- NFSv4: Fix a potential sleep while atomic in nfs4_do_reclaim()
- NFSv4: Fix a credential refcount leak in nfs41_check_delegation_stateid
- pNFS: Report errors from the call to nfs4_select_rw_stateid()
- NFSv4: Various other delegation and open stateid recovery fixes
- NFSv4: Fix state recovery behaviour when server connection times
out"
* tag 'nfs-for-5.3-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFSv4: Ensure state recovery handles ETIMEDOUT correctly
NFS: Fix regression whereby fscache errors are appearing on 'nofsc' mounts
NFSv4: Fix an Oops in nfs4_do_setattr
NFSv4: Fix a potential sleep while atomic in nfs4_do_reclaim()
NFSv4: Check the return value of update_open_stateid()
NFSv4.1: Only reap expired delegations
NFSv4.1: Fix open stateid recovery
NFSv4: Report the error from nfs4_select_rw_stateid()
NFSv4: When recovering state fails with EAGAIN, retry the same recovery
NFSv4: Print an error in the syslog when state is marked as irrecoverable
NFSv4: Fix delegation state recovery
NFSv4: Fix a credential refcount leak in nfs41_check_delegation_stateid
|
|
Pull cifs fixes from Steve French:
"Six small SMB3 fixes, two for stable"
* tag '5.3-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
SMB3: Kernel oops mounting a encryptData share with CONFIG_DEBUG_VIRTUAL
smb3: update TODO list of missing features
smb3: send CAP_DFS capability during session setup
SMB3: Fix potential memory leak when processing compound chain
SMB3: Fix deadlock in validate negotiate hits reconnect
cifs: fix rmmod regression in cifs.ko caused by force_sig changes
|
|
Commit 89e524c04fa9 ("loop: Fix mount(2) failure due to race with
LOOP_SET_FD") converted blkdev_get() to use the new helpers for
finishing claiming of a block device. However the conversion botched the
error handling in blkdev_get() and thus the bdev has been marked as held
even in case __blkdev_get() returned error. This led to occasional
warnings with block/001 test from blktests like:
kernel: WARNING: CPU: 5 PID: 907 at fs/block_dev.c:1899 __blkdev_put+0x396/0x3a0
Correct the error handling.
CC: stable@vger.kernel.org
Fixes: 89e524c04fa9 ("loop: Fix mount(2) failure due to race with LOOP_SET_FD")
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When building with W=1, we get some kerneldoc warnings:
CC fs/fhandle.o
fs/fhandle.c:259: warning: Function parameter or member 'flags' not described in 'sys_open_by_handle_at'
fs/fhandle.c:259: warning: Excess function parameter 'flag' description in 'sys_open_by_handle_at'
Fix the typo that caused it.
Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
0eb6ddfb865c tried to fix this up, but introduced a use-after-free
of dio. Additionally, we still had an issue with error handling,
as reported by Darrick:
"I noticed a regression in xfs/747 (an unreleased xfstest for the
xfs_scrub media scanning feature) on 5.3-rc3. I'll condense that down
to a simpler reproducer:
error-test: 0 209 linear 8:48 0
error-test: 209 1 error
error-test: 210 6446894 linear 8:48 210
Basically we have a ~3G /dev/sdd and we set up device mapper to fail IO
for sector 209 and to pass the io to the scsi device everywhere else.
On 5.3-rc3, performing a directio pread of this range with a < 1M buffer
(in other words, a request for fewer than MAX_BIO_PAGES bytes) yields
EIO like you'd expect:
pread64(3, 0x7f880e1c7000, 1048576, 0) = -1 EIO (Input/output error)
pread: Input/output error
+++ exited with 0 +++
But doing it with a larger buffer succeeds(!):
pread64(3, "XFSB\0\0\20\0\0\0\0\0\0\fL\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1146880, 0) = 1146880
read 1146880/1146880 bytes at offset 0
1 MiB, 1 ops; 0.0009 sec (1.124 GiB/sec and 1052.6316 ops/sec)
+++ exited with 0 +++
(Note that the part of the buffer corresponding to the dm-error area is
uninitialized)
On 5.3-rc2, both commands would fail with EIO like you'd expect. The
only change between rc2 and rc3 is commit 0eb6ddfb865c ("block: Fix
__blkdev_direct_IO() for bio fragments").
AFAICT we end up in __blkdev_direct_IO with a 1120K buffer, which gets
split into two bios: one for the first BIO_MAX_PAGES worth of data (1MB)
and a second one for the 96k after that."
Fix this by noting that it's always safe to dereference dio if we get
BLK_QC_T_EAGAIN returned, as end_io hasn't been run for that case. So
we can safely increment the dio size before calling submit_bio(), and
then decrement it on failure (not that it really matters, as the bio
and dio are going away).
For error handling, return to the original method of just using 'ret'
for tracking the error, and the size tracking in dio->size.
Fixes: 0eb6ddfb865c ("block: Fix __blkdev_direct_IO() for bio fragments")
Fixes: 6a43074e2f46 ("block: properly handle IOCB_NOWAIT for async O_DIRECT IO")
Reported-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Ensure that the state recovery code handles ETIMEDOUT correctly,
and also that we set RPC_TASK_TIMEOUT when recovering open state.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
Normally the range->len is set to default value (U64_MAX), but when it's
not default value, we should check if the range overflows.
And if it overflows, return -EINVAL before doing anything.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
In the 5.3 merge window, commit 7c7e301406d0a9 ("btrfs: sysfs: Replace
default_attrs in ktypes with groups"), we started using the member
"defaults_groups" for the kobject type "btrfs_raid_ktype". That leads
to a series of warnings when running some test cases of fstests, such
as btrfs/027, btrfs/124 and btrfs/176. The traces produced by those
warnings are like the following:
[116648.059212] kernfs: can not remove 'total_bytes', no directory
[116648.060112] WARNING: CPU: 3 PID: 28500 at fs/kernfs/dir.c:1504 kernfs_remove_by_name_ns+0x75/0x80
(...)
[116648.066482] CPU: 3 PID: 28500 Comm: umount Tainted: G W 5.3.0-rc3-btrfs-next-54 #1
(...)
[116648.069376] RIP: 0010:kernfs_remove_by_name_ns+0x75/0x80
(...)
[116648.072385] RSP: 0018:ffffabfd0090bd08 EFLAGS: 00010282
[116648.073437] RAX: 0000000000000000 RBX: ffffffffc0c11998 RCX: 0000000000000000
[116648.074201] RDX: ffff9fff603a7a00 RSI: ffff9fff603978a8 RDI: ffff9fff603978a8
[116648.074956] RBP: ffffffffc0b9ca2f R08: 0000000000000000 R09: 0000000000000001
[116648.075708] R10: ffff9ffe1f72e1c0 R11: 0000000000000000 R12: ffffffffc0b94120
[116648.076434] R13: ffffffffb3d9b4e0 R14: 0000000000000000 R15: dead000000000100
[116648.077143] FS: 00007f9cdc78a2c0(0000) GS:ffff9fff60380000(0000) knlGS:0000000000000000
[116648.077852] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[116648.078546] CR2: 00007f9fc4747ab4 CR3: 00000005c7832003 CR4: 00000000003606e0
[116648.079235] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[116648.079907] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[116648.080585] Call Trace:
[116648.081262] remove_files+0x31/0x70
[116648.081929] sysfs_remove_group+0x38/0x80
[116648.082596] sysfs_remove_groups+0x34/0x70
[116648.083258] kobject_del+0x20/0x60
[116648.083933] btrfs_free_block_groups+0x405/0x430 [btrfs]
[116648.084608] close_ctree+0x19a/0x380 [btrfs]
[116648.085278] generic_shutdown_super+0x6c/0x110
[116648.085951] kill_anon_super+0xe/0x30
[116648.086621] btrfs_kill_super+0x12/0xa0 [btrfs]
[116648.087289] deactivate_locked_super+0x3a/0x70
[116648.087956] cleanup_mnt+0xb4/0x160
[116648.088620] task_work_run+0x7e/0xc0
[116648.089285] exit_to_usermode_loop+0xfa/0x100
[116648.089933] do_syscall_64+0x1cb/0x220
[116648.090567] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[116648.091197] RIP: 0033:0x7f9cdc073b37
(...)
[116648.100046] ---[ end trace 22e24db328ccadf8 ]---
[116648.100618] ------------[ cut here ]------------
[116648.101175] kernfs: can not remove 'used_bytes', no directory
[116648.101731] WARNING: CPU: 3 PID: 28500 at fs/kernfs/dir.c:1504 kernfs_remove_by_name_ns+0x75/0x80
(...)
[116648.105649] CPU: 3 PID: 28500 Comm: umount Tainted: G W 5.3.0-rc3-btrfs-next-54 #1
(...)
[116648.107461] RIP: 0010:kernfs_remove_by_name_ns+0x75/0x80
(...)
[116648.109336] RSP: 0018:ffffabfd0090bd08 EFLAGS: 00010282
[116648.109979] RAX: 0000000000000000 RBX: ffffffffc0c119a0 RCX: 0000000000000000
[116648.110625] RDX: ffff9fff603a7a00 RSI: ffff9fff603978a8 RDI: ffff9fff603978a8
[116648.111283] RBP: ffffffffc0b9ca41 R08: 0000000000000000 R09: 0000000000000001
[116648.111940] R10: ffff9ffe1f72e1c0 R11: 0000000000000000 R12: ffffffffc0b94120
[116648.112603] R13: ffffffffb3d9b4e0 R14: 0000000000000000 R15: dead000000000100
[116648.113268] FS: 00007f9cdc78a2c0(0000) GS:ffff9fff60380000(0000) knlGS:0000000000000000
[116648.113939] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[116648.114607] CR2: 00007f9fc4747ab4 CR3: 00000005c7832003 CR4: 00000000003606e0
[116648.115286] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[116648.115966] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[116648.116649] Call Trace:
[116648.117326] remove_files+0x31/0x70
[116648.117997] sysfs_remove_group+0x38/0x80
[116648.118671] sysfs_remove_groups+0x34/0x70
[116648.119342] kobject_del+0x20/0x60
[116648.120022] btrfs_free_block_groups+0x405/0x430 [btrfs]
[116648.120707] close_ctree+0x19a/0x380 [btrfs]
[116648.121396] generic_shutdown_super+0x6c/0x110
[116648.122057] kill_anon_super+0xe/0x30
[116648.122702] btrfs_kill_super+0x12/0xa0 [btrfs]
[116648.123335] deactivate_locked_super+0x3a/0x70
[116648.123961] cleanup_mnt+0xb4/0x160
[116648.124586] task_work_run+0x7e/0xc0
[116648.125210] exit_to_usermode_loop+0xfa/0x100
[116648.125830] do_syscall_64+0x1cb/0x220
[116648.126463] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[116648.127080] RIP: 0033:0x7f9cdc073b37
(...)
[116648.135923] ---[ end trace 22e24db328ccadf9 ]---
These happen because, during the unmount path, we call kobject_del() for
raid kobjects that are not fully initialized, meaning that we set their
ktype (as btrfs_raid_ktype) through link_block_group() but we didn't set
their parent kobject, which is done through btrfs_add_raid_kobjects().
We have this split raid kobject setup since commit 75cb379d263521
("btrfs: defer adding raid type kobject until after chunk relocation") in
order to avoid triggering reclaim during contextes where we can not
(either we are holding a transaction handle or some lock required by
the transaction commit path), so that we do the calls to kobject_add(),
which triggers GFP_KERNEL allocations, through btrfs_add_raid_kobjects()
in contextes where it is safe to trigger reclaim. That change expected
that a new raid kobject can only be created either when mounting the
filesystem or after raid profile conversion through the relocation path.
However, we can have new raid kobject created in other two cases at least:
1) During device replace (or scrub) after adding a device a to the
filesystem. The replace procedure (and scrub) do calls to
btrfs_inc_block_group_ro() which can allocate a new block group
with a new raid profile (because we now have more devices). This
can be triggered by test cases btrfs/027 and btrfs/176.
2) During a degraded mount trough any write path. This can be triggered
by test case btrfs/124.
Fixing this by adding extra calls to btrfs_add_raid_kobjects(), not only
makes things more complex and fragile, can also introduce deadlocks with
reclaim the following way:
1) Calling btrfs_add_raid_kobjects() at btrfs_inc_block_group_ro() or
anywhere in the replace/scrub path will cause a deadlock with reclaim
because if reclaim happens and a transaction commit is triggered,
the transaction commit path will block at btrfs_scrub_pause().
2) During degraded mounts it is essentially impossible to figure out where
to add extra calls to btrfs_add_raid_kobjects(), because allocation of
a block group with a new raid profile can happen anywhere, which means
we can't safely figure out which contextes are safe for reclaim, as
we can either hold a transaction handle or some lock needed by the
transaction commit path.
So it is too complex and error prone to have this split setup of raid
kobjects. So fix the issue by consolidating the setup of the kobjects in a
single place, at link_block_group(), and setup a nofs context there in
order to prevent reclaim being triggered by the memory allocations done
through the call chain of kobject_add().
Besides fixing the sysfs warnings during kobject_del(), this also ensures
the sysfs directories for the new raid profiles end up created and visible
to users (a bug that existed before the 5.3 commit 7c7e301406d0a9
("btrfs: sysfs: Replace default_attrs in ktypes with groups")).
Fixes: 75cb379d263521 ("btrfs: defer adding raid type kobject until after chunk relocation")
Fixes: 7c7e301406d0a9 ("btrfs: sysfs: Replace default_attrs in ktypes with groups")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Pull networking fixes from David Miller:
"Yeah I should have sent a pull request last week, so there is a lot
more here than usual:
1) Fix memory leak in ebtables compat code, from Wenwen Wang.
2) Several kTLS bug fixes from Jakub Kicinski (circular close on
disconnect etc.)
3) Force slave speed check on link state recovery in bonding 802.3ad
mode, from Thomas Falcon.
4) Clear RX descriptor bits before assigning buffers to them in
stmmac, from Jose Abreu.
5) Several missing of_node_put() calls, mostly wrt. for_each_*() OF
loops, from Nishka Dasgupta.
6) Double kfree_skb() in peak_usb can driver, from Stephane Grosjean.
7) Need to hold sock across skb->destructor invocation, from Cong
Wang.
8) IP header length needs to be validated in ipip tunnel xmit, from
Haishuang Yan.
9) Use after free in ip6 tunnel driver, also from Haishuang Yan.
10) Do not use MSI interrupts on r8169 chips before RTL8168d, from
Heiner Kallweit.
11) Upon bridge device init failure, we need to delete the local fdb.
From Nikolay Aleksandrov.
12) Handle erros from of_get_mac_address() properly in stmmac, from
Martin Blumenstingl.
13) Handle concurrent rename vs. dump in netfilter ipset, from Jozsef
Kadlecsik.
14) Setting NETIF_F_LLTX on mac80211 causes complete breakage with
some devices, so revert. From Johannes Berg.
15) Fix deadlock in rxrpc, from David Howells.
16) Fix Kconfig deps of enetc driver, we must have PHYLIB. From Yue
Haibing.
17) Fix mvpp2 crash on module removal, from Matteo Croce.
18) Fix race in genphy_update_link, from Heiner Kallweit.
19) bpf_xdp_adjust_head() stopped working with generic XDP when we
fixes generic XDP to support stacked devices properly, fix from
Jesper Dangaard Brouer.
20) Unbalanced RCU locking in rt6_update_exception_stamp_rt(), from
David Ahern.
21) Several memory leaks in new sja1105 driver, from Vladimir Oltean"
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (214 commits)
net: dsa: sja1105: Fix memory leak on meta state machine error path
net: dsa: sja1105: Fix memory leak on meta state machine normal path
net: dsa: sja1105: Really fix panic on unregistering PTP clock
net: dsa: sja1105: Use the LOCKEDS bit for SJA1105 E/T as well
net: dsa: sja1105: Fix broken learning with vlan_filtering disabled
net: dsa: qca8k: Add of_node_put() in qca8k_setup_mdio_bus()
net: sched: sample: allow accessing psample_group with rtnl
net: sched: police: allow accessing police->params with rtnl
net: hisilicon: Fix dma_map_single failed on arm64
net: hisilicon: fix hip04-xmit never return TX_BUSY
net: hisilicon: make hip04_tx_reclaim non-reentrant
tc-testing: updated vlan action tests with batch create/delete
net sched: update vlan action for batched events operations
net: stmmac: tc: Do not return a fragment entry
net: stmmac: Fix issues when number of Queues >= 4
net: stmmac: xgmac: Fix XGMAC selftests
be2net: disable bh with spin_lock in be_process_mcc
net: cxgb3_main: Fix a resource leak in a error path in 'init_one()'
net: ethernet: sun4i-emac: Support phy-handle property for finding PHYs
net: bridge: move default pvid init/deinit to NETDEV_REGISTER/UNREGISTER
...
|
|
Fix kernel oops when mounting a encryptData CIFS share with
CONFIG_DEBUG_VIRTUAL
Signed-off-by: Sebastien Tisserant <stisserant@wallix.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
We had a report of a server which did not do a DFS referral
because the session setup Capabilities field was set to 0
(unlike negotiate protocol where we set CAP_DFS). Better to
send it session setup in the capabilities as well (this also
more closely matches Windows client behavior).
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
CC: Stable <stable@vger.kernel.org>
|