Age | Commit message (Collapse) | Author |
|
Making something exportable takes more than providing ->s_export_ops.
In particular, ->lookup() *MUST* use d_splice_alias() instead of
d_add().
Reading Documentation/filesystems/nfs/Exporting would've been a good idea;
as it is, exporting AFFS is badly (and exploitably) broken.
Partially-Fixes: ed4433d72394 "fs/affs: make affs exportable"
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
we unlock the directory hash too early - if we are looking at secondary
link and primary (in another directory) gets removed just as we unlock,
we could have the old primary moved in place of the secondary, leaving
us to look into freed entry (and leaving our dentry with ->d_fsdata
pointing to a freed entry).
Cc: stable@vger.kernel.org # 2.4.4+
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Merge speculative store buffer bypass fixes from Thomas Gleixner:
- rework of the SPEC_CTRL MSR management to accomodate the new fancy
SSBD (Speculative Store Bypass Disable) bit handling.
- the CPU bug and sysfs infrastructure for the exciting new Speculative
Store Bypass 'feature'.
- support for disabling SSB via LS_CFG MSR on AMD CPUs including
Hyperthread synchronization on ZEN.
- PRCTL support for dynamic runtime control of SSB
- SECCOMP integration to automatically disable SSB for sandboxed
processes with a filter flag for opt-out.
- KVM integration to allow guests fiddling with SSBD including the new
software MSR VIRT_SPEC_CTRL to handle the LS_CFG based oddities on
AMD.
- BPF protection against SSB
.. this is just the core and x86 side, other architecture support will
come separately.
* 'speck-v20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (49 commits)
bpf: Prevent memory disambiguation attack
x86/bugs: Rename SSBD_NO to SSB_NO
KVM: SVM: Implement VIRT_SPEC_CTRL support for SSBD
x86/speculation, KVM: Implement support for VIRT_SPEC_CTRL/LS_CFG
x86/bugs: Rework spec_ctrl base and mask logic
x86/bugs: Remove x86_spec_ctrl_set()
x86/bugs: Expose x86_spec_ctrl_base directly
x86/bugs: Unify x86_spec_ctrl_{set_guest,restore_host}
x86/speculation: Rework speculative_store_bypass_update()
x86/speculation: Add virtualized speculative store bypass disable support
x86/bugs, KVM: Extend speculation control for VIRT_SPEC_CTRL
x86/speculation: Handle HT correctly on AMD
x86/cpufeatures: Add FEATURE_ZEN
x86/cpufeatures: Disentangle SSBD enumeration
x86/cpufeatures: Disentangle MSR_SPEC_CTRL enumeration from IBRS
x86/speculation: Use synthetic bits for IBRS/IBPB/STIBP
KVM: SVM: Move spec control call after restore of GS
x86/cpu: Make alternative_msr_write work for 32-bit code
x86/bugs: Fix the parameters alignment and missing void
x86/bugs: Make cpu_show_common() static
...
|
|
Fix below build warning:
WARNING: vmlinux.o(.text+0x422bb8): Section mismatch in reference from
the function vmcore_add_device_dump() to the function
.init.text:get_vmcore_size.constprop.5()
The function vmcore_add_device_dump() references
the function __init get_vmcore_size.constprop.5().
This is often because vmcore_add_device_dump lacks a __init
annotation or the annotation of get_vmcore_size.constprop.5 is wrong.
Fixes: 7efe48df8a3d ("vmcore: append device dumps to vmcore as elf notes")
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This reserved space isn't committed yet but cannot be used for allocations.
For userspace it has no difference from used space. XFS already does this.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Fixes: 689c958cbe6b ("ext4: add project quota support")
|
|
Signed-off-by: Sean Fu <fxinrong@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
The kmem_cache_destroy() function already checks for null pointers, so
we can remove the check at the call site.
This patch also sets jbd2_handle_cache and jbd2_inode_cache to be NULL
after freeing them in jbd2_journal_destroy_handle_cache().
Signed-off-by: Wang Long <wanglong19@meituan.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
|
|
See following dmesg output with jbd2 debug enabled:
...(start_this_handle, 313): New handle 00000000c88d6ceb going live.
...(start_this_handle, 383): Handle 00000000c88d6ceb given 53 credits (total 53, free 32681)
...(do_get_write_access, 838): journal_head 0000000002856fc0, force_copy 0
...(jbd2_journal_cancel_revoke, 421): journal_head 0000000002856fc0, cancelling revoke
We have an extra line with every messages, this is a waste of buffer,
we can fix it by removing "\n" in the caller or remove it in
the __jbd2_debug(), i checked every jbd2_debug() passed '\n' explicitly.
To avoid more lines, let's remove it inside __jbd2_debug().
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
|
|
Log the crypto algorithm driver name for each fscrypt encryption mode on
its first use, also showing a friendly name for the mode.
This will help people determine whether the expected implementations are
being used. In some cases we've seen people do benchmarks and reject
using encryption for performance reasons, when in fact they used a much
slower implementation of AES-XTS than was possible on the hardware. It
can make an enormous difference; e.g., AES-XTS on ARM is about 10x
faster with the crypto extensions (AES instructions) than without.
This also makes it more obvious which modes are being used, now that
fscrypt supports multiple combinations of modes.
Example messages (with default modes, on x86_64):
[ 35.492057] fscrypt: AES-256-CTS-CBC using implementation "cts(cbc-aes-aesni)"
[ 35.492171] fscrypt: AES-256-XTS using implementation "xts-aes-aesni"
Note: algorithms can be dynamically added to the crypto API, which can
result in different implementations being used at different times. But
this is rare; for most users, showing the first will be good enough.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
fscrypt currently only supports AES encryption. However, many low-end
mobile devices have older CPUs that don't have AES instructions, e.g.
the ARMv8 Cryptography Extensions. Currently, user data on such devices
is not encrypted at rest because AES is too slow, even when the NEON
bit-sliced implementation of AES is used. Unfortunately, it is
infeasible to encrypt these devices at all when AES is the only option.
Therefore, this patch updates fscrypt to support the Speck block cipher,
which was recently added to the crypto API. The C implementation of
Speck is not especially fast, but Speck can be implemented very
efficiently with general-purpose vector instructions, e.g. ARM NEON.
For example, on an ARMv7 processor, we measured the NEON-accelerated
Speck128/256-XTS at 69 MB/s for both encryption and decryption, while
AES-256-XTS with the NEON bit-sliced implementation was only 22 MB/s
encryption and 19 MB/s decryption.
There are multiple variants of Speck. This patch only adds support for
Speck128/256, which is the variant with a 128-bit block size and 256-bit
key size -- the same as AES-256. This is believed to be the most secure
variant of Speck, and it's only about 6% slower than Speck128/128.
Speck64/128 would be at least 20% faster because it has 20% rounds, and
it can be even faster on CPUs that can't efficiently do the 64-bit
operations needed for Speck128. However, Speck64's 64-bit block size is
not preferred security-wise. ARM NEON also supports the needed 64-bit
operations even on 32-bit CPUs, resulting in Speck128 being fast enough
for our targeted use cases so far.
The chosen modes of operation are XTS for contents and CTS-CBC for
filenames. These are the same modes of operation that fscrypt defaults
to for AES. Note that as with the other fscrypt modes, Speck will not
be used unless userspace chooses to use it. Nor are any of the existing
modes (which are all AES-based) being removed, of course.
We intentionally don't make CONFIG_FS_ENCRYPTION select
CONFIG_CRYPTO_SPECK, so people will have to enable Speck support
themselves if they need it. This is because we shouldn't bloat the
FS_ENCRYPTION dependencies with every new cipher, especially ones that
aren't recommended for most users. Moreover, CRYPTO_SPECK is just the
generic implementation, which won't be fast enough for many users; in
practice, they'll need to enable CRYPTO_SPECK_NEON to get acceptable
performance.
More details about our choice of Speck can be found in our patches that
added Speck to the crypto API, and the follow-on discussion threads.
We're planning a publication that explains the choice in more detail.
But briefly, we can't use ChaCha20 as we previously proposed, since it
would be insecure to use a stream cipher in this context, with potential
IV reuse during writes on f2fs and/or on wear-leveling flash storage.
We also evaluated many other lightweight and/or ARX-based block ciphers
such as Chaskey-LTS, RC5, LEA, CHAM, Threefish, RC6, NOEKEON, SPARX, and
XTEA. However, all had disadvantages vs. Speck, such as insufficient
performance with NEON, much less published cryptanalysis, or an
insufficient security level. Various design choices in Speck make it
perform better with NEON than competing ciphers while still having a
security margin similar to AES, and in the case of Speck128 also the
same available security levels. Unfortunately, Speck does have some
political baggage attached -- it's an NSA designed cipher, and was
rejected from an ISO standard (though for context, as far as I know none
of the above-mentioned alternatives are ISO standards either).
Nevertheless, we believe it is a good solution to the problem from a
technical perspective.
Certain algorithms constructed from ChaCha or the ChaCha permutation,
such as MEM (Masked Even-Mansour) or HPolyC, may also meet our
performance requirements. However, these are new constructions that
need more time to receive the cryptographic review and acceptance needed
to be confident in their security. HPolyC hasn't been published yet,
and we are concerned that MEM makes stronger assumptions about the
underlying permutation than the ChaCha stream cipher does. In contrast,
the XTS mode of operation is relatively well accepted, and Speck has
over 70 cryptanalysis papers. Of course, these ChaCha-based algorithms
can still be added later if they become ready.
The best known attack on Speck128/256 is a differential cryptanalysis
attack on 25 of 34 rounds with 2^253 time complexity and 2^125 chosen
plaintexts, i.e. only marginally faster than brute force. There is no
known attack on the full 34 rounds.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Currently the key derivation function in fscrypt uses the master key
length as the amount of output key material to derive. This works, but
it means we can waste time deriving more key material than is actually
used, e.g. most commonly, deriving 64 bytes for directories which only
take a 32-byte AES-256-CTS-CBC key. It also forces us to validate that
the master key length is a multiple of AES_BLOCK_SIZE, which wouldn't
otherwise be necessary.
Fix it to only derive the needed length key.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Refactor the confusingly-named function 'validate_user_key()' into a new
function 'find_and_derive_key()' which first finds the keyring key, then
does the key derivation. Among other benefits this avoids the strange
behavior we had previously where if key derivation failed for some
reason, then we would fall back to the alternate key prefix. Now, we'll
only fall back to the alternate key prefix if a valid key isn't found.
This patch also improves the warning messages that are logged when the
keyring key's payload is invalid.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Use a common function for fscrypt warning and error messages so that all
the messages are consistently ratelimited, include the "fscrypt:"
prefix, and include the filesystem name if applicable.
Also fix up a few of the log messages to be more descriptive.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
With one exception, the internal key size constants such as
FS_AES_256_XTS_KEY_SIZE are only used for the 'available_modes' array,
where they really only serve to obfuscate what the values are. Also
some of the constants are unused, and the key sizes tend to be in the
names of the algorithms anyway. In the past these values were also
misused, e.g. we used to have FS_AES_256_XTS_KEY_SIZE in places that
technically should have been FS_MAX_KEY_SIZE.
The exception is that FS_AES_128_ECB_KEY_SIZE is used for key
derivation. But it's more appropriate to use
FS_KEY_DERIVATION_NONCE_SIZE for that instead.
Thus, just put the sizes directly in the 'available_modes' array.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
We're passing 'key_type_logon' to request_key(), so the found key is
guaranteed to be of type "logon". Thus, there is no reason to check
later that the key is really a "logon" key.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Now ->max_namelen() is only called to limit the filename length when
adding NUL padding, and only for real filenames -- not symlink targets.
It also didn't give the correct length for symlink targets anyway since
it forgot to subtract 'sizeof(struct fscrypt_symlink_data)'.
Thus, change ->max_namelen from a function to a simple 'unsigned int'
that gives the filesystem's maximum filename length.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
fname_decrypt() is validating that the encrypted filename is nonempty.
However, earlier a stronger precondition was already enforced: the
encrypted filename must be at least 16 (FS_CRYPTO_BLOCK_SIZE) bytes.
Drop the redundant check for an empty filename.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
fname_decrypt() returns an error if the input filename is longer than
the inode's ->max_namelen() as given by the filesystem. But, this
doesn't actually make sense because the filesystem provided the input
filename in the first place, where it was subject to the filesystem's
limits. And fname_decrypt() has no internal limit itself.
Thus, remove this unnecessary check.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
In fscrypt_setup_filename(), remove the unnecessary check for
fscrypt_get_encryption_info() returning EOPNOTSUPP. There's no reason
to handle this error differently from any other. I think there may have
been some confusion because the "notsupp" version of
fscrypt_get_encryption_info() returns EOPNOTSUPP -- but that's not
applicable from inside fs/crypto/.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
fscrypt is clearing the flags on the crypto_skcipher it allocates for
each inode. But, this is unnecessary and may cause problems in the
future because it will even clear flags that are meant to be internal to
the crypto API, e.g. CRYPTO_TFM_NEED_KEY.
Remove the unnecessary flag clearing.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
skcipher_request_alloc() can only fail due to lack of memory, and in
that case the memory allocator will have already printed a detailed
error message. Thus, remove the redundant error messages from fscrypt.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
crypto_alloc_skcipher() returns an ERR_PTR() on failure, not NULL.
Remove the unnecessary check for NULL.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Now that all filesystems have been converted to use
fscrypt_prepare_lookup(), we can remove the fscrypt_set_d_op() and
fscrypt_set_encrypted_dentry() functions as well as un-export
fscrypt_d_ops.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Improve fscrypt read performance by switching the decryption workqueue
from bound to unbound. With the bound workqueue, when multiple bios
completed on the same CPU, they were decrypted on that same CPU. But
with the unbound queue, they are now decrypted in parallel on any CPU.
Although fscrypt read performance can be tough to measure due to the
many sources of variation, this change is most beneficial when
decryption is slow, e.g. on CPUs without AES instructions. For example,
I timed tarring up encrypted directories on f2fs. On x86 with AES-NI
instructions disabled, the unbound workqueue improved performance by
about 25-35%, using 1 to NUM_CPUs jobs with 4 or 8 CPUs available. But
with AES-NI enabled, performance was unchanged to within ~2%.
I also did the same test on a quad-core ARM CPU using xts-speck128-neon
encryption. There performance was usually about 10% better with the
unbound workqueue, bringing it closer to the unencrypted speed.
The unbound workqueue may be worse in some cases due to worse locality,
but I think it's still the better default. dm-crypt uses an unbound
workqueue by default too, so this change makes fscrypt match.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"We've accumulated some fixes during the last week, some of them were
in the works for a longer time but there are some newer ones too.
Most of the fixes have a reproducer and fix user visible problems,
also candidates for stable kernels. They IMHO qualify for a late rc,
though I did not expect that many"
* tag 'for-4.17-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix crash when trying to resume balance without the resume flag
btrfs: Fix delalloc inodes invalidation during transaction abort
btrfs: Split btrfs_del_delalloc_inode into 2 functions
btrfs: fix reading stale metadata blocks after degraded raid1 mounts
btrfs: property: Set incompat flag if lzo/zstd compression is set
Btrfs: fix duplicate extents after fsync of file with prealloc extents
Btrfs: fix xattr loss after power failure
Btrfs: send, fix invalid access to commit roots due to concurrent snapshotting
|
|
syzbot is reporting ODEBUG messages at hfsplus_fill_super() [1]. This
is because hfsplus_fill_super() forgot to call cancel_delayed_work_sync().
As far as I can see, it is hfsplus_mark_mdb_dirty() from
hfsplus_new_inode() in hfsplus_fill_super() that calls
queue_delayed_work(). Therefore, I assume that hfsplus_new_inode() does
not fail if queue_delayed_work() was called, and the out_put_hidden_dir
label is the appropriate location to call cancel_delayed_work_sync().
[1] https://syzkaller.appspot.com/bug?id=a66f45e96fdbeb76b796bf46eb25ea878c42a6c9
Link: http://lkml.kernel.org/r/964a8b27-cd69-357c-fe78-76b066056201@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: syzbot <syzbot+4f2e5f086147d543ab03@syzkaller.appspotmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: Ernesto A. Fernandez <ernesto.mnd.fernandez@gmail.com>
Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
There can be a lot of workqueue workers and they all show up with the
cryptic kworker/* names making it difficult to understand which is
doing what and how they came to be.
# ps -ef | grep kworker
root 4 2 0 Feb25 ? 00:00:00 [kworker/0:0H]
root 6 2 0 Feb25 ? 00:00:00 [kworker/u112:0]
root 19 2 0 Feb25 ? 00:00:00 [kworker/1:0H]
root 25 2 0 Feb25 ? 00:00:00 [kworker/2:0H]
root 31 2 0 Feb25 ? 00:00:00 [kworker/3:0H]
...
This patch makes workqueue workers report the latest workqueue it was
executing for through /proc/PID/{comm,stat,status}. The extra
information is appended to the kthread name with intervening '+' if
currently executing, otherwise '-'.
# cat /proc/25/comm
kworker/2:0-events_power_efficient
# cat /proc/25/stat
25 (kworker/2:0-events_power_efficient) I 2 0 0 0 -1 69238880 0 0...
# grep Name /proc/25/status
Name: kworker/2:0-events_power_efficient
Unfortunately, ps(1) truncates comm to 15 characters,
# ps 25
PID TTY STAT TIME COMMAND
25 ? I 0:00 [kworker/2:0-eve]
making it a lot less useful; however, this should be an easy fix from
ps(1) side.
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Craig Small <csmall@enc.com.au>
|
|
proc shows task->comm in three places - comm, stat, status - and each
is fetching and formatting task->comm slighly differently. This patch
renames task_name() to proc_task_name(), makes it more generic, and
updates all three paths to use it.
This will enable expanding comm reporting for workqueue workers.
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Before changing the arguments of the functions fsnotify_add_mark()
and fsnotify_add_mark_locked(), convert most callers to use a wrapper.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
Use fsnotify_foreach_obj_type macros to generalize the code that filters
events by marks mask and ignored_mask.
This is going to be used for adding mark of super block object type.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
Use fsnotify_foreach_obj_type macros to generalize the code that filters
events by marks mask and ignored_mask.
This is going to be used for adding mark of super block object type.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
Make some code that handles marks of object types inode and vfsmount
generic, so it can handle other object types.
Introduce fsnotify_foreach_obj_type macro to iterate marks by object type
and fsnotify_iter_{should|set}_report_type macros to set/test report_mask.
This is going to be used for adding mark of another object type
(super block mark).
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
Introduce helpers fsnotify_iter_select_report_types() and
fsnotify_iter_next() to abstract the inode/vfsmount marks merged
list iteration.
This is a preparation patch before generalizing mark list
iteration to more mark object types (i.e. super block marks).
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
inode_mark and vfsmount_mark arguments are passed to handle_event()
operation as function arguments as well as on iter_info struct.
The difference is that iter_info struct may contain marks that should
not be handled and are represented as NULL arguments to inode_mark or
vfsmount_mark.
Instead of passing the inode_mark and vfsmount_mark arguments, add
a report_mask member to iter_info struct to indicate which marks should
be handled, versus marks that should only be kept alive during user
wait.
This change is going to be used for passing more mark types
with handle_event() (i.e. super block marks).
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
An fsnotify_mark_connector is referencing a single type of object
(either inode or vfsmount). Instead of storing a type mask in
connector->flags, store a single type id in connector->type to
identify the type of object.
When a connector object is detached from the object, its type is set
to FSNOTIFY_OBJ_TYPE_DETACHED and this object is not going to be
reused.
The function fsnotify_clear_marks_by_group() is the only place where
type mask was used, so use type flags instead of type id to this
function.
This change is going to be more convenient when adding a new object
type (super block).
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
Provide two extra functions, proc_create_net_data_write() and
proc_create_net_single_write() that act like their non-write versions but
also set a write method in the proc_dir_entry struct.
An internal simple write function is provided that will copy its buffer and
hand it to the pde->write() method if available (or give an error if not).
The buffer may be modified by the write method.
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Rearrange fs/afs/proc.c to get rid of all the remaining predeclarations.
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Rearrange fs/afs/proc.c to move the show routines up to the top of each
block so the order is show, iteration, ops, file ops, fops.
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Rearrange fs/afs/proc.c by moving fops and open functions down so as to
remove predeclarations.
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
In fs/afs/proc.c, move functions that create and remove /proc files to the
end of the source file as a first stage in getting rid of all the forward
declarations.
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Use path_equal() to detect whether we're already in root.
Signed-off-by: Danilo Krummrich <danilokrummrich@dk-develop.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
We have some very odd semantics for reading the command line through
/proc, because we allow people to rewrite their own command line pretty
much at will, and things get positively funky when you extend your
command line past the point that used to be the end of the command line,
and is now in the environment variable area.
But our weird semantics doesn't mean that we should write weird and
complex code to handle them.
So re-write get_mm_cmdline() to be much simpler, and much more explicit
about what it is actually doing and why. And avoid the extra check for
"is there a NUL character at the end of the command line where I expect
one to be", by simply making the NUL character handling be part of the
normal "once you hit the end of the command line, stop at the first NUL
character" logic.
It's quite possible that we should stop the crazy "walk into
environment" entirely, but happily it's not really the usual case.
NOTE! We tried to really simplify and limit our odd cmdline parsing some
time ago, but people complained. See commit c2c0bb44620d ("proc: fix
PAGE_SIZE limit of /proc/$PID/cmdline") for details about why we have
this complexity.
Cc: Tejun Heo <tj@kernel.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Jarod Wilson <jarod@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This is a pure refactoring of the function, preparing for some further
cleanups. The thing was pretty illegible, and the core functionality
still is, but now the core loop is a bit more isolated from the thing
that goes on around it.
This was "inspired" by the confluence of kworker workqueue name cleanups
by Tejun, currently scheduled for 4.18, and commit 7f7ccc2ccc2e ("proc:
do not access cmdline nor environ from file-backed areas").
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
proc_pid_cmdline_read() and environ_read() directly access the target
process' VM to retrieve the command line and environment. If this
process remaps these areas onto a file via mmap(), the requesting
process may experience various issues such as extra delays if the
underlying device is slow to respond.
Let's simply refuse to access file-backed areas in these functions.
For this we add a new FOLL_ANON gup flag that is passed to all calls
to access_remote_vm(). The code already takes care of such failures
(including unmapped areas). Accesses via /proc/pid/mem were not
changed though.
This was assigned CVE-2018-1120.
Note for stable backports: the patch may apply to kernels prior to 4.11
but silently miss one location; it must be checked that no call to
access_remote_vm() keeps zero as the last argument.
Reported-by: Qualys Security Advisory <qsa@qualys.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
generic_swapfile_activate() doesn't allow holes, so we should be
consistent here. This is also a bit safer: if the user creates a
swapfile with, say, truncate -s $SIZE followed by mkswap, they should
really get an error and not much less swap space than they expected.
swapon(8) will error out before calling swapon(2) if the file has holes,
anyways.
Fixes: 9d93388b0afe ("iomap: add a swapfile activation function")
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
We set the BTRFS_BALANCE_RESUME flag in the btrfs_recover_balance()
only, which isn't called during the remount. So when resuming from
the paused balance we hit the bug:
kernel: kernel BUG at fs/btrfs/volumes.c:3890!
::
kernel: balance_kthread+0x51/0x60 [btrfs]
kernel: kthread+0x111/0x130
::
kernel: RIP: btrfs_balance+0x12e1/0x1570 [btrfs] RSP: ffffba7d0090bde8
Reproducer:
On a mounted filesystem:
btrfs balance start --full-balance /btrfs
btrfs balance pause /btrfs
mount -o remount,ro /dev/sdb /btrfs
mount -o remount,rw /dev/sdb /btrfs
To fix this set the BTRFS_BALANCE_RESUME flag in
btrfs_resume_balance_async().
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When a transaction is aborted btrfs_cleanup_transaction is called to
cleanup all the various in-flight bits and pieces which migth be
active. One of those is delalloc inodes - inodes which have dirty
pages which haven't been persisted yet. Currently the process of
freeing such delalloc inodes in exceptional circumstances such as
transaction abort boiled down to calling btrfs_invalidate_inodes whose
sole job is to invalidate the dentries for all inodes related to a
root. This is in fact wrong and insufficient since such delalloc inodes
will likely have pending pages or ordered-extents and will be linked to
the sb->s_inode_list. This means that unmounting a btrfs instance with
an aborted transaction could potentially lead inodes/their pages
visible to the system long after their superblock has been freed. This
in turn leads to a "use-after-free" situation once page shrink is
triggered. This situation could be simulated by running generic/019
which would cause such inodes to be left hanging, followed by
generic/176 which causes memory pressure and page eviction which lead
to touching the freed super block instance. This situation is
additionally detected by the unmount code of VFS with the following
message:
"VFS: Busy inodes after unmount of Self-destruct in 5 seconds. Have a nice day..."
Additionally btrfs hits WARN_ON(!RB_EMPTY_ROOT(&root->inode_tree));
in free_fs_root for the same reason.
This patch aims to rectify the sitaution by doing the following:
1. Change btrfs_destroy_delalloc_inodes so that it calls
invalidate_inode_pages2 for every inode on the delalloc list, this
ensures that all the pages of the inode are released. This function
boils down to calling btrfs_releasepage. During test I observed cases
where inodes on the delalloc list were having an i_count of 0, so this
necessitates using igrab to be sure we are working on a non-freed inode.
2. Since calling btrfs_releasepage might queue delayed iputs move the
call out to btrfs_cleanup_transaction in btrfs_error_commit_super before
calling run_delayed_iputs for the last time. This is necessary to ensure
that delayed iputs are run.
Note: this patch is tagged for 4.14 stable but the fix applies to older
versions too but needs to be backported manually due to conflicts.
CC: stable@vger.kernel.org # 4.14.x: 2b8773313494: btrfs: Split btrfs_del_delalloc_inode into 2 functions
CC: stable@vger.kernel.org # 4.14.x
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comment to igrab ]
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This is in preparation of fixing delalloc inodes leakage on transaction
abort. Also export the new function.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
If a btree block, aka. extent buffer, is not available in the extent
buffer cache, it'll be read out from the disk instead, i.e.
btrfs_search_slot()
read_block_for_search() # hold parent and its lock, go to read child
btrfs_release_path()
read_tree_block() # read child
Unfortunately, the parent lock got released before reading child, so
commit 5bdd3536cbbe ("Btrfs: Fix block generation verification race") had
used 0 as parent transid to read the child block. It forces
read_tree_block() not to check if parent transid is different with the
generation id of the child that it reads out from disk.
A simple PoC is included in btrfs/124,
0. A two-disk raid1 btrfs,
1. Right after mkfs.btrfs, block A is allocated to be device tree's root.
2. Mount this filesystem and put it in use, after a while, device tree's
root got COW but block A hasn't been allocated/overwritten yet.
3. Umount it and reload the btrfs module to remove both disks from the
global @fs_devices list.
4. mount -odegraded dev1 and write some data, so now block A is allocated
to be a leaf in checksum tree. Note that only dev1 has the latest
metadata of this filesystem.
5. Umount it and mount it again normally (with both disks), since raid1
can pick up one disk by the writer task's pid, if btrfs_search_slot()
needs to read block A, dev2 which does NOT have the latest metadata
might be read for block A, then we got a stale block A.
6. As parent transid is not checked, block A is marked as uptodate and
put into the extent buffer cache, so the future search won't bother
to read disk again, which means it'll make changes on this stale
one and make it dirty and flush it onto disk.
To avoid the problem, parent transid needs to be passed to
read_tree_block().
In order to get a valid parent transid, we need to hold the parent's
lock until finishing reading child.
This patch needs to be slightly adapted for stable kernels, the
&first_key parameter added to read_tree_block() is from 4.16+
(581c1760415c4). The fix is to replace 0 by 'gen'.
Fixes: 5bdd3536cbbe ("Btrfs: Fix block generation verification race")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
|