summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2022-01-07random: early initialization of ChaCha constantsDominik Brodowski
Previously, the ChaCha constants for the primary pool were only initialized in crng_initialize_primary(), called by rand_initialize(). However, some randomness is actually extracted from the primary pool beforehand, e.g. by kmem_cache_create(). Therefore, statically initialize the ChaCha constants for the primary pool. Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: "David S. Miller" <davem@davemloft.net> Cc: <linux-crypto@vger.kernel.org> Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-01-07random: use IS_ENABLED(CONFIG_NUMA) instead of ifdefsJason A. Donenfeld
Rather than an awkward combination of ifdefs and __maybe_unused, we can ensure more source gets parsed, regardless of the configuration, by using IS_ENABLED for the CONFIG_NUMA conditional code. This makes things cleaner and easier to follow. I've confirmed that on !CONFIG_NUMA, we don't wind up with excess code by accident; the generated object file is the same. Reviewed-by: Dominik Brodowski <linux@dominikbrodowski.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-01-07random: harmonize "crng init done" messagesDominik Brodowski
We print out "crng init done" for !TRUST_CPU, so we should also print out the same for TRUST_CPU. Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-01-07random: mix bootloader randomness into poolJason A. Donenfeld
If we're trusting bootloader randomness, crng_fast_load() is called by add_hwgenerator_randomness(), which sets us to crng_init==1. However, usually it is only called once for an initial 64-byte push, so bootloader entropy will not mix any bytes into the input pool. So it's conceivable that crng_init==1 when crng_initialize_primary() is called later, but then the input pool is empty. When that happens, the crng state key will be overwritten with extracted output from the empty input pool. That's bad. In contrast, if we're not trusting bootloader randomness, we call crng_slow_load() *and* we call mix_pool_bytes(), so that later crng_initialize_primary() isn't drawing on nothing. In order to prevent crng_initialize_primary() from extracting an empty pool, have the trusted bootloader case mirror that of the untrusted bootloader case, mixing the input into the pool. [linux@dominikbrodowski.net: rewrite commit message] Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-01-07random: do not throw away excess input to crng_fast_loadJason A. Donenfeld
When crng_fast_load() is called by add_hwgenerator_randomness(), we currently will advance to crng_init==1 once we've acquired 64 bytes, and then throw away the rest of the buffer. Usually, that is not a problem: When add_hwgenerator_randomness() gets called via EFI or DT during setup_arch(), there won't be any IRQ randomness. Therefore, the 64 bytes passed by EFI exactly matches what is needed to advance to crng_init==1. Usually, DT seems to pass 64 bytes as well -- with one notable exception being kexec, which hands over 128 bytes of entropy to the kexec'd kernel. In that case, we'll advance to crng_init==1 once 64 of those bytes are consumed by crng_fast_load(), but won't continue onward feeding in bytes to progress to crng_init==2. This commit fixes the issue by feeding any leftover bytes into the next phase in add_hwgenerator_randomness(). [linux@dominikbrodowski.net: rewrite commit message] Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-01-07random: do not re-init if crng_reseed completes before primary initJason A. Donenfeld
If the bootloader supplies sufficient material and crng_reseed() is called very early on, but not too early that wqs aren't available yet, then we might transition to crng_init==2 before rand_initialize()'s call to crng_initialize_primary() made. Then, when crng_initialize_primary() is called, if we're trusting the CPU's RDRAND instructions, we'll needlessly reinitialize the RNG and emit a message about it. This is mostly harmless, as numa_crng_init() will allocate and then free what it just allocated, and excessive calls to invalidate_batched_entropy() aren't so harmful. But it is funky and the extra message is confusing, so avoid the re-initialization all together by checking for crng_init < 2 in crng_initialize_primary(), just as we already do in crng_reseed(). Reviewed-by: Dominik Brodowski <linux@dominikbrodowski.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-01-07random: fix crash on multiple early calls to add_bootloader_randomness()Dominik Brodowski
Currently, if CONFIG_RANDOM_TRUST_BOOTLOADER is enabled, multiple calls to add_bootloader_randomness() are broken and can cause a NULL pointer dereference, as noted by Ivan T. Ivanov. This is not only a hypothetical problem, as qemu on arm64 may provide bootloader entropy via EFI and via devicetree. On the first call to add_hwgenerator_randomness(), crng_fast_load() is executed, and if the seed is long enough, crng_init will be set to 1. On subsequent calls to add_bootloader_randomness() and then to add_hwgenerator_randomness(), crng_fast_load() will be skipped. Instead, wait_event_interruptible() and then credit_entropy_bits() will be called. If the entropy count for that second seed is large enough, that proceeds to crng_reseed(). However, both wait_event_interruptible() and crng_reseed() depends (at least in numa_crng_init()) on workqueues. Therefore, test whether system_wq is already initialized, which is a sufficient indicator that workqueue_init_early() has progressed far enough. If we wind up hitting the !system_wq case, we later want to do what would have been done there when wqs are up, so set a flag, and do that work later from the rand_initialize() call. Reported-by: Ivan T. Ivanov <iivanov@suse.de> Fixes: 18b915ac6b0a ("efi/random: Treat EFI_RNG_PROTOCOL output as bootloader randomness") Cc: stable@vger.kernel.org Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net> [Jason: added crng_need_done state and related logic.] Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-01-07random: do not sign extend bytes for rotation when mixingJason A. Donenfeld
By using `char` instead of `unsigned char`, certain platforms will sign extend the byte when `w = rol32(*bytes++, input_rotate)` is called, meaning that bit 7 is overrepresented when mixing. This isn't a real problem (unless the mixer itself is already broken) since it's still invertible, but it's not quite correct either. Fix this by using an explicit unsigned type. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-01-07random: use BLAKE2s instead of SHA1 in extractionJason A. Donenfeld
This commit addresses one of the lower hanging fruits of the RNG: its usage of SHA1. BLAKE2s is generally faster, and certainly more secure, than SHA1, which has [1] been [2] really [3] very [4] broken [5]. Additionally, the current construction in the RNG doesn't use the full SHA1 function, as specified, and allows overwriting the IV with RDRAND output in an undocumented way, even in the case when RDRAND isn't set to "trusted", which means potential malicious IV choices. And its short length means that keeping only half of it secret when feeding back into the mixer gives us only 2^80 bits of forward secrecy. In other words, not only is the choice of hash function dated, but the use of it isn't really great either. This commit aims to fix both of these issues while also keeping the general structure and semantics as close to the original as possible. Specifically: a) Rather than overwriting the hash IV with RDRAND, we put it into BLAKE2's documented "salt" and "personal" fields, which were specifically created for this type of usage. b) Since this function feeds the full hash result back into the entropy collector, we only return from it half the length of the hash, just as it was done before. This increases the construction's forward secrecy from 2^80 to a much more comfortable 2^128. c) Rather than using the raw "sha1_transform" function alone, we instead use the full proper BLAKE2s function, with finalization. This also has the advantage of supplying 16 bytes at a time rather than SHA1's 10 bytes, which, in addition to having a faster compression function to begin with, means faster extraction in general. On an Intel i7-11850H, this commit makes initial seeding around 131% faster. BLAKE2s itself has the nice property of internally being based on the ChaCha permutation, which the RNG is already using for expansion, so there shouldn't be any issue with newness, funkiness, or surprising CPU behavior, since it's based on something already in use. [1] https://eprint.iacr.org/2005/010.pdf [2] https://www.iacr.org/archive/crypto2005/36210017/36210017.pdf [3] https://eprint.iacr.org/2015/967.pdf [4] https://shattered.io/static/shattered.pdf [5] https://www.usenix.org/system/files/sec20-leurent.pdf Reviewed-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Jean-Philippe Aumasson <jeanphilippe.aumasson@gmail.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-01-07lib/crypto: blake2s: include as built-inJason A. Donenfeld
In preparation for using blake2s in the RNG, we change the way that it is wired-in to the build system. Instead of using ifdefs to select the right symbol, we use weak symbols. And because ARM doesn't need the generic implementation, we make the generic one default only if an arch library doesn't need it already, and then have arch libraries that do need it opt-in. So that the arch libraries can remain tristate rather than bool, we then split the shash part from the glue code. Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: linux-kbuild@vger.kernel.org Cc: linux-crypto@vger.kernel.org Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-01-07random: fix data race on crng init timeEric Biggers
_extract_crng() does plain loads of crng->init_time and crng_global_init_time, which causes undefined behavior if crng_reseed() and RNDRESEEDCRNG modify these corrently. Use READ_ONCE() and WRITE_ONCE() to make the behavior defined. Don't fix the race on crng->init_time by protecting it with crng->lock, since it's not a problem for duplicate reseedings to occur. I.e., the lockless access with READ_ONCE() is fine. Fixes: d848e5f8e1eb ("random: add new ioctl RNDRESEEDCRNG") Fixes: e192be9d9a30 ("random: replace non-blocking pool with a Chacha20-based CRNG") Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Acked-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-01-07random: fix data race on crng_node_poolEric Biggers
extract_crng() and crng_backtrack_protect() load crng_node_pool with a plain load, which causes undefined behavior if do_numa_crng_init() modifies it concurrently. Fix this by using READ_ONCE(). Note: as per the previous discussion https://lore.kernel.org/lkml/20211219025139.31085-1-ebiggers@kernel.org/T/#u, READ_ONCE() is believed to be sufficient here, and it was requested that it be used here instead of smp_load_acquire(). Also change do_numa_crng_init() to set crng_node_pool using cmpxchg_release() instead of mb() + cmpxchg(), as the former is sufficient here but is more lightweight. Fixes: 1e7f583af67b ("random: make /dev/urandom scalable for silly userspace programs") Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Acked-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-01-07irq: remove unused flags argument from __handle_irq_event_percpu()Sebastian Andrzej Siewior
The __IRQF_TIMER bit from the flags argument was used in add_interrupt_randomness() to distinguish the timer interrupt from other interrupts. This is no longer the case. Remove the flags argument from __handle_irq_event_percpu(). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-01-07random: remove unused irq_flags argument from add_interrupt_randomness()Sebastian Andrzej Siewior
Since commit ee3e00e9e7101 ("random: use registers from interrupted code for CPU's w/o a cycle counter") the irq_flags argument is no longer used. Remove unused irq_flags. Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dexuan Cui <decui@microsoft.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Wei Liu <wei.liu@kernel.org> Cc: linux-hyperv@vger.kernel.org Cc: x86@kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Wei Liu <wei.liu@kernel.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-01-07random: document add_hwgenerator_randomness() with other input functionsMark Brown
The section at the top of random.c which documents the input functions available does not document add_hwgenerator_randomness() which might lead a reader to overlook it. Add a brief note about it. Signed-off-by: Mark Brown <broonie@kernel.org> [Jason: reorganize position of function in doc comment and also document add_bootloader_randomness() while we're at it.] Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-01-07MAINTAINERS: add git tree for random.cJason A. Donenfeld
This is handy not just for humans, but also so that the 0-day bot can automatically test posted mailing list patches against the right tree. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-01-06bpf/selftests: Test bpf_d_path on rdonly_mem.Hao Luo
The second parameter of bpf_d_path() can only accept writable memories. Rdonly_mem obtained from bpf_per_cpu_ptr() can not be passed into bpf_d_path for modification. This patch adds a selftest to verify this behavior. Signed-off-by: Hao Luo <haoluo@google.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20220106205525.2116218-1-haoluo@google.com
2022-01-06libbpf: Add documentation for bpf_map batch operationsGrant Seltzer
This adds documention for: - bpf_map_delete_batch() - bpf_map_lookup_batch() - bpf_map_lookup_and_delete_batch() - bpf_map_update_batch() This also updates the public API for the `keys` parameter of `bpf_map_delete_batch()`, and both the `keys` and `values` parameters of `bpf_map_update_batch()` to be constants. Signed-off-by: Grant Seltzer <grantseltzer@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20220106201304.112675-1-grantseltzer@gmail.com
2022-01-06Merge tag 'trace-v5.16-rc8' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "Three minor tracing fixes: - Fix missing prototypes in sample module for direct functions - Fix check of valid buffer in get_trace_buf() - Fix annotations of percpu pointers" * tag 'trace-v5.16-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Tag trace_percpu_buffer as a percpu pointer tracing: Fix check for trace_percpu_buffer validity in get_trace_buf() ftrace/samples: Add missing prototypes direct functions
2022-01-06cgroup/rstat: check updated_next only for rootWei Yang
After commit dc26532aed0a ("cgroup: rstat: punt root-level optimization to individual controllers"), each rstat on updated_children list has its ->updated_next not NULL. This means we can remove the check on ->updated_next, if we make sure the subtree from @root is on list, which could be done by checking updated_next for root. tj: Coding style fixes. Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-01-06selftests/bpf: Don't rely on preserving volatile in PT_REGS macros in loop3Andrii Nakryiko
PT_REGS*() macro on some architectures force-cast struct pt_regs to other types (user_pt_regs, etc) and might drop volatile modifiers, if any. Volatile isn't really required as pt_regs value isn't supposed to change during the BPF program run, so this is correct behavior. But progs/loop3.c relies on that volatile modifier to ensure that loop is preserved. Fix loop3.c by declaring i and sum variables as volatile instead. It preserves the loop and makes the test pass on all architectures (including s390x which is currently broken). Fixes: 3cc31d794097 ("libbpf: Normalize PT_REGS_xxx() macro definitions") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220106205156.955373-1-andrii@kernel.org
2022-01-06cgroup: rstat: explicitly put loop variant in whileWei Yang
Instead of do while unconditionally, let's put the loop variant in while. Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-01-06selftests: cgroup: Test open-time cgroup namespace usage for migration checksTejun Heo
When a task is writing to an fd opened by a different task, the perm check should use the cgroup namespace of the latter task. Add a test for it. Tested-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-01-06selftests: cgroup: Test open-time credential usage for migration checksTejun Heo
When a task is writing to an fd opened by a different task, the perm check should use the credentials of the latter task. Add a test for it. Tested-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-01-06selftests: cgroup: Make cg_create() use 0755 for permission instead of 0644Tejun Heo
0644 is an odd perm to create a cgroup which is a directory. Use the regular 0755 instead. This is necessary for euid switching test case. Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-01-06cgroup: Use open-time cgroup namespace for process migration perm checksTejun Heo
cgroup process migration permission checks are performed at write time as whether a given operation is allowed or not is dependent on the content of the write - the PID. This currently uses current's cgroup namespace which is a potential security weakness as it may allow scenarios where a less privileged process tricks a more privileged one into writing into a fd that it created. This patch makes cgroup remember the cgroup namespace at the time of open and uses it for migration permission checks instad of current's. Note that this only applies to cgroup2 as cgroup1 doesn't have namespace support. This also fixes a use-after-free bug on cgroupns reported in https://lore.kernel.org/r/00000000000048c15c05d0083397@google.com Note that backporting this fix also requires the preceding patch. Reported-by: "Eric W. Biederman" <ebiederm@xmission.com> Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Reported-by: syzbot+50f5cf33a284ce738b62@syzkaller.appspotmail.com Link: https://lore.kernel.org/r/00000000000048c15c05d0083397@google.com Fixes: 5136f6365ce3 ("cgroup: implement "nsdelegate" mount option") Signed-off-by: Tejun Heo <tj@kernel.org>
2022-01-06cgroup: Allocate cgroup_file_ctx for kernfs_open_file->privTejun Heo
of->priv is currently used by each interface file implementation to store private information. This patch collects the current two private data usages into struct cgroup_file_ctx which is allocated and freed by the common path. This allows generic private data which applies to multiple files, which will be used to in the following patch. Note that cgroup_procs iterator is now embedded as procs.iter in the new cgroup_file_ctx so that it doesn't need to be allocated and freed separately. v2: union dropped from cgroup_file_ctx and the procs iterator is embedded in cgroup_file_ctx as suggested by Linus. v3: Michal pointed out that cgroup1's procs pidlist uses of->priv too. Converted. Didn't change to embedded allocation as cgroup1 pidlists get stored for caching. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Michal Koutný <mkoutny@suse.com>
2022-01-06cgroup: Use open-time credentials for process migraton perm checksTejun Heo
cgroup process migration permission checks are performed at write time as whether a given operation is allowed or not is dependent on the content of the write - the PID. This currently uses current's credentials which is a potential security weakness as it may allow scenarios where a less privileged process tricks a more privileged one into writing into a fd that it created. This patch makes both cgroup2 and cgroup1 process migration interfaces to use the credentials saved at the time of open (file->f_cred) instead of current's. Reported-by: "Eric W. Biederman" <ebiederm@xmission.com> Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org> Fixes: 187fe84067bd ("cgroup: require write perm on common ancestor when moving processes on the default hierarchy") Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-01-07Merge tag 'amd-drm-fixes-5.16-2021-12-31' of ↵Dave Airlie
ssh://gitlab.freedesktop.org/agd5f/linux into drm-fixes amd-drm-fixes-5.16-2021-12-31: amdgpu: - Suspend/resume fix - Restore runtime pm behavior with efifb Signed-off-by: Dave Airlie <airlied@redhat.com> From: Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20211231143825.11479-1-alexander.deucher@amd.com
2022-01-06efi: use default_groups in kobj_typeGreg Kroah-Hartman
There are currently 2 ways to create a set of sysfs files for a kobj_type, through the default_attrs field, and the default_groups field. Move the firmware efi sysfs code to use default_groups field which has been the preferred way since aa30f47cf666 ("kobject: Add support for default attribute groups to kobj_type") so that we can soon get rid of the obsolete default_attrs field. Cc: Ard Biesheuvel <ardb@kernel.org> Cc: linux-efi@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2022-01-06efi/libstub: measure loaded initrd info into the TPMIlias Apalodimas
In an effort to ensure the initrd observed and used by the OS is the same one that was meant to be loaded, which is difficult to guarantee otherwise, let's measure the initrd if the EFI stub and specifically the newly introduced LOAD_FILE2 protocol was used. Modify the initrd loading sequence so that the contents of the initrd are measured into PCR9. Note that the patch is currently using EV_EVENT_TAG to create the eventlog entry instead of EV_IPL. According to the TCP PC Client specification this is used for PCRs defined for OS and application usage. Co-developed-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Link: https://lore.kernel.org/r/20211119114745.1560453-5-ilias.apalodimas@linaro.org [ardb: add braces to initializer of tagged_event_data] Link: https://github.com/ClangBuiltLinux/linux/issues/1547 Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2022-01-06xfs: warn about inodes with project id of -1Darrick J. Wong
Inodes aren't supposed to have a project id of -1U (aka 4294967295) but the kernel hasn't always validated FSSETXATTR correctly. Flag this as something for the sysadmin to check out. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-01-06xfs: hold quota inode ILOCK_EXCL until the end of dqallocDarrick J. Wong
Online fsck depends on callers holding ILOCK_EXCL from the time they decide to update a block mapping until after they've updated the reverse mapping records to guarantee the stability of both mapping records. Unfortunately, the quota code drops ILOCK_EXCL at the first transaction roll in the dquot allocation process, which breaks that assertion. This leads to sporadic failures in the online rmap repair code if the repair code grabs the AGF after bmapi_write maps a new block into the quota file's data fork but before it can finish the deferred rmap update. Fix this by rewriting the function to hold the ILOCK until after the transaction commit like all other bmap updates do, and get rid of the dqread wrapper that does nothing but complicate the codebase. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-01-06xfs: Remove redundant assignment of mpJiapeng Chong
mp is being initialized to log->l_mp but this is never read as record is overwritten later on. Remove the redundant assignment. Cleans up the following clang-analyzer warning: fs/xfs/xfs_log_recover.c:3543:20: warning: Value stored to 'mp' during its initialization is never read [clang-analyzer-deadcode.DeadStores]. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-01-06xfs: reduce kvmalloc overhead for CIL shadow buffersDave Chinner
Oh, let me count the ways that the kvmalloc API sucks dog eggs. The problem is when we are logging lots of large objects, we hit kvmalloc really damn hard with costly order allocations, and behaviour utterly sucks: - 49.73% xlog_cil_commit - 31.62% kvmalloc_node - 29.96% __kmalloc_node - 29.38% kmalloc_large_node - 29.33% __alloc_pages - 24.33% __alloc_pages_slowpath.constprop.0 - 18.35% __alloc_pages_direct_compact - 17.39% try_to_compact_pages - compact_zone_order - 15.26% compact_zone 5.29% __pageblock_pfn_to_page 3.71% PageHuge - 1.44% isolate_migratepages_block 0.71% set_pfnblock_flags_mask 1.11% get_pfnblock_flags_mask - 0.81% get_page_from_freelist - 0.59% _raw_spin_lock_irqsave - do_raw_spin_lock __pv_queued_spin_lock_slowpath - 3.24% try_to_free_pages - 3.14% shrink_node - 2.94% shrink_slab.constprop.0 - 0.89% super_cache_count - 0.66% xfs_fs_nr_cached_objects - 0.65% xfs_reclaim_inodes_count 0.55% xfs_perag_get_tag 0.58% kfree_rcu_shrink_count - 2.09% get_page_from_freelist - 1.03% _raw_spin_lock_irqsave - do_raw_spin_lock __pv_queued_spin_lock_slowpath - 4.88% get_page_from_freelist - 3.66% _raw_spin_lock_irqsave - do_raw_spin_lock __pv_queued_spin_lock_slowpath - 1.63% __vmalloc_node - __vmalloc_node_range - 1.10% __alloc_pages_bulk - 0.93% __alloc_pages - 0.92% get_page_from_freelist - 0.89% rmqueue_bulk - 0.69% _raw_spin_lock - do_raw_spin_lock __pv_queued_spin_lock_slowpath 13.73% memcpy_erms - 2.22% kvfree On this workload, that's almost a dozen CPUs all trying to compact and reclaim memory inside kvmalloc_node at the same time. Yet it is regularly falling back to vmalloc despite all that compaction, page and shrinker reclaim that direct reclaim is doing. Copying all the metadata is taking far less CPU time than allocating the storage! Direct reclaim should be considered extremely harmful. This is a high frequency, high throughput, CPU usage and latency sensitive allocation. We've got memory there, and we're using kvmalloc to allow memory allocation to avoid doing lots of work to try to do contiguous allocations. Except it still does *lots of costly work* that is unnecessary. Worse: the only way to avoid the slowpath page allocation trying to do compaction on costly allocations is to turn off direct reclaim (i.e. remove __GFP_RECLAIM_DIRECT from the gfp flags). Unfortunately, the stupid kvmalloc API then says "oh, this isn't a GFP_KERNEL allocation context, so you only get kmalloc!". This cuts off the vmalloc fallback, and this leads to almost instant OOM problems which ends up in filesystems deadlocks, shutdowns and/or kernel crashes. I want some basic kvmalloc behaviour: - kmalloc for a contiguous range with fail fast semantics - no compaction direct reclaim if the allocation enters the slow path. - run normal vmalloc (i.e. GFP_KERNEL) if kmalloc fails The really, really stupid part about this is these kvmalloc() calls are run under memalloc_nofs task context, so all the allocations are always reduced to GFP_NOFS regardless of the fact that kvmalloc requires GFP_KERNEL to be passed in. IOWs, we're already telling kvmalloc to behave differently to the gfp flags we pass in, but it still won't allow vmalloc to be run with anything other than GFP_KERNEL. So, this patch open codes the kvmalloc() in the commit path to have the above described behaviour. The result is we more than halve the CPU time spend doing kvmalloc() in this path and transaction commits with 64kB objects in them more than doubles. i.e. we get ~5x reduction in CPU usage per costly-sized kvmalloc() invocation and the profile looks like this: - 37.60% xlog_cil_commit 16.01% memcpy_erms - 8.45% __kmalloc - 8.04% kmalloc_order_trace - 8.03% kmalloc_order - 7.93% alloc_pages - 7.90% __alloc_pages - 4.05% __alloc_pages_slowpath.constprop.0 - 2.18% get_page_from_freelist - 1.77% wake_all_kswapds .... - __wake_up_common_lock - 0.94% _raw_spin_lock_irqsave - 3.72% get_page_from_freelist - 2.43% _raw_spin_lock_irqsave - 5.72% vmalloc - 5.72% __vmalloc_node_range - 4.81% __get_vm_area_node.constprop.0 - 3.26% alloc_vmap_area - 2.52% _raw_spin_lock - 1.46% _raw_spin_lock 0.56% __alloc_pages_bulk - 4.66% kvfree - 3.25% vfree - __vfree - 3.23% __vunmap - 1.95% remove_vm_area - 1.06% free_vmap_area_noflush - 0.82% _raw_spin_lock - 0.68% _raw_spin_lock - 0.92% _raw_spin_lock - 1.40% kfree - 1.36% __free_pages - 1.35% __free_pages_ok - 1.02% _raw_spin_lock_irqsave It's worth noting that over 50% of the CPU time spent allocating these shadow buffers is now spent on spinlocks. So the shadow buffer allocation overhead is greatly reduced by getting rid of direct reclaim from kmalloc, and could probably be made even less costly if vmalloc() didn't use global spinlocks to protect it's structures. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-01-06xfs: sysfs: use default_groups in kobj_typeGreg Kroah-Hartman
There are currently 2 ways to create a set of sysfs files for a kobj_type, through the default_attrs field, and the default_groups field. Move the xfs sysfs code to use default_groups field which has been the preferred way since aa30f47cf666 ("kobject: Add support for default attribute groups to kobj_type") so that we can soon get rid of the obsolete default_attrs field. Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: linux-xfs@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-01-06spi: dt-bindings: mediatek,spi-mtk-nor: Fix example 'interrupts' propertyRob Herring
A phandle for 'interrupts' value is wrong and should be one or more numbers. Signed-off-by: Rob Herring <robh@kernel.org> Link: https://lore.kernel.org/r/20220106182518.1435497-9-robh@kernel.org Signed-off-by: Mark Brown <broonie@kernel.org>
2022-01-06ice: Use bitmap_free() to free bitmapChristophe JAILLET
kfree() and bitmap_free() are the same. But using the latter is more consistent when freeing memory allocated with bitmap_zalloc(). Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Tested-by: Gurucharan G <gurucharanx.g@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-01-06ice: Optimize a few bitmap operationsChristophe JAILLET
When a bitmap is local to a function, it is safe to use the non-atomic __[set|clear]_bit(). No concurrent accesses can occur. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Tested-by: Gurucharan G <gurucharanx.g@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-01-06ice: Slightly simply ice_find_free_recp_res_idxChristophe JAILLET
The 'possible_idx' bitmap is set just after it is zeroed, so we can save the first step. The 'free_idx' bitmap is used only at the end of the function as the result of a bitmap xor operation. So there is no need to explicitly zero it before. So, slightly simply the code and remove 2 useless 'bitmap_zero()' call Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Tested-by: Sandeep Penigalapati <sandeep.penigalapati@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-01-06ice: improve switchdev's slow-pathWojciech Drewek
In current switchdev implementation, every VF PR is assigned to individual ring on switchdev ctrl VSI. For slow-path traffic, there is a mapping VF->ring done in software based on src_vsi value (by calling ice_eswitch_get_target_netdev function). With this change, HW solution is introduced which is more efficient. For each VF, src MAC (VF's MAC) filter will be created, which forwards packets to the corresponding switchdev ctrl VSI queue based on src MAC address. This filter has to be removed and then replayed in case of resetting one VF. Keep information about this rule in repr->mac_rule, thanks to that we know which rule has to be removed and replayed for a given VF. In case of CORE/GLOBAL all rules are removed automatically. We have to take care of readding them. This is done by ice_replay_vsi_adv_rule. When driver leaves switchdev mode, remove all advanced rules from switchdev ctrl VSI. This is done by ice_rem_adv_rule_for_vsi. Flag repr->rule_added is needed because in some cases reset might be triggered before VF sends request to add MAC. Co-developed-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Signed-off-by: Wojciech Drewek <wojciech.drewek@intel.com> Tested-by: Sandeep Penigalapati <sandeep.penigalapati@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-01-06x86, sched: Fix undefined reference to init_freq_invariance_cppc() build errorHuang Rui
The init_freq_invariance_cppc function is implemented in smpboot and depends on CONFIG_SMP. MODPOST vmlinux.symvers MODINFO modules.builtin.modinfo GEN modules.builtin LD .tmp_vmlinux.kallsyms1 ld: drivers/acpi/cppc_acpi.o: in function `acpi_cppc_processor_probe': /home/ray/brahma3/linux/drivers/acpi/cppc_acpi.c:819: undefined reference to `init_freq_invariance_cppc' make: *** [Makefile:1161: vmlinux] Error 1 See https://lore.kernel.org/lkml/484af487-7511-647e-5c5b-33d4429acdec@infradead.org/. Fixes: 41ea667227ba ("x86, sched: Calculate frequency invariance for AMD systems") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Randy Dunlap <rdunlap@infradead.org> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Huang Rui <ray.huang@amd.com> [ rjw: Subject edits ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2022-01-06cpufreq: amd-pstate: Fix Kconfig dependencies for AMD P-StateHuang Rui
The AMD P-State driver is based on ACPI CPPC function, so ACPI should be dependence of this driver in the kernel config. In file included from ../drivers/cpufreq/amd-pstate.c:40:0: ../include/acpi/processor.h:226:2: error: unknown type name ‘phys_cpuid_t’ phys_cpuid_t phys_id; /* CPU hardware ID such as APIC ID for x86 */ ^~~~~~~~~~~~ ../include/acpi/processor.h:355:1: error: unknown type name ‘phys_cpuid_t’; did you mean ‘phys_addr_t’? phys_cpuid_t acpi_get_phys_id(acpi_handle, int type, u32 acpi_id); ^~~~~~~~~~~~ phys_addr_t CC drivers/rtc/rtc-rv3029c2.o ../include/acpi/processor.h:356:1: error: unknown type name ‘phys_cpuid_t’; did you mean ‘phys_addr_t’? phys_cpuid_t acpi_map_madt_entry(u32 acpi_id); ^~~~~~~~~~~~ phys_addr_t ../include/acpi/processor.h:357:20: error: unknown type name ‘phys_cpuid_t’; did you mean ‘phys_addr_t’? int acpi_map_cpuid(phys_cpuid_t phys_id, u32 acpi_id); ^~~~~~~~~~~~ phys_addr_t See https://lore.kernel.org/lkml/20e286d4-25d7-fb6e-31a1-4349c805aae3@infradead.org/. Reported-by: Randy Dunlap <rdunlap@infradead.org> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Huang Rui <ray.huang@amd.com> [ rjw: Subject edits ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2022-01-06cpufreq: amd-pstate: Fix struct amd_cpudata kernel-doc commentYang Li
Add the description of @req and @boost_supported in struct amd_cpudata kernel-doc comment to remove warnings found by running scripts/kernel-doc, which is caused by using 'make W=1'. drivers/cpufreq/amd-pstate.c:104: warning: Function parameter or member 'req' not described in 'amd_cpudata' drivers/cpufreq/amd-pstate.c:104: warning: Function parameter or member 'boost_supported' not described in 'amd_cpudata' Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Acked-by: Huang Rui <ray.huang@amd.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2022-01-06ice: replay advanced rules after resetVictor Raj
ice_replay_vsi_adv_rule will replay advanced rules for a given VSI. Exit this function when list of rules for given recipe is empty. Do not add rule when given vsi_handle does not match vsi_handle from the rule info. Use ICE_MAX_NUM_RECIPES instead of ICE_SW_LKUP_LAST in order to find advanced rules as well. Signed-off-by: Victor Raj <victor.raj@intel.com> Signed-off-by: Wojciech Drewek <wojciech.drewek@intel.com> Tested-by: Sandeep Penigalapati <sandeep.penigalapati@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-01-06spi: qcom: geni: handle timeout for gpi modeVinod Koul
We missed adding handle_err for gpi mode, so add a new function spi_geni_handle_err() which would call handle_fifo_timeout() or newly added handle_gpi_timeout() based on mode Fixes: b59c122484ec ("spi: spi-geni-qcom: Add support for GPI dma") Reported-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Vinod Koul <vkoul@kernel.org> Link: https://lore.kernel.org/r/20220103071118.27220-2-vkoul@kernel.org Signed-off-by: Mark Brown <broonie@kernel.org>
2022-01-06spi: qcom: geni: set the error code for gpi transferVinod Koul
Before we invoke spi_finalize_current_transfer() in spi_gsi_callback_result() we should set the spi->cur_msg->status as appropriate (0 for success, error otherwise). The helps to return error on transfer and not wait till it timesout on error Fixes: b59c122484ec ("spi: spi-geni-qcom: Add support for GPI dma") Signed-off-by: Vinod Koul <vkoul@kernel.org> Reviewed-by: Douglas Anderson <dianders@chromium.org> Link: https://lore.kernel.org/r/20220103071118.27220-1-vkoul@kernel.org Signed-off-by: Mark Brown <broonie@kernel.org>
2022-01-06HID: magicmouse: Fix an error handling path in magicmouse_probe()Christophe JAILLET
If the timer introduced by the commit below is started, then it must be deleted in the error handling of the probe. Otherwise it would trigger once the driver is no more. Fixes: 0b91b4e4dae6 ("HID: magicmouse: Report battery level over USB") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Tested-by: José Expósito <jose.exposito89@gmail.com> Reported-by: <syzbot+a437546ec71b04dfb5ac@syzkaller.appspotmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2022-01-06HID: address kernel-doc warningsLukas Bulwahn
The command ./scripts/kernel-doc -none include/linux/hid.h reports: include/linux/hid.h:818: warning: cannot understand function prototype: 'struct hid_ll_driver ' include/linux/hid.h:1135: warning: expecting prototype for hid_may_wakeup(). Prototype was for hid_hw_may_wakeup() instead Address those kernel-doc warnings. Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2022-01-06HID: intel-ish-hid: ishtp-fw-loader: Fix a kernel-doc formatting issueYang Li
This function had kernel-doc that not used a hash to separate the function name from the one line description. The warning was found by running scripts/kernel-doc, which is caused by using 'make W=1'. drivers/hid/intel-ish-hid/ishtp-fw-loader.c:271: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>