Age | Commit message (Collapse) | Author |
|
Based on 1 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your option any later version this program is distributed in the
hope that it will be useful but without any warranty without even
the implied warranty of merchantability or fitness for a particular
purpose see the gnu general public license for more details you
should have received a copy of the gnu general public license along
with this program if not write to the free software foundation inc
59 temple place suite 330 boston ma 021110 1307 usa
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-or-later
has been chosen to replace the boilerplate/reference in 84 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190524100844.756442981@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Remove the unnecessary constant CRYPTO_ALG_TYPE_DIGEST, which has the
same value as CRYPTO_ALG_TYPE_HASH.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
In the PHYLINK_DEV operation type, the PHYLINK infrastructure can work
without an attached net_device. For printing usecases, instead, a struct
device * should be passed to PHYLINK using the phylink_config structure.
Also, netif_carrier_* calls ar guarded by the presence of a valid
net_device. When using the PHYLINK_DEV operation type, we cannot check
link status using the netif_carrier_ok() API so instead, keep an
internal state of the MAC and call mac_link_{down,up} only when the link
changed.
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The phylink_config structure will encapsulate a pointer to a struct
device and the operation type requested for this instance of PHYLINK.
This patch does not make any functional changes, it just transitions the
PHYLINK internals and all its users to the new API.
A pointer to a phylink_config structure will be passed to
phylink_create() instead of the net_device directly. Also, the same
phylink_config pointer will be passed back to all phylink_mac_ops
callbacks instead of the net_device. Using this mechanism, a PHYLINK
user can get the original net_device using a structure such as
'to_net_dev(config->dev)' or directly the structure containing the
phylink_config using a container_of call.
At the moment, only the PHYLINK_NETDEV is defined as a valid operation
type for PHYLINK. In this mode, a valid reference to a struct device
linked to the original net_device should be passed to PHYLINK through
the phylink_config structure.
This API changes is mainly driven by the necessity of adding a new
operation type in PHYLINK that disconnects the phy_device from the
net_device and also works when the net_device is lacking.
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
A PF_EXITING task can stay associated with an offline css. If such
task calls task_get_css(), it can get stuck indefinitely. This can be
triggered by BSD process accounting which writes to a file with
PF_EXITING set when racing against memcg disable as in the backtrace
at the end.
After this change, task_get_css() may return a css which was already
offline when the function was called. None of the existing users are
affected by this change.
INFO: rcu_sched self-detected stall on CPU
INFO: rcu_sched detected stalls on CPUs/tasks:
...
NMI backtrace for cpu 0
...
Call Trace:
<IRQ>
dump_stack+0x46/0x68
nmi_cpu_backtrace.cold.2+0x13/0x57
nmi_trigger_cpumask_backtrace+0xba/0xca
rcu_dump_cpu_stacks+0x9e/0xce
rcu_check_callbacks.cold.74+0x2af/0x433
update_process_times+0x28/0x60
tick_sched_timer+0x34/0x70
__hrtimer_run_queues+0xee/0x250
hrtimer_interrupt+0xf4/0x210
smp_apic_timer_interrupt+0x56/0x110
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:balance_dirty_pages_ratelimited+0x28f/0x3d0
...
btrfs_file_write_iter+0x31b/0x563
__vfs_write+0xfa/0x140
__kernel_write+0x4f/0x100
do_acct_process+0x495/0x580
acct_process+0xb9/0xdb
do_exit+0x748/0xa00
do_group_exit+0x3a/0xa0
get_signal+0x254/0x560
do_signal+0x23/0x5c0
exit_to_usermode_loop+0x5d/0xa0
prepare_exit_to_usermode+0x53/0x80
retint_user+0x8/0x8
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org # v4.2+
Fixes: ec438699a9ae ("cgroup, block: implement task_get_css() and use it in bio_associate_current()")
|
|
This patch moves common code between rht_ptr and rht_ptr_exclusive
into __rht_ptr. It also adds a new helper rht_ptr_rcu exclusively
for the RCU case. This way rht_ptr becomes a lock-only construct
so we can use the lighter rcu_dereference_protected primitive.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
force_sig_info always delivers to the current task and the signal
parameter always matches info.si_signo. So remove those parameters to
make it a simpler less error prone interface, and to make it clear
that none of the callers are doing anything clever.
This guarantees that force_sig_info will not grow any new buggy
callers that attempt to call force_sig on a non-current task, or that
pass an signal number that does not match info.si_signo.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
|
|
As synchronous exceptions really only make sense against the current
task (otherwise how are you synchronous) remove the task parameter
from from force_sig_fault to make it explicit that is what is going
on.
The two known exceptions that deliver a synchronous exception to a
stopped ptraced task have already been changed to
force_sig_fault_to_task.
The callers have been changed with the following emacs regular expression
(with obvious variations on the architectures that take more arguments)
to avoid typos:
force_sig_fault[(]\([^,]+\)[,]\([^,]+\)[,]\([^,]+\)[,]\W+current[)]
->
force_sig_fault(\1,\2,\3)
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
|
|
current
In preparation for removing the task parameter from force_sig_fault
introduce force_sig_fault_to_task and use it for the two cases where
it matters.
On mips force_fcr31_sig calls force_sig_fault and is called on either
the current task, or a task that is suspended and is being switched to
by the scheduler. This is safe because the task being switched to by
the scheduler is guaranteed to be suspended. This ensures that
task->sighand is stable while the signal is delivered to it.
On parisc user_enable_single_step calls force_sig_fault and is in turn
called by ptrace_request. The function ptrace_request always calls
user_enable_single_step on a child that is stopped for tracing. The
child being traced and not reaped ensures that child->sighand is not
NULL, and that the child will not change child->sighand.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
|
|
Now that we don't have __rcu markers on the bpf_prog_array helpers,
let's use proper rcu_dereference_protected to obtain array pointer
under mutex.
We also don't need __rcu annotations on cgroup_bpf.inactive since
it's not read/updated concurrently.
v4:
* drop cgroup_rcu_xyz wrappers and use rcu APIs directly; presumably
should be more clear to understand which mutex/refcount protects
each particular place
v3:
* amend cgroup_rcu_dereference to include percpu_ref_is_dying;
cgroup_bpf is now reference counted and we don't hold cgroup_mutex
anymore in cgroup_bpf_release
v2:
* replace xchg with rcu_swap_protected
Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
|
Drop __rcu annotations and rcu read sections from bpf_prog_array
helper functions. They are not needed since all existing callers
call those helpers from the rcu update side while holding a mutex.
This guarantees that use-after-free could not happen.
In the next patches I'll fix the callers with missing
rcu_dereference_protected to make sparse/lockdep happy, the proper
way to use these helpers is:
struct bpf_prog_array __rcu *progs = ...;
struct bpf_prog_array *p;
mutex_lock(&mtx);
p = rcu_dereference_protected(progs, lockdep_is_held(&mtx));
bpf_prog_array_length(p);
bpf_prog_array_copy_to_user(p, ...);
bpf_prog_array_delete_safe(p, ...);
bpf_prog_array_copy_info(p, ...);
bpf_prog_array_copy(p, ...);
bpf_prog_array_free(p);
mutex_unlock(&mtx);
No functional changes! rcu_dereference_protected with lockdep_is_held
should catch any cases where we update prog array without a mutex
(I've looked at existing call sites and I think we hold a mutex
everywhere).
Motivation is to fix sparse warnings:
kernel/bpf/core.c:1803:9: warning: incorrect type in argument 1 (different address spaces)
kernel/bpf/core.c:1803:9: expected struct callback_head *head
kernel/bpf/core.c:1803:9: got struct callback_head [noderef] <asn:4> *
kernel/bpf/core.c:1877:44: warning: incorrect type in initializer (different address spaces)
kernel/bpf/core.c:1877:44: expected struct bpf_prog_array_item *item
kernel/bpf/core.c:1877:44: got struct bpf_prog_array_item [noderef] <asn:4> *
kernel/bpf/core.c:1901:26: warning: incorrect type in assignment (different address spaces)
kernel/bpf/core.c:1901:26: expected struct bpf_prog_array_item *existing
kernel/bpf/core.c:1901:26: got struct bpf_prog_array_item [noderef] <asn:4> *
kernel/bpf/core.c:1935:26: warning: incorrect type in assignment (different address spaces)
kernel/bpf/core.c:1935:26: expected struct bpf_prog_array_item *[assigned] existing
kernel/bpf/core.c:1935:26: got struct bpf_prog_array_item [noderef] <asn:4> *
v2:
* remove comment about potential race; that can't happen
because all callers are in rcu-update section
Cc: Roman Gushchin <guro@fb.com>
Acked-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
|
Rename fscrypt_decrypt_page() to fscrypt_decrypt_pagecache_blocks() and
redefine its behavior to decrypt all filesystem blocks in the given
region of the given page, rather than assuming that the region consists
of just one filesystem block. Also remove the 'inode' and 'lblk_num'
parameters, since they can be retrieved from the page as it's already
assumed to be a pagecache page.
This is in preparation for allowing encryption on ext4 filesystems with
blocksize != PAGE_SIZE.
This is based on work by Chandan Rajendra.
Reviewed-by: Chandan Rajendra <chandan@linux.ibm.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
Currently fscrypt_decrypt_page() does one of two logically distinct
things depending on whether FS_CFLG_OWN_PAGES is set in the filesystem's
fscrypt_operations: decrypt a pagecache page in-place, or decrypt a
filesystem block in-place in any page. Currently these happen to share
the same implementation, but this conflates the notion of blocks and
pages. It also makes it so that all callers have to provide inode and
lblk_num, when fscrypt could determine these itself for pagecache pages.
Therefore, move the FS_CFLG_OWN_PAGES behavior into a new function
fscrypt_decrypt_block_inplace(). This mirrors
fscrypt_encrypt_block_inplace().
This is in preparation for allowing encryption on ext4 filesystems with
blocksize != PAGE_SIZE.
Reviewed-by: Chandan Rajendra <chandan@linux.ibm.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
Rename fscrypt_encrypt_page() to fscrypt_encrypt_pagecache_blocks() and
redefine its behavior to encrypt all filesystem blocks from the given
region of the given page, rather than assuming that the region consists
of just one filesystem block. Also remove the 'inode' and 'lblk_num'
parameters, since they can be retrieved from the page as it's already
assumed to be a pagecache page.
This is in preparation for allowing encryption on ext4 filesystems with
blocksize != PAGE_SIZE.
This is based on work by Chandan Rajendra.
Reviewed-by: Chandan Rajendra <chandan@linux.ibm.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
fscrypt_encrypt_page() behaves very differently depending on whether the
filesystem set FS_CFLG_OWN_PAGES in its fscrypt_operations. This makes
the function difficult to understand and document. It also makes it so
that all callers have to provide inode and lblk_num, when fscrypt could
determine these itself for pagecache pages.
Therefore, move the FS_CFLG_OWN_PAGES behavior into a new function
fscrypt_encrypt_block_inplace().
This is in preparation for allowing encryption on ext4 filesystems with
blocksize != PAGE_SIZE.
Reviewed-by: Chandan Rajendra <chandan@linux.ibm.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
Now that fscrypt_ctx is not used for writes, remove the 'w' fields.
Reviewed-by: Chandan Rajendra <chandan@linux.ibm.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
Currently, bounce page handling for writes to encrypted files is
unnecessarily complicated. A fscrypt_ctx is allocated along with each
bounce page, page_private(bounce_page) points to this fscrypt_ctx, and
fscrypt_ctx::w::control_page points to the original pagecache page.
However, because writes don't use the fscrypt_ctx for anything else,
there's no reason why page_private(bounce_page) can't just point to the
original pagecache page directly.
Therefore, this patch makes this change. In the process, it also cleans
up the API exposed to filesystems that allows testing whether a page is
a bounce page, getting the pagecache page from a bounce page, and
freeing a bounce page.
Reviewed-by: Chandan Rajendra <chandan@linux.ibm.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
Currently the lifetime of bpf programs attached to a cgroup is bound
to the lifetime of the cgroup itself. It means that if a user
forgets (or intentionally avoids) to detach a bpf program before
removing the cgroup, it will stay attached up to the release of the
cgroup. Since the cgroup can stay in the dying state (the state
between being rmdir()'ed and being released) for a very long time, it
leads to a waste of memory. Also, it blocks a possibility to implement
the memcg-based memory accounting for bpf objects, because a circular
reference dependency will occur. Charged memory pages are pinning the
corresponding memory cgroup, and if the memory cgroup is pinning
the attached bpf program, nothing will be ever released.
A dying cgroup can not contain any processes, so the only chance for
an attached bpf program to be executed is a live socket associated
with the cgroup. So in order to release all bpf data early, let's
count associated sockets using a new percpu refcounter. On cgroup
removal the counter is transitioned to the atomic mode, and as soon
as it reaches 0, all bpf programs are detached.
Because cgroup_bpf_release() can block, it can't be called from
the percpu ref counter callback directly, so instead an asynchronous
work is scheduled.
The reference counter is not socket specific, and can be used for any
other types of programs, which can be executed from a cgroup-bpf hook
outside of the process context, had such a need arise in the future.
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: jolsa@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Proc filesystem has special locking rules for various files. Thus
fanotify which opens files on event delivery can easily deadlock
against another process that waits for fanotify permission event to be
handled. Since permission events on /proc have doubtful value anyway,
just disallow them.
Link: https://lore.kernel.org/linux-fsdevel/20190320131642.GE9485@quack2.suse.cz/
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
Currently, the inter-stutter interval is the same as the stutter duration,
that is, whatever number of jiffies is passed into torture_stutter_init().
This has worked well for quite some time, but the addition of
forward-progress testing to rcutorture can delay processes for several
seconds, which can triple the time that they are stuttered.
This commit therefore adds a second argument to torture_stutter_init()
that specifies the inter-stutter interval. While locktorture preserves
the current behavior, rcutorture uses the RCU CPU stall warning interval
to provide a wider inter-stutter interval.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
|
|
With this patch rcu_sync has a single state variable and the transition rules
become really simple:
GP_IDLE - owned by the first rcu_sync_enter() which moves it to
GP_ENTER - owned by rcu-callback which moves it to
GP_PASSED - owned by the last rcu_sync_exit() which moves it to
GP_EXIT - and this is the only "nontrivial" state.
rcu-callback moves it back to GP_IDLE unless another enter()
comes before a GP pass.
If rcu-callback is invoked before the next rcu_sync_exit() it
must see gp_count incremented by that enter() and set GP_PASSED.
Otherwise, if the next rcu_sync_exit() wins the race, it will
move it to
GP_REPLAY - owned by rcu-callback which moves it to GP_EXIT
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
[ paulmck: While here, apply READ_ONCE() and WRITE_ONCE() to ->gp_state. ]
[ paulmck: Tweaks to make htmldocs happy. (Reported by kbuild test robot.) ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
|
|
cgroup_threadgroup_rwsem
Turn DEFINE_STATIC_PERCPU_RWSEM() into __DEFINE_PERCPU_RWSEM() with the
additional "is_static" argument to introduce DEFINE_PERCPU_RWSEM().
Change cgroup.c to use DEFINE_PERCPU_RWSEM(cgroup_threadgroup_rwsem).
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
|
|
Now that the RCU flavors have been consolidated, rcu_sync_type makes no
sense because none of internal update functions aside from .held() depend
on gp_type. This commit therefore removes this field and consolidates
the relevant code.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
[ paulmck: Added RCU and RCU-bh checks to rcu_sync_is_idle(). ]
[ paulmck: And applied subsequent feedback from Oleg Nesterov. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
|
|
Since commit title ("srcu: Allocate per-CPU data for DEFINE_SRCU() in
modules"), modules that call DEFINE_{STATIC,}SRCU will have a new array
of srcu_struct pointers, which is used by srcu code to initialize and
clean up these structures and save valuable per-cpu reserved space.
There is no reason for this array of pointers to be writable, and can
cause security or other hidden bugs. Mark these are read-only after the
module init has completed.
Tested with the following diff to ensure array not writable:
(diff is a bit reduced to avoid patch command getting confused)
a/kernel/module.c
b/kernel/module.c
-3506,6 +3506,14 static noinline int do_init_module [snip]
rcu_assign_pointer(mod->kallsyms, &mod->core_kallsyms);
#endif
module_enable_ro(mod, true);
+
+ if (mod->srcu_struct_ptrs) {
+ // Check if srcu_struct_ptrs access is possible
+ char x = *(char *)mod->srcu_struct_ptrs;
+ *(char *)mod->srcu_struct_ptrs = 0;
+ *(char *)mod->srcu_struct_ptrs = x;
+ }
+
mod_tree_remove_init(mod);
disable_ro_nx(&mod->init_layout);
module_arch_freeing_init(mod);
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: paulmck@linux.vnet.ibm.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: rcu@vger.kernel.org
Cc: kernel-hardening@lists.openwall.com
Cc: kernel-team@android.com
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
|
|
Adding DEFINE_SRCU() or DEFINE_STATIC_SRCU() to a loadable module requires
that the size of the reserved region be increased, which is not something
we want to be doing all that often. One approach would be to require
that loadable modules define an srcu_struct and invoke init_srcu_struct()
from their module_init function and cleanup_srcu_struct() from their
module_exit function. However, this is more than a bit user unfriendly.
This commit therefore creates an ___srcu_struct_ptrs linker section,
and pointers to srcu_struct structures created by DEFINE_SRCU() and
DEFINE_STATIC_SRCU() within a module are placed into that module's
___srcu_struct_ptrs section. The required init_srcu_struct() and
cleanup_srcu_struct() functions are then automatically invoked as needed
when that module is loaded and unloaded, thus allowing modules to continue
to use DEFINE_SRCU() and DEFINE_STATIC_SRCU() while avoiding the need
to increase the size of the reserved region.
Many of the algorithms and some of the code was cheerfully cherry-picked
from other code making use of linker sections, perhaps most notably from
tracepoints. All bugs are nevertheless the sole property of the author.
Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
[ paulmck: Use __section() and use "default" in srcu_module_notify()'s
"switch" statement as suggested by Joel Fernandes. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
|
|
This commit makes the kfree_rcu() macro's semantics be consistent
with the likes of kfree() by adding a check for NULL pointers, so
that kfree_rcu(NULL, ...) is a no-op.
Reported-by: Andriy Shevchenko <andriy.shevchenko@linux.intel.com>
Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Reviewed-by: Andriy Shevchenko <andriy.shevchenko@linux.intel.com>
|
|
ACPI permits arbitrary producer->consumer interrupt links to be
described in AML, which means a topology such as the following
is perfectly legal:
Device (EXIU) {
Name (_HID, "SCX0008")
Name (_UID, Zero)
Name (_CRS, ResourceTemplate () {
...
})
}
Device (GPIO) {
Name (_HID, "SCX0007")
Name (_UID, Zero)
Name (_CRS, ResourceTemplate () {
Memory32Fixed (ReadWrite, SYNQUACER_GPIO_BASE, SYNQUACER_GPIO_SIZE)
Interrupt (ResourceConsumer, Edge, ActiveHigh, ExclusiveAndWake, 0, "\\_SB.EXIU") {
7,
}
})
...
}
The EXIU in this example is the external interrupt unit as can be found
on Socionext SynQuacer based platforms, which converts a block of 32 SPIs
from arbitrary polarity/trigger into level-high, with a separate set
of config/mask/unmask/clear controls.
The existing DT based driver in drivers/irqchip/irq-sni-exiu.c models
this as a hierarchical domain stacked on top of the GIC's irqdomain.
Since the GIC is modeled as a DT node as well, obtaining a reference
to this irqdomain is easily done by going through the parent link.
On ACPI systems, however, the GIC is not modeled as an object in the
namespace, and so device objects cannot refer to it directly. So in
order to obtain the irqdomain reference when driving the EXIU in ACPI
mode, we need a helper that implicitly grabs the default domain as the
parent of the hierarchy for interrupts allocated out of the global GSI
pool.
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Reviewed-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
|
|
Fix 'acccess' to 'access'.
Signed-off-by: Weitao Hou <houweitaoo@gmail.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
Previously, get_valid_domain_for_dev() is used to retrieve the
DMA domain which has been attached to the device or allocate one
if no domain has been attached yet. As we have delegated the DMA
domain management to upper layer, this function is used purely to
allocate a private DMA domain if the default domain doesn't work
for ths device. Cleanup the code for readability.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
This patch drops support for I2C devices with 10 bit addressing. When I2C
device with 10 bit address is defined in DT, I3C master registration fails.
Address space for I2C devices has been reduced and ->i2c_funcs() hook has been
removed.
Because this patch series dropped support for 10 bit I2C devices, support is
also dropped in Cadence I3C master driver and Synopsys DesignWare I3C master
driver.
Signed-off-by: Przemyslaw Gaj <pgaj@cadence.com>
Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
|
|
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
There is nothing really arm64 specific in the iommu_dma_ops
implementation, so move it to dma-iommu.c and keep a lot of symbols
self-contained. Note the implementation does depend on the
DMA_DIRECT_REMAP infrastructure for now, so we'll have to make the
DMA_IOMMU support depend on it, but this will be relaxed soon.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
We now have a arch_dma_prep_coherent architecture hook that is used
for the generic DMA remap allocator, and we should use the same
interface for the dma-iommu code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
No need for a __KERNEL__ guard outside uapi and add a missing comment
describing the #else cpp statement. Last but not least include
<linux/errno.h> instead of the asm version, which is frowned upon.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
Normally during iommu probing a device, a default doamin will
be allocated and attached to the device. The domain type of
the default domain is statically defined, which results in a
situation where the allocated default domain isn't suitable
for the device due to some limitations. We already have API
iommu_request_dm_for_dev() to replace a DMA domain with an
identity one. This adds iommu_request_dma_domain_for_dev()
to request a dma domain if an allocated identity domain isn't
suitable for the device in question.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
All of the callers pass current into force_sig_mceer so remove the
task parameter to make this obvious.
This also makes it clear that force_sig_mceerr passes current
into force_sig_info.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
|
|
All of the remaining callers pass current into force_sig so
remove the task parameter to make this obvious and to make
misuse more difficult in the future.
This also makes it clear force_sig passes current into force_sig_info.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
|
|
The function force_sigsegv is always called on the current task
so passing in current is redundant and not passing in current
makes this fact obvious.
This also makes it clear force_sigsegv always calls force_sig
on the current task.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
|
|
A scalable mode DMAR table walk would involve looking at bits in each stage
of walk, like,
1. Is PASID enabled in the context entry?
2. What's the size of PASID directory?
3. Is the PASID directory entry present?
4. Is the PASID table entry present?
5. Number of PASID table entries?
Hence, add these macros that will later be used during this walk.
Apart from adding new macros, move existing macros (like
pasid_pde_is_present(), get_pasid_table_from_pde() and pasid_supported())
to appropriate header files so that they could be reused.
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Cc: Sohil Mehta <sohil.mehta@intel.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
If a PCI driver leaves the device handled by it in D0 and calls
pci_save_state() on the device in its ->suspend() or ->suspend_late()
callback, it can expect the device to stay in D0 over the whole
s2idle cycle. However, that may not be the case if there is a
spurious wakeup while the system is suspended, because in that case
pci_pm_suspend_noirq() will run again after pci_pm_resume_noirq()
which calls pci_restore_state(), via pci_pm_default_resume_early(),
so state_saved is cleared and the second iteration of
pci_pm_suspend_noirq() will invoke pci_prepare_to_sleep() which
may change the power state of the device.
To avoid that, add a new internal flag, skip_bus_pm, that will be set
by pci_pm_suspend_noirq() when it runs for the first time during the
given system suspend-resume cycle if the state of the device has
been saved already and the device is still in D0. Setting that flag
will cause the next iterations of pci_pm_suspend_noirq() to set
state_saved for pci_pm_resume_noirq(), so that it always restores the
device state from the originally saved data, and avoid calling
pci_prepare_to_sleep() for the device.
Fixes: 33e4f80ee69b ("ACPI / PM: Ignore spurious SCI wakeups from suspend-to-idle")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
|
|
On systems with ACPI platform firmware the last stage of hibernation
is analogous to system suspend to S3 (suspend-to-RAM), so it should
be handled analogously. In particular, pm_suspend_via_firmware()
should return 'true' in that stage to let the callers of it know that
control will be passed to the platform firmware going forward, so
pm_set_suspend_via_firmware() needs to be called then in analogy with
acpi_suspend_begin().
However, the platform hibernation ->begin() callback is invoked
during the "freeze" transition (before creating a snapshot image of
system memory) as well as during the "hibernate" transition which is
the last stage of it and pm_set_suspend_via_firmware() should be
invoked by that callback in the latter stage only.
In order to implement that redefine the hibernation ->begin()
callback to take a pm_message_t argument to indicate which stage
of hibernation is taking place and rework acpi_hibernation_begin()
and acpi_hibernation_begin_old() to take it into account as needed.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Add iWARP engine affinity setting for supporting iWARP over 100g.
iWARP cannot be distinguished by the LLH from L2, hence the
engine division will affect L2 as well. For this reason we add
a parameter to devlink to determine the engine division.
Signed-off-by: Ariel Elior <ariel.elior@marvell.com>
Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When initializing status blocks use the affined hwfn
instead of the leading one for RDMA / Storage
Signed-off-by: Ariel Elior <ariel.elior@marvell.com>
Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
ethtool ops get_rxfh_context and set_rxfh_context are used to create,
remove and access parameters associated to RSS contexts, in a similar
fashion to get_rxfh and set_rxfh.
Add a small descritopn of these callbacks in the struct ethtool_ops doc.
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In rcu_rrupt_from_idle, we want to check if it is called from within an
interrupt, but want to do such checking only for debug builds. lockdep
already tracks when we enter an interrupt. Let us expose it as an
assertion macro so it can be used to assert this.
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Cc: kernel-team@android.com
Cc: rcu@vger.kernel.org
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
|
|
When RCU core processing is offloaded from RCU_SOFTIRQ to the rcuc
kthreads, a full and unconditional wakeup is required to initiate RCU
core processing. In contrast, when RCU core processing is carried
out by RCU_SOFTIRQ, a raise_softirq() suffices. Of course, there are
situations where raise_softirq() does a full wakeup, but these do not
occur with normal usage of rcu_read_unlock().
The reason that full wakeups can be problematic is that the scheduler
sometimes invokes rcu_read_unlock() with its pi or rq locks held,
which can of course result in deadlock in CONFIG_PREEMPT=y kernels when
rcu_read_unlock() invokes the scheduler. Scheduler invocations can happen
in the following situations: (1) The just-ended reader has been subjected
to RCU priority boosting, in which case rcu_read_unlock() must deboost,
(2) Interrupts were disabled across the call to rcu_read_unlock(), so
the quiescent state must be deferred, requiring a wakeup of the rcuc
kthread corresponding to the current CPU.
Now, the scheduler may hold one of its locks across rcu_read_unlock()
only if preemption has been disabled across the entire RCU read-side
critical section, which in the days prior to RCU flavor consolidation
meant that rcu_read_unlock() never needed to do wakeups. However, this
is no longer the case for any but the first rcu_read_unlock() following a
condition (e.g., preempted RCU reader) requiring special rcu_read_unlock()
attention. For example, an RCU read-side critical section might be
preempted, but preemption might be disabled across the rcu_read_unlock().
The rcu_read_unlock() must defer the quiescent state, and therefore
leaves the task queued on its leaf rcu_node structure. If a scheduler
interrupt occurs, the scheduler might well invoke rcu_read_unlock() with
one of its locks held. However, the preempted task is still queued, so
rcu_read_unlock() will attempt to defer the quiescent state once more.
When RCU core processing is carried out by RCU_SOFTIRQ, this works just
fine: The raise_softirq() function simply sets a bit in a per-CPU mask
and the RCU core processing will be undertaken upon return from interrupt.
Not so when RCU core processing is carried out by the rcuc kthread: In this
case, the required wakeup can result in deadlock.
The initial solution to this problem was to use set_tsk_need_resched() and
set_preempt_need_resched() to force a future context switch, which allows
rcu_preempt_note_context_switch() to report the deferred quiescent state
to RCU's core processing. Unfortunately for expedited grace periods,
there can be a significant delay between the call for a context switch
and the actual context switch.
This commit therefore introduces a ->deferred_qs flag to the task_struct
structure's rcu_special structure. This flag is initially false, and
is set to true by the first call to rcu_read_unlock() requiring special
attention, then finally reset back to false when the quiescent state is
finally reported. Then rcu_read_unlock() attempts full wakeups only when
->deferred_qs is false, that is, on the first rcu_read_unlock() requiring
special attention. Note that a chain of RCU readers linked by some other
sort of reader may find that a later rcu_read_unlock() is once again able
to do a full wakeup, courtesy of an intervening preemption:
rcu_read_lock();
/* preempted */
local_irq_disable();
rcu_read_unlock(); /* Can do full wakeup, sets ->deferred_qs. */
rcu_read_lock();
local_irq_enable();
preempt_disable()
rcu_read_unlock(); /* Cannot do full wakeup, ->deferred_qs set. */
rcu_read_lock();
preempt_enable();
/* preempted, >deferred_qs reset. */
local_irq_disable();
rcu_read_unlock(); /* Can again do full wakeup, sets ->deferred_qs. */
Such linked RCU readers do not yet seem to appear in the Linux kernel, and
it is probably best if they don't. However, RCU needs to handle them, and
some variations on this theme could make even raise_softirq() unsafe due to
the possibility of its doing a full wakeup. This commit therefore also
avoids invoking raise_softirq() when the ->deferred_qs set flag is set.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm fixes from Dan Williams:
- Fix a regression that disabled device-mapper dax support
- Remove unnecessary hardened-user-copy overhead (>30%) for dax
read(2)/write(2).
- Fix some compilation warnings.
* tag 'libnvdimm-fixes-5.2-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead
dax: Arrange for dax_supported check to span multiple devices
libnvdimm: Fix compilation warnings with W=1
|
|
After previous patches, verifier will mark a insn if it really needs zero
extension on dst_reg.
It is then for back-ends to decide how to use such information to eliminate
unnecessary zero extension code-gen during JIT compilation.
One approach is verifier insert explicit zero extension for those insns
that need zero extension in a generic way, JIT back-ends then do not
generate zero extension for sub-register write at default.
However, only those back-ends which do not have hardware zero extension
want this optimization. Back-ends like x86_64 and AArch64 have hardware
zero extension support that the insertion should be disabled.
This patch introduces new target hook "bpf_jit_needs_zext" which returns
false at default, meaning verifier zero extension insertion is disabled at
default. A back-end could override this hook to return true if it doesn't
have hardware support and want verifier insert zero extension explicitly.
Offload targets do not use this native target hook, instead, they could
get the optimization results using bpf_prog_offload_ops.finalize.
NOTE: arches could have diversified features, it is possible for one arch
to have hardware zero extension support for some sub-register write insns
but not for all. For example, PowerPC, SPARC have zero extended loads, but
not for alu32. So when verifier zero extension insertion enabled, these JIT
back-ends need to peephole insns to remove those zero extension inserted
for insn that actually has hardware zero extension support. The peephole
could be as simple as looking the next insn, if it is a special zero
extension insn then it is safe to eliminate it if the current insn has
hardware zero extension support.
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
The encoding for this new variant is based on BPF_X format. "imm" field was
0 only, now it could be 1 which means doing zero extension unconditionally
.code = BPF_ALU | BPF_MOV | BPF_X
.dst_reg = DST
.src_reg = SRC
.imm = 1
We use this new form for doing zero extension for which verifier will
guarantee SRC == DST.
Implications on JIT back-ends when doing code-gen for
BPF_ALU | BPF_MOV | BPF_X:
1. No change if hardware already does zero extension unconditionally for
sub-register write.
2. Otherwise, when seeing imm == 1, just generate insns to clear high
32-bit. No need to generate insns for the move because when imm == 1,
dst_reg is the same as src_reg at the moment.
Interpreter doesn't need change as well. It is doing unconditionally zero
extension for mov32 already.
One helper macro BPF_ZEXT_REG is added to help creating zero extension
insn using this new mov32 variant.
One helper function insn_is_zext is added for checking one insn is an
zero extension on dst. This will be widely used by a few JIT back-ends in
later patches in this set.
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
eBPF ISA specification requires high 32-bit cleared when low 32-bit
sub-register is written. This applies to destination register of ALU32 etc.
JIT back-ends must guarantee this semantic when doing code-gen. x86_64 and
AArch64 ISA has the same semantics, so the corresponding JIT back-end
doesn't need to do extra work.
However, 32-bit arches (arm, x86, nfp etc.) and some other 64-bit arches
(PowerPC, SPARC etc) need to do explicit zero extension to meet this
requirement, otherwise code like the following will fail.
u64_value = (u64) u32_value
... other uses of u64_value
This is because compiler could exploit the semantic described above and
save those zero extensions for extending u32_value to u64_value, these JIT
back-ends are expected to guarantee this through inserting extra zero
extensions which however could be a significant increase on the code size.
Some benchmarks show there could be ~40% sub-register writes out of total
insns, meaning at least ~40% extra code-gen.
One observation is these extra zero extensions are not always necessary.
Take above code snippet for example, it is possible u32_value will never be
casted into a u64, the value of high 32-bit of u32_value then could be
ignored and extra zero extension could be eliminated.
This patch implements this idea, insns defining sub-registers will be
marked when the high 32-bit of the defined sub-register matters. For
those unmarked insns, it is safe to eliminate high 32-bit clearnace for
them.
Algo:
- Split read flags into READ32 and READ64.
- Record index of insn that does sub-register write. Keep the index inside
reg state and update it during verifier insn walking.
- A full register read on a sub-register marks its definition insn as
needing zero extension on dst register.
A new sub-register write overrides the old one.
- When propagating read64 during path pruning, also mark any insn defining
a sub-register that is read in the pruned path as full-register.
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|