Age | Commit message (Collapse) | Author |
|
SD_DEGENERATE_GROUPS_MASK is only useful for sched/topology.c, but still
gets defined for anyone who imports topology.h, leading to a flurry of
unused variable warnings.
Move it out of the header and place it next to the SD degeneration
functions in sched/topology.c.
Fixes: 4ee4ea443a5d ("sched/topology: Introduce SD metaflag for flags needing > 1 groups")
Reported-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200825133216.9163-2-valentin.schneider@arm.com
|
|
Defining an array in a header imported all over the place clearly is a daft
idea, that still didn't stop me from doing it.
Leave a declaration of sd_flag_debug in topology.h and move its definition
to sched/debug.c.
Fixes: b6e862f38672 ("sched/topology: Define and assign sched_domain flag metadata")
Reported-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200825133216.9163-1-valentin.schneider@arm.com
|
|
The bits PF_IO_WORKER and PF_WQ_WORKER are tested together in
sched_submit_work() which is considered to be a hot path.
If the two bits cross the 8 or 16 bit boundary then most architecture
require multiple load instructions in order to create the constant
value. Also, such a value can not be encoded within the compare opcode.
By moving the bit definition within the same block, the compiler can
create/use one immediate value.
For some reason gcc-10 on ARM64 requires both bits to be next to each
other in order to issue "tst reg, val; bne label". Otherwise the result
is "mov reg1, val; tst reg, reg1; bne label".
Move PF_VCPU out of the way so that PF_IO_WORKER can be next to
PF_WQ_WORKER.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200819195505.y3fxk72sotnrkczi@linutronix.de
|
|
Problem:
raw_local_irq_save(); // software state on
local_irq_save(); // software state off
...
local_irq_restore(); // software state still off, because we don't enable IRQs
raw_local_irq_restore(); // software state still off, *whoopsie*
existing instances:
- lock_acquire()
raw_local_irq_save()
__lock_acquire()
arch_spin_lock(&graph_lock)
pv_wait() := kvm_wait() (same or worse for Xen/HyperV)
local_irq_save()
- trace_clock_global()
raw_local_irq_save()
arch_spin_lock()
pv_wait() := kvm_wait()
local_irq_save()
- apic_retrigger_irq()
raw_local_irq_save()
apic->send_IPI() := default_send_IPI_single_phys()
local_irq_save()
Possible solutions:
A) make it work by enabling the tracing inside raw_*()
B) make it work by keeping tracing disabled inside raw_*()
C) call it broken and clean it up now
Now, given that the only reason to use the raw_* variant is because you don't
want tracing. Therefore A) seems like a weird option (although it can be done).
C) is tempting, but OTOH it ends up converting a _lot_ of code to raw just
because there is one raw user, this strips the validation/tracing off for all
the other users.
So we pick B) and declare any code that ends up doing:
raw_local_irq_save()
local_irq_save()
lockdep_assert_irqs_disabled();
broken. AFAICT this problem has existed forever, the only reason it came
up is because commit: 859d069ee1dd ("lockdep: Prepare for NMI IRQ
state tracking") changed IRQ tracing vs lockdep recursion and the
first instance is fairly common, the other cases hardly ever happen.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[rewrote changelog]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Marco Elver <elver@google.com>
Link: https://lkml.kernel.org/r/20200723105615.1268126-1-npiggin@gmail.com
|
|
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Marco Elver <elver@google.com>
Link: https://lkml.kernel.org/r/20200821085348.546087214@infradead.org
|
|
This allows moving the leave_mm() call into generic code before
rcu_idle_enter(). Gets rid of more trace_*_rcuidle() users.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Marco Elver <elver@google.com>
Link: https://lkml.kernel.org/r/20200821085348.369441600@infradead.org
|
|
Sven reported that commit a21ee6055c30 ("lockdep: Change
hardirq{s_enabled,_context} to per-cpu variables") caused trouble on
s390 because their this_cpu_*() primitives disable preemption which
then lands back tracing.
On the one hand, per-cpu ops should use preempt_*able_notrace() and
raw_local_irq_*(), on the other hand, we can trivialy use raw_cpu_*()
ops for this.
Fixes: a21ee6055c30 ("lockdep: Change hardirq{s_enabled,_context} to per-cpu variables")
Reported-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200821085348.192346882@infradead.org
|
|
is_idle_task() may be used from noinstr functions such as
irqentry_enter(). Since the compiler is free to not inline regular
inline functions, switch to using __always_inline.
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200820172046.GA177701@elver.google.com
|
|
Backmerge requested by Tomi for a fix to omap inconsistent
locking state issue, and because we need at least v5.9-rc2 now.
Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
|
|
The function consume_skb is only meaningful when tracing is enabled.
This patch makes it conditional on CONFIG_TRACEPOINTS.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Previous to the change to distinguish read-write accesses, when
CONFIG_KCSAN_ASSUME_PLAIN_WRITES_ATOMIC=y is set, KCSAN would consider
the non-atomic bitops as atomic. We want to partially revert to this
behaviour, but with one important distinction: report racing
modifications, since lost bits due to non-atomicity are certainly
possible.
Given the operations here only modify a single bit, assuming
non-atomicity of the writer is sufficient may be reasonable for certain
usage (and follows the permissible nature of the "assume plain writes
atomic" rule). In other words:
1. We want non-atomic read-modify-write races to be reported;
this is accomplished by kcsan_check_read(), where any
concurrent write (atomic or not) will generate a report.
2. We do not want to report races with marked readers, but -do-
want to report races with unmarked readers; this is
accomplished by the instrument_write() ("assume atomic
write" with Kconfig option set).
With the above rules, when KCSAN_ASSUME_PLAIN_WRITES_ATOMIC is selected,
it is hoped that KCSAN's reporting behaviour is better aligned with
current expected permissible usage for non-atomic bitops.
Note that, a side-effect of not telling KCSAN that the accesses are
read-writes, is that this information is not displayed in the access
summary in the report. It is, however, visible in inline-expanded stack
traces. For now, it does not make sense to introduce yet another special
case to KCSAN's runtime, only to cater to the case here.
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Daniel Axtens <dja@axtens.net>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
Use instrument_atomic_read_write() for atomic RMW ops.
Cc: Will Deacon <will@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: <linux-arch@vger.kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
Use the new instrument_read_write() where appropriate.
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
Introduce read-write instrumentation hooks, to more precisely denote an
operation's behaviour.
KCSAN is able to distinguish compound instrumentation, and with the new
instrumentation we then benefit from improved reporting. More
importantly, read-write compound operations should not implicitly be
treated as atomic, if they aren't actually atomic.
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
Add support for compounded read-write instrumentation if supported by
the compiler. Adds the necessary instrumentation functions, and a new
type which is used to generate a more descriptive report.
Furthermore, such compounded memory access instrumentation is excluded
from the "assume aligned writes up to word size are atomic" rule,
because we cannot assume that the compiler emits code that is atomic for
compound ops.
LLVM/Clang added support for the feature in:
https://github.com/llvm/llvm-project/commit/785d41a261d136b64ab6c15c5d35f2adc5ad53e3
The new instrumentation is emitted for sets of memory accesses in the
same basic block to the same address with at least one read appearing
before a write. These typically result from compound operations such as
++, --, +=, -=, |=, &=, etc. but also equivalent forms such as "var =
var + 1". Where the compiler determines that it is equivalent to emit a
call to a single __tsan_read_write instead of separate __tsan_read and
__tsan_write, we can then benefit from improved performance and better
reporting for such access patterns.
The new reports now show that the ops are both reads and writes, for
example:
read-write to 0xffffffff90548a38 of 8 bytes by task 143 on cpu 3:
test_kernel_rmw_array+0x45/0xa0
access_thread+0x71/0xb0
kthread+0x21e/0x240
ret_from_fork+0x22/0x30
read-write to 0xffffffff90548a38 of 8 bytes by task 144 on cpu 2:
test_kernel_rmw_array+0x45/0xa0
access_thread+0x71/0xb0
kthread+0x21e/0x240
ret_from_fork+0x22/0x30
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
Commit 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler")
changed ndisc_ifinfo_sysctl_change to take a kernel pointer. Adjust its
prototype in net/ndisc.h as well to fix the following sparse warning:
net/ipv6/ndisc.c:1838:5: error: symbol 'ndisc_ifinfo_sysctl_change' redeclared with different type (incompatible argument 3 (different address spaces)):
net/ipv6/ndisc.c:1838:5: int extern [addressable] [signed] [toplevel] ndisc_ifinfo_sysctl_change( ... )
net/ipv6/ndisc.c: note: in included file (through include/net/ipv6.h):
./include/net/ndisc.h:496:5: note: previously declared as:
./include/net/ndisc.h:496:5: int extern [addressable] [signed] [toplevel] ndisc_ifinfo_sysctl_change( ... )
net/ipv6/ndisc.c: note: in included file (through include/net/ip6_route.h):
Fixes: 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler")
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Don't flag SCTP heartbeat as invalid for re-used connections,
from Florian Westphal.
2) Bogus overlap report due to rbtree tree rotations, from Stefano Brivio.
3) Detect partial overlap with start end point match, also from Stefano.
4) Skip netlink dump of NFTA_SET_USERDATA is unset.
5) Incorrect nft_list_attributes enumeration definition.
6) Missing zeroing before memcpy to destination register, also
from Florian.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Avoid -Wunused-const-variable warnings for "make W=1".
Reported-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
|
|
This reverts commit 5c9fa16e8abd342ce04dc830c1ebb2a03abf6c05.
Since PROT_SAO can still be useful for certain classes of software,
reintroduce it. Concerns about guest migration for LPARs using SAO
will be addressed next.
Signed-off-by: Shawn Anastasio <shawn@anastas.io>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200821185558.35561-2-shawn@anastas.io
|
|
Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.
[1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
|
|
Pull networking fixes from David Miller:
"Nothing earth shattering here, lots of small fixes (f.e. missing RCU
protection, bad ref counting, missing memset(), etc.) all over the
place:
1) Use get_file_rcu() in task_file iterator, from Yonghong Song.
2) There are two ways to set remote source MAC addresses in macvlan
driver, but only one of which validates things properly. Fix this.
From Alvin Šipraga.
3) Missing of_node_put() in gianfar probing, from Sumera
Priyadarsini.
4) Preserve device wanted feature bits across multiple netlink
ethtool requests, from Maxim Mikityanskiy.
5) Fix rcu_sched stall in task and task_file bpf iterators, from
Yonghong Song.
6) Avoid reset after device destroy in ena driver, from Shay
Agroskin.
7) Missing memset() in netlink policy export reallocation path, from
Johannes Berg.
8) Fix info leak in __smc_diag_dump(), from Peilin Ye.
9) Decapsulate ECN properly for ipv6 in ipv4 tunnels, from Mark
Tomlinson.
10) Fix number of data stream negotiation in SCTP, from David Laight.
11) Fix double free in connection tracker action module, from Alaa
Hleihel.
12) Don't allow empty NHA_GROUP attributes, from Nikolay Aleksandrov"
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (46 commits)
net: nexthop: don't allow empty NHA_GROUP
bpf: Fix two typos in uapi/linux/bpf.h
net: dsa: b53: check for timeout
tipc: call rcu_read_lock() in tipc_aead_encrypt_done()
net/sched: act_ct: Fix skb double-free in tcf_ct_handle_fragments() error flow
net: sctp: Fix negotiation of the number of data streams.
dt-bindings: net: renesas, ether: Improve schema validation
gre6: Fix reception with IP6_TNL_F_RCV_DSCP_COPY
hv_netvsc: Fix the queue_mapping in netvsc_vf_xmit()
hv_netvsc: Remove "unlikely" from netvsc_select_queue
bpf: selftests: global_funcs: Check err_str before strstr
bpf: xdp: Fix XDP mode when no mode flags specified
selftests/bpf: Remove test_align leftovers
tools/resolve_btfids: Fix sections with wrong alignment
net/smc: Prevent kernel-infoleak in __smc_diag_dump()
sfc: fix build warnings on 32-bit
net: phy: mscc: Fix a couple of spelling mistakes "spcified" -> "specified"
libbpf: Fix map index used in error message
net: gemini: Fix missing free_netdev() in error path of gemini_ethernet_port_probe()
net: atlantic: Use readx_poll_timeout() for large timeout
...
|
|
printk wants to store various timestamps (MONOTONIC, REALTIME, BOOTTIME) to
make correlation of dmesg from several systems easier.
Provide an interface to retrieve all three timestamps in one go.
There are some caveats:
1) Boot time and late sleep time injection
Boot time is a racy access on 32bit systems if the sleep time injection
happens late during resume and not in timekeeping_resume(). That could be
avoided by expanding struct tk_read_base with boot offset for 32bit and
adding more overhead to the update. As this is a hard to observe once per
resume event which can be filtered with reasonable effort using the
accurate mono/real timestamps, it's probably not worth the trouble.
Aside of that it might be possible on 32 and 64 bit to observe the
following when the sleep time injection happens late:
CPU 0 CPU 1
timekeeping_resume()
ktime_get_fast_timestamps()
mono, real = __ktime_get_real_fast()
inject_sleep_time()
update boot offset
boot = mono + bootoffset;
That means that boot time already has the sleep time adjustment, but
real time does not. On the next readout both are in sync again.
Preventing this for 64bit is not really feasible without destroying the
careful cache layout of the timekeeper because the sequence count and
struct tk_read_base would then need two cache lines instead of one.
2) Suspend/resume timestamps
Access to the time keeper clock source is disabled accross the innermost
steps of suspend/resume. The accessors still work, but the timestamps
are frozen until time keeping is resumed which happens very early.
For regular suspend/resume there is no observable difference vs. sched
clock, but it might affect some of the nasty low level debug printks.
OTOH, access to sched clock is not guaranteed accross suspend/resume on
all systems either so it depends on the hardware in use.
If that turns out to be a real problem then this could be mitigated by
using sched clock in a similar way as during early boot. But it's not as
trivial as on early boot because it needs some careful protection
against the clock monotonic timestamp jumping backwards on resume.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20200814115512.159981360@linutronix.de
|
|
This patch adds further attributes to the event. These attributes are
helpful to understand the context of the message and can be used
to filter the events.
There are three common items. Source context, target context and tclass.
There are also items from the outcome of operation performed.
An event is similar to:
<...>-1309 [002] .... 6346.691689: selinux_audited:
requested=0x4000000 denied=0x4000000 audited=0x4000000
result=-13
scontext=system_u:system_r:cupsd_t:s0-s0:c0.c1023
tcontext=system_u:object_r:bin_t:s0 tclass=file
With systems where many denials are occurring, it is useful to apply a
filter. The filtering is a set of logic that is inserted with
the filter file. Example:
echo "tclass==\"file\" " > events/avc/selinux_audited/filter
This adds that we only get tclass=file.
The trace can also have extra properties. Adding the user stack
can be done with
echo 1 > options/userstacktrace
Now the output will be
runcon-1365 [003] .... 6960.955530: selinux_audited:
requested=0x4000000 denied=0x4000000 audited=0x4000000
result=-13
scontext=system_u:system_r:cupsd_t:s0-s0:c0.c1023
tcontext=system_u:object_r:bin_t:s0 tclass=file
runcon-1365 [003] .... 6960.955560: <user stack trace>
=> <00007f325b4ce45b>
=> <00005607093efa57>
Signed-off-by: Peter Enderborg <peter.enderborg@sony.com>
Reviewed-by: Thiébaud Weksteen <tweek@google.com>
Acked-by: Stephen Smalley <stephen.smalley.work@gmail.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
|
|
The audit data currently captures which process and which target
is responsible for a denial. There is no data on where exactly in the
process that call occurred. Debugging can be made easier by being able to
reconstruct the unified kernel and userland stack traces [1]. Add a
tracepoint on the SELinux denials which can then be used by userland
(i.e. perf).
Although this patch could manually be added by each OS developer to
trouble shoot a denial, adding it to the kernel streamlines the
developers workflow.
It is possible to use perf for monitoring the event:
# perf record -e avc:selinux_audited -g -a
^C
# perf report -g
[...]
6.40% 6.40% audited=800000 tclass=4
|
__libc_start_main
|
|--4.60%--__GI___ioctl
| entry_SYSCALL_64
| do_syscall_64
| __x64_sys_ioctl
| ksys_ioctl
| binder_ioctl
| binder_set_nice
| can_nice
| capable
| security_capable
| cred_has_capability.isra.0
| slow_avc_audit
| common_lsm_audit
| avc_audit_post_callback
| avc_audit_post_callback
|
It is also possible to use the ftrace interface:
# echo 1 > /sys/kernel/debug/tracing/events/avc/selinux_audited/enable
# cat /sys/kernel/debug/tracing/trace
tracer: nop
entries-in-buffer/entries-written: 1/1 #P:8
[...]
dmesg-3624 [001] 13072.325358: selinux_denied: audited=800000 tclass=4
The tclass value can be mapped to a class by searching
security/selinux/flask.h. The audited value is a bit field of the
permissions described in security/selinux/av_permissions.h for the
corresponding class.
[1] https://source.android.com/devices/tech/debug/native_stack_dump
Signed-off-by: Thiébaud Weksteen <tweek@google.com>
Suggested-by: Joel Fernandes <joelaf@google.com>
Reviewed-by: Peter Enderborg <peter.enderborg@sony.com>
Acked-by: Stephen Smalley <stephen.smalley.work@gmail.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
|
|
Alexei Starovoitov says:
====================
pull-request: bpf 2020-08-21
The following pull-request contains BPF updates for your *net* tree.
We've added 11 non-merge commits during the last 5 day(s) which contain
a total of 12 files changed, 78 insertions(+), 24 deletions(-).
The main changes are:
1) three fixes in BPF task iterator logic, from Yonghong.
2) fix for compressed dwarf sections in vmlinux, from Jiri.
3) fix xdp attach regression, from Andrii.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V fixes from Palmer Dabbelt:
- The CLINT driver has been split in two: one to handle the M-mode
CLINT (memory mapped and used on NOMMU systems) and one to handle the
S-mode CLINT (via SBI).
- The addition of SiFive's drivers to rv32_defconfig
* tag 'riscv-for-linus-5.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: Add SiFive drivers to rv32_defconfig
dt-bindings: timer: Add CLINT bindings
RISC-V: Remove CLINT related code from timer and arch
clocksource/drivers: Add CLINT timer driver
RISC-V: Add mechanism to provide custom IPI operations
|
|
Also remove trailing whitespaces in bpf_skb_get_tunnel_key example code.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200821133642.18870-1-tklauser@distanz.ch
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Improvements to ext4's block allocator performance for very large file
systems, especially when the file system or files which are highly
fragmented. There is a new mount option, prefetch_block_bitmaps which
will pull in the block bitmaps and set up the in-memory buddy bitmaps
when the file system is initially mounted.
Beyond that, a lot of bug fixes and cleanups. In particular, a number
of changes to make ext4 more robust in the face of write errors or
file system corruptions"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (46 commits)
ext4: limit the length of per-inode prealloc list
ext4: reorganize if statement of ext4_mb_release_context()
ext4: add mb_debug logging when there are lost chunks
ext4: Fix comment typo "the the".
jbd2: clean up checksum verification in do_one_pass()
ext4: change to use fallthrough macro
ext4: remove unused parameter of ext4_generic_delete_entry function
mballoc: replace seq_printf with seq_puts
ext4: optimize the implementation of ext4_mb_good_group()
ext4: delete invalid comments near ext4_mb_check_limits()
ext4: fix typos in ext4_mb_regular_allocator() comment
ext4: fix checking of directory entry validity for inline directories
fs: prevent BUG_ON in submit_bh_wbc()
ext4: correctly restore system zone info when remount fails
ext4: handle add_system_zone() failure in ext4_setup_system_zone()
ext4: fold ext4_data_block_valid_rcu() into the caller
ext4: check journal inode extents more carefully
ext4: don't allow overlapping system zones
ext4: handle error of ext4_setup_system_zone() on remount
ext4: delete the invalid BUGON in ext4_mb_load_buddy_gfp()
...
|
|
Following bug was reported via irc:
nft list ruleset
set knock_candidates_ipv4 {
type ipv4_addr . inet_service
size 65535
elements = { 127.0.0.1 . 123,
127.0.0.1 . 123 }
}
..
udp dport 123 add @knock_candidates_ipv4 { ip saddr . 123 }
udp dport 123 add @knock_candidates_ipv4 { ip saddr . udp dport }
It should not have been possible to add a duplicate set entry.
After some debugging it turned out that the problem is the immediate
value (123) in the second-to-last rule.
Concatenations use 32bit registers, i.e. the elements are 8 bytes each,
not 6 and it turns out the kernel inserted
inet firewall @knock_candidates_ipv4
element 0100007f ffff7b00 : 0 [end]
element 0100007f 00007b00 : 0 [end]
Note the non-zero upper bits of the first element. It turns out that
nft_immediate doesn't zero the destination register, but this is needed
when the length isn't a multiple of 4.
Furthermore, the zeroing in nft_payload is broken. We can't use
[len / 4] = 0 -- if len is a multiple of 4, index is off by one.
Skip zeroing in this case and use a conditional instead of (len -1) / 4.
Fixes: 49499c3e6e18 ("netfilter: nf_tables: switch registers to 32 bit addressing")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
This should be NFTA_LIST_UNSPEC instead of NFTA_LIST_UNPEC, all other
similar attribute definitions are postfixed with _UNSPEC.
Fixes: 96518518cc41 ("netfilter: add nftables")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
arm64 requires a vcpu fd (KVM_HAS_DEVICE_ATTR vcpu ioctl) to probe
support for steal-time. However this is unnecessary, as only a KVM
fd is required, and it complicates userspace (userspace may prefer
delaying vcpu creation until after feature probing). Introduce a cap
that can be checked instead. While x86 can already probe steal-time
support with a kvm fd (KVM_GET_SUPPORTED_CPUID), we add the cap there
too for consistency.
Signed-off-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Steven Price <steven.price@arm.com>
Link: https://lore.kernel.org/r/20200804170604.42662-7-drjones@redhat.com
|
|
When updating the stolen time we should always read the current
stolen time from the user provided memory, not from a kernel
cache. If we use a cache then we'll end up resetting stolen time
to zero on the first update after migration.
Signed-off-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20200804170604.42662-5-drjones@redhat.com
|
|
We can use typeof() to avoid the need for the type input.
Suggested-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20200804170604.42662-4-drjones@redhat.com
|
|
Revert "crypto: hash - Add real ahash walk interface"
This reverts commit 75ecb231ff45b54afa9f4ec9137965c3c00868f4.
The callers of the functions in this commit were removed in ab8085c130ed
Remove these unused calls.
Fixes: ab8085c130ed ("crypto: x86 - remove SHA multibuffer routines and mcryptd")
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
All callers of these primitives will
* discard anything we might've copied in case of error
* ignore the csum value in case of error
* always pass 0xffffffff as the initial sum, so the
resulting csum value (in case of success, that is) will never be 0.
That suggest the following calling conventions:
* don't pass err_ptr - just return 0 on error.
* don't bother with zeroing destination, etc. in case of error
* don't pass the initial sum - just use 0xffffffff.
This commit does the minimal conversion in the instances of csum_and_copy_...();
the changes of actual asm code behind them are done later in the series.
Note that this asm code is often shared with csum_partial_copy_nocheck();
the difference is that csum_partial_copy_nocheck() passes 0 for initial
sum while csum_and_copy_..._user() pass 0xffffffff. Fortunately, we are
free to pass 0xffffffff in all cases and subsequent patches will use that
freedom without any special comments.
A part that could be split off: parisc and uml/i386 claimed to have
csum_and_copy_to_user() instances of their own, but those were identical
to the generic one, so we simply drop them. Not sure if it's worth
a separate commit...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
It's always 0. Note that we theoretically could use ~0U as well -
result will be the same modulo 0xffff, _if_ the damn thing did the
right thing for any value of initial sum; later we'll make use of
that when convenient.
However, unlike csum_and_copy_..._user(), there are instances that
did not work for arbitrary initial sums; c6x is one such.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
quite a few architectures have the same csum_partial_copy_nocheck() -
simply memcpy() the data and then return the csum of the copy.
hexagon, parisc, ia64, s390, um: explicitly spelled out that way.
arc, arm64, csky, h8300, m68k/nommu, microblaze, mips/GENERIC_CSUM, nds32,
nios2, openrisc, riscv, unicore32: end up picking the same thing spelled
out in lib/checksum.h (with varying amounts of perversions along the way).
everybody else (alpha, arm, c6x, m68k/mmu, mips/!GENERIC_CSUM, powerpc,
sh, sparc, x86, xtensa) have non-generic variants. For all except c6x
the declaration is in their asm/checksum.h. c6x uses the wrapper
from asm-generic/checksum.h that would normally lead to the lib/checksum.h
instance, but in case of c6x we end up using an asm function from arch/c6x
instead.
Screw that mess - have architectures with private instances define
_HAVE_ARCH_CSUM_AND_COPY in their asm/checksum.h and have the default
one right in net/checksum.h conditional on _HAVE_ARCH_CSUM_AND_COPY
*not* defined.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
it's always 0
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
We add a separate CLINT timer driver for Linux RISC-V M-mode (i.e.
RISC-V NoMMU kernel).
The CLINT MMIO device provides three things:
1. 64bit free running counter register
2. 64bit per-CPU time compare registers
3. 32bit per-CPU inter-processor interrupt registers
Unlike other timer devices, CLINT provides IPI registers along with
timer registers. To use CLINT IPI registers, the CLINT timer driver
provides IPI related callbacks to arch/riscv.
Signed-off-by: Anup Patel <anup.patel@wdc.com>
Tested-by: Emil Renner Berhing <kernel@esmil.dk>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Reviewed-by: Atish Patra <atish.patra@wdc.com>
Reviewed-by: Palmer Dabbelt <palmerdabbelt@google.com>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
|
|
Pull dma-mapping fixes from Christoph Hellwig:
"Fix more fallout from the dma-pool changes (Nicolas Saenz Julienne,
me)"
* tag 'dma-mapping-5.9-1' of git://git.infradead.org/users/hch/dma-mapping:
dma-pool: Only allocate from CMA when in same memory zone
dma-pool: fix coherent pool allocations for IOMMU mappings
|
|
Fix rxrpc_kernel_get_srtt() to indicate the validity of the returned
smoothed RTT. If we haven't had any valid samples yet, the SRTT isn't
useful.
Fixes: c410bf01933e ("rxrpc: Fix the excessive initial retransmission timeout")
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
The Rx protocol has a mechanism to help generate RTT samples that works by
a client transmitting a REQUESTED-type ACK when it receives a DATA packet
that has the REQUEST_ACK flag set.
The peer, however, may interpose other ACKs before transmitting the
REQUESTED-ACK, as can be seen in the following trace excerpt:
rxrpc_tx_data: c=00000044 DATA d0b5ece8:00000001 00000001 q=00000001 fl=07
rxrpc_rx_ack: c=00000044 00000001 PNG r=00000000 f=00000002 p=00000000 n=0
rxrpc_rx_ack: c=00000044 00000002 REQ r=00000001 f=00000002 p=00000001 n=0
...
DATA packet 1 (q=xx) has REQUEST_ACK set (bit 1 of fl=xx). The incoming
ping (labelled PNG) hard-acks the request DATA packet (f=xx exceeds the
sequence number of the DATA packet), causing it to be discarded from the Tx
ring. The ACK that was requested (labelled REQ, r=xx references the serial
of the DATA packet) comes after the ping, but the sk_buff holding the
timestamp has gone and the RTT sample is lost.
This is particularly noticeable on RPC calls used to probe the service
offered by the peer. A lot of peers end up with an unknown RTT because we
only ever sent a single RPC. This confuses the server rotation algorithm.
Fix this by caching the information about the outgoing packet in RTT
calculations in the rxrpc_call struct rather than looking in the Tx ring.
A four-deep buffer is maintained and both REQUEST_ACK-flagged DATA and
PING-ACK transmissions are recorded in there. When the appropriate
response ACK is received, the buffer is checked for a match and, if found,
an RTT sample is recorded.
If a received ACK refers to a packet with a later serial number than an
entry in the cache, that entry is presumed lost and the entry is made
available to record a new transmission.
ACKs types other than REQUESTED-type and PING-type cause any matching
sample to be cancelled as they don't necessarily represent a useful
measurement.
If there's no space in the buffer on ping/data transmission, the sample
base is discarded.
Fixes: 50235c4b5a2f ("rxrpc: Obtain RTT data by requesting ACKs on DATA packets")
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
If an sctp connection gets re-used, heartbeats are flagged as invalid
because their vtag doesn't match.
Handle this in a similar way as TCP conntrack when it suspects that the
endpoints and conntrack are out-of-sync.
When a HEARTBEAT request fails its vtag validation, flag this in the
conntrack state and accept the packet.
When a HEARTBEAT_ACK is received with an invalid vtag in the reverse
direction after we allowed such a HEARTBEAT through, assume we are
out-of-sync and re-set the vtag info.
v2: remove left-over snippet from an older incarnation that moved
new_state/old_state assignments, thats not needed so keep that
as-is.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
The header file algapi.h includes skbuff.h unnecessarily since
all we need is a forward declaration for struct sk_buff. This
patch removes that inclusion.
Unfortunately skbuff.h pulls in a lot of things and drivers over
the years have come to rely on it so this patch adds a lot of
missing inclusions that result from this.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
In the scenario of writing sparse files, the per-inode prealloc list may
be very long, resulting in high overhead for ext4_mb_use_preallocated().
To circumvent this problem, we limit the maximum length of per-inode
prealloc list to 512 and allow users to modify it.
After patching, we observed that the sys ratio of cpu has dropped, and
the system throughput has increased significantly. We created a process
to write the sparse file, and the running time of the process on the
fixed kernel was significantly reduced, as follows:
Running time on unfixed kernel:
[root@TENCENT64 ~]# time taskset 0x01 ./sparse /data1/sparce.dat
real 0m2.051s
user 0m0.008s
sys 0m2.026s
Running time on fixed kernel:
[root@TENCENT64 ~]# time taskset 0x01 ./sparse /data1/sparce.dat
real 0m0.471s
user 0m0.004s
sys 0m0.395s
Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Link: https://lore.kernel.org/r/d7a98178-056b-6db5-6bce-4ead23f4a257@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
This patch moves crypto_yield into internal.h as it's only used
by internal code such as skcipher. It also adds a missing inclusion
of sched.h which is required for cond_resched.
The header files in internal.h have been cleaned up to remove some
ancient junk and add some more specific inclusions.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
resctrl/core.c defines get_cache_id() for use in its cpu-hotplug
callbacks. This gets the id attribute of the cache at the corresponding
level of a CPU.
Later rework means this private function needs to be shared. Move
it to the header file.
The name conflicts with a different definition in intel_cacheinfo.c,
name it get_cpu_cacheinfo_id() to show its relation with
get_cpu_cacheinfo().
Now this is visible on other architectures, check the id attribute
has actually been set.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lkml.kernel.org/r/20200708163929.2783-11-james.morse@arm.com
|
|
There would be no point in preserving a sched_domain with a single group
just because it has this flag set. Add it to SD_DEGENERATE_GROUPS_MASK.
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: https://lore.kernel.org/r/20200817113003.20802-17-valentin.schneider@arm.com
|
|
A sched_domain can only have overlapping sched_groups if it has more than
one group.
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: https://lore.kernel.org/r/20200817113003.20802-16-valentin.schneider@arm.com
|
|
Being a load-balancing flag, it requires 2+ groups to have any effect.
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: https://lore.kernel.org/r/20200817113003.20802-15-valentin.schneider@arm.com
|