Age | Commit message (Collapse) | Author |
|
In order to have a separate address space for text poking, we need to
duplicate init_mm early during start_kernel(). This, however, introduces
a problem since uprobes functions are called from dup_mmap(), but
uprobes is still not initialized in this early stage.
Since uprobes initialization is necassary for fork, and since all the
dependant initialization has been done when fork is initialized (percpu
and vmalloc), move uprobes initialization to fork_init(). It does not
seem uprobes introduces any security problem for the poking_mm.
Crash and burn if uprobes initialization fails, similarly to other early
initializations. Change the init_probes() name to probes_init() to match
other early initialization functions name convention.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akpm@linux-foundation.org
Cc: ard.biesheuvel@linaro.org
Cc: deneen.t.dock@intel.com
Cc: kernel-hardening@lists.openwall.com
Cc: kristen@linux.intel.com
Cc: linux_dti@icloud.com
Cc: will.deacon@arm.com
Link: https://lkml.kernel.org/r/20190426232303.28381-6-nadav.amit@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Prevent user watchpoints from mistakenly firing while the temporary mm
is being used. As the addresses of the temporary mm might overlap those
of the user-process, this is necessary to prevent wrong signals or worse
things from happening.
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <akpm@linux-foundation.org>
Cc: <ard.biesheuvel@linaro.org>
Cc: <deneen.t.dock@intel.com>
Cc: <kernel-hardening@lists.openwall.com>
Cc: <kristen@linux.intel.com>
Cc: <linux_dti@icloud.com>
Cc: <will.deacon@arm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190426001143.4983-5-namit@vmware.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Using a dedicated page-table for temporary PTEs prevents other cores
from using - even speculatively - these PTEs, thereby providing two
benefits:
(1) Security hardening: an attacker that gains kernel memory writing
abilities cannot easily overwrite sensitive data.
(2) Avoiding TLB shootdowns: the PTEs do not need to be flushed in
remote page-tables.
To do so a temporary mm_struct can be used. Mappings which are private
for this mm can be set in the userspace part of the address-space.
During the whole time in which the temporary mm is loaded, interrupts
must be disabled.
The first use-case for temporary mm struct, which will follow, is for
poking the kernel text.
[ Commit message was written by Nadav Amit ]
Tested-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: <akpm@linux-foundation.org>
Cc: <ard.biesheuvel@linaro.org>
Cc: <deneen.t.dock@intel.com>
Cc: <kernel-hardening@lists.openwall.com>
Cc: <kristen@linux.intel.com>
Cc: <linux_dti@icloud.com>
Cc: <will.deacon@arm.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190426001143.4983-4-namit@vmware.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
There is no apparent reason not to use text_poke_early() during
early-init, since no patching of code that might be on the stack is done
and only a single core is running.
This is required for the next patches that would set a temporary mm for
text poking, and this mm is only initialized after some static-keys are
enabled/disabled.
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <akpm@linux-foundation.org>
Cc: <ard.biesheuvel@linaro.org>
Cc: <deneen.t.dock@intel.com>
Cc: <kernel-hardening@lists.openwall.com>
Cc: <kristen@linux.intel.com>
Cc: <linux_dti@icloud.com>
Cc: <will.deacon@arm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190426001143.4983-3-namit@vmware.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
When using a temporary mm, bpf_probe_write_user() should not be able to
write to user memory, since user memory addresses may be used to map
kernel memory. Detect these cases and fail bpf_probe_write_user() in
such cases.
Suggested-by: Jann Horn <jannh@google.com>
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <akpm@linux-foundation.org>
Cc: <ard.biesheuvel@linaro.org>
Cc: <deneen.t.dock@intel.com>
Cc: <kernel-hardening@lists.openwall.com>
Cc: <kristen@linux.intel.com>
Cc: <linux_dti@icloud.com>
Cc: <will.deacon@arm.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190426001143.4983-24-namit@vmware.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
x86 has an nmi_uaccess_okay(), but other architectures do not.
Arch-independent code might need to know whether access to user
addresses is ok in an NMI context or in other code whose execution
context is unknown. Specifically, this function is needed for
bpf_probe_write_user().
Add a default implementation of nmi_uaccess_okay() for architectures
that do not have such a function.
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <akpm@linux-foundation.org>
Cc: <ard.biesheuvel@linaro.org>
Cc: <deneen.t.dock@intel.com>
Cc: <kernel-hardening@lists.openwall.com>
Cc: <kristen@linux.intel.com>
Cc: <linux_dti@icloud.com>
Cc: <will.deacon@arm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190426001143.4983-23-namit@vmware.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
text_mutex is currently expected to be held before text_poke() is
called, but kgdb does not take the mutex, and instead *supposedly*
ensures the lock is not taken and will not be acquired by any other core
while text_poke() is running.
The reason for the "supposedly" comment is that it is not entirely clear
that this would be the case if gdb_do_roundup is zero.
Create two wrapper functions, text_poke() and text_poke_kgdb(), which do
or do not run the lockdep assertion respectively.
While we are at it, change the return code of text_poke() to something
meaningful. One day, callers might actually respect it and the existing
BUG_ON() when patching fails could be removed. For kgdb, the return
value can actually be used.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Cc: <akpm@linux-foundation.org>
Cc: <ard.biesheuvel@linaro.org>
Cc: <deneen.t.dock@intel.com>
Cc: <kernel-hardening@lists.openwall.com>
Cc: <kristen@linux.intel.com>
Cc: <linux_dti@icloud.com>
Cc: <will.deacon@arm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 9222f606506c ("x86/alternatives: Lockdep-enforce text_mutex in text_poke*()")
Link: https://lkml.kernel.org/r/20190426001143.4983-2-namit@vmware.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
With the introduction of the config option that allows to enable kuser
helpers, it is now possible to reduce TASK_SIZE_32 when these are
disabled and 64K pages are enabled. This extends the compliance with
the section 6.5.8 of the C standard (C99).
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reported-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
|
|
When executing clock_gettime(), either in the vDSO or via a system call,
we need to ensure that the read of the counter register occurs within
the seqlock reader critical section. This ensures that updates to the
clocksource parameters (e.g. the multiplier) are consistent with the
counter value and therefore avoids the situation where time appears to
go backwards across multiple reads.
Extend the vDSO logic so that the seqlock critical section covers the
read of the counter register as well as accesses to the data page. Since
reads of the counter system registers are not ordered by memory barrier
instructions, introduce dependency ordering from the counter read to a
subsequent memory access so that the seqlock memory barriers apply to
the counter access in both the vDSO and the system call paths.
Cc: <stable@vger.kernel.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Tested-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Link: https://lore.kernel.org/linux-arm-kernel/alpine.DEB.2.21.1902081950260.1662@nanos.tec.linutronix.de/
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Will Deacon <will.deacon@arm.com>
|
|
Currently the error return path from kobject_init_and_add() is not
followed by a call to kobject_put() - which means we are leaking the
kobject.
Fix it by adding a call to kobject_put() in the error path of
kobject_init_and_add().
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Tobin C. Harding <tobin@kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm into pm-cpufreq
Pull ARM cpufreq drivers changes for v5.2 from Viresh Kumar:
"This pull request contains:
- Fix for possible object reference leak for few drivers (Wen Yang).
- Fix for armada frequency calculation (Gregory).
- Code cleanup in maple driver (Viresh).
This contains some non-ARM bits as well this time as the patches were
picked up from a series."
* 'cpufreq/arm/linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm:
cpufreq: armada-37xx: fix frequency calculation for opp
cpufreq: maple: Remove redundant code from maple_cpufreq_init()
cpufreq: ppc_cbe: fix possible object reference leak
cpufreq: pmac32: fix possible object reference leak
cpufreq/pasemi: fix possible object reference leak
cpufreq: maple: fix possible object reference leak
cpufreq: kirkwood: fix possible object reference leak
cpufreq: imx6q: fix possible object reference leak
cpufreq: ap806: fix possible object reference leak
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm into pm-opp
Pull operating performance points (OPP) framework changes for v5.2
from Viresh Kumar:
"This pull request contains:
- New helper in OPP core to find best matching frequency for a voltage
value."
* 'opp/linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm:
OPP: Introduce dev_pm_opp_find_freq_ceil_by_volt()
|
|
Currently the error return path from kobject_init_and_add() is not
followed by a call to kobject_put() - which means we are leaking
the kobject.
Fix it by adding a call to kobject_put() in the error path of
kobject_init_and_add().
Signed-off-by: Tobin C. Harding <tobin@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: http://lkml.kernel.org/r/20190430001144.24890-1-tobin@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
syzbot was able to catch a use-after-free read in pid_nr_ns() [1]
ip6fl_seq_show() seems to use RCU protection, dereferencing fl->owner.pid
but fl_free() releases fl->owner.pid before rcu grace period is started.
[1]
BUG: KASAN: use-after-free in pid_nr_ns+0x128/0x140 kernel/pid.c:407
Read of size 4 at addr ffff888094012a04 by task syz-executor.0/18087
CPU: 0 PID: 18087 Comm: syz-executor.0 Not tainted 5.1.0-rc6+ #89
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x172/0x1f0 lib/dump_stack.c:113
print_address_description.cold+0x7c/0x20d mm/kasan/report.c:187
kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
__asan_report_load4_noabort+0x14/0x20 mm/kasan/generic_report.c:131
pid_nr_ns+0x128/0x140 kernel/pid.c:407
ip6fl_seq_show+0x2f8/0x4f0 net/ipv6/ip6_flowlabel.c:794
seq_read+0xad3/0x1130 fs/seq_file.c:268
proc_reg_read+0x1fe/0x2c0 fs/proc/inode.c:227
do_loop_readv_writev fs/read_write.c:701 [inline]
do_loop_readv_writev fs/read_write.c:688 [inline]
do_iter_read+0x4a9/0x660 fs/read_write.c:922
vfs_readv+0xf0/0x160 fs/read_write.c:984
kernel_readv fs/splice.c:358 [inline]
default_file_splice_read+0x475/0x890 fs/splice.c:413
do_splice_to+0x12a/0x190 fs/splice.c:876
splice_direct_to_actor+0x2d2/0x970 fs/splice.c:953
do_splice_direct+0x1da/0x2a0 fs/splice.c:1062
do_sendfile+0x597/0xd00 fs/read_write.c:1443
__do_sys_sendfile64 fs/read_write.c:1498 [inline]
__se_sys_sendfile64 fs/read_write.c:1490 [inline]
__x64_sys_sendfile64+0x15a/0x220 fs/read_write.c:1490
do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x458da9
Code: ad b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f300d24bc78 EFLAGS: 00000246 ORIG_RAX: 0000000000000028
RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 0000000000458da9
RDX: 00000000200000c0 RSI: 0000000000000008 RDI: 0000000000000007
RBP: 000000000073bf00 R08: 0000000000000000 R09: 0000000000000000
R10: 000000000000005a R11: 0000000000000246 R12: 00007f300d24c6d4
R13: 00000000004c5fa3 R14: 00000000004da748 R15: 00000000ffffffff
Allocated by task 17543:
save_stack+0x45/0xd0 mm/kasan/common.c:75
set_track mm/kasan/common.c:87 [inline]
__kasan_kmalloc mm/kasan/common.c:497 [inline]
__kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:470
kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:505
slab_post_alloc_hook mm/slab.h:437 [inline]
slab_alloc mm/slab.c:3393 [inline]
kmem_cache_alloc+0x11a/0x6f0 mm/slab.c:3555
alloc_pid+0x55/0x8f0 kernel/pid.c:168
copy_process.part.0+0x3b08/0x7980 kernel/fork.c:1932
copy_process kernel/fork.c:1709 [inline]
_do_fork+0x257/0xfd0 kernel/fork.c:2226
__do_sys_clone kernel/fork.c:2333 [inline]
__se_sys_clone kernel/fork.c:2327 [inline]
__x64_sys_clone+0xbf/0x150 kernel/fork.c:2327
do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Freed by task 7789:
save_stack+0x45/0xd0 mm/kasan/common.c:75
set_track mm/kasan/common.c:87 [inline]
__kasan_slab_free+0x102/0x150 mm/kasan/common.c:459
kasan_slab_free+0xe/0x10 mm/kasan/common.c:467
__cache_free mm/slab.c:3499 [inline]
kmem_cache_free+0x86/0x260 mm/slab.c:3765
put_pid.part.0+0x111/0x150 kernel/pid.c:111
put_pid+0x20/0x30 kernel/pid.c:105
fl_free+0xbe/0xe0 net/ipv6/ip6_flowlabel.c:102
ip6_fl_gc+0x295/0x3e0 net/ipv6/ip6_flowlabel.c:152
call_timer_fn+0x190/0x720 kernel/time/timer.c:1325
expire_timers kernel/time/timer.c:1362 [inline]
__run_timers kernel/time/timer.c:1681 [inline]
__run_timers kernel/time/timer.c:1649 [inline]
run_timer_softirq+0x652/0x1700 kernel/time/timer.c:1694
__do_softirq+0x266/0x95a kernel/softirq.c:293
The buggy address belongs to the object at ffff888094012a00
which belongs to the cache pid_2 of size 88
The buggy address is located 4 bytes inside of
88-byte region [ffff888094012a00, ffff888094012a58)
The buggy address belongs to the page:
page:ffffea0002500480 count:1 mapcount:0 mapping:ffff88809a483080 index:0xffff888094012980
flags: 0x1fffc0000000200(slab)
raw: 01fffc0000000200 ffffea00018a3508 ffffea0002524a88 ffff88809a483080
raw: ffff888094012980 ffff888094012000 000000010000001b 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff888094012900: fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc fc
ffff888094012980: fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc fc
>ffff888094012a00: fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc fc
^
ffff888094012a80: fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc fc
ffff888094012b00: fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc fc
Fixes: 4f82f45730c6 ("net ip6 flowlabel: Make owner a union of struct pid * and kuid_t")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
sending dest unreach
When there is no route to an IPv6 dest addr, skb_dst(skb) points
to loopback dev in the case of that the IP6CB(skb)->iif is
enslaved to a vrf. This causes Ip6InNoRoutes to be incremented on the
loopback dev. This also causes the lookup to fail on icmpv6_send() and
the dest unreachable to not sent and Ip6OutNoRoutes gets incremented on
the loopback dev.
To reproduce:
* Gateway configuration:
ip link add dev vrf_258 type vrf table 258
ip link set dev enp0s9 master vrf_258
ip addr add 66:1/64 dev enp0s9
ip -6 route add unreachable default metric 8192 table 258
sysctl -w net.ipv6.conf.all.forwarding=1
sysctl -w net.ipv6.conf.enp0s9.forwarding=1
* Sender configuration:
ip addr add 66::2/64 dev enp0s9
ip -6 route add default via 66::1
and ping 67::1 for example from the sender.
Fix this by counting on the original netdev and reset the skb dst to
force a fresh lookup.
v2: Fix typo of destination address in the repro steps.
v3: Simplify the loopback check (per David Ahern) and use reverse
Christmas tree format (per David Miller).
Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Tested-by: David Ahern <dsahern@gmail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Richard and Bruno both reported that my commit added a bug,
and Bruno was able to determine the problem came when a segment
wih a FIN packet was coalesced to a prior one in tcp backlog queue.
It turns out the header prediction in tcp_rcv_established()
looks back to TCP headers in the packet, not in the metadata
(aka TCP_SKB_CB(skb)->tcp_flags)
The fast path in tcp_rcv_established() is not supposed to
handle a FIN flag (it does not call tcp_fin())
Therefore we need to make sure to propagate the FIN flag,
so that the coalesced packet does not go through the fast path,
the same than a GRO packet carrying a FIN flag.
While we are at it, make sure we do not coalesce packets with
RST or SYN, or if they do not have ACK set.
Many thanks to Richard and Bruno for pinpointing the bad commit,
and to Richard for providing a first version of the fix.
Fixes: 4f693b55c3d2 ("tcp: implement coalescing on backlog queue")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Richard Purdie <richard.purdie@linuxfoundation.org>
Reported-by: Bruno Prémont <bonbons@sysophe.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
A request for a flowlabel fails in process or user exclusive mode must
fail if the caller pid or uid does not match. Invert the test.
Previously, the test was unsafe wrt PID recycling, but indeed tested
for inequality: fl1->owner != fl->owner
Fixes: 4f82f45730c68 ("net ip6 flowlabel: Make owner a union of struct pid* and kuid_t")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/sschmidt/wpan
Stefan Schmidt says:
====================
ieee802154 for net 2019-04-25
An update from ieee802154 for your *net* tree.
Another fix from Kangjie Lu to ensure better checking regmap updates in the
mcr20a driver. Nothing else I have pending for the final release.
If there are any problems let me know.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull seccomp fixes from Kees Cook:
"Syzbot found a use-after-free bug in seccomp due to flags that should
not be allowed to be used together.
Tycho fixed this, I updated the self-tests, and the syzkaller PoC has
been running for several days without triggering KASan (before this
fix, it would reproduce). These patches have also been in -next for
almost a week, just to be sure.
- Add logic for making some seccomp flags exclusive (Tycho)
- Update selftests for exclusivity testing (Kees)"
* tag 'seccomp-v5.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
seccomp: Make NEW_LISTENER and TSYNC flags exclusive
selftests/seccomp: Prepare for exclusive seccomp flags
|
|
name(pdev->name)"
This reverts commit edb16da34b084c66763f29bee42b4e6bb33c3d66 as it
breaks existing systems as reported by Krzysztof.
Reported-by: Krzysztof Kozlowski <krzk@kernel.org>
Cc: Venkata Narendra Kumar Gutta <vnkgutta@codeaurora.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
When diagnosing a slowdown of generic/224 I noticed we were not doing
anything when calling into shrink_delalloc(). This is because all
writes in 224 are O_DIRECT, not delalloc, and thus our delalloc_bytes
counter is 0, which short circuits most of the work inside of
shrink_delalloc(). However O_DIRECT writes still consume metadata
resources and generate ordered extents, which we can still wait on.
Fix this by tracking outstanding DIO write bytes, and use this as well
as the delalloc bytes counter to decide if we need to lookup and wait on
any ordered extents. If we have more DIO writes than delalloc bytes
we'll go ahead and wait on any ordered extents regardless of our flush
state as flushing delalloc is likely to not gain us anything.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ use dio instead of odirect in identifiers ]
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Since now the trans argument is never NULL in btrfs_set_prop we don't
have to check. So delete it and use btrfs_setxattr that makes use of
that.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The last consumer of btrfs_set_prop_trans() was taken away by the patch
("btrfs: start transaction in xattr_handler_set_prop") so now this
function can be deleted.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
btrfs specific extended attributes on the inode are set using
btrfs_xattr_handler_set_prop(), and the required transaction for this
update is started by btrfs_setxattr(). For better visibility of the
transaction start and end, do this in btrfs_xattr_handler_set_prop().
For which this patch copied code of btrfs_setxattr() as it is in the
original, which needs proper error handling.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There isn't real use of making struct inode::i_mode a local copy, it
saves a dereference one time, not much. Just use it directly.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
btrfs_inode_flags_to_fsflags() is copied into @old_fsflags and used only
once. Instead used it directly.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Instead of updating the binode::flags directly, update a local copy, and
then at the point of no error, store copy it to the binode::flags.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The patch ("btrfs: start transaction in btrfs_ioctl_setflags()") used
btrfs_set_prop() instead of btrfs_set_prop_trans() by which now the
inode::i_flags update functions such as
btrfs_sync_inode_flags_to_i_flags() and btrfs_update_inode() is called
in btrfs_ioctl_setflags() instead of
btrfs_set_prop_trans()->btrfs_setxattr() as earlier. So the
inode::i_flags remains unmodified until the thread has checked all the
conditions. So drop the saved inode::i_flags in out_i_flags.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Inode attribute can be set through the FS_IOC_SETFLAGS ioctl. This
flags also includes compression attribute for which we would set/reset
the compression extended attribute. While doing this there is a bit of
duplicate code, the following things happens twice:
- start/end_transaction
- inode_inc_iversion()
- current_time update to inode->i_ctime
- and btrfs_update_inode()
These are updated both at btrfs_ioctl_setflags() and btrfs_set_props()
as well. This patch merges these two duplicate codes at
btrfs_ioctl_setflags().
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Make btrfs_set_prop() a non-static function, so that it can be called
from btrfs_ioctl_setflags(). We need btrfs_set_prop() instead of
btrfs_set_prop_trans() so that we can use the transaction which is
already started in the current thread.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
In preparation to merge multiple transactions when setting the
compression flags, split btrfs_set_props() validation part outside of
it.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We have a long standing problem with reversed keys that's detected by
btrfs_set_item_key_safe. This is hard to reproduce so we'd like to
capture more information for later analysis.
Let's dump the leaf content before triggering BUG_ON() so that we can
have some clue on what's going wrong. The output of tree locks should
help us to debug such problem.
Sample stacktrace:
generic/522 [00:07:05]
[26946.113381] run fstests generic/522 at 2019-04-16 00:07:05
[27161.474720] kernel BUG at fs/btrfs/ctree.c:3192!
[27161.475923] invalid opcode: 0000 [#1] PREEMPT SMP
[27161.477167] CPU: 0 PID: 15676 Comm: fsx Tainted: G W 5.1.0-rc5-default+ #562
[27161.478932] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c89-prebuilt.qemu.org 04/01/2014
[27161.481099] RIP: 0010:btrfs_set_item_key_safe+0x146/0x1c0 [btrfs]
[27161.485369] RSP: 0018:ffffb087499e39b0 EFLAGS: 00010286
[27161.486464] RAX: 00000000ffffffff RBX: ffff941534d80e70 RCX: 0000000000024000
[27161.487929] RDX: 0000000000013039 RSI: ffffb087499e3aa5 RDI: ffffb087499e39c7
[27161.489289] RBP: 000000000000000e R08: ffff9414e0f49008 R09: 0000000000001000
[27161.490807] R10: 0000000000000000 R11: 0000000000000003 R12: ffff9414e0f48e70
[27161.492305] R13: ffffb087499e3aa5 R14: 0000000000000000 R15: 0000000000071000
[27161.493845] FS: 00007f8ea58d0b80(0000) GS:ffff94153d400000(0000) knlGS:0000000000000000
[27161.495608] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[27161.496717] CR2: 00007f8ea57a9000 CR3: 0000000016a33000 CR4: 00000000000006f0
[27161.498100] Call Trace:
[27161.498771] __btrfs_drop_extents+0x6ec/0xdf0 [btrfs]
[27161.499872] btrfs_log_changed_extents.isra.26+0x3a2/0x9e0 [btrfs]
[27161.501114] btrfs_log_inode+0x7ff/0xdc0 [btrfs]
[27161.502114] ? __mutex_unlock_slowpath+0x4b/0x2b0
[27161.503172] btrfs_log_inode_parent+0x237/0x9c0 [btrfs]
[27161.504348] btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
[27161.505374] btrfs_sync_file+0x1b7/0x480 [btrfs]
[27161.506371] __x64_sys_msync+0x180/0x210
[27161.507208] do_syscall_64+0x54/0x180
[27161.507932] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[27161.508839] RIP: 0033:0x7f8ea5aa9c61
[27161.512616] RSP: 002b:00007ffea2a06498 EFLAGS: 00000246 ORIG_RAX: 000000000000001a
[27161.514161] RAX: ffffffffffffffda RBX: 000000000002a938 RCX: 00007f8ea5aa9c61
[27161.515376] RDX: 0000000000000004 RSI: 000000000001c9b2 RDI: 00007f8ea578d000
[27161.516572] RBP: 000000000001c07a R08: fffffffffffffff8 R09: 000000000002a000
[27161.517883] R10: 00007f8ea57a99b2 R11: 0000000000000246 R12: 0000000000000938
[27161.519080] R13: 00007f8ea578d000 R14: 000000000001c9b2 R15: 0000000000000000
[27161.520281] Modules linked in: btrfs libcrc32c xor zstd_decompress zstd_compress xxhash raid6_pq loop [last unloaded: scsi_debug]
[27161.522272] ---[ end trace d5afec7ccac6a252 ]---
[27161.523111] RIP: 0010:btrfs_set_item_key_safe+0x146/0x1c0 [btrfs]
[27161.527253] RSP: 0018:ffffb087499e39b0 EFLAGS: 00010286
[27161.528192] RAX: 00000000ffffffff RBX: ffff941534d80e70 RCX: 0000000000024000
[27161.529392] RDX: 0000000000013039 RSI: ffffb087499e3aa5 RDI: ffffb087499e39c7
[27161.530607] RBP: 000000000000000e R08: ffff9414e0f49008 R09: 0000000000001000
[27161.531802] R10: 0000000000000000 R11: 0000000000000003 R12: ffff9414e0f48e70
[27161.533018] R13: ffffb087499e3aa5 R14: 0000000000000000 R15: 0000000000071000
[27161.534405] FS: 00007f8ea58d0b80(0000) GS:ffff94153d400000(0000) knlGS:0000000000000000
[27161.536048] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[27161.537210] CR2: 00007f8ea57a9000 CR3: 0000000016a33000 CR4: 00000000000006f0
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Allowing error injection for btrfs_check_leaf_full() and
btrfs_check_node() is useful to test the failure path of btrfs write
time tree check.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Commit 41bd6067692382 ("Btrfs: fix fsync of files with multiple hard links
in new directories") introduced a path that makes fsync fallback to a full
transaction commit in order to avoid losing hard links and new ancestors
of the fsynced inode. That path is triggered only when the inode has more
than one hard link and either has a new hard link created in the current
transaction or the inode was evicted and reloaded in the current
transaction.
That path ends up getting triggered very often (hundreds of times) during
the course of pgbench benchmarks, resulting in performance drops of about
20%.
This change restores the performance by not triggering the full transaction
commit in those cases, and instead iterate the fs/subvolume tree in search
of all possible new ancestors, for all hard links, to log them.
Reported-by: Zhao Yuhu <zyuhu@suse.com>
Tested-by: James Wang <jnwang@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Send operates on read only trees and expects them to never change while it
is using them. This is part of its initial design, and this expection is
due to two different reasons:
1) When it was introduced, no operations were allowed to modifiy read-only
subvolumes/snapshots (including defrag for example).
2) It keeps send from having an impact on other filesystem operations.
Namely send does not need to keep locks on the trees nor needs to hold on
to transaction handles and delay transaction commits. This ends up being
a consequence of the former reason.
However the deduplication feature was introduced later (on September 2013,
while send was introduced in July 2012) and it allowed for deduplication
with destination files that belong to read-only trees (subvolumes and
snapshots).
That means that having a send operation (either full or incremental) running
in parallel with a deduplication that has the destination inode in one of
the trees used by the send operation, can result in tree nodes and leaves
getting freed and reused while send is using them. This problem is similar
to the problem solved for the root nodes getting freed and reused when a
snapshot is made against one tree that is currenly being used by a send
operation, fixed in commits [1] and [2]. These commits explain in detail
how the problem happens and the explanation is valid for any node or leaf
that is not the root of a tree as well. This problem was also discussed
and explained recently in a thread [3].
The problem is very easy to reproduce when using send with large trees
(snapshots) and just a few concurrent deduplication operations that target
files in the trees used by send. A stress test case is being sent for
fstests that triggers the issue easily. The most common error to hit is
the send ioctl return -EIO with the following messages in dmesg/syslog:
[1631617.204075] BTRFS error (device sdc): did not find backref in send_root. inode=63292, offset=0, disk_byte=5228134400 found extent=5228134400
[1631633.251754] BTRFS error (device sdc): parent transid verify failed on 32243712 wanted 24 found 27
The first one is very easy to hit while the second one happens much less
frequently, except for very large trees (in that test case, snapshots
with 100000 files having large xattrs to get deep and wide trees).
Less frequently, at least one BUG_ON can be hit:
[1631742.130080] ------------[ cut here ]------------
[1631742.130625] kernel BUG at fs/btrfs/ctree.c:1806!
[1631742.131188] invalid opcode: 0000 [#6] SMP DEBUG_PAGEALLOC PTI
[1631742.131726] CPU: 1 PID: 13394 Comm: btrfs Tainted: G B D W 5.0.0-rc8-btrfs-next-45 #1
[1631742.132265] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
[1631742.133399] RIP: 0010:read_node_slot+0x122/0x130 [btrfs]
(...)
[1631742.135061] RSP: 0018:ffffb530021ebaa0 EFLAGS: 00010246
[1631742.135615] RAX: ffff93ac8912e000 RBX: 000000000000009d RCX: 0000000000000002
[1631742.136173] RDX: 000000000000009d RSI: ffff93ac564b0d08 RDI: ffff93ad5b48c000
[1631742.136759] RBP: ffffb530021ebb7d R08: 0000000000000001 R09: ffffb530021ebb7d
[1631742.137324] R10: ffffb530021eba70 R11: 0000000000000000 R12: ffff93ac87d0a708
[1631742.137900] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
[1631742.138455] FS: 00007f4cdb1528c0(0000) GS:ffff93ad76a80000(0000) knlGS:0000000000000000
[1631742.139010] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1631742.139568] CR2: 00007f5acb3d0420 CR3: 000000012be3e006 CR4: 00000000003606e0
[1631742.140131] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1631742.140719] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[1631742.141272] Call Trace:
[1631742.141826] ? do_raw_spin_unlock+0x49/0xc0
[1631742.142390] tree_advance+0x173/0x1d0 [btrfs]
[1631742.142948] btrfs_compare_trees+0x268/0x690 [btrfs]
[1631742.143533] ? process_extent+0x1070/0x1070 [btrfs]
[1631742.144088] btrfs_ioctl_send+0x1037/0x1270 [btrfs]
[1631742.144645] _btrfs_ioctl_send+0x80/0x110 [btrfs]
[1631742.145161] ? trace_sched_stick_numa+0xe0/0xe0
[1631742.145685] btrfs_ioctl+0x13fe/0x3120 [btrfs]
[1631742.146179] ? account_entity_enqueue+0xd3/0x100
[1631742.146662] ? reweight_entity+0x154/0x1a0
[1631742.147135] ? update_curr+0x20/0x2a0
[1631742.147593] ? check_preempt_wakeup+0x103/0x250
[1631742.148053] ? do_vfs_ioctl+0xa2/0x6f0
[1631742.148510] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[1631742.148942] do_vfs_ioctl+0xa2/0x6f0
[1631742.149361] ? __fget+0x113/0x200
[1631742.149767] ksys_ioctl+0x70/0x80
[1631742.150159] __x64_sys_ioctl+0x16/0x20
[1631742.150543] do_syscall_64+0x60/0x1b0
[1631742.150931] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[1631742.151326] RIP: 0033:0x7f4cd9f5add7
(...)
[1631742.152509] RSP: 002b:00007ffe91017708 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[1631742.152892] RAX: ffffffffffffffda RBX: 0000000000000105 RCX: 00007f4cd9f5add7
[1631742.153268] RDX: 00007ffe91017790 RSI: 0000000040489426 RDI: 0000000000000007
[1631742.153633] RBP: 0000000000000007 R08: 00007f4cd9e79700 R09: 00007f4cd9e79700
[1631742.153999] R10: 00007f4cd9e799d0 R11: 0000000000000202 R12: 0000000000000003
[1631742.154365] R13: 0000555dfae53020 R14: 0000000000000000 R15: 0000000000000001
(...)
[1631742.156696] ---[ end trace 5dac9f96dcc3fd6b ]---
That BUG_ON happens because while send is using a node, that node is COWed
by a concurrent deduplication, gets freed and gets reused as a leaf (because
a transaction commit happened in between), so when it attempts to read a
slot from the extent buffer, at ctree.c:read_node_slot(), the extent buffer
contents were wiped out and it now matches a leaf (which can even belong to
some other tree now), hitting the BUG_ON(level == 0).
Fix this concurrency issue by not allowing send and deduplication to run
in parallel if both operate on the same readonly trees, returning EAGAIN
to user space and logging an exlicit warning in dmesg/syslog.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=be6821f82c3cc36e026f5afd10249988852b35ea
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6f2f0b394b54e2b159ef969a0b5274e9bbf82ff2
[3] https://lore.kernel.org/linux-btrfs/CAL3q7H7iqSEEyFaEtpRZw3cp613y+4k2Q8b4W7mweR3tZA05bQ@mail.gmail.com/
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When we set a subvolume to read-only mode we do not flush dellaloc for any
of its inodes (except if the filesystem is mounted with -o flushoncommit),
since it does not affect correctness for any subsequent operations - except
for a future send operation. The send operation will not be able to see the
delalloc data since the respective file extent items, inode item updates,
backreferences, etc, have not hit yet the subvolume and extent trees.
Effectively this means data loss, since the send stream will not contain
any data from existing delalloc. Another problem from this is that if the
writeback starts and finishes while the send operation is in progress, we
have the subvolume tree being being modified concurrently which can result
in send failing unexpectedly with EIO or hitting runtime errors, assertion
failures or hitting BUG_ONs, etc.
Simple reproducer:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ btrfs subvolume create /mnt/sv
$ xfs_io -f -c "pwrite -S 0xea 0 108K" /mnt/sv/foo
$ btrfs property set /mnt/sv ro true
$ btrfs send -f /tmp/send.stream /mnt/sv
$ od -t x1 -A d /mnt/sv/foo
0000000 ea ea ea ea ea ea ea ea ea ea ea ea ea ea ea ea
*
0110592
$ umount /mnt
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
$ btrfs receive -f /tmp/send.stream /mnt
$ echo $?
0
$ od -t x1 -A d /mnt/sv/foo
0000000
# ---> empty file
Since this a problem that affects send only, fix it in send by flushing
dellaloc for all the roots used by the send operation before send starts
to process the commit roots.
This is a problem that affects send since it was introduced (commit
31db9f7c23fbf7 ("Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive"))
but backporting it to older kernels has some dependencies:
- For kernels between 3.19 and 4.20, it depends on commit 3cd24c698004d2
("btrfs: use tagged writepage to mitigate livelock of snapshot") because
the function btrfs_start_delalloc_snapshot() does not exist before that
commit. So one has to either pick that commit or replace the calls to
btrfs_start_delalloc_snapshot() in this patch with calls to
btrfs_start_delalloc_inodes().
- For kernels older than 3.19 it also requires commit e5fa8f865b3324
("Btrfs: ensure send always works on roots without orphans") because
it depends on the function ensure_commit_roots_uptodate() which that
commits introduced.
- No dependencies for 5.0+ kernels.
A test case for fstests follows soon.
CC: stable@vger.kernel.org # 3.19+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
During fiemap, for regular extents (non inline) we need to check if they
are shared and if they are, set the shared bit. Checking if an extent is
shared requires checking the delayed references of the currently running
transaction, since some reference might have not yet hit the extent tree
and be only in the in-memory delayed references.
However we were using a transaction join for this, which creates a new
transaction when there is no transaction currently running. That means
that two more potential failures can happen: creating the transaction and
committing it. Further, if no write activity is currently happening in the
system, and fiemap calls keep being done, we end up creating and
committing transactions that do nothing.
In some extreme cases this can result in the commit of the transaction
created by fiemap to fail with ENOSPC when updating the root item of a
subvolume tree because a join does not reserve any space, leading to a
trace like the following:
heisenberg kernel: ------------[ cut here ]------------
heisenberg kernel: BTRFS: Transaction aborted (error -28)
heisenberg kernel: WARNING: CPU: 0 PID: 7137 at fs/btrfs/root-tree.c:136 btrfs_update_root+0x22b/0x320 [btrfs]
(...)
heisenberg kernel: CPU: 0 PID: 7137 Comm: btrfs-transacti Not tainted 4.19.0-4-amd64 #1 Debian 4.19.28-2
heisenberg kernel: Hardware name: FUJITSU LIFEBOOK U757/FJNB2A5, BIOS Version 1.21 03/19/2018
heisenberg kernel: RIP: 0010:btrfs_update_root+0x22b/0x320 [btrfs]
(...)
heisenberg kernel: RSP: 0018:ffffb5448828bd40 EFLAGS: 00010286
heisenberg kernel: RAX: 0000000000000000 RBX: ffff8ed56bccef50 RCX: 0000000000000006
heisenberg kernel: RDX: 0000000000000007 RSI: 0000000000000092 RDI: ffff8ed6bda166a0
heisenberg kernel: RBP: 00000000ffffffe4 R08: 00000000000003df R09: 0000000000000007
heisenberg kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff8ed63396a078
heisenberg kernel: R13: ffff8ed092d7c800 R14: ffff8ed64f5db028 R15: ffff8ed6bd03d068
heisenberg kernel: FS: 0000000000000000(0000) GS:ffff8ed6bda00000(0000) knlGS:0000000000000000
heisenberg kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
heisenberg kernel: CR2: 00007f46f75f8000 CR3: 0000000310a0a002 CR4: 00000000003606f0
heisenberg kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
heisenberg kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
heisenberg kernel: Call Trace:
heisenberg kernel: commit_fs_roots+0x166/0x1d0 [btrfs]
heisenberg kernel: ? _cond_resched+0x15/0x30
heisenberg kernel: ? btrfs_run_delayed_refs+0xac/0x180 [btrfs]
heisenberg kernel: btrfs_commit_transaction+0x2bd/0x870 [btrfs]
heisenberg kernel: ? start_transaction+0x9d/0x3f0 [btrfs]
heisenberg kernel: transaction_kthread+0x147/0x180 [btrfs]
heisenberg kernel: ? btrfs_cleanup_transaction+0x530/0x530 [btrfs]
heisenberg kernel: kthread+0x112/0x130
heisenberg kernel: ? kthread_bind+0x30/0x30
heisenberg kernel: ret_from_fork+0x35/0x40
heisenberg kernel: ---[ end trace 05de912e30e012d9 ]---
Since fiemap (and btrfs_check_shared()) is a read-only operation, do not do
a transaction join to avoid the overhead of creating a new transaction (if
there is currently no running transaction) and introducing a potential
point of failure when the new transaction gets committed, instead use a
transaction attach to grab a handle for the currently running transaction
if any.
Reported-by: Christoph Anton Mitterer <calestyo@scientia.net>
Link: https://lore.kernel.org/linux-btrfs/b2a668d7124f1d3e410367f587926f622b3f03a4.camel@scientia.net/
Fixes: afce772e87c36c ("btrfs: fix check_shared for fiemap ioctl")
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Since reloc tree doesn't contribute to qgroup numbers, just skip them.
This should catch the final cause of unnecessary data ref processing
when running balance of metadata with qgroups on.
The 4G data 16 snapshots test (*) should explain it pretty well:
| delayed subtree | refactor delayed ref | this patch
---------------------------------------------------------------------
relocated | 22653 | 22673 | 22744
qgroup dirty | 122792 | 48360 | 70
time | 24.494 | 11.606 | 3.944
Finally, we're at the stage where qgroup + metadata balance cost no
obvious overhead.
Test environment:
Test VM:
- vRAM 8G
- vCPU 8
- block dev vitrio-blk, 'unsafe' cache mode
- host block 850evo
Test workload:
- Copy 4G data from /usr/ to one subvolume
- Create 16 snapshots of that subvolume, and modify 3 files in each
snapshot
- Enable quota, rescan
- Time "btrfs balance start -m"
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Similar to btrfs_inc_extent_ref(), use btrfs_ref to replace the long
parameter list and the confusing @owner parameter.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|