Age | Commit message (Collapse) | Author |
|
When hclge_bind_ring_with_vector() fails,
hclge_map_unmap_ring_to_vf_vector() returns the error
directly, so nobody will free the memory allocated by
hclge_get_ring_chain_from_mbx().
So hclge_free_vector_ring_chain() should be called no matter
hclge_bind_ring_with_vector() fails or not.
Fixes: 84e095d64ed9 ("net: hns3: Change PF to add ring-vect binding & resetQ to mailbox")
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
hns3_uninit_phy() should be called before checking
HNS3_NIC_STATE_INITED flags, otherwise when this checking fails,
there is nobody to call hns3_uninit_phy().
Fixes: c8a8045b2d0a ("net: hns3: Fix NULL deref when unloading driver")
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When unloading driver, the reset task should not be scheduled
anymore. If disable IRQ before cancel ongoing reset task,
the IRQ may be re-enabled by the reset task.
This patch uses HCLGE_STATE_REMOVING/HCLGEVF_STATE_REMOVING
flag to indicate that the driver is unloading, and we should
stop new coming reset service to be scheduled, otherwise,
reset service will access some resource which has been freed
by unloading.
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When reset happens, the hardware reset should begin after the
driver has finished its preparatory work, otherwise it may cause
some hardware error.
Before Hardware's reset, it will wait for the driver to write
bit HCLGE_NIC_CMQ_ENABLE of register HCLGE_NIC_CSQ_DEPTH_REG
to 1, while the driver finishes its preparatory work will do that.
BTW, since some cases this register will be cleared, so it needs
some sync time before driver's writing.
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
hclgevf_init_client_instance() is a little bloated and there is
some duplicated code. This patch adds some cleanup for it.
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
hclge_init_client_instance() is a little bloated and there is
some duplicated code. This patch adds some cleanup for it.
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
registered
When VF NIC client's init_instance() succeeds, it means this client
has been registered successfully, so we use HCLGEVF_STATE_NIC_REGISTERED
to indicate that. And before calling VF NIC client's uninit_instance(),
we clear this state.
So any operation of VF NIC client from HCLGEVF is not allowed if this
state is not set.
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
registered
When PF ROCE client's init_instance() succeeds, it means this client
has been registered successfully, so we use HCLGE_STATE_ROCE_REGISTERED
to indicate that. And before calling PF ROCE client's uninit_instance(),
we clear this state.
So any operation of the ROCE client from HCLGE is not allowed if this
state is not set.
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
registered
When PF NIC client's init_instance() succeeds, it means this client
has been registered successfully, so we use HCLGE_STATE_NIC_REGISTERED
to indicate that. And before calling PF NIC client's uninit_instance(),
we clear this state.
So any operation of PF NIC client from HCLGE is not allowed if this
state is not set.
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch prints firmware statistics information.
debugfs command:
echo dump m7 info > cmd
estuary:/dbg/hns3/0000:7d:00.0$ echo dump m7 info > cmd
[ 172.577240] hns3 0000:7d:00.0: 0x00000000 0x00000000 0x00000000
[ 172.583471] hns3 0000:7d:00.0: 0x00000000 0x00000000 0x00000000
[ 172.589552] hns3 0000:7d:00.0: 0x00000030 0x00000000 0x00000000
[ 172.595632] hns3 0000:7d:00.0: 0x00000000 0x00000000 0x00000000
estuary:/dbg/hns3/0000:7d:00.0$
Signed-off-by: Zhongzhu Liu <liuzhongzhu@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
According to hardware user menual, the GRO_SIZE is 14 bits width,
the HNS3_RXD_GRO_SIZE_M is 10 bits width now, which may cause
hardware GRO received packet error problem.
Fixes: a6d53b97a2e7 ("net: hns3: Adds GRO params to SKB for the stack")
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The ifdef condition of function hclge_add_fd_entry_by_arfs() is
unnecessary. It may cause compile warning when CONFIG_RFS_ACCEL
is not chosen. This patch fixes it by removing the ifdef condition.
Fixes: d93ed94fbeaf ("net: hns3: add aRFS support for PF")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Both sysfs-bus-mdio and sysfs-class-net-phydev contain the same
duplication information. There is not currently any MDIO bus specific
attribute, but there are PHY device (struct phy_device) specific
attributes. Use the more precise description from sysfs-bus-mdio and
carry that over to sysfs-class-net-phydev.
Fixes: 86f22d04dfb5 ("net: sysfs: Document PHY device sysfs attributes")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
If llc_mac_hdr_init() returns an error, we must drop the skb
since no llc_build_and_send_ui_pkt() caller will take care of this.
BUG: memory leak
unreferenced object 0xffff8881202b6800 (size 2048):
comm "syz-executor907", pid 7074, jiffies 4294943781 (age 8.590s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
1a 00 07 40 00 00 00 00 00 00 00 00 00 00 00 00 ...@............
backtrace:
[<00000000e25b5abe>] kmemleak_alloc_recursive include/linux/kmemleak.h:55 [inline]
[<00000000e25b5abe>] slab_post_alloc_hook mm/slab.h:439 [inline]
[<00000000e25b5abe>] slab_alloc mm/slab.c:3326 [inline]
[<00000000e25b5abe>] __do_kmalloc mm/slab.c:3658 [inline]
[<00000000e25b5abe>] __kmalloc+0x161/0x2c0 mm/slab.c:3669
[<00000000a1ae188a>] kmalloc include/linux/slab.h:552 [inline]
[<00000000a1ae188a>] sk_prot_alloc+0xd6/0x170 net/core/sock.c:1608
[<00000000ded25bbe>] sk_alloc+0x35/0x2f0 net/core/sock.c:1662
[<000000002ecae075>] llc_sk_alloc+0x35/0x170 net/llc/llc_conn.c:950
[<00000000551f7c47>] llc_ui_create+0x7b/0x140 net/llc/af_llc.c:173
[<0000000029027f0e>] __sock_create+0x164/0x250 net/socket.c:1430
[<000000008bdec225>] sock_create net/socket.c:1481 [inline]
[<000000008bdec225>] __sys_socket+0x69/0x110 net/socket.c:1523
[<00000000b6439228>] __do_sys_socket net/socket.c:1532 [inline]
[<00000000b6439228>] __se_sys_socket net/socket.c:1530 [inline]
[<00000000b6439228>] __x64_sys_socket+0x1e/0x30 net/socket.c:1530
[<00000000cec820c1>] do_syscall_64+0x76/0x1a0 arch/x86/entry/common.c:301
[<000000000c32554f>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
BUG: memory leak
unreferenced object 0xffff88811d750d00 (size 224):
comm "syz-executor907", pid 7074, jiffies 4294943781 (age 8.600s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 f0 0c 24 81 88 ff ff 00 68 2b 20 81 88 ff ff ...$.....h+ ....
backtrace:
[<0000000053026172>] kmemleak_alloc_recursive include/linux/kmemleak.h:55 [inline]
[<0000000053026172>] slab_post_alloc_hook mm/slab.h:439 [inline]
[<0000000053026172>] slab_alloc_node mm/slab.c:3269 [inline]
[<0000000053026172>] kmem_cache_alloc_node+0x153/0x2a0 mm/slab.c:3579
[<00000000fa8f3c30>] __alloc_skb+0x6e/0x210 net/core/skbuff.c:198
[<00000000d96fdafb>] alloc_skb include/linux/skbuff.h:1058 [inline]
[<00000000d96fdafb>] alloc_skb_with_frags+0x5f/0x250 net/core/skbuff.c:5327
[<000000000a34a2e7>] sock_alloc_send_pskb+0x269/0x2a0 net/core/sock.c:2225
[<00000000ee39999b>] sock_alloc_send_skb+0x32/0x40 net/core/sock.c:2242
[<00000000e034d810>] llc_ui_sendmsg+0x10a/0x540 net/llc/af_llc.c:933
[<00000000c0bc8445>] sock_sendmsg_nosec net/socket.c:652 [inline]
[<00000000c0bc8445>] sock_sendmsg+0x54/0x70 net/socket.c:671
[<000000003b687167>] __sys_sendto+0x148/0x1f0 net/socket.c:1964
[<00000000922d78d9>] __do_sys_sendto net/socket.c:1976 [inline]
[<00000000922d78d9>] __se_sys_sendto net/socket.c:1972 [inline]
[<00000000922d78d9>] __x64_sys_sendto+0x2a/0x30 net/socket.c:1972
[<00000000cec820c1>] do_syscall_64+0x76/0x1a0 arch/x86/entry/common.c:301
[<000000000c32554f>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
in set_rx_mode, __dev_mc_sync and netdev_for_each_mc_addr will
repeatedly set the multicast mac address. so we delete this loop.
Signed-off-by: Xue Chaojing <xuechaojing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Eric Dumazet says:
====================
inet: frags: followup to 'inet-frags-avoid-possible-races-at-netns-dismantle'
Latest patch series ('inet-frags-avoid-possible-races-at-netns-dismantle')
brought another syzbot report shown in the third patch changelog.
While fixing the issue, I had to call inet_frags_fini() later
in IPv6 and ilowpan.
Also I believe a completion is needed to ensure proper dismantle
at module removal.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
As caught by syzbot [1], the rcu grace period that is respected
before fqdir_rwork_fn() proceeds and frees fqdir is not enough
to prevent inet_frag_destroy_rcu() being run after the freeing.
We need a proper rcu_barrier() synchronization to replace
the one we had in inet_frags_fini()
We also have to fix a potential problem at module removal :
inet_frags_fini() needs to make sure that all queued work queues
(fqdir_rwork_fn) have completed, otherwise we might
call kmem_cache_destroy() too soon and get another use-after-free.
[1]
BUG: KASAN: use-after-free in inet_frag_destroy_rcu+0xd9/0xe0 net/ipv4/inet_fragment.c:201
Read of size 8 at addr ffff88806ed47a18 by task swapper/1/0
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.2.0-rc1+ #2
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
<IRQ>
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x172/0x1f0 lib/dump_stack.c:113
print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188
__kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
kasan_report+0x12/0x20 mm/kasan/common.c:614
__asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132
inet_frag_destroy_rcu+0xd9/0xe0 net/ipv4/inet_fragment.c:201
__rcu_reclaim kernel/rcu/rcu.h:222 [inline]
rcu_do_batch kernel/rcu/tree.c:2092 [inline]
invoke_rcu_callbacks kernel/rcu/tree.c:2310 [inline]
rcu_core+0xba5/0x1500 kernel/rcu/tree.c:2291
__do_softirq+0x25c/0x94c kernel/softirq.c:293
invoke_softirq kernel/softirq.c:374 [inline]
irq_exit+0x180/0x1d0 kernel/softirq.c:414
exiting_irq arch/x86/include/asm/apic.h:536 [inline]
smp_apic_timer_interrupt+0x13b/0x550 arch/x86/kernel/apic/apic.c:1068
apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:806
</IRQ>
RIP: 0010:native_safe_halt+0xe/0x10 arch/x86/include/asm/irqflags.h:61
Code: ff ff 48 89 df e8 f2 95 8c fa eb 82 e9 07 00 00 00 0f 00 2d e4 45 4b 00 f4 c3 66 90 e9 07 00 00 00 0f 00 2d d4 45 4b 00 fb f4 <c3> 90 55 48 89 e5 41 57 41 56 41 55 41 54 53 e8 8e 18 42 fa e8 99
RSP: 0018:ffff8880a98e7d78 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
RAX: 1ffffffff1164e11 RBX: ffff8880a98d4340 RCX: 0000000000000000
RDX: dffffc0000000000 RSI: 0000000000000006 RDI: ffff8880a98d4bbc
RBP: ffff8880a98e7da8 R08: ffff8880a98d4340 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
R13: ffffffff88b27078 R14: 0000000000000001 R15: 0000000000000000
arch_cpu_idle+0xa/0x10 arch/x86/kernel/process.c:571
default_idle_call+0x36/0x90 kernel/sched/idle.c:94
cpuidle_idle_call kernel/sched/idle.c:154 [inline]
do_idle+0x377/0x560 kernel/sched/idle.c:263
cpu_startup_entry+0x1b/0x20 kernel/sched/idle.c:354
start_secondary+0x34e/0x4c0 arch/x86/kernel/smpboot.c:267
secondary_startup_64+0xa4/0xb0 arch/x86/kernel/head_64.S:243
Allocated by task 8877:
save_stack+0x23/0x90 mm/kasan/common.c:71
set_track mm/kasan/common.c:79 [inline]
__kasan_kmalloc mm/kasan/common.c:489 [inline]
__kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462
kasan_kmalloc+0x9/0x10 mm/kasan/common.c:503
kmem_cache_alloc_trace+0x151/0x750 mm/slab.c:3555
kmalloc include/linux/slab.h:547 [inline]
kzalloc include/linux/slab.h:742 [inline]
fqdir_init include/net/inet_frag.h:115 [inline]
ipv6_frags_init_net+0x48/0x460 net/ipv6/reassembly.c:513
ops_init+0xb3/0x410 net/core/net_namespace.c:130
setup_net+0x2d3/0x740 net/core/net_namespace.c:316
copy_net_ns+0x1df/0x340 net/core/net_namespace.c:439
create_new_namespaces+0x400/0x7b0 kernel/nsproxy.c:107
unshare_nsproxy_namespaces+0xc2/0x200 kernel/nsproxy.c:206
ksys_unshare+0x440/0x980 kernel/fork.c:2692
__do_sys_unshare kernel/fork.c:2760 [inline]
__se_sys_unshare kernel/fork.c:2758 [inline]
__x64_sys_unshare+0x31/0x40 kernel/fork.c:2758
do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Freed by task 17:
save_stack+0x23/0x90 mm/kasan/common.c:71
set_track mm/kasan/common.c:79 [inline]
__kasan_slab_free+0x102/0x150 mm/kasan/common.c:451
kasan_slab_free+0xe/0x10 mm/kasan/common.c:459
__cache_free mm/slab.c:3432 [inline]
kfree+0xcf/0x220 mm/slab.c:3755
fqdir_rwork_fn+0x33/0x40 net/ipv4/inet_fragment.c:154
process_one_work+0x989/0x1790 kernel/workqueue.c:2269
worker_thread+0x98/0xe40 kernel/workqueue.c:2415
kthread+0x354/0x420 kernel/kthread.c:255
ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
The buggy address belongs to the object at ffff88806ed47a00
which belongs to the cache kmalloc-512 of size 512
The buggy address is located 24 bytes inside of
512-byte region [ffff88806ed47a00, ffff88806ed47c00)
The buggy address belongs to the page:
page:ffffea0001bb51c0 refcount:1 mapcount:0 mapping:ffff8880aa400940 index:0x0
flags: 0x1fffc0000000200(slab)
raw: 01fffc0000000200 ffffea000282a788 ffffea0001bb53c8 ffff8880aa400940
raw: 0000000000000000 ffff88806ed47000 0000000100000006 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff88806ed47900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88806ed47980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff88806ed47a00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff88806ed47a80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88806ed47b00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
Fixes: 3c8fc8782044 ("inet: frags: rework rhashtable dismantle")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Both IPv6 and 6lowpan are calling inet_frags_fini() too soon.
inet_frags_fini() is dismantling a kmem_cache, that might be needed
later when unregister_pernet_subsys() eventually has to remove
frags queues from hash tables and free them.
This fixes potential use-after-free, and is a prereq for the following patch.
Fixes: d4ad4d22e7ac ("inet: frags: use kmem_cache for inet_frag_queue")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
fqdir_init() is not fast path and is getting bigger.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Test the IPv6 flowlabel control and datapath interfaces:
Acquire and release the right to use flowlabels with socket option
IPV6_FLOWLABEL_MGR.
Then configure flowlabels on send and read them on recv with cmsg
IPV6_FLOWINFO. Also verify auto-flowlabel if not explicitly set.
This helped identify the issue fixed in commit 95c169251bf73 ("ipv6:
invert flowlabel sharing check in process and user mode")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In the pmtu_vti6_link_change_mtu test, both local and remote addresses
for the vti6 tunnel are assigned to the same address given to the dummy
interface that we use as encapsulating device with a known MTU.
This works as long as the dummy interface is actually selected, via
rt6_lookup(), as encapsulating device. But if the remote address of the
tunnel is a local address too, the loopback interface could also be
selected, and there's nothing wrong with it.
This is what some older -stable kernels do (3.18.z, at least), and
nothing prevents us from subtly changing FIB implementation to revert
back to that behaviour in the future.
Define an IPv6 prefix instead, and use two separate addresses as local
and remote for vti6, so that the encapsulating device can't be a
loopback interface.
Reported-by: Xiumei Mu <xmu@redhat.com>
Fixes: 1fad59ea1c34 ("selftests: pmtu: Add pmtu_vti6_link_change_mtu test")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add support to configure multiple prioritized TX traffic
classes with mqprio.
Configure one BD ring per TC for the moment, one netdev
queue per TC.
Signed-off-by: Camelia Groza <camelia.groza@nxp.com>
Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Neil Armstrong says:
====================
net: stmmac: dwmac-meson: update with SPDX Licence identifier
Update the SPDX Licence identifier for the Amlogic Meson6 and Meson8 dwmac
glue drivers.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Signed-off-by: Neil Armstrong <narmstrong@baylibre.com>
Reviewed-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Signed-off-by: Neil Armstrong <narmstrong@baylibre.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The default flow control settings for the i225 device is both
'rx' and 'tx' pause frames. There is no depend on the NVM value.
This patch comes to fix this and clean up the driver code.
Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|
This change adds flow control settings. This is required to
enable the legacy flow control support.
Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|
Driver does not want to keep packets in Tx queue when link is lost.
But present code only reset NIC to flush them, but does not prevent
queuing new packets. Moreover reset sequence itself could generate
new packets via netconsole and NIC falls into endless reset loop.
This patch wakes Tx queue only when NIC is ready to send packets.
This is proper fix for problem addressed by commit 0f9e980bf5ee
("e1000e: fix cyclic resets at link up with active tx").
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Suggested-by: Alexander Duyck <alexander.duyck@gmail.com>
Tested-by: Joseph Yasi <joe.yasi@gmail.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Tested-by: Oleksandr Natalenko <oleksandr@redhat.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|
This reverts commit 0f9e980bf5ee1a97e2e401c846b2af989eb21c61.
That change cased false-positive warning about hardware hang:
e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
e1000e 0000:00:1f.6 eth0: Detected Hardware Unit Hang:
TDH <0>
TDT <1>
next_to_use <1>
next_to_clean <0>
buffer_info[next_to_clean]:
time_stamp <fffba7a7>
next_to_watch <0>
jiffies <fffbb140>
next_to_watch.status <0>
MAC Status <40080080>
PHY Status <7949>
PHY 1000BASE-T Status <0>
PHY Extended Status <3000>
PCI Status <10>
e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Besides warning everything works fine.
Original issue will be fixed property in following patch.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reported-by: Joseph Yasi <joe.yasi@gmail.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=203175
Tested-by: Joseph Yasi <joe.yasi@gmail.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Tested-by: Oleksandr Natalenko <oleksandr@redhat.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|
Enables a resend request after the completion timeout workaround is not
relevant for i225 device. This patch is clean code relevant this
workaround.
Minor cosmetic fixes, replace the 'spaces' with 'tabs'
Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|
Few function pointers from phy_operations structure were unused.
This patch cleans those.
Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|
Collision threshold and threshold's shift has been defined twice.
This patch comes to fix that.
Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.
This patch fixes the following warning:
drivers/net/ethernet/intel/igb/e1000_82575.c: In function ‘igb_get_invariants_82575’:
drivers/net/ethernet/intel/igb/e1000_82575.c:636:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
if (igb_sgmii_uses_mdio_82575(hw)) {
^
drivers/net/ethernet/intel/igb/e1000_82575.c:642:2: note: here
case E1000_CTRL_EXT_LINK_MODE_PCIE_SERDES:
^~~~
Warning level 3 was used: -Wimplicit-fallthrough=3
Notice that, in this particular case, the code comment is modified
in accordance with what GCC is expecting to find.
This patch is part of the ongoing efforts to enable
-Wimplicit-fallthrough.
Signed-off-by: "Gustavo A. R. Silva" <gustavo@embeddedor.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.
This patch fixes the following warning:
drivers/net/ethernet/intel/igb/igb_main.c: In function ‘__igb_notify_dca’:
drivers/net/ethernet/intel/igb/igb_main.c:6694:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
if (dca_add_requester(dev) == 0) {
^
drivers/net/ethernet/intel/igb/igb_main.c:6701:2: note: here
case DCA_PROVIDER_REMOVE:
^~~~
Warning level 3 was used: -Wimplicit-fallthrough=3
Notice that, in this particular case, the code comment is modified
in accordance with what GCC is expecting to find.
This patch is part of the ongoing efforts to enable
-Wimplicit-fallthrough.
Signed-off-by: "Gustavo A. R. Silva" <gustavo@embeddedor.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|
Failed in read the HW register is very serious for igb/igc driver,
as its hw_addr will be set to NULL and cause the adapter be seen as
"REMOVED".
We saw the error only a few times in the MTBF test for suspend/resume,
but can hardly get any useful info to debug.
Adding WARN() so that we can get the necessary information about
where and how it happens, and use it for root causing and fixing
this "PCIe link lost issue"
This affects igb, igc.
Signed-off-by: Feng Tang <feng.tang@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Acked-by: Sasha Neftin <sasha.neftin@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|
In case of errors, predicate_parse() goes to the out_free label
to free memory and to return an error code.
However, predicate_parse() does not free the predicates of the
temporary prog_stack array, thence leaking them.
Link: http://lkml.kernel.org/r/20190528154338.29976-1-tomasbortoli@gmail.com
Cc: stable@vger.kernel.org
Fixes: 80765597bc587 ("tracing: Rewrite filter logic to be simpler and faster")
Reported-by: syzbot+6b8e0fb820e570c59e19@syzkaller.appspotmail.com
Signed-off-by: Tomas Bortoli <tomasbortoli@gmail.com>
[ Added protection around freeing prog_stack[i].pred ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
The keygen extracted fields are used as input for the hash that
determines the incoming frames distribution. Adding IPSEC SPI so
different IPSEC flows can be distributed to different CPUs.
Signed-off-by: Madalin Bucur <madalin.bucur@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Make sure only the portals for the online CPUs are used.
Without this change, there are issues when someone boots with
maxcpus=n, with n < actual number of cores available as frames
either received or corresponding to the transmit confirmation
path would be offered for dequeue to the offline CPU portals,
getting lost.
Signed-off-by: Madalin Bucur <madalin.bucur@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Make sure we don't use an out-of-bound index for the per-port RSS
context array.
As of today, the global context creation in mvpp22_rss_context_create
will prevent us from reaching this case, but we should still make sure
we are using a sane value anyway.
Reported-by: kbuild test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Fix below issues in err code path of probe:
1. we don't need to unregister_netdev() because the netdev isn't
registered.
2. when register_netdev() fails, we also need to destroy bm pool for
HWBM case.
Fixes: dc35a10f68d3 ("net: mvneta: bm: add support for hardware buffer management")
Signed-off-by: Jisheng Zhang <Jisheng.Zhang@synaptics.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
If the subdriver defers probe, do not show an error message. It's
perfectly fine for this error to occur since the driver will get another
chance to probe after some time and will usually succeed after all of
the resources that it requires have been registered.
Signed-off-by: Thierry Reding <treding@nvidia.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When syncing the log, the final phase of a fsync operation, we need to
either create a log root's item or update the existing item in the log
tree of log roots, and that depends on the current value of the log
root's log_transid - if it's 1 we need to create the log root item,
otherwise it must exist already and we update it. Since there is no
synchronization between updating the log_transid and checking it for
deciding whether the log root's item needs to be created or updated, we
end up with a tiny race window that results in attempts to update the
item to fail because the item was not yet created:
CPU 1 CPU 2
btrfs_sync_log()
lock root->log_mutex
set log root's log_transid to 1
unlock root->log_mutex
btrfs_sync_log()
lock root->log_mutex
sets log root's
log_transid to 2
unlock root->log_mutex
update_log_root()
sees log root's log_transid
with a value of 2
calls btrfs_update_root(),
which fails with -EUCLEAN
and causes transaction abort
Until recently the race lead to a BUG_ON at btrfs_update_root(), but after
the recent commit 7ac1e464c4d47 ("btrfs: Don't panic when we can't find a
root key") we just abort the current transaction.
A sample trace of the BUG_ON() on a SLE12 kernel:
------------[ cut here ]------------
kernel BUG at ../fs/btrfs/root-tree.c:157!
Oops: Exception in kernel mode, sig: 5 [#1]
SMP NR_CPUS=2048 NUMA pSeries
(...)
Supported: Yes, External
CPU: 78 PID: 76303 Comm: rtas_errd Tainted: G X 4.4.156-94.57-default #1
task: c00000ffa906d010 ti: c00000ff42b08000 task.ti: c00000ff42b08000
NIP: d000000036ae5cdc LR: d000000036ae5cd8 CTR: 0000000000000000
REGS: c00000ff42b0b860 TRAP: 0700 Tainted: G X (4.4.156-94.57-default)
MSR: 8000000002029033 <SF,VEC,EE,ME,IR,DR,RI,LE> CR: 22444484 XER: 20000000
CFAR: d000000036aba66c SOFTE: 1
GPR00: d000000036ae5cd8 c00000ff42b0bae0 d000000036bda220 0000000000000054
GPR04: 0000000000000001 0000000000000000 c00007ffff8d37c8 0000000000000000
GPR08: c000000000e19c00 0000000000000000 0000000000000000 3736343438312079
GPR12: 3930373337303434 c000000007a3a800 00000000007fffff 0000000000000023
GPR16: c00000ffa9d26028 c00000ffa9d261f8 0000000000000010 c00000ffa9d2ab28
GPR20: c00000ff42b0bc48 0000000000000001 c00000ff9f0d9888 0000000000000001
GPR24: c00000ffa9d26000 c00000ffa9d261e8 c00000ffa9d2a800 c00000ff9f0d9888
GPR28: c00000ffa9d26028 c00000ffa9d2aa98 0000000000000001 c00000ffa98f5b20
NIP [d000000036ae5cdc] btrfs_update_root+0x25c/0x4e0 [btrfs]
LR [d000000036ae5cd8] btrfs_update_root+0x258/0x4e0 [btrfs]
Call Trace:
[c00000ff42b0bae0] [d000000036ae5cd8] btrfs_update_root+0x258/0x4e0 [btrfs] (unreliable)
[c00000ff42b0bba0] [d000000036b53610] btrfs_sync_log+0x2d0/0xc60 [btrfs]
[c00000ff42b0bce0] [d000000036b1785c] btrfs_sync_file+0x44c/0x4e0 [btrfs]
[c00000ff42b0bd80] [c00000000032e300] vfs_fsync_range+0x70/0x120
[c00000ff42b0bdd0] [c00000000032e44c] do_fsync+0x5c/0xb0
[c00000ff42b0be10] [c00000000032e8dc] SyS_fdatasync+0x2c/0x40
[c00000ff42b0be30] [c000000000009488] system_call+0x3c/0x100
Instruction dump:
7f43d378 4bffebb9 60000000 88d90008 3d220000 e8b90000 3b390009 e87a01f0
e8898e08 e8f90000 4bfd48e5 60000000 <0fe00000> e95b0060 39200004 394a0ea0
---[ end trace 8f2dc8f919cabab8 ]---
So fix this by doing the check of log_transid and updating or creating the
log root's item while holding the root's log_mutex.
Fixes: 7237f1833601d ("Btrfs: fix tree logs parallel sync")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When replaying a log that contains a new file or directory name that needs
to be added to its parent directory, we end up updating the mtime and the
ctime of the parent directory to the current time after we have set their
values to the correct ones (set at fsync time), efectivelly losing them.
Sample reproducer:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkdir /mnt/dir
$ touch /mnt/dir/file
# fsync of the directory is optional, not needed
$ xfs_io -c fsync /mnt/dir
$ xfs_io -c fsync /mnt/dir/file
$ stat -c %Y /mnt/dir
1557856079
<power failure>
$ sleep 3
$ mount /dev/sdb /mnt
$ stat -c %Y /mnt/dir
1557856082
--> should have been 1557856079, the mtime is updated to the current
time when replaying the log
Fix this by not updating the mtime and ctime to the current time at
btrfs_add_link() when we are replaying a log tree.
This could be triggered by my recent fsync fuzz tester for fstests, for
which an fstests patch exists titled "fstests: generic, fsync fuzz tester
with fsstress".
Fixes: e02119d5a7b43 ("Btrfs: Add a write ahead tree log to optimize synchronous operations")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
While logging an inode we follow its ancestors and for each one we mark
it as logged in the current transaction, even if we have not logged it.
As a consequence if we change an attribute of an ancestor, such as the
UID or GID for example, and then explicitly fsync it, we end up not
logging the inode at all despite returning success to user space, which
results in the attribute being lost if a power failure happens after
the fsync.
Sample reproducer:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkdir /mnt/dir
$ chown 6007:6007 /mnt/dir
$ sync
$ chown 9003:9003 /mnt/dir
$ touch /mnt/dir/file
$ xfs_io -c fsync /mnt/dir/file
# fsync our directory after fsync'ing the new file, should persist the
# new values for the uid and gid.
$ xfs_io -c fsync /mnt/dir
<power failure>
$ mount /dev/sdb /mnt
$ stat -c %u:%g /mnt/dir
6007:6007
--> should be 9003:9003, the uid and gid were not persisted, despite
the explicit fsync on the directory prior to the power failure
Fix this by not updating the logged_trans field of ancestor inodes when
logging an inode, since we have not logged them. Let only future calls to
btrfs_log_inode() to mark inodes as logged.
This could be triggered by my recent fsync fuzz tester for fstests, for
which an fstests patch exists titled "fstests: generic, fsync fuzz tester
with fsstress".
Fixes: 12fcfd22fe5b ("Btrfs: tree logging unlink/rename fixes")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
dereference
[BUG]
When mounting a fs with reloc tree and has qgroup enabled, it can cause
NULL pointer dereference at mount time:
BUG: kernel NULL pointer dereference, address: 00000000000000a8
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP NOPTI
RIP: 0010:btrfs_qgroup_add_swapped_blocks+0x186/0x300 [btrfs]
Call Trace:
replace_path.isra.23+0x685/0x900 [btrfs]
merge_reloc_root+0x26e/0x5f0 [btrfs]
merge_reloc_roots+0x10a/0x1a0 [btrfs]
btrfs_recover_relocation+0x3cd/0x420 [btrfs]
open_ctree+0x1bc8/0x1ed0 [btrfs]
btrfs_mount_root+0x544/0x680 [btrfs]
legacy_get_tree+0x34/0x60
vfs_get_tree+0x2d/0xf0
fc_mount+0x12/0x40
vfs_kern_mount.part.12+0x61/0xa0
vfs_kern_mount+0x13/0x20
btrfs_mount+0x16f/0x860 [btrfs]
legacy_get_tree+0x34/0x60
vfs_get_tree+0x2d/0xf0
do_mount+0x81f/0xac0
ksys_mount+0xbf/0xe0
__x64_sys_mount+0x25/0x30
do_syscall_64+0x65/0x240
entry_SYSCALL_64_after_hwframe+0x49/0xbe
[CAUSE]
In btrfs_recover_relocation(), we don't have enough info to determine
which block group we're relocating, but only to merge existing reloc
trees.
Thus in btrfs_recover_relocation(), rc->block_group is NULL.
btrfs_qgroup_add_swapped_blocks() hasn't taken this into consideration,
and causes a NULL pointer dereference.
The bug is introduced by commit 3d0174f78e72 ("btrfs: qgroup: Only trace
data extents in leaves if we're relocating data block group"), and
later qgroup refactoring still keeps this optimization.
[FIX]
Thankfully in the context of btrfs_recover_relocation(), there is no
other progress can modify tree blocks, thus those swapped tree blocks
pair will never affect qgroup numbers, no matter whatever we set for
block->trace_leaf.
So we only need to check if @bg is NULL before accessing @bg->flags.
Reported-by: Juan Erbes <jerbes@gmail.com>
Link: https://bugzilla.opensuse.org/show_bug.cgi?id=1134806
Fixes: 3d0174f78e72 ("btrfs: qgroup: Only trace data extents in leaves if we're relocating data block group")
CC: stable@vger.kernel.org # 4.20+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
When a fs has orphan reloc tree along with unfinished balance:
...
item 16 key (TREE_RELOC ROOT_ITEM FS_TREE) itemoff 12090 itemsize 439
generation 12 root_dirid 256 bytenr 300400640 level 1 refs 0 <<<
lastsnap 8 byte_limit 0 bytes_used 1359872 flags 0x0(none)
uuid 7c48d938-33a3-4aae-ab19-6e5c9d406e46
item 17 key (BALANCE TEMPORARY_ITEM 0) itemoff 11642 itemsize 448
temporary item objectid BALANCE offset 0
balance status flags 14
Then at mount time, we can hit the following kernel BUG_ON():
BTRFS info (device dm-3): relocating block group 298844160 flags metadata|dup
------------[ cut here ]------------
kernel BUG at fs/btrfs/relocation.c:1413!
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 1 PID: 897 Comm: btrfs-balance Tainted: G O 5.2.0-rc1-custom #15
RIP: 0010:create_reloc_root+0x1eb/0x200 [btrfs]
Call Trace:
btrfs_init_reloc_root+0x96/0xb0 [btrfs]
record_root_in_trans+0xb2/0xe0 [btrfs]
btrfs_record_root_in_trans+0x55/0x70 [btrfs]
select_reloc_root+0x7e/0x230 [btrfs]
do_relocation+0xc4/0x620 [btrfs]
relocate_tree_blocks+0x592/0x6a0 [btrfs]
relocate_block_group+0x47b/0x5d0 [btrfs]
btrfs_relocate_block_group+0x183/0x2f0 [btrfs]
btrfs_relocate_chunk+0x4e/0xe0 [btrfs]
btrfs_balance+0x864/0xfa0 [btrfs]
balance_kthread+0x3b/0x50 [btrfs]
kthread+0x123/0x140
ret_from_fork+0x27/0x50
[CAUSE]
In btrfs, reloc trees are used to record swapped tree blocks during
balance.
Reloc tree either get merged (replace old tree blocks of its parent
subvolume) in next transaction if its ref is 1 (fresh).
Or is already merged and will be cleaned up if its ref is 0 (orphan).
After commit d2311e698578 ("btrfs: relocation: Delay reloc tree deletion
after merge_reloc_roots"), reloc tree cleanup is delayed until one block
group is balanced.
Since fresh reloc roots are recorded during merge, as long as there
is no power loss, those orphan reloc roots converted from fresh ones are
handled without problem.
However when power loss happens, orphan reloc roots can be recorded
on-disk, thus at next mount time, we will have orphan reloc roots from
on-disk data directly, and ignored by clean_dirty_subvols() routine.
Then when background balance starts to balance another block group, and
needs to create new reloc root for the same root, btrfs_insert_item()
returns -EEXIST, and trigger that BUG_ON().
[FIX]
For orphan reloc roots, also queue them to rc->dirty_subvol_roots, so
all reloc roots no matter orphan or not, can be cleaned up properly and
avoid above BUG_ON().
And to cooperate with above change, clean_dirty_subvols() will check if
the queued root is a reloc root or a subvol root.
For a subvol root, do the old work, and for a orphan reloc root, clean it
up.
Fixes: d2311e698578 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots")
CC: stable@vger.kernel.org # 5.1
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When doing an incremental send we can now issue clone operations with a
source range that ends at the source's file eof and with a destination
range that ends at an offset smaller then the destination's file eof.
If the eof of the source file is not aligned to the sector size of the
filesystem, the receiver will get a -EINVAL error when trying to do the
operation or, on older kernels, silently corrupt the destination file.
The corruption happens on kernels without commit ac765f83f1397646
("Btrfs: fix data corruption due to cloning of eof block"), while the
failure to clone happens on kernels with that commit.
Example reproducer:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt/sdb
$ xfs_io -f -c "pwrite -S 0xb1 0 2M" /mnt/sdb/foo
$ xfs_io -f -c "pwrite -S 0xc7 0 2M" /mnt/sdb/bar
$ xfs_io -f -c "pwrite -S 0x4d 0 2M" /mnt/sdb/baz
$ xfs_io -f -c "pwrite -S 0xe2 0 2M" /mnt/sdb/zoo
$ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/base
$ btrfs send -f /tmp/base.send /mnt/sdb/base
$ xfs_io -c "reflink /mnt/sdb/bar 1560K 500K 100K" /mnt/sdb/bar
$ xfs_io -c "reflink /mnt/sdb/bar 1560K 0 100K" /mnt/sdb/zoo
$ xfs_io -c "truncate 550K" /mnt/sdb/bar
$ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/incr
$ btrfs send -f /tmp/incr.send -p /mnt/sdb/base /mnt/sdb/incr
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt/sdc
$ btrfs receive -f /tmp/base.send /mnt/sdc
$ btrfs receive -vv -f /tmp/incr.send /mnt/sdc
(...)
truncate bar size=563200
utimes bar
clone zoo - source=bar source offset=512000 offset=0 length=51200
ERROR: failed to clone extents to zoo
Invalid argument
The failure happens because the clone source range ends at the eof of file
bar, 563200, which is not aligned to the filesystems sector size (4Kb in
this case), and the destination range ends at offset 0 + 51200, which is
less then the size of the file zoo (2Mb).
So fix this by detecting such case and instead of issuing a clone
operation for the whole range, do a clone operation for smaller range
that is sector size aligned followed by a write operation for the block
containing the eof. Here we will always be pessimistic and assume the
destination filesystem of the send stream has the largest possible sector
size (64Kb), since we have no way of determining it.
This fixes a recent regression introduced in kernel 5.2-rc1.
Fixes: 040ee6120cb6706 ("Btrfs: send, improve clone range")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When using the no-holes feature, if we have a file with prealloc extents
with a start offset beyond the file's eof, doing an incremental send can
cause corruption of the file due to incorrect hole detection. Such case
requires that the prealloc extent(s) exist in both the parent and send
snapshots, and that a hole is punched into the file that covers all its
extents that do not cross the eof boundary.
Example reproducer:
$ mkfs.btrfs -f -O no-holes /dev/sdb
$ mount /dev/sdb /mnt/sdb
$ xfs_io -f -c "pwrite -S 0xab 0 500K" /mnt/sdb/foobar
$ xfs_io -c "falloc -k 1200K 800K" /mnt/sdb/foobar
$ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/base
$ btrfs send -f /tmp/base.snap /mnt/sdb/base
$ xfs_io -c "fpunch 0 500K" /mnt/sdb/foobar
$ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/incr
$ btrfs send -p /mnt/sdb/base -f /tmp/incr.snap /mnt/sdb/incr
$ md5sum /mnt/sdb/incr/foobar
816df6f64deba63b029ca19d880ee10a /mnt/sdb/incr/foobar
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt/sdc
$ btrfs receive -f /tmp/base.snap /mnt/sdc
$ btrfs receive -f /tmp/incr.snap /mnt/sdc
$ md5sum /mnt/sdc/incr/foobar
cf2ef71f4a9e90c2f6013ba3b2257ed2 /mnt/sdc/incr/foobar
--> Different checksum, because the prealloc extent beyond the
file's eof confused the hole detection code and it assumed
a hole starting at offset 0 and ending at the offset of the
prealloc extent (1200Kb) instead of ending at the offset
500Kb (the file's size).
Fix this by ensuring we never cross the file's size when issuing the
write operations for a hole.
Fixes: 16e7549f045d33 ("Btrfs: incompatible format change to remove hole extents")
CC: stable@vger.kernel.org # 3.14+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The btrfs zstd workspace manager uses a background timer to reclaim not
recently used workspaces. I used spin_lock() from this context which
should have been caught with lockdep, but was not. This deadlock was
reported in bugzilla. The fix is to switch the zstd wsm lock to use
spin_lock_bh() from the softirq context.
This happened quite relibably on ppc64, unlike on other architectures.
[ 313.402874] ================================
[ 313.402875] WARNING: inconsistent lock state
[ 313.402879] 5.1.0-rc7 #1 Not tainted
[ 313.402880] --------------------------------
[ 313.402882] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
[ 313.402885] swapper/5/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
[ 313.402888] 0000000080d1120c (&(&wsm.lock)->rlock){+.?.}, at: .zstd_reclaim_timer_fn+0x40/0x230
[ 313.402895] {SOFTIRQ-ON-W} state was registered at:
[ 313.402899] .lock_acquire+0xd0/0x240
[ 313.402903] ._raw_spin_lock+0x34/0x60
[ 313.402906] .zstd_get_workspace+0xd0/0x360
[ 313.402908] .end_compressed_bio_read+0x3b8/0x540
[ 313.402911] .bio_endio+0x174/0x2c0
[ 313.402914] .end_workqueue_fn+0x4c/0x70
[ 313.402917] .normal_work_helper+0x138/0x7e0
[ 313.402920] .process_one_work+0x324/0x790
[ 313.402922] .worker_thread+0x68/0x570
[ 313.402925] .kthread+0x19c/0x1b0
[ 313.402928] .ret_from_kernel_thread+0x58/0x78
[ 313.402930] irq event stamp: 2629216
[ 313.402933] hardirqs last enabled at (2629216): [<c0000000009da738>] ._raw_spin_unlock_irq+0x38/0x60
[ 313.402936] hardirqs last disabled at (2629215): [<c0000000009da4c4>] ._raw_spin_lock_irq+0x24/0x70
[ 313.402939] softirqs last enabled at (2629212): [<c0000000000af9fc>] .irq_enter+0x8c/0xd0
[ 313.402942] softirqs last disabled at (2629213): [<c0000000000afb58>] .irq_exit+0x118/0x170
[ 313.402944]
other info that might help us debug this:
[ 313.402945] Possible unsafe locking scenario:
[ 313.402947] CPU0
[ 313.402948] ----
[ 313.402949] lock(&(&wsm.lock)->rlock);
[ 313.402951] <Interrupt>
[ 313.402952] lock(&(&wsm.lock)->rlock);
[ 313.402954]
*** DEADLOCK ***
[ 313.402957] 1 lock held by swapper/5/0:
[ 313.402958] #0: 000000004b612042 ((&wsm.timer)){+.-.}, at: .call_timer_fn+0x0/0x3c0
[ 313.402963]
stack backtrace:
[ 313.402967] CPU: 5 PID: 0 Comm: swapper/5 Not tainted 5.1.0-rc7 #1
[ 313.402968] Call Trace:
[ 313.402972] [c0000007fa262e70] [c0000000009b3294] .dump_stack+0xe0/0x15c (unreliable)
[ 313.402975] [c0000007fa262f10] [c000000000125548] .print_usage_bug+0x348/0x390
[ 313.402978] [c0000007fa262fd0] [c000000000125cb4] .mark_lock+0x724/0x930
[ 313.402981] [c0000007fa263080] [c000000000126c20] .__lock_acquire+0xc90/0x16a0
[ 313.402984] [c0000007fa2631b0] [c000000000128040] .lock_acquire+0xd0/0x240
[ 313.402987] [c0000007fa263280] [c0000000009da2b4] ._raw_spin_lock+0x34/0x60
[ 313.402990] [c0000007fa263300] [c00000000054b0b0] .zstd_reclaim_timer_fn+0x40/0x230
[ 313.402993] [c0000007fa2633d0] [c000000000158b38] .call_timer_fn+0xc8/0x3c0
[ 313.402996] [c0000007fa2634a0] [c000000000158f74] .expire_timers+0x144/0x260
[ 313.402999] [c0000007fa263550] [c000000000159178] .run_timer_softirq+0xe8/0x230
[ 313.403002] [c0000007fa263680] [c0000000009db288] .__do_softirq+0x188/0x5d4
[ 313.403004] [c0000007fa263790] [c0000000000afb58] .irq_exit+0x118/0x170
[ 313.403008] [c0000007fa263800] [c000000000028d88] .timer_interrupt+0x158/0x430
[ 313.403012] [c0000007fa2638b0] [c0000000000091d4] decrementer_common+0x134/0x140
[ 313.403017] --- interrupt: 901 at replay_interrupt_return+0x0/0x4
LR = .arch_local_irq_restore.part.0+0x68/0x80
[ 313.403020] [c0000007fa263bb0] [c00000000001a3ac] .arch_local_irq_restore.part.0+0x2c/0x80 (unreliable)
[ 313.403024] [c0000007fa263c30] [c0000000007bbbcc] .cpuidle_enter_state+0xec/0x670
[ 313.403027] [c0000007fa263d00] [c0000000000f5130] .call_cpuidle+0x40/0x90
[ 313.403031] [c0000007fa263d70] [c0000000000f554c] .do_idle+0x2dc/0x3a0
[ 313.403034] [c0000007fa263e30] [c0000000000f59ac] .cpu_startup_entry+0x2c/0x30
[ 313.403037] [c0000007fa263ea0] [c000000000045674] .start_secondary+0x644/0x650
[ 313.403041] [c0000007fa263f90] [c00000000000ad5c] start_secondary_prolog+0x10/0x14
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203517
Fixes: 3f93aef535c8 ("btrfs: add zstd compression level support")
CC: stable@vger.kernel.org # 5.1+
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Recent FITRIM work, namely bbbf7243d62d ("btrfs: combine device update
operations during transaction commit") combined the way certain
operations are recoded in a transaction. As a result an ASSERT was added
in dev_replace_finish to ensure the new code works correctly.
Unfortunately I got reports that it's possible to trigger the assert,
meaning that during a device replace it's possible to have an unfinished
chunk allocation on the source device.
This is supposed to be prevented by the fact that a transaction is
committed before finishing the replace oepration and alter acquiring the
chunk mutex. This is not sufficient since by the time the transaction is
committed and the chunk mutex acquired it's possible to allocate a chunk
depending on the workload being executed on the replaced device. This
bug has been present ever since device replace was introduced but there
was never code which checks for it.
The correct way to fix is to ensure that there is no pending device
modification operation when the chunk mutex is acquire and if there is
repeat transaction commit. Unfortunately it's not possible to just
exclude the source device from btrfs_fs_devices::dev_alloc_list since
this causes ENOSPC to be hit in transaction commit.
Fixing that in another way would need to add special cases to handle the
last writes and forbid new ones. The looped transaction fix is more
obvious, and can be easily backported. The runtime of dev-replace is
long so there's no noticeable delay caused by that.
Reported-by: David Sterba <dsterba@suse.com>
Fixes: 391cd9df81ac ("Btrfs: fix unprotected alloc list insertion during the finishing procedure of replace")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|