Age | Commit message (Collapse) | Author |
|
This patch sync include/uapi/linux/bpf.h to
tools/include/uapi/linux/
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
|
This patch refactors the ARRAY_SIZE macro to bpf_util.h.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
|
This patch allows a BPF_PROG_TYPE_SK_REUSEPORT bpf prog to select a
SO_REUSEPORT sk from a BPF_MAP_TYPE_REUSEPORT_ARRAY introduced in
the earlier patch. "bpf_run_sk_reuseport()" will return -ECONNREFUSED
when the BPF_PROG_TYPE_SK_REUSEPORT prog returns SK_DROP.
The callers, in inet[6]_hashtable.c and ipv[46]/udp.c, are modified to
handle this case and return NULL immediately instead of continuing the
sk search from its hashtable.
It re-uses the existing SO_ATTACH_REUSEPORT_EBPF setsockopt to attach
BPF_PROG_TYPE_SK_REUSEPORT. The "sk_reuseport_attach_bpf()" will check
if the attaching bpf prog is in the new SK_REUSEPORT or the existing
SOCKET_FILTER type and then check different things accordingly.
One level of "__reuseport_attach_prog()" call is removed. The
"sk_unhashed() && ..." and "sk->sk_reuseport_cb" tests are pushed
back to "reuseport_attach_prog()" in sock_reuseport.c. sock_reuseport.c
seems to have more knowledge on those test requirements than filter.c.
In "reuseport_attach_prog()", after new_prog is attached to reuse->prog,
the old_prog (if any) is also directly freed instead of returning the
old_prog to the caller and asking the caller to free.
The sysctl_optmem_max check is moved back to the
"sk_reuseport_attach_filter()" and "sk_reuseport_attach_bpf()".
As of other bpf prog types, the new BPF_PROG_TYPE_SK_REUSEPORT is only
bounded by the usual "bpf_prog_charge_memlock()" during load time
instead of bounded by both bpf_prog_charge_memlock and sysctl_optmem_max.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
|
This patch adds a BPF_PROG_TYPE_SK_REUSEPORT which can select
a SO_REUSEPORT sk from a BPF_MAP_TYPE_REUSEPORT_ARRAY. Like other
non SK_FILTER/CGROUP_SKB program, it requires CAP_SYS_ADMIN.
BPF_PROG_TYPE_SK_REUSEPORT introduces "struct sk_reuseport_kern"
to store the bpf context instead of using the skb->cb[48].
At the SO_REUSEPORT sk lookup time, it is in the middle of transiting
from a lower layer (ipv4/ipv6) to a upper layer (udp/tcp). At this
point, it is not always clear where the bpf context can be appended
in the skb->cb[48] to avoid saving-and-restoring cb[]. Even putting
aside the difference between ipv4-vs-ipv6 and udp-vs-tcp. It is not
clear if the lower layer is only ipv4 and ipv6 in the future and
will it not touch the cb[] again before transiting to the upper
layer.
For example, in udp_gro_receive(), it uses the 48 byte NAPI_GRO_CB
instead of IP[6]CB and it may still modify the cb[] after calling
the udp[46]_lib_lookup_skb(). Because of the above reason, if
sk->cb is used for the bpf ctx, saving-and-restoring is needed
and likely the whole 48 bytes cb[] has to be saved and restored.
Instead of saving, setting and restoring the cb[], this patch opts
to create a new "struct sk_reuseport_kern" and setting the needed
values in there.
The new BPF_PROG_TYPE_SK_REUSEPORT and "struct sk_reuseport_(kern|md)"
will serve all ipv4/ipv6 + udp/tcp combinations. There is no protocol
specific usage at this point and it is also inline with the current
sock_reuseport.c implementation (i.e. no protocol specific requirement).
In "struct sk_reuseport_md", this patch exposes data/data_end/len
with semantic similar to other existing usages. Together
with "bpf_skb_load_bytes()" and "bpf_skb_load_bytes_relative()",
the bpf prog can peek anywhere in the skb. The "bind_inany" tells
the bpf prog that the reuseport group is bind-ed to a local
INANY address which cannot be learned from skb.
The new "bind_inany" is added to "struct sock_reuseport" which will be
used when running the new "BPF_PROG_TYPE_SK_REUSEPORT" bpf prog in order
to avoid repeating the "bind INANY" test on
"sk_v6_rcv_saddr/sk->sk_rcv_saddr" every time a bpf prog is run. It can
only be properly initialized when a "sk->sk_reuseport" enabled sk is
adding to a hashtable (i.e. during "reuseport_alloc()" and
"reuseport_add_sock()").
The new "sk_select_reuseport()" is the main helper that the
bpf prog will use to select a SO_REUSEPORT sk. It is the only function
that can use the new BPF_MAP_TYPE_REUSEPORT_ARRAY. As mentioned in
the earlier patch, the validity of a selected sk is checked in
run time in "sk_select_reuseport()". Doing the check in
verification time is difficult and inflexible (consider the map-in-map
use case). The runtime check is to compare the selected sk's reuseport_id
with the reuseport_id that we want. This helper will return -EXXX if the
selected sk cannot serve the incoming request (e.g. reuseport_id
not match). The bpf prog can decide if it wants to do SK_DROP as its
discretion.
When the bpf prog returns SK_PASS, the kernel will check if a
valid sk has been selected (i.e. "reuse_kern->selected_sk != NULL").
If it does , it will use the selected sk. If not, the kernel
will select one from "reuse->socks[]" (as before this patch).
The SK_DROP and SK_PASS handling logic will be in the next patch.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
|
This patch introduces a new map type BPF_MAP_TYPE_REUSEPORT_SOCKARRAY.
To unleash the full potential of a bpf prog, it is essential for the
userspace to be capable of directly setting up a bpf map which can then
be consumed by the bpf prog to make decision. In this case, decide which
SO_REUSEPORT sk to serve the incoming request.
By adding BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, the userspace has total control
and visibility on where a SO_REUSEPORT sk should be located in a bpf map.
The later patch will introduce BPF_PROG_TYPE_SK_REUSEPORT such that
the bpf prog can directly select a sk from the bpf map. That will
raise the programmability of the bpf prog attached to a reuseport
group (a group of sk serving the same IP:PORT).
For example, in UDP, the bpf prog can peek into the payload (e.g.
through the "data" pointer introduced in the later patch) to learn
the application level's connection information and then decide which sk
to pick from a bpf map. The userspace can tightly couple the sk's location
in a bpf map with the application logic in generating the UDP payload's
connection information. This connection info contact/API stays within the
userspace.
Also, when used with map-in-map, the userspace can switch the
old-server-process's inner map to a new-server-process's inner map
in one call "bpf_map_update_elem(outer_map, &index, &new_reuseport_array)".
The bpf prog will then direct incoming requests to the new process instead
of the old process. The old process can finish draining the pending
requests (e.g. by "accept()") before closing the old-fds. [Note that
deleting a fd from a bpf map does not necessary mean the fd is closed]
During map_update_elem(),
Only SO_REUSEPORT sk (i.e. which has already been added
to a reuse->socks[]) can be used. That means a SO_REUSEPORT sk that is
"bind()" for UDP or "bind()+listen()" for TCP. These conditions are
ensured in "reuseport_array_update_check()".
A SO_REUSEPORT sk can only be added once to a map (i.e. the
same sk cannot be added twice even to the same map). SO_REUSEPORT
already allows another sk to be created for the same IP:PORT.
There is no need to re-create a similar usage in the BPF side.
When a SO_REUSEPORT is deleted from the "reuse->socks[]" (e.g. "close()"),
it will notify the bpf map to remove it from the map also. It is
done through "bpf_sk_reuseport_detach()" and it will only be called
if >=1 of the "reuse->sock[]" has ever been added to a bpf map.
The map_update()/map_delete() has to be in-sync with the
"reuse->socks[]". Hence, the same "reuseport_lock" used
by "reuse->socks[]" has to be used here also. Care has
been taken to ensure the lock is only acquired when the
adding sk passes some strict tests. and
freeing the map does not require the reuseport_lock.
The reuseport_array will also support lookup from the syscall
side. It will return a sock_gen_cookie(). The sock_gen_cookie()
is on-demand (i.e. a sk's cookie is not generated until the very
first map_lookup_elem()).
The lookup cookie is 64bits but it goes against the logical userspace
expectation on 32bits sizeof(fd) (and as other fd based bpf maps do also).
It may catch user in surprise if we enforce value_size=8 while
userspace still pass a 32bits fd during update. Supporting different
value_size between lookup and update seems unintuitive also.
We also need to consider what if other existing fd based maps want
to return 64bits value from syscall's lookup in the future.
Hence, reuseport_array supports both value_size 4 and 8, and
assuming user will usually use value_size=4. The syscall's lookup
will return ENOSPC on value_size=4. It will will only
return 64bits value from sock_gen_cookie() when user consciously
choose value_size=8 (as a signal that lookup is desired) which then
requires a 64bits value in both lookup and update.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
|
A later patch will introduce a BPF_MAP_TYPE_REUSEPORT_ARRAY which
allows a SO_REUSEPORT sk to be added to a bpf map. When a sk
is removed from reuse->socks[], it also needs to be removed from
the bpf map. Also, when adding a sk to a bpf map, the bpf
map needs to ensure it is indeed in a reuse->socks[].
Hence, reuseport_lock is needed by the bpf map to ensure its
map_update_elem() and map_delete_elem() operations are in-sync with
the reuse->socks[]. The BPF_MAP_TYPE_REUSEPORT_ARRAY map will only
acquire the reuseport_lock after ensuring the adding sk is already
in a reuseport group (i.e. reuse->socks[]). The map_lookup_elem()
will be lockless.
This patch also adds an ID to sock_reuseport. A later patch
will introduce BPF_PROG_TYPE_SK_REUSEPORT which allows
a bpf prog to select a sk from a bpf map. It is inflexible to
statically enforce a bpf map can only contain the sk belonging to
a particular reuse->socks[] (i.e. same IP:PORT) during the bpf
verification time. For example, think about the the map-in-map situation
where the inner map can be dynamically changed in runtime and the outer
map may have inner maps belonging to different reuseport groups.
Hence, when the bpf prog (in the new BPF_PROG_TYPE_SK_REUSEPORT
type) selects a sk, this selected sk has to be checked to ensure it
belongs to the requesting reuseport group (i.e. the group serving
that IP:PORT).
The "sk->sk_reuseport_cb" pointer cannot be used for this checking
purpose because the pointer value will change after reuseport_grow().
Instead of saving all checking conditions like the ones
preced calling "reuseport_add_sock()" and compare them everytime a
bpf_prog is run, a 32bits ID is introduced to survive the
reuseport_grow(). The ID is only acquired if any of the
reuse->socks[] is added to the newly introduced
"BPF_MAP_TYPE_REUSEPORT_ARRAY" map.
If "BPF_MAP_TYPE_REUSEPORT_ARRAY" is not used, the changes in this
patch is a no-op.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
|
Although the actual cookie check "__cookie_v[46]_check()" does
not involve sk specific info, it checks whether the sk has recent
synq overflow event in "tcp_synq_no_recent_overflow()". The
tcp_sk(sk)->rx_opt.ts_recent_stamp is updated every second
when it has sent out a syncookie (through "tcp_synq_overflow()").
The above per sk "recent synq overflow event timestamp" works well
for non SO_REUSEPORT use case. However, it may cause random
connection request reject/discard when SO_REUSEPORT is used with
syncookie because it fails the "tcp_synq_no_recent_overflow()"
test.
When SO_REUSEPORT is used, it usually has multiple listening
socks serving TCP connection requests destinated to the same local IP:PORT.
There are cases that the TCP-ACK-COOKIE may not be received
by the same sk that sent out the syncookie. For example,
if reuse->socks[] began with {sk0, sk1},
1) sk1 sent out syncookies and tcp_sk(sk1)->rx_opt.ts_recent_stamp
was updated.
2) the reuse->socks[] became {sk1, sk2} later. e.g. sk0 was first closed
and then sk2 was added. Here, sk2 does not have ts_recent_stamp set.
There are other ordering that will trigger the similar situation
below but the idea is the same.
3) When the TCP-ACK-COOKIE comes back, sk2 was selected.
"tcp_synq_no_recent_overflow(sk2)" returns true. In this case,
all syncookies sent by sk1 will be handled (and rejected)
by sk2 while sk1 is still alive.
The userspace may create and remove listening SO_REUSEPORT sockets
as it sees fit. E.g. Adding new thread (and SO_REUSEPORT sock) to handle
incoming requests, old process stopping and new process starting...etc.
With or without SO_ATTACH_REUSEPORT_[CB]BPF,
the sockets leaving and joining a reuseport group makes picking
the same sk to check the syncookie very difficult (if not impossible).
The later patches will allow bpf prog more flexibility in deciding
where a sk should be located in a bpf map and selecting a particular
SO_REUSEPORT sock as it sees fit. e.g. Without closing any sock,
replace the whole bpf reuseport_array in one map_update() by using
map-in-map. Getting the syncookie check working smoothly across
socks in the same "reuse->socks[]" is important.
A partial solution is to set the newly added sk's ts_recent_stamp
to the max ts_recent_stamp of a reuseport group but that will require
to iterate through reuse->socks[] OR
pessimistically set it to "now - TCP_SYNCOOKIE_VALID" when a sk is
joining a reuseport group. However, neither of them will solve the
existing sk getting moved around the reuse->socks[] and that
sk may not have ts_recent_stamp updated, unlikely under continuous
synflood but not impossible.
This patch opts to treat the reuseport group as a whole when
considering the last synq overflow timestamp since
they are serving the same IP:PORT from the userspace
(and BPF program) perspective.
"synq_overflow_ts" is added to "struct sock_reuseport".
The tcp_synq_overflow() and tcp_synq_no_recent_overflow()
will update/check reuse->synq_overflow_ts if the sk is
in a reuseport group. Similar to the reuseport decision in
__inet_lookup_listener(), both sk->sk_reuseport and
sk->sk_reuseport_cb are tested for SO_REUSEPORT usage.
Update on "synq_overflow_ts" happens at roughly once
every second.
A synflood test was done with a 16 rx-queues and 16 reuseport sockets.
No meaningful performance change is observed. Before and
after the change is ~9Mpps in IPv4.
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
|
We really, really don't want to be encouraging people to use
cifs (the dialect) since it is insecure, so to avoid confusion
we want to move them to names which include 'smb3' instead of
'cifs' - so this simply creates an alias for the pseudo-xattrs
e.g. can now do:
getfattr -n user.smb3.creationtime /mnt1/file
and
getfattr -n user.smb3.dosattrib /mnt1/file
and
getfattr -n system.smb3_acl /mnt1/file
instead of forcing you to use the string 'cifs' in
these (e.g. getfattr -n system.cifs_acl /mnt1/file)
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Let's reset i_gc_failures to zero when we unset pinned state for file.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
syzbot found the following crash on:
HEAD commit: d9bd94c0bcaa Add linux-next specific files for 20180801
git tree: linux-next
console output: https://syzkaller.appspot.com/x/log.txt?x=1001189c400000
kernel config: https://syzkaller.appspot.com/x/.config?x=cc8964ea4d04518c
dashboard link: https://syzkaller.appspot.com/bug?extid=c966a82db0b14aa37e81
compiler: gcc (GCC) 8.0.1 20180413 (experimental)
Unfortunately, I don't have any reproducer for this crash yet.
IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+c966a82db0b14aa37e81@syzkaller.appspotmail.com
loop7: rw=12288, want=8200, limit=20
netlink: 65342 bytes leftover after parsing attributes in process `syz-executor4'.
openvswitch: netlink: Message has 8 unknown bytes.
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP KASAN
CPU: 1 PID: 7615 Comm: syz-executor7 Not tainted 4.18.0-rc7-next-20180801+ #29
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:__read_once_size include/linux/compiler.h:188 [inline]
RIP: 0010:compound_head include/linux/page-flags.h:142 [inline]
RIP: 0010:PageLocked include/linux/page-flags.h:272 [inline]
RIP: 0010:f2fs_put_page fs/f2fs/f2fs.h:2011 [inline]
RIP: 0010:validate_checkpoint+0x66d/0xec0 fs/f2fs/checkpoint.c:835
Code: e8 58 05 7f fe 4c 8d 6b 80 4d 8d 74 24 08 48 b8 00 00 00 00 00 fc ff df 4c 89 ea 48 c1 ea 03 c6 04 02 00 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 f4 06 00 00 4c 89 ea 4d 8b 7c 24 08 48 b8 00 00
RSP: 0018:ffff8801937cebe8 EFLAGS: 00010246
RAX: dffffc0000000000 RBX: ffff8801937cef30 RCX: ffffc90006035000
RDX: 0000000000000000 RSI: ffffffff82fd9658 RDI: 0000000000000005
RBP: ffff8801937cef58 R08: ffff8801ab254700 R09: fffff94000d9e026
R10: fffff94000d9e026 R11: ffffea0006cf0137 R12: fffffffffffffffb
R13: ffff8801937ceeb0 R14: 0000000000000003 R15: ffff880193419b40
FS: 00007f36a61d5700(0000) GS:ffff8801db100000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fc04ff93000 CR3: 00000001d0562000 CR4: 00000000001426e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
f2fs_get_valid_checkpoint+0x436/0x1ec0 fs/f2fs/checkpoint.c:860
f2fs_fill_super+0x2d42/0x8110 fs/f2fs/super.c:2883
mount_bdev+0x314/0x3e0 fs/super.c:1344
f2fs_mount+0x3c/0x50 fs/f2fs/super.c:3133
legacy_get_tree+0x131/0x460 fs/fs_context.c:729
vfs_get_tree+0x1cb/0x5c0 fs/super.c:1743
do_new_mount fs/namespace.c:2603 [inline]
do_mount+0x6f2/0x1e20 fs/namespace.c:2927
ksys_mount+0x12d/0x140 fs/namespace.c:3143
__do_sys_mount fs/namespace.c:3157 [inline]
__se_sys_mount fs/namespace.c:3154 [inline]
__x64_sys_mount+0xbe/0x150 fs/namespace.c:3154
do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x45943a
Code: b8 a6 00 00 00 0f 05 48 3d 01 f0 ff ff 0f 83 bd 8a fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 0f 83 9a 8a fb ff c3 66 0f 1f 84 00 00 00 00 00
RSP: 002b:00007f36a61d4a88 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 00007f36a61d4b30 RCX: 000000000045943a
RDX: 00007f36a61d4ad0 RSI: 0000000020000100 RDI: 00007f36a61d4af0
RBP: 0000000020000100 R08: 00007f36a61d4b30 R09: 00007f36a61d4ad0
R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000013
R13: 0000000000000000 R14: 00000000004c8ea0 R15: 0000000000000000
Modules linked in:
Dumping ftrace buffer:
(ftrace buffer empty)
---[ end trace bd8550c129352286 ]---
RIP: 0010:__read_once_size include/linux/compiler.h:188 [inline]
RIP: 0010:compound_head include/linux/page-flags.h:142 [inline]
RIP: 0010:PageLocked include/linux/page-flags.h:272 [inline]
RIP: 0010:f2fs_put_page fs/f2fs/f2fs.h:2011 [inline]
RIP: 0010:validate_checkpoint+0x66d/0xec0 fs/f2fs/checkpoint.c:835
Code: e8 58 05 7f fe 4c 8d 6b 80 4d 8d 74 24 08 48 b8 00 00 00 00 00 fc ff df 4c 89 ea 48 c1 ea 03 c6 04 02 00 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 f4 06 00 00 4c 89 ea 4d 8b 7c 24 08 48 b8 00 00
RSP: 0018:ffff8801937cebe8 EFLAGS: 00010246
RAX: dffffc0000000000 RBX: ffff8801937cef30 RCX: ffffc90006035000
RDX: 0000000000000000 RSI: ffffffff82fd9658 RDI: 0000000000000005
netlink: 65342 bytes leftover after parsing attributes in process `syz-executor4'.
RBP: ffff8801937cef58 R08: ffff8801ab254700 R09: fffff94000d9e026
openvswitch: netlink: Message has 8 unknown bytes.
R10: fffff94000d9e026 R11: ffffea0006cf0137 R12: fffffffffffffffb
R13: ffff8801937ceeb0 R14: 0000000000000003 R15: ffff880193419b40
FS: 00007f36a61d5700(0000) GS:ffff8801db100000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fc04ff93000 CR3: 00000001d0562000 CR4: 00000000001426e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
In validate_checkpoint(), if we failed to call get_checkpoint_version(), we
will pass returned invalid page pointer into f2fs_put_page, cause accessing
invalid memory, this patch tries to handle error path correctly to fix this
issue.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
f2fs recovery flow is relying on dnode block link list, it means fsynced
file recovery depends on previous dnode's persistence in the list, so
during fsync() we should wait on all regular inode's dnode writebacked
before issuing flush.
By this way, we can avoid dnode block list being broken by out-of-order
IO submission due to IO scheduler or driver.
Sheng Yong helps to do the test with this patch:
Target:/data (f2fs, -)
64MB / 32768KB / 4KB / 8
1 / PERSIST / Index
Base:
SEQ-RD(MB/s) SEQ-WR(MB/s) RND-RD(IOPS) RND-WR(IOPS) Insert(TPS) Update(TPS) Delete(TPS)
1 867.82 204.15 41440.03 41370.54 680.8 1025.94 1031.08
2 871.87 205.87 41370.3 40275.2 791.14 1065.84 1101.7
3 866.52 205.69 41795.67 40596.16 694.69 1037.16 1031.48
Avg 868.7366667 205.2366667 41535.33333 40747.3 722.21 1042.98 1054.753333
After:
SEQ-RD(MB/s) SEQ-WR(MB/s) RND-RD(IOPS) RND-WR(IOPS) Insert(TPS) Update(TPS) Delete(TPS)
1 798.81 202.5 41143 40613.87 602.71 838.08 913.83
2 805.79 206.47 40297.2 41291.46 604.44 840.75 924.27
3 814.83 206.17 41209.57 40453.62 602.85 834.66 927.91
Avg 806.4766667 205.0466667 40883.25667 40786.31667 603.3333333 837.83 922.0033333
Patched/Original:
0.928332713 0.999074239 0.984300676 1.000957528 0.835398753 0.803303994 0.874141189
It looks like atomic write will suffer performance regression.
I suspect that the criminal is that we forcing to wait all dnode being in
storage cache before we issue PREFLUSH+FUA.
BTW, will commit ("f2fs: don't need to wait for node writes for atomic write")
cause the problem: we will lose data of last transaction after SPO, even if
atomic write return no error:
- atomic_open();
- write() P1, P2, P3;
- atomic_commit();
- writeback data: P1, P2, P3;
- writeback node: N1, N2, N3; <--- If N1, N2 is not writebacked, N3 with fsync_mark is
writebacked, In SPOR, we won't find N3 since node chain is broken, turns out that losing
last transaction.
- preflush + fua;
- power-cut
If we don't wait dnode writeback for atomic_write:
SEQ-RD(MB/s) SEQ-WR(MB/s) RND-RD(IOPS) RND-WR(IOPS) Insert(TPS) Update(TPS) Delete(TPS)
1 779.91 206.03 41621.5 40333.16 716.9 1038.21 1034.85
2 848.51 204.35 40082.44 39486.17 791.83 1119.96 1083.77
3 772.12 206.27 41335.25 41599.65 723.29 1055.07 971.92
Avg 800.18 205.55 41013.06333 40472.99333 744.0066667 1071.08 1030.18
Patched/Original:
0.92108464 1.001526693 0.987425886 0.993268102 1.030180511 1.026942031 0.976702294
SQLite's performance recovers.
Jaegeuk:
"Practically, I don't see db corruption becase of this. We can excuse to lose
the last transaction."
Finally, we decide to keep original implementation of atomic write interface
sematics that we don't wait all dnode writeback before preflush+fua submission.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Return statements in functions returning bool should use true or false
instead of an integer value.
This issue was detected with the help of Coccinelle.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
After fuzzing, cp_pack_start_sum could be corrupted, so current log's
summary info should be wrong due to loading incorrect summary block.
Then, if segment's type in current log is exceeded NR_CURSEG_TYPE, it
can lead accessing invalid dirty_i->dirty_segmap bitmap finally.
Add sanity check for cp_pack_start_sum to fix this issue.
https://bugzilla.kernel.org/show_bug.cgi?id=200419
- Reproduce
- Kernel message (f2fs-dev w/ KASAN)
[ 3117.578432] F2FS-fs (loop0): Invalid log blocks per segment (8)
[ 3117.578445] F2FS-fs (loop0): Can't find valid F2FS filesystem in 2th superblock
[ 3117.581364] F2FS-fs (loop0): invalid crc_offset: 30716
[ 3117.583564] WARNING: CPU: 1 PID: 1225 at fs/f2fs/checkpoint.c:90 __get_meta_page+0x448/0x4b0
[ 3117.583570] Modules linked in: snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer joydev input_leds serio_raw snd soundcore mac_hid i2c_piix4 ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi btrfs zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear 8139too qxl ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel psmouse aes_x86_64 8139cp crypto_simd cryptd mii glue_helper pata_acpi floppy
[ 3117.584014] CPU: 1 PID: 1225 Comm: mount Not tainted 4.17.0+ #1
[ 3117.584017] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 3117.584022] RIP: 0010:__get_meta_page+0x448/0x4b0
[ 3117.584023] Code: 00 49 8d bc 24 84 00 00 00 e8 74 54 da ff 41 83 8c 24 84 00 00 00 08 4c 89 f6 4c 89 ef e8 c0 d9 95 00 48 89 ef e8 18 e3 00 00 <0f> 0b f0 80 4d 48 04 e9 0f fe ff ff 0f 0b 48 89 c7 48 89 04 24 e8
[ 3117.584072] RSP: 0018:ffff88018eb678c0 EFLAGS: 00010286
[ 3117.584082] RAX: ffff88018f0a6a78 RBX: ffffea0007a46600 RCX: ffffffff9314d1b2
[ 3117.584085] RDX: ffffffff00000001 RSI: 0000000000000000 RDI: ffff88018f0a6a98
[ 3117.584087] RBP: ffff88018ebe9980 R08: 0000000000000002 R09: 0000000000000001
[ 3117.584090] R10: 0000000000000001 R11: ffffed00326e4450 R12: ffff880193722200
[ 3117.584092] R13: ffff88018ebe9afc R14: 0000000000000206 R15: ffff88018eb67900
[ 3117.584096] FS: 00007f5694636840(0000) GS:ffff8801f3b00000(0000) knlGS:0000000000000000
[ 3117.584098] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3117.584101] CR2: 00000000016f21b8 CR3: 0000000191c22000 CR4: 00000000000006e0
[ 3117.584112] Call Trace:
[ 3117.584121] ? f2fs_set_meta_page_dirty+0x150/0x150
[ 3117.584127] ? f2fs_build_segment_manager+0xbf9/0x3190
[ 3117.584133] ? f2fs_npages_for_summary_flush+0x75/0x120
[ 3117.584145] f2fs_build_segment_manager+0xda8/0x3190
[ 3117.584151] ? f2fs_get_valid_checkpoint+0x298/0xa00
[ 3117.584156] ? f2fs_flush_sit_entries+0x10e0/0x10e0
[ 3117.584184] ? map_id_range_down+0x17c/0x1b0
[ 3117.584188] ? __put_user_ns+0x30/0x30
[ 3117.584206] ? find_next_bit+0x53/0x90
[ 3117.584237] ? cpumask_next+0x16/0x20
[ 3117.584249] f2fs_fill_super+0x1948/0x2b40
[ 3117.584258] ? f2fs_commit_super+0x1a0/0x1a0
[ 3117.584279] ? sget_userns+0x65e/0x690
[ 3117.584296] ? set_blocksize+0x88/0x130
[ 3117.584302] ? f2fs_commit_super+0x1a0/0x1a0
[ 3117.584305] mount_bdev+0x1c0/0x200
[ 3117.584310] mount_fs+0x5c/0x190
[ 3117.584320] vfs_kern_mount+0x64/0x190
[ 3117.584330] do_mount+0x2e4/0x1450
[ 3117.584343] ? lockref_put_return+0x130/0x130
[ 3117.584347] ? copy_mount_string+0x20/0x20
[ 3117.584357] ? kasan_unpoison_shadow+0x31/0x40
[ 3117.584362] ? kasan_kmalloc+0xa6/0xd0
[ 3117.584373] ? memcg_kmem_put_cache+0x16/0x90
[ 3117.584377] ? __kmalloc_track_caller+0x196/0x210
[ 3117.584383] ? _copy_from_user+0x61/0x90
[ 3117.584396] ? memdup_user+0x3e/0x60
[ 3117.584401] ksys_mount+0x7e/0xd0
[ 3117.584405] __x64_sys_mount+0x62/0x70
[ 3117.584427] do_syscall_64+0x73/0x160
[ 3117.584440] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 3117.584455] RIP: 0033:0x7f5693f14b9a
[ 3117.584456] Code: 48 8b 0d 01 c3 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ce c2 2b 00 f7 d8 64 89 01 48
[ 3117.584505] RSP: 002b:00007fff27346488 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
[ 3117.584510] RAX: ffffffffffffffda RBX: 00000000016e2030 RCX: 00007f5693f14b9a
[ 3117.584512] RDX: 00000000016e2210 RSI: 00000000016e3f30 RDI: 00000000016ee040
[ 3117.584514] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000013
[ 3117.584516] R10: 00000000c0ed0000 R11: 0000000000000206 R12: 00000000016ee040
[ 3117.584519] R13: 00000000016e2210 R14: 0000000000000000 R15: 0000000000000003
[ 3117.584523] ---[ end trace a8e0d899985faf31 ]---
[ 3117.685663] F2FS-fs (loop0): f2fs_check_nid_range: out-of-range nid=2, run fsck to fix.
[ 3117.685673] F2FS-fs (loop0): recover_data: ino = 2 (i_size: recover) recovered = 1, err = 0
[ 3117.685707] ==================================================================
[ 3117.685955] BUG: KASAN: slab-out-of-bounds in __remove_dirty_segment+0xdd/0x1e0
[ 3117.686175] Read of size 8 at addr ffff88018f0a63d0 by task mount/1225
[ 3117.686477] CPU: 0 PID: 1225 Comm: mount Tainted: G W 4.17.0+ #1
[ 3117.686481] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 3117.686483] Call Trace:
[ 3117.686494] dump_stack+0x71/0xab
[ 3117.686512] print_address_description+0x6b/0x290
[ 3117.686517] kasan_report+0x28e/0x390
[ 3117.686522] ? __remove_dirty_segment+0xdd/0x1e0
[ 3117.686527] __remove_dirty_segment+0xdd/0x1e0
[ 3117.686532] locate_dirty_segment+0x189/0x190
[ 3117.686538] f2fs_allocate_new_segments+0xa9/0xe0
[ 3117.686543] recover_data+0x703/0x2c20
[ 3117.686547] ? f2fs_recover_fsync_data+0x48f/0xd50
[ 3117.686553] ? ksys_mount+0x7e/0xd0
[ 3117.686564] ? policy_nodemask+0x1a/0x90
[ 3117.686567] ? policy_node+0x56/0x70
[ 3117.686571] ? add_fsync_inode+0xf0/0xf0
[ 3117.686592] ? blk_finish_plug+0x44/0x60
[ 3117.686597] ? f2fs_ra_meta_pages+0x38b/0x5e0
[ 3117.686602] ? find_inode_fast+0xac/0xc0
[ 3117.686606] ? f2fs_is_valid_blkaddr+0x320/0x320
[ 3117.686618] ? __radix_tree_lookup+0x150/0x150
[ 3117.686633] ? dqget+0x670/0x670
[ 3117.686648] ? pagecache_get_page+0x29/0x410
[ 3117.686656] ? kmem_cache_alloc+0x176/0x1e0
[ 3117.686660] ? f2fs_is_valid_blkaddr+0x11d/0x320
[ 3117.686664] f2fs_recover_fsync_data+0xc23/0xd50
[ 3117.686670] ? f2fs_space_for_roll_forward+0x60/0x60
[ 3117.686674] ? rb_insert_color+0x323/0x3d0
[ 3117.686678] ? f2fs_recover_orphan_inodes+0xa5/0x700
[ 3117.686683] ? proc_register+0x153/0x1d0
[ 3117.686686] ? f2fs_remove_orphan_inode+0x10/0x10
[ 3117.686695] ? f2fs_attr_store+0x50/0x50
[ 3117.686700] ? proc_create_single_data+0x52/0x60
[ 3117.686707] f2fs_fill_super+0x1d06/0x2b40
[ 3117.686728] ? f2fs_commit_super+0x1a0/0x1a0
[ 3117.686735] ? sget_userns+0x65e/0x690
[ 3117.686740] ? set_blocksize+0x88/0x130
[ 3117.686745] ? f2fs_commit_super+0x1a0/0x1a0
[ 3117.686748] mount_bdev+0x1c0/0x200
[ 3117.686753] mount_fs+0x5c/0x190
[ 3117.686758] vfs_kern_mount+0x64/0x190
[ 3117.686762] do_mount+0x2e4/0x1450
[ 3117.686769] ? lockref_put_return+0x130/0x130
[ 3117.686773] ? copy_mount_string+0x20/0x20
[ 3117.686777] ? kasan_unpoison_shadow+0x31/0x40
[ 3117.686780] ? kasan_kmalloc+0xa6/0xd0
[ 3117.686786] ? memcg_kmem_put_cache+0x16/0x90
[ 3117.686790] ? __kmalloc_track_caller+0x196/0x210
[ 3117.686795] ? _copy_from_user+0x61/0x90
[ 3117.686801] ? memdup_user+0x3e/0x60
[ 3117.686804] ksys_mount+0x7e/0xd0
[ 3117.686809] __x64_sys_mount+0x62/0x70
[ 3117.686816] do_syscall_64+0x73/0x160
[ 3117.686824] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 3117.686829] RIP: 0033:0x7f5693f14b9a
[ 3117.686830] Code: 48 8b 0d 01 c3 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ce c2 2b 00 f7 d8 64 89 01 48
[ 3117.686887] RSP: 002b:00007fff27346488 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
[ 3117.686892] RAX: ffffffffffffffda RBX: 00000000016e2030 RCX: 00007f5693f14b9a
[ 3117.686894] RDX: 00000000016e2210 RSI: 00000000016e3f30 RDI: 00000000016ee040
[ 3117.686896] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000013
[ 3117.686899] R10: 00000000c0ed0000 R11: 0000000000000206 R12: 00000000016ee040
[ 3117.686901] R13: 00000000016e2210 R14: 0000000000000000 R15: 0000000000000003
[ 3117.687005] Allocated by task 1225:
[ 3117.687152] kasan_kmalloc+0xa6/0xd0
[ 3117.687157] kmem_cache_alloc_trace+0xfd/0x200
[ 3117.687161] f2fs_build_segment_manager+0x2d09/0x3190
[ 3117.687165] f2fs_fill_super+0x1948/0x2b40
[ 3117.687168] mount_bdev+0x1c0/0x200
[ 3117.687171] mount_fs+0x5c/0x190
[ 3117.687174] vfs_kern_mount+0x64/0x190
[ 3117.687177] do_mount+0x2e4/0x1450
[ 3117.687180] ksys_mount+0x7e/0xd0
[ 3117.687182] __x64_sys_mount+0x62/0x70
[ 3117.687186] do_syscall_64+0x73/0x160
[ 3117.687190] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 3117.687285] Freed by task 19:
[ 3117.687412] __kasan_slab_free+0x137/0x190
[ 3117.687416] kfree+0x8b/0x1b0
[ 3117.687460] ttm_bo_man_put_node+0x61/0x80 [ttm]
[ 3117.687476] ttm_bo_cleanup_refs+0x15f/0x250 [ttm]
[ 3117.687492] ttm_bo_delayed_delete+0x2f0/0x300 [ttm]
[ 3117.687507] ttm_bo_delayed_workqueue+0x17/0x50 [ttm]
[ 3117.687528] process_one_work+0x2f9/0x740
[ 3117.687531] worker_thread+0x78/0x6b0
[ 3117.687541] kthread+0x177/0x1c0
[ 3117.687545] ret_from_fork+0x35/0x40
[ 3117.687638] The buggy address belongs to the object at ffff88018f0a6300
which belongs to the cache kmalloc-192 of size 192
[ 3117.688014] The buggy address is located 16 bytes to the right of
192-byte region [ffff88018f0a6300, ffff88018f0a63c0)
[ 3117.688382] The buggy address belongs to the page:
[ 3117.688554] page:ffffea00063c2980 count:1 mapcount:0 mapping:ffff8801f3403180 index:0x0
[ 3117.688788] flags: 0x17fff8000000100(slab)
[ 3117.688944] raw: 017fff8000000100 ffffea00063c2840 0000000e0000000e ffff8801f3403180
[ 3117.689166] raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000
[ 3117.689386] page dumped because: kasan: bad access detected
[ 3117.689653] Memory state around the buggy address:
[ 3117.689816] ffff88018f0a6280: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
[ 3117.690027] ffff88018f0a6300: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 3117.690239] >ffff88018f0a6380: 00 00 fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 3117.690448] ^
[ 3117.690644] ffff88018f0a6400: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 3117.690868] ffff88018f0a6480: 00 00 fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 3117.691077] ==================================================================
[ 3117.691290] Disabling lock debugging due to kernel taint
[ 3117.693893] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[ 3117.694120] PGD 80000001f01bc067 P4D 80000001f01bc067 PUD 1d9638067 PMD 0
[ 3117.694338] Oops: 0002 [#1] SMP KASAN PTI
[ 3117.694490] CPU: 1 PID: 1225 Comm: mount Tainted: G B W 4.17.0+ #1
[ 3117.694703] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 3117.695073] RIP: 0010:__remove_dirty_segment+0xe2/0x1e0
[ 3117.695246] Code: c4 48 89 c7 e8 cf bb d7 ff 45 0f b6 24 24 41 83 e4 3f 44 88 64 24 07 41 83 e4 3f 4a 8d 7c e3 08 e8 b3 bc d7 ff 4a 8b 4c e3 08 <f0> 4c 0f b3 29 0f 82 94 00 00 00 48 8d bd 20 04 00 00 e8 97 bb d7
[ 3117.695793] RSP: 0018:ffff88018eb67638 EFLAGS: 00010292
[ 3117.695969] RAX: 0000000000000000 RBX: ffff88018f0a6300 RCX: 0000000000000000
[ 3117.696182] RDX: 0000000000000000 RSI: 0000000000000297 RDI: 0000000000000297
[ 3117.696391] RBP: ffff88018ebe9980 R08: ffffed003e743ebb R09: ffffed003e743ebb
[ 3117.696604] R10: 0000000000000001 R11: ffffed003e743eba R12: 0000000000000019
[ 3117.696813] R13: 0000000000000014 R14: 0000000000000320 R15: ffff88018ebe99e0
[ 3117.697032] FS: 00007f5694636840(0000) GS:ffff8801f3b00000(0000) knlGS:0000000000000000
[ 3117.697280] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3117.702357] CR2: 00007fe89bb1a000 CR3: 0000000191c22000 CR4: 00000000000006e0
[ 3117.707235] Call Trace:
[ 3117.712077] locate_dirty_segment+0x189/0x190
[ 3117.716891] f2fs_allocate_new_segments+0xa9/0xe0
[ 3117.721617] recover_data+0x703/0x2c20
[ 3117.726316] ? f2fs_recover_fsync_data+0x48f/0xd50
[ 3117.730957] ? ksys_mount+0x7e/0xd0
[ 3117.735573] ? policy_nodemask+0x1a/0x90
[ 3117.740198] ? policy_node+0x56/0x70
[ 3117.744829] ? add_fsync_inode+0xf0/0xf0
[ 3117.749487] ? blk_finish_plug+0x44/0x60
[ 3117.754152] ? f2fs_ra_meta_pages+0x38b/0x5e0
[ 3117.758831] ? find_inode_fast+0xac/0xc0
[ 3117.763448] ? f2fs_is_valid_blkaddr+0x320/0x320
[ 3117.768046] ? __radix_tree_lookup+0x150/0x150
[ 3117.772603] ? dqget+0x670/0x670
[ 3117.777159] ? pagecache_get_page+0x29/0x410
[ 3117.781648] ? kmem_cache_alloc+0x176/0x1e0
[ 3117.786067] ? f2fs_is_valid_blkaddr+0x11d/0x320
[ 3117.790476] f2fs_recover_fsync_data+0xc23/0xd50
[ 3117.794790] ? f2fs_space_for_roll_forward+0x60/0x60
[ 3117.799086] ? rb_insert_color+0x323/0x3d0
[ 3117.803304] ? f2fs_recover_orphan_inodes+0xa5/0x700
[ 3117.807563] ? proc_register+0x153/0x1d0
[ 3117.811766] ? f2fs_remove_orphan_inode+0x10/0x10
[ 3117.815947] ? f2fs_attr_store+0x50/0x50
[ 3117.820087] ? proc_create_single_data+0x52/0x60
[ 3117.824262] f2fs_fill_super+0x1d06/0x2b40
[ 3117.828367] ? f2fs_commit_super+0x1a0/0x1a0
[ 3117.832432] ? sget_userns+0x65e/0x690
[ 3117.836500] ? set_blocksize+0x88/0x130
[ 3117.840501] ? f2fs_commit_super+0x1a0/0x1a0
[ 3117.844420] mount_bdev+0x1c0/0x200
[ 3117.848275] mount_fs+0x5c/0x190
[ 3117.852053] vfs_kern_mount+0x64/0x190
[ 3117.855810] do_mount+0x2e4/0x1450
[ 3117.859441] ? lockref_put_return+0x130/0x130
[ 3117.862996] ? copy_mount_string+0x20/0x20
[ 3117.866417] ? kasan_unpoison_shadow+0x31/0x40
[ 3117.869719] ? kasan_kmalloc+0xa6/0xd0
[ 3117.872948] ? memcg_kmem_put_cache+0x16/0x90
[ 3117.876121] ? __kmalloc_track_caller+0x196/0x210
[ 3117.879333] ? _copy_from_user+0x61/0x90
[ 3117.882467] ? memdup_user+0x3e/0x60
[ 3117.885604] ksys_mount+0x7e/0xd0
[ 3117.888700] __x64_sys_mount+0x62/0x70
[ 3117.891742] do_syscall_64+0x73/0x160
[ 3117.894692] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 3117.897669] RIP: 0033:0x7f5693f14b9a
[ 3117.900563] Code: 48 8b 0d 01 c3 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ce c2 2b 00 f7 d8 64 89 01 48
[ 3117.906922] RSP: 002b:00007fff27346488 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
[ 3117.910159] RAX: ffffffffffffffda RBX: 00000000016e2030 RCX: 00007f5693f14b9a
[ 3117.913469] RDX: 00000000016e2210 RSI: 00000000016e3f30 RDI: 00000000016ee040
[ 3117.916764] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000013
[ 3117.920071] R10: 00000000c0ed0000 R11: 0000000000000206 R12: 00000000016ee040
[ 3117.923393] R13: 00000000016e2210 R14: 0000000000000000 R15: 0000000000000003
[ 3117.926680] Modules linked in: snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer joydev input_leds serio_raw snd soundcore mac_hid i2c_piix4 ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi btrfs zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear 8139too qxl ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel psmouse aes_x86_64 8139cp crypto_simd cryptd mii glue_helper pata_acpi floppy
[ 3117.949979] CR2: 0000000000000000
[ 3117.954283] ---[ end trace a8e0d899985faf32 ]---
[ 3117.958575] RIP: 0010:__remove_dirty_segment+0xe2/0x1e0
[ 3117.962810] Code: c4 48 89 c7 e8 cf bb d7 ff 45 0f b6 24 24 41 83 e4 3f 44 88 64 24 07 41 83 e4 3f 4a 8d 7c e3 08 e8 b3 bc d7 ff 4a 8b 4c e3 08 <f0> 4c 0f b3 29 0f 82 94 00 00 00 48 8d bd 20 04 00 00 e8 97 bb d7
[ 3117.971789] RSP: 0018:ffff88018eb67638 EFLAGS: 00010292
[ 3117.976333] RAX: 0000000000000000 RBX: ffff88018f0a6300 RCX: 0000000000000000
[ 3117.980926] RDX: 0000000000000000 RSI: 0000000000000297 RDI: 0000000000000297
[ 3117.985497] RBP: ffff88018ebe9980 R08: ffffed003e743ebb R09: ffffed003e743ebb
[ 3117.990098] R10: 0000000000000001 R11: ffffed003e743eba R12: 0000000000000019
[ 3117.994761] R13: 0000000000000014 R14: 0000000000000320 R15: ffff88018ebe99e0
[ 3117.999392] FS: 00007f5694636840(0000) GS:ffff8801f3b00000(0000) knlGS:0000000000000000
[ 3118.004096] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3118.008816] CR2: 00007fe89bb1a000 CR3: 0000000191c22000 CR4: 00000000000006e0
- Location
https://elixir.bootlin.com/linux/v4.18-rc3/source/fs/f2fs/segment.c#L775
if (test_and_clear_bit(segno, dirty_i->dirty_segmap[t]))
dirty_i->nr_dirty[t]--;
Here dirty_i->dirty_segmap[t] can be NULL which leads to crash in test_and_clear_bit()
Reported-by Wen Xu <wen.xu@gatech.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
There is a subtle race condition to invoke f2fs_bug_on() in shutdown tests. I've
confirmed that the last checkpoint is preserved in consistent state, so it'd be
fine to just return error at this moment.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
PG_checked flag will be set on data page during GC, later, we can
recognize such page by the flag and migrate page to cold segment.
But previously, we don't clear this flag when invalidating data page,
after page redirtying, we will write it into wrong log.
Let's clear PG_checked flag in set_page_dirty() to avoid this.
Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Trivial fix to spelling mistake in dev_err message and comment
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Rob Clark <robdclark@gmail.com>
|
|
This change validates the physical encoder before it
is dereferenced.
Signed-off-by: Jeykumar Sankaran <jsanka@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>
|
|
Add support for the A6XX family of Adreno GPUs. The biggest addition
is the GMU (Graphics Management Unit) which takes over most of the
power management of the GPU itself but in a ironic twist of fate
needs a goodly amount of management itself. Add support for the
A6XX core code, the GMU and the HFI (hardware firmware interface)
queue that the CPU uses to communicate with the GMU.
Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>
|
|
Resync generated headers to pull in a6xx registers.
Signed-off-by: Rob Clark <robdclark@gmail.com>
|
|
Failure to load firmware is the primary reason to fail adreno_load_gpu().
Try to load it first before going into the hardware initialization code and
unwinding it. This is important for a6xx because the GMU gets loaded from
the runtime power code and it is more costly to fail in that path because
of missing firmware.
Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>
|
|
Add a helper function to parse the clock names and set up
the bulk data so we can take advantage of the bulk clock
functions instead of rolling our own. This is added
as a helper function so the upcoming a6xx GMU code can
also take advantage of it.
Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>
|
|
Fix typos, line length, grammar, punctuation, and capitalization
in console.txt.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Antonino A. Daplas <adaplas@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
|
|
Update ioctl-number.txt for ioctl's that are defined in
<media/v4l2-subdev.h>.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
|
|
This small commit replaces gendered pronouns for neutral ones.
Signed-off-by: Fox Foster <fox@tardis.ed.ac.uk>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
|
|
Memory in the bundle is valuable, do not waste it holding an 8 byte
pointer for the rare case of writing to a PTR_OUT. We can compute the
pointer by storing a small 1 byte array offset and the base address of the
uattr memory in the bundle private memory.
This also means we can access the kernel's copy of the ib_uverbs_attr, so
drop the copy of flags as well.
Since the uattr base should be private bundle information this also
de-inlines the already too big uverbs_copy_to inline and moves
create_udata into uverbs_ioctl.c so they can see the private struct
definition.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
|
|
This already existed as the anonymous 'ctx' structure, but this was not
really a useful form. Hoist this struct into bundle_priv and rework the
internal things to use it instead.
Move a bunch of the processing internal state into the priv and reduce the
excessive use of function arguments.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
|
|
Currently the struct uverbs_obj_type stored in the ib_uobject is part of
the .rodata segment of the module that defines the object. This is a
problem if drivers define new uapi objects as we will be left with a
dangling pointer after device disassociation.
Switch the uverbs_obj_type for struct uverbs_api_object, which is
allocated memory that is part of the uverbs_api and is guaranteed to
always exist. Further this moves the 'type_class' into this memory which
means access to the IDR/FD function pointers is also guaranteed. Drivers
cannot define new types.
This makes it safe to continue to use all uobjects, including driver
defined ones, after disassociation.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
This radix tree datastructure is intended to replace the 'hash' structure
used today for parsing ioctl methods during system calls. This first
commit introduces the structure and builds it from the existing .rodata
descriptions.
The so-called hash arrangement is actually a 5 level open coded radix tree.
This new version uses a 3 level radix tree built using the radix tree
library.
Overall this is much less code and much easier to build as the radix tree
API allows for dynamic modification during the building. There is a small
memory penalty to pay for this, but since the radix tree is allocated on
a per device basis, a few kb of RAM seems immaterial considering the
gained simplicity.
The radix tree is similar to the existing tree, but also has a 'attr_bkey'
concept, which is a small value'd index for each method attribute. This is
used to simplify and improve performance of everything in the next
patches.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
|
|
There is no reason for drivers to do this, the core code should take of
everything. The drivers will provide their information from rodata to
describe their modifications to the core's base uapi specification.
The core uses this to build up the runtime uapi for each device.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
|
|
Add LED identification support for liquidio TP copperhead cards.
Signed-off-by: Raghu Vatsavayi <raghu.vatsavayi@cavium.com>
Acked-by: Derek Chickles <derek.chickles@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Fixes: 5e7baf0fcb2a ("qed/qede: Multi CoS support.")
Signed-off-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The function mlxsw_core_driver_put only traverse mlxsw_core_driver_list
to find the matched mlxsw_driver,but never used it.
So it can be removed safely.
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The mvneta Ethernet driver is used on a few different Marvell SoCs.
Some SoCs have per cpu interrupts for Ethernet events, the driver uses
a per CPU napi structure for this case. Some SoCs such as armada 3700
have a single interrupt for Ethernet events, the driver uses a global
napi structure for this case.
Current mvneta_config_rss() always operates the per cpu napi structure.
Fix it by operating a global napi for "single interrupt" case, and per
cpu napi structure for remaining cases.
Signed-off-by: Jisheng Zhang <Jisheng.Zhang@synaptics.com>
Fixes: 2636ac3cc2b4 ("net: mvneta: Add network support for Armada 3700 SoC")
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
With SMC-D z/OS sends a test link signal every 10 seconds. Linux is
supposed to answer, otherwise the SMC-D connection breaks.
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Heiner Kallweit says:
====================
r8169: smaller improvements
This series includes smaller improvements, no functional change
intended.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
We don't have to configure the max jumbo frame size per chip
(sub-)version. It can be easily determined based on the chip family.
And new members of the RTL8168 family (if there are any) should be
automatically covered.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
We don't have to configure the csum function per chip (sub-)version.
The distinction is simple, versions RTL8102e and from RTL8168c onwards
support csum_v2.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Simplify the interrupt handler a little and make it better readable.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The asm headers shouldn't be included directly. asm/irq.h is
implicitly included by linux/interrupt.h, and instead of
asm/io.h include linux/io.h.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The version number hasn't changed for ages and in general I doubt it
provides any benefit. The message in rtl_init_one() may even be
misleading because it's printed also if something fails in probe.
Therefore let's remove the version information.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:
====================
pull request: bluetooth-next 2018-08-10
Here's one more (most likely last) bluetooth-next pull request for the
4.19 kernel.
- Added support for MediaTek serial Bluetooth devices
- Initial skeleton for controller-side address resolution support
- Fix BT_HCIUART_RTL related Kconfig dependencies
- A few other minor fixes/cleanups
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This was tested on actual hardware and found to work fine, but currently
the official specifications of this chip could not be obtained to
confirm the numbers.
Signed-off-by: Leonid Bloch <lbloch@janustech.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
|
|
The DIO connector on the WAFER-945GSE is interfaced to GPIO ports
on the ITE IT8718F Super I/O chipset. From the datasheet of ITE IT8718F,
the GPIO interface is identical to IT8728, so just add it
to the same case as the other chip.
Signed-off-by: Ivan Podovalov <ipodovalov@gmail.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
|
|
Add a check for unused gpios to avoid chip->request() call to client
driver for unused gpios.
Signed-off-by: Biju Das <biju.das@bp.renesas.com>
Reviewed-by: Fabrizio Castro <fabrizio.castro@bp.renesas.com>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
|
|
This is a GPIO driver so include only <linux/gpio/driver.h>.
Drop the use of GPIOF_* flags: these are for consumers, not
drivers. Just return 0/1.
Cc: Stefan Agner <stefan@agner.ch>
Cc: Thierry Reding <treding@nvidia.com>
Reviewed-by: Dmitry Osipenko <digetx@gmail.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
|
|
The bgpio_init() takes one of two arguments to specify a register
to set the direction of the GPIO line: either dirout that
indicates that a 1 in the bit in that register sets the
corresponding line to output, or dirin which indicates that
a 1 in the bit in that register sets the corresponding line to
input. Conversely setting the bit to 0 on these will turn the
line into input and output respectively. One of these can
be defined but not both.
This means that a platform that sets a bit to 1 for output
only defines dirout and a platform that sets a bit to 0 for
output only defines dirin. In short this defines the polarity
of the direction register.
Both can also be left as NULL meaning the GPIO chip is either
input only or output only.
Tomer Maimon discovered that for get/set chips (those where the
get and set registers are defined but no separate clear register,
and specifying BGPIOF_READ_OUTPUT_REG_SET so that we say we
want to read the output value from the SET register)
we are unconditionally reading the value from the SET register
when the direction bit is 1 and from the DAT register when the
direction bit is 0, not taking the direction bit polarity into
account.
It would be expected that when the direction bit is inverted
(dirin is defined but not dirout) we read the current value from
the DAT register when the bit is 1 and from the SET register
when the bit is 0.
Currently only some versions of ATH79, brcmstb, some versions of
CLP711x, GE, IOP and Loongson use the dirin mode (a 1 in the
register means input). They are unaffected because
BGPIOF_READ_OUTPUT_REG_SET is not set on any of them. (They
do not read back the SET register to figure out the output
value.) So this is no regression with current drivers.
However the behaviour is wrong and does not work with Tomer's
new driver where he needs to use the BGIOF_READ_OUTPUT_REG_SET.
This fixes the above issue by:
- Instead of defining separate functions for the inverted case,
set up a flag in the gpio_chip that indicates that the
direction is inverted.
- Remove the special inverted functions for setting
input/output and getting the direction, rely on the flag
instead.
- Respect this flag in bgpio_get_set() and
bgpio_get_set_multiple()
Reported-by: Tomer Maimon <tmaimon77@gmail.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
|
|
This is a GPIO driver so use only <linux/gpio/driver.h>.
Acked-by: Michal Simek <michal.simek@xilinx.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
|
|
This is harmless, but "val" isn't necessarily initialized if
abx500_get_register_interruptible() fails. I've re-arranged the code to
just return an error code in that situation.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
|
|
There is no check that allocation in axp20x_funcs_groups_from_mask
is successful.
The patch adds corresponding check and return values.
Found by Linux Driver Verification project (linuxtesting.org).
Signed-off-by: Anton Vasilyev <vasilyev@ispras.ru>
Acked-by: Chen-Yu Tsai <wens@csie.org>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
|
|
This is a GPIO driver so include only <linux/gpio/driver.h>.
Cc: Richard Röjfors <richard.rojfors@gmail.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
|