Age | Commit message (Collapse) | Author |
|
BPF end-user on Cilium slack-channel (Carlo Carraro) wants to use
bpf_fib_lookup for doing MTU-check, but *prior* to extending packet size,
by adjusting fib_params 'tot_len' with the packet length plus the expected
encap size. (Just like the bpf_check_mtu helper supports). He discovered
that for SKB ctx the param->tot_len was not used, instead skb->len was used
(via MTU check in is_skb_forwardable() that checks against netdev MTU).
Fix this by using fib_params 'tot_len' for MTU check. If not provided (e.g.
zero) then keep existing TC behaviour intact. Notice that 'tot_len' for MTU
check is done like XDP code-path, which checks against FIB-dst MTU.
V16:
- Revert V13 optimization, 2nd lookup is against egress/resulting netdev
V13:
- Only do ifindex lookup one time, calling dev_get_by_index_rcu().
V10:
- Use same method as XDP for 'tot_len' MTU check
Fixes: 4c79579b44b1 ("bpf: Change bpf_fib_lookup to return lookup status")
Reported-by: Carlo Carraro <colrack@gmail.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/161287789444.790810.15247494756551413508.stgit@firesoul
|
|
Multiple BPF-helpers that can manipulate/increase the size of the SKB uses
__bpf_skb_max_len() as the max-length. This function limit size against
the current net_device MTU (skb->dev->mtu).
When a BPF-prog grow the packet size, then it should not be limited to the
MTU. The MTU is a transmit limitation, and software receiving this packet
should be allowed to increase the size. Further more, current MTU check in
__bpf_skb_max_len uses the MTU from ingress/current net_device, which in
case of redirects uses the wrong net_device.
This patch keeps a sanity max limit of SKB_MAX_ALLOC (16KiB). The real limit
is elsewhere in the system. Jesper's testing[1] showed it was not possible
to exceed 8KiB when expanding the SKB size via BPF-helper. The limiting
factor is the define KMALLOC_MAX_CACHE_SIZE which is 8192 for
SLUB-allocator (CONFIG_SLUB) in-case PAGE_SIZE is 4096. This define is
in-effect due to this being called from softirq context see code
__gfp_pfmemalloc_flags() and __do_kmalloc_node(). Jakub's testing showed
that frames above 16KiB can cause NICs to reset (but not crash). Keep this
sanity limit at this level as memory layer can differ based on kernel
config.
[1] https://github.com/xdp-project/bpf-examples/tree/master/MTU-tests
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/161287788936.790810.2937823995775097177.stgit@firesoul
|
|
Recently noticed that when mod32 with a known src reg of 0 is performed,
then the dst register is 32-bit truncated in verifier:
0: R1=ctx(id=0,off=0,imm=0) R10=fp0
0: (b7) r0 = 0
1: R0_w=inv0 R1=ctx(id=0,off=0,imm=0) R10=fp0
1: (b7) r1 = -1
2: R0_w=inv0 R1_w=inv-1 R10=fp0
2: (b4) w2 = -1
3: R0_w=inv0 R1_w=inv-1 R2_w=inv4294967295 R10=fp0
3: (9c) w1 %= w0
4: R0_w=inv0 R1_w=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R2_w=inv4294967295 R10=fp0
4: (b7) r0 = 1
5: R0_w=inv1 R1_w=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R2_w=inv4294967295 R10=fp0
5: (1d) if r1 == r2 goto pc+1
R0_w=inv1 R1_w=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R2_w=inv4294967295 R10=fp0
6: R0_w=inv1 R1_w=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R2_w=inv4294967295 R10=fp0
6: (b7) r0 = 2
7: R0_w=inv2 R1_w=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R2_w=inv4294967295 R10=fp0
7: (95) exit
7: R0=inv1 R1=inv(id=0,umin_value=4294967295,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R2=inv4294967295 R10=fp0
7: (95) exit
However, as a runtime result, we get 2 instead of 1, meaning the dst
register does not contain (u32)-1 in this case. The reason is fairly
straight forward given the 0 test leaves the dst register as-is:
# ./bpftool p d x i 23
0: (b7) r0 = 0
1: (b7) r1 = -1
2: (b4) w2 = -1
3: (16) if w0 == 0x0 goto pc+1
4: (9c) w1 %= w0
5: (b7) r0 = 1
6: (1d) if r1 == r2 goto pc+1
7: (b7) r0 = 2
8: (95) exit
This was originally not an issue given the dst register was marked as
completely unknown (aka 64 bit unknown). However, after 468f6eafa6c4
("bpf: fix 32-bit ALU op verification") the verifier casts the register
output to 32 bit, and hence it becomes 32 bit unknown. Note that for
the case where the src register is unknown, the dst register is marked
64 bit unknown. After the fix, the register is truncated by the runtime
and the test passes:
# ./bpftool p d x i 23
0: (b7) r0 = 0
1: (b7) r1 = -1
2: (b4) w2 = -1
3: (16) if w0 == 0x0 goto pc+2
4: (9c) w1 %= w0
5: (05) goto pc+1
6: (bc) w1 = w1
7: (b7) r0 = 1
8: (1d) if r1 == r2 goto pc+1
9: (b7) r0 = 2
10: (95) exit
Semantics also match with {R,W}x mod{64,32} 0 -> {R,W}x. Invalid div
has always been {R,W}x div{64,32} 0 -> 0. Rewrites are as follows:
mod32: mod64:
(16) if w0 == 0x0 goto pc+2 (15) if r0 == 0x0 goto pc+1
(9c) w1 %= w0 (9f) r1 %= r0
(05) goto pc+1
(bc) w1 = w1
Fixes: 468f6eafa6c4 ("bpf: fix 32-bit ALU op verification")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
|
|
The devmap bulk queue is allocated with GFP_ATOMIC and the allocation
may fail if there is no available space in existing percpu pool.
Since commit 75ccae62cb8d42 ("xdp: Move devmap bulk queue into struct net_device")
moved the bulk queue allocation to NETDEV_REGISTER callback, whose context
is allowed to sleep, use GFP_KERNEL instead of GFP_ATOMIC to let percpu
allocator extend the pool when needed and avoid possible failure of netdev
registration.
As the required alignment is natural, we can simply use alloc_percpu().
Fixes: 75ccae62cb8d42 ("xdp: Move devmap bulk queue into struct net_device")
Signed-off-by: Jun'ichi Nomura <junichi.nomura@nec.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210209082451.GA44021@jeru.linux.bs1.fc.nec.co.jp
|
|
Pull cifs fixes from Steve French:
"Four small smb3 fixes to the new mount API (including a particularly
important one for DFS links).
These were found in testing this week of additional DFS scenarios, and
a user testing of an apache container problem"
* tag '5.11-rc7-smb3-github' of git://github.com/smfrench/smb3-kernel:
cifs: Set CIFS_MOUNT_USE_PREFIX_PATH flag on setting cifs_sb->prepath.
cifs: In the new mount api we get the full devname as source=
cifs: do not disable noperm if multiuser mount option is not provided
cifs: fix dfs-links
|
|
Let's allow mounting readonly partition. We're able to recovery later once we
have it as read-write back.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Perf failed to add a kretprobe event with debuginfo of vmlinux which is
compiled by gcc with -fpatchable-function-entry option enabled. The
same issue with kernel module.
Issue:
# perf probe -v 'kernel_clone%return $retval'
......
Writing event: r:probe/kernel_clone__return _text+599624 $retval
Failed to write event: Invalid argument
Error: Failed to add events. Reason: Invalid argument (Code: -22)
# cat /sys/kernel/debug/tracing/error_log
[156.75] trace_kprobe: error: Retprobe address must be an function entry
Command: r:probe/kernel_clone__return _text+599624 $retval
^
# llvm-dwarfdump vmlinux |grep -A 10 -w 0x00df2c2b
0x00df2c2b: DW_TAG_subprogram
DW_AT_external (true)
DW_AT_name ("kernel_clone")
DW_AT_decl_file ("/home/code/linux-next/kernel/fork.c")
DW_AT_decl_line (2423)
DW_AT_decl_column (0x07)
DW_AT_prototyped (true)
DW_AT_type (0x00dcd492 "pid_t")
DW_AT_low_pc (0xffff800010092648)
DW_AT_high_pc (0xffff800010092b9c)
DW_AT_frame_base (DW_OP_call_frame_cfa)
# cat /proc/kallsyms |grep kernel_clone
ffff800010092640 T kernel_clone
# readelf -s vmlinux |grep -i kernel_clone
183173: ffff800010092640 1372 FUNC GLOBAL DEFAULT 2 kernel_clone
# objdump -d vmlinux |grep -A 10 -w \<kernel_clone\>:
ffff800010092640 <kernel_clone>:
ffff800010092640: d503201f nop
ffff800010092644: d503201f nop
ffff800010092648: d503233f paciasp
ffff80001009264c: a9b87bfd stp x29, x30, [sp, #-128]!
ffff800010092650: 910003fd mov x29, sp
ffff800010092654: a90153f3 stp x19, x20, [sp, #16]
The entry address of kernel_clone converted by debuginfo is _text+599624
(0x92648), which is consistent with the value of DW_AT_low_pc attribute.
But the symbolic address of kernel_clone from /proc/kallsyms is
ffff800010092640.
This issue is found on arm64, -fpatchable-function-entry=2 is enabled when
CONFIG_DYNAMIC_FTRACE_WITH_REGS=y;
Just as objdump displayed the assembler contents of kernel_clone,
GCC generate 2 NOPs at the beginning of each function.
kprobe_on_func_entry detects that (_text+599624) is not the entry address
of the function, which leads to the failure of adding kretprobe event.
kprobe_on_func_entry
->_kprobe_addr
->kallsyms_lookup_size_offset
->arch_kprobe_on_func_entry // FALSE
The cause of the issue is that the first instruction in the compile unit
indicated by DW_AT_low_pc does not include NOPs.
This issue exists in all gcc versions that support
-fpatchable-function-entry option.
I have reported it to the GCC community:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98776
Currently arm64 and PA-RISC may enable fpatchable-function-entry option.
The kernel compiled with clang does not have this issue.
FIX:
This GCC issue only cause the registration failure of the kretprobe event
which doesn't need debuginfo. So, stop using debuginfo for retprobe.
map will be used to query the probe function address.
Signed-off-by: Jianlin Lv <Jianlin.Lv@arm.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: clang-built-linux@googlegroups.com
Cc: Frank Ch. Eigler <fche@redhat.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sumanth Korikkar <sumanthk@linux.ibm.com>
Link: http://lore.kernel.org/lkml/20210210062646.2377995-1-Jianlin.Lv@arm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
Commit 15d83c4d7cef ("bpf: Allow loading of a bpf_iter program")
cached btf_id in struct bpf_iter_target_info so later on
if it can be checked cheaply compared to checking registered names.
syzbot found a bug that uninitialized value may occur to
bpf_iter_target_info->btf_id. This is because we allocated
bpf_iter_target_info structure with kmalloc and never initialized
field btf_id afterwards. This uninitialized btf_id is typically
compared to a u32 bpf program func proto btf_id, and the chance
of being equal is extremely slim.
This patch fixed the issue by using kzalloc which will also
prevent future likely instances due to adding new fields.
Fixes: 15d83c4d7cef ("bpf: Allow loading of a bpf_iter program")
Reported-by: syzbot+580f4f2a272e452d55cb@syzkaller.appspotmail.com
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210212005926.2875002-1-yhs@fb.com
|
|
BCM4908 uses fixed partitions layout but function of some partitions may
vary. Some devices use multiple firmware partitions and those partitions
should be marked to let system discover their purpose.
Signed-off-by: Rafał Miłecki <rafal@milecki.pl>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
Single partition binding is quite common and may be:
1. Used by multiple parsers
2. Extended for more specific cases
Move it to separated file to avoid code duplication.
Signed-off-by: Rafał Miłecki <rafal@milecki.pl>
Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
The first time dso__load() was called on a PE file it always returned -1
error. This caused the first call to map__find_symbol() to always fail
on a PE file so the first sample from each PE file always had symbol
<unknown>. Subsequent samples succeed however because the DSO is already
loaded.
This fixes dso__load() to return 0 when successfully loading a DSO with
libbfd.
Fixes: eac9a4342e5447ca ("perf symbols: Try reading the symbol table with libbfd")
Signed-off-by: Nicholas Fraser <nfraser@codeweavers.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Frank Ch. Eigler <fche@redhat.com>
Cc: Huw Davies <huw@codeweavers.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kim Phillips <kim.phillips@amd.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Remi Bernon <rbernon@codeweavers.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Tommi Rantala <tommi.t.rantala@nokia.com>
Cc: Ulrich Czekalla <uczekalla@codeweavers.com>
Link: http://lore.kernel.org/lkml/1671b43b-09c3-1911-dbf8-7f030242fbf7@codeweavers.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
dso__load_bfd_symbols() attempts to load a DSO at its original path,
then closes it and loads the file in the debug cache. This is incorrect.
It should ignore the original file and work with only the debug cache.
The original file may have changed or may not even exist, for example if
the debug cache has been transferred to another machine via "perf
archive".
This fix makes it only load the file in the debug cache.
Further notes from Nicholas:
dso__load_bfd_symbols() is called in a loop from dso__load() for a variety
of paths. These are generated by the various DSO_BINARY_TYPEs in the
binary_type_symtab list at the top of util/symbol.c. In each case the
debugfile passed to dso__load_bfd_symbols() is the path to try.
One of those iterations (the first one I believe) passes the original path
as the debugfile. If the file still exists at the original path, this is
the one that ends up being used in case the debugcache was deleted or the
PE file doesn't have a build-id.
A later iteration (BUILD_ID_CACHE) passes debugfile as the file in the
debugcache if it has a build-id. Even if the file was previously loaded at
its original path, (if I understand correctly) this load will override it
so the debugcache file ends up being used.
Committer notes:
So if it fails to find in the cache, it will eventually hope for the
best and look at the path in the local filesystem, which in many cases
is enough.
At some point we need to switch from this "hope for the best" approach
to one that warns the user that there is no guarantee, if no buildid is
present, that just by looking at the pathname the symbolisation will
work.
Signed-off-by: Nicholas Fraser <nfraser@codeweavers.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Frank Ch. Eigler <fche@redhat.com>
Cc: Huw Davies <huw@codeweavers.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kim Phillips <kim.phillips@amd.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Remi Bernon <rbernon@codeweavers.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Tommi Rantala <tommi.t.rantala@nokia.com>
Cc: Ulrich Czekalla <uczekalla@codeweavers.com>
Link: http://lore.kernel.org/lkml/e58e1237-94ab-e1c9-a7b9-473531906954@codeweavers.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
Huazhong Tan says:
====================
net: hns3: some cleanups for -next
To improve code readability and maintainability, the series
refactor out some bloated functions in the HNS3 ethernet driver.
change log:
V2: remove an unused variable in #5
previous version:
V1: https://patchwork.kernel.org/project/netdevbpf/cover/1612943005-59416-1-git-send-email-tanhuazhong@huawei.com/
====================
Acked-by: Jakub Kicinski <kuba@kernel.org>
|
|
hclge_rm_vport_all_mac_table() is bloated, so split it into
separate functions for readability and maintainability.
Signed-off-by: Hao Chen <chenhao288@hisilicon.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
To make it more readable and maintainable, split
hclgevf_set_rss_tuple() into two parts.
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
To make it more readable and maintainable, split
hclge_set_rss_tuple() into two parts.
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
hclgevf_cmd_send() is bloated, so split it into separate
functions for readability and maintainability.
Signed-off-by: Yufeng Mo <moyufeng@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
hclge_cmd_send() is bloated, so split it into separate
functions for readability and maintainability.
Signed-off-by: Yufeng Mo <moyufeng@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
hclge_dbg_dump_qos_buf_cfg() is bloated, so split it into
separate functions for readability and maintainability.
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
To improve code readability and maintainability, separate
the flow type parsing part and the converting part from
bloated hclgevf_get_rss_tuple().
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
To improve code readability and maintainability, separate
the flow type parsing part and the converting part from
bloated hclge_get_rss_tuple().
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
To improve code readability and maintainability, separate
the command handling part and the status parsing part from
bloated hclge_set_vf_vlan_common().
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Use common ipv6_addr_any() to determine if an addr is ipv6 any addr.
Signed-off-by: Jiaran Zhang <zhangjiaran@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
As more commands are added, hns3_dbg_cmd_write() is going to
get more bloated, so move the part about command check into
a separate function.
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
To improve code readability and maintainability, refactor
hclgevf_cmd_convert_err_code() with an array of imp_errcode
and common_errno mapping, instead of a bloated switch/case.
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
To improve code readability and maintainability, refactor
hclge_cmd_convert_err_code() with an array of imp_errcode
and common_errno mapping, instead of a bloated switch/case.
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This is what I see after compiling the kernel:
# bpf-next...bpf-next/master
?? tools/bpf/resolve_btfids/libbpf/
Fixes: fc6b48f692f8 ("tools/resolve_btfids: Build libbpf and libsubcmd in separate directories")
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210212010053.668700-1-sdf@google.com
|
|
Add support for Intel's eASIC N5X platform. The clock manager driver for
the N5X is very similar to the Agilex platform, we can re-use most of
the Agilex clock driver.
This patch makes the necessary changes for the driver to differentiate
between the Agilex and the N5X platforms.
Signed-off-by: Dinh Nguyen <dinguyen@kernel.org>
Link: https://lore.kernel.org/r/20210212143059.478554-2-dinguyen@kernel.org
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
|
|
Document the Agilex clock bindings, and add the clock header file. The
clock header is an enumeration of all the different clocks on the eASIC
N5X platform.
Signed-off-by: Dinh Nguyen <dinguyen@kernel.org>
Link: https://lore.kernel.org/r/20210212143059.478554-1-dinguyen@kernel.org
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
|
|
By default, kernel threads have init_fs and init_files assigned. In the
past, this has triggered security problems, as commands that don't ask
for (and hence don't get assigned) fs/files from the originating task
can then attempt path resolution etc with access to parts of the system
they should not be able to.
Rather than add checks in the fs code for misuse, just set these to
NULL. If we do attempt to use them, then the resulting code will oops
rather than provide access to something that it should not permit.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Song Liu says:
====================
This set introduces bpf_iter for task_vma, which can be used to generate
information similar to /proc/pid/maps. Patch 4/4 adds an example that
mimics /proc/pid/maps.
Current /proc/<pid>/maps and /proc/<pid>/smaps provide information of
vma's of a process. However, these information are not flexible enough to
cover all use cases. For example, if a vma cover mixed 2MB pages and 4kB
pages (x86_64), there is no easy way to tell which address ranges are
backed by 2MB pages. task_vma solves the problem by enabling the user to
generate customize information based on the vma (and vma->vm_mm,
vma->vm_file, etc.).
Changes v6 => v7:
1. Let BPF iter program use bpf_d_path without specifying sleepable.
(Alexei)
Changes v5 => v6:
1. Add more comments for task_vma_seq_get_next() to explain the logic
of find_vma() calls. (Alexei)
2. Skip vma found by find_vma() when both vm_start and vm_end matches
prev_vm_[start|end]. Previous versions only compares vm_start.
IOW, if vma of [4k, 8k] is replaced by [4k, 12k] after relocking
mmap_lock, v5 will skip the new vma, while v6 will process it.
Changes v4 => v5:
1. Fix a refcount leak on task_struct. (Yonghong)
2. Fix the selftest. (Yonghong)
Changes v3 => v4:
1. Avoid skipping vma by assigning invalid prev_vm_start in
task_vma_seq_stop(). (Yonghong)
2. Move "again" label in task_vma_seq_get_next() save a check. (Yonghong)
Changes v2 => v3:
1. Rewrite 1/4 so that we hold mmap_lock while calling BPF program. This
enables the BPF program to access the real vma with BTF. (Alexei)
2. Fix the logic when the control is returned to user space. (Yonghong)
3. Revise commit log and cover letter. (Yonghong)
Changes v1 => v2:
1. Small fixes in task_iter.c and the selftests. (Yonghong)
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
The test dumps information similar to /proc/pid/maps. The first line of
the output is compared against the /proc file to make sure they match.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210212183107.50963-4-songliubraving@fb.com
|
|
task_file and task_vma iter programs have access to file->f_path. Enable
bpf_d_path to print paths of these file.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210212183107.50963-3-songliubraving@fb.com
|
|
Introduce task_vma bpf_iter to print memory information of a process. It
can be used to print customized information similar to /proc/<pid>/maps.
Current /proc/<pid>/maps and /proc/<pid>/smaps provide information of
vma's of a process. However, these information are not flexible enough to
cover all use cases. For example, if a vma cover mixed 2MB pages and 4kB
pages (x86_64), there is no easy way to tell which address ranges are
backed by 2MB pages. task_vma solves the problem by enabling the user to
generate customize information based on the vma (and vma->vm_mm,
vma->vm_file, etc.).
To access the vma safely in the BPF program, task_vma iterator holds
target mmap_lock while calling the BPF program. If the mmap_lock is
contended, task_vma unlocks mmap_lock between iterations to unblock the
writer(s). This lock contention avoidance mechanism is similar to the one
used in show_smaps_rollup().
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210212183107.50963-2-songliubraving@fb.com
|
|
KASAN reports a BUG when download file in jffs2 filesystem.It is
because when dstlen == 1, cpage_out will write array out of bounds.
Actually, data will not be compressed in jffs2_zlib_compress() if
data's length less than 4.
[ 393.799778] BUG: KASAN: slab-out-of-bounds in jffs2_rtime_compress+0x214/0x2f0 at addr ffff800062e3b281
[ 393.809166] Write of size 1 by task tftp/2918
[ 393.813526] CPU: 3 PID: 2918 Comm: tftp Tainted: G B 4.9.115-rt93-EMBSYS-CGEL-6.1.R6-dirty #1
[ 393.823173] Hardware name: LS1043A RDB Board (DT)
[ 393.827870] Call trace:
[ 393.830322] [<ffff20000808c700>] dump_backtrace+0x0/0x2f0
[ 393.835721] [<ffff20000808ca04>] show_stack+0x14/0x20
[ 393.840774] [<ffff2000086ef700>] dump_stack+0x90/0xb0
[ 393.845829] [<ffff20000827b19c>] kasan_object_err+0x24/0x80
[ 393.851402] [<ffff20000827b404>] kasan_report_error+0x1b4/0x4d8
[ 393.857323] [<ffff20000827bae8>] kasan_report+0x38/0x40
[ 393.862548] [<ffff200008279d44>] __asan_store1+0x4c/0x58
[ 393.867859] [<ffff2000084ce2ec>] jffs2_rtime_compress+0x214/0x2f0
[ 393.873955] [<ffff2000084bb3b0>] jffs2_selected_compress+0x178/0x2a0
[ 393.880308] [<ffff2000084bb530>] jffs2_compress+0x58/0x478
[ 393.885796] [<ffff2000084c5b34>] jffs2_write_inode_range+0x13c/0x450
[ 393.892150] [<ffff2000084be0b8>] jffs2_write_end+0x2a8/0x4a0
[ 393.897811] [<ffff2000081f3008>] generic_perform_write+0x1c0/0x280
[ 393.903990] [<ffff2000081f5074>] __generic_file_write_iter+0x1c4/0x228
[ 393.910517] [<ffff2000081f5210>] generic_file_write_iter+0x138/0x288
[ 393.916870] [<ffff20000829ec1c>] __vfs_write+0x1b4/0x238
[ 393.922181] [<ffff20000829ff00>] vfs_write+0xd0/0x238
[ 393.927232] [<ffff2000082a1ba8>] SyS_write+0xa0/0x110
[ 393.932283] [<ffff20000808429c>] __sys_trace_return+0x0/0x4
[ 393.937851] Object at ffff800062e3b280, in cache kmalloc-64 size: 64
[ 393.944197] Allocated:
[ 393.946552] PID = 2918
[ 393.948913] save_stack_trace_tsk+0x0/0x220
[ 393.953096] save_stack_trace+0x18/0x20
[ 393.956932] kasan_kmalloc+0xd8/0x188
[ 393.960594] __kmalloc+0x144/0x238
[ 393.963994] jffs2_selected_compress+0x48/0x2a0
[ 393.968524] jffs2_compress+0x58/0x478
[ 393.972273] jffs2_write_inode_range+0x13c/0x450
[ 393.976889] jffs2_write_end+0x2a8/0x4a0
[ 393.980810] generic_perform_write+0x1c0/0x280
[ 393.985251] __generic_file_write_iter+0x1c4/0x228
[ 393.990040] generic_file_write_iter+0x138/0x288
[ 393.994655] __vfs_write+0x1b4/0x238
[ 393.998228] vfs_write+0xd0/0x238
[ 394.001543] SyS_write+0xa0/0x110
[ 394.004856] __sys_trace_return+0x0/0x4
[ 394.008684] Freed:
[ 394.010691] PID = 2918
[ 394.013051] save_stack_trace_tsk+0x0/0x220
[ 394.017233] save_stack_trace+0x18/0x20
[ 394.021069] kasan_slab_free+0x88/0x188
[ 394.024902] kfree+0x6c/0x1d8
[ 394.027868] jffs2_sum_write_sumnode+0x2c4/0x880
[ 394.032486] jffs2_do_reserve_space+0x198/0x598
[ 394.037016] jffs2_reserve_space+0x3f8/0x4d8
[ 394.041286] jffs2_write_inode_range+0xf0/0x450
[ 394.045816] jffs2_write_end+0x2a8/0x4a0
[ 394.049737] generic_perform_write+0x1c0/0x280
[ 394.054179] __generic_file_write_iter+0x1c4/0x228
[ 394.058968] generic_file_write_iter+0x138/0x288
[ 394.063583] __vfs_write+0x1b4/0x238
[ 394.067157] vfs_write+0xd0/0x238
[ 394.070470] SyS_write+0xa0/0x110
[ 394.073783] __sys_trace_return+0x0/0x4
[ 394.077612] Memory state around the buggy address:
[ 394.082404] ffff800062e3b180: 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc
[ 394.089623] ffff800062e3b200: 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc
[ 394.096842] >ffff800062e3b280: 01 fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 394.104056] ^
[ 394.107283] ffff800062e3b300: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
[ 394.114502] ffff800062e3b380: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
[ 394.121718] ==================================================================
Signed-off-by: Yang Yang <yang.yang29@zte.com.cn>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
An inode is allowed to have ubifs_xattr_max_cnt() xattrs, so we must
complain only when an inode has more xattrs, having exactly
ubifs_xattr_max_cnt() xattrs is fine.
With this the maximum number of xattrs can be created without hitting
the "has too many xattrs" warning when removing it.
Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
An earlier commit moved out some functions to not be inlined by gcc, but
after some other rework to remove one of those, clang started inlining
the other one and ran into the same problem as gcc did before:
fs/ubifs/replay.c:1174:5: error: stack frame size of 1152 bytes in function 'ubifs_replay_journal' [-Werror,-Wframe-larger-than=]
Mark the function as noinline_for_stack to ensure it doesn't happen
again.
Fixes: f80df3851246 ("ubifs: use crypto_shash_tfm_digest()")
Fixes: eb66eff6636d ("ubifs: replay: Fix high stack usage")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
When crypto_shash_digestsize() fails, c->hmac_tfm
has not been freed before returning, which leads
to memleak.
Fixes: 49525e5eecca5 ("ubifs: Add helper functions for authentication support")
Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn>
Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
clang static analysis reports this problem
fs/jffs2/summary.c:794:31: warning: Use of memory after it is freed
c->summary->sum_list_head = temp->u.next;
^~~~~~~~~~~~
In jffs2_sum_write_data(), in a loop summary data is handles a node at
a time. When it has written out the node it is removed the summary list,
and the node is deleted. In the corner case when a
JFFS2_FEATURE_RWCOMPAT_COPY is seen, a call is made to
jffs2_sum_disable_collecting(). jffs2_sum_disable_collecting() deletes
the whole list which conflicts with the loop's deleting the list by parts.
To preserve the old behavior of stopping the write midway, bail out of
the loop after disabling summary collection.
Fixes: 6171586a7ae5 ("[JFFS2] Correct handling of JFFS2_FEATURE_RWCOMPAT_COPY nodes.")
Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
The parameter of kfree function is NULL, so kfree code is useless, delete it.
Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
data_size is already checked against zero when vol_type matches
UBI_VID_STATIC. Remove the following dead code.
Signed-off-by: Jubin Zhong <zhongjubin@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
This patch is to store operation type in packet structure.
Signed-off-by: Leo Yan <leo.yan@linaro.org>
Reviewed-by: James Clark <james.clark@arm.com>
Tested-by: James Clark <james.clark@arm.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Al Grant <al.grant@arm.com>
Cc: Andre Przywara <andre.przywara@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: John Garry <john.garry@huawei.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wei Li <liwei391@huawei.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: James Clark <james.clark@arm.com>
Link: https://lore.kernel.org/r/20210211133856.2137-3-james.clark@arm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
This patch is to store virtual and physical memory addresses in packet,
which will be used for memory samples.
Signed-off-by: Leo Yan <leo.yan@linaro.org>
Reviewed-by: James Clark <james.clark@arm.com>
Tested-by: James Clark <james.clark@arm.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Al Grant <al.grant@arm.com>
Cc: Andre Przywara <andre.przywara@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: John Garry <john.garry@huawei.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wei Li <liwei391@huawei.com>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20210211133856.2137-2-james.clark@arm.com
Signed-off-by: James Clark <james.clark@arm.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
This will get the (no-op) definition of irq_canonicalize()
which some code might want. We could define that ourselves,
but it seems like we'd likely want generic extensions in
the future, if any.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
This may be needed for size_t if something doesn't get
it included elsewhere before including <asm/io.h>, so
add the include.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
Add a pseudo RTC that simply is able to send an alarm signal
waking up the system at a given time in the future.
Since apparently timerfd_create() FDs don't support SIGIO, we
use the sigio-creating helper thread, which just learned to do
suspend/resume properly in the previous patch.
For time-travel mode, OTOH, just add an event at the specified
time in the future, and that's already sufficient to wake up
the system at that point in time since suspend will just be in
an "endless wait".
For s2idle support also call pm_system_wakeup().
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
This patch is to enable sample type PERF_SAMPLE_DATA_SRC for Arm SPE in
the perf data, when output the tracing data, it tells tools that it
contains data source in the memory event.
Signed-off-by: Leo Yan <leo.yan@linaro.org>
Reviewed-by: James Clark <james.clark@arm.com>
Tested-by: James Clark <james.clark@arm.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Al Grant <al.grant@arm.com>
Cc: Andre Przywara <andre.przywara@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: John Garry <john.garry@huawei.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wei Li <liwei391@huawei.com>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20210211133856.2137-1-james.clark@arm.com
Signed-off-by: James Clark <james.clark@arm.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
This mostly reverts the old commit 3963333fe676 ("uml: cover stubs
with a VMA") which had added a VMA to the existing PTEs. However,
there's no real reason to have the PTEs in the first place and the
VMA cannot be 'fixed' in place, which leads to bugs that userspace
could try to unmap them and be forcefully killed, or such. Also,
there's a bit of an ugly hole in userspace's address space.
Simplify all this: just install the stub code/page at the top of
the (inner) address space, i.e. put it just above TASK_SIZE. The
pages are simply hard-coded to be mapped in the userspace process
we use to implement an mm context, and they're out of reach of the
inner mmap/munmap/mprotect etc. since they're above TASK_SIZE.
Getting rid of the VMA also makes vma_merge() no longer hit one of
the VM_WARN_ON()s there because we installed a VMA while the code
assumes the stack VMA is the first one.
It also removes a lockdep warning about mmap_sem usage since we no
longer have uml_setup_stubs() and thus no longer need to do any
manipulation that would require mmap_sem in activate_mm().
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
Minor cleanup.
Signed-off-by: Ian Rogers <irogers@google.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lore.kernel.org/lkml/20210211183914.4093187-1-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
The userspace stacks mostly have a stack (and in the case of the
syscall stub we can just set their stack pointer) that points to
the location of the stub data page already.
Rework the stubs to use the stack pointer to derive the start of
the data page, rather than requiring it to be hard-coded.
In the clone stub, also integrate the int3 into the stack remap,
since we really must not use the stack while we remap it.
This prepares for putting the stub at a variable location that's
not part of the normal address space of the userspace processes
running inside the UML machine.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|