summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2021-12-16Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfJakub Kicinski
Daniel Borkmann says: ==================== pull-request: bpf 2021-12-16 We've added 15 non-merge commits during the last 7 day(s) which contain a total of 12 files changed, 434 insertions(+), 30 deletions(-). The main changes are: 1) Fix incorrect verifier state pruning behavior for <8B register spill/fill, from Paul Chaignon. 2) Fix x86-64 JIT's extable handling for fentry/fexit when return pointer is an ERR_PTR(), from Alexei Starovoitov. 3) Fix 3 different possibilities that BPF verifier missed where unprivileged could leak kernel addresses, from Daniel Borkmann. 4) Fix xsk's poll behavior under need_wakeup flag, from Magnus Karlsson. 5) Fix an oob-write in test_verifier due to a missed MAX_NR_MAPS bump, from Kumar Kartikeya Dwivedi. 6) Fix a race in test_btf_skc_cls_ingress selftest, from Martin KaFai Lau. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf, selftests: Fix racing issue in btf_skc_cls_ingress test selftest/bpf: Add a test that reads various addresses. bpf: Fix extable address check. bpf: Fix extable fixup offset. bpf, selftests: Add test case trying to taint map value pointer bpf: Make 32->64 bounds propagation slightly more robust bpf: Fix signed bounds propagation after mov32 bpf, selftests: Update test case for atomic cmpxchg on r0 with pointer bpf: Fix kernel address leakage in atomic cmpxchg's r0 aux reg bpf, selftests: Add test case for atomic fetch on spilled pointer bpf: Fix kernel address leakage in atomic fetch selftests/bpf: Fix OOB write in test_verifier xsk: Do not sleep in poll() when need_wakeup set selftests/bpf: Tests for state pruning with u32 spill/fill bpf: Fix incorrect state pruning for <8B spill/fill ==================== Link: https://lore.kernel.org/r/20211216210005.13815-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-16bpf, selftests: Fix racing issue in btf_skc_cls_ingress testMartin KaFai Lau
The libbpf CI reported occasional failure in btf_skc_cls_ingress: test_syncookie:FAIL:Unexpected syncookie states gen_cookie:80326634 recv_cookie:0 bpf prog error at line 97 "error at line 97" means the bpf prog cannot find the listening socket when the final ack is received. It then skipped processing the syncookie in the final ack which then led to "recv_cookie:0". The problem is the userspace program did not do accept() and went ahead to close(listen_fd) before the kernel (and the bpf prog) had a chance to process the final ack. The fix is to add accept() call so that the userspace will wait for the kernel to finish processing the final ack first before close()-ing everything. Fixes: 9a856cae2217 ("bpf: selftest: Add test_btf_skc_cls_ingress") Reported-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20211216191630.466151-1-kafai@fb.com
2021-12-16selftest/bpf: Add a test that reads various addresses.Alexei Starovoitov
Add a function to bpf_testmod that returns invalid kernel and user addresses. Then attach an fexit program to that function that tries to read memory through these addresses. This logic checks that bpf_probe_read_kernel and BPF_PROBE_MEM logic is sane. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2021-12-16bpf: Fix extable address check.Alexei Starovoitov
The verifier checks that PTR_TO_BTF_ID pointer is either valid or NULL, but it cannot distinguish IS_ERR pointer from valid one. When offset is added to IS_ERR pointer it may become small positive value which is a user address that is not handled by extable logic and has to be checked for at the runtime. Tighten BPF_PROBE_MEM pointer check code to prevent this case. Fixes: 4c5de127598e ("bpf: Emit explicit NULL pointer checks for PROBE_LDX instructions.") Reported-by: Lorenzo Fontana <lorenzo.fontana@elastic.co> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2021-12-16bpf: Fix extable fixup offset.Alexei Starovoitov
The prog - start_of_ldx is the offset before the faulting ldx to the location after it, so this will be used to adjust pt_regs->ip for jumping over it and continuing, and with old temp it would have been fixed up to the wrong offset, causing crash. Fixes: 4c5de127598e ("bpf: Emit explicit NULL pointer checks for PROBE_LDX instructions.") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2021-12-16Merge tag 'clk-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux Pull clk fix from Stephen Boyd: "A single fix for the clk framework that needed some more bake time in linux-next. The problem is that two clks being registered at the same time can lead to a busted clk tree if the parent isn't fully registered by the time the child finds the parent. We rejigger the place where we mark the parent as fully registered so that the child can't find the parent until things are proper" * tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux: clk: Don't parent clks until the parent is fully registered
2021-12-16bpf, selftests: Add test case trying to taint map value pointerDaniel Borkmann
Add a test case which tries to taint map value pointer arithmetic into a unknown scalar with subsequent export through the map. Before fix: # ./test_verifier 1186 #1186/u map access: trying to leak tained dst reg FAIL Unexpected success to load! verification time 24 usec stack depth 8 processed 15 insns (limit 1000000) max_states_per_insn 0 total_states 1 peak_states 1 mark_read 1 #1186/p map access: trying to leak tained dst reg FAIL Unexpected success to load! verification time 8 usec stack depth 8 processed 15 insns (limit 1000000) max_states_per_insn 0 total_states 1 peak_states 1 mark_read 1 Summary: 0 PASSED, 0 SKIPPED, 2 FAILED After fix: # ./test_verifier 1186 #1186/u map access: trying to leak tained dst reg OK #1186/p map access: trying to leak tained dst reg OK Summary: 2 PASSED, 0 SKIPPED, 0 FAILED Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org>
2021-12-16bpf: Make 32->64 bounds propagation slightly more robustDaniel Borkmann
Make the bounds propagation in __reg_assign_32_into_64() slightly more robust and readable by aligning it similarly as we did back in the __reg_combine_64_into_32() counterpart. Meaning, only propagate or pessimize them as a smin/smax pair. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org>
2021-12-16bpf: Fix signed bounds propagation after mov32Daniel Borkmann
For the case where both s32_{min,max}_value bounds are positive, the __reg_assign_32_into_64() directly propagates them to their 64 bit counterparts, otherwise it pessimises them into [0,u32_max] universe and tries to refine them later on by learning through the tnum as per comment in mentioned function. However, that does not always happen, for example, in mov32 operation we call zext_32_to_64(dst_reg) which invokes the __reg_assign_32_into_64() as is without subsequent bounds update as elsewhere thus no refinement based on tnum takes place. Thus, not calling into the __update_reg_bounds() / __reg_deduce_bounds() / __reg_bound_offset() triplet as we do, for example, in case of ALU ops via adjust_scalar_min_max_vals(), will lead to more pessimistic bounds when dumping the full register state: Before fix: 0: (b4) w0 = -1 1: R0_w=invP4294967295 (id=0,imm=ffffffff, smin_value=4294967295,smax_value=4294967295, umin_value=4294967295,umax_value=4294967295, var_off=(0xffffffff; 0x0), s32_min_value=-1,s32_max_value=-1, u32_min_value=-1,u32_max_value=-1) 1: (bc) w0 = w0 2: R0_w=invP4294967295 (id=0,imm=ffffffff, smin_value=0,smax_value=4294967295, umin_value=4294967295,umax_value=4294967295, var_off=(0xffffffff; 0x0), s32_min_value=-1,s32_max_value=-1, u32_min_value=-1,u32_max_value=-1) Technically, the smin_value=0 and smax_value=4294967295 bounds are not incorrect, but given the register is still a constant, they break assumptions about const scalars that smin_value == smax_value and umin_value == umax_value. After fix: 0: (b4) w0 = -1 1: R0_w=invP4294967295 (id=0,imm=ffffffff, smin_value=4294967295,smax_value=4294967295, umin_value=4294967295,umax_value=4294967295, var_off=(0xffffffff; 0x0), s32_min_value=-1,s32_max_value=-1, u32_min_value=-1,u32_max_value=-1) 1: (bc) w0 = w0 2: R0_w=invP4294967295 (id=0,imm=ffffffff, smin_value=4294967295,smax_value=4294967295, umin_value=4294967295,umax_value=4294967295, var_off=(0xffffffff; 0x0), s32_min_value=-1,s32_max_value=-1, u32_min_value=-1,u32_max_value=-1) Without the smin_value == smax_value and umin_value == umax_value invariant being intact for const scalars, it is possible to leak out kernel pointers from unprivileged user space if the latter is enabled. For example, when such registers are involved in pointer arithmtics, then adjust_ptr_min_max_vals() will taint the destination register into an unknown scalar, and the latter can be exported and stored e.g. into a BPF map value. Fixes: 3f50f132d840 ("bpf: Verifier, do explicit ALU32 bounds tracking") Reported-by: Kuee K1r0a <liulin063@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org>
2021-12-16Merge tag 'arm64-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 fix from Catalin Marinas: "Fix missing error code on kexec failure path" * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: arm64: kexec: Fix missing error code 'ret' warning in load_other_segments()
2021-12-16ksmbd: fix uninitialized symbol 'pntsd_size'Namjae Jeon
No check for if "rc" is an error code for build_sec_desc(). This can cause problems with using uninitialized pntsd_size. Fixes: e2f34481b24d ("cifsd: add server-side procedures for SMB3") Cc: stable@vger.kernel.org # v5.15 Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-12-16ksmbd: fix error code in ndr_read_int32()Dan Carpenter
This is a failure path and it should return -EINVAL instead of success. Otherwise it could result in the caller using uninitialized memory. Fixes: 303fff2b8c77 ("ksmbd: add validation for ndr read/write functions") Cc: stable@vger.kernel.org # v5.15 Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-12-16Merge tag 'for-5.16/dm-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: - Fix use after free in DM btree remove's rebalance_children() - Fix DM integrity data corruption, introduced during 5.16 merge, due to improper use of bvec_kmap_local() * tag 'for-5.16/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm integrity: fix data corruption due to improper use of bvec_kmap_local dm btree remove: fix use after free in rebalance_children()
2021-12-16arm64: kexec: Fix missing error code 'ret' warning in load_other_segments()Lakshmi Ramasubramanian
Since commit ac10be5cdbfa ("arm64: Use common of_kexec_alloc_and_setup_fdt()"), smatch reports the following warning: arch/arm64/kernel/machine_kexec_file.c:152 load_other_segments() warn: missing error code 'ret' Return code is not set to an error code in load_other_segments() when of_kexec_alloc_and_setup_fdt() call returns a NULL dtb. This results in status success (return code set to 0) being returned from load_other_segments(). Set return code to -EINVAL if of_kexec_alloc_and_setup_fdt() returns NULL dtb. Signed-off-by: Lakshmi Ramasubramanian <nramas@linux.microsoft.com> Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Fixes: ac10be5cdbfa ("arm64: Use common of_kexec_alloc_and_setup_fdt()") Link: https://lore.kernel.org/r/20211210010121.101823-1-nramas@linux.microsoft.com Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2021-12-16afs: Fix mmapDavid Howells
Fix afs_add_open_map() to check that the vnode isn't already on the list when it adds it. It's possible that afs_drop_open_mmap() decremented the cb_nr_mmap counter, but hadn't yet got into the locked section to remove it. Also vnode->cb_mmap_link should be initialised, so fix that too. Fixes: 6e0e99d58a65 ("afs: Fix mmap coherency vs 3rd-party changes") Reported-by: kafs-testing+fedora34_64checkkafs-build-300@auristor.com Suggested-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: kafs-testing+fedora34_64checkkafs-build-300@auristor.com cc: linux-afs@lists.infradead.org Link: https://lore.kernel.org/r/686465.1639435380@warthog.procyon.org.uk/ # v1 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-12-16KVM: arm64: Rework kvm_pgtable initialisationMarc Zyngier
Ganapatrao reported that the kvm_pgtable->mmu pointer is more or less hardcoded to the main S2 mmu structure, while the nested code needs it to point to other instances (as we have one instance per nested context). Rework the initialisation of the kvm_pgtable structure so that this assumtion doesn't hold true anymore. This requires some minor changes to the order in which things are initialised (the mmu->arch pointer being the critical one). Reported-by: Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com> Reviewed-by: Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20211129200150.351436-5-maz@kernel.org
2021-12-16sit: do not call ipip6_dev_free() from sit_init_net()Eric Dumazet
ipip6_dev_free is sit dev->priv_destructor, already called by register_netdevice() if something goes wrong. Alternative would be to make ipip6_dev_free() robust against multiple invocations, but other drivers do not implement this strategy. syzbot reported: dst_release underflow WARNING: CPU: 0 PID: 5059 at net/core/dst.c:173 dst_release+0xd8/0xe0 net/core/dst.c:173 Modules linked in: CPU: 1 PID: 5059 Comm: syz-executor.4 Not tainted 5.16.0-rc5-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:dst_release+0xd8/0xe0 net/core/dst.c:173 Code: 4c 89 f2 89 d9 31 c0 5b 41 5e 5d e9 da d5 44 f9 e8 1d 90 5f f9 c6 05 87 48 c6 05 01 48 c7 c7 80 44 99 8b 31 c0 e8 e8 67 29 f9 <0f> 0b eb 85 0f 1f 40 00 53 48 89 fb e8 f7 8f 5f f9 48 83 c3 a8 48 RSP: 0018:ffffc9000aa5faa0 EFLAGS: 00010246 RAX: d6894a925dd15a00 RBX: 00000000ffffffff RCX: 0000000000040000 RDX: ffffc90005e19000 RSI: 000000000003ffff RDI: 0000000000040000 RBP: 0000000000000000 R08: ffffffff816a1f42 R09: ffffed1017344f2c R10: ffffed1017344f2c R11: 0000000000000000 R12: 0000607f462b1358 R13: 1ffffffff1bfd305 R14: ffffe8ffffcb1358 R15: dffffc0000000000 FS: 00007f66c71a2700(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f88aaed5058 CR3: 0000000023e0f000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> dst_cache_destroy+0x107/0x1e0 net/core/dst_cache.c:160 ipip6_dev_free net/ipv6/sit.c:1414 [inline] sit_init_net+0x229/0x550 net/ipv6/sit.c:1936 ops_init+0x313/0x430 net/core/net_namespace.c:140 setup_net+0x35b/0x9d0 net/core/net_namespace.c:326 copy_net_ns+0x359/0x5c0 net/core/net_namespace.c:470 create_new_namespaces+0x4ce/0xa00 kernel/nsproxy.c:110 unshare_nsproxy_namespaces+0x11e/0x180 kernel/nsproxy.c:226 ksys_unshare+0x57d/0xb50 kernel/fork.c:3075 __do_sys_unshare kernel/fork.c:3146 [inline] __se_sys_unshare kernel/fork.c:3144 [inline] __x64_sys_unshare+0x34/0x40 kernel/fork.c:3144 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f66c882ce99 Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f66c71a2168 EFLAGS: 00000246 ORIG_RAX: 0000000000000110 RAX: ffffffffffffffda RBX: 00007f66c893ff60 RCX: 00007f66c882ce99 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000048040200 RBP: 00007f66c8886ff1 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007fff6634832f R14: 00007f66c71a2300 R15: 0000000000022000 </TASK> Fixes: cf124db566e6 ("net: Fix inconsistent teardown and release of private netdev state.") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Link: https://lore.kernel.org/r/20211216111741.1387540-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-16net: systemport: Add global locking for descriptor lifecycleFlorian Fainelli
The descriptor list is a shared resource across all of the transmit queues, and the locking mechanism used today only protects concurrency across a given transmit queue between the transmit and reclaiming. This creates an opportunity for the SYSTEMPORT hardware to work on corrupted descriptors if we have multiple producers at once which is the case when using multiple transmit queues. This was particularly noticeable when using multiple flows/transmit queues and it showed up in interesting ways in that UDP packets would get a correct UDP header checksum being calculated over an incorrect packet length. Similarly TCP packets would get an equally correct checksum computed by the hardware over an incorrect packet length. The SYSTEMPORT hardware maintains an internal descriptor list that it re-arranges when the driver produces a new descriptor anytime it writes to the WRITE_PORT_{HI,LO} registers, there is however some delay in the hardware to re-organize its descriptors and it is possible that concurrent TX queues eventually break this internal allocation scheme to the point where the length/status part of the descriptor gets used for an incorrect data buffer. The fix is to impose a global serialization for all TX queues in the short section where we are writing to the WRITE_PORT_{HI,LO} registers which solves the corruption even with multiple concurrent TX queues being used. Fixes: 80105befdb4b ("net: systemport: add Broadcom SYSTEMPORT Ethernet MAC driver") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Link: https://lore.kernel.org/r/20211215202450.4086240-1-f.fainelli@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-16net/smc: Prevent smc_release() from long blockingD. Wythe
In nginx/wrk benchmark, there's a hung problem with high probability on case likes that: (client will last several minutes to exit) server: smc_run nginx client: smc_run wrk -c 10000 -t 1 http://server Client hangs with the following backtrace: 0 [ffffa7ce8Of3bbf8] __schedule at ffffffff9f9eOd5f 1 [ffffa7ce8Of3bc88] schedule at ffffffff9f9eløe6 2 [ffffa7ce8Of3bcaO] schedule_timeout at ffffffff9f9e3f3c 3 [ffffa7ce8Of3bd2O] wait_for_common at ffffffff9f9el9de 4 [ffffa7ce8Of3bd8O] __flush_work at ffffffff9fOfeOl3 5 [ffffa7ce8øf3bdfO] smc_release at ffffffffcO697d24 [smc] 6 [ffffa7ce8Of3be2O] __sock_release at ffffffff9f8O2e2d 7 [ffffa7ce8Of3be4ø] sock_close at ffffffff9f8ø2ebl 8 [ffffa7ce8øf3be48] __fput at ffffffff9f334f93 9 [ffffa7ce8Of3be78] task_work_run at ffffffff9flOlff5 10 [ffffa7ce8Of3beaO] do_exit at ffffffff9fOe5Ol2 11 [ffffa7ce8Of3bflO] do_group_exit at ffffffff9fOe592a 12 [ffffa7ce8Of3bf38] __x64_sys_exit_group at ffffffff9fOe5994 13 [ffffa7ce8Of3bf4O] do_syscall_64 at ffffffff9f9d4373 14 [ffffa7ce8Of3bfsO] entry_SYSCALL_64_after_hwframe at ffffffff9fa0007c This issue dues to flush_work(), which is used to wait for smc_connect_work() to finish in smc_release(). Once lots of smc_connect_work() was pending or all executing work dangling, smc_release() has to block until one worker comes to free, which is equivalent to wait another smc_connnect_work() to finish. In order to fix this, There are two changes: 1. For those idle smc_connect_work(), cancel it from the workqueue; for executing smc_connect_work(), waiting for it to finish. For that purpose, replace flush_work() with cancel_work_sync(). 2. Since smc_connect() hold a reference for passive closing, if smc_connect_work() has been cancelled, release the reference. Fixes: 24ac3a08e658 ("net/smc: rebuild nonblocking connect") Reported-by: Tony Lu <tonylu@linux.alibaba.com> Tested-by: Dust Li <dust.li@linux.alibaba.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> Acked-by: Karsten Graul <kgraul@linux.ibm.com> Link: https://lore.kernel.org/r/1639571361-101128-1-git-send-email-alibuda@linux.alibaba.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-16optee: Suppress false positive kmemleak report in optee_handle_rpc()Xiaolei Wang
We observed the following kmemleak report: unreferenced object 0xffff000007904500 (size 128): comm "swapper/0", pid 1, jiffies 4294892671 (age 44.036s) hex dump (first 32 bytes): 00 47 90 07 00 00 ff ff 60 00 c0 ff 00 00 00 00 .G......`....... 60 00 80 13 00 80 ff ff a0 00 00 00 00 00 00 00 `............... backtrace: [<000000004c12b1c7>] kmem_cache_alloc+0x1ac/0x2f4 [<000000005d23eb4f>] tee_shm_alloc+0x78/0x230 [<00000000794dd22c>] optee_handle_rpc+0x60/0x6f0 [<00000000d9f7c52d>] optee_do_call_with_arg+0x17c/0x1dc [<00000000c35884da>] optee_open_session+0x128/0x1ec [<000000001748f2ff>] tee_client_open_session+0x28/0x40 [<00000000aecb5389>] optee_enumerate_devices+0x84/0x2a0 [<000000003df18bf1>] optee_probe+0x674/0x6cc [<000000003a4a534a>] platform_drv_probe+0x54/0xb0 [<000000000c51ce7d>] really_probe+0xe4/0x4d0 [<000000002f04c865>] driver_probe_device+0x58/0xc0 [<00000000b485397d>] device_driver_attach+0xc0/0xd0 [<00000000c835f0df>] __driver_attach+0x84/0x124 [<000000008e5a429c>] bus_for_each_dev+0x70/0xc0 [<000000001735e8a8>] driver_attach+0x24/0x30 [<000000006d94b04f>] bus_add_driver+0x104/0x1ec This is not a memory leak because we pass the share memory pointer to secure world and would get it from secure world before releasing it. Signed-off-by: Xiaolei Wang <xiaolei.wang@windriver.com> Signed-off-by: Jens Wiklander <jens.wiklander@linaro.org>
2021-12-16tee: optee: Fix incorrect page free bugSumit Garg
Pointer to the allocated pages (struct page *page) has already progressed towards the end of allocation. It is incorrect to perform __free_pages(page, order) using this pointer as we would free any arbitrary pages. Fix this by stop modifying the page pointer. Fixes: ec185dd3ab25 ("optee: Fix memory leak when failing to register shm pages") Cc: stable@vger.kernel.org Reported-by: Patrik Lantz <patrik.lantz@axis.com> Signed-off-by: Sumit Garg <sumit.garg@linaro.org> Reviewed-by: Tyler Hicks <tyhicks@linux.microsoft.com> Signed-off-by: Jens Wiklander <jens.wiklander@linaro.org>
2021-12-16Merge tag 'tegra-for-5.16-soc-fixes' of ↵Arnd Bergmann
git://git.kernel.org/pub/scm/linux/kernel/git/tegra/linux into arm/fixes soc/tegra: Fixes for v5.16-rc6 This contains a single build fix without which ARM allmodconfig builds are broken if -Werror is enabled. * tag 'tegra-for-5.16-soc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tegra/linux: soc/tegra: fuse: Fix bitwise vs. logical OR warning Link: https://lore.kernel.org/r/20211215162618.3568474-1-thierry.reding@gmail.com Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2021-12-16netfilter: ctnetlink: remove expired entries firstFlorian Westphal
When dumping conntrack table to userspace via ctnetlink, check if the ct has already expired before doing any of the 'skip' checks. This expires dead entries faster. /proc handler also removes outdated entries first. Reported-by: Vitaly Zuevsky <vzuevsky@ns1.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-12-16Merge branch kvm-arm64/pkvm-hyp-sharing into kvmarm-master/nextMarc Zyngier
* kvm-arm64/pkvm-hyp-sharing: : . : Series from Quentin Perret, implementing HYP page share/unshare: : : This series implements an unshare hypercall at EL2 in nVHE : protected mode, and makes use of it to unmmap guest-specific : data-structures from EL2 stage-1 during guest tear-down. : Crucially, the implementation of the share and unshare : routines use page refcounts in the host kernel to avoid : accidentally unmapping data-structures that overlap a common : page. : [...] : . KVM: arm64: pkvm: Unshare guest structs during teardown KVM: arm64: Expose unshare hypercall to the host KVM: arm64: Implement do_unshare() helper for unsharing memory KVM: arm64: Implement __pkvm_host_share_hyp() using do_share() KVM: arm64: Implement do_share() helper for sharing memory KVM: arm64: Introduce wrappers for host and hyp spin lock accessors KVM: arm64: Extend pkvm_page_state enumeration to handle absent pages KVM: arm64: pkvm: Refcount the pages shared with EL2 KVM: arm64: Introduce kvm_share_hyp() KVM: arm64: Implement kvm_pgtable_hyp_unmap() at EL2 KVM: arm64: Hook up ->page_count() for hypervisor stage-1 page-table KVM: arm64: Fixup hyp stage-1 refcount KVM: arm64: Refcount hyp stage-1 pgtable pages KVM: arm64: Provide {get,put}_page() stubs for early hyp allocator Signed-off-by: Marc Zyngier <maz@kernel.org>
2021-12-16KVM: arm64: pkvm: Unshare guest structs during teardownQuentin Perret
Make use of the newly introduced unshare hypercall during guest teardown to unmap guest-related data structures from the hyp stage-1. Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20211215161232.1480836-15-qperret@google.com
2021-12-16KVM: arm64: Expose unshare hypercall to the hostWill Deacon
Introduce an unshare hypercall which can be used to unmap memory from the hypervisor stage-1 in nVHE protected mode. This will be useful to update the EL2 ownership state of pages during guest teardown, and avoids keeping dangling mappings to unreferenced portions of memory. Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20211215161232.1480836-14-qperret@google.com
2021-12-16KVM: arm64: Implement do_unshare() helper for unsharing memoryWill Deacon
Tearing down a previously shared memory region results in the borrower losing access to the underlying pages and returning them to the "owned" state in the owner. Implement a do_unshare() helper, along the same lines as do_share(), to provide this functionality for the host-to-hyp case. Reviewed-by: Andrew Walbran <qwandor@google.com> Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20211215161232.1480836-13-qperret@google.com
2021-12-16KVM: arm64: Implement __pkvm_host_share_hyp() using do_share()Will Deacon
__pkvm_host_share_hyp() shares memory between the host and the hypervisor so implement it as an invocation of the new do_share() mechanism. Note that double-sharing is no longer permitted (as this allows us to reduce the number of page-table walks significantly), but is thankfully no longer relied upon by the host. Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20211215161232.1480836-12-qperret@google.com
2021-12-16KVM: arm64: Implement do_share() helper for sharing memoryWill Deacon
By default, protected KVM isolates memory pages so that they are accessible only to their owner: be it the host kernel, the hypervisor at EL2 or (in future) the guest. Establishing shared-memory regions between these components therefore involves a transition for each page so that the owner can share memory with a borrower under a certain set of permissions. Introduce a do_share() helper for safely sharing a memory region between two components. Currently, only host-to-hyp sharing is implemented, but the code is easily extended to handle other combinations and the permission checks for each component are reusable. Reviewed-by: Andrew Walbran <qwandor@google.com> Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20211215161232.1480836-11-qperret@google.com
2021-12-16KVM: arm64: Introduce wrappers for host and hyp spin lock accessorsWill Deacon
In preparation for adding additional locked sections for manipulating page-tables at EL2, introduce some simple wrappers around the host and hypervisor locks so that it's a bit easier to read and bit more difficult to take the wrong lock (or even take them in the wrong order). Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20211215161232.1480836-10-qperret@google.com
2021-12-16KVM: arm64: Extend pkvm_page_state enumeration to handle absent pagesWill Deacon
Explicitly name the combination of SW0 | SW1 as reserved in the pte and introduce a new PKVM_NOPAGE meta-state which, although not directly stored in the software bits of the pte, can be used to represent an entry for which there is no underlying page. This is distinct from an invalid pte, as stage-2 identity mappings for the host are created lazily and so an invalid pte there is the same as a valid mapping for the purposes of ownership information. This state will be used for permission checking during page transitions in later patches. Reviewed-by: Andrew Walbran <qwandor@google.com> Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20211215161232.1480836-9-qperret@google.com
2021-12-16KVM: arm64: pkvm: Refcount the pages shared with EL2Quentin Perret
In order to simplify the page tracking infrastructure at EL2 in nVHE protected mode, move the responsibility of refcounting pages that are shared multiple times on the host. In order to do so, let's create a red-black tree tracking all the PFNs that have been shared, along with a refcount. Acked-by: Will Deacon <will@kernel.org> Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20211215161232.1480836-8-qperret@google.com
2021-12-16KVM: arm64: Introduce kvm_share_hyp()Quentin Perret
The create_hyp_mappings() function can currently be called at any point in time. However, its behaviour in protected mode changes widely depending on when it is being called. Prior to KVM init, it is used to create the temporary page-table used to bring-up the hypervisor, and later on it is transparently turned into a 'share' hypercall when the kernel has lost control over the hypervisor stage-1. In order to prepare the ground for also unsharing pages with the hypervisor during guest teardown, introduce a kvm_share_hyp() function to make it clear in which places a share hypercall should be expected, as we will soon need a matching unshare hypercall in all those places. Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20211215161232.1480836-7-qperret@google.com
2021-12-16KVM: arm64: Implement kvm_pgtable_hyp_unmap() at EL2Will Deacon
Implement kvm_pgtable_hyp_unmap() which can be used to remove hypervisor stage-1 mappings at EL2. Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20211215161232.1480836-6-qperret@google.com
2021-12-16KVM: arm64: Hook up ->page_count() for hypervisor stage-1 page-tableWill Deacon
kvm_pgtable_hyp_unmap() relies on the ->page_count() function callback being provided by the memory-management operations for the page-table. Wire up this callback for the hypervisor stage-1 page-table. Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20211215161232.1480836-5-qperret@google.com
2021-12-16KVM: arm64: Fixup hyp stage-1 refcountQuentin Perret
In nVHE-protected mode, the hyp stage-1 page-table refcount is broken due to the lack of refcount support in the early allocator. Fix-up the refcount in the finalize walker, once the 'hyp_vmemmap' is up and running. Acked-by: Will Deacon <will@kernel.org> Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20211215161232.1480836-4-qperret@google.com
2021-12-16KVM: arm64: Refcount hyp stage-1 pgtable pagesQuentin Perret
To prepare the ground for allowing hyp stage-1 mappings to be removed at run-time, update the KVM page-table code to maintain a correct refcount using the ->{get,put}_page() function callbacks. Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20211215161232.1480836-3-qperret@google.com
2021-12-16KVM: arm64: Provide {get,put}_page() stubs for early hyp allocatorQuentin Perret
In nVHE protected mode, the EL2 code uses a temporary allocator during boot while re-creating its stage-1 page-table. Unfortunately, the hyp_vmmemap is not ready to use at this stage, so refcounting pages is not possible. That is not currently a problem because hyp stage-1 mappings are never removed, which implies refcounting of page-table pages is unnecessary. In preparation for allowing hypervisor stage-1 mappings to be removed, provide stub implementations for {get,put}_page() in the early allocator. Acked-by: Will Deacon <will@kernel.org> Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20211215161232.1480836-2-qperret@google.com
2021-12-16Merge branch kvm-arm64/vgic-fixes-5.17 into kvmarm-master/nextMarc Zyngier
* kvm-arm64/vgic-fixes-5.17: : . : A few vgic fixes: : - Harden vgic-v3 error handling paths against signed vs unsigned : comparison that will happen once the xarray-based vcpus are in : - Demote userspace-triggered console output to kvm_debug() : . KVM: arm64: vgic: Demote userspace-triggered console prints to kvm_debug() KVM: arm64: vgic-v3: Fix vcpu index comparison Signed-off-by: Marc Zyngier <maz@kernel.org>
2021-12-16net: Fix double 0x prefix print in SKB dumpGal Pressman
When printing netdev features %pNF already takes care of the 0x prefix, remove the explicit one. Fixes: 6413139dfc64 ("skbuff: increase verbosity when dumping skb data") Signed-off-by: Gal Pressman <gal@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-16virtio_net: fix rx_drops stat for small pktsWenliang Wang
We found the stat of rx drops for small pkts does not increment when build_skb fail, it's not coherent with other mode's rx drops stat. Signed-off-by: Wenliang Wang <wangwenliang.1995@bytedance.com> Acked-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-16dsa: mv88e6xxx: fix debug print for SPEED_UNFORCEDAndrey Eremeev
Debug print uses invalid check to detect if speed is unforced: (speed != SPEED_UNFORCED) should be used instead of (!speed). Found by Linux Verification Center (linuxtesting.org) with SVACE. Signed-off-by: Andrey Eremeev <Axtone4all@yandex.ru> Fixes: 96a2b40c7bd3 ("net: dsa: mv88e6xxx: add port's MAC speed setter") Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-16sfc_ef100: potential dereference of null pointerJiasheng Jiang
The return value of kmalloc() needs to be checked. To avoid use in efx_nic_update_stats() in case of the failure of alloc. Fixes: b593b6f1b492 ("sfc_ef100: statistics gathering") Signed-off-by: Jiasheng Jiang <jiasheng@iscas.ac.cn> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-16KVM: arm64: vgic: Demote userspace-triggered console prints to kvm_debug()Marc Zyngier
Running the KVM selftests results in these messages being dumped in the kernel console: [ 188.051073] kvm [469]: VGIC redist and dist frames overlap [ 188.056820] kvm [469]: VGIC redist and dist frames overlap [ 188.076199] kvm [469]: VGIC redist and dist frames overlap Being amle to trigger this from userspace is definitely not on, so demote these warnings to kvm_debug(). Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20211216104507.1482017-1-maz@kernel.org
2021-12-16net: stmmac: dwmac-rk: fix oob read in rk_gmac_setupJohn Keeping
KASAN reports an out-of-bounds read in rk_gmac_setup on the line: while (ops->regs[i]) { This happens for most platforms since the regs flexible array member is empty, so the memory after the ops structure is being read here. It seems that mostly this happens to contain zero anyway, so we get lucky and everything still works. To avoid adding redundant data to nearly all the ops structures, add a new flag to indicate whether the regs field is valid and avoid this loop when it is not. Fixes: 3bb3d6b1c195 ("net: stmmac: Add RK3566/RK3568 SoC support") Signed-off-by: John Keeping <john@metanate.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-16KVM: arm64: vgic-v3: Fix vcpu index comparisonMarc Zyngier
When handling an error at the point where we try and register all the redistributors, we unregister all the previously registered frames by counting down from the failing index. However, the way the code is written relies on that index being a signed value. Which won't be true once we switch to an xarray-based vcpu set. Since this code is pretty awkward the first place, and that the failure mode is hard to spot, rewrite this loop to iterate over the vcpus upwards rather than downwards. Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20211216104526.1482124-1-maz@kernel.org
2021-12-16Merge branch '1GbE' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue Tony Nguyen says: ==================== Intel Wired LAN Driver Updates 2021-12-15 This series contains updates to igb, igbvf, igc and ixgbe drivers. Karen moves checks for invalid VF MAC filters to occur earlier for igb. Letu Ren fixes a double free issue in igbvf probe. Sasha fixes incorrect min value being used when calculating for max for igc. Robert Schlabbach adds documentation on enabling NBASE-T support for ixgbe. Cyril Novikov adds missing initialization of MDIO bus speed for ixgbe. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-16arm64: dts: lx2160a: fix scl-gpios property nameZhang Ying-22455
Fix the typo in the property name. Fixes: d548c217c6a3c ("arm64: dts: add QorIQ LX2160A SoC support") Signed-off-by: Zhang Ying <ying.zhang22455@nxp.com> Signed-off-by: Li Yang <leoyang.li@nxp.com> Signed-off-by: Shawn Guo <shawnguo@kernel.org>
2021-12-16tee: handle lookup of shm with reference count 0Jens Wiklander
Since the tee subsystem does not keep a strong reference to its idle shared memory buffers, it races with other threads that try to destroy a shared memory through a close of its dma-buf fd or by unmapping the memory. In tee_shm_get_from_id() when a lookup in teedev->idr has been successful, it is possible that the tee_shm is in the dma-buf teardown path, but that path is blocked by the teedev mutex. Since we don't have an API to tell if the tee_shm is in the dma-buf teardown path or not we must find another way of detecting this condition. Fix this by doing the reference counting directly on the tee_shm using a new refcount_t refcount field. dma-buf is replaced by using anon_inode_getfd() instead, this separates the life-cycle of the underlying file from the tee_shm. tee_shm_put() is updated to hold the mutex when decreasing the refcount to 0 and then remove the tee_shm from teedev->idr before releasing the mutex. This means that the tee_shm can never be found unless it has a refcount larger than 0. Fixes: 967c9cca2cc5 ("tee: generic TEE subsystem") Cc: stable@vger.kernel.org Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Lars Persson <larper@axis.com> Reviewed-by: Sumit Garg <sumit.garg@linaro.org> Reported-by: Patrik Lantz <patrik.lantz@axis.com> Signed-off-by: Jens Wiklander <jens.wiklander@linaro.org>
2021-12-16xen/netback: don't queue unlimited number of packagesJuergen Gross
In case a guest isn't consuming incoming network traffic as fast as it is coming in, xen-netback is buffering network packages in unlimited numbers today. This can result in host OOM situations. Commit f48da8b14d04ca8 ("xen-netback: fix unlimited guest Rx internal queue and carrier flapping") meant to introduce a mechanism to limit the amount of buffered data by stopping the Tx queue when reaching the data limit, but this doesn't work for cases like UDP. When hitting the limit don't queue further SKBs, but drop them instead. In order to be able to tell Rx packages have been dropped increment the rx_dropped statistics counter in this case. It should be noted that the old solution to continue queueing SKBs had the additional problem of an overflow of the 32-bit rx_queue_len value would result in intermittent Tx queue enabling. This is part of XSA-392 Fixes: f48da8b14d04ca8 ("xen-netback: fix unlimited guest Rx internal queue and carrier flapping") Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>