summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2019-11-15mm,thp: recheck each page before collapsing file THPSong Liu
In collapse_file(), for !is_shmem case, current check cannot guarantee the locked page is up-to-date. Specifically, xas_unlock_irq() should not be called before lock_page() and get_page(); and it is necessary to recheck PageUptodate() after locking the page. With this bug and CONFIG_READ_ONLY_THP_FOR_FS=y, madvise(HUGE)'ed .text may contain corrupted data. This is because khugepaged mistakenly collapses some not up-to-date sub pages into a huge page, and assumes the huge page is up-to-date. This will NOT corrupt data in the disk, because the page is read-only and never written back. Fix this by properly checking PageUptodate() after locking the page. This check replaces "VM_BUG_ON_PAGE(!PageUptodate(page), page);". Also, move PageDirty() check after locking the page. Current khugepaged should not try to collapse dirty file THP, because it is limited to read-only .text. The only case we hit a dirty page here is when the page hasn't been written since write. Bail out and retry when this happens. syzbot reported bug on previous version of this patch. Link: http://lkml.kernel.org/r/20191106060930.2571389-2-songliubraving@fb.com Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS") Signed-off-by: Song Liu <songliubraving@fb.com> Reported-by: syzbot+efb9e48b9fbdc49bb34a@syzkaller.appspotmail.com Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-15mm: slub: really fix slab walking for init_on_freeLaura Abbott
Commit 1b7e816fc80e ("mm: slub: Fix slab walking for init_on_free") fixed one problem with the slab walking but missed a key detail: When walking the list, the head and tail pointers need to be updated since we end up reversing the list as a result. Without doing this, bulk free is broken. One way this is exposed is a NULL pointer with slub_debug=F: ============================================================================= BUG skbuff_head_cache (Tainted: G T): Object already free ----------------------------------------------------------------------------- INFO: Slab 0x000000000d2d2f8f objects=16 used=3 fp=0x0000000064309071 flags=0x3fff00000000201 BUG: kernel NULL pointer dereference, address: 0000000000000000 Oops: 0000 [#1] PREEMPT SMP PTI RIP: 0010:print_trailer+0x70/0x1d5 Call Trace: <IRQ> free_debug_processing.cold.37+0xc9/0x149 __slab_free+0x22a/0x3d0 kmem_cache_free_bulk+0x415/0x420 __kfree_skb_flush+0x30/0x40 net_rx_action+0x2dd/0x480 __do_softirq+0xf0/0x246 irq_exit+0x93/0xb0 do_IRQ+0xa0/0x110 common_interrupt+0xf/0xf </IRQ> Given we're now almost identical to the existing debugging code which correctly walks the list, combine with that. Link: https://lkml.kernel.org/r/20191104170303.GA50361@gandi.net Link: http://lkml.kernel.org/r/20191106222208.26815-1-labbott@redhat.com Fixes: 1b7e816fc80e ("mm: slub: Fix slab walking for init_on_free") Signed-off-by: Laura Abbott <labbott@redhat.com> Reported-by: Thibaut Sautereau <thibaut.sautereau@clip-os.org> Acked-by: David Rientjes <rientjes@google.com> Tested-by: Alexander Potapenko <glider@google.com> Acked-by: Alexander Potapenko <glider@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <clipos@ssi.gouv.fr> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-15mm: hugetlb: switch to css_tryget() in hugetlb_cgroup_charge_cgroup()Roman Gushchin
An exiting task might belong to an offline cgroup. In this case an attempt to grab a cgroup reference from the task can end up with an infinite loop in hugetlb_cgroup_charge_cgroup(), because neither the cgroup will become online, neither the task will be migrated to a live cgroup. Fix this by switching over to css_tryget(). As css_tryget_online() can't guarantee that the cgroup won't go offline, in most cases the check doesn't make sense. In this particular case users of hugetlb_cgroup_charge_cgroup() are not affected by this change. A similar problem is described by commit 18fa84a2db0e ("cgroup: Use css_tryget() instead of css_tryget_online() in task_get_css()"). Link: http://lkml.kernel.org/r/20191106225131.3543616-2-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-15mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()Roman Gushchin
We've encountered a rcu stall in get_mem_cgroup_from_mm(): rcu: INFO: rcu_sched self-detected stall on CPU rcu: 33-....: (21000 ticks this GP) idle=6c6/1/0x4000000000000002 softirq=35441/35441 fqs=5017 (t=21031 jiffies g=324821 q=95837) NMI backtrace for cpu 33 <...> RIP: 0010:get_mem_cgroup_from_mm+0x2f/0x90 <...> __memcg_kmem_charge+0x55/0x140 __alloc_pages_nodemask+0x267/0x320 pipe_write+0x1ad/0x400 new_sync_write+0x127/0x1c0 __kernel_write+0x4f/0xf0 dump_emit+0x91/0xc0 writenote+0xa0/0xc0 elf_core_dump+0x11af/0x1430 do_coredump+0xc65/0xee0 get_signal+0x132/0x7c0 do_signal+0x36/0x640 exit_to_usermode_loop+0x61/0xd0 do_syscall_64+0xd4/0x100 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The problem is caused by an exiting task which is associated with an offline memcg. We're iterating over and over in the do {} while (!css_tryget_online()) loop, but obviously the memcg won't become online and the exiting task won't be migrated to a live memcg. Let's fix it by switching from css_tryget_online() to css_tryget(). As css_tryget_online() cannot guarantee that the memcg won't go offline, the check is usually useless, except some rare cases when for example it determines if something should be presented to a user. A similar problem is described by commit 18fa84a2db0e ("cgroup: Use css_tryget() instead of css_tryget_online() in task_get_css()"). Johannes: : The bug aside, it doesn't matter whether the cgroup is online for the : callers. It used to matter when offlining needed to evacuate all charges : from the memcg, and so needed to prevent new ones from showing up, but we : don't care now. Link: http://lkml.kernel.org/r/20191106225131.3543616-1-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Shakeel Butt <shakeeb@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutn <mkoutny@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-15lib/xz: fix XZ_DYNALLOC to avoid useless memory reallocationsLasse Collin
s->dict.allocated was initialized to 0 but never set after a successful allocation, thus the code always thought that the dictionary buffer has to be reallocated. Link: http://lkml.kernel.org/r/20191104185107.3b6330df@tukaani.org Signed-off-by: Lasse Collin <lasse.collin@tukaani.org> Reported-by: Yu Sun <yusun2@cisco.com> Acked-by: Daniel Walker <danielwa@cisco.com> Cc: "Yixia Si (yisi)" <yisi@cisco.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-15mm: fix trying to reclaim unevictable lru page when calling madvise_pageoutzhong jiang
Recently, I hit the following issue when running upstream. kernel BUG at mm/vmscan.c:1521! invalid opcode: 0000 [#1] SMP KASAN PTI CPU: 0 PID: 23385 Comm: syz-executor.6 Not tainted 5.4.0-rc4+ #1 RIP: 0010:shrink_page_list+0x12b6/0x3530 mm/vmscan.c:1521 Call Trace: reclaim_pages+0x499/0x800 mm/vmscan.c:2188 madvise_cold_or_pageout_pte_range+0x58a/0x710 mm/madvise.c:453 walk_pmd_range mm/pagewalk.c:53 [inline] walk_pud_range mm/pagewalk.c:112 [inline] walk_p4d_range mm/pagewalk.c:139 [inline] walk_pgd_range mm/pagewalk.c:166 [inline] __walk_page_range+0x45a/0xc20 mm/pagewalk.c:261 walk_page_range+0x179/0x310 mm/pagewalk.c:349 madvise_pageout_page_range mm/madvise.c:506 [inline] madvise_pageout+0x1f0/0x330 mm/madvise.c:542 madvise_vma mm/madvise.c:931 [inline] __do_sys_madvise+0x7d2/0x1600 mm/madvise.c:1113 do_syscall_64+0x9f/0x4c0 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe madvise_pageout() accesses the specified range of the vma and isolates them, then runs shrink_page_list() to reclaim its memory. But it also isolates the unevictable pages to reclaim. Hence, we can catch the cases in shrink_page_list(). The root cause is that we scan the page tables instead of specific LRU list. and so we need to filter out the unevictable lru pages from our end. Link: http://lkml.kernel.org/r/1572616245-18946-1-git-send-email-zhongjiang@huawei.com Fixes: 1a4e58cce84e ("mm: introduce MADV_PAGEOUT") Signed-off-by: zhong jiang <zhongjiang@huawei.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-15mm: mempolicy: fix the wrong return value and potential pages leak of mbindYang Shi
Commit d883544515aa ("mm: mempolicy: make the behavior consistent when MPOL_MF_MOVE* and MPOL_MF_STRICT were specified") fixed the return value of mbind() for a couple of corner cases. But, it altered the errno for some other cases, for example, mbind() should return -EFAULT when part or all of the memory range specified by nodemask and maxnode points outside your accessible address space, or there was an unmapped hole in the specified memory range specified by addr and len. Fix this by preserving the errno returned by queue_pages_range(). And, the pagelist may be not empty even though queue_pages_range() returns error, put the pages back to LRU since mbind_range() is not called to really apply the policy so those pages should not be migrated, this is also the old behavior before the problematic commit. Link: http://lkml.kernel.org/r/1572454731-3925-1-git-send-email-yang.shi@linux.alibaba.com Fixes: d883544515aa ("mm: mempolicy: make the behavior consistent when MPOL_MF_MOVE* and MPOL_MF_STRICT were specified") Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Reported-by: Li Xinhai <lixinhai.lxh@gmail.com> Reviewed-by: Li Xinhai <lixinhai.lxh@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: <stable@vger.kernel.org> [4.19 and 5.2+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-15Input: synaptics - enable RMI mode for X1 Extreme 2nd GenerationLyude Paul
Just got one of these for debugging some unrelated issues, and noticed that Lenovo seems to have gone back to using RMI4 over smbus with Synaptics touchpads on some of their new systems, particularly this one. So, let's enable RMI mode for the X1 Extreme 2nd Generation. Signed-off-by: Lyude Paul <lyude@redhat.com> Link: https://lore.kernel.org/r/20191115221814.31903-1-lyude@redhat.com Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
2019-11-15Merge branch 'bpf-trampoline'Daniel Borkmann
Alexei Starovoitov says: ==================== Introduce BPF trampoline that works as a bridge between kernel functions, BPF programs and other BPF programs. The first use case is fentry/fexit BPF programs that are roughly equivalent to kprobe/kretprobe. Unlike k[ret]probe there is practically zero overhead to call a set of BPF programs before or after kernel function. The second use case is heavily influenced by pain points in XDP development. BPF trampoline allows attaching similar fentry/fexit BPF program to any networking BPF program. It's now possible to see packets on input and output of any XDP, TC, lwt, cgroup programs without disturbing them. This greatly helps BPF-based network troubleshooting. The third use case of BPF trampoline will be explored in the follow up patches. The BPF trampoline will be used to dynamicly link BPF programs. It's more generic mechanism than array and link list of programs used in tracing, networking, cgroups. In many cases it can be used as a replacement for bpf_tail_call-based program chaining. See [1] for long term design discussion. v3 -> v4: - Included Peter's "86/alternatives: Teach text_poke_bp() to emulate instructions" as a first patch. If it changes between now and merge window, I'll rebease to newer version. The patch is necessary to do s/text_poke/text_poke_bp/ in patch 3 to fix the race. - In patch 4 fixed bpf_trampoline creation race spotted by Andrii. - Added patch 15 that annotates prog->kern bpf context types. It made patches 16 and 17 cleaner and more generic. - Addressed Andrii's feedback in other patches. v2 -> v3: - Addressed Song's and Andrii's comments - Fixed few minor bugs discovered while testing - Added one more libbpf patch v1 -> v2: - Addressed Andrii's comments - Added more test for fentry/fexit to kernel functions. Including stress test for maximum number of progs per trampoline. - Fixed a race btf_resolve_helper_id() - Added a patch to compare BTF types of functions arguments with actual types. - Added support for attaching BPF program to another BPF program via trampoline - Converted to use text_poke() API. That's the only viable mechanism to implement BPF-to-BPF attach. BPF-to-kernel attach can be refactored to use register_ftrace_direct() whenever it's available. [1] https://lore.kernel.org/bpf/20191112025112.bhzmrrh2pr76ssnh@ast-mbp.dhcp.thefacebook.com/ ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-11-15selftests/bpf: Add a test for attaching BPF prog to another BPF prog and subprogAlexei Starovoitov
Add a test that attaches one FEXIT program to main sched_cls networking program and two other FEXIT programs to subprograms. All three tracing programs access return values and skb->len of networking program and subprograms. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20191114185720.1641606-21-ast@kernel.org
2019-11-15selftests/bpf: Extend test_pkt_access testAlexei Starovoitov
The test_pkt_access.o is used by multiple tests. Fix its section name so that program type can be automatically detected by libbpf and make it call other subprograms with skb argument. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20191114185720.1641606-20-ast@kernel.org
2019-11-15libbpf: Add support for attaching BPF programs to other BPF programsAlexei Starovoitov
Extend libbpf api to pass attach_prog_fd into bpf_object__open. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20191114185720.1641606-19-ast@kernel.org
2019-11-15bpf: Support attaching tracing BPF program to other BPF programsAlexei Starovoitov
Allow FENTRY/FEXIT BPF programs to attach to other BPF programs of any type including their subprograms. This feature allows snooping on input and output packets in XDP, TC programs including their return values. In order to do that the verifier needs to track types not only of vmlinux, but types of other BPF programs as well. The verifier also needs to translate uapi/linux/bpf.h types used by networking programs into kernel internal BTF types used by FENTRY/FEXIT BPF programs. In some cases LLVM optimizations can remove arguments from BPF subprograms without adjusting BTF info that LLVM backend knows. When BTF info disagrees with actual types that the verifiers sees the BPF trampoline has to fallback to conservative and treat all arguments as u64. The FENTRY/FEXIT program can still attach to such subprograms, but it won't be able to recognize pointer types like 'struct sk_buff *' and it won't be able to pass them to bpf_skb_output() for dumping packets to user space. The FENTRY/FEXIT program would need to use bpf_probe_read_kernel() instead. The BPF_PROG_LOAD command is extended with attach_prog_fd field. When it's set to zero the attach_btf_id is one vmlinux BTF type ids. When attach_prog_fd points to previously loaded BPF program the attach_btf_id is BTF type id of main function or one of its subprograms. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20191114185720.1641606-18-ast@kernel.org
2019-11-15bpf: Compare BTF types of functions arguments with actual typesAlexei Starovoitov
Make the verifier check that BTF types of function arguments match actual types passed into top-level BPF program and into BPF-to-BPF calls. If types match such BPF programs and sub-programs will have full support of BPF trampoline. If types mismatch the trampoline has to be conservative. It has to save/restore five program arguments and assume 64-bit scalars. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20191114185720.1641606-17-ast@kernel.org
2019-11-15netfilter: nf_tables: add nft_unregister_flowtable_hook()Pablo Neira Ayuso
Unbind flowtable callback if hook is unregistered. This patch is implicitly fixing the error path of nf_tables_newflowtable() and nft_flowtable_event(). Fixes: 8bb69f3b2918 ("netfilter: nf_tables: add flowtable offload control plane") Reported-by: wenxu <wenxu@ucloud.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-11-15netfilter: nf_tables: check if bind callback fails and unbind if hook ↵wenxu
registration fails Undo the callback binding before unregistering the existing hooks. This should also check for error of the bind setup call. Fixes: c29f74e0df7a ("netfilter: nf_flow_table: hardware offload support") Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-11-15netfilter: nf_tables_offload: undo updates if transaction failsPablo Neira Ayuso
The nft_flow_rule_offload_commit() function might fail after several successful commands, thus, leaving the hardware filtering policy in inconsistent state. This patch adds nft_flow_rule_offload_abort() function which undoes the updates that have been already processed if one command in this transaction fails. Hence, the hardware ruleset is left as it was before this aborted transaction. The deletion path needs to create the flow_rule object too, in case that an existing rule needs to be re-added from the abort path. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-11-15netfilter: nf_tables_offload: release flow_rule on error from commit pathPablo Neira Ayuso
If hardware offload commit path fails, release all flow_rule objects. Fixes: c9626a2cbdb2 ("netfilter: nf_tables: add hardware offload support") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-11-15netfilter: nf_tables_offload: remove reference to flow rule from deletion pathPablo Neira Ayuso
The cookie is sufficient to delete the rule from the hardware. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-11-15netfilter: nf_flow_table: remove unnecessary parameter in flow_offload_fill_dirwenxu
The ct object is already in the flow_offload structure, remove it. Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-11-15netfilter: nf_flow_table_offload: Fix check ndo_setup_tc when setup_blockwenxu
It should check the ndo_setup_tc in the nf_flow_table_offload_setup. Fixes: c29f74e0df7a ("netfilter: nf_flow_table: hardware offload support") Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-11-15bpf: Annotate context typesAlexei Starovoitov
Annotate BPF program context types with program-side type and kernel-side type. This type information is used by the verifier. btf_get_prog_ctx_type() is used in the later patches to verify that BTF type of ctx in BPF program matches to kernel expected ctx type. For example, the XDP program type is: BPF_PROG_TYPE(BPF_PROG_TYPE_XDP, xdp, struct xdp_md, struct xdp_buff) That means that XDP program should be written as: int xdp_prog(struct xdp_md *ctx) { ... } Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20191114185720.1641606-16-ast@kernel.org
2019-11-15netfilter: Support iif matches in POSTROUTINGPhil Sutter
Instead of generally passing NULL to NF_HOOK_COND() for input device, pass skb->dev which contains input device for routed skbs. Note that iptables (both legacy and nft) reject rules with input interface match from being added to POSTROUTING chains, but nftables allows this. Cc: Eric Garver <eric@garver.life> Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-11-15netfilter: nf_flow_table_offload: add IPv6 supportPablo Neira Ayuso
Add nf_flow_rule_route_ipv6() and use it from the IPv6 and the inet flowtable type definitions. Rename the nf_flow_rule_route() function to nf_flow_rule_route_ipv4(). Adjust maximum number of actions, which now becomes 16 to leave sufficient room for the IPv6 address mangling for NAT. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-11-15netfilter: nf_flow_table_offload: add flow_action_entry_next() and use itPablo Neira Ayuso
This function retrieves a spare action entry from the array of actions. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-11-15netfilter: nft_meta: use 64-bit time arithmeticArnd Bergmann
On 32-bit architectures, get_seconds() returns an unsigned 32-bit time value, which also matches the type used in the nft_meta code. This will not overflow in year 2038 as a time_t would, but it still suffers from the overflow problem later on in year 2106. Change this instance to use the time64_t type consistently and avoid the deprecated get_seconds(). The nft_meta_weekday() calculation potentially gets a little slower on 32-bit architectures, but now it has the same behavior as on 64-bit architectures and does not overflow. Fixes: 63d10e12b00d ("netfilter: nft_meta: support for time matching") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-11-15netfilter: xt_time: use time64_tArnd Bergmann
The current xt_time driver suffers from the y2038 overflow on 32-bit architectures, when the time of day calculations break. Also, on both 32-bit and 64-bit architectures, there is a problem with info->date_start/stop, which is part of the user ABI and overflows in in 2106. Fix the first issue by using time64_t and explicit calls to div_u64() and div_u64_rem(), and document the seconds issue. The explicit 64-bit division is unfortunately slower on 32-bit architectures, but doing it as unsigned lets us use the optimized division-through-multiplication path in most configurations. This should be fine, as the code already does not allow any negative time of day values. Using u32 seconds values consistently would probably also work and be a little more efficient, but that doesn't feel right as it would propagate the y2106 overflow to more place rather than fewer. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-11-15bpf: Fix race in btf_resolve_helper_id()Alexei Starovoitov
btf_resolve_helper_id() caching logic is a bit racy, since under root the verifier can verify several programs in parallel. Fix it with READ/WRITE_ONCE. Fix the type as well, since error is also recorded. Fixes: a7658e1a4164 ("bpf: Check types of arguments passed into helpers") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20191114185720.1641606-15-ast@kernel.org
2019-11-15bpf: Reserve space for BPF trampoline in BPF programsAlexei Starovoitov
BPF trampoline can be made to work with existing 5 bytes of BPF program prologue, but let's add 5 bytes of NOPs to the beginning of every JITed BPF program to make BPF trampoline job easier. They can be removed in the future. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20191114185720.1641606-14-ast@kernel.org
2019-11-15selftests/bpf: Add stress test for maximum number of progsAlexei Starovoitov
Add stress test for maximum number of attached BPF programs per BPF trampoline. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20191114185720.1641606-13-ast@kernel.org
2019-11-15selftests/bpf: Add combined fentry/fexit testAlexei Starovoitov
Add a combined fentry/fexit test. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20191114185720.1641606-12-ast@kernel.org
2019-11-15selftests/bpf: Add fexit tests for BPF trampolineAlexei Starovoitov
Add fexit tests for BPF trampoline that checks kernel functions with up to 6 arguments of different sizes and their return values. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20191114185720.1641606-11-ast@kernel.org
2019-11-15selftests/bpf: Add test for BPF trampolineAlexei Starovoitov
Add sanity test for BPF trampoline that checks kernel functions with up to 6 arguments of different sizes. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20191114185720.1641606-10-ast@kernel.org
2019-11-15bpf: Add kernel test functions for fentry testingAlexei Starovoitov
Add few kernel functions with various number of arguments, their types and sizes for BPF trampoline testing to cover different calling conventions. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20191114185720.1641606-9-ast@kernel.org
2019-11-15selftest/bpf: Simple test for fentry/fexitAlexei Starovoitov
Add simple test for fentry and fexit programs around eth_type_trans. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20191114185720.1641606-8-ast@kernel.org
2019-11-15libbpf: Add support to attach to fentry/fexit tracing progsAlexei Starovoitov
Teach libbpf to recognize tracing programs types and attach them to fentry/fexit. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20191114185720.1641606-7-ast@kernel.org
2019-11-15libbpf: Introduce btf__find_by_name_kind()Alexei Starovoitov
Introduce btf__find_by_name_kind() helper to search BTF by name and kind, since name alone can be ambiguous. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20191114185720.1641606-6-ast@kernel.org
2019-11-15bpf: Introduce BPF trampolineAlexei Starovoitov
Introduce BPF trampoline concept to allow kernel code to call into BPF programs with practically zero overhead. The trampoline generation logic is architecture dependent. It's converting native calling convention into BPF calling convention. BPF ISA is 64-bit (even on 32-bit architectures). The registers R1 to R5 are used to pass arguments into BPF functions. The main BPF program accepts only single argument "ctx" in R1. Whereas CPU native calling convention is different. x86-64 is passing first 6 arguments in registers and the rest on the stack. x86-32 is passing first 3 arguments in registers. sparc64 is passing first 6 in registers. And so on. The trampolines between BPF and kernel already exist. BPF_CALL_x macros in include/linux/filter.h statically compile trampolines from BPF into kernel helpers. They convert up to five u64 arguments into kernel C pointers and integers. On 64-bit architectures this BPF_to_kernel trampolines are nops. On 32-bit architecture they're meaningful. The opposite job kernel_to_BPF trampolines is done by CAST_TO_U64 macros and __bpf_trace_##call() shim functions in include/trace/bpf_probe.h. They convert kernel function arguments into array of u64s that BPF program consumes via R1=ctx pointer. This patch set is doing the same job as __bpf_trace_##call() static trampolines, but dynamically for any kernel function. There are ~22k global kernel functions that are attachable via nop at function entry. The function arguments and types are described in BTF. The job of btf_distill_func_proto() function is to extract useful information from BTF into "function model" that architecture dependent trampoline generators will use to generate assembly code to cast kernel function arguments into array of u64s. For example the kernel function eth_type_trans has two pointers. They will be casted to u64 and stored into stack of generated trampoline. The pointer to that stack space will be passed into BPF program in R1. On x86-64 such generated trampoline will consume 16 bytes of stack and two stores of %rdi and %rsi into stack. The verifier will make sure that only two u64 are accessed read-only by BPF program. The verifier will also recognize the precise type of the pointers being accessed and will not allow typecasting of the pointer to a different type within BPF program. The tracing use case in the datacenter demonstrated that certain key kernel functions have (like tcp_retransmit_skb) have 2 or more kprobes that are always active. Other functions have both kprobe and kretprobe. So it is essential to keep both kernel code and BPF programs executing at maximum speed. Hence generated BPF trampoline is re-generated every time new program is attached or detached to maintain maximum performance. To avoid the high cost of retpoline the attached BPF programs are called directly. __bpf_prog_enter/exit() are used to support per-program execution stats. In the future this logic will be optimized further by adding support for bpf_stats_enabled_key inside generated assembly code. Introduction of preemptible and sleepable BPF programs will completely remove the need to call to __bpf_prog_enter/exit(). Detach of a BPF program from the trampoline should not fail. To avoid memory allocation in detach path the half of the page is used as a reserve and flipped after each attach/detach. 2k bytes is enough to call 40+ BPF programs directly which is enough for BPF tracing use cases. This limit can be increased in the future. BPF_TRACE_FENTRY programs have access to raw kernel function arguments while BPF_TRACE_FEXIT programs have access to kernel return value as well. Often kprobe BPF program remembers function arguments in a map while kretprobe fetches arguments from a map and analyzes them together with return value. BPF_TRACE_FEXIT accelerates this typical use case. Recursion prevention for kprobe BPF programs is done via per-cpu bpf_prog_active counter. In practice that turned out to be a mistake. It caused programs to randomly skip execution. The tracing tools missed results they were looking for. Hence BPF trampoline doesn't provide builtin recursion prevention. It's a job of BPF program itself and will be addressed in the follow up patches. BPF trampoline is intended to be used beyond tracing and fentry/fexit use cases in the future. For example to remove retpoline cost from XDP programs. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20191114185720.1641606-5-ast@kernel.org
2019-11-15bpf: Add bpf_arch_text_poke() helperAlexei Starovoitov
Add bpf_arch_text_poke() helper that is used by BPF trampoline logic to patch nops/calls in kernel text into calls into BPF trampoline and to patch calls/nops inside BPF programs too. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20191114185720.1641606-4-ast@kernel.org
2019-11-15bpf: Refactor x86 JIT into helpersAlexei Starovoitov
Refactor x86 JITing of LDX, STX, CALL instructions into separate helper functions. No functional changes in LDX and STX helpers. There is a minor change in CALL helper. It will populate target address correctly on the first pass of JIT instead of second pass. That won't reduce total number of JIT passes though. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20191114185720.1641606-3-ast@kernel.org
2019-11-15x86/alternatives: Teach text_poke_bp() to emulate instructionsPeter Zijlstra
In preparation for static_call and variable size jump_label support, teach text_poke_bp() to emulate instructions, namely: JMP32, JMP8, CALL, NOP2, NOP_ATOMIC5, INT3 The current text_poke_bp() takes a @handler argument which is used as a jump target when the temporary INT3 is hit by a different CPU. When patching CALL instructions, this doesn't work because we'd miss the PUSH of the return address. Instead, teach poke_int3_handler() to emulate an instruction, typically the instruction we're patching in. This fits almost all text_poke_bp() users, except arch_unoptimize_kprobe() which restores random text, and for that site we have to build an explicit emulate instruction. Tested-by: Alexei Starovoitov <ast@kernel.org> Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20191111132457.529086974@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit 8c7eebc10687af45ac8e40ad1bac0cf7893dba9f) Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-11-15bpf, doc: Change right arguments for JIT example codeMao Wenan
The example code for the x86_64 JIT uses the wrong arguments when calling function bar(). Signed-off-by: Mao Wenan <maowenan@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20191114034351.162740-1-maowenan@huawei.com
2019-11-15samples/bpf: Add missing option to xdpsock usageAndre Guedes
Commit 743e568c1586 (samples/bpf: Add a "force" flag to XDP samples) introduced the '-F' option but missed adding it to the usage() and the 'long_option' array. Fixes: 743e568c1586 (samples/bpf: Add a "force" flag to XDP samples) Signed-off-by: Andre Guedes <andre.guedes@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20191114162847.221770-2-andre.guedes@intel.com
2019-11-15samples/bpf: Remove duplicate option from xdpsockAndre Guedes
The '-f' option is shown twice in the usage(). This patch removes the outdated version. Signed-off-by: Andre Guedes <andre.guedes@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20191114162847.221770-1-andre.guedes@intel.com
2019-11-15s390/bpf: Make sure JIT passes do not increase code sizeIlya Leoshkevich
The upcoming s390 branch length extension patches rely on "passes do not increase code size" property in order to consistently choose between short and long branches. Currently this property does not hold between the first and the second passes for register save/restore sequences, as well as various code fragments that depend on SEEN_* flags. Generate the code during the first pass conservatively: assume register save/restore sequences have the maximum possible length, and that all SEEN_* flags are set. Also refuse to JIT if this happens anyway (e.g. due to a bug), as this might lead to verifier bypass once long branches are introduced. Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20191114151820.53222-1-iii@linux.ibm.com
2019-11-15bpf: Support doubleword alignment in bpf_jit_binary_allocIlya Leoshkevich
Currently passing alignment greater than 4 to bpf_jit_binary_alloc does not work: in such cases it silently aligns only to 4 bytes. On s390, in order to load a constant from memory in a large (>512k) BPF program, one must use lgrl instruction, whose memory operand must be aligned on an 8-byte boundary. This patch makes it possible to request 8-byte alignment from bpf_jit_binary_alloc, and also makes it issue a warning when an unsupported alignment is requested. Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20191115123722.58462-1-iii@linux.ibm.com
2019-11-15Merge tag 'for-linus-20191115' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: "A few fixes that should make it into this release. This contains: - io_uring: - The timeout command assumes sequence == 0 means that we want one completion, but this kind of overloading is unfortunate as it prevents users from doing a pure time based wait. Since this operation was introduced in this cycle, let's correct it now, while we can. (me) - One-liner to fix an issue with dependent links and fixed buffer reads. The actual IO completed fine, but the link got severed since we stored the wrong expected value. (me) - Add TIMEOUT to list of opcodes that don't need a file. (Pavel) - rsxx missing workqueue destry calls. Old bug. (Chuhong) - Fix blk-iocost active list check (Jiufei) - Fix impossible-to-hit overflow merge condition, that still hit some folks very rarely (Junichi) - Fix bfq hang issue from 5.3. This didn't get marked for stable, but will go into stable post this merge (Paolo)" * tag 'for-linus-20191115' of git://git.kernel.dk/linux-block: rsxx: add missed destroy_workqueue calls in remove iocost: check active_list of all the ancestors in iocg_activate() block, bfq: deschedule empty bfq_queues not referred by any process io_uring: ensure registered buffer import returns the IO length io_uring: Fix getting file for timeout block: check bi_size overflow before merge io_uring: make timeout sequence == 0 mean no sequence
2019-11-15i2c: core: fix use after free in of_i2c_notifyWen Yang
We can't use "adap->dev" after it has been freed. Fixes: 5bf4fa7daea6 ("i2c: break out OF support into separate file") Signed-off-by: Wen Yang <wenyang@linux.alibaba.com> Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
2019-11-15i2c: acpi: Force bus speed to 400KHz if a Silead touchscreen is presentHans de Goede
Many cheap devices use Silead touchscreen controllers. Testing has shown repeatedly that these touchscreen controllers work fine at 400KHz, but for unknown reasons do not work properly at 100KHz. This has been seen on both ARM and x86 devices using totally different i2c controllers. On some devices the ACPI tables list another device at the same I2C-bus as only being capable of 100KHz, testing has shown that these other devices work fine at 400KHz (as can be expected of any recent I2C hw). This commit makes i2c_acpi_find_bus_speed() always return 400KHz if a Silead touchscreen controller is present, fixing the touchscreen not working on devices which ACPI tables' wrongly list another device on the same bus as only being capable of 100KHz. Specifically this fixes the touchscreen on the Jumper EZpad 6 m4 not working. Reported-by: youling 257 <youling257@gmail.com> Tested-by: youling 257 <youling257@gmail.com> Signed-off-by: Hans de Goede <hdegoede@redhat.com> Reviewed-by: Jarkko Nikula <jarkko.nikula@linux.intel.com> Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> [wsa: rewording warning a little] Signed-off-by: Wolfram Sang <wsa@the-dreams.de> Cc: stable@kernel.org
2019-11-15Merge branch 'ptp-Validate-the-ancillary-ioctl-flags-more-carefully'David S. Miller
Richard Cochran says: ==================== ptp: Validate the ancillary ioctl flags more carefully. The flags passed to the ioctls for periodic output signals and time stamping of external signals were never checked, and thus formed a useless ABI inadvertently. More recently, a version 2 of the ioctls was introduced in order make the flags meaningful. This series tightens up the checks on the new ioctl flags. - Patch 1 ensures at least one edge flag is set for the new ioctl. - Patches 2-7 are Jacob's recent checks, picking up the tags. - Patch 8 introduces a "strict" flag for passing to the drivers when the new ioctl is used. - Patches 9-12 implement the "strict" checking in the drivers. - Patch 13 extends the test program to exercise combinations of flags. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>