summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2017-07-03bpf, verifier: add additional patterns to evaluate_reg_imm_aluJohn Fastabend
Currently the verifier does not track imm across alu operations when the source register is of unknown type. This adds additional pattern matching to catch this and track imm. We've seen LLVM generating this pattern while working on cilium. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-03bpf: extend bpf_trace_printk to support %iJohn Fastabend
Currently, bpf_trace_printk does not support common formatting symbol '%i' however vsprintf does and is what eventually gets called by bpf helper. If users are used to '%i' and currently make use of it, then bpf_trace_printk will just return with error without dumping anything to the trace pipe, so just add support for '%i' to the helper. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-03bpf: export whether tail call has jited ownerDaniel Borkmann
We do export through fdinfo already whether a prog is JITed or not, given a program load can fail in case of either prog or tail call map has JITed property, but neither both are JITed or not JITed, we can facilitate error reporting in loaders like iproute2 through exporting owner_jited of tail call map. We already do export owner_prog_type through this facility, so parser can pick up both for comparison. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-03bpf: simplify narrower ctx accessDaniel Borkmann
This work tries to make the semantics and code around the narrower ctx access a bit easier to follow. Right now everything is done inside the .is_valid_access(). Offset matching is done differently for read/write types, meaning writes don't support narrower access and thus matching only on offsetof(struct foo, bar) is enough whereas for read case that supports narrower access we must check for offsetof(struct foo, bar) + offsetof(struct foo, bar) + sizeof(<bar>) - 1 for each of the cases. For read cases of individual members that don't support narrower access (like packet pointers or skb->cb[] case which has its own narrow access logic), we check as usual only offsetof(struct foo, bar) like in write case. Then, for the case where narrower access is allowed, we also need to set the aux info for the access. Meaning, ctx_field_size and converted_op_size have to be set. First is the original field size e.g. sizeof(<bar>) as in above example from the user facing ctx, and latter one is the target size after actual rewrite happened, thus for the kernel facing ctx. Also here we need the range match and we need to keep track changing convert_ctx_access() and converted_op_size from is_valid_access() as both are not at the same location. We can simplify the code a bit: check_ctx_access() becomes simpler in that we only store ctx_field_size as a meta data and later in convert_ctx_accesses() we fetch the target_size right from the location where we do convert. Should the verifier be misconfigured we do reject for BPF_WRITE cases or target_size that are not provided. For the subsystems, we always work on ranges in is_valid_access() and add small helpers for ranges and narrow access, convert_ctx_accesses() sets target_size for the relevant instruction. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Cc: Yonghong Song <yhs@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01bpf: BPF support for sock_opsLawrence Brakmo
Created a new BPF program type, BPF_PROG_TYPE_SOCK_OPS, and a corresponding struct that allows BPF programs of this type to access some of the socket's fields (such as IP addresses, ports, etc.). It uses the existing bpf cgroups infrastructure so the programs can be attached per cgroup with full inheritance support. The program will be called at appropriate times to set relevant connections parameters such as buffer sizes, SYN and SYN-ACK RTOs, etc., based on connection information such as IP addresses, port numbers, etc. Alghough there are already 3 mechanisms to set parameters (sysctls, route metrics and setsockopts), this new mechanism provides some distinct advantages. Unlike sysctls, it can set parameters per connection. In contrast to route metrics, it can also use port numbers and information provided by a user level program. In addition, it could set parameters probabilistically for evaluation purposes (i.e. do something different on 10% of the flows and compare results with the other 90% of the flows). Also, in cases where IPv6 addresses contain geographic information, the rules to make changes based on the distance (or RTT) between the hosts are much easier than route metric rules and can be global. Finally, unlike setsockopt, it oes not require application changes and it can be updated easily at any time. Although the bpf cgroup framework already contains a sock related program type (BPF_PROG_TYPE_CGROUP_SOCK), I created the new type (BPF_PROG_TYPE_SOCK_OPS) beccause the existing type expects to be called only once during the connections's lifetime. In contrast, the new program type will be called multiple times from different places in the network stack code. For example, before sending SYN and SYN-ACKs to set an appropriate timeout, when the connection is established to set congestion control, etc. As a result it has "op" field to specify the type of operation requested. The purpose of this new program type is to simplify setting connection parameters, such as buffer sizes, TCP's SYN RTO, etc. For example, it is easy to use facebook's internal IPv6 addresses to determine if both hosts of a connection are in the same datacenter. Therefore, it is easy to write a BPF program to choose a small SYN RTO value when both hosts are in the same datacenter. This patch only contains the framework to support the new BPF program type, following patches add the functionality to set various connection parameters. This patch defines a new BPF program type: BPF_PROG_TYPE_SOCKET_OPS and a new bpf syscall command to load a new program of this type: BPF_PROG_LOAD_SOCKET_OPS. Two new corresponding structs (one for the kernel one for the user/BPF program): /* kernel version */ struct bpf_sock_ops_kern { struct sock *sk; __u32 op; union { __u32 reply; __u32 replylong[4]; }; }; /* user version * Some fields are in network byte order reflecting the sock struct * Use the bpf_ntohl helper macro in samples/bpf/bpf_endian.h to * convert them to host byte order. */ struct bpf_sock_ops { __u32 op; union { __u32 reply; __u32 replylong[4]; }; __u32 family; __u32 remote_ip4; /* In network byte order */ __u32 local_ip4; /* In network byte order */ __u32 remote_ip6[4]; /* In network byte order */ __u32 local_ip6[4]; /* In network byte order */ __u32 remote_port; /* In network byte order */ __u32 local_port; /* In host byte horder */ }; Currently there are two types of ops. The first type expects the BPF program to return a value which is then used by the caller (or a negative value to indicate the operation is not supported). The second type expects state changes to be done by the BPF program, for example through a setsockopt BPF helper function, and they ignore the return value. The reply fields of the bpf_sockt_ops struct are there in case a bpf program needs to return a value larger than an integer. Signed-off-by: Lawrence Brakmo <brakmo@fb.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
A set of overlapping changes in macvlan and the rocker driver, nothing serious. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Need to access netdev->num_rx_queues behind an accessor in netvsc driver otherwise the build breaks with some configs, from Arnd Bergmann. 2) Add dummy xfrm_dev_event() so that build doesn't fail when CONFIG_XFRM_OFFLOAD is not set. From Hangbin Liu. 3) Don't OOPS when pfkey_msg2xfrm_state() signals an erros, from Dan Carpenter. 4) Fix MCDI command size for filter operations in sfc driver, from Martin Habets. 5) Fix UFO segmenting so that we don't calculate incorrect checksums, from Michal Kubecek. 6) When ipv6 datagram connects fail, reset destination address and port. From Wei Wang. 7) TCP disconnect must reset the cached receive DST, from WANG Cong. 8) Fix sign extension bug on 32-bit in dev_get_stats(), from Eric Dumazet. 9) fman driver has to depend on HAS_DMA, from Madalin Bucur. 10) Fix bpf pointer leak with xadd in verifier, from Daniel Borkmann. 11) Fix negative page counts with GFO, from Michal Kubecek. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (41 commits) sfc: fix attempt to translate invalid filter ID net: handle NAPI_GRO_FREE_STOLEN_HEAD case also in napi_frags_finish() bpf: prevent leaking pointer via xadd on unpriviledged arcnet: com20020-pci: add missing pdev setup in netdev structure arcnet: com20020-pci: fix dev_id calculation arcnet: com20020: remove needless base_addr assignment Trivial fix to spelling mistake in arc_printk message arcnet: change irq handler to lock irqsave rocker: move dereference before free mlxsw: spectrum_router: Fix NULL pointer dereference net: sched: Fix one possible panic when no destroy callback virtio-net: serialize tx routine during reset net: usb: asix88179_178a: Add support for the Belkin B2B128 fsl/fman: add dependency on HAS_DMA net: prevent sign extension in dev_get_stats() tcp: reset sk_rx_dst in tcp_disconnect() net: ipv6: reset daddr and dport in sk if connect() fails bnx2x: Don't log mc removal needlessly bnxt_en: Fix netpoll handling. bnxt_en: Add missing logic to handle TPA end error conditions. ...
2017-06-29bpf: prevent leaking pointer via xadd on unpriviledgedDaniel Borkmann
Leaking kernel addresses on unpriviledged is generally disallowed, for example, verifier rejects the following: 0: (b7) r0 = 0 1: (18) r2 = 0xffff897e82304400 3: (7b) *(u64 *)(r1 +48) = r2 R2 leaks addr into ctx Doing pointer arithmetic on them is also forbidden, so that they don't turn into unknown value and then get leaked out. However, there's xadd as a special case, where we don't check the src reg for being a pointer register, e.g. the following will pass: 0: (b7) r0 = 0 1: (7b) *(u64 *)(r1 +48) = r0 2: (18) r2 = 0xffff897e82304400 ; map 4: (db) lock *(u64 *)(r1 +48) += r2 5: (95) exit We could store the pointer into skb->cb, loose the type context, and then read it out from there again to leak it eventually out of a map value. Or more easily in a different variant, too: 0: (bf) r6 = r1 1: (7a) *(u64 *)(r10 -8) = 0 2: (bf) r2 = r10 3: (07) r2 += -8 4: (18) r1 = 0x0 6: (85) call bpf_map_lookup_elem#1 7: (15) if r0 == 0x0 goto pc+3 R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R6=ctx R10=fp 8: (b7) r3 = 0 9: (7b) *(u64 *)(r0 +0) = r3 10: (db) lock *(u64 *)(r0 +0) += r6 11: (b7) r0 = 0 12: (95) exit from 7 to 11: R0=inv,min_value=0,max_value=0 R6=ctx R10=fp 11: (b7) r0 = 0 12: (95) exit Prevent this by checking xadd src reg for pointer types. Also add a couple of test cases related to this. Fixes: 1be7f75d1668 ("bpf: enable non-root eBPF programs") Fixes: 17a5267067f3 ("bpf: verifier (add verifier core)") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-29bpf: Fix out-of-bound access on interpreters[]Martin KaFai Lau
The index is off-by-one when fp->aux->stack_depth has already been rounded up to 32. In particular, if stack_depth is 512, the index will be 16. The fix is to round_up and then takes -1 instead of round_down. [ 22.318680] ================================================================== [ 22.319745] BUG: KASAN: global-out-of-bounds in bpf_prog_select_runtime+0x48a/0x670 [ 22.320737] Read of size 8 at addr ffffffff82aadae0 by task sockex3/1946 [ 22.321646] [ 22.321858] CPU: 1 PID: 1946 Comm: sockex3 Tainted: G W 4.12.0-rc6-01680-g2ee87db3a287 #22 [ 22.323061] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.el7.centos 04/01/2014 [ 22.324260] Call Trace: [ 22.324612] dump_stack+0x67/0x99 [ 22.325081] print_address_description+0x1e8/0x290 [ 22.325734] ? bpf_prog_select_runtime+0x48a/0x670 [ 22.326360] kasan_report+0x265/0x350 [ 22.326860] __asan_report_load8_noabort+0x19/0x20 [ 22.327484] bpf_prog_select_runtime+0x48a/0x670 [ 22.328109] bpf_prog_load+0x626/0xd40 [ 22.328637] ? __bpf_prog_charge+0xc0/0xc0 [ 22.329222] ? check_nnp_nosuid.isra.61+0x100/0x100 [ 22.329890] ? __might_fault+0xf6/0x1b0 [ 22.330446] ? lock_acquire+0x360/0x360 [ 22.331013] SyS_bpf+0x67c/0x24d0 [ 22.331491] ? trace_hardirqs_on+0xd/0x10 [ 22.332049] ? __getnstimeofday64+0xaf/0x1c0 [ 22.332635] ? bpf_prog_get+0x20/0x20 [ 22.333135] ? __audit_syscall_entry+0x300/0x600 [ 22.333770] ? syscall_trace_enter+0x540/0xdd0 [ 22.334339] ? exit_to_usermode_loop+0xe0/0xe0 [ 22.334950] ? do_syscall_64+0x48/0x410 [ 22.335446] ? bpf_prog_get+0x20/0x20 [ 22.335954] do_syscall_64+0x181/0x410 [ 22.336454] entry_SYSCALL64_slow_path+0x25/0x25 [ 22.337121] RIP: 0033:0x7f263fe81f19 [ 22.337618] RSP: 002b:00007ffd9a3440c8 EFLAGS: 00000202 ORIG_RAX: 0000000000000141 [ 22.338619] RAX: ffffffffffffffda RBX: 0000000000aac5fb RCX: 00007f263fe81f19 [ 22.339600] RDX: 0000000000000030 RSI: 00007ffd9a3440d0 RDI: 0000000000000005 [ 22.340470] RBP: 0000000000a9a1e0 R08: 0000000000a9a1e0 R09: 0000009d00000001 [ 22.341430] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000010000 [ 22.342411] R13: 0000000000a9a023 R14: 0000000000000001 R15: 0000000000000003 [ 22.343369] [ 22.343593] The buggy address belongs to the variable: [ 22.344241] interpreters+0x80/0x980 [ 22.344708] [ 22.344908] Memory state around the buggy address: [ 22.345556] ffffffff82aad980: 00 00 00 04 fa fa fa fa 04 fa fa fa fa fa fa fa [ 22.346449] ffffffff82aada00: 00 00 00 00 00 fa fa fa fa fa fa fa 00 00 00 00 [ 22.347361] >ffffffff82aada80: 00 00 00 00 00 00 00 00 00 00 00 00 fa fa fa fa [ 22.348301] ^ [ 22.349142] ffffffff82aadb00: 00 01 fa fa fa fa fa fa 00 00 00 00 00 00 00 00 [ 22.350058] ffffffff82aadb80: 00 00 07 fa fa fa fa fa 00 00 05 fa fa fa fa fa [ 22.350984] ================================================================== Fixes: b870aa901f4b ("bpf: use different interpreter depending on required stack size") Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Alexei Starovoitov <ast@fb.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-29bpf: Add syscall lookup support for fd array and htabMartin KaFai Lau
This patch allows userspace to do BPF_MAP_LOOKUP_ELEM on BPF_MAP_TYPE_PROG_ARRAY, BPF_MAP_TYPE_ARRAY_OF_MAPS and BPF_MAP_TYPE_HASH_OF_MAPS. The lookup returns a prog-id or map-id to the userspace. The userspace can then use the BPF_PROG_GET_FD_BY_ID or BPF_MAP_GET_FD_BY_ID to get a fd. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-25Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Thomas Gleixner: "A few fixes for timekeeping and timers: - Plug a subtle race due to a missing READ_ONCE() in the timekeeping code where reloading of a pointer results in an inconsistent callback argument being supplied to the clocksource->read function. - Correct the CLOCK_MONOTONIC_RAW sub-nanosecond accounting in the time keeping core code, to prevent a possible discontuity. - Apply a similar fix to the arm64 vdso clock_gettime() implementation - Add missing includes to clocksource drivers, which relied on indirect includes which fails in certain configs. - Use the proper iomem pointer for read/iounmap in a probe function" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: arm64/vdso: Fix nsec handling for CLOCK_MONOTONIC_RAW time: Fix CLOCK_MONOTONIC_RAW sub-nanosecond accounting time: Fix clock->read(clock) race around clocksource changes clocksource: Explicitly include linux/clocksource.h when needed clocksource/drivers/arm_arch_timer: Fix read and iounmap of incorrect variable
2017-06-25Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Thomas Gleixner: "Three fixlets for perf: - Return the proper error code if aux buffers for a event are not supported. - Calculate the probe offset for inlined functions correctly - Update the Skylake DTLB load/store miss event so it can count 1G TLB entries as well" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf probe: Fix probe definition for inlined functions perf/x86/intel: Add 1G DTLB load/store miss support for SKL perf/aux: Correct return code of rb_alloc_aux() if !has_aux(ev)
2017-06-24Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull timer fix from Eric Biederman: "This fixes an issue of confusing injected signals with the signals from posix timers that has existed since posix timers have been in the kernel. This patch is slightly simpler than my earlier version of this patch as I discovered in testing that I had misspelled "#ifdef CONFIG_POSIX_TIMERS". So I deleted that unnecessary test and made setting of resched_timer uncondtional. I have tested this and verified that without this patch there is a nasty hang that is easy to trigger, and with this patch everything works properly" Thomas Gleixner dixit: "It fixes the problem at hand and covers the ptrace case as well, which I missed. Reviewed-and-tested-by: Thomas Gleixner <tglx@linutronix.de>" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: signal: Only reschedule timers on signals timers have sent
2017-06-23bpf: possibly avoid extra masking for narrower load in verifierYonghong Song
Commit 31fd85816dbe ("bpf: permits narrower load from bpf program context fields") permits narrower load for certain ctx fields. The commit however will already generate a masking even if the prog-specific ctx conversion produces the result with narrower size. For example, for __sk_buff->protocol, the ctx conversion loads the data into register with 2-byte load. A narrower 2-byte load should not generate masking. For __sk_buff->vlan_present, the conversion function set the result as either 0 or 1, essentially a byte. The narrower 2-byte or 1-byte load should not generate masking. To avoid unnecessary masking, prog-specific *_is_valid_access now passes converted_op_size back to verifier, which indicates the valid data width after perceived future conversion. Based on this information, verifier is able to avoid unnecessary marking. Since we want more information back from prog-specific *_is_valid_access checking, all of them are packed into one data structure for more clarity. Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Two entries being added at the same time to the IFLA policy table, whilst parallel bug fixes to decnet routing dst handling overlapping with the dst gc removal in net-next. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-21Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching Pull livepatching fix from Jiri Kosina: "Fix the way how livepatches are being stacked with respect to RCU, from Petr Mladek" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching: livepatch: Fix stacking of patches with respect to RCU
2017-06-21perf/aux: Correct return code of rb_alloc_aux() if !has_aux(ev)Hendrik Brueckner
If the event for which an AUX area is about to be allocated, does not support setting up an AUX area, rb_alloc_aux() return -ENOTSUPP. This error condition is being returned unfiltered to the user space, and, for example, the perf tools fails with: failed to mmap with 524 (INTERNAL ERROR: strerror_r(524, 0x3fff497a1c8, 512)=22) This error can be easily seen with "perf record -m 128,256 -e cpu-clock". The 524 error code maps to -ENOTSUPP (in rb_alloc_aux()). The -ENOTSUPP error code shall be only used within the kernel. So the correct error code would then be -EOPNOTSUPP. With this commit, the perf tool then reports: failed to mmap with 95 (Operation not supported) which is more clear. Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com> Acked-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Pu Hou <bjhoupu@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas-Mich Richter <tmricht@linux.vnet.ibm.com> Cc: acme@kernel.org Cc: linux-s390@vger.kernel.org Link: http://lkml.kernel.org/r/1497954399-6355-1-git-send-email-brueckner@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20livepatch: Fix stacking of patches with respect to RCUPetr Mladek
rcu_read_(un)lock(), list_*_rcu(), and synchronize_rcu() are used for a secure access and manipulation of the list of patches that modify the same function. In particular, it is the variable func_stack that is accessible from the ftrace handler via struct ftrace_ops and klp_ops. Of course, it synchronizes also some states of the patch on the top of the stack, e.g. func->transition in klp_ftrace_handler. At the same time, this mechanism guards also the manipulation of task->patch_state. It is modified according to the state of the transition and the state of the process. Now, all this works well as long as RCU works well. Sadly livepatching might get into some corner cases when this is not true. For example, RCU is not watching when rcu_read_lock() is taken in idle threads. It is because they might sleep and prevent reaching the grace period for too long. There are ways how to make RCU watching even in idle threads, see rcu_irq_enter(). But there is a small location inside RCU infrastructure when even this does not work. This small problematic location can be detected either before calling rcu_irq_enter() by rcu_irq_enter_disabled() or later by rcu_is_watching(). Sadly, there is no safe way how to handle it. Once we detect that RCU was not watching, we might see inconsistent state of the function stack and the related variables in klp_ftrace_handler(). Then we could do a wrong decision, use an incompatible implementation of the function and break the consistency of the system. We could warn but we could not avoid the damage. Fortunately, ftrace has similar problems and they seem to be solved well there. It uses a heavy weight implementation of some RCU operations. In particular, it replaces: + rcu_read_lock() with preempt_disable_notrace() + rcu_read_unlock() with preempt_enable_notrace() + synchronize_rcu() with schedule_on_each_cpu(sync_work) My understanding is that this is RCU implementation from a stone age. It meets the core RCU requirements but it is rather ineffective. Especially, it does not allow to batch or speed up the synchronize calls. On the other hand, it is very trivial. It allows to safely trace and/or livepatch even the RCU core infrastructure. And the effectiveness is a not a big issue because using ftrace or livepatches on productive systems is a rare operation. The safety is much more important than a negligible extra load. Note that the alternative implementation follows the RCU principles. Therefore, we could and actually must use list_*_rcu() variants when manipulating the func_stack. These functions allow to access the pointers in the right order and with the right barriers. But they do not use any other information that would be set only by rcu_read_lock(). Also note that there are actually two problems solved in ftrace: First, it cares about the consistency of RCU read sections. It is being solved the way as described and used in this patch. Second, ftrace needs to make sure that nobody is inside the dynamic trampoline when it is being freed. For this, it also calls synchronize_rcu_tasks() in preemptive kernel in ftrace_shutdown(). Livepatch has similar problem but it is solved by ftrace for free. klp_ftrace_handler() is a good guy and never sleeps. In addition, it is registered with FTRACE_OPS_FL_DYNAMIC. It causes that unregister_ftrace_function() calls: * schedule_on_each_cpu(ftrace_sync) - always * synchronize_rcu_tasks() - in preemptive kernel The effect is that nobody is neither inside the dynamic trampoline nor inside the ftrace handler after unregister_ftrace_function() returns. [jkosina@suse.cz: reformat changelog, fix comment] Signed-off-by: Petr Mladek <pmladek@suse.com> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Miroslav Benes <mbenes@suse.cz> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-06-20time: Fix CLOCK_MONOTONIC_RAW sub-nanosecond accountingJohn Stultz
Due to how the MONOTONIC_RAW accumulation logic was handled, there is the potential for a 1ns discontinuity when we do accumulations. This small discontinuity has for the most part gone un-noticed, but since ARM64 enabled CLOCK_MONOTONIC_RAW in their vDSO clock_gettime implementation, we've seen failures with the inconsistency-check test in kselftest. This patch addresses the issue by using the same sub-ns accumulation handling that CLOCK_MONOTONIC uses, which avoids the issue for in-kernel users. Since the ARM64 vDSO implementation has its own clock_gettime calculation logic, this patch reduces the frequency of errors, but failures are still seen. The ARM64 vDSO will need to be updated to include the sub-nanosecond xtime_nsec values in its calculation for this issue to be completely fixed. Signed-off-by: John Stultz <john.stultz@linaro.org> Tested-by: Daniel Mentz <danielmentz@google.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Stephen Boyd <stephen.boyd@linaro.org> Cc: Will Deacon <will.deacon@arm.com> Cc: "stable #4 . 8+" <stable@vger.kernel.org> Cc: Miroslav Lichvar <mlichvar@redhat.com> Link: http://lkml.kernel.org/r/1496965462-20003-3-git-send-email-john.stultz@linaro.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-06-20time: Fix clock->read(clock) race around clocksource changesJohn Stultz
In tests, which excercise switching of clocksources, a NULL pointer dereference can be observed on AMR64 platforms in the clocksource read() function: u64 clocksource_mmio_readl_down(struct clocksource *c) { return ~(u64)readl_relaxed(to_mmio_clksrc(c)->reg) & c->mask; } This is called from the core timekeeping code via: cycle_now = tkr->read(tkr->clock); tkr->read is the cached tkr->clock->read() function pointer. When the clocksource is changed then tkr->clock and tkr->read are updated sequentially. The code above results in a sequential load operation of tkr->read and tkr->clock as well. If the store to tkr->clock hits between the loads of tkr->read and tkr->clock, then the old read() function is called with the new clock pointer. As a consequence the read() function dereferences a different data structure and the resulting 'reg' pointer can point anywhere including NULL. This problem was introduced when the timekeeping code was switched over to use struct tk_read_base. Before that, it was theoretically possible as well when the compiler decided to reload clock in the code sequence: now = tk->clock->read(tk->clock); Add a helper function which avoids the issue by reading tk_read_base->clock once into a local variable clk and then issue the read function via clk->read(clk). This guarantees that the read() function always gets the proper clocksource pointer handed in. Since there is now no use for the tkr.read pointer, this patch also removes it, and to address stopping the fast timekeeper during suspend/resume, it introduces a dummy clocksource to use rather then just a dummy read function. Signed-off-by: John Stultz <john.stultz@linaro.org> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Stephen Boyd <stephen.boyd@linaro.org> Cc: stable <stable@vger.kernel.org> Cc: Miroslav Lichvar <mlichvar@redhat.com> Cc: Daniel Mentz <danielmentz@google.com> Link: http://lkml.kernel.org/r/1496965462-20003-2-git-send-email-john.stultz@linaro.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-06-18Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Thomas Gleixner: "Three fixlets for timers: - Two hot-fixes for the alarmtimer based posix timers, which prevent a nasty DOS by self rescheduling timers. The proper cleanup of that mess is queued for 4.13 - Make a function static" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: tick/broadcast: Make tick_broadcast_setup_oneshot() static alarmtimer: Rate limit periodic intervals alarmtimer: Prevent overflow of relative timers
2017-06-18Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Thomas Gleixner: "Two small fixes for the schedulre core: - Use the proper switch_mm() variant in idle_task_exit() because that code is not called with interrupts disabled. - Fix a confusing typo in a printk" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/core: Idle_task_exit() shouldn't use switch_mm_irqs_off() sched/fair: Fix typo in printk message
2017-06-18Merge branch 'irq-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fix from Thomas Gleixner: "Add a missing resource release to an error path" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq: Release resources in __setup_irq() error path
2017-06-17signal: Only reschedule timers on signals timers have sentEric W. Biederman
Thomas Gleixner wrote: > The CRIU support added a 'feature' which allows a user space task to send > arbitrary (kernel) signals to itself. The changelog says: > > The kernel prevents sending of siginfo with positive si_code, because > these codes are reserved for kernel. I think we can allow a task to > send such a siginfo to itself. This operation should not be dangerous. > > Quite contrary to that claim, it turns out that it is outright dangerous > for signals with info->si_code == SI_TIMER. The following code sequence in > a user space task allows to crash the kernel: > > id = timer_create(CLOCK_XXX, ..... signo = SIGX); > timer_set(id, ....); > info->si_signo = SIGX; > info->si_code = SI_TIMER: > info->_sifields._timer._tid = id; > info->_sifields._timer._sys_private = 2; > rt_[tg]sigqueueinfo(..., SIGX, info); > sigemptyset(&sigset); > sigaddset(&sigset, SIGX); > rt_sigtimedwait(sigset, info); > > For timers based on CLOCK_PROCESS_CPUTIME_ID, CLOCK_THREAD_CPUTIME_ID this > results in a kernel crash because sigwait() dequeues the signal and the > dequeue code observes: > > info->si_code == SI_TIMER && info->_sifields._timer._sys_private != 0 > > which triggers the following callchain: > > do_schedule_next_timer() -> posix_cpu_timer_schedule() -> arm_timer() > > arm_timer() executes a list_add() on the timer, which is already armed via > the timer_set() syscall. That's a double list add which corrupts the posix > cpu timer list. As a consequence the kernel crashes on the next operation > touching the posix cpu timer list. > > Posix clocks which are internally implemented based on hrtimers are not > affected by this because hrtimer_start() can handle already armed timers > nicely, but it's a reliable way to trigger the WARN_ON() in > hrtimer_forward(), which complains about calling that function on an > already armed timer. This problem has existed since the posix timer code was merged into 2.5.63. A few releases earlier in 2.5.60 ptrace gained the ability to inject not just a signal (which linux has supported since 1.0) but the full siginfo of a signal. The core problem is that the code will reschedule in response to signals getting dequeued not just for signals the timers sent but for other signals that happen to a si_code of SI_TIMER. Avoid this confusion by testing to see if the queued signal was preallocated as all timer signals are preallocated, and so far only the timer code preallocates signals. Move the check for if a timer needs to be rescheduled up into collect_signal where the preallocation check must be performed, and pass the result back to dequeue_signal where the code reschedules timers. This makes it clear why the code cares about preallocated timers. Cc: stable@vger.kernel.org Reported-by: Thomas Gleixner <tglx@linutronix.de> History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git Reference: 66dd34ad31e5 ("signal: allow to send any siginfo to itself") Reference: 1669ce53e2ff ("Add PTRACE_GETSIGINFO and PTRACE_SETSIGINFO") Fixes: db8b50ba75f2 ("[PATCH] POSIX clocks & timers") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-06-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
The conflicts were two cases of overlapping changes in batman-adv and the qed driver. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-15Merge branches 'pm-cpufreq', 'pm-cpuidle' and 'pm-devfreq'Rafael J. Wysocki
* pm-cpufreq: cpufreq: conservative: Allow down_threshold to take values from 1 to 10 Revert "cpufreq: schedutil: Reduce frequencies slower" * pm-cpuidle: cpuidle: dt: Add missing 'of_node_put()' * pm-devfreq: PM / devfreq: exynos-ppmu: Staticize event list PM / devfreq: exynos-ppmu: Handle return value of clk_prepare_enable PM / devfreq: exynos-nocp: Handle return value of clk_prepare_enable
2017-06-14bpf: permits narrower load from bpf program context fieldsYonghong Song
Currently, verifier will reject a program if it contains an narrower load from the bpf context structure. For example, __u8 h = __sk_buff->hash, or __u16 p = __sk_buff->protocol __u32 sample_period = bpf_perf_event_data->sample_period which are narrower loads of 4-byte or 8-byte field. This patch solves the issue by: . Introduce a new parameter ctx_field_size to carry the field size of narrower load from prog type specific *__is_valid_access validator back to verifier. . The non-zero ctx_field_size for a memory access indicates (1). underlying prog type specific convert_ctx_accesses supporting non-whole-field access (2). the current insn is a narrower or whole field access. . In verifier, for such loads where load memory size is less than ctx_field_size, verifier transforms it to a full field load followed by proper masking. . Currently, __sk_buff and bpf_perf_event_data->sample_period are supporting narrowing loads. . Narrower stores are still not allowed as typical ctx stores are just normal stores. Because of this change, some tests in verifier will fail and these tests are removed. As a bonus, rename some out of bound __sk_buff->cb access to proper field name and remove two redundant "skb cb oob" tests. Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-13genirq: Release resources in __setup_irq() error pathHeiner Kallweit
In case __irq_set_trigger() fails the resources requested via irq_request_resources() are not released. Add the missing release call into the error handling path. Fixes: c1bacbae8192 ("genirq: Provide irq_request/release_resources chip callbacks") Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/655538f5-cb20-a892-ff15-fbd2dd1fa4ec@gmail.com
2017-06-12tick/broadcast: Make tick_broadcast_setup_oneshot() staticStephen Boyd
This function isn't used outside of tick-broadcast.c, so let's mark it static. Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Link: http://lkml.kernel.org/r/20170608063603.13276-1-sboyd@codeaurora.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-06-12Revert "cpufreq: schedutil: Reduce frequencies slower"Rafael J. Wysocki
Revert commit 39b64aa1c007 (cpufreq: schedutil: Reduce frequencies slower) that introduced unintentional changes in behavior leading to adverse effects on some systems. Reported-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-06-11sched/core: Idle_task_exit() shouldn't use switch_mm_irqs_off()Andy Lutomirski
idle_task_exit() can be called with IRQs on x86 on and therefore should use switch_mm(), not switch_mm_irqs_off(). This doesn't seem to cause any problems right now, but it will confuse my upcoming TLB flush changes. Nonetheless, I think it should be backported because it's trivial. There won't be any meaningful performance impact because idle_task_exit() is only used when offlining a CPU. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Fixes: f98db6013c55 ("sched/core: Add switch_mm_irqs_off() and use it in the scheduler") Link: http://lkml.kernel.org/r/ca3d1a9fa93a0b49f5a8ff729eda3640fb6abdf9.1497034141.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-11sched/fair: Fix typo in printk messageMarcin Nowakowski
'schedstats' kernel parameter should be set to enable/disable, so correct the printk hint saying that it should be set to 'enable' rather than 'enabled' to enable scheduler tracepoints. Signed-off-by: Marcin Nowakowski <marcin.nowakowski@imgtec.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1496995229-31245-1-git-send-email-marcin.nowakowski@imgtec.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-10bpf: reset id on spilled regs in clear_all_pkt_pointersDaniel Borkmann
Right now, we don't reset the id of spilled registers in case of clear_all_pkt_pointers(). Given pkt_pointers are highly likely to contain an id, do so by reusing __mark_reg_unknown_value(). Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-10bpf: reset id on CONST_IMM transitionDaniel Borkmann
Whenever we set the register to the type CONST_IMM, we currently don't reset the id to 0. id member is not used in CONST_IMM case, so don't let it become stale, where pruning won't be able to match later on. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-10bpf: don't check spilled reg state for non-STACK_SPILLed type slotsDaniel Borkmann
spilled_regs[] state is only used for stack slots of type STACK_SPILL, never for STACK_MISC. Right now, in states_equal(), even if we have old and current stack state of type STACK_MISC, we compare spilled_regs[] for that particular offset. Just skip these like we do everywhere else. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-10bpf: avoid excessive stack usage for perf_sample_dataDaniel Borkmann
perf_sample_data consumes 386 bytes on stack, reduce excessive stack usage and move it to per cpu buffer. It's allowed due to preemption being disabled for tracing, xdp and tc programs, thus at all times only one program can run on a specific CPU and programs cannot run from interrupt. We similarly also handle bpf_pt_regs. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-10Merge branch 'smp-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull CPU hotplug fix from Ingo Molnar: "An error handling corner case fix" * 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: cpu/hotplug: Drop the device lock on error
2017-06-10Merge branch 'rcu-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU fixes from Ingo Molnar: "Fix an SRCU bug affecting KVM IRQ injection" * 'rcu-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: srcu: Allow use of Classic SRCU from both process and interrupt context srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context
2017-06-10Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "This is mostly tooling fixes, plus an instruction pointer filtering fix. It's more fixes than usual - Arnaldo got back from a longer vacation and there was a backlog" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits) perf symbols: Kill dso__build_id_is_kmod() perf symbols: Keep DSO->symtab_type after decompress perf tests: Decompress kernel module before objdump perf tools: Consolidate error path in __open_dso() perf tools: Decompress kernel module when reading DSO data perf annotate: Use dso__decompress_kmodule_path() perf tools: Introduce dso__decompress_kmodule_{fd,path} perf tools: Fix a memory leak in __open_dso() perf annotate: Fix symbolic link of build-id cache perf/core: Drop kernel samples even though :u is specified perf script python: Remove dups in documentation examples perf script python: Updated trace_unhandled() signature perf script python: Fix wrong code snippets in documentation perf script: Fix documentation errors perf script: Fix outdated comment for perf-trace-python perf probe: Fix examples section of documentation perf report: Ensure the perf DSO mapping matches what libdw sees perf report: Include partial stacks unwound with libdw perf annotate: Add missing powerpc triplet perf test: Disable breakpoint signal tests for powerpc ...
2017-06-09Merge branch 'rcu/urgent' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into rcu/urgent Pull RCU fix from Paul E. McKenney: " This series enables srcu_read_lock() and srcu_read_unlock() to be used from interrupt handlers, which fixes a bug in KVM's use of SRCU in delivery of interrupts to guest OSes. " Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08Merge tag 'pm-4.12-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These revert one problematic commit related to system sleep and fix one recent intel_pstate regression. Specifics: - Revert a recent commit that attempted to avoid spurious wakeups from suspend-to-idle via ACPI SCI, but introduced regressions on some systems (Rafael Wysocki). We will get back to the problem it tried to address in the next cycle. - Fix a possible division by 0 during intel_pstate initialization due to a missing check (Rafael Wysocki)" * tag 'pm-4.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: Revert "ACPI / sleep: Ignore spurious SCI wakeups from suspend-to-idle" cpufreq: intel_pstate: Avoid division by 0 in min_perf_pct_min()
2017-06-09Merge branches 'intel_pstate' and 'pm-sleep'Rafael J. Wysocki
* intel_pstate: cpufreq: intel_pstate: Avoid division by 0 in min_perf_pct_min() * pm-sleep: Revert "ACPI / sleep: Ignore spurious SCI wakeups from suspend-to-idle"
2017-06-08Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk Pull printk fix from Petr Mladek: "This reverts a fix added into 4.12-rc1. It caused the kernel log to be printed on another console when two consoles of the same type were defined, e.g. console=ttyS0 console=ttyS1. This configuration was never supported by kernel itself, but it started to make sense with systemd. In other words, the commit broke userspace" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk: Revert "printk: fix double printing with earlycon"
2017-06-08srcu: Allow use of Classic SRCU from both process and interrupt contextPaolo Bonzini
Linu Cherian reported a WARN in cleanup_srcu_struct() when shutting down a guest running iperf on a VFIO assigned device. This happens because irqfd_wakeup() calls srcu_read_lock(&kvm->irq_srcu) in interrupt context, while a worker thread does the same inside kvm_set_irq(). If the interrupt happens while the worker thread is executing __srcu_read_lock(), updates to the Classic SRCU ->lock_count[] field or the Tree SRCU ->srcu_lock_count[] field can be lost. The docs say you are not supposed to call srcu_read_lock() and srcu_read_unlock() from irq context, but KVM interrupt injection happens from (host) interrupt context and it would be nice if SRCU supported the use case. KVM is using SRCU here not really for the "sleepable" part, but rather due to its IPI-free fast detection of grace periods. It is therefore not desirable to switch back to RCU, which would effectively revert commit 719d93cd5f5c ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING", 2014-01-16). However, the docs are overly conservative. You can have an SRCU instance only has users in irq context, and you can mix process and irq context as long as process context users disable interrupts. In addition, __srcu_read_unlock() actually uses this_cpu_dec() on both Tree SRCU and Classic SRCU. For those two implementations, only srcu_read_lock() is unsafe. When Classic SRCU's __srcu_read_unlock() was changed to use this_cpu_dec(), in commit 5a41344a3d83 ("srcu: Simplify __srcu_read_unlock() via this_cpu_dec()", 2012-11-29), __srcu_read_lock() did two increments. Therefore it kept __this_cpu_inc(), with preempt_disable/enable in the caller. Tree SRCU however only does one increment, so on most architectures it is more efficient for __srcu_read_lock() to use this_cpu_inc(), and any performance differences appear to be down in the noise. Cc: stable@vger.kernel.org Fixes: 719d93cd5f5c ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING") Reported-by: Linu Cherian <linuc.decode@gmail.com> Suggested-by: Linu Cherian <linuc.decode@gmail.com> Cc: kvm@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08srcu: Allow use of Tiny/Tree SRCU from both process and interrupt contextPaolo Bonzini
Linu Cherian reported a WARN in cleanup_srcu_struct() when shutting down a guest running iperf on a VFIO assigned device. This happens because irqfd_wakeup() calls srcu_read_lock(&kvm->irq_srcu) in interrupt context, while a worker thread does the same inside kvm_set_irq(). If the interrupt happens while the worker thread is executing __srcu_read_lock(), updates to the Classic SRCU ->lock_count[] field or the Tree SRCU ->srcu_lock_count[] field can be lost. The docs say you are not supposed to call srcu_read_lock() and srcu_read_unlock() from irq context, but KVM interrupt injection happens from (host) interrupt context and it would be nice if SRCU supported the use case. KVM is using SRCU here not really for the "sleepable" part, but rather due to its IPI-free fast detection of grace periods. It is therefore not desirable to switch back to RCU, which would effectively revert commit 719d93cd5f5c ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING", 2014-01-16). However, the docs are overly conservative. You can have an SRCU instance only has users in irq context, and you can mix process and irq context as long as process context users disable interrupts. In addition, __srcu_read_unlock() actually uses this_cpu_dec() on both Tree SRCU and Classic SRCU. For those two implementations, only srcu_read_lock() is unsafe. When Classic SRCU's __srcu_read_unlock() was changed to use this_cpu_dec(), in commit 5a41344a3d83 ("srcu: Simplify __srcu_read_unlock() via this_cpu_dec()", 2012-11-29), __srcu_read_lock() did two increments. Therefore it kept __this_cpu_inc(), with preempt_disable/enable in the caller. Tree SRCU however only does one increment, so on most architectures it is more efficient for __srcu_read_lock() to use this_cpu_inc(), and any performance differences appear to be down in the noise. Unlike Classic and Tree SRCU, Tiny SRCU does increments and decrements on a single variable. Therefore, as Peter Zijlstra pointed out, Tiny SRCU's implementation already supports mixed-context use of srcu_read_lock() and srcu_read_unlock(), at least as long as uses of srcu_read_lock() and srcu_read_unlock() in each handler are nested and paired properly. In other words, it is still illegal to (say) invoke srcu_read_lock() in an interrupt handler and to invoke the matching srcu_read_unlock() in a softirq handler. Therefore, the only change required for Tiny SRCU is to its comments. Fixes: 719d93cd5f5c ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING") Reported-by: Linu Cherian <linuc.decode@gmail.com> Suggested-by: Linu Cherian <linuc.decode@gmail.com> Cc: kvm@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Paolo Bonzini <pbonzini@redhat.com>
2017-06-08Revert "printk: fix double printing with earlycon"Petr Mladek
This reverts commit cf39bf58afdaabc0b86f141630fb3fd18190294e. The commit regression to users that define both console=ttyS1 and console=ttyS0 on the command line, see https://lkml.kernel.org/r/20170509082915.GA13236@bistromath.localdomain The kernel log messages always appeared only on one serial port. It is even documented in Documentation/admin-guide/serial-console.rst: "Note that you can only define one console per device type (serial, video)." The above mentioned commit changed the order in which the command line parameters are searched. As a result, the kernel log messages go to the last mentioned ttyS* instead of the first one. We long thought that using two console=ttyS* on the command line did not make sense. But then we realized that console= parameters were handled also by systemd, see http://0pointer.de/blog/projects/serial-console.html "By default systemd will instantiate one serial-getty@.service on the main kernel console, if it is not a virtual terminal." where "[4] If multiple kernel consoles are used simultaneously, the main console is the one listed first in /sys/class/tty/console/active, which is the last one listed on the kernel command line." This puts the original report into another light. The system is running in qemu. The first serial port is used to store the messages into a file. The second one is used to login to the system via a socket. It depends on systemd and the historic kernel behavior. By other words, systemd causes that it makes sense to define both console=ttyS1 console=ttyS0 on the command line. The kernel fix caused regression related to userspace (systemd) and need to be reverted. In addition, it went out that the fix helped only partially. The messages still were duplicated when the boot console was removed early by late_initcall(printk_late_init). Then the entire log was replayed when the same console was registered as a normal one. Link: 20170606160339.GC7604@pathway.suse.cz Cc: Aleksey Makarov <aleksey.makarov@linaro.org> Cc: Sabrina Dubroca <sd@queasysnail.net> Cc: Sudeep Holla <sudeep.holla@arm.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: Jiri Slaby <jslaby@suse.com> Cc: Robin Murphy <robin.murphy@arm.com>, Cc: Steven Rostedt <rostedt@goodmis.org> Cc: "Nair, Jayachandran" <Jayachandran.Nair@cavium.com> Cc: linux-serial@vger.kernel.org Cc: linux-kernel@vger.kernel.org Reported-by: Sabrina Dubroca <sd@queasysnail.net> Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Petr Mladek <pmladek@suse.com>
2017-06-08perf/core: Drop kernel samples even though :u is specifiedJin Yao
When doing sampling, for example: perf record -e cycles:u ... On workloads that do a lot of kernel entry/exits we see kernel samples, even though :u is specified. This is due to skid existing. This might be a security issue because it can leak kernel addresses even though kernel sampling support is disabled. The patch drops the kernel samples if exclude_kernel is specified. For example, test on Haswell desktop: perf record -e cycles:u <mgen> perf report --stdio Before patch applied: 99.77% mgen mgen [.] buf_read 0.20% mgen mgen [.] rand_buf_init 0.01% mgen [kernel.vmlinux] [k] apic_timer_interrupt 0.00% mgen mgen [.] last_free_elem 0.00% mgen libc-2.23.so [.] __random_r 0.00% mgen libc-2.23.so [.] _int_malloc 0.00% mgen mgen [.] rand_array_init 0.00% mgen [kernel.vmlinux] [k] page_fault 0.00% mgen libc-2.23.so [.] __random 0.00% mgen libc-2.23.so [.] __strcasestr 0.00% mgen ld-2.23.so [.] strcmp 0.00% mgen ld-2.23.so [.] _dl_start 0.00% mgen libc-2.23.so [.] sched_setaffinity@@GLIBC_2.3.4 0.00% mgen ld-2.23.so [.] _start We can see kernel symbols apic_timer_interrupt and page_fault. After patch applied: 99.79% mgen mgen [.] buf_read 0.19% mgen mgen [.] rand_buf_init 0.00% mgen libc-2.23.so [.] __random_r 0.00% mgen mgen [.] rand_array_init 0.00% mgen mgen [.] last_free_elem 0.00% mgen libc-2.23.so [.] vfprintf 0.00% mgen libc-2.23.so [.] rand 0.00% mgen libc-2.23.so [.] __random 0.00% mgen libc-2.23.so [.] _int_malloc 0.00% mgen libc-2.23.so [.] _IO_doallocbuf 0.00% mgen ld-2.23.so [.] do_lookup_x 0.00% mgen ld-2.23.so [.] open_verify.constprop.7 0.00% mgen ld-2.23.so [.] _dl_important_hwcaps 0.00% mgen libc-2.23.so [.] sched_setaffinity@@GLIBC_2.3.4 0.00% mgen ld-2.23.so [.] _start There are only userspace symbols. Signed-off-by: Jin Yao <yao.jin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: acme@kernel.org Cc: jolsa@kernel.org Cc: kan.liang@intel.com Cc: mark.rutland@arm.com Cc: will.deacon@arm.com Cc: yao.jin@intel.com Link: http://lkml.kernel.org/r/1495706947-3744-1-git-send-email-yao.jin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Just some simple overlapping changes in marvell PHY driver and the DSA core code. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-07Revert "ACPI / sleep: Ignore spurious SCI wakeups from suspend-to-idle"Rafael J. Wysocki
Revert commit eed4d47efe95 (ACPI / sleep: Ignore spurious SCI wakeups from suspend-to-idle) as it turned out to be premature and triggered a number of different issues on various systems. That includes, but is not limited to, premature suspend-to-RAM aborts on Dell XPS 13 (9343) reported by Dominik. The issue the commit in question attempted to address is real and will need to be taken care of going forward, but evidently more work is needed for this purpose. Reported-by: Dominik Brodowski <linux@dominikbrodowski.net> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-06-06bpf: cgroup skb progs cannot access ld_abs/indDaniel Borkmann
Commit fb9a307d11d6 ("bpf: Allow CGROUP_SKB eBPF program to access sk_buff") enabled programs of BPF_PROG_TYPE_CGROUP_SKB type to use ld_abs/ind instructions. However, at this point, we cannot use them, since offsets relative to SKF_LL_OFF will end up pointing skb_mac_header(skb) out of bounds since in the egress path it is not yet set at that point in time, but only after __dev_queue_xmit() did a general reset on the mac header. bpf_internal_load_pointer_neg_helper() will then end up reading data from a wrong offset. BPF_PROG_TYPE_CGROUP_SKB programs can use bpf_skb_load_bytes() already to access packet data, which is also more flexible than the insns carried over from cBPF. Fixes: fb9a307d11d6 ("bpf: Allow CGROUP_SKB eBPF program to access sk_buff") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Cc: Chenbo Feng <fengc@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>