summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2016-11-22x86/kvm: Support the vCPU preemption checkPan Xinhui
Support the vcpu_is_preempted() functionality under KVM. This will enhance lock performance on overcommitted hosts (more runnable vCPUs than physical CPUs in the system) as doing busy waits for preempted vCPUs will hurt system performance far worse than early yielding. struct kvm_steal_time::preempted indicates that if one vCPU is running or not after commit "x86, kvm/x86.c: support vCPU preempted check". unix benchmark result: host: kernel 4.8.1, i5-4570, 4 cpus guest: kernel 4.8.1, 8 vcpus test-case after-patch before-patch Execl Throughput | 18307.9 lps | 11701.6 lps File Copy 1024 bufsize 2000 maxblocks | 1352407.3 KBps | 790418.9 KBps File Copy 256 bufsize 500 maxblocks | 367555.6 KBps | 222867.7 KBps File Copy 4096 bufsize 8000 maxblocks | 3675649.7 KBps | 1780614.4 KBps Pipe Throughput | 11872208.7 lps | 11855628.9 lps Pipe-based Context Switching | 1495126.5 lps | 1490533.9 lps Process Creation | 29881.2 lps | 28572.8 lps Shell Scripts (1 concurrent) | 23224.3 lpm | 22607.4 lpm Shell Scripts (8 concurrent) | 3531.4 lpm | 3211.9 lpm System Call Overhead | 10385653.0 lps | 10419979.0 lps Signed-off-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Cc: David.Laight@ACULAB.COM Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: benh@kernel.crashing.org Cc: boqun.feng@gmail.com Cc: borntraeger@de.ibm.com Cc: bsingharora@gmail.com Cc: dave@stgolabs.net Cc: jgross@suse.com Cc: kernellwp@gmail.com Cc: konrad.wilk@oracle.com Cc: linuxppc-dev@lists.ozlabs.org Cc: mpe@ellerman.id.au Cc: paulmck@linux.vnet.ibm.com Cc: paulus@samba.org Cc: rkrcmar@redhat.com Cc: virtualization@lists.linux-foundation.org Cc: will.deacon@arm.com Cc: xen-devel-request@lists.xenproject.org Cc: xen-devel@lists.xenproject.org Link: http://lkml.kernel.org/r/1478077718-37424-10-git-send-email-xinhui.pan@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-22x86/kvm: Support the vCPU preemption checkPan Xinhui
Support the vcpu_is_preempted() functionality under KVM. This will enhance lock performance on overcommitted hosts (more runnable vCPUs than physical CPUs in the system) as doing busy waits for preempted vCPUs will hurt system performance far worse than early yielding. Use struct kvm_steal_time::preempted to indicate that if a vCPU is running or not. Signed-off-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Cc: David.Laight@ACULAB.COM Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: benh@kernel.crashing.org Cc: boqun.feng@gmail.com Cc: borntraeger@de.ibm.com Cc: bsingharora@gmail.com Cc: dave@stgolabs.net Cc: jgross@suse.com Cc: kernellwp@gmail.com Cc: konrad.wilk@oracle.com Cc: linuxppc-dev@lists.ozlabs.org Cc: mpe@ellerman.id.au Cc: paulmck@linux.vnet.ibm.com Cc: paulus@samba.org Cc: rkrcmar@redhat.com Cc: virtualization@lists.linux-foundation.org Cc: will.deacon@arm.com Cc: xen-devel-request@lists.xenproject.org Cc: xen-devel@lists.xenproject.org Link: http://lkml.kernel.org/r/1478077718-37424-9-git-send-email-xinhui.pan@linux.vnet.ibm.com [ Typo fixes. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-22kvm: Introduce kvm_write_guest_offset_cached()Pan Xinhui
It allows us to update some status or field of a structure partially. We can also save a kvm_read_guest_cached() call if we just update one fild of the struct regardless of its current value. Signed-off-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Cc: David.Laight@ACULAB.COM Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: benh@kernel.crashing.org Cc: boqun.feng@gmail.com Cc: borntraeger@de.ibm.com Cc: bsingharora@gmail.com Cc: dave@stgolabs.net Cc: jgross@suse.com Cc: kernellwp@gmail.com Cc: konrad.wilk@oracle.com Cc: linuxppc-dev@lists.ozlabs.org Cc: mpe@ellerman.id.au Cc: paulmck@linux.vnet.ibm.com Cc: paulus@samba.org Cc: rkrcmar@redhat.com Cc: virtualization@lists.linux-foundation.org Cc: will.deacon@arm.com Cc: xen-devel-request@lists.xenproject.org Cc: xen-devel@lists.xenproject.org Link: http://lkml.kernel.org/r/1478077718-37424-8-git-send-email-xinhui.pan@linux.vnet.ibm.com [ Typo fixes. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-22locking/core, x86/paravirt: Implement vcpu_is_preempted(cpu) for KVM and Xen ↵Pan Xinhui
guests Optimize spinlock and mutex busy-loops by providing a vcpu_is_preempted(cpu) function on KVM and Xen platforms. Extend the pv_lock_ops interface accordingly and implement the callbacks on KVM and Xen. Signed-off-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> [ Translated to English. ] Acked-by: Paolo Bonzini <pbonzini@redhat.com> Cc: David.Laight@ACULAB.COM Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: benh@kernel.crashing.org Cc: boqun.feng@gmail.com Cc: borntraeger@de.ibm.com Cc: bsingharora@gmail.com Cc: dave@stgolabs.net Cc: jgross@suse.com Cc: kernellwp@gmail.com Cc: konrad.wilk@oracle.com Cc: linuxppc-dev@lists.ozlabs.org Cc: mpe@ellerman.id.au Cc: paulmck@linux.vnet.ibm.com Cc: paulus@samba.org Cc: rkrcmar@redhat.com Cc: virtualization@lists.linux-foundation.org Cc: will.deacon@arm.com Cc: xen-devel-request@lists.xenproject.org Cc: xen-devel@lists.xenproject.org Link: http://lkml.kernel.org/r/1478077718-37424-7-git-send-email-xinhui.pan@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-22locking/spinlocks, s390: Implement vcpu_is_preempted(cpu)Christian Borntraeger
This implements the s390 version for vcpu_is_preempted(cpu), by reworking the existing smp_vcpu_scheduled() function into arch_vcpu_is_preempted(). We can then also get rid of the local cpu_is_preempted() function by moving the CIF_ENABLED_WAIT test into arch_vcpu_is_preempted(). Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: David.Laight@ACULAB.COM Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: benh@kernel.crashing.org Cc: boqun.feng@gmail.com Cc: bsingharora@gmail.com Cc: dave@stgolabs.net Cc: jgross@suse.com Cc: kernellwp@gmail.com Cc: konrad.wilk@oracle.com Cc: linuxppc-dev@lists.ozlabs.org Cc: mpe@ellerman.id.au Cc: paulmck@linux.vnet.ibm.com Cc: paulus@samba.org Cc: pbonzini@redhat.com Cc: rkrcmar@redhat.com Cc: virtualization@lists.linux-foundation.org Cc: will.deacon@arm.com Cc: xen-devel-request@lists.xenproject.org Cc: xen-devel@lists.xenproject.org Link: http://lkml.kernel.org/r/1478077718-37424-6-git-send-email-xinhui.pan@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-22locking/core, powerpc: Implement vcpu_is_preempted(cpu)Pan Xinhui
Optimize spinlock and mutex busy-loops by providing a vcpu_is_preempted(cpu) function on pSeries. We do not support PowerNV. All this can be achieved by using lppaca->yield_count, which is zero on PowerNV. Suggested-by: Boqun Feng <boqun.feng@gmail.com> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: David.Laight@ACULAB.COM Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: benh@kernel.crashing.org Cc: borntraeger@de.ibm.com Cc: bsingharora@gmail.com Cc: dave@stgolabs.net Cc: jgross@suse.com Cc: kernellwp@gmail.com Cc: konrad.wilk@oracle.com Cc: linuxppc-dev@lists.ozlabs.org Cc: mpe@ellerman.id.au Cc: paulmck@linux.vnet.ibm.com Cc: paulus@samba.org Cc: pbonzini@redhat.com Cc: rkrcmar@redhat.com Cc: virtualization@lists.linux-foundation.org Cc: will.deacon@arm.com Cc: xen-devel-request@lists.xenproject.org Cc: xen-devel@lists.xenproject.org Link: http://lkml.kernel.org/r/1478077718-37424-5-git-send-email-xinhui.pan@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-22sched/core: Introduce the vcpu_is_preempted(cpu) interfacePan Xinhui
This patch is the first step to add support to improve lock holder preemption beaviour. vcpu_is_preempted(cpu) does the obvious thing: it tells us whether a vCPU is preempted or not. Defaults to false on architectures that don't support it. Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Juergen Gross <jgross@suse.com> Signed-off-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> [ Translated the changelog to English. ] Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Cc: David.Laight@ACULAB.COM Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: benh@kernel.crashing.org Cc: boqun.feng@gmail.com Cc: bsingharora@gmail.com Cc: dave@stgolabs.net Cc: kernellwp@gmail.com Cc: konrad.wilk@oracle.com Cc: linuxppc-dev@lists.ozlabs.org Cc: mpe@ellerman.id.au Cc: paulmck@linux.vnet.ibm.com Cc: paulus@samba.org Cc: rkrcmar@redhat.com Cc: virtualization@lists.linux-foundation.org Cc: will.deacon@arm.com Cc: xen-devel-request@lists.xenproject.org Cc: xen-devel@lists.xenproject.org Link: http://lkml.kernel.org/r/1478077718-37424-2-git-send-email-xinhui.pan@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-22Merge branch 'linus' into locking/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-22perf/x86/intel/uncore: Allow only a single PMU/box within an events groupPeter Zijlstra
Group validation expects all events to be of the same PMU; however is_uncore_pmu() is too wide, it matches _all_ uncore events, even across PMUs. This triggers failure when we group different events from different uncore PMUs, like: perf stat -vv -e '{uncore_cbox_0/config=0x0334/,uncore_qpi_0/event=1/}' -a sleep 1 Fix is_uncore_pmu() by only matching events to the box at hand. Note that generic code; ran after this step; will disallow this mixture of PMU events. Reported-by: Jiri Olsa <jolsa@redhat.com> Tested-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vince@deater.net> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/20161118125354.GQ3117@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-22perf/x86/intel: Cure bogus unwind from PEBS entriesPeter Zijlstra
Vince Weaver reported that perf_fuzzer + KASAN detects that PEBS event unwinds sometimes do 'weird' things. In particular, we seemed to be ending up unwinding from random places on the NMI stack. While it was somewhat expected that the event record BP,SP would not match the interrupt BP,SP in that the interrupt is strictly later than the record event, it was overlooked that it could be on an already overwritten stack. Therefore, don't copy the recorded BP,SP over the interrupted BP,SP when we need stack unwinds. Note that its still possible the unwind doesn't full match the actual event, as its entirely possible to have done an (I)RET between record and interrupt, but on average it should still point in the general direction of where the event came from. Also, it's the best we can do, considering. The particular scenario that triggered the bogus NMI stack unwind was a PEBS event with very short period, upon enabling the event at the tail of the PMI handler (FREEZE_ON_PMI is not used), it instantly triggers a record (while still on the NMI stack) which in turn triggers the next PMI. This then causes back-to-back NMIs and we'll try and unwind the stack-frame from the last NMI, which obviously is now overwritten by our own. Analyzed-by: Josh Poimboeuf <jpoimboe@redhat.com> Reported-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@gmail.com> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: davej@codemonkey.org.uk <davej@codemonkey.org.uk> Cc: dvyukov@google.com <dvyukov@google.com> Cc: stable@vger.kernel.org Fixes: ca037701a025 ("perf, x86: Add PEBS infrastructure") Link: http://lkml.kernel.org/r/20161117171731.GV3157@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-22perf/x86: Restore TASK_SIZE check on frame pointerJohannes Weiner
The following commit: 75925e1ad7f5 ("perf/x86: Optimize stack walk user accesses") ... switched from copy_from_user_nmi() to __copy_from_user_nmi() with a manual access_ok() check. Unfortunately, copy_from_user_nmi() does an explicit check against TASK_SIZE, whereas the access_ok() uses whatever the current address limit of the task is. We are getting NMIs when __probe_kernel_read() has switched to KERNEL_DS, and then see vmalloc faults when we access what looks like pointers into vmalloc space: [] WARNING: CPU: 3 PID: 3685731 at arch/x86/mm/fault.c:435 vmalloc_fault+0x289/0x290 [] CPU: 3 PID: 3685731 Comm: sh Tainted: G W 4.6.0-5_fbk1_223_gdbf0f40 #1 [] Call Trace: [] <NMI> [<ffffffff814717d1>] dump_stack+0x4d/0x6c [] [<ffffffff81076e43>] __warn+0xd3/0xf0 [] [<ffffffff81076f2d>] warn_slowpath_null+0x1d/0x20 [] [<ffffffff8104a899>] vmalloc_fault+0x289/0x290 [] [<ffffffff8104b5a0>] __do_page_fault+0x330/0x490 [] [<ffffffff8104b70c>] do_page_fault+0xc/0x10 [] [<ffffffff81794e82>] page_fault+0x22/0x30 [] [<ffffffff81006280>] ? perf_callchain_user+0x100/0x2a0 [] [<ffffffff8115124f>] get_perf_callchain+0x17f/0x190 [] [<ffffffff811512c7>] perf_callchain+0x67/0x80 [] [<ffffffff8114e750>] perf_prepare_sample+0x2a0/0x370 [] [<ffffffff8114e840>] perf_event_output+0x20/0x60 [] [<ffffffff8114aee7>] ? perf_event_update_userpage+0xc7/0x130 [] [<ffffffff8114ea01>] __perf_event_overflow+0x181/0x1d0 [] [<ffffffff8114f484>] perf_event_overflow+0x14/0x20 [] [<ffffffff8100a6e3>] intel_pmu_handle_irq+0x1d3/0x490 [] [<ffffffff8147daf7>] ? copy_user_enhanced_fast_string+0x7/0x10 [] [<ffffffff81197191>] ? vunmap_page_range+0x1a1/0x2f0 [] [<ffffffff811972f1>] ? unmap_kernel_range_noflush+0x11/0x20 [] [<ffffffff814f2056>] ? ghes_copy_tofrom_phys+0x116/0x1f0 [] [<ffffffff81040d1d>] ? x2apic_send_IPI_self+0x1d/0x20 [] [<ffffffff8100411d>] perf_event_nmi_handler+0x2d/0x50 [] [<ffffffff8101ea31>] nmi_handle+0x61/0x110 [] [<ffffffff8101ef94>] default_do_nmi+0x44/0x110 [] [<ffffffff8101f13b>] do_nmi+0xdb/0x150 [] [<ffffffff81795187>] end_repeat_nmi+0x1a/0x1e [] [<ffffffff8147daf7>] ? copy_user_enhanced_fast_string+0x7/0x10 [] [<ffffffff8147daf7>] ? copy_user_enhanced_fast_string+0x7/0x10 [] [<ffffffff8147daf7>] ? copy_user_enhanced_fast_string+0x7/0x10 [] <<EOE>> <IRQ> [<ffffffff8115d05e>] ? __probe_kernel_read+0x3e/0xa0 Fix this by moving the valid_user_frame() check to before the uaccess that loads the return address and the pointer to the next frame. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: linux-kernel@vger.kernel.org Fixes: 75925e1ad7f5 ("perf/x86: Optimize stack walk user accesses") Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-22sched/autogroup: Do not use autogroup->tg in zombie threadsOleg Nesterov
Exactly because for_each_thread() in autogroup_move_group() can't see it and update its ->sched_task_group before _put() and possibly free(). So the exiting task needs another sched_move_task() before exit_notify() and we need to re-introduce the PF_EXITING (or similar) check removed by the previous change for another reason. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: hartsjc@redhat.com Cc: vbendel@redhat.com Cc: vlovejoy@redhat.com Link: http://lkml.kernel.org/r/20161114184612.GA15968@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-22sched/autogroup: Fix autogroup_move_group() to never skip sched_move_task()Oleg Nesterov
The PF_EXITING check in task_wants_autogroup() is no longer needed. Remove it, but see the next patch. However the comment is correct in that autogroup_move_group() must always change task_group() for every thread so the sysctl_ check is very wrong; we can race with cgroups and even sys_setsid() is not safe because a task running with task_group() == ag->tg must participate in refcounting: int main(void) { int sctl = open("/proc/sys/kernel/sched_autogroup_enabled", O_WRONLY); assert(sctl > 0); if (fork()) { wait(NULL); // destroy the child's ag/tg pause(); } assert(pwrite(sctl, "1\n", 2, 0) == 2); assert(setsid() > 0); if (fork()) pause(); kill(getppid(), SIGKILL); sleep(1); // The child has gone, the grandchild runs with kref == 1 assert(pwrite(sctl, "0\n", 2, 0) == 2); assert(setsid() > 0); // runs with the freed ag/tg for (;;) sleep(1); return 0; } crashes the kernel. It doesn't really need sleep(1), it doesn't matter if autogroup_move_group() actually frees the task_group or this happens later. Reported-by: Vern Lovejoy <vlovejoy@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: hartsjc@redhat.com Cc: vbendel@redhat.com Link: http://lkml.kernel.org/r/20161114184609.GA15965@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-22x86/mce/AMD: Add system physical address translation for AMD Fam17hYazen Ghannam
The Unified Memory Controllers (UMCs) on Fam17h log a normalized address in their MCA_ADDR registers. We need to convert that normalized address to a system physical address in order to support a few facilities: 1) To offline poisoned pages in DRAM proactively in the deferred error handler. 2) To print sysaddr and page info for DRAM ECC errors in EDAC. [ Boris: fixes/cleanups ontop: * hi_addr_offset = 0 - no need for that branch. Stick it all under the HiAddrOffsetEn case. It confines hi_addr_offset's declaration too. * Move variables to the innermost scope they're used at so that we save on stack and not blow it up immediately on function entry. * Do not modify *sys_addr prematurely - we want to not exit early and have modified *sys_addr some, which callers get to see. We either convert to a sys_addr or we don't do anything. And we signal that with the retval of the function. * Rename label out -> out_err - because it is the error path. * No need to pr_err of the conversion failed case: imagine a sparsely-populated machine with UMCs which don't have DIMMs. Callers should look at the retval instead and issue a printk only when really necessary. No need for useless info in dmesg. * s/temp_reg/tmp/ and other variable names shortening => shorter code. * Use BIT() everywhere. * Make error messages more informative. * Small build fix for the !CONFIG_X86_MCE_AMD case. * ... and more minor cleanups. ] Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Aravind Gopalakrishnan <aravindksg.lkml@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tony Luck <tony.luck@intel.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/20161122111133.mjzpvzhf7o7yl2oa@pd.tnic [ Typo fixes. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-22powerpc: Fix missing CRCs, add more asm-prototypes.h declarationsNicholas Piggin
After patch 4efca4ed0 ("kbuild: modversions for EXPORT_SYMBOL() for asm"), asm exports can get modversions CRCs generated if they have C definitions in asm-prototypes.h. This patch adds missing definitions for 32 and 64 bit allmodconfig builds. Fixes: 9445aa1a3062 ("ppc: move exports to definitions") Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-11-22genirq/msi: Drop artificial PCI dependencyMarc Zyngier
The generic MSI layer doesn't have any PCI ties anymore, and the build hack should have been removed some time ago. Fixes: d9109698be6e ("genirq: Introduce msi_domain_alloc/free_irqs()") Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Link: http://lkml.kernel.org/r/1479806476-20801-1-git-send-email-marc.zyngier@arm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-22clocksource/drivers/bcm2835_timer: Unmap region obtained by of_iomapArvind Yadav
Free memory mapping, if bcm2835_timer_init is not successful. Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com> Reviewed-by: Eric Anholt <eric@anholt.net> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
2016-11-22crypto: scatterwalk - Remove unnecessary aliasing check in map_and_copyHerbert Xu
The aliasing check in map_and_copy is no longer necessary because the IPsec ESP code no longer provides an IV that points into the actual request data. As this check is now triggering BUG checks due to the vmalloced stack code, I'm removing it. Reported-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2016-11-22crypto: algif_hash - Fix result clobbering in recvmsgHerbert Xu
Recently an init call was added to hash_recvmsg so as to reset the hash state in case a sendmsg call was never made. Unfortunately this ended up clobbering the result if the previous sendmsg was done with a MSG_MORE flag. This patch fixes it by excluding that case when we make the init call. Fixes: a8348bca2944 ("algif_hash - Fix NULL hash crash with shash") Reported-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2016-11-22powerpc: Set missing wakeup bit in LPCR on POWER9Benjamin Herrenschmidt
There is a new bit, LPCR_PECE_HVEE (Hypervisor Virtualization Exit Enable), which controls wakeup from STOP states on Hypervisor Virtualization Interrupts (which happen to also be all external interrupts in host or bare metal mode). It needs to be set or we will miss wakeups. Fixes: 9baaef0a22c8 ("powerpc/irq: Add support for HV virtualization interrupts") Cc: stable@vger.kernel.org # v4.8+ Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> [mpe: Rename it to HVEE to match the name in the ISA] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-11-21Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull apparmor bugfix from James Morris: "This has a fix for a policy replacement bug that is fairly serious for apache mod_apparmor users, as it results in the wrong policy being applied on an network facing service" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: apparmor: fix change_hat not finding hat after policy replacement
2016-11-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparcLinus Torvalds
Pull sparc fixes from David Miller: 1) With modern networking cards we can run out of 32-bit DMA space, so support 64-bit DMA addressing when possible on sparc64. From Dave Tushar. 2) Some signal frame validation checks are inverted on sparc32, fix from Andreas Larsson. 3) Lockdep tables can get too large in some circumstances on sparc64, add a way to adjust the size a bit. From Babu Moger. 4) Fix NUMA node probing on some sun4v systems, from Thomas Tai. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc: sparc: drop duplicate header scatterlist.h lockdep: Limit static allocations if PROVE_LOCKING_SMALL is defined config: Adding the new config parameter CONFIG_PROVE_LOCKING_SMALL for sparc sunbmac: Fix compiler warning sunqe: Fix compiler warnings sparc64: Enable 64-bit DMA sparc64: Enable sun4v dma ops to use IOMMU v2 APIs sparc64: Bind PCIe devices to use IOMMU v2 service sparc64: Initialize iommu_map_table and iommu_pool sparc64: Add ATU (new IOMMU) support sparc64: Add FORCE_MAX_ZONEORDER and default to 13 sparc64: fix compile warning section mismatch in find_node() sparc32: Fix inverted invalid_frame_pointer checks on sigreturns sparc64: Fix find_node warning if numa node cannot be found
2016-11-21watchdog: wdat_wdt: Select WATCHDOG_COREMika Westerberg
The WDAT watchdog driver uses functionality provided by the watchdog timer core but it did not select it explicitly. This results following linker error when only WDAT_WDT is enabled in Kconfig: drivers/built-in.o: In function `wdat_wdt_probe': drivers/watchdog/wdat_wdt.c:444: undefined reference to `devm_watchdog_register_device' Fix this by explicitly selecting WATCHDOG_CORE when WDAT watchdog driver is enabled. Fixes: 058dfc767008 (ACPI / watchdog: Add support for WDAT hardware watchdog) Reported-by: Vegard Nossum <vegard.nossum@gmail.com> Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-11-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Clear congestion control state when changing algorithms on an existing socket, from Florian Westphal. 2) Fix register bit values in altr_tse_pcs portion of stmmac driver, from Jia Jie Ho. 3) Fix PTP handling in stammc driver for GMAC4, from Giuseppe CAVALLARO. 4) Fix udplite multicast delivery handling, it ignores the udp_table parameter passed into the lookups, from Pablo Neira Ayuso. 5) Synchronize the space estimated by rtnl_vfinfo_size and the space actually used by rtnl_fill_vfinfo. From Sabrina Dubroca. 6) Fix memory leak in fib_info when splitting nodes, from Alexander Duyck. 7) If a driver does a napi_hash_del() explicitily and not via netif_napi_del(), it must perform RCU synchronization as needed. Fix this in virtio-net and bnxt drivers, from Eric Dumazet. 8) Likewise, it is not necessary to invoke napi_hash_del() is we are also doing neif_napi_del() in the same code path. Remove such calls from be2net and cxgb4 drivers, also from Eric Dumazet. 9) Don't allocate an ID in peernet2id_alloc() if the netns is dead, from WANG Cong. 10) Fix OF node and device struct leaks in of_mdio, from Johan Hovold. 11) We cannot cache routes in ip6_tunnel when using inherited traffic classes, from Paolo Abeni. 12) Fix several crashes and leaks in cpsw driver, from Johan Hovold. 13) Splice operations cannot use freezable blocking calls in AF_UNIX, from WANG Cong. 14) Link dump filtering by master device and kind support added an error in loop index updates during the dump if we actually do filter, fix from Zhang Shengju. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (59 commits) tcp: zero ca_priv area when switching cc algorithms net: l2tp: Treat NET_XMIT_CN as success in l2tp_eth_dev_xmit ethernet: stmmac: make DWMAC_STM32 depend on it's associated SoC tipc: eliminate obsolete socket locking policy description rtnl: fix the loop index update error in rtnl_dump_ifinfo() l2tp: fix racy SOCK_ZAPPED flag check in l2tp_ip{,6}_bind() net: macb: add check for dma mapping error in start_xmit() rtnetlink: fix FDB size computation netns: fix get_net_ns_by_fd(int pid) typo af_unix: conditionally use freezable blocking calls in read net: ethernet: ti: cpsw: fix fixed-link phy probe deferral net: ethernet: ti: cpsw: add missing sanity check net: ethernet: ti: cpsw: fix secondary-emac probe error path net: ethernet: ti: cpsw: fix of_node and phydev leaks net: ethernet: ti: cpsw: fix deferred probe net: ethernet: ti: cpsw: fix mdio device reference leak net: ethernet: ti: cpsw: fix bad register access in probe error path net: sky2: Fix shutdown crash cfg80211: limit scan results cache size net sched filters: pass netlink message flags in event notification ...
2016-11-21net: ieee802154: constify ieee802154_ops structuresBhumika Goyal
Declare the structure ieee802154_ops as const as it is only passed as an argument to the function ieee802154_alloc_hw. This argument is of type const struct ieee802154_ops *, so ieee80254_ops structures having this property can be declared as const. Done using Coccinelle: @r1 disable optional_qualifier @ identifier i; position p; @@ static struct ieee802154_ops i@p = {...}; @ok1@ identifier r1.i; position p; expression e1; @@ ieee802154_alloc_hw(e1,&i@p) @bad@ position p!={r1.p,ok1.p}; identifier r1.i; @@ i@p @depends on !bad disable optional_qualifier@ identifier r1.i; @@ static +const struct ieee802154_ops i={...}; @depends on !bad disable optional_qualifier@ identifier r1.i; @@ +const struct ieee802154_ops i; The before and after size details of the affected files are: text data bss dec hex filename 8669 1176 16 9861 2685 drivers/net/ieee802154/adf7242.o 8805 1048 16 9869 268d drivers/net/ieee802154/adf7242.o text data bss dec hex filename 7211 2296 32 9539 2543 drivers/net/ieee802154/atusb.o 7339 2160 32 9531 253b drivers/net/ieee802154/atusb.o Signed-off-by: Bhumika Goyal <bhumirks@gmail.com> Acked-by: Stefan Schmidt <stefan@osg.samsung.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-21Merge branch 'geneve-lwt-efficiency'David S. Miller
Pravin B Shelar says: ==================== geneve: Use LWT more effectively. Following patch series make use of geneve LWT code path for geneve netdev type of device. This allows us to simplify geneve module without changing any functionality. v2-v3: Rebase against latest net-next. v1-v2: Fix warning reported by kbuild test robot. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-21geneve: Optimize geneve device lookup.pravin shelar
Rather than comparing 64-bit tunnel-id, compare tunnel vni which is 24-bit id. This also save conversion from vni to tunnel id on each tunnel packet receive. Signed-off-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-21geneve: Remove redundant socket checks.pravin shelar
Geneve already has check for device socket in route lookup function. So no need to check it in xmit function. Signed-off-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-21geneve: Merge ipv4 and ipv6 geneve_build_skb()pravin shelar
There are minimal difference in building Geneve header between ipv4 and ipv6 geneve tunnels. Following patch refactors code to unify it. Signed-off-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-21geneve: Unify LWT and netdev handling.pravin shelar
Current geneve implementation has two separate cases to handle. 1. netdev xmit 2. LWT xmit. In case of netdev, geneve configuration is stored in various struct geneve_dev members. For example geneve_addr, ttl, tos, label, flags, dst_cache, etc. For LWT ip_tunnel_info is passed to the device in ip_tunnel_info. Following patch uses ip_tunnel_info struct to store almost all of configuration of a geneve netdevice. This allows us to unify most of geneve driver code around ip_tunnel_info struct. This dramatically simplify geneve code, since it does not need to handle two different configuration cases. Removes duplicate code, single code path can handle either type of geneve devices. Signed-off-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-21clk: sunxi-ng: sun6i-a31: Enable PLL-MIPI LDOs when ungating itChen-Yu Tsai
The PLL-MIPI clock is somewhat special as it has its own LDOs which need to be turned on for this PLL to actually work and output a clock signal. Add the 2 LDO enable bits to the gate bits. This fixes issues with the TCON not sending vblank interrupts when the tcon and dot clock are indirectly clocked from the PLL-MIPI clock. Fixes: c6e6c96d8fa6 ("clk: sunxi-ng: Add A31/A31s clocks") Signed-off-by: Chen-Yu Tsai <wens@csie.org> Signed-off-by: Maxime Ripard <maxime.ripard@free-electrons.com>
2016-11-21Merge branch 'tcp-cong-undo_cwnd-mandatory'David S. Miller
Florian Westphal says: ==================== tcp: make undo_cwnd mandatory for congestion modules highspeed, illinois, scalable, veno and yeah congestion control algorithms don't provide a 'cwnd_undo' function. This makes the stack default to a 'reno undo' which doubles cwnd. However, the ssthresh implementation of these algorithms do not halve the slowstart threshold. This causes similar issue as the one fixed for dctcp in ce6dd23329b1e ("dctcp: avoid bogus doubling of cwnd after loss"). In light of this it seems better to remove the fallback and make undo_cwnd mandatory. First patch fixes those spots where reno undo seems incorrect by providing .cwnd_undo functions, second patch removes the fallback. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-21tcp: make undo_cwnd mandatory for congestion modulesFlorian Westphal
The undo_cwnd fallback in the stack doubles cwnd based on ssthresh, which un-does reno halving behaviour. It seems more appropriate to let congctl algorithms pair .ssthresh and .undo_cwnd properly. Add a 'tcp_reno_undo_cwnd' function and wire it up for all congestion algorithms that used to rely on the fallback. Cc: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-21tcp: add cwnd_undo functions to various tcp cc algorithmsFlorian Westphal
congestion control algorithms that do not halve cwnd in their .ssthresh should provide a .cwnd_undo rather than rely on current fallback which assumes reno halving (and thus doubles the cwnd). All of these do 'something else' in their .ssthresh implementation, thus store the cwnd on loss and provide .undo_cwnd to restore it again. A followup patch will remove the fallback and all algorithms will need to provide a .cwnd_undo function. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-21Merge branch 'bridge-igmpv3-mldv2-support'David S. Miller
Nikolay Aleksandrov says: ==================== bridge: add support for IGMPv3 and MLDv2 querier This patch-set adds support for IGMPv3 and MLDv2 querier in the bridge. Two new options which can be toggled via netlink and sysfs are added that control the version per-bridge: multicast_igmp_version - default 2, can be set to 3 multicast_mld_version - default 1, can be set to 2 (this option is disabled if CONFIG_IPV6=n) Note that the names do not include "querier", I think that these options can be re-used later as more IGMPv3 support is added to the bridge so we can avoid adding more options to switch between v2 and v3 behaviour. The set uses the already existing br_ip{4,6}_multicast_alloc_query functions and adds the appropriate header based on the chosen version. For the initial support I have removed the compatibility implementation (RFC3376 sec 7.3.1, 7.3.2; RFC3810 sec 8.3.1, 8.3.2), because there are some details that we need to sort out. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-21bridge: mcast: add MLDv2 querier supportNikolay Aleksandrov
This patch adds basic support for MLDv2 queries, the default is MLDv1 as before. A new multicast option - multicast_mld_version, adds the ability to change it between 1 and 2 via netlink and sysfs. The MLD option is disabled if CONFIG_IPV6 is disabled. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-21bridge: mcast: add IGMPv3 query supportNikolay Aleksandrov
This patch adds basic support for IGMPv3 queries, the default is IGMPv2 as before. A new multicast option - multicast_igmp_version, adds the ability to change it between 2 and 3 via netlink and sysfs. The option struct member is in a 4 byte hole in net_bridge. There also a few minor style adjustments in br_multicast_new_group and br_multicast_add_group. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-21tcp: zero ca_priv area when switching cc algorithmsFlorian Westphal
We need to zero out the private data area when application switches connection to different algorithm (TCP_CONGESTION setsockopt). When congestion ops get assigned at connect time everything is already zeroed because sk_alloc uses GFP_ZERO flag. But in the setsockopt case this contains whatever previous cc placed there. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-21net: l2tp: Treat NET_XMIT_CN as success in l2tp_eth_dev_xmitGao Feng
The tc could return NET_XMIT_CN as one congestion notification, but it does not mean the packe is lost. Other modules like ipvlan, macvlan, and others treat NET_XMIT_CN as success too. So l2tp_eth_dev_xmit should add the NET_XMIT_CN check. Signed-off-by: Gao Feng <gfree.wind@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-21driver: macvlan: Remove duplicated IFF_UP condition check in ↵Gao Feng
macvlan_forward_source The function macvlan_forward_source_one has already checked the flag IFF_UP, so needn't check it outside in macvlan_forward_source too. Signed-off-by: Gao Feng <gfree.wind@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-21NFSv4.1: Keep a reference on lock states while checkingBenjamin Coddington
While walking the list of lock_states, keep a reference on each nfs4_lock_state to be checked, otherwise the lock state could be removed while the check performs TEST_STATEID and possible FREE_STATEID. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-21ethernet: stmmac: make DWMAC_STM32 depend on it's associated SoCPeter Robinson
There's not much point, except compile test, enabling the stmmac platform drivers unless the STM32 SoC is enabled. It's not useful without it. Signed-off-by: Peter Robinson <pbrobinson@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-21mlx4: avoid unnecessary dirtying of critical fieldsEric Dumazet
While stressing a 40Gbit mlx4 NIC with busy polling, I found false sharing in mlx4 driver that can be easily avoided. This patch brings an additional 7 % performance improvement in UDP_RR workload. 1) If we received no frame during one mlx4_en_process_rx_cq() invocation, no need to call mlx4_cq_set_ci() and/or dirty ring->cons 2) Do not refill rx buffers if we have plenty of them. This avoids false sharing and allows some bulk/batch optimizations. Page allocator and its locks will thank us. Finally, mlx4_en_poll_rx_cq() should not return 0 if it determined cpu handling NIC IRQ should be changed. We should return budget-1 instead, to not fool net_rx_action() and its netdev_budget. v2: keep AVG_PERF_COUNTER(... polled) even if polled is 0 Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-21bnx2: use READ_ONCE() instead of barrier()Eric Dumazet
barrier() is a big hammer compared to READ_ONCE(), and requires comments explaining what is protected. READ_ONCE() is more precise and compiler should generate better overall code. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-21udp: avoid one cache line miss in recvmsg()Eric Dumazet
UDP_SKB_CB(skb)->partial_cov is located at offset 66 in skb, requesting a cold cache line being read in cpu cache. We can avoid this cache line miss for UDP sockets, as partial_cov has a meaning only for UDPLite. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-21Merge branch 'mlx5-bpf-refcnt-fixes'David S. Miller
Daniel Borkmann says: ==================== Couple of BPF refcount fixes for mlx5 Various mlx5 bugs on eBPF refcount handling found during review. Last patch in series adds a __must_check to BPF helpers to make sure we won't run into it again w/o compiler complaining first. v2 -> v3: - Just reworked patch 2/4 so we don't need bpf_prog_sub(). - Rebased, rest as is. v1 -> v2: - After discussion with Alexei, we agreed upon rebasing the patches against net-next. - Since net-next, I've also added the __must_check to enforce future users to check for errors. - Fixed up commit message #2. - Simplify assignment from patch #1 based on Saeed's feedback on previous set. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-21bpf: add __must_check attributes to refcount manipulating helpersDaniel Borkmann
Helpers like bpf_prog_add(), bpf_prog_inc(), bpf_map_inc() can fail with an error, so make sure the caller properly checks their return value and not just ignores it, which could worst-case lead to use after free. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-21bpf, mlx5: drop priv->xdp_prog reference on netdev cleanupDaniel Borkmann
mlx5e_xdp_set() is currently the only place where we drop reference on the prog sitting in priv->xdp_prog when it's exchanged by a new one. We also need to make sure that we eventually release that reference, for example, in case the netdev is dismantled, otherwise we leak the program. Fixes: 86994156c736 ("net/mlx5e: XDP fast RX drop bpf programs support") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-21bpf, mlx5: fix various refcount issues in mlx5e_xdp_setDaniel Borkmann
There are multiple issues in mlx5e_xdp_set(): 1) The batched bpf_prog_add() is currently not checked for errors. When doing so, it should be done at an earlier point in time to makes sure that we cannot fail anymore at the time we want to set the program for each channel. The batched refs short-cut can only be performed when we don't need to perform a reset for changing the rq type and the device was in opened state. In case the device was not in opened state, then the next mlx5e_open_locked() will aquire the refs from the control prog via mlx5e_create_rq(), same when we need to perform a reset. 2) When swapping the priv->xdp_prog, then no extra reference count must be taken since we got that from call path via dev_change_xdp_fd() already. Otherwise, we'd never be able to release the program. Also, bpf_prog_add() without checking the return code could fail. Fixes: 86994156c736 ("net/mlx5e: XDP fast RX drop bpf programs support") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-21bpf, mlx5: fix mlx5e_create_rq taking reference on progDaniel Borkmann
In mlx5e_create_rq(), when creating a new queue, we call bpf_prog_add() but without checking the return value. bpf_prog_add() can fail since 92117d8443bc ("bpf: fix refcnt overflow"), so we really must check it. Take the reference right when we assign it to the rq from priv->xdp_prog, and just drop the reference on error path. Destruction in mlx5e_destroy_rq() looks good, though. Fixes: 86994156c736 ("net/mlx5e: XDP fast RX drop bpf programs support") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>