summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2017-02-10iommu/arm-smmu: Make use of the iommu_register interfaceJoerg Roedel
Also add the smmu devices to sysfs. Signed-off-by: Joerg Roedel <jroedel@suse.de>
2017-02-10iommu: Add iommu_device_set_fwnode() interfaceJoerg Roedel
Allow to store a fwnode in 'struct iommu_device'; Signed-off-by: Joerg Roedel <jroedel@suse.de>
2017-02-10iommu: Make iommu_device_link/unlink take a struct iommu_deviceJoerg Roedel
This makes the interface more consistent with iommu_device_sysfs_add/remove. Signed-off-by: Joerg Roedel <jroedel@suse.de>
2017-02-10iommu: Add sysfs bindings for struct iommu_deviceJoerg Roedel
There is currently support for iommu sysfs bindings, but those need to be implemented in the IOMMU drivers. Add a more generic version of this by adding a struct device to struct iommu_device and use that for the sysfs bindings. Also convert the AMD and Intel IOMMU driver to make use of it. Signed-off-by: Joerg Roedel <jroedel@suse.de>
2017-02-10iommu: Introduce new 'struct iommu_device'Joerg Roedel
This struct represents one hardware iommu in the iommu core code. For now it only has the iommu-ops associated with it, but that will be extended soon. The register/unregister interface is also added, as well as making use of it in the Intel and AMD IOMMU drivers. Signed-off-by: Joerg Roedel <jroedel@suse.de>
2017-02-10iommu: Rename struct iommu_deviceJoerg Roedel
The struct is used to link devices to iommu-groups, so 'struct group_device' is a better name. Further this makes the name iommu_device available for a struct representing hardware iommus. Signed-off-by: Joerg Roedel <jroedel@suse.de>
2017-02-10iommu: Rename iommu_get_instance()Joerg Roedel
Rename the function to iommu_ops_from_fwnode(), because that is what the function actually does. The new name is much more descriptive about what the function does. Signed-off-by: Joerg Roedel <jroedel@suse.de>
2017-02-10arm64: Work around Falkor erratum 1003Christopher Covington
The Qualcomm Datacenter Technologies Falkor v1 CPU may allocate TLB entries using an incorrect ASID when TTBRx_EL1 is being updated. When the erratum is triggered, page table entries using the new translation table base address (BADDR) will be allocated into the TLB using the old ASID. All circumstances leading to the incorrect ASID being cached in the TLB arise when software writes TTBRx_EL1[ASID] and TTBRx_EL1[BADDR], a memory operation is in the process of performing a translation using the specific TTBRx_EL1 being written, and the memory operation uses a translation table descriptor designated as non-global. EL2 and EL3 code changing the EL1&0 ASID is not subject to this erratum because hardware is prohibited from performing translations from an out-of-context translation regime. Consider the following pseudo code. write new BADDR and ASID values to TTBRx_EL1 Replacing the above sequence with the one below will ensure that no TLB entries with an incorrect ASID are used by software. write reserved value to TTBRx_EL1[ASID] ISB write new value to TTBRx_EL1[BADDR] ISB write new value to TTBRx_EL1[ASID] ISB When the above sequence is used, page table entries using the new BADDR value may still be incorrectly allocated into the TLB using the reserved ASID. Yet this will not reduce functionality, since TLB entries incorrectly tagged with the reserved ASID will never be hit by a later instruction. Based on work by Shanker Donthineni <shankerd@codeaurora.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Christopher Covington <cov@codeaurora.org> Signed-off-by: Will Deacon <will.deacon@arm.com>
2017-02-10timerfd: Protect the might cancel mechanism properThomas Gleixner
The handling of the might_cancel queueing is not properly protected, so parallel operations on the file descriptor can race with each other and lead to list corruptions or use after free. Protect the context for these operations with a seperate lock. The wait queue lock cannot be reused for this because that would create a lock inversion scenario vs. the cancel lock. Replacing might_cancel with an atomic (atomic_t or atomic bit) does not help either because it still can race vs. the actual list operation. Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: "linux-fsdevel@vger.kernel.org" Cc: syzkaller <syzkaller@googlegroups.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1701311521430.3457@nanos Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-02-10timer_list: Remove useless cast when printingMars Cheng
hrtimer_resolution is already unsigned int, not necessary to cast it when printing. Signed-off-by: Mars Cheng <mars.cheng@mediatek.com> Cc: CC Hwang <cc.hwang@mediatek.com> Cc: wsd_upstream@mediatek.com Cc: Loda Chou <loda.chou@mediatek.com> Cc: Jades Shih <jades.shih@mediatek.com> Cc: Miles Chen <miles.chen@mediatek.com> Cc: John Stultz <john.stultz@linaro.org> Cc: My Chuang <my.chuang@mediatek.com> Cc: Matthias Brugger <matthias.bgg@gmail.com> Cc: Yingjoe Chen <yingjoe.chen@mediatek.com> Link: http://lkml.kernel.org/r/1486626615-5879-1-git-send-email-mars.cheng@mediatek.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-02-10time: Remove CONFIG_TIMER_STATSKees Cook
Currently CONFIG_TIMER_STATS exposes process information across namespaces: kernel/time/timer_list.c print_timer(): SEQ_printf(m, ", %s/%d", tmp, timer->start_pid); /proc/timer_list: #11: <0000000000000000>, hrtimer_wakeup, S:01, do_nanosleep, cron/2570 Given that the tracer can give the same information, this patch entirely removes CONFIG_TIMER_STATS. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: John Stultz <john.stultz@linaro.org> Cc: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: linux-doc@vger.kernel.org Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Xing Gao <xgao01@email.wm.edu> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Jessica Frazelle <me@jessfraz.com> Cc: kernel-hardening@lists.openwall.com Cc: Nicolas Iooss <nicolas.iooss_linux@m4x.org> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Michal Marek <mmarek@suse.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Olof Johansson <olof@lixom.net> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: linux-api@vger.kernel.org Cc: Arjan van de Ven <arjan@linux.intel.com> Link: http://lkml.kernel.org/r/20170208192659.GA32582@beast Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-02-10x86/mm/ptdump: Fix soft lockup in page table walkerAndrey Ryabinin
CONFIG_KASAN=y needs a lot of virtual memory mapped for its shadow. In that case ptdump_walk_pgd_level_core() takes a lot of time to walk across all page tables and doing this without a rescheduling causes soft lockups: NMI watchdog: BUG: soft lockup - CPU#3 stuck for 23s! [swapper/0:1] ... Call Trace: ptdump_walk_pgd_level_core+0x40c/0x550 ptdump_walk_pgd_level_checkwx+0x17/0x20 mark_rodata_ro+0x13b/0x150 kernel_init+0x2f/0x120 ret_from_fork+0x2c/0x40 I guess that this issue might arise even without KASAN on huge machines with several terabytes of RAM. Stick cond_resched() in pgd loop to fix this. Reported-by: Tobias Regnery <tobias.regnery@gmail.com> Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: kasan-dev@googlegroups.com Cc: Alexander Potapenko <glider@google.com> Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/20170210095405.31802-1-aryabinin@virtuozzo.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-02-10debugobjects: Improve variable namingWaiman Long
As suggested by Ingo, the debug_objects_alloc counter is now renamed to debug_objects_allocated with minor twist in comment and debug output. Signed-off-by: Waiman Long <longman@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1486503630-1501-1-git-send-email-longman@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-10x86/tsc: Make the TSC ADJUST sanitizing work for tsc_reliableThomas Gleixner
When the TSC is marked reliable then the synchronization check is skipped, but that also skips the TSC ADJUST sanitizing code. So on a machine with a wreckaged BIOS the TSC deviation between CPUs might go unnoticed. Let the TSC adjust sanitizing code run unconditionally and just skip the expensive synchronization checks when TSC is marked reliable. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Olof Johansson <olof@lixom.net> Link: http://lkml.kernel.org/r/20170209151231.491189912@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-02-10x86/tsc: Avoid the large time jump when sanitizing TSC ADJUSTThomas Gleixner
Olof reported that on a machine which has a BIOS wreckaged TSC the timestamps in dmesg are making a large jump because the TSC value is jumping forward after resetting the TSC ADJUST register to a sane value. This can be avoided by calling the TSC ADJUST saniziting function before initializing the per cpu sched clock machinery. That takes the offset into account and avoid the time jump. What cannot be avoided is that the 'Firmware Bug' warnings on the secondary CPUs are printed with the large time offsets because it would be too much effort and ugly hackery to print those warnings into a buffer and emit them after the adjustemt on the starting CPUs. It's a firmware bug and should be fixed in firmware. The weird timestamps are collateral damage and just illustrate the sillyness of the BIOS folks: [ 0.397445] smp: Bringing up secondary CPUs ... [ 0.402100] x86: Booting SMP configuration: [ 0.406343] .... node #0, CPUs: #1 [1265776479.930667] [Firmware Bug]: TSC ADJUST differs: Reference CPU0: -2978888639075328 CPU1: -2978888639183101 [1265776479.944664] TSC ADJUST synchronize: Reference CPU0: 0 CPU1: -2978888639183101 [ 0.508119] #2 [1265776480.032346] [Firmware Bug]: TSC ADJUST differs: Reference CPU0: -2978888639075328 CPU2: -2978888639183677 [1265776480.044192] TSC ADJUST synchronize: Reference CPU0: 0 CPU2: -2978888639183677 [ 0.607643] #3 [1265776480.131874] [Firmware Bug]: TSC ADJUST differs: Reference CPU0: -2978888639075328 CPU3: -2978888639184530 [1265776480.143720] TSC ADJUST synchronize: Reference CPU0: 0 CPU3: -2978888639184530 [ 0.707108] smp: Brought up 1 node, 4 CPUs [ 0.711271] smpboot: Total of 4 processors activated (21698.88 BogoMIPS) Reported-by: Olof Johansson <olof@lixom.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20170209151231.411460506@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-02-10tick/nohz: Fix possible missing clock reprog after tick soft restartFrederic Weisbecker
ts->next_tick keeps track of the next tick deadline in order to optimize clock programmation on irq exit and avoid redundant clock device writes. Now if ts->next_tick missed an update, we may spuriously miss a clock reprog later as the nohz code is fooled by an obsolete next_tick value. This is what happens here on a specific path: when we observe an expired timer from the nohz update code on irq exit, we perform a soft tick restart which simply fires the closest possible tick without actually exiting the nohz mode and restoring a periodic state. But we forget to update ts->next_tick accordingly. As a result, after the next tick resulting from such soft tick restart, the nohz code sees a stale value on ts->next_tick which doesn't match the clock deadline that just expired. If that obsolete ts->next_tick value happens to collide with the actual next tick deadline to be scheduled, we may spuriously bypass the clock reprogramming. In the worst case, the tick may never fire again. Fix this with a ts->next_tick reset on soft tick restart. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Reviewed: Wanpeng Li <wanpeng.li@hotmail.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1486485894-29173-1-git-send-email-fweisbec@gmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-02-10locking/spinlock/debug: Remove spinlock lockup detection codeWaiman Long
The current spinlock lockup detection code can sometimes produce false positives because of the unfairness of the locking algorithm itself. So the lockup detection code is now removed. Instead, we are relying on the NMI watchdog to detect potential lockup. We won't have lockup detection if the watchdog isn't running. The commented-out read-write lock lockup detection code are also removed. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1486583208-11038-1-git-send-email-longman@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-10lockdep: Fix incorrect condition to print bug msgs for MAX_LOCKDEP_CHAIN_HLOCKSByungchul Park
Bug messages and stack dump for MAX_LOCKDEP_CHAIN_HLOCKS should only be printed once. Signed-off-by: Byungchul Park <byungchul.park@lge.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1484275324-28192-1-git-send-email-byungchul.park@lge.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-10perf/core: Allow kernel filters on CPU eventsAlexander Shishkin
While supporting file-based address filters for CPU events requires some extra context switch handling, kernel address filters are easy, since the kernel mapping is preserved across address spaces. It is also useful as it permits tracing scheduling paths of the kernel. This patch allows setting up kernel filters for CPU events. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: Will Deacon <will.deacon@arm.com> Cc: vince@deater.net Link: http://lkml.kernel.org/r/20170126094057.13805-4-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-10perf/core: Do error out on a kernel filter on an exclude_filter eventAlexander Shishkin
It is currently possible to configure a kernel address filter for a event that excludes kernel from its traces (attr.exclude_kernel==1). While in reality this doesn't make sense, the SET_FILTER ioctl() should return a error in such case, currently it does not. Furthermore, it will still silently discard the filter and any potentially valid filters that came with it. This patch makes the SET_FILTER ioctl() error out in such cases. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: Will Deacon <will.deacon@arm.com> Cc: vince@deater.net Link: http://lkml.kernel.org/r/20170126094057.13805-3-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-10sched/core: Remove unlikely() annotation from sched_move_task()Steven Rostedt (VMware)
The check for 'running' in sched_move_task() has an unlikely() around it. That is, it is unlikely that the task being moved is running. That use to be true. But with a couple of recent updates, it is now likely that the task will be running. The first change came from ea86cb4b7621 ("sched/cgroup: Fix cpu_cgroup_fork() handling") that moved around the use case of sched_move_task() in do_fork() where the call is now done after the task is woken (hence it is running). The second change came from 8e5bfa8c1f84 ("sched/autogroup: Do not use autogroup->tg in zombie threads") where sched_move_task() is called by the exit path, by the task that is exiting. Hence it too is running. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: http://lkml.kernel.org/r/20170206110426.27ca6426@gandalf.local.home Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-10perf/core: Fix crash in perf_event_read()Peter Zijlstra
Alexei had his box explode because doing read() on a package (rapl/uncore) event that isn't currently scheduled in ends up doing an out-of-bounds load. Rework the code to more explicitly deal with event->oncpu being -1. Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Tested-by: Alexei Starovoitov <ast@kernel.org> Tested-by: David Carrillo-Cisneros <davidcc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: eranian@google.com Fixes: d6a2f9035bfc ("perf/core: Introduce PMU_EV_CAP_READ_ACTIVE_PKG") Link: http://lkml.kernel.org/r/20170131102710.GL6515@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-10lkdtm: Convert to refcount_t testingKees Cook
Since we'll be using refcount_t instead of atomic_t for refcounting, change the LKDTM tests to reflect the new interface and test conditions. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Hans Liljestrand <ishkamiel@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: arnd@arndb.de Cc: dhowells@redhat.com Cc: dwindsor@gmail.com Cc: elena.reshetova@intel.com Cc: gregkh@linuxfoundation.org Cc: h.peter.anvin@intel.com Cc: kernel-hardening@lists.openwall.com Cc: will.deacon@arm.com Link: http://lkml.kernel.org/r/1486164412-7338-3-git-send-email-keescook@chromium.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-10kref: Implement 'struct kref' using refcount_tPeter Zijlstra
Use the refcount_t 'atomic' type to implement 'struct kref', this makes kref more robust by bringing saturation semantics. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-10refcount_t: Introduce a special purpose refcount typePeter Zijlstra
Provide refcount_t, an atomic_t like primitive built just for refcounting. It provides saturation semantics such that overflow becomes impossible and thereby 'spurious' use-after-free is avoided. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-10Merge tag 'perf-core-for-mingo-4.11-20170209' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo: User visible changes: - Add support for parsing Intel uncore vendor event files and add uncore vendor events for the Intel server processors (Haswell, Broadwell, IvyBridge), Xeon Phi (Knights Landing) and Broadwell DE (Andi Kleen) - Support --symfs in 'perf probe' (Uwe Kleine-König) - Add support for generating bpf prologue on the aarch64 architecture (He Kuang) - Show proper hint when SDT event not yet in place via 'perf probe' (Ravi Bangoria) - Take into account symfs setting when reading file build ID (Victor Kamensky) Infrastructure changes: - Map gcc7's '__attribute__ ((fallthrough))', that warns when code associated to case blocks in switches continue into the next case entry, to '__falltrough' and use it where warned by gcc, tested on Fedora Rawhide (Arnaldo Carvalho de Melo) - Fix buffer sizes used with snprintf that could lead to truncation, another warning introduced in gcc7 (Arnaldo Carvalho de Melo) - Robustify do_generate_dynamic_list_file in libtraceevent (David Carrillo-Cisneros) - Use zfree() in more places (Taeung Song) Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-10ext4: do not use stripe_width if it is not setJan Kara
Avoid using stripe_width for sbi->s_stripe value if it is not actually set. It prevents using the stride for sbi->s_stripe. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-02-10ext4: fix stripe-unaligned allocationsJan Kara
When a filesystem is created using: mkfs.ext4 -b 4096 -E stride=512 <dev> and we try to allocate 64MB extent, we will end up directly in ext4_mb_complex_scan_group(). This is because the request is detected as power-of-two allocation (so we start in ext4_mb_regular_allocator() with ac_criteria == 0) however the check before ext4_mb_simple_scan_group() refuses the direct buddy scan because the allocation request is too large. Since cr == 0, the check whether we should use ext4_mb_scan_aligned() fails as well and we fall back to ext4_mb_complex_scan_group(). Fix the problem by checking for upper limit on power-of-two requests directly when detecting them. Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-02-09Merge remote-tracking branch 'mkp-scsi/4.10/scsi-fixes' into fixesJames Bottomley
2017-02-10powerpc/pseries: Fix typo in parameter descriptionWei Yongjun
Fix typo in "hotplug_delay" parameter description. This allows modinfo to match the help text to the parameter. Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-02-09Merge branch 'openvswitch-Conntrack-integration-improvements'David S. Miller
2017-02-09openvswitch: Pack struct sw_flow_key.Jarno Rajahalme
struct sw_flow_key has two 16-bit holes. Move the most matched conntrack match fields there. In some typical cases this reduces the size of the key that needs to be hashed into half and into one cache line. Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Acked-by: Joe Stringer <joe@ovn.org> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-09openvswitch: Add force commit.Jarno Rajahalme
Stateful network admission policy may allow connections to one direction and reject connections initiated in the other direction. After policy change it is possible that for a new connection an overlapping conntrack entry already exists, where the original direction of the existing connection is opposed to the new connection's initial packet. Most importantly, conntrack state relating to the current packet gets the "reply" designation based on whether the original direction tuple or the reply direction tuple matched. If this "directionality" is wrong w.r.t. to the stateful network admission policy it may happen that packets in neither direction are correctly admitted. This patch adds a new "force commit" option to the OVS conntrack action that checks the original direction of an existing conntrack entry. If that direction is opposed to the current packet, the existing conntrack entry is deleted and a new one is subsequently created in the correct direction. Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Acked-by: Pravin B Shelar <pshelar@ovn.org> Acked-by: Joe Stringer <joe@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-09openvswitch: Add original direction conntrack tuple to sw_flow_key.Jarno Rajahalme
Add the fields of the conntrack original direction 5-tuple to struct sw_flow_key. The new fields are initially marked as non-existent, and are populated whenever a conntrack action is executed and either finds or generates a conntrack entry. This means that these fields exist for all packets that were not rejected by conntrack as untrackable. The original tuple fields in the sw_flow_key are filled from the original direction tuple of the conntrack entry relating to the current packet, or from the original direction tuple of the master conntrack entry, if the current conntrack entry has a master. Generally, expected connections of connections having an assigned helper (e.g., FTP), have a master conntrack entry. The main purpose of the new conntrack original tuple fields is to allow matching on them for policy decision purposes, with the premise that the admissibility of tracked connections reply packets (as well as original direction packets), and both direction packets of any related connections may be based on ACL rules applying to the master connection's original direction 5-tuple. This also makes it easier to make policy decisions when the actual packet headers might have been transformed by NAT, as the original direction 5-tuple represents the packet headers before any such transformation. When using the original direction 5-tuple the admissibility of return and/or related packets need not be based on the mere existence of a conntrack entry, allowing separation of admission policy from the established conntrack state. While existence of a conntrack entry is required for admission of the return or related packets, policy changes can render connections that were initially admitted to be rejected or dropped afterwards. If the admission of the return and related packets was based on mere conntrack state (e.g., connection being in an established state), a policy change that would make the connection rejected or dropped would need to find and delete all conntrack entries affected by such a change. When using the original direction 5-tuple matching the affected conntrack entries can be allowed to time out instead, as the established state of the connection would not need to be the basis for packet admission any more. It should be noted that the directionality of related connections may be the same or different than that of the master connection, and neither the original direction 5-tuple nor the conntrack state bits carry this information. If needed, the directionality of the master connection can be stored in master's conntrack mark or labels, which are automatically inherited by the expected related connections. The fact that neither ARP nor ND packets are trackable by conntrack allows mutual exclusion between ARP/ND and the new conntrack original tuple fields. Hence, the IP addresses are overlaid in union with ARP and ND fields. This allows the sw_flow_key to not grow much due to this patch, but it also means that we must be careful to never use the new key fields with ARP or ND packets. ARP is easy to distinguish and keep mutually exclusive based on the ethernet type, but ND being an ICMPv6 protocol requires a bit more attention. Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Acked-by: Joe Stringer <joe@ovn.org> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-09openvswitch: Inherit master's labels.Jarno Rajahalme
We avoid calling into nf_conntrack_in() for expected connections, as that would remove the expectation that we want to stick around until we are ready to commit the connection. Instead, we do a lookup in the expectation table directly. However, after a successful expectation lookup we have set the flow key label field from the master connection, whereas nf_conntrack_in() does not do this. This leads to master's labels being inherited after an expectation lookup, but those labels not being inherited after the corresponding conntrack action with a commit flag. This patch resolves the problem by changing the commit code path to also inherit the master's labels to the expected connection. Resolving this conflict in favor of inheriting the labels allows more information be passed from the master connection to related connections, which would otherwise be much harder if the 32 bits in the connmark are not enough. Labels can still be set explicitly, so this change only affects the default values of the labels in presense of a master connection. Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action") Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Acked-by: Pravin B Shelar <pshelar@ovn.org> Acked-by: Joe Stringer <joe@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-09openvswitch: Refactor labels initialization.Jarno Rajahalme
Refactoring conntrack labels initialization makes changes in later patches easier to review. Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Acked-by: Pravin B Shelar <pshelar@ovn.org> Acked-by: Joe Stringer <joe@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-09openvswitch: Simplify labels length logic.Jarno Rajahalme
Since 23014011ba42 ("netfilter: conntrack: support a fixed size of 128 distinct labels"), the size of conntrack labels extension has fixed to 128 bits, so we do not need to check for labels sizes shorter than 128 at run-time. This patch simplifies labels length logic accordingly, but allows the conntrack labels size to be increased in the future without breaking the build. In the event of conntrack labels increasing in size OVS would still be able to deal with the 128 first label bits. Suggested-by: Joe Stringer <joe@ovn.org> Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Acked-by: Pravin B Shelar <pshelar@ovn.org> Acked-by: Joe Stringer <joe@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-09openvswitch: Unionize ovs_key_ct_label with a u32 array.Jarno Rajahalme
Make the array of labels in struct ovs_key_ct_label an union, adding a u32 array of the same byte size as the existing u8 array. It is faster to loop through the labels 32 bits at the time, which is also the alignment of netlink attributes. Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Acked-by: Joe Stringer <joe@ovn.org> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-09openvswitch: Do not trigger events for unconfirmed connections.Jarno Rajahalme
Receiving change events before the 'new' event for the connection has been received can be confusing. Avoid triggering change events for setting conntrack mark or labels before the conntrack entry has been confirmed. Fixes: 182e3042e15d ("openvswitch: Allow matching on conntrack mark") Fixes: c2ac66735870 ("openvswitch: Allow matching on conntrack label") Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Acked-by: Joe Stringer <joe@ovn.org> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-09openvswitch: Use inverted tuple in ovs_ct_find_existing() if NATted.Jarno Rajahalme
The conntrack lookup for existing connections fails to invert the packet 5-tuple for NATted packets, and therefore fails to find the existing conntrack entry. Conntrack only stores 5-tuples for incoming packets, and there are various situations where a lookup on a packet that has already been transformed by NAT needs to be made. Looking up an existing conntrack entry upon executing packet received from the userspace is one of them. This patch fixes ovs_ct_find_existing() to invert the packet 5-tuple for the conntrack lookup whenever the packet has already been transformed by conntrack from its input form as evidenced by one of the NAT flags being set in the conntrack state metadata. Fixes: 05752523e565 ("openvswitch: Interface with NAT.") Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Acked-by: Joe Stringer <joe@ovn.org> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-09openvswitch: Fix comments for skb->_nfctJarno Rajahalme
Fix comments referring to skb 'nfct' and 'nfctinfo' fields now that they are combined into '_nfct'. Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Acked-by: Pravin B Shelar <pshelar@ovn.org> Acked-by: Joe Stringer <joe@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-10powerpc/kprobes: Remove kprobe_exceptions_notify()Naveen N. Rao
... as the generic weak variant will do. Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-02-10kprobes: Introduce weak variant of kprobe_exceptions_notify()Naveen N. Rao
kprobe_exceptions_notify() is not used on some of the architectures such as arm[64] and powerpc anymore. Introduce a weak variant for such architectures. Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-02-09Merge branch 'ena-bug-fixes'David S. Miller
Netanel Belgazal says: ==================== Bug Fixes in ENA driver Changes from V3: * Rebase patchset to master and solve merge conflicts. * Remove redundant bug fix (fix error handling when probe fails) ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-09net/ena: update driver version to 1.1.2Netanel Belgazal
Signed-off-by: Netanel Belgazal <netanel@annapurnalabs.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-09net/ena: change condition for host attribute configurationNetanel Belgazal
Move the host info config to be the first admin command that is executed. This change require the driver to remove the 'feature check' from host info configuration flow. The check is removed since the supported features bitmask field is retrieved only after calling ENA_ADMIN_DEVICE_ATTRIBUTES admin command. If set host info is not supported an error will be returned by the device. Signed-off-by: Netanel Belgazal <netanel@annapurnalabs.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-09net/ena: change driver's default timeoutsNetanel Belgazal
The timeouts were too agressive and sometimes cause false alarms. Signed-off-by: Netanel Belgazal <netanel@annapurnalabs.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-09net/ena: reduce the severity of ena printoutsNetanel Belgazal
Signed-off-by: Netanel Belgazal <netanel@annapurnalabs.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-09net/ena: use READ_ONCE to access completion descriptorsNetanel Belgazal
Completion descriptors are accessed from the driver and from the device. To avoid reading the old value, use READ_ONCE macro. Signed-off-by: Netanel Belgazal <netanel@annapurnalabs.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-09net/ena: use napi_complete_done() return valueNetanel Belgazal
Do not unamsk interrupts if we are in busy poll mode. Signed-off-by: Netanel Belgazal <netanel@annapurnalabs.com> Signed-off-by: David S. Miller <davem@davemloft.net>