summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2018-09-10sched/fair: Fix load_balance redo for !imbalanceVincent Guittot
It can happen that load_balance() finds a busiest group and then a busiest rq but the calculated imbalance is in fact 0. In such situation, detach_tasks() returns immediately and lets the flag LBF_ALL_PINNED set. The busiest CPU is then wrongly assumed to have pinned tasks and removed from the load balance mask. then, we redo a load balance without the busiest CPU. This creates wrong load balance situation and generates wrong task migration. If the calculated imbalance is 0, it's useless to try to find a busiest rq as no task will be migrated and we can return immediately. This situation can happen with heterogeneous system or smp system when RT tasks are decreasing the capacity of some CPUs. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: jhugo@codeaurora.org Link: http://lkml.kernel.org/r/1536306664-29827-1-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10sched/fair: Fix scale_rt_capacity() for SMTVincent Guittot
Since commit: 523e979d3164 ("sched/core: Use PELT for scale_rt_capacity()") scale_rt_capacity() returns the remaining capacity and not a scale factor to apply on cpu_capacity_orig. arch_scale_cpu() is directly called by scale_rt_capacity() so we must take the sched_domain argument. Reported-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 523e979d3164 ("sched/core: Use PELT for scale_rt_capacity()") Link: http://lkml.kernel.org/r/20180904093626.GA23936@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10sched/fair: Fix vruntime_normalized() for remote non-migration wakeupSteve Muckle
When a task which previously ran on a given CPU is remotely queued to wake up on that same CPU, there is a period where the task's state is TASK_WAKING and its vruntime is not normalized. This is not accounted for in vruntime_normalized() which will cause an error in the task's vruntime if it is switched from the fair class during this time. For example if it is boosted to RT priority via rt_mutex_setprio(), rq->min_vruntime will not be subtracted from the task's vruntime but it will be added again when the task returns to the fair class. The task's vruntime will have been erroneously doubled and the effective priority of the task will be reduced. Note this will also lead to inflation of all vruntimes since the doubled vruntime value will become the rq's min_vruntime when other tasks leave the rq. This leads to repeated doubling of the vruntime and priority penalty. Fix this by recognizing a WAKING task's vruntime as normalized only if sched_remote_wakeup is true. This indicates a migration, in which case the vruntime would have been normalized in migrate_task_rq_fair(). Based on a similar patch from John Dias <joaodias@google.com>. Suggested-by: Peter Zijlstra <peterz@infradead.org> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Steve Muckle <smuckle@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Chris Redpath <Chris.Redpath@arm.com> Cc: John Dias <joaodias@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Miguel de Dios <migueldedios@google.com> Cc: Morten Rasmussen <Morten.Rasmussen@arm.com> Cc: Patrick Bellasi <Patrick.Bellasi@arm.com> Cc: Paul Turner <pjt@google.com> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Todd Kjos <tkjos@google.com> Cc: kernel-team@android.com Fixes: b5179ac70de8 ("sched/fair: Prepare to fix fairness problems on migration") Link: http://lkml.kernel.org/r/20180831224217.169476-1-smuckle@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10sched/pelt: Fix update_blocked_averages() for RT and DL classesVincent Guittot
update_blocked_averages() is called to periodiccally decay the stalled load of idle CPUs and to sync all loads before running load balance. When cfs rq is idle, it trigs a load balance during pick_next_task_fair() in order to potentially pull tasks and to use this newly idle CPU. This load balance happens whereas prev task from another class has not been put and its utilization updated yet. This may lead to wrongly account running time as idle time for RT or DL classes. Test that no RT or DL task is running when updating their utilization in update_blocked_averages(). We still update RT and DL utilization instead of simply skipping them to make sure that all metrics are synced when used during load balance. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 371bf4273269 ("sched/rt: Add rt_rq utilization tracking") Fixes: 3727e0e16340 ("sched/dl: Add dl_rq utilization tracking") Link: http://lkml.kernel.org/r/1535728975-22799-1-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10sched/topology: Set correct NUMA topology typeSrikar Dronamraju
With the following commit: 051f3ca02e46 ("sched/topology: Introduce NUMA identity node sched domain") the scheduler introduced a new NUMA level. However this leads to the NUMA topology on 2 node systems to not be marked as NUMA_DIRECT anymore. After this commit, it gets reported as NUMA_BACKPLANE, because sched_domains_numa_level is now 2 on 2 node systems. Fix this by allowing setting systems that have up to 2 NUMA levels as NUMA_DIRECT. While here remove code that assumes that level can be 0. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andre Wild <wild@linux.vnet.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org> Fixes: 051f3ca02e46 "Introduce NUMA identity node sched domain" Link: http://lkml.kernel.org/r/1533920419-17410-1-git-send-email-srikar@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10sched/debug: Fix potential deadlock when writing to sched_featuresJiada Wang
The following lockdep report can be triggered by writing to /sys/kernel/debug/sched_features: ====================================================== WARNING: possible circular locking dependency detected 4.18.0-rc6-00152-gcd3f77d74ac3-dirty #18 Not tainted ------------------------------------------------------ sh/3358 is trying to acquire lock: 000000004ad3989d (cpu_hotplug_lock.rw_sem){++++}, at: static_key_enable+0x14/0x30 but task is already holding lock: 00000000c1b31a88 (&sb->s_type->i_mutex_key#3){+.+.}, at: sched_feat_write+0x160/0x428 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (&sb->s_type->i_mutex_key#3){+.+.}: lock_acquire+0xb8/0x148 down_write+0xac/0x140 start_creating+0x5c/0x168 debugfs_create_dir+0x18/0x220 opp_debug_register+0x8c/0x120 _add_opp_dev+0x104/0x1f8 dev_pm_opp_get_opp_table+0x174/0x340 _of_add_opp_table_v2+0x110/0x760 dev_pm_opp_of_add_table+0x5c/0x240 dev_pm_opp_of_cpumask_add_table+0x5c/0x100 cpufreq_init+0x160/0x430 cpufreq_online+0x1cc/0xe30 cpufreq_add_dev+0x78/0x198 subsys_interface_register+0x168/0x270 cpufreq_register_driver+0x1c8/0x278 dt_cpufreq_probe+0xdc/0x1b8 platform_drv_probe+0xb4/0x168 driver_probe_device+0x318/0x4b0 __device_attach_driver+0xfc/0x1f0 bus_for_each_drv+0xf8/0x180 __device_attach+0x164/0x200 device_initial_probe+0x10/0x18 bus_probe_device+0x110/0x178 device_add+0x6d8/0x908 platform_device_add+0x138/0x3d8 platform_device_register_full+0x1cc/0x1f8 cpufreq_dt_platdev_init+0x174/0x1bc do_one_initcall+0xb8/0x310 kernel_init_freeable+0x4b8/0x56c kernel_init+0x10/0x138 ret_from_fork+0x10/0x18 -> #2 (opp_table_lock){+.+.}: lock_acquire+0xb8/0x148 __mutex_lock+0x104/0xf50 mutex_lock_nested+0x1c/0x28 _of_add_opp_table_v2+0xb4/0x760 dev_pm_opp_of_add_table+0x5c/0x240 dev_pm_opp_of_cpumask_add_table+0x5c/0x100 cpufreq_init+0x160/0x430 cpufreq_online+0x1cc/0xe30 cpufreq_add_dev+0x78/0x198 subsys_interface_register+0x168/0x270 cpufreq_register_driver+0x1c8/0x278 dt_cpufreq_probe+0xdc/0x1b8 platform_drv_probe+0xb4/0x168 driver_probe_device+0x318/0x4b0 __device_attach_driver+0xfc/0x1f0 bus_for_each_drv+0xf8/0x180 __device_attach+0x164/0x200 device_initial_probe+0x10/0x18 bus_probe_device+0x110/0x178 device_add+0x6d8/0x908 platform_device_add+0x138/0x3d8 platform_device_register_full+0x1cc/0x1f8 cpufreq_dt_platdev_init+0x174/0x1bc do_one_initcall+0xb8/0x310 kernel_init_freeable+0x4b8/0x56c kernel_init+0x10/0x138 ret_from_fork+0x10/0x18 -> #1 (subsys mutex#6){+.+.}: lock_acquire+0xb8/0x148 __mutex_lock+0x104/0xf50 mutex_lock_nested+0x1c/0x28 subsys_interface_register+0xd8/0x270 cpufreq_register_driver+0x1c8/0x278 dt_cpufreq_probe+0xdc/0x1b8 platform_drv_probe+0xb4/0x168 driver_probe_device+0x318/0x4b0 __device_attach_driver+0xfc/0x1f0 bus_for_each_drv+0xf8/0x180 __device_attach+0x164/0x200 device_initial_probe+0x10/0x18 bus_probe_device+0x110/0x178 device_add+0x6d8/0x908 platform_device_add+0x138/0x3d8 platform_device_register_full+0x1cc/0x1f8 cpufreq_dt_platdev_init+0x174/0x1bc do_one_initcall+0xb8/0x310 kernel_init_freeable+0x4b8/0x56c kernel_init+0x10/0x138 ret_from_fork+0x10/0x18 -> #0 (cpu_hotplug_lock.rw_sem){++++}: __lock_acquire+0x203c/0x21d0 lock_acquire+0xb8/0x148 cpus_read_lock+0x58/0x1c8 static_key_enable+0x14/0x30 sched_feat_write+0x314/0x428 full_proxy_write+0xa0/0x138 __vfs_write+0xd8/0x388 vfs_write+0xdc/0x318 ksys_write+0xb4/0x138 sys_write+0xc/0x18 __sys_trace_return+0x0/0x4 other info that might help us debug this: Chain exists of: cpu_hotplug_lock.rw_sem --> opp_table_lock --> &sb->s_type->i_mutex_key#3 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&sb->s_type->i_mutex_key#3); lock(opp_table_lock); lock(&sb->s_type->i_mutex_key#3); lock(cpu_hotplug_lock.rw_sem); *** DEADLOCK *** 2 locks held by sh/3358: #0: 00000000a8c4b363 (sb_writers#10){.+.+}, at: vfs_write+0x238/0x318 #1: 00000000c1b31a88 (&sb->s_type->i_mutex_key#3){+.+.}, at: sched_feat_write+0x160/0x428 stack backtrace: CPU: 5 PID: 3358 Comm: sh Not tainted 4.18.0-rc6-00152-gcd3f77d74ac3-dirty #18 Hardware name: Renesas H3ULCB Kingfisher board based on r8a7795 ES2.0+ (DT) Call trace: dump_backtrace+0x0/0x288 show_stack+0x14/0x20 dump_stack+0x13c/0x1ac print_circular_bug.isra.10+0x270/0x438 check_prev_add.constprop.16+0x4dc/0xb98 __lock_acquire+0x203c/0x21d0 lock_acquire+0xb8/0x148 cpus_read_lock+0x58/0x1c8 static_key_enable+0x14/0x30 sched_feat_write+0x314/0x428 full_proxy_write+0xa0/0x138 __vfs_write+0xd8/0x388 vfs_write+0xdc/0x318 ksys_write+0xb4/0x138 sys_write+0xc/0x18 __sys_trace_return+0x0/0x4 This is because when loading the cpufreq_dt module we first acquire cpu_hotplug_lock.rw_sem lock, then in cpufreq_init(), we are taking the &sb->s_type->i_mutex_key lock. But when writing to /sys/kernel/debug/sched_features, the cpu_hotplug_lock.rw_sem lock depends on the &sb->s_type->i_mutex_key lock. To fix this bug, reverse the lock acquisition order when writing to sched_features, this way cpu_hotplug_lock.rw_sem no longer depends on &sb->s_type->i_mutex_key. Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Jiada Wang <jiada_wang@mentor.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Eugeniu Rosca <erosca@de.adit-jv.com> Cc: George G. Davis <george_davis@mentor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180731121222.26195-1-jiada_wang@mentor.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10staging: erofs: rename superblock flags (MS_xyz -> SB_xyz)Gao Xiang
This patch follows commit 1751e8a6cb93 ("Rename superblock flags (MS_xyz -> SB_xyz)") and after commit ("vfs: Suppress MS_* flag defs within the kernel unless explicitly enabled"), there is no MS_RDONLY and MS_NOATIME at all. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-10staging: erofs: fix 1 warning and 9 checksPavel Zemlyanoy
This patch does not change the logic, it only corrects the formatting and checkpatch checks by braces {} should be used on all arms of this statement, unbalanced braces around else statement and warning by braces {} are not necessary for any arm of this statement. The patch fixes 9 checks of type: "Check: braces {} should be used on all arms of this statement"; "Check: Unbalanced braces around else statement"; and 1 warning of type: "WARNING: braces {} are not necessary for any arm of this statement". Signed-off-by: Pavel Zemlyanoy <zemlyanoy@ispras.ru> Reviewed-by: Chao Yu <yuchao0@huawei.com> Reviewed-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-10staging: erofs: formatting alignment parenthesisPavel Zemlyanoy
This patch does not change the logic, it only corrects the formatting and checkpatch check by alignment should match open parenthesis. The patch fixes 2 check of type: "Check: Alignment should match open parenthesis". Signed-off-by: Pavel Zemlyanoy <zemlyanoy@ispras.ru> Reviewed-by: Chao Yu <yuchao0@huawei.com> Reviewed-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-10staging: erofs: formatting add spaces arround '*'Pavel Zemlyanoy
This patch does not change the logic, it only corrects the formatting and checkpatch check by adding spaces around '*'. The patch fixes 1 check of type: "Check: spaces preferred around that '*'". Signed-off-by: Pavel Zemlyanoy <zemlyanoy@ispras.ru> Reviewed-by: Chao Yu <yuchao0@huawei.com> Reviewed-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-10staging: erofs: formatting spaces around '-'Pavel Zemlyanoy
This patch does not change the logic, it only corrects the formatting and checkpatch checks by adding spaces around '-'. The patch fixes 4 checks of type: "Check: spaces preferred around that '-'". Signed-off-by: Pavel Zemlyanoy <zemlyanoy@ispras.ru> Reviewed-by: Chao Yu <yuchao0@huawei.com> Reviewed-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-10staging: erofs: formatting fix to NULL comparisonPavel Zemlyanoy
This patch does not change the logic, it only corrects the formatting and checkpatch checks by to NULL comparison. The patch fixes 5 checks of type: "Comparison to NULL could be written". Signed-off-by: Pavel Zemlyanoy <zemlyanoy@ispras.ru> Reviewed-by: Chao Yu <yuchao0@huawei.com> Reviewed-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-10staging: erofs: formatting fix in unzip_vle_lz4.cPavel Zemlyanoy
This patch does not change the logic, it only corrects the formatting and checkpatch warnings by adding "int" to the unsigned type. The patch fixes 11 warnings of the type: "WARNING: Prefer 'unsigned int' to bare use of 'unsigned'" Signed-off-by: Pavel Zemlyanoy <zemlyanoy@ispras.ru> Reviewed-by: Chao Yu <yuchao0@huawei.com> Reviewed-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-10locking/mutex: Fix mutex debug call and ww_mutex documentationThomas Hellstrom
The following commit: 08295b3b5bee ("Implement an algorithm choice for Wound-Wait mutexes") introduced a reference in the documentation to a function that was removed in an earlier commit. It also forgot to remove a call to debug_mutex_add_waiter() which is now unconditionally called by __mutex_add_waiter(). Fix those bugs. Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dri-devel@lists.freedesktop.org Fixes: 08295b3b5bee ("Implement an algorithm choice for Wound-Wait mutexes") Link: http://lkml.kernel.org/r/20180903140708.2401-1-thellstrom@vmware.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10perf/UAPI: Clearly mark __PERF_SAMPLE_CALLCHAIN_EARLY as internal usePeter Zijlstra
Vince noted that commit: 6cbc304f2f36 ("perf/x86/intel: Fix unwind errors from PEBS entries (mk-II)") 'leaked' __PERF_SAMPLE_CALLCHAIN_EARLY into the UAPI namespace. And while sys_perf_event_open() will error out if you try to use it, it is exposed. Clearly mark it for internal use only to avoid any confusion. Requested-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10perf/x86/intel: Add support/quirk for the MISPREDICT bit on Knights Landing CPUsJacek Tomaka
Problem: perf did not show branch predicted/mispredicted bit in brstack. Output of perf -F brstack for profile collected Before: 0x4fdbcd/0x4fdc03/-/-/-/0 0x45f4c1/0x4fdba0/-/-/-/0 0x45f544/0x45f4bb/-/-/-/0 0x45f555/0x45f53c/-/-/-/0 0x7f66901cc24b/0x45f555/-/-/-/0 0x7f66901cc22e/0x7f66901cc23d/-/-/-/0 0x7f66901cc1ff/0x7f66901cc20f/-/-/-/0 0x7f66901cc1e8/0x7f66901cc1fc/-/-/-/0 After: 0x4fdbcd/0x4fdc03/P/-/-/0 0x45f4c1/0x4fdba0/P/-/-/0 0x45f544/0x45f4bb/P/-/-/0 0x45f555/0x45f53c/P/-/-/0 0x7f66901cc24b/0x45f555/P/-/-/0 0x7f66901cc22e/0x7f66901cc23d/P/-/-/0 0x7f66901cc1ff/0x7f66901cc20f/P/-/-/0 0x7f66901cc1e8/0x7f66901cc1fc/P/-/-/0 Cause: As mentioned in Software Development Manual vol 3, 17.4.8.1, IA32_PERF_CAPABILITIES[5:0] indicates the format of the address that is stored in the LBR stack. Knights Landing reports 1 (LBR_FORMAT_LIP) as its format. Despite that, registers containing FROM address of the branch, do have MISPREDICT bit but because of the format indicated in IA32_PERF_CAPABILITIES[5:0], LBR did not read MISPREDICT bit. Solution: Teach LBR about above Knights Landing quirk and make it read MISPREDICT bit. Signed-off-by: Jacek Tomaka <jacek.tomaka@poczta.fm> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180802013830.10600-1-jacekt@dugeo.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10Revert "staging: erofs: disable compiling temporarile"Gao Xiang
This reverts commit 156c3df8d4db4e693c062978186f44079413d74d. Since XArray and the new mount apis aren't merged in 4.19-rc1 merge window, the BROKEN mark can be reverted directly without any problems. Fixes: 156c3df8d4db ("staging: erofs: disable compiling temporarile") Cc: Matthew Wilcox <willy@infradead.org> Cc: David Howells <dhowells@redhat.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-09Linux 4.19-rc3v4.19-rc3Linus Torvalds
2018-09-09ip: frags: fix crash in ip_do_fragment()Taehee Yoo
A kernel crash occurrs when defragmented packet is fragmented in ip_do_fragment(). In defragment routine, skb_orphan() is called and skb->ip_defrag_offset is set. but skb->sk and skb->ip_defrag_offset are same union member. so that frag->sk is not NULL. Hence crash occurrs in skb->sk check routine in ip_do_fragment() when defragmented packet is fragmented. test commands: %iptables -t nat -I POSTROUTING -j MASQUERADE %hping3 192.168.4.2 -s 1000 -p 2000 -d 60000 splat looks like: [ 261.069429] kernel BUG at net/ipv4/ip_output.c:636! [ 261.075753] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI [ 261.083854] CPU: 1 PID: 1349 Comm: hping3 Not tainted 4.19.0-rc2+ #3 [ 261.100977] RIP: 0010:ip_do_fragment+0x1613/0x2600 [ 261.106945] Code: e8 e2 38 e3 fe 4c 8b 44 24 18 48 8b 74 24 08 e9 92 f6 ff ff 80 3c 02 00 0f 85 da 07 00 00 48 8b b5 d0 00 00 00 e9 25 f6 ff ff <0f> 0b 0f 0b 44 8b 54 24 58 4c 8b 4c 24 18 4c 8b 5c 24 60 4c 8b 6c [ 261.127015] RSP: 0018:ffff8801031cf2c0 EFLAGS: 00010202 [ 261.134156] RAX: 1ffff1002297537b RBX: ffffed0020639e6e RCX: 0000000000000004 [ 261.142156] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880114ba9bd8 [ 261.150157] RBP: ffff880114ba8a40 R08: ffffed0022975395 R09: ffffed0022975395 [ 261.158157] R10: 0000000000000001 R11: ffffed0022975394 R12: ffff880114ba9ca4 [ 261.166159] R13: 0000000000000010 R14: ffff880114ba9bc0 R15: dffffc0000000000 [ 261.174169] FS: 00007fbae2199700(0000) GS:ffff88011b400000(0000) knlGS:0000000000000000 [ 261.183012] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 261.189013] CR2: 00005579244fe000 CR3: 0000000119bf4000 CR4: 00000000001006e0 [ 261.198158] Call Trace: [ 261.199018] ? dst_output+0x180/0x180 [ 261.205011] ? save_trace+0x300/0x300 [ 261.209018] ? ip_copy_metadata+0xb00/0xb00 [ 261.213034] ? sched_clock_local+0xd4/0x140 [ 261.218158] ? kill_l4proto+0x120/0x120 [nf_conntrack] [ 261.223014] ? rt_cpu_seq_stop+0x10/0x10 [ 261.227014] ? find_held_lock+0x39/0x1c0 [ 261.233008] ip_finish_output+0x51d/0xb50 [ 261.237006] ? ip_fragment.constprop.56+0x220/0x220 [ 261.243011] ? nf_ct_l4proto_register_one+0x5b0/0x5b0 [nf_conntrack] [ 261.250152] ? rcu_is_watching+0x77/0x120 [ 261.255010] ? nf_nat_ipv4_out+0x1e/0x2b0 [nf_nat_ipv4] [ 261.261033] ? nf_hook_slow+0xb1/0x160 [ 261.265007] ip_output+0x1c7/0x710 [ 261.269005] ? ip_mc_output+0x13f0/0x13f0 [ 261.273002] ? __local_bh_enable_ip+0xe9/0x1b0 [ 261.278152] ? ip_fragment.constprop.56+0x220/0x220 [ 261.282996] ? nf_hook_slow+0xb1/0x160 [ 261.287007] raw_sendmsg+0x21f9/0x4420 [ 261.291008] ? dst_output+0x180/0x180 [ 261.297003] ? sched_clock_cpu+0x126/0x170 [ 261.301003] ? find_held_lock+0x39/0x1c0 [ 261.306155] ? stop_critical_timings+0x420/0x420 [ 261.311004] ? check_flags.part.36+0x450/0x450 [ 261.315005] ? _raw_spin_unlock_irq+0x29/0x40 [ 261.320995] ? _raw_spin_unlock_irq+0x29/0x40 [ 261.326142] ? cyc2ns_read_end+0x10/0x10 [ 261.330139] ? raw_bind+0x280/0x280 [ 261.334138] ? sched_clock_cpu+0x126/0x170 [ 261.338995] ? check_flags.part.36+0x450/0x450 [ 261.342991] ? __lock_acquire+0x4500/0x4500 [ 261.348994] ? inet_sendmsg+0x11c/0x500 [ 261.352989] ? dst_output+0x180/0x180 [ 261.357012] inet_sendmsg+0x11c/0x500 [ ... ] v2: - clear skb->sk at reassembly routine.(Eric Dumarzet) Fixes: fa0f527358bd ("ip: use rb trees for IP frag queue.") Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Taehee Yoo <ap420073@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-09Merge tag 'perf-urgent-for-mingo-4.19-20180903' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent Pull perf/urgent fixes from Arnaldo Carvalho de Melo: Kernel: - Modify breakpoint fixes (Jiri Olsa) perf annotate: - Fix parsing aarch64 branch instructions after objdump update (Kim Phillips) - Fix parsing indirect calls in 'perf annotate' (Martin Liška) perf probe: - Ignore SyS symbols irrespective of endianness on PowerPC (Sandipan Das) perf trace: - Fix include path for asm-generic/unistd.h on arm64 (Kim Phillips) Core libraries: - Fix potential null pointer dereference in perf_evsel__new_idx() (Hisao Tanabe) - Use fixed size string for comms instead of scanf("%m"), that is not present in the bionic libc and leads to a crash (Chris Phlipot) - Fix bad memory access in trace info on 32-bit systems, we were reading 8 bytes from a 4-byte long variable when saving the command line in the perf.data file. (Chris Phlipot) Build system: - Streamline bpf examples and headers installation, clarifying some install messages. (Arnaldo Carvalho de Melo) Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-09net/tls: Set count of SG entries if sk_alloc_sg returns -ENOSPCVakul Garg
tls_sw_sendmsg() allocates plaintext and encrypted SG entries using function sk_alloc_sg(). In case the number of SG entries hit MAX_SKB_FRAGS, sk_alloc_sg() returns -ENOSPC and sets the variable for current SG index to '0'. This leads to calling of function tls_push_record() with 'sg_encrypted_num_elem = 0' and later causes kernel crash. To fix this, set the number of SG elements to the number of elements in plaintext/encrypted SG arrays in case sk_alloc_sg() returns -ENOSPC. Fixes: 3c4d7559159b ("tls: kernel TLS support") Signed-off-by: Vakul Garg <vakul.garg@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-09Merge branch 'ena-fixes'David S. Miller
Netanel Belgazal says: ==================== bug fixes for ENA Ethernet driver ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-09net: ena: fix incorrect usage of memory barriersNetanel Belgazal
Added memory barriers where they were missing to support multiple architectures, and removed redundant ones. As part of removing the redundant memory barriers and improving performance, we moved to more relaxed versions of memory barriers, as well as to the more relaxed version of writel - writel_relaxed, while maintaining correctness. Signed-off-by: Netanel Belgazal <netanel@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-09net: ena: fix missing calls to READ_ONCENetanel Belgazal
Add READ_ONCE calls where necessary (for example when iterating over a memory field that gets updated by the hardware). Signed-off-by: Netanel Belgazal <netanel@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-09net: ena: fix missing lock during device destructionNetanel Belgazal
acquire the rtnl_lock during device destruction to avoid using partially destroyed device. ena_remove() shares almost the same logic as ena_destroy_device(), so use ena_destroy_device() and avoid duplications. Signed-off-by: Netanel Belgazal <netanel@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-09net: ena: fix potential double ena_destroy_device()Netanel Belgazal
ena_destroy_device() can potentially be called twice. To avoid this, check that the device is running and only then proceed destroying it. Signed-off-by: Netanel Belgazal <netanel@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-09net: ena: fix device destruction to gracefully free resourcesNetanel Belgazal
When ena_destroy_device() is called from ena_suspend(), the device is still reachable from the driver. Therefore, the driver can send a command to the device to free all resources. However, in all other cases of calling ena_destroy_device(), the device is potentially in an error state and unreachable from the driver. In these cases the driver must not send commands to the device. The current implementation does not request resource freeing from the device even when possible. We add the graceful parameter to ena_destroy_device() to enable resource freeing when possible, and use it in ena_suspend(). Signed-off-by: Netanel Belgazal <netanel@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-09net: ena: fix driver when PAGE_SIZE == 64kBNetanel Belgazal
The buffer length field in the ena rx descriptor is 16 bit, and the current driver passes a full page in each ena rx descriptor. When PAGE_SIZE equals 64kB or more, the buffer length field becomes zero. To solve this issue, limit the ena Rx descriptor to use 16kB even when allocating 64kB kernel pages. This change would not impact ena device functionality, as 16kB is still larger than maximum MTU. Signed-off-by: Netanel Belgazal <netanel@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-09net: ena: fix surprise unplug NULL dereference kernel crashNetanel Belgazal
Starting with driver version 1.5.0, in case of a surprise device unplug, there is a race caused by invoking ena_destroy_device() from two different places. As a result, the readless register might be accessed after it was destroyed. Signed-off-by: Netanel Belgazal <netanel@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-09Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "A set of fixes for x86: - Prevent multiplication result truncation on 32bit. Introduced with the early timestamp reworrk. - Ensure microcode revision storage to be consistent under all circumstances - Prevent write tearing of PTEs - Prevent confusion of user and kernel reegisters when dumping fatal signals verbosely - Make an error return value in a failure path of the vector allocation negative. Returning EINVAL might the caller assume success and causes further wreckage. - A trivial kernel doc warning fix" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm: Use WRITE_ONCE() when setting PTEs x86/apic/vector: Make error return value negative x86/process: Don't mix user/kernel regs in 64bit __show_regs() x86/tsc: Prevent result truncation on 32bit x86: Fix kernel-doc atomic.h warnings x86/microcode: Update the new microcode revision unconditionally x86/microcode: Make sure boot_cpu_data.microcode is up-to-date
2018-09-09Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timekeeping fixes from Thomas Gleixner: "Two fixes for timekeeping: - Revert to the previous kthread based update, which is unfortunately required due to lock ordering issues. The removal caused boot failures on old Core2 machines. Add a proper comment why the thread needs to stay to prevent accidental removal in the future. - Fix a silly typo in a function declaration" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: clocksource: Revert "Remove kthread" timekeeping: Fix declaration of read_persistent_wall_and_boot_offset()
2018-09-09Merge branch 'irq-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irqchip fix from Thomas Gleixner: "A single fix to prevent allocating excessive memory in the GIC/ITS driver. While the subject of the patch might suggest otherwise this is a real fix as some SoCs exceed the memory allocation limits and fail to boot" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip/gic-v3-its: Cap lpi_id_bits to reduce memory footprint
2018-09-09Merge branch 'smp-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull cpu hotplug fixes from Thomas Gleixner: "Two fixes for the hotplug state machine code: - Move the misplaces smb() in the hotplug thread function to the proper place, otherwise a half update control struct could be observed - Prevent state corruption on error rollback, which causes the state to advance by one and as a consequence skip it in the bringup sequence" * 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: cpu/hotplug: Prevent state corruption on error rollback cpu/hotplug: Adjust misplaced smb() in cpuhp_thread_fun()
2018-09-09Merge tag 'for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random Pull random driver fix from Ted Ts'o: "Fix things so the choice of whether or not to trust RDRAND to initialize the CRNG is configurable via the boot option random.trust_cpu={on,off}" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random: random: make CPU trust a boot parameter
2018-09-09Merge tag 'kbuild-fixes-v4.19' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild fixes from Masahiro Yamada: - make setlocalversion more robust about -dirty check - loosen the pkg-config requirement for Kconfig - change missing depmod to a warning from an error - warn modules_install when System.map is missing * tag 'kbuild-fixes-v4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: kbuild: modules_install: warn when missing System.map file kbuild: make missing $DEPMOD a Warning instead of an Error kconfig: do not require pkg-config on make {menu,n}config kconfig: remove a spurious self-assignment scripts/setlocalversion: git: Make -dirty check more robust
2018-09-09Merge tag 'iio-fixes-4.19a' of ↵Greg Kroah-Hartman
git://git.kernel.org/pub/scm/linux/kernel/git/jic23/iio into staging-linus Jonathan writes: First set of IIO fixes for the 4.19 cycle. * ad9523 - sysfs write should return the number of characters used or an error, not 0 which could result in an infinite loop in userspace. * lsm6dsx - Fix computation of length when updating the watermark to include timestamps avoiding the watermark being set earlier than intended. * maxim_thermocouple - Revert a patch adding the max31856 as it's not actually compatible with the register set that this driver supports. * si1133 - Fix an impossible value check so we don't always error out whether the passed in value is good or bad.
2018-09-09fs/cifs: require sha512Stefan Metzmacher
This got lost in commit 0fdfef9aa7ee68ddd508aef7c98630cfc054f8d6, which removed CONFIG_CIFS_SMB311. Signed-off-by: Stefan Metzmacher <metze@samba.org> Fixes: 0fdfef9aa7ee68ddd ("smb3: simplify code by removing CONFIG_CIFS_SMB311") CC: Stable <stable@vger.kernel.org> CC: linux-cifs@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>
2018-09-09fs/cifs: suppress a string overflow warningStephen Rothwell
A powerpc build of cifs with gcc v8.2.0 produces this warning: fs/cifs/cifssmb.c: In function ‘CIFSSMBNegotiate’: fs/cifs/cifssmb.c:605:3: warning: ‘strncpy’ writing 16 bytes into a region of size 1 overflows the destination [-Wstringop-overflow=] strncpy(pSMB->DialectsArray+count, protocols[i].name, 16); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Since we are already doing a strlen() on the source, change the strncpy to a memcpy(). Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Steve French <stfrench@microsoft.com>
2018-09-09kbuild: modules_install: warn when missing System.map fileRandy Dunlap
If there is no System.map file for "make modules_install", scripts/depmod.sh will silently exit with success, having done nothing. Since this is an unexpected situation, change it to report a Warning for the missing file. The behavior is not changed except for the Warning message. The (previous) silent success and new Warning can be reproduced by: $ make mrproper; make defconfig $ make modules; make modules_install and since System.map is produced by "make vmlinux", the steps above omit producing the System.map file. Reported-by: Masahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2018-09-08Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull KVM fixes from Radim Krčmář: "ARM: - Fix a VFP corruption in 32-bit guest - Add missing cache invalidation for CoW pages - Two small cleanups s390: - Fallout from the hugetlbfs support: pfmf interpretion and locking - VSIE: fix keywrapping for nested guests PPC: - Fix a bug where pages might not get marked dirty, causing guest memory corruption on migration - Fix a bug causing reads from guest memory to use the wrong guest real address for very large HPT guests (>256G of memory), leading to failures in instruction emulation. x86: - Fix out of bound access from malicious pv ipi hypercalls (introduced in rc1) - Fix delivery of pending interrupts when entering a nested guest, preventing arbitrarily late injection - Sanitize kvm_stat output after destroying a guest - Fix infinite loop when emulating a nested guest page fault and improve the surrounding emulation code - Two minor cleanups" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (28 commits) KVM: LAPIC: Fix pv ipis out-of-bounds access KVM: nVMX: Fix loss of pending IRQ/NMI before entering L2 arm64: KVM: Remove pgd_lock KVM: Remove obsolete kvm_unmap_hva notifier backend arm64: KVM: Only force FPEXC32_EL2.EN if trapping FPSIMD KVM: arm/arm64: Clean dcache to PoC when changing PTE due to CoW KVM: s390: Properly lock mm context allow_gmap_hpage_1m setting KVM: s390: vsie: copy wrapping keys to right place KVM: s390: Fix pfmf and conditional skey emulation tools/kvm_stat: re-animate display of dead guests tools/kvm_stat: indicate dead guests as such tools/kvm_stat: handle guest removals more gracefully tools/kvm_stat: don't reset stats when setting PID filter for debugfs tools/kvm_stat: fix updates for dead guests tools/kvm_stat: fix handling of invalid paths in debugfs provider tools/kvm_stat: fix python3 issues KVM: x86: Unexport x86_emulate_instruction() KVM: x86: Rename emulate_instruction() to kvm_emulate_instruction() KVM: x86: Do not re-{try,execute} after failed emulation in L2 KVM: x86: Default to not allowing emulation retry in kvm_mmu_page_fault ...
2018-09-08Merge tag 'armsoc-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc Pull ARM SoC fixes from Olof Johansson: "A few more fixes who have trickled in: - MMC bus width fixup for some Allwinner platforms - Fix for NULL deref in ti-aemif when no platform data is passed in - Fix div by 0 in SCMI code - Add a missing module alias in a new RPi driver" * tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: memory: ti-aemif: fix a potential NULL-pointer dereference firmware: arm_scmi: fix divide by zero when sustained_perf_level is zero hwmon: rpi: add module alias to raspberrypi-hwmon arm64: allwinner: dts: h6: fix Pine H64 MMC bus width
2018-09-08Merge tag 'sunxi-fixes-for-4.19' of ↵Olof Johansson
https://git.kernel.org/pub/scm/linux/kernel/git/sunxi/linux into fixes Allwinner fixes for 4.19 Just one fix for H6 mmc on the Pine H64: the mmc bus width was missing from the device tree. This was added in 4.19-rc1. * tag 'sunxi-fixes-for-4.19' of https://git.kernel.org/pub/scm/linux/kernel/git/sunxi/linux: arm64: allwinner: dts: h6: fix Pine H64 MMC bus width Signed-off-by: Olof Johansson <olof@lixom.net>
2018-09-08iio: light: bh1750: simplify setting PM opsTomasz Duszynski
Relying on CONFIG_PM_SLEEP to set PM ops is not necessary since core will handle everything internally. One have to only make sure that functions that can go unused are marked with __maybe_unused. Signed-off-by: Tomasz Duszynski <tduszyns@gmail.com> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
2018-09-08dt-bindings: adxl372: Document the adxl372 I2C bindingsStefan Popa
The adxl372 is designed to communicate in either SPI or I2C protocol. This patch adds the documentation of device tree bindings for adxl372 I2C. Signed-off-by: Stefan Popa <stefan.popa@analog.com> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
2018-09-08iio: adxl372: Add support for I2C communicationStefan Popa
The adxl372 is designed to communicate in either SPI or I2C protocol. It autodetects the format being used, requiring no configuration control to select the format. Signed-off-by: Stefan Popa <stefan.popa@analog.com> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
2018-09-08iio: adxl372: Refactor the driverStefan Popa
This patch restructures the existing adxl372 driver by adding a module for SPI and a header file, while the baseline module deals with the chip-logic. This is a necessary step, as this driver should support in the future a similar device which differs only in the type of interface used (I2C instead of SPI). Signed-off-by: Stefan Popa <stefan.popa@analog.com> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
2018-09-08dt-bindings: iio: vadc: Fix documentation of 'reg'Matthias Kaehlcke
The documentation of Qualcomm's SPMI PMIC voltage ADC claims that the 'reg' property consists of two values, the SPMI address and the length of the controller's registers. However the SPMI bus to which it is added specifies "#size-cells = <0>;". Remove the controller register length from the documentation of the field and the example. Signed-off-by: Matthias Kaehlcke <mka@chromium.org> Reviewed-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Rob Herring <robh@kernel.org> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
2018-09-08iio: pressure: ms5611: switch to SPDX identifierTomasz Duszynski
Drop boilerplate license text and use SPDX identifier instead. Signed-off-by: Tomasz Duszynski <tduszyns@gmail.com> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
2018-09-08iio: light: bh1750: switch to SPDX identifierTomasz Duszynski
Drop boilerplate license text and use SPDX identifier instead. Signed-off-by: Tomasz Duszynski <tduszyns@gmail.com> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
2018-09-08dt-bindings: iio: imu: st_lsm6dsx: add LSM6DSO device bindingsLorenzo Bianconi
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> Reviewed-by: Rob Herring <robh@kernel.org> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>