summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2017-10-10Merge tag 'f2fs-for-4.14-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs fix from Jaegeuk Kim: "This contains one bug fix which causes a kernel panic during fstrim introduced in 4.14-rc1" * tag 'f2fs-for-4.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: f2fs: fix potential panic during fstrim
2017-10-10Merge tag 'linux-kselftest-4.14-rc5-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest Pull kselftest fixes from Shuah Khan: - fix for x86: sysret_ss_attrs test build failure preventing the x86 tests from running - fix mqueue: fix regression in silencing test run output * tag 'linux-kselftest-4.14-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: selftests: mqueue: fix regression in silencing output from RUN_TESTS selftests: x86: sysret_ss_attrs doesn't build on a PIE build
2017-10-10iommu/amd: Do not disable SWIOTLB if SME is activeTom Lendacky
When SME memory encryption is active it will rely on SWIOTLB to handle DMA for devices that cannot support the addressing requirements of having the encryption mask set in the physical address. The IOMMU currently disables SWIOTLB if it is not running in passthrough mode. This is not desired as non-PCI devices attempting DMA may fail. Update the code to check if SME is active and not disable SWIOTLB. Fixes: 2543a786aa25 ("iommu/amd: Allow the AMD IOMMU to work with memory encryption") Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Joerg Roedel <jroedel@suse.de>
2017-10-10Merge tag 'perf-urgent-for-mingo-4.14-20171010' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent Pull perf/urgent fixes from Arnaldo Carvalho de Melo: - Unbreak 'perf record' for arm/arm64 with events with explicit PMU (Mark Rutland) - Add missing separator for "perf script -F ip,brstack" (and brstackoff) (Mark Santaniello) - One line, comment only, sync kernel ABI header with tooling header (Arnaldo Carvalho de Melo) Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-11crypto: shash - Fix zero-length shash ahash digest crashHerbert Xu
The shash ahash digest adaptor function may crash if given a zero-length input together with a null SG list. This is because it tries to read the SG list before looking at the length. This patch fixes it by checking the length first. Cc: <stable@vger.kernel.org> Reported-by: Stephan Müller<smueller@chronox.de> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Tested-by: Stephan Müller <smueller@chronox.de>
2017-10-10quota: Generate warnings for DQUOT_SPACE_NOFAIL allocationsJan Kara
Eryu has reported that since commit 7b9ca4c61bc2 "quota: Reduce contention on dq_data_lock" test generic/233 occasionally fails. This is caused by the fact that since that commit we don't generate warning and set grace time for quota allocations that have DQUOT_SPACE_NOFAIL set (these are for example some metadata allocations in ext4). We need these allocations to behave regularly wrt warning generation and grace time setting so fix the code to return to the original behavior. Reported-and-tested-by: Eryu Guan <eguan@redhat.com> CC: stable@vger.kernel.org Fixes: 7b9ca4c61bc278b771fb57d6290a31ab1fc7fdac Signed-off-by: Jan Kara <jack@suse.cz>
2017-10-10i40e: Fix memory leak related filter programming statusAlexander Duyck
It looks like we weren't correctly placing the pages from buffers that had been used to return a filter programming status back on the ring. As a result they were being overwritten and tracking of the pages was lost. This change works to correct that by incorporating part of i40e_put_rx_buffer into the programming status handler code. As a result we should now be correctly placing the pages for those buffers on the re-allocation list instead of letting them stay in place. Fixes: 0e626ff7ccbf ("i40e: Fix support for flow director programming status") Reported-by: Anders K. Pedersen <akp@cohaesio.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Anders K Pedersen <akp@cohaesio.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-10-10i40e: Fix comment about locking for __i40e_read_nvm_word()Stefano Brivio
Caller needs to acquire the lock. Called functions will not. Fixes: 09f79fd49d94 ("i40e: avoid NVM acquire deadlock during NVM update") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-10-10workqueue: replace pool->manager_arb mutex with a flagTejun Heo
Josef reported a HARDIRQ-safe -> HARDIRQ-unsafe lock order detected by lockdep: [ 1270.472259] WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected [ 1270.472783] 4.14.0-rc1-xfstests-12888-g76833e8 #110 Not tainted [ 1270.473240] ----------------------------------------------------- [ 1270.473710] kworker/u5:2/5157 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire: [ 1270.474239] (&(&lock->wait_lock)->rlock){+.+.}, at: [<ffffffff8da253d2>] __mutex_unlock_slowpath+0xa2/0x280 [ 1270.474994] [ 1270.474994] and this task is already holding: [ 1270.475440] (&pool->lock/1){-.-.}, at: [<ffffffff8d2992f6>] worker_thread+0x366/0x3c0 [ 1270.476046] which would create a new lock dependency: [ 1270.476436] (&pool->lock/1){-.-.} -> (&(&lock->wait_lock)->rlock){+.+.} [ 1270.476949] [ 1270.476949] but this new dependency connects a HARDIRQ-irq-safe lock: [ 1270.477553] (&pool->lock/1){-.-.} ... [ 1270.488900] to a HARDIRQ-irq-unsafe lock: [ 1270.489327] (&(&lock->wait_lock)->rlock){+.+.} ... [ 1270.494735] Possible interrupt unsafe locking scenario: [ 1270.494735] [ 1270.495250] CPU0 CPU1 [ 1270.495600] ---- ---- [ 1270.495947] lock(&(&lock->wait_lock)->rlock); [ 1270.496295] local_irq_disable(); [ 1270.496753] lock(&pool->lock/1); [ 1270.497205] lock(&(&lock->wait_lock)->rlock); [ 1270.497744] <Interrupt> [ 1270.497948] lock(&pool->lock/1); , which will cause a irq inversion deadlock if the above lock scenario happens. The root cause of this safe -> unsafe lock order is the mutex_unlock(pool->manager_arb) in manage_workers() with pool->lock held. Unlocking mutex while holding an irq spinlock was never safe and this problem has been around forever but it never got noticed because the only time the mutex is usually trylocked while holding irqlock making actual failures very unlikely and lockdep annotation missed the condition until the recent b9c16a0e1f73 ("locking/mutex: Fix lockdep_assert_held() fail"). Using mutex for pool->manager_arb has always been a bit of stretch. It primarily is an mechanism to arbitrate managership between workers which can easily be done with a pool flag. The only reason it became a mutex is that pool destruction path wants to exclude parallel managing operations. This patch replaces the mutex with a new pool flag POOL_MANAGER_ACTIVE and make the destruction path wait for the current manager on a wait queue. v2: Drop unnecessary flag clearing before pool destruction as suggested by Boqun. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: stable@vger.kernel.org
2017-10-10KVM: MMU: always terminate page walks at level 1Ladi Prosek
is_last_gpte() is not equivalent to the pseudo-code given in commit 6bb69c9b69c31 ("KVM: MMU: simplify last_pte_bitmap") because an incorrect value of last_nonleaf_level may override the result even if level == 1. It is critical for is_last_gpte() to return true on level == 1 to terminate page walks. Otherwise memory corruption may occur as level is used as an index to various data structures throughout the page walking code. Even though the actual bug would be wherever the MMU is initialized (as in the previous patch), be defensive and ensure here that is_last_gpte() returns the correct value. This patch is also enough to fix CVE-2017-12188. Fixes: 6bb69c9b69c315200ddc2bc79aee14c0184cf5b2 Cc: stable@vger.kernel.org Cc: Andy Honig <ahonig@google.com> Signed-off-by: Ladi Prosek <lprosek@redhat.com> [Panic if walk_addr_generic gets an incorrect level; this is a serious bug and it's not worth a WARN_ON where the recovery path might hide further exploitable issues; suggested by Andrew Honig. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-10-10KVM: nVMX: update last_nonleaf_level when initializing nested EPTLadi Prosek
The function updates context->root_level but didn't call update_last_nonleaf_level so the previous and potentially wrong value was used for page walks. For example, a zero value of last_nonleaf_level would allow a potential out-of-bounds access in arch/x86/mmu/paging_tmpl.h's walk_addr_generic function (CVE-2017-12188). Fixes: 155a97a3d7c78b46cef6f1a973c831bc5a4f82bb Signed-off-by: Ladi Prosek <lprosek@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-10-10xen/vcpu: Use a unified name about cpu hotplug state for pv and pvhvmZhenzhong Duan
As xen_cpuhp_setup is called by PV and PVHVM, the name of "x86/xen/hvm_guest" is confusing. Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
2017-10-10ALSA: usb-audio: Kill stray URB at exitingTakashi Iwai
USB-audio driver may leave a stray URB for the mixer interrupt when it exits by some error during probe. This leads to a use-after-free error as spotted by syzkaller like: ================================================================== BUG: KASAN: use-after-free in snd_usb_mixer_interrupt+0x604/0x6f0 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:16 dump_stack+0x292/0x395 lib/dump_stack.c:52 print_address_description+0x78/0x280 mm/kasan/report.c:252 kasan_report_error mm/kasan/report.c:351 kasan_report+0x23d/0x350 mm/kasan/report.c:409 __asan_report_load8_noabort+0x19/0x20 mm/kasan/report.c:430 snd_usb_mixer_interrupt+0x604/0x6f0 sound/usb/mixer.c:2490 __usb_hcd_giveback_urb+0x2e0/0x650 drivers/usb/core/hcd.c:1779 .... Allocated by task 1484: save_stack_trace+0x1b/0x20 arch/x86/kernel/stacktrace.c:59 save_stack+0x43/0xd0 mm/kasan/kasan.c:447 set_track mm/kasan/kasan.c:459 kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551 kmem_cache_alloc_trace+0x11e/0x2d0 mm/slub.c:2772 kmalloc ./include/linux/slab.h:493 kzalloc ./include/linux/slab.h:666 snd_usb_create_mixer+0x145/0x1010 sound/usb/mixer.c:2540 create_standard_mixer_quirk+0x58/0x80 sound/usb/quirks.c:516 snd_usb_create_quirk+0x92/0x100 sound/usb/quirks.c:560 create_composite_quirk+0x1c4/0x3e0 sound/usb/quirks.c:59 snd_usb_create_quirk+0x92/0x100 sound/usb/quirks.c:560 usb_audio_probe+0x1040/0x2c10 sound/usb/card.c:618 .... Freed by task 1484: save_stack_trace+0x1b/0x20 arch/x86/kernel/stacktrace.c:59 save_stack+0x43/0xd0 mm/kasan/kasan.c:447 set_track mm/kasan/kasan.c:459 kasan_slab_free+0x72/0xc0 mm/kasan/kasan.c:524 slab_free_hook mm/slub.c:1390 slab_free_freelist_hook mm/slub.c:1412 slab_free mm/slub.c:2988 kfree+0xf6/0x2f0 mm/slub.c:3919 snd_usb_mixer_free+0x11a/0x160 sound/usb/mixer.c:2244 snd_usb_mixer_dev_free+0x36/0x50 sound/usb/mixer.c:2250 __snd_device_free+0x1ff/0x380 sound/core/device.c:91 snd_device_free_all+0x8f/0xe0 sound/core/device.c:244 snd_card_do_free sound/core/init.c:461 release_card_device+0x47/0x170 sound/core/init.c:181 device_release+0x13f/0x210 drivers/base/core.c:814 .... Actually such a URB is killed properly at disconnection when the device gets probed successfully, and what we need is to apply it for the error-path, too. In this patch, we apply snd_usb_mixer_disconnect() at releasing. Also introduce a new flag, disconnected, to struct usb_mixer_interface for not performing the disconnection procedure twice. Reported-by: Andrey Konovalov <andreyknvl@google.com> Tested-by: Andrey Konovalov <andreyknvl@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Takashi Iwai <tiwai@suse.de>
2017-10-10x86/hyperv: Fix hypercalls with extended CPU ranges for TLB flushingMarcelo Henrique Cerri
Do not consider the fixed size of hv_vp_set when passing the variable header size to hv_do_rep_hypercall(). The Hyper-V hypervisor specification states that for a hypercall with a variable header only the size of the variable portion should be supplied via the input control. For HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE_EX/LIST_EX calls that means the fixed portion of hv_vp_set should not be considered. That fixes random failures of some applications that are unexpectedly killed with SIGBUS or SIGSEGV. Signed-off-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com> Cc: Dexuan Cui <decui@microsoft.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Jork Loeser <Jork.Loeser@microsoft.com> Cc: Josh Poulson <jopoulso@microsoft.com> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Simon Xiao <sixiao@microsoft.com> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: devel@linuxdriverproject.org Fixes: 628f54cc6451 ("x86/hyper-v: Support extended CPU ranges for TLB flush hypercalls") Link: http://lkml.kernel.org/r/1507210469-29065-1-git-send-email-marcelo.cerri@canonical.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10x86/hyperv: Don't use percpu areas for pcpu_flush/pcpu_flush_ex structuresVitaly Kuznetsov
hv_do_hypercall() does virt_to_phys() translation and with some configs (CONFIG_SLAB) this doesn't work for percpu areas, we pass wrong memory to hypervisor and get #GP. We could use working slow_virt_to_phys() instead but doing so kills the performance. Move pcpu_flush/pcpu_flush_ex structures out of percpu areas and allocate memory on first call. The additional level of indirection gives us a small performance penalty, in future we may consider introducing hypercall functions which avoid virt_to_phys() conversion and cache physical addresses of pcpu_flush/pcpu_flush_ex structures somewhere. Reported-by: Simon Xiao <sixiao@microsoft.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Dexuan Cui <decui@microsoft.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Jork Loeser <Jork.Loeser@microsoft.com> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: devel@linuxdriverproject.org Link: http://lkml.kernel.org/r/20171005113924.28021-1-vkuznets@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10x86/hyperv: Clear vCPU banks between calls to avoid flushing unneeded vCPUsVitaly Kuznetsov
hv_flush_pcpu_ex structures are not cleared between calls for performance reasons (they're variable size up to PAGE_SIZE each) but we must clear hv_vp_set.bank_contents part of it to avoid flushing unneeded vCPUs. The rest of the structure is formed correctly. To do the clearing in an efficient way stash the maximum possible vCPU number (this may differ from Linux CPU id). Reported-by: Jork Loeser <Jork.Loeser@microsoft.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Dexuan Cui <decui@microsoft.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: devel@linuxdriverproject.org Link: http://lkml.kernel.org/r/20171006154854.18092-1-vkuznets@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10perf/x86/intel/uncore: Fix memory leaks on allocation failuresColin Ian King
Currently if an allocation fails then the error return paths don't free up any currently allocated pmus[].boxes and pmus causing a memory leak. Add an error clean up exit path that frees these objects. Detected by CoverityScan, CID#711632 ("Resource Leak") Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: kernel-janitors@vger.kernel.org Fixes: 087bfbb03269 ("perf/x86: Add generic Intel uncore PMU support") Link: http://lkml.kernel.org/r/20171009172655.6132-1-colin.king@canonical.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10x86/unwind: Disable unwinder warnings on 32-bitJosh Poimboeuf
x86-32 doesn't have stack validation, so in most cases it doesn't make sense to warn about bad frame pointers. Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Byungchul Park <byungchul.park@lge.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: LKP <lkp@01.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/a69658760800bf281e6353248c23e0fa0acf5230.1507597785.git.jpoimboe@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10x86/unwind: Align stack pointer in unwinder dumpJosh Poimboeuf
When printing the unwinder dump, the stack pointer could be unaligned, for one of two reasons: - stack corruption; or - GCC created an unaligned stack. There's no way for the unwinder to tell the difference between the two, so we have to assume one or the other. GCC unaligned stacks are very rare, and have only been spotted before GCC 5. Presumably, if we're doing an unwinder stack dump, stack corruption is more likely than a GCC unaligned stack. So always align the stack before starting the dump. Reported-and-tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-and-tested-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Byungchul Park <byungchul.park@lge.com> Cc: LKP <lkp@01.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/2f540c515946ab09ed267e1a1d6421202a0cce08.1507597785.git.jpoimboe@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10x86/unwind: Use MSB for frame pointer encoding on 32-bitJosh Poimboeuf
On x86-32, Tetsuo Handa and Fengguang Wu reported unwinder warnings like: WARNING: kernel stack regs at f60bb9c8 in swapper:1 has bad 'bp' value 0ba00000 And also there were some stack dumps with a bunch of unreliable '?' symbols after an apic_timer_interrupt symbol, meaning the unwinder got confused when it tried to read the regs. The cause of those issues is that, with GCC 4.8 (and possibly older), there are cases where GCC misaligns the stack pointer in a leaf function for no apparent reason: c124a388 <acpi_rs_move_data>: c124a388: 55 push %ebp c124a389: 89 e5 mov %esp,%ebp c124a38b: 57 push %edi c124a38c: 56 push %esi c124a38d: 89 d6 mov %edx,%esi c124a38f: 53 push %ebx c124a390: 31 db xor %ebx,%ebx c124a392: 83 ec 03 sub $0x3,%esp ... c124a3e3: 83 c4 03 add $0x3,%esp c124a3e6: 5b pop %ebx c124a3e7: 5e pop %esi c124a3e8: 5f pop %edi c124a3e9: 5d pop %ebp c124a3ea: c3 ret If an interrupt occurs in such a function, the regs on the stack will be unaligned, which breaks the frame pointer encoding assumption. So on 32-bit, use the MSB instead of the LSB to encode the regs. This isn't an issue on 64-bit, because interrupts align the stack before writing to it. Reported-and-tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-and-tested-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Byungchul Park <byungchul.park@lge.com> Cc: LKP <lkp@01.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/279a26996a482ca716605c7dbc7f2db9d8d91e81.1507597785.git.jpoimboe@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10x86/unwind: Fix dereference of untrusted pointerJosh Poimboeuf
Tetsuo Handa and Fengguang Wu reported a panic in the unwinder: BUG: unable to handle kernel NULL pointer dereference at 000001f2 IP: update_stack_state+0xd4/0x340 *pde = 00000000 Oops: 0000 [#1] PREEMPT SMP CPU: 0 PID: 18728 Comm: 01-cpu-hotplug Not tainted 4.13.0-rc4-00170-gb09be67 #592 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-20161025_171302-gandalf 04/01/2014 task: bb0b53c0 task.stack: bb3ac000 EIP: update_stack_state+0xd4/0x340 EFLAGS: 00010002 CPU: 0 EAX: 0000a570 EBX: bb3adccb ECX: 0000f401 EDX: 0000a570 ESI: 00000001 EDI: 000001ba EBP: bb3adc6b ESP: bb3adc3f DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 CR0: 80050033 CR2: 000001f2 CR3: 0b3a7000 CR4: 00140690 DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 DR6: fffe0ff0 DR7: 00000400 Call Trace: ? unwind_next_frame+0xea/0x400 ? __unwind_start+0xf5/0x180 ? __save_stack_trace+0x81/0x160 ? save_stack_trace+0x20/0x30 ? __lock_acquire+0xfa5/0x12f0 ? lock_acquire+0x1c2/0x230 ? tick_periodic+0x3a/0xf0 ? _raw_spin_lock+0x42/0x50 ? tick_periodic+0x3a/0xf0 ? tick_periodic+0x3a/0xf0 ? debug_smp_processor_id+0x12/0x20 ? tick_handle_periodic+0x23/0xc0 ? local_apic_timer_interrupt+0x63/0x70 ? smp_trace_apic_timer_interrupt+0x235/0x6a0 ? trace_apic_timer_interrupt+0x37/0x3c ? strrchr+0x23/0x50 Code: 0f 95 c1 89 c7 89 45 e4 0f b6 c1 89 c6 89 45 dc 8b 04 85 98 cb 74 bc 88 4d e3 89 45 f0 83 c0 01 84 c9 89 04 b5 98 cb 74 bc 74 3b <8b> 47 38 8b 57 34 c6 43 1d 01 25 00 00 02 00 83 e2 03 09 d0 83 EIP: update_stack_state+0xd4/0x340 SS:ESP: 0068:bb3adc3f CR2: 00000000000001f2 ---[ end trace 0d147fd4aba8ff50 ]--- Kernel panic - not syncing: Fatal exception in interrupt On x86-32, after decoding a frame pointer to get a regs address, regs_size() dereferences the regs pointer when it checks regs->cs to see if the regs are user mode. This is dangerous because it's possible that what looks like a decoded frame pointer is actually a corrupt value, and we don't want the unwinder to make things worse. Instead of calling regs_size() on an unsafe pointer, just assume they're kernel regs to start with. Later, once it's safe to access the regs, we can do the user mode check and corresponding safety check for the remaining two regs. Reported-and-tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-and-tested-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Byungchul Park <byungchul.park@lge.com> Cc: LKP <lkp@01.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 5ed8d8bb38c5 ("x86/unwind: Move common code into update_stack_state()") Link: http://lkml.kernel.org/r/7f95b9a6993dec7674b3f3ab3dcd3294f7b9644d.1507597785.git.jpoimboe@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10powerpc: Don't call lockdep_assert_cpus_held() from arch_update_cpu_topology()Thiago Jung Bauermann
It turns out that not all paths calling arch_update_cpu_topology() hold cpu_hotplug_lock, but that's OK because those paths can't race with any concurrent hotplug events. Warnings were reported with the following trace: lockdep_assert_cpus_held arch_update_cpu_topology sched_init_domains sched_init_smp kernel_init_freeable kernel_init ret_from_kernel_thread Which is safe because it's called early in boot when hotplug is not live yet. And also this trace: lockdep_assert_cpus_held arch_update_cpu_topology partition_sched_domains cpuset_update_active_cpus sched_cpu_deactivate cpuhp_invoke_callback cpuhp_down_callbacks cpuhp_thread_fun smpboot_thread_fn kthread ret_from_kernel_thread Which is safe because it's called as part of CPU hotplug, so although we don't hold the CPU hotplug lock, there is another thread driving the CPU hotplug operation which does hold the lock, and there is no race. Thanks to tglx for deciphering it for us. Fixes: 3e401f7a2e51 ("powerpc: Only obtain cpu_hotplug_lock if called by rtasd") Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-10-10locking/arch: Remove dummy arch_{read,spin,write}_lock_flags() implementationsWill Deacon
The arch_{read,spin,write}_lock_flags() macros are simply mapped to the non-flags versions by the majority of architectures, so do this in core code and remove the dummy implementations. Also remove the implementation in spinlock_up.h, since all callers of do_raw_spin_lock_flags() call local_irq_save(flags) anyway. Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: paulmck@linux.vnet.ibm.com Link: http://lkml.kernel.org/r/1507055129-12300-4-git-send-email-will.deacon@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10locking/arch: Remove dummy arch_{read,spin,write}_relax() implementationsWill Deacon
arch_{read,spin,write}_relax() are defined as cpu_relax() by the core code, so architectures that can't do better (i.e. most of them) don't need to bother with the dummy definitions. Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: paulmck@linux.vnet.ibm.com Link: http://lkml.kernel.org/r/1507055129-12300-3-git-send-email-will.deacon@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10locking/core: Remove {read,spin,write}_can_lock()Will Deacon
Outside of the locking code itself, {read,spin,write}_can_lock() have no users in tree. Apparmor (the last remaining user of write_can_lock()) got moved over to lockdep by the previous patch. This patch removes the use of {read,spin,write}_can_lock() from the BUILD_LOCK_OPS macro, deferring to the trylock operation for testing the lock status, and subsequently removes the unused macros altogether. They aren't guaranteed to work in a concurrent environment and can give incorrect results in the case of qrwlock. Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: paulmck@linux.vnet.ibm.com Link: http://lkml.kernel.org/r/1507055129-12300-2-git-send-email-will.deacon@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10locking/rwsem, security/apparmor: Replace homebrew use of write_can_lock() ↵Will Deacon
with lockdep The lockdep subsystem provides a robust way to assert that a lock is held, so use that instead of write_can_lock, which can give incorrect results for qrwlocks. Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: John Johansen <john.johansen@canonical.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: paulmck@linux.vnet.ibm.com Link: http://lkml.kernel.org/r/1507055129-12300-1-git-send-email-will.deacon@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10locking/rwsem, fs: Use killable down_read() in iterate_dir()Kirill Tkhai
There was mutex_lock_interruptible() initially, and it was changed to rwsem, but there were not killable rwsem primitives that time. >From commit 9902af79c01a: "The main issue is the lack of down_write_killable(), so the places like readdir.c switched to plain inode_lock(); once killable variants of rwsem primitives appear, that'll be dealt with" Use down_read_killable() same as down_write_killable() in !shared case, as concurrent inode_lock() may take much time, that may be wanted to be interrupted by user. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: arnd@arndb.de Cc: avagin@virtuozzo.com Cc: davem@davemloft.net Cc: fenghua.yu@intel.com Cc: gorcunov@virtuozzo.com Cc: heiko.carstens@de.ibm.com Cc: hpa@zytor.com Cc: ink@jurassic.park.msu.ru Cc: mattst88@gmail.com Cc: rientjes@google.com Cc: rth@twiddle.net Cc: schwidefsky@de.ibm.com Cc: tony.luck@intel.com Cc: viro@zeniv.linux.org.uk Link: http://lkml.kernel.org/r/150670120820.23930.5455667921545937220.stgit@localhost.localdomain Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10locking/rwsem: Add down_read_killable()Kirill Tkhai
Similar to down_read() and down_write_killable(), add killable version of down_read(), based on __down_read_killable() function, added in previous patches. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: arnd@arndb.de Cc: avagin@virtuozzo.com Cc: davem@davemloft.net Cc: fenghua.yu@intel.com Cc: gorcunov@virtuozzo.com Cc: heiko.carstens@de.ibm.com Cc: hpa@zytor.com Cc: ink@jurassic.park.msu.ru Cc: mattst88@gmail.com Cc: rientjes@google.com Cc: rth@twiddle.net Cc: schwidefsky@de.ibm.com Cc: tony.luck@intel.com Cc: viro@zeniv.linux.org.uk Link: http://lkml.kernel.org/r/150670119884.23930.2585570605960763239.stgit@localhost.localdomain Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10locking/arch, x86: Add __down_read_killable()Kirill Tkhai
Similar to __down_write_killable(), add read killable primitive: extract current __down_read() code to macros and teach it to get different functions as slow_path argument: store ax register to ret, and add sp register and preserve its value. Add call_rwsem_down_read_failed_killable() assembly entry similar to call_rwsem_down_read_failed(): push dx register to stack in additional to common registers, as it's not declarated as modifiable in ____down_read(). Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: arnd@arndb.de Cc: avagin@virtuozzo.com Cc: davem@davemloft.net Cc: fenghua.yu@intel.com Cc: gorcunov@virtuozzo.com Cc: heiko.carstens@de.ibm.com Cc: hpa@zytor.com Cc: ink@jurassic.park.msu.ru Cc: mattst88@gmail.com Cc: rientjes@google.com Cc: rth@twiddle.net Cc: schwidefsky@de.ibm.com Cc: tony.luck@intel.com Cc: viro@zeniv.linux.org.uk Link: http://lkml.kernel.org/r/150670118802.23930.1316107715255410256.stgit@localhost.localdomain Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10locking/arch, s390: Add __down_read_killable()Kirill Tkhai
Similar to __down_write_killable(), and read killable primitive. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: arnd@arndb.de Cc: avagin@virtuozzo.com Cc: davem@davemloft.net Cc: fenghua.yu@intel.com Cc: gorcunov@virtuozzo.com Cc: heiko.carstens@de.ibm.com Cc: hpa@zytor.com Cc: ink@jurassic.park.msu.ru Cc: mattst88@gmail.com Cc: rientjes@google.com Cc: rth@twiddle.net Cc: schwidefsky@de.ibm.com Cc: tony.luck@intel.com Cc: viro@zeniv.linux.org.uk Link: http://lkml.kernel.org/r/150670117817.23930.13068785028558453848.stgit@localhost.localdomain Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10locking/arch, ia64: Add __down_read_killable()Kirill Tkhai
Similar to __down_write_killable(), and read killable primitive. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: arnd@arndb.de Cc: avagin@virtuozzo.com Cc: davem@davemloft.net Cc: fenghua.yu@intel.com Cc: gorcunov@virtuozzo.com Cc: heiko.carstens@de.ibm.com Cc: hpa@zytor.com Cc: ink@jurassic.park.msu.ru Cc: mattst88@gmail.com Cc: rientjes@google.com Cc: rth@twiddle.net Cc: schwidefsky@de.ibm.com Cc: tony.luck@intel.com Cc: viro@zeniv.linux.org.uk Link: http://lkml.kernel.org/r/150670116749.23930.14976888440968191759.stgit@localhost.localdomain Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10locking/arch, alpha: Add __down_read_killable()Kirill Tkhai
Similar to __down_write_killable(), and read killable primitive. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: arnd@arndb.de Cc: avagin@virtuozzo.com Cc: davem@davemloft.net Cc: fenghua.yu@intel.com Cc: gorcunov@virtuozzo.com Cc: heiko.carstens@de.ibm.com Cc: hpa@zytor.com Cc: ink@jurassic.park.msu.ru Cc: mattst88@gmail.com Cc: rientjes@google.com Cc: rth@twiddle.net Cc: schwidefsky@de.ibm.com Cc: tony.luck@intel.com Cc: viro@zeniv.linux.org.uk Link: http://lkml.kernel.org/r/150670115677.23930.5711263025537758463.stgit@localhost.localdomain Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10locking/spinlocks, paravirt, xen: Correct the xen_nopvspin caseJuergen Gross
With the boot parameter "xen_nopvspin" specified a Xen guest should not make use of paravirt spinlocks, but behave as if running on bare metal. This is not true, however, as the qspinlock code will fall back to a test-and-set scheme when it is detecting a hypervisor. In order to avoid this disable the virt_spin_lock_key. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Waiman Long <longman@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: akataria@vmware.com Cc: boris.ostrovsky@oracle.com Cc: chrisw@sous-sol.org Cc: hpa@zytor.com Cc: jeremy@goop.org Cc: rusty@rustcorp.com.au Cc: virtualization@lists.linux-foundation.org Cc: xen-devel@lists.xenproject.org Link: http://lkml.kernel.org/r/20170906173625.18158-3-jgross@suse.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10locking/paravirt: Use new static key for controlling call of virt_spin_lock()Juergen Gross
There are cases where a guest tries to switch spinlocks to bare metal behavior (e.g. by setting "xen_nopvspin" boot parameter). Today this has the downside of falling back to unfair test and set scheme for qspinlocks due to virt_spin_lock() detecting the virtualized environment. Add a static key controlling whether virt_spin_lock() should be called or not. When running on bare metal set the new key to false. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Waiman Long <longman@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: akataria@vmware.com Cc: boris.ostrovsky@oracle.com Cc: chrisw@sous-sol.org Cc: hpa@zytor.com Cc: jeremy@goop.org Cc: rusty@rustcorp.com.au Cc: virtualization@lists.linux-foundation.org Cc: xen-devel@lists.xenproject.org Link: http://lkml.kernel.org/r/20170906173625.18158-2-jgross@suse.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10Merge branch 'locking/urgent' into locking/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10x86/tsc: Append the 'tsc=' description for the 'tsc=unstable' boot parameterDou Liyang
Commit: 8309f86cd41e ("x86/tsc: Provide 'tsc=unstable' boot parameter") added a new 'tsc=unstable' parameter, but didn't document it. Document it. Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <corbet@lwn.net> Cc: <pasha.tatashin@oracle.com> Cc: <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1507539813-11420-1-git-send-email-douly.fnst@cn.fujitsu.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10sched/rt: Simplify the IPI based RT balancing logicSteven Rostedt (Red Hat)
When a CPU lowers its priority (schedules out a high priority task for a lower priority one), a check is made to see if any other CPU has overloaded RT tasks (more than one). It checks the rto_mask to determine this and if so it will request to pull one of those tasks to itself if the non running RT task is of higher priority than the new priority of the next task to run on the current CPU. When we deal with large number of CPUs, the original pull logic suffered from large lock contention on a single CPU run queue, which caused a huge latency across all CPUs. This was caused by only having one CPU having overloaded RT tasks and a bunch of other CPUs lowering their priority. To solve this issue, commit: b6366f048e0c ("sched/rt: Use IPI to trigger RT task push migration instead of pulling") changed the way to request a pull. Instead of grabbing the lock of the overloaded CPU's runqueue, it simply sent an IPI to that CPU to do the work. Although the IPI logic worked very well in removing the large latency build up, it still could suffer from a large number of IPIs being sent to a single CPU. On a 80 CPU box, I measured over 200us of processing IPIs. Worse yet, when I tested this on a 120 CPU box, with a stress test that had lots of RT tasks scheduling on all CPUs, it actually triggered the hard lockup detector! One CPU had so many IPIs sent to it, and due to the restart mechanism that is triggered when the source run queue has a priority status change, the CPU spent minutes! processing the IPIs. Thinking about this further, I realized there's no reason for each run queue to send its own IPI. As all CPUs with overloaded tasks must be scanned regardless if there's one or many CPUs lowering their priority, because there's no current way to find the CPU with the highest priority task that can schedule to one of these CPUs, there really only needs to be one IPI being sent around at a time. This greatly simplifies the code! The new approach is to have each root domain have its own irq work, as the rto_mask is per root domain. The root domain has the following fields attached to it: rto_push_work - the irq work to process each CPU set in rto_mask rto_lock - the lock to protect some of the other rto fields rto_loop_start - an atomic that keeps contention down on rto_lock the first CPU scheduling in a lower priority task is the one to kick off the process. rto_loop_next - an atomic that gets incremented for each CPU that schedules in a lower priority task. rto_loop - a variable protected by rto_lock that is used to compare against rto_loop_next rto_cpu - The cpu to send the next IPI to, also protected by the rto_lock. When a CPU schedules in a lower priority task and wants to make sure overloaded CPUs know about it. It increments the rto_loop_next. Then it atomically sets rto_loop_start with a cmpxchg. If the old value is not "0", then it is done, as another CPU is kicking off the IPI loop. If the old value is "0", then it will take the rto_lock to synchronize with a possible IPI being sent around to the overloaded CPUs. If rto_cpu is greater than or equal to nr_cpu_ids, then there's either no IPI being sent around, or one is about to finish. Then rto_cpu is set to the first CPU in rto_mask and an IPI is sent to that CPU. If there's no CPUs set in rto_mask, then there's nothing to be done. When the CPU receives the IPI, it will first try to push any RT tasks that is queued on the CPU but can't run because a higher priority RT task is currently running on that CPU. Then it takes the rto_lock and looks for the next CPU in the rto_mask. If it finds one, it simply sends an IPI to that CPU and the process continues. If there's no more CPUs in the rto_mask, then rto_loop is compared with rto_loop_next. If they match, everything is done and the process is over. If they do not match, then a CPU scheduled in a lower priority task as the IPI was being passed around, and the process needs to start again. The first CPU in rto_mask is sent the IPI. This change removes this duplication of work in the IPI logic, and greatly lowers the latency caused by the IPIs. This removed the lockup happening on the 120 CPU machine. It also simplifies the code tremendously. What else could anyone ask for? Thanks to Peter Zijlstra for simplifying the rto_loop_start atomic logic and supplying me with the rto_start_trylock() and rto_start_unlock() helper functions. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Clark Williams <williams@redhat.com> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Scott Wood <swood@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170424114732.1aac6dc4@gandalf.local.home Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10block/ioprio: Use a helper to check for RT prioSebastian Andrzej Siewior
A side-effect to the old code is that now SCHED_DEADLINE is also recognized. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Jens Axboe <axboe@fb.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20171004154901.26904-2-bigeasy@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10sched/rt: Add a helper to test for a RT taskSebastian Andrzej Siewior
This helper returns true if a task has elevated priority which is true for RT tasks (SCHED_RR and SCHED_FIFO) and also for SCHED_DEADLINE. A task which runs at RT priority due to PI-boosting is not considered as one with elevated priority. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Jens Axboe <axboe@fb.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20171004154901.26904-1-bigeasy@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10sched/fair: Fix usage of find_idlest_group() when the local group is idlestBrendan Jackman
find_idlest_group() returns NULL when the local group is idlest. The caller then continues the find_idlest_group() search at a lower level of the current CPU's sched_domain hierarchy. find_idlest_group_cpu() is not consulted and, crucially, @new_cpu is not updated. This means the search is pointless and we return @prev_cpu from select_task_rq_fair(). This is fixed by initialising @new_cpu to @cpu instead of @prev_cpu. Signed-off-by: Brendan Jackman <brendan.jackman@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20171005114516.18617-6-brendan.jackman@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10sched/fair: Fix usage of find_idlest_group() when no groups are allowedBrendan Jackman
When 'p' is not allowed on any of the CPUs in the sched_domain, we currently return NULL from find_idlest_group(), and pointlessly continue the search on lower sched_domain levels (where 'p' is also not allowed) before returning prev_cpu regardless (as we have not updated new_cpu). Add an explicit check for this case, and add a comment to find_idlest_group(). Now when find_idlest_group() returns NULL, it always means that the local group is allowed and idlest. Signed-off-by: Brendan Jackman <brendan.jackman@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Josef Bacik <jbacik@fb.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20171005114516.18617-5-brendan.jackman@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10sched/fair: Fix find_idlest_group() when local group is not allowedBrendan Jackman
When the local group is not allowed we do not modify this_*_load from their initial value of 0. That means that the load checks at the end of find_idlest_group cause us to incorrectly return NULL. Fixing the initial values to ULONG_MAX means we will instead return the idlest remote group in that case. Signed-off-by: Brendan Jackman <brendan.jackman@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Josef Bacik <jbacik@fb.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20171005114516.18617-4-brendan.jackman@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10sched/fair: Remove unnecessary comparison with -1Brendan Jackman
Since commit: 83a0a96a5f26 ("sched/fair: Leverage the idle state info when choosing the "idlest" cpu") find_idlest_group_cpu() (formerly find_idlest_cpu) no longer returns -1, so we can simplify the checking of the return value in find_idlest_cpu(). Signed-off-by: Brendan Jackman <brendan.jackman@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20171005114516.18617-3-brendan.jackman@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10sched/fair: Move select_task_rq_fair() slow-path into its own functionBrendan Jackman
In preparation for changes that would otherwise require adding a new level of indentation to the while(sd) loop, create a new function find_idlest_cpu() which contains this loop, and rename the existing find_idlest_cpu() to find_idlest_group_cpu(). Code inside the while(sd) loop is unchanged. @new_cpu is added as a variable in the new function, with the same initial value as the @new_cpu in select_task_rq_fair(). Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Brendan Jackman <brendan.jackman@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20171005114516.18617-2-brendan.jackman@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10sched/fair: Force balancing on NOHZ balance if local group has capacityBrendan Jackman
The "goto force_balance" here is intended to mitigate the fact that avg_load calculations can result in bad placement decisions when priority is asymmetrical. The original commit that adds it: fab476228ba3 ("sched: Force balancing on newidle balance if local group has capacity") explains: Under certain situations, such as a niced down task (i.e. nice = -15) in the presence of nr_cpus NICE0 tasks, the niced task lands on a sched group and kicks away other tasks because of its large weight. This leads to sub-optimal utilization of the machine. Even though the sched group has capacity, it does not pull tasks because sds.this_load >> sds.max_load, and f_b_g() returns NULL. A similar but inverted issue also affects ARM big.LITTLE (asymmetrical CPU capacity) systems - consider 8 always-running, same-priority tasks on a system with 4 "big" and 4 "little" CPUs. Suppose that 5 of them end up on the "big" CPUs (which will be represented by one sched_group in the DIE sched_domain) and 3 on the "little" (the other sched_group in DIE), leaving one CPU unused. Because the "big" group has a higher group_capacity its avg_load may not present an imbalance that would cause migrating a task to the idle "little". The force_balance case here solves the problem but currently only for CPU_NEWLY_IDLE balances, which in theory might never happen on the unused CPU. Including CPU_IDLE in the force_balance case means there's an upper bound on the time before we can attempt to solve the underutilization: after DIE's sd->balance_interval has passed the next nohz balance kick will help us out. Signed-off-by: Brendan Jackman <brendan.jackman@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170807163900.25180-1-brendan.jackman@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10sched/fair: Sync task util before slow-path wakeupBrendan Jackman
We use task_util() in find_idlest_group() via capacity_spare_wake(). This task_util() updated in wake_cap(). However wake_cap() is not the only reason for ending up in find_idlest_group() - we could have been sent there by wake_wide(). So explicitly sync the task util with prev_cpu when we are about to head to find_idlest_group(). We could simply do this at the beginning of select_task_rq_fair() (i.e. irrespective of whether we're heading to select_idle_sibling() or find_idlest_group() & co), but I didn't want to slow down the select_idle_sibling() path more than necessary. Don't do this during fork balancing, we won't need the task_util and we'd just clobber the last_update_time, which is supposed to be 0. Signed-off-by: Brendan Jackman <brendan.jackman@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andres Oportus <andresoportus@google.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: http://lkml.kernel.org/r/20170808095519.10077-1-brendan.jackman@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10sched/fair: Search a task from the tail of the queueUladzislau Rezki
As a first step this patch makes cfs_tasks list as MRU one. It means, that when a next task is picked to run on physical CPU it is moved to the front of the list. Therefore, the cfs_tasks list is more or less sorted (except woken tasks) starting from recently given CPU time tasks toward tasks with max wait time in a run-queue, i.e. MRU list. Second, as part of the load balance operation, this approach starts detach_tasks()/detach_one_task() from the tail of the queue instead of the head, giving some advantages: - tends to pick a task with highest wait time; - tasks located in the tail are less likely cache-hot, therefore the can_migrate_task() decision is higher. hackbench illustrates slightly better performance. For example doing 1000 samples and 40 groups on i5-3320M CPU, it shows below figures: default: 0.657 avg patched: 0.646 avg Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Kirill Tkhai <tkhai@yandex.ru> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Link: http://lkml.kernel.org/r/20170913102430.8985-2-urezki@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10sched/topology: Introduce NUMA identity node sched domainSuravee Suthikulpanit
On AMD Family17h-based (EPYC) system, a logical NUMA node can contain upto 8 cores (16 threads) with the following topology. ---------------------------- C0 | T0 T1 | || | T0 T1 | C4 --------| || |-------- C1 | T0 T1 | L3 || L3 | T0 T1 | C5 --------| || |-------- C2 | T0 T1 | #0 || #1 | T0 T1 | C6 --------| || |-------- C3 | T0 T1 | || | T0 T1 | C7 ---------------------------- Here, there are 2 last-level (L3) caches per logical NUMA node. A socket can contain upto 4 NUMA nodes, and a system can support upto 2 sockets. With full system configuration, current scheduler creates 4 sched domains: domain0 SMT (span a core) domain1 MC (span a last-level-cache) domain2 NUMA (span a socket: 4 nodes) domain3 NUMA (span a system: 8 nodes) Note that there is no domain to represent cpus spaning a logical NUMA node. With this hierarchy of sched domains, the scheduler does not balance properly in the following cases: Case1: When running 8 tasks, a properly balanced system should schedule a task per logical NUMA node. This is not the case for the current scheduler. Case2: In some cases, threads are scheduled on the same cpu, while other cpus are idle. This results in run-to-run inconsistency. For example: taskset -c 0-7 sysbench --num-threads=8 --test=cpu \ --cpu-max-prime=100000 run Total execution time ranges from 25.1s to 33.5s depending on threads placement, where 25.1s is when all 8 threads are balanced properly on 8 cpus. Introducing NUMA identity node sched domain, which is based on how SRAT/SLIT table define a logical NUMA node. This results in the following hierarchy of sched domains on the same system described above. domain0 SMT (span a core) domain1 MC (span a last-level-cache) domain2 NODE (span a logical NUMA node) domain3 NUMA (span a socket: 4 nodes) domain4 NUMA (span a system: 8 nodes) This fixes the improper load balancing cases mentioned above. Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bp@suse.de Link: http://lkml.kernel.org/r/1504768805-46716-1-git-send-email-suravee.suthikulpanit@amd.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10sched/topology: Restore SD_PREFER_SIBLING on MC domainsPeter Zijlstra
The normal x86_topology on NHM+ machines degenerates because the MC and CPU domains are of the same size, therefore MC inherits SD_PREFER_SIBLING from CPU (which then gets taken out). The result is that we'll spread tasks across the first NUMA level in order to maximize cache utilization. However, for the x86_numa_in_package_topology we loose the CPU domain, and we'll not have SD_PREFER_SIBLING set anywhere, giving a distinct difference in behaviour. Commit: 8e7fbcbc22c1 ("sched: Remove stale power aware scheduling remnants and dysfunctional knobs") made a fail by not preserving the SD_PREFER_SIBLING for the !power_saving case on both CPU and MC. Then commit: 6956dc568f34 ("sched/numa: Add SD_PERFER_SIBLING to CPU domain") adds it back to the CPU but not MC. Restore that now, such that we get consistent spreading behaviour wrt L3 and NUMA. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10sched/deadline: Use C bitfields for the state flagsluca abeni
Ask the compiler to use a single bit for storing true / false values, instead of wasting the size of a whole int value. Tested with gcc 5.4.0 on x86_64, and the compiler produces the expected Assembly (similar to the Assembly code generated when explicitly accessing the bits with bitmasks, "&" and "|"). Signed-off-by: luca abeni <luca.abeni@santannapisa.it> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1504778971-13573-5-git-send-email-luca.abeni@santannapisa.it Signed-off-by: Ingo Molnar <mingo@kernel.org>