Age | Commit message (Collapse) | Author |
|
we need to take care of failure exit as well - pages already
in bio should be dropped by analogue of bio_unmap_pages(),
since their refcounts had been bumped only once per reference
in bio.
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
bio_map_user_iov and bio_unmap_user do unbalanced pages refcounting if
IO vector has small consecutive buffers belonging to the same page.
bio_add_pc_page merges them into one, but the page reference is never
dropped.
Cc: stable@vger.kernel.org
Signed-off-by: Vitaly Mayatskikh <v.mayatskih@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
If for some reason, the newly allocated child need to be freed,
we will call cgroup_put() (via sk_free_unlock_clone()) while the
corresponding cgroup_get() was not yet done, and we will free memory
too soon.
Fixes: d979a39d7242 ("cgroup: duplicate cgroup reference when cloning sockets")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This reverts commit fbb1fb4ad415cb31ce944f65a5ca700aaf73a227.
This was not the proper fix, lets cleanly revert it, so that
following patch can be carried to stable versions.
sock_cgroup_ptr() callers do not expect a NULL return value.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In the code added to function submit_page_section by commit b1058b981,
sdio->bio can currently be NULL when calling dio_bio_submit. This then
leads to a NULL pointer access in dio_bio_submit, so check for a NULL
bio in submit_page_section before trying to submit it instead.
Fixes xfstest generic/250 on gfs2.
Cc: stable@vger.kernel.org # v3.10+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
struct pci_host_bridge gained hooks to map/swizzle IRQs, so that the IRQ
mapping can be done automatically by PCI core code through the
pci_assign_irq() function instead of resorting to arch-specific
implementation callbacks to carry out the same task which force PCI host
bridge drivers implementation to implement per-arch kludges to carry out a
task that is inherently architecture agnostic.
Commit 769b461fc0c0 ("arm64: PCI: Drop DT IRQ allocation from
pcibios_alloc_irq()") was assuming all PCI host controller drivers had been
converted to use ->map_irq(), but that wasn't the case: pci-aardvark had
not been converted. Due to this, it broke the support for legacy PCI
interrupts when using the pci-aardvark driver (used on Marvell Armada 3720
platforms).
In order to fix this, we make sure the ->map_irq and ->swizzle_irq fields
of pci_host_bridge are properly filled in.
Fixes: 769b461fc0c0 ("arm64: PCI: Drop DT IRQ allocation from pcibios_alloc_irq()")
Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: stable@vger.kernel.org # v4.13+
|
|
This reverts commit d7bd554f27c942e6b8b54100b4044f9be1038edf.
It turns out that Tegra20 has a bug in the implementation of the MSI
target address register (which is worked around by the existence of the
struct tegra_pcie_soc.msi_base_shift parameter) that restricts the MSI
target memory to the lower 32 bits of physical memory on that particular
generation. The offending patch causes a regression on TrimSlice, which
is a Tegra20-based device and has a PCI network interface card.
An initial, simpler fix was to change the MSI target address for Tegra20
only, but it was pointed out that the offending commit also prevents the
use of 32-bit only MSI capable devices, even on later chips. Technically
this was never guaranteed to work with the prior code in the first place
because the allocated page could have resided beyond the 4 GiB boundary,
but it is still possible that this could've introduced a regression.
The proper fix that was settled on is to select a fixed address within
the lowest 32 bits of physical address space that is otherwise unused,
but testing of that patch has provided mixed results that are not fully
understood yet.
Given all of the above and the relative urgency to get this fixed in
v4.13, revert the offending commit until a universal fix is found.
Fixes: d7bd554f27c9 ("PCI: tegra: Do not allocate MSI target memory")
Reported-by: Tomasz Maciej Nowak <tmn505@gmail.com>
Reported-by: Erik Faye-Lund <kusmabite@gmail.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: stable@vger.kernel.org # 4.13.x
|
|
Jakub Kicinski says:
====================
nfp: fix ethtool stats and page allocation
Two fixes for net. First one makes sure we handle gather of stats on
32bit machines correctly (ouch). The second fix solves a potential
NULL-deref if we fail to allocate a page with XDP running.
I used Fixes: tags pointing to where the bug was introduced, but for
patch 1 it has been in the driver "for ever" and fix won't backport
cleanly beyond commit 325945ede6d4 ("nfp: split software and hardware
vNIC statistics") which is in net.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
page_address() does not handle NULL argument gracefully,
make sure we NULL-check the page pointer before passing it
to page_address().
Fixes: ecd63a0217d5 ("nfp: add XDP support in the driver")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The while loop fetching 64 bit ethtool statistics may have
to retry multiple times, it shouldn't modify the outside state.
Fixes: 4c3523623dc0 ("net: add driver for Netronome NFP4000/NFP6000 NIC VFs")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue
Jeff Kirsher says:
====================
Intel Wired LAN Driver Updates 2017-10-10
This series contains updates to i40e only.
Stefano Brivio fixes the grammar in a function header comment.
Alex fixes a memory leak where we were not correctly placing the pages
from buffers that had been used to return a filter programming status
back on the ring.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull seccomp fixlet from Kees Cook:
"Minor seccomp fix for v4.14-rc5. I debated sending this at all for
v4.14, but since it fixes a minor issue in the prior fix, which also
went to -stable, it seemed better to just get all of it cleaned up
right now.
- fix missed "static" to avoid Sparse warning (Colin King)"
* tag 'seccomp-v4.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
seccomp: make function __get_seccomp_filter static
|
|
Pull nfsd fix from Bruce Fields:
"One fix for a 4.14 regression, and one minor fix to the MAINTAINERs
file. (I was weirdly flattered by the idea that lots of random people
suddenly seemed to think Jeff and I were VFS experts. Turns out it was
just a typo)"
* tag 'nfsd-4.14-1' of git://linux-nfs.org/~bfields/linux:
nfsd4: define nfsd4_secinfo_no_name_release()
MAINTAINERS: associate linux/fs.h with VFS instead of file locking
|
|
Convert Variable Length Array in Struct (VLAIS) to valid C by converting
local struct definition to use a flexible array. The structure is only
used to define a cast of a buffer so the size of the struct is not used
to allocate storage.
Signed-off-by: Behan Webster <behanw@converseincode.com>
Signed-off-by: Mark Charebois <charlebm@gmail.com>
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The function __get_seccomp_filter is local to the source and does
not need to be in global scope, so make it static.
Cleans up sparse warning:
symbol '__get_seccomp_filter' was not declared. Should it be static?
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Fixes: 66a733ea6b61 ("seccomp: fix the usage of get/put_seccomp_filter() in seccomp_get_filter()")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
|
|
When RPMSG_QCOM_GLINK_SMEM=m and one driver causes the qcom_common.c file
to be compiled as built-in, we get a link error:
drivers/remoteproc/qcom_common.o: In function `glink_subdev_remove':
qcom_common.c:(.text+0x130): undefined reference to `qcom_glink_smem_unregister'
qcom_common.c:(.text+0x130): relocation truncated to fit: R_AARCH64_CALL26 against undefined symbol `qcom_glink_smem_unregister'
drivers/remoteproc/qcom_common.o: In function `glink_subdev_probe':
qcom_common.c:(.text+0x160): undefined reference to `qcom_glink_smem_register'
qcom_common.c:(.text+0x160): relocation truncated to fit: R_AARCH64_CALL26 against undefined symbol `qcom_glink_smem_register'
Out of the three PIL driver instances, QCOM_ADSP_PIL already has a
Kconfig dependency to prevent this from happening, but the other two
do not. This adds the same dependency there.
Fixes: eea07023e6d9 ("remoteproc: qcom: adsp: Allow defining GLINK edge")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
|
|
The priv->mem[] array has IMX7D_RPROC_MEM_MAX elements so the > should
be >= to avoid writing one element beyond the end of the array.
Fixes: a0ff4aa6f010 ("remoteproc: imx_rproc: add a NXP/Freescale imx_rproc driver")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
|
|
We need to free "intent" and "intent->data" on a couple error paths.
Fixes: 933b45da5d1d ("rpmsg: glink: Add support for TX intents")
Acked-by: Sricharan R <sricharan@codeaurora.org>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
|
|
If qcom_glink_tx() fails, then we need to unlock before returning the
error code.
Fixes: 27b9c5b66b23 ("rpmsg: glink: Request for intents when unavailable")
Acked-by: Sricharan R <sricharan@codeaurora.org>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs fix from Jaegeuk Kim:
"This contains one bug fix which causes a kernel panic during fstrim
introduced in 4.14-rc1"
* tag 'f2fs-for-4.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs:
f2fs: fix potential panic during fstrim
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull kselftest fixes from Shuah Khan:
- fix for x86: sysret_ss_attrs test build failure preventing the x86
tests from running
- fix mqueue: fix regression in silencing test run output
* tag 'linux-kselftest-4.14-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
selftests: mqueue: fix regression in silencing output from RUN_TESTS
selftests: x86: sysret_ss_attrs doesn't build on a PIE build
|
|
When SME memory encryption is active it will rely on SWIOTLB to handle
DMA for devices that cannot support the addressing requirements of
having the encryption mask set in the physical address. The IOMMU
currently disables SWIOTLB if it is not running in passthrough mode.
This is not desired as non-PCI devices attempting DMA may fail. Update
the code to check if SME is active and not disable SWIOTLB.
Fixes: 2543a786aa25 ("iommu/amd: Allow the AMD IOMMU to work with memory encryption")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent
Pull perf/urgent fixes from Arnaldo Carvalho de Melo:
- Unbreak 'perf record' for arm/arm64 with events with explicit PMU (Mark Rutland)
- Add missing separator for "perf script -F ip,brstack" (and brstackoff) (Mark Santaniello)
- One line, comment only, sync kernel ABI header with tooling header (Arnaldo Carvalho de Melo)
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
The shash ahash digest adaptor function may crash if given a
zero-length input together with a null SG list. This is because
it tries to read the SG list before looking at the length.
This patch fixes it by checking the length first.
Cc: <stable@vger.kernel.org>
Reported-by: Stephan Müller<smueller@chronox.de>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Tested-by: Stephan Müller <smueller@chronox.de>
|
|
Eryu has reported that since commit 7b9ca4c61bc2 "quota: Reduce
contention on dq_data_lock" test generic/233 occasionally fails. This is
caused by the fact that since that commit we don't generate warning and
set grace time for quota allocations that have DQUOT_SPACE_NOFAIL set
(these are for example some metadata allocations in ext4). We need these
allocations to behave regularly wrt warning generation and grace time
setting so fix the code to return to the original behavior.
Reported-and-tested-by: Eryu Guan <eguan@redhat.com>
CC: stable@vger.kernel.org
Fixes: 7b9ca4c61bc278b771fb57d6290a31ab1fc7fdac
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
It looks like we weren't correctly placing the pages from buffers that had
been used to return a filter programming status back on the ring. As a
result they were being overwritten and tracking of the pages was lost.
This change works to correct that by incorporating part of
i40e_put_rx_buffer into the programming status handler code. As a result we
should now be correctly placing the pages for those buffers on the
re-allocation list instead of letting them stay in place.
Fixes: 0e626ff7ccbf ("i40e: Fix support for flow director programming status")
Reported-by: Anders K. Pedersen <akp@cohaesio.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Anders K Pedersen <akp@cohaesio.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|
Caller needs to acquire the lock. Called functions will not.
Fixes: 09f79fd49d94 ("i40e: avoid NVM acquire deadlock during NVM update")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|
Josef reported a HARDIRQ-safe -> HARDIRQ-unsafe lock order detected by
lockdep:
[ 1270.472259] WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected
[ 1270.472783] 4.14.0-rc1-xfstests-12888-g76833e8 #110 Not tainted
[ 1270.473240] -----------------------------------------------------
[ 1270.473710] kworker/u5:2/5157 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
[ 1270.474239] (&(&lock->wait_lock)->rlock){+.+.}, at: [<ffffffff8da253d2>] __mutex_unlock_slowpath+0xa2/0x280
[ 1270.474994]
[ 1270.474994] and this task is already holding:
[ 1270.475440] (&pool->lock/1){-.-.}, at: [<ffffffff8d2992f6>] worker_thread+0x366/0x3c0
[ 1270.476046] which would create a new lock dependency:
[ 1270.476436] (&pool->lock/1){-.-.} -> (&(&lock->wait_lock)->rlock){+.+.}
[ 1270.476949]
[ 1270.476949] but this new dependency connects a HARDIRQ-irq-safe lock:
[ 1270.477553] (&pool->lock/1){-.-.}
...
[ 1270.488900] to a HARDIRQ-irq-unsafe lock:
[ 1270.489327] (&(&lock->wait_lock)->rlock){+.+.}
...
[ 1270.494735] Possible interrupt unsafe locking scenario:
[ 1270.494735]
[ 1270.495250] CPU0 CPU1
[ 1270.495600] ---- ----
[ 1270.495947] lock(&(&lock->wait_lock)->rlock);
[ 1270.496295] local_irq_disable();
[ 1270.496753] lock(&pool->lock/1);
[ 1270.497205] lock(&(&lock->wait_lock)->rlock);
[ 1270.497744] <Interrupt>
[ 1270.497948] lock(&pool->lock/1);
, which will cause a irq inversion deadlock if the above lock scenario
happens.
The root cause of this safe -> unsafe lock order is the
mutex_unlock(pool->manager_arb) in manage_workers() with pool->lock
held.
Unlocking mutex while holding an irq spinlock was never safe and this
problem has been around forever but it never got noticed because the
only time the mutex is usually trylocked while holding irqlock making
actual failures very unlikely and lockdep annotation missed the
condition until the recent b9c16a0e1f73 ("locking/mutex: Fix
lockdep_assert_held() fail").
Using mutex for pool->manager_arb has always been a bit of stretch.
It primarily is an mechanism to arbitrate managership between workers
which can easily be done with a pool flag. The only reason it became
a mutex is that pool destruction path wants to exclude parallel
managing operations.
This patch replaces the mutex with a new pool flag POOL_MANAGER_ACTIVE
and make the destruction path wait for the current manager on a wait
queue.
v2: Drop unnecessary flag clearing before pool destruction as
suggested by Boqun.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: stable@vger.kernel.org
|
|
is_last_gpte() is not equivalent to the pseudo-code given in commit
6bb69c9b69c31 ("KVM: MMU: simplify last_pte_bitmap") because an incorrect
value of last_nonleaf_level may override the result even if level == 1.
It is critical for is_last_gpte() to return true on level == 1 to
terminate page walks. Otherwise memory corruption may occur as level
is used as an index to various data structures throughout the page
walking code. Even though the actual bug would be wherever the MMU is
initialized (as in the previous patch), be defensive and ensure here
that is_last_gpte() returns the correct value.
This patch is also enough to fix CVE-2017-12188.
Fixes: 6bb69c9b69c315200ddc2bc79aee14c0184cf5b2
Cc: stable@vger.kernel.org
Cc: Andy Honig <ahonig@google.com>
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
[Panic if walk_addr_generic gets an incorrect level; this is a serious
bug and it's not worth a WARN_ON where the recovery path might hide
further exploitable issues; suggested by Andrew Honig. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
The function updates context->root_level but didn't call
update_last_nonleaf_level so the previous and potentially wrong value
was used for page walks. For example, a zero value of last_nonleaf_level
would allow a potential out-of-bounds access in arch/x86/mmu/paging_tmpl.h's
walk_addr_generic function (CVE-2017-12188).
Fixes: 155a97a3d7c78b46cef6f1a973c831bc5a4f82bb
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
As xen_cpuhp_setup is called by PV and PVHVM, the name of "x86/xen/hvm_guest"
is confusing.
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
|
|
USB-audio driver may leave a stray URB for the mixer interrupt when it
exits by some error during probe. This leads to a use-after-free
error as spotted by syzkaller like:
==================================================================
BUG: KASAN: use-after-free in snd_usb_mixer_interrupt+0x604/0x6f0
Call Trace:
<IRQ>
__dump_stack lib/dump_stack.c:16
dump_stack+0x292/0x395 lib/dump_stack.c:52
print_address_description+0x78/0x280 mm/kasan/report.c:252
kasan_report_error mm/kasan/report.c:351
kasan_report+0x23d/0x350 mm/kasan/report.c:409
__asan_report_load8_noabort+0x19/0x20 mm/kasan/report.c:430
snd_usb_mixer_interrupt+0x604/0x6f0 sound/usb/mixer.c:2490
__usb_hcd_giveback_urb+0x2e0/0x650 drivers/usb/core/hcd.c:1779
....
Allocated by task 1484:
save_stack_trace+0x1b/0x20 arch/x86/kernel/stacktrace.c:59
save_stack+0x43/0xd0 mm/kasan/kasan.c:447
set_track mm/kasan/kasan.c:459
kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
kmem_cache_alloc_trace+0x11e/0x2d0 mm/slub.c:2772
kmalloc ./include/linux/slab.h:493
kzalloc ./include/linux/slab.h:666
snd_usb_create_mixer+0x145/0x1010 sound/usb/mixer.c:2540
create_standard_mixer_quirk+0x58/0x80 sound/usb/quirks.c:516
snd_usb_create_quirk+0x92/0x100 sound/usb/quirks.c:560
create_composite_quirk+0x1c4/0x3e0 sound/usb/quirks.c:59
snd_usb_create_quirk+0x92/0x100 sound/usb/quirks.c:560
usb_audio_probe+0x1040/0x2c10 sound/usb/card.c:618
....
Freed by task 1484:
save_stack_trace+0x1b/0x20 arch/x86/kernel/stacktrace.c:59
save_stack+0x43/0xd0 mm/kasan/kasan.c:447
set_track mm/kasan/kasan.c:459
kasan_slab_free+0x72/0xc0 mm/kasan/kasan.c:524
slab_free_hook mm/slub.c:1390
slab_free_freelist_hook mm/slub.c:1412
slab_free mm/slub.c:2988
kfree+0xf6/0x2f0 mm/slub.c:3919
snd_usb_mixer_free+0x11a/0x160 sound/usb/mixer.c:2244
snd_usb_mixer_dev_free+0x36/0x50 sound/usb/mixer.c:2250
__snd_device_free+0x1ff/0x380 sound/core/device.c:91
snd_device_free_all+0x8f/0xe0 sound/core/device.c:244
snd_card_do_free sound/core/init.c:461
release_card_device+0x47/0x170 sound/core/init.c:181
device_release+0x13f/0x210 drivers/base/core.c:814
....
Actually such a URB is killed properly at disconnection when the
device gets probed successfully, and what we need is to apply it for
the error-path, too.
In this patch, we apply snd_usb_mixer_disconnect() at releasing.
Also introduce a new flag, disconnected, to struct usb_mixer_interface
for not performing the disconnection procedure twice.
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
|
|
Do not consider the fixed size of hv_vp_set when passing the variable
header size to hv_do_rep_hypercall().
The Hyper-V hypervisor specification states that for a hypercall with a
variable header only the size of the variable portion should be supplied
via the input control.
For HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE_EX/LIST_EX calls that means the
fixed portion of hv_vp_set should not be considered.
That fixes random failures of some applications that are unexpectedly
killed with SIGBUS or SIGSEGV.
Signed-off-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Jork Loeser <Jork.Loeser@microsoft.com>
Cc: Josh Poulson <jopoulso@microsoft.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Simon Xiao <sixiao@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: devel@linuxdriverproject.org
Fixes: 628f54cc6451 ("x86/hyper-v: Support extended CPU ranges for TLB flush hypercalls")
Link: http://lkml.kernel.org/r/1507210469-29065-1-git-send-email-marcelo.cerri@canonical.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
hv_do_hypercall() does virt_to_phys() translation and with some configs
(CONFIG_SLAB) this doesn't work for percpu areas, we pass wrong memory to
hypervisor and get #GP. We could use working slow_virt_to_phys() instead
but doing so kills the performance.
Move pcpu_flush/pcpu_flush_ex structures out of percpu areas and
allocate memory on first call. The additional level of indirection gives
us a small performance penalty, in future we may consider introducing
hypercall functions which avoid virt_to_phys() conversion and cache
physical addresses of pcpu_flush/pcpu_flush_ex structures somewhere.
Reported-by: Simon Xiao <sixiao@microsoft.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Jork Loeser <Jork.Loeser@microsoft.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: devel@linuxdriverproject.org
Link: http://lkml.kernel.org/r/20171005113924.28021-1-vkuznets@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
hv_flush_pcpu_ex structures are not cleared between calls for performance
reasons (they're variable size up to PAGE_SIZE each) but we must clear
hv_vp_set.bank_contents part of it to avoid flushing unneeded vCPUs. The
rest of the structure is formed correctly.
To do the clearing in an efficient way stash the maximum possible vCPU
number (this may differ from Linux CPU id).
Reported-by: Jork Loeser <Jork.Loeser@microsoft.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: devel@linuxdriverproject.org
Link: http://lkml.kernel.org/r/20171006154854.18092-1-vkuznets@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Currently if an allocation fails then the error return paths
don't free up any currently allocated pmus[].boxes and pmus causing
a memory leak. Add an error clean up exit path that frees these
objects.
Detected by CoverityScan, CID#711632 ("Resource Leak")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-janitors@vger.kernel.org
Fixes: 087bfbb03269 ("perf/x86: Add generic Intel uncore PMU support")
Link: http://lkml.kernel.org/r/20171009172655.6132-1-colin.king@canonical.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
x86-32 doesn't have stack validation, so in most cases it doesn't make
sense to warn about bad frame pointers.
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: LKP <lkp@01.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/a69658760800bf281e6353248c23e0fa0acf5230.1507597785.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
When printing the unwinder dump, the stack pointer could be unaligned,
for one of two reasons:
- stack corruption; or
- GCC created an unaligned stack.
There's no way for the unwinder to tell the difference between the two,
so we have to assume one or the other. GCC unaligned stacks are very
rare, and have only been spotted before GCC 5. Presumably, if we're
doing an unwinder stack dump, stack corruption is more likely than a
GCC unaligned stack. So always align the stack before starting the
dump.
Reported-and-tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-and-tested-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: LKP <lkp@01.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/2f540c515946ab09ed267e1a1d6421202a0cce08.1507597785.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
On x86-32, Tetsuo Handa and Fengguang Wu reported unwinder warnings
like:
WARNING: kernel stack regs at f60bb9c8 in swapper:1 has bad 'bp' value 0ba00000
And also there were some stack dumps with a bunch of unreliable '?'
symbols after an apic_timer_interrupt symbol, meaning the unwinder got
confused when it tried to read the regs.
The cause of those issues is that, with GCC 4.8 (and possibly older),
there are cases where GCC misaligns the stack pointer in a leaf function
for no apparent reason:
c124a388 <acpi_rs_move_data>:
c124a388: 55 push %ebp
c124a389: 89 e5 mov %esp,%ebp
c124a38b: 57 push %edi
c124a38c: 56 push %esi
c124a38d: 89 d6 mov %edx,%esi
c124a38f: 53 push %ebx
c124a390: 31 db xor %ebx,%ebx
c124a392: 83 ec 03 sub $0x3,%esp
...
c124a3e3: 83 c4 03 add $0x3,%esp
c124a3e6: 5b pop %ebx
c124a3e7: 5e pop %esi
c124a3e8: 5f pop %edi
c124a3e9: 5d pop %ebp
c124a3ea: c3 ret
If an interrupt occurs in such a function, the regs on the stack will be
unaligned, which breaks the frame pointer encoding assumption. So on
32-bit, use the MSB instead of the LSB to encode the regs.
This isn't an issue on 64-bit, because interrupts align the stack before
writing to it.
Reported-and-tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-and-tested-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: LKP <lkp@01.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/279a26996a482ca716605c7dbc7f2db9d8d91e81.1507597785.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Tetsuo Handa and Fengguang Wu reported a panic in the unwinder:
BUG: unable to handle kernel NULL pointer dereference at 000001f2
IP: update_stack_state+0xd4/0x340
*pde = 00000000
Oops: 0000 [#1] PREEMPT SMP
CPU: 0 PID: 18728 Comm: 01-cpu-hotplug Not tainted 4.13.0-rc4-00170-gb09be67 #592
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-20161025_171302-gandalf 04/01/2014
task: bb0b53c0 task.stack: bb3ac000
EIP: update_stack_state+0xd4/0x340
EFLAGS: 00010002 CPU: 0
EAX: 0000a570 EBX: bb3adccb ECX: 0000f401 EDX: 0000a570
ESI: 00000001 EDI: 000001ba EBP: bb3adc6b ESP: bb3adc3f
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
CR0: 80050033 CR2: 000001f2 CR3: 0b3a7000 CR4: 00140690
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: fffe0ff0 DR7: 00000400
Call Trace:
? unwind_next_frame+0xea/0x400
? __unwind_start+0xf5/0x180
? __save_stack_trace+0x81/0x160
? save_stack_trace+0x20/0x30
? __lock_acquire+0xfa5/0x12f0
? lock_acquire+0x1c2/0x230
? tick_periodic+0x3a/0xf0
? _raw_spin_lock+0x42/0x50
? tick_periodic+0x3a/0xf0
? tick_periodic+0x3a/0xf0
? debug_smp_processor_id+0x12/0x20
? tick_handle_periodic+0x23/0xc0
? local_apic_timer_interrupt+0x63/0x70
? smp_trace_apic_timer_interrupt+0x235/0x6a0
? trace_apic_timer_interrupt+0x37/0x3c
? strrchr+0x23/0x50
Code: 0f 95 c1 89 c7 89 45 e4 0f b6 c1 89 c6 89 45 dc 8b 04 85 98 cb 74 bc 88 4d e3 89 45 f0 83 c0 01 84 c9 89 04 b5 98 cb 74 bc 74 3b <8b> 47 38 8b 57 34 c6 43 1d 01 25 00 00 02 00 83 e2 03 09 d0 83
EIP: update_stack_state+0xd4/0x340 SS:ESP: 0068:bb3adc3f
CR2: 00000000000001f2
---[ end trace 0d147fd4aba8ff50 ]---
Kernel panic - not syncing: Fatal exception in interrupt
On x86-32, after decoding a frame pointer to get a regs address,
regs_size() dereferences the regs pointer when it checks regs->cs to see
if the regs are user mode. This is dangerous because it's possible that
what looks like a decoded frame pointer is actually a corrupt value, and
we don't want the unwinder to make things worse.
Instead of calling regs_size() on an unsafe pointer, just assume they're
kernel regs to start with. Later, once it's safe to access the regs, we
can do the user mode check and corresponding safety check for the
remaining two regs.
Reported-and-tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-and-tested-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: LKP <lkp@01.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 5ed8d8bb38c5 ("x86/unwind: Move common code into update_stack_state()")
Link: http://lkml.kernel.org/r/7f95b9a6993dec7674b3f3ab3dcd3294f7b9644d.1507597785.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
It turns out that not all paths calling arch_update_cpu_topology() hold
cpu_hotplug_lock, but that's OK because those paths can't race with
any concurrent hotplug events.
Warnings were reported with the following trace:
lockdep_assert_cpus_held
arch_update_cpu_topology
sched_init_domains
sched_init_smp
kernel_init_freeable
kernel_init
ret_from_kernel_thread
Which is safe because it's called early in boot when hotplug is not
live yet.
And also this trace:
lockdep_assert_cpus_held
arch_update_cpu_topology
partition_sched_domains
cpuset_update_active_cpus
sched_cpu_deactivate
cpuhp_invoke_callback
cpuhp_down_callbacks
cpuhp_thread_fun
smpboot_thread_fn
kthread
ret_from_kernel_thread
Which is safe because it's called as part of CPU hotplug, so although
we don't hold the CPU hotplug lock, there is another thread driving
the CPU hotplug operation which does hold the lock, and there is no
race.
Thanks to tglx for deciphering it for us.
Fixes: 3e401f7a2e51 ("powerpc: Only obtain cpu_hotplug_lock if called by rtasd")
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
According to the GCC documentation, the behaviour of __builtin_clz()
and __builtin_clzl() is undefined if the value of the input argument
is zero. Without handling this special case, these builtins have been
used for emulating the following instructions:
* Count Leading Zeros Word (cntlzw[.])
* Count Leading Zeros Doubleword (cntlzd[.])
This fixes the emulated behaviour of these instructions by adding an
additional check for this special case.
Fixes: 3cdfcbfd32b9d ("powerpc: Change analyse_instr so it doesn't modify *regs")
Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com>
Reviewed-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
When waiting for transfer completion, a3700_spi_wait_completion
returns a boolean indicating if a timeout occurred.
The function was returning 'true' everytime, failing to detect any
timeout.
This patch makes it return 'false' when a timeout is reached.
Signed-off-by: Maxime Chevallier <maxime.chevallier@smile.fr>
Signed-off-by: Mark Brown <broonie@kernel.org>
Cc: stable@vger.kernel.org
|
|
While load_balance() masks the source CPUs against active_mask, it had
a hole against the destination CPU. Ensure the destination CPU is also
part of the 'domain-mask & active-mask' set.
Reported-by: Levin, Alexander (Sasha Levin) <alexander.levin@verizon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 77d1dfda0e79 ("sched/topology, cpuset: Avoid spurious/wrong domain rebuilds")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
The trivial wake_affine_idle() implementation is very good for a
number of workloads, but it comes apart at the moment there are no
idle CPUs left, IOW. the overloaded case.
hackbench:
NO_WA_WEIGHT WA_WEIGHT
hackbench-20 : 7.362717561 seconds 6.450509391 seconds
(win)
netperf:
NO_WA_WEIGHT WA_WEIGHT
TCP_SENDFILE-1 : Avg: 54524.6 Avg: 52224.3
TCP_SENDFILE-10 : Avg: 48185.2 Avg: 46504.3
TCP_SENDFILE-20 : Avg: 29031.2 Avg: 28610.3
TCP_SENDFILE-40 : Avg: 9819.72 Avg: 9253.12
TCP_SENDFILE-80 : Avg: 5355.3 Avg: 4687.4
TCP_STREAM-1 : Avg: 41448.3 Avg: 42254
TCP_STREAM-10 : Avg: 24123.2 Avg: 25847.9
TCP_STREAM-20 : Avg: 15834.5 Avg: 18374.4
TCP_STREAM-40 : Avg: 5583.91 Avg: 5599.57
TCP_STREAM-80 : Avg: 2329.66 Avg: 2726.41
TCP_RR-1 : Avg: 80473.5 Avg: 82638.8
TCP_RR-10 : Avg: 72660.5 Avg: 73265.1
TCP_RR-20 : Avg: 52607.1 Avg: 52634.5
TCP_RR-40 : Avg: 57199.2 Avg: 56302.3
TCP_RR-80 : Avg: 25330.3 Avg: 26867.9
UDP_RR-1 : Avg: 108266 Avg: 107844
UDP_RR-10 : Avg: 95480 Avg: 95245.2
UDP_RR-20 : Avg: 68770.8 Avg: 68673.7
UDP_RR-40 : Avg: 76231 Avg: 75419.1
UDP_RR-80 : Avg: 34578.3 Avg: 35639.1
UDP_STREAM-1 : Avg: 64684.3 Avg: 66606
UDP_STREAM-10 : Avg: 52701.2 Avg: 52959.5
UDP_STREAM-20 : Avg: 30376.4 Avg: 29704
UDP_STREAM-40 : Avg: 15685.8 Avg: 15266.5
UDP_STREAM-80 : Avg: 8415.13 Avg: 7388.97
(wins and losses)
sysbench:
NO_WA_WEIGHT WA_WEIGHT
sysbench-mysql-2 : 2135.17 per sec. 2142.51 per sec.
sysbench-mysql-5 : 4809.68 per sec. 4800.19 per sec.
sysbench-mysql-10 : 9158.59 per sec. 9157.05 per sec.
sysbench-mysql-20 : 14570.70 per sec. 14543.55 per sec.
sysbench-mysql-40 : 22130.56 per sec. 22184.82 per sec.
sysbench-mysql-80 : 20995.56 per sec. 21904.18 per sec.
sysbench-psql-2 : 1679.58 per sec. 1705.06 per sec.
sysbench-psql-5 : 3797.69 per sec. 3879.93 per sec.
sysbench-psql-10 : 7253.22 per sec. 7258.06 per sec.
sysbench-psql-20 : 11166.75 per sec. 11220.00 per sec.
sysbench-psql-40 : 17277.28 per sec. 17359.78 per sec.
sysbench-psql-80 : 17112.44 per sec. 17221.16 per sec.
(increase on the top end)
tbench:
NO_WA_WEIGHT
Throughput 685.211 MB/sec 2 clients 2 procs max_latency=0.123 ms
Throughput 1596.64 MB/sec 5 clients 5 procs max_latency=0.119 ms
Throughput 2985.47 MB/sec 10 clients 10 procs max_latency=0.262 ms
Throughput 4521.15 MB/sec 20 clients 20 procs max_latency=0.506 ms
Throughput 9438.1 MB/sec 40 clients 40 procs max_latency=2.052 ms
Throughput 8210.5 MB/sec 80 clients 80 procs max_latency=8.310 ms
WA_WEIGHT
Throughput 697.292 MB/sec 2 clients 2 procs max_latency=0.127 ms
Throughput 1596.48 MB/sec 5 clients 5 procs max_latency=0.080 ms
Throughput 2975.22 MB/sec 10 clients 10 procs max_latency=0.254 ms
Throughput 4575.14 MB/sec 20 clients 20 procs max_latency=0.502 ms
Throughput 9468.65 MB/sec 40 clients 40 procs max_latency=2.069 ms
Throughput 8631.73 MB/sec 80 clients 80 procs max_latency=8.605 ms
(increase on the top end)
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Eric reported a sysbench regression against commit:
3fed382b46ba ("sched/numa: Implement NUMA node level wake_affine()")
Similarly, Rik was looking at the NAS-lu.C benchmark, which regressed
against his v3.10 enterprise kernel.
PRE (current tip/master):
ivb-ep sysbench:
2: [30 secs] transactions: 64110 (2136.94 per sec.)
5: [30 secs] transactions: 143644 (4787.99 per sec.)
10: [30 secs] transactions: 274298 (9142.93 per sec.)
20: [30 secs] transactions: 418683 (13955.45 per sec.)
40: [30 secs] transactions: 320731 (10690.15 per sec.)
80: [30 secs] transactions: 355096 (11834.28 per sec.)
hsw-ex NAS:
OMP_PROC_BIND/lu.C.x_threads_144_run_1.log: Time in seconds = 18.01
OMP_PROC_BIND/lu.C.x_threads_144_run_2.log: Time in seconds = 17.89
OMP_PROC_BIND/lu.C.x_threads_144_run_3.log: Time in seconds = 17.93
lu.C.x_threads_144_run_1.log: Time in seconds = 434.68
lu.C.x_threads_144_run_2.log: Time in seconds = 405.36
lu.C.x_threads_144_run_3.log: Time in seconds = 433.83
POST (+patch):
ivb-ep sysbench:
2: [30 secs] transactions: 64494 (2149.75 per sec.)
5: [30 secs] transactions: 145114 (4836.99 per sec.)
10: [30 secs] transactions: 278311 (9276.69 per sec.)
20: [30 secs] transactions: 437169 (14571.60 per sec.)
40: [30 secs] transactions: 669837 (22326.73 per sec.)
80: [30 secs] transactions: 631739 (21055.88 per sec.)
hsw-ex NAS:
lu.C.x_threads_144_run_1.log: Time in seconds = 23.36
lu.C.x_threads_144_run_2.log: Time in seconds = 22.96
lu.C.x_threads_144_run_3.log: Time in seconds = 22.52
This patch takes out all the shiny wake_affine() stuff and goes back to
utter basics. Between the two CPUs involved with the wakeup (the CPU
doing the wakeup and the CPU we ran on previously) pick the CPU we can
run on _now_.
This restores much of the regressions against the older kernels,
but leaves some ground in the overloaded case. The default-enabled
WA_WEIGHT (which will be introduced in the next patch) is an attempt
to address the overloaded situation.
Reported-by: Eric Farman <farman@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Rosato <mjrosato@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: jinpuwang@gmail.com
Cc: vcaputo@pengaru.com
Fixes: 3fed382b46ba ("sched/numa: Implement NUMA node level wake_affine()")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Update cgroup time when an event is scheduled in by descendants.
Reviewed-and-tested-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: leilei.lin <leilei.lin@alibaba-inc.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@kernel.org
Cc: alexander.shishkin@linux.intel.com
Cc: brendan.d.gregg@gmail.com
Cc: yang_oliver@hotmail.com
Link: http://lkml.kernel.org/r/CALPjY3mkHiekRkRECzMi9G-bjUQOvOjVBAqxmWkTzc-g+0LwMg@mail.gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Since commit:
1fd7e4169954 ("perf/core: Remove perf_cpu_context::unique_pmu")
... when a PMU is unregistered then its associated ->pmu_cpu_context is
unconditionally freed. Whilst this is fine for dynamically allocated
context types (i.e. those registered using perf_invalid_context), this
causes a problem for sharing of static contexts such as
perf_{sw,hw}_context, which are used by multiple built-in PMUs and
effectively have a global lifetime.
Whilst testing the ARM SPE driver, which must use perf_sw_context to
support per-task AUX tracing, unregistering the driver as a result of a
module unload resulted in:
Unable to handle kernel NULL pointer dereference at virtual address 00000038
Internal error: Oops: 96000004 [#1] PREEMPT SMP
Modules linked in: [last unloaded: arm_spe_pmu]
PC is at ctx_resched+0x38/0xe8
LR is at perf_event_exec+0x20c/0x278
[...]
ctx_resched+0x38/0xe8
perf_event_exec+0x20c/0x278
setup_new_exec+0x88/0x118
load_elf_binary+0x26c/0x109c
search_binary_handler+0x90/0x298
do_execveat_common.isra.14+0x540/0x618
SyS_execve+0x38/0x48
since the software context has been freed and the ctx.pmu->pmu_disable_count
field has been set to NULL.
This patch fixes the problem by avoiding the freeing of static PMU contexts
altogether. Whilst the sharing of dynamic contexts is questionable, this
actually requires the caller to share their context pointer explicitly
and so the burden is on them to manage the object lifetime.
Reported-by: Kim Phillips <kim.phillips@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 1fd7e4169954 ("perf/core: Remove perf_cpu_context::unique_pmu")
Link: http://lkml.kernel.org/r/1507040450-7730-1-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
The work-around for the expected failure is providing another failure :/
Only when CONFIG_PROVE_LOCKING=y do we increment unexpected_testcase_failures,
so only then do we need to decrement, otherwise we'll end up with a negative
number and that will again trigger a BUG (printout, not crash).
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Tested-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: d82fed752942 ("locking/lockdep/selftests: Fix mixed read-write ABBA tests")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
There is some complication between check_prevs_add() and
check_prev_add() wrt. saving stack traces. The problem is that we want
to be frugal with saving stack traces, since it consumes static
resources.
We'll only know in check_prev_add() if we need the trace, but we can
call into it multiple times. So we want to do on-demand and re-use.
A further complication is that check_prev_add() can drop graph_lock
and mess with our static resources.
In any case, the current state; after commit:
ce07a9415f26 ("locking/lockdep: Make check_prev_add() able to handle external stack_trace")
is that we'll assume the trace contains valid data once
check_prev_add() returns '2'. However, as noted by Josh, this is
false, check_prev_add() can return '2' before having saved a trace,
this then result in the possibility of using uninitialized data.
Testing, as reported by Wu, shows a NULL deref.
So simplify.
Since the graph_lock() thing is a debug path that hasn't
really been used in a long while, take it out back and avoid the
head-ache.
Further initialize the stack_trace to a known 'empty' state; as long
as nr_entries == 0, nothing should deref entries. We can then use the
'entries == NULL' test for a valid trace / on-demand saving.
Analyzed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: ce07a9415f26 ("locking/lockdep: Make check_prev_add() able to handle external stack_trace")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|