Age | Commit message (Collapse) | Author |
|
In i40e_vsi_add_vlan we treat attempting to add VID=0 as an error,
because it does not do what the caller might expect. We already special
case VID=0 in i40e_vlan_rx_add_vid so that we avoid this error when
adding the VLAN.
This special casing is necessary so that we do not add the VLAN=0 filter
since we don't want to stop receiving untagged traffic. Unfortunately,
not all callers of i40e_vsi_add_vlan are aware of this, including when
we add VLANs from a VF device.
Rather than special casing every single caller of i40e_vsi_add_vlan,
lets just move this check internally. This makes the code simpler
because the caller does not need to be aware of how VLAN=0 is special,
and we don't forget to add this check in new places.
This fixes a harmless error message displaying when adding a VLAN from
within a VF. The message was meaningless but there is no reason to
confuse end users and system administrators, and this is now avoided.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|
When a user gives an invalid command to change a private flag which is
not supported, either because it is read-only, or the device is not
capable of the feature, we simply ignore the request.
A naive solution would simply be to report error codes when one of the
flags was not supported. However, this causes problems because it makes
the operation not atomic. If a user requests multiple private flags
together at once we could end up changing one before failing at the
second flag.
We can do a bit better if we instead update a temporary copy of the
flags variable in the loop, and then copy it into place after. If we
aren't careful this has the pitfall of potentially silently overwriting
any changes caused by other threads.
Avoid this by using cmpxchg64 which will compare and swap the flags
variable only if it currently matched the old value. We'll report
-EAGAIN in the (hopefully rare!) case where the cmpxchg64 fails.
This ensures that we can properly report when flags are not supported in
an atomic fashion without the risk of overwriting other threads changes.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|
This patch fixes a problem with the HW ATR eviction feature where the
NVM setting was incorrect. This patch detects the issue on X720
adapters and disables the feature if the NVM setting is incorrect.
Without this patch, HW ATR Evict feature does not work on broken NVMs
and is not detected either. If the HW ATR Evict feature is disabled
the SW Eviction feature will take effect.
Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com>
Signed-off-by: Alice Michael <alice.michael@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|
Since commit b499ffb0a22c ("i40e: Look up MAC address in Open Firmware
or IDPROM"), we've had support for obtaining the MAC address
form Open Firmware or IDPROM.
This code relied on sending the Open Firmware address directly to the
device firmware instead of relying on our MAC/VLAN filter list. Thus,
a work around was introduced in commit b1b15df59232 ("i40e: Explicitly
write platform-specific mac address after PF reset")
We refactored the Open Firmware address enablement code in the ill-named
commit 41c4c2b50d52 ("i40e: allow look-up of MAC address from Open
Firmware or IDPROM")
Since this refactor, we no longer even set I40E_FLAG_PF_MAC. Further, we
don't need this work around, because we actually store the MAC address
as part of the MAC/VLAN filter hash. Thus, we will restore the address
correctly upon reset.
The refactor above failed to revert the workaround, so do that now.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|
The number of flags found in pf->flags has grown quite large, and there
are a lot of different types of flags. Most of the flags are simply
hardware features which are enabled on some firmware or some MAC types.
Other flags are dynamic run-time flags which enable or disable certain
features of the driver.
Separate these two types of flags into pf->hw_features and pf->flags.
The hw_features list will contain a set of features which are enabled at
init time. This will not contain toggles or otherwise dynamically
changing features. These flags should not need atomic protections, as
they will be set once during init and then be essentially read only.
Everything else will remain in the flags variable. These flags may be
modified at any time during run time. A future patch may wish to convert
these flags into set_bit/clear_bit/test_bit or similar approach to
ensure atomic correctness.
The I40E_FLAG_MFP_ENABLED flag may be a good fit for hw_features but
currently is used by ethtool in the private flags settings, and thus has
been left as part of flags.
Additionally, I40E_FLAG_DCB_CAPABLE may be a good fit for the
hw_features but this patch has not tried to untangle it yet.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|
The X722 pf flag setup should happen before the VMDq RSS queue count is
initialized for VMDq VSI to get the right number of queues for RSS in
case of X722 devices.
Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com>
Signed-off-by: Alice Michael <alice.michael@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|
Currently i40evf_close() can return before state transitions to
__I40EVF_DOWN because of the latency involved in processing and
receiving response from PF driver and scheduling of VF watchdog_task.
Due to this inconsistency an immediate call to i40evf_open() fails
because state is still DOWN_PENDING.
When a VF interface is in up state and we try to add it as slave,
The bonding driver calls dev_close() and dev_open() in short duration
resulting in dev_open returning error. The ifenslave command needs
to be run again for dev_open to succeed.
This fix ensures that watchdog timer is scheduled immediately after
admin queue operations are scheduled in i40evf_down(). In addition a
wait condition is added at the end of i40evf_close so that function
wont return when state is still DOWN_PENDING. The timeout value is
chosen after some profiling and includes some buffer.
Signed-off-by: Sudheer Mogilappagari <sudheer.mogilappagari@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|
Now that the kernel supports double VLAN tags, we should at least play
nice. Adjust the max packet size to account for two VLAN tags, not just
one.
Signed-off-by: Mitch Williams <mitch.a.williams@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|
There is code duplicated over all architecture's headers for
futex_atomic_op_inuser. Namely op decoding, access_ok check for uaddr,
and comparison of the result.
Remove this duplication and leave up to the arches only the needed
assembly which is now in arch_futex_atomic_op_inuser.
This effectively distributes the Will Deacon's arm64 fix for undefined
behaviour reported by UBSAN to all architectures. The fix was done in
commit 5f16a046f8e1 (arm64: futex: Fix undefined behaviour with
FUTEX_OP_OPARG_SHIFT usage). Look there for an example dump.
And as suggested by Thomas, check for negative oparg too, because it was
also reported to cause undefined behaviour report.
Note that s390 removed access_ok check in d12a29703 ("s390/uaccess:
remove pointless access_ok() checks") as access_ok there returns true.
We introduce it back to the helper for the sake of simplicity (it gets
optimized away anyway).
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Russell King <rmk+kernel@armlinux.org.uk>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> [s390]
Acked-by: Chris Metcalf <cmetcalf@mellanox.com> [for tile]
Reviewed-by: Darren Hart (VMware) <dvhart@infradead.org>
Reviewed-by: Will Deacon <will.deacon@arm.com> [core/arm64]
Cc: linux-mips@linux-mips.org
Cc: Rich Felker <dalias@libc.org>
Cc: linux-ia64@vger.kernel.org
Cc: linux-sh@vger.kernel.org
Cc: peterz@infradead.org
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: sparclinux@vger.kernel.org
Cc: Jonas Bonn <jonas@southpole.se>
Cc: linux-s390@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: linux-hexagon@vger.kernel.org
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: linux-snps-arc@lists.infradead.org
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: linux-xtensa@linux-xtensa.org
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: openrisc@lists.librecores.org
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Stafford Horne <shorne@gmail.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Richard Henderson <rth@twiddle.net>
Cc: Chris Zankel <chris@zankel.net>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-parisc@vger.kernel.org
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: linux-alpha@vger.kernel.org
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: "David S. Miller" <davem@davemloft.net>
Link: http://lkml.kernel.org/r/20170824073105.3901-1-jslaby@suse.cz
|
|
kernel/irq/proc.c: In function ‘show_irq_affinity’:
include/linux/cpumask.h:24:29: warning: ‘mask’ may be used uninitialized in this function [-Wmaybe-uninitialized]
#define cpumask_bits(maskp) ((maskp)->bits)
gcc is silly, but admittedly it can't know that this won't be called with
anything else than the enumerated constants.
Shut up the warning by creating a default clause.
Fixes: 6bc6d4abd22e ("genirq/proc: Use the the accessor to report the effective affinity
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
This code generates a Smatch warning:
kernel/irq/irqdomain.c:1511 irq_domain_push_irq()
warn: variable dereferenced before check 'root_irq_data' (see line 1508)
irq_get_irq_data() can return a NULL pointer, but the code dereferences
the returned pointer before checking it.
Move the NULL pointer check before the dereference.
[ tglx: Rewrote changelog to be precise and conforming to the instructions
in submitting-patches and added a Fixes tag. Sigh! ]
Fixes: 495c38d3001f ("irqdomain: Add irq_domain_{push,pop}_irq() functions")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: David Daney <david.daney@cavium.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: kernel-janitors@vger.kernel.org
Link: http://lkml.kernel.org/r/20170825121409.6rfv4vt6ztz2oqkt@mwanda
|
|
kernel/irq/proc.c:69:2-3: Unneeded semicolon
Remove unneeded semicolon.
Generated by: scripts/coccinelle/misc/semicolon.cocci
Fixes: 0d3f54257dc3 ("genirq: Introduce effective affinity mask")
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: kbuild-all@01.org
Link: http://lkml.kernel.org/r/20170822075053.GA93890@lkp-hsx02
|
|
Errata list is included in this document:
https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/6th-gen-x-series-spec-update.pdf
with more details in:
https://www.intel.com/content/www/us/en/processors/xeon/scalable/xeon-scalable-spec-update.html
But the tl;dr summary (using tags from first of the documents) is:
SKZ4 MBM does not accurately track write bandwidth
SKZ17 CMT counters may not count accurately
SKZ18 CAT may not restrict cacheline allocation under certain conditions
SKZ19 MBM counters may undercount
Disable all these features on Skylake models. Users who understand the
errata may re-enable using boot command line options.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Cc: Fenghua" <fenghua.yu@intel.com>
Cc: Ravi V" <ravi.v.shankar@intel.com>
Cc: "Peter Zijlstra" <peterz@infradead.org>
Cc: "Stephane Eranian" <eranian@google.com>
Cc: "Andi Kleen" <ak@linux.intel.com>
Cc: "David Carrillo-Cisneros" <davidcc@google.com>
Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Link: http://lkml.kernel.org/r/3aea0a3bae219062c812668bd9b7b8f1a25003ba.1503512900.git.tony.luck@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Command line options allow us to ignore features that we don't want.
Also we can re-enable options that have been disabled on a platform
(so long as the underlying h/w actually supports the option).
[ tglx: Marked the option array __initdata and the helper function __init ]
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Fenghua" <fenghua.yu@intel.com>
Cc: Ravi V" <ravi.v.shankar@intel.com>
Cc: "Peter Zijlstra" <peterz@infradead.org>
Cc: "Stephane Eranian" <eranian@google.com>
Cc: "Andi Kleen" <ak@linux.intel.com>
Cc: "David Carrillo-Cisneros" <davidcc@google.com>
Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Link: http://lkml.kernel.org/r/0c37b0d4dbc30977a3c1cee08b66420f83662694.1503512900.git.tony.luck@intel.com
|
|
No functional change, but lay the ground work for other per-model
quirks.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Fenghua" <fenghua.yu@intel.com>
Cc: Ravi V" <ravi.v.shankar@intel.com>
Cc: "Peter Zijlstra" <peterz@infradead.org>
Cc: "Stephane Eranian" <eranian@google.com>
Cc: "Andi Kleen" <ak@linux.intel.com>
Cc: "David Carrillo-Cisneros" <davidcc@google.com>
Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Link: http://lkml.kernel.org/r/f195a83751b5f8b1d8a78bd3c1914300c8fa3142.1503512900.git.tony.luck@intel.com
|
|
In xen_blkif_disconnect, before checking inflight I/O, following code
stops the blkback thread,
if (ring->xenblkd) {
kthread_stop(ring->xenblkd);
wake_up(&ring->shutdown_wq);
}
If there is inflight I/O in any non-last queue, blkback returns -EBUSY
directly, and above code would not be called to stop thread of remaining
queue and processs them. When removing vbd device with lots of disk I/O
load, some queues with inflight I/O still have blkback thread running even
though the corresponding vbd device or guest is gone.
And this could cause some problems, for example, if the backend device type
is file, some loop devices and blkback thread always lingers there forever
after guest is destroyed, and this causes failure of umounting repositories
unless rebooting the dom0.
This patch allows thread of every queue has the chance to get stopped.
Otherwise, only thread of queue previous to(including) first busy one get
stopped, blkthread of remaining queue will still run. So stop all threads
properly and return -EBUSY if any queue has inflight I/O.
Signed-off-by: Annie Li <annie.li@oracle.com>
Reviewed-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com>
Reviewed-by: Bhavesh Davda <bhavesh.davda@oracle.com>
Reviewed-by: Adnan Misherfi <adnan.misherfi@oracle.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
|
Commit 0b0f9dc5 ("Revert "virtio_pci: use shared interrupts for
virtqueues"") removed the adjustment of the pre_vectors for the virtio
MSI-X vector allocation which was added in commit fb5e31d9 ("virtio:
allow drivers to request IRQ affinity when creating VQs"). This will
lead to an incorrect assignment of MSI-X vectors, and potential
deadlocks when offlining cpus.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Fixes: 0b0f9dc5 ("Revert "virtio_pci: use shared interrupts for virtqueues")
Reported-by: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
The message printed on disk resize is incorrect. The following is
printed when resizing to 2 GiB:
$ truncate -s 1G test.img
$ qemu -device virtio-blk-pci,logical_block_size=4096,...
(qemu) block_resize drive1 2G
virtio_blk virtio0: new size: 4194304 4096-byte logical blocks (17.2 GB/16.0 GiB)
The virtio_blk capacity config field is in 512-byte sector units
regardless of logical_block_size as per the VIRTIO specification.
Therefore the message should read:
virtio_blk virtio0: new size: 524288 4096-byte logical blocks (2.15 GB/2.0 GiB)
Note that this only affects the printed message. Thankfully the actual
block device has the correct size because the block layer expects
capacity in sectors.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
Make this const as it is only stored as a reference in a const field of
a net_device structure.
Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
Acked-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
Enlarge sd_fsname to be big enough for the longest long lock table name
and an arbitrary journal number. This silences two -Wformat-truncation
warnings with gcc 7.1.1.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
|
|
Before this patch, if GFS2 encountered IO errors while writing to
the journal, it would not report the problem, so they would go
unnoticed, sometimes for many hours. Sometimes this would only be
noticed later, when recovery tried to do journal replay and failed
due to invalid metadata at the blocks that resulted in IO errors.
This patch makes GFS2's log daemon check for IO errors. If it
encounters one, it withdraws from the file system and reports
why in dmesg. A similar action is taken when IO errors occur when
writing to the system statfs file.
These errors are also reported back to any callers of fsync, since
that requires the journal to be flushed. Therefore, any IO errors
that would previously go unnoticed are now noticed and the file
system is withdrawn as early as possible, thus preventing further
file system damage.
Also note that this reintroduces superblock variable sd_log_error,
which Christoph removed with commit f729b66fca.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
|
|
On some devices (TH 2.x devices at the moment), the internal time counter
is initially not synchronized to the global crystal clock, so the time
stamps it produces will not be useful. In this case, the driver needs
to force the time counter resync.
This applies the workaround to relevant devices.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
|
|
A glue layer may want to install its own hooks into trace capture start
and stop paths to apply workarounds. This adds optional callbacks.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
|
|
Allow attaching miscellaneous quirk information to devices as drvdata.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
|
|
This adds Intel(R) Trace Hub PCI ID for Cannon Lake PCH-LP.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: <stable@vger.kernel.org>
|
|
This adds Intel(R) Trace Hub PCI ID for Cannon Lake PCH-H.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: <stable@vger.kernel.org>
|
|
The Low Power Path (LPP) output port type, looks mostly like PTI to
the software, with a few additional bits in the control register.
This extends the PTI driver to support LPP ports as well.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
|
|
Trace Hub 2.x adds Low Power Path (LPP) output port type, which provides
a low power mode trace path from sources to PTI or BSSB.
This adds an output subdevice for the LPP port.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
|
|
When allocating DMA buffers for the MSU, use the real device instead
of GTH.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
|
|
Instead of allocating devices for every possible output subdevice,
allow the switch to allocate only the ones that it knows about.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
|
|
The switch (GTH) does not directly interact with SOURCE type devices and
may not even be present (in host mode). To reflect this and avoid
inconsistencies between target and host mode, make SOURCE devices
descendant directly from the root (i.e. PCI) device. Their symlinks
will no longer appear under the switch device, but they can still
be found under intel_th bus.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
|
|
Make to_intel_th*() accessors available from the main header file.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
|
|
Output subdevices that rely on other output subdevices (or otherwise
don't directly talk to an output port on the switch) don't need to be
assigned an output port either.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
|
|
The driver forgets to enable bus mastering for the PCI device.
Fix this.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
|
|
This patch adds a description to the stm_ftrace device source, an
interface for collecting Ftrace's function trace information via
STM devices.
Signed-off-by: Chunyan Zhang <chunyan.zhang@spreadtrum.com>
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
|
|
The "size" variable comes from the user so we need to verify that it's
large enough to hold an stp_policy_id struct.
Fixes: 7bd1d4093c2f ("stm class: Introduce an abstraction for System Trace Module devices")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
|
|
The symbolic constants QUEUE_FLAG_SCSI_PASSTHROUGH, QUEUE_FLAG_QUIESCED
and REQ_NOWAIT are missing from blk-mq-debugfs.c. Add these to
blk-mq-debugfs.c such that these appear as names in debugfs instead of
as numbers.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Currently we unconditionally destroy all sysctl bits and regenerate
them after we've rebuild the domains (even if that rebuild is a
no-op).
And since we unconditionally (re)build the sysctl for all possible
CPUs, onlining all CPUs gets us O(n^2) time. Instead change this to
only rebuild the bits for CPUs we've actually installed new domains
on.
Reported-by: Ofer Levi(SW) <oferle@mellanox.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Fix partition_sched_domains() to try and preserve the existing machine
wide domain instead of unconditionally destroying it. We do this by
attempting to allocate the new single domain, only when that fails to
we reuse the fallback_doms.
When using fallback_doms we need to first destroy and then recreate
because both the old and new could be backed by it.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ofer Levi(SW) <oferle@mellanox.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vineet.Gupta1@synopsys.com <Vineet.Gupta1@synopsys.com>
Cc: rusty@rustcorp.com.au <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
When disabling cpuset.sched_load_balance we expect to be able to online
CPUs without generating sched_domains. However this is currently
completely broken.
What happens is that we generate the sched_domains and then destroy
them. This is because of the spurious 'default' domain build in
cpuset_update_active_cpus(). That builds a single machine wide domain
and then schedules a work to build the 'real' domains. The work then
finds there are _no_ domains and destroys the lot again.
Furthermore, if there actually were cpusets, building the machine wide
domain is actively wrong, because it would allow tasks to 'escape' their
cpuset. Also I don't think its needed, the scheduler really should
respect the active mask.
Reported-by: Ofer Levi(SW) <oferle@mellanox.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vineet.Gupta1@synopsys.com <Vineet.Gupta1@synopsys.com>
Cc: rusty@rustcorp.com.au <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Mike provided a better comment for destroy_sched_domain() ...
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Found this issue by kmemleak: the 'sg' and 'sgc' pointers from
__sdt_alloc() might be leaked as each domain holds many groups' ref,
but in destroy_sched_domain(), it only declined the first group ref.
Onlining and offlining a CPU can trigger this leak, and cause OOM.
Reproducer for my 6 CPUs machine:
while true
do
echo 0 > /sys/devices/system/cpu/cpu5/online;
echo 1 > /sys/devices/system/cpu/cpu5/online;
done
unreferenced object 0xffff88007d772a80 (size 64):
comm "cpuhp/5", pid 39, jiffies 4294719962 (age 35.251s)
hex dump (first 32 bytes):
c0 22 77 7d 00 88 ff ff 02 00 00 00 01 00 00 00 ."w}............
40 2a 77 7d 00 88 ff ff 00 00 00 00 00 00 00 00 @*w}............
backtrace:
[<ffffffff8176525a>] kmemleak_alloc+0x4a/0xa0
[<ffffffff8121efe1>] __kmalloc_node+0xf1/0x280
[<ffffffff810d94a8>] build_sched_domains+0x1e8/0xf20
[<ffffffff810da674>] partition_sched_domains+0x304/0x360
[<ffffffff81139557>] cpuset_update_active_cpus+0x17/0x40
[<ffffffff810bdb2e>] sched_cpu_activate+0xae/0xc0
[<ffffffff810900e0>] cpuhp_invoke_callback+0x90/0x400
[<ffffffff81090597>] cpuhp_up_callbacks+0x37/0xb0
[<ffffffff81090887>] cpuhp_thread_fun+0xd7/0xf0
[<ffffffff810b37e0>] smpboot_thread_fn+0x110/0x160
[<ffffffff810af5d9>] kthread+0x109/0x140
[<ffffffff81770e45>] ret_from_fork+0x25/0x30
[<ffffffffffffffff>] 0xffffffffffffffff
unreferenced object 0xffff88007d772a40 (size 64):
comm "cpuhp/5", pid 39, jiffies 4294719962 (age 35.251s)
hex dump (first 32 bytes):
03 00 00 00 00 00 00 00 00 04 00 00 00 00 00 00 ................
00 04 00 00 00 00 00 00 4f 3c fc ff 00 00 00 00 ........O<......
backtrace:
[<ffffffff8176525a>] kmemleak_alloc+0x4a/0xa0
[<ffffffff8121efe1>] __kmalloc_node+0xf1/0x280
[<ffffffff810da16d>] build_sched_domains+0xead/0xf20
[<ffffffff810da674>] partition_sched_domains+0x304/0x360
[<ffffffff81139557>] cpuset_update_active_cpus+0x17/0x40
[<ffffffff810bdb2e>] sched_cpu_activate+0xae/0xc0
[<ffffffff810900e0>] cpuhp_invoke_callback+0x90/0x400
[<ffffffff81090597>] cpuhp_up_callbacks+0x37/0xb0
[<ffffffff81090887>] cpuhp_thread_fun+0xd7/0xf0
[<ffffffff810b37e0>] smpboot_thread_fn+0x110/0x160
[<ffffffff810af5d9>] kthread+0x109/0x140
[<ffffffff81770e45>] ret_from_fork+0x25/0x30
[<ffffffffffffffff>] 0xffffffffffffffff
Reported-by: Chunyu Hu <chuhu@redhat.com>
Signed-off-by: Shu Wang <shuwang@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Chunyu Hu <chuhu@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: liwang@redhat.com
Link: http://lkml.kernel.org/r/1502351536-9108-1-git-send-email-shuwang@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Nixiaoming pointed out that there is a memory leak in
kvm_vm_ioctl_create_spapr_tce() if the call to anon_inode_getfd()
fails; the memory allocated for the kvmppc_spapr_tce_table struct
is not freed, and nor are the pages allocated for the iommu
tables. In addition, we have already incremented the process's
count of locked memory pages, and this doesn't get restored on
error.
David Hildenbrand pointed out that there is a race in that the
function checks early on that there is not already an entry in the
stt->iommu_tables list with the same LIOBN, but an entry with the
same LIOBN could get added between then and when the new entry is
added to the list.
This fixes all three problems. To simplify things, we now call
anon_inode_getfd() before placing the new entry in the list. The
check for an existing entry is done while holding the kvm->lock
mutex, immediately before adding the new entry to the list.
Finally, on failure we now call kvmppc_account_memlimit to
decrement the process's count of locked memory pages.
Reported-by: Nixiaoming <nixiaoming@huawei.com>
Reported-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
All other fields use __
Cc: Ben Widawsky <ben@bwidawsk.net>
Fixes: db1689aa61b ("drm: Create a format/modifier blob")
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Signed-off-by: Daniel Stone <daniels@collabora.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Emil Velikov <emil.l.velikov@gmail.com>
Reviewed-by: Ben Widawsky <ben@bwidawsk.net>
Reviewed-by: Daniel Stone <daniels@collabora.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20170824150814.5878-1-lionel.g.landwerlin@intel.com
|
|
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Julia reported that the document looked unfinished, and it is. I
forgot to include the example cooked up by Paul here:
https://lkml.kernel.org/r/20170731174345.GL3730@linux.vnet.ibm.com
and I added an explicit example showing how, while it is an ACQUIRE
pattern, it really does provide an MB.
Reported-by: Julia Cartwright <julia@ni.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
The new completion/crossrelease annotations interact unfavourable with
the extant flush_work()/flush_workqueue() annotations.
The problem is that when a single work class does:
wait_for_completion(&C)
and
complete(&C)
in different executions, we'll build dependencies like:
lock_map_acquire(W)
complete_acquire(C)
and
lock_map_acquire(W)
complete_release(C)
which results in the dependency chain: W->C->W, which lockdep thinks
spells deadlock, even though there is no deadlock potential since
works are ran concurrently.
One possibility would be to change the work 'lock' to recursive-read,
but that would mean hitting a lockdep limitation on recursive locks.
Also, unconditinoally switching to recursive-read here would fail to
detect the actual deadlock on single-threaded workqueues, which do
have a problem with this.
For now, forcefully disregard these locks for crossrelease.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: boqun.feng@gmail.com
Cc: byungchul.park@lge.com
Cc: david@fromorbit.com
Cc: johannes@sipsolutions.net
Cc: oleg@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
The flush_work() annotation as introduced by commit:
e159489baa71 ("workqueue: relax lockdep annotation on flush_work()")
hits on the lockdep problem with recursive read locks.
The situation as described is:
Work W1: Work W2: Task:
ARR(Q) ARR(Q) flush_workqueue(Q)
A(W1) A(W2) A(Q)
flush_work(W2) R(Q)
A(W2)
R(W2)
if (special)
A(Q)
else
ARR(Q)
R(Q)
where: A - acquire, ARR - acquire-read-recursive, R - release.
Where under 'special' conditions we want to trigger a lock recursion
deadlock, but otherwise allow the flush_work(). The allowing is done
by using recursive read locks (ARR), but lockdep is broken for
recursive stuff.
However, there appears to be no need to acquire the lock if we're not
'special', so if we remove the 'else' clause things become much
simpler and no longer need the recursion thing at all.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: boqun.feng@gmail.com
Cc: byungchul.park@lge.com
Cc: david@fromorbit.com
Cc: johannes@sipsolutions.net
Cc: oleg@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Currently lockdep has limited support for recursive readers, add a few
mixed read-write ABBA selftests to show the extend of these
limitations.
[ 0.000000] ----------------------------------------------------------------------------
[ 0.000000] | spin |wlock |rlock |mutex | wsem | rsem |
[ 0.000000] --------------------------------------------------------------------------
[ 0.000000] mixed read-lock/lock-write ABBA: |FAILED| | ok |
[ 0.000000] mixed read-lock/lock-read ABBA: | ok | | ok |
[ 0.000000] mixed write-lock/lock-write ABBA: | ok | | ok |
This clearly illustrates the case where lockdep fails to find a
deadlock.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: boqun.feng@gmail.com
Cc: byungchul.park@lge.com
Cc: david@fromorbit.com
Cc: johannes@sipsolutions.net
Cc: oleg@redhat.com
Cc: tj@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Better document the ordering around tlb_flush_pending().
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|