Age | Commit message (Collapse) | Author |
|
Code that dereferences the struct net_device ip_ptr member must be
protected with an in_dev_get() / in_dev_put() pair. Hence insert
calls to these functions.
Fixes: commit 7b85627b9f02 ("IB/cma: IBoE (RoCE) IP-based GID addressing")
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Moni Shoua <monis@mellanox.com>
Cc: Or Gerlitz <ogerlitz@mellanox.com>
Cc: Roland Dreier <roland@purestorage.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
|
|
This patch series adds a driver for a paravirtual RDMA device. The
device is developed for VMware's Virtual Machines and allows existing RDMA
applications to continue to use existing Verbs API when deployed in VMs
on ESXi. We recently did a presentation in the OFA Workshop [1] regarding
this device.
Description and RDMA Support
============================
The virtual device is exposed as a dual function PCIe device. One part
is a virtual network device (VMXNet3) which provides networking properties
like MAC, IP addresses to the RDMA part of the device. The networking
properties are used to register GIDs required by RDMA applications to
communicate.
These patches add support and the all required infrastructure for
letting applications use such a device. We support the mandatory Verbs API as
well as the base memory management extensions (Local Inv, Send with Inv and
Fast Register Work Requests). We currently support both Reliable Connected
and Unreliable Datagram QPs but do not support Shared Receive Queues
(SRQs).
Also, we support the following types of Work Requests:
o Send/Receive (with or without Immediate Data)
o RDMA Write (with or without Immediate Data)
o RDMA Read
o Local Invalidate
o Send with Invalidate
o Fast Register Work Requests
This version only adds support for version 1 of RoCE. We will add RoCEv2
support in a future patch. We do support registration of both MAC-based
and IP-based GIDs. I have also created a git tree for our user-level driver
[2].
Testing
=======
We have tested this internally for various types of Guest OS - Red Hat,
Centos, Ubuntu 12.04/14.04/16.04, Oracle Enterprise Linux, SLES 12
using backported versions of this driver. The tests included several
runs of the performance tests (included with OFED), Intel MPI PingPong
benchmark on OpenMPI, krping for FRWRs. Mellanox has been kind enough
to test the backported version of the driver internally on their hardware
using a VMware provided ESX build. I have also applied and tested this
with Doug's k.o/for-4.9 branch (commit 5603910b). Note, that this patch
series should be applied all together. I split out the commits so that
it may be easier to review.
PVRDMA Resources
================
[1] OFA Workshop Presentation -
https://openfabrics.org/images/eventpresos/2016presentations/102parardma.pdf
[2] Libpvrdma User-level library -
http://git.openfabrics.org/?p=~aditr/libpvrdma.git;a=summary
Reviewed-by: Jorgen Hansen <jhansen@vmware.com>
Reviewed-by: George Zhang <georgezhang@vmware.com>
Reviewed-by: Aditya Sarwade <asarwade@vmware.com>
Reviewed-by: Bryan Tan <bryantan@vmware.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Adit Ranadive <aditr@vmware.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
|
|
|
|
|
|
The kernel side is #ifdef'd to this type, and the UAPI header
should use it directly. It has slightly different alignment
requirments from the usual user space version.
Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
rdma_consumer_reject_data() will return the private data pointer
and length if any is available.
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
Return true if the peer consumer application rejected the
connection attempt.
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
rdma_reject_msg() returns a pointer to a string message associated with
the transport reject reason codes.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
and rename class version define to indicate SM rather than SMP or SMI
Signed-off-by: Hal Rosenstock <hal@mellanox.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
An new uverbs command ib_uverbs_ex_modify_qp is added to support more QP
attributes. User driver should choose to call the legacy/extended API
based on input mask.
IB_USER_LAST_QP_ATTR_MASK is added to indicated the maximum bit position
which supports legacy ib_uverbs_modify_qp.
IB_USER_LEGACY_LAST_QP_ATTR_MASK indicates the maximum bit position
which supports ib_uverbs_ex_modify_qp, the value of this mask should be
updated if new mask is added later.
Along with this change, rate_limit is supported by the extended command,
user driver could use it to control packet packing.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
Add new member rate_limit to ib_qp_attr which holds the packet pacing rate
in kbps, 0 means unlimited.
IB_QP_RATE_LIMIT is added to ib_attr_mask and could be used by RAW
QPs when changing QP state from RTR to RTS, RTS to RTS.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
Enable mlx5 based hardware to report packet pacing capabilities
from kernel to user space. Packet pacing allows to limit the rate to any
number between the maximum and minimum, based on user settings.
The capabilities are exposed to user space through query_device by uhw.
The following capabilities are reported:
1. The maximum and minimum rate limit in kbps supported by packet pacing.
2. Bitmap showing which QP types are supported by packet pacing operation.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
Resolving a MAC address for a given IP address in userspace is inefficient.
This patch lets mlx5 user driver using the kernel driver to resolve the mac
and get the answer in the private section of the response.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
Add struct ib_udata to the signature of create_ah callback that is
implemented by IB device drivers. This allows HW drivers to return extra
data to the userspace library.
This patch prepares the ground for mlx5 driver to resolve destination
mac address for a given GID and return it to userspace.
This patch was previously submitted by Knut Omang as a part of the
patch set to support Oracle's Infiniband HCA (SIF).
Signed-off-by: Knut Omang <knut.omang@oracle.com>
Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
To make mlx5 user driver aware of whether kernel driver returns dmac
in user data response add a new flag that will be returned back to
user-space through alloc_ucontext.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
The function ib_resolve_eth_dmac() requires struct qp_attr * and
qp_attr_mask as parameters while the function might be useful to resolve
dmac for address handles. This patch changes the signature of the
function so it can be used in the flow of creating an address handle.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
For a tunneled packet which contains external and internal headers,
we refer to the external headers as "outer fields" and the internal
headers as "inner fields".
Example of a tunneled packet:
{ L2 | L3 | L4 | tunnel header | L2 | L3 | l4 | data }
| | | | | | |
{ outer fields }{ inner fields }
This patch introduces a new flag for flow steering rules
- IB_FLOW_SPEC_INNER - which specifies that the rule applies
to the inner fields, rather than to the outer fields of the packet.
Signed-off-by: Moses Reuben <mosesr@mellanox.com>
Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
Add ib_uverbs_flow_spec_tunnel to define the rule to match Vxlan,
the type, size, reserved fields are identical to rest of the protocols,
and are used to identify the spec.
The tunnel id is the vni value of the Vxlan protocol, and it is used
as part of the steering rule, it is limited by the mask.
The steering rule configured on the hardware does a match
according to vni and other protocols.
In the same way as rest of the protocols that we match,
the uniq field's of each protocol are represented on
the val and the mask.
Signed-off-by: Moses Reuben <mosesr@mellanox.com>
Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
Aligned the structure ib_flow_spec_type indentation,
after adding a new definition.
Signed-off-by: Moses Reuben <mosesr@mellanox.com>
Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
In order to support tunneling, that can be used by the QP,
both struct ib_flow_spec_tunnel and struct ib_flow_tunnel_filter can be
used to more IP or UDP based tunneling protocols (e.g NVGRE, GRE, etc).
IB_FLOW_SPEC_VXLAN_TUNNEL type flow specification is added to use this
functionality and match specific Vxlan packets.
In similar to IPv6, we check overflow of the vni value by
comparing with the maximum size.
Signed-off-by: Moses Reuben <mosesr@mellanox.com>
Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
CQE compressing reduces PCI overhead by coalescing and compressing
multiple CQEs into a single merged CQE. Successful compressing
improves message rate especially for small packet traffic.
CQE compressing is supported for all 64B CQE formats (with certain
limitations) generated by RQ/Responder or by SQ/Requestor.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
The capabilities include:
- Max number of compressed and aggregated CQEs in a single session,
while zero means unsupported.
- For Responder, there are two formats of mini CQE: mini CQE with Rx
hash and mini CQE with checksum. They're mutual exclusive.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
The capabilities whether hardware support multi packet WQE or not is
exposed to user space through query_device by uhw.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
Multi packet WQE enables sending multiple fix sized packets
using a single WQE. The exposed field reports such HW support.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
Add rvt_div_round_up_mtu() and rvt_div_mtu() routines to
do the computation based on the pmtu and the log_pmtu.
Change divides in qib, hfi1 to use the new inlines.
Reviewed-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
Add a helper to release mr references held by
an swqe.
Reviewed-by: Brian Welty <brian.welty@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
This is for use by client drivers to drive
send completions into a CQ.
A new exported table allows for the mapping
of ib_wr_opcode into a ib_wc_opcode.
Reviewed-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
For the received packets with payload less or equal 8DWS
RxDmaDataFifoRdUncErr is not reported. There is set RHF.EccErr
if the header is not suppressed. When such packet is detected
on the send side the header suppression mechanism is disabled
by clearing SH bit in the packet header.
Reviewed-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Jakub Pawlak <jakub.pawlak@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
The VMXNet3 PCI Id will be shared with our paravirtual RDMA driver.
Moved it to the shared location in pci_ids.h.
Suggested-by: Leon Romanovsky <leon@kernel.org>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Adit Ranadive <aditr@vmware.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
This patch moves HNS vendor's specific structures to
common UAPI folder which will be visible to all consumers.
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
When MAD arrives to the hypervisor, we need to identify which slave it
should be sent by destination GID. When L3 protocol is IPv4 the
GRH is replaced by an IPv4 header. This patch detects when IPv4 header
needs to be parsed instead of GRH.
Fixes: b6ffaeffaea4 ('mlx4: In RoCE allow guests to have multiple GIDS')
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Daniel Jurgens <danielj@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
Save a cacheline by having hot path calldowns together.
Reviewed-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
Profiling shows that the key validation is susceptible
to cache line trading when accessing the lkey table.
Fix by separating out the read mostly fields from the write
fields. In addition the shift amount, which is function
of the lkey table size, is precomputed and stored with the
table pointer. Since both the shift and table pointer
are in the same read mostly cacheline, this saves a cache
line in this hot path.
Reviewed-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
Currently when calling mlx5_add_flow_rule we accept
only one flow destination, this commit allows to pass
multiple destinations.
This change forces us to change the return structure to a more
flexible one. We introduce a flow handle (struct mlx5_flow_handle),
it holds internally the number for rules created and holds an array
where each cell points the to a flow rule.
From the consumers (of mlx5_add_flow_rule) point of view this
change is only cosmetic and requires only to change the type
of the returned value they store.
From the core point of view, we now need to use a loop when
allocating and deleting rules (e.g given to us a flow handler).
Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
TSAR (stands for Transmit Scheduling ARbiter) is a hardware component
that is responsible for selecting the next entity to serve on the
transmit path.
The arbitration defines the QoS policy between the agents connected to
the TSAR.
The TSAR is a consist two main features:
1) BW Allocation between agents:
The TSAR implements a defecit weighted round robin between the agents.
Each agent attached to the TSAR is assigned with a weight and it is
awarded transmission tokens according to this weight.
2) Rate limer per agent:
Each agent attached to the TSAR is (optionally) assigned with a rate
limit.
TSAR will not allow scheduling for an agent exceeding its defined rate
limit.
In this patch we implement the API of manipulating the TSAR.
Signed-off-by: Mohamad Haj Yahia <mohamad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
SRQ physical address structure field should be in big-endian format.
Signed-off-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Update struct mlx5_ifc_xrqc_bits according to last specification
Signed-off-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fixes from Rafael Wysocki:
"These fix recent ACPICA regressions, an older PCI IRQ management
regression, and an incorrect return value of a function in the APEI
code.
Specifics:
- Fix three ACPICA issues related to the interpreter locking and
introduced by recent changes in that area (Lv Zheng).
- Fix a PCI IRQ management regression introduced during the 4.7 cycle
and related to the configuration of shared IRQs on systems with an
ISA bus (Sinan Kaya).
- Fix up a return value of one function in the APEI code (Punit
Agrawal)"
* tag 'acpi-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPICA: Dispatcher: Fix interpreter locking around acpi_ev_initialize_region()
ACPICA: Dispatcher: Fix an unbalanced lock exit path in acpi_ds_auto_serialize_method()
ACPICA: Dispatcher: Fix order issue of method termination
ACPI / APEI: Fix incorrect return value of ghes_proc()
ACPI/PCI: pci_link: Include PIRQ_PENALTY_PCI_USING for ISA IRQs
ACPI/PCI: pci_link: penalize SCI correctly
ACPI/PCI/IRQ: assign ISA IRQ directly during early boot stages
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
"Misc kernel fixes: a virtualization environment related fix, an uncore
PMU driver removal handling fix, a PowerPC fix and new events for
Knights Landing"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel: Honour the CPUID for number of fixed counters in hypervisors
perf/powerpc: Don't call perf_event_disable() from atomic context
perf/core: Protect PMU device removal with a 'pmu_bus_running' check, to fix CONFIG_DEBUG_TEST_DRIVER_REMOVE=y kernel panic
perf/x86/intel/cstate: Add C-state residency events for Knights Landing
|
|
git://people.freedesktop.org/~airlied/linux
Pull drm x86/pat regression fixes from Dave Airlie:
"This is a standalone pull request for the fix for a regression
introduced in -rc1 by a change to vm_insert_mixed to start using the
PAT range tracking to validate page protections. With this fix in
place, all the VRAM mappings for GPU drivers ended up at UC instead of
WC.
There are probably better ways to fix this long term, but nothing I'd
considered for -fixes that wouldn't need more settling in time. So
I've just created a new arch API that the drivers can reserve all
their VRAM aperture ranges as WC"
* tag 'drm-x86-pat-regression-fix' of git://people.freedesktop.org/~airlied/linux:
drm/drivers: add support for using the arch wc mapping API.
x86/io: add interface to reserve io memtype for a resource range. (v1.1)
|
|
The trinity syscall fuzzer triggered following WARN() on powerpc:
WARNING: CPU: 9 PID: 2998 at arch/powerpc/kernel/hw_breakpoint.c:278
...
NIP [c00000000093aedc] .hw_breakpoint_handler+0x28c/0x2b0
LR [c00000000093aed8] .hw_breakpoint_handler+0x288/0x2b0
Call Trace:
[c0000002f7933580] [c00000000093aed8] .hw_breakpoint_handler+0x288/0x2b0 (unreliable)
[c0000002f7933630] [c0000000000f671c] .notifier_call_chain+0x7c/0xf0
[c0000002f79336d0] [c0000000000f6abc] .__atomic_notifier_call_chain+0xbc/0x1c0
[c0000002f7933780] [c0000000000f6c40] .notify_die+0x70/0xd0
[c0000002f7933820] [c00000000001a74c] .do_break+0x4c/0x100
[c0000002f7933920] [c0000000000089fc] handle_dabr_fault+0x14/0x48
Followed by a lockdep warning:
===============================
[ INFO: suspicious RCU usage. ]
4.8.0-rc5+ #7 Tainted: G W
-------------------------------
./include/linux/rcupdate.h:556 Illegal context switch in RCU read-side critical section!
other info that might help us debug this:
rcu_scheduler_active = 1, debug_locks = 0
2 locks held by ls/2998:
#0: (rcu_read_lock){......}, at: [<c0000000000f6a00>] .__atomic_notifier_call_chain+0x0/0x1c0
#1: (rcu_read_lock){......}, at: [<c00000000093ac50>] .hw_breakpoint_handler+0x0/0x2b0
stack backtrace:
CPU: 9 PID: 2998 Comm: ls Tainted: G W 4.8.0-rc5+ #7
Call Trace:
[c0000002f7933150] [c00000000094b1f8] .dump_stack+0xe0/0x14c (unreliable)
[c0000002f79331e0] [c00000000013c468] .lockdep_rcu_suspicious+0x138/0x180
[c0000002f7933270] [c0000000001005d8] .___might_sleep+0x278/0x2e0
[c0000002f7933300] [c000000000935584] .mutex_lock_nested+0x64/0x5a0
[c0000002f7933410] [c00000000023084c] .perf_event_ctx_lock_nested+0x16c/0x380
[c0000002f7933500] [c000000000230a80] .perf_event_disable+0x20/0x60
[c0000002f7933580] [c00000000093aeec] .hw_breakpoint_handler+0x29c/0x2b0
[c0000002f7933630] [c0000000000f671c] .notifier_call_chain+0x7c/0xf0
[c0000002f79336d0] [c0000000000f6abc] .__atomic_notifier_call_chain+0xbc/0x1c0
[c0000002f7933780] [c0000000000f6c40] .notify_die+0x70/0xd0
[c0000002f7933820] [c00000000001a74c] .do_break+0x4c/0x100
[c0000002f7933920] [c0000000000089fc] handle_dabr_fault+0x14/0x48
While it looks like the first WARN() is probably valid, the other one is
triggered by disabling event via perf_event_disable() from atomic context.
The event is disabled here in case we were not able to emulate
the instruction that hit the breakpoint. By disabling the event
we unschedule the event and make sure it's not scheduled back.
But we can't call perf_event_disable() from atomic context, instead
we need to use the event's pending_disable irq_work method to disable it.
Reported-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20161026094824.GA21397@krava
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Merge misc fixes from Andrew Morton:
"20 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
drivers/misc/sgi-gru/grumain.c: remove bogus 0x prefix from printk
cris/arch-v32: cryptocop: print a hex number after a 0x prefix
ipack: print a hex number after a 0x prefix
block: DAC960: print a hex number after a 0x prefix
fs: exofs: print a hex number after a 0x prefix
lib/genalloc.c: start search from start of chunk
mm: memcontrol: do not recurse in direct reclaim
CREDITS: update credit information for Martin Kepplinger
proc: fix NULL dereference when reading /proc/<pid>/auxv
mm: kmemleak: ensure that the task stack is not freed during scanning
lib/stackdepot.c: bump stackdepot capacity from 16MB to 128MB
latent_entropy: raise CONFIG_FRAME_WARN by default
kconfig.h: remove config_enabled() macro
ipc: account for kmem usage on mqueue and msg
mm/slab: improve performance of gathering slabinfo stats
mm: page_alloc: use KERN_CONT where appropriate
mm/list_lru.c: avoid error-path NULL pointer deref
h8300: fix syscall restarting
kcov: properly check if we are in an interrupt
mm/slab: fix kmemcg cache creation delayed issue
|
|
The use of config_enabled() is ambiguous. For config options,
IS_ENABLED(), IS_REACHABLE(), etc. will make intention clearer.
Sometimes config_enabled() has been used for non-config options because
it is useful to check whether the given symbol is defined or not.
I have been tackling on deprecating config_enabled(), and now is the
time to finish this work.
Some new users have appeared for v4.9-rc1, but it is trivial to replace
them:
- arch/x86/mm/kaslr.c
replace config_enabled() with IS_ENABLED() because
CONFIG_X86_ESPFIX64 and CONFIG_EFI are boolean.
- include/asm-generic/export.h
replace config_enabled() with __is_defined().
Then, config_enabled() can be removed now.
Going forward, please use IS_ENABLED(), IS_REACHABLE(), etc. for config
options, and __is_defined() for non-config symbols.
Link: http://lkml.kernel.org/r/1476616078-32252-1-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michal Marek <mmarek@suse.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Garnier <thgarnie@google.com>
Cc: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs
Pull xfs fixes from Dave Chinner:
"This update contains fixes for most of the outstanding regressions
introduced with the 4.9-rc1 XFS merge. There is also a fix for an
iomap bug, too.
This is a quite a bit larger than I'd prefer for a -rc3, but most of
the change comes from cleaning up the new reflink copy on write code;
it's much simpler and easier to understand now. These changes fixed
several bugs in the new code, and it wasn't clear that there was an
easier/simpler way to fix them. The rest of the fixes are the usual
size you'd expect at this stage.
I've left the commits to soak in linux-next for a some extra time
because of the size before asking you to pull, no new problems with
them have been reported so I think it's all OK.
Summary:
- iomap page offset masking fix for page faults
- add IOMAP_REPORT to distinguish between read and fiemap map
requests
- cleanups to new shared data extent code
- fix mount active status on failed log recovery
- fix broken dquots in a buffer calculation
- fix locking order issues and merge xfs_reflink_remap_range and
xfs_file_share_range
- rework unmapping of CoW extents and remove now unused functions
- clean state when CoW is done"
* tag 'xfs-fixes-for-linus-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (25 commits)
xfs: clear cowblocks tag when cow fork is emptied
xfs: fix up inode cowblocks tracking tracepoints
fs: Do to trim high file position bits in iomap_page_mkwrite_actor
xfs: remove xfs_bunmapi_cow
xfs: optimize xfs_reflink_end_cow
xfs: optimize xfs_reflink_cancel_cow_blocks
xfs: refactor xfs_bunmapi_cow
xfs: optimize writes to reflink files
xfs: don't bother looking at the refcount tree for reads
xfs: handle "raw" delayed extents xfs_reflink_trim_around_shared
xfs: add xfs_trim_extent
iomap: add IOMAP_REPORT
xfs: merge xfs_reflink_remap_range and xfs_file_share_range
xfs: remove xfs_file_wait_for_io
xfs: move inode locking from xfs_reflink_remap_range to xfs_file_share_range
xfs: fix the same_inode check in xfs_file_share_range
xfs: remove the same fs check from xfs_file_share_range
libxfs: v3 inodes are only valid on crc-enabled filesystems
libxfs: clean up _calc_dquots_per_chunk
xfs: unset MS_ACTIVE if mount fails
...
|
|
The per-zone waitqueues exist because of a scalability issue with the
page waitqueues on some NUMA machines, but it turns out that they hurt
normal loads, and now with the vmalloced stacks they also end up
breaking gfs2 that uses a bit_wait on a stack object:
wait_on_bit(&gh->gh_iflags, HIF_WAIT, TASK_UNINTERRUPTIBLE)
where 'gh' can be a reference to the local variable 'mount_gh' on the
stack of fill_super().
The reason the per-zone hash table breaks for this case is that there is
no "zone" for virtual allocations, and trying to look up the physical
page to get at it will fail (with a BUG_ON()).
It turns out that I actually complained to the mm people about the
per-zone hash table for another reason just a month ago: the zone lookup
also hurts the regular use of "unlock_page()" a lot, because the zone
lookup ends up forcing several unnecessary cache misses and generates
horrible code.
As part of that earlier discussion, we had a much better solution for
the NUMA scalability issue - by just making the page lock have a
separate contention bit, the waitqueue doesn't even have to be looked at
for the normal case.
Peter Zijlstra already has a patch for that, but let's see if anybody
even notices. In the meantime, let's fix the actual gfs2 breakage by
simplifying the bitlock waitqueues and removing the per-zone issue.
Reported-by: Andreas Gruenbacher <agruenba@redhat.com>
Tested-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
A recent change to the mm code in:
87744ab3832b mm: fix cache mode tracking in vm_insert_mixed()
started enforcing checking the memory type against the registered list for
amixed pfn insertion mappings. It happens that the drm drivers for a number
of gpus relied on this being broken. Currently the driver only inserted
VRAM mappings into the tracking table when they came from the kernel,
and userspace mappings never landed in the table. This led to a regression
where all the mapping end up as UC instead of WC now.
I've considered a number of solutions but since this needs to be fixed
in fixes and not next, and some of the solutions were going to introduce
overhead that hadn't been there before I didn't consider them viable at
this stage. These mainly concerned hooking into the TTM io reserve APIs,
but these API have a bunch of fast paths I didn't want to unwind to add
this to.
The solution I've decided on is to add a new API like the arch_phys_wc
APIs (these would have worked but wc_del didn't take a range), and
use them from the drivers to add a WC compatible mapping to the table
for all VRAM on those GPUs. This means we can then create userspace
mapping that won't get degraded to UC.
v1.1: use CONFIG_X86_PAT + add some comments in io.h
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: x86@kernel.org
Cc: mcgrof@suse.com
Cc: Dan Williams <dan.j.williams@intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Airlie <airlied@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
Pull clk fixes from Stephen Boyd:
"This is the first batch of clk driver fixes for this release.
We have a handful of fixes for the uniphier clk driver that was
introduced recently, as well as Kconfig option hiding, module
autoloading markings, and a few fixes for clk_hw based registration
patches that went in this merge window"
* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
clk: at91: Fix a return value in case of error
clk: uniphier: rename MIO clock to SD clock for Pro5, PXs2, LD20 SoCs
clk: uniphier: fix memory overrun bug
clk: hi6220: use CLK_OF_DECLARE_DRIVER for sysctrl and mediactrl clock init
clk: mvebu: armada-37xx-periph: Fix the clock gate flag
clk: bcm2835: Clamp the PLL's requested rate to the hardware limits.
clk: max77686: fix number of clocks setup for clk_hw based registration
clk: mvebu: armada-37xx-periph: Fix the clock provider registration
clk: core: add __init decoration for CLK_OF_DECLARE_DRIVER function
clk: mediatek: Add hardware dependency
clk: samsung: clk-exynos-audss: Fix module autoload
clk: uniphier: fix type of variable passed to regmap_read()
clk: uniphier: add system clock support for sLD3 SoC
|
|
This patch unexports the low-level __get_user_pages() function.
Recent refactoring of the get_user_pages* functions allow flags to be
passed through get_user_pages() which eliminates the need for access to
this function from its one user, kvm.
We can see that the two calls to get_user_pages() which replace
__get_user_pages() in kvm_main.c are equivalent by examining their call
stacks:
get_user_page_nowait():
get_user_pages(start, 1, flags, page, NULL)
__get_user_pages_locked(current, current->mm, start, 1, page, NULL, NULL,
false, flags | FOLL_TOUCH)
__get_user_pages(current, current->mm, start, 1,
flags | FOLL_TOUCH | FOLL_GET, page, NULL, NULL)
check_user_page_hwpoison():
get_user_pages(addr, 1, flags, NULL, NULL)
__get_user_pages_locked(current, current->mm, addr, 1, NULL, NULL, NULL,
false, flags | FOLL_TOUCH)
__get_user_pages(current, current->mm, addr, 1, flags | FOLL_TOUCH, NULL,
NULL, NULL)
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|