summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2019-08-29perf time-utils: Adopt rdclock() from perf.hArnaldo Carvalho de Melo
Seems to be a better place for this function to live, further shrinking the hodge-podge that perf.h was. Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Link: https://lkml.kernel.org/n/tip-0zzt1u9rpyjukdy1ccr2u5r9@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-08-29perf tools: Move everything related to sys_perf_event_open() to perf-sys.hArnaldo Carvalho de Melo
And remove unneeded include directives from perf-sys.h to prune the header dependency tree. Fixup the fallout in places where definitions were being used without the needed include directives that were being satisfied because they were in perf-sys.h. Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Link: https://lkml.kernel.org/n/tip-7b1zvugiwak4ibfa3j6ott7f@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-08-29perf header: Move CPUINFO_PROC to the only file where it is usedArnaldo Carvalho de Melo
To reduce perf-sys.h and eventually nuke it. Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Link: https://lkml.kernel.org/n/tip-ars2j5m3if3gypsvkbbijucq@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-08-29perf tools: Remove needless libtraceevent include directivesArnaldo Carvalho de Melo
Remove traceevent/event-parse.h and traceevent/trace-seq.h from places where it is not needed. Should avoid rebuilding those files when these traceevent headers get changed. Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Tzvetomir Stoyanov <tstoyanov@vmware.com> Link: https://lkml.kernel.org/n/tip-26hn75jn9rdealn4uqtzend6@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-08-29libperf: Warn when exceeding MAX_NR_CPUS in cpumapKyle Meyer
Display a warning when attempting to profile more than MAX_NR_CPU CPUs. This patch should not change any behavior. Signed-off-by: Kyle Meyer <kyle.meyer@hpe.com> Reviewed-by: Jiri Olsa <jolsa@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russ Anderson <russ.anderson@hpe.com> Link: http://lore.kernel.org/lkml/20190827214352.94272-8-meyerk@stormcage.eag.rdlabs.hpecorp.net Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-08-29perf header: Replace MAX_NR_CPUS with cpu__max_cpu()Kyle Meyer
The function cpu__max_cpu() returns the possible number of CPUs as defined in the sysfs and can be used as an alternative for MAX_NR_CPUS in write_cache. MAX_CACHES is replaced by cpu__max_cpu() * MAX_CACHE_LVL. Signed-off-by: Kyle Meyer <kyle.meyer@hpe.com> Reviewed-by: Jiri Olsa <jolsa@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russ Anderson <russ.anderson@hpe.com> Link: http://lore.kernel.org/lkml/20190827214352.94272-7-meyerk@stormcage.eag.rdlabs.hpecorp.net Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-08-29perf machine: Replace MAX_NR_CPUS with perf_env::nr_cpus_onlineKyle Meyer
nr_cpus, the number of CPUs online during a record session bound by MAX_NR_CPUS, can be used as a dynamic alternative for MAX_NR_CPUS in __machine__synthesize_threads and machine__set_current_tid. Signed-off-by: Kyle Meyer <kyle.meyer@hpe.com> Reviewed-by: Jiri Olsa <jolsa@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russ Anderson <russ.anderson@hpe.com> Link: http://lore.kernel.org/lkml/20190827214352.94272-6-meyerk@stormcage.eag.rdlabs.hpecorp.net Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-08-29perf session: Replace MAX_NR_CPUS with perf_env::nr_cpus_onlineKyle Meyer
nr_cpus, the number of CPUs online during a record session bound by MAX_NR_CPUS, can be used as a dynamic alternative for MAX_NR_CPUS in perf_session__cpu_bitmap. Signed-off-by: Kyle Meyer <kyle.meyer@hpe.com> Reviewed-by: Jiri Olsa <jolsa@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russ Anderson <russ.anderson@hpe.com> Link: http://lore.kernel.org/lkml/20190827214352.94272-5-meyerk@stormcage.eag.rdlabs.hpecorp.net Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-08-29perf stat: Replace MAX_NR_CPUS with cpu__max_cpu()Kyle Meyer
The function cpu__max_cpu() returns the possible number of CPUs as defined in the sysfs and can be used as an alternative for MAX_NR_CPUS in zero_per_pkg() and check_per_pkg(). Signed-off-by: Kyle Meyer <kyle.meyer@hpe.com> Reviewed-by: Jiri Olsa <jolsa@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russ Anderson <russ.anderson@hpe.com> Link: http://lore.kernel.org/lkml/20190827214352.94272-4-meyerk@stormcage.eag.rdlabs.hpecorp.net Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-08-29perf svghelper: Replace MAX_NR_CPUS with perf_env::nr_cpus_onlineKyle Meyer
'nr_cpus', the number of CPUs online during a record session bound by MAX_NR_CPUS, can be used as a dynamic alternative for MAX_NR_CPUS in svg_build_topology_map(). The value of nr_cpus can be passed into str_to_bitmap(), scan_core_topology(), and svg_build_topology_map() to replace MAX_NR_CPUS as well. Signed-off-by: Kyle Meyer <kyle.meyer@hpe.com> Reviewed-by: Jiri Olsa <jolsa@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russ Anderson <russ.anderson@hpe.com> Link: http://lore.kernel.org/lkml/20190827214352.94272-3-meyerk@stormcage.eag.rdlabs.hpecorp.net Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-08-29perf timechart: Refactor svg_build_topology_map()Kyle Meyer
Exchange the parameters of svg_build_topology_map() with 'struct perf_env *env' and adjust the function accordingly. This patch should not change any behavior, it is merely refactoring for the following patch. Committer notes: No need to include env.h from svghelper.h, all it needs is a forward declaration for 'struct perf_env', so move the include directive to svghelper.c, where it is really needed. Signed-off-by: Kyle Meyer <kyle.meyer@hpe.com> Reviewed-by: Jiri Olsa <jolsa@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russ Anderson <russ.anderson@hpe.com> Link: http://lore.kernel.org/lkml/20190827214352.94272-2-meyerk@stormcage.eag.rdlabs.hpecorp.net Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-08-29perf c2c: Display proper cpu count in nodes columnJiri Olsa
There's wrong bitmap considered when checking for cpu count of specific node. We do the needed computation for 'set' variable, but at the end we use the 'c2c_he->cpuset' weight, which shows misleading numbers. Fixes: 1e181b92a2da ("perf c2c report: Add 'node' sort key") Reported-by: Joe Mario <jmario@redhat.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Michael Petlan <mpetlan@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lore.kernel.org/lkml/20190820140219.28338-1-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-08-29i2c: piix4: Fix port selection for AMD Family 16h Model 30hAndrew Cooks
Family 16h Model 30h SMBus controller needs the same port selection fix as described and fixed in commit 0fe16195f891 ("i2c: piix4: Fix SMBus port selection for AMD Family 17h chips") commit 6befa3fde65f ("i2c: piix4: Support alternative port selection register") also fixed the port selection for Hudson2, but unfortunately this is not the exact same device and the AMD naming and PCI Device IDs aren't particularly helpful here. The SMBus port selection register is common to the following Families and models, as documented in AMD's publicly available BIOS and Kernel Developer Guides: 50742 - Family 15h Model 60h-6Fh (PCI_DEVICE_ID_AMD_KERNCZ_SMBUS) 55072 - Family 15h Model 70h-7Fh (PCI_DEVICE_ID_AMD_KERNCZ_SMBUS) 52740 - Family 16h Model 30h-3Fh (PCI_DEVICE_ID_AMD_HUDSON2_SMBUS) The Hudson2 PCI Device ID (PCI_DEVICE_ID_AMD_HUDSON2_SMBUS) is shared between Bolton FCH and Family 16h Model 30h, but the location of the SmBus0Sel port selection bits are different: 51192 - Bolton Register Reference Guide We distinguish between Bolton and Family 16h Model 30h using the PCI Revision ID: Bolton is device 0x780b, revision 0x15 Family 16h Model 30h is device 0x780b, revision 0x1F Family 15h Model 60h and 70h are both device 0x790b, revision 0x4A. The following additional public AMD BKDG documents were checked and do not share the same port selection register: 42301 - Family 15h Model 00h-0Fh doesn't mention any 42300 - Family 15h Model 10h-1Fh doesn't mention any 49125 - Family 15h Model 30h-3Fh doesn't mention any 48751 - Family 16h Model 00h-0Fh uses the previously supported index register SB800_PIIX4_PORT_IDX_ALT at 0x2e Signed-off-by: Andrew Cooks <andrew.cooks@opengear.com> Signed-off-by: Jean Delvare <jdelvare@suse.de> Cc: stable@vger.kernel.org [v4.6+] Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
2019-08-29nvme-rdma: Use rq_dma_dir macroIsrael Rukshin
Remove code duplication. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-29nvme-fc: Use rq_dma_dir macroIsrael Rukshin
Remove code duplication. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: James Smart <james.smart@broadcom.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-29nvme-pci: Tidy up nvme_unmap_dataIsrael Rukshin
Remove pointless local variable and use rq_dma_dir macro. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-29nvme: make fabrics command run on a separate request queueSagi Grimberg
We have a fundamental issue that fabric commands use the admin_q. The reason is, that admin-connect, register reads and writes and admin commands cannot be guaranteed ordering while we are running controller resets. For example, when we reset a controller we perform: 1. disable the controller 2. teardown the admin queue 3. re-establish the admin queue 4. enable the controller In order to perform (3), we need to unquiesce the admin queue, however we may have some admin commands that are already pending on the quiesced admin_q and will immediate execute when we unquiesce it before we execute (4). The host must not send admin commands to the controller before enabling the controller. To fix this, we have the fabric commands (admin connect and property get/set, but not I/O queue connect) use a separate fabrics_q and make sure to quiesce the admin_q before we disable the controller, and unquiesce it only after we enable the controller. This fixes the error prints from nvmet in a controller reset storm test: kernel: nvmet: got cmd 6 while CC.EN == 0 on qid = 0 Which indicate that the host is sending an admin command when the controller is not enabled. Reviewed-by: James Smart <james.smart@broadcom.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-29nvme-pci: Support shared tags across queues for Apple 2018 controllersBenjamin Herrenschmidt
Another issue with the Apple T2 based 2018 controllers seem to be that they blow up (and shut the machine down) if there's a tag collision between the IO queue and the Admin queue. My suspicion is that they use our tags for their internal tracking and don't mix them with the queue id. They also seem to not like when tags go beyond the IO queue depth, ie 128 tags. This adds a quirk that marks tags 0..31 of the IO queue reserved Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Acked-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-29nvme-pci: Add support for Apple 2018+ modelsBenjamin Herrenschmidt
Based on reverse engineering and original patch by Paul Pawlowski <paul@mrarm.io> This adds support for Apple weird implementation of NVME in their 2018 or later machines. It accounts for the twice-as-big SQ entries for the IO queues, and the fact that only interrupt vector 0 appears to function properly. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-29nvme-pci: Add support for variable IO SQ element sizeBenjamin Herrenschmidt
The size of a submission queue element should always be 6 (64 bytes) by spec. However some controllers such as Apple's are not properly implementing the standard and require a different size. This provides the ground work for the subsequent quirks for these controllers. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-29nvme-pci: Pass the queue to SQ_SIZE/CQ_SIZE macrosBenjamin Herrenschmidt
This will make it easier to handle variable queue entry sizes later. No functional change. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-29nvme: trace bio completionHannes Reinecke
When native multipathing is enabled we cannot enable blktrace for the underlying paths, so any completion is never traced. Signed-off-by: Hannes Reinecke <hare@suse.com> [fixed-up by Mikhail for non-multipath-build] Signed-off-by: Mikhail Skorzhinskii <mskorzhinskiy@solarflare.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-29nvme-multipath: fix ana log nsid lookup when nsid is not foundAnton Eidelman
ANA log parsing invokes nvme_update_ana_state() per ANA group desc. This updates the state of namespaces with nsids in desc->nsids[]. Both ctrl->namespaces list and desc->nsids[] array are sorted by nsid. Hence nvme_update_ana_state() performs a single walk over ctrl->namespaces: - if current namespace matches the current desc->nsids[n], this namespace is updated, and n is incremented. - the process stops when it encounters the end of either ctrl->namespaces end or desc->nsids[] In case desc->nsids[n] does not match any of ctrl->namespaces, the remaining nsids following desc->nsids[n] will not be updated. Such situation was considered abnormal and generated WARN_ON_ONCE. However ANA log MAY contain nsids not (yet) found in ctrl->namespaces. For example, lets consider the following scenario: - nvme0 exposes namespaces with nsids = [2, 3] to the host - a new namespace nsid = 1 is added dynamically - also, a ANA topology change is triggered - NS_CHANGED aen is generated and triggers scan_work - before scan_work discovers nsid=1 and creates a namespace, a NOTICE_ANA aen was issues and ana_work receives ANA log with nsids=[1, 2, 3] Result: ana_work fails to update ANA state on existing namespaces [2, 3] Solution: Change the way nvme_update_ana_state() namespace list walk checks the current namespace against desc->nsids[n] as follows: a) ns->head->ns_id < desc->nsids[n]: keep walking ctrl->namespaces. b) ns->head->ns_id == desc->nsids[n]: match, update the namespace c) ns->head->ns_id >= desc->nsids[n]: skip to desc->nsids[n+1] This enables correct operation in the scenario described above. This also allows ANA log to contain nsids currently invisible to the host, i.e. inactive nsids. Signed-off-by: Anton Eidelman <anton@lightbitslabs.com> Reviewed-by: James Smart <james.smart@broadcom.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-29nvmet-tcp: Add TOS for tcp transportIsrael Rukshin
Set the outgoing packets type of service (TOS) according to the receiving TOS. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Suggested-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-29nvme-tcp: Add TOS for tcp transportIsrael Rukshin
TOS provide clients the ability to segregate traffic flows for different type of data. One of the TOS usage is bandwidth management which allows setting bandwidth limits for QoS classes, e.g. 80% bandwidth to controllers at QoS class A and 20% to controllers at QoS class B. usage examples: nvme connect --tos=0 --transport=tcp --traddr=10.0.1.1 --nqn=test-nvme Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-29nvme-tcp: Use struct nvme_ctrl directlyIsrael Rukshin
This patch doesn't change any functionality. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-29nvme-rdma: Add TOS for rdma transportIsrael Rukshin
For RDMA transports, TOS is an extension of IB QoS to provide clients the ability to segregate traffic flows for different type of data. RDMA CM abstract it for ULPs using rdma_set_service_type(). Internally, each traffic flow is represented by a connection with all of its independent resources like that of a normal connection, and is differentiated by service type. In other words, there can be multiple qp connections between an IP pair and each supports a unique service type. One of the TOS usage is bandwidth management which allows setting bandwidth limits for QoS classes, e.g. 80% bandwidth to controllers at QoS class A and 20% to controllers at QoS class B. Note: In addition to the TOS configuration, QOS must be configured on the relevant HCA on the target (send RDMA commands) and initiator to effect the traffic. usage examples: nvme connect --tos=0 --transport=rdma --traddr=10.0.1.1 --nqn=test-nvme Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-29nvme-fabrics: Add type of service (TOS) configurationIsrael Rukshin
TOS is user-defined and needs to be configured via nvme-cli. It must be set before initiating any traffic and once set the TOS cannot be changed. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-29nvmet-tcp: fix possible memory leakSagi Grimberg
when we uninit a command in error flow we also need to free an iovec if it was allocated. Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-29nvmet-tcp: fix possible NULL derefSagi Grimberg
We must only call sgl_free for sgl that we actually allocated. Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-29nvmet: trace: parse Get LBA Status command in detailMinwoo Im
Four different fields are in CDWs of Get LBA Status command which means it would be great if we can see in detail when tracing in target side also. Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-29nvme: trace: parse Get LBA Status command in detailMinwoo Im
Four different fields are in CDWs of Get LBA Status command which means it would be great if we can see in detail when tracing. Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-29nvme: trace: support for Get LBA Status opcode parsedMinwoo Im
This patch adds Get LBA Status command's opcode to the macro that is used by the trace feature. Now we can see "get_lba_status" instead of the opcode value itself. Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-29nvme: add Get LBA Status command opcodeMinwoo Im
NVMe 1.4 added Get LBA Status command with opcode 0x86. Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-29nvmet: fix data units read and written counters in SMART logTom Wu
In nvme spec 1.3 there is a definition for data write/read counters from SMART log, (See section 5.14.1.2): This value is reported in thousands (i.e., a value of 1 corresponds to 1000 units of 512 bytes read) and is rounded up. However, in nvme target where value is reported with actual units, but not thousands of units as the spec requires. Signed-off-by: Tom Wu <tomwu@mellanox.com> Reviewed-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-29nvme-tcp: support simple pollingSagi Grimberg
Simple polling support via socket busy_poll interface. Although we do not shutdown interrupts but simply hammer the socket poll, we can sometimes find completions faster than the normal interrupt driven RX path. We add per queue nr_cqe counter that resets every time RX path is invoked such that .poll callback can return it to stay consistent with the semantics. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-29nvme: tcp: selects CRYPTO_CRC32C for nvme-tcpMinwoo Im
The tcp host module is now taking those APIs from crypto ahash: (1) crypto_ahash_final() (2) crypto_ahash_digest() (3) crypto_alloc_ahash() nvme-tcp should depends on CRYPTO_CRC32C. Cc: Christoph Hellwig <hch@lst.de> Cc: Keith Busch <kbusch@kernel.org> Cc: Jens Axboe <axboe@fb.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-29nvme: don't pass cap to nvme_disable_ctrlSagi Grimberg
All seem to call it with ctrl->cap so no need to pass it at all. Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-29nvme: move sqsize setting to the coreSagi Grimberg
nvme_enable_ctrl reads the cap register right after, so no need to do that locally in the transport driver. Have sqsize setting in nvme_init_identify. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-29nvme-pci: set ctrl sqsize to the device q_depthSagi Grimberg
Align with what the rest of the transports are doing. Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-29nvme: have nvme_init_identify set ctrl->capSagi Grimberg
No need to use a stack cap variable. Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-29nvme-tcp: Use protocol specific operations while reading socketPotnuri Bharat Teja
Using socket specific read_sock() calls instead of directly calling tcp_read_sock() helps lld module registered handlers if any, to be called from nvme-tcp host. This patch therefore replaces the tcp_read_sock() with socket specific prot_ops. Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com> Acked-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-29nvme-tcp: cleanup nvme_tcp_recv_pduSagi Grimberg
Can return directly in the switch statement Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-29Merge tag 'perf-core-for-mingo-5.4-20190829' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo: perf top: Namhyung Kim: - Decay all events in the evlist, we were decaying just the first event in a group. - Fix linking of histograms in different evsels in a event group with more than two events. With the two fixes above a command line such as: # perf top -e '{cycles,instructions,cache-misses,cache-references} Should work as expected, with four columns and with all of them being decayed over time, i.e. less weight is given for older samples. perf record: Arnaldo Carvalho de Melo: - Fix collection of build-ids when using setns() to get into namespaces, which had been broken with the introduction of the extra thread to react to PERF_RECORD_BPF_EVENT, i.e. to collect extra info for BPF programs. We need to unshare(CLONE_FS) in that thread so that the main one can do the setns(CLONE_NEWNS) when collectingthe build-ids. Without that symbol resolution gets more difficult and potentially misresolves symbols. core: Igor Lubashev: - Further alignment in permission checking via capabilities to how the kernel checks what tooling tries to do. PowerPC: Naveen N. Rao: - Sync powerpc syscall.tbl, so that 'perf trace' gets the definitions for recent syscalls. libperf: Jiri Olsa: - Move the rest of the PERF_RECORD_ metadata struct definitions so that we can use 'union perf_event'. libtraceevent: Steven Rostedt (VMware): - Do not free tep->cmdlines in add_new_comm() on failure. - Remove unneeded qsort and uses memmove instead Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-08-29x86/mm/pti: Do not invoke PTI functions when PTI is disabledThomas Gleixner
When PTI is disabled at boot time either because the CPU is not affected or PTI has been disabled on the command line, the boot code still calls into pti_finalize() which then unconditionally invokes: pti_clone_entry_text() pti_clone_kernel_text() pti_clone_kernel_text() was called unconditionally before the 32bit support was added and 32bit added the call to pti_clone_entry_text(). The call has no side effects as cloning the page tables into the available second one, which was allocated for PTI does not create damage. But it does not make sense either and in case that this functionality would be extended later this might actually lead to hard to diagnose issues. Neither function should be called when PTI is runtime disabled. Make the invocation conditional. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Ingo Molnar <mingo@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20190828143124.063353972@linutronix.de
2019-08-29x86/mm/pti: Handle unaligned address gracefully in pti_clone_pagetable()Song Liu
pti_clone_pmds() assumes that the supplied address is either: - properly PUD/PMD aligned or - the address is actually mapped which means that independently of the mapping level (PUD/PMD/PTE) the next higher mapping exists. If that's not the case the unaligned address can be incremented by PUD or PMD size incorrectly. All callers supply mapped and/or aligned addresses, but for the sake of robustness it's better to handle that case properly and to emit a warning. [ tglx: Rewrote changelog and added WARN_ON_ONCE() ] Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1908282352470.1938@nanos.tec.linutronix.de
2019-08-29x86/mm/cpa: Prevent large page split when ftrace flips RW on kernel textThomas Gleixner
ftrace does not use text_poke() for enabling trace functionality. It uses its own mechanism and flips the whole kernel text to RW and back to RO. The CPA rework removed a loop based check of 4k pages which tried to preserve a large page by checking each 4k page whether the change would actually cover all pages in the large page. This resulted in endless loops for nothing as in testing it turned out that it actually never preserved anything. Of course testing missed to include ftrace, which is the one and only case which benefitted from the 4k loop. As a consequence enabling function tracing or ftrace based kprobes results in a full 4k split of the kernel text, which affects iTLB performance. The kernel RO protection is the only valid case where this can actually preserve large pages. All other static protections (RO data, data NX, PCI, BIOS) are truly static. So a conflict with those protections which results in a split should only ever happen when a change of memory next to a protected region is attempted. But these conflicts are rightfully splitting the large page to preserve the protected regions. In fact a change to the protected regions itself is a bug and is warned about. Add an exception for the static protection check for kernel text RO when the to be changed region spawns a full large page which allows to preserve the large mappings. This also prevents the syslog to be spammed about CPA violations when ftrace is used. The exception needs to be removed once ftrace switched over to text_poke() which avoids the whole issue. Fixes: 585948f4f695 ("x86/mm/cpa: Avoid the 4k pages check completely") Reported-by: Song Liu <songliubraving@fb.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Song Liu <songliubraving@fb.com> Reviewed-by: Song Liu <songliubraving@fb.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1908282355340.1938@nanos.tec.linutronix.de
2019-08-29i2c: designware: Synchronize IRQs when unregistering slave clientJarkko Nikula
Make sure interrupt handler i2c_dw_irq_handler_slave() has finished before clearing the the dev->slave pointer in i2c_dw_unreg_slave(). There is possibility for a race if i2c_dw_irq_handler_slave() is running on another CPU while clearing the dev->slave pointer. Reported-by: Krzysztof Adamski <krzysztof.adamski@nokia.com> Reported-by: Wolfram Sang <wsa@the-dreams.de> Signed-off-by: Jarkko Nikula <jarkko.nikula@linux.intel.com> Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
2019-08-29i2c: i801: Avoid memory leak in check_acpi_smo88xx_device()Andy Shevchenko
check_acpi_smo88xx_device() utilizes acpi_get_object_info() which in its turn allocates a buffer. User is responsible to clean allocated resources. The last has been missed in the original code. Fix it here. While here, replace !ACPI_SUCCESS() with ACPI_FAILURE(). Fixes: 19b07cb4a187 ("i2c: i801: Register optional lis3lv02d I2C device on Dell machines") Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Pali Rohár <pali.rohar@gmail.com> Reviewed-by: Jean Delvare <jdelvare@suse.de> Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
2019-08-29i2c: make i2c_unregister_device() ERR_PTR safeWolfram Sang
We are moving towards returning ERR_PTRs when i2c_new_*_device() calls fail. Make sure its counterpart for unregistering handles ERR_PTRs as well. Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Wolfram Sang <wsa@the-dreams.de>