summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2019-04-01nfp: flower: replace CFI with vlan presentPieter Jansen van Vuuren
Replace vlan CFI bit with a vlan present bit that indicates the presence of a vlan tag. Previously the driver incorrectly assumed that an vlan id of 0 is not matchable, therefore we indicate vlan presence with a vlan present bit. Fixes: 5571e8c9f241 ("nfp: extend flower matching capabilities") Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com> Signed-off-by: Louis Peens <louis.peens@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-01kcm: switch order of device registration to fix a crashJiri Slaby
When kcm is loaded while many processes try to create a KCM socket, a crash occurs: BUG: unable to handle kernel NULL pointer dereference at 000000000000000e IP: mutex_lock+0x27/0x40 kernel/locking/mutex.c:240 PGD 8000000016ef2067 P4D 8000000016ef2067 PUD 3d6e9067 PMD 0 Oops: 0002 [#1] SMP KASAN PTI CPU: 0 PID: 7005 Comm: syz-executor.5 Not tainted 4.12.14-396-default #1 SLE15-SP1 (unreleased) RIP: 0010:mutex_lock+0x27/0x40 kernel/locking/mutex.c:240 RSP: 0018:ffff88000d487a00 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 000000000000000e RCX: 1ffff100082b0719 ... CR2: 000000000000000e CR3: 000000004b1bc003 CR4: 0000000000060ef0 Call Trace: kcm_create+0x600/0xbf0 [kcm] __sock_create+0x324/0x750 net/socket.c:1272 ... This is due to race between sock_create and unfinished register_pernet_device. kcm_create tries to do "net_generic(net, kcm_net_id)". but kcm_net_id is not initialized yet. So switch the order of the two to close the race. This can be reproduced with mutiple processes doing socket(PF_KCM, ...) and one process doing module removal. Fixes: ab7ac4eb9832 ("kcm: Kernel Connection Multiplexor module") Reviewed-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-01PM / wakeup: Use pm_pr_dbg() instead of pr_debug()Stephen Boyd
These prints are useful if we're doing PM suspend debugging. Having them at pr_debug() level means that we need to either enable DEBUG in this file, or compile the kernel with dynamic debug capabilities. Both of these options have drawbacks like custom compilation or opting into all debug statements being included into the kernel image. Given that we already have infrastructure to collect PM debugging information with CONFIG_PM_DEBUG and friends, let's change the pr_debug usage here to be pm_pr_dbg() instead so we can collect the wakeup information in the kernel logs. Signed-off-by: Stephen Boyd <swboyd@chromium.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-04-01Merge branch 'net-sched-fix-stats-accounting-for-child-NOLOCK-qdiscs'David S. Miller
Paolo Abeni says: ==================== net: sched: fix stats accounting for child NOLOCK qdiscs Currently, stats accounting for NOLOCK qdisc enslaved to classful (lock) qdiscs is buggy. Per CPU values are ignored in most places, as a result, stats dump in the above scenario always report 0 length backlog and parent backlog len is not updated correctly on NOLOCK qdisc removal. The first patch address stats dumping, and the second one child qdisc removal. I'm targeting the net tree as this is a bugfix, but it could be moved to net-next due to the relatively large diffstat. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-01net: sched: introduce and use qdisc tree flush/purge helpersPaolo Abeni
The same code to flush qdisc tree and purge the qdisc queue is duplicated in many places and in most cases it does not respect NOLOCK qdisc: the global backlog len is used and the per CPU values are ignored. This change addresses the above, factoring-out the relevant code and using the helpers introduced by the previous patch to fetch the correct backlog len. Fixes: c5ad119fb6c0 ("net: sched: pfifo_fast use skb_array") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-01net: sched: introduce and use qstats read helpersPaolo Abeni
Classful qdiscs can't access directly the child qdiscs backlog length: if such qdisc is NOLOCK, per CPU values should be accounted instead. Most qdiscs no not respect the above. As a result, qstats fetching for most classful qdisc is currently incorrect: if the child qdisc is NOLOCK, it always reports 0 len backlog. This change introduces a pair of helpers to safely fetch both backlog and qlen and use them in stats class dumping functions, fixing the above issue and cleaning a bit the code. DRR needs also to access the child qdisc queue length, so it needs custom handling. Fixes: c5ad119fb6c0 ("net: sched: pfifo_fast use skb_array") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-01cpuidle: exynos: Unify target residency for AFTR and coupled AFTR statesMarek Szyprowski
Since commit 45f1ff59e27c ("cpuidle: Return nohz hint from cpuidle_select()") Exynos CPUidle driver stopped entering C1 (AFTR) mode on Exynos4412-based Trats2 board. Further analysis revealed that the CPUidle framework changed the way it handles predicted timer ticks and reported target residency for the given idle states. As a result, the C1 (AFTR) state was not chosen anymore on completely idle device. The main issue was to high target residency value. The similar C1 (AFTR) state for 'coupled' CPUidle version used 10 times lower value for the target residency, despite the fact that it is the same state from the hardware perspective. The 100000us value for standard C1 (AFTR) mode is there from the begining of the support for this idle state, added by the commit 67173ca492ab ("ARM: EXYNOS: Add support AFTR mode on EXYNOS4210"). That commit doesn't give any reason for it, instead it looks like it was blindly copied from the WFI/IDLE state of the same driver that time. That time, that value was probably not really used by the framework for any critical decision, so it didn't matter that much. Now it turned out to be an issue, so unify the target residency with the 'coupled' version, as it seems to better match the real use case values and restores the operation of the Exynos CPUidle driver on the idle device. Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Reviewed-by: Krzysztof Kozlowski <krzk@kernel.org> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org> Acked-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-04-01cpufreq: Add cpufreq_cpu_acquire() and cpufreq_cpu_release()Rafael J. Wysocki
It sometimes is necessary to find a cpufreq policy for a given CPU and acquire its rwsem (for writing) immediately after that, so introduce cpufreq_cpu_acquire() as a helper for that and the complementary cpufreq_cpu_release(). Make cpufreq_update_policy() use the new functions. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2019-04-01cpufreq: intel_pstate: Driver-specific handling of _PPC updatesRafael J. Wysocki
In some cases, the platform firmware disables or enables turbo frequencies for all CPUs globally before triggering a _PPC change notification for one of them. Obviously, that global change affects all CPUs, not just the notified one, and it needs to be acted upon by cpufreq. The intel_pstate driver is able to detect such global changes of the settings, but it also needs to update policy limits for all CPUs if that happens, in particular if turbo frequencies are enabled globally - to allow them to be used. For this reason, introduce a new cpufreq driver callback to be invoked on _PPC notifications, if present, instead of simply calling cpufreq_update_policy() for the notified CPU and make intel_pstate use it to trigger policy updates for all CPUs in the system if global settings change. Link: https://bugzilla.kernel.org/show_bug.cgi?id=200759 Reported-by: Gabriele Mazzotta <gabriele.mzt@gmail.com> Tested-by: Gabriele Mazzotta <gabriele.mzt@gmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2019-04-01cpufreq/intel_pstate: Load only on Intel hardwareBorislav Petkov
This driver is Intel-only so loading on anything which is not Intel is pointless. Prevent it from doing so. While at it, correct the "not supported" print statement to say CPU "model" which is what that test does. Fixes: 076b862c7e44 (cpufreq: intel_pstate: Add reasons for failure and debug messages) Suggested-by: Erwan Velu <e.velu@criteo.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Thomas Renninger <trenn@suse.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-04-01net/sched: fix ->get helper of the matchall clsNicolas Dichtel
It returned always NULL, thus it was never possible to get the filter. Example: $ ip link add foo type dummy $ ip link add bar type dummy $ tc qdisc add dev foo clsact $ tc filter add dev foo protocol all pref 1 ingress handle 1234 \ matchall action mirred ingress mirror dev bar Before the patch: $ tc filter get dev foo protocol all pref 1 ingress handle 1234 matchall Error: Specified filter handle not found. We have an error talking to the kernel After: $ tc filter get dev foo protocol all pref 1 ingress handle 1234 matchall filter ingress protocol all pref 1 matchall chain 0 handle 0x4d2 not_in_hw action order 1: mirred (Ingress Mirror to device bar) pipe index 1 ref 1 bind 1 CC: Yotam Gigi <yotamg@mellanox.com> CC: Jiri Pirko <jiri@mellanox.com> Fixes: fd62d9f5c575 ("net/sched: matchall: Fix configuration race") Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-01signal: don't silently convert SI_USER signals to non-current pidfdJann Horn
The current sys_pidfd_send_signal() silently turns signals with explicit SI_USER context that are sent to non-current tasks into signals with kernel-generated siginfo. This is unlike do_rt_sigqueueinfo(), which returns -EPERM in this case. If a user actually wants to send a signal with kernel-provided siginfo, they can do that with pidfd_send_signal(pidfd, sig, NULL, 0); so allowing this case is unnecessary. Instead of silently replacing the siginfo, just bail out with an error; this is consistent with other interfaces and avoids special-casing behavior based on security checks. Fixes: 3eb39f47934f ("signal: add pidfd_send_signal() syscall") Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Christian Brauner <christian@brauner.io>
2019-04-01dm table: propagate BDI_CAP_STABLE_WRITES to fix sporadic checksum errorsIlya Dryomov
Some devices don't use blk_integrity but still want stable pages because they do their own checksumming. Examples include rbd and iSCSI when data digests are negotiated. Stacking DM (and thus LVM) on top of these devices results in sporadic checksum errors. Set BDI_CAP_STABLE_WRITES if any underlying device has it set. Cc: stable@vger.kernel.org Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-01dm: revert 8f50e358153d ("dm: limit the max bio size as BIO_MAX_PAGES * ↵Mikulas Patocka
PAGE_SIZE") The limit was already incorporated to dm-crypt with commit 4e870e948fba ("dm crypt: fix error with too large bios"), so we don't need to apply it globally to all targets. The quantity BIO_MAX_PAGES * PAGE_SIZE is wrong anyway because the variable ti->max_io_len it is supposed to be in the units of 512-byte sectors not in bytes. Reduction of the limit to 1048576 sectors could even cause data corruption in rare cases - suppose that we have a dm-striped device with stripe size 768MiB. The target will call dm_set_target_max_io_len with the value 1572864. The buggy code would reduce it to 1048576. Now, the dm-core will errorneously split the bios on 1048576-sector boundary insetad of 1572864-sector boundary and pass these stripe-crossing bios to the striped target. Cc: stable@vger.kernel.org # v4.16+ Fixes: 8f50e358153d ("dm: limit the max bio size as BIO_MAX_PAGES * PAGE_SIZE") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Acked-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-01dm init: fix const confusion for dm_allowed_targets arrayAndi Kleen
A non const pointer to const cannot be marked initconst. Mark the array actually const. Fixes: 6bbc923dfcf5 dm: add support to directly boot to a mapped device Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-01dm integrity: make dm_integrity_init and dm_integrity_exit staticYueHaibing
Fix sparse warnings: drivers/md/dm-integrity.c:3619:12: warning: symbol 'dm_integrity_init' was not declared. Should it be static? drivers/md/dm-integrity.c:3638:6: warning: symbol 'dm_integrity_exit' was not declared. Should it be static? Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-01dm integrity: change memcmp to strncmp in dm_integrity_ctrMikulas Patocka
If the string opt_string is small, the function memcmp can access bytes that are beyond the terminating nul character. In theory, it could cause segfault, if opt_string were located just below some unmapped memory. Change from memcmp to strncmp so that we don't read bytes beyond the end of the string. Cc: stable@vger.kernel.org # v4.12+ Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-01cifs: a smb2_validate_and_copy_iov failure does not mean the handle is invalid.Ronnie Sahlberg
It only means that we do not have a valid cached value for the file_all_info structure. CC: Stable <stable@vger.kernel.org> Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2019-04-01SMB3: Allow persistent handle timeout to be configurable on mountSteve French
Reconnecting after server or network failure can be improved (to maintain availability and protect data integrity) by allowing the client to choose the default persistent (or resilient) handle timeout in some use cases. Today we default to 0 which lets the server pick the default timeout (usually 120 seconds) but this can be problematic for some workloads. Add the new mount parameter to cifs.ko for SMB3 mounts "handletimeout" which enables the user to override the default handle timeout for persistent (mount option "persistenthandles") or resilient handles (mount option "resilienthandles"). Maximum allowed is 16 minutes (960000 ms). Units for the timeout are expressed in milliseconds. See section 2.2.14.2.12 and 2.2.31.3 of the MS-SMB2 protocol specification for more information. Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> CC: Stable <stable@vger.kernel.org>
2019-04-01smb3: Fix enumerating snapshots to AzureSteve French
Some servers (see MS-SMB2 protocol specification section 3.3.5.15.1) expect that the FSCTL enumerate snapshots is done twice, with the first query having EXACTLY the minimum size response buffer requested (16 bytes) which refreshes the snapshot list (otherwise that and subsequent queries get an empty list returned). So had to add code to set the maximum response size differently for the first snapshot query (which gets the size needed for the second query which contains the actual list of snapshots). Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> CC: Stable <stable@vger.kernel.org> # 4.19+
2019-04-01cifs: fix kref underflow in close_shroot()Ronnie Sahlberg
Fix a bug where we used to not initialize the cached fid structure at all in open_shroot() if the open was successful but we did not get a lease. This would leave the structure uninitialized and later when we close the handle we would in close_shroot() try to kref_put() an uninitialized refcount. Fix this by always initializing this structure if the open was successful but only do the extra get() if we got a lease. This extra get() is only used to hold the structure until we get a lease break from the server at which point we will kref_put() it during lease processing. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com> CC: Stable <stable@vger.kernel.org>
2019-04-01i40e: add tracking of AF_XDP ZC state for each queue pairBjörn Töpel
In commit f3fef2b6e1cc ("i40e: Remove umem from VSI") a regression was introduced; When the VSI was reset, the setup code would try to enable AF_XDP ZC unconditionally (as long as there was a umem placed in the netdev._rx struct). Here, we add a bitmap to the VSI that tracks if a certain queue pair has been "zero-copy enabled" via the ndo_bpf. The bitmap is used in i40e_xsk_umem, and enables zero-copy if and only if XDP is enabled, the corresponding qid in the bitmap is set and the umem is non-NULL. Fixes: f3fef2b6e1cc ("i40e: Remove umem from VSI") Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-04-01perf vendor events intel: Update Silvermont to v14Andi Kleen
Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Link: https://lkml.kernel.org/r/20190315165219.GA21223@tassilo.jf.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-04-01perf vendor events intel: Update GoldmontPlus to v1.01Andi Kleen
Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Link: https://lkml.kernel.org/r/20190315165219.GA21223@tassilo.jf.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-04-01perf vendor events intel: Update Goldmont to v13Andi Kleen
Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Link: https://lkml.kernel.org/r/20190315165219.GA21223@tassilo.jf.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-04-01perf vendor events intel: Update Bonnell to V4Andi Kleen
Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Link: https://lkml.kernel.org/r/20190315165219.GA21223@tassilo.jf.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-04-01perf vendor events intel: Update KnightsLanding events to v9Andi Kleen
Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Link: https://lkml.kernel.org/r/20190315165219.GA21223@tassilo.jf.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-04-01perf vendor events intel: Update Haswell events to v28Andi Kleen
Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Link: https://lkml.kernel.org/r/20190315165219.GA21223@tassilo.jf.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-04-01perf vendor events intel: Update IvyBridge events to v21Andi Kleen
Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Link: https://lkml.kernel.org/r/20190315165219.GA21223@tassilo.jf.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-04-01perf vendor events intel: Update SandyBridge events to v16Andi Kleen
Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Link: https://lkml.kernel.org/r/20190315165219.GA21223@tassilo.jf.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-04-01perf vendor events intel: Update JakeTown events to v20Andi Kleen
Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Link: https://lkml.kernel.org/r/20190315165219.GA21223@tassilo.jf.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-04-01perf vendor events intel: Update IvyTown events to v20Andi Kleen
Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Link: https://lkml.kernel.org/r/20190315165219.GA21223@tassilo.jf.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-04-01perf vendor events intel: Update HaswellX events to v20Andi Kleen
Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Link: https://lkml.kernel.org/r/20190315165219.GA21223@tassilo.jf.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-04-01perf vendor events intel: Update BroadwellX events to v14Andi Kleen
Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Link: https://lkml.kernel.org/r/20190315165219.GA21223@tassilo.jf.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-04-01perf vendor events intel: Update SkylakeX events to v1.12Andi Kleen
Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Link: https://lkml.kernel.org/r/20190315165219.GA21223@tassilo.jf.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-04-01perf vendor events intel: Update Skylake events to v42Andi Kleen
Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Link: https://lkml.kernel.org/r/20190315165219.GA21223@tassilo.jf.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-04-01perf vendor events intel: Update Broadwell-DE events to v7Andi Kleen
Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Link: https://lkml.kernel.org/r/20190315165219.GA21223@tassilo.jf.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-04-01perf vendor events intel: Update Broadwell events to v23Andi Kleen
Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Link: https://lkml.kernel.org/r/20190315165219.GA21223@tassilo.jf.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-04-01perf vendor events intel: Update metrics from TMAM 3.5Andi Kleen
Update all the Intel JSON metrics from Ahmad Yasin's TMAM 3.5 for Intel big core from Sandy Bridge to Cascade Lake. This has many improvements and new metircs - New TopDownL1_SMT group that provides a per SMT thread version of --topdown that does not require -a anymore. The drawback is increased multiplexing though since L1 TopDown does not fit into 4 generic counters anymore. - Added SMT aware versions of other metrics - Split SMT aware metrics into separate metrics to avoid unnecessary event collections - New metrics for better branch analysis: Estimated Branch_Mispredict_Costs, Instructions per taken Branch, Branch Instructions per Taken Branch, etc. - Instruction mix metrics: Instructions per load, Instructions per store, Instructions per Branch, Instructions per Call - New Cache metrics: Bandwidth to L1/L2/L3 caches. L1/L2/L3 misses per kilo instructions. memory level parallelism - New memory controller metrics: Normalized memory bandwidth in interval mode, Average memory latency, Average number of parallel read requests, - 3DXP persistent memory metrics for Cascade Lake: 3dxp read latency, 3dxp read/write bandwidth - Some other useful metrics like Instruction Level Parallelism, - Various other improvements. Not all metrics are available on all CPUs. Skylake has best coverage. Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Link: https://lkml.kernel.org/r/20190315165219.GA21223@tassilo.jf.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-04-01perf record: Implement --mmap-flush=<number> optionAlexey Budankov
Implement a --mmap-flush option that specifies minimal number of bytes that is extracted from mmaped kernel buffer to store into a trace. The default option value is 1 byte what means every time trace writing thread finds some new data in the mmaped buffer the data is extracted, possibly compressed and written to a trace. $ tools/perf/perf record --mmap-flush 1024 -e cycles -- matrix.gcc $ tools/perf/perf record --aio --mmap-flush 1K -e cycles -- matrix.gcc The option is independent from -z setting, doesn't vary with compression level and can serve two purposes. The first purpose is to increase the compression ratio of a trace data. Larger data chunks are compressed more effectively so the implemented option allows specifying data chunk size to compress. Also at some cases executing more write syscalls with smaller data size can take longer than executing less write syscalls with bigger data size due to syscall overhead so extracting bigger data chunks specified by the option value could additionally decrease runtime overhead. The second purpose is to avoid self monitoring live-lock issue in system wide (-a) profiling mode. Profiling in system wide mode with compression (-a -z) can additionally induce data into the kernel buffers along with the data from monitored processes. If performance data rate and volume from the monitored processes is high then trace streaming and compression activity in the tool is also high. High tool process activity can lead to subtle live-lock effect when compression of single new byte from some of mmaped kernel buffer leads to generation of the next single byte at some mmaped buffer. So perf tool process ends up in endless self monitoring. Implemented synch parameter is the mean to force data move independently from the specified flush threshold value. Despite the provided flush value the tool needs capability to unconditionally drain memory buffers, at least in the end of the collection. Committer testing: Running with the default value, i.e. as soon as there is something to read go on consuming, we first write the synthesized events, small chunks of about 128 bytes: # perf trace -m 2048 --call-graph dwarf -e write -- perf record <SNIP> 101.142 ( 0.004 ms): perf/25821 write(fd: 3</root/perf.data>, buf: 0x210db60, count: 120) = 120 __libc_write (/usr/lib64/libpthread-2.28.so) ion (/home/acme/bin/perf) record__write (inlined) process_synthesized_event (/home/acme/bin/perf) perf_tool__process_synth_event (inlined) perf_event__synthesize_mmap_events (/home/acme/bin/perf) Then we move to reading the mmap buffers consuming the events put there by the kernel perf infrastructure: 107.561 ( 0.005 ms): perf/25821 write(fd: 3</root/perf.data>, buf: 0x7f1befc02000, count: 336) = 336 __libc_write (/usr/lib64/libpthread-2.28.so) ion (/home/acme/bin/perf) record__write (inlined) record__pushfn (/home/acme/bin/perf) perf_mmap__push (/home/acme/bin/perf) record__mmap_read_evlist (inlined) record__mmap_read_all (inlined) __cmd_record (inlined) cmd_record (/home/acme/bin/perf) 12919.953 ( 0.136 ms): perf/25821 write(fd: 3</root/perf.data>, buf: 0x7f1befc83150, count: 184984) = 184984 <SNIP same backtrace as in the 107.561 timestamp> 12920.094 ( 0.155 ms): perf/25821 write(fd: 3</root/perf.data>, buf: 0x7f1befc02150, count: 261816) = 261816 <SNIP same backtrace as in the 107.561 timestamp> 12920.253 ( 0.093 ms): perf/25821 write(fd: 3</root/perf.data>, buf: 0x7f1befb81120, count: 170832) = 170832 <SNIP same backtrace as in the 107.561 timestamp> If we limit it to write only when more than 16MB are available for reading, it throttles that to a quarter of the --mmap-pages set for 'perf record', which by default get to 528384 bytes, found out using 'record -v': mmap flush: 132096 mmap size 528384B With that in place all the writes coming from record__mmap_read_evlist(), i.e. from the mmap buffers setup by the kernel perf infrastructure were at least 132096 bytes long. Trying with a bigger mmap size: perf trace -e write perf record -v -m 2048 --mmap-flush 16M 74982.928 ( 2.471 ms): perf/26500 write(fd: 3</root/perf.data>, buf: 0x7ff94a6cc000, count: 3580888) = 3580888 74985.406 ( 2.353 ms): perf/26500 write(fd: 3</root/perf.data>, buf: 0x7ff949ecb000, count: 3453256) = 3453256 74987.764 ( 2.629 ms): perf/26500 write(fd: 3</root/perf.data>, buf: 0x7ff9496ca000, count: 3859232) = 3859232 74990.399 ( 2.341 ms): perf/26500 write(fd: 3</root/perf.data>, buf: 0x7ff948ec9000, count: 3769032) = 3769032 74992.744 ( 2.064 ms): perf/26500 write(fd: 3</root/perf.data>, buf: 0x7ff9486c8000, count: 3310520) = 3310520 74994.814 ( 2.619 ms): perf/26500 write(fd: 3</root/perf.data>, buf: 0x7ff947ec7000, count: 4194688) = 4194688 74997.439 ( 2.787 ms): perf/26500 write(fd: 3</root/perf.data>, buf: 0x7ff9476c6000, count: 4029760) = 4029760 Was again limited to a quarter of the mmap size: mmap flush: 2098176 mmap size 8392704B A warning about that would be good to have but can be added later, something like: "max flush is a quarter of the mmap size, if wanting to bump the mmap flush further, bump the mmap size as well using -m/--mmap-pages" Also rename the 'sync' parameters to 'synch' to keep tools/perf building with older glibcs: cc1: warnings being treated as errors builtin-record.c: In function 'record__mmap_read_evlist': builtin-record.c:775: warning: declaration of 'sync' shadows a global declaration /usr/include/unistd.h:933: warning: shadowed declaration is here builtin-record.c: In function 'record__mmap_read_all': builtin-record.c:856: warning: declaration of 'sync' shadows a global declaration /usr/include/unistd.h:933: warning: shadowed declaration is here Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com> Reviewed-by: Jiri Olsa <jolsa@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/f6600d72-ecfa-2eb7-7e51-f6954547d500@linux.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-04-01tools build: Implement libzstd feature check, LIBZSTD_DIR and NO_LIBZSTD definesAlexey Budankov
Implement libzstd feature check, NO_LIBZSTD and LIBZSTD_DIR defines to override Zstd library sources or disable the feature from the command line: $ make -C tools/perf LIBZSTD_DIR=/path/to/zstd/sources/ clean all $ make -C tools/perf NO_LIBZSTD=1 clean all Auto detection feature status is reported just before compilation starts. If your system has some version of the zstd library preinstalled then the build system finds and uses it during the build. If you still prefer to compile with some other version of zstd library you have capability to refer the compilation to that version using LIBZSTD_DIR define. Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com> Reviewed-by: Jiri Olsa <jolsa@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/9b4cd8b0-10a3-1f1e-8d6b-5922a7ca216b@linux.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-04-01tools lib traceevent: Rename input arguments and local variables of ↵Tzvetomir Stoyanov
libtraceevent from pevent to tep "pevent" to "tep" renaming of: - all "pevent" input arguments of libtraceevent internal functions. - all local "pevent" variables of libtraceevent. This makes the implementation consistent with the chosen naming convention, tep (trace event parser), and will avoid any confusion with the old pevent name Signed-off-by: Tzvetomir Stoyanov <tstoyanov@vmware.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Link: http://lore.kernel.org/linux-trace-devel/20190401132111.13727-5-tstoyanov@vmware.com Link: http://lkml.kernel.org/r/20190401164344.944953447@goodmis.org Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-04-01perf tools, tools lib traceevent: Rename "pevent" member of struct ↵Tzvetomir Stoyanov
tep_event_filter to "tep" The member "pevent" of the struct tep_event_filter is renamed to "tep". This makes the struct consistent with the chosen naming convention: tep (trace event parser), instead of the old pevent. Signed-off-by: Tzvetomir Stoyanov <tstoyanov@vmware.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Link: http://lore.kernel.org/linux-trace-devel/20190401132111.13727-4-tstoyanov@vmware.com Link: http://lkml.kernel.org/r/20190401164344.785896189@goodmis.org Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-04-01perf tools, tools lib traceevent: Rename "pevent" member of struct tep_event ↵Tzvetomir Stoyanov
to "tep" The member "pevent" of the struct tep_event is renamed to "tep". This makes the struct consistent with the chosen naming convention: tep (trace event parser), instead of the old pevent. Signed-off-by: Tzvetomir Stoyanov <tstoyanov@vmware.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Link: http://lore.kernel.org/linux-trace-devel/20190401132111.13727-3-tstoyanov@vmware.com Link: http://lkml.kernel.org/r/20190401164344.627724996@goodmis.org Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-04-01tools lib traceevent: Rename input arguments of libtraceevent APIs from ↵Tzvetomir Stoyanov
pevent to tep Input arguments of libtraceevent APIs are renamed from "struct tep_handle *pevent" to "struct tep_handle *tep". This makes the API consistent with the chosen naming convention: tep (trace event parser), instead of the old pevent. Signed-off-by: Tzvetomir Stoyanov <tstoyanov@vmware.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Link: http://lore.kernel.org/linux-trace-devel/20190401132111.13727-2-tstoyanov@vmware.com Link: http://lkml.kernel.org/r/20190401164344.465573837@goodmis.org Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-04-01tools tools, tools lib traceevent: Make traceevent APIs more consistentTzvetomir Stoyanov
Rename some traceevent APIs for consistency: tep_pid_is_registered() to tep_is_pid_registered() tep_file_bigendian() to tep_is_file_bigendian() to make the names and return values consistent with other tep_is_... APIs tep_data_lat_fmt() to tep_data_latency_format() to make the name more descriptive tep_host_bigendian() to tep_is_bigendian() tep_set_host_bigendian() to tep_set_local_bigendian() tep_is_host_bigendian() to tep_is_local_bigendian() "host" can be confused with VMs, and "local" is about the local machine. All tep_is_..._bigendian(struct tep_handle *tep) APIs return the saved data in the tep handle, while tep_is_bigendian() returns the running machine's endianness. All tep_is_... functions are modified to return bool value, instead of int. Signed-off-by: Tzvetomir Stoyanov <tstoyanov@vmware.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Link: http://lkml.kernel.org/r/20190327141946.4353-2-tstoyanov@vmware.com Link: http://lkml.kernel.org/r/20190401164344.288624897@goodmis.org [ Removed some extra parenthesis around return statements ] Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-04-01tools lib traceevent: Remove call to exit() from tep_filter_add_filter_str()Tzvetomir Stoyanov
This patch removes call to exit() from tep_filter_add_filter_str(). A library function should not force the application to exit. In the current implementation tep_filter_add_filter_str() calls exit() when a special "test_filters" mode is set, used only for debugging purposes. When this mode is set and a filter is added - its string is printed to the console and exit() is called. This patch changes the logic - when in "test_filters" mode, the filter string is still printed, but the exit() is not called. It is up to the application to track when "test_filters" mode is set and to call exit, if needed. Signed-off-by: Tzvetomir Stoyanov <tstoyanov@vmware.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Link: http://lkml.kernel.org/r/20190326154328.28718-9-tstoyanov@vmware.com Link: http://lkml.kernel.org/r/20190401164344.121717482@goodmis.org Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-04-01tools lib traceevent: Remove tep filter trivial APIsTzvetomir Stoyanov
This patch removes trivial filter tep APIs: enum tep_filter_trivial_type tep_filter_event_has_trivial() tep_update_trivial() tep_filter_clear_trivial() Trivial filters is an optimization, used only in the first version of KernelShark. The API is deprecated, the next KernelShark release does not use it. Signed-off-by: Tzvetomir Stoyanov <tstoyanov@vmware.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Link: http://lkml.kernel.org/r/20190326154328.28718-4-tstoyanov@vmware.com Link: http://lkml.kernel.org/r/20190401164343.968458918@goodmis.org Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-04-01tools lib traceevent: Removed unneeded !! and return parenthesisSteven Rostedt (VMware)
As return is not a function we do not need parenthesis around the return value. Also, a function returning bool does not need to add !!. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Tzvetomir Stoyanov <tstoyanov@vmware.com> Link: http://lkml.kernel.org/r/20190401164343.817886725@goodmis.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-04-01tools lib traceevent: Implement new traceevent APIs for accessing struct ↵Tzvetomir Stoyanov
tep_handler fields As struct tep_handler definition is not exposed as part of libtraceevent API, its fields cannot be accessed directly by the library users. This patch implements new APIs, which can be used to access the struct tep_handler fields: tep_get_event() - retrieves an event pointer at a specific index tep_get_first_event() - is modified to use tep_get_event() tep_clear_flag() - clears a tep handle flag tep_test_flag() - test if a given flag is set tep_get_header_timestamp_size() - returns the size of the timestamp stored in the header. tep_get_cpus() - returns the number of CPUs tep_is_old_format() - returns true if data was created by an older kernel with the old data format tep_set_print_raw() - have the output print in the raw format tep_set_test_filters() - debugging utility for testing tep filters Signed-off-by: Tzvetomir Stoyanov <tstoyanov@vmware.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Link: http://lore.kernel.org/linux-trace-devel/20190325145017.30246-4-tstoyanov@vmware.com Link: http://lkml.kernel.org/r/20190401164343.679629539@goodmis.org [ Renamed some newly added "pevent" to "tep" ] Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>