linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2015-04-08	pinctrl: bcm2835: Fix support for threaded level triggered IRQs	Charles Keepax
	Currently, the driver uses handle_simple_irq for all IRQ types and hard codes the acknowledge for different IRQ types into the handler. It is better to use the IRQ core as intended and let it handle the differences between the various types of IRQ. For example the current system does not work for threaded level triggered IRQs as these need to be masked until the threaded handler has run. Signed-off-by: Charles Keepax <ckeepax@opensource.wolfsonmicro.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2015-04-08	ARM: mvebu: use 0xf1000000 as internal registers on Armada 370 DB	Thomas Petazzoni
	All Marvell EBU SoCs (Kirkwood, Dove, Orion, Armada) have the capability of changing the location of their internal registers (i.e the registers for most hardware blocks inside the SoC). When coming out of reset, the internal registers are mapped at 0xd0000000, but since years and years, the tradition has been to have the internal registers remapped at 0xf1000000 by the bootloader, and Linux has since then assumed that the internal registers for the SoC were located at 0xf1000000 on Kirkwood, Dove, Orion, etc. Linux has never been aware that those registers are remappable (and there is no way to know where they are mapped at runtime, since the register to configure the address of the registers is itself within the internal registers). Then came the Armada 370 and Armada XP, in which some of the very early silicon steppings had an issue, which forced to use 0xd0000000: the SoC was no longer working properly when the internal registers were remapped at 0xf1000000. This issue is only affecting very early silicon steppings and production steppings are not affected: the issue has been fixed in between. Since what we (Free Electrons) used to do the initial submission of the Armada 370 and Armada XP platforms was evaluation boards with those very early steppings, we submitted Device Tree that assumed the internal registers were mapped at 0xd0000000. This is the case for Armada 370 DB, Armada XP DB and Armada XP GP. However, in practice, since Marvell has been shipping the evaluation boards with production steppings of the SoC, they are shipping those boards with bootloaders that remap the registers to 0xf1000000. We have already changed this internal register address to 0xf1000000 for the Armada XP DB in commit 82066bdb5a75 and for the Armada XP GP in commit 91ed32200e6e (both merged in v3.15). We only recently got our hand on an Armada 370 DB with a production stepping of the SoC, which uses a bootloader that remaps internal registers at 0xf1000000. Therefore, this commit aligns the Armada 370 DB to be like the Armada XP DB and Armada XP GP: assume that the internal registers are mapped at 0xf1000000. We would like to stress out the fact that the usage of 0xd0000000 as the internal register base address was a temporary workaround for early steppings deficiencies, and that the real long-term solution is the usage of 0xf1000000. Having 0xd0000000 is an accident in the life of the Marvell platform support in the kernel, as is confirmed by the usage of 0xf1000000 in all previous Marvell platforms (Dove, Kirkwood, Orion). There are unfortunately a number of commercial devices that continue to use 0xd0000000 even though they use production steppings of the SoC, simply because the vendors of such devices have never bothered using a more recent bootloader version from Marvell. There is not much we can do about it, and we plan on keeping 0xd0000000 in the Device Tree of such devices. The main reason for remapping the internal registers at 0xf1000000 instead of 0xd0000000 is that it leaves more space in the 0 -> 4 GB part of the physical address space for RAM. With registers at 0xd0000000, all RAM between 0xd0000000 to 0xffffffff is lost because it's covered by the I/O registers. Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Acked-by: Andrew Lunn <andrew@lunn.ch> Acked-by: Jason Cooper <jason@lakedameon.net> Acked-by: Gregory CLEMENT <gregory.clement@free-electrons.com> Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
2015-04-08	gpio: mvebu: Fix mask/unmask managment per irq chip type	Gregory CLEMENT
	Level IRQ handlers and edge IRQ handler are managed by tow different sets of registers. But currently the driver uses the same mask for the both registers. It lead to issues with the following scenario: First, an IRQ is requested on a GPIO to be triggered on front. After, this an other IRQ is requested for a GPIO of the same bank but triggered on level. Then the first one will be also setup to be triggered on level. It leads to an interrupt storm. The different kind of handler are already associated with two different irq chip type. With this patch the driver uses a private mask for each one which solves this issue. It has been tested on an Armada XP based board and on an Armada 375 board. For the both boards, with this patch is applied, there is no such interrupt storm when running the previous scenario. This bug was already fixed but in a different way in the legacy version of this driver by Evgeniy Dushistov: 9ece8839b1277fb9128ff6833411614ab6c88d68 "ARM: orion: Fix for certain sequence of request_irq can cause irq storm". The fact the new version of the gpio drive could be affected had been discussed there: http://thread.gmane.org/gmane.linux.ports.arm.kernel/344670/focus=364012 Reported-by: Evgeniy A. Dushistov <dushistov@mail.ru> Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com> Cc: <stable@vger.kernel.org> # v3.7 + Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2015-04-08	gfs2: fix quota refresh race in do_glock()	Abhi Das
	quotad periodically syncs in-memory quotas to the ondisk quota file and sets the QDF_REFRESH flag so that a subsequent read of a synced quota is re-read from disk. gfs2_quota_lock() checks for this flag and sets a 'force' bit to force re-read from disk if requested. However, there is a race condition here. It is possible for gfs2_quota_lock() to find the QDF_REFRESH flag unset (i.e force=0) and quotad comes in immediately after and syncs the relevant quota and sets the QDF_REFRESH flag. gfs2_quota_lock() resumes with force=0 and uses the stale in-memory quota usage values that result in miscalculations. This patch fixes this race by moving the check for the QDF_REFRESH flag check further out into the gfs2_quota_lock() process, i.e, in do_glock(), under the protection of the quota glock. Signed-off-by: Abhi Das <adas@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com> Acked-by: Steven Whitehouse <swhiteho@redhat.com>
2015-04-08	crypto: sahara - fix AES descriptor create	Steffen Trumtrar
	The AES implementation still assumes, that the hw_desc[0] has a valid key as long as no new key needs to be set; consequentialy it always sets the AES key header for the first descriptor and puts data into the second one (hw_desc[1]). Change this to only update the key in the hardware, when a new key is to be set and use the first descriptor for data otherwise. Signed-off-by: Steffen Trumtrar <s.trumtrar@pengutronix.de> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-04-08	crypto: sahara - use the backlog	Steffen Trumtrar
	With commit 7e77bdebff5cb1e9876c561f69710b9ab8fa1f7e crypto: af_alg - fix backlog handling in place, the backlog works under all circumstances where it previously failed, atleast for the sahara driver. Use it. Signed-off-by: Steffen Trumtrar <s.trumtrar@pengutronix.de> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-04-08	crypto: user - Fix crypto_alg_match race	Herbert Xu
	The function crypto_alg_match returns an algorithm without taking any references on it. This means that the algorithm can be freed at any time, therefore all users of crypto_alg_match are buggy. This patch fixes this by taking a reference count on the algorithm to prevent such races. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-04-08	crypto: atmel-aes - correct usage of dma_sync_* API	Leilei Zhao
	The output buffer is used for CPU access, so the API should be dma_sync_single_for_cpu which makes the cache line invalid in order to reload the value in memory. Signed-off-by: Leilei Zhao <leilei.zhao@atmel.com> Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-04-08	crypto: atmel-aes - sync the buf used in DMA or CPU	Leilei Zhao
	The input buffer and output buffer are mapped for DMA transfer in Atmel AES driver. But they are also be used by CPU when the requested crypt length is not bigger than the threshold value 16. The buffers will be cached in cache line when CPU accessed them. When DMA uses the buffers again, the memory can happened to be flushed by cache while DMA starts transfer. So using API dma_sync_single_for_device and dma_sync_single_for_cpu in DMA to ensure DMA coherence and CPU always access the correct value. This fix the issue that the encrypted result periodically goes wrong when doing performance test with OpenSSH. Signed-off-by: Leilei Zhao <leilei.zhao@atmel.com> Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-04-08	crypto: atmel-aes - initialize spinlock in probe	Leilei Zhao
	Kernel will report "BUG: spinlock lockup suspected on CPU#0" when CONFIG_DEBUG_SPINLOCK is enabled in kernel config and the spinlock is used at the first time. It's caused by uninitialized spinlock, so just initialize it in probe. Signed-off-by: Leilei Zhao <leilei.zhao@atmel.com> Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-04-08	crypto: atmel-tdes - initialize spinlock in probe	Leilei Zhao
	Kernel will report "BUG: spinlock lockup suspected on CPU#0" when CONFIG_DEBUG_SPINLOCK is enabled in kernel config and the spinlock is used at the first time. It's caused by uninitialized spinlock, so just initialize it in probe. Signed-off-by: Leilei Zhao <leilei.zhao@atmel.com> Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-04-08	crypto: atmel-sha - correct the max burst size	Leilei Zhao
	The maximum source and destination burst size is 16 according to the datasheet of Atmel DMA. And the value is also checked in function at_xdmac_csize of Atmel DMA driver. With the restrict, the value beyond maximum value will not be processed in DMA driver, so SHA384 and SHA512 will not work and the program will wait forever. So here change the max burst size of all the cases to 16 in order to make SHA384 and SHA512 work and keep consistent with DMA driver and datasheet. Signed-off-by: Leilei Zhao <leilei.zhao@atmel.com> Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-04-08	crypto: atmel-sha - initialize spinlock in probe	Leilei Zhao
	Kernel will report "BUG: spinlock lockup suspected on CPU#0" when CONFIG_DEBUG_SPINLOCK is enabled in kernel config and the spinlock is used at the first time. It's caused by uninitialized spinlock, so just initialize it in probe. Signed-off-by: Leilei Zhao <leilei.zhao@atmel.com> Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-04-08	crypto: atmel-sha - fix sg list management	Leilei Zhao
	Having a zero length sg doesn't mean it is the end of the sg list. This case happens when calculating HMAC of IPSec packet. Signed-off-by: Leilei Zhao <leilei.zhao@atmel.com> Signed-off-by: Ludovic Desroches <ludovic.desroches@atmel.com> Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-04-08	crypto: atmel-sha - correct the way data are split	Ludovic Desroches
	When a hash is requested on data bigger than the buffer allocated by the SHA driver, the way DMA transfers are performed is quite strange: The buffer is filled at each update request. When full, a DMA transfer is done. On next update request, another DMA transfer is done. Then we wait to have a full buffer (or the end of the data) to perform the dma transfer. Such a situation lead sometimes, on SAMA5D4, to a case where dma transfer is finished but the data ready irq never comes. Moreover hash was incorrect in this case. With this patch, dma transfers are only performed when the buffer is full or when there is no more data. So it removes the transfer whose size is equal the update size after the full buffer transmission. Signed-off-by: Ludovic Desroches <ludovic.desroches@atmel.com> Signed-off-by: Leilei Zhao <leilei.zhao@atmel.com> Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-04-08	crypto: atmel-sha - add new version	Leilei Zhao
	Add new version of atmel-sha available with SAMA5D4 devices. Signed-off-by: Leilei Zhao <leilei.zhao@atmel.com> Signed-off-by: Ludovic Desroches <ludovic.desroches@atmel.com> Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-04-08	crypto: atmel-aes - add new version	Leilei Zhao
	Add new version of atmel-aes available with SAMA5D4 devices. Signed-off-by: Leilei Zhao <leilei.zhao@atmel.com> Signed-off-by: Ludovic Desroches <ludovic.desroches@atmel.com> Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-04-08	vfio-pci: Fix use after free	Alex Williamson
	Reported by 0-day test infrastructure. Fixes: ecaa1f6a0154 ("vfio-pci: Add VGA arbiter client") Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-04-08	perf tools: Add 'I' event modifier for exclude_idle bit	Jiri Olsa
	Adding 'I' event modifier to have complete set of modifiers for perf_event_attr:exclude_* bits. Any event specified with 'I' modifier will have the perf_event_attr:exclude_idle bit set. $ perf record -e cycles:I -vv ls 2>&1 \| grep exclude_idle exclude_hv 0 exclude_idle 1 Adding automated tests. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Cc: Andi Kleen <andi@firstfloor.org> Cc: David Ahern <dsahern@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: William Cohen <wcohen@redhat.com> Link: http://lkml.kernel.org/r/1428441919-23099-2-git-send-email-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2015-04-08	perf report: Don't call map__kmap if map is NULL.	Wang Nan
	report__warn_kptr_restrict() calls map__kmap(kernel_map) before checking kernel_map againest NULL. Which is dangerous, since map__kmap() will return a invalid and not NULL address. It will trigger a warning message in map__kmap() after the patch "perf: kmaps: enforce usage of kmaps to protect futher bugs." was applied. This patch fixes it by adding the missing checking. Signed-off-by: Wang Nan <wangnan0@huawei.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Zefan Li <lizefan@huawei.com> Cc: pi3orama@163.com Link: http://lkml.kernel.org/r/1428490772-135393-1-git-send-email-wangnan0@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2015-04-08	perf tests: Fix attr tests	Jiri Olsa
	Following commit: 1a5941312414 perf: Add wakeup watermark control to the AUX area enlarged perf_event_attr, but did not updated attr tests. Reported-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Kaixu Xia <kaixu.xia@linaro.org> Cc: Kan Liang <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Robert Richter <rric@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Markus T Metzger <markus.t.metzger@intel.com> Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Link: http://lkml.kernel.org/n/20150407171715.GA22603@krava.redhat.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2015-04-08	perf probe: Fix ARM 32 building error	Wang Nan
	Commit 9b118acae310f57baee770b5db402500d8695e50 ("perf probe: Fix to handle aliased symbols in glibc") uses an absolute format '%lx' to print u64 argument, which causes compiling error on ARM 32. This patch replaces it with PRIx64. Signed-off-by: Wang Nan <wangnan0@huawei.com> Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Zefan Li <lizefan@huawei.com> Cc: pi3orama@163.com Link: http://lkml.kernel.org/r/1428459274-138470-1-git-send-email-wangnan0@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2015-04-08	SUNRPC: Export enums in tracepoints to user space	Steven Rostedt (Red Hat)
	The enums used in the tracepoints for __print_symbolic() have their names shown in the tracepoint format files. User space tools do not know how to convert those names into their values to be able to convert the binary data. Use TRACE_DEFINE_ENUM() to export the enum names to their values for userspace to do the parsing correctly. Link: http://lkml.kernel.org/r/20150403013802.220157513@goodmis.org Acked-by: Trond Myklebust <trond.myklebust@primarydata.com> Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-04-08	mm: tracing: Export enums in tracepoints to user space	Steven Rostedt (Red Hat)
	The enums used in tracepoints with __print_symbolic() have their names shown in the tracepoint format files and not their values. This makes it difficult for user space tools to convert the binary data to the strings as user space does not know what those enums are about. By having them use TRACE_DEFINE_ENUM(), the names of the enums will be mapped to the values and shown to user space. Link: http://lkml.kernel.org/r/20150403013802.220157513@goodmis.org Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-04-08	irq/tracing: Export enums in tracepoints to user space	Steven Rostedt (Red Hat)
	The enums used by the softirq mapping is what is shown in the output of the __print_symbolic() and not their values, that are needed to map them to their strings. Export them to userspace with the TRACE_DEFINE_ENUM() macro so that user space tools can map the enums with their values. Link: http://lkml.kernel.org/r/20150403013802.220157513@goodmis.org Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-04-08	f2fs: Export the enums in the tracepoints to userspace	Steven Rostedt (Red Hat)
	The tracepoints that use __print_symbolic() use enums as the value to convert to strings. Unfortunately, the format files for these tracepoints show the enum name and not their value. This causes some userspace tools not to know how to convert __print_symbolic() to their strings. Add TRACE_DEFINE_ENUM() macros to export the enums used to userspace to let those tools know what those enum values are. Link: http://lkml.kernel.org/r/20150403013802.220157513@goodmis.org Cc: Namjae Jeon <namjae.jeon@samsung.com> Cc: Pankaj Kumar <pankaj.km@samsung.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-04-08	net/9p/tracing: Export enums in tracepoints to userspace	Steven Rostedt (Red Hat)
	The tracepoints in the 9p code use a lot of enums for the __print_symbolic() function. These enums are shown in the tracepoint format files, and user space tools such as trace-cmd does not have the information to parse it. Add helper macros to export the enums with TRACE_DEFINE_ENUM(). Link: http://lkml.kernel.org/r/20150403013802.220157513@goodmis.org Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Eric Van Hensbergen <ericvh@gmail.com> Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-04-08	x86/tlb/trace: Export enums in used by tlb_flush tracepoint	Steven Rostedt (Red Hat)
	Have the enums used in __print_symbolic() by the trace_tlb_flush() tracepoint exported to userpace such that they can be parsed by userspace tools. Link: http://lkml.kernel.org/r/20150403013802.220157513@goodmis.org Cc: Dave Hansen <dave@sr71.net> Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-04-08	tracing/samples: Update the trace-event-sample.h with TRACE_DEFINE_ENUM()	Steven Rostedt (Red Hat)
	Document the use of TRACE_DEFINE_ENUM() by adding enums to the trace-event-sample.h and using this macro to convert them in the format files. Also update the comments and sho the use of __print_symbolic() and __print_flags() as well as adding comments abount __print_array(). Link: http://lkml.kernel.org/r/20150403013802.220157513@goodmis.org Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-04-08	tracing: Allow for modules to convert their enums to values	Steven Rostedt (Red Hat)
	Update the infrastructure such that modules that declare TRACE_DEFINE_ENUM() will have those enums converted into their values in the tracepoint print fmt strings. Link: http://lkml.kernel.org/r/87vbhjp74q.fsf@rustcorp.com.au Acked-by: Rusty Russell <rusty@rustcorp.com.au> Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-04-08	tracing: Add TRACE_DEFINE_ENUM() macro to map enums to their values	Steven Rostedt (Red Hat)
	Several tracepoints use the helper functions __print_symbolic() or __print_flags() and pass in enums that do the mapping between the binary data stored and the value to print. This works well for reading the ASCII trace files, but when the data is read via userspace tools such as perf and trace-cmd, the conversion of the binary value to a human string format is lost if an enum is used, as userspace does not have access to what the ENUM is. For example, the tracepoint trace_tlb_flush() has: __print_symbolic(REC->reason, { TLB_FLUSH_ON_TASK_SWITCH, "flush on task switch" }, { TLB_REMOTE_SHOOTDOWN, "remote shootdown" }, { TLB_LOCAL_SHOOTDOWN, "local shootdown" }, { TLB_LOCAL_MM_SHOOTDOWN, "local mm shootdown" }) Which maps the enum values to the strings they represent. But perf and trace-cmd do no know what value TLB_LOCAL_MM_SHOOTDOWN is, and would not be able to map it. With TRACE_DEFINE_ENUM(), developers can place these in the event header files and ftrace will convert the enums to their values: By adding: TRACE_DEFINE_ENUM(TLB_FLUSH_ON_TASK_SWITCH); TRACE_DEFINE_ENUM(TLB_REMOTE_SHOOTDOWN); TRACE_DEFINE_ENUM(TLB_LOCAL_SHOOTDOWN); TRACE_DEFINE_ENUM(TLB_LOCAL_MM_SHOOTDOWN); $ cat /sys/kernel/debug/tracing/events/tlb/tlb_flush/format [...] __print_symbolic(REC->reason, { 0, "flush on task switch" }, { 1, "remote shootdown" }, { 2, "local shootdown" }, { 3, "local mm shootdown" }) The above is what userspace expects to see, and tools do not need to be modified to parse them. Link: http://lkml.kernel.org/r/20150403013802.220157513@goodmis.org Cc: Guilherme Cox <cox@computer.org> Cc: Tony Luck <tony.luck@gmail.com> Cc: Xie XiuQi <xiexiuqi@huawei.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-04-08	tracing: Update trace-event-sample with TRACE_SYSTEM_VAR documentation	Steven Rostedt (Red Hat)
	Add documentation about TRACE_SYSTEM needing to be alpha-numeric or with underscores, and that if it is not, then the use of TRACE_SYSTEM_VAR is required to make something that is. An example of this is shown in samples/trace_events/trace-events-sample.h Link: http://lkml.kernel.org/r/20150403013802.220157513@goodmis.org Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-04-08	tracing: Give system name a pointer	Steven Rostedt (Red Hat)
	Normally the compiler will use the same pointer for a string throughout the file. But there's no guarantee of that happening. Later changes will require that all events have the same pointer to the system string. Name the system string and have all events point to it. Testing this, it did not increases the size of the text, except for the notes section, which should not harm the real size any. Link: http://lkml.kernel.org/r/20150403013802.220157513@goodmis.org Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-04-08	brcmsmac: Move each system tracepoints to their own header	Steven Rostedt (Red Hat)
	Every tracing file must have its own TRACE_SYSTEM defined. The brcmsmac tracepoint header broke this and added in the middle of the file: #undef TRACE_SYSTEM #define TRACE_SYSTEM brcmsmac #undef TRACE_SYSTEM #define TRACE_SYSTEM brcmsmac_tx #undef TRACE_SYSTEM #define TRACE_SYSTEM brcmsmac_msg Unfortunately, this broke new code in the ftrace infrastructure. Moving each of these TRACE_SYSTEMs into their own trace file with just one TRACE_SYSTEM per file fixes the issue. Link: http://lkml.kernel.org/r/5524D99C.1050902@broadcom.com Acked-by: Arend van Spriel <arend@broadcom.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-04-08	iwlwifi: Move each system tracepoints to their own header	Steven Rostedt (Red Hat)
	Every tracing file must have its own TRACE_SYSTEM defined. The iwlwifi tracepoint header broke this and added in the middle of the file: #undef TRACE_SYSTEM #define TRACE_SYSTEM iwlwifi_io #undef TRACE_SYSTEM #define TRACE_SYSTEM iwlwifi_ucode #undef TRACE_SYSTEM #define TRACE_SYSTEM iwlwifi_msg #undef TRACE_SYSTEM #define TRACE_SYSTEM iwlwifi_data #undef TRACE_SYSTEM #define TRACE_SYSTEM iwlwifi Unfortunately, this broke new code in the ftrace infrastructure. Moving each of these TRACE_SYSTEMs into their own trace file with just one TRACE_SYSTEM per file fixes the issue. Link: http://lkml.kernel.org/r/1428479094.2809.3.camel@sipsolutions.net Reviewed-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-04-08	gpio: split GPIO drivers in submenus	Linus Walleij
	Create Kconfig submenus for memory mapped, I2C, MFD, PCI, SPI and USB GPIO drivers to help navigate the forest of drivers in this subsystem. The I2C, SPI and USB menus get dependencies so we don't have to see them unless we have the required subsystem enabled in the first place. Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2015-04-08	gpio: move MFD GPIO drivers under their own comment	Linus Walleij
	Get rid of AC97, MODULbus and other weird subheadings for GPIO drivers. Move all MFD drivers out of I2C etc and in under the MFD comment. This is too weird as it is and makes no sense, if the dependent parent driver is MFD, group these as MFD GPIO drivers. Alphabetize and move this comment group inbetween "I2C" and "PCI" to also have the groups in alphabetic order. Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2015-04-08	perf tools: Merge all perf_event_attr print functions	Peter Zijlstra
	Currently there's 3 (that I found) different and incomplete implementations of printing perf_event_attr. This is quite silly. Merge the lot. While this patch does not retain the exact form all printing that I found is debug output and thus it should not be critical. Also, I cannot find a single print_event_desc() caller. Pre: $ perf record -vv -e cycles -- sleep 1 ------------------------------------------------------------ perf_event_attr: type 0 size 104 config 0 sample_period 4000 sample_freq 4000 sample_type 0x107 read_format 0 disabled 1 inherit 1 pinned 0 exclusive 0 exclude_user 0 exclude_kernel 0 exclude_hv 0 exclude_idle 0 mmap 1 comm 1 mmap2 1 comm_exec 1 freq 1 inherit_stat 0 enable_on_exec 1 task 1 watermark 0 precise_ip 0 mmap_data 0 sample_id_all 1 exclude_host 0 exclude_guest 1 excl.callchain_kern 0 excl.callchain_user 0 wakeup_events 0 wakeup_watermark 0 bp_type 0 bp_addr 0 config1 0 bp_len 0 config2 0 branch_sample_type 0 sample_regs_user 0 sample_stack_user 0 sample_regs_intr 0 ------------------------------------------------------------ $ perf evlist -vv cycles: sample_freq=4000, size: 104, sample_type: IP\|TID\|TIME\|PERIOD, disabled: 1, inherit: 1, mmap: 1, mmap2: 1, comm: 1, comm_exec: 1, freq: 1, enable_on_exec: 1, task: 1, sample_id_all: 1, exclude_guest: 1 Post: $ ./perf record -vv -e cycles -- sleep 1 ------------------------------------------------------------ perf_event_attr: size 112 { sample_period, sample_freq } 4000 sample_type IP\|TID\|TIME\|PERIOD disabled 1 inherit 1 mmap 1 comm 1 freq 1 enable_on_exec 1 task 1 sample_id_all 1 exclude_guest 1 mmap2 1 comm_exec 1 ------------------------------------------------------------ $ ./perf evlist -vv cycles: size: 112, { sample_period, sample_freq }: 4000, sample_type: IP\|TID\|TIME\|PERIOD, disabled: 1, inherit: 1, mmap: 1, comm: 1, freq: 1, enable_on_exec: 1, task: 1, sample_id_all: 1, exclude_guest: 1, mmap2: 1, comm_exec: 1 Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Adrian Hunter <adrian.hunter@intel.com> Acked-by: Ingo Molnar <mingo@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Ahern <dsahern@gmail.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20150407091150.644238729@infradead.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2015-04-08	perf record: Add clockid parameter	Peter Zijlstra
	Teach perf-record about the new perf_event_attr::{use_clockid, clockid} fields. Add a simple parameter to set the clock (if any) to be used for the events to be recorded into the data file. Since we store the entire perf_event_attr in the EVENT_DESC section we also already store the used clockid in the data file. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: David Ahern <dsahern@gmail.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yunlong Song <yunlong.song@huawei.com> Link: http://lkml.kernel.org/r/20150407154851.GR23123@twins.programming.kicks-ass.net [ Conditionally define CLOCK_BOOTTIME, at least rhel6 doesn't have it - dsahern Ditto for CLOCK_MONOTONIC_RAW, sles11sp2 doesn't have it - yunlong.song ] Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2015-04-08	gpio: move BCM Kona Kconfig option	Linus Walleij
	Move the Kconfig option for the Broadcom BCM Kona up to the commin GPIO controllers, as it is currently grouped under MODULbus expanders which it definately is not. Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2015-04-08	gpio: arrange SPI Kconfig symbols alphabetically	Linus Walleij
	Rearrange the SPI GPIO expanders in alphabetic order as already indicated by the comment in the file. Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2015-04-08	gpio: arrange PCI GPIO controllers alphabetically	Linus Walleij
	Rearrange PCI GPIO controllers in alphabetic order as already indicated by the comment in the file. Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2015-04-08	gpio: arrange I2C Kconfig symbols alphabetically	Linus Walleij
	Rearrange the I2C GPIO expanders in alphabetic order as already indicated by the comment in the file. Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2015-04-08	perf sched replay: Use replay_repeat to calculate the runavg of cpu usage ↵	Yunlong Song
	instead of the default value 10 Since sched->replay_repeat is set to 10 as default, the sched->run_avg, sched->runavg_cpu_usage, and sched->runavg_parent_cpu_usage all use 10 to calculate their value. However, the replay_repeat can be changed to other value by using -r option, so the calculation above should use replay_repeat to achieve more accurate results instead of the default value 10. Signed-off-by: Yunlong Song <yunlong.song@huawei.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/r/1427809596-29559-10-git-send-email-yunlong.song@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2015-04-08	perf sched replay: Support using -f to override perf.data file ownership	Yunlong Song
	Enable to use perf.data when it is not owned by current user or root. Example: $ ls -al perf.data -rw------- 1 Yunlong.Song Yunlong.Song 5321918 Mar 25 15:14 perf.data $ sudo id uid=0(root) gid=0(root) groups=0(root),64(pkcs11) Before this patch: $ sudo perf sched replay -f run measurement overhead: 98 nsecs sleep measurement overhead: 52909 nsecs the run test took 1000015 nsecs the sleep test took 1054253 nsecs File perf.data not owned by current user or root (use -f to override) As shown above, the -f option does not work at all. After this patch: $ sudo perf sched replay -f run measurement overhead: 221 nsecs sleep measurement overhead: 40514 nsecs the run test took 1000003 nsecs the sleep test took 1056098 nsecs nr_run_events: 10 nr_sleep_events: 1562 nr_wakeup_events: 5 task 0 ( :1: 1), nr_events: 1 task 1 ( :2: 2), nr_events: 1 task 2 ( :3: 3), nr_events: 1 ... ... task 1549 ( :163132: 163132), nr_events: 1 task 1550 ( :163540: 163540), nr_events: 1 task 1551 ( <unknown>: 0), nr_events: 10 ------------------------------------------------------------ #1 : 50.198, ravg: 50.20, cpu: 2335.18 / 2335.18 #2 : 219.099, ravg: 67.09, cpu: 2835.11 / 2385.17 #3 : 238.626, ravg: 84.24, cpu: 3278.26 / 2474.48 #4 : 200.364, ravg: 95.85, cpu: 2977.41 / 2524.77 #5 : 176.882, ravg: 103.96, cpu: 2801.35 / 2552.43 #6 : 191.093, ravg: 112.67, cpu: 2813.70 / 2578.56 #7 : 189.448, ravg: 120.35, cpu: 2809.21 / 2601.62 #8 : 200.637, ravg: 128.38, cpu: 2849.91 / 2626.45 #9 : 248.338, ravg: 140.37, cpu: 4380.61 / 2801.87 #10 : 511.139, ravg: 177.45, cpu: 3077.73 / 2829.45 As shown above, the -f option really works now. Besides for replay, -f option can also work for latency and map. Signed-off-by: Yunlong Song <yunlong.song@huawei.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/r/1427809596-29559-9-git-send-email-yunlong.song@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2015-04-08	perf sched replay: Fix the EMFILE error caused by the limitation of the ↵	Yunlong Song
	maximum open files The soft maximum number of open files for a calling process is 1024, which is defined as INR_OPEN_CUR in include/uapi/linux/fs.h, and the hard maximum number of open files for a calling process is 4096, which is defined as INR_OPEN_MAX in include/uapi/linux/fs.h. Both INR_OPEN_CUR and INR_OPEN_MAX are used to limit the value of RLIMIT_NOFILE in include/asm-generic/resource.h. And the soft maximum number finally decides the limitation of the maximum files which are allowed to be opened. That is to say a process can use at most 1024 file descriptors for its o pened files, or an EMFILE error will happen. This error can be fixed by increasing the soft maximum number, under the constraint that the soft maximum number can not exceed the hard maximum number, or both soft and hard maximum number should be increased simultaneously with privilege. For perf sched replay, it uses sys_perf_event_open to create the file descriptor for each of the tasks in order to handle information of perf events. That is to say each task needs a unique file descriptor. In x86_64, there may be over 1024 or 4096 tasks correspoinding to the record in perf.data, which causes that no enough file descriptors can be used. As a result, EMFILE error happens and stops the replay process. To solve this problem, we adaptively increase the soft and hard maximum number of open files with a '-f' option. Example: Test environment: x86_64 with 160 cores $ cat /proc/sys/kernel/pid_max 163840 $ cat /proc/sys/fs/file-max 6815744 $ ulimit -Sn 1024 $ ulimit -Hn 4096 Before this patch: $ perf sched replay ... task 1549 ( :163132: 163132), nr_events: 1 task 1550 ( :163540: 163540), nr_events: 1 task 1551 ( <unknown>: 0), nr_events: 10 Error: sys_perf_event_open() syscall returned with -1 (Too many open files) After this patch: $ perf sched replay ... task 1549 ( :163132: 163132), nr_events: 1 task 1550 ( :163540: 163540), nr_events: 1 task 1551 ( <unknown>: 0), nr_events: 10 Error: sys_perf_event_open() syscall returned with -1 (Too many open files) Have a try with -f option $ perf sched replay -f ... task 1549 ( :163132: 163132), nr_events: 1 task 1550 ( :163540: 163540), nr_events: 1 task 1551 ( <unknown>: 0), nr_events: 10 ------------------------------------------------------------ #1 : 54.401, ravg: 54.40, cpu: 3285.21 / 3285.21 #2 : 199.548, ravg: 68.92, cpu: 4999.65 / 3456.66 #3 : 170.483, ravg: 79.07, cpu: 1349.94 / 3245.99 #4 : 192.034, ravg: 90.37, cpu: 1322.88 / 3053.67 #5 : 182.929, ravg: 99.62, cpu: 1406.51 / 2888.96 #6 : 152.974, ravg: 104.96, cpu: 1167.54 / 2716.82 #7 : 155.579, ravg: 110.02, cpu: 2992.53 / 2744.39 #8 : 130.557, ravg: 112.08, cpu: 1126.43 / 2582.59 #9 : 138.520, ravg: 114.72, cpu: 1253.22 / 2449.65 #10 : 134.328, ravg: 116.68, cpu: 1587.95 / 2363.48 Signed-off-by: Yunlong Song <yunlong.song@huawei.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/r/1427809596-29559-8-git-send-email-yunlong.song@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2015-04-08	perf sched replay: Handle the dead halt of sem_wait when create_tasks() ↵	Yunlong Song
	fails for any task Since there is sem_wait for each task in the wait_for_tasks(), e.g. sem_wait(&task->work_done_sem). The sem_wait can continue only when work_done_sem is greater than 0, or it will be blocked. For perf sched replay, one task may sem_post the work_done_sem of another task, which causes the work_done_sem of that task processed in a reasonable sequence, e.g. sem_post, sem_wait, sem_wait, sem_post... This sequence simulates the sched process of the running tasks at the time when perf sched record runs. As a result, all the tasks are required and their threads must be successfully created. If any one (task A) of the tasks fails to create its thread, then another task (task B), whose work_done_sem needs sem_post from that failed task A, may likely block itself due to seg_wait. And this is a dead halt, since task B's thread_func cannot continue at all. To solve this problem, perf sched replay should exit once any task fails to create its thread. Example: Test environment: x86_64 with 160 cores Before this patch: $ perf sched replay ... Error: sys_perf_event_open() syscall returned with -1 (Too many open files) ------------------------------------------------------------ <- dead halt After this patch: $ perf sched replay ... task 1551 ( <unknown>: 0), nr_events: 10 Error: sys_perf_event_open() syscall returned with -1 (Too many open files) $ As shown above, perf sched replay finishes the process after printing an error message and does not block itself. Signed-off-by: Yunlong Song <yunlong.song@huawei.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/r/1427809596-29559-7-git-send-email-yunlong.song@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2015-04-08	perf sched replay: Fix the segmentation fault problem caused by pr_err in ↵	Yunlong Song
	threads The pr_err in self_open_counters() prints error message to stderr. Unlike stdout, stderr uses memory buffer on the stack of each calling process. The pr_err in self_open_counters() works in a thread called thread_func created in function create_tasks, which concurrently creates sched->nr_tasks threads. If the error happens and pr_err prints the error message in each of these threads, the stack size of the perf process (default is 8192 kbytes) will quickly run out and the segmentation fault will happen then. To solve this problem, pr_err with self_open_counters() should be moved from newly created threads to the old main thread of the perf process. Then the pr_err can work in a stable situation without the strange segmentation fault problem. Example: Test environment: x86_64 with 160 cores Before this patch: $ perf sched replay ... task 1549 ( :163132: 163132), nr_events: 1 task 1550 ( :163540: 163540), nr_events: 1 task 1551 ( <unknown>: 0), nr_events: 10 Segmentation fault After this patch: $ perf sched replay ... task 1549 ( :163132: 163132), nr_events: 1 task 1550 ( :163540: 163540), nr_events: 1 task 1551 ( <unknown>: 0), nr_events: 10 ... As shown above, the result continues without any segmentation fault. Signed-off-by: Yunlong Song <yunlong.song@huawei.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/r/1427809596-29559-6-git-send-email-yunlong.song@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2015-04-08	perf sched replay: Realloc the memory of pid_to_task stepwise to adapt to ↵	Yunlong Song
	the different pid_max configurations Although the memory of pid_to_task can be allocated via calloc according to the value of /proc/sys/kernel/pid_max, it cannot handle the case when pid_max is changed after 'perf sched record' has created its perf.data. If the new pid_max configured in 'perf sched replay' is smaller than the old pid_max configured in 'perf sched record', then it will cause the assertion failure problem. To solve this problem, we realloc the memory of pid_to_task stepwise once the passed-in pid parameter in register_pid is larger than the current pid_max. Example: Test environment: x86_64 with 160 cores $ cat /proc/sys/kernel/pid_max 163840 $ perf sched record ls $ echo 5000 > /proc/sys/kernel/pid_max $ cat /proc/sys/kernel/pid_max 5000 Before this patch: $ perf sched replay run measurement overhead: 221 nsecs sleep measurement overhead: 55356 nsecs the run test took 1000011 nsecs the sleep test took 1060940 nsecs perf: builtin-sched.c:337: register_pid: Assertion `!(pid >= (unsigned long)pid_max)' failed. Aborted After this patch: $ perf sched replay run measurement overhead: 221 nsecs sleep measurement overhead: 55611 nsecs the run test took 1000026 nsecs the sleep test took 1060486 nsecs nr_run_events: 10 nr_sleep_events: 1562 nr_wakeup_events: 5 task 0 ( :1: 1), nr_events: 1 task 1 ( :2: 2), nr_events: 1 task 2 ( :3: 3), nr_events: 1 task 3 ( :5: 5), nr_events: 1 ... Signed-off-by: Yunlong Song <yunlong.song@huawei.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/r/1427809596-29559-5-git-send-email-yunlong.song@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2015-04-08	perf sched replay: Alloc the memory of pid_to_task dynamically to adapt to ↵	Yunlong Song
	the unexpected change of pid_max The current memory allocation of struct task_desc pid_to_task[MAX_PID] is in a permanent and preset way, and it has two problems: Problem 1: If the pid_max, which is the max number of pids in the system, is much smaller than MAX_PID (10241000), then it causes a waste of stack memory. This may happen in the case where the number of cpu cores is much smaller than 1000. Problem 2: If the pid_max is changed from the default value to a value larger than MAX_PID, then it will cause assertion failure problem. The maximum value of pid_max can be set to pid_max_max (see pidmap_init defined in kernel/pid.c), which equals to PID_MAX_LIMIT. In x86_64, PID_MAX_LIMIT is 410241024 (defined in include/linux/threads.h). This value is much larger than MAX_PID, and will take up 32768 Kbytes (4102410248/1024) for memory allocation of pid_to_task, which is much larger than the default 8192 Kbytes of the stack size of calling process. Due to these two problems, we use calloc to allocate the memory of pid_to_task dynamically. Example: Test environment: x86_64 with 160 cores $ cat /proc/sys/kernel/pid_max 163840 $ echo 1025000 > /proc/sys/kernel/pid_max $ cat /proc/sys/kernel/pid_max 1025000 Run some applications until the pid of some process is greater than the value of MAX_PID (10241000). Before this patch: $ perf sched replay run measurement overhead: 221 nsecs sleep measurement overhead: 55480 nsecs the run test took 1000008 nsecs the sleep test took 1063151 nsecs perf: builtin-sched.c:330: register_pid: Assertion `!(pid >= 1024000)' failed. Aborted After this patch: $ perf sched replay run measurement overhead: 221 nsecs sleep measurement overhead: 55435 nsecs the run test took 1000004 nsecs the sleep test took 1059312 nsecs nr_run_events: 10 nr_sleep_events: 1562 nr_wakeup_events: 5 task 0 ( :1: 1), nr_events: 1 task 1 ( :2: 2), nr_events: 1 task 2 ( :3: 3), nr_events: 1 task 3 ( :5: 5), nr_events: 1 ... Signed-off-by: Yunlong Song <yunlong.song@huawei.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/r/1427809596-29559-4-git-send-email-yunlong.song@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>