From 27ec26ecdca956a32e89a22a148f1c16b5c0becf Mon Sep 17 00:00:00 2001 From: Luiz Capitulino Date: Fri, 12 Dec 2014 16:55:18 -0800 Subject: hugetlb: fix hugepages= entry in kernel-parameters.txt The hugepages= entry in kernel-parameters.txt states that 1GB pages can only be allocated at boot time and not freed afterwards. This is not true since commit 944d9fec8d7a ("hugetlb: add support for gigantic page allocation at runtime"), at least for x86_64. Instead of adding arch-specifc observations to the hugepages= entry, this commit just drops the out of date information. Further information about arch-specific support and available features can be obtained in the hugetlb documentation. Signed-off-by: Luiz Capitulino Cc: Andi Kleen Acked-by: David Rientjes Cc: Rik van Riel Cc: Yasuaki Ishimatsu Cc: Yinghai Lu Cc: Davidlohr Bueso Acked-by: Naoya Horiguchi Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/kernel-parameters.txt | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) (limited to 'Documentation') diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index eacb2e0397ae..24539d1c7d25 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1228,9 +1228,7 @@ bytes respectively. Such letter suffixes can also be entirely omitted. multiple times interleaved with hugepages= to reserve huge pages of different sizes. Valid pages sizes on x86-64 are 2M (when the CPU supports "pse") and 1G - (when the CPU supports the "pdpe1gb" cpuinfo flag) - Note that 1GB pages can only be allocated at boot time - using hugepages= and not freed afterwards. + (when the CPU supports the "pdpe1gb" cpuinfo flag). hvc_iucv= [S390] Number of z/VM IUCV hypervisor console (HVC) terminal devices. Valid values: 0..8 -- cgit From 031bc5743f158b2d5498294f489e534a31251626 Mon Sep 17 00:00:00 2001 From: Joonsoo Kim Date: Fri, 12 Dec 2014 16:55:52 -0800 Subject: mm/debug-pagealloc: make debug-pagealloc boottime configurable Now, we have prepared to avoid using debug-pagealloc in boottime. So introduce new kernel-parameter to disable debug-pagealloc in boottime, and makes related functions to be disabled in this case. Only non-intuitive part is change of guard page functions. Because guard page is effective only if debug-pagealloc is enabled, turning off according to debug-pagealloc is reasonable thing to do. Signed-off-by: Joonsoo Kim Cc: Mel Gorman Cc: Johannes Weiner Cc: Minchan Kim Cc: Dave Hansen Cc: Michal Nazarewicz Cc: Jungsoo Son Cc: Ingo Molnar Cc: Joonsoo Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/kernel-parameters.txt | 9 +++++++++ 1 file changed, 9 insertions(+) (limited to 'Documentation') diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 24539d1c7d25..6f067954675b 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -829,6 +829,15 @@ bytes respectively. Such letter suffixes can also be entirely omitted. CONFIG_DEBUG_PAGEALLOC, hence this option will not help tracking down these problems. + debug_pagealloc= + [KNL] When CONFIG_DEBUG_PAGEALLOC is set, this + parameter enables the feature at boot time. In + default, it is disabled. We can avoid allocating huge + chunk of memory for debug pagealloc if we don't enable + it at boot time and the system will work mostly same + with the kernel built without CONFIG_DEBUG_PAGEALLOC. + on: enable the feature + debugpat [X86] Enable PAT debugging decnet.addr= [HW,NET] -- cgit From 48c96a3685795e52903e60c7ee115e5e22e7d640 Mon Sep 17 00:00:00 2001 From: Joonsoo Kim Date: Fri, 12 Dec 2014 16:56:01 -0800 Subject: mm/page_owner: keep track of page owners This is the page owner tracking code which is introduced so far ago. It is resident on Andrew's tree, though, nobody tried to upstream so it remain as is. Our company uses this feature actively to debug memory leak or to find a memory hogger so I decide to upstream this feature. This functionality help us to know who allocates the page. When allocating a page, we store some information about allocation in extra memory. Later, if we need to know status of all pages, we can get and analyze it from this stored information. In previous version of this feature, extra memory is statically defined in struct page, but, in this version, extra memory is allocated outside of struct page. It enables us to turn on/off this feature at boottime without considerable memory waste. Although we already have tracepoint for tracing page allocation/free, using it to analyze page owner is rather complex. We need to enlarge the trace buffer for preventing overlapping until userspace program launched. And, launched program continually dump out the trace buffer for later analysis and it would change system behaviour with more possibility rather than just keeping it in memory, so bad for debug. Moreover, we can use page_owner feature further for various purposes. For example, we can use it for fragmentation statistics implemented in this patch. And, I also plan to implement some CMA failure debugging feature using this interface. I'd like to give the credit for all developers contributed this feature, but, it's not easy because I don't know exact history. Sorry about that. Below is people who has "Signed-off-by" in the patches in Andrew's tree. Contributor: Alexander Nyberg Mel Gorman Dave Hansen Minchan Kim Michal Nazarewicz Andrew Morton Jungsoo Son Signed-off-by: Joonsoo Kim Cc: Mel Gorman Cc: Johannes Weiner Cc: Minchan Kim Cc: Dave Hansen Cc: Michal Nazarewicz Cc: Jungsoo Son Cc: Ingo Molnar Cc: Joonsoo Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/kernel-parameters.txt | 6 ++++++ 1 file changed, 6 insertions(+) (limited to 'Documentation') diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 6f067954675b..68153642c44e 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -2513,6 +2513,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted. OSS [HW,OSS] See Documentation/sound/oss/oss-parameters.txt + page_owner= [KNL] Boot-time page_owner enabling option. + Storage of the information about who allocated + each page is disabled in default. With this switch, + we can turn it on. + on: enable the feature + panic= [KNL] Kernel behaviour on panic: delay timeout > 0: seconds before rebooting timeout = 0: wait forever -- cgit From 16a7ade8af3b4ad30aec880177ff291bb5ea86d1 Mon Sep 17 00:00:00 2001 From: Joonsoo Kim Date: Fri, 12 Dec 2014 16:56:07 -0800 Subject: Documentation: add new page_owner document page owner is for the tracking about who allocated each page. This document explains what is the page owner feature and what is the merit of it. And, simple HOW-TO is also explained. See the document for detailed information. Signed-off-by: Joonsoo Kim Cc: Mel Gorman Cc: Johannes Weiner Cc: Minchan Kim Cc: Dave Hansen Cc: Michal Nazarewicz Cc: Jungsoo Son Cc: Ingo Molnar Cc: Joonsoo Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/vm/page_owner.txt | 81 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 81 insertions(+) create mode 100644 Documentation/vm/page_owner.txt (limited to 'Documentation') diff --git a/Documentation/vm/page_owner.txt b/Documentation/vm/page_owner.txt new file mode 100644 index 000000000000..8f3ce9b3aa11 --- /dev/null +++ b/Documentation/vm/page_owner.txt @@ -0,0 +1,81 @@ +page owner: Tracking about who allocated each page +----------------------------------------------------------- + +* Introduction + +page owner is for the tracking about who allocated each page. +It can be used to debug memory leak or to find a memory hogger. +When allocation happens, information about allocation such as call stack +and order of pages is stored into certain storage for each page. +When we need to know about status of all pages, we can get and analyze +this information. + +Although we already have tracepoint for tracing page allocation/free, +using it for analyzing who allocate each page is rather complex. We need +to enlarge the trace buffer for preventing overlapping until userspace +program launched. And, launched program continually dump out the trace +buffer for later analysis and it would change system behviour with more +possibility rather than just keeping it in memory, so bad for debugging. + +page owner can also be used for various purposes. For example, accurate +fragmentation statistics can be obtained through gfp flag information of +each page. It is already implemented and activated if page owner is +enabled. Other usages are more than welcome. + +page owner is disabled in default. So, if you'd like to use it, you need +to add "page_owner=on" into your boot cmdline. If the kernel is built +with page owner and page owner is disabled in runtime due to no enabling +boot option, runtime overhead is marginal. If disabled in runtime, it +doesn't require memory to store owner information, so there is no runtime +memory overhead. And, page owner inserts just two unlikely branches into +the page allocator hotpath and if it returns false then allocation is +done like as the kernel without page owner. These two unlikely branches +would not affect to allocation performance. Following is the kernel's +code size change due to this facility. + +- Without page owner + text data bss dec hex filename + 40662 1493 644 42799 a72f mm/page_alloc.o + +- With page owner + text data bss dec hex filename + 40892 1493 644 43029 a815 mm/page_alloc.o + 1427 24 8 1459 5b3 mm/page_ext.o + 2722 50 0 2772 ad4 mm/page_owner.o + +Although, roughly, 4 KB code is added in total, page_alloc.o increase by +230 bytes and only half of it is in hotpath. Building the kernel with +page owner and turning it on if needed would be great option to debug +kernel memory problem. + +There is one notice that is caused by implementation detail. page owner +stores information into the memory from struct page extension. This memory +is initialized some time later than that page allocator starts in sparse +memory system, so, until initialization, many pages can be allocated and +they would have no owner information. To fix it up, these early allocated +pages are investigated and marked as allocated in initialization phase. +Although it doesn't mean that they have the right owner information, +at least, we can tell whether the page is allocated or not, +more accurately. On 2GB memory x86-64 VM box, 13343 early allocated pages +are catched and marked, although they are mostly allocated from struct +page extension feature. Anyway, after that, no page is left in +un-tracking state. + +* Usage + +1) Build user-space helper + cd tools/vm + make page_owner_sort + +2) Enable page owner + Add "page_owner=on" to boot cmdline. + +3) Do the job what you want to debug + +4) Analyze information from page owner + cat /sys/kernel/debug/page_owner > page_owner_full.txt + grep -v ^PFN page_owner_full.txt > page_owner.txt + ./page_owner_sort page_owner.txt sorted_page_owner.txt + + See the result about who allocated each page + in the sorted_page_owner.txt. -- cgit From 0050ee059f7fc86b1df2527aaa14ed5dc72f9973 Mon Sep 17 00:00:00 2001 From: Manfred Spraul Date: Fri, 12 Dec 2014 16:58:17 -0800 Subject: ipc/msg: increase MSGMNI, remove scaling SysV can be abused to allocate locked kernel memory. For most systems, a small limit doesn't make sense, see the discussion with regards to SHMMAX. Therefore: increase MSGMNI to the maximum supported. And: If we ignore the risk of locking too much memory, then an automatic scaling of MSGMNI doesn't make sense. Therefore the logic can be removed. The code preserves auto_msgmni to avoid breaking any user space applications that expect that the value exists. Notes: 1) If an administrator must limit the memory allocations, then he can set MSGMNI as necessary. Or he can disable sysv entirely (as e.g. done by Android). 2) MSGMAX and MSGMNB are intentionally not increased, as these values are used to control latency vs. throughput: If MSGMNB is large, then msgsnd() just returns and more messages can be queued before a task switch to a task that calls msgrcv() is forced. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Manfred Spraul Cc: Davidlohr Bueso Cc: Rafael Aquini Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/sysctl/kernel.txt | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) (limited to 'Documentation') diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt index b5d0c8501a18..75511efefc64 100644 --- a/Documentation/sysctl/kernel.txt +++ b/Documentation/sysctl/kernel.txt @@ -116,10 +116,12 @@ set during run time. auto_msgmni: -Enables/Disables automatic recomputing of msgmni upon memory add/remove -or upon ipc namespace creation/removal (see the msgmni description -above). Echoing "1" into this file enables msgmni automatic recomputing. -Echoing "0" turns it off. auto_msgmni default value is 1. +This variable has no effect and may be removed in future kernel +releases. Reading it always returns 0. +Up to Linux 3.17, it enabled/disabled automatic recomputing of msgmni +upon memory add/remove or upon ipc namespace creation/removal. +Echoing "1" into this file enabled msgmni automatic recomputing. +Echoing "0" turned it off. auto_msgmni default value was 1. ============================================================== -- cgit From 7d94a82e45157318568d839902f46b2085de9e90 Mon Sep 17 00:00:00 2001 From: Christoph Lameter Date: Fri, 12 Dec 2014 16:58:45 -0800 Subject: percpu: update local_ops.txt to reflect this_cpu operations Update the documentation to reflect changes due to the availability of this_cpu operations. Signed-off-by: Christoph Lameter Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/local_ops.txt | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) (limited to 'Documentation') diff --git a/Documentation/local_ops.txt b/Documentation/local_ops.txt index 300da4bdfdbd..407576a23317 100644 --- a/Documentation/local_ops.txt +++ b/Documentation/local_ops.txt @@ -8,6 +8,11 @@ to implement them for any given architecture and shows how they can be used properly. It also stresses on the precautions that must be taken when reading those local variables across CPUs when the order of memory writes matters. +Note that local_t based operations are not recommended for general kernel use. +Please use the this_cpu operations instead unless there is really a special purpose. +Most uses of local_t in the kernel have been replaced by this_cpu operations. +this_cpu operations combine the relocation with the local_t like semantics in +a single instruction and yield more compact and faster executing code. * Purpose of local atomic operations @@ -87,10 +92,10 @@ the per cpu variable. For instance : local_inc(&get_cpu_var(counters)); put_cpu_var(counters); -If you are already in a preemption-safe context, you can directly use -__get_cpu_var() instead. +If you are already in a preemption-safe context, you can use +this_cpu_ptr() instead. - local_inc(&__get_cpu_var(counters)); + local_inc(this_cpu_ptr(&counters)); @@ -134,7 +139,7 @@ static void test_each(void *info) { /* Increment the counter from a non preemptible context */ printk("Increment on cpu %d\n", smp_processor_id()); - local_inc(&__get_cpu_var(counters)); + local_inc(this_cpu_ptr(&counters)); /* This is what incrementing the variable would look like within a * preemptible context (it disables preemption) : -- cgit From 29d293b6007b91a4463f05bc8d0b26e0e65c5816 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Fri, 12 Dec 2014 16:58:50 -0800 Subject: cgroups: Documentation: fix trivial typos and wrong paragraph numberings Signed-off-by: SeongJae Park Cc: Jonathan Corbet Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/cgroups/cpusets.txt | 6 +++--- Documentation/cgroups/memory.txt | 8 ++++---- 2 files changed, 7 insertions(+), 7 deletions(-) (limited to 'Documentation') diff --git a/Documentation/cgroups/cpusets.txt b/Documentation/cgroups/cpusets.txt index 3c94ff3f9693..f2235a162529 100644 --- a/Documentation/cgroups/cpusets.txt +++ b/Documentation/cgroups/cpusets.txt @@ -445,7 +445,7 @@ across partially overlapping sets of CPUs would risk unstable dynamics that would be beyond our understanding. So if each of two partially overlapping cpusets enables the flag 'cpuset.sched_load_balance', then we form a single sched domain that is a superset of both. We won't move -a task to a CPU outside it cpuset, but the scheduler load balancing +a task to a CPU outside its cpuset, but the scheduler load balancing code might waste some compute cycles considering that possibility. This mismatch is why there is not a simple one-to-one relation @@ -552,8 +552,8 @@ otherwise initial value -1 that indicates the cpuset has no request. 1 : search siblings (hyperthreads in a core). 2 : search cores in a package. 3 : search cpus in a node [= system wide on non-NUMA system] - ( 4 : search nodes in a chunk of node [on NUMA system] ) - ( 5 : search system wide [on NUMA system] ) + 4 : search nodes in a chunk of node [on NUMA system] + 5 : search system wide [on NUMA system] The system default is architecture dependent. The system default can be changed using the relax_domain_level= boot parameter. diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index 46b2b5080317..a22df3ad35ff 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -326,7 +326,7 @@ per cgroup, instead of globally. * tcp memory pressure: sockets memory pressure for the tcp protocol. -2.7.3 Common use cases +2.7.2 Common use cases Because the "kmem" counter is fed to the main user counter, kernel memory can never be limited completely independently of user memory. Say "U" is the user @@ -354,19 +354,19 @@ set: 3. User Interface -0. Configuration +3.0. Configuration a. Enable CONFIG_CGROUPS b. Enable CONFIG_MEMCG c. Enable CONFIG_MEMCG_SWAP (to use swap extension) d. Enable CONFIG_MEMCG_KMEM (to use kmem extension) -1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?) +3.1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?) # mount -t tmpfs none /sys/fs/cgroup # mkdir /sys/fs/cgroup/memory # mount -t cgroup none /sys/fs/cgroup/memory -o memory -2. Make the new group and move bash into it +3.2. Make the new group and move bash into it # mkdir /sys/fs/cgroup/memory/0 # echo $$ > /sys/fs/cgroup/memory/0/tasks -- cgit