Age | Commit message (Collapse) | Author |
|
Trace events have been added around the initcall functions defined in
init/main.c. But console and security have their own initcalls. This adds
the trace events associated for those initcall functions.
Link: http://lkml.kernel.org/r/1521765208.19745.2.camel@polymtl.ca
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Abderrahmane Benbachir <abderrahmane.benbachir@polymtl.ca>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
Being able to trace the start and stop of initcalls is useful to see where
the timings are an issue. There is already an "initcall_debug" parameter,
but that can cause a large overhead itself, as the printing of the
information may take longer than the initcall functions.
Adding in a start and finish trace event around the initcall functions, as
well as a trace event that records the level of the initcalls, one can get a
much finer measurement of the times and interactions of the initcalls
themselves, as trace events are much lighter than printk()s.
Suggested-by: Abderrahmane Benbachir <abderrahmane.benbachir@polymtl.ca>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
A boot up test function update_pred_fn() dereferences filter->prog without
the proper rcu annotation.
To do this, we must also take the event_mutex first. Normally, this isn't
needed because this test function can not race with other use cases that
touch the event filters (it is disabled if any events are enabled).
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Fixes: 80765597bc587 ("tracing: Rewrite filter logic to be simpler and faster")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
ftrace_function_set_filter() referenences filter->prog without annotation
and sparse complains about it. It needs a rcu_dereference_protected()
wrapper.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Fixes: 80765597bc587 ("tracing: Rewrite filter logic to be simpler and faster")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
In commit 932066a15335 ("tracing: Default to using trace_global_clock if
sched_clock is unstable"), the logic for deciding to override the
default clock if unstable was reversed from the earlier posting. I was
trying to reduce the width of the message by using an early return
rather than a if-block, but reverted back to using the if-block and
accidentally left the predicate inverted.
Link: http://lkml.kernel.org/r/20180404212450.26646-1-chris@chris-wilson.co.uk
Fixes: 932066a15335 ("tracing: Default to using trace_global_clock if sched_clock is unstable")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
Function tracing can trace in NMIs and such. If the TSC is determined
to be unstable, the tracing clock will switch to the global clock on
boot up, unless "trace_clock" is specified on the kernel command line.
The global clock disables interrupts to access sched_clock_cpu(), and in
doing so can be done within lockdep internals (because of function
tracing and NMIs). This can trigger false lockdep splats.
The trace_clock_global() is special, best not to trace the irq logic
within it.
Link: http://lkml.kernel.org/r/20180404145015.77bde42d@gandalf.local.home
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
As si_mem_available() can say there is enough memory even though the memory
available is not useable by the ring buffer, it is best to not kill innocent
applications because the ring buffer is taking up all the memory while it is
trying to allocate a great deal of memory.
If the allocator is user space (because kernel threads can also increase the
size of the kernel ring buffer on boot up), then after si_mem_available()
says there is enough memory, set the OOM killer to kill the current task if
an OOM triggers during the allocation.
Link: http://lkml.kernel.org/r/20180404062340.GD6312@dhcp22.suse.cz
Suggested-by: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
The ring buffer is made up of a link list of pages. When making the ring
buffer bigger, it will allocate all the pages it needs before adding to the
ring buffer, and if it fails, it frees them and returns an error. This makes
increasing the ring buffer size an all or nothing action. When this was
first created, the pages were allocated with "NORETRY". This was to not
cause any Out-Of-Memory (OOM) actions from allocating the ring buffer. But
NORETRY was too strict, as the ring buffer would fail to expand even when
there's memory available, but was taken up in the page cache.
Commit 848618857d253 ("tracing/ring_buffer: Try harder to allocate") changed
the allocating from NORETRY to RETRY_MAYFAIL. The RETRY_MAYFAIL would
allocate from the page cache, but if there was no memory available, it would
simple fail the allocation and not trigger an OOM.
This worked fine, but had one problem. As the ring buffer would allocate one
page at a time, it could take up all memory in the system before it failed
to allocate and free that memory. If the allocation is happening and the
ring buffer allocates all memory and then tries to take more than available,
its allocation will not trigger an OOM, but if there's any allocation that
happens someplace else, that could trigger an OOM, even though once the ring
buffer's allocation fails, it would free up all the previous memory it tried
to allocate, and allow other memory allocations to succeed.
Commit d02bd27bd33dd ("mm/page_alloc.c: calculate 'available' memory in a
separate function") separated out si_mem_availble() as a separate function
that could be used to see how much memory is available in the system. Using
this function to make sure that the ring buffer could be allocated before it
tries to allocate pages we can avoid allocating all memory in the system and
making it vulnerable to OOMs if other allocations are taking place.
Link: http://lkml.kernel.org/r/1522320104-6573-1-git-send-email-zhaoyang.huang@spreadtrum.com
CC: stable@vger.kernel.org
Cc: linux-mm@kvack.org
Fixes: 848618857d253 ("tracing/ring_buffer: Try harder to allocate")
Requires: d02bd27bd33dd ("mm/page_alloc.c: calculate 'available' memory in a separate function")
Reported-by: Zhaoyang Huang <huangzhaoyang@gmail.com>
Tested-by: Joel Fernandes <joelaf@google.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
Running a test on a x86_32 kernel I triggered a bug that an interrupt
disable/enable isn't being catched by lockdep. At least knowing where the
last one was found would be helpful, but the warnings that are produced do
not show this information. Even without debugging lockdep, having the WARN()
display the last place hard and soft irqs were enabled or disabled is
valuable.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
Commit 841a915d20c7b2 ("printf: Do not have bprintf dereference pointers")
would preprocess various pointers that are dereferenced in the bprintf()
because the recording and printing are done at two different times. Some
pointers stayed dereferenced in the ring buffer because user space could
handle them (namely "%pS" and friends). Pointers that are not dereferenced
should not be processed immediately but instead just saved directly.
Cc: stable@vger.kernel.org
Fixes: 841a915d20c7b2 ("printf: Do not have bprintf dereference pointers")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
Smatch complains that idx can be used uninitialized when we check if
(idx < 0). It has to be the first iteration through the loop and the
HIST_FIELD_FL_STACKTRACE bit has to be clear and the HIST_FIELD_FL_VAR
bit has to be set to reach the bug.
Link: http://lkml.kernel.org/r/20180328114815.GC29050@mwanda
Fixes: 30350d65ac56 ("tracing: Add variable support to hist triggers")
Acked-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
The strncpy() currently being used for variable string fields can
result in a lack of termination if the string length is equal to the
field size. Use the safer strscpy() instead, which will guarantee
termination.
Link: http://lkml.kernel.org/r/fb97c1e518fb358c12a4057d7445ba2c46956cd7.1522256721.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
Actions also need to be considered when checking for matching triggers
- triggers differing only by action should be allowed, but currently
aren't because the matching check ignores the action and erroneously
returns -EEXIST.
Add and call an actions_match() function to address that.
Here's an example using onmatch() actions. The first -EEXIST shouldn't
occur because the onmatch() is different in the second wakeup_latency()
param. The second -EEXIST shouldn't occur because it's a different
action (in this case, it doesn't have an action, so shouldn't be seen
as being the same and therefore rejected).
In the after case, both are correctly accepted (and trying to add one of
them again returns -EEXIST as it should).
before:
# echo 'wakeup_latency u64 lat; pid_t pid' >> /sys/kernel/debug/tracing/synthetic_events
# echo 'hist:keys=pid:ts0=common_timestamp.usecs if comm=="cyclictest"' >> /sys/kernel/debug/tracing/events/sched/sched_wakeup/trigger
# echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0 if next_comm=="cyclictest"' >> /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
# echo 'hist:keys=next_pid:onmatch(sched.sched_wakeup).wakeup_latency(sched.sched_switch.$wakeup_lat,next_pid) if next_comm=="cyclictest"' >> /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
# echo 'hist:keys=next_pid:onmatch(sched.sched_wakeup).wakeup_latency(sched.sched_switch.$wakeup_lat,prev_pid) if next_comm=="cyclictest"' >> /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
-su: echo: write error: File exists
# echo 'hist:keys=next_pid if next_comm=="cyclictest"' >> /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
-su: echo: write error: File exists
after:
# echo 'wakeup_latency u64 lat; pid_t pid' >> /sys/kernel/debug/tracing/synthetic_events
# echo 'hist:keys=pid:ts0=common_timestamp.usecs if comm=="cyclictest"' >> /sys/kernel/debug/tracing/events/sched/sched_wakeup/trigger
# echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0 if next_comm=="cyclictest"' >> /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
# echo 'hist:keys=next_pid:onmatch(sched.sched_wakeup).wakeup_latency(sched.sched_switch.$wakeup_lat,next_pid) if next_comm=="cyclictest"' >> /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
# echo 'hist:keys=next_pid:onmatch(sched.sched_wakeup).wakeup_latency(sched.sched_switch.$wakeup_lat,prev_pid) if next_comm=="cyclictest"' >> /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
# echo 'hist:keys=next_pid if next_comm=="cyclictest"' >> /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
Link: http://lkml.kernel.org/r/a7fd668b87ec10736c8f016ac4279c8480d50c2b.1522256721.git.tom.zanussi@linux.intel.com
Tested-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
Variable references should never have flags appended when displayed -
prevent that from happening.
Before:
# cat /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
hist:keys=next_pid:vals=hitcount:wakeup_lat=common_timestamp.usecs-$ts0.usecs:...
After:
hist:keys=next_pid:vals=hitcount:wakeup_lat=common_timestamp.usecs-$ts0:...
Link: http://lkml.kernel.org/r/913318a5610ef6b24af2522575f671fa6ee19b6b.1522256721.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
When displaying hist triggers, variable references that have the
timestamp field flag set are erroneously displayed as common_timestamp
rather than the variable reference. Additionally, timestamp
expressions are displayed in the same way. Fix this by forcing the
timestamp flag handling to follow variable reference and expression
handling.
Before:
# cat /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
hist:keys=next_pid:vals=hitcount:wakeup_lat=common_timestamp.usecs:...
After:
# cat /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
hist:keys=next_pid:vals=hitcount:wakeup_lat=common_timestamp.usecs-$ts0.usecs:...
Link: http://lkml.kernel.org/r/92746b06be67499c2a6217bd55395b350ad18fad.1522256721.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
Avoid a VLA by using a real constant expression instead of a variable.
The compiler should be able to optimize the original code and avoid using
an actual VLA. Anyway this change is useful because it will avoid a false
positive with -Wvla, it might also help the compiler generating better
code.
Link: http://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com
Link: http://lkml.kernel.org/r/1522399988-8815-1-git-send-email-s.mesoraca16@gmail.com
Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
Mention the alternative of adding trace_clock=global to the kernel
command line when we detect that we've used an unstable clock across a
suspend/resume cycle.
Link: http://lkml.kernel.org/r/20180330150132.16903-2-chris@chris-wilson.co.uk
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
Across suspend, we may see a very large drift in timestamps if the sched
clock is unstable, prompting the global trace's ringbuffer code to warn
and suggest switching to the global clock. Preempt this request by
detecting when the sched clock is unstable (determined during
late_initcall) and automatically switching the default clock over to
trace_global_clock.
This should prevent requiring user interaction to resolve warnings such
as:
Delta way too big! 18446743856563626466 ts=18446744054496180323 write stamp = 197932553857
If you just came from a suspend/resume,
please switch to the trace global clock:
echo global > /sys/kernel/debug/tracing/trace_clock
Link: http://lkml.kernel.org/r/20180330150132.16903-1-chris@chris-wilson.co.uk
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
In preparation for supporting AUX area sampling buffers,
auxtrace_queues__add_buffer() needs to be more generic. To that end, move
CPU filtering into it.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/1520327598-1317-8-git-send-email-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent
Pull perf/urgent fixes from Arnaldo Carvalho de Melo:
- Show group details on the title line in the annotate browser
and 'perf annotate --stdio2' output, so that the per-event
columns can have headers (Arnaldo Carvalho de Melo)
- Fixup vertical line separating metrics from instructions and
cleaning unused lines at the bottom, both in the annotate TUI
browser (Arnaldo Carvalho de Melo)
- Remove duplicated 'samples' in lost samples warning in
'perf report' (Arnaldo Carvalho de Melo)
- Synchronize i915_drm.h, silencing the perf build process,
automagically adding support for the new DRM_I915_QUERY
ioctl (Arnaldo Carvalho de Melo)
- Make auxtrace_queues__add_buffer() allocate struct buffer,
from a patchkit already applied (Adrian Hunter)
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Commit 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
tried to spread the interrupts accross all possible CPUs to make sure that
in case of phsyical hotplug (e.g. virtualization) the CPUs which get
plugged in after the device was initialized are targeted by a hardware
queue and the corresponding interrupt.
This has a downside in cases where the ACPI tables claim that there are
more possible CPUs than present CPUs and the number of interrupts to spread
out is smaller than the number of possible CPUs. These bogus ACPI tables
are unfortunately not uncommon.
In such a case the vector spreading algorithm assigns interrupts to CPUs
which can never be utilized and as a consequence these interrupts are
unused instead of being mapped to present CPUs. As a result the performance
of the device is suboptimal.
To fix this spread the interrupt vectors in two stages:
1) Spread as many interrupts as possible among the present CPUs
2) Spread the remaining vectors among non present CPUs
On a 8 core system, where CPU 0-3 are present and CPU 4-7 are not present,
for a device with 4 queues the resulting interrupt affinity is:
1) Before 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
irq 39, cpu list 0
irq 40, cpu list 1
irq 41, cpu list 2
irq 42, cpu list 3
2) With 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
irq 39, cpu list 0-2
irq 40, cpu list 3-4,6
irq 41, cpu list 5
irq 42, cpu list 7
3) With the refined vector spread applied:
irq 39, cpu list 0,4
irq 40, cpu list 1,6
irq 41, cpu list 2,5
irq 42, cpu list 3,7
On a 8 core system, where all CPUs are present the resulting interrupt
affinity for the 4 queues is:
irq 39, cpu list 0,1
irq 40, cpu list 2,3
irq 41, cpu list 4,5
irq 42, cpu list 6,7
This is independent of the number of CPUs which are online at the point of
initialization because in such a system the offline CPUs can be easily
onlined afterwards, while in non-present CPUs need to be plugged physically
or virtually which requires external interaction.
The downside of this approach is that in case of physical hotplug the
interrupt vector spreading might be suboptimal when CPUs 4-7 are physically
plugged. Suboptimal from a NUMA point of view and due to the single target
nature of interrupt affinities the later plugged CPUs might not be targeted
by interrupts at all.
Though, physical hotplug systems are not the common case while the broken
ACPI table disease is wide spread. So it's preferred to have as many
interrupts as possible utilized at the point where the device is
initialized.
Block multi-queue devices like NVME create a hardware queue per possible
CPU, so the goal of commit 84676c1f21 to assign one interrupt vector per
possible CPU is still achieved even with physical/virtual hotplug.
[ tglx: Changed from online to present CPUs for the first spreading stage,
renamed variables for readability sake, added comments and massaged
changelog ]
Reported-by: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: Christoph Hellwig <hch@infradead.org>
Link: https://lkml.kernel.org/r/20180308105358.1506-5-ming.lei@redhat.com
|
|
To support two stage irq vector spreading, it's required to add a starting
point to the spreading function. No functional change, just preparatory
work for the actual two stage change.
[ tglx: Renamed variables, tidied up the code and massaged changelog ]
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Link: https://lkml.kernel.org/r/20180308105358.1506-4-ming.lei@redhat.com
|
|
No functional change, just prepare for converting to 2-stage irq vector
spreading.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Link: https://lkml.kernel.org/r/20180308105358.1506-3-ming.lei@redhat.com
|
|
The following patches will introduce two stage irq spreading for improving
irq spread on all possible CPUs.
No functional change.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Link: https://lkml.kernel.org/r/20180308105358.1506-2-ming.lei@redhat.com
|
|
When the allocation of node_to_possible_cpumask fails, then
irq_create_affinity_masks() returns with a pointer to the empty affinity
masks array, which will cause malfunction.
Reorder the allocations so the masks array allocation comes last and every
failure path returns NULL.
Fixes: 9a0ef98e186d ("genirq/affinity: Assign vectors to all present CPUs")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ming Lei <ming.lei@redhat.com>
|
|
Open code readX() inside inX() so that inX() variants have their own
overrideable Port IO barrier combinations as __io_pbr() and __io_par() for
actions to be taken before port IO and after port IO read.
Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
Open code writeX() inside outX() so that outX() variants have their own
overrideable Port IO barrier combinations as __io_pbw() and __io_paw() for
actions to be taken before port IO and after port IO write.
Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
The default implementation of mapping writeX() to __raw_writeX() is wrong.
writeX() has stronger ordering semantics. Compiler is allowed to reorder
memory writes against __raw_writeX().
Use the previously defined __io_aw() and __io_bw() macros to harden
code generation according to architecture support.
Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
The default implementation of mapping readX() to __raw_readX() is wrong.
readX() has stronger ordering semantics. Compiler is allowed to reorder
__raw_readX() against the memory accesses following register read.
Use the previously defined __io_ar() and __io_br() macros to harden
code generation according to architecture support.
Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
Getting ready to harden readX()/writeX() and inX()/outX() semantics for the
generic implementation.
Defining two set of macros as __io_br() and __io_ar() to indicate actions
to be taken before and after MMIO read.
Defining two set of macros as __io_bw() and __io_aw() to indicate actions
to be taken before and after MMIO write.
Defining two set of macros as __io_pbw() and __io_paw() to indicate actions
to be taken before and after Port IO write.
Defining two set of macros as __io_pbr() and __io_par() to indicate actions
to be taken before and after Port IO read.
If rmb() is available for the architecture, prefer rmb() as the default
implementation of __io_ar()/__io_par().
If wmb() is available for the architecture, prefer wmb() as the default
implementation of __io_bw()/__io_pbw().
Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
Add a new pointer argument to cpuidle_select() and to the ->select
cpuidle governor callback to allow a boolean value indicating
whether or not the tick should be stopped before entering the
selected state to be returned from there.
Make the ladder governor ignore that pointer (to preserve its
current behavior) and make the menu governor return 'false" through
it if:
(1) the idle exit latency is constrained at 0, or
(2) the selected state is a polling one, or
(3) the expected idle period duration is within the tick period
range.
In addition to that, the correction factor computations in the menu
governor need to take the possibility that the tick may not be
stopped into account to avoid artificially small correction factor
values. To that end, add a mechanism to record tick wakeups, as
suggested by Peter Zijlstra, and use it to modify the menu_update()
behavior when tick wakeup occurs. Namely, if the CPU is woken up by
the tick and the return value of tick_nohz_get_sleep_length() is not
within the tick boundary, the predicted idle duration is likely too
short, so make menu_update() try to compensate for that by updating
the governor statistics as though the CPU was idle for a long time.
Since the value returned through the new argument pointer of
cpuidle_select() is not used by its caller yet, this change by
itself is not expected to alter the functionality of the code.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|
|
Since the subsequent changes will need a TICK_USEC definition
analogous to TICK_NSEC, rename the existing TICK_USEC as
USER_TICK_USEC, update its users and redefine TICK_USEC
accordingly.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
|
|
Currently calling list_del on smd subdev remove path results in
null pointer dereference on glink only platforms. Fix this by
adding safety checks in glink/smd subdev remove paths.
Signed-off-by: Sibi Sankar <sibis@codeaurora.org>
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
|
|
I got "oom_reaper: unable to reap pid:" messages when the victim thread
was blocked inside free_pgtables() (which occurred after returning from
unmap_vmas() and setting MMF_OOM_SKIP). We don't need to complain when
exit_mmap() already set MMF_OOM_SKIP.
Killed process 7558 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
oom_reaper: unable to reap pid:7558 (a.out)
a.out D13272 7558 6931 0x00100084
Call Trace:
schedule+0x2d/0x80
rwsem_down_write_failed+0x2bb/0x440
call_rwsem_down_write_failed+0x13/0x20
down_write+0x49/0x60
unlink_file_vma+0x28/0x50
free_pgtables+0x36/0x100
exit_mmap+0xbb/0x180
mmput+0x50/0x110
copy_process.part.41+0xb61/0x1fe0
_do_fork+0xe6/0x560
do_syscall_64+0x74/0x230
entry_SYSCALL_64_after_hwframe+0x42/0xb7
Link: http://lkml.kernel.org/r/201803221946.DHG65638.VFJHFtOSQLOMOF@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This patch fixes a corner case for KSM. When two pages belong or
belonged to the same transparent hugepage, and they should be merged,
KSM fails to split the page, and therefore no merging happens.
This bug can be reproduced by:
* making sure ksm is running (in case disabling ksmtuned)
* enabling transparent hugepages
* allocating a THP-aligned 1-THP-sized buffer
e.g. on amd64: posix_memalign(&p, 1<<21, 1<<21)
* filling it with the same values
e.g. memset(p, 42, 1<<21)
* performing madvise to make it mergeable
e.g. madvise(p, 1<<21, MADV_MERGEABLE)
* waiting for KSM to perform a few scans
The expected outcome is that the all the pages get merged (1 shared and
the rest sharing); the actual outcome is that no pages get merged (1
unshared and the rest volatile)
The reason of this behaviour is that we increase the reference count
once for both pages we want to merge, but if they belong to the same
hugepage (or compound page), the reference counter used in both cases is
the one of the head of the compound page. This means that
split_huge_page will find a value of the reference counter too high and
will fail.
This patch solves this problem by testing if the two pages to merge
belong to the same hugepage when attempting to merge them. If so, the
hugepage is split safely. This means that the hugepage is not split if
not necessary.
Link: http://lkml.kernel.org/r/1521548069-24758-1-git-send-email-imbrenda@linux.vnet.ibm.com
Signed-off-by: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
Co-authored-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This fixes a warning shown when phys_addr_t is 32-bit int when compiling
with clang:
mm/memblock.c:927:15: warning: implicit conversion from 'unsigned long long'
to 'phys_addr_t' (aka 'unsigned int') changes value from
18446744073709551615 to 4294967295 [-Wconstant-conversion]
r->base : ULLONG_MAX;
^~~~~~~~~~
./include/linux/kernel.h:30:21: note: expanded from macro 'ULLONG_MAX'
#define ULLONG_MAX (~0ULL)
^~~~~
Link: http://lkml.kernel.org/r/20180319005645.29051-1-stefan@agner.ch
Signed-off-by: Stefan Agner <stefan@agner.ch>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Currently <linux/slab.h> #includes <linux/kmemleak.h> for no obvious
reason. It looks like it's only a convenience, so remove kmemleak.h
from slab.h and add <linux/kmemleak.h> to any users of kmemleak_* that
don't already #include it. Also remove <linux/kmemleak.h> from source
files that do not use it.
This is tested on i386 allmodconfig and x86_64 allmodconfig. It would
be good to run it through the 0day bot for other $ARCHes. I have
neither the horsepower nor the storage space for the other $ARCHes.
Update: This patch has been extensively build-tested by both the 0day
bot & kisskb/ozlabs build farms. Both of them reported 2 build failures
for which patches are included here (in v2).
[ slab.h is the second most used header file after module.h; kernel.h is
right there with slab.h. There could be some minor error in the
counting due to some #includes having comments after them and I didn't
combine all of those. ]
[akpm@linux-foundation.org: security/keys/big_key.c needs vmalloc.h, per sfr]
Link: http://lkml.kernel.org/r/e4309f98-3749-93e1-4bb7-d9501a39d015@infradead.org
Link: http://kisskb.ellerman.id.au/kisskb/head/13396/
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Reported-by: Michael Ellerman <mpe@ellerman.id.au> [2 build failures]
Reported-by: Fengguang Wu <fengguang.wu@intel.com> [2 build failures]
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Wei Yongjun <weiyongjun1@huawei.com>
Cc: Luis R. Rodriguez <mcgrof@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mimi Zohar <zohar@linux.vnet.ibm.com>
Cc: John Johansen <john.johansen@canonical.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
At present the construct
if (VM_WARN(...))
will compile OK with CONFIG_DEBUG_VM=y and will fail with
CONFIG_DEBUG_VM=n. The reason is that VM_{WARN,BUG}* have always been
special wrt. {WARN/BUG}* and never generate any code when DEBUG_VM is
disabled. So we cannot really use it in conditionals.
We considered changing things so that this construct works in both cases
but that might cause unwanted code generation with CONFIG_DEBUG_VM=n.
It is safer and simpler to make the build fail in both cases.
[akpm@linux-foundation.org: changelog]
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
start_isolate_page_range() is used to set the migrate type of a set of
pageblocks to MIGRATE_ISOLATE while attempting to start a migration
operation. It assumes that only one thread is calling it for the
specified range. This routine is used by CMA, memory hotplug and
gigantic huge pages. Each of these users synchronize access to the
range within their subsystem. However, two subsystems (CMA and gigantic
huge pages for example) could attempt operations on the same range. If
this happens, one thread may 'undo' the work another thread is doing.
This can result in pageblocks being incorrectly left marked as
MIGRATE_ISOLATE and therefore not available for page allocation.
What is ideally needed is a way to synchronize access to a set of
pageblocks that are undergoing isolation and migration. The only thing
we know about these pageblocks is that they are all in the same zone. A
per-node mutex is too coarse as we want to allow multiple operations on
different ranges within the same zone concurrently. Instead, we will
use the migration type of the pageblocks themselves as a form of
synchronization.
start_isolate_page_range sets the migration type on a set of page-
blocks going in order from the one associated with the smallest pfn to
the largest pfn. The zone lock is acquired to check and set the
migration type. When going through the list of pageblocks check if
MIGRATE_ISOLATE is already set. If so, this indicates another thread is
working on this pageblock. We know exactly which pageblocks we set, so
clean up by undo those and return -EBUSY.
This allows start_isolate_page_range to serve as a synchronization
mechanism and will allow for more general use of callers making use of
these interfaces. Update comments in alloc_contig_range to reflect this
new functionality.
Each CPU holds the associated zone lock to modify or examine the
migration type of a pageblock. And, it will only examine/update a
single pageblock per lock acquire/release cycle.
Link: http://lkml.kernel.org/r/20180309224731.16978-1-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The plan for these patches is to introduce the typedef, initially just
as documentation ("These functions should return a VM_FAULT_ status").
We'll trickle the patches to individual drivers/filesystems in through
the maintainers, as far as possible. Then we'll change the typedef to
an unsigned int and break the compilation of any unconverted
drivers/filesystems.
vmf_insert_page(), vmf_insert_mixed() and vmf_insert_pfn() are three
newly added functions. The various drivers/filesystems where return
value of fault(), huge_fault(), page_mkwrite() and pfn_mkwrite() get
converted, will need them. These functions will return correct
VM_FAULT_ code based on err value.
We've had bugs before where drivers returned -EFOO. And we have this
silly inefficiency where vm_insert_xxx() return an errno which (afaict)
every driver then converts into a VM_FAULT code. In many cases drivers
failed to return correct VM_FAULT code value despite of vm_insert_xxx()
fails. We have indentified and clean up all those existing bugs and
silly inefficiencies in driver/filesystems by adding these three new
inline wrappers. As mentioned above, we will trickle those patches to
individual drivers/filesystems in through maintainers after these three
wrapper functions are merged.
Eventually we can convert vm_insert_xxx() into vmf_insert_xxx() and
remove these inline wrappers, but these are a good intermediate step.
Link: http://lkml.kernel.org/r/20180310162351.GA7422@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Since the 2.6 kernel, the oom killer has slightly biased away from
CAP_SYS_ADMIN processes by discounting some of its memory usage in
comparison to other processes.
This has always been implicit and nothing exactly relies on the
behavior.
Gaurav notices that __task_cred() can dereference a potentially freed
pointer if the task under consideration is exiting because a reference
to the task_struct is not held.
Remove the CAP_SYS_ADMIN bias so that all processes are treated equally.
If any CAP_SYS_ADMIN process would like to be biased against, it is
always allowed to adjust /proc/pid/oom_score_adj.
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803071548510.6996@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Gaurav Kohli <gkohli@codeaurora.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Kswapd will not wakeup if per-zone watermarks are not failing or if too
many previous attempts at background reclaim have failed.
This can be true if there is a lot of free memory available. For high-
order allocations, kswapd is responsible for waking up kcompactd for
background compaction. If the zone is not below its watermarks or
reclaim has recently failed (lots of free memory, nothing left to
reclaim), kcompactd does not get woken up.
When __GFP_DIRECT_RECLAIM is not allowed, allow kcompactd to still be
woken up even if kswapd will not reclaim. This allows high-order
allocations, such as thp, to still trigger background compaction even
when the zone has an abundance of free memory.
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803111659420.209721@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
KASAN splats indicate that in some cases we free a live mm, then
continue to access it, with potentially disastrous results. This is
likely due to a mismatched mmdrop() somewhere in the kernel, but so far
the culprit remains elusive.
Let's have __mmdrop() verify that the mm isn't live for the current
task, similar to the existing check for init_mm. This way, we can catch
this class of issue earlier, and without requiring KASAN.
Currently, idle_task_exit() leaves active_mm stale after it switches to
init_mm. This isn't harmful, but will trigger the new assertions, so we
must adjust idle_task_exit() to update active_mm.
Link: http://lkml.kernel.org/r/20180312140103.19235-1-mark.rutland@arm.com
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
During the reclaiming slab of a memcg, shrink_slab iterates over all
registered shrinkers in the system, and tries to count and consume
objects related to the cgroup. In case of memory pressure, this behaves
bad: I observe high system time and time spent in list_lru_count_one()
for many processes on RHEL7 kernel.
This patch makes list_lru_node::memcg_lrus rcu protected, that allows to
skip taking spinlock in list_lru_count_one().
Shakeel Butt with the patch observes significant perf graph change. He
says:
========================================================================
Setup: running a fork-bomb in a memcg of 200MiB on a 8GiB and 4 vcpu
VM and recording the trace with 'perf record -g -a'.
The trace without the patch:
+ 34.19% fb.sh [kernel.kallsyms] [k] queued_spin_lock_slowpath
+ 30.77% fb.sh [kernel.kallsyms] [k] _raw_spin_lock
+ 3.53% fb.sh [kernel.kallsyms] [k] list_lru_count_one
+ 2.26% fb.sh [kernel.kallsyms] [k] super_cache_count
+ 1.68% fb.sh [kernel.kallsyms] [k] shrink_slab
+ 0.59% fb.sh [kernel.kallsyms] [k] down_read_trylock
+ 0.48% fb.sh [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
+ 0.38% fb.sh [kernel.kallsyms] [k] shrink_node_memcg
+ 0.32% fb.sh [kernel.kallsyms] [k] queue_work_on
+ 0.26% fb.sh [kernel.kallsyms] [k] count_shadow_nodes
With the patch:
+ 0.16% swapper [kernel.kallsyms] [k] default_idle
+ 0.13% oom_reaper [kernel.kallsyms] [k] mutex_spin_on_owner
+ 0.05% perf [kernel.kallsyms] [k] copy_user_generic_string
+ 0.05% init.real [kernel.kallsyms] [k] wait_consider_task
+ 0.05% kworker/0:0 [kernel.kallsyms] [k] finish_task_switch
+ 0.04% kworker/2:1 [kernel.kallsyms] [k] finish_task_switch
+ 0.04% kworker/3:1 [kernel.kallsyms] [k] finish_task_switch
+ 0.04% kworker/1:0 [kernel.kallsyms] [k] finish_task_switch
+ 0.03% binary [kernel.kallsyms] [k] copy_page
========================================================================
Thanks Shakeel for the testing.
[ktkhai@virtuozzo.com: v2]
Link: http://lkml.kernel.org/r/151203869520.3915.2587549826865799173.stgit@localhost.localdomain
Link: http://lkml.kernel.org/r/150583358557.26700.8490036563698102569.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The bool enable_vma_readahead and swap_vma_readahead() are local to the
source and do not need to be in global scope, so make them static.
Cleans up sparse warnings:
mm/swap_state.c:41:6: warning: symbol 'enable_vma_readahead' was not declared. Should it be static?
mm/swap_state.c:742:13: warning: symbol 'swap_vma_readahead' was not declared. Should it be static?
Link: http://lkml.kernel.org/r/20180223164852.5159-1-colin.king@canonical.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Prior to commit d47992f86b30 ("mm: change invalidatepage prototype to
accept length"), an offset of 0 meant that the full page was being
invalidated. After that commit, we need to instead check the length.
Jan said:
:
: The only possible issue is that try_to_release_page() was called more
: often than necessary. Otherwise the issue is harmless but still it's good
: to have this fixed.
Link: http://lkml.kernel.org/r/x49fu5rtnzs.fsf@segfault.boston.devel.redhat.com
Fixes: d47992f86b307 ("mm: change invalidatepage prototype to accept length")
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Lukas Czerner <lczerner@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Link: http://lkml.kernel.org/r/1519585191-10180-4-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The 'cold' parameter was removed from release_pages function by commit
c6f92f9fbe7d ("mm: remove cold parameter for release_pages").
Update the description to match the code.
Link: http://lkml.kernel.org/r/1519585191-10180-3-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The alloc_mm_area in nommu is a stub, but its description states it
allocates kernel address space. Remove the description to make the code
and the documentation agree.
Link: http://lkml.kernel.org/r/1519585191-10180-2-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Remove ZRAM's enforced "huge object" value and use zsmalloc huge-class
watermark instead, which makes more sense.
TEST
- I used a 1G zram device, LZO compression back-end, original
data set size was 444MB. Looking at zsmalloc classes stats the
test ended up to be pretty fair.
BASE ZRAM/ZSMALLOC
=====================
zram mm_stat
498978816 191482495 199831552 0 199831552 15634 0
zsmalloc classes
class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable
...
151 2448 0 0 1240 1240 744 3 0
168 2720 0 0 4200 4200 2800 2 0
190 3072 0 0 10100 10100 7575 3 0
202 3264 0 0 380 380 304 4 0
254 4096 0 0 10620 10620 10620 1 0
Total 7 46 106982 106187 48787 0
PATCHED ZRAM/ZSMALLOC
=====================
zram mm_stat
498978816 182579184 194248704 0 194248704 15628 0
zsmalloc classes
class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable
...
151 2448 0 0 1240 1240 744 3 0
168 2720 0 0 4200 4200 2800 2 0
190 3072 0 0 10100 10100 7575 3 0
202 3264 0 0 7180 7180 5744 4 0
254 4096 0 0 3820 3820 3820 1 0
Total 8 45 106959 106193 47424 0
As we can see, we reduced the number of objects stored in class-4096,
because a huge number of objects which we previously forcibly stored in
class-4096 now stored in non-huge class-3264. This results in lower
memory consumption:
- zsmalloc now uses 47424 physical pages, which is less than 48787 pages
zsmalloc used before.
- objects that we store in class-3264 share zspages. That's why overall
the number of pages that both class-4096 and class-3264 consumed went
down from 10924 to 9564.
[sergey.senozhatsky.work@gmail.com: add pool param to zs_huge_class_size()]
Link: http://lkml.kernel.org/r/20180314081833.1096-3-sergey.senozhatsky@gmail.com
Link: http://lkml.kernel.org/r/20180306070639.7389-3-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|