summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2011-10-31mm/debug-pagealloc.c: use memchr_invAkinobu Mita
Use newly introduced memchr_inv() for page verification. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31lib/string.c: introduce memchr_inv()Akinobu Mita
memchr_inv() is mainly used to check whether the whole buffer is filled with just a specified byte. The function name and prototype are stolen from logfs and the implementation is from SLUB. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Acked-by: Christoph Lameter <cl@linux-foundation.org> Acked-by: Pekka Enberg <penberg@kernel.org> Cc: Matt Mackall <mpm@selenic.com> Acked-by: Joern Engel <joern@logfs.org> Cc: Marcin Slusarz <marcin.slusarz@gmail.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31mm/debug-pagealloc.c: use plain __ratelimit() instead of printk_ratelimit()Akinobu Mita
printk_ratelimit() should not be used, because it shares ratelimiting state with all other unrelated printk_ratelimit() callsites. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31vmscan: count pages into balanced for zone with good watermarkShaohua Li
It's possible a zone watermark is ok when entering the balance_pgdat() loop, while the zone is within the requested classzone_idx. Count pages from this zone into `balanced'. In this way, we can skip shrinking zones too much for high order allocation. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Acked-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31mm: vmscan: immediately reclaim end-of-LRU dirty pages when writeback completesMel Gorman
When direct reclaim encounters a dirty page, it gets recycled around the LRU for another cycle. This patch marks the page PageReclaim similar to deactivate_page() so that the page gets reclaimed almost immediately after the page gets cleaned. This is to avoid reclaiming clean pages that are younger than a dirty page encountered at the end of the LRU that might have been something like a use-once page. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <jweiner@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Alex Elder <aelder@sgi.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31mm: vmscan: throttle reclaim if encountering too many dirty pages under ↵Mel Gorman
writeback Workloads that are allocating frequently and writing files place a large number of dirty pages on the LRU. With use-once logic, it is possible for them to reach the end of the LRU quickly requiring the reclaimer to scan more to find clean pages. Ordinarily, processes that are dirtying memory will get throttled by dirty balancing but this is a global heuristic and does not take into account that LRUs are maintained on a per-zone basis. This can lead to a situation whereby reclaim is scanning heavily, skipping over a large number of pages under writeback and recycling them around the LRU consuming CPU. This patch checks how many of the number of pages isolated from the LRU were dirty and under writeback. If a percentage of them under writeback, the process will be throttled if a backing device or the zone is congested. Note that this applies whether it is anonymous or file-backed pages that are under writeback meaning that swapping is potentially throttled. This is intentional due to the fact if the swap device is congested, scanning more pages and dispatching more IO is not going to help matters. The percentage that must be in writeback depends on the priority. At default priority, all of them must be dirty. At DEF_PRIORITY-1, 50% of them must be, DEF_PRIORITY-2, 25% etc. i.e. as pressure increases the greater the likelihood the process will get throttled to allow the flusher threads to make some progress. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Johannes Weiner <jweiner@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Alex Elder <aelder@sgi.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31mm: vmscan: do not writeback filesystem pages in kswapd except in high priorityMel Gorman
It is preferable that no dirty pages are dispatched for cleaning from the page reclaim path. At normal priorities, this patch prevents kswapd writing pages. However, page reclaim does have a requirement that pages be freed in a particular zone. If it is failing to make sufficient progress (reclaiming < SWAP_CLUSTER_MAX at any priority priority), the priority is raised to scan more pages. A priority of DEF_PRIORITY - 3 is considered to be the point where kswapd is getting into trouble reclaiming pages. If this priority is reached, kswapd will dispatch pages for writing. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Alex Elder <aelder@sgi.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31ext4: warn if direct reclaim tries to writeback pagesMel Gorman
Direct reclaim should never writeback pages. Warn if an attempt is made. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Alex Elder <aelder@sgi.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31xfs: warn if direct reclaim tries to writeback pagesMel Gorman
Direct reclaim should never writeback pages. For now, handle the situation and warn about it. Ultimately, this will be a BUG_ON. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Alex Elder <aelder@sgi.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31mm: vmscan: remove dead code related to lumpy reclaim waiting on pages under ↵Mel Gorman
writeback Lumpy reclaim worked with two passes - the first which queued pages for IO and the second which waited on writeback. As direct reclaim can no longer write pages there is some dead code. This patch removes it but direct reclaim will continue to wait on pages under writeback while in synchronous reclaim mode. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Alex Elder <aelder@sgi.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31mm: vmscan: do not writeback filesystem pages in direct reclaimMel Gorman
Testing from the XFS folk revealed that there is still too much I/O from the end of the LRU in kswapd. Previously it was considered acceptable by VM people for a small number of pages to be written back from reclaim with testing generally showing about 0.3% of pages reclaimed were written back (higher if memory was low). That writing back a small number of pages is ok has been heavily disputed for quite some time and Dave Chinner explained it well; It doesn't have to be a very high number to be a problem. IO is orders of magnitude slower than the CPU time it takes to flush a page, so the cost of making a bad flush decision is very high. And single page writeback from the LRU is almost always a bad flush decision. To complicate matters, filesystems respond very differently to requests from reclaim according to Christoph Hellwig; xfs tries to write it back if the requester is kswapd ext4 ignores the request if it's a delayed allocation btrfs ignores the request As a result, each filesystem has different performance characteristics when under memory pressure and there are many pages being dirtied. In some cases, the request is ignored entirely so the VM cannot depend on the IO being dispatched. The objective of this series is to reduce writing of filesystem-backed pages from reclaim, play nicely with writeback that is already in progress and throttle reclaim appropriately when writeback pages are encountered. The assumption is that the flushers will always write pages faster than if reclaim issues the IO. A secondary goal is to avoid the problem whereby direct reclaim splices two potentially deep call stacks together. There is a potential new problem as reclaim has less control over how long before a page in a particularly zone or container is cleaned and direct reclaimers depend on kswapd or flusher threads to do the necessary work. However, as filesystems sometimes ignore direct reclaim requests already, it is not expected to be a serious issue. Patch 1 disables writeback of filesystem pages from direct reclaim entirely. Anonymous pages are still written. Patch 2 removes dead code in lumpy reclaim as it is no longer able to synchronously write pages. This hurts lumpy reclaim but there is an expectation that compaction is used for hugepage allocations these days and lumpy reclaim's days are numbered. Patches 3-4 add warnings to XFS and ext4 if called from direct reclaim. With patch 1, this "never happens" and is intended to catch regressions in this logic in the future. Patch 5 disables writeback of filesystem pages from kswapd unless the priority is raised to the point where kswapd is considered to be in trouble. Patch 6 throttles reclaimers if too many dirty pages are being encountered and the zones or backing devices are congested. Patch 7 invalidates dirty pages found at the end of the LRU so they are reclaimed quickly after being written back rather than waiting for a reclaimer to find them I consider this series to be orthogonal to the writeback work but it is worth noting that the writeback work affects the viability of patch 8 in particular. I tested this on ext4 and xfs using fs_mark, a simple writeback test based on dd and a micro benchmark that does a streaming write to a large mapping (exercises use-once LRU logic) followed by streaming writes to a mix of anonymous and file-backed mappings. The command line for fs_mark when botted with 512M looked something like ./fs_mark -d /tmp/fsmark-2676 -D 100 -N 150 -n 150 -L 25 -t 1 -S0 -s 10485760 The number of files was adjusted depending on the amount of available memory so that the files created was about 3xRAM. For multiple threads, the -d switch is specified multiple times. The test machine is x86-64 with an older generation of AMD processor with 4 cores. The underlying storage was 4 disks configured as RAID-0 as this was the best configuration of storage I had available. Swap is on a separate disk. Dirty ratio was tuned to 40% instead of the default of 20%. Testing was run with and without monitors to both verify that the patches were operating as expected and that any performance gain was real and not due to interference from monitors. Here is a summary of results based on testing XFS. 512M1P-xfs Files/s mean 32.69 ( 0.00%) 34.44 ( 5.08%) 512M1P-xfs Elapsed Time fsmark 51.41 48.29 512M1P-xfs Elapsed Time simple-wb 114.09 108.61 512M1P-xfs Elapsed Time mmap-strm 113.46 109.34 512M1P-xfs Kswapd efficiency fsmark 62% 63% 512M1P-xfs Kswapd efficiency simple-wb 56% 61% 512M1P-xfs Kswapd efficiency mmap-strm 44% 42% 512M-xfs Files/s mean 30.78 ( 0.00%) 35.94 (14.36%) 512M-xfs Elapsed Time fsmark 56.08 48.90 512M-xfs Elapsed Time simple-wb 112.22 98.13 512M-xfs Elapsed Time mmap-strm 219.15 196.67 512M-xfs Kswapd efficiency fsmark 54% 56% 512M-xfs Kswapd efficiency simple-wb 54% 55% 512M-xfs Kswapd efficiency mmap-strm 45% 44% 512M-4X-xfs Files/s mean 30.31 ( 0.00%) 33.33 ( 9.06%) 512M-4X-xfs Elapsed Time fsmark 63.26 55.88 512M-4X-xfs Elapsed Time simple-wb 100.90 90.25 512M-4X-xfs Elapsed Time mmap-strm 261.73 255.38 512M-4X-xfs Kswapd efficiency fsmark 49% 50% 512M-4X-xfs Kswapd efficiency simple-wb 54% 56% 512M-4X-xfs Kswapd efficiency mmap-strm 37% 36% 512M-16X-xfs Files/s mean 60.89 ( 0.00%) 65.22 ( 6.64%) 512M-16X-xfs Elapsed Time fsmark 67.47 58.25 512M-16X-xfs Elapsed Time simple-wb 103.22 90.89 512M-16X-xfs Elapsed Time mmap-strm 237.09 198.82 512M-16X-xfs Kswapd efficiency fsmark 45% 46% 512M-16X-xfs Kswapd efficiency simple-wb 53% 55% 512M-16X-xfs Kswapd efficiency mmap-strm 33% 33% Up until 512-4X, the FSmark improvements were statistically significant. For the 4X and 16X tests the results were within standard deviations but just barely. The time to completion for all tests is improved which is an important result. In general, kswapd efficiency is not affected by skipping dirty pages. 1024M1P-xfs Files/s mean 39.09 ( 0.00%) 41.15 ( 5.01%) 1024M1P-xfs Elapsed Time fsmark 84.14 80.41 1024M1P-xfs Elapsed Time simple-wb 210.77 184.78 1024M1P-xfs Elapsed Time mmap-strm 162.00 160.34 1024M1P-xfs Kswapd efficiency fsmark 69% 75% 1024M1P-xfs Kswapd efficiency simple-wb 71% 77% 1024M1P-xfs Kswapd efficiency mmap-strm 43% 44% 1024M-xfs Files/s mean 35.45 ( 0.00%) 37.00 ( 4.19%) 1024M-xfs Elapsed Time fsmark 94.59 91.00 1024M-xfs Elapsed Time simple-wb 229.84 195.08 1024M-xfs Elapsed Time mmap-strm 405.38 440.29 1024M-xfs Kswapd efficiency fsmark 79% 71% 1024M-xfs Kswapd efficiency simple-wb 74% 74% 1024M-xfs Kswapd efficiency mmap-strm 39% 42% 1024M-4X-xfs Files/s mean 32.63 ( 0.00%) 35.05 ( 6.90%) 1024M-4X-xfs Elapsed Time fsmark 103.33 97.74 1024M-4X-xfs Elapsed Time simple-wb 204.48 178.57 1024M-4X-xfs Elapsed Time mmap-strm 528.38 511.88 1024M-4X-xfs Kswapd efficiency fsmark 81% 70% 1024M-4X-xfs Kswapd efficiency simple-wb 73% 72% 1024M-4X-xfs Kswapd efficiency mmap-strm 39% 38% 1024M-16X-xfs Files/s mean 42.65 ( 0.00%) 42.97 ( 0.74%) 1024M-16X-xfs Elapsed Time fsmark 103.11 99.11 1024M-16X-xfs Elapsed Time simple-wb 200.83 178.24 1024M-16X-xfs Elapsed Time mmap-strm 397.35 459.82 1024M-16X-xfs Kswapd efficiency fsmark 84% 69% 1024M-16X-xfs Kswapd efficiency simple-wb 74% 73% 1024M-16X-xfs Kswapd efficiency mmap-strm 39% 40% All FSMark tests up to 16X had statistically significant improvements. For the most part, tests are completing faster with the exception of the streaming writes to a mixture of anonymous and file-backed mappings which were slower in two cases In the cases where the mmap-strm tests were slower, there was more swapping due to dirty pages being skipped. The number of additional pages swapped is almost identical to the fewer number of pages written from reclaim. In other words, roughly the same number of pages were reclaimed but swapping was slower. As the test is a bit unrealistic and stresses memory heavily, the small shift is acceptable. 4608M1P-xfs Files/s mean 29.75 ( 0.00%) 30.96 ( 3.91%) 4608M1P-xfs Elapsed Time fsmark 512.01 492.15 4608M1P-xfs Elapsed Time simple-wb 618.18 566.24 4608M1P-xfs Elapsed Time mmap-strm 488.05 465.07 4608M1P-xfs Kswapd efficiency fsmark 93% 86% 4608M1P-xfs Kswapd efficiency simple-wb 88% 84% 4608M1P-xfs Kswapd efficiency mmap-strm 46% 45% 4608M-xfs Files/s mean 27.60 ( 0.00%) 28.85 ( 4.33%) 4608M-xfs Elapsed Time fsmark 555.96 532.34 4608M-xfs Elapsed Time simple-wb 659.72 571.85 4608M-xfs Elapsed Time mmap-strm 1082.57 1146.38 4608M-xfs Kswapd efficiency fsmark 89% 91% 4608M-xfs Kswapd efficiency simple-wb 88% 82% 4608M-xfs Kswapd efficiency mmap-strm 48% 46% 4608M-4X-xfs Files/s mean 26.00 ( 0.00%) 27.47 ( 5.35%) 4608M-4X-xfs Elapsed Time fsmark 592.91 564.00 4608M-4X-xfs Elapsed Time simple-wb 616.65 575.07 4608M-4X-xfs Elapsed Time mmap-strm 1773.02 1631.53 4608M-4X-xfs Kswapd efficiency fsmark 90% 94% 4608M-4X-xfs Kswapd efficiency simple-wb 87% 82% 4608M-4X-xfs Kswapd efficiency mmap-strm 43% 43% 4608M-16X-xfs Files/s mean 26.07 ( 0.00%) 26.42 ( 1.32%) 4608M-16X-xfs Elapsed Time fsmark 602.69 585.78 4608M-16X-xfs Elapsed Time simple-wb 606.60 573.81 4608M-16X-xfs Elapsed Time mmap-strm 1549.75 1441.86 4608M-16X-xfs Kswapd efficiency fsmark 98% 98% 4608M-16X-xfs Kswapd efficiency simple-wb 88% 82% 4608M-16X-xfs Kswapd efficiency mmap-strm 44% 42% Unlike the other tests, the fsmark results are not statistically significant but the min and max times are both improved and for the most part, tests completed faster. There are other indications that this is an improvement as well. For example, in the vast majority of cases, there were fewer pages scanned by direct reclaim implying in many cases that stalls due to direct reclaim are reduced. KSwapd is scanning more due to skipping dirty pages which is unfortunate but the CPU usage is still acceptable In an earlier set of tests, I used blktrace and in almost all cases throughput throughout the entire test was higher. However, I ended up discarding those results as recording blktrace data was too heavy for my liking. On a laptop, I plugged in a USB stick and ran a similar tests of tests using it as backing storage. A desktop environment was running and for the entire duration of the tests, firefox and gnome terminal were launching and exiting to vaguely simulate a user. 1024M-xfs Files/s mean 0.41 ( 0.00%) 0.44 ( 6.82%) 1024M-xfs Elapsed Time fsmark 2053.52 1641.03 1024M-xfs Elapsed Time simple-wb 1229.53 768.05 1024M-xfs Elapsed Time mmap-strm 4126.44 4597.03 1024M-xfs Kswapd efficiency fsmark 84% 85% 1024M-xfs Kswapd efficiency simple-wb 92% 81% 1024M-xfs Kswapd efficiency mmap-strm 60% 51% 1024M-xfs Avg wait ms fsmark 5404.53 4473.87 1024M-xfs Avg wait ms simple-wb 2541.35 1453.54 1024M-xfs Avg wait ms mmap-strm 3400.25 3852.53 The mmap-strm results were hurt because firefox launching had a tendency to push the test out of memory. On the postive side, firefox launched marginally faster with the patches applied. Time to completion for many tests was faster but more importantly - the "Avg wait" time as measured by iostat was far lower implying the system would be more responsive. It was also the case that "Avg wait ms" on the root filesystem was lower. I tested it manually and while the system felt slightly more responsive while copying data to a USB stick, it was marginal enough that it could be my imagination. This patch: do not writeback filesystem pages in direct reclaim. When kswapd is failing to keep zones above the min watermark, a process will enter direct reclaim in the same manner kswapd does. If a dirty page is encountered during the scan, this page is written to backing storage using mapping->writepage. This causes two problems. First, it can result in very deep call stacks, particularly if the target storage or filesystem are complex. Some filesystems ignore write requests from direct reclaim as a result. The second is that a single-page flush is inefficient in terms of IO. While there is an expectation that the elevator will merge requests, this does not always happen. Quoting Christoph Hellwig; The elevator has a relatively small window it can operate on, and can never fix up a bad large scale writeback pattern. This patch prevents direct reclaim writing back filesystem pages by checking if current is kswapd. Anonymous pages are still written to swap as there is not the equivalent of a flusher thread for anonymous pages. If the dirty pages cannot be written back, they are placed back on the LRU lists. There is now a direct dependency on dirty page balancing to prevent too many pages in the system being dirtied which would prevent reclaim making forward progress. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Alex Elder <aelder@sgi.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31mm: add comments to explain mm_struct fieldsChristoph Lameter
Add comments to explain the page statistics field in the mm_struct. [akpm@linux-foundation.org: add missing ;] Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31mm: distinguish between mlocked and pinned pagesChristoph Lameter
Some kernel components pin user space memory (infiniband and perf) (by increasing the page count) and account that memory as "mlocked". The difference between mlocking and pinning is: A. mlocked pages are marked with PG_mlocked and are exempt from swapping. Page migration may move them around though. They are kept on a special LRU list. B. Pinned pages cannot be moved because something needs to directly access physical memory. They may not be on any LRU list. I recently saw an mlockalled process where mm->locked_vm became bigger than the virtual size of the process (!) because some memory was accounted for twice: Once when the page was mlocked and once when the Infiniband layer increased the refcount because it needt to pin the RDMA memory. This patch introduces a separate counter for pinned pages and accounts them seperately. Signed-off-by: Christoph Lameter <cl@linux.com> Cc: Mike Marciniszyn <infinipath@qlogic.com> Cc: Roland Dreier <roland@kernel.org> Cc: Sean Hefty <sean.hefty@intel.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31mm: vmscan: drop nr_force_scan[] from get_scan_countJohannes Weiner
The nr_force_scan[] tuple holds the effective scan numbers for anon and file pages in case the situation called for a forced scan and the regularly calculated scan numbers turned out zero. However, the effective scan number can always be assumed to be SWAP_CLUSTER_MAX right before the division into anon and file. The numerators and denominator are properly set up for all cases, be it force scan for just file, just anon, or both, to do the right thing. Signed-off-by: Johannes Weiner <jweiner@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Ying Han <yinghan@google.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31mm: output a list of loaded modules when we hit bad_page()Dave Jones
When we get a bad_page bug report, it's useful to see what modules the user had loaded. Signed-off-by: Dave Jones <davej@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31tmpfs: add "tmpfs" to the Kconfig prompt to make it obvious.Robert P. J. Day
Add the leading word "tmpfs" to the Kconfig string to make it blindingly obvious that this selection refers to tmpfs. Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31oom: fix race while temporarily setting current's oom_score_adjDavid Rientjes
test_set_oom_score_adj() was introduced in 72788c385604 ("oom: replace PF_OOM_ORIGIN with toggling oom_score_adj") to temporarily elevate current's oom_score_adj for ksm and swapoff without requiring an additional per-process flag. Using that function to both set oom_score_adj to OOM_SCORE_ADJ_MAX and then reinstate the previous value is racy since it's possible that userspace can set the value to something else itself before the old value is reinstated. That results in userspace setting current's oom_score_adj to a different value and then the kernel immediately setting it back to its previous value without notification. To fix this, a new compare_swap_oom_score_adj() function is introduced with the same semantics as the compare and swap CAS instruction, or CMPXCHG on x86. It is used to reinstate the previous value of oom_score_adj if and only if the present value is the same as the old value. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ying Han <yinghan@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31oom: remove oom_disable_countDavid Rientjes
This removes mm->oom_disable_count entirely since it's unnecessary and currently buggy. The counter was intended to be per-process but it's currently decremented in the exit path for each thread that exits, causing it to underflow. The count was originally intended to prevent oom killing threads that share memory with threads that cannot be killed since it doesn't lead to future memory freeing. The counter could be fixed to represent all threads sharing the same mm, but it's better to remove the count since: - it is possible that the OOM_DISABLE thread sharing memory with the victim is waiting on that thread to exit and will actually cause future memory freeing, and - there is no guarantee that a thread is disabled from oom killing just because another thread sharing its mm is oom disabled. Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Ying Han <yinghan@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31oom: avoid killing kthreads if they assume the oom killed thread's mmDavid Rientjes
After selecting a task to kill, the oom killer iterates all processes and kills all other threads that share the same mm_struct in different thread groups. It would not otherwise be helpful to kill a thread if its memory would not be subsequently freed. A kernel thread, however, may assume a user thread's mm by using use_mm(). This is only temporary and should not result in sending a SIGKILL to that kthread. This patch ensures that only user threads and not kthreads are sent a SIGKILL if they share the same mm_struct as the oom killed task. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31oom: thaw threads if oom killed thread is frozen before deferringDavid Rientjes
If a thread has been oom killed and is frozen, thaw it before returning to the page allocator. Otherwise, it can stay frozen indefinitely and no memory will be freed. Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31mm/page-writeback.c: document bdi_min_ratioJohannes Weiner
Looks like someone got distracted after adding the comment characters. Signed-off-by: Johannes Weiner <jweiner@redhat.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31vmscan: add block plug for page reclaimShaohua Li
per-task block plug can reduce block queue lock contention and increase request merge. Currently page reclaim doesn't support it. I originally thought page reclaim doesn't need it, because kswapd thread count is limited and file cache write is done at flusher mostly. When I test a workload with heavy swap in a 4-node machine, each CPU is doing direct page reclaim and swap. This causes block queue lock contention. In my test, without below patch, the CPU utilization is about 2% ~ 7%. With the patch, the CPU utilization is about 1% ~ 3%. Disk throughput isn't changed. This should improve normal kswapd write and file cache write too (increase request merge for example), but might not be so obvious as I explain above. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31radix_tree: clean away saw_unset_tag leftoversHugh Dickins
radix_tree_tag_get()'s BUG (when it sees a tag after saw_unset_tag) was unsafe and removed in 2.6.34, but the pointless saw_unset_tag left behind. Remove it now, and return 0 as soon as we see unset tag - we already rely upon the root tag to be correct, returning 0 immediately if it's not set. Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31mm: migration: clean up unmap_and_move()Minchan Kim
unmap_and_move() is one a big messy function. Clean it up. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31Cross Memory AttachChristopher Yeoh
The basic idea behind cross memory attach is to allow MPI programs doing intra-node communication to do a single copy of the message rather than a double copy of the message via shared memory. The following patch attempts to achieve this by allowing a destination process, given an address and size from a source process, to copy memory directly from the source process into its own address space via a system call. There is also a symmetrical ability to copy from the current process's address space into a destination process's address space. - Use of /proc/pid/mem has been considered, but there are issues with using it: - Does not allow for specifying iovecs for both src and dest, assuming preadv or pwritev was implemented either the area read from or written to would need to be contiguous. - Currently mem_read allows only processes who are currently ptrace'ing the target and are still able to ptrace the target to read from the target. This check could possibly be moved to the open call, but its not clear exactly what race this restriction is stopping (reason appears to have been lost) - Having to send the fd of /proc/self/mem via SCM_RIGHTS on unix domain socket is a bit ugly from a userspace point of view, especially when you may have hundreds if not (eventually) thousands of processes that all need to do this with each other - Doesn't allow for some future use of the interface we would like to consider adding in the future (see below) - Interestingly reading from /proc/pid/mem currently actually involves two copies! (But this could be fixed pretty easily) As mentioned previously use of vmsplice instead was considered, but has problems. Since you need the reader and writer working co-operatively if the pipe is not drained then you block. Which requires some wrapping to do non blocking on the send side or polling on the receive. In all to all communication it requires ordering otherwise you can deadlock. And in the example of many MPI tasks writing to one MPI task vmsplice serialises the copying. There are some cases of MPI collectives where even a single copy interface does not get us the performance gain we could. For example in an MPI_Reduce rather than copy the data from the source we would like to instead use it directly in a mathops (say the reduce is doing a sum) as this would save us doing a copy. We don't need to keep a copy of the data from the source. I haven't implemented this, but I think this interface could in the future do all this through the use of the flags - eg could specify the math operation and type and the kernel rather than just copying the data would apply the specified operation between the source and destination and store it in the destination. Although we don't have a "second user" of the interface (though I've had some nibbles from people who may be interested in using it for intra process messaging which is not MPI). This interface is something which hardware vendors are already doing for their custom drivers to implement fast local communication. And so in addition to this being useful for OpenMPI it would mean the driver maintainers don't have to fix things up when the mm changes. There was some discussion about how much faster a true zero copy would go. Here's a link back to the email with some testing I did on that: http://marc.info/?l=linux-mm&m=130105930902915&w=2 There is a basic man page for the proposed interface here: http://ozlabs.org/~cyeoh/cma/process_vm_readv.txt This has been implemented for x86 and powerpc, other architecture should mainly (I think) just need to add syscall numbers for the process_vm_readv and process_vm_writev. There are 32 bit compatibility versions for 64-bit kernels. For arch maintainers there are some simple tests to be able to quickly verify that the syscalls are working correctly here: http://ozlabs.org/~cyeoh/cma/cma-test-20110718.tgz Signed-off-by: Chris Yeoh <yeohc@au1.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Howells <dhowells@redhat.com> Cc: James Morris <jmorris@namei.org> Cc: <linux-man@vger.kernel.org> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31mm: zone_reclaim: make isolate_lru_page() filter-awareMinchan Kim
In __zone_reclaim case, we don't want to shrink mapped page. Nonetheless, we have isolated mapped page and re-add it into LRU's head. It's unnecessary CPU overhead and makes LRU churning. Of course, when we isolate the page, the page might be mapped but when we try to migrate the page, the page would be not mapped. So it could be migrated. But race is rare and although it happens, it's no big deal. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31ipc/mqueue.c: fix wrong use of schedule_hrtimeout_range_clock()Wanlong Gao
Fix the wrong use of schedule_hrtimeout_range_clock() in wq_sleep(), although it is harmless for the syscall mq_timed* now. It was introduced by 9ca7d8e ("mqueue: Convert message queue timeout to use hrtimers"). Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com> Cc: Carsten Emde <C.Emde@osadl.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31mm: compaction: make isolate_lru_page() filter-awareMinchan Kim
In async mode, compaction doesn't migrate dirty or writeback pages. So, it's meaningless to pick the page and re-add it to lru list. Of course, when we isolate the page in compaction, the page might be dirty or writeback but when we try to migrate the page, the page would be not dirty, writeback. So it could be migrated. But it's very unlikely as isolate and migration cycle is much faster than writeout. So, this patch helps cpu overhead and prevent unnecessary LRU churning. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31/proc/self/numa_maps: restore "huge" tag for hugetlb vmasAndrew Morton
The display of the "huge" tag was accidentally removed in 29ea2f698 ("mm: use walk_page_range() instead of custom page table walking code"). Reported-by: Stephen Hemminger <shemminger@vyatta.com> Tested-by: Stephen Hemminger <shemminger@vyatta.com> Reviewed-by: Stephen Wilson <wilsons@start.ca> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: <stable@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31mm: change isolate mode from #define to bitwise typeMinchan Kim
Change ISOLATE_XXX macro with bitwise isolate_mode_t type. Normally, macro isn't recommended as it's type-unsafe and making debugging harder as symbol cannot be passed throught to the debugger. Quote from Johannes " Hmm, it would probably be cleaner to fully convert the isolation mode into independent flags. INACTIVE, ACTIVE, BOTH is currently a tri-state among flags, which is a bit ugly." This patch moves isolate mode from swap.h to mmzone.h by memcontrol.h Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31include/linux/dmar.h: forward-declare struct acpi_dmar_headerAndrew Morton
x86_64 allnoconfig: In file included from arch/x86/kernel/pci-dma.c:3: include/linux/dmar.h:248: warning: 'struct acpi_dmar_header' declared inside parameter list include/linux/dmar.h:248: warning: its scope is only this definition or declaration, which is probably not what you want Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31mm: compaction: trivial clean up in acct_isolated()Minchan Kim
acct_isolated of compaction uses page_lru_base_type which returns only base type of LRU list so it never returns LRU_ACTIVE_ANON or LRU_ACTIVE_FILE. In addtion, cc->nr_[anon|file] is used in only acct_isolated so it doesn't have fields in conpact_control. This patch removes fields from compact_control and makes clear function of acct_issolated which counts the number of anon|file pages isolated. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31dma-mapping: fix sync_single_range_* DMA debuggingClemens Ladisch
Commit 5fd75a7850b5 (dma-mapping: remove unnecessary sync_single_range_* in dma_map_ops) unified not only the dma_map_ops but also the corresponding debug_dma_sync_* calls. This led to spurious WARN()ings like the following because the DMA debug code was no longer able to detect the DMA buffer base address without the separate offset parameter: WARNING: at lib/dma-debug.c:911 check_sync+0xce/0x446() firewire_ohci 0000:04:00.0: DMA-API: device driver tries to sync DMA memory it has not allocated [device address=0x00000000cedaa400] [size=1024 bytes] Call Trace: ... [<ffffffff811326a5>] check_sync+0xce/0x446 [<ffffffff81132ad9>] debug_dma_sync_single_for_device+0x39/0x3b [<ffffffffa01d6e6a>] ohci_queue_iso+0x4f3/0x77d [firewire_ohci] ... To fix this, unshare the sync_single_* and sync_single_range_* implementations so that we are able to call the correct debug_dma_sync_* functions. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Clemens Ladisch <clemens@ladisch.de> Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31Revert "tracing: Include module.h in define_trace.h"Paul Gortmaker
This reverts commit 3a9f987b3141f086de27832514aad9f50a53f754. With all the files that are real modules now having module.h explicitly called out for inclusion, and no reliance on any implicit presence of module.h assumed, we should no longer need this workaround. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-10-31irq: don't put module.h into irq.h for tracking irqgen modules.Paul Gortmaker
Recent commit "irq: Track the owner of irq descriptor" in commit ID b6873807a7143b7 placed module.h into linux/irq.h but we are trying to limit module.h inclusion to just C files that really need it, due to its size and number of children includes. This targets just reversing that include. Add in the basic "struct module" since that is all we really need to ensure things compile. In theory, b687380 should have added the module.h include to the irqdesc.h header as well, but the implicit module.h everywhere presence masked this from showing up. So give it the "struct module" as well. As for the C files, irqdesc.c is only using THIS_MODULE, so it does not need module.h - give it export.h instead. The C file irq/manage.c is now (as of b687380) using try_module_get and module_put and so it needs module.h (which it already has). Also convert the irq_alloc_descs variants to macros, since all they really do is is call the __irq_alloc_descs primitive. This avoids including export.h and no debug info is lost. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-10-31bluetooth: macroize two small inlines to avoid module.hPaul Gortmaker
These two small inlines make calls to try_module_get() and module_put() which would force us to keep module.h present within yet another common include header. We can avoid this by turning them into macros. The hci_dev_hold construct is patterned off of raw_spin_trylock_irqsave() in spinlock.h Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-10-31ip_vs.h: fix implicit use of module_get/module_put from module.hPaul Gortmaker
This file was using the module get/put functions in two simple inline functions. But module_get/put were only within scope because of the implicit presence of module.h being everywhere. Rather than add module.h to another file in include/ -- which is exactly the thing we are trying to avoid, simply convert these one-line functions into a define, as per what was done for the device_schedule_callback() in commit 523ded71de0c5e669733. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-10-31nf_conntrack.h: fix up fallout from implicit moduleparam.h presencePaul Gortmaker
The implicit presence of module.h everywhere meant that this header also was getting moduleparam.h which defines struct kernel_param. Since it only needs to know that kernel_param is a struct, call that out instead of adding an include of moduleparam.h -- to get rid of this: include/net/netfilter/nf_conntrack.h:316: warning: 'struct kernel_param' declared inside parameter list include/net/netfilter/nf_conntrack.h:316: warning: its scope is only this definition or declaration, which is probably not what you want Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-10-31include: replace linux/module.h with "struct module" wherever possiblePaul Gortmaker
The <linux/module.h> pretty much brings in the kitchen sink along with it, so it should be avoided wherever reasonably possible in terms of being included from other commonly used <linux/something.h> files, as it results in a measureable increase on compile times. The worst culprit was probably device.h since it is used everywhere. This file also had an implicit dependency/usage of mutex.h which was masked by module.h, and is also fixed here at the same time. There are over a dozen other headers that simply declare the struct instead of pulling in the whole file, so follow their lead and simply make it a few more. Most of the implicit dependencies on module.h being present by these headers pulling it in have been now weeded out, so we can finally make this change with hopefully minimal breakage. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-10-31include: convert various register fcns to macros to avoid include chainingPaul Gortmaker
The original implementations reference THIS_MODULE in an inline. We could include <linux/export.h>, but it is better to avoid chaining. Fortunately someone else already thought of this, and made a similar inline into a #define in <linux/device.h> for device_schedule_callback(), [see commit 523ded71de0] so follow that precedent here. Also bubble up any __must_check that were used on the prev. wrapper inline functions up one to the real __register functions, to preserve any prev. sanity checks that were used in those instances. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-10-31crypto.h: remove unused crypto_tfm_alg_modname() inlinePaul Gortmaker
The <linux/crypto.h> (which is in turn in common headers like tcp.h) wants to use module_name() in an inline fcn. But having all of <linux/module.h> along for the ride is overkill and slows down compiles by a measureable amount, since it in turn includes lots of headers. Since the inline is never used anywhere in the kernel[1], we can just remove it, and then also remove the module.h include as well. In all the many crypto modules, there were some relying on crypto.h including module.h -- for them we now explicitly call out module.h for inclusion. [1] git grep shows some staging drivers also define the same static inline, but they also never ever use it. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-10-31uwb.h: fix implicit use of asm/page.h for PAGE_SIZEPaul Gortmaker
Once we clean up the implicit presence of module.h (and all its sub-includes), we'll see an implicit dependency on page.h for the PAGE_SIZE define. So fix it in advance. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-10-31pm_runtime.h: explicitly requires notifier.hPaul Gortmaker
This file was getting notifier.h via device.h --> module.h but the module.h inclusion is going away, so add notifier.h directly. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-10-31linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.hPaul Gortmaker
The implicit presence of module.h and all its sub-includes was masking these implicit header usages: include/linux/dmaengine.h:684: warning: 'struct page' declared inside parameter list include/linux/dmaengine.h:684: warning: its scope is only this definition or declaration, which is probably not what you want include/linux/dmaengine.h:687: warning: 'struct page' declared inside parameter list include/linux/dmaengine.h:736:2: error: implicit declaration of function 'bitmap_zero' With input from Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-10-31miscdevice.h: fix up implicit use of lists and typesPaul Gortmaker
By removing the implicit presence of module.h from this file, we will see things like: In file included from fs/dlm/user.c:9: include/linux/miscdevice.h:50: error: field ‘list’ has incomplete type include/linux/miscdevice.h:54: error: expected specifier-qualifier-list before ‘mode_t’ Call out lists.h and types.h for inclusion to fix each of the above respectively. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-10-31stop_machine.h: fix implicit use of smp.h for smp_processor_idPaul Gortmaker
This will show up on MIPS when we fix all the implicit header presences that are because of module.h being everywhere. In file included from kernel/trace/ftrace.c:16: include/linux/stop_machine.h: In function 'stop_one_cpu': include/linux/stop_machine.h:50: error: implicit declaration of function 'smp_processor_id' include/linux/stop_machine.h: In function 'stop_cpus': include/linux/stop_machine.h:80: error: implicit declaration of function 'raw_smp_processor_id' Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-10-31of: fix implicit use of errno.h in include/linux/of.hPaul Gortmaker
It shows up as a build failure on MIPS, as it is used in three of_property function stubs. include/linux/of.h:275: error: 'ENOSYS' undeclared (first use in this function) include/linux/of.h:282: error: 'ENOSYS' undeclared (first use in this function) include/linux/of.h:295: error: 'ENOSYS' undeclared (first use in this function) Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-10-31of_platform.h: delete needless include <linux/module.h>Paul Gortmaker
There is nothing modular in this file, and no reason to drag in all the 357 headers that module.h brings with it, since it just slows down compiles. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-10-31acpi: remove module.h include from platform/aclinux.hPaul Gortmaker
This file had an include of module.h which was probably added in relation to this line: #define ACPI_EXPORT_SYMBOL(symbol) EXPORT_SYMBOL(symbol); However, we really expect symbol exporters to grab export.h themselves, and since this is only a define, we can remove the module.h include without aclinux.h itself causing any compile issues. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-10-31miscdevice.h: delete unnecessary inclusion of module.hPaul Gortmaker
This file has a define MODULE_ALIAS_MISCDEV which in turn will use the MODULE_ALIAS define, but only if the former is explicitly used by modular device driver code (and such code should be already including module.h). Delete the include, since module.h is such a giant thing that we don't want it implicitly sneaking into compiles where it isn't specifically required. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>