summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2010-05-25mincore: pass ranges as start,end address pairsJohannes Weiner
Instead of passing a start address and a number of pages into the helper functions, convert them to use a start and an end address. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-25mincore: break do_mincore() into logical piecesJohannes Weiner
Split out functions to handle hugetlb ranges, pte ranges and unmapped ranges, to improve readability but also to prepare the file structure for nested page table walks. No semantic changes intended. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-25mincore: cleanupsJohannes Weiner
This fixes some minor issues that bugged me while going over the code: o adjust argument order of do_mincore() to match the syscall o simplify range length calculation o drop superfluous shift in huge tlb calculation, address is page aligned o drop dead nr_huge calculation o check pte_none() before pte_present() o comment and whitespace fixes No semantic changes intended. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-25cpuset,mm: fix no node to alloc memory when changing cpuset's memsMiao Xie
Before applying this patch, cpuset updates task->mems_allowed and mempolicy by setting all new bits in the nodemask first, and clearing all old unallowed bits later. But in the way, the allocator may find that there is no node to alloc memory. The reason is that cpuset rebinds the task's mempolicy, it cleans the nodes which the allocater can alloc pages on, for example: (mpol: mempolicy) task1 task1's mpol task2 alloc page 1 alloc on node0? NO 1 1 change mems from 1 to 0 1 rebind task1's mpol 0-1 set new bits 0 clear disallowed bits alloc on node1? NO 0 ... can't alloc page goto oom This patch fixes this problem by expanding the nodes range first(set newly allowed bits) and shrink it lazily(clear newly disallowed bits). So we use a variable to tell the write-side task that read-side task is reading nodemask, and the write-side task clears newly disallowed nodes after read-side task ends the current memory allocation. [akpm@linux-foundation.org: fix spello] Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Menage <menage@google.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-25mempolicy: restructure rebinding-mempolicy functionsMiao Xie
Nick Piggin reported that the allocator may see an empty nodemask when changing cpuset's mems[1]. It happens only on the kernel that do not do atomic nodemask_t stores. (MAX_NUMNODES > BITS_PER_LONG) But I found that there is also a problem on the kernel that can do atomic nodemask_t stores. The problem is that the allocator can't find a node to alloc page when changing cpuset's mems though there is a lot of free memory. The reason is like this: (mpol: mempolicy) task1 task1's mpol task2 alloc page 1 alloc on node0? NO 1 1 change mems from 1 to 0 1 rebind task1's mpol 0-1 set new bits 0 clear disallowed bits alloc on node1? NO 0 ... can't alloc page goto oom I can use the attached program reproduce it by the following step: # mkdir /dev/cpuset # mount -t cpuset cpuset /dev/cpuset # mkdir /dev/cpuset/1 # echo `cat /dev/cpuset/cpus` > /dev/cpuset/1/cpus # echo `cat /dev/cpuset/mems` > /dev/cpuset/1/mems # echo $$ > /dev/cpuset/1/tasks # numactl --membind=`cat /dev/cpuset/mems` ./cpuset_mem_hog <nr_tasks> & <nr_tasks> = max(nr_cpus - 1, 1) # killall -s SIGUSR1 cpuset_mem_hog # ./change_mems.sh several hours later, oom will happen though there is a lot of free memory. This patchset fixes this problem by expanding the nodes range first(set newly allowed bits) and shrink it lazily(clear newly disallowed bits). So we use a variable to tell the write-side task that read-side task is reading nodemask, and the write-side task clears newly disallowed nodes after read-side task ends the current memory allocation. This patch: In order to fix no node to alloc memory, when we want to update mempolicy and mems_allowed, we expand the set of nodes first (set all the newly nodes) and shrink the set of nodes lazily(clean disallowed nodes), But the mempolicy's rebind functions may breaks the expanding. So we restructure the mempolicy's rebind functions and split the rebind work to two steps, just like the update of cpuset's mems: The 1st step: expand the set of the mempolicy's nodes. The 2nd step: shrink the set of the mempolicy's nodes. It is used when there is no real lock to protect the mempolicy in the read-side. Otherwise we can do rebind work at once. In order to implement it, we define enum mpol_rebind_step { MPOL_REBIND_ONCE, MPOL_REBIND_STEP1, MPOL_REBIND_STEP2, MPOL_REBIND_NSTEP, }; If the mempolicy needn't be updated by two steps, we can pass MPOL_REBIND_ONCE to the rebind functions. Or we can pass MPOL_REBIND_STEP1 to do the first step of the rebind work and pass MPOL_REBIND_STEP2 to do the second step work. Besides that, it maybe long time between these two step and we have to release the lock that protects mempolicy and mems_allowed. If we hold the lock once again, we must check whether the current mempolicy is under the rebinding (the first step has been done) or not, because the task may alloc a new mempolicy when we don't hold the lock. So we defined the following flag to identify it: #define MPOL_F_REBINDING (1 << 2) The new functions will be used in the next patch. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Menage <menage@google.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-25mempolicy: document cpuset interaction with tmpfs mpol mount optionLee Schermerhorn
Update Documentation/filesystems/tmpfs.txt to describe the interaction of tmpfs mount option memory policy with tasks' cpuset mems_allowed. Note: the mount(8) man page [in the util-linux-ng package] requires similiar updates. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-25mempolicy: factor mpol_shared_policy_init() return pathsLee Schermerhorn
Factor out duplicate put/frees in mpol_shared_policy_init() to a common return path. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-25mempolicy: rename policy_types and cleanup initializationLee Schermerhorn
Rename 'policy_types[]' to 'policy_modes[]' to better match the array contents. Use designated intializer syntax for policy_modes[]. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-25mempolicy: lose unnecessary loop variable in mpol_parse_str()Lee Schermerhorn
We don't really need the extra variable 'i' in mpol_parse_str(). The only use is as the the loop variable. Then, it's assigned to 'mode'. Just use mode, and loose the 'uninitialized_var()' macro. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-25mempolicy: don't call mpol_set_nodemask() when no_contextLee Schermerhorn
No need to call mpol_set_nodemask() when we have no context for the mempolicy. This can occur when we're parsing a tmpfs 'mpol' mount option. Just save the raw nodemask in the mempolicy's w.user_nodemask member for use when a tmpfs/shmem file is created. mpol_shared_policy_init() will "contextualize" the policy for the new file based on the creating task's context. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-25mempolicy: remove redundant checkBob Liu
Lee's patch "mempolicy: use MPOL_PREFERRED for system-wide default policy" has made the MPOL_DEFAULT only used in the memory policy APIs. So, no need to check in __mpol_equal also. Also get rid of mpol_match_intent() and move its logic directly into __mpol_equal(). Signed-off-by: Bob Liu <lliubbo@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-25mempolicy: remove case MPOL_INTERLEAVE from policy_zonelist()Bob Liu
In policy_zonelist() mode MPOL_INTERLEAVE shouldn't happen, so fall through to BUG() instead of break to return. I also fixed the comment. Signed-off-by: Bob Liu <lliubbo@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-25mempolicy: remove redundant codeBob Liu
1. In funtion is_valid_nodemask(), varibable k will be inited to 0 in the following loop, needn't init to policy_zone anymore. 2. (MPOL_F_STATIC_NODES | MPOL_F_RELATIVE_NODES) has already defined to MPOL_MODE_FLAGS in mempolicy.h. Signed-off-by: Bob Liu <lliubbo@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-25mm: remove return value of putback_lru_pages()Minchan Kim
putback_lru_page() never can fail. So it doesn't matter count of "the number of pages put back". In addition, users of this functions don't use return value. Let's remove unnecessary code. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-25shmem: remove redundant codeHuang Shijie
prep_new_page() will call set_page_private(page, 0) to initialise the page, so the code is redundant. Signed-off-by: Huang Shijie <shijie8@gmail.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-25sparsemem: on no vmemmap path put mem_map on node high tooYinghai Lu
We need to put mem_map high when virtual memmap is not used. before this patch free mem pfn range on first node: [ 0.000000] 19 - 1f [ 0.000000] 28 40 - 80 95 [ 0.000000] 702 740 - 1000 1000 [ 0.000000] 347c - 347e [ 0.000000] 34e7 3500 - 3b80 3b8b [ 0.000000] 73b8b 73bc0 - 73c00 73c00 [ 0.000000] 73ddd - 73e00 [ 0.000000] 73fdd - 74000 [ 0.000000] 741dd - 74200 [ 0.000000] 743dd - 74400 [ 0.000000] 745dd - 74600 [ 0.000000] 747dd - 74800 [ 0.000000] 749dd - 74a00 [ 0.000000] 74bdd - 74c00 [ 0.000000] 74ddd - 74e00 [ 0.000000] 74fdd - 75000 [ 0.000000] 751dd - 75200 [ 0.000000] 753dd - 75400 [ 0.000000] 755dd - 75600 [ 0.000000] 757dd - 75800 [ 0.000000] 759dd - 75a00 [ 0.000000] 79bdd 79c00 - 7d540 7d550 [ 0.000000] 7f745 - 7f750 [ 0.000000] 10000b 100040 - 2080000 2080000 so only 79c00 - 7d540 are major free block under 4g... after this patch, we will get [ 0.000000] 19 - 1f [ 0.000000] 28 40 - 80 95 [ 0.000000] 702 740 - 1000 1000 [ 0.000000] 347c - 347e [ 0.000000] 34e7 3500 - 3600 3600 [ 0.000000] 37dd - 3800 [ 0.000000] 39dd - 3a00 [ 0.000000] 3bdd - 3c00 [ 0.000000] 3ddd - 3e00 [ 0.000000] 3fdd - 4000 [ 0.000000] 41dd - 4200 [ 0.000000] 43dd - 4400 [ 0.000000] 45dd - 4600 [ 0.000000] 47dd - 4800 [ 0.000000] 49dd - 4a00 [ 0.000000] 4bdd - 4c00 [ 0.000000] 4ddd - 4e00 [ 0.000000] 4fdd - 5000 [ 0.000000] 51dd - 5200 [ 0.000000] 53dd - 5400 [ 0.000000] 95dd 9600 - 7d540 7d550 [ 0.000000] 7f745 - 7f750 [ 0.000000] 17000b 170040 - 2080000 2080000 we will have 9600 - 7d540 for major free block... sparse-vmemmap path already used __alloc_bootmem_node_high() Signed-off-by: Yinghai Lu <yinghai@kernel.org> Cc: Jiri Slaby <jirislaby@gmail.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-25page allocator: reduce fragmentation in buddy allocator by adding buddies ↵Corrado Zoccolo
that are merging to the tail of the free lists In order to reduce fragmentation, this patch classifies freed pages in two groups according to their probability of being part of a high order merge. Pages belonging to a compound whose next-highest buddy is free are more likely to be part of a high order merge in the near future, so they will be added at the tail of the freelist. The remaining pages are put at the front of the freelist. In this way, the pages that are more likely to cause a big merge are kept free longer. Consequently there is a tendency to aggregate the long-living allocations on a subset of the compounds, reducing the fragmentation. This heuristic was tested on three machines, x86, x86-64 and ppc64 with 3GB of RAM in each machine. The tests were kernbench, netperf, sysbench and STREAM for performance and a high-order stress test for huge page allocations. KernBench X86 Elapsed mean 374.77 ( 0.00%) 375.10 (-0.09%) User mean 649.53 ( 0.00%) 650.44 (-0.14%) System mean 54.75 ( 0.00%) 54.18 ( 1.05%) CPU mean 187.75 ( 0.00%) 187.25 ( 0.27%) KernBench X86-64 Elapsed mean 94.45 ( 0.00%) 94.01 ( 0.47%) User mean 323.27 ( 0.00%) 322.66 ( 0.19%) System mean 36.71 ( 0.00%) 36.50 ( 0.57%) CPU mean 380.75 ( 0.00%) 381.75 (-0.26%) KernBench PPC64 Elapsed mean 173.45 ( 0.00%) 173.74 (-0.17%) User mean 587.99 ( 0.00%) 587.95 ( 0.01%) System mean 60.60 ( 0.00%) 60.57 ( 0.05%) CPU mean 373.50 ( 0.00%) 372.75 ( 0.20%) Nothing notable for kernbench. NetPerf UDP X86 64 42.68 ( 0.00%) 42.77 ( 0.21%) 128 85.62 ( 0.00%) 85.32 (-0.35%) 256 170.01 ( 0.00%) 168.76 (-0.74%) 1024 655.68 ( 0.00%) 652.33 (-0.51%) 2048 1262.39 ( 0.00%) 1248.61 (-1.10%) 3312 1958.41 ( 0.00%) 1944.61 (-0.71%) 4096 2345.63 ( 0.00%) 2318.83 (-1.16%) 8192 4132.90 ( 0.00%) 4089.50 (-1.06%) 16384 6770.88 ( 0.00%) 6642.05 (-1.94%)* NetPerf UDP X86-64 64 148.82 ( 0.00%) 154.92 ( 3.94%) 128 298.96 ( 0.00%) 312.95 ( 4.47%) 256 583.67 ( 0.00%) 626.39 ( 6.82%) 1024 2293.18 ( 0.00%) 2371.10 ( 3.29%) 2048 4274.16 ( 0.00%) 4396.83 ( 2.79%) 3312 6356.94 ( 0.00%) 6571.35 ( 3.26%) 4096 7422.68 ( 0.00%) 7635.42 ( 2.79%)* 8192 12114.81 ( 0.00%)* 12346.88 ( 1.88%) 16384 17022.28 ( 0.00%)* 17033.19 ( 0.06%)* 1.64% 2.73% NetPerf UDP PPC64 64 49.98 ( 0.00%) 50.25 ( 0.54%) 128 98.66 ( 0.00%) 100.95 ( 2.27%) 256 197.33 ( 0.00%) 191.03 (-3.30%) 1024 761.98 ( 0.00%) 785.07 ( 2.94%) 2048 1493.50 ( 0.00%) 1510.85 ( 1.15%) 3312 2303.95 ( 0.00%) 2271.72 (-1.42%) 4096 2774.56 ( 0.00%) 2773.06 (-0.05%) 8192 4918.31 ( 0.00%) 4793.59 (-2.60%) 16384 7497.98 ( 0.00%) 7749.52 ( 3.25%) The tests are run to have confidence limits within 1%. Results marked with a * were not confident although in this case, it's only outside by small amounts. Even with some results that were not confident, the netperf UDP results were generally positive. NetPerf TCP X86 64 652.25 ( 0.00%)* 648.12 (-0.64%)* 23.80% 22.82% 128 1229.98 ( 0.00%)* 1220.56 (-0.77%)* 21.03% 18.90% 256 2105.88 ( 0.00%) 1872.03 (-12.49%)* 1.00% 16.46% 1024 3476.46 ( 0.00%)* 3548.28 ( 2.02%)* 13.37% 11.39% 2048 4023.44 ( 0.00%)* 4231.45 ( 4.92%)* 9.76% 12.48% 3312 4348.88 ( 0.00%)* 4396.96 ( 1.09%)* 6.49% 8.75% 4096 4726.56 ( 0.00%)* 4877.71 ( 3.10%)* 9.85% 8.50% 8192 4732.28 ( 0.00%)* 5777.77 (18.10%)* 9.13% 13.04% 16384 5543.05 ( 0.00%)* 5906.24 ( 6.15%)* 7.73% 8.68% NETPERF TCP X86-64 netperf-tcp-vanilla-netperf netperf-tcp tcp-vanilla pgalloc-delay 64 1895.87 ( 0.00%)* 1775.07 (-6.81%)* 5.79% 4.78% 128 3571.03 ( 0.00%)* 3342.20 (-6.85%)* 3.68% 6.06% 256 5097.21 ( 0.00%)* 4859.43 (-4.89%)* 3.02% 2.10% 1024 8919.10 ( 0.00%)* 8892.49 (-0.30%)* 5.89% 6.55% 2048 10255.46 ( 0.00%)* 10449.39 ( 1.86%)* 7.08% 7.44% 3312 10839.90 ( 0.00%)* 10740.15 (-0.93%)* 6.87% 7.33% 4096 10814.84 ( 0.00%)* 10766.97 (-0.44%)* 6.86% 8.18% 8192 11606.89 ( 0.00%)* 11189.28 (-3.73%)* 7.49% 5.55% 16384 12554.88 ( 0.00%)* 12361.22 (-1.57%)* 7.36% 6.49% NETPERF TCP PPC64 netperf-tcp-vanilla-netperf netperf-tcp tcp-vanilla pgalloc-delay 64 594.17 ( 0.00%) 596.04 ( 0.31%)* 1.00% 2.29% 128 1064.87 ( 0.00%)* 1074.77 ( 0.92%)* 1.30% 1.40% 256 1852.46 ( 0.00%)* 1856.95 ( 0.24%) 1.25% 1.00% 1024 3839.46 ( 0.00%)* 3813.05 (-0.69%) 1.02% 1.00% 2048 4885.04 ( 0.00%)* 4881.97 (-0.06%)* 1.15% 1.04% 3312 5506.90 ( 0.00%) 5459.72 (-0.86%) 4096 6449.19 ( 0.00%) 6345.46 (-1.63%) 8192 7501.17 ( 0.00%) 7508.79 ( 0.10%) 16384 9618.65 ( 0.00%) 9490.10 (-1.35%) There was a distinct lack of confidence in the X86* figures so I included what the devation was where the results were not confident. Many of the results, whether gains or losses were within the standard deviation so no solid conclusion can be reached on performance impact. Looking at the figures, only the X86-64 ones look suspicious with a few losses that were outside the noise. However, the results were so unstable that without knowing why they vary so much, a solid conclusion cannot be reached. SYSBENCH X86 sysbench-vanilla pgalloc-delay 1 7722.85 ( 0.00%) 7756.79 ( 0.44%) 2 14901.11 ( 0.00%) 13683.44 (-8.90%) 3 15171.71 ( 0.00%) 14888.25 (-1.90%) 4 14966.98 ( 0.00%) 15029.67 ( 0.42%) 5 14370.47 ( 0.00%) 14865.00 ( 3.33%) 6 14870.33 ( 0.00%) 14845.57 (-0.17%) 7 14429.45 ( 0.00%) 14520.85 ( 0.63%) 8 14354.35 ( 0.00%) 14362.31 ( 0.06%) SYSBENCH X86-64 1 17448.70 ( 0.00%) 17484.41 ( 0.20%) 2 34276.39 ( 0.00%) 34251.00 (-0.07%) 3 50805.25 ( 0.00%) 50854.80 ( 0.10%) 4 66667.10 ( 0.00%) 66174.69 (-0.74%) 5 66003.91 ( 0.00%) 65685.25 (-0.49%) 6 64981.90 ( 0.00%) 65125.60 ( 0.22%) 7 64933.16 ( 0.00%) 64379.23 (-0.86%) 8 63353.30 ( 0.00%) 63281.22 (-0.11%) 9 63511.84 ( 0.00%) 63570.37 ( 0.09%) 10 62708.27 ( 0.00%) 63166.25 ( 0.73%) 11 62092.81 ( 0.00%) 61787.75 (-0.49%) 12 61330.11 ( 0.00%) 61036.34 (-0.48%) 13 61438.37 ( 0.00%) 61994.47 ( 0.90%) 14 62304.48 ( 0.00%) 62064.90 (-0.39%) 15 63296.48 ( 0.00%) 62875.16 (-0.67%) 16 63951.76 ( 0.00%) 63769.09 (-0.29%) SYSBENCH PPC64 -sysbench-pgalloc-delay-sysbench sysbench-vanilla pgalloc-delay 1 7645.08 ( 0.00%) 7467.43 (-2.38%) 2 14856.67 ( 0.00%) 14558.73 (-2.05%) 3 21952.31 ( 0.00%) 21683.64 (-1.24%) 4 27946.09 ( 0.00%) 28623.29 ( 2.37%) 5 28045.11 ( 0.00%) 28143.69 ( 0.35%) 6 27477.10 ( 0.00%) 27337.45 (-0.51%) 7 26489.17 ( 0.00%) 26590.06 ( 0.38%) 8 26642.91 ( 0.00%) 25274.33 (-5.41%) 9 25137.27 ( 0.00%) 24810.06 (-1.32%) 10 24451.99 ( 0.00%) 24275.85 (-0.73%) 11 23262.20 ( 0.00%) 23674.88 ( 1.74%) 12 24234.81 ( 0.00%) 23640.89 (-2.51%) 13 24577.75 ( 0.00%) 24433.50 (-0.59%) 14 25640.19 ( 0.00%) 25116.52 (-2.08%) 15 26188.84 ( 0.00%) 26181.36 (-0.03%) 16 26782.37 ( 0.00%) 26255.99 (-2.00%) Again, there is little to conclude here. While there are a few losses, the results vary by +/- 8% in some cases. They are the results of most concern as there are some large losses but it's also within the variance typically seen between kernel releases. The STREAM results varied so little and are so verbose that I didn't include them here. The final test stressed how many huge pages can be allocated. The absolute number of huge pages allocated are the same with or without the page. However, the "unusability free space index" which is a measure of external fragmentation was slightly lower (lower is better) throughout the lifetime of the system. I also measured the latency of how long it took to successfully allocate a huge page. The latency was slightly lower and on X86 and PPC64, more huge pages were allocated almost immediately from the free lists. The improvement is slight but there. [mel@csn.ul.ie: Tested, reworked for less branches] [czoccolo@gmail.com: fix oops by checking pfn_valid_within()] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Corrado Zoccolo <czoccolo@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-25tmpfs: insert tmpfs cache pages to inactive list at firstKOSAKI Motohiro
Shaohua Li reported parallel file copy on tmpfs can lead to OOM killer. This is regression of caused by commit 9ff473b9a7 ("vmscan: evict streaming IO first"). Wow, It is 2 years old patch! Currently, tmpfs file cache is inserted active list at first. This means that the insertion doesn't only increase numbers of pages in anon LRU, but it also reduces anon scanning ratio. Therefore, vmscan will get totally confused. It scans almost only file LRU even though the system has plenty unused tmpfs pages. Historically, lru_cache_add_active_anon() was used for two reasons. 1) Intend to priotize shmem page rather than regular file cache. 2) Intend to avoid reclaim priority inversion of used once pages. But we've lost both motivation because (1) Now we have separate anon and file LRU list. then, to insert active list doesn't help such priotize. (2) In past, one pte access bit will cause page activation. then to insert inactive list with pte access bit mean higher priority than to insert active list. Its priority inversion may lead to uninteded lru chun. but it was already solved by commit 645747462 (vmscan: detect mapped file pages used only once). (Thanks Hannes, you are great!) Thus, now we can use lru_cache_add_anon() instead. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reported-by: Shaohua Li <shaohua.li@intel.com> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-25xtensa: includecheck fix: vectors.SJaswinder Singh Rajput
fix the following 'make includecheck' warnings: arch/xtensa/kernel/vectors.S: asm/processor.h is included more than once. arch/xtensa/kernel/vectors.S: asm/ptrace.h is included more than once. Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com> Cc: Chris Zankel <chris@zankel.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-25xtensa: convert to asm-generic/hardirq.hChristoph Hellwig
Also remove lots of unused irq_cpustat fields. Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Chris Zankel <chris@zankel.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-25xtensa: set ARCH_KMALLOC_MINALIGNFUJITA Tomonori
Architectures that handle DMA-non-coherent memory need to set ARCH_KMALLOC_MINALIGN to make sure that kmalloc'ed buffer is DMA-safe: the buffer doesn't share a cache with the others. Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: Chris Zankel <chris@zankel.net> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-25watchdog: Driver for the watchdog timer on Freescale IMX2 (and later) ↵Wolfram Sang
processors. This is the driver for the hardware watchdog on the Freescale IMX2 and later processors. Signed-off-by: Wolfram Sang <w.sang@pengutronix.de> Cc: Vladimir Zapolskiy <vzapolskiy@gmail.com> Cc: Sascha Hauer <s.hauer@pengutronix.de> Tested-by: Juergen Beisert <jbe@pengutronix.de> Signed-off-by: Wim Van Sebroeck <wim@iguana.be>
2010-05-25power_supply: Fix regression for 'type' propertyDaniel Mack
Commit 5f487cd34f4337f9bc27ca19da72a39d1b0a0ab4 (power_supply: Use attribute groups) causes a regression the power supply core does not export the 'type' attribute anymore. POWER_SUPPLY_PROP_TYPE is handled by the power supply core without the low-level driver, so power_supply_attr_is_visible() must always return the entry as readable. Reported-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Daniel Mack <daniel@caiaq.de> Tested-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Anton Vorontsov <cbouatmailru@gmail.com>
2010-05-25regulator: return set_mode is same mode is requestedSundar R Iyer
On Mon, 2010-05-17 at 17:34 +0200, Mark Brown wrote: > This doesn't seem like the right error handling - if the driver has a > set_mode() you'd *expect* it to have a get_mode() but there's no need > for it to be a strict requirement. True. In such a case, even a valid request would be lost! So now in the updated patch: - check if get_mode is present to avoid oops; - if get_mode is not present, proceed anyways for the request. Here is the updated patch: >From bad0d5eb51ef84be5b100e3dd0f5a590ea0529b6 Mon Sep 17 00:00:00 2001 From: Sundar R Iyer <sundar.iyer@stericsson.com> Date: Fri, 14 May 2010 15:14:17 +0530 Subject: [PATCH 1/1] regulator: return set_mode when same mode is requested save I/O costs by returning when the same mode is requested for the regulator Cc: Liam Girdwood <lrg@slimlogic.co.uk> Cc: Mark Brown <broonie@opensource.wolfsonmicro.com> Acked-by: Linus Walleij <linus.walleij@stericsson.com> Signed-off-by: Sundar R Iyer <sundar.iyer@stericsson.com> Acked-by: Mark Brown <broonie@opensource.wolfsonmicro.com> Signed-off-by: Liam Girdwood <lrg@slimlogic.co.uk>
2010-05-25Regulators: ab3100/bq24022: add a missing .owner field in regulator_descAxel Lin
This patch adds a missing .owner field in regulator_desc, which is used for refcounting. Signed-off-by: Axel Lin <axel.lin@gmail.com> Acked-by: Mark Brown <broonie@opensource.wolfsonmicro.com> Signed-off-by: Liam Girdwood <lrg@slimlogic.co.uk>
2010-05-25twl6030: regulator: Remove vsel tables and use formula for calculationRajendra Nayak
All twl6030 regulators can be programmed from 1.0v to 3.3v with 100mV steps. The below formula can be used to calculate the vsel values to be programmed in the VREG_VOLTAGE registers. Voltage(in mV) = 1000mv + 100mv * (vsel - 1) Ex: if vsel = 0x9, mV = 1000 + 100 * (9 -1) = 1800mV. This patch removes all existing VSEL tables for twl6030 adjustable regulators and just uses the formula directly for vsel calculations after verifing they fall in the allowed range. Signed-off-by: Rajendra Nayak <rnayak@ti.com> Cc: Liam Girdwood <lrg@slimlogic.co.uk> Cc: Samuel Ortiz <sameo@linux.intel.com> Cc: Mark Brown <broonie@opensource.wolfsonmicro.com> Acked-by: Mark Brown <broonie@opensource.wolfsonmicro.com> Signed-off-by: Liam Girdwood <lrg@slimlogic.co.uk>
2010-05-25mc13783-regulator: fix vaild voltage range checking for ↵Axel Lin
mc13783_fixed_regulator_set_voltage In the case of "min_uV == max_uV == mc13783_regulators[id].voltages[0]", mc13783_fixed_regulator_set_voltage should return 0 instead of -EINVAL. This patch also adds a missing ">" character for MODULE_AUTHOR, a trivial fix. Signed-off-by: Axel Lin <axel.lin@gmail.com> Cc: Mark Brown <broonie@opensource.wolfsonmicro.com> Cc: Liam Girdwood <lrg@slimlogic.co.uk> Acked-by: Sascha Hauer <s.hauer@pengutronix.de> Acked-by: Mark Brown <broonie@opensource.wolfsonmicro.com> Signed-off-by: Liam Girdwood <lrg@slimlogic.co.uk>
2010-05-25regulator: use voltage number array in 88pm860xHaojian Zhuang
A lot of condition comparision statements are used in original driver. These statements are used to check the boundary of voltage numbers since voltage number isn't linear. Now use array of voltage numbers instead. Clean code with simpler way. Signed-off-by: Haojian Zhuang <haojian.zhuang@marvell.com> Acked-by: Mark Brown <broonie@opensource.wolfsonmicro.com> Signed-off-by: Liam Girdwood <lrg@slimlogic.co.uk>
2010-05-25regulator: make 88pm860x sharing one driver structureHaojian Zhuang
Remove a lot of driver structures in 88pm860x driver. Make regulators share one driver structure. Signed-off-by: Haojian Zhuang <haojian.zhuang@marvell.com> Acked-by: Mark Brown <broonie@opensource.wolfsonmicro.com> Signed-off-by: Liam Girdwood <lrg@slimlogic.co.uk>
2010-05-25regulator: simplify regulator_register() error handlingJani Nikula
Simply remove all consumer supplies for the regulator on errors. Remove unset_consumer_device_supply() which is no longer used. Signed-off-by: Jani Nikula <ext-jani.1.nikula@nokia.com> Acked-by: Mark Brown <broonie@opensource.wolfsonmicro.com> Signed-off-by: Liam Girdwood <lrg@slimlogic.co.uk>
2010-05-25regulator: fix unset_regulator_supplies() to remove all matchesJani Nikula
Remove all matching consumer supplies, not just the first, to not leave dangling pointers. Signed-off-by: Jani Nikula <ext-jani.1.nikula@nokia.com> Acked-by: Mark Brown <broonie@opensource.wolfsonmicro.com> Signed-off-by: Liam Girdwood <lrg@slimlogic.co.uk>
2010-05-25regulator: prevent registration of matching regulator consumer suppliesJani Nikula
Acked-by: Mark Brown <broonie@opensource.wolfsonmicro.com> Pointer comparison is not sufficient for non-NULL device name matching, so use strcmp(). Otherwise the semantics remain the same. Signed-off-by: Jani Nikula <ext-jani.1.nikula@nokia.com> Signed-off-by: Liam Girdwood <lrg@slimlogic.co.uk>
2010-05-25regulator: Allow regulator-regulator supplies to be specified by nameMark Brown
When one regulator supplies another allow the relationship to be specified using names rather than struct regulators, in a similar manner to that allowed for consumer supplies. This allows static configuration at compile time, reducing the need for dynamic init code. Also change the references to LINE supply to be system supply since line is sometimes used for actual supplies and therefore potentially confusing. Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com> Signed-off-by: Liam Girdwood <lrg@slimlogic.co.uk>
2010-05-25watchdog: s3c2410_wdt - Fix on handling of the request_mem_region failBanajit Goswami
If the request for wdt_mem region fails, this patch modifies the driver such that, it does not try to release the wdt_mem region on exit. Signed-off-by: Banajit Goswami <banajit.g@samsung.com> Signed-off-by: Kukjin Kim <kgene.kim@samsung.com> Signed-off-by: Ben Dooks <ben-linux@fluff.org> Signed-off-by: Wim Van Sebroeck <wim@iguana.be>
2010-05-25watchdog: s3c2410_wdt - Add extra option to include watchdog for Samsung SoCsBanajit Goswami
This patch adds HAVE_S3C2410_WATCHDOG to control inclusion of watchdog driver for Samsung SoCs. This option will help to include the driver only for the necessary machines and not for all for any given arch. Signed-off-by: Banajit Goswami <banajit.g@samsung.com> Signed-off-by: Kukjin Kim <kgene.kim@samsung.com> Signed-off-by: Ben Dooks <ben-linux@fluff.org> Signed-off-by: Wim Van Sebroeck <wim@iguana.be>
2010-05-25iTCO_wdt: fix TCO V1 timeout values and limitsPádraig Brady
For TCO V1 devices the programmed timeout was twice too long because the fact that the TCO V1 timer needs to count down twice before triggering the watchdog, wasn't accounted for. Also the timeout values in the module description and error message were clarified. And the _STS registers are 16 bit instead of 8 bit. Signed-off-by: Pádraig Brady <P@draigBrady.com> Tested-by: Simon Kagstrom <simon.kagstrom@netinsight.se> Signed-off-by: Wim Van Sebroeck <wim@iguana.be>
2010-05-25watchdog: twl4030_wdt: Disable watchdog during probingAmeya Palande
If we are not able to register then it is better to have watchdog in disabled state than noticing a system reboot. Signed-off-by: Ameya Palande <ameya.palande@nokia.com> Acked-By: Timo Kokkonen <timo.t.kokkonen@nokia.com> Signed-off-by: Wim Van Sebroeck <wim@iguana.be>
2010-05-25watchdog: update/improve/consolidate watchdog driverRandy Dunlap
Move the limited watchdog driver help from kernel-parameters.txt to Documentation/watchdog/watchdog-parameters.txt and add info to it for all watchdog drivers except the ones that have driver-specific files already. Correct minor comments and MODULE_PARM_DESC() text in 2 places. Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Wim Van Sebroeck <wim@iguana.be>
2010-05-25watchdog: booke_wdt: fix ioctl status flagsWim Van Sebroeck
The WDIOC_GETSTATUS & WDIOC_GETBOOTSTATUS ioctl calls return the WDIOF_* flags and nothing else. Signed-off-by: Wim Van Sebroeck <wim@iguana.be>
2010-05-25watchdog: fix several MODULE_PARM_DESC stringsRandy Dunlap
Fix MODULE_PARM_DESC() strings in several watchdog drivers. Some are simple as add a parenthesis. Others are problems from __stringify() being used on a variable name instead of a macro name, so the variable name is produced in the string instead of its build-time value. In these cases, create a macro for the value so that the module param description string is useful. Only pc87413_wdt has been built (due to toolchains). Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Wim Van Sebroeck <wim@iguana.be>
2010-05-25watchdog: bfin: use new common Blackfin watchdog headerMike Frysinger
use new common Blackfin watchdog header Signed-off-by: Mike Frysinger <vapier@gentoo.org> Signed-off-by: Wim Van Sebroeck <wim@iguana.be>
2010-05-24sock.h: fix kernel-doc warningRandy Dunlap
Fix sock.h kernel-doc warning: Warning(include/net/sock.h:1438): No description found for parameter 'wq' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-24cls_cgroup: Fix build error when built-inHerbert Xu
There is a typo in cgroup_cls_state when cls_cgroup is built-in. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-25spi/xilinx: Fix compile errorGrant Likely
Commit 58f9b0b02414062eaff46716bc04b47d7e79add5, "of: eliminate of_device->node and dev_archdata->{of,prom}_node" changed the location of the device_node pointer. Most drivers were converted to the new location, but the xilinx_spi_of driver was missed and now fails to compile. This patch fixes up the xilinx_spi_of driver to use the new location. Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
2010-05-25Merge remote branch 'origin' into secretlab/next-spiGrant Likely
2010-05-25spi/davinci: Fix clock prescale factor computationThomas Koeller
Computation of the clock prescaler value returned bogus results if the requested SPI clock was impossible to set. It now sets either the maximum or minimum clock frequency, as appropriate. Signed-off-by: Thomas Koeller <thomas.koeller@baslerweb.com> Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
2010-05-25spi: move bitbang txrx utility functions to private headerhartleys
A number of files in drivers/spi fail checkincludes.pl due to the double include of <linux/spi/spi_bitbang.h>. The first include is needed to get the struct spi_bitbang definition and the spi_bitbang_* function prototypes. The second include happens after defining EXPAND_BITBANG_TXRX to get the inlined bitbang_txrx_* utility functions. The <linux/spi/spi_bitbang.h> header is also included by a number of other spi drivers, as well as some arch/ code, in order to use struct spi_bitbang and the associated functions. To fix the double include, and remove any potential confusion about it, move the inlined bitbang_txrx_* functions to a new private header in drivers/spi and also remove the need to define EXPAND_BITBANG_TXRX. Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
2010-05-25spi/mpc5121: Add SPI master driver for MPC5121 PSCAnatolij Gustschin
Signed-off-by: John Rigby <jcrigby@gmail.com> Signed-off-by: Anatolij Gustschin <agust@denx.de> Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
2010-05-25powerpc/mpc5121: move PSC FIFO memory init to platform codeAnatolij Gustschin
Since PSC could also be used in other modes than UART mode we move PSC FIFO memory initialization from serial driver to common platform code. The initialized FIFO memory slices may not overlap, so the most easy way would be to configure them all at once at init time for all PSC devices. This is now done by this patch. Signed-off-by: Anatolij Gustschin <agust@denx.de> Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
2010-05-25spi/ep93xx: implemented driver for Cirrus EP93xx SPI controllerMika Westerberg
This patch adds an SPI master driver for the Cirrus EP93xx SPI controller found in EP93xx chips. Signed-off-by: Mika Westerberg <mika.westerberg@iki.fi> Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Acked-by: H Hartley Sweeten <hsweeten@visionengravers.com> Signed-off-by: Grant Likely <grant.likely@secretlab.ca>