summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2009-12-15hugetlb: derive huge pages nodes allowed from task mempolicyLee Schermerhorn
This patch derives a "nodes_allowed" node mask from the numa mempolicy of the task modifying the number of persistent huge pages to control the allocation, freeing and adjusting of surplus huge pages when the pool page count is modified via the new sysctl or sysfs attribute "nr_hugepages_mempolicy". The nodes_allowed mask is derived as follows: * For "default" [NULL] task mempolicy, a NULL nodemask_t pointer is produced. This will cause the hugetlb subsystem to use node_online_map as the "nodes_allowed". This preserves the behavior before this patch. * For "preferred" mempolicy, including explicit local allocation, a nodemask with the single preferred node will be produced. "local" policy will NOT track any internode migrations of the task adjusting nr_hugepages. * For "bind" and "interleave" policy, the mempolicy's nodemask will be used. * Other than to inform the construction of the nodes_allowed node mask, the actual mempolicy mode is ignored. That is, all modes behave like interleave over the resulting nodes_allowed mask with no "fallback". See the updated documentation [next patch] for more information about the implications of this patch. Examples: Starting with: Node 0 HugePages_Total: 0 Node 1 HugePages_Total: 0 Node 2 HugePages_Total: 0 Node 3 HugePages_Total: 0 Default behavior [with or without this patch] balances persistent hugepage allocation across nodes [with sufficient contiguous memory]: sysctl vm.nr_hugepages[_mempolicy]=32 yields: Node 0 HugePages_Total: 8 Node 1 HugePages_Total: 8 Node 2 HugePages_Total: 8 Node 3 HugePages_Total: 8 Of course, we only have nr_hugepages_mempolicy with the patch, but with default mempolicy, nr_hugepages_mempolicy behaves the same as nr_hugepages. Applying mempolicy--e.g., with numactl [using '-m' a.k.a. '--membind' because it allows multiple nodes to be specified and it's easy to type]--we can allocate huge pages on individual nodes or sets of nodes. So, starting from the condition above, with 8 huge pages per node, add 8 more to node 2 using: numactl -m 2 sysctl vm.nr_hugepages_mempolicy=40 This yields: Node 0 HugePages_Total: 8 Node 1 HugePages_Total: 8 Node 2 HugePages_Total: 16 Node 3 HugePages_Total: 8 The incremental 8 huge pages were restricted to node 2 by the specified mempolicy. Similarly, we can use mempolicy to free persistent huge pages from specified nodes: numactl -m 0,1 sysctl vm.nr_hugepages_mempolicy=32 yields: Node 0 HugePages_Total: 4 Node 1 HugePages_Total: 4 Node 2 HugePages_Total: 16 Node 3 HugePages_Total: 8 The 8 huge pages freed were balanced over nodes 0 and 1. [rientjes@google.com: accomodate reworked NODEMASK_ALLOC] Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Andi Kleen <andi@firstfloor.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Nishanth Aravamudan <nacc@us.ibm.com> Cc: Adam Litke <agl@us.ibm.com> Cc: Andy Whitcroft <apw@canonical.com> Cc: Eric Whitney <eric.whitney@hp.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15hugetlb: factor init_nodemask_of_node()Lee Schermerhorn
Factor init_nodemask_of_node() out of the nodemask_of_node() macro. This will be used to populate the huge pages "nodes_allowed" nodemask for a single node when basing nodes_allowed on a preferred/local mempolicy or when a persistent huge page pool page count is modified via a per node sysfs attribute. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Andi Kleen <andi@firstfloor.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Nishanth Aravamudan <nacc@us.ibm.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Adam Litke <agl@us.ibm.com> Cc: Andy Whitcroft <apw@canonical.com> Cc: Eric Whitney <eric.whitney@hp.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15hugetlb: add nodemask arg to huge page alloc, free and surplus adjust functionsLee Schermerhorn
In preparation for constraining huge page allocation and freeing by the controlling task's numa mempolicy, add a "nodes_allowed" nodemask pointer to the allocate, free and surplus adjustment functions. For now, pass NULL to indicate default behavior--i.e., use node_online_map. A subsqeuent patch will derive a non-default mask from the controlling task's numa mempolicy. Note that this method of updating the global hstate nr_hugepages under the constraint of a nodemask simplifies keeping the global state consistent--especially the number of persistent and surplus pages relative to reservations and overcommit limits. There are undoubtedly other ways to do this, but this works for both interfaces: mempolicy and per node attributes. [rientjes@google.com: fix HIGHMEM compile error] Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Reviewed-by: Mel Gorman <mel@csn.ul.ie> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Andi Kleen <andi@firstfloor.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Nishanth Aravamudan <nacc@us.ibm.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Adam Litke <agl@us.ibm.com> Cc: Andy Whitcroft <apw@canonical.com> Cc: Eric Whitney <eric.whitney@hp.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15hugetlb: rework hstate_next_node_* functionsLee Schermerhorn
Modify the hstate_next_node* functions to allow them to be called to obtain the "start_nid". Then, whereas prior to this patch we unconditionally called hstate_next_node_to_{alloc|free}(), whether or not we successfully allocated/freed a huge page on the node, now we only call these functions on failure to alloc/free to advance to next allowed node. Factor out the next_node_allowed() function to handle wrap at end of node_online_map. In this version, the allowed nodes include all of the online nodes. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Reviewed-by: Mel Gorman <mel@csn.ul.ie> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Andi Kleen <andi@firstfloor.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Nishanth Aravamudan <nacc@us.ibm.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Adam Litke <agl@us.ibm.com> Cc: Andy Whitcroft <apw@canonical.com> Cc: Eric Whitney <eric.whitney@hp.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15nodemask: make NODEMASK_ALLOC more generalDavid Rientjes
This is a series of patches to provide control over the location of the allocation and freeing of persistent huge pages on a NUMA platform. Please consider for merging into mmotm. This series uses two mechanisms to constrain the nodes from which persistent huge pages are allocated: 1) the task NUMA mempolicy of the task modifying a new sysctl "nr_hugepages_mempolicy", based on a suggestion by Mel Gorman; and 2) a subset of the hugepages hstate sysfs attributes have been added [in V4] to each node system device under: /sys/devices/node/node[0-9]*/hugepages The per node attibutes allow direct assignment of a huge page count on a specific node, regardless of the task's mempolicy or cpuset constraints. This patch: NODEMASK_ALLOC(x, m) assumes x is a type of struct, which is unnecessary. It's perfectly reasonable to use this macro to allocate a nodemask_t, which is anonymous, either dynamically or on the stack depending on NODES_SHIFT. Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Nishanth Aravamudan <nacc@us.ibm.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: David Rientjes <rientjes@google.com> Cc: Adam Litke <agl@us.ibm.com> Cc: Andy Whitcroft <apw@canonical.com> Cc: Eric Whitney <eric.whitney@hp.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15mm: move inc_zone_page_state(NR_ISOLATED) to just isolated placeKOSAKI Motohiro
Christoph pointed out inc_zone_page_state(NR_ISOLATED) should be placed in right after isolate_page(). This patch does it. Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15/dev/mem: remove redundant parameter from do_write_kmem()Wu Fengguang
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Avi Kivity <avi@qumranet.com> Cc: Greg Kroah-Hartman <gregkh@suse.de> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Mark Brown <broonie@opensource.wolfsonmicro.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15/dev/mem: remove the "written" variable in write_kmem()Wu Fengguang
Also rename "len" to "sz". No behavior change. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Avi Kivity <avi@qumranet.com> Cc: Greg Kroah-Hartman <gregkh@suse.de> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Mark Brown <broonie@opensource.wolfsonmicro.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15/dev/mem: make size_inside_page() logic straightWu Fengguang
Also convert more size_inside_page() users. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Avi Kivity <avi@qumranet.com> Cc: Greg Kroah-Hartman <gregkh@suse.de> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Mark Brown <broonie@opensource.wolfsonmicro.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15/dev/mem: cleanup unxlate_dev_mem_ptr() callsWu Fengguang
No behaviour change. [akpm@linux-foundation.org: cleanuplets] [akpm@linux-foundation.org: remove unused `ret'] Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Acked-by: Andi Kleen <ak@linux.intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Greg Kroah-Hartman <gregkh@suse.de> Cc: Mark Brown <broonie@opensource.wolfsonmicro.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Avi Kivity <avi@qumranet.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15/dev/mem: introduce size_inside_page()Wu Fengguang
Introduce size_inside_page() to replace duplicate /dev/mem code. Also apply it to /dev/kmem, whose alignment logic was buggy. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Acked-by: Andi Kleen <ak@linux.intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Greg Kroah-Hartman <gregkh@suse.de> Cc: Mark Brown <broonie@opensource.wolfsonmicro.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Avi Kivity <avi@qumranet.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15/dev/mem: remove redundant test on lenWu Fengguang
The len test in write_kmem() is always true, so can be reduced. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Acked-by: Andi Kleen <ak@linux.intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Greg Kroah-Hartman <gregkh@suse.de> Cc: Mark Brown <broonie@opensource.wolfsonmicro.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Avi Kivity <avi@qumranet.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15mmap: don't return ENOMEM when mapcount is temporarily exceeded in munmap()KOSAKI Motohiro
On ia64, the following test program exit abnormally, because glibc thread library called abort(). ======================================================== (gdb) bt #0 0xa000000000010620 in __kernel_syscall_via_break () #1 0x20000000003208e0 in raise () from /lib/libc.so.6.1 #2 0x2000000000324090 in abort () from /lib/libc.so.6.1 #3 0x200000000027c3e0 in __deallocate_stack () from /lib/libpthread.so.0 #4 0x200000000027f7c0 in start_thread () from /lib/libpthread.so.0 #5 0x200000000047ef60 in __clone2 () from /lib/libc.so.6.1 ======================================================== The fact is, glibc call munmap() when thread exitng time for freeing stack, and it assume munlock() never fail. However, munmap() often make vma splitting and it with many mapcount make -ENOMEM. Oh well, that's crazy, because stack unmapping never increase mapcount. The maxcount exceeding is only temporary. internal temporary exceeding shouldn't make ENOMEM. This patch does it. test_max_mapcount.c ================================================================== #include<stdio.h> #include<stdlib.h> #include<string.h> #include<pthread.h> #include<errno.h> #include<unistd.h> #define THREAD_NUM 30000 #define MAL_SIZE (8*1024*1024) void *wait_thread(void *args) { void *addr; addr = malloc(MAL_SIZE); sleep(10); return NULL; } void *wait_thread2(void *args) { sleep(60); return NULL; } int main(int argc, char *argv[]) { int i; pthread_t thread[THREAD_NUM], th; int ret, count = 0; pthread_attr_t attr; ret = pthread_attr_init(&attr); if(ret) { perror("pthread_attr_init"); } ret = pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED); if(ret) { perror("pthread_attr_setdetachstate"); } for (i = 0; i < THREAD_NUM; i++) { ret = pthread_create(&th, &attr, wait_thread, NULL); if(ret) { fprintf(stderr, "[%d] ", count); perror("pthread_create"); } else { printf("[%d] create OK.\n", count); } count++; ret = pthread_create(&thread[i], &attr, wait_thread2, NULL); if(ret) { fprintf(stderr, "[%d] ", count); perror("pthread_create"); } else { printf("[%d] create OK.\n", count); } count++; } sleep(3600); return 0; } ================================================================== [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15page-types: exit early when invoked with -d|--describeAlex Chiang
On a system with large amount of memory (256GB), invoking page-types can take quite a long time, which is unreasonable considering the user only wants a description of the flags: # time ./page-types -d 0x10 0x0000000000000010 ____D_____________________________ dirty real 0m34.285s user 0m1.966s sys 0m32.313s This is because we still walk the entire address range. Exiting early seems like a reasonble solution: # time ./page-types -d 0x10 0x0000000000000010 ____D_____________________________ dirty real 0m0.007s user 0m0.001s sys 0m0.005s Signed-off-by: Alex Chiang <achiang@hp.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Haicheng Li <haicheng.li@intel.com> Acked-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15page-types: whitespace alignmentAlex Chiang
Align the output when page-type -h is invoked. Signed-off-by: Alex Chiang <achiang@hp.com> Acked-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15page-types: learn to describe flags directly from command lineAlex Chiang
Teach page-types to describe page flags directly from the command line. Why is this useful? For instance, if you're using memory hotplug and see this in /var/log/messages: kernel: removing from LRU failed 3836dd0/1/1e00000000000010 It would be nice to decode those page flags without staring at the source. Example usage and output: # Documentation/vm/page-types -d 0x10 0x0000000000000010 ____D_____________________________ dirty # Documentation/vm/page-types -d anon 0x0000000000001000 ____________a_____________________ anonymous # Documentation/vm/page-types -d anon,0x10 0x0000000000001010 ____D_______a_____________________ dirty,anonymous [achiang@hp.com: documentation] Signed-off-by: Alex Chiang <achiang@hp.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Haicheng Li <haicheng.li@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15page-types: unsigned cannot be less than 0 in add_page()Roel Kluin
If not signed, testing of the read() return value in this function will not work. Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15page-types: constify read only arraysTommi Rantala
Signed-off-by: Tommi Rantala <tt.rantala@gmail.com> Cc: Randy Dunlap <rdunlap@xenotime.net> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15oom: dump stack and VM state when oom killer panicsDavid Rientjes
The oom killer header, including information such as the allocation order and gfp mask, current's cpuset and memory controller, call trace, and VM state information is currently only shown when the oom killer has selected a task to kill. This information is omitted, however, when the oom killer panics either because of panic_on_oom sysctl settings or when no killable task was found. It is still relevant to know crucial pieces of information such as the allocation order and VM state when diagnosing such issues, especially at boot. This patch displays the oom killer header whenever it panics so that bug reports can include pertinent information to debug the issue, if possible. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15MAINTAINERS: new kbuild maintainerMichal Marek
Sam was fine with handing over kbuild maintainership to me. The git trees are already in linux-next, a merge request will follow shortly. Acked-by: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Michal Marek <mmarek@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15hfs: fix a potential buffer overflowAmerigo Wang
A specially-crafted Hierarchical File System (HFS) filesystem could cause a buffer overflow to occur in a process's kernel stack during a memcpy() call within the hfs_bnode_read() function (at fs/hfs/bnode.c:24). The attacker can provide the source buffer and length, and the destination buffer is a local variable of a fixed length. This local variable (passed as "&entry" from fs/hfs/dir.c:112 and allocated on line 60) is stored in the stack frame of hfs_bnode_read()'s caller, which is hfs_readdir(). Because the hfs_readdir() function executes upon any attempt to read a directory on the filesystem, it gets called whenever a user attempts to inspect any filesystem contents. [amwang@redhat.com: modify this patch and fix coding style problems] Signed-off-by: WANG Cong <amwang@redhat.com> Cc: Eugene Teo <eteo@redhat.com> Cc: Roman Zippel <zippel@linux-m68k.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Dave Anderson <anderson@redhat.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15bsdacct: fix uid/gid misreportingAlexey Dobriyan
commit d8e180dcd5bbbab9cd3ff2e779efcf70692ef541 "bsdacct: switch credentials for writing to the accounting file" introduced credential switching during final acct data collecting. However, uid/gid pair continued to be collected from current which became credentials of who created acct file, not who exits. Addresses http://bugzilla.kernel.org/show_bug.cgi?id=14676 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reported-by: Juho K. Juopperi <jkj@kapsi.fi> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: David Howells <dhowells@redhat.com> Reviewed-by: Michal Schmidt <mschmidt@redhat.com> Cc: James Morris <jmorris@namei.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15Merge commit 'linus' into nextDmitry Torokhov
2009-12-15ipvs: zero usvc and udestSimon Horman
Make sure that any otherwise uninitialised fields of usvc are zero. This has been obvserved to cause a problem whereby the port of fwmark services may end up as a non-zero value which causes scheduling of a destination server to fail for persisitent services. As observed by Deon van der Merwe <dvdm@truteq.co.za>. This fix suggested by Julian Anastasov <ja@ssi.bg>. For good measure also zero udest. Cc: Deon van der Merwe <dvdm@truteq.co.za> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au> Cc: stable@kernel.org Signed-off-by: Patrick McHardy <kaber@trash.net>
2009-12-15netfilter: fix crashes in bridge netfilter caused by fragment jumpsPatrick McHardy
When fragments from bridge netfilter are passed to IPv4 or IPv6 conntrack and a reassembly queue with the same fragment key already exists from reassembling a similar packet received on a different device (f.i. with multicasted fragments), the reassembled packet might continue on a different codepath than where the head fragment originated. This can cause crashes in bridge netfilter when a fragment received on a non-bridge device (and thus with skb->nf_bridge == NULL) continues through the bridge netfilter code. Add a new reassembly identifier for packets originating from bridge netfilter and use it to put those packets in insolated queues. Fixes http://bugzilla.kernel.org/show_bug.cgi?id=14805 Reported-and-Tested-by: Chong Qiao <qiaochong@loongson.cn> Signed-off-by: Patrick McHardy <kaber@trash.net>
2009-12-15ipv6: reassembly: use seperate reassembly queues for conntrack and local ↵Patrick McHardy
delivery Currently the same reassembly queue might be used for packets reassembled by conntrack in different positions in the stack (PREROUTING/LOCAL_OUT), as well as local delivery. This can cause "packet jumps" when the fragment completing a reassembled packet is queued from a different position in the stack than the previous ones. Add a "user" identifier to the reassembly queue key to seperate the queues of each caller, similar to what we do for IPv4. Signed-off-by: Patrick McHardy <kaber@trash.net>
2009-12-15USB: Close usb_find_interface race v3Russ Dill
USB drivers that create character devices call usb_register_dev in their probe function. This associates the usb_interface device with that minor number and creates the character device and announces it to the world. However, the driver's probe function is called before the new usb_interface is added to the driver's klist_devices. This is a problem because userspace will respond to the character device creation announcement by opening the character device. The driver's open function will the call usb_find_interface to find the usb_interface associated with that minor number. usb_find_interface will walk the driver's list of devices and find the usb_interface with the matching minor number. Because the announcement happens before the usb_interface is added to the driver's klist_devices, a race condition exists. A straightforward fix is to walk the list of devices on usb_bus_type instead since the device is added to that list before the announcement occurs. bus_find_device calls get_device to bump the reference count on the found device. It is arguable that the reference count should be dropped by the caller of usb_find_interface instead of usb_find_interface, however, the current users of usb_find_interface do not expect this. The original version of this patch only matched against minor number instead of driver and minor number. This version matches against both. Signed-off-by: Russ Dill <Russ.Dill@gmail.com> Cc: stable <stable@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-15Revert "USB: Close usb_find_interface race"Greg Kroah-Hartman
This reverts commit a2582bd478c13c574d4c16ef1209d333f2a25935. It turned out to be buggy and broke USB printers from working. Cc: Russ Dill <Russ.Dill@gmail.com> Cc: stable <stable@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-15edac, mce: correct corenum reportingBorislav Petkov
Fix core number reporting with NB MCEs. Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2009-12-15perf diff: Fix documentationArnaldo Carvalho de Melo
Add a newline do fix this problem: ERROR: perf-diff.txt: line 31: closing [blockdef-listing] delimiter expected Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frédéric Weisbecker <fweisbec@gmail.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <1260882082-10007-1-git-send-email-acme@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-15perf diff: Improve the help textIngo Molnar
Fix the short line displayed by 'perf' and also fix some other details in the longer text. Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-15Blackfin: Convert BUG() to use unreachable()David Daney
Use the new unreachable() macro instead of for(;;); Signed-off-by: David Daney <ddaney@caviumnetworks.com> Signed-off-by: Mike Frysinger <vapier@gentoo.org>
2009-12-15sh: mach-ecovec24: Add FSI sound supportKuninori Morimoto
Signed-off-by: Kuninori Morimoto <morimoto.kuninori@renesas.com> Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2009-12-15sh: mach-ecovec24: Add mt9t112 camera supportKuninori Morimoto
Signed-off-by: Kuninori Morimoto <morimoto.kuninori@renesas.com> Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2009-12-15sh: mach-ecovec24: Add tw9910 supportKuninori Morimoto
Signed-off-by: Kuninori Morimoto <morimoto.kuninori@renesas.com> Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2009-12-15perf_event: Fix incorrect range check on cpu numberPaul Mackerras
It is quite legitimate for CPUs to be numbered sparsely, meaning that it possible for an online CPU to have a number which is greater than the total count of possible CPUs. Currently find_get_context() has a sanity check on the cpu number where it checks it against num_possible_cpus(). This test can fail for a legitimate cpu number if the cpu_possible_mask is sparsely populated. This fixes the problem by checking the CPU number against nr_cpumask_bits instead, since that is the appropriate check to ensure that the cpu number is same to pass to cpu_isset() subsequently. Reported-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Paul Mackerras <paulus@samba.org> Tested-by: Michael Neuling <mikey@neuling.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: <stable@kernel.org> LKML-Reference: <20091215084032.GA18661@brick.ozlabs.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-15x86: Split swiotlb initialization into two stagesFUJITA Tomonori
The commit f4780ca005404166cc40af77ef0e86132ab98a81 moves swiotlb initialization before dma32_free_bootmem(). It's supposed to fix a bug that the commit 75f1cdf1dda92cae037ec848ae63690d91913eac introduced, we initialize SWIOTLB right after dma32_free_bootmem so we wrongly steal memory area allocated for GART with broken BIOS earlier. However, the above commit introduced another problem, which likely breaks machines with huge amount of memory. Such a box use the majority of DMA32_ZONE so there is no memory for swiotlb. With this patch, the x86 IOMMU initialization sequence are: 1. We set swiotlb to 1 in the case of (max_pfn > MAX_DMA32_PFN && !no_iommu). If swiotlb usage is forced by the boot option, we go to the step 3 and finish (we don't try to detect IOMMUs). 2. We call the detection functions of all the IOMMUs. The detection function sets x86_init.iommu.iommu_init to the IOMMU initialization function (so we can avoid calling the initialization functions of all the IOMMUs needlessly). 3. We initialize swiotlb (and set dma_ops to swiotlb_dma_ops) if swiotlb is set to 1. 4. If the IOMMU initialization function doesn't need swiotlb (e.g. the initialization is sucessful) then sets swiotlb to zero. 5. If we find that swiotlb is set to zero, we free swiotlb resource. Reported-by: Yinghai Lu <yinghai@kernel.org> Reported-by: Roland Dreier <rdreier@cisco.com> Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> LKML-Reference: <20091215204729A.fujita.tomonori@lab.ntt.co.jp> Tested-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-15Merge branch 'fix/hda' into for-linusTakashi Iwai
2009-12-15perf trace/scripting: Update DocumentationTom Zanussi
Update the perf-trace page with new and missing options and remove some unused ones. Signed-off-by: Tom Zanussi <tzanussi@gmail.com> Cc: fweisbec@gmail.com Cc: rostedt@goodmis.org LKML-Reference: <1260867220-15699-7-git-send-email-tzanussi@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-15perf trace/scripting: Add 'record' and 'report' optionsTom Zanussi
Allow scripts to be recorded/executed by simply specifying the script root name (the script name minus extension) along with 'record' or 'report' to 'perf trace'. The script names shown by 'perf trace -l' can be directly used to run the command-line contained within the corresponding '-record' and '-report' versions of scripts in the scripts/*/bin directories. For example, to record the trace data needed to run the wakeup-latency.pl script, the user can easily find the name of the corresponding script from the script list and invoke it using 'perf trace record', without having to remember the details of how to do the same thing using the lower-level perf trace command-line options: root@tropicana:~# perf trace -l List of available trace scripts: workqueue-stats workqueue stats (ins/exe/create/destroy) wakeup-latency system-wide min/max/avg wakeup latency rw-by-file <comm> r/w activity for a program, by file check-perf-trace useless but exhaustive test script rw-by-pid system-wide r/w activity root@tropicana:~# perf trace record wakeup-latency ^C[ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.296 MB perf.data (~12931 samples) ] To run the wakeup-latency.pl script using the captured data, change 'record' to 'report' in the command-line: root@tropicana:~# perf trace report wakeup-latency wakeup_latency stats: total_wakeups: 65 avg_wakeup_latency (ns): 22417 min_wakeup_latency (ns): 3470 max_wakeup_latency (ns): 223311 perf trace Perl script stopped If the script takes options, thay can be simply added to the end of the 'report' invocation: root@tropicana:~# perf trace record rw-by-file ^C[ perf record: Woken up 2 times to write data ] [ perf record: Captured and wrote 0.782 MB perf.data (~34171 samples) ] root@tropicana:~# perf trace report rw-by-file perf file read counts for perf: fd # reads bytes_requested ------ ---------- ----------- 122 1934 1980416 120 1 32 file write counts for perf: fd # writes bytes_written ------ ---------- ----------- 3 4006 280568 perf trace Perl script stopped Signed-off-by: Tom Zanussi <tzanussi@gmail.com> Cc: fweisbec@gmail.com Cc: rostedt@goodmis.org LKML-Reference: <1260867220-15699-6-git-send-email-tzanussi@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-15perf trace/scripting: List available scriptsTom Zanussi
Lists the available perf trace scripts, one per line e.g.: root@tropicana:~# perf trace -l List of available trace scripts: workqueue-stats workqueue stats (ins/exe/create/destroy) wakeup-latency system-wide min/max/avg wakeup latency rw-by-file <comm> r/w activity for a program, by file check-perf-trace useless but exhaustive test script rw-by-pid system-wide r/w activity To be consistent with the other listing options in perf, the current latency trace option was changed to '-L', and '-l' is now used to access the script listing as: To create the list, it searches each scripts/*/bin directory for files ending with "-report" and reads information found in certain comment lines contained in those shell scripts: - if the comment line starts with "description:", the rest of the line is used as a 'half-line' description. To keep each line in the list to a single line, the description should be limited to 40 characters (the rest of the line contains the script name and args) - if the comment line starts with "args:", the rest of the line names the args the script supports. Required args should be surrounded by <> brackets, optional args by [] brackets. The current scripts in scripts/perl/bin have also been updated with description: and args: comments. Signed-off-by: Tom Zanussi <tzanussi@gmail.com> Cc: fweisbec@gmail.com Cc: rostedt@goodmis.org LKML-Reference: <1260867220-15699-5-git-send-email-tzanussi@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-15perf trace/scripting: Check return val of perl_run()Tom Zanussi
The return value from perl_run() is currently ignored, but it should be checked and used to exit perf if there are problems loading the script. Signed-off-by: Tom Zanussi <tzanussi@gmail.com> Cc: fweisbec@gmail.com Cc: rostedt@goodmis.org LKML-Reference: <1260867220-15699-4-git-send-email-tzanussi@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-15perf trace/scripting: Don't install unneeded filesTom Zanussi
README and Makefile.PL don't need to be installed for Perl run-time support. Signed-off-by: Tom Zanussi <tzanussi@gmail.com> Cc: fweisbec@gmail.com Cc: rostedt@goodmis.org LKML-Reference: <1260867220-15699-3-git-send-email-tzanussi@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-15perf trace/scripting: Add support for script argsTom Zanussi
One oversight of the original scripting_ops patch was a lack of support for passing args to handler scripts. This adds argc/argv to the start_script() scripting_op, and changes the rw-by-file script to take 'comm' arg rather than the 'perf' value currently hard-coded. It also takes the opportunity to do some related minor cleanup. Signed-off-by: Tom Zanussi <tzanussi@gmail.com> Cc: fweisbec@gmail.com Cc: rostedt@goodmis.org LKML-Reference: <1260867220-15699-2-git-send-email-tzanussi@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-15Merge branch 'fix/asoc' into for-linusTakashi Iwai
2009-12-15Merge branch 'fixes' of git://git.alsa-project.org/alsa-kernel into for-linusTakashi Iwai
2009-12-15cfq: set workload as expired if it doesn't have any slice leftGui Jianfeng
When a group is resumed, if it doesn't have workload slice left, we should set workload_expires as expired. Otherwise, we might start from where we left in previous group by error. Thanks the idea from Corrado. Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-12-15Blackfin: define __NR_recvmmsgMike Frysinger
Commit a2e2725541f added recvmmsg to a bunch of arches (including the Blackfin entry.S), but didn't actually add the new __NR_ define for it. Signed-off-by: Mike Frysinger <vapier@gentoo.org>
2009-12-15Input: wacom - separate pen from express keys on GraphirePing Cheng
Since Graphire/Bamboo devices report pen and expresskeys in the same data packet, we need to send a input_sync event to separate pen data from expresskeys for X11 driver to process them properly. Signed-off-by: Ping Cheng <pingc@wacom.com> Signed-off-by: Dmitry Torokhov <dtor@mail.ru>
2009-12-15Input: wacom - add defines for data packet report IDsPing Cheng
Signed-off-by: Ping Cheng <pingc@wacom.com> Signed-off-by: Dmitry Torokhov <dtor@mail.ru>