summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2012-12-11net: gro: avoid double copy in skb_gro_receive()Eric Dumazet
__copy_skb_header(nskb, p) already copied p->cb[], no need to copy it again. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-11bridge: fix seq check in br_mdb_dump()Cong Wang
In case of rehashing, introduce a global variable 'br_mdb_rehash_seq' which gets increased every time when rehashing, and assign net->dev_base_seq + br_mdb_rehash_seq to cb->seq. In theory cb->seq could be wrapped to zero, but this is not easy to fix, as net->dev_base_seq is not visible inside br_mdb_rehash(). In practice, this is rare. Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Stephen Hemminger <shemminger@vyatta.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Thomas Graf <tgraf@suug.ch> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-11net: smc911x: use io{read,write}*_rep accessorsMatthew Leach
The {read,write}s{b,w,l} operations are not defined by all architectures and are being removed from the asm-generic/io.h interface. This patch replaces the usage of these string functions in the smc911x accessors with io{read,write}{8,16,32}_rep calls instead. Cc: Arnd Bergmann <arnd@arndb.de> Cc: Ben Herrenschmidt <benh@kernel.crashing.org> Cc: netdev@vger.kernel.org Signed-off-by: Matthew Leach <matthew@mattleach.net> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-11net: remove obsolete simple_strto<foo>Abhijit Pawar
This patch removes the redundant occurences of simple_strto<foo> Signed-off-by: Abhijit Pawar <abhi.c.pawar@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-11tun: allow setting ethernet addresss while runningstephen hemminger
This is a pure software device, and ok with live address change. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-11net: gro: dev_gro_receive() cleanupEric Dumazet
__napi_gro_receive() is inlined from two call sites for no good reason. Lets move the prep stuff in a function of its own, called only if/when needed. This saves 300 bytes on x86 : # size net/core/dev.o.after net/core/dev.o.before text data bss dec hex filename 51968 1238 1040 54246 d3e6 net/core/dev.o.before 51664 1238 1040 53942 d2b6 net/core/dev.o.after Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-11net: fix a race in gro_cell_poll()Eric Dumazet
Dmitry Kravkov reported packet drops for GRE packets since GRO support was added. There is a race in gro_cell_poll() because we call napi_complete() without any synchronization with a concurrent gro_cells_receive() Once bug was triggered, we queued packets but did not schedule NAPI poll. We can fix this issue using the spinlock protected the napi_skbs queue, as we have to hold it to perform skb dequeue anyway. As we open-code skb_dequeue(), we no longer need to mask IRQS, as both producer and consumer run under BH context. Bug added in commit c9e6bc644e (net: add gro_cells infrastructure) Reported-by: Dmitry Kravkov <dmitry@broadcom.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Tested-by: Dmitry Kravkov <dmitry@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-11bnx2x: use netdev_alloc_frag()Eric Dumazet
Using netdev_alloc_frag() instead of kmalloc() permits better GRO or TCP coalescing behavior, as skb_gro_receive() doesn't have to fallback to frag_list overhead. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Dmitry Kravkov <dmitry@broadcom.com> Cc: Eilon Greenstein <eilong@broadcom.com> Acked-by: Dmitry Kravkov <dmitry@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-11CIFS: Fix write after setting a read lock for read oplock filesPavel Shilovsky
If we have a read oplock and set a read lock in it, we can't write to the locked area - so, filemap_fdatawrite may fail with a no information for a userspace application even if we request a write to non-locked area. Fix this by populating the page cache without marking affected pages dirty after a successful write directly to the server. Also remove CONFIG_CIFS_SMB2 ifdefs because it's suitable for both CIFS and SMB2 protocols. Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
2012-12-11cifs: parse the device name into UNC and prepathJeff Layton
This should fix a regression that was introduced when the new mount option parser went in. Also, when the unc= and prefixpath= options are provided, check their values against the ones we parsed from the device string. If they differ, then throw a warning that tells the user that we're using the values from the unc= option for now, but that that will change in 3.10. Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
2012-12-11cifs: fix up handling of prefixpath= optionJeff Layton
Currently the code takes care to ensure that the prefixpath has a leading '/' delimiter. What if someone passes us a prefixpath with a leading '\\' instead? The code doesn't properly handle that currently AFAICS. Let's just change the code to skip over any leading delimiter character when copying the prepath. Then, fix up the users of the prepath option to prefix it with the correct delimiter when they use it. Also, there's no need to limit the length of the prefixpath to 1k. If the server can handle it, why bother forbidding it? Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
2012-12-11cifs: clean up handling of unc= optionJeff Layton
Make sure we free any existing memory allocated for vol->UNC, just in case someone passes in multiple unc= options. Get rid of the check for too long a UNC. The check for >300 bytes seems arbitrary. We later copy this into the tcon->treeName, for instance and it's a lot shorter than 300 bytes. Eliminate an extra kmalloc and copy as well. Just set the vol->UNC directly with the contents of match_strdup. Establish that the UNC should be stored with '\\' delimiters. Use convert_delimiter to change it in place in the vol->UNC. Finally, move the check for a malformed UNC into cifs_parse_mount_options so we can catch that situation earlier. Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
2012-12-11cifs: fix SID binary to string conversionJeff Layton
The authority fields are supposed to be represented by a single 48-bit value. It's also supposed to represent the value as hex if it's equal to or greater than 2^32. This is documented in MS-DTYP, section 2.4.2.1. Also, fix up the max string length to account for this fix. Acked-by: Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
2012-12-11of: *node argument to of_parse_phandle_with_args should be constGuennadi Liakhovetski
The "struct device_node *" argument of of_parse_phandle_with_args() can be const. Making this change makes it explicit that the function will not modify a node. Signed-off-by: Guennadi Liakhovetski <g.liakhovetski@gmx.de> [grant.likely: Resolved conflict with previous patch modifying of_parse_phandle()] Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
2012-12-11of/i2c: add dummy inline functions for when CONFIG_OF_I2C(_MODULE) isn't definedGuennadi Liakhovetski
If CONFIG_OF_I2C and CONFIG_OF_I2C_MODULE are undefined no declaration of of_find_i2c_device_by_node and of_find_i2c_adapter_by_node will be available. Add dummy inline functions to avoid compilation breakage. Signed-off-by: Guennadi Liakhovetski <g.liakhovetski@gmx.de> Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
2012-12-11mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalableIngo Molnar
rmap_walk_anon() and try_to_unmap_anon() appears to be too careful about locking the anon vma: while it needs protection against anon vma list modifications, it does not need exclusive access to the list itself. Transforming this exclusive lock to a read-locked rwsem removes a global lock from the hot path of page-migration intense threaded workloads which can cause pathological performance like this: 96.43% process 0 [kernel.kallsyms] [k] perf_trace_sched_switch | --- perf_trace_sched_switch __schedule schedule schedule_preempt_disabled __mutex_lock_common.isra.6 __mutex_lock_slowpath mutex_lock | |--50.61%-- rmap_walk | move_to_new_page | migrate_pages | migrate_misplaced_page | __do_numa_page.isra.69 | handle_pte_fault | handle_mm_fault | __do_page_fault | do_page_fault | page_fault | __memset_sse2 | | | --100.00%-- worker_thread | | | --100.00%-- start_thread | --49.39%-- page_lock_anon_vma try_to_unmap_anon try_to_unmap migrate_pages migrate_misplaced_page __do_numa_page.isra.69 handle_pte_fault handle_mm_fault __do_page_fault do_page_fault page_fault __memset_sse2 | --100.00%-- worker_thread start_thread With this change applied the profile is now nicely flat and there's no anon-vma related scheduling/blocking. Rename anon_vma_[un]lock() => anon_vma_[un]lock_write(), to make it clearer that it's an exclusive write-lock in that case - suggested by Rik van Riel. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Turner <pjt@google.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Christoph Lameter <cl@linux.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm/rmap: Convert the struct anon_vma::mutex to an rwsemIngo Molnar
Convert the struct anon_vma::mutex to an rwsem, which will help in solving a page-migration scalability problem. (Addressed in a separate patch.) The conversion is simple and straightforward: in every case where we mutex_lock()ed we'll now down_write(). Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Turner <pjt@google.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Christoph Lameter <cl@linux.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: migrate: Account a transhuge page properly when rate limitingMel Gorman
If there is excessive migration due to NUMA balancing it gets rate limited. It does this by counting the number of pages it has migrated recently but counts a transhuge page as 1 page. Account for it properly. Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: numa: Account for failed allocations and isolations as migration failuresMel Gorman
Subject says it all. Allocation failures and a failure to isolate should be accounted as a migration failure. This is partially another difference between base page and transhuge page migration. A base page migration makes multiple attempts for these conditions before it would be accounted for as a failure. Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: numa: Add THP migration for the NUMA working set scanning fault case ↵Mel Gorman
build fix Commit "Add THP migration for the NUMA working set scanning fault case" breaks the build because HPAGE_PMD_SHIFT and HPAGE_PMD_MASK defined to explode without CONFIG_TRANSPARENT_HUGEPAGE: mm/migrate.c: In function 'migrate_misplaced_transhuge_page_put': mm/migrate.c:1549: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed mm/migrate.c:1564: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed mm/migrate.c:1566: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed mm/migrate.c:1573: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed mm/migrate.c:1606: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed mm/migrate.c:1648: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed CONFIG_NUMA_BALANCING allows compilation without enabling transparent hugepages, so define the dummy function for such a configuration and only define migrate_misplaced_transhuge_page_put() when transparent hugepages are enabled. Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: numa: Add THP migration for the NUMA working set scanning fault case.Mel Gorman
Note: This is very heavily based on a patch from Peter Zijlstra with fixes from Ingo Molnar, Hugh Dickins and Johannes Weiner. That patch put a lot of migration logic into mm/huge_memory.c where it does not belong. This version puts tries to share some of the migration logic with migrate_misplaced_page. However, it should be noted that now migrate.c is doing more with the pagetable manipulation than is preferred. The end result is barely recognisable so as before, the signed-offs had to be removed but will be re-added if the original authors are ok with it. Add THP migration for the NUMA working set scanning fault case. It uses the page lock to serialize. No migration pte dance is necessary because the pte is already unmapped when we decide to migrate. [dhillf@gmail.com: Fix memory leak on isolation failure] [dhillf@gmail.com: Fix transfer of last_nid information] Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: sched: numa: Delay PTE scanning until a task is scheduled on a new nodeMel Gorman
Due to the fact that migrations are driven by the CPU a task is running on there is no point tracking NUMA faults until one task runs on a new node. This patch tracks the first node used by an address space. Until it changes, PTE scanning is disabled and no NUMA hinting faults are trapped. This should help workloads that are short-lived, do not care about NUMA placement or have bound themselves to a single node. This takes advantage of the logic in "mm: sched: numa: Implement slow start for working set sampling" to delay when the checks are made. This will take advantage of processes that set their CPU and node bindings early in their lifetime. It will also potentially allow any initial load balancing to take place. Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: sched: numa: Control enabling and disabling of NUMA balancing if ↵Mel Gorman
!SCHED_DEBUG The "mm: sched: numa: Control enabling and disabling of NUMA balancing" depends on scheduling debug being enabled but it's perfectly legimate to disable automatic NUMA balancing even without this option. This should take care of it. Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: sched: numa: Control enabling and disabling of NUMA balancingMel Gorman
This patch adds Kconfig options and kernel parameters to allow the enabling and disabling of automatic NUMA balancing. The existance of such a switch was and is very important when debugging problems related to transparent hugepages and we should have the same for automatic NUMA placement. Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrateMel Gorman
The PTE scanning rate and fault rates are two of the biggest sources of system CPU overhead with automatic NUMA placement. Ideally a proper policy would detect if a workload was properly placed, schedule and adjust the PTE scanning rate accordingly. We do not track the necessary information to do that but we at least know if we migrated or not. This patch scans slower if a page was not migrated as the result of a NUMA hinting fault up to sysctl_numa_balancing_scan_period_max which is now higher than the previous default. Once every minute it will reset the scanner in case of phase changes. This is hilariously crude and the numbers are arbitrary. Workloads will converge quite slowly in comparison to what a proper policy should be able to do. On the plus side, we will chew up less CPU for workloads that have no need for automatic balancing. Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: numa: Use a two-stage filter to restrict pages being migrated for ↵Mel Gorman
unlikely task<->node relationships Note: This two-stage filter was taken directly from the sched/numa patch "sched, numa, mm: Add the scanning page fault machinery" but is only a partial extraction. As the end result is not necessarily recognisable, the signed-offs-by had to be removed. Will be added back if requested. While it is desirable that all threads in a process run on its home node, this is not always possible or necessary. There may be more threads than exist within the node or the node might over-subscribed with unrelated processes. This can cause a situation whereby a page gets migrated off its home node because the threads clearing pte_numa were running off-node. This patch uses page->last_nid to build a two-stage filter before pages get migrated to avoid problems with short or unlikely task<->node relationships. Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: numa: migrate: Set last_nid on newly allocated pageHillf Danton
Pass last_nid from misplaced page to newly allocated migration target page. Signed-off-by: Hillf Danton <dhillf@gmail.com> Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: numa: split_huge_page: Transfer last_nid on tail pageHillf Danton
Pass last_nid from head page to tail page. Signed-off-by: Hillf Danton <dhillf@gmail.com> Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: numa: Introduce last_nid to the page frameMel Gorman
This patch introduces a last_nid field to the page struct. This is used to build a two-stage filter in the next patch that is aimed at mitigating a problem whereby pages migrate to the wrong node when referenced by a process that was running off its home node. Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11sched: numa: Slowly increase the scanning period as NUMA faults are handledMel Gorman
Currently the rate of scanning for an address space is controlled by the individual tasks. The next scan is simply determined by 2*p->numa_scan_period. The 2*p->numa_scan_period is arbitrary and never changes. At this point there is still no proper policy that decides if a task or process is properly placed. It just scans and assumes the next NUMA fault will place it properly. As it is assumed that pages will get properly placed over time, increase the scan window each time a fault is incurred. This is a big assumption as noted in the comments. It should be noted that changing to p->numa_scan_period will increase system CPU usage because now the scanning rate has effectively doubled. If that is a problem then the min_rate should be made 200ms instead of restoring the 2* logic. Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: numa: Rate limit setting of pte_numa if node is saturatedMel Gorman
If there are a large number of NUMA hinting faults and all of them are resulting in migrations it may indicate that memory is just bouncing uselessly around. NUMA balancing cost is likely exceeding any benefit from locality. Rate limit the PTE updates if the node is migration rate-limited. As noted in the comments, this distorts the NUMA faulting statistics. Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: numa: Rate limit the amount of memory that is migrated between nodesMel Gorman
NOTE: This is very heavily based on similar logic in autonuma. It should be signed off by Andrea but because there was no standalone patch and it's sufficiently different from what he did that the signed-off is omitted. Will be added back if requested. If a large number of pages are misplaced then the memory bus can be saturated just migrating pages between nodes. This patch rate-limits the amount of memory that can be migrating between nodes. Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: numa: Structures for Migrate On Fault per NUMA migration rate limitingAndrea Arcangeli
This defines the per-node data used by Migrate On Fault in order to rate limit the migration. The rate limiting is applied independently to each destination node. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: numa: Migrate pages handled during a pmd_numa hinting faultMel Gorman
To say that the PMD handling code was incorrectly transferred from autonuma is an understatement. The intention was to handle a PMDs worth of pages in the same fault and effectively batch the taking of the PTL and page migration. The copied version instead has the impact of clearing a number of pte_numa PTE entries and whether any page migration takes place depends on racing. This just happens to work in some cases. This patch handles pte_numa faults in batch when a pmd_numa fault is handled. The pages are migrated if they are currently misplaced. Essentially this is making an assumption that NUMA locality is on a PMD boundary but that could be addressed by only setting pmd_numa if all the pages within that PMD are on the same node if necessary. Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: numa: Migrate on reference policyMel Gorman
This is the simplest possible policy that still does something of note. When a pte_numa is faulted, it is moved immediately. Any replacement policy must at least do better than this and in all likelihood this policy regresses normal workloads. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com>
2012-12-11mm: numa: Add pte updates, hinting and migration statsMel Gorman
It is tricky to quantify the basic cost of automatic NUMA placement in a meaningful manner. This patch adds some vmstats that can be used as part of a basic costing model. u = basic unit = sizeof(void *) Ca = cost of struct page access = sizeof(struct page) / u Cpte = Cost PTE access = Ca Cupdate = Cost PTE update = (2 * Cpte) + (2 * Wlock) where Cpte is incurred twice for a read and a write and Wlock is a constant representing the cost of taking or releasing a lock Cnumahint = Cost of a minor page fault = some high constant e.g. 1000 Cpagerw = Cost to read or write a full page = Ca + PAGE_SIZE/u Ci = Cost of page isolation = Ca + Wi where Wi is a constant that should reflect the approximate cost of the locking operation Cpagecopy = Cpagerw + (Cpagerw * Wnuma) + Ci + (Ci * Wnuma) where Wnuma is the approximate NUMA factor. 1 is local. 1.2 would imply that remote accesses are 20% more expensive Balancing cost = Cpte * numa_pte_updates + Cnumahint * numa_hint_faults + Ci * numa_pages_migrated + Cpagecopy * numa_pages_migrated Note that numa_pages_migrated is used as a measure of how many pages were isolated even though it would miss pages that failed to migrate. A vmstat counter could have been added for it but the isolation cost is pretty marginal in comparison to the overall cost so it seemed overkill. The ideal way to measure automatic placement benefit would be to count the number of remote accesses versus local accesses and do something like benefit = (remote_accesses_before - remove_access_after) * Wnuma but the information is not readily available. As a workload converges, the expection would be that the number of remote numa hints would reduce to 0. convergence = numa_hint_faults_local / numa_hint_faults where this is measured for the last N number of numa hints recorded. When the workload is fully converged the value is 1. This can measure if the placement policy is converging and how fast it is doing it. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com>
2012-12-11mm: sched: numa: Implement slow start for working set samplingPeter Zijlstra
Add a 1 second delay before starting to scan the working set of a task and starting to balance it amongst nodes. [ note that before the constant per task WSS sampling rate patch the initial scan would happen much later still, in effect that patch caused this regression. ] The theory is that short-run tasks benefit very little from NUMA placement: they come and go, and they better stick to the node they were started on. As tasks mature and rebalance to other CPUs and nodes, so does their NUMA placement have to change and so does it start to matter more and more. In practice this change fixes an observable kbuild regression: # [ a perf stat --null --repeat 10 test of ten bzImage builds to /dev/shm ] !NUMA: 45.291088843 seconds time elapsed ( +- 0.40% ) 45.154231752 seconds time elapsed ( +- 0.36% ) +NUMA, no slow start: 46.172308123 seconds time elapsed ( +- 0.30% ) 46.343168745 seconds time elapsed ( +- 0.25% ) +NUMA, 1 sec slow start: 45.224189155 seconds time elapsed ( +- 0.25% ) 45.160866532 seconds time elapsed ( +- 0.17% ) and it also fixes an observable perf bench (hackbench) regression: # perf stat --null --repeat 10 perf bench sched messaging -NUMA: -NUMA: 0.246225691 seconds time elapsed ( +- 1.31% ) +NUMA no slow start: 0.252620063 seconds time elapsed ( +- 1.13% ) +NUMA 1sec delay: 0.248076230 seconds time elapsed ( +- 1.35% ) The implementation is simple and straightforward, most of the patch deals with adding the /proc/sys/kernel/numa_balancing_scan_delay_ms tunable knob. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> [ Wrote the changelog, ran measurements, tuned the default. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com>
2012-12-11sched, numa, mm: Count WS scanning against present PTEs, not virtual memory ↵Mel Gorman
ranges By accounting against the present PTEs, scanning speed reflects the actual present (mapped) memory. Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) ratePeter Zijlstra
Previously, to probe the working set of a task, we'd use a very simple and crude method: mark all of its address space PROT_NONE. That method has various (obvious) disadvantages: - it samples the working set at dissimilar rates, giving some tasks a sampling quality advantage over others. - creates performance problems for tasks with very large working sets - over-samples processes with large address spaces but which only very rarely execute Improve that method by keeping a rotating offset into the address space that marks the current position of the scan, and advance it by a constant rate (in a CPU cycles execution proportional manner). If the offset reaches the last mapped address of the mm then it then it starts over at the first address. The per-task nature of the working set sampling functionality in this tree allows such constant rate, per task, execution-weight proportional sampling of the working set, with an adaptive sampling interval/frequency that goes from once per 100ms up to just once per 8 seconds. The current sampling volume is 256 MB per interval. As tasks mature and converge their working set, so does the sampling rate slow down to just a trickle, 256 MB per 8 seconds of CPU time executed. This, beyond being adaptive, also rate-limits rarely executing systems and does not over-sample on overloaded systems. [ In AutoNUMA speak, this patch deals with the effective sampling rate of the 'hinting page fault'. AutoNUMA's scanning is currently rate-limited, but it is also fundamentally single-threaded, executing in the knuma_scand kernel thread, so the limit in AutoNUMA is global and does not scale up with the number of CPUs, nor does it scan tasks in an execution proportional manner. So the idea of rate-limiting the scanning was first implemented in the AutoNUMA tree via a global rate limit. This patch goes beyond that by implementing an execution rate proportional working set sampling rate that is not implemented via a single global scanning daemon. ] [ Dan Carpenter pointed out a possible NULL pointer dereference in the first version of this patch. ] Based-on-idea-by: Andrea Arcangeli <aarcange@redhat.com> Bug-Found-By: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> [ Wrote changelog and fixed bug. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com>
2012-12-11mm: numa: Add fault driven placement and migrationPeter Zijlstra
NOTE: This patch is based on "sched, numa, mm: Add fault driven placement and migration policy" but as it throws away all the policy to just leave a basic foundation I had to drop the signed-offs-by. This patch creates a bare-bones method for setting PTEs pte_numa in the context of the scheduler that when faulted later will be faulted onto the node the CPU is running on. In itself this does nothing useful but any placement policy will fundamentally depend on receiving hints on placement from fault context and doing something intelligent about it. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com>
2012-12-11mm: mempolicy: Hide MPOL_NOOP and MPOL_MF_LAZY from userspace for nowMel Gorman
The use of MPOL_NOOP and MPOL_MF_LAZY to allow an application to explicitly request lazy migration is a good idea but the actual API has not been well reviewed and once released we have to support it. For now this patch prevents an application using the services. This will need to be revisited. Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: mempolicy: Implement change_prot_numa() in terms of change_protection()Mel Gorman
This patch converts change_prot_numa() to use change_protection(). As pte_numa and friends check the PTE bits directly it is necessary for change_protection() to use pmd_mknuma(). Hence the required modifications to change_protection() are a little clumsy but the end result is that most of the numa page table helpers are just one or two instructions. Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: mempolicy: Add MPOL_MF_LAZYLee Schermerhorn
NOTE: Once again there is a lot of patch stealing and the end result is sufficiently different that I had to drop the signed-offs. Will re-add if the original authors are ok with that. This patch adds another mbind() flag to request "lazy migration". The flag, MPOL_MF_LAZY, modifies MPOL_MF_MOVE* such that the selected pages are marked PROT_NONE. The pages will be migrated in the fault path on "first touch", if the policy dictates at that time. "Lazy Migration" will allow testing of migrate-on-fault via mbind(). Also allows applications to specify that only subsequently touched pages be migrated to obey new policy, instead of all pages in range. This can be useful for multi-threaded applications working on a large shared data area that is initialized by an initial thread resulting in all pages on one [or a few, if overflowed] nodes. After PROT_NONE, the pages in regions assigned to the worker threads will be automatically migrated local to the threads on 1st touch. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com>
2012-12-11mm: mempolicy: Use _PAGE_NUMA to migrate pagesMel Gorman
Note: Based on "mm/mpol: Use special PROT_NONE to migrate pages" but sufficiently different that the signed-off-bys were dropped Combine our previous _PAGE_NUMA, mpol_misplaced and migrate_misplaced_page() pieces into an effective migrate on fault scheme. Note that (on x86) we rely on PROT_NONE pages being !present and avoid the TLB flush from try_to_unmap(TTU_MIGRATION). This greatly improves the page-migration performance. Based-on-work-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: migrate: Drop the misplaced pages reference count if the target node is fullMel Gorman
If we have to avoid migrating to a node that is nearly full, put page and return zero. Signed-off-by: Hillf Danton <dhillf@gmail.com> Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: migrate: Introduce migrate_misplaced_page()Peter Zijlstra
Note: This was originally based on Peter's patch "mm/migrate: Introduce migrate_misplaced_page()" but borrows extremely heavily from Andrea's "autonuma: memory follows CPU algorithm and task/mm_autonuma stats collection". The end result is barely recognisable so signed-offs had to be dropped. If original authors are ok with it, I'll re-add the signed-off-bys. Add migrate_misplaced_page() which deals with migrating pages from faults. Based-on-work-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Based-on-work-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Based-on-work-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com>
2012-12-11mm: mempolicy: Check for misplaced pageLee Schermerhorn
This patch provides a new function to test whether a page resides on a node that is appropriate for the mempolicy for the vma and address where the page is supposed to be mapped. This involves looking up the node where the page belongs. So, the function returns that node so that it may be used to allocated the page without consulting the policy again. A subsequent patch will call this function from the fault path. Because of this, I don't want to go ahead and allocate the page, e.g., via alloc_page_vma() only to have to free it if it has the correct policy. So, I just mimic the alloc_page_vma() node computation logic--sort of. Note: we could use this function to implement a MPOL_MF_STRICT behavior when migrating pages to match mbind() mempolicy--e.g., to ensure that pages in an interleaved range are reinterleaved rather than left where they are when they reside on any page in the interleave nodemask. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> [ Added MPOL_F_LAZY to trigger migrate-on-fault; simplified code now that we don't have to bother with special crap for interleaved ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: mempolicy: Add MPOL_NOOPLee Schermerhorn
This patch augments the MPOL_MF_LAZY feature by adding a "NOOP" policy to mbind(). When the NOOP policy is used with the 'MOVE and 'LAZY flags, mbind() will map the pages PROT_NONE so that they will be migrated on the next touch. This allows an application to prepare for a new phase of operation where different regions of shared storage will be assigned to worker threads, w/o changing policy. Note that we could just use "default" policy in this case. However, this also allows an application to request that pages be migrated, only if necessary, to follow any arbitrary policy that might currently apply to a range of pages, without knowing the policy, or without specifying multiple mbind()s for ranges with different policies. [ Bug in early version of mpol_parse_str() reported by Fengguang Wu. ] Bug-Reported-by: Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: mempolicy: Make MPOL_LOCAL a real policyPeter Zijlstra
Make MPOL_LOCAL a real and exposed policy such that applications that relied on the previous default behaviour can explicitly request it. Requested-by: Christoph Lameter <cl@linux.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: numa: Create basic numa page hinting infrastructureMel Gorman
Note: This patch started as "mm/mpol: Create special PROT_NONE infrastructure" and preserves the basic idea but steals *very* heavily from "autonuma: numa hinting page faults entry points" for the actual fault handlers without the migration parts. The end result is barely recognisable as either patch so all Signed-off and Reviewed-bys are dropped. If Peter, Ingo and Andrea are ok with this version, I will re-add the signed-offs-by to reflect the history. In order to facilitate a lazy -- fault driven -- migration of pages, create a special transient PAGE_NUMA variant, we can then use the 'spurious' protection faults to drive our migrations from. The meaning of PAGE_NUMA depends on the architecture but on x86 it is effectively PROT_NONE. Actual PROT_NONE mappings will not generate these NUMA faults for the reason that the page fault code checks the permission on the VMA (and will throw a segmentation fault on actual PROT_NONE mappings), before it ever calls handle_mm_fault. [dhillf@gmail.com: Fix typo] Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com>