summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2012-12-11Merge tag 'pinctrl-for-v3.8' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl Pull pinctrl changes from Linus Walleij: "These are the first and major pinctrl changes for the v3.8 merge cycle. Some of this is used as merge base for other trees so I better be early on the trigger. As can be seen from the diffstat the major changes are: - A big conversion of the AT91 pinctrl driver and the associated ACKed platform changes under arch/arm/max-at91 and its device trees. This has been coordinated with the AT91 maintainers to go in through the pinctrl tree. - A larger chunk of changes to the SPEAr drivers and the addition of the "plgpio" driver for the SPEAr as well. - The removal of the remnants of the Nomadik driver from the arch/arm tree and fusion of that into the Nomadik driver and platform data header files. - Some local movement in the Marvell MVEBU drivers, these now have their own subdirectory. - The addition of a chunk of code to gpiolib under drivers/gpio to register gpio-to-pin range mappings from the GPIO side of things. This has been requested by Grant Likely and is now implemented, it is particularly useful for device tree work. Then we have incremental updates all over the place, many of these are cleanups and fixes from Axel Lin who has done a great job of removing minor mistakes and compilation annoyances." * tag 'pinctrl-for-v3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: (114 commits) ARM: mmp: select PINCTRL for ARCH_MMP pinctrl: Drop selecting PINCONF for MMP2, PXA168 and PXA910 pinctrl: pinctrl-single: Fix error check condition pinctrl: SPEAr: Update error check for unsigned variables gpiolib: Fix use after free in gpiochip_add_pin_range gpiolib: rename pin range arguments pinctrl: single: support gpio request and free pinctrl: generic: add input schmitt disable parameter pinctrl/u300/coh901: stop spawning pinctrl from GPIO pinctrl/u300/coh901: let the gpio_chip register the range pinctrl: add function to retrieve range from pin gpiolib: return any error code from range creation pinctrl: make range registration defer properly gpiolib: rename find_pinctrl_* gpiolib: let gpiochip_add_pin_range() specify offset ARM: at91: pm9g45: add mmc support ARM: at91: Animeo IP: add mmc support ARM: at91: dt: add mmc pinctrl for Atmel reference boards ARM: at91: dt: at91sam9: add mmc pinctrl support ARM: at91/dts: add nodes for atmel hsmci controllers for atmel boards ...
2012-12-11Merge tag 'hwmon-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging Pull hwmon updates from Guenter Roeck: "New driver: DA9055 Added/improved support for new chips in existing drivers: Z650/670, N550/570, ADS7830, AMD 16h family" * tag 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging: hwmon: (da9055) Fix chan_mux[DA9055_ADC_ADCIN3] setting hwmon: DA9055 HWMON driver hwmon: (coretemp) List TjMax for Z650/670 and N550/570 hwmon: (coretemp) Drop N4xx, N5xx, D4xx, D5xx CPUs from tjmax table hwmon: (coretemp) Use model table instead of if/else to identify CPU models hwmon: da9052: Use da9052_reg_update for rmw operations hwmon: (coretemp) Drop dependency on PCI for TjMax detection on Atom CPUs hwmon: (ina2xx) use module_i2c_driver to simplify the code hwmon: (ads7828) add support for ADS7830 hwmon: (ads7828) driver cleanup x86,AMD: Power driver support for AMD's family 16h processors
2012-12-11Merge tag 'mmc-updates-for-3.8-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc Pull MMC updates from Chris Ball: "MMC highlights for 3.8: Core: - Expose access to the eMMC RPMB ("Replay Protected Memory Block") area by extending the existing mmc_block ioctl. - Add SDIO powered-suspend DT properties to the core MMC DT binding. - Add no-1-8-v DT flag for boards where the SD controller reports that it supports 1.8V but the board itself has no way to switch to 1.8V. - More work on switching to 1.8V UHS support using a vqmmc regulator. - Fix up a case where the slot-gpio helper may fail to reset the host controller properly if a card was removed during a transfer. - Fix several cases where a broken device could cause an infinite loop while we wait for a register to update. Drivers: - at91-mci: Remove obsolete driver, atmel-mci handles these devices now. - sdhci-dove: Allow using GPIOs for card-detect notifications. - sdhci-esdhc: Fix for recovering from ADMA errors on broken silicon. - sdhci-s3c: Add pinctrl support. - wmt-sdmmc: New driver for WonderMedia SD/MMC controllers." * tag 'mmc-updates-for-3.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc: (65 commits) mmc: sdhci: implement the .card_event() method mmc: extend the slot-gpio card-detection to use host's .card_event() method mmc: add a card-event host operation mmc: sdhci-s3c: Fix compilation warning mmc: sdhci-pci: Enable SDHCI_CAN_DO_HISPD for Ricoh SDHCI controller mmc: sdhci-dove: allow GPIOs to be used for card detection on Dove mmc: sdhci-dove: use two-stage initialization for sdhci-pltfm mmc: sdhci-dove: use devm_clk_get() mmc: eSDHC: Recover from ADMA errors mmc: dw_mmc: remove duplicated buswidth code mmc: dw_mmc: relocate where dw_mci_setup_bus() is called from mmc: Limit MMC speed to 52MHz if not HS200 mmc: dw_mmc: use devres functions in dw_mmc mmc: sh_mmcif: remove unneeded clock connection ID mmc: sh_mobile_sdhi: remove unneeded clock connection ID mmc: sh_mobile_sdhi: fix clock frequency printing mmc: Remove redundant null check before kfree in bus.c mmc: Remove redundant null check before kfree in sdio_bus.c mmc: sdhci-imx-esdhc: use more devm_* functions mmc: dt: add no-1-8-v device tree flag ...
2012-12-11net: gro: avoid double copy in skb_gro_receive()Eric Dumazet
__copy_skb_header(nskb, p) already copied p->cb[], no need to copy it again. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-11bridge: fix seq check in br_mdb_dump()Cong Wang
In case of rehashing, introduce a global variable 'br_mdb_rehash_seq' which gets increased every time when rehashing, and assign net->dev_base_seq + br_mdb_rehash_seq to cb->seq. In theory cb->seq could be wrapped to zero, but this is not easy to fix, as net->dev_base_seq is not visible inside br_mdb_rehash(). In practice, this is rare. Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Stephen Hemminger <shemminger@vyatta.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Thomas Graf <tgraf@suug.ch> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-11net: smc911x: use io{read,write}*_rep accessorsMatthew Leach
The {read,write}s{b,w,l} operations are not defined by all architectures and are being removed from the asm-generic/io.h interface. This patch replaces the usage of these string functions in the smc911x accessors with io{read,write}{8,16,32}_rep calls instead. Cc: Arnd Bergmann <arnd@arndb.de> Cc: Ben Herrenschmidt <benh@kernel.crashing.org> Cc: netdev@vger.kernel.org Signed-off-by: Matthew Leach <matthew@mattleach.net> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-11net: remove obsolete simple_strto<foo>Abhijit Pawar
This patch removes the redundant occurences of simple_strto<foo> Signed-off-by: Abhijit Pawar <abhi.c.pawar@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-11tun: allow setting ethernet addresss while runningstephen hemminger
This is a pure software device, and ok with live address change. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-11net: gro: dev_gro_receive() cleanupEric Dumazet
__napi_gro_receive() is inlined from two call sites for no good reason. Lets move the prep stuff in a function of its own, called only if/when needed. This saves 300 bytes on x86 : # size net/core/dev.o.after net/core/dev.o.before text data bss dec hex filename 51968 1238 1040 54246 d3e6 net/core/dev.o.before 51664 1238 1040 53942 d2b6 net/core/dev.o.after Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-11net: fix a race in gro_cell_poll()Eric Dumazet
Dmitry Kravkov reported packet drops for GRE packets since GRO support was added. There is a race in gro_cell_poll() because we call napi_complete() without any synchronization with a concurrent gro_cells_receive() Once bug was triggered, we queued packets but did not schedule NAPI poll. We can fix this issue using the spinlock protected the napi_skbs queue, as we have to hold it to perform skb dequeue anyway. As we open-code skb_dequeue(), we no longer need to mask IRQS, as both producer and consumer run under BH context. Bug added in commit c9e6bc644e (net: add gro_cells infrastructure) Reported-by: Dmitry Kravkov <dmitry@broadcom.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Tested-by: Dmitry Kravkov <dmitry@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-11bnx2x: use netdev_alloc_frag()Eric Dumazet
Using netdev_alloc_frag() instead of kmalloc() permits better GRO or TCP coalescing behavior, as skb_gro_receive() doesn't have to fallback to frag_list overhead. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Dmitry Kravkov <dmitry@broadcom.com> Cc: Eilon Greenstein <eilong@broadcom.com> Acked-by: Dmitry Kravkov <dmitry@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-11CIFS: Fix write after setting a read lock for read oplock filesPavel Shilovsky
If we have a read oplock and set a read lock in it, we can't write to the locked area - so, filemap_fdatawrite may fail with a no information for a userspace application even if we request a write to non-locked area. Fix this by populating the page cache without marking affected pages dirty after a successful write directly to the server. Also remove CONFIG_CIFS_SMB2 ifdefs because it's suitable for both CIFS and SMB2 protocols. Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
2012-12-11cifs: parse the device name into UNC and prepathJeff Layton
This should fix a regression that was introduced when the new mount option parser went in. Also, when the unc= and prefixpath= options are provided, check their values against the ones we parsed from the device string. If they differ, then throw a warning that tells the user that we're using the values from the unc= option for now, but that that will change in 3.10. Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
2012-12-11cifs: fix up handling of prefixpath= optionJeff Layton
Currently the code takes care to ensure that the prefixpath has a leading '/' delimiter. What if someone passes us a prefixpath with a leading '\\' instead? The code doesn't properly handle that currently AFAICS. Let's just change the code to skip over any leading delimiter character when copying the prepath. Then, fix up the users of the prepath option to prefix it with the correct delimiter when they use it. Also, there's no need to limit the length of the prefixpath to 1k. If the server can handle it, why bother forbidding it? Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
2012-12-11cifs: clean up handling of unc= optionJeff Layton
Make sure we free any existing memory allocated for vol->UNC, just in case someone passes in multiple unc= options. Get rid of the check for too long a UNC. The check for >300 bytes seems arbitrary. We later copy this into the tcon->treeName, for instance and it's a lot shorter than 300 bytes. Eliminate an extra kmalloc and copy as well. Just set the vol->UNC directly with the contents of match_strdup. Establish that the UNC should be stored with '\\' delimiters. Use convert_delimiter to change it in place in the vol->UNC. Finally, move the check for a malformed UNC into cifs_parse_mount_options so we can catch that situation earlier. Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
2012-12-11cifs: fix SID binary to string conversionJeff Layton
The authority fields are supposed to be represented by a single 48-bit value. It's also supposed to represent the value as hex if it's equal to or greater than 2^32. This is documented in MS-DTYP, section 2.4.2.1. Also, fix up the max string length to account for this fix. Acked-by: Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
2012-12-11of: *node argument to of_parse_phandle_with_args should be constGuennadi Liakhovetski
The "struct device_node *" argument of of_parse_phandle_with_args() can be const. Making this change makes it explicit that the function will not modify a node. Signed-off-by: Guennadi Liakhovetski <g.liakhovetski@gmx.de> [grant.likely: Resolved conflict with previous patch modifying of_parse_phandle()] Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
2012-12-11of/i2c: add dummy inline functions for when CONFIG_OF_I2C(_MODULE) isn't definedGuennadi Liakhovetski
If CONFIG_OF_I2C and CONFIG_OF_I2C_MODULE are undefined no declaration of of_find_i2c_device_by_node and of_find_i2c_adapter_by_node will be available. Add dummy inline functions to avoid compilation breakage. Signed-off-by: Guennadi Liakhovetski <g.liakhovetski@gmx.de> Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
2012-12-11mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalableIngo Molnar
rmap_walk_anon() and try_to_unmap_anon() appears to be too careful about locking the anon vma: while it needs protection against anon vma list modifications, it does not need exclusive access to the list itself. Transforming this exclusive lock to a read-locked rwsem removes a global lock from the hot path of page-migration intense threaded workloads which can cause pathological performance like this: 96.43% process 0 [kernel.kallsyms] [k] perf_trace_sched_switch | --- perf_trace_sched_switch __schedule schedule schedule_preempt_disabled __mutex_lock_common.isra.6 __mutex_lock_slowpath mutex_lock | |--50.61%-- rmap_walk | move_to_new_page | migrate_pages | migrate_misplaced_page | __do_numa_page.isra.69 | handle_pte_fault | handle_mm_fault | __do_page_fault | do_page_fault | page_fault | __memset_sse2 | | | --100.00%-- worker_thread | | | --100.00%-- start_thread | --49.39%-- page_lock_anon_vma try_to_unmap_anon try_to_unmap migrate_pages migrate_misplaced_page __do_numa_page.isra.69 handle_pte_fault handle_mm_fault __do_page_fault do_page_fault page_fault __memset_sse2 | --100.00%-- worker_thread start_thread With this change applied the profile is now nicely flat and there's no anon-vma related scheduling/blocking. Rename anon_vma_[un]lock() => anon_vma_[un]lock_write(), to make it clearer that it's an exclusive write-lock in that case - suggested by Rik van Riel. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Turner <pjt@google.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Christoph Lameter <cl@linux.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm/rmap: Convert the struct anon_vma::mutex to an rwsemIngo Molnar
Convert the struct anon_vma::mutex to an rwsem, which will help in solving a page-migration scalability problem. (Addressed in a separate patch.) The conversion is simple and straightforward: in every case where we mutex_lock()ed we'll now down_write(). Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Turner <pjt@google.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Christoph Lameter <cl@linux.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: migrate: Account a transhuge page properly when rate limitingMel Gorman
If there is excessive migration due to NUMA balancing it gets rate limited. It does this by counting the number of pages it has migrated recently but counts a transhuge page as 1 page. Account for it properly. Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: numa: Account for failed allocations and isolations as migration failuresMel Gorman
Subject says it all. Allocation failures and a failure to isolate should be accounted as a migration failure. This is partially another difference between base page and transhuge page migration. A base page migration makes multiple attempts for these conditions before it would be accounted for as a failure. Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: numa: Add THP migration for the NUMA working set scanning fault case ↵Mel Gorman
build fix Commit "Add THP migration for the NUMA working set scanning fault case" breaks the build because HPAGE_PMD_SHIFT and HPAGE_PMD_MASK defined to explode without CONFIG_TRANSPARENT_HUGEPAGE: mm/migrate.c: In function 'migrate_misplaced_transhuge_page_put': mm/migrate.c:1549: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed mm/migrate.c:1564: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed mm/migrate.c:1566: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed mm/migrate.c:1573: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed mm/migrate.c:1606: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed mm/migrate.c:1648: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed CONFIG_NUMA_BALANCING allows compilation without enabling transparent hugepages, so define the dummy function for such a configuration and only define migrate_misplaced_transhuge_page_put() when transparent hugepages are enabled. Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: numa: Add THP migration for the NUMA working set scanning fault case.Mel Gorman
Note: This is very heavily based on a patch from Peter Zijlstra with fixes from Ingo Molnar, Hugh Dickins and Johannes Weiner. That patch put a lot of migration logic into mm/huge_memory.c where it does not belong. This version puts tries to share some of the migration logic with migrate_misplaced_page. However, it should be noted that now migrate.c is doing more with the pagetable manipulation than is preferred. The end result is barely recognisable so as before, the signed-offs had to be removed but will be re-added if the original authors are ok with it. Add THP migration for the NUMA working set scanning fault case. It uses the page lock to serialize. No migration pte dance is necessary because the pte is already unmapped when we decide to migrate. [dhillf@gmail.com: Fix memory leak on isolation failure] [dhillf@gmail.com: Fix transfer of last_nid information] Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: sched: numa: Delay PTE scanning until a task is scheduled on a new nodeMel Gorman
Due to the fact that migrations are driven by the CPU a task is running on there is no point tracking NUMA faults until one task runs on a new node. This patch tracks the first node used by an address space. Until it changes, PTE scanning is disabled and no NUMA hinting faults are trapped. This should help workloads that are short-lived, do not care about NUMA placement or have bound themselves to a single node. This takes advantage of the logic in "mm: sched: numa: Implement slow start for working set sampling" to delay when the checks are made. This will take advantage of processes that set their CPU and node bindings early in their lifetime. It will also potentially allow any initial load balancing to take place. Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: sched: numa: Control enabling and disabling of NUMA balancing if ↵Mel Gorman
!SCHED_DEBUG The "mm: sched: numa: Control enabling and disabling of NUMA balancing" depends on scheduling debug being enabled but it's perfectly legimate to disable automatic NUMA balancing even without this option. This should take care of it. Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: sched: numa: Control enabling and disabling of NUMA balancingMel Gorman
This patch adds Kconfig options and kernel parameters to allow the enabling and disabling of automatic NUMA balancing. The existance of such a switch was and is very important when debugging problems related to transparent hugepages and we should have the same for automatic NUMA placement. Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrateMel Gorman
The PTE scanning rate and fault rates are two of the biggest sources of system CPU overhead with automatic NUMA placement. Ideally a proper policy would detect if a workload was properly placed, schedule and adjust the PTE scanning rate accordingly. We do not track the necessary information to do that but we at least know if we migrated or not. This patch scans slower if a page was not migrated as the result of a NUMA hinting fault up to sysctl_numa_balancing_scan_period_max which is now higher than the previous default. Once every minute it will reset the scanner in case of phase changes. This is hilariously crude and the numbers are arbitrary. Workloads will converge quite slowly in comparison to what a proper policy should be able to do. On the plus side, we will chew up less CPU for workloads that have no need for automatic balancing. Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: numa: Use a two-stage filter to restrict pages being migrated for ↵Mel Gorman
unlikely task<->node relationships Note: This two-stage filter was taken directly from the sched/numa patch "sched, numa, mm: Add the scanning page fault machinery" but is only a partial extraction. As the end result is not necessarily recognisable, the signed-offs-by had to be removed. Will be added back if requested. While it is desirable that all threads in a process run on its home node, this is not always possible or necessary. There may be more threads than exist within the node or the node might over-subscribed with unrelated processes. This can cause a situation whereby a page gets migrated off its home node because the threads clearing pte_numa were running off-node. This patch uses page->last_nid to build a two-stage filter before pages get migrated to avoid problems with short or unlikely task<->node relationships. Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: numa: migrate: Set last_nid on newly allocated pageHillf Danton
Pass last_nid from misplaced page to newly allocated migration target page. Signed-off-by: Hillf Danton <dhillf@gmail.com> Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: numa: split_huge_page: Transfer last_nid on tail pageHillf Danton
Pass last_nid from head page to tail page. Signed-off-by: Hillf Danton <dhillf@gmail.com> Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: numa: Introduce last_nid to the page frameMel Gorman
This patch introduces a last_nid field to the page struct. This is used to build a two-stage filter in the next patch that is aimed at mitigating a problem whereby pages migrate to the wrong node when referenced by a process that was running off its home node. Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11sched: numa: Slowly increase the scanning period as NUMA faults are handledMel Gorman
Currently the rate of scanning for an address space is controlled by the individual tasks. The next scan is simply determined by 2*p->numa_scan_period. The 2*p->numa_scan_period is arbitrary and never changes. At this point there is still no proper policy that decides if a task or process is properly placed. It just scans and assumes the next NUMA fault will place it properly. As it is assumed that pages will get properly placed over time, increase the scan window each time a fault is incurred. This is a big assumption as noted in the comments. It should be noted that changing to p->numa_scan_period will increase system CPU usage because now the scanning rate has effectively doubled. If that is a problem then the min_rate should be made 200ms instead of restoring the 2* logic. Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: numa: Rate limit setting of pte_numa if node is saturatedMel Gorman
If there are a large number of NUMA hinting faults and all of them are resulting in migrations it may indicate that memory is just bouncing uselessly around. NUMA balancing cost is likely exceeding any benefit from locality. Rate limit the PTE updates if the node is migration rate-limited. As noted in the comments, this distorts the NUMA faulting statistics. Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: numa: Rate limit the amount of memory that is migrated between nodesMel Gorman
NOTE: This is very heavily based on similar logic in autonuma. It should be signed off by Andrea but because there was no standalone patch and it's sufficiently different from what he did that the signed-off is omitted. Will be added back if requested. If a large number of pages are misplaced then the memory bus can be saturated just migrating pages between nodes. This patch rate-limits the amount of memory that can be migrating between nodes. Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: numa: Structures for Migrate On Fault per NUMA migration rate limitingAndrea Arcangeli
This defines the per-node data used by Migrate On Fault in order to rate limit the migration. The rate limiting is applied independently to each destination node. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: numa: Migrate pages handled during a pmd_numa hinting faultMel Gorman
To say that the PMD handling code was incorrectly transferred from autonuma is an understatement. The intention was to handle a PMDs worth of pages in the same fault and effectively batch the taking of the PTL and page migration. The copied version instead has the impact of clearing a number of pte_numa PTE entries and whether any page migration takes place depends on racing. This just happens to work in some cases. This patch handles pte_numa faults in batch when a pmd_numa fault is handled. The pages are migrated if they are currently misplaced. Essentially this is making an assumption that NUMA locality is on a PMD boundary but that could be addressed by only setting pmd_numa if all the pages within that PMD are on the same node if necessary. Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: numa: Migrate on reference policyMel Gorman
This is the simplest possible policy that still does something of note. When a pte_numa is faulted, it is moved immediately. Any replacement policy must at least do better than this and in all likelihood this policy regresses normal workloads. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com>
2012-12-11mm: numa: Add pte updates, hinting and migration statsMel Gorman
It is tricky to quantify the basic cost of automatic NUMA placement in a meaningful manner. This patch adds some vmstats that can be used as part of a basic costing model. u = basic unit = sizeof(void *) Ca = cost of struct page access = sizeof(struct page) / u Cpte = Cost PTE access = Ca Cupdate = Cost PTE update = (2 * Cpte) + (2 * Wlock) where Cpte is incurred twice for a read and a write and Wlock is a constant representing the cost of taking or releasing a lock Cnumahint = Cost of a minor page fault = some high constant e.g. 1000 Cpagerw = Cost to read or write a full page = Ca + PAGE_SIZE/u Ci = Cost of page isolation = Ca + Wi where Wi is a constant that should reflect the approximate cost of the locking operation Cpagecopy = Cpagerw + (Cpagerw * Wnuma) + Ci + (Ci * Wnuma) where Wnuma is the approximate NUMA factor. 1 is local. 1.2 would imply that remote accesses are 20% more expensive Balancing cost = Cpte * numa_pte_updates + Cnumahint * numa_hint_faults + Ci * numa_pages_migrated + Cpagecopy * numa_pages_migrated Note that numa_pages_migrated is used as a measure of how many pages were isolated even though it would miss pages that failed to migrate. A vmstat counter could have been added for it but the isolation cost is pretty marginal in comparison to the overall cost so it seemed overkill. The ideal way to measure automatic placement benefit would be to count the number of remote accesses versus local accesses and do something like benefit = (remote_accesses_before - remove_access_after) * Wnuma but the information is not readily available. As a workload converges, the expection would be that the number of remote numa hints would reduce to 0. convergence = numa_hint_faults_local / numa_hint_faults where this is measured for the last N number of numa hints recorded. When the workload is fully converged the value is 1. This can measure if the placement policy is converging and how fast it is doing it. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com>
2012-12-11mm: sched: numa: Implement slow start for working set samplingPeter Zijlstra
Add a 1 second delay before starting to scan the working set of a task and starting to balance it amongst nodes. [ note that before the constant per task WSS sampling rate patch the initial scan would happen much later still, in effect that patch caused this regression. ] The theory is that short-run tasks benefit very little from NUMA placement: they come and go, and they better stick to the node they were started on. As tasks mature and rebalance to other CPUs and nodes, so does their NUMA placement have to change and so does it start to matter more and more. In practice this change fixes an observable kbuild regression: # [ a perf stat --null --repeat 10 test of ten bzImage builds to /dev/shm ] !NUMA: 45.291088843 seconds time elapsed ( +- 0.40% ) 45.154231752 seconds time elapsed ( +- 0.36% ) +NUMA, no slow start: 46.172308123 seconds time elapsed ( +- 0.30% ) 46.343168745 seconds time elapsed ( +- 0.25% ) +NUMA, 1 sec slow start: 45.224189155 seconds time elapsed ( +- 0.25% ) 45.160866532 seconds time elapsed ( +- 0.17% ) and it also fixes an observable perf bench (hackbench) regression: # perf stat --null --repeat 10 perf bench sched messaging -NUMA: -NUMA: 0.246225691 seconds time elapsed ( +- 1.31% ) +NUMA no slow start: 0.252620063 seconds time elapsed ( +- 1.13% ) +NUMA 1sec delay: 0.248076230 seconds time elapsed ( +- 1.35% ) The implementation is simple and straightforward, most of the patch deals with adding the /proc/sys/kernel/numa_balancing_scan_delay_ms tunable knob. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> [ Wrote the changelog, ran measurements, tuned the default. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com>
2012-12-11sched, numa, mm: Count WS scanning against present PTEs, not virtual memory ↵Mel Gorman
ranges By accounting against the present PTEs, scanning speed reflects the actual present (mapped) memory. Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) ratePeter Zijlstra
Previously, to probe the working set of a task, we'd use a very simple and crude method: mark all of its address space PROT_NONE. That method has various (obvious) disadvantages: - it samples the working set at dissimilar rates, giving some tasks a sampling quality advantage over others. - creates performance problems for tasks with very large working sets - over-samples processes with large address spaces but which only very rarely execute Improve that method by keeping a rotating offset into the address space that marks the current position of the scan, and advance it by a constant rate (in a CPU cycles execution proportional manner). If the offset reaches the last mapped address of the mm then it then it starts over at the first address. The per-task nature of the working set sampling functionality in this tree allows such constant rate, per task, execution-weight proportional sampling of the working set, with an adaptive sampling interval/frequency that goes from once per 100ms up to just once per 8 seconds. The current sampling volume is 256 MB per interval. As tasks mature and converge their working set, so does the sampling rate slow down to just a trickle, 256 MB per 8 seconds of CPU time executed. This, beyond being adaptive, also rate-limits rarely executing systems and does not over-sample on overloaded systems. [ In AutoNUMA speak, this patch deals with the effective sampling rate of the 'hinting page fault'. AutoNUMA's scanning is currently rate-limited, but it is also fundamentally single-threaded, executing in the knuma_scand kernel thread, so the limit in AutoNUMA is global and does not scale up with the number of CPUs, nor does it scan tasks in an execution proportional manner. So the idea of rate-limiting the scanning was first implemented in the AutoNUMA tree via a global rate limit. This patch goes beyond that by implementing an execution rate proportional working set sampling rate that is not implemented via a single global scanning daemon. ] [ Dan Carpenter pointed out a possible NULL pointer dereference in the first version of this patch. ] Based-on-idea-by: Andrea Arcangeli <aarcange@redhat.com> Bug-Found-By: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> [ Wrote changelog and fixed bug. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com>
2012-12-11mm: numa: Add fault driven placement and migrationPeter Zijlstra
NOTE: This patch is based on "sched, numa, mm: Add fault driven placement and migration policy" but as it throws away all the policy to just leave a basic foundation I had to drop the signed-offs-by. This patch creates a bare-bones method for setting PTEs pte_numa in the context of the scheduler that when faulted later will be faulted onto the node the CPU is running on. In itself this does nothing useful but any placement policy will fundamentally depend on receiving hints on placement from fault context and doing something intelligent about it. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com>
2012-12-11mm: mempolicy: Hide MPOL_NOOP and MPOL_MF_LAZY from userspace for nowMel Gorman
The use of MPOL_NOOP and MPOL_MF_LAZY to allow an application to explicitly request lazy migration is a good idea but the actual API has not been well reviewed and once released we have to support it. For now this patch prevents an application using the services. This will need to be revisited. Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: mempolicy: Implement change_prot_numa() in terms of change_protection()Mel Gorman
This patch converts change_prot_numa() to use change_protection(). As pte_numa and friends check the PTE bits directly it is necessary for change_protection() to use pmd_mknuma(). Hence the required modifications to change_protection() are a little clumsy but the end result is that most of the numa page table helpers are just one or two instructions. Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: mempolicy: Add MPOL_MF_LAZYLee Schermerhorn
NOTE: Once again there is a lot of patch stealing and the end result is sufficiently different that I had to drop the signed-offs. Will re-add if the original authors are ok with that. This patch adds another mbind() flag to request "lazy migration". The flag, MPOL_MF_LAZY, modifies MPOL_MF_MOVE* such that the selected pages are marked PROT_NONE. The pages will be migrated in the fault path on "first touch", if the policy dictates at that time. "Lazy Migration" will allow testing of migrate-on-fault via mbind(). Also allows applications to specify that only subsequently touched pages be migrated to obey new policy, instead of all pages in range. This can be useful for multi-threaded applications working on a large shared data area that is initialized by an initial thread resulting in all pages on one [or a few, if overflowed] nodes. After PROT_NONE, the pages in regions assigned to the worker threads will be automatically migrated local to the threads on 1st touch. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com>
2012-12-11mm: mempolicy: Use _PAGE_NUMA to migrate pagesMel Gorman
Note: Based on "mm/mpol: Use special PROT_NONE to migrate pages" but sufficiently different that the signed-off-bys were dropped Combine our previous _PAGE_NUMA, mpol_misplaced and migrate_misplaced_page() pieces into an effective migrate on fault scheme. Note that (on x86) we rely on PROT_NONE pages being !present and avoid the TLB flush from try_to_unmap(TTU_MIGRATION). This greatly improves the page-migration performance. Based-on-work-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: migrate: Drop the misplaced pages reference count if the target node is fullMel Gorman
If we have to avoid migrating to a node that is nearly full, put page and return zero. Signed-off-by: Hillf Danton <dhillf@gmail.com> Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: migrate: Introduce migrate_misplaced_page()Peter Zijlstra
Note: This was originally based on Peter's patch "mm/migrate: Introduce migrate_misplaced_page()" but borrows extremely heavily from Andrea's "autonuma: memory follows CPU algorithm and task/mm_autonuma stats collection". The end result is barely recognisable so signed-offs had to be dropped. If original authors are ok with it, I'll re-add the signed-off-bys. Add migrate_misplaced_page() which deals with migrating pages from faults. Based-on-work-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Based-on-work-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Based-on-work-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com>
2012-12-11mm: mempolicy: Check for misplaced pageLee Schermerhorn
This patch provides a new function to test whether a page resides on a node that is appropriate for the mempolicy for the vma and address where the page is supposed to be mapped. This involves looking up the node where the page belongs. So, the function returns that node so that it may be used to allocated the page without consulting the policy again. A subsequent patch will call this function from the fault path. Because of this, I don't want to go ahead and allocate the page, e.g., via alloc_page_vma() only to have to free it if it has the correct policy. So, I just mimic the alloc_page_vma() node computation logic--sort of. Note: we could use this function to implement a MPOL_MF_STRICT behavior when migrating pages to match mbind() mempolicy--e.g., to ensure that pages in an interleaved range are reinterleaved rather than left where they are when they reside on any page in the interleave nodemask. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> [ Added MPOL_F_LAZY to trigger migrate-on-fault; simplified code now that we don't have to bother with special crap for interleaved ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Mel Gorman <mgorman@suse.de>