Age | Commit message (Collapse) | Author |
|
Pull clock framework changes from Mike Turquette:
"The common clock framework changes for 3.8 are comprised of lots of
fixes for existing platforms as well as new ports for some ARM
platforms. In addition there are new clk drivers for audio devices
and MFDs."
Fix up trivial conflict in <linux/clk-provider.h> (removal of 'inline'
clashing with return type fixes)
* tag 'clk-for-linus' of git://git.linaro.org/people/mturquette/linux: (51 commits)
MAINTAINERS: bad email address for Mike Turquette
clk: introduce optional disable_unused callback
clk: ux500: fix bit error
clk: clock multiplexers may register out of order
clk: ux500: Initial support for abx500 clock driver
CLK: SPEAr: Remove unused dummy apb_pclk
CLK: SPEAr: Correct index scanning done for clock synths
CLK: SPEAr: Update clock rate table
CLK: SPEAr: Add missing clocks
CLK: SPEAr: Set CLK_SET_RATE_PARENT for few clocks
CLK: SPEAr13xx: fix parent names of multiple clocks
CLK: SPEAr13xx: Fix mux clock names
CLK: SPEAr: Fix dev_id & con_id for multiple clocks
clk: move IM-PD1 clocks to drivers/clk
clk: make ICST driver handle the VCO registers
clk: add GPLv2 headers to the Versatile clock files
clk: mxs: Use a better name for the USB PHY clock
clk: spear: Add stub functions for spear3[0|1|2]0_clk_init()
CLK: clk-twl6040: fix return value check in twl6040_clk_probe()
clk: ux500: Register nomadik keypad clock lookups for u8500
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
Pull pinctrl changes from Linus Walleij:
"These are the first and major pinctrl changes for the v3.8 merge
cycle. Some of this is used as merge base for other trees so I better
be early on the trigger.
As can be seen from the diffstat the major changes are:
- A big conversion of the AT91 pinctrl driver and the associated ACKed
platform changes under arch/arm/max-at91 and its device trees. This
has been coordinated with the AT91 maintainers to go in through the
pinctrl tree.
- A larger chunk of changes to the SPEAr drivers and the addition of
the "plgpio" driver for the SPEAr as well.
- The removal of the remnants of the Nomadik driver from the arch/arm
tree and fusion of that into the Nomadik driver and platform data
header files.
- Some local movement in the Marvell MVEBU drivers, these now have
their own subdirectory.
- The addition of a chunk of code to gpiolib under drivers/gpio to
register gpio-to-pin range mappings from the GPIO side of things.
This has been requested by Grant Likely and is now implemented, it
is particularly useful for device tree work.
Then we have incremental updates all over the place, many of these are
cleanups and fixes from Axel Lin who has done a great job of removing
minor mistakes and compilation annoyances."
* tag 'pinctrl-for-v3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: (114 commits)
ARM: mmp: select PINCTRL for ARCH_MMP
pinctrl: Drop selecting PINCONF for MMP2, PXA168 and PXA910
pinctrl: pinctrl-single: Fix error check condition
pinctrl: SPEAr: Update error check for unsigned variables
gpiolib: Fix use after free in gpiochip_add_pin_range
gpiolib: rename pin range arguments
pinctrl: single: support gpio request and free
pinctrl: generic: add input schmitt disable parameter
pinctrl/u300/coh901: stop spawning pinctrl from GPIO
pinctrl/u300/coh901: let the gpio_chip register the range
pinctrl: add function to retrieve range from pin
gpiolib: return any error code from range creation
pinctrl: make range registration defer properly
gpiolib: rename find_pinctrl_*
gpiolib: let gpiochip_add_pin_range() specify offset
ARM: at91: pm9g45: add mmc support
ARM: at91: Animeo IP: add mmc support
ARM: at91: dt: add mmc pinctrl for Atmel reference boards
ARM: at91: dt: at91sam9: add mmc pinctrl support
ARM: at91/dts: add nodes for atmel hsmci controllers for atmel boards
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging
Pull hwmon updates from Guenter Roeck:
"New driver: DA9055
Added/improved support for new chips in existing drivers: Z650/670,
N550/570, ADS7830, AMD 16h family"
* tag 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
hwmon: (da9055) Fix chan_mux[DA9055_ADC_ADCIN3] setting
hwmon: DA9055 HWMON driver
hwmon: (coretemp) List TjMax for Z650/670 and N550/570
hwmon: (coretemp) Drop N4xx, N5xx, D4xx, D5xx CPUs from tjmax table
hwmon: (coretemp) Use model table instead of if/else to identify CPU models
hwmon: da9052: Use da9052_reg_update for rmw operations
hwmon: (coretemp) Drop dependency on PCI for TjMax detection on Atom CPUs
hwmon: (ina2xx) use module_i2c_driver to simplify the code
hwmon: (ads7828) add support for ADS7830
hwmon: (ads7828) driver cleanup
x86,AMD: Power driver support for AMD's family 16h processors
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc
Pull MMC updates from Chris Ball:
"MMC highlights for 3.8:
Core:
- Expose access to the eMMC RPMB ("Replay Protected Memory Block")
area by extending the existing mmc_block ioctl.
- Add SDIO powered-suspend DT properties to the core MMC DT binding.
- Add no-1-8-v DT flag for boards where the SD controller reports
that it supports 1.8V but the board itself has no way to switch to
1.8V.
- More work on switching to 1.8V UHS support using a vqmmc regulator.
- Fix up a case where the slot-gpio helper may fail to reset the host
controller properly if a card was removed during a transfer.
- Fix several cases where a broken device could cause an infinite
loop while we wait for a register to update.
Drivers:
- at91-mci: Remove obsolete driver, atmel-mci handles these devices
now.
- sdhci-dove: Allow using GPIOs for card-detect notifications.
- sdhci-esdhc: Fix for recovering from ADMA errors on broken silicon.
- sdhci-s3c: Add pinctrl support.
- wmt-sdmmc: New driver for WonderMedia SD/MMC controllers."
* tag 'mmc-updates-for-3.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc: (65 commits)
mmc: sdhci: implement the .card_event() method
mmc: extend the slot-gpio card-detection to use host's .card_event() method
mmc: add a card-event host operation
mmc: sdhci-s3c: Fix compilation warning
mmc: sdhci-pci: Enable SDHCI_CAN_DO_HISPD for Ricoh SDHCI controller
mmc: sdhci-dove: allow GPIOs to be used for card detection on Dove
mmc: sdhci-dove: use two-stage initialization for sdhci-pltfm
mmc: sdhci-dove: use devm_clk_get()
mmc: eSDHC: Recover from ADMA errors
mmc: dw_mmc: remove duplicated buswidth code
mmc: dw_mmc: relocate where dw_mci_setup_bus() is called from
mmc: Limit MMC speed to 52MHz if not HS200
mmc: dw_mmc: use devres functions in dw_mmc
mmc: sh_mmcif: remove unneeded clock connection ID
mmc: sh_mobile_sdhi: remove unneeded clock connection ID
mmc: sh_mobile_sdhi: fix clock frequency printing
mmc: Remove redundant null check before kfree in bus.c
mmc: Remove redundant null check before kfree in sdio_bus.c
mmc: sdhci-imx-esdhc: use more devm_* functions
mmc: dt: add no-1-8-v device tree flag
...
|
|
__copy_skb_header(nskb, p) already copied p->cb[], no need to copy
it again.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In case of rehashing, introduce a global variable 'br_mdb_rehash_seq'
which gets increased every time when rehashing, and assign
net->dev_base_seq + br_mdb_rehash_seq to cb->seq.
In theory cb->seq could be wrapped to zero, but this is not
easy to fix, as net->dev_base_seq is not visible inside
br_mdb_rehash(). In practice, this is rare.
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Though the process of the ordered extents is a bit different with the delalloc inode
flush, but we can see it as a subset of the delalloc inode flush, so we also handle
them by flush workers.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
The process of the ordered operations is similar to the delalloc inode flush, so
we handle them by flush workers.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
This patch introduce a new worker pool named "flush_workers", and if we
want to force all the inode with pending delalloc to the disks, we can
queue those inodes into the work queue of the worker pool, in this way,
those inodes will be flushed by multi-task.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
Dave gave me an image of a very full file system that would abort the
transaction because it ran out of space while committing the transaction.
This is because we would think there was plenty of room to create a snapshot
even though the global reserve was not full. This happens because we
calculate the global reserve size before we unpin any space, so after we
unpin the space we allow reservations to occur even though we haven't
reserved all of the space for our global reserve. Fix this by adding to the
global reserve while unpinning in order to make sure we always have enough
space to do our work. With this patch we no longer end up with an aborted
transaction, we return ENOSPC properly to the person trying to create the
snapshot. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
'disk_key' is not used at all.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
The argument 'tree_mod_log' is not necessary since all of callers enable it.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
Since we don't use MOD_LOG_KEY_REMOVE_WHILE_MOVING to add nritems
during rewinding, we should insert a MOD_LOG_KEY_REMOVE operation first.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
Key MOD_LOG_KEY_REMOVE_WHILE_MOVING means that we're doing memmove inside
an extent buffer node, and the node's number of items remains unchanged
(unless we are inserting a single pointer, but we have MOD_LOG_KEY_ADD for that).
So we don't need to increase node's number of items during rewinding,
otherwise we may get an node larger than leafsize and cause general protection
errors later.
Here is the details,
- If we do memory move for inserting a single pointer, we need to
add node's nritems by one, and we honor MOD_LOG_KEY_ADD for adding.
- If we do memory move for deleting a single pointer, we need to
decrease node's nritems by one, and we honor MOD_LOG_KEY_REMOVE for
deleting.
- If we do memory move for balance left/right, we need to decrease
node's nritems, and we honor MOD_LOG_KEY_REMOVE for balaning.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
When we find a bitmap free space entry, we may check the previous extent
entry covers the offset or not. But if we find this entry is also a bitmap
entry, we will continue to check the previous entry of the current one by
a while loop. It is unnecessary because it is impossible that the extent
entry which is in front of a bitmap entry can cover the offset of the entry
after that bitmap entry.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
Alex reported a problem where we were writing between chunks on a rbd
device. The thing is we do bio_add_page using logical offsets, but the
physical offset may be different. So when we map the bio now check to see
if the bio is still ok with the physical offset, and if it is not split the
bio up and redo the bio_add_page with the physical sector. This fixes the
problem for Alex and doesn't affect performance in the normal case. Thanks,
Reported-and-tested-by: Alex Elder <elder@inktank.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
In some places(such as: evicting inode), we just can not flush the reserved
space of delalloc, flushing the delayed directory index and delayed inode
is OK, but we don't try to flush those things and just go back when there is
no enough space to be reserved. This patch fixes this problem.
We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL.
If we can in the transaction, we should not flush anything, or the deadlock
would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc
would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used,
and we will flush all things.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
The comment is not coincident with the code. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
div_factor{_fine} has been implemented for two times, cleanup it.
And I move them into a independent file named math.h because they are
common math functions.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
The {read,write}s{b,w,l} operations are not defined by all
architectures and are being removed from the asm-generic/io.h
interface.
This patch replaces the usage of these string functions in the smc911x
accessors with io{read,write}{8,16,32}_rep calls instead.
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Ben Herrenschmidt <benh@kernel.crashing.org>
Cc: netdev@vger.kernel.org
Signed-off-by: Matthew Leach <matthew@mattleach.net>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch removes the redundant occurences of simple_strto<foo>
Signed-off-by: Abhijit Pawar <abhi.c.pawar@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This is a pure software device, and ok with live address change.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
__napi_gro_receive() is inlined from two call sites for no good reason.
Lets move the prep stuff in a function of its own, called only if/when
needed. This saves 300 bytes on x86 :
# size net/core/dev.o.after net/core/dev.o.before
text data bss dec hex filename
51968 1238 1040 54246 d3e6 net/core/dev.o.before
51664 1238 1040 53942 d2b6 net/core/dev.o.after
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Dmitry Kravkov reported packet drops for GRE packets since GRO support
was added.
There is a race in gro_cell_poll() because we call napi_complete()
without any synchronization with a concurrent gro_cells_receive()
Once bug was triggered, we queued packets but did not schedule NAPI
poll.
We can fix this issue using the spinlock protected the napi_skbs queue,
as we have to hold it to perform skb dequeue anyway.
As we open-code skb_dequeue(), we no longer need to mask IRQS, as both
producer and consumer run under BH context.
Bug added in commit c9e6bc644e (net: add gro_cells infrastructure)
Reported-by: Dmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Dmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Using netdev_alloc_frag() instead of kmalloc() permits better GRO or
TCP coalescing behavior, as skb_gro_receive() doesn't have to fallback
to frag_list overhead.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Dmitry Kravkov <dmitry@broadcom.com>
Cc: Eilon Greenstein <eilong@broadcom.com>
Acked-by: Dmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
If we have a read oplock and set a read lock in it, we can't write to the
locked area - so, filemap_fdatawrite may fail with a no information for a
userspace application even if we request a write to non-locked area. Fix
this by populating the page cache without marking affected pages dirty
after a successful write directly to the server.
Also remove CONFIG_CIFS_SMB2 ifdefs because it's suitable for both CIFS
and SMB2 protocols.
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
|
|
This should fix a regression that was introduced when the new mount
option parser went in. Also, when the unc= and prefixpath= options
are provided, check their values against the ones we parsed from
the device string. If they differ, then throw a warning that tells
the user that we're using the values from the unc= option for now,
but that that will change in 3.10.
Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
|
|
Currently the code takes care to ensure that the prefixpath has a
leading '/' delimiter. What if someone passes us a prefixpath with a
leading '\\' instead? The code doesn't properly handle that currently
AFAICS.
Let's just change the code to skip over any leading delimiter character
when copying the prepath. Then, fix up the users of the prepath option
to prefix it with the correct delimiter when they use it.
Also, there's no need to limit the length of the prefixpath to 1k. If
the server can handle it, why bother forbidding it?
Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
|
|
Make sure we free any existing memory allocated for vol->UNC, just in
case someone passes in multiple unc= options.
Get rid of the check for too long a UNC. The check for >300 bytes seems
arbitrary. We later copy this into the tcon->treeName, for instance and
it's a lot shorter than 300 bytes.
Eliminate an extra kmalloc and copy as well. Just set the vol->UNC
directly with the contents of match_strdup.
Establish that the UNC should be stored with '\\' delimiters. Use
convert_delimiter to change it in place in the vol->UNC.
Finally, move the check for a malformed UNC into
cifs_parse_mount_options so we can catch that situation earlier.
Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
|
|
The authority fields are supposed to be represented by a single 48-bit
value. It's also supposed to represent the value as hex if it's equal to
or greater than 2^32. This is documented in MS-DTYP, section 2.4.2.1.
Also, fix up the max string length to account for this fix.
Acked-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
|
|
The "struct device_node *" argument of of_parse_phandle_with_args() can
be const. Making this change makes it explicit that the function will
not modify a node.
Signed-off-by: Guennadi Liakhovetski <g.liakhovetski@gmx.de>
[grant.likely: Resolved conflict with previous patch modifying of_parse_phandle()]
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
|
|
If CONFIG_OF_I2C and CONFIG_OF_I2C_MODULE are undefined no declaration of
of_find_i2c_device_by_node and of_find_i2c_adapter_by_node will be
available. Add dummy inline functions to avoid compilation breakage.
Signed-off-by: Guennadi Liakhovetski <g.liakhovetski@gmx.de>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
|
|
If the server sends us a target that looks like an outlier, but
is lower than the existing target, then respect it anyway.
However defer actually updating the generation counter until
we get a target that doesn't look like an outlier.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
|
|
if ore_write() fails, we would unlock the pages of pcol, which is now
empty, rather than pcol_copy which owns the pages when ore_write() is
called. this means that no pages will actually be unlocked
(pcol.nr_pages == 0) and the writing process (more accurately, the
syncing process) will hang waiting for a writeback notification that
never comes.
moreover, if ore_write() fails, pcol_free() is called for pcol, whereas
pcol_copy is the object owning the ore_io_state, thus leaking the
ore_io_state.
[Boaz]
I have simplified Idan's original patch a bit, everything else still
holds
Signed-off-by: Idan Kedar <idank@tonian.com>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
|
|
Most (all) NFS4ERR_BADSLOT errors are due to the client failing to
respect the server's sr_highest_slotid limit. This mainly happens
due to reordered RPC requests.
The way to handle it is simply to drop the slot that we're using,
and retry using the new highest_slotid limits.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
|
|
rmap_walk_anon() and try_to_unmap_anon() appears to be too
careful about locking the anon vma: while it needs protection
against anon vma list modifications, it does not need exclusive
access to the list itself.
Transforming this exclusive lock to a read-locked rwsem removes
a global lock from the hot path of page-migration intense
threaded workloads which can cause pathological performance like
this:
96.43% process 0 [kernel.kallsyms] [k] perf_trace_sched_switch
|
--- perf_trace_sched_switch
__schedule
schedule
schedule_preempt_disabled
__mutex_lock_common.isra.6
__mutex_lock_slowpath
mutex_lock
|
|--50.61%-- rmap_walk
| move_to_new_page
| migrate_pages
| migrate_misplaced_page
| __do_numa_page.isra.69
| handle_pte_fault
| handle_mm_fault
| __do_page_fault
| do_page_fault
| page_fault
| __memset_sse2
| |
| --100.00%-- worker_thread
| |
| --100.00%-- start_thread
|
--49.39%-- page_lock_anon_vma
try_to_unmap_anon
try_to_unmap
migrate_pages
migrate_misplaced_page
__do_numa_page.isra.69
handle_pte_fault
handle_mm_fault
__do_page_fault
do_page_fault
page_fault
__memset_sse2
|
--100.00%-- worker_thread
start_thread
With this change applied the profile is now nicely flat
and there's no anon-vma related scheduling/blocking.
Rename anon_vma_[un]lock() => anon_vma_[un]lock_write(),
to make it clearer that it's an exclusive write-lock in
that case - suggested by Rik van Riel.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Turner <pjt@google.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
|
|
Convert the struct anon_vma::mutex to an rwsem, which will help
in solving a page-migration scalability problem. (Addressed in
a separate patch.)
The conversion is simple and straightforward: in every case
where we mutex_lock()ed we'll now down_write().
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Turner <pjt@google.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
|
|
If there is excessive migration due to NUMA balancing it gets rate
limited. It does this by counting the number of pages it has migrated
recently but counts a transhuge page as 1 page. Account for it properly.
Signed-off-by: Mel Gorman <mgorman@suse.de>
|
|
Subject says it all. Allocation failures and a failure to isolate should
be accounted as a migration failure. This is partially another
difference between base page and transhuge page migration. A base page
migration makes multiple attempts for these conditions before it would
be accounted for as a failure.
Signed-off-by: Mel Gorman <mgorman@suse.de>
|
|
build fix
Commit "Add THP migration for the NUMA working set scanning fault case"
breaks the build because HPAGE_PMD_SHIFT and HPAGE_PMD_MASK defined to
explode without CONFIG_TRANSPARENT_HUGEPAGE:
mm/migrate.c: In function 'migrate_misplaced_transhuge_page_put':
mm/migrate.c:1549: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed
mm/migrate.c:1564: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed
mm/migrate.c:1566: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed
mm/migrate.c:1573: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed
mm/migrate.c:1606: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed
mm/migrate.c:1648: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed
CONFIG_NUMA_BALANCING allows compilation without enabling transparent
hugepages, so define the dummy function for such a configuration and only
define migrate_misplaced_transhuge_page_put() when transparent hugepages
are enabled.
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
|
|
Note: This is very heavily based on a patch from Peter Zijlstra with
fixes from Ingo Molnar, Hugh Dickins and Johannes Weiner. That patch
put a lot of migration logic into mm/huge_memory.c where it does
not belong. This version puts tries to share some of the migration
logic with migrate_misplaced_page. However, it should be noted
that now migrate.c is doing more with the pagetable manipulation
than is preferred. The end result is barely recognisable so as
before, the signed-offs had to be removed but will be re-added if
the original authors are ok with it.
Add THP migration for the NUMA working set scanning fault case.
It uses the page lock to serialize. No migration pte dance is
necessary because the pte is already unmapped when we decide
to migrate.
[dhillf@gmail.com: Fix memory leak on isolation failure]
[dhillf@gmail.com: Fix transfer of last_nid information]
Signed-off-by: Mel Gorman <mgorman@suse.de>
|
|
Due to the fact that migrations are driven by the CPU a task is running
on there is no point tracking NUMA faults until one task runs on a new
node. This patch tracks the first node used by an address space. Until
it changes, PTE scanning is disabled and no NUMA hinting faults are
trapped. This should help workloads that are short-lived, do not care
about NUMA placement or have bound themselves to a single node.
This takes advantage of the logic in "mm: sched: numa: Implement slow
start for working set sampling" to delay when the checks are made. This
will take advantage of processes that set their CPU and node bindings
early in their lifetime. It will also potentially allow any initial load
balancing to take place.
Signed-off-by: Mel Gorman <mgorman@suse.de>
|
|
!SCHED_DEBUG
The "mm: sched: numa: Control enabling and disabling of NUMA balancing"
depends on scheduling debug being enabled but it's perfectly legimate to
disable automatic NUMA balancing even without this option. This should
take care of it.
Signed-off-by: Mel Gorman <mgorman@suse.de>
|
|
This patch adds Kconfig options and kernel parameters to allow the
enabling and disabling of automatic NUMA balancing. The existance
of such a switch was and is very important when debugging problems
related to transparent hugepages and we should have the same for
automatic NUMA placement.
Signed-off-by: Mel Gorman <mgorman@suse.de>
|
|
The PTE scanning rate and fault rates are two of the biggest sources of
system CPU overhead with automatic NUMA placement. Ideally a proper policy
would detect if a workload was properly placed, schedule and adjust the
PTE scanning rate accordingly. We do not track the necessary information
to do that but we at least know if we migrated or not.
This patch scans slower if a page was not migrated as the result of a
NUMA hinting fault up to sysctl_numa_balancing_scan_period_max which is
now higher than the previous default. Once every minute it will reset
the scanner in case of phase changes.
This is hilariously crude and the numbers are arbitrary. Workloads will
converge quite slowly in comparison to what a proper policy should be able
to do. On the plus side, we will chew up less CPU for workloads that have
no need for automatic balancing.
Signed-off-by: Mel Gorman <mgorman@suse.de>
|
|
unlikely task<->node relationships
Note: This two-stage filter was taken directly from the sched/numa patch
"sched, numa, mm: Add the scanning page fault machinery" but is
only a partial extraction. As the end result is not necessarily
recognisable, the signed-offs-by had to be removed. Will be added
back if requested.
While it is desirable that all threads in a process run on its home
node, this is not always possible or necessary. There may be more
threads than exist within the node or the node might over-subscribed
with unrelated processes.
This can cause a situation whereby a page gets migrated off its home
node because the threads clearing pte_numa were running off-node. This
patch uses page->last_nid to build a two-stage filter before pages get
migrated to avoid problems with short or unlikely task<->node
relationships.
Signed-off-by: Mel Gorman <mgorman@suse.de>
|
|
Pass last_nid from misplaced page to newly allocated migration target page.
Signed-off-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
|
|
Pass last_nid from head page to tail page.
Signed-off-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
|
|
This patch introduces a last_nid field to the page struct. This is used
to build a two-stage filter in the next patch that is aimed at
mitigating a problem whereby pages migrate to the wrong node when
referenced by a process that was running off its home node.
Signed-off-by: Mel Gorman <mgorman@suse.de>
|
|
Currently the rate of scanning for an address space is controlled
by the individual tasks. The next scan is simply determined by
2*p->numa_scan_period.
The 2*p->numa_scan_period is arbitrary and never changes. At this point
there is still no proper policy that decides if a task or process is
properly placed. It just scans and assumes the next NUMA fault will
place it properly. As it is assumed that pages will get properly placed
over time, increase the scan window each time a fault is incurred. This
is a big assumption as noted in the comments.
It should be noted that changing to p->numa_scan_period will increase
system CPU usage because now the scanning rate has effectively doubled.
If that is a problem then the min_rate should be made 200ms instead of
restoring the 2* logic.
Signed-off-by: Mel Gorman <mgorman@suse.de>
|