summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2013-07-03mm/page_alloc: factor zone_pageset_init() out of setup_zone_pageset()Cody P Schafer
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Cc: Gilad Ben-Yossef <gilad@benyossef.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03mm/page_alloc: relocate comment to be directly above code it refers to.Cody P Schafer
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Cc: Gilad Ben-Yossef <gilad@benyossef.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03mm/page_alloc: factor setup_pageset() into pageset_init() and ↵Cody P Schafer
pageset_set_batch() Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Cc: Gilad Ben-Yossef <gilad@benyossef.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03mm/page_alloc: when handling percpu_pagelist_fraction, don't unneedly ↵Cody P Schafer
recalulate high Simply moves calculation of the new 'high' value outside the for_each_possible_cpu() loop, as it does not depend on the cpu. Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Cc: Gilad Ben-Yossef <gilad@benyossef.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03mm/page_alloc: convert zone_pcp_update() to rely on memory barriers instead ↵Cody P Schafer
of stop_machine() zone_pcp_update()'s goal is to adjust the ->high and ->mark members of a percpu pageset based on a zone's ->managed_pages. We don't need to drain the entire percpu pageset just to modify these fields. This lets us avoid calling setup_pageset() (and the draining required to call it) and instead allows simply setting the fields' values (with some attention paid to memory barriers to prevent the relationship between ->batch and ->high from being thrown off). This does change the behavior of zone_pcp_update() as the percpu pagesets will not be drained when zone_pcp_update() is called (they will end up being shrunk, not completely drained, later when a 0-order page is freed in free_hot_cold_page()). Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Cc: Gilad Ben-Yossef <gilad@benyossef.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03mm/page_alloc: protect pcp->batch accesses with ACCESS_ONCECody P Schafer
pcp->batch could change at any point, avoid relying on it being a stable value. Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Cc: Gilad Ben-Yossef <gilad@benyossef.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03mm/page_alloc: insert memory barriers to allow async update of pcp batch and ↵Cody P Schafer
high Introduce pageset_update() to perform a safe transision from one set of pcp->{batch,high} to a new set using memory barriers. This ensures that batch is always set to a safe value (1) prior to updating high, and ensure that high is fully updated before setting the real value of batch. It avoids ->batch ever rising above ->high. Suggested by Gilad Ben-Yossef in these threads: https://lkml.org/lkml/2013/4/9/23 https://lkml.org/lkml/2013/4/10/49 Also reproduces his proposed comment. Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Reviewed-by: Gilad Ben-Yossef <gilad@benyossef.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03mm/page_alloc: prevent concurrent updaters of pcp ->batch and ->highCody P Schafer
Because we are going to rely upon a careful transision between old and new ->high and ->batch values using memory barriers and will remove stop_machine(), we need to prevent multiple updaters from interweaving their memory writes. Add a simple mutex to protect both update loops. Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Cc: Gilad Ben-Yossef <gilad@benyossef.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03mm/page_alloc: factor out setting of pcp->high and pcp->batchCody P Schafer
"Problems" with the current code: 1: there is a lack of synchronization in setting ->high and ->batch in percpu_pagelist_fraction_sysctl_handler() 2: stop_machine() in zone_pcp_update() is unnecissary. 3: zone_pcp_update() does not consider the case where percpu_pagelist_fraction is non-zero To fix: 1: add memory barriers, a safe ->batch value, an update side mutex when updating ->high and ->batch, and use ACCESS_ONCE() for ->batch users that expect a stable value. 2: avoid draining pages in zone_pcp_update(), rely upon the memory barriers added to fix #1 3: factor out quite a few functions, and then call the appropriate one. Note that it results in a change to the behavior of zone_pcp_update(), which is used by memory_hotplug. I'm rather certain that I've diserned (and preserved) the essential behavior (changing ->high and ->batch), and only eliminated unneeded actions (draining the per cpu pages), but this may not be the case. Further note that the draining of pages that previously took place in zone_pcp_update() occured after repeated draining when attempting to offline a page, and after the offline has "succeeded". It appears that the draining was added to zone_pcp_update() to avoid refactoring setup_pageset() into 2 funtions. This patch: Creates pageset_set_batch() for use in setup_pageset(). pageset_set_batch() imitates the functionality of setup_pagelist_highmark(), but uses the boot time (percpu_pagelist_fraction == 0) calculations for determining ->high based on ->batch. Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Cc: Gilad Ben-Yossef <gilad@benyossef.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03mm: use vma_pages() to replace (vm_end - vm_start) >> PAGE_SHIFTLibin
(*->vm_end - *->vm_start) >> PAGE_SHIFT operation is implemented as a inline funcion vma_pages() in linux/mm.h, so using it. Signed-off-by: Libin <huawei.libin@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03mm: remove compressed copy from zram in-memoryMinchan Kim
Swap subsystem does lazy swap slot free with expecting the page would be swapped out again so we can avoid unnecessary write. But the problem in in-memory swap(ex, zram) is that it consumes memory space until vm_swap_full(ie, used half of all of swap device) condition meet. It could be bad if we use multiple swap device, small in-memory swap and big storage swap or in-memory swap alone. This patch makes swap subsystem free swap slot as soon as swap-read is completed and make the swapcache page dirty so the page should be written out the swap device to reclaim it. It means we never lose it. I tested this patch with kernel compile workload. 1. before compile time : 9882.42 zram max wasted space by fragmentation: 13471881 byte memory space consumed by zram: 174227456 byte the number of slot free notify: 206684 2. after compile time : 9653.90 zram max wasted space by fragmentation: 11805932 byte memory space consumed by zram: 154001408 byte the number of slot free notify: 426972 [akpm@linux-foundation.org: tweak comment text] [artem.savkov@gmail.com: fix BUG due to non-swapcache pages in end_swap_bio_read()] [akpm@linux-foundation.org: invert unlikely() test, augment comment, 80-col cleanup] Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com> Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Artem Savkov <artem.savkov@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Seth Jennings <sjenning@linux.vnet.ibm.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Konrad Rzeszutek Wilk <konrad@darnok.org> Cc: Shaohua Li <shli@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03mm, memcg: don't take task_lock in task_in_mem_cgroupDavid Rientjes
For processes that have detached their mm's, task_in_mem_cgroup() unnecessarily takes task_lock() when rcu_read_lock() is all that is necessary to call mem_cgroup_from_task(). While we're here, switch task_in_mem_cgroup() to return bool. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03mm: soft-dirty bits for user memory changes trackingPavel Emelyanov
The soft-dirty is a bit on a PTE which helps to track which pages a task writes to. In order to do this tracking one should 1. Clear soft-dirty bits from PTEs ("echo 4 > /proc/PID/clear_refs) 2. Wait some time. 3. Read soft-dirty bits (55'th in /proc/PID/pagemap2 entries) To do this tracking, the writable bit is cleared from PTEs when the soft-dirty bit is. Thus, after this, when the task tries to modify a page at some virtual address the #PF occurs and the kernel sets the soft-dirty bit on the respective PTE. Note, that although all the task's address space is marked as r/o after the soft-dirty bits clear, the #PF-s that occur after that are processed fast. This is so, since the pages are still mapped to physical memory, and thus all the kernel does is finds this fact out and puts back writable, dirty and soft-dirty bits on the PTE. Another thing to note, is that when mremap moves PTEs they are marked with soft-dirty as well, since from the user perspective mremap modifies the virtual memory at mremap's new address. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Cc: Matt Mackall <mpm@selenic.com> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Glauber Costa <glommer@parallels.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03Merge tag 'pm+acpi-3.11-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management and ACPI updates from Rafael Wysocki: "This time the total number of ACPI commits is slightly greater than the number of cpufreq commits, but Viresh Kumar (who works on cpufreq) remains the most active patch submitter. To me, the most significant change is the addition of offline/online device operations to the driver core (with the Greg's blessing) and the related modifications of the ACPI core hotplug code. Next are the freezer updates from Colin Cross that should make the freezing of tasks a bit less heavy weight. We also have a couple of regression fixes, a number of fixes for issues that have not been identified as regressions, two new drivers and a bunch of cleanups all over. Highlights: - Hotplug changes to support graceful hot-removal failures. It sometimes is necessary to fail device hot-removal operations gracefully if they cannot be carried out completely. For example, if memory from a memory module being hot-removed has been allocated for the kernel's own use and cannot be moved elsewhere, it's desirable to fail the hot-removal operation in a graceful way rather than to crash the kernel, but currenty a success or a kernel crash are the only possible outcomes of an attempted memory hot-removal. Needless to say, that is not a very attractive alternative and it had to be addressed. However, in order to make it work for memory, I first had to make it work for CPUs and for this purpose I needed to modify the ACPI processor driver. It's been split into two parts, a resident one handling the low-level initialization/cleanup and a modular one playing the actual driver's role (but it binds to the CPU system device objects rather than to the ACPI device objects representing processors). That's been sort of like a live brain surgery on a patient who's riding a bike. So this is a little scary, but since we found and fixed a couple of regressions it caused to happen during the early linux-next testing (a month ago), nobody has complained. As a bonus we remove some duplicated ACPI hotplug code, because the ACPI-based CPU hotplug is now going to use the common ACPI hotplug code. - Lighter weight freezing of tasks. These changes from Colin Cross and Mandeep Singh Baines are targeted at making the freezing of tasks a bit less heavy weight operation. They reduce the number of tasks woken up every time during the freezing, by using the observation that the freezer simply doesn't need to wake up some of them and wait for them all to call refrigerator(). The time needed for the freezer to decide to report a failure is reduced too. Also reintroduced is the check causing a lockdep warining to trigger when try_to_freeze() is called with locks held (which is generally unsafe and shouldn't happen). - cpufreq updates First off, a commit from Srivatsa S Bhat fixes a resume regression introduced during the 3.10 cycle causing some cpufreq sysfs attributes to return wrong values to user space after resume. The fix is kind of fresh, but also it's pretty obvious once Srivatsa has identified the root cause. Second, we have a new freqdomain_cpus sysfs attribute for the acpi-cpufreq driver to provide information previously available via related_cpus. From Lan Tianyu. Finally, we fix a number of issues, mostly related to the CPUFREQ_POSTCHANGE notifier and cpufreq Kconfig options and clean up some code. The majority of changes from Viresh Kumar with bits from Jacob Shin, Heiko Stübner, Xiaoguang Chen, Ezequiel Garcia, Arnd Bergmann, and Tang Yuantian. - ACPICA update A usual bunch of updates from the ACPICA upstream. During the 3.4 cycle we introduced support for ACPI 5 extended sleep registers, but they are only supposed to be used if the HW-reduced mode bit is set in the FADT flags and the code attempted to use them without checking that bit. That caused suspend/resume regressions to happen on some systems. Fix from Lv Zheng causes those registers to be used only if the HW-reduced mode bit is set. Apart from this some other ACPICA bugs are fixed and code cleanups are made by Bob Moore, Tomasz Nowicki, Lv Zheng, Chao Guan, and Zhang Rui. - cpuidle updates New driver for Xilinx Zynq processors is added by Michal Simek. Multidriver support simplification, addition of some missing kerneldoc comments and Kconfig-related fixes come from Daniel Lezcano. - ACPI power management updates Changes to make suspend/resume work correctly in Xen guests from Konrad Rzeszutek Wilk, sparse warning fix from Fengguang Wu and cleanups and fixes of the ACPI device power state selection routine. - ACPI documentation updates Some previously missing pieces of ACPI documentation are added by Lv Zheng and Aaron Lu (hopefully, that will help people to uderstand how the ACPI subsystem works) and one outdated doc is updated by Hanjun Guo. - Assorted ACPI updates We finally nailed down the IA-64 issue that was the reason for reverting commit 9f29ab11ddbf ("ACPI / scan: do not match drivers against objects having scan handlers"), so we can fix it and move the ACPI scan handler check added to the ACPI video driver back to the core. A mechanism for adding CMOS RTC address space handlers is introduced by Lan Tianyu to allow some EC-related breakage to be fixed on some systems. A spec-compliant implementation of acpi_os_get_timer() is added by Mika Westerberg. The evaluation of _STA is added to do_acpi_find_child() to avoid situations in which a pointer to a disabled device object is returned instead of an enabled one with the same _ADR value. From Jeff Wu. Intel BayTrail PCH (Platform Controller Hub) support is added to the ACPI driver for Intel Low-Power Subsystems (LPSS) and that driver is modified to work around a couple of known BIOS issues. Changes from Mika Westerberg and Heikki Krogerus. The EC driver is fixed by Vasiliy Kulikov to use get_user() and put_user() instead of dereferencing user space pointers blindly. Code cleanups are made by Bjorn Helgaas, Nicholas Mazzuca and Toshi Kani. - Assorted power management updates The "runtime idle" helper routine is changed to take the return values of the callbacks executed by it into account and to call rpm_suspend() if they return 0, which allows us to reduce the overall code bloat a bit (by dropping some code that's not necessary any more after that modification). The runtime PM documentation is updated by Alan Stern (to reflect the "runtime idle" behavior change). New trace points for PM QoS are added by Sahara (<keun-o.park@windriver.com>). PM QoS documentation is updated by Lan Tianyu. Code cleanups are made and minor issues are addressed by Bernie Thompson, Bjorn Helgaas, Julius Werner, and Shuah Khan. - devfreq updates New driver for the Exynos5-bus device from Abhilash Kesavan. Minor cleanups, fixes and MAINTAINERS update from MyungJoo Ham, Abhilash Kesavan, Paul Bolle, Rajagopal Venkat, and Wei Yongjun. - OMAP power management updates Adaptive Voltage Scaling (AVS) SmartReflex voltage control driver updates from Andrii Tseglytskyi and Nishanth Menon." * tag 'pm+acpi-3.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (162 commits) cpufreq: Fix cpufreq regression after suspend/resume ACPI / PM: Fix possible NULL pointer deref in acpi_pm_device_sleep_state() PM / Sleep: Warn about system time after resume with pm_trace cpufreq: don't leave stale policy pointer in cdbs->cur_policy acpi-cpufreq: Add new sysfs attribute freqdomain_cpus cpufreq: make sure frequency transitions are serialized ACPI: implement acpi_os_get_timer() according the spec ACPI / EC: Add HP Folio 13 to ec_dmi_table in order to skip DSDT scan ACPI: Add CMOS RTC Operation Region handler support ACPI / processor: Drop unused variable from processor_perflib.c cpufreq: tegra: call CPUFREQ_POSTCHANGE notfier in error cases cpufreq: s3c64xx: call CPUFREQ_POSTCHANGE notfier in error cases cpufreq: omap: call CPUFREQ_POSTCHANGE notfier in error cases cpufreq: imx6q: call CPUFREQ_POSTCHANGE notfier in error cases cpufreq: exynos: call CPUFREQ_POSTCHANGE notfier in error cases cpufreq: dbx500: call CPUFREQ_POSTCHANGE notfier in error cases cpufreq: davinci: call CPUFREQ_POSTCHANGE notfier in error cases cpufreq: arm-big-little: call CPUFREQ_POSTCHANGE notfier in error cases cpufreq: powernow-k8: call CPUFREQ_POSTCHANGE notfier in error cases cpufreq: pcc: call CPUFREQ_POSTCHANGE notfier in error cases ...
2013-07-03Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull security subsystem updates from James Morris: "In this update, Smack learns to love IPv6 and to mount a filesystem with a transmutable hierarchy (i.e. security labels are inherited from parent directory upon creation rather than creating process). The rest of the changes are maintenance" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (37 commits) tpm/tpm_i2c_infineon: Remove unused header file tpm: tpm_i2c_infinion: Don't modify i2c_client->driver evm: audit integrity metadata failures integrity: move integrity_audit_msg() evm: calculate HMAC after initializing posix acl on tmpfs maintainers: add Dmitry Kasatkin Smack: Fix the bug smackcipso can't set CIPSO correctly Smack: Fix possible NULL pointer dereference at smk_netlbl_mls() Smack: Add smkfstransmute mount option Smack: Improve access check performance Smack: Local IPv6 port based controls tpm: fix regression caused by section type conflict of tpm_dev_release() in ppc builds maintainers: Remove Kent from maintainers tpm: move TPM_DIGEST_SIZE defintion tpm_tis: missing platform_driver_unregister() on error in init_tis() security: clarify cap_inode_getsecctx description apparmor: no need to delay vfree() apparmor: fix fully qualified name parsing apparmor: fix setprocattr arg processing for onexec apparmor: localize getting the security context to a few macros ...
2013-07-03Merge tag 'arm64-upstream' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-aarch64 Pull ARM64 updates from Catalin Marinas: "Main features: - KVM and Xen ports to AArch64 - Hugetlbfs and transparent huge pages support for arm64 - Applied Micro X-Gene Kconfig entry and dts file - Cache flushing improvements For arm64 huge pages support, there are x86 changes moving part of arch/x86/mm/hugetlbpage.c into mm/hugetlb.c to be re-used by arm64" * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-aarch64: (66 commits) arm64: Add initial DTS for APM X-Gene Storm SOC and APM Mustang board arm64: Add defines for APM ARMv8 implementation arm64: Enable APM X-Gene SOC family in the defconfig arm64: Add Kconfig option for APM X-Gene SOC family arm64/Makefile: provide vdso_install target ARM64: mm: THP support. ARM64: mm: Raise MAX_ORDER for 64KB pages and THP. ARM64: mm: HugeTLB support. ARM64: mm: Move PTE_PROT_NONE bit. ARM64: mm: Make PAGE_NONE pages read only and no-execute. ARM64: mm: Restore memblock limit when map_mem finished. mm: thp: Correct the HPAGE_PMD_ORDER check. x86: mm: Remove general hugetlb code from x86. mm: hugetlb: Copy general hugetlb code from x86 to mm. x86: mm: Remove x86 version of huge_pmd_share. mm: hugetlb: Copy huge_pmd_share from x86 to mm. arm64: KVM: document kernel object mappings in HYP arm64: KVM: MAINTAINERS update arm64: KVM: userspace API documentation arm64: KVM: enable initialization of a 32bit vcpu ...
2013-07-03Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull second set of VFS changes from Al Viro: "Assorted f_pos race fixes, making do_splice_direct() safe to call with i_mutex on parent, O_TMPFILE support, Jeff's locks.c series, ->d_hash/->d_compare calling conventions changes from Linus, misc stuff all over the place." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits) Document ->tmpfile() ext4: ->tmpfile() support vfs: export lseek_execute() to modules lseek_execute() doesn't need an inode passed to it block_dev: switch to fixed_size_llseek() cpqphp_sysfs: switch to fixed_size_llseek() tile-srom: switch to fixed_size_llseek() proc_powerpc: switch to fixed_size_llseek() ubi/cdev: switch to fixed_size_llseek() pci/proc: switch to fixed_size_llseek() isapnp: switch to fixed_size_llseek() lpfc: switch to fixed_size_llseek() locks: give the blocked_hash its own spinlock locks: add a new "lm_owner_key" lock operation locks: turn the blocked_list into a hashtable locks: convert fl_link to a hlist_node locks: avoid taking global lock if possible when waking up blocked waiters locks: protect most of the file_lock handling with i_lock locks: encapsulate the fl_link list handling locks: make "added" in __posix_lock_file a bool ...
2013-07-03vfs: export lseek_execute() to modulesJie Liu
For those file systems(btrfs/ext4/ocfs2/tmpfs) that support SEEK_DATA/SEEK_HOLE functions, we end up handling the similar matter in lseek_execute() to update the current file offset to the desired offset if it is valid, ceph also does the simliar things at ceph_llseek(). To reduce the duplications, this patch make lseek_execute() public accessible so that we can call it directly from the underlying file systems. Thanks Dave Chinner for this suggestion. [AV: call it vfs_setpos(), don't bring the removed 'inode' argument back] v2->v1: - Add kernel-doc comments for lseek_execute() - Call lseek_execute() in ceph->llseek() Signed-off-by: Jie Liu <jeff.liu@oracle.com> Cc: Dave Chinner <dchinner@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andi Kleen <andi@firstfloor.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Chris Mason <chris.mason@fusionio.com> Cc: Josef Bacik <jbacik@fusionio.com> Cc: Ben Myers <bpm@sgi.com> Cc: Ted Tso <tytso@mit.edu> Cc: Hugh Dickins <hughd@google.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Sage Weil <sage@inktank.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-07-02Merge branch 'sched-mm-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull voluntary preemption fixes from Ingo Molnar: "This tree contains a speedup which is achieved through better might_sleep()/might_fault() preemption point annotations for uaccess functions, by Michael S Tsirkin: 1. The only reason uaccess routines might sleep is if they fault. Make this explicit for all architectures. 2. A voluntary preemption point in uaccess functions means compiler can't inline them efficiently, this breaks assumptions that they are very fast and small that e.g. net code seems to make. Remove this preemption point so behaviour matches with what callers assume. 3. Accesses (e.g through socket ops) to kernel memory with KERNEL_DS like net/sunrpc does will never sleep. Remove an unconditinal might_sleep() in the might_fault() inline in kernel.h (used when PROVE_LOCKING is not set). 4. Accesses with pagefault_disable() return EFAULT but won't cause caller to sleep. Check for that and thus avoid might_sleep() when PROVE_LOCKING is set. These changes offer a nice speedup for CONFIG_PREEMPT_VOLUNTARY=y kernels, here's a network bandwidth measurement between a virtual machine and the host: before: incoming: 7122.77 Mb/s outgoing: 8480.37 Mb/s after: incoming: 8619.24 Mb/s [ +21.0% ] outgoing: 9455.42 Mb/s [ +11.5% ] I kept these changes in a separate tree, separate from scheduler changes, because it's a mixed MM and scheduler topic" * 'sched-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: mm, sched: Allow uaccess in atomic with pagefault_disable() mm, sched: Drop voluntary schedule from might_fault() x86: uaccess s/might_sleep/might_fault/ tile: uaccess s/might_sleep/might_fault/ powerpc: uaccess s/might_sleep/might_fault/ mn10300: uaccess s/might_sleep/might_fault/ microblaze: uaccess s/might_sleep/might_fault/ m32r: uaccess s/might_sleep/might_fault/ frv: uaccess s/might_sleep/might_fault/ arm64: uaccess s/might_sleep/might_fault/ asm-generic: uaccess s/might_sleep/might_fault/
2013-07-02Merge branch 'core-locking-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking changes from Ingo Molnar: "Four miscellanous standalone fixes for futexes, rtmutexes and Kconfig.locks." * 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: futex: Use freezable blocking call futex: Take hugepages into account when generating futex_key rtmutex: Document rt_mutex_adjust_prio_chain() locking: Fix copy/paste errors of "ARCH_INLINE_*_UNLOCK_BH"
2013-07-02Merge tag 'driver-core-3.11-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here's the big driver core merge for 3.11-rc1 Lots of little things, and larger firmware subsystem updates, all described in the shortlog. Nice thing here is that we finally get rid of CONFIG_HOTPLUG, after 10+ years, thanks to Stephen Rohtwell (it had been always on for a number of kernel releases, now it's just removed)" * tag 'driver-core-3.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (27 commits) driver core: device.h: fix doc compilation warnings firmware loader: fix another compile warning with PM_SLEEP unset build some drivers only when compile-testing firmware loader: fix compile warning with PM_SLEEP set kobject: sanitize argument for format string sysfs_notify is only possible on file attributes firmware loader: simplify holding module for request_firmware firmware loader: don't export cache_firmware and uncache_firmware drivers/base: Use attribute groups to create sysfs memory files firmware loader: fix compile warning firmware loader: fix build failure with !CONFIG_FW_LOADER_USER_HELPER Documentation: Updated broken link in HOWTO Finally eradicate CONFIG_HOTPLUG driver core: firmware loader: kill FW_ACTION_NOHOTPLUG requests before suspend driver core: firmware loader: don't cache FW_ACTION_NOHOTPLUG firmware Documentation: Tidy up some drivers/base/core.c kerneldoc content. platform_device: use a macro instead of platform_driver_register firmware: move EXPORT_SYMBOL annotations firmware: Avoid deadlock of usermodehelper lock at shutdown dell_rbu: Select CONFIG_FW_LOADER_USER_HELPER explicitly ...
2013-07-02Merge tag 'ext4_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 update from Ted Ts'o: "Lots of bug fixes, cleanups and optimizations. In the bug fixes category, of note is a fix for on-line resizing file systems where the block size is smaller than the page size (i.e., file systems 1k blocks on x86, or more interestingly file systems with 4k blocks on Power or ia64 systems.) In the cleanup category, the ext4's punch hole implementation was significantly improved by Lukas Czerner, and now supports bigalloc file systems. In addition, Jan Kara significantly cleaned up the write submission code path. We also improved error checking and added a few sanity checks. In the optimizations category, two major optimizations deserve mention. The first is that ext4_writepages() is now used for nodelalloc and ext3 compatibility mode. This allows writes to be submitted much more efficiently as a single bio request, instead of being sent as individual 4k writes into the block layer (which then relied on the elevator code to coalesce the requests in the block queue). Secondly, the extent cache shrink mechanism, which was introduce in 3.9, no longer has a scalability bottleneck caused by the i_es_lru spinlock. Other optimizations include some changes to reduce CPU usage and to avoid issuing empty commits unnecessarily." * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (86 commits) ext4: optimize starting extent in ext4_ext_rm_leaf() jbd2: invalidate handle if jbd2_journal_restart() fails ext4: translate flag bits to strings in tracepoints ext4: fix up error handling for mpage_map_and_submit_extent() jbd2: fix theoretical race in jbd2__journal_restart ext4: only zero partial blocks in ext4_zero_partial_blocks() ext4: check error return from ext4_write_inline_data_end() ext4: delete unnecessary C statements ext3,ext4: don't mess with dir_file->f_pos in htree_dirblock_to_tree() jbd2: move superblock checksum calculation to jbd2_write_superblock() ext4: pass inode pointer instead of file pointer to punch hole ext4: improve free space calculation for inline_data ext4: reduce object size when !CONFIG_PRINTK ext4: improve extent cache shrink mechanism to avoid to burn CPU time ext4: implement error handling of ext4_mb_new_preallocation() ext4: fix corruption when online resizing a fs with 1K block size ext4: delete unused variables ext4: return FIEMAP_EXTENT_UNKNOWN for delalloc extents jbd2: remove debug dependency on debug_fs and update Kconfig help text jbd2: use a single printk for jbd_debug() ...
2013-07-02mm/cma: Move dma contiguous changes into a seperate configAneesh Kumar K.V
We want to use CMA for allocating hash page table and real mode area for PPC64. Hence move DMA contiguous related changes into a seperate config so that ppc64 can enable CMA without requiring DMA contiguous. Acked-by: Michal Nazarewicz <mina86@mina86.com> Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> [removed defconfig changes] Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
2013-07-01Merge tag 'v3.10' into nextBenjamin Herrenschmidt
Merge 3.10 in order to get some of the last minute powerpc changes, resolve conflicts and add additional fixes on top of them.
2013-06-29[O_TMPFILE] it's still short a few helpers, but infrastructure should be OK ↵Al Viro
now... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-28treewide: relase -> releaseGeert Uytterhoeven
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-06-28Merge branch 'acpi-hotplug'Rafael J. Wysocki
* acpi-hotplug: ACPI: Do not use CONFIG_ACPI_HOTPLUG_MEMORY_MODULE ACPI / cpufreq: Add ACPI processor device IDs to acpi-cpufreq Memory hotplug: Move alternative function definitions to header ACPI / processor: Fix potential NULL pointer dereference in acpi_processor_add() Memory hotplug / ACPI: Simplify memory removal ACPI / scan: Add second pass of companion offlining to hot-remove code Driver core / MM: Drop offline_memory_block() ACPI / processor: Pass processor object handle to acpi_bind_one() ACPI: Drop removal_type field from struct acpi_device Driver core / memory: Simplify __memory_block_change_state() ACPI / processor: Initialize per_cpu(processors, pr->id) properly CPU: Fix sysfs cpu/online of offlined CPUs Driver core: Introduce offline/online callbacks for memory blocks ACPI / memhotplug: Bind removable memory blocks to ACPI device nodes ACPI / processor: Use common hotplug infrastructure ACPI / hotplug: Use device offline/online for graceful hot-removal Driver core: Use generic offline/online for CPU offline/online Driver core: Add offline/online device operations
2013-06-25futex: Take hugepages into account when generating futex_keyZhang Yi
The futex_keys of process shared futexes are generated from the page offset, the mapping host and the mapping index of the futex user space address. This should result in an unique identifier for each futex. Though this is not true when futexes are located in different subpages of an hugepage. The reason is, that the mapping index for all those futexes evaluates to the index of the base page of the hugetlbfs mapping. So a futex at offset 0 of the hugepage mapping and another one at offset PAGE_SIZE of the same hugepage mapping have identical futex_keys. This happens because the futex code blindly uses page->index. Steps to reproduce the bug: 1. Map a file from hugetlbfs. Initialize pthread_mutex1 at offset 0 and pthread_mutex2 at offset PAGE_SIZE of the hugetlbfs mapping. The mutexes must be initialized as PTHREAD_PROCESS_SHARED because PTHREAD_PROCESS_PRIVATE mutexes are not affected by this issue as their keys solely depend on the user space address. 2. Lock mutex1 and mutex2 3. Create thread1 and in the thread function lock mutex1, which results in thread1 blocking on the locked mutex1. 4. Create thread2 and in the thread function lock mutex2, which results in thread2 blocking on the locked mutex2. 5. Unlock mutex2. Despite the fact that mutex2 got unlocked, thread2 still blocks on mutex2 because the futex_key points to mutex1. To solve this issue we need to take the normal page index of the page which contains the futex into account, if the futex is in an hugetlbfs mapping. In other words, we calculate the normal page mapping index of the subpage in the hugetlbfs mapping. Mappings which are not based on hugetlbfs are not affected and still use page->index. Thanks to Mel Gorman who provided a patch for adding proper evaluation functions to the hugetlbfs code to avoid exposing hugetlbfs specific details to the futex code. [ tglx: Massaged changelog ] Signed-off-by: Zhang Yi <zhang.yi20@zte.com.cn> Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn> Tested-by: Ma Chenggong <ma.chenggong@zte.com.cn> Reviewed-by: 'Mel Gorman' <mgorman@suse.de> Acked-by: 'Darren Hart' <dvhart@linux.intel.com> Cc: 'Peter Zijlstra' <peterz@infradead.org> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/000101ce71a6%24a83c5880%24f8b50980%24@com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-06-24Merge 3.10-rc7 into driver-core-nextGreg Kroah-Hartman
We want the firmware merge fixes, and other bits, in here now. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-06-20evm: calculate HMAC after initializing posix acl on tmpfsMimi Zohar
Included in the EVM hmac calculation is the i_mode. Any changes to the i_mode need to be reflected in the hmac. shmem_mknod() currently calls generic_acl_init(), which modifies the i_mode, after calling security_inode_init_security(). This patch reverses the order in which they are called. Reported-by: Sven Vermeulen <sven.vermeulen@siphos.be> Signed-off-by: Mimi Zohar <zohar@us.ibm.com> Acked-by: Hugh Dickins <hughd@google.com>
2013-06-20mm/THP: deposit the transpare huge pgtable before set_pmdAneesh Kumar K.V
Architectures like powerpc use the deposited pgtable to store hash index values. We need to make the deposted pgtable is visible to other cpus before we are ready to take a hash fault. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-20mm/THP: withdraw the pgtable after pmdp related operationsAneesh Kumar K.V
For architectures like ppc64 we look at deposited pgtable when calling pmdp_get_and_clear. So do the pgtable_trans_huge_withdraw after finishing pmdp related operations. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-20mm/THP: add pmd args to pgtable deposit and withdraw APIsAneesh Kumar K.V
This will be later used by powerpc THP support. In powerpc we want to use pgtable for storing the hash index values. So instead of adding them to mm_context list, we would like to store them in the second half of pmd Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-20mm/thp: use the correct function when updating access flagsAneesh Kumar K.V
We should use pmdp_set_access_flags to update access flags. Archs like powerpc use extra checks(_PAGE_BUSY) when updating a hugepage PTE. A set_pmd_at doesn't do those checks. We should use set_pmd_at only when updating a none hugepage PTE. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com>a Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-18Merge branch 'slab/urgent' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux Pull SLAB fix from Pekka Enberg: "A slab regression fix by Sasha Levin" * 'slab/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux: slab: prevent warnings when allocating with __GFP_NOWARN
2013-06-17Merge 3.10-rc6 into driver-core-nextGreg Kroah-Hartman
We want these fixes here too. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-06-14mm: hugetlb: Copy general hugetlb code from x86 to mm.Steve Capper
The huge_pte_alloc, huge_pte_offset and follow_huge_p[mu]d functions in x86/mm/hugetlbpage.c do not rely on any architecture specific knowledge other than the fact that pmds and puds can be treated as huge ptes. To allow other architectures to use this code (and reduce the need for code duplication), this patch copies these functions into mm, replaces the use of pud_large with pud_huge and provides a config flag to activate them: CONFIG_ARCH_WANT_GENERAL_HUGETLB If CONFIG_ARCH_WANT_HUGE_PMD_SHARE is also active then the huge_pmd_share code will be called by huge_pte_alloc (othewise we call pmd_alloc and skip the sharing code). Signed-off-by: Steve Capper <steve.capper@linaro.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Andrew Morton <akpm@linux-foundation.org>
2013-06-14mm: hugetlb: Copy huge_pmd_share from x86 to mm.Steve Capper
Under x86, multiple puds can be made to reference the same bank of huge pmds provided that they represent a full PUD_SIZE of shared huge memory that is aligned to a PUD_SIZE boundary. The code to share pmds does not require any architecture specific knowledge other than the fact that pmds can be indexed, thus can be beneficial to some other architectures. This patch copies the huge pmd sharing (and unsharing) logic from x86/ to mm/ and introduces a new config option to activate it: CONFIG_ARCH_WANTS_HUGE_PMD_SHARE Signed-off-by: Steve Capper <steve.capper@linaro.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Andrew Morton <akpm@linux-foundation.org>
2013-06-13slab: prevent warnings when allocating with __GFP_NOWARNSasha Levin
Sasha Levin noticed that the warning introduced by commit 6286ae9 ("slab: Return NULL for oversized allocations) is being triggered: WARNING: CPU: 15 PID: 21519 at mm/slab_common.c:376 kmalloc_slab+0x2f/0xb0() can: request_module (can-proto-4) failed. mpoa: proc_mpc_write: could not parse '' Modules linked in: CPU: 15 PID: 21519 Comm: trinity-child15 Tainted: G W 3.10.0-rc4-next-20130607-sasha-00011-gcd78395-dirty #2 0000000000000009 ffff880020a95e30 ffffffff83ff4041 0000000000000000 ffff880020a95e68 ffffffff8111fe12 fffffffffffffff0 00000000000082d0 0000000000080000 0000000000080000 0000000001400000 ffff880020a95e78 Call Trace: [<ffffffff83ff4041>] dump_stack+0x4e/0x82 [<ffffffff8111fe12>] warn_slowpath_common+0x82/0xb0 [<ffffffff8111fe55>] warn_slowpath_null+0x15/0x20 [<ffffffff81243dcf>] kmalloc_slab+0x2f/0xb0 [<ffffffff81278d54>] __kmalloc+0x24/0x4b0 [<ffffffff8196ffe3>] ? security_capable+0x13/0x20 [<ffffffff812a26b7>] ? pipe_fcntl+0x107/0x210 [<ffffffff812a26b7>] pipe_fcntl+0x107/0x210 [<ffffffff812b7ea0>] ? fget_raw_light+0x130/0x3f0 [<ffffffff812aa5fb>] SyS_fcntl+0x60b/0x6a0 [<ffffffff8403ca98>] tracesys+0xe1/0xe6 Andrew Morton writes: __GFP_NOWARN is frequently used by kernel code to probe for "how big an allocation can I get". That's a bit lame, but it's used on slow paths and is pretty simple. However, SLAB would still spew a warning when a big allocation happens if the __GFP_NOWARN flag is _not_ set to expose kernel bugs. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> [ penberg@kernel.org: improve changelog ] Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-06-12mm: memcontrol: fix lockless reclaim hierarchy iteratorJohannes Weiner
The lockless reclaim hierarchy iterator currently has a misplaced barrier that can lead to use-after-free crashes. The reclaim hierarchy iterator consist of a sequence count and a position pointer that are read and written locklessly, with memory barriers enforcing ordering. The write side sets the position pointer first, then updates the sequence count to "publish" the new position. Likewise, the read side must read the sequence count first, then the position. If the sequence count is up to date, it's guaranteed that the position is up to date as well: writer: reader: iter->position = position if iter->sequence == expected: smp_wmb() smp_rmb() iter->sequence = sequence position = iter->position However, the read side barrier is currently misplaced, which can lead to dereferencing stale position pointers that no longer point to valid memory. Fix this. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Tejun Heo <tj@kernel.org> Reviewed-by: Tejun Heo <tj@kernel.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Glauber Costa <glommer@parallels.com> Cc: <stable@kernel.org> [3.10+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-06-12frontswap: fix incorrect zeroing and allocation size for frontswap_mapAkinobu Mita
The bitmap accessed by bitops must have enough size to hold the required numbers of bits rounded up to a multiple of BITS_PER_LONG. And the bitmap must not be zeroed by memset() if the number of bits cleared is not a multiple of BITS_PER_LONG. This fixes incorrect zeroing and allocation size for frontswap_map. The incorrect zeroing part doesn't cause any problem because frontswap_map is freed just after zeroing. But the wrongly calculated allocation size may cause the problem. For 32bit systems, the allocation size of frontswap_map is about twice as large as required size. For 64bit systems, the allocation size is smaller than requeired if the number of bits is not a multiple of BITS_PER_LONG. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-06-12mm: migration: add migrate_entry_wait_huge()Naoya Horiguchi
When we have a page fault for the address which is backed by a hugepage under migration, the kernel can't wait correctly and do busy looping on hugepage fault until the migration finishes. As a result, users who try to kick hugepage migration (via soft offlining, for example) occasionally experience long delay or soft lockup. This is because pte_offset_map_lock() can't get a correct migration entry or a correct page table lock for hugepage. This patch introduces migration_entry_wait_huge() to solve this. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Andi Kleen <andi@firstfloor.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: <stable@vger.kernel.org> [2.6.35+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-06-12mm/page_alloc.c: fix watermark check in __zone_watermark_ok()Tomasz Stanislawski
The watermark check consists of two sub-checks. The first one is: if (free_pages <= min + lowmem_reserve) return false; The check assures that there is minimal amount of RAM in the zone. If CMA is used then the free_pages is reduced by the number of free pages in CMA prior to the over-mentioned check. if (!(alloc_flags & ALLOC_CMA)) free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES); This prevents the zone from being drained from pages available for non-movable allocations. The second check prevents the zone from getting too fragmented. for (o = 0; o < order; o++) { free_pages -= z->free_area[o].nr_free << o; min >>= 1; if (free_pages <= min) return false; } The field z->free_area[o].nr_free is equal to the number of free pages including free CMA pages. Therefore the CMA pages are subtracted twice. This may cause a false positive fail of __zone_watermark_ok() if the CMA area gets strongly fragmented. In such a case there are many 0-order free pages located in CMA. Those pages are subtracted twice therefore they will quickly drain free_pages during the check against fragmentation. The test fails even though there are many free non-cma pages in the zone. This patch fixes this issue by subtracting CMA pages only for a purpose of (free_pages <= min + lowmem_reserve) check. Laura said: We were observing allocation failures of higher order pages (order 5 = 128K typically) under tight memory conditions resulting in driver failure. The output from the page allocation failure showed plenty of free pages of the appropriate order/type/zone and mostly CMA pages in the lower orders. For full disclosure, we still observed some page allocation failures even after applying the patch but the number was drastically reduced and those failures were attributed to fragmentation/other system issues. Signed-off-by: Tomasz Stanislawski <t.stanislaws@samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Tested-by: Laura Abbott <lauraa@codeaurora.org> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mel@csn.ul.ie> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Cc: <stable@vger.kernel.org> [3.7+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-06-12swap: avoid read_swap_cache_async() race to deadlock while waiting on ↵Rafael Aquini
discard I/O completion read_swap_cache_async() can race against get_swap_page(), and stumble across a SWAP_HAS_CACHE entry in the swap map whose page wasn't brought into the swapcache yet. This transient swap_map state is expected to be transitory, but the actual placement of discard at scan_swap_map() inserts a wait for I/O completion thus making the thread at read_swap_cache_async() to loop around its -EEXIST case, while the other end at get_swap_page() is scheduled away at scan_swap_map(). This can leave the system deadlocked if the I/O completion happens to be waiting on the CPU waitqueue where read_swap_cache_async() is busy looping and !CONFIG_PREEMPT. This patch introduces a cond_resched() call to make the aforementioned read_swap_cache_async() busy loop condition to bail out when necessary, thus avoiding the subtle race window. Signed-off-by: Rafael Aquini <aquini@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Shaohua Li <shli@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-06-12memcg: don't initialize kmem-cache destroying work for root cachesAndrey Vagin
struct memcg_cache_params has a union. Different parts of this union are used for root and non-root caches. A part with destroying work is used only for non-root caches. BUG: unable to handle kernel paging request at 0000000fffffffe0 IP: kmem_cache_alloc+0x41/0x1f0 Modules linked in: netlink_diag af_packet_diag udp_diag tcp_diag inet_diag unix_diag ip6table_filter ip6_tables i2c_piix4 virtio_net virtio_balloon microcode i2c_core pcspkr floppy CPU: 0 PID: 1929 Comm: lt-vzctl Tainted: G D 3.10.0-rc1+ #2 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 RIP: kmem_cache_alloc+0x41/0x1f0 Call Trace: getname_flags.part.34+0x30/0x140 getname+0x38/0x60 do_sys_open+0xc5/0x1e0 SyS_open+0x22/0x30 system_call_fastpath+0x16/0x1b Code: f4 53 48 83 ec 18 8b 05 8e 53 b7 00 4c 8b 4d 08 21 f0 a8 10 74 0d 4c 89 4d c0 e8 1b 76 4a 00 4c 8b 4d c0 e9 92 00 00 00 4d 89 f5 <4d> 8b 45 00 65 4c 03 04 25 48 cd 00 00 49 8b 50 08 4d 8b 38 49 RIP [<ffffffff8116b641>] kmem_cache_alloc+0x41/0x1f0 Signed-off-by: Andrey Vagin <avagin@openvz.org> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Glauber Costa <glommer@parallels.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Li Zefan <lizefan@huawei.com> Cc: <stable@vger.kernel.org> [3.9.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-06-08mm, slab: moved kmem_cache_alloc_node comment to correct placeZhouping Liu
After several fixing about kmem_cache_alloc_node(), its comment was splitted. This patch moved it on top of kmem_cache_alloc_node() definition. Signed-off-by: Zhouping Liu <zliu@redhat.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-06-06arch, mm: Remove tlb_fast_mode()Peter Zijlstra
Since the introduction of preemptible mmu_gather TLB fast mode has been broken. TLB fast mode relies on there being absolutely no concurrency; it frees pages first and invalidates TLBs later. However now we can get concurrency and stuff goes *bang*. This patch removes all tlb_fast_mode() code; it was found the better option vs trying to patch the hole by entangling tlb invalidation with the scheduler. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Russell King <linux@arm.linux.org.uk> Cc: Tony Luck <tony.luck@intel.com> Reported-by: Max Filippov <jcmvbkbc@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-06-03Finally eradicate CONFIG_HOTPLUGStephen Rothwell
Ever since commit 45f035ab9b8f ("CONFIG_HOTPLUG should be always on"), it has been basically impossible to build a kernel with CONFIG_HOTPLUG turned off. Remove all the remaining references to it. Cc: Russell King <linux@arm.linux.org.uk> Cc: Doug Thompson <dougthompson@xmission.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com> Acked-by: Hans Verkuil <hans.verkuil@cisco.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-06-01Memory hotplug: Move alternative function definitions to headerRafael J. Wysocki
Move the definitions of offline_pages() and remove_memory() for CONFIG_MEMORY_HOTREMOVE to memory_hotplug.h, where they belong, and make them static inline. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-06-01Memory hotplug / ACPI: Simplify memory removalRafael J. Wysocki
Now that the memory offlining should be taken care of by the companion device offlining code in acpi_scan_hot_remove(), the ACPI memory hotplug driver doesn't need to offline it in remove_memory() any more. Moreover, since the return value of remove_memory() is not used, it's better to make it be a void function and trigger a BUG() if the memory scheduled for removal is not offline. Change the code in accordance with the above observations. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Toshi Kani <toshi.kani@hp.com>