summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2014-01-23kexec: add sysctl to disable kexec_loadKees Cook
For general-purpose (i.e. distro) kernel builds it makes sense to build with CONFIG_KEXEC to allow end users to choose what kind of things they want to do with kexec. However, in the face of trying to lock down a system with such a kernel, there needs to be a way to disable kexec_load (much like module loading can be disabled). Without this, it is too easy for the root user to modify kernel memory even when CONFIG_STRICT_DEVMEM and modules_disabled are set. With this change, it is still possible to load an image for use later, then disable kexec_load so the image (or lack of image) can't be altered. The intention is for using this in environments where "perfect" enforcement is hard. Without a verified boot, along with verified modules, and along with verified kexec, this is trying to give a system a better chance to defend itself (or at least grow the window of discoverability) against attack in the face of a privilege escalation. In my mind, I consider several boot scenarios: 1) Verified boot of read-only verified root fs loading fd-based verification of kexec images. 2) Secure boot of writable root fs loading signed kexec images. 3) Regular boot loading kexec (e.g. kcrash) image early and locking it. 4) Regular boot with no control of kexec image at all. 1 and 2 don't exist yet, but will soon once the verified kexec series has landed. 4 is the state of things now. The gap between 2 and 4 is too large, so this change creates scenario 3, a middle-ground above 4 when 2 and 1 are not possible for a system. Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Rik van Riel <riel@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23exec: kill task_struct->did_execOleg Nesterov
We can kill either task->did_exec or PF_FORKNOEXEC, they are mutually exclusive. The patch kills ->did_exec because it has a single user. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23kernel/fork.c: make dup_mm() staticDaeSeok Youn
dup_mm() is used only in kernel/fork.c Signed-off-by: Daeseok Youn <daeseok.youn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23proc: cleanup/simplify get_task_state/task_state_arrayOleg Nesterov
get_task_state() and task_state_array[] look confusing and suboptimal, it is not clear what it can actually report to user-space and task_state_array[] blows .data for no reason. 1. state = (tsk->state & TASK_REPORT) | tsk->exit_state is not clear. TASK_REPORT is self-documenting but it is not clear what ->exit_state can add. Move the potential exit_state's (EXIT_ZOMBIE and EXIT_DEAD) into TASK_REPORT and use it to calculate the final result. 2. With the change above it is obvious that task_state_array[] has the unused entries just to make BUILD_BUG_ON() happy. Change this BUILD_BUG_ON() to use TASK_REPORT rather than TASK_STATE_MAX and shrink task_state_array[]. 3. Turn the "while (state)" loop into fls(state). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: David Laight <David.Laight@ACULAB.COM> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23coredump: make __get_dumpable/get_dumpable inline, kill fs/coredump.hOleg Nesterov
1. Remove fs/coredump.h. It is not clear why do we need it, it only declares __get_dumpable(), signal.c includes it for no reason. 2. Now that get_dumpable() and __get_dumpable() are really trivial make them inline in linux/sched.h. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: Alex Kelly <alex.page.kelly@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Petr Matousek <pmatouse@redhat.com> Cc: Vasily Kulikov <segoon@openwall.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23coredump: kill MMF_DUMPABLE and MMF_DUMP_SECURELYOleg Nesterov
Nobody actually needs MMF_DUMPABLE/MMF_DUMP_SECURELY, they are only used to enforce the encoding of SUID_DUMP_* enum in mm->flags & MMF_DUMPABLE_MASK. Now that set_dumpable() updates both bits atomically we can kill them and simply store the value "as is" in 2 lower bits. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: Alex Kelly <alex.page.kelly@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Petr Matousek <pmatouse@redhat.com> Cc: Vasily Kulikov <segoon@openwall.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23fs/ramfs/file-nommu.c: make ramfs_nommu_get_unmapped_area() and ↵Axel Lin
ramfs_nommu_mmap() static Since commit 853ac43ab194 ("shmem: unify regular and tiny shmem"), ramfs_nommu_get_unmapped_area() and ramfs_nommu_mmap() are not directly referenced outside of file-nommu.c. Thus make them static. Signed-off-by: Axel Lin <axel.lin@ingics.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23printk/cache: mark printk_once test variable __read_mostlyJoe Perches
Add #include <linux/cache.h> to define __read_mostly. Convert cache.h to use uapi/linux/kernel.h instead of linux/kernel.h to avoid recursive #includes. Convert the ALIGN macro to __ALIGN_KERNEL. printk_once only sets the bool variable tested once so mark it __read_mostly. Neaten the alignment so it matches the rest of the pr_<level>_once #defines too. Signed-off-by: Joe Perches <joe@perches.com> Reviewed-by: James Hogan <james.hogan@imgtec.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23lib/parser.c: add match_wildcard() functionDu, Changbin
match_wildcard function is a simple implementation of wildcard matching algorithm. It only supports two usual wildcardes: '*' - matches zero or more characters '?' - matches one character This algorithm is safe since it is non-recursive. Signed-off-by: Du, Changbin <changbin.du@gmail.com> Cc: Jason Baron <jbaron@akamai.com> Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23uapi: convert u64 to __u64 in exported headersMike Frysinger
The u64 type is not defined in any exported kernel headers, so trying to use it will lead to build failures. Signed-off-by: Mike Frysinger <vapier@gentoo.org> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23include/uapi/linux/dn.h: pull in ioctl.h headerMike Frysinger
This header uses _IOW/_IOR defines but doesn't include ioctl.h for it. If you try to use this w/out including ioctl.h yourself, it can fail to build, so add the explicit include. Signed-off-by: Mike Frysinger <vapier@gentoo.org> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23include/uapi/linux/ppp-ioctl.h: pull in ppp_defs.hMike Frysinger
This header uses enum NPmode but doesn't include ppp_defs.h. If you try to use this header w/out including the defs header first, it leads to a build failure. So add the explicit include to fix it. Don't know of any packages directly impacted, but noticed while building some ppp code by hand. Signed-off-by: Mike Frysinger <vapier@gentoo.org> Cc: Paul Mackerras <paulus@samba.org> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23include/linux/of.h: make for_each_child_of_node() reference its args when ↵David Howells
CONFIG_OF=n Make for_each_child_of_node() reference its args when CONFIG_OF=n to avoid warnings like: drivers/leds/leds-pwm.c:88:22: warning: unused variable 'node' [-Wunused-variable] struct device_node *node = pdev->dev.of_node; ^ Signed-off-by: David Howells <dhowells@redhat.com> Cc: Grant Likely <grant.likely@linaro.org> Cc: Rob Herring <robh+dt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23remove extra definitions of U32_MAXAlex Elder
Now that the definition is centralized in <linux/kernel.h>, the definitions of U32_MAX (and related) elsewhere in the kernel can be removed. Signed-off-by: Alex Elder <elder@linaro.org> Acked-by: Sage Weil <sage@inktank.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23kernel.h: define u8, s8, u32, etc. limitsAlex Elder
Create constants that define the maximum and minimum values representable by the kernel types u8, s8, u16, s16, and so on. Signed-off-by: Alex Elder <elder@linaro.org> Cc: Sage Weil <sage@inktank.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23conditionally define U32_MAXAlex Elder
The symbol U32_MAX is defined in several spots. Change these definitions to be conditional. This is in preparation for the next patch, which centralizes the definition in <linux/kernel.h>. Signed-off-by: Alex Elder <elder@linaro.org> Cc: Sage Weil <sage@inktank.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23add generic fixmap.hMark Salter
Many architectures provide an asm/fixmap.h which defines support for compile-time 'special' virtual mappings which need to be made before paging_init() has run. This support is also used for early ioremap on x86. Much of this support is identical across the architectures. This patch consolidates all of the common bits into asm-generic/fixmap.h which is intended to be included from arch/*/include/asm/fixmap.h. Signed-off-by: Mark Salter <msalter@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Ralf Baechle <ralf@linux-mips.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: James Hogan <james.hogan@imgtec.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Richard Weinberger <richard@nod.at> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jonas Bonn <jonas.bonn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23asm/types.h: Remove include/asm-generic/int-l64.hGeert Uytterhoeven
Now all 64-bit architectures have been converted to int-ll64.h, we can remove int-l64.h in kernelspace. For backwards compatibility, alpha, ia64, mips64, and powerpc64 still use int-l64.h in userspace. This is the (reworked for UAPI) non-documentation part of more than two year old "asm/types.h: All architectures use int-ll64.h in kernelspace" (https://lkml.org/lkml/2011/8/13/104) Since <asm/types.h> (from include/uapi/asm-generic/types.h) is used for both kernel and user space, include/asm-generic/int-ll64.h cannot just become include/asm-generic/types.h, as Arnd suggested. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Arnd Bergmann <arnd@arndb.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23include/linux/genalloc.h: spinlock_t needs spinlock_types.hShawn Guo
Compiling a C file which includes genalloc.h but without spinlock_types.h being included before, we will see the compile error below. include/linux/genalloc.h:54:2: error: unknown type name `spinlock_t' Include spinlock_types.h from genalloc.h to fix the problem. Signed-off-by: Shawn Guo <shawn.guo@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23numa: add a sysctl for numa_balancingAndi Kleen
Add a working sysctl to enable/disable automatic numa memory balancing at runtime. This allows us to track down performance problems with this feature and is generally a good idea. This was possible earlier through debugfs, but only with special debugging options set. Also fix the boot message. [akpm@linux-foundation.org: s/sched_numa_balancing/sysctl_numa_balancing/] Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23mm: free memblock.memory in free_all_bootmemPhilipp Hachtmann
When calling free_all_bootmem() the free areas under memblock's control are released to the buddy allocator. Additionally the reserved list is freed if it was reallocated by memblock. The same should apply for the memory list. Signed-off-by: Philipp Hachtmann <phacht@linux.vnet.ibm.com> Reviewed-by: Tejun Heo <tj@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23memcg, slab: RCU protect memcg_params for root cachesVladimir Davydov
We relocate root cache's memcg_params whenever we need to grow the memcg_caches array to accommodate all kmem-active memory cgroups. Currently on relocation we free the old version immediately, which can lead to use-after-free, because the memcg_caches array is accessed lock-free (see cache_from_memcg_idx()). This patch fixes this by making memcg_params RCU-protected for root caches. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Glauber Costa <glommer@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23memcg, slab: clean up memcg cache initialization/destructionVladimir Davydov
Currently, we have rather a messy function set relating to per-memcg kmem cache initialization/destruction. Per-memcg caches are created in memcg_create_kmem_cache(). This function calls kmem_cache_create_memcg() to allocate and initialize a kmem cache and then "registers" the new cache in the memcg_params::memcg_caches array of the parent cache. During its work-flow, kmem_cache_create_memcg() executes the following memcg-related functions: - memcg_alloc_cache_params(), to initialize memcg_params of the newly created cache; - memcg_cache_list_add(), to add the new cache to the memcg_slab_caches list. On the other hand, kmem_cache_destroy() called on a cache destruction only calls memcg_release_cache(), which does all the work: it cleans the reference to the cache in its parent's memcg_params::memcg_caches, removes the cache from the memcg_slab_caches list, and frees memcg_params. Such an inconsistency between destruction and initialization paths make the code difficult to read, so let's clean this up a bit. This patch moves all the code relating to registration of per-memcg caches (adding to memcg list, setting the pointer to a cache from its parent) to the newly created memcg_register_cache() and memcg_unregister_cache() functions making the initialization and destruction paths look symmetrical. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Glauber Costa <glommer@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23memcg, slab: kmem_cache_create_memcg(): fix memleak on fail pathVladimir Davydov
We do not free the cache's memcg_params if __kmem_cache_create fails. Fix this. Plus, rename memcg_register_cache() to memcg_alloc_cache_params(), because it actually does not register the cache anywhere, but simply initialize kmem_cache::memcg_params. [akpm@linux-foundation.org: fix build] Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Glauber Costa <glommer@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23mm: dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGESasha Levin
Most of the VM_BUG_ON assertions are performed on a page. Usually, when one of these assertions fails we'll get a BUG_ON with a call stack and the registers. I've recently noticed based on the requests to add a small piece of code that dumps the page to various VM_BUG_ON sites that the page dump is quite useful to people debugging issues in mm. This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what VM_BUG_ON() does, also dumps the page before executing the actual BUG_ON. [akpm@linux-foundation.org: fix up includes] Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23mm: print more details for bad_page()Dave Hansen
bad_page() is cool in that it prints out a bunch of data about the page. But, I can never remember which page flags are good and which are bad, or whether ->index or ->mapping is required to be NULL. This patch allows bad/dump_page() callers to specify a string about why they are dumping the page and adds explanation strings to a number of places. It also adds a 'bad_flags' argument to bad_page(), which it then dumps out separately from the flags which are actually set. This way, the messages will show specifically why the page was bad, *specifically* which flags it is complaining about, if it was a page flag combination which was the problem. [akpm@linux-foundation.org: switch to pr_alert] Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Christoph Lameter <cl@linux.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse update from Miklos Szeredi: "This contains a fix for a potential use-after-module-unload bug noticed by Al and caching improvements for read-only fuse filesystems by Andrew Gallagher" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: support clients that don't implement 'open' fuse: don't invalidate attrs when not using atime fuse: fix SetPageUptodate() condition in STORE fuse: fix pipe_buf_operations
2014-01-23Merge tag 'for-f2fs-3.14' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs updates from Jaegeuk Kim: "In this round, a couple of sysfs entries were introduced to tune the f2fs at runtime. In addition, f2fs starts to support inline_data and improves the read/write performance in some workloads by refactoring bio-related flows. This patch-set includes the following major enhancement patches. - support inline_data - refactor bio operations such as merge operations and rw type assignment - enhance the direct IO path - enhance bio operations - truncate a node page when it becomes obsolete - add sysfs entries: small_discards, max_victim_search, and in-place-update - add a sysfs entry to control max_victim_search The other bug fixes are as follows. - fix a bug in truncate_partial_nodes - avoid warnings during sparse and build process - fix error handling flows - fix potential bit overflows And, there are a bunch of cleanups" * tag 'for-f2fs-3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (95 commits) f2fs: drop obsolete node page when it is truncated f2fs: introduce NODE_MAPPING for code consistency f2fs: remove the orphan block page array f2fs: add help function META_MAPPING f2fs: move a branch for code redability f2fs: call mark_inode_dirty to flush dirty pages f2fs: clean checkpatch warnings f2fs: missing REQ_META and REQ_PRIO when sync_meta_pages(META_FLUSH) f2fs: avoid f2fs_balance_fs call during pageout f2fs: add delimiter to seperate name and value in debug phrase f2fs: use spinlock rather than mutex for better speed f2fs: move alloc new orphan node out of lock protection region f2fs: move grabing orphan pages out of protection region f2fs: remove the needless parameter of f2fs_wait_on_page_writeback f2fs: update documents and a MAINTAINERS entry f2fs: add a sysfs entry to control max_victim_search f2fs: improve write performance under frequent fsync calls f2fs: avoid to read inline data except first page f2fs: avoid to left uninitialized data in page when read inline data f2fs: fix truncate_partial_nodes bug ...
2014-01-22Merge tag 'modules-next-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux Pull module updates from Rusty Russell. * tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: module: Add missing newline in printk call. module: fix coding style export: declare ksymtab symbols module.h: Remove unnecessary semicolon params: improve standard definitions Add Documentation/module-signing.txt file
2014-01-22Merge tag 'stable/for-linus-3.14-rc0-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull Xen updates from Konrad Rzeszutek Wilk: "Two major features that Xen community is excited about: The first is event channel scalability by David Vrabel - we switch over from an two-level per-cpu bitmap of events (IRQs) - to an FIFO queue with priorities. This lets us be able to handle more events, have lower latency, and better scalability. Good stuff. The other is PVH by Mukesh Rathor. In short, PV is a mode where the kernel lets the hypervisor program page-tables, segments, etc. With EPT/NPT capabilities in current processors, the overhead of doing this in an HVM (Hardware Virtual Machine) container is much lower than the hypervisor doing it for us. In short we let a PV guest run without doing page-table, segment, syscall, etc updates through the hypervisor - instead it is all done within the guest container. It is a "hybrid" PV - hence the 'PVH' name - a PV guest within an HVM container. The major benefits are less code to deal with - for example we only use one function from the the pv_mmu_ops (which has 39 function calls); faster performance for syscall (no context switches into the hypervisor); less traps on various operations; etc. It is still being baked - the ABI is not yet set in stone. But it is pretty awesome and we are excited about it. Lastly, there are some changes to ARM code - you should get a simple conflict which has been resolved in #linux-next. In short, this pull has awesome features. Features: - FIFO event channels. Key advantages: support for over 100,000 events (2^17), 16 different event priorities, improved fairness in event latency through the use of FIFOs. - Xen PVH support. "It’s a fully PV kernel mode, running with paravirtualized disk and network, paravirtualized interrupts and timers, no emulated devices of any kind (and thus no qemu), no BIOS or legacy boot — but instead of requiring PV MMU, it uses the HVM hardware extensions to virtualize the pagetables, as well as system calls and other privileged operations." (from "The Paravirtualization Spectrum, Part 2: From poles to a spectrum") Bug-fixes: - Fixes in balloon driver (refactor and make it work under ARM) - Allow xenfb to be used in HVM guests. - Allow xen_platform_pci=0 to work properly. - Refactors in event channels" * tag 'stable/for-linus-3.14-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: (52 commits) xen/pvh: Set X86_CR0_WP and others in CR0 (v2) MAINTAINERS: add git repository for Xen xen/pvh: Use 'depend' instead of 'select'. xen: delete new instances of __cpuinit usage xen/fb: allow xenfb initialization for hvm guests xen/evtchn_fifo: fix error return code in evtchn_fifo_setup() xen-platform: fix error return code in platform_pci_init() xen/pvh: remove duplicated include from enlighten.c xen/pvh: Fix compile issues with xen_pvh_domain() xen: Use dev_is_pci() to check whether it is pci device xen/grant-table: Force to use v1 of grants. xen/pvh: Support ParaVirtualized Hardware extensions (v3). xen/pvh: Piggyback on PVHVM XenBus. xen/pvh: Piggyback on PVHVM for grant driver (v4) xen/grant: Implement an grant frame array struct (v3). xen/grant-table: Refactor gnttab_init xen/grants: Remove gnttab_max_grant_frames dependency on gnttab_init. xen/pvh: Piggyback on PVHVM for event channels (v2) xen/pvh: Update E820 to work with PVH (v2) xen/pvh: Secondary VCPU bringup (non-bootup CPUs) ...
2014-01-22Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull KVM updates from Paolo Bonzini: "First round of KVM updates for 3.14; PPC parts will come next week. Nothing major here, just bugfixes all over the place. The most interesting part is the ARM guys' virtualized interrupt controller overhaul, which lets userspace get/set the state and thus enables migration of ARM VMs" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (67 commits) kvm: make KVM_MMU_AUDIT help text more readable KVM: s390: Fix memory access error detection KVM: nVMX: Update guest activity state field on L2 exits KVM: nVMX: Fix nested_run_pending on activity state HLT KVM: nVMX: Clean up handling of VMX-related MSRs KVM: nVMX: Add tracepoints for nested_vmexit and nested_vmexit_inject KVM: nVMX: Pass vmexit parameters to nested_vmx_vmexit KVM: nVMX: Leave VMX mode on clearing of feature control MSR KVM: VMX: Fix DR6 update on #DB exception KVM: SVM: Fix reading of DR6 KVM: x86: Sync DR7 on KVM_SET_DEBUGREGS add support for Hyper-V reference time counter KVM: remove useless write to vcpu->hv_clock.tsc_timestamp KVM: x86: fix tsc catchup issue with tsc scaling KVM: x86: limit PIT timer frequency KVM: x86: handle invalid root_hpa everywhere kvm: Provide kvm_vcpu_eligible_for_directed_yield() stub kvm: vfio: silence GCC warning KVM: ARM: Remove duplicate include arm/arm64: KVM: relax the requirements of VMA alignment for THP ...
2014-01-22Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree updates from Jiri Kosina: "Usual rocket science stuff from trivial.git" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits) neighbour.h: fix comment sched: Fix warning on make htmldocs caused by wait.h slab: struct kmem_cache is protected by slab_mutex doc: Fix typo in USB Gadget Documentation of/Kconfig: Spelling s/one/once/ mkregtable: Fix sscanf handling lp5523, lp8501: comment improvements thermal: rcar: comment spelling treewide: fix comments and printk msgs IXP4xx: remove '1 &&' from a condition check in ixp4xx_restart() Documentation: update /proc/uptime field description Documentation: Fix size parameter for snprintf arm: fix comment header and macro name asm-generic: uaccess: Spelling s/a ny/any/ mtd: onenand: fix comment header doc: driver-model/platform.txt: fix a typo drivers: fix typo in DEVTMPFS_MOUNT Kconfig help text doc: Fix typo (acces_process_vm -> access_process_vm) treewide: Fix typos in printk drivers/gpu/drm/qxl/Kconfig: reformat the help text ...
2014-01-22Merge tag 'dm-3.14-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device-mapper changes from Mike Snitzer: "A lot of attention was paid to improving the thin-provisioning target's handling of metadata operation failures and running out of space. A new 'error_if_no_space' feature was added to allow users to error IOs rather than queue them when either the data or metadata space is exhausted. Additional fixes/features include: - a few fixes to properly support thin metadata device resizing - a solution for reliably waiting for a DM device's embedded kobject to be released before destroying the device - old dm-snapshot is updated to use the dm-bufio interface to take advantage of readahead capabilities that improve snapshot activation - new dm-cache target tunables to control how quickly data is promoted to the cache (fast) device - improved write efficiency of cluster mirror target by combining userspace flush and mark requests" * tag 'dm-3.14-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (35 commits) dm log userspace: allow mark requests to piggyback on flush requests dm space map metadata: fix bug in resizing of thin metadata dm cache: add policy name to status output dm thin: fix pool feature parsing dm sysfs: fix a module unload race dm snapshot: use dm-bufio prefetch dm snapshot: use dm-bufio dm snapshot: prepare for switch to using dm-bufio dm snapshot: use GFP_KERNEL when initializing exceptions dm cache: add block sizes and total cache blocks to status output dm btree: add dm_btree_find_lowest_key dm space map metadata: fix extending the space map dm space map common: make sure new space is used during extend dm: wait until embedded kobject is released before destroying a device dm: remove pointless kobject comparison in dm_get_from_kobject dm snapshot: call destroy_work_on_stack() to pair with INIT_WORK_ONSTACK() dm cache policy mq: introduce three promotion threshold tunables dm cache policy mq: use list_del_init instead of list_del + INIT_LIST_HEAD dm thin: fix set_pool_mode exposed pool operation races dm thin: eliminate the no_free_space flag ...
2014-01-22Merge tag 'scsi-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI updates from James Bottomley: "This patch set is a lot of driver updates for qla4xxx, bfa, hpsa, qla2xxx. It also removes the aic7xxx_old driver (which has been deprecated for nearly a decade) and adds support for deadlines in error handling" * tag 'scsi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (75 commits) [SCSI] hpsa: allow SCSI mid layer to handle unit attention [SCSI] hpsa: do not require board "not ready" status after hard reset [SCSI] hpsa: enable unit attention reporting [SCSI] hpsa: rename scsi prefetch field [SCSI] hpsa: use workqueue instead of kernel thread for lockup detection [SCSI] ipr: increase dump size in ipr driver [SCSI] mac_scsi: Fix crash on out of memory [SCSI] st: fix enlarge_buffer [SCSI] qla1280: Annotate timer on stack so object debug does not complain [SCSI] qla4xxx: Update driver version to 5.04.00-k3 [SCSI] qla4xxx: Recreate chap data list during get chap operation [SCSI] qla4xxx: Add support for ISCSI_PARAM_LOCAL_IPADDR sysfs attr [SCSI] libiscsi: Add local_ipaddr parameter in iscsi_conn struct [SCSI] scsi_transport_iscsi: Export ISCSI_PARAM_LOCAL_IPADDR attr for iscsi_connection [SCSI] qla4xxx: Add host statistics support [SCSI] scsi_transport_iscsi: Add host statistics support [SCSI] qla4xxx: Added support for Diagnostics MBOX command [SCSI] bfa: Driver version upgrade to 3.2.23.0 [SCSI] bfa: change FC_ELS_TOV to 20sec [SCSI] bfa: Observed auto D-port mode instead of manual ...
2014-01-22Merge tag 'pci-v3.14-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull PCI updates from Bjorn Helgaas: "PCI changes for the v3.14 merge window: Resource management - Change pci_bus_region addresses to dma_addr_t (Bjorn Helgaas) - Support 64-bit AGP BARs (Bjorn Helgaas, Yinghai Lu) - Add pci_bus_address() to get bus address of a BAR (Bjorn Helgaas) - Use pci_resource_start() for CPU address of AGP BARs (Bjorn Helgaas) - Enforce bus address limits in resource allocation (Yinghai Lu) - Allocate 64-bit BARs above 4G when possible (Yinghai Lu) - Convert pcibios_resource_to_bus() to take pci_bus, not pci_dev (Yinghai Lu) PCI device hotplug - Major rescan/remove locking update (Rafael J. Wysocki) - Make ioapic builtin only (not modular) (Yinghai Lu) - Fix release/free issues (Yinghai Lu) - Clean up pciehp (Bjorn Helgaas) - Announce pciehp slot info during enumeration (Bjorn Helgaas) MSI - Add pci_msi_vec_count(), pci_msix_vec_count() (Alexander Gordeev) - Add pci_enable_msi_range(), pci_enable_msix_range() (Alexander Gordeev) - Deprecate "tri-state" interfaces: fail/success/fail+info (Alexander Gordeev) - Export MSI mode using attributes, not kobjects (Greg Kroah-Hartman) - Drop "irq" param from *_restore_msi_irqs() (DuanZhenzhong) SR-IOV - Clear NumVFs when disabling SR-IOV in sriov_init() (ethan.zhao) Virtualization - Add support for save/restore of extended capabilities (Alex Williamson) - Add Virtual Channel to save/restore support (Alex Williamson) - Never treat a VF as a multifunction device (Alex Williamson) - Add pci_try_reset_function(), et al (Alex Williamson) AER - Ignore non-PCIe error sources (Betty Dall) - Support ACPI HEST error sources for domains other than 0 (Betty Dall) - Consolidate HEST error source parsers (Bjorn Helgaas) - Add a TLP header print helper (Borislav Petkov) Freescale i.MX6 - Remove unnecessary code (Fabio Estevam) - Make reset-gpio optional (Marek Vasut) - Report "link up" only after link training completes (Marek Vasut) - Start link in Gen1 before negotiating for Gen2 mode (Marek Vasut) - Fix PCIe startup code (Richard Zhu) Marvell MVEBU - Remove duplicate of_clk_get_by_name() call (Andrew Lunn) - Drop writes to bridge Secondary Status register (Jason Gunthorpe) - Obey bridge PCI_COMMAND_MEM and PCI_COMMAND_IO bits (Jason Gunthorpe) - Support a bridge with no IO port window (Jason Gunthorpe) - Use max_t() instead of max(resource_size_t,) (Jingoo Han) - Remove redundant of_match_ptr (Sachin Kamat) - Call pci_ioremap_io() at startup instead of dynamically (Thomas Petazzoni) NVIDIA Tegra - Disable Gen2 for Tegra20 and Tegra30 (Eric Brower) Renesas R-Car - Add runtime PM support (Valentine Barshak) - Fix rcar_pci_probe() return value check (Wei Yongjun) Synopsys DesignWare - Fix crash in dw_msi_teardown_irq() (Bjørn Erik Nilsen) - Remove redundant call to pci_write_config_word() (Bjørn Erik Nilsen) - Fix missing MSI IRQs (Harro Haan) - Add dw_pcie prefix before cfg_read/write (Pratyush Anand) - Fix I/O transfers by using CPU (not realio) address (Pratyush Anand) - Whitespace cleanup (Jingoo Han) EISA - Call put_device() if device_register() fails (Levente Kurusa) - Revert EISA initialization breakage ((Bjorn Helgaas) Miscellaneous - Remove unused code, including PCIe 3.0 interfaces (Stephen Hemminger) - Prevent bus conflicts while checking for bridge apertures (Bjorn Helgaas) - Stop clearing bridge Secondary Status when setting up I/O aperture (Bjorn Helgaas) - Use dev_is_pci() to identify PCI devices (Yijing Wang) - Deprecate DEFINE_PCI_DEVICE_TABLE (Joe Perches) - Update documentation 00-INDEX (Erik Ekman)" * tag 'pci-v3.14-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (119 commits) Revert "EISA: Initialize device before its resources" Revert "EISA: Log device resources in dmesg" vfio-pci: Use pci "try" reset interface PCI: Check parent kobject in pci_destroy_dev() xen/pcifront: Use global PCI rescan-remove locking powerpc/eeh: Use global PCI rescan-remove locking PCI: Fix pci_check_and_unmask_intx() comment typos PCI: Add pci_try_reset_function(), pci_try_reset_slot(), pci_try_reset_bus() MPT / PCI: Use pci_stop_and_remove_bus_device_locked() platform / x86: Use global PCI rescan-remove locking PCI: hotplug: Use global PCI rescan-remove locking pcmcia: Use global PCI rescan-remove locking ACPI / hotplug / PCI: Use global PCI rescan-remove locking ACPI / PCI: Use global PCI rescan-remove locking in PCI root hotplug PCI: Add global pci_lock_rescan_remove() PCI: Cleanup pci.h whitespace PCI: Reorder so actual code comes before stubs PCI/AER: Support ACPI HEST AER error sources for PCI domains other than 0 ACPICA: Add helper macros to extract bus/segment numbers from HEST table. PCI: Make local functions static ...
2014-01-22Merge tag 'trace-3.14' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing updates from Steven Rostedt: "This pull request has a new feature to ftrace, namely the trace event triggers by Tom Zanussi. A trigger is a way to enable an action when an event is hit. The actions are: o trace on/off - enable or disable tracing o snapshot - save the current trace buffer in the snapshot o stacktrace - dump the current stack trace to the ringbuffer o enable/disable events - enable or disable another event Namhyung Kim added updates to the tracing uprobes code. Having the uprobes add support for fetch methods. The rest are various bug fixes with the new code, and minor ones for the old code" * tag 'trace-3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (38 commits) tracing: Fix buggered tee(2) on tracing_pipe tracing: Have trace buffer point back to trace_array ftrace: Fix synchronization location disabling and freeing ftrace_ops ftrace: Have function graph only trace based on global_ops filters ftrace: Synchronize setting function_trace_op with ftrace_trace_function tracing: Show available event triggers when no trigger is set tracing: Consolidate event trigger code tracing: Fix counter for traceon/off event triggers tracing: Remove double-underscore naming in syscall trigger invocations tracing/kprobes: Add trace event trigger invocations tracing/probes: Fix build break on !CONFIG_KPROBE_EVENT tracing/uprobes: Add @+file_offset fetch method uprobes: Allocate ->utask before handler_chain() for tracing handlers tracing/uprobes: Add support for full argument access methods tracing/uprobes: Fetch args before reserving a ring buffer tracing/uprobes: Pass 'is_return' to traceprobe_parse_probe_arg() tracing/probes: Implement 'memory' fetch method for uprobes tracing/probes: Add fetch{,_size} member into deref fetch method tracing/probes: Move 'symbol' fetch method to kprobes tracing/probes: Implement 'stack' fetch method for uprobes ...
2014-01-22fuse: fix pipe_buf_operationsMiklos Szeredi
Having this struct in module memory could Oops when if the module is unloaded while the buffer still persists in a pipe. Since sock_pipe_buf_ops is essentially the same as fuse_dev_pipe_buf_steal merge them into nosteal_pipe_buf_ops (this is the same as default_pipe_buf_ops except stealing the page from the buffer is not allowed). Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: stable@vger.kernel.org
2014-01-22neighbour.h: fix commentLi Zhong
Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2014-01-22sched: Fix warning on make htmldocs caused by wait.hMasanari Iida
Missing "@" in include/linux/wait.h cause "make htmldocs" failed with following warning messages. Warning(/home/iida/Repo/linux-next//include/linux/wait.h:304): No description found for parameter 'cmd1' Warning(/home/iida/Repo/linux-next//include/linux/wait.h:304): No description found for parameter 'cmd2' Signed-off-by: Masanari Iida <standby24x7@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2014-01-21dm log userspace: allow mark requests to piggyback on flush requestsDongmao Zhang
In the cluster evironment, cluster write has poor performance because userspace_flush() has to contact a userspace program (cmirrord) for clear/mark/flush requests. But both mark and flush requests require cmirrord to communicate the message to all the cluster nodes for each flush call. This behaviour is really slow. To address this we now merge mark and flush requests together to reduce the kernel-userspace-kernel time. We allow a new directive, "integrated_flush" that can be used to instruct the kernel log code to combine flush and mark requests when directed by userspace. If not directed by userspace (due to an older version of the userspace code perhaps), the kernel will function as it did previously - preserving backwards compatibility. Additionally, flush requests are performed lazily when only clear requests exist. Signed-off-by: Dongmao Zhang <dmzhang@suse.com> Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-01-21Merge branch 'akpm' (incoming from Andrew)Linus Torvalds
Merge first patch-bomb from Andrew Morton: - a couple of misc things - inotify/fsnotify work from Jan - ocfs2 updates (partial) - about half of MM * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (117 commits) mm/migrate: remove unused function, fail_migrate_page() mm/migrate: remove putback_lru_pages, fix comment on putback_movable_pages mm/migrate: correct failure handling if !hugepage_migration_support() mm/migrate: add comment about permanent failure path mm, page_alloc: warn for non-blockable __GFP_NOFAIL allocation failure mm: compaction: reset scanner positions immediately when they meet mm: compaction: do not mark unmovable pageblocks as skipped in async compaction mm: compaction: detect when scanners meet in isolate_freepages mm: compaction: reset cached scanner pfn's before reading them mm: compaction: encapsulate defer reset logic mm: compaction: trace compaction begin and end memcg, oom: lock mem_cgroup_print_oom_info sched: add tracepoints related to NUMA task migration mm: numa: do not automatically migrate KSM pages mm: numa: trace tasks that fail migration due to rate limiting mm: numa: limit scope of lock for NUMA migrate rate limiting mm: numa: make NUMA-migrate related functions static lib/show_mem.c: show num_poisoned_pages when oom mm/hwpoison: add '#' to hwpoison_inject mm/memblock: use WARN_ONCE when MAX_NUMNODES passed as input parameter ...
2014-01-21Merge branch 'for-3.14' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata Pull libata updates from Tejun Heo: "Support for some new embedded controllers. A couple late (<= a week) fixes have stable cc'd and one patch ("SATA: MV: Add support for the optional PHYs") got committed yesterday because otherwise the resulting kernel would fail boot on an embedded board due to interdependent changes in its platform tree. Other than that, nothing too noteworthy" * 'for-3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata: SATA: MV: Add support for the optional PHYs sata-highbank: Remove unnecessary ahci_platform.h include libata: disable LPM for some WD SATA-I devices ARM: mvebu: update the SATA compatible string for Armada 370/XP ata: sata_mv: fix disk hotplug for Armada 370/XP SoCs ata: sata_mv: introduce compatible string "marvell, armada-370-sata" ata: pata_samsung_cf: Remove unused macros ata: pata_samsung_cf: Use devm_ioremap_resource() ata: pata_samsung_cf: Merge pata_samsung_cf.h into pata_samsung_cf.c ata: pata_samsung_cf: Move plat/regs-ata.h to drivers/ata drivers: ata: Mark the function as static in libahci.c drivers: ata: Mark the function ahci_init_interrupts() as static in ahci.c ahci: imx: fix the error handling in imx_ahci_probe() ahci: imx: ahci_imx_softreset() can be static ahci: imx: Add i.MX53 support ahci: imx: Pull out the clock enable/disable calls libata, dt: Document sata_rcar bindings sata_rcar: Add R-Car Gen2 SATA PHY support ahci: mcp89: enter AHCI mode under Apple BIOS emulation ata: libata-eh: Remove unnecessary snprintf arithmetic
2014-01-21Merge branch 'for-3.14' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "The bulk of changes are cleanups and preparations for the upcoming kernfs conversion. - cgroup_event mechanism which is and will be used only by memcg is moved to memcg. - pidlist handling is updated so that it can be served by seq_file. Also, the list is not sorted if sane_behavior. cgroup documentation explicitly states that the file is not sorted but it has been for quite some time. - All cgroup file handling now happens on top of seq_file. This is to prepare for kernfs conversion. In addition, all operations are restructured so that they map 1-1 to kernfs operations. - Other cleanups and low-pri fixes" * 'for-3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (40 commits) cgroup: trivial style updates cgroup: remove stray references to css_id doc: cgroups: Fix typo in doc/cgroups cgroup: fix fail path in cgroup_load_subsys() cgroup: fix missing unlock on error in cgroup_load_subsys() cgroup: remove for_each_root_subsys() cgroup: implement for_each_css() cgroup: factor out cgroup_subsys_state creation into create_css() cgroup: combine css handling loops in cgroup_create() cgroup: reorder operations in cgroup_create() cgroup: make for_each_subsys() useable under cgroup_root_mutex cgroup: css iterations and css_from_dir() are safe under cgroup_mutex cgroup: unify pidlist and other file handling cgroup: replace cftype->read_seq_string() with cftype->seq_show() cgroup: attach cgroup_open_file to all cgroup files cgroup: generalize cgroup_pidlist_open_file cgroup: unify read path so that seq_file is always used cgroup: unify cgroup_write_X64() and cgroup_write_string() cgroup: remove cftype->read(), ->read_map() and ->write() hugetlb_cgroup: convert away from cftype->read() ...
2014-01-21Merge tag 'gfs2-merge-window' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw Pull GFS2 updates from Steven Whitehouse: "The main topics this time are allocation, in the form of Bob's improvements when searching resource groups and several updates to quotas which should increase scalability. The quota changes follow on from those in the last merge window, and there will likely be further work to come in this area in due course. There are also a few patches which help to improve efficiency of adding entries into directories, and clean up some of that code. One on-disk change is included this time, which is to write some additional information which should be useful to fsck and also potentially for debugging. Other than that, its just a few small random bug fixes and clean ups" * tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw: (24 commits) GFS2: revert "GFS2: d_splice_alias() can't return error" GFS2: Small cleanup GFS2: Don't use ENOBUFS when ENOMEM is the correct error code GFS2: Fix kbuild test robot reported warning GFS2: Move quota bitmap operations under their own lock GFS2: Clean up quota slot allocation GFS2: Only run logd and quota when mounted read/write GFS2: Use RCU/hlist_bl based hash for quotas GFS2: No need to invalidate pages for a dio read GFS2: Add initialization for address space in super block GFS2: Add hints to directory leaf blocks GFS2: For exhash conversion, only one block is needed GFS2: Increase i_writecount during gfs2_setattr_chown GFS2: Remember directory insert point GFS2: Consolidate transaction blocks calculation for dir add GFS2: Add directory addition info structure GFS2: Use only a single address space for rgrps GFS2: Use range based functions for rgrp sync/invalidation GFS2: Remove test which is always true GFS2: Remove gfs2_quota_change_host structure ...
2014-01-21mm/migrate: remove unused function, fail_migrate_page()Joonsoo Kim
fail_migrate_page() isn't used anywhere, so remove it. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Christoph Lameter <cl@linux.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Rafael Aquini <aquini@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-21mm/migrate: remove putback_lru_pages, fix comment on putback_movable_pagesJoonsoo Kim
Some part of putback_lru_pages() and putback_movable_pages() is duplicated, so it could confuse us what we should use. We can remove putback_lru_pages() since it is not really needed now. This makes us undestand and maintain the code more easily. And comment on putback_movable_pages() is stale now, so fix it. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Christoph Lameter <cl@linux.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Rafael Aquini <aquini@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-21mm: compaction: encapsulate defer reset logicVlastimil Babka
Currently there are several functions to manipulate the deferred compaction state variables. The remaining case where the variables are touched directly is when a successful allocation occurs in direct compaction, or is expected to be successful in the future by kswapd. Here, the lowest order that is expected to fail is updated, and in the case of successful allocation, the deferred status and counter is reset completely. Create a new function compaction_defer_reset() to encapsulate this functionality and make it easier to understand the code. No functional change. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-21mm: compaction: trace compaction begin and endMel Gorman
The broad goal of the series is to improve allocation success rates for huge pages through memory compaction, while trying not to increase the compaction overhead. The original objective was to reintroduce capturing of high-order pages freed by the compaction, before they are split by concurrent activity. However, several bugs and opportunities for simple improvements were found in the current implementation, mostly through extra tracepoints (which are however too ugly for now to be considered for sending). The patches mostly deal with two mechanisms that reduce compaction overhead, which is caching the progress of migrate and free scanners, and marking pageblocks where isolation failed to be skipped during further scans. Patch 1 (from mgorman) adds tracepoints that allow calculate time spent in compaction and potentially debug scanner pfn values. Patch 2 encapsulates the some functionality for handling deferred compactions for better maintainability, without a functional change type is not determined without being actually needed. Patch 3 fixes a bug where cached scanner pfn's are sometimes reset only after they have been read to initialize a compaction run. Patch 4 fixes a bug where scanners meeting is sometimes not properly detected and can lead to multiple compaction attempts quitting early without doing any work. Patch 5 improves the chances of sync compaction to process pageblocks that async compaction has skipped due to being !MIGRATE_MOVABLE. Patch 6 improves the chances of sync direct compaction to actually do anything when called after async compaction fails during allocation slowpath. The impact of patches were validated using mmtests's stress-highalloc benchmark with mmtests's stress-highalloc benchmark on a x86_64 machine with 4GB memory. Due to instability of the results (mostly related to the bugs fixed by patches 2 and 3), 10 iterations were performed, taking min,mean,max values for success rates and mean values for time and vmstat-based metrics. First, the default GFP_HIGHUSER_MOVABLE allocations were tested with the patches stacked on top of v3.13-rc2. Patch 2 is OK to serve as baseline due to no functional changes in 1 and 2. Comments below. stress-highalloc 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 2-nothp 3-nothp 4-nothp 5-nothp 6-nothp Success 1 Min 9.00 ( 0.00%) 10.00 (-11.11%) 43.00 (-377.78%) 43.00 (-377.78%) 33.00 (-266.67%) Success 1 Mean 27.50 ( 0.00%) 25.30 ( 8.00%) 45.50 (-65.45%) 45.90 (-66.91%) 46.30 (-68.36%) Success 1 Max 36.00 ( 0.00%) 36.00 ( 0.00%) 47.00 (-30.56%) 48.00 (-33.33%) 52.00 (-44.44%) Success 2 Min 10.00 ( 0.00%) 8.00 ( 20.00%) 46.00 (-360.00%) 45.00 (-350.00%) 35.00 (-250.00%) Success 2 Mean 26.40 ( 0.00%) 23.50 ( 10.98%) 47.30 (-79.17%) 47.60 (-80.30%) 48.10 (-82.20%) Success 2 Max 34.00 ( 0.00%) 33.00 ( 2.94%) 48.00 (-41.18%) 50.00 (-47.06%) 54.00 (-58.82%) Success 3 Min 65.00 ( 0.00%) 63.00 ( 3.08%) 85.00 (-30.77%) 84.00 (-29.23%) 85.00 (-30.77%) Success 3 Mean 76.70 ( 0.00%) 70.50 ( 8.08%) 86.20 (-12.39%) 85.50 (-11.47%) 86.00 (-12.13%) Success 3 Max 87.00 ( 0.00%) 86.00 ( 1.15%) 88.00 ( -1.15%) 87.00 ( 0.00%) 87.00 ( 0.00%) 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 2-nothp 3-nothp 4-nothp 5-nothp 6-nothp User 6437.72 6459.76 5960.32 5974.55 6019.67 System 1049.65 1049.09 1029.32 1031.47 1032.31 Elapsed 1856.77 1874.48 1949.97 1994.22 1983.15 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 2-nothp 3-nothp 4-nothp 5-nothp 6-nothp Minor Faults 253952267 254581900 250030122 250507333 250157829 Major Faults 420 407 506 530 530 Swap Ins 4 9 9 6 6 Swap Outs 398 375 345 346 333 Direct pages scanned 197538 189017 298574 287019 299063 Kswapd pages scanned 1809843 1801308 1846674 1873184 1861089 Kswapd pages reclaimed 1806972 1798684 1844219 1870509 1858622 Direct pages reclaimed 197227 188829 298380 286822 298835 Kswapd efficiency 99% 99% 99% 99% 99% Kswapd velocity 953.382 970.449 952.243 934.569 922.286 Direct efficiency 99% 99% 99% 99% 99% Direct velocity 104.058 101.832 153.961 143.200 148.205 Percentage direct scans 9% 9% 13% 13% 13% Zone normal velocity 347.289 359.676 348.063 339.933 332.983 Zone dma32 velocity 710.151 712.605 758.140 737.835 737.507 Zone dma velocity 0.000 0.000 0.000 0.000 0.000 Page writes by reclaim 557.600 429.000 353.600 426.400 381.800 Page writes file 159 53 7 79 48 Page writes anon 398 375 345 346 333 Page reclaim immediate 825 644 411 575 420 Sector Reads 2781750 2769780 2878547 2939128 2910483 Sector Writes 12080843 12083351 12012892 12002132 12010745 Page rescued immediate 0 0 0 0 0 Slabs scanned 1575654 1545344 1778406 1786700 1794073 Direct inode steals 9657 10037 15795 14104 14645 Kswapd inode steals 46857 46335 50543 50716 51796 Kswapd skipped wait 0 0 0 0 0 THP fault alloc 97 91 81 71 77 THP collapse alloc 456 506 546 544 565 THP splits 6 5 5 4 4 THP fault fallback 0 1 0 0 0 THP collapse fail 14 14 12 13 12 Compaction stalls 1006 980 1537 1536 1548 Compaction success 303 284 562 559 578 Compaction failures 702 696 974 976 969 Page migrate success 1177325 1070077 3927538 3781870 3877057 Page migrate failure 0 0 0 0 0 Compaction pages isolated 2547248 2306457 8301218 8008500 8200674 Compaction migrate scanned 42290478 38832618 153961130 154143900 159141197 Compaction free scanned 89199429 79189151 356529027 351943166 356326727 Compaction cost 1566 1426 5312 5156 5294 NUMA PTE updates 0 0 0 0 0 NUMA hint faults 0 0 0 0 0 NUMA hint local faults 0 0 0 0 0 NUMA hint local percent 100 100 100 100 100 NUMA pages migrated 0 0 0 0 0 AutoNUMA cost 0 0 0 0 0 Observations: - The "Success 3" line is allocation success rate with system idle (phases 1 and 2 are with background interference). I used to get stable values around 85% with vanilla 3.11. The lower min and mean values came with 3.12. This was bisected to commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") As explained in comment for patch 3, I don't think the commit is wrong, but that it makes the effect of compaction bugs worse. From patch 3 onwards, the results are OK and match the 3.11 results. - Patch 4 also clearly helps phases 1 and 2, and exceeds any results I've seen with 3.11 (I didn't measure it that thoroughly then, but it was never above 40%). - Compaction cost and number of scanned pages is higher, especially due to patch 4. However, keep in mind that patches 3 and 4 fix existing bugs in the current design of compaction overhead mitigation, they do not change it. If overhead is found unacceptable, then it should be decreased differently (and consistently, not due to random conditions) than the current implementation does. In contrast, patches 5 and 6 (which are not strictly bug fixes) do not increase the overhead (but also not success rates). This might be a limitation of the stress-highalloc benchmark as it's quite uniform. Another set of results is when configuring stress-highalloc t allocate with similar flags as THP uses: (GFP_HIGHUSER_MOVABLE|__GFP_NOMEMALLOC|__GFP_NORETRY|__GFP_NO_KSWAPD) stress-highalloc 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 2-thp 3-thp 4-thp 5-thp 6-thp Success 1 Min 2.00 ( 0.00%) 7.00 (-250.00%) 18.00 (-800.00%) 19.00 (-850.00%) 26.00 (-1200.00%) Success 1 Mean 19.20 ( 0.00%) 17.80 ( 7.29%) 29.20 (-52.08%) 29.90 (-55.73%) 32.80 (-70.83%) Success 1 Max 27.00 ( 0.00%) 29.00 ( -7.41%) 35.00 (-29.63%) 36.00 (-33.33%) 37.00 (-37.04%) Success 2 Min 3.00 ( 0.00%) 8.00 (-166.67%) 21.00 (-600.00%) 21.00 (-600.00%) 32.00 (-966.67%) Success 2 Mean 19.30 ( 0.00%) 17.90 ( 7.25%) 32.20 (-66.84%) 32.60 (-68.91%) 35.70 (-84.97%) Success 2 Max 27.00 ( 0.00%) 30.00 (-11.11%) 36.00 (-33.33%) 37.00 (-37.04%) 39.00 (-44.44%) Success 3 Min 62.00 ( 0.00%) 62.00 ( 0.00%) 85.00 (-37.10%) 75.00 (-20.97%) 64.00 ( -3.23%) Success 3 Mean 66.30 ( 0.00%) 65.50 ( 1.21%) 85.60 (-29.11%) 83.40 (-25.79%) 83.50 (-25.94%) Success 3 Max 70.00 ( 0.00%) 69.00 ( 1.43%) 87.00 (-24.29%) 86.00 (-22.86%) 87.00 (-24.29%) 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 2-thp 3-thp 4-thp 5-thp 6-thp User 6547.93 6475.85 6265.54 6289.46 6189.96 System 1053.42 1047.28 1043.23 1042.73 1038.73 Elapsed 1835.43 1821.96 1908.67 1912.74 1956.38 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 2-thp 3-thp 4-thp 5-thp 6-thp Minor Faults 256805673 253106328 253222299 249830289 251184418 Major Faults 395 375 423 434 448 Swap Ins 12 10 10 12 9 Swap Outs 530 537 487 455 415 Direct pages scanned 71859 86046 153244 152764 190713 Kswapd pages scanned 1900994 1870240 1898012 1892864 1880520 Kswapd pages reclaimed 1897814 1867428 1894939 1890125 1877924 Direct pages reclaimed 71766 85908 153167 152643 190600 Kswapd efficiency 99% 99% 99% 99% 99% Kswapd velocity 1029.000 1067.782 1000.091 991.049 951.218 Direct efficiency 99% 99% 99% 99% 99% Direct velocity 38.897 49.127 80.747 79.983 96.468 Percentage direct scans 3% 4% 7% 7% 9% Zone normal velocity 351.377 372.494 348.910 341.689 335.310 Zone dma32 velocity 716.520 744.414 731.928 729.343 712.377 Zone dma velocity 0.000 0.000 0.000 0.000 0.000 Page writes by reclaim 669.300 604.000 545.700 538.900 429.900 Page writes file 138 66 58 83 14 Page writes anon 530 537 487 455 415 Page reclaim immediate 806 655 772 548 517 Sector Reads 2711956 2703239 2811602 2818248 2839459 Sector Writes 12163238 12018662 12038248 11954736 11994892 Page rescued immediate 0 0 0 0 0 Slabs scanned 1385088 1388364 1507968 1513292 1558656 Direct inode steals 1739 2564 4622 5496 6007 Kswapd inode steals 47461 46406 47804 48013 48466 Kswapd skipped wait 0 0 0 0 0 THP fault alloc 110 82 84 69 70 THP collapse alloc 445 482 467 462 539 THP splits 6 5 4 5 3 THP fault fallback 3 0 0 0 0 THP collapse fail 15 14 14 14 13 Compaction stalls 659 685 1033 1073 1111 Compaction success 222 225 410 427 456 Compaction failures 436 460 622 646 655 Page migrate success 446594 439978 1085640 1095062 1131716 Page migrate failure 0 0 0 0 0 Compaction pages isolated 1029475 1013490 2453074 2482698 2565400 Compaction migrate scanned 9955461 11344259 24375202 27978356 30494204 Compaction free scanned 27715272 28544654 80150615 82898631 85756132 Compaction cost 552 555 1344 1379 1436 NUMA PTE updates 0 0 0 0 0 NUMA hint faults 0 0 0 0 0 NUMA hint local faults 0 0 0 0 0 NUMA hint local percent 100 100 100 100 100 NUMA pages migrated 0 0 0 0 0 AutoNUMA cost 0 0 0 0 0 There are some differences from the previous results for THP-like allocations: - Here, the bad result for unpatched kernel in phase 3 is much more consistent to be between 65-70% and not related to the "regression" in 3.12. Still there is the improvement from patch 4 onwards, which brings it on par with simple GFP_HIGHUSER_MOVABLE allocations. - Compaction costs have increased, but nowhere near as much as the non-THP case. Again, the patches should be worth the gained determininsm. - Patches 5 and 6 somewhat increase the number of migrate-scanned pages. This is most likely due to __GFP_NO_KSWAPD flag, which means the cached pfn's and pageblock skip bits are not reset by kswapd that often (at least in phase 3 where no concurrent activity would wake up kswapd) and the patches thus help the sync-after-async compaction. It doesn't however show that the sync compaction would help so much with success rates, which can be again seen as a limitation of the benchmark scenario. This patch (of 6): Add two tracepoints for compaction begin and end of a zone. Using this it is possible to calculate how much time a workload is spending within compaction and potentially debug problems related to cached pfns for scanning. In combination with the direct reclaim and slab trace points it should be possible to estimate most allocation-related overhead for a workload. Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-21sched: add tracepoints related to NUMA task migrationMel Gorman
This patch adds three tracepoints o trace_sched_move_numa when a task is moved to a node o trace_sched_swap_numa when a task is swapped with another task o trace_sched_stick_numa when a numa-related migration fails The tracepoints allow the NUMA scheduler activity to be monitored and the following high-level metrics can be calculated o NUMA migrated stuck nr trace_sched_stick_numa o NUMA migrated idle nr trace_sched_move_numa o NUMA migrated swapped nr trace_sched_swap_numa o NUMA local swapped trace_sched_swap_numa src_nid == dst_nid (should never happen) o NUMA remote swapped trace_sched_swap_numa src_nid != dst_nid (should == NUMA migrated swapped) o NUMA group swapped trace_sched_swap_numa src_ngid == dst_ngid Maybe a small number of these are acceptable but a high number would be a major surprise. It would be even worse if bounces are frequent. o NUMA avg task migs. Average number of migrations for tasks o NUMA stddev task mig Self-explanatory o NUMA max task migs. Maximum number of migrations for a single task In general the intent of the tracepoints is to help diagnose problems where automatic NUMA balancing appears to be doing an excessive amount of useless work. [akpm@linux-foundation.org: remove semicolon-after-if, repair coding-style] Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Alex Thorlton <athorlton@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-21mm: numa: trace tasks that fail migration due to rate limitingMel Gorman
A low local/remote numa hinting fault ratio is potentially explained by failed migrations. This patch adds a tracepoint that fires when migration fails due to migration rate limitation. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Alex Thorlton <athorlton@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>