summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2012-03-22xfs: introduce an allocation workqueueDave Chinner
We currently have significant issues with the amount of stack that allocation in XFS uses, especially in the writeback path. We can easily consume 4k of stack between mapping the page, manipulating the bmap btree and allocating blocks from the free list. Not to mention btree block readahead and other functionality that issues IO in the allocation path. As a result, we can no longer fit allocation in the writeback path in the stack space provided on x86_64. To alleviate this problem, introduce an allocation workqueue and move all allocations to a seperate context. This can be easily added as an interposing layer into xfs_alloc_vextent(), which takes a single argument structure and does not return until the allocation is complete or has failed. To do this, add a work structure and a completion to the allocation args structure. This allows xfs_alloc_vextent to queue the args onto the workqueue and wait for it to be completed by the worker. This can be done completely transparently to the caller. The worker function needs to ensure that it sets and clears the PF_TRANS flag appropriately as it is being run in an active transaction context. Work can also be queued in a memory reclaim context, so a rescuer is needed for the workqueue. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>
2012-03-22xfs: Fix open flag handling in open_by_handle codeDave Chinner
Sparse identified some unsafe handling of open flags in the xfs open by handle ioctl code. Update the code to use the correct access macros to ensure that we handle the open flags correctly. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2012-03-22xfs: fix deadlock in xfs_rtfree_extentKamal Dasu
To fix the deadlock caused by repeatedly calling xfs_rtfree_extent - removed xfs_ilock() and xfs_trans_ijoin() from xfs_rtfree_extent(), instead added asserts that the inode is locked and has an inode_item attached to it. - in xfs_bunmapi() when dealing with an inode with the rt flag call xfs_ilock() and xfs_trans_ijoin() so that the reference count is bumped on the inode and attached it to the transaction before calling into xfs_bmap_del_extent, similar to what we do in xfs_bmap_rtalloc. Signed-off-by: Kamal Dasu <kdasu.kdev@gmail.com> Reviewed-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Ben Myers <bpm@sgi.com>
2012-03-22fs: xfs: fix section mismatch in linux-nextGerard Snitselaar
xfs_qm_exit() is called in init_xfs_fs(). Signed-off-by: Gerard Snitselaar <dev@snitselaar.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>
2012-03-22Merge branch 'akpm' (Andrew's patch-bomb)Linus Torvalds
Merge first batch of patches from Andrew Morton: "A few misc things and all the MM queue" * emailed from Andrew Morton <akpm@linux-foundation.org>: (92 commits) memcg: avoid THP split in task migration thp: add HPAGE_PMD_* definitions for !CONFIG_TRANSPARENT_HUGEPAGE memcg: clean up existing move charge code mm/memcontrol.c: remove unnecessary 'break' in mem_cgroup_read() mm/memcontrol.c: remove redundant BUG_ON() in mem_cgroup_usage_unregister_event() mm/memcontrol.c: s/stealed/stolen/ memcg: fix performance of mem_cgroup_begin_update_page_stat() memcg: remove PCG_FILE_MAPPED memcg: use new logic for page stat accounting memcg: remove PCG_MOVE_LOCK flag from page_cgroup memcg: simplify move_account() check memcg: remove EXPORT_SYMBOL(mem_cgroup_update_page_stat) memcg: kill dead prev_priority stubs memcg: remove PCG_CACHE page_cgroup flag memcg: let css_get_next() rely upon rcu_read_lock() cgroup: revert ss_id_lock to spinlock idr: make idr_get_next() good for rcu_read_lock() memcg: remove unnecessary thp check in page stat accounting memcg: remove redundant returns memcg: enum lru_list lru ...
2012-03-22ceph: fix three bugs, two in ceph_vxattrcb_file_layout()Alex Elder
In ceph_vxattrcb_file_layout(), there is a check to determine whether a preferred PG should be formatted into the output buffer. That check assumes that a preferred PG number of 0 indicates "no preference," but that is wrong. No preference is indicated by a negative (specifically, -1) PG number. In addition, if that condition yields true, the preferred value is formatted into a sized buffer, but the size consumed by the earlier snprintf() call is not accounted for, opening up the possibilty of a buffer overrun. Finally, in ceph_vxattrcb_dir_rctime() where the nanoseconds part of the time displayed did not include leading 0's, which led to erroneous (sub-second portion of) time values being shown. This fixes these three issues: http://tracker.newdream.net/issues/2155 http://tracker.newdream.net/issues/2156 http://tracker.newdream.net/issues/2157 Signed-off-by: Alex Elder <elder@dreamhost.com> Reviewed-by: Sage Weil <sage@newdream.net>
2012-03-22ceph: ensure Boolean options support both sensesAlex Elder
Many ceph-related Boolean options offer the ability to both enable and disable a feature. For all those that don't offer this, add a new option so that they do. Note that ceph_show_options()--which reports mount options currently in effect--only reports the option if it is different from the default value. Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22rbd: make ceph_parse_options() return a pointerAlex Elder
ceph_parse_options() takes the address of a pointer as an argument and uses it to return the address of an allocated structure if successful. With this interface is not evident at call sites that the pointer is always initialized. Change the interface to return the address instead (or a pointer-coded error code) to make the validity of the returned pointer obvious. Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22ceph: make ceph_setxattr() and ceph_removexattr() more alikeAlex Elder
This patch just rearranges a few bits of code to make more portions of ceph_setxattr() and ceph_removexattr() identical. Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22ceph: avoid repeatedly computing the size of constant vxattr namesAlex Elder
All names defined in the directory and file virtual extended attribute tables are constant, and the size of each is known at compile time. So there's no need to compute their length every time any file's attribute is listed. Record the length of each string and use it when needed to determine the space need to represent them. In addition, compute the aggregate size of strings in each table just once at initialization time. Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22ceph: encode type in vxattr callback routinesAlex Elder
The names of the callback functions used for virtual extended attributes are based only on the last component of the attribute name. Because of the way these are defined, this precludes allowing a single (lowest) attribute name for different callbacks, dependent on the type of file being operated on. (For example, it might be nice to support both "ceph.dir.layout" and "ceph.file.layout".) Just change the callback names to avoid this problem. Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22ceph: drop "_cb" from name of struct ceph_vxattr_cbAlex Elder
A struct ceph_vxattr_cb does not represent a callback at all, but rather a virtual extended attribute itself. Drop the "_cb" suffix from its name to reflect that. Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22ceph: use macros to normalize vxattr table definitionsAlex Elder
Entries in the ceph virtual extended attribute tables all follow a distinct pattern in their definition. Enforce this pattern through the use of a macro. Also, a null name field signals the end of the table, so make that be the first field in the ceph_vxattr_cb structure. Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22ceph: use a symbolic name for "ceph." extended attribute namespaceAlex Elder
Use symbolic constants to define the top-level prefix for "ceph." extended attribute names. Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22ceph: pass inode rather than table to ceph_match_vxattr()Alex Elder
All callers of ceph_match_vxattr() determine what to pass as the first argument by calling ceph_inode_vxattrs(inode). Just do that inside ceph_match_vxattr() itself, changing it to take an inode rather than the vxattr pointer as its first argument. Also ensure the function works correctly for an empty table (i.e., containing only a terminating null entry). Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22ceph: don't null-terminate xattr valuesAlex Elder
For some reason, ceph_setxattr() allocates an extra byte in which a '\0' is stored past the end of an extended attribute value. This is not needed, and is potentially misleading, so get rid of it. Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22ceph: fix overflow check in build_snap_context()Xi Wang
The overflow check for a + n * b should be (n > (ULONG_MAX - a) / b), rather than (n > ULONG_MAX / b - a). Signed-off-by: Xi Wang <xi.wang@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22ceph: avoid panic with mismatched symlink sizes in fill_inode()Xi Wang
Return -EINVAL rather than panic if iinfo->symlink_len and inode->i_size do not match. Also use kstrndup rather than kmalloc/memcpy. Signed-off-by: Xi Wang <xi.wang@gmail.com> Reviewed-by: Alex Elder <elder@dreamhost.com>
2012-03-22ceph: use 2 instead of 1 as fallback for 32-bit inode numberAmon Ott
The root directory of the Ceph mount has inode number 1, so falling back to 1 always creates a collision. 2 is unused on my test systems and seems less likely to collide. Signed-off-by: Amon Ott <ao@m-privacy.de> Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22ceph: don't reset s_cap_ttl to zeroAlex Elder
Avoid the need to check for a special zero s_cap_ttl value by just using (jiffies - 1) as the value assigned to indicate "sometime in the past." Signed-off-by: Alex Elder <elder@dreamhost.com> Reviewed-by: Sage Weil <sage@newdream.net>
2012-03-22btrfs: Fix busyloop in transaction_kthread()Jan Kara
When a filesystem got aborted due do error, transaction_kthread() will busyloop. Fix it by going to sleep in that case as well. Maybe we should just stop transaction_kthread() when filesystem is aborted but that would be more complex. Signed-off-by: Jan Kara <jack@suse.cz>
2012-03-22btrfs: replace many BUG_ONs with proper error handlingJeff Mahoney
btrfs currently handles most errors with BUG_ON. This patch is a work-in- progress but aims to handle most errors other than internal logic errors and ENOMEM more gracefully. This iteration prevents most crashes but can run into lockups with the page lock on occasion when the timing "works out." Signed-off-by: Jeff Mahoney <jeffm@suse.com>
2012-03-21ext4: remove useless s_dirt assignmentArtem Bityutskiy
Clean-up ext4 a tiny bit by removing useless s_dirt assignment in 'ext4_fill_super()' because a bit later we anyway call 'ext4_setup_super()' which writes the superblock to the media unconditionally. Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-21ext4: write superblock only once on unmountArtem Bityutskiy
In some rather rare cases it is possible that ext4 may the superblock to the media twice. This patch makes sure this does not happen. This should speed up unmounting in those cases. Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-21ext4: do not mark superblock as dirty unnecessarilyArtem Bityutskiy
Commit a0375156ca1041574b5d47cc7e32f10b891151b0 cleaned up superblock dirtying handling, but missed one place. This patch does what was intended: if we have the journal, then we update the superblock through the journal rather than doing this directly. Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-21ext4: correct ext4_punch_hole return codesAllison Henderson
ext4_punch_hole returns -ENOTSUPP but it should be using -EOPNOTSUPP Signed-off-by: Allison Henderson <achender@linux.vnet.ibm.com> Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-21Merge branch 'next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc Pull powerpc merge from Benjamin Herrenschmidt: "Here's the powerpc batch for this merge window. It is going to be a bit more nasty than usual as in touching things outside of arch/powerpc mostly due to the big iSeriesectomy :-) We finally got rid of the bugger (legacy iSeries support) which was a PITA to maintain and that nobody really used anymore. Here are some of the highlights: - Legacy iSeries is gone. Thanks Stephen ! There's still some bits and pieces remaining if you do a grep -ir series arch/powerpc but they are harmless and will be removed in the next few weeks hopefully. - The 'fadump' functionality (Firmware Assisted Dump) replaces the previous (equivalent) "pHyp assisted dump"... it's a rewrite of a mechanism to get the hypervisor to do crash dumps on pSeries, the new implementation hopefully being much more reliable. Thanks Mahesh Salgaonkar. - The "EEH" code (pSeries PCI error handling & recovery) got a big spring cleaning, motivated by the need to be able to implement a new backend for it on top of some new different type of firwmare. The work isn't complete yet, but a good chunk of the cleanups is there. Note that this adds a field to struct device_node which is not very nice and which Grant objects to. I will have a patch soon that moves that to a powerpc private data structure (hopefully before rc1) and we'll improve things further later on (hopefully getting rid of the need for that pointer completely). Thanks Gavin Shan. - I dug into our exception & interrupt handling code to improve the way we do lazy interrupt handling (and make it work properly with "edge" triggered interrupt sources), and while at it found & fixed a wagon of issues in those areas, including adding support for page fault retry & fatal signals on page faults. - Your usual random batch of small fixes & updates, including a bunch of new embedded boards, both Freescale and APM based ones, etc..." I fixed up some conflicts with the generalized irq-domain changes from Grant Likely, hopefully correctly. * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (141 commits) powerpc/ps3: Do not adjust the wrapper load address powerpc: Remove the rest of the legacy iSeries include files powerpc: Remove the remaining CONFIG_PPC_ISERIES pieces init: Remove CONFIG_PPC_ISERIES powerpc: Remove FW_FEATURE ISERIES from arch code tty/hvc_vio: FW_FEATURE_ISERIES is no longer selectable powerpc/spufs: Fix double unlocks powerpc/5200: convert mpc5200 to use of_platform_populate() powerpc/mpc5200: add options to mpc5200_defconfig powerpc/mpc52xx: add a4m072 board support powerpc/mpc5200: update mpc5200_defconfig to fit for charon board Documentation/powerpc/mpc52xx.txt: Checkpatch cleanup powerpc/44x: Add additional device support for APM821xx SoC and Bluestone board powerpc/44x: Add support PCI-E for APM821xx SoC and Bluestone board MAINTAINERS: Update PowerPC 4xx tree powerpc/44x: The bug fixed support for APM821xx SoC and Bluestone board powerpc: document the FSL MPIC message register binding powerpc: add support for MPIC message register API powerpc/fsl: Added aliased MSIIR register address to MSI node in dts powerpc/85xx: mpc8548cds - add 36-bit dts ...
2012-03-21ext4: remove restrictive checks for EOFBLOCKS_FLLukas Czerner
We are going to remove the EOFBLOCKS_FL flag in the future, so this is the first part of the removal. We can not remove it entirely just now, since the e2fsck is still checking for it and it might cause headache to some people. Instead, remove the restrictive checks now and the rest later, when the new e2fsck code is out and common enough. This is also needed because punch hole already breaks the EOFBLOCKS_FL semantics, so it might cause the some troubles. So simply remove it. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-21ext4: always set then trimmed blocks count into lenLukas Czerner
Currently if the range to trim is too small, for example on 1K fs the request to trim the first block, then the 'range->len' is not set reporting wrong number of discarded block to the caller. Fix this by always setting the 'range->len' before we return. Note that when there is a failure (-EINVAL) caller can not depend on 'range->len' being set properly. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-21ext4: fix trimmed block count accuntingLukas Czerner
Currently when there is not enough free blocks in the block group to discard (grp->bb_free < minlen) the 'trimmed' is bumped up anyway with the number of discarded blocks from the previous iteration. Fix this by bumping up 'trimmed' only if the ext4_trim_all_free() was actually run. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-21ext4: fix start and len arguments handling in ext4_trim_fs()Lukas Czerner
The overflow can happen when we are calling get_group_no_and_offset() which stores the group number in the ext4_grpblk_t type which is actually int. However when the blocknr is big enough the group number might be bigger than ext4_grpblk_t resulting in overflow. This will most likely happen with FITRIM default argument len = ULLONG_MAX. Fix this by using "end" variable instead of "start+len" as it is easier to get right and specifically check that the end is not beyond the end of the file system, so we are sure that the result of get_group_no_and_offset() will not overflow. Otherwise truncate it to the size of the file system. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmwLinus Torvalds
Pull gfs2 changes from Steven Whitehouse. * git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw: GFS2: Change truncate page allocation to be GFP_NOFS GFS2: call gfs2_write_alloc_required for each chunk GFS2: Clean up log flush header writing GFS2: Remove a __GFP_NOFAIL allocation GFS2: Flush pending glock work when evicting an inode GFS2: make sure rgrps are up to date in func gfs2_blk2rgrpd GFS2: Eliminate sd_rindex_mutex GFS2: Unlock rindex mutex on glock error GFS2: Make bd_cmp() static GFS2: Sort the ordered write list GFS2: FITRIM ioctl support GFS2: Move two functions from log.c to lops.c GFS2: glock statistics gathering
2012-03-21hugetlbfs: return error code when initializing moduleHillf Danton
Return an errno upon failure to create inode kmem cache, and unregister the FS upon failure to mount. [akpm@linux-foundation.org: remove unneeded test of `error'] Signed-off-by: Hillf Danton <dhillf@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21hugetlbfs: fix alignment of huge page requestsSteven Truelove
When calling shmget() with SHM_HUGETLB, shmget aligns the request size to PAGE_SIZE, but this is not sufficient. Modify hugetlb_file_setup() to align requests to the huge page size, and to accept an address argument so that all alignment checks can be performed in hugetlb_file_setup(), rather than in its callers. Change newseg() and mmap_pgoff() to match the new prototype and eliminate a now redundant alignment check. [akpm@linux-foundation.org: fix build] Signed-off-by: Steven Truelove <steven.truelove@utoronto.ca> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21mm, hugetlb: add thread name and pid to SHM_HUGETLB mlock rlimit warningDavid Rientjes
Add the thread name and pid of the application that is allocating shm segments with MAP_HUGETLB without being a part of /proc/sys/vm/hugetlb_shm_group or having CAP_IPC_LOCK. This identifies the application so it may be fixed by avoiding using the deprecated exception (see Documentation/feature-removal-schedule.txt). Signed-off-by: David Rientjes <rientjes@google.com> Cc: Dave Jones <davej@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21mm, counters: remove task argument to sync_mm_rss() and __sync_task_rss_stat()David Rientjes
sync_mm_rss() can only be used for current to avoid race conditions in iterating and clearing its per-task counters. Remove the task argument for it and its helper function, __sync_task_rss_stat(), to avoid thinking it can be used safely for anything other than current. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21hugepages: fix use after free bug in "quota" handlingDavid Gibson
hugetlbfs_{get,put}_quota() are badly named. They don't interact with the general quota handling code, and they don't much resemble its behaviour. Rather than being about maintaining limits on on-disk block usage by particular users, they are instead about maintaining limits on in-memory page usage (including anonymous MAP_PRIVATE copied-on-write pages) associated with a particular hugetlbfs filesystem instance. Worse, they work by having callbacks to the hugetlbfs filesystem code from the low-level page handling code, in particular from free_huge_page(). This is a layering violation of itself, but more importantly, if the kernel does a get_user_pages() on hugepages (which can happen from KVM amongst others), then the free_huge_page() can be delayed until after the associated inode has already been freed. If an unmount occurs at the wrong time, even the hugetlbfs superblock where the "quota" limits are stored may have been freed. Andrew Barry proposed a patch to fix this by having hugepages, instead of storing a pointer to their address_space and reaching the superblock from there, had the hugepages store pointers directly to the superblock, bumping the reference count as appropriate to avoid it being freed. Andrew Morton rejected that version, however, on the grounds that it made the existing layering violation worse. This is a reworked version of Andrew's patch, which removes the extra, and some of the existing, layering violation. It works by introducing the concept of a hugepage "subpool" at the lower hugepage mm layer - that is a finite logical pool of hugepages to allocate from. hugetlbfs now creates a subpool for each filesystem instance with a page limit set, and a pointer to the subpool gets added to each allocated hugepage, instead of the address_space pointer used now. The subpool has its own lifetime and is only freed once all pages in it _and_ all other references to it (i.e. superblocks) are gone. subpools are optional - a NULL subpool pointer is taken by the code to mean that no subpool limits are in effect. Previous discussion of this bug found in: "Fix refcounting in hugetlbfs quota handling.". See: https://lkml.org/lkml/2011/8/11/28 or http://marc.info/?l=linux-mm&m=126928970510627&w=1 v2: Fixed a bug spotted by Hillf Danton, and removed the extra parameter to alloc_huge_page() - since it already takes the vma, it is not necessary. Signed-off-by: Andrew Barry <abarry@cray.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21hugetlb: cleanup hugetlb.hDavid Gibson
Make a couple of small cleanups to linux/include/hugetlb.h. The set_file_hugepages() function, which was not used anywhere is removed, and the hugetlbfs_config and hugetlbfs_inode_info structures with its HUGETLBFS_I helper function are moved into inode.c, the only place they were used. These structures are really linked to the hugetlbfs filesystem specifically not to hugepage mm handling in general, so they belong in the filesystem code not in a generally available header. It would be nice to move the hugetlbfs_sb_info (superblock) structure in there as well, but it's currently needed in a number of places via the hstate_vma() and hstate_inode(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Cc: Hugh Dickins <hughd@google.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Andrew Barry <abarry@cray.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Hillf Danton <dhillf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21hugetlbfs: avoid taking i_mutex from hugetlbfs_read()Aneesh Kumar K.V
Taking i_mutex in hugetlbfs_read() can result in deadlock with mmap as explained below Thread A: read() on hugetlbfs hugetlbfs_read() called i_mutex grabbed hugetlbfs_read_actor() called __copy_to_user() called page fault is triggered Thread B, sharing address space with A: mmap() the same file ->mmap_sem is grabbed on task_B->mm->mmap_sem hugetlbfs_file_mmap() is called attempt to grab ->i_mutex and block waiting for A to give it up Thread A: pagefault handled blocked on attempt to grab task_A->mm->mmap_sem, which happens to be the same thing as task_B->mm->mmap_sem. Block waiting for B to give it up. AFAIU the i_mutex locking was added to hugetlbfs_read() as per http://lkml.indiana.edu/hypermail/linux/kernel/0707.2/3066.html to take care of the race between truncate and read. This patch fixes this by looking at page->mapping under lock_page() (find_lock_page()) to ensure that the inode didn't get truncated in the range during a parallel read. Ideally we can extend the patch to make sure we don't increase i_size in mmap. But that will break userspace, because applications will now have to use truncate(2) to increase i_size in hugetlbfs. Based on the original patch from Hillf Danton. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Hugh Dickins <hughd@google.com> Cc: <stable@kernel.org> [everything after 2007 :)] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21procfs: mark thread stack correctly in proc/<pid>/mapsSiddhesh Poyarekar
Stack for a new thread is mapped by userspace code and passed via sys_clone. This memory is currently seen as anonymous in /proc/<pid>/maps, which makes it difficult to ascertain which mappings are being used for thread stacks. This patch uses the individual task stack pointers to determine which vmas are actually thread stacks. For a multithreaded program like the following: #include <pthread.h> void *thread_main(void *foo) { while(1); } int main() { pthread_t t; pthread_create(&t, NULL, thread_main, NULL); pthread_join(t, NULL); } proc/PID/maps looks like the following: 00400000-00401000 r-xp 00000000 fd:0a 3671804 /home/siddhesh/a.out 00600000-00601000 rw-p 00000000 fd:0a 3671804 /home/siddhesh/a.out 019ef000-01a10000 rw-p 00000000 00:00 0 [heap] 7f8a44491000-7f8a44492000 ---p 00000000 00:00 0 7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0 7f8a44c92000-7f8a44e3d000 r-xp 00000000 fd:00 2097482 /lib64/libc-2.14.90.so 7f8a44e3d000-7f8a4503d000 ---p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so 7f8a4503d000-7f8a45041000 r--p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so 7f8a45041000-7f8a45043000 rw-p 001af000 fd:00 2097482 /lib64/libc-2.14.90.so 7f8a45043000-7f8a45048000 rw-p 00000000 00:00 0 7f8a45048000-7f8a4505f000 r-xp 00000000 fd:00 2099938 /lib64/libpthread-2.14.90.so 7f8a4505f000-7f8a4525e000 ---p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so 7f8a4525e000-7f8a4525f000 r--p 00016000 fd:00 2099938 /lib64/libpthread-2.14.90.so 7f8a4525f000-7f8a45260000 rw-p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so 7f8a45260000-7f8a45264000 rw-p 00000000 00:00 0 7f8a45264000-7f8a45286000 r-xp 00000000 fd:00 2097348 /lib64/ld-2.14.90.so 7f8a45457000-7f8a4545a000 rw-p 00000000 00:00 0 7f8a45484000-7f8a45485000 rw-p 00000000 00:00 0 7f8a45485000-7f8a45486000 r--p 00021000 fd:00 2097348 /lib64/ld-2.14.90.so 7f8a45486000-7f8a45487000 rw-p 00022000 fd:00 2097348 /lib64/ld-2.14.90.so 7f8a45487000-7f8a45488000 rw-p 00000000 00:00 0 7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0 [stack] 7fff627ff000-7fff62800000 r-xp 00000000 00:00 0 [vdso] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall] Here, one could guess that 7f8a44492000-7f8a44c92000 is a stack since the earlier vma that has no permissions (7f8a44e3d000-7f8a4503d000) but that is not always a reliable way to find out which vma is a thread stack. Also, /proc/PID/maps and /proc/PID/task/TID/maps has the same content. With this patch in place, /proc/PID/task/TID/maps are treated as 'maps as the task would see it' and hence, only the vma that that task uses as stack is marked as [stack]. All other 'stack' vmas are marked as anonymous memory. /proc/PID/maps acts as a thread group level view, where all thread stack vmas are marked as [stack:TID] where TID is the process ID of the task that uses that vma as stack, while the process stack is marked as [stack]. So /proc/PID/maps will look like this: 00400000-00401000 r-xp 00000000 fd:0a 3671804 /home/siddhesh/a.out 00600000-00601000 rw-p 00000000 fd:0a 3671804 /home/siddhesh/a.out 019ef000-01a10000 rw-p 00000000 00:00 0 [heap] 7f8a44491000-7f8a44492000 ---p 00000000 00:00 0 7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0 [stack:1442] 7f8a44c92000-7f8a44e3d000 r-xp 00000000 fd:00 2097482 /lib64/libc-2.14.90.so 7f8a44e3d000-7f8a4503d000 ---p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so 7f8a4503d000-7f8a45041000 r--p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so 7f8a45041000-7f8a45043000 rw-p 001af000 fd:00 2097482 /lib64/libc-2.14.90.so 7f8a45043000-7f8a45048000 rw-p 00000000 00:00 0 7f8a45048000-7f8a4505f000 r-xp 00000000 fd:00 2099938 /lib64/libpthread-2.14.90.so 7f8a4505f000-7f8a4525e000 ---p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so 7f8a4525e000-7f8a4525f000 r--p 00016000 fd:00 2099938 /lib64/libpthread-2.14.90.so 7f8a4525f000-7f8a45260000 rw-p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so 7f8a45260000-7f8a45264000 rw-p 00000000 00:00 0 7f8a45264000-7f8a45286000 r-xp 00000000 fd:00 2097348 /lib64/ld-2.14.90.so 7f8a45457000-7f8a4545a000 rw-p 00000000 00:00 0 7f8a45484000-7f8a45485000 rw-p 00000000 00:00 0 7f8a45485000-7f8a45486000 r--p 00021000 fd:00 2097348 /lib64/ld-2.14.90.so 7f8a45486000-7f8a45487000 rw-p 00022000 fd:00 2097348 /lib64/ld-2.14.90.so 7f8a45487000-7f8a45488000 rw-p 00000000 00:00 0 7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0 [stack] 7fff627ff000-7fff62800000 r-xp 00000000 00:00 0 [vdso] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall] Thus marking all vmas that are used as stacks by the threads in the thread group along with the process stack. The task level maps will however like this: 00400000-00401000 r-xp 00000000 fd:0a 3671804 /home/siddhesh/a.out 00600000-00601000 rw-p 00000000 fd:0a 3671804 /home/siddhesh/a.out 019ef000-01a10000 rw-p 00000000 00:00 0 [heap] 7f8a44491000-7f8a44492000 ---p 00000000 00:00 0 7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0 [stack] 7f8a44c92000-7f8a44e3d000 r-xp 00000000 fd:00 2097482 /lib64/libc-2.14.90.so 7f8a44e3d000-7f8a4503d000 ---p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so 7f8a4503d000-7f8a45041000 r--p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so 7f8a45041000-7f8a45043000 rw-p 001af000 fd:00 2097482 /lib64/libc-2.14.90.so 7f8a45043000-7f8a45048000 rw-p 00000000 00:00 0 7f8a45048000-7f8a4505f000 r-xp 00000000 fd:00 2099938 /lib64/libpthread-2.14.90.so 7f8a4505f000-7f8a4525e000 ---p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so 7f8a4525e000-7f8a4525f000 r--p 00016000 fd:00 2099938 /lib64/libpthread-2.14.90.so 7f8a4525f000-7f8a45260000 rw-p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so 7f8a45260000-7f8a45264000 rw-p 00000000 00:00 0 7f8a45264000-7f8a45286000 r-xp 00000000 fd:00 2097348 /lib64/ld-2.14.90.so 7f8a45457000-7f8a4545a000 rw-p 00000000 00:00 0 7f8a45484000-7f8a45485000 rw-p 00000000 00:00 0 7f8a45485000-7f8a45486000 r--p 00021000 fd:00 2097348 /lib64/ld-2.14.90.so 7f8a45486000-7f8a45487000 rw-p 00022000 fd:00 2097348 /lib64/ld-2.14.90.so 7f8a45487000-7f8a45488000 rw-p 00000000 00:00 0 7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0 7fff627ff000-7fff62800000 r-xp 00000000 00:00 0 [vdso] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall] where only the vma that is being used as a stack by *that* task is marked as [stack]. Analogous changes have been made to /proc/PID/smaps, /proc/PID/numa_maps, /proc/PID/task/TID/smaps and /proc/PID/task/TID/numa_maps. Relevant snippets from smaps and numa_maps: [siddhesh@localhost ~ ]$ pgrep a.out 1441 [siddhesh@localhost ~ ]$ cat /proc/1441/smaps | grep "\[stack" 7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0 [stack:1442] 7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0 [stack] [siddhesh@localhost ~ ]$ cat /proc/1441/task/1442/smaps | grep "\[stack" 7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0 [stack] [siddhesh@localhost ~ ]$ cat /proc/1441/task/1441/smaps | grep "\[stack" 7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0 [stack] [siddhesh@localhost ~ ]$ cat /proc/1441/numa_maps | grep "stack" 7f8a44492000 default stack:1442 anon=2 dirty=2 N0=2 7fff6273a000 default stack anon=3 dirty=3 N0=3 [siddhesh@localhost ~ ]$ cat /proc/1441/task/1442/numa_maps | grep "stack" 7f8a44492000 default stack anon=2 dirty=2 N0=2 [siddhesh@localhost ~ ]$ cat /proc/1441/task/1441/numa_maps | grep "stack" 7fff6273a000 default stack anon=3 dirty=3 N0=3 [akpm@linux-foundation.org: checkpatch fixes] [akpm@linux-foundation.org: fix build] Signed-off-by: Siddhesh Poyarekar <siddhesh.poyarekar@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jamie Lokier <jamie@shareable.org> Cc: Mike Frysinger <vapier@gentoo.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Matt Mackall <mpm@selenic.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21pagemap: introduce data structure for pagemap entryNaoya Horiguchi
Currently a local variable of pagemap entry in pagemap_pte_range() is named pfn and typed with u64, but it's not correct (pfn should be unsigned long.) This patch introduces special type for pagemap entries and replaces code with it. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: David Rientjes <rientjes@google.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21pagemap: export KPF_THPNaoya Horiguchi
This flag shows that a given page is a subpage of a transparent hugepage. It helps us debug and test the kernel by showing physical address of thp. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21thp: optimize away unnecessary page table lockingNaoya Horiguchi
Currently when we check if we can handle thp as it is or we need to split it into regular sized pages, we hold page table lock prior to check whether a given pmd is mapping thp or not. Because of this, when it's not "huge pmd" we suffer from unnecessary lock/unlock overhead. To remove it, this patch introduces a optimized check function and replace several similar logics with it. [akpm@linux-foundation.org: checkpatch fixes] Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: David Rientjes <rientjes@google.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21pagemap: avoid splitting thp when reading /proc/pid/pagemapNaoya Horiguchi
Thp split is not necessary if we explicitly check whether pmds are mapping thps or not. This patch introduces this check and adds code to generate pagemap entries for pmds mapping thps, which results in less performance impact of pagemap on thp. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21hugetlbfs: fix hugetlb_get_unmapped_area()Xiao Guangrong
Use/update cached_hole_size and free_area_cache properly to speedup finding of a free region. Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hillf Danton <dhillf@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21seq_file: fix mishandling of consecutive pread() invocations.Earl Chew
The following program illustrates the problem: char buf[8192]; int fd = open("/proc/self/maps", O_RDONLY); n = pread(fd, buf, sizeof(buf), 0); printf("%d\n", n); /* lseek(fd, 0, SEEK_CUR); */ /* Uncomment to work around */ n = pread(fd, buf, sizeof(buf), 0); printf("%d\n", n); The second printf() prints zero, but uncommenting the lseek() corrects its behaviour. To fix, make seq_read() mirror seq_lseek() when processing changes in *ppos. Restore m->version first, then if required traverse and update read_pos on success. Addresses https://bugzilla.kernel.org/show_bug.cgi?id=11856 Signed-off-by: Earl Chew <echew@ixiacom.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21fs/namei.c: fix warnings on 32-bitAndrew Morton
i386 allnoconfig: fs/namei.c: In function 'has_zero': fs/namei.c:1617: warning: integer constant is too large for 'unsigned long' type fs/namei.c:1617: warning: integer constant is too large for 'unsigned long' type fs/namei.c: In function 'hash_name': fs/namei.c:1635: warning: integer constant is too large for 'unsigned long' type There must be a tidier way of doing this. Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21mm: thp: fix pmd_bad() triggering in code paths holding mmap_sem read modeAndrea Arcangeli
In some cases it may happen that pmd_none_or_clear_bad() is called with the mmap_sem hold in read mode. In those cases the huge page faults can allocate hugepmds under pmd_none_or_clear_bad() and that can trigger a false positive from pmd_bad() that will not like to see a pmd materializing as trans huge. It's not khugepaged causing the problem, khugepaged holds the mmap_sem in write mode (and all those sites must hold the mmap_sem in read mode to prevent pagetables to go away from under them, during code review it seems vm86 mode on 32bit kernels requires that too unless it's restricted to 1 thread per process or UP builds). The race is only with the huge pagefaults that can convert a pmd_none() into a pmd_trans_huge(). Effectively all these pmd_none_or_clear_bad() sites running with mmap_sem in read mode are somewhat speculative with the page faults, and the result is always undefined when they run simultaneously. This is probably why it wasn't common to run into this. For example if the madvise(MADV_DONTNEED) runs zap_page_range() shortly before the page fault, the hugepage will not be zapped, if the page fault runs first it will be zapped. Altering pmd_bad() not to error out if it finds hugepmds won't be enough to fix this, because zap_pmd_range would then proceed to call zap_pte_range (which would be incorrect if the pmd become a pmd_trans_huge()). The simplest way to fix this is to read the pmd in the local stack (regardless of what we read, no need of actual CPU barriers, only compiler barrier needed), and be sure it is not changing under the code that computes its value. Even if the real pmd is changing under the value we hold on the stack, we don't care. If we actually end up in zap_pte_range it means the pmd was not none already and it was not huge, and it can't become huge from under us (khugepaged locking explained above). All we need is to enforce that there is no way anymore that in a code path like below, pmd_trans_huge can be false, but pmd_none_or_clear_bad can run into a hugepmd. The overhead of a barrier() is just a compiler tweak and should not be measurable (I only added it for THP builds). I don't exclude different compiler versions may have prevented the race too by caching the value of *pmd on the stack (that hasn't been verified, but it wouldn't be impossible considering pmd_none_or_clear_bad, pmd_bad, pmd_trans_huge, pmd_none are all inlines and there's no external function called in between pmd_trans_huge and pmd_none_or_clear_bad). if (pmd_trans_huge(*pmd)) { if (next-addr != HPAGE_PMD_SIZE) { VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem)); split_huge_page_pmd(vma->vm_mm, pmd); } else if (zap_huge_pmd(tlb, vma, pmd, addr)) continue; /* fall through */ } if (pmd_none_or_clear_bad(pmd)) Because this race condition could be exercised without special privileges this was reported in CVE-2012-1179. The race was identified and fully explained by Ulrich who debugged it. I'm quoting his accurate explanation below, for reference. ====== start quote ======= mapcount 0 page_mapcount 1 kernel BUG at mm/huge_memory.c:1384! At some point prior to the panic, a "bad pmd ..." message similar to the following is logged on the console: mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7). The "bad pmd ..." message is logged by pmd_clear_bad() before it clears the page's PMD table entry. 143 void pmd_clear_bad(pmd_t *pmd) 144 { -> 145 pmd_ERROR(*pmd); 146 pmd_clear(pmd); 147 } After the PMD table entry has been cleared, there is an inconsistency between the actual number of PMD table entries that are mapping the page and the page's map count (_mapcount field in struct page). When the page is subsequently reclaimed, __split_huge_page() detects this inconsistency. 1381 if (mapcount != page_mapcount(page)) 1382 printk(KERN_ERR "mapcount %d page_mapcount %d\n", 1383 mapcount, page_mapcount(page)); -> 1384 BUG_ON(mapcount != page_mapcount(page)); The root cause of the problem is a race of two threads in a multithreaded process. Thread B incurs a page fault on a virtual address that has never been accessed (PMD entry is zero) while Thread A is executing an madvise() system call on a virtual address within the same 2 MB (huge page) range. virtual address space .---------------------. | | | | .-|---------------------| | | | | | |<-- B(fault) | | | 2 MB | |/////////////////////|-. huge < |/////////////////////| > A(range) page | |/////////////////////|-' | | | | | | '-|---------------------| | | | | '---------------------' - Thread A is executing an madvise(..., MADV_DONTNEED) system call on the virtual address range "A(range)" shown in the picture. sys_madvise // Acquire the semaphore in shared mode. down_read(&current->mm->mmap_sem) ... madvise_vma switch (behavior) case MADV_DONTNEED: madvise_dontneed zap_page_range unmap_vmas unmap_page_range zap_pud_range zap_pmd_range // // Assume that this huge page has never been accessed. // I.e. content of the PMD entry is zero (not mapped). // if (pmd_trans_huge(*pmd)) { // We don't get here due to the above assumption. } // // Assume that Thread B incurred a page fault and .---------> // sneaks in here as shown below. | // | if (pmd_none_or_clear_bad(pmd)) | { | if (unlikely(pmd_bad(*pmd))) | pmd_clear_bad | { | pmd_ERROR | // Log "bad pmd ..." message here. | pmd_clear | // Clear the page's PMD entry. | // Thread B incremented the map count | // in page_add_new_anon_rmap(), but | // now the page is no longer mapped | // by a PMD entry (-> inconsistency). | } | } | v - Thread B is handling a page fault on virtual address "B(fault)" shown in the picture. ... do_page_fault __do_page_fault // Acquire the semaphore in shared mode. down_read_trylock(&mm->mmap_sem) ... handle_mm_fault if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) // We get here due to the above assumption (PMD entry is zero). do_huge_pmd_anonymous_page alloc_hugepage_vma // Allocate a new transparent huge page here. ... __do_huge_pmd_anonymous_page ... spin_lock(&mm->page_table_lock) ... page_add_new_anon_rmap // Here we increment the page's map count (starts at -1). atomic_set(&page->_mapcount, 0) set_pmd_at // Here we set the page's PMD entry which will be cleared // when Thread A calls pmd_clear_bad(). ... spin_unlock(&mm->page_table_lock) The mmap_sem does not prevent the race because both threads are acquiring it in shared mode (down_read). Thread B holds the page_table_lock while the page's map count and PMD table entry are updated. However, Thread A does not synchronize on that lock. ====== end quote ======= [akpm@linux-foundation.org: checkpatch fixes] Reported-by: Ulrich Obergfell <uobergfe@redhat.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Jones <davej@redhat.com> Acked-by: Larry Woodman <lwoodman@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: <stable@vger.kernel.org> [2.6.38+] Cc: Mark Salter <msalter@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22btrfs: enhance transaction abort infrastructureJeff Mahoney
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
2012-03-22btrfs: add varargs to btrfs_errorJeff Mahoney
btrfs currently handles most errors with BUG_ON. This patch is a work-in- progress but aims to handle most errors other than internal logic errors and ENOMEM more gracefully. This iteration prevents most crashes but can run into lockups with the page lock on occasion when the timing "works out." Signed-off-by: Jeff Mahoney <jeffm@suse.com>