summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2012-12-11mm: adjust address_space_operations.migratepage() return codeRafael Aquini
Memory fragmentation introduced by ballooning might reduce significantly the number of 2MB contiguous memory blocks that can be used within a guest, thus imposing performance penalties associated with the reduced number of transparent huge pages that could be used by the guest workload. This patch-set follows the main idea discussed at 2012 LSFMMS session: "Ballooning for transparent huge pages" -- http://lwn.net/Articles/490114/ to introduce the required changes to the virtio_balloon driver, as well as the changes to the core compaction & migration bits, in order to make those subsystems aware of ballooned pages and allow memory balloon pages become movable within a guest, thus avoiding the aforementioned fragmentation issue Following are numbers that prove this patch benefits on allowing compaction to be more effective at memory ballooned guests. Results for STRESS-HIGHALLOC benchmark, from Mel Gorman's mmtests suite, running on a 4gB RAM KVM guest which was ballooning 512mB RAM in 64mB chunks, at every minute (inflating/deflating), while test was running: ===BEGIN stress-highalloc STRESS-HIGHALLOC highalloc-3.7 highalloc-3.7 rc4-clean rc4-patch Pass 1 55.00 ( 0.00%) 62.00 ( 7.00%) Pass 2 54.00 ( 0.00%) 62.00 ( 8.00%) while Rested 75.00 ( 0.00%) 80.00 ( 5.00%) MMTests Statistics: duration 3.7 3.7 rc4-clean rc4-patch User 1207.59 1207.46 System 1300.55 1299.61 Elapsed 2273.72 2157.06 MMTests Statistics: vmstat 3.7 3.7 rc4-clean rc4-patch Page Ins 3581516 2374368 Page Outs 11148692 10410332 Swap Ins 80 47 Swap Outs 3641 476 Direct pages scanned 37978 33826 Kswapd pages scanned 1828245 1342869 Kswapd pages reclaimed 1710236 1304099 Direct pages reclaimed 32207 31005 Kswapd efficiency 93% 97% Kswapd velocity 804.077 622.546 Direct efficiency 84% 91% Direct velocity 16.703 15.682 Percentage direct scans 2% 2% Page writes by reclaim 79252 9704 Page writes file 75611 9228 Page writes anon 3641 476 Page reclaim immediate 16764 11014 Page rescued immediate 0 0 Slabs scanned 2171904 2152448 Direct inode steals 385 2261 Kswapd inode steals 659137 609670 Kswapd skipped wait 1 69 THP fault alloc 546 631 THP collapse alloc 361 339 THP splits 259 263 THP fault fallback 98 50 THP collapse fail 20 17 Compaction stalls 747 499 Compaction success 244 145 Compaction failures 503 354 Compaction pages moved 370888 474837 Compaction move failure 77378 65259 ===END stress-highalloc This patch: Introduce MIGRATEPAGE_SUCCESS as the default return code for address_space_operations.migratepage() method and documents the expected return code for the same method in failure cases. Signed-off-by: Rafael Aquini <aquini@redhat.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andi Kleen <andi@firstfloor.org> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-11mm: use vm_unmapped_area() in hugetlbfsMichel Lespinasse
Update the hugetlb_get_unmapped_area function to make use of vm_unmapped_area() instead of implementing a brute force search. Signed-off-by: Michel Lespinasse <walken@google.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-11mm: support more pagesizes for MAP_HUGETLB/SHM_HUGETLBAndi Kleen
There was some desire in large applications using MAP_HUGETLB or SHM_HUGETLB to use 1GB huge pages on some mappings, and stay with 2MB on others. This is useful together with NUMA policy: use 2MB interleaving on some mappings, but 1GB on local mappings. This patch extends the IPC/SHM syscall interfaces slightly to allow specifying the page size. It borrows some upper bits in the existing flag arguments and allows encoding the log of the desired page size in addition to the *_HUGETLB flag. When 0 is specified the default size is used, this makes the change fully compatible. Extending the internal hugetlb code to handle this is straight forward. Instead of a single mount it just keeps an array of them and selects the right mount based on the specified page size. When no page size is specified it uses the mount of the default page size. The change is not visible in /proc/mounts because internal mounts don't appear there. It also has very little overhead: the additional mounts just consume a super block, but not more memory when not used. I also exported the new flags to the user headers (they were previously under __KERNEL__). Right now only symbols for x86 and some other architecture for 1GB and 2MB are defined. The interface should already work for all other architectures though. Only architectures that define multiple hugetlb sizes actually need it (that is currently x86, tile, powerpc). However tile and powerpc have user configurable hugetlb sizes, so it's not easy to add defines. A program on those architectures would need to query sysfs and use the appropiate log2. [akpm@linux-foundation.org: cleanups] [rientjes@google.com: fix build] [akpm@linux-foundation.org: checkpatch fixes] Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hillf Danton <dhillf@gmail.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-11writeback: remove nr_pages_dirtied arg from balance_dirty_pages_ratelimited_nr()Namjae Jeon
There is no reason to pass the nr_pages_dirtied argument, because nr_pages_dirtied value from the caller is unused in balance_dirty_pages_ratelimited_nr(). Signed-off-by: Namjae Jeon <linkinjeon@gmail.com> Signed-off-by: Vivek Trivedi <vtrivedi018@gmail.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-11Merge tag 'tty-3.8-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty Pull TTY/Serial merge from Greg Kroah-Hartman: "Here's the big tty/serial tree set of changes for 3.8-rc1. Contained in here is a bunch more reworks of the tty port layer from Jiri and bugfixes from Alan, along with a number of other tty and serial driver updates by the various driver authors. Also, Jiri has been coerced^Wconvinced to be the co-maintainer of the TTY layer, which is much appreciated by me. All of these have been in the linux-next tree for a while. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>" Fixed up some trivial conflicts in the staging tree, due to the fwserial driver having come in both ways (but fixed up a bit in the serial tree), and the ioctl handling in the dgrp driver having been done slightly differently (staging tree got that one right, and removed both TIOCGSOFTCAR and TIOCSSOFTCAR). * tag 'tty-3.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (146 commits) staging: sb105x: fix potential NULL pointer dereference in mp_chars_in_buffer() staging/fwserial: Remove superfluous free staging/fwserial: Use WARN_ONCE when port table is corrupted staging/fwserial: Destruct embedded tty_port on teardown staging/fwserial: Fix build breakage when !CONFIG_BUG staging: fwserial: Add TTY-over-Firewire serial driver drivers/tty/serial/serial_core.c: clean up HIGH_BITS_OFFSET usage staging: dgrp: dgrp_tty.c: Audit the return values of get/put_user() staging: dgrp: dgrp_tty.c: Remove the TIOCSSOFTCAR ioctl handler from dgrp driver serial: ifx6x60: Add modem power off function in the platform reboot process serial: mxs-auart: unmap the scatter list before we copy the data serial: mxs-auart: disable the Receive Timeout Interrupt when DMA is enabled serial: max310x: Setup missing "can_sleep" field for GPIO tty/serial: fix ifx6x60.c declaration warning serial: samsung: add devicetree properties for non-Exynos SoCs serial: samsung: fix potential soft lockup during uart write tty: vt: Remove redundant null check before kfree. tty/8250 Add check for pci_ioremap_bar failure tty/8250 Add support for Commtech's Fastcom Async-335 and Fastcom Async-PCIe cards tty/8250 Add XR17D15x devices to the exar_handle_irq override ...
2012-12-11Merge tag 'driver-core-3.8-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg Kroah-Hartman: "Here's the large driver core updates for 3.8-rc1. The biggest thing here is the various __dev* marking removals. This is going to be a pain for the merge with different subsystem trees, I know, but all of the patches included here have been ACKed by their various subsystem maintainers, as they wanted them to go through here. If this is too much of a pain, I can pull all of them out of this tree and just send you one with the other fixes/updates and then, after 3.8-rc1 is out, do the rest of the removals to ensure we catch them all, it's up to you. The merges should all be trivial, and Stephen has been doing them all in linux-next for a few weeks now quite easily. Other than the __dev* marking removals, there's nothing major here, some firmware loading updates and other minor things in the driver core. All of these have (much to Stephen's annoyance), been in linux-next for a while. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>" Fixed up trivial conflicts in drivers/gpio/gpio-{em,stmpe}.c due to gpio update. * tag 'driver-core-3.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (93 commits) modpost.c: Stop checking __dev* section mismatches init.h: Remove __dev* sections from the kernel acpi: remove use of __devinit PCI: Remove __dev* markings PCI: Always build setup-bus when PCI is enabled PCI: Move pci_uevent into pci-driver.c PCI: Remove CONFIG_HOTPLUG ifdefs unicore32/PCI: Remove CONFIG_HOTPLUG ifdefs sh/PCI: Remove CONFIG_HOTPLUG ifdefs powerpc/PCI: Remove CONFIG_HOTPLUG ifdefs mips/PCI: Remove CONFIG_HOTPLUG ifdefs microblaze/PCI: Remove CONFIG_HOTPLUG ifdefs dma: remove use of __devinit dma: remove use of __devexit_p firewire: remove use of __devinitdata firewire: remove use of __devinit leds: remove use of __devexit leds: remove use of __devinit leds: remove use of __devexit_p mmc: remove use of __devexit ...
2012-12-11Btrfs: make ordered extent be flushed by multi-taskMiao Xie
Though the process of the ordered extents is a bit different with the delalloc inode flush, but we can see it as a subset of the delalloc inode flush, so we also handle them by flush workers. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-11Btrfs: make ordered operations be handled by multi-taskMiao Xie
The process of the ordered operations is similar to the delalloc inode flush, so we handle them by flush workers. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-11Btrfs: make delalloc inodes be flushed by multi-taskMiao Xie
This patch introduce a new worker pool named "flush_workers", and if we want to force all the inode with pending delalloc to the disks, we can queue those inodes into the work queue of the worker pool, in this way, those inodes will be flushed by multi-task. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-11Btrfs: fill the global reserve when unpinning spaceJosef Bacik
Dave gave me an image of a very full file system that would abort the transaction because it ran out of space while committing the transaction. This is because we would think there was plenty of room to create a snapshot even though the global reserve was not full. This happens because we calculate the global reserve size before we unpin any space, so after we unpin the space we allow reservations to occur even though we haven't reserved all of the space for our global reserve. Fix this by adding to the global reserve while unpinning in order to make sure we always have enough space to do our work. With this patch we no longer end up with an aborted transaction, we return ENOSPC properly to the person trying to create the snapshot. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-11Btrfs: cleanup unused argumentsLiu Bo
'disk_key' is not used at all. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-11Btrfs: kill unnecessary arguments in del_ptrLiu Bo
The argument 'tree_mod_log' is not necessary since all of callers enable it. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-11Btrfs: reorder tree mod log operations in deleting a pointerLiu Bo
Since we don't use MOD_LOG_KEY_REMOVE_WHILE_MOVING to add nritems during rewinding, we should insert a MOD_LOG_KEY_REMOVE operation first. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-11Btrfs: MOD_LOG_KEY_REMOVE_WHILE_MOVING never change node's nritemsLiu Bo
Key MOD_LOG_KEY_REMOVE_WHILE_MOVING means that we're doing memmove inside an extent buffer node, and the node's number of items remains unchanged (unless we are inserting a single pointer, but we have MOD_LOG_KEY_ADD for that). So we don't need to increase node's number of items during rewinding, otherwise we may get an node larger than leafsize and cause general protection errors later. Here is the details, - If we do memory move for inserting a single pointer, we need to add node's nritems by one, and we honor MOD_LOG_KEY_ADD for adding. - If we do memory move for deleting a single pointer, we need to decrease node's nritems by one, and we honor MOD_LOG_KEY_REMOVE for deleting. - If we do memory move for balance left/right, we need to decrease node's nritems, and we honor MOD_LOG_KEY_REMOVE for balaning. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-11Btrfs: fix unnecessary while loop when search the free space, cacheMiao Xie
When we find a bitmap free space entry, we may check the previous extent entry covers the offset or not. But if we find this entry is also a bitmap entry, we will continue to check the previous entry of the current one by a while loop. It is unnecessary because it is impossible that the extent entry which is in front of a bitmap entry can cover the offset of the entry after that bitmap entry. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-11Btrfs: recheck bio against block device when we map the bioJosef Bacik
Alex reported a problem where we were writing between chunks on a rbd device. The thing is we do bio_add_page using logical offsets, but the physical offset may be different. So when we map the bio now check to see if the bio is still ok with the physical offset, and if it is not split the bio up and redo the bio_add_page with the physical sector. This fixes the problem for Alex and doesn't affect performance in the normal case. Thanks, Reported-and-tested-by: Alex Elder <elder@inktank.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-11Btrfs: improve the noflush reservationMiao Xie
In some places(such as: evicting inode), we just can not flush the reserved space of delalloc, flushing the delayed directory index and delayed inode is OK, but we don't try to flush those things and just go back when there is no enough space to be reserved. This patch fixes this problem. We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL. If we can in the transaction, we should not flush anything, or the deadlock would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used, and we will flush all things. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-11Btrfs: fix wrong comment in can_overcommit()Miao Xie
The comment is not coincident with the code. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-11Btrfs: cleanup duplicated division functionsMiao Xie
div_factor{_fine} has been implemented for two times, cleanup it. And I move them into a independent file named math.h because they are common math functions. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-11CIFS: Fix write after setting a read lock for read oplock filesPavel Shilovsky
If we have a read oplock and set a read lock in it, we can't write to the locked area - so, filemap_fdatawrite may fail with a no information for a userspace application even if we request a write to non-locked area. Fix this by populating the page cache without marking affected pages dirty after a successful write directly to the server. Also remove CONFIG_CIFS_SMB2 ifdefs because it's suitable for both CIFS and SMB2 protocols. Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
2012-12-11cifs: parse the device name into UNC and prepathJeff Layton
This should fix a regression that was introduced when the new mount option parser went in. Also, when the unc= and prefixpath= options are provided, check their values against the ones we parsed from the device string. If they differ, then throw a warning that tells the user that we're using the values from the unc= option for now, but that that will change in 3.10. Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
2012-12-11cifs: fix up handling of prefixpath= optionJeff Layton
Currently the code takes care to ensure that the prefixpath has a leading '/' delimiter. What if someone passes us a prefixpath with a leading '\\' instead? The code doesn't properly handle that currently AFAICS. Let's just change the code to skip over any leading delimiter character when copying the prepath. Then, fix up the users of the prepath option to prefix it with the correct delimiter when they use it. Also, there's no need to limit the length of the prefixpath to 1k. If the server can handle it, why bother forbidding it? Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
2012-12-11cifs: clean up handling of unc= optionJeff Layton
Make sure we free any existing memory allocated for vol->UNC, just in case someone passes in multiple unc= options. Get rid of the check for too long a UNC. The check for >300 bytes seems arbitrary. We later copy this into the tcon->treeName, for instance and it's a lot shorter than 300 bytes. Eliminate an extra kmalloc and copy as well. Just set the vol->UNC directly with the contents of match_strdup. Establish that the UNC should be stored with '\\' delimiters. Use convert_delimiter to change it in place in the vol->UNC. Finally, move the check for a malformed UNC into cifs_parse_mount_options so we can catch that situation earlier. Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
2012-12-11cifs: fix SID binary to string conversionJeff Layton
The authority fields are supposed to be represented by a single 48-bit value. It's also supposed to represent the value as hex if it's equal to or greater than 2^32. This is documented in MS-DTYP, section 2.4.2.1. Also, fix up the max string length to account for this fix. Acked-by: Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
2012-12-11NFSv4.1: Be conservative about the client highest slotidTrond Myklebust
If the server sends us a target that looks like an outlier, but is lower than the existing target, then respect it anyway. However defer actually updating the generation counter until we get a target that doesn't look like an outlier. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-12-11exofs: clean up the correct page collection on write errorIdan Kedar
if ore_write() fails, we would unlock the pages of pcol, which is now empty, rather than pcol_copy which owns the pages when ore_write() is called. this means that no pages will actually be unlocked (pcol.nr_pages == 0) and the writing process (more accurately, the syncing process) will hang waiting for a writeback notification that never comes. moreover, if ore_write() fails, pcol_free() is called for pcol, whereas pcol_copy is the object owning the ore_io_state, thus leaking the ore_io_state. [Boaz] I have simplified Idan's original patch a bit, everything else still holds Signed-off-by: Idan Kedar <idank@tonian.com> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2012-12-11NFSv4.1: Handle NFS4ERR_BADSLOT errors correctlyTrond Myklebust
Most (all) NFS4ERR_BADSLOT errors are due to the client failing to respect the server's sr_highest_slotid limit. This mainly happens due to reordered RPC requests. The way to handle it is simply to drop the slot that we're using, and retry using the new highest_slotid limits. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-12-11Merge branch 'bugfixes' into nfs-for-nextTrond Myklebust
2012-12-11nfs: don't extend writes to cover entire page if pagecache is invalidJeff Layton
Jian reported that the following sequence would leave "testfile" with corrupt data: # mount localhost:/export /mnt/nfs/ -o vers=3 # echo abc > /mnt/nfs/testfile; echo def >> /export/testfile; echo ghi >> /mnt/nfs/testfile # cat -v /export/testfile abc ^@^@^@^@ghi While there's no locking involved here, the operations are serialized, so CTO should prevent corruption. The first write to the file is fine and writes 4 bytes. The file is then extended on the server. When it's reopened a GETATTR is issued and the size change is noticed. This causes NFS_INO_INVALID_DATA to be set on the file. Because the file is opened for write only, nfs_want_read_modify_write() returns 0 to nfs_write_begin(). nfs_updatepage then calls nfs_write_pageuptodate() to see if it should extend the nfs_page to cover the whole page. NFS_INO_INVALID_DATA is still set on the file at that point, but that flag is ignored and nfs_pageuptodate erroneously extends the write to cover the whole page, with the write done on the server side filled in with zeroes. This patch just has that function check for NFS_INO_INVALID_DATA in addition to NFS_INO_REVAL_PAGECACHE. This fixes the bug, but looking over the code, I wonder if we might have a similar bug in nfs_revalidate_size(). The difference between those two flags is very subtle, so it seems like we ought to be checking for NFS_INO_INVALID_DATA in most of the places that we look for NFS_INO_REVAL_PAGECACHE. I believe this is regression introduced by commit 8d197a568. The code did check for NFS_INO_INVALID_DATA prior to that patch. Original bug report is here: https://bugzilla.redhat.com/show_bug.cgi?id=885743 Cc: <stable@vger.kernel.org> # 3.5+ Reported-by: Jian Li <jiali@redhat.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-12-11NFSv4: Check for buffer length in __nfs4_get_acl_uncachedSven Wegener
Commit 1f1ea6c "NFSv4: Fix buffer overflow checking in __nfs4_get_acl_uncached" accidently dropped the checking for too small result buffer length. If someone uses getxattr on "system.nfs4_acl" on an NFSv4 mount supporting ACLs, the ACL has not been cached and the buffer suplied is too short, we still copy the complete ACL, resulting in kernel and user space memory corruption. Signed-off-by: Sven Wegener <sven.wegener@stealer.net> Cc: stable@kernel.org Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-12-11Revert "sched/autogroup: Fix crash on reboot when autogroup is disabled"Ingo Molnar
This reverts commit 5258f386ea4e8454bc801fb443e8a4217da1947c, because the underlying autogroups bug got fixed upstream in a better way, via: fd8ef11730f1 Revert "sched, autogroup: Stop going ahead if autogroup is disabled" Cc: Mike Galbraith <efault@gmx.de> Cc: Yong Zhang <yong.zhang0@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-12-11ext4: zero out inline data using memset() instead of empty_zero_pageTheodore Ts'o
Not all architectures (in particular, sparc64) have empty_zero_page. So instead of copying from empty_zero_page, use memset to clear the inline data by signalling to ext4_xattr_set_entry() via a magic pointer value, EXT4_ZERO_ATTR_VALUE, which is defined by casting -1 to a pointer. This fixes a build failure on sparc64, and the memset() should be more efficient than using memcpy() anyway. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-10ext4: ensure Inode flags consistency are checked at build timeCarlos Maiolino
Flags being used by atomic operations in inode flags (e.g. ext4_test_inode_flag(), should be consistent with that actually stored in inodes, i.e.: EXT4_XXX_FL. It ensures that this consistency is checked at build-time, not at run-time. Currently, the flags consistency are being checked at run-time, but, there is no real reason to not do a build-time check instead of a run-time check. The code is comparing macro defined values with enum type variables, where both are constants, so, there is no problem in comparing constants at build-time. enum variables are treated as constants by the C compiler, according to the C99 specs (see www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf sec. 6.2.5, item 16), so, there is no real problem in comparing an enumeration type at build time Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-10ext4: Remove CONFIG_EXT4_FS_XATTRTao Ma
Ted has sent out a RFC about removing this feature. Eric and Jan confirmed that both RedHat and SUSE enable this feature in all their product. David also said that "As far as I know, it's enabled in all Android kernels that use ext4." So it seems OK for us. And what's more, as inline data depends its implementation on xattr, and to be frank, I don't run any test again inline data enabled while xattr disabled. So I think we should add inline data and remove this config option in the same release. [ The savings if you disable CONFIG_EXT4_FS_XATTR is only 27k, which isn't much in the grand scheme of things. Since no one seems to be testing this configuration except for some automated compile farms, on balance we are better removing this config option, and so that it is effectively always enabled. -- tytso ] Cc: David Brown <davidb@codeaurora.org> Cc: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-10ext4: remove unused variable from ext4_ext_in_cache()Zhi Yong Wu
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> Reviewed-by: Zheng Liu <gnehzuil.liu@gmail.com>
2012-12-10ext4: remove redundant initialization in ext4_fill_super()Guo Chao
We use kzalloc() to allocate sbi, no need to zero its field. Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-10ext4: remove redundant code in ext4_alloc_inode()Guo Chao
inode_init_always() will initialize inode->i_data.writeback_index anyway, no need to do this in ext4_alloc_inode(). Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2012-12-10ext4: use sync_inode_metadata() when syncing inode metadataGuo Chao
We have a dedicated interface to sync inode metadata. Use it to simplify ext4's code some. Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2012-12-10ext4: enable ext4 inline supportTao Ma
Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-10ext4: let fallocate handle inline data correctlyTao Ma
If we are punching hole in a file, we will return ENOTSUPP. As for the fallocation of some extents, we will convert the inline data to a normal extent based file first. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-10ext4: let ext4_truncate handle inline data correctlyTao Ma
Signed-off-by: Robin Dong <sanbai@taobao.com> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-10ext4: evict inline data out if we need to strore xattr in inodeTao Ma
Now we that store data in the inode, in case we need to store some xattrs and inode doesn't have enough space, Andreas suggested that we should keep the xattr(metadata) in and data should be pushed out. So this patch does the work. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-10ext4: let fiemap work with inline dataTao Ma
fiemap is used to find the disk layout of a file, as for inline data, let us just pretend like a file with just one extent. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-10ext4: let ext4_rename handle inline dirTao Ma
In case we rename a directory, ext4_rename has to read the dir block and change its dotdot's information. The old ext4_rename encapsulated the dir_block read into itself. So this patch adds a new function ext4_get_first_dir_block() which gets the dir buffer information so the ext4_rename can handle it properly. As it will also change the parent inode number, we return the parent_de so that ext4_rename() can handle it more easily. ext4_find_entry is also changed so that the caller(rename) can tell whether the found entry is an inlined one or not and journaling the corresponding buffer head. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-10ext4: let empty_dir handle inline dirTao Ma
empty_dir is used when deleting a dir. So it should handle inline dir properly. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-10ext4: let ext4_delete_entry() handle inline dataTao Ma
Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-10ext4: make ext4_delete_entry genericTao Ma
Currently ext4_delete_entry() is used only for dir entry removing from a dir block. So let us create a new function ext4_generic_delete_entry and this function takes a entry_buf and a buf_size so that it can be used for inline data. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-10ext4: let ext4_find_entry handle inline dataTao Ma
Create a new function ext4_find_inline_entry() to handle the case of inline data. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-10ext4: create a new function search_dirTao Ma
search_dirblock is used to search a dir block, but the code is almost the same for searching an inline dir. So create a new fuction search_dir and let search_dirblock call it. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-10ext4: let ext4_readdir handle inline dataTao Ma
For "." and "..", we just call filldir by ourselves instead of iterating the real dir entry. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>