summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2015-08-15ext4: ratelimit the file system mounted messageTheodore Ts'o
The xfstests ext4/305 will mount and unmount the same file system over 4,000 times, and each one of these will cause a system log message. Ratelimit this message since if we are getting more than a few dozen of these messages, they probably aren't going to be helpful. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-08-15ext4: silence a format string false positiveDan Carpenter
Static checkers complain that the format string should be "%s". It does not make a difference for the current code. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-08-15ext4: simplify some code in read_mmp_block()Dan Carpenter
My static check complains because we have: if (!*bh) return -ENOMEM; if (*bh) { The second check is unnecessary. I've simplified this code by moving the "if (!*bh)" checks around. Also Andreas Dilger says we should probably print a warning if sb_getblk() fails. [ Restructured the code so that we print a warning message as well if the mmp block doesn't check out, and to print the error code to disambiguate between the error cases. - TYT ] Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-08-15ext4: don't manipulate recovery flag when freezing no-journal fsEric Sandeen
At some point along this sequence of changes: f6e63f9 ext4: fold ext4_nojournal_sops into ext4_sops bb04457 ext4: support freezing ext2 (nojournal) file systems 9ca9238 ext4: Use separate super_operations structure for no_journal filesystems ext4 started setting needs_recovery on filesystems without journals when they are unfrozen. This makes no sense, and in fact confuses blkid to the point where it doesn't recognize the filesystem at all. (freeze ext2; unfreeze ext2; run blkid; see no output; run dumpe2fs, see needs_recovery set on fs w/ no journal). To fix this, don't manipulate the INCOMPAT_RECOVER feature on filesystems without journals. Reported-by: Stu Mark <smark@datto.com> Reviewed-by: Jan Kara <jack@suse.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2015-08-04jbd2: limit number of reserved creditsLukas Czerner
Currently there is no limitation on number of reserved credits we can ask for. If we ask for more reserved credits than 1/2 of maximum transaction size, or if total number of credits exceeds the maximum transaction size per operation (which is currently only possible with the former) we will spin forever in start_this_handle(). Fix this by adding this limitation at the start of start_this_handle(). This patch also removes the credit limitation 1/2 of maximum transaction size, since we really only want to limit the number of reserved credits. There is not much point to limit the credits if there is still space in the journal. This accidentally also fixes the online resize, where due to the limitation of the journal credits we're unable to grow file systems with 1k block size and size between 16M and 32M. It has been partially fixed by 2c869b262a10ca99cb866d04087d75311587a30c, but not entirely. Thanks Jan Kara for helping me getting the correct fix. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-07-28ext4 crypto: remove duplicate header filezilong.liu
Remove key.h which is included twice in crypto_fname.c Signed-off-by: zilong.liu <liuziloong@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-07-28ext4: update c/mtime on truncate upEryu Guan
Commit 3da40c7b0898 ("ext4: only call ext4_truncate when size <= isize") introduced a bug that c/mtime is not updated on truncate up. Fix the issue by setting c/mtime explicitly in the truncate up case. Note that ftruncate(2) is not affected, so you won't see this bug using truncate(1) and xfs_io(1). Signed-off-by: Zirong Lang <zorro.lang@gmail.com> Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-07-28jbd2: avoid infinite loop when destroying aborted journalJan Kara
Commit 6f6a6fda2945 "jbd2: fix ocfs2 corrupt when updating journal superblock fails" changed jbd2_cleanup_journal_tail() to return EIO when the journal is aborted. That makes logic in jbd2_log_do_checkpoint() bail out which is fine, except that jbd2_journal_destroy() expects jbd2_log_do_checkpoint() to always make a progress in cleaning the journal. Without it jbd2_journal_destroy() just loops in an infinite loop. Fix jbd2_journal_destroy() to cleanup journal checkpoint lists of jbd2_log_do_checkpoint() fails with error. Reported-by: Eryu Guan <guaneryu@gmail.com> Tested-by: Eryu Guan <guaneryu@gmail.com> Fixes: 6f6a6fda294506dfe0e3e0a253bb2d2923f28f0a Signed-off-by: Jan Kara <jack@suse.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-07-23ext4, jbd2: add REQ_FUA flag when recording an error in the superblockDaeho Jeong
When an error condition is detected, an error status should be recorded into superblocks of EXT4 or JBD2. However, the write request is submitted now without REQ_FUA flag, even in "barrier=1" mode, which is followed by panic() function in "errors=panic" mode. On mobile devices which make whole system reset as soon as kernel panic occurs, this write request containing an error flag will disappear just from storage cache without written to the physical cells. Therefore, when next start, even forever, the error flag cannot be shown in both superblocks, and e2fsck cannot fix the filesystem problems automatically, unless e2fsck is executed in force checking mode. [ Changed use test_opt(sb, BARRIER) of checking the journal flags -- TYT ] Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-07-22ext4 crypto: fix spelling typo in commentLaurent Navet
Signed-off-by: Laurent Navet <laurent.navet@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-07-22ext4 crypto: exit cleanly if ext4_derive_key_aes() failsLaurent Navet
Return value of ext4_derive_key_aes() is stored but not used. Add test to exit cleanly if ext4_derive_key_aes() fail. Also fix coverity CID 1309760. Signed-off-by: Laurent Navet <laurent.navet@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-07-21ext4: reject journal options for ext2 mountsCarlos Maiolino
There is no reason to allow ext2 filesystems be mounted with journal mount options. So, this patch adds them to the MOPT_NO_EXT2 mount options list. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-07-21ext4: implement cgroup writeback supportTejun Heo
For ordered and writeback data modes, all data IOs go through ext4_io_submit. This patch adds cgroup writeback support by invoking wbc_init_bio() from io_submit_init_bio() and wbc_account_io() in io_submit_add_bh(). Journal data which is written by jbd2 worker is left alone by this patch and will always be written out from the root cgroup. ext4_fill_super() is updated to set MS_CGROUPWB when data mode is either ordered or writeback. In journaled data mode, most IOs become synchronous through the journal and enabling cgroup writeback support doesn't make much sense or difference. Journaled data mode is left alone. Lightly tested with sequential data write workload. Behaves as expected. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-07-21ext4: replace ext4_io_submit->io_op with ->io_wbcTejun Heo
ext4_io_submit_init() takes the pointer to writeback_control to test its sync_mode and determine between WRITE and WRITE_SYNC and records the result in ->io_op. This patch makes it record the pointer directly and moves the test to ext4_io_submit(). This doesn't cause any noticeable differences now but having writeback_control available throughout IO submission path will be depended upon by the planned cgroup writeback support. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-07-17ext4 crypto: check for too-short encrypted file namesTheodore Ts'o
An encrypted file name should never be shorter than an 16 bytes, the AES block size. The 3.10 crypto layer will oops and crash the kernel if ciphertext shorter than the block size is passed to it. Fortunately, in modern kernels the crypto layer will not crash the kernel in this scenario, but nevertheless, it represents a corrupted directory, and we should detect it and mark the file system as corrupted so that e2fsck can fix this. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-07-17ext4 crypto: use a jbd2 transaction when adding a crypto policyTheodore Ts'o
Start a jbd2 transaction, and mark the inode dirty on the inode under that transaction after setting the encrypt flag. Otherwise if the directory isn't modified after setting the crypto policy, the encrypted flag might not survive the inode getting pushed out from memory, or the the file system getting unmounted and remounted. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-07-12jbd2: speedup jbd2_journal_dirty_metadata()Jan Kara
It is often the case that we mark buffer as having dirty metadata when the buffer is already in that state (frequent for bitmaps, inode table blocks, superblock). Thus it is unnecessary to contend on grabbing journal head reference and bh_state lock. Avoid that by checking whether any modification to the buffer is needed before grabbing any locks or references. [ Note: this is a fixed version of commit 2143c1965a761, which was reverted in ebeaa8ddb3663b5 due to a false positive triggering of an assertion check. -- Ted ] Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-07-12Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull VFS fixes from Al Viro: "Fixes for this cycle regression in overlayfs and a couple of long-standing (== all the way back to 2.6.12, at least) bugs" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: freeing unlinked file indefinitely delayed fix a braino in ovl_d_select_inode() 9p: don't leave a half-initialized inode sitting around
2015-07-12freeing unlinked file indefinitely delayedAl Viro
Normally opening a file, unlinking it and then closing will have the inode freed upon close() (provided that it's not otherwise busy and has no remaining links, of course). However, there's one case where that does *not* happen. Namely, if you open it by fhandle with cold dcache, then unlink() and close(). In normal case you get d_delete() in unlink(2) notice that dentry is busy and unhash it; on the final dput() it will be forcibly evicted from dcache, triggering iput() and inode removal. In this case, though, we end up with *two* dentries - disconnected (created by open-by-fhandle) and regular one (used by unlink()). The latter will have its reference to inode dropped just fine, but the former will not - it's considered hashed (it is on the ->s_anon list), so it will stay around until the memory pressure will finally do it in. As the result, we have the final iput() delayed indefinitely. It's trivial to reproduce - void flush_dcache(void) { system("mount -o remount,rw /"); } static char buf[20 * 1024 * 1024]; main() { int fd; union { struct file_handle f; char buf[MAX_HANDLE_SZ]; } x; int m; x.f.handle_bytes = sizeof(x); chdir("/root"); mkdir("foo", 0700); fd = open("foo/bar", O_CREAT | O_RDWR, 0600); close(fd); name_to_handle_at(AT_FDCWD, "foo/bar", &x.f, &m, 0); flush_dcache(); fd = open_by_handle_at(AT_FDCWD, &x.f, O_RDWR); unlink("foo/bar"); write(fd, buf, sizeof(buf)); system("df ."); /* 20Mb eaten */ close(fd); system("df ."); /* should've freed those 20Mb */ flush_dcache(); system("df ."); /* should be the same as #2 */ } will spit out something like Filesystem 1K-blocks Used Available Use% Mounted on /dev/root 322023 303843 1131 100% / Filesystem 1K-blocks Used Available Use% Mounted on /dev/root 322023 303843 1131 100% / Filesystem 1K-blocks Used Available Use% Mounted on /dev/root 322023 283282 21692 93% / - inode gets freed only when dentry is finally evicted (here we trigger than by remount; normally it would've happened in response to memory pressure hell knows when). Cc: stable@vger.kernel.org # v2.6.38+; earlier ones need s/kill_it/unhash_it/ Acked-by: J. Bruce Fields <bfields@fieldses.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-12fix a braino in ovl_d_select_inode()Al Viro
when opening a directory we want the overlayfs inode, not one from the topmost layer. Reported-By: Andrey Jr. Melnikov <temnota.am@gmail.com> Tested-By: Andrey Jr. Melnikov <temnota.am@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-129p: don't leave a half-initialized inode sitting aroundAl Viro
Cc: stable@vger.kernel.org # all branches Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-11Merge branch 'for-linus-4.2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "This is an assortment of fixes. Most of the commits are from Filipe (fsync, the inode allocation cache and a few others). Mark kicked in a series fixing corners in the extent sharing ioctls, and everyone else fixed up on assorted other problems" * 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix wrong check for btrfs_force_chunk_alloc() Btrfs: fix warning of bytes_may_use Btrfs: fix hang when failing to submit bio of directIO Btrfs: fix a comment in inode.c:evict_inode_truncate_pages() Btrfs: fix memory corruption on failure to submit bio for direct IO btrfs: don't update mtime/ctime on deduped inodes btrfs: allow dedupe of same inode btrfs: fix deadlock with extent-same and readpage btrfs: pass unaligned length to btrfs_cmp_data() Btrfs: fix fsync after truncate when no_holes feature is enabled Btrfs: fix fsync xattr loss in the fast fsync path Btrfs: fix fsync data loss after append write Btrfs: fix crash on close_ctree() if cleaner starts new transaction Btrfs: fix race between caching kthread and returning inode to inode cache Btrfs: use kmem_cache_free when freeing entry in inode cache Btrfs: fix race between balance and unused block group deletion btrfs: add error handling for scrub_workers_get() btrfs: cleanup noused initialization of dev in btrfs_end_bio() btrfs: qgroup: allow user to clear the limitation on qgroup
2015-07-09hpfs: hpfs_error: Remove static buffer, use vsprintf extension %pV insteadJoe Perches
Removing unnecessary static buffers is good. Use the vsprintf %pV extension instead. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Mikulas Patocka <mikulas@twibright.com> Cc: stable@vger.kernel.org # v2.6.36+ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-07-09hpfs: kstrdup() out of memory handlingSanidhya Kashyap
There is a possibility of nothing being allocated to the new_opts in case of memory pressure, therefore return ENOMEM for such case. Signed-off-by: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Signed-off-by: Mikulas Patocka <mikulas@twibright.com> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-07-09hpfs: Remove unessary castFiro Yang
Avoid a pointless kmem_cache_alloc() return value cast in fs/hpfs/super.c::hpfs_alloc_inode() Signed-off-by: Firo Yang <firogm@gmail.com> Signed-off-by: Mikulas Patocka <mikulas@twibright.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-07-09hpfs: add fstrim supportMikulas Patocka
This patch adds support for fstrim to the HPFS filesystem. Signed-off-by: Mikulas Patocka <mikulas@twibright.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-07-09ioctl_compat: handle FITRIMMikulas Patocka
The FITRIM ioctl has the same arguments on 32-bit and 64-bit architectures, so we can add it to the list of compatible ioctls and drop it from compat_ioctl method of various filesystems. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ted Ts'o <tytso@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-07-05Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 bugfixes from Ted Ts'o: "Bug fixes (all for stable kernels) for ext4: - address corner cases for indirect blocks->extent migration - fix reserved block accounting invalidate_page when page_size != block_size (i.e., ppc or 1k block size file systems) - fix deadlocks when a memcg is under heavy memory pressure - fix fencepost error in lazytime optimization" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: replace open coded nofail allocation in ext4_free_blocks() ext4: correctly migrate a file with a hole at the beginning ext4: be more strict when migrating to non-extent based file ext4: fix reservation release on invalidatepage for delalloc fs ext4: avoid deadlocks in the writeback path by using sb_getblk_gfp bufferhead: Add _gfp version for sb_getblk() ext4: fix fencepost error in lazytime optimization
2015-07-05ext4: replace open coded nofail allocation in ext4_free_blocks()Michal Hocko
ext4_free_blocks is looping around the allocation request and mimics __GFP_NOFAIL behavior without any allocation fallback strategy. Let's remove the open coded loop and replace it with __GFP_NOFAIL. Without the flag the allocator has no way to find out never-fail requirement and cannot help in any way. Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2015-07-04Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull more vfs updates from Al Viro: "Assorted VFS fixes and related cleanups (IMO the most interesting in that part are f_path-related things and Eric's descriptor-related stuff). UFS regression fixes (it got broken last cycle). 9P fixes. fs-cache series, DAX patches, Jan's file_remove_suid() work" [ I'd say this is much more than "fixes and related cleanups". The file_table locking rule change by Eric Dumazet is a rather big and fundamental update even if the patch isn't huge. - Linus ] * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (49 commits) 9p: cope with bogus responses from server in p9_client_{read,write} p9_client_write(): avoid double p9_free_req() 9p: forgetting to cancel request on interrupted zero-copy RPC dax: bdev_direct_access() may sleep block: Add support for DAX reads/writes to block devices dax: Use copy_from_iter_nocache dax: Add block size note to documentation fs/file.c: __fget() and dup2() atomicity rules fs/file.c: don't acquire files->file_lock in fd_install() fs:super:get_anon_bdev: fix race condition could cause dev exceed its upper limitation vfs: avoid creation of inode number 0 in get_next_ino namei: make set_root_rcu() return void make simple_positive() public ufs: use dir_pages instead of ufs_dir_pages() pagemap.h: move dir_pages() over there remove the pointless include of lglock.h fs: cleanup slight list_entry abuse xfs: Correctly lock inode when removing suid and file capabilities fs: Call security_ops->inode_killpriv on truncate fs: Provide function telling whether file_remove_privs() will do anything ...
2015-07-04dax: bdev_direct_access() may sleepMatthew Wilcox
The brd driver is the only in-tree driver that may sleep currently. After some discussion on linux-fsdevel, we decided that any driver may choose to sleep in its ->direct_access method. To ensure that all callers of bdev_direct_access() are prepared for this, add a call to might_sleep(). Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-04block: Add support for DAX reads/writes to block devicesMatthew Wilcox
If a block device supports the ->direct_access methods, bypass the normal DIO path and use DAX to go straight to memcpy() instead of allocating a DIO and a BIO. Includes support for the DIO_SKIP_DIO_COUNT flag in DAX, as is done in do_blockdev_direct_IO(). Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-04dax: Use copy_from_iter_nocacheMatthew Wilcox
When userspace does a write, there's no need for the written data to pollute the CPU cache. This matches the original XIP code. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-04Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Debug info and other statistics fixes and related enhancements" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/numa: Fix numa balancing stats in /proc/pid/sched sched/numa: Show numa_group ID in /proc/sched_debug task listings sched/debug: Move print_cfs_rq() declaration to kernel/sched/sched.h sched/stat: Expose /proc/pid/schedstat if CONFIG_SCHED_INFO=y sched/stat: Simplify the sched_info accounting dependency
2015-07-04sched/stat: Expose /proc/pid/schedstat if CONFIG_SCHED_INFO=yNaveen N. Rao
Expand /proc/pid/schedstat output: - enable it on CONFIG_TASK_DELAY_ACCT=y && !CONFIG_SCHEDSTATS kernels. - dump all zeroes on kernels that are booted with the 'nodelayacct' option, which boot option disables delay accounting on CONFIG_TASK_DELAY_ACCT=y kernels. Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: a.p.zijlstra@chello.nl Cc: ricklind@us.ibm.com Link: http://lkml.kernel.org/r/5ccbef17d4bc841084ea6e6421d4e4a23b7b806f.1435654789.git.naveen.n.rao@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-07-04ext4: correctly migrate a file with a hole at the beginningEryu Guan
Currently ext4_ind_migrate() doesn't correctly handle a file which contains a hole at the beginning of the file. This caused the migration to be done incorrectly, and then if there is a subsequent following delayed allocation write to the "hole", this would reclaim the same data blocks again and results in fs corruption. # assmuing 4k block size ext4, with delalloc enabled # skip the first block and write to the second block xfs_io -fc "pwrite 4k 4k" -c "fsync" /mnt/ext4/testfile # converting to indirect-mapped file, which would move the data blocks # to the beginning of the file, but extent status cache still marks # that region as a hole chattr -e /mnt/ext4/testfile # delayed allocation writes to the "hole", reclaim the same data block # again, results in i_blocks corruption xfs_io -c "pwrite 0 4k" /mnt/ext4/testfile umount /mnt/ext4 e2fsck -nf /dev/sda6 ... Inode 53, i_blocks is 16, should be 8. Fix? no ... Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2015-07-03ext4: be more strict when migrating to non-extent based fileEryu Guan
Currently the check in ext4_ind_migrate() is not enough before doing the real conversion: a) delayed allocated extents could bypass the check on eh->eh_entries and eh->eh_depth This can be demonstrated by this script xfs_io -fc "pwrite 0 4k" -c "pwrite 8k 4k" /mnt/ext4/testfile chattr -e /mnt/ext4/testfile where testfile has two extents but still be converted to non-extent based file format. b) only extent length is checked but not the offset, which would result in data lose (delalloc) or fs corruption (nodelalloc), because non-extent based file only supports at most (12 + 2^10 + 2^20 + 2^30) blocks This can be demostrated by xfs_io -fc "pwrite 5T 4k" /mnt/ext4/testfile chattr -e /mnt/ext4/testfile sync If delalloc is enabled, dmesg prints EXT4-fs warning (device dm-4): ext4_block_to_path:105: block 1342177280 > max in inode 53 EXT4-fs (dm-4): Delayed block allocation failed for inode 53 at logical offset 1342177280 with max blocks 1 with error 5 EXT4-fs (dm-4): This should not happen!! Data will be lost If delalloc is disabled, e2fsck -nf shows corruption Inode 53, i_size is 5497558142976, should be 4096. Fix? no Fix the two issues by a) forcing all delayed allocation blocks to be allocated before checking eh->eh_depth and eh->eh_entries b) limiting the last logical block of the extent is within direct map Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2015-07-03ext4: fix reservation release on invalidatepage for delalloc fsLukas Czerner
On delalloc enabled file system on invalidatepage operation in ext4_da_page_release_reservation() we want to clear the delayed buffer and remove the extent covering the delayed buffer from the extent status tree. However currently there is a bug where on the systems with page size > block size we will always remove extents from the start of the page regardless where the actual delayed buffers are positioned in the page. This leads to the errors like this: EXT4-fs warning (device loop0): ext4_da_release_space:1225: ext4_da_release_space: ino 13, to_free 1 with only 0 reserved data blocks This however can cause data loss on writeback time if the file system is in ENOSPC condition because we're releasing reservation for someones else delayed buffer. Fix this by only removing extents that corresponds to the part of the page we want to invalidate. This problem is reproducible by the following fio receipt (however I was only able to reproduce it with fio-2.1 or older. [global] bs=8k iodepth=1024 iodepth_batch=60 randrepeat=1 size=1m directory=/mnt/test numjobs=20 [job1] ioengine=sync bs=1k direct=1 rw=randread filename=file1:file2 [job2] ioengine=libaio rw=randwrite direct=1 filename=file1:file2 [job3] bs=1k ioengine=posixaio rw=randwrite direct=1 filename=file1:file2 [job5] bs=1k ioengine=sync rw=randread filename=file1:file2 [job7] ioengine=libaio rw=randwrite filename=file1:file2 [job8] ioengine=posixaio rw=randwrite filename=file1:file2 [job10] ioengine=mmap rw=randwrite bs=1k filename=file1:file2 [job11] ioengine=mmap rw=randwrite direct=1 filename=file1:file2 Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Cc: stable@vger.kernel.org
2015-07-03Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull user namespace updates from Eric Biederman: "Long ago and far away when user namespaces where young it was realized that allowing fresh mounts of proc and sysfs with only user namespace permissions could violate the basic rule that only root gets to decide if proc or sysfs should be mounted at all. Some hacks were put in place to reduce the worst of the damage could be done, and the common sense rule was adopted that fresh mounts of proc and sysfs should allow no more than bind mounts of proc and sysfs. Unfortunately that rule has not been fully enforced. There are two kinds of gaps in that enforcement. Only filesystems mounted on empty directories of proc and sysfs should be ignored but the test for empty directories was insufficient. So in my tree directories on proc, sysctl and sysfs that will always be empty are created specially. Every other technique is imperfect as an ordinary directory can have entries added even after a readdir returns and shows that the directory is empty. Special creation of directories for mount points makes the code in the kernel a smidge clearer about it's purpose. I asked container developers from the various container projects to help test this and no holes were found in the set of mount points on proc and sysfs that are created specially. This set of changes also starts enforcing the mount flags of fresh mounts of proc and sysfs are consistent with the existing mount of proc and sysfs. I expected this to be the boring part of the work but unfortunately unprivileged userspace winds up mounting fresh copies of proc and sysfs with noexec and nosuid clear when root set those flags on the previous mount of proc and sysfs. So for now only the atime, read-only and nodev attributes which userspace happens to keep consistent are enforced. Dealing with the noexec and nosuid attributes remains for another time. This set of changes also addresses an issue with how open file descriptors from /proc/<pid>/ns/* are displayed. Recently readlink of /proc/<pid>/fd has been triggering a WARN_ON that has not been meaningful since it was added (as all of the code in the kernel was converted) and is not now actively wrong. There is also a short list of issues that have not been fixed yet that I will mention briefly. It is possible to rename a directory from below to above a bind mount. At which point any directory pointers below the renamed directory can be walked up to the root directory of the filesystem. With user namespaces enabled a bind mount of the bind mount can be created allowing the user to pick a directory whose children they can rename to outside of the bind mount. This is challenging to fix and doubly so because all obvious solutions must touch code that is in the performance part of pathname resolution. As mentioned above there is also a question of how to ensure that developers by accident or with purpose do not introduce exectuable files on sysfs and proc and in doing so introduce security regressions in the current userspace that will not be immediately obvious and as such are likely to require breaking userspace in painful ways once they are recognized" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: vfs: Remove incorrect debugging WARN in prepend_path mnt: Update fs_fully_visible to test for permanently empty directories sysfs: Create mountpoints with sysfs_create_mount_point sysfs: Add support for permanently empty directories to serve as mount points. kernfs: Add support for always empty directories. proc: Allow creating permanently empty directories that serve as mount points sysctl: Allow creating permanently empty directories that serve as mountpoints. fs: Add helper functions for permanently empty directories. vfs: Ignore unlocked mounts in fs_fully_visible mnt: Modify fs_fully_visible to deal with locked ro nodev and atime mnt: Refactor the logic for mounting sysfs and proc in a user namespace
2015-07-02Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph updates from Sage Weil: "We have a pile of bug fixes from Ilya, including a few patches that sync up the CRUSH code with the latest from userspace. There is also a long series from Zheng that fixes various issues with snapshots, inline data, and directory fsync, some simplification and improvement in the cap release code, and a rework of the caching of directory contents. To top it off there are a few small fixes and cleanups from Benoit and Hong" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (40 commits) rbd: use GFP_NOIO in rbd_obj_request_create() crush: fix a bug in tree bucket decode libceph: Fix ceph_tcp_sendpage()'s more boolean usage libceph: Remove spurious kunmap() of the zero page rbd: queue_depth map option rbd: store rbd_options in rbd_device rbd: terminate rbd_opts_tokens with Opt_err ceph: fix ceph_writepages_start() rbd: bump queue_max_segments ceph: rework dcache readdir crush: sync up with userspace crush: fix crash from invalid 'take' argument ceph: switch some GFP_NOFS memory allocation to GFP_KERNEL ceph: pre-allocate data structure that tracks caps flushing ceph: re-send flushing caps (which are revoked) in reconnect stage ceph: send TID of the oldest pending caps flush to MDS ceph: track pending caps flushing globally ceph: track pending caps flushing accurately libceph: fix wrong name "Ceph filesystem for Linux" ceph: fix directory fsync ...
2015-07-02Merge tag 'nfs-for-4.2-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client updates from Trond Myklebust: "Highlights include: Stable patches: - Fix a crash in the NFSv4 file locking code. - Fix an fsync() regression, where we were failing to retry I/O in some circumstances. - Fix an infinite loop in NFSv4.0 OPEN stateid recovery - Fix a memory leak when an attempted pnfs fails. - Fix a memory leak in the backchannel code - Large hostnames were not supported correctly in NFSv4.1 - Fix a pNFS/flexfiles bug that was impeding error reporting on I/O. - Fix a couple of credential issues in pNFS/flexfiles Bugfixes + cleanups: - Open flag sanity checks in the NFSv4 atomic open codepath - More NFSv4 delegation related bugfixes - Various NFSv4.1 backchannel bugfixes and cleanups - Fix the NFS swap socket code - Various cleanups of the NFSv4 SETCLIENTID and EXCHANGE_ID code - Fix a UDP transport deadlock issue Features: - More RDMA client transport improvements - NFSv4.2 LAYOUTSTATS functionality for pnfs flexfiles" * tag 'nfs-for-4.2-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (87 commits) nfs: Remove invalid tk_pid from debug message nfs: Remove invalid NFS_ATTR_FATTR_V4_REFERRAL checking in nfs4_get_rootfh nfs: Drop bad comment in nfs41_walk_client_list() nfs: Remove unneeded micro checking of CONFIG_PROC_FS nfs: Don't setting FILE_CREATED flags always nfs: Use remove_proc_subtree() instead remove_proc_entry() nfs: Remove unused argument in nfs_server_set_fsinfo() nfs: Fix a memory leak when meeting an unsupported state protect nfs: take extra reference to fl->fl_file when running a LOCKU operation NFSv4: When returning a delegation, don't reclaim an incompatible open mode. NFSv4.2: LAYOUTSTATS is optional to implement NFSv4.2: Fix up a decoding error in layoutstats pNFS/flexfiles: Fix the reset of struct pgio_header when resending pNFS/flexfiles: Turn off layoutcommit for servers that don't need it pnfs/flexfiles: protect ktime manipulation with mirror lock nfs: provide pnfs_report_layoutstat when NFS42 is disabled nfs: verify open flags before allowing open nfs: always update creds in mirror, even when we have an already connected ds nfs: fix potential credential leak in ff_layout_update_mirror_cred pnfs/flexfiles: report layoutstat regularly ...
2015-07-02Merge branch 'overlayfs-next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs updates from Miklos Szeredi: "This relaxes the requirements on the lower layer filesystem: now ones that implement .d_revalidate, such as NFS, can be used. Upper layer filesystems still has the "no .d_revalidate" requirement. Also a bad interaction with jffs2 locking has been fixed" * 'overlayfs-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: ovl: lookup whiteouts outside iterate_dir() ovl: allow distributed fs as lower layer ovl: don't traverse automount points
2015-07-02Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi: "This is the start of improving fuse scalability. An input queue and a processing queue is split out from the monolithic fuse connection, each of those having their own spinlock. The end of the patchset adds the ability to clone a fuse connection. This means, that instead of having to read/write requests/answers on a single fuse device fd, the fuse daemon can have multiple distinct file descriptors open. Each of those can be used to receive requests and send answers, currently the only constraint is that a request must be answered on the same fd as it was read from. This can be extended further to allow binding a device clone to a specific CPU or NUMA node. Based on a patchset by Srinivas Eeda and Ashish Samant. Thanks to Ashish for the review of this series" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (40 commits) fuse: update MAINTAINERS entry fuse: separate pqueue for clones fuse: introduce per-instance fuse_dev structure fuse: device fd clone fuse: abort: no fc->lock needed for request ending fuse: no fc->lock for pqueue parts fuse: no fc->lock in request_end() fuse: cleanup request_end() fuse: request_end(): do once fuse: add req flag for private list fuse: pqueue locking fuse: abort: group pqueue accesses fuse: cleanup fuse_dev_do_read() fuse: move list_del_init() from request_end() into callers fuse: duplicate ->connected in pqueue fuse: separate out processing queue fuse: simplify request_wait() fuse: no fc->lock for iqueue parts fuse: allow interrupt queuing without fc->lock fuse: iqueue locking ...
2015-07-02Merge tag 'module_init-alternate_initcall-v4.1-rc8' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux Pull module_init replacement part two from Paul Gortmaker: "Replace module_init with appropriate alternate initcall in non modules. This series converts non-modular code that is using the module_init() call to hook itself into the system to instead use one of our alternate priority initcalls. Unlike the previous series that used device_initcall and hence was a runtime no-op, these commits change to one of the alternate initcalls, because (a) we have them and (b) it seems like the right thing to do. For example, it would seem logical to use arch_initcall for arch specific setup code and fs_initcall for filesystem setup code. This does mean however, that changes in the init ordering will be taking place, and so there is a small risk that some kind of implicit init ordering issue may lie uncovered. But I think it is still better to give these ones sensible priorities than to just assign them all to device_initcall in order to exactly preserve the old ordering. Thad said, we have already made similar changes in core kernel code in commit c96d6660dc65 ("kernel: audit/fix non-modular users of module_init in core code") without any regressions reported, so this type of change isn't without precedent. It has also got the same local testing and linux-next coverage as all the other pull requests that I'm sending for this merge window have got. Once again, there is an unused module_exit function removal that shows up as an outlier upon casual inspection of the diffstat" * tag 'module_init-alternate_initcall-v4.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: x86: perf_event_intel_pt.c: use arch_initcall to hook in enabling x86: perf_event_intel_bts.c: use arch_initcall to hook in enabling mm/page_owner.c: use late_initcall to hook in enabling lib/list_sort: use late_initcall to hook in self tests arm: use subsys_initcall in non-modular pl320 IPC code powerpc: don't use module_init for non-modular core hugetlb code powerpc: use subsys_initcall for Freescale Local Bus x86: don't use module_init for non-modular core bootflag code netfilter: don't use module_init/exit in core IPV4 code fs/notify: don't use module_init for non-modular inotify_user code mm: replace module_init usages with subsys_initcall in nommu.c
2015-07-02ext4: avoid deadlocks in the writeback path by using sb_getblk_gfpNikolay Borisov
Switch ext4 to using sb_getblk_gfp with GFP_NOFS added to fix possible deadlocks in the page writeback path. Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2015-07-01ext4: fix fencepost error in lazytime optimizationTheodore Ts'o
Commit 8f4d8558391: "ext4: fix lazytime optimization" was not a complete fix. In the case where the inode number is a multiple of 16, and we could still end up updating an inode with dirty timestamps written to the wrong inode on disk. Oops. This can be easily reproduced by using generic/005 with a file system with metadata_csum and lazytime enabled. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2015-07-01Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge third patchbomb from Andrew Morton: - the rest of MM - scripts/gdb updates - ipc/ updates - lib/ updates - MAINTAINERS updates - various other misc things * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (67 commits) genalloc: rename of_get_named_gen_pool() to of_gen_pool_get() genalloc: rename dev_get_gen_pool() to gen_pool_get() x86: opt into HAVE_COPY_THREAD_TLS, for both 32-bit and 64-bit MAINTAINERS: add zpool MAINTAINERS: BCACHE: Kent Overstreet has changed email address MAINTAINERS: move Jens Osterkamp to CREDITS MAINTAINERS: remove unused nbd.h pattern MAINTAINERS: update brcm gpio filename pattern MAINTAINERS: update brcm dts pattern MAINTAINERS: update sound soc intel patterns MAINTAINERS: remove website for paride MAINTAINERS: update Emulex ocrdma email addresses bcache: use kvfree() in various places libcxgbi: use kvfree() in cxgbi_free_big_mem() target: use kvfree() in session alloc and free IB/ehca: use kvfree() in ipz_queue_{cd}tor() drm/nouveau/gem: use kvfree() in u_free() drm: use kvfree() in drm_free_large() cxgb4: use kvfree() in t4_free_mem() cxgb3: use kvfree() in cxgb_free_mem() ...
2015-07-01Btrfs: fix wrong check for btrfs_force_chunk_alloc()Shilong Wang
btrfs_force_chunk_alloc() return 1 for allocation chunk successfully. This problem exists since commit c87f08ca4. With this patch, we might fix some enospc problems for balances. Signed-off-by: Wang Shilong <wangshilong1991@gmail.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Tested-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-07-01Btrfs: fix warning of bytes_may_useLiu Bo
While running generic/019, dmesg got several warnings from btrfs_free_reserved_data_space(). Test generic/019 produces some disk failures so sumbit dio will get errors, in which case, btrfs_direct_IO() goes to the error handling and free bytes_may_use, but the problem is that bytes_may_use has been free'd during get_block(). This adds a runtime flag to show if we've gone through get_block(), if so, don't do the cleanup work. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Tested-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-07-01Btrfs: fix hang when failing to submit bio of directIOLiu Bo
The hang is uncoverd by generic/019. btrfs_endio_direct_write() skips the "finish_ordered_fn" part when it hits an error, thus those added ordered extents will never get processed, which block processes that waiting for them via btrfs_start_ordered_extent(). This fixes the above, and meanwhile finish_ordered_fn will do the space accounting work. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Tested-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>