summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2015-10-21block: Inline blk_integrity in struct gendiskMartin K. Petersen
Up until now the_integrity profile has been dynamically allocated and attached to struct gendisk after the disk has been made active. This causes problems because NVMe devices need to register the profile prior to the partition table being read due to a mandatory metadata buffer requirement. In addition, DM goes through hoops to deal with preallocating, but not initializing integrity profiles. Since the integrity profile is small (4 bytes + a pointer), Christoph suggested moving it to struct gendisk proper. This requires several changes: - Moving the blk_integrity definition to genhd.h. - Inlining blk_integrity in struct gendisk. - Removing the dynamic allocation code. - Adding helper functions which allow gendisk to set up and tear down the integrity sysfs dir when a disk is added/deleted. - Adding a blk_integrity_revalidate() callback for updating the stable pages bdi setting. - The calls that depend on whether a device has an integrity profile or not now key off of the bi->profile pointer. - Simplifying the integrity support routines in DM (Mike Snitzer). Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reported-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-10-21Merge branch 'nfsclone'Trond Myklebust
* nfsclone: nfs: add missing linux/types.h NFS: Fix an 'unused variable' complaint when #ifndef CONFIG_NFS_V4_2 nfs42: add NFS_IOC_CLONE_RANGE ioctl nfs42: respect clone_blksize nfs: get clone_blksize when probing fsinfo nfs42: add NFS_IOC_CLONE ioctl nfs42: add CLONE proc functions nfs42: add CLONE xdr functions
2015-10-21btrfs: reada: Fix returned errno codeLuis de Bethencourt
reada is using -1 instead of the -ENOMEM defined macro to specify that a buffer allocation failed. Since the error number is propagated, the caller will get a -EPERM which is the wrong error condition. Also, updating the caller to return the exact value from reada_add_block. Smatch tool warning: reada_add_block() warn: returning -1 instead of -ENOMEM is sloppy Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com> Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21btrfs: check-integrity: Fix returned errno codesLuis de Bethencourt
check-integrity is using -1 instead of the -ENOMEM defined macro to specify that a buffer allocation failed. Since the error number is propagated, the caller will get a -EPERM which is the wrong error condition. Also, the smatch tool complains with the following warnings: btrfsic_process_superblock() warn: returning -1 instead of -ENOMEM is sloppy btrfsic_read_block() warn: returning -1 instead of -ENOMEM is sloppy Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com> Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21btrfs: compress: put variables defined per compress type in struct to make ↵Byongho Lee
cache friendly Below variables are defined per compress type. - struct list_head comp_idle_workspace[BTRFS_COMPRESS_TYPES] - spinlock_t comp_workspace_lock[BTRFS_COMPRESS_TYPES] - int comp_num_workspace[BTRFS_COMPRESS_TYPES] - atomic_t comp_alloc_workspace[BTRFS_COMPRESS_TYPES] - wait_queue_head_t comp_workspace_wait[BTRFS_COMPRESS_TYPES] BTW, while accessing one compress type of these variables, the next or before address is other compress types of it. So this patch puts these variables in a struct to make cache friendly. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21btrfs: cleanup iterating over prop_handlers arrayByongho Lee
This patch eliminates the last item of prop_handlers array which is used to check end of array and instead uses ARRAY_SIZE macro. Though this is a very tiny optimization, using ARRAY_SIZE macro is a good practice to iterate array. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21btrfs: fix a comment typoGeliang Tang
Just fix a typo in the code comment. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Geliang Tang <geliangtang@163.com> Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21btrfs: declare rsv_count as unsigned int instead of intAlexandru Moise
rsv_count ultimately gets passed to start_transaction() which now takes an unsigned int as its num_items parameter. The value of rsv_count should always be positive so declare it as being unsigned. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21btrfs: change num_items type from u64 to unsigned intAlexandru Moise
The value of num_items that start_transaction() ultimately always takes is a small one, so a 64 bit integer is overkill. Also change num_items for btrfs_start_transaction() and btrfs_start_transaction_lflush() as well. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21btrfs: cleanup btrfs_balance profile validity checksAlexandru Moise
Improve readability by generalizing the profile validity checks. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21btrfs/file.c: remove an unsed varialbe first_indexShan Hai
The commit b37392ea86761 ("Btrfs: cleanup unnecessary parameter and variant of prepare_pages()") makes it redundant. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Shan Hai <haishan.bai@hotmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21btrfs: use btrfs_raid_array in btrfs_reduce_alloc_profileZhao Lei
btrfs_raid_array[] holds attributes of all raid types. Use btrfs_raid_array[].devs_min is best way for request in btrfs_reduce_alloc_profile(), instead of use complex condition of each raid types. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21btrfs: use btrfs_raid_array for btrfs_get_num_tolerated_disk_barrier_failures()Zhao Lei
btrfs_raid_array[] is used to define all raid attributes, use it to get tolerated_failures in btrfs_get_num_tolerated_disk_barrier_failures(), instead of complex condition in function. It can make code simple and auto-support other possible raid-type in future. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21btrfs: Move btrfs_raid_array to publicZhao Lei
This array is used to record attributes of each raid type, make it public, and many functions will benifit with this array. For example, num_tolerated_disk_barrier_failures(), we can avoid complex conditions in this function, and get raid attribute simply by accessing above array. It can also make code logic simple, reduce duplication code, and increase maintainability. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21btrfs: use a single if() statement for one outcome in get_block_rsv()Alexandru Moise
Rather than have three separate if() statements for the same outcome we should just OR them together in the same if() statement. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21btrfs: memset cur_trans->delayed_refs to zeroAlexandru Moise
Use memset() to null out the btrfs_delayed_ref_root of btrfs_transaction instead of setting all the members to 0 by hand. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21btrfs: remove unnecessary list_delByongho Lee
We can safely iterate whole list items, without using list_del macro. So remove the list_del call. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21btrfs: replace unnecessary list_for_each_entry_safe to list_for_each_entryByongho Lee
There is no removing list element while iterating over list. So, replace list_for_each_entry_safe to list_for_each_entry. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21btrfs: trimming some start_transaction() code awayAlexandru Moise
Just call kmem_cache_zalloc() instead of calling kmem_cache_alloc(). We're just initializing most fields to 0, false and NULL later on _anyway_, so to make the code mode readable and potentially gain a bit of performance (completely untested claim), we should fill our btrfs_trans_handle with zeros on allocation then just initialize those five remaining fields (not counting the list_heads) as normal. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21btrfs: Fixed declaration of old_lenAlexandru Moise
old_len is used to store the return value of btrfs_item_size_nr(). The return value of btrfs_item_size_nr() is of type u32. To improve code correctness and avoid mixing signed and unsigned integers I've changed old_len to be of type u32 as well. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21btrfs: Fixed dsize and last_off declarationsAlexandru Moise
The return values of btrfs_item_offset_nr and btrfs_item_size_nr are of type u32. To avoid mixing signed and unsigned integers we should also declare dsize and last_off to be of type u32. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21Btrfs: btrfs_submit_bio_hook: Use btrfs_wq_endio_type values instead of ↵Chandan Rajendra
integer constants btrfs_submit_bio_hook() uses integer constants instead of values from "enum btrfs_wq_endio_type". Fix this. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21pstore: add a helper function pstore_register_kmsgGeliang Tang
Add a new wrapper function pstore_register_kmsg to keep the consistency with other similar pstore_register_* functions. Signed-off-by: Geliang Tang <geliangtang@163.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2015-10-21pstore: add vmalloc error checkGeliang Tang
If vmalloc fails, make write_pmsg return -ENOMEM. Signed-off-by: Geliang Tang <geliangtang@163.com> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Tony Luck <tony.luck@intel.com>
2015-10-21KEYS: Merge the type-specific data with the payload dataDavid Howells
Merge the type-specific data with the payload data into one four-word chunk as it seems pointless to keep them separate. Use user_key_payload() for accessing the payloads of overloaded user-defined keys. Signed-off-by: David Howells <dhowells@redhat.com> cc: linux-cifs@vger.kernel.org cc: ecryptfs@vger.kernel.org cc: linux-ext4@vger.kernel.org cc: linux-f2fs-devel@lists.sourceforge.net cc: linux-nfs@vger.kernel.org cc: ceph-devel@vger.kernel.org cc: linux-ima-devel@lists.sourceforge.net
2015-10-20btrfs: Avoid truncate tailing page if fallocate range doesn't exceed inode sizeQu Wenruo
Current code will always truncate tailing page if its alloc_start is smaller than inode size. For example, the file extent layout is like: 0 4K 8K 16K 32K |<-----Extent A---------------->| |<--Inode size: 18K---------->| But if calling fallocate even for range [0,4K), it will cause btrfs to re-truncate the range [16,32K), causing COW and a new extent. 0 4K 8K 16K 32K |///////| <- Fallocate call range |<-----Extent A-------->|<--B-->| The cause is quite easy, just a careless btrfs_truncate_inode() in a else branch without extra judgment. Fix it by add judgment on whether the fallocate range is beyond isize. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-10-20f2fs: support fiemap for inline_dataJaegeuk Kim
There is a FIEMAP_EXTENT_INLINE_DATA, pointed out by Marc. Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-20f2fs: flush dirty data for bmapJaegeuk Kim
Users expect bmap will give allocated block addresses. Let's play likewise ext4. Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-19ext2: Add locking for DAX faultsRoss Zwisler
Add locking to ensure that DAX faults are isolated from ext2 operations that modify the data blocks allocation for an inode. This is intended to be analogous to the work being done in XFS by Dave Chinner: http://www.spinics.net/lists/linux-fsdevel/msg90260.html Compared with XFS the ext2 case is greatly simplified by the fact that ext2 already allocates and zeros new blocks before they are returned as part of ext2_get_block(), so DAX doesn't need to worry about getting unmapped or unwritten buffer heads. This means that the only work we need to do in ext2 is to isolate the DAX faults from inode block allocation changes. I believe this just means that we need to isolate the DAX faults from truncate operations. The newly introduced dax_sem is intended to replicate the protection offered by i_mmaplock in XFS. In addition to truncate the i_mmaplock also protects XFS operations like hole punching, fallocate down, extent manipulation IOCTLS like xfs_ioc_space() and extent swapping. Truncate is the only one of these operations supported by ext2. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Jan Kara <jack@suse.com>
2015-10-19ext4: fix abs() usage in ext4_mb_check_group_paJohn Stultz
The ext4_fsblk_t type is a long long, which should not be used with abs(), as is done in ext4_mb_check_group_pa(). This patch modifies ext4_mb_check_group_pa() to use abs64() instead. Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-10-18ext4: do not allow journal_opts for fs w/o journalDmitry Monakhov
It is appeared that we can pass journal related mount options and such options be shown in /proc/mounts Example: #mkfs.ext4 -F /dev/vdb #tune2fs -O ^has_journal /dev/vdb #mount /dev/vdb /mnt/ -ocommit=20,journal_async_commit #cat /proc/mounts | grep /mnt /dev/vdb /mnt ext4 rw,relatime,journal_checksum,journal_async_commit,commit=20,data=ordered 0 0 But options:"journal_checksum,journal_async_commit,commit=20,data=ordered" has nothing with reality because there is no journal at all. This patch disallow following options for journalless configurations: - journal_checksum - journal_async_commit - commit=%ld - data={writeback,ordered,journal} Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2015-10-18ext4: explicit mount options parsing cleanupDmitry Monakhov
Currently MOPT_EXPLICIT treated as EXPLICIT_DELALLOC which may be changed in future. Let's fix it now. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-10-19sysfs: added __compat_only_sysfs_link_entry_to_kobj()Jarkko Sakkinen
Added a new function __compat_only_sysfs_link_group_to_kobj() that adds a symlink from attribute or group to a kobject. This needed for maintaining backwards compatibility with PPI attributes in the TPM driver. Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> Signed-off-by: Peter Huewe <peterhuewe@gmx.de>
2015-10-18ext4, jbd2: ensure entering into panic after recording an error in superblockDaeho Jeong
If a EXT4 filesystem utilizes JBD2 journaling and an error occurs, the journaling will be aborted first and the error number will be recorded into JBD2 superblock and, finally, the system will enter into the panic state in "errors=panic" option. But, in the rare case, this sequence is little twisted like the below figure and it will happen that the system enters into panic state, which means the system reset in mobile environment, before completion of recording an error in the journal superblock. In this case, e2fsck cannot recognize that the filesystem failure occurred in the previous run and the corruption wouldn't be fixed. Task A Task B ext4_handle_error() -> jbd2_journal_abort() -> __journal_abort_soft() -> __jbd2_journal_abort_hard() | -> journal->j_flags |= JBD2_ABORT; | | __ext4_abort() | -> jbd2_journal_abort() | | -> __journal_abort_soft() | | -> if (journal->j_flags & JBD2_ABORT) | | return; | -> panic() | -> jbd2_journal_update_sb_errno() Tested-by: Hobin Woo <hobin.woo@samsung.com> Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2015-10-18debugfs: Add debugfs_create_ulong()Viresh Kumar
Add debugfs_create_ulong() for the users of type 'unsigned long'. These will be 32 bits long on a 32 bit machine and 64 bits long on a 64 bit machine. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-17debugfs: Add read-only/write-only bool file opsStephen Boyd
There aren't any read-only or write-only bool file ops, but there is a caller of debugfs_create_bool() that calls it with mode equal to 0400. This leads to the possibility of userspace modifying the file, so let's use the newly created debugfs_create_mode() helper here to fix this. Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-17debugfs: Add read-only/write-only size_t file opsStephen Boyd
There aren't any read-only or write-only size_t file ops, but there is a caller of debugfs_create_size_t() that calls it with mode equal to 0400. This leads to the possibility of userspace modifying the file, so let's use the newly created debugfs_create_mode() helper here to fix this. Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-17debugfs: Add read-only/write-only x64 file opsStephen Boyd
There aren't any read-only or write-only x64 file ops, but there is a caller of debugfs_create_x64() that calls it with mode equal to S_IRUGO. This leads to the possibility of userspace modifying the file, so let's use the newly created debugfs_create_mode() helper here to fix this. Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-17debugfs: Consolidate file mode checks in debugfs_create_*()Stephen Boyd
The code that creates debugfs file with different file ops based on the file mode is duplicated in each debugfs_create_*() API. Consolidate that code into debugfs_create_mode(), that takes three file ops structures so that we don't have to keep copy/pasting that logic. Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-18[PATCH] fix calculation of meta_bg descriptor backupsAndy Leiserson
"group" is the group where the backup will be placed, and is initialized to zero in the declaration. This meant that backups for meta_bg descriptors were erroneously written to the backup block group descriptors in groups 1 and (desc_per_block-1). Reproduction information: mke2fs -Fq -t ext4 -b 1024 -O ^resize_inode /tmp/foo.img 16G truncate -s 24G /tmp/foo.img losetup /dev/loop0 /tmp/foo.img mount /dev/loop0 /mnt resize2fs /dev/loop0 umount /dev/loop0 dd if=/dev/zero of=/dev/loop0 bs=1024 count=2 e2fsck -fy /dev/loop0 losetup -d /dev/loop0 Signed-off-by: Andy Leiserson <andy@leiserson.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2015-10-17ext4: fix potential use after free in __ext4_journal_stopLukas Czerner
There is a use-after-free possibility in __ext4_journal_stop() in the case that we free the handle in the first jbd2_journal_stop() because we're referencing handle->h_err afterwards. This was introduced in 9705acd63b125dee8b15c705216d7186daea4625 and it is wrong. Fix it by storing the handle->h_err value beforehand and avoid referencing potentially freed handle. Fixes: 9705acd63b125dee8b15c705216d7186daea4625 Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Cc: stable@vger.kernel.org
2015-10-17jbd2: fix checkpoint list cleanupJan Kara
Unlike comments and expectation of callers journal_clean_one_cp_list() returned 1 not only if it freed the transaction but also if it freed some buffers in the transaction. That could make __jbd2_journal_clean_checkpoint_list() skip processing t_checkpoint_io_list and continue with processing the next transaction. This is mostly a cosmetic issue since the only result is we can sometimes free less memory than we could. But it's still worth fixing. Fix journal_clean_one_cp_list() to return 1 only if the transaction was really freed. Fixes: 50849db32a9f529235a84bcc84a6b8e631b1d0ec Signed-off-by: Jan Kara <jack@suse.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2015-10-17ext4: fix xfstest generic/269 double revoked buffer bug with bigallocDaeho Jeong
When you repeatly execute xfstest generic/269 with bigalloc_1k option enabled using the below command: "./kvm-xfstests -c bigalloc_1k -m nodelalloc -C 1000 generic/269" you can easily see the below bug message. "JBD2 unexpected failure: jbd2_journal_revoke: !buffer_revoked(bh);" This means that an already revoked buffer is erroneously revoked again and it is caused by doing revoke for the buffer at the wrong position in ext4_free_blocks(). We need to re-position the buffer revoke procedure for an unspecified buffer after checking the cluster boundary for bigalloc option. If not, some part of the cluster can be doubly revoked. Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
2015-10-17ext4: make the bitmap read routines return real error codesDarrick J. Wong
Make the bitmap reaading routines return real error codes (EIO, EFSCORRUPTED, EFSBADCRC) which can then be reflected back to userspace for more precise diagnosis work. In particular, this means that mballoc no longer claims that we're out of memory if the block bitmaps become corrupt. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-10-17jbd2: clean up feature test macros with predicate functionsDarrick J. Wong
Create separate predicate functions to test/set/clear feature flags, thereby replacing the wordy old macros. Furthermore, clean out the places where we open-coded feature tests. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-10-17ext4: clean up feature test macros with predicate functionsDarrick J. Wong
Create separate predicate functions to test/set/clear feature flags, thereby replacing the wordy old macros. Furthermore, clean out the places where we open-coded feature tests. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2015-10-17ext4: call out CRC and corruption errors with specific error codesDarrick J. Wong
Instead of overloading EIO for CRC errors and corrupt structures, return the same error codes that XFS returns for the same issues. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-10-17ext4: store checksum seed in superblockDarrick J. Wong
Allow the filesystem to store the metadata checksum seed in the superblock and add an incompat feature to say that we're using it. This enables tune2fs to change the UUID on a mounted metadata_csum FS without having to (racy!) rewrite all disk metadata. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-10-17ext4: reserve code points for the project quota featureTheodore Ts'o
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-10-16Btrfs: fix truncation of compressed and inlined extentsFilipe Manana
When truncating a file to a smaller size which consists of an inline extent that is compressed, we did not discard (or made unusable) the data between the new file size and the old file size, wasting metadata space and allowing for the truncated data to be leaked and the data corruption/loss mentioned below. We were also not correctly decrementing the number of bytes used by the inode, we were setting it to zero, giving a wrong report for callers of the stat(2) syscall. The fsck tool also reported an error about a mismatch between the nbytes of the file versus the real space used by the file. Now because we weren't discarding the truncated region of the file, it was possible for a caller of the clone ioctl to actually read the data that was truncated, allowing for a security breach without requiring root access to the system, using only standard filesystem operations. The scenario is the following: 1) User A creates a file which consists of an inline and compressed extent with a size of 2000 bytes - the file is not accessible to any other users (no read, write or execution permission for anyone else); 2) The user truncates the file to a size of 1000 bytes; 3) User A makes the file world readable; 4) User B creates a file consisting of an inline extent of 2000 bytes; 5) User B issues a clone operation from user A's file into its own file (using a length argument of 0, clone the whole range); 6) User B now gets to see the 1000 bytes that user A truncated from its file before it made its file world readbale. User B also lost the bytes in the range [1000, 2000[ bytes from its own file, but that might be ok if his/her intention was reading stale data from user A that was never supposed to be public. Note that this contrasts with the case where we truncate a file from 2000 bytes to 1000 bytes and then truncate it back from 1000 to 2000 bytes. In this case reading any byte from the range [1000, 2000[ will return a value of 0x00, instead of the original data. This problem exists since the clone ioctl was added and happens both with and without my recent data loss and file corruption fixes for the clone ioctl (patch "Btrfs: fix file corruption and data loss after cloning inline extents"). So fix this by truncating the compressed inline extents as we do for the non-compressed case, which involves decompressing, if the data isn't already in the page cache, compressing the truncated version of the extent, writing the compressed content into the inline extent and then truncate it. The following test case for fstests reproduces the problem. In order for the test to pass both this fix and my previous fix for the clone ioctl that forbids cloning a smaller inline extent into a larger one, which is titled "Btrfs: fix file corruption and data loss after cloning inline extents", are needed. Without that other fix the test fails in a different way that does not leak the truncated data, instead part of destination file gets replaced with zeroes (because the destination file has a larger inline extent than the source). seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter # real QA test starts here _need_to_be_root _supported_fs btrfs _supported_os Linux _require_scratch _require_cloner rm -f $seqres.full _scratch_mkfs >>$seqres.full 2>&1 _scratch_mount "-o compress" # Create our test files. File foo is going to be the source of a clone operation # and consists of a single inline extent with an uncompressed size of 512 bytes, # while file bar consists of a single inline extent with an uncompressed size of # 256 bytes. For our test's purpose, it's important that file bar has an inline # extent with a size smaller than foo's inline extent. $XFS_IO_PROG -f -c "pwrite -S 0xa1 0 128" \ -c "pwrite -S 0x2a 128 384" \ $SCRATCH_MNT/foo | _filter_xfs_io $XFS_IO_PROG -f -c "pwrite -S 0xbb 0 256" $SCRATCH_MNT/bar | _filter_xfs_io # Now durably persist all metadata and data. We do this to make sure that we get # on disk an inline extent with a size of 512 bytes for file foo. sync # Now truncate our file foo to a smaller size. Because it consists of a # compressed and inline extent, btrfs did not shrink the inline extent to the # new size (if the extent was not compressed, btrfs would shrink it to 128 # bytes), it only updates the inode's i_size to 128 bytes. $XFS_IO_PROG -c "truncate 128" $SCRATCH_MNT/foo # Now clone foo's inline extent into bar. # This clone operation should fail with errno EOPNOTSUPP because the source # file consists only of an inline extent and the file's size is smaller than # the inline extent of the destination (128 bytes < 256 bytes). However the # clone ioctl was not prepared to deal with a file that has a size smaller # than the size of its inline extent (something that happens only for compressed # inline extents), resulting in copying the full inline extent from the source # file into the destination file. # # Note that btrfs' clone operation for inline extents consists of removing the # inline extent from the destination inode and copy the inline extent from the # source inode into the destination inode, meaning that if the destination # inode's inline extent is larger (N bytes) than the source inode's inline # extent (M bytes), some bytes (N - M bytes) will be lost from the destination # file. Btrfs could copy the source inline extent's data into the destination's # inline extent so that we would not lose any data, but that's currently not # done due to the complexity that would be needed to deal with such cases # (specially when one or both extents are compressed), returning EOPNOTSUPP, as # it's normally not a very common case to clone very small files (only case # where we get inline extents) and copying inline extents does not save any # space (unlike for normal, non-inlined extents). $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/foo $SCRATCH_MNT/bar # Now because the above clone operation used to succeed, and due to foo's inline # extent not being shinked by the truncate operation, our file bar got the whole # inline extent copied from foo, making us lose the last 128 bytes from bar # which got replaced by the bytes in range [128, 256[ from foo before foo was # truncated - in other words, data loss from bar and being able to read old and # stale data from foo that should not be possible to read anymore through normal # filesystem operations. Contrast with the case where we truncate a file from a # size N to a smaller size M, truncate it back to size N and then read the range # [M, N[, we should always get the value 0x00 for all the bytes in that range. # We expected the clone operation to fail with errno EOPNOTSUPP and therefore # not modify our file's bar data/metadata. So its content should be 256 bytes # long with all bytes having the value 0xbb. # # Without the btrfs bug fix, the clone operation succeeded and resulted in # leaking truncated data from foo, the bytes that belonged to its range # [128, 256[, and losing data from bar in that same range. So reading the # file gave us the following content: # # 0000000 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 # * # 0000200 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a # * # 0000400 echo "File bar's content after the clone operation:" od -t x1 $SCRATCH_MNT/bar # Also because the foo's inline extent was not shrunk by the truncate # operation, btrfs' fsck, which is run by the fstests framework everytime a # test completes, failed reporting the following error: # # root 5 inode 257 errors 400, nbytes wrong status=0 exit Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com>