summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2016-07-22xfs: skip dirty pages in ->releasepage()Brian Foster
XFS has had scattered reports of delalloc blocks present at ->releasepage() time. This results in a warning with a stack trace similar to the following: ... Call Trace: [<ffffffffa23c5b8f>] dump_stack+0x63/0x84 [<ffffffffa20837a7>] warn_slowpath_common+0x97/0xe0 [<ffffffffa208380a>] warn_slowpath_null+0x1a/0x20 [<ffffffffa2326caf>] xfs_vm_releasepage+0x10f/0x140 [<ffffffffa218c680>] ? page_mkclean_one+0xd0/0xd0 [<ffffffffa218d3a0>] ? anon_vma_prepare+0x150/0x150 [<ffffffffa21521c2>] try_to_release_page+0x32/0x50 [<ffffffffa2166b2e>] shrink_active_list+0x3ce/0x3e0 [<ffffffffa21671c7>] shrink_lruvec+0x687/0x7d0 [<ffffffffa21673ec>] shrink_zone+0xdc/0x2c0 [<ffffffffa2168539>] kswapd+0x4f9/0x970 [<ffffffffa2168040>] ? mem_cgroup_shrink_node_zone+0x1a0/0x1a0 [<ffffffffa20a0d99>] kthread+0xc9/0xe0 [<ffffffffa20a0cd0>] ? kthread_stop+0x100/0x100 [<ffffffffa26b404f>] ret_from_fork+0x3f/0x70 [<ffffffffa20a0cd0>] ? kthread_stop+0x100/0x100 This occurs because it is possible for shrink_active_list() to send pages marked dirty to ->releasepage() when certain buffer_head threshold conditions are met. shrink_active_list() doesn't check the page dirty state apparently to handle an old ext3 corner case where in some cases clean pages would not have the dirty bit cleared, thus it is up to the filesystem to determine how to handle the page. XFS currently handles the delalloc case properly, but this behavior makes the warning spurious. Update the XFS ->releasepage() handler to explicitly skip dirty pages. Retain the existing delalloc/unwritten checks so we continue to warn if such buffers exist on clean pages when they shouldn't. Diagnosed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-21GFS2: Fix gfs2_replay_incr_blk for multiple journal sizesBob Peterson
Before this patch, if you used gfs2_jadd to add new journals of a size smaller than the existing journals, replaying those new journals would withdraw. That's because function gfs2_replay_incr_blk was using the number of journal blocks (jd_block) from the superblock's journal pointer. In other words, "My journal's max size" rather than "the journal we're replaying's size." This patch changes the function to use the size of the pertinent journal rather than always using the journal we happen to be using. Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2016-07-21Merge branch 'for-miklos' of ↵Miklos Szeredi
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs into for-next
2016-07-20block: add QUEUE_FLAG_DAX for devices to advertise their DAX supportToshi Kani
Currently, presence of direct_access() in block_device_operations indicates support of DAX on its block device. Because block_device_operations is instantiated with 'const', this DAX capablity may not be enabled conditinally. In preparation for supporting DAX to device-mapper devices, add QUEUE_FLAG_DAX to request_queue flags to advertise their DAX support. This will allow to set the DAX capability based on how mapped device is composed. Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Acked-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: <linux-s390@vger.kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-20block: get rid of bio_rw and READAChristoph Hellwig
These two are confusing leftover of the old world order, combining values of the REQ_OP_ and REQ_ namespaces. For callers that don't special case we mostly just replace bi_rw with bio_data_dir or op_is_write, except for the few cases where a switch over the REQ_OP_ values makes more sense. Any check for READA is replaced with an explicit check for REQ_RAHEAD. Also remove the READA alias for REQ_RAHEAD. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Mike Christie <mchristi@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-20f2fs: handle error case with f2fs_bug_onJaegeuk Kim
It's enough to show BUG or WARN by f2fs_bug_on for error case. Then, we don't need to remain corrupted filesystem. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-07-20f2fs: avoid data race when deciding checkpoin in f2fs_sync_fileJaegeuk Kim
When fs utilization is almost full, f2fs_sync_file should do checkpoint if there is not enough space for roll-forward later. (i.e. space_for_roll_forward) So, currently we have no lock for sbi->alloc_valid_block_count, resulting in race condition. In rare case, we can get -ENOSPC when doing roll-forward which triggers if (is_valid_blkaddr(sbi, dest, META_POR)) { if (src == NULL_ADDR) { err = reserve_new_block(&dn); f2fs_bug_on(sbi, err); ... } ... } in do_recover_data. So, this patch avoids that situation in advance. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-07-20f2fs: support an ioctl to move a range of data blocksJaegeuk Kim
This patch implements moving a range of data blocks from source file to destination file. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-07-20f2fs: fix to report error number of f2fs_find_entryChao Yu
This patch fixes to report the right error number of f2fs_find_entry to its caller. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-07-20cifs: fix crash due to race in hmac(md5) handlingRabin Vincent
The secmech hmac(md5) structures are present in the TCP_Server_Info struct and can be shared among multiple CIFS sessions. However, the server mutex is not currently held when these structures are allocated and used, which can lead to a kernel crashes, as in the scenario below: mount.cifs(8) #1 mount.cifs(8) #2 Is secmech.sdeschmaccmd5 allocated? // false Is secmech.sdeschmaccmd5 allocated? // false secmech.hmacmd = crypto_alloc_shash.. secmech.sdeschmaccmd5 = kzalloc.. sdeschmaccmd5->shash.tfm = &secmec.hmacmd; secmech.sdeschmaccmd5 = kzalloc // sdeschmaccmd5->shash.tfm // not yet assigned crypto_shash_update() deref NULL sdeschmaccmd5->shash.tfm Unable to handle kernel paging request at virtual address 00000030 epc : 8027ba34 crypto_shash_update+0x38/0x158 ra : 8020f2e8 setup_ntlmv2_rsp+0x4bc/0xa84 Call Trace: crypto_shash_update+0x38/0x158 setup_ntlmv2_rsp+0x4bc/0xa84 build_ntlmssp_auth_blob+0xbc/0x34c sess_auth_rawntlmssp_authenticate+0xac/0x248 CIFS_SessSetup+0xf0/0x178 cifs_setup_session+0x4c/0x84 cifs_get_smb_ses+0x2c8/0x314 cifs_mount+0x38c/0x76c cifs_do_mount+0x98/0x440 mount_fs+0x20/0xc0 vfs_kern_mount+0x58/0x138 do_mount+0x1e8/0xccc SyS_mount+0x88/0xd4 syscall_common+0x30/0x54 Fix this by locking the srv_mutex around the code which uses these hmac(md5) structures. All the other secmech algos already have similar locking. Fixes: 95dc8dd14e2e84cc ("Limit allocation of crypto mechanisms to dialect which requires") Signed-off-by: Rabin Vincent <rabinv@axis.com> Acked-by: Sachin Prabhu <sprabhu@redhat.com> CC: Stable <stable@vger.kernel.org> Signed-off-by: Steve French <smfrench@gmail.com>
2016-07-20Merge branch 'xfs-4.8-dir2-sf-fixes' into for-nextDave Chinner
2016-07-20Merge branch 'xfs-4.8-split-dax-dio' into for-nextDave Chinner
2016-07-20Merge branch 'xfs-4.8-buf-fixes' into for-nextDave Chinner
2016-07-20Merge branch 'xfs-4.8-misc-fixes-3' into for-nextDave Chinner
2016-07-20xfs: remove __arch_packChristoph Hellwig
Instead we always declare struct xfs_dir2_sf_hdr as packed. That's the expected layout, and while most major architectures do the packing by default the new structure size and offset checker showed that not only the ARM old ABI got this wrong, but various minor embedded architectures did as well. [Verified that no code change on x86-64 results from this change] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-20xfs: kill xfs_dir2_inou_tChristoph Hellwig
And use an array of unsigned char values directly to avoid problems with architectures that pad the size of structures. This also gets rid of the xfs_dir2_ino4_t and xfs_dir2_ino8_t types, and introduces new constants for the size of 4 and 8 bytes as well as the size difference between the two. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-20xfs: kill xfs_dir2_sf_off_tChristoph Hellwig
Just use an array of two unsigned chars directly to avoid problems with architectures that pad the size of structures. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-20xfs: split direct I/O and DAX pathChristoph Hellwig
So far the DAX code overloaded the direct I/O code path. There is very little in common between the two, and untangling them allows to clean up both variants. As a side effect we also get separate trace points for both I/O types. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-20xfs: direct calls in the direct I/O pathChristoph Hellwig
We control both the callers and callees of ->direct_IO, so remove the indirect calls. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-20xfs: stop using generic_file_read_iter for direct I/OChristoph Hellwig
XFS already implement it's own flushing of the pagecache because it implements proper synchronization for direct I/O reads. This means calling generic_file_read_iter for direct I/O is rather useless, as it doesn't do much but updating the atime and iocb position for us. This also gets rid of the buffered I/O fallback that isn't used for XFS. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-20xfs: split xfs_file_read_iter into buffered and direct I/O helpersChristoph Hellwig
Similar to what we did on the write side a while ago. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-20xfs: remove s_maxbytes enforcement in xfs_file_read_iterChristoph Hellwig
All the three low-level read implementations that we might call already take care of not overflowing the maximum supported bytes, no need to duplicate it here. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-20xfs: kill ioflagsChristoph Hellwig
Now that we have the direct I/O kiocb flag there is no real need to sample the value inside of XFS, and the invis flag was always just partially used and isn't worth keeping this infrastructure around for. This also splits the read tracepoint into buffered vs direct as we've done for writes a long time ago. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-20xfs: don't pass ioflags around in the ioctl pathChristoph Hellwig
Instead check the file pointer for the invisble I/O flag directly, and use the chance to drop redundant arguments from the xfs_ioc_space prototype. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-20xfs: track and serialize in-flight async buffers against unmountBrian Foster
Newly allocated XFS metadata buffers are added to the LRU once the hold count is released, which typically occurs after I/O completion. There is no other mechanism at current that tracks the existence or I/O state of a new buffer. Further, readahead I/O tends to be submitted asynchronously by nature, which means the I/O can remain in flight and actually complete long after the calling context is gone. This means that file descriptors or any other holds on the filesystem can be released, allowing the filesystem to be unmounted while I/O is still in flight. When I/O completion occurs, core data structures may have been freed, causing completion to run into invalid memory accesses and likely to panic. This problem is reproduced on XFS via directory readahead. A filesystem is mounted, a directory is opened/closed and the filesystem immediately unmounted. The open/close cycle triggers a directory readahead that if delayed long enough, runs buffer I/O completion after the unmount has completed. To address this problem, add a mechanism to track all in-flight, asynchronous buffers using per-cpu counters in the buftarg. The buffer is accounted on the first I/O submission after the current reference is acquired and unaccounted once the buffer is returned to the LRU or freed. Update xfs_wait_buftarg() to wait on all in-flight I/O before walking the LRU list. Once in-flight I/O has completed and the workqueue has drained, all new buffers should have been released onto the LRU. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-20xfs: exclude never-released buffers from buftarg I/O accountingBrian Foster
The upcoming buftarg I/O accounting mechanism maintains a count of all buffers that have undergone I/O in the current hold-release cycle. Certain buffers associated with core infrastructure (e.g., the xfs_mount superblock buffer, log buffers) are never released, however. This means that accounting I/O submission on such buffers elevates the buftarg count indefinitely and could lead to lockup on unmount. Define a new buffer flag to explicitly exclude buffers from buftarg I/O accounting. Set the flag on the superblock and associated log buffers. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-20xfs: don't reset b_retries to 0 on every failureEric Sandeen
With the code as it stands today, b_retries never increments because it gets reset to 0 in the error callback. Remove that, and fix a similar problem where the first retry time was constantly being overwritten, which defeated the timeout tunable as well. We now only set first retry time if a non-zero timeout is set, to match the behavior of only incrementing retries if a retry value is set. This way max retries & timeouts consistently take effect after a tunable is set, rather than acting retroactively on a buffer which has failed at some point in the past and has accumulated state from those prior failures. Thanks to dchinner for talking through this with me. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-20xfs: remove extraneous buffer flag changesEric Sandeen
Fix up a couple places where extra flag manipulation occurs. In the first case we clear XBF_ASYNC and then immediately reset it - so don't bother clearing in the first place. In the 2nd case we are at a point in the function where the buffer must already be async, so there is no need to reset it. Add consistent spacing around the " | " while we're at it. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-20xfs: fix xfs_error_get_cfg for negative errnosEric Sandeen
xfs_error_get_cfg() is called with bp->b_error as an arg, which is negative, so the switch statement won't ever find any matches. This results in only the default error handler having any effect, as EIO/ENOSPC/ENODEV get ignored due to the wrong sign. It seems simplest to always flip the error sign to positive, so that we can handle either negative errors in bp->b_error, or possibly a positive errno via something like xfs_error_get_cfg(EIO) - this future-proofs the function. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-20xfs: remove the magic numbers in xfs_btree_block-related len macrosHou Tao
replace the magic numbers by offsetof(...) and sizeof(...), and add two extra checks on xfs_check_ondisk_structs() [dchinner: renamed header structures to be more descriptive] Signed-off-by: Hou Tao <houtao1@huawei.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-20xfs: indentation fix in xfs_btree_get_iroot()Kaho Ng
The indentation in this function is different from the other functions. Those spacebars are converted to tabs to improve readability. Signed-off-by: Kaho Ng <ngkaho1234@gmail.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-20xfs: don't allow negative error tagsDan Carpenter
Errors go from zero which means no error to XFS_ERRTAG_MAX (22). My static checker complains that xfs_errortag_add() puts an upper bound on this but not a lower bound. Let's fix it by making it unsigned. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-20xfs: fix type confusion in xfs_ioc_swapextJann Horn
When calling fdget() in xfs_ioc_swapext(), we need to verify that the file descriptors passed into the ioctl point to XFS inodes before we start operations on them. If we don't do this, we could be referencing arbitrary kernel memory as an XFS inode. THis could lead to memory corruption and/or performing locking operations on attacker-chosen structures in kernel memory. [dchinner: rewrite commit message ] [dchinner: add comment explaining new check ] Signed-off-by: Jann Horn <jann@thejh.net> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-19cifs: unbreak TCP session reuseRabin Vincent
adfeb3e0 ("cifs: Make echo interval tunable") added a comparison of vol->echo_interval to server->echo_interval as a criterium to match_server(), but: (1) A default value is set for server->echo_interval but not for vol->echo_interval, meaning these can never match if the echo_interval option is not specified. (2) vol->echo_interval is in seconds but server->echo_interval is in jiffies, meaning these can never match even if the echo_interval option is specified. This broke TCP session reuse since match_server() can never return 1. Fix it. Fixes: adfeb3e0 ("cifs: Make echo interval tunable") Signed-off-by: Rabin Vincent <rabinv@axis.com> Acked-by: Sachin Prabhu <sprabhu@redhat.com> CC: Stable <stable@vger.kernel.org> Signed-off-by: Steve French <smfrench@gmail.com>
2016-07-19bdev: get rid of ->bd_inodesAl Viro
Since 2006 we have ->i_bdev pinning bdev in question, so there's no way to get to bdev ->evict_inode() while there's an aliasing inode anywhere. In other words, the only place walking the list of aliases is guaranteed to do it only when the list is empty... Remove the detritus; it should've been done in "[PATCH] Fix a race condition between ->i_mapping and iput()", but nobody had noticed it back then. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-07-19fuse: don't mess with blocking signalsAl Viro
just use wait_event_killable{,_exclusive}(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-07-18Btrfs: fix comparison in __btrfs_map_block()Vincent Stehlé
Add missing comparison to op in expression, which was forgotten when doing the REQ_OP transition. Fixes: b3d3fa519905 ("btrfs: update __btrfs_map_block for REQ_OP transition") Signed-off-by: Vincent Stehlé <vincent.stehle@intel.com> Reviewed-by: Mike Christie <mchristi@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-18f2fs: avoid memory allocation failure due to a long lengthJaegeuk Kim
We need to avoid ENOMEM due to unexpected long length. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-07-15f2fs: reset default idle interval valueChao Yu
The default value of idle interval is 2 mins, but for most time when screen shutdown, there are still operations during the 2 mins interval, and gc's sleep time is about 30 secs to 60 secs, so there is almost no chance for GC thread to do garbage collecting. Set default value of idle interval value from 2 mins to 5 secs for fixing. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-07-15f2fs: use blk_plug in all the possible pathsJaegeuk Kim
This patch reverts 19a5f5e2ef37 (f2fs: drop any block plugging), and adds blk_plug in write paths additionally. The main reason is that blk_start_plug can be used to wake up from low-power mode before submitting further bios. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-07-15f2fs: fix to avoid data update racing between GC and DIOChao Yu
Datas in file can be operated by GC and DIO simultaneously, so we will face race case as below: For write case: Thread A Thread B - generic_file_direct_write - invalidate_inode_pages2_range - f2fs_direct_IO - do_blockdev_direct_IO - do_direct_IO - get_more_blocks - f2fs_gc - do_garbage_collect - gc_data_segment - move_data_page - do_write_data_page migrate data block to new block address - dio_bio_submit update user data to old block address For read case: Thread A Thread B - generic_file_direct_write - invalidate_inode_pages2_range - f2fs_direct_IO - do_blockdev_direct_IO - do_direct_IO - get_more_blocks - f2fs_balance_fs - f2fs_gc - do_garbage_collect - gc_data_segment - move_data_page - do_write_data_page migrate data block to new block address - write_checkpoint - do_checkpoint - clear_prefree_segments - f2fs_issue_discard discard old block adress - dio_bio_submit update user buffer from obsolete block address In order to fix this, for one file, we should let DIO and GC getting exclusion against with each other. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-07-15f2fs: add maximum prefree segmentsJaegeuk Kim
In 1TB storage, we need to admit 22841 prefree segments, which can consume too much segments. This patch sets 8GB in max. prefree segments in that case. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-07-15f2fs: disable extent_cache for fcollapse/finsert inodesJaegeuk Kim
This reduces the elapsed time to do xfstests/generic/017. Before: 458 s After: 390 s Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-07-15f2fs: refactor __exchange_data_block for speed upJaegeuk Kim
This reduces the elapsed time to do xfstests/generic/017. Before: 715 s After: 458 s Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-07-15f2fs: fix ERR_PTR returned by bioJaegeuk Kim
This is to fix wrong error pointer handling flow reported by Dan. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-07-16xfs: fix type confusion in xfs_ioc_swapextJann Horn
Without this check, the following XFS_I invocations would return bad pointers when used on non-XFS inodes (perhaps pointers into preceding allocator chunks). This could be used by an attacker to trick xfs_swap_extents into performing locking operations on attacker-chosen structures in kernel memory, potentially leading to code execution in the kernel. (I have not investigated how likely this is to be usable for an attack in practice.) Signed-off-by: Jann Horn <jann@thejh.net> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-15tracing: Use __get_str() when manipulating stringsDaniel Bristot de Oliveira
Use __get_str(str) rather than __get_dynamic_array(str) when deadling with strings. It is just a code cleanup, no changes on tracepoint ABI. Link: http://lkml.kernel.org/r/ea260df91817411cca2a1f3db2abd88860094788.1467407618.git.bristot@redhat.com Cc: Trond Myklebust <trond.myklebust@primarydata.com> Cc: Anna Schumaker <anna.schumaker@netapp.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: linux-nfs@vger.kernel.org Suggested-by: Steven Rostedt <rostedt@goodmis.org> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-07-15x86/syscalls: Add compat_sys_preadv64v2/compat_sys_pwritev64v2H.J. Lu
Don't use the same syscall numbers for 2 different syscalls: 534 x32 preadv compat_sys_preadv64 535 x32 pwritev compat_sys_pwritev64 534 x32 preadv2 compat_sys_preadv2 535 x32 pwritev2 compat_sys_pwritev2 Add compat_sys_preadv64v2() and compat_sys_pwritev64v2() so that 64-bit offset is passed in one 64-bit register on x32, similar to compat_sys_preadv64() and compat_sys_pwritev64(). Signed-off-by: H.J. Lu <hjl.tools@gmail.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/CAMe9rOovCMf-RQfx_n1U_Tu_DX1BYkjtFr%3DQ4-_PFVSj9BCzUA@mail.gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-15ext4: verify extent header depthVegard Nossum
Although the extent tree depth of 5 should enough be for the worst case of 2*32 extents of length 1, the extent tree code does not currently to merge nodes which are less than half-full with a sibling node, or to shrink the tree depth if possible. So it's possible, at least in theory, for the tree depth to be greater than 5. However, even in the worst case, a tree depth of 32 is highly unlikely, and if the file system is maliciously corrupted, an insanely large eh_depth can cause memory allocation failures that will trigger kernel warnings (here, eh_depth = 65280): JBD2: ext4.exe wants too many credits credits:195849 rsv_credits:0 max:256 ------------[ cut here ]------------ WARNING: CPU: 0 PID: 50 at fs/jbd2/transaction.c:293 start_this_handle+0x569/0x580 CPU: 0 PID: 50 Comm: ext4.exe Not tainted 4.7.0-rc5+ #508 Stack: 604a8947 625badd8 0002fd09 00000000 60078643 00000000 62623910 601bf9bc 62623970 6002fc84 626239b0 900000125 Call Trace: [<6001c2dc>] show_stack+0xdc/0x1a0 [<601bf9bc>] dump_stack+0x2a/0x2e [<6002fc84>] __warn+0x114/0x140 [<6002fdff>] warn_slowpath_null+0x1f/0x30 [<60165829>] start_this_handle+0x569/0x580 [<60165d4e>] jbd2__journal_start+0x11e/0x220 [<60146690>] __ext4_journal_start_sb+0x60/0xa0 [<60120a81>] ext4_truncate+0x131/0x3a0 [<60123677>] ext4_setattr+0x757/0x840 [<600d5d0f>] notify_change+0x16f/0x2a0 [<600b2b16>] do_truncate+0x76/0xc0 [<600c3e56>] path_openat+0x806/0x1300 [<600c55c9>] do_filp_open+0x89/0xf0 [<600b4074>] do_sys_open+0x134/0x1e0 [<600b4140>] SyS_open+0x20/0x30 [<6001ea68>] handle_syscall+0x88/0x90 [<600295fd>] userspace+0x3fd/0x500 [<6001ac55>] fork_handler+0x85/0x90 ---[ end trace 08b0b88b6387a244 ]--- [ Commit message modified and the extent tree depath check changed from 5 to 32 -- tytso ] Cc: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-07-14ext4: short-cut orphan cleanup on errorVegard Nossum
If we encounter a filesystem error during orphan cleanup, we should stop. Otherwise, we may end up in an infinite loop where the same inode is processed again and again. EXT4-fs (loop0): warning: checktime reached, running e2fsck is recommended EXT4-fs error (device loop0): ext4_mb_generate_buddy:758: group 2, block bitmap and bg descriptor inconsistent: 6117 vs 0 free clusters Aborting journal on device loop0-8. EXT4-fs (loop0): Remounting filesystem read-only EXT4-fs error (device loop0) in ext4_free_blocks:4895: Journal has aborted EXT4-fs error (device loop0) in ext4_do_update_inode:4893: Journal has aborted EXT4-fs error (device loop0) in ext4_do_update_inode:4893: Journal has aborted EXT4-fs error (device loop0) in ext4_ext_remove_space:3068: IO failure EXT4-fs error (device loop0) in ext4_ext_truncate:4667: Journal has aborted EXT4-fs error (device loop0) in ext4_orphan_del:2927: Journal has aborted EXT4-fs error (device loop0) in ext4_do_update_inode:4893: Journal has aborted EXT4-fs (loop0): Inode 16 (00000000618192a0): orphan list check failed! [...] EXT4-fs (loop0): Inode 16 (0000000061819748): orphan list check failed! [...] EXT4-fs (loop0): Inode 16 (0000000061819bf0): orphan list check failed! [...] See-also: c9eb13a9105 ("ext4: fix hang when processing corrupted orphaned inode list") Cc: Jan Kara <jack@suse.cz> Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org