summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2017-05-05ovl: use an auxiliary var for overlay root entryAmir Goldstein
Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-05-05ovl: store file handle of lower inode on copy upAmir Goldstein
Sometimes it is interesting to know if an upper file is pure upper or a copy up target, and if it is a copy up target, it may be interesting to find the copy up origin. This will be used to preserve lower inode numbers across copy up. Store the lower inode file handle in upper inode extended attribute overlay.origin on copy up to use it later for these cases. Store the lower filesystem uuid along side the file handle, so we can validate that we are looking for the origin file in the original fs. If lower fs does not support NFS export ops store a zero sized xattr so we can always use the overlay.origin xattr to distinguish between a copy up and a pure upper inode. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-05-05ovl: check if all layers are on the same fsAmir Goldstein
Some features can only work when all layers are on the same fs. Test this condition during mount time, so features can check them later. Add helper ovl_same_sb() to return the common super block in case all layers are on the same fs. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-05-04Merge tag 'char-misc-4.12-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char/misc driver updates from Greg KH: "Here is the big set of new char/misc driver drivers and features for 4.12-rc1. There's lots of new drivers added this time around, new firmware drivers from Google, more auxdisplay drivers, extcon drivers, fpga drivers, and a bunch of other driver updates. Nothing major, except if you happen to have the hardware for these drivers, and then you will be happy :) All of these have been in linux-next for a while with no reported issues" * tag 'char-misc-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (136 commits) firmware: google memconsole: Fix return value check in platform_memconsole_init() firmware: Google VPD: Fix return value check in vpd_platform_init() goldfish_pipe: fix build warning about using too much stack. goldfish_pipe: An implementation of more parallel pipe fpga fr br: update supported version numbers fpga: region: release FPGA region reference in error path fpga altera-hps2fpga: disable/unprepare clock on error in alt_fpga_bridge_probe() mei: drop the TODO from samples firmware: Google VPD sysfs driver firmware: Google VPD: import lib_vpd source files misc: lkdtm: Add volatile to intentional NULL pointer reference eeprom: idt_89hpesx: Add OF device ID table misc: ds1682: Add OF device ID table misc: tsl2550: Add OF device ID table w1: Remove unneeded use of assert() and remove w1_log.h w1: Use kernel common min() implementation uio_mf624: Align memory regions to page size and set correct offsets uio_mf624: Refactor memory info initialization uio: Allow handling of non page-aligned memory regions hangcheck-timer: Fix typo in comment ...
2017-05-04btrfs: fix the gfp_mask for the reada_zones radix treeChris Mason
Commits cc8385b59e17 and 7ef70b4d9987a7 added preallocation for the reada radix trees and also switched them over to GFP_KERNEL for the default gfp mask. Since we're doing radix tree insertions under spinlocks, we need to make sure the mask doesn't allow sleeping. This fix keeps the radix preallocation but switches back to the original gfp_mask. Reported-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2017-05-04orangefs: count directory pieces correctlyMartin Brandenburg
A large directory full of differently sized file names triggered this. Most directories, even very large directories with shorter names, would be lucky enough to fit in one server response. Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-05-04orangefs: invalidate stored directory on seekMartin Brandenburg
If an application seeks to a position before the point which has been read, it must want updates which have been made to the directory. So delete the copy stored in the kernel so it will be fetched again. Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-05-04orangefs: skip forward to the next directory entry if seek is shortMartin Brandenburg
If userspace seeks to a position in the stream which is not correct, it would have returned EIO because the data in the buffer at that offset would be incorrect. This and the userspace daemon returning a corrupt directory are indistinguishable. Now if the data does not look right, skip forward to the next chunk and try again. The motivation is that if the directory changes, an application may seek to a position that was valid and no longer is valid. It is not yet possible for a directory to change. Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-05-04ext4: clean up ext4_match() and callersEric Biggers
When ext4 encryption was originally merged, we were encrypting the user-specified filename in ext4_match(), introducing a lot of additional complexity into ext4_match() and its callers. This has since been changed to encrypt the filename earlier, so we can remove the gunk that's no longer needed. This more or less reverts ext4_search_dir() and ext4_find_dest_de() to the way they were in the v4.0 kernel. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-04f2fs: switch to using fscrypt_match_name()Eric Biggers
Switch f2fs directory searches to use the fscrypt_match_name() helper function. There should be no functional change. Signed-off-by: Eric Biggers <ebiggers@google.com> Acked-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-04ext4: switch to using fscrypt_match_name()Eric Biggers
Switch ext4 directory searches to use the fscrypt_match_name() helper function. There should be no functional change. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-04fscrypt: introduce helper function for filename matchingEric Biggers
Introduce a helper function fscrypt_match_name() which tests whether a fscrypt_name matches a directory entry. Also clean up the magic numbers and document things properly. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-04fscrypt: avoid collisions when presenting long encrypted filenamesEric Biggers
When accessing an encrypted directory without the key, userspace must operate on filenames derived from the ciphertext names, which contain arbitrary bytes. Since we must support filenames as long as NAME_MAX, we can't always just base64-encode the ciphertext, since that may make it too long. Currently, this is solved by presenting long names in an abbreviated form containing any needed filesystem-specific hashes (e.g. to identify a directory block), then the last 16 bytes of ciphertext. This needs to be sufficient to identify the actual name on lookup. However, there is a bug. It seems to have been assumed that due to the use of a CBC (ciphertext block chaining)-based encryption mode, the last 16 bytes (i.e. the AES block size) of ciphertext would depend on the full plaintext, preventing collisions. However, we actually use CBC with ciphertext stealing (CTS), which handles the last two blocks specially, causing them to appear "flipped". Thus, it's actually the second-to-last block which depends on the full plaintext. This caused long filenames that differ only near the end of their plaintexts to, when observed without the key, point to the wrong inode and be undeletable. For example, with ext4: # echo pass | e4crypt add_key -p 16 edir/ # seq -f "edir/abcdefghijklmnopqrstuvwxyz012345%.0f" 100000 | xargs touch # find edir/ -type f | xargs stat -c %i | sort | uniq | wc -l 100000 # sync # echo 3 > /proc/sys/vm/drop_caches # keyctl new_session # find edir/ -type f | xargs stat -c %i | sort | uniq | wc -l 2004 # rm -rf edir/ rm: cannot remove 'edir/_A7nNFi3rhkEQlJ6P,hdzluhODKOeWx5V': Structure needs cleaning ... To fix this, when presenting long encrypted filenames, encode the second-to-last block of ciphertext rather than the last 16 bytes. Although it would be nice to solve this without depending on a specific encryption mode, that would mean doing a cryptographic hash like SHA-256 which would be much less efficient. This way is sufficient for now, and it's still compatible with encryption modes like HEH which are strong pseudorandom permutations. Also, changing the presented names is still allowed at any time because they are only provided to allow applications to do things like delete encrypted directories. They're not designed to be used to persistently identify files --- which would be hard to do anyway, given that they're encrypted after all. For ease of backports, this patch only makes the minimal fix to both ext4 and f2fs. It leaves ubifs as-is, since ubifs doesn't compare the ciphertext block yet. Follow-on patches will clean things up properly and make the filesystems use a shared helper function. Fixes: 5de0b4d0cd15 ("ext4 crypto: simplify and speed up filename encryption") Reported-by: Gwendal Grignou <gwendal@chromium.org> Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-04f2fs: check entire encrypted bigname when finding a dentryJaegeuk Kim
If user has no key under an encrypted dir, fscrypt gives digested dentries. Previously, when looking up a dentry, f2fs only checks its hash value with first 4 bytes of the digested dentry, which didn't handle hash collisions fully. This patch enhances to check entire dentry bytes likewise ext4. Eric reported how to reproduce this issue by: # seq -f "edir/abcdefghijklmnopqrstuvwxyz012345%.0f" 100000 | xargs touch # find edir -type f | xargs stat -c %i | sort | uniq | wc -l 100000 # sync # echo 3 > /proc/sys/vm/drop_caches # keyctl new_session # find edir -type f | xargs stat -c %i | sort | uniq | wc -l 99999 Cc: <stable@vger.kernel.org> Reported-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> (fixed f2fs_dentry_hash() to work even when the hash is 0) Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-04ubifs: check for consistent encryption contexts in ubifs_lookup()Eric Biggers
As ext4 and f2fs do, ubifs should check for consistent encryption contexts during ->lookup() in an encrypted directory. This protects certain users of filesystem encryption against certain types of offline attacks. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-04f2fs: sync f2fs_lookup() with ext4_lookup()Eric Biggers
As for ext4, now that fscrypt_has_permitted_context() correctly handles the case where we have the key for the parent directory but not the child, f2fs_lookup() no longer has to work around it. Also add the same warning message that ext4 uses. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-04ext4: remove "nokey" check from ext4_lookup()Eric Biggers
Now that fscrypt_has_permitted_context() correctly handles the case where we have the key for the parent directory but not the child, we don't need to try to work around this in ext4_lookup(). Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-04fscrypt: fix context consistency check when key(s) unavailableEric Biggers
To mitigate some types of offline attacks, filesystem encryption is designed to enforce that all files in an encrypted directory tree use the same encryption policy (i.e. the same encryption context excluding the nonce). However, the fscrypt_has_permitted_context() function which enforces this relies on comparing struct fscrypt_info's, which are only available when we have the encryption keys. This can cause two incorrect behaviors: 1. If we have the parent directory's key but not the child's key, or vice versa, then fscrypt_has_permitted_context() returned false, causing applications to see EPERM or ENOKEY. This is incorrect if the encryption contexts are in fact consistent. Although we'd normally have either both keys or neither key in that case since the master_key_descriptors would be the same, this is not guaranteed because keys can be added or removed from keyrings at any time. 2. If we have neither the parent's key nor the child's key, then fscrypt_has_permitted_context() returned true, causing applications to see no error (or else an error for some other reason). This is incorrect if the encryption contexts are in fact inconsistent, since in that case we should deny access. To fix this, retrieve and compare the fscrypt_contexts if we are unable to set up both fscrypt_infos. While this slightly hurts performance when accessing an encrypted directory tree without the key, this isn't a case we really need to be optimizing for; access *with* the key is much more important. Furthermore, the performance hit is barely noticeable given that we are already retrieving the fscrypt_context and doing two keyring searches in fscrypt_get_encryption_info(). If we ever actually wanted to optimize this case we might start by caching the fscrypt_contexts. Cc: stable@vger.kernel.org # 4.0+ Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-04jbd2: cleanup write flags handling from jbd2_write_superblock()Jan Kara
Currently jbd2_write_superblock() silently adds REQ_SYNC to flags with which journal superblock is written. Make this explicit by making flags passed down to jbd2_write_superblock() contain REQ_SYNC. CC: linux-ext4@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-04ext4: mark superblock writes synchronous for nobarrier mountsJan Kara
Commit b685d3d65ac7 "block: treat REQ_FUA and REQ_PREFLUSH as synchronous" removed REQ_SYNC flag from WRITE_FUA implementation. generic_make_request_checks() however strips REQ_FUA flag from a bio when the storage doesn't report volatile write cache and thus write effectively becomes asynchronous which can lead to performance regressions. This affects superblock writes for ext4. Fix the problem by marking superblock writes always as synchronous. Fixes: b685d3d65ac791406e0dfd8779cc9b3707fea5a3 CC: linux-ext4@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-04nfs: Fix bdi handling for cloned superblocksJan Kara
In commit 0d3b12584972 "nfs: Convert to separately allocated bdi" I have wrongly cloned bdi reference in nfs_clone_super(). Further inspection has shown that originally the code was actually allocating a new bdi (in ->clone_server callback) which was later registered in nfs_fs_mount_common() and used for sb->s_bdi in nfs_initialise_sb(). This could later result in bdi for the original superblock not getting unregistered when that superblock got shutdown (as the cloned sb still held bdi reference) and later when a new superblock was created under the same anonymous device number, a clash in sysfs has happened on bdi registration: ------------[ cut here ]------------ WARNING: CPU: 1 PID: 10284 at /linux-next/fs/sysfs/dir.c:31 sysfs_warn_dup+0x64/0x74 sysfs: cannot create duplicate filename '/devices/virtual/bdi/0:32' Modules linked in: axp20x_usb_power gpio_axp209 nvmem_sunxi_sid sun4i_dma sun4i_ss virt_dma CPU: 1 PID: 10284 Comm: mount.nfs Not tainted 4.11.0-rc4+ #14 Hardware name: Allwinner sun7i (A20) Family [<c010f19c>] (unwind_backtrace) from [<c010bc74>] (show_stack+0x10/0x14) [<c010bc74>] (show_stack) from [<c03c6e24>] (dump_stack+0x78/0x8c) [<c03c6e24>] (dump_stack) from [<c0122200>] (__warn+0xe8/0x100) [<c0122200>] (__warn) from [<c0122250>] (warn_slowpath_fmt+0x38/0x48) [<c0122250>] (warn_slowpath_fmt) from [<c02ac178>] (sysfs_warn_dup+0x64/0x74) [<c02ac178>] (sysfs_warn_dup) from [<c02ac254>] (sysfs_create_dir_ns+0x84/0x94) [<c02ac254>] (sysfs_create_dir_ns) from [<c03c8b8c>] (kobject_add_internal+0x9c/0x2ec) [<c03c8b8c>] (kobject_add_internal) from [<c03c8e24>] (kobject_add+0x48/0x98) [<c03c8e24>] (kobject_add) from [<c048d75c>] (device_add+0xe4/0x5a0) [<c048d75c>] (device_add) from [<c048ddb4>] (device_create_groups_vargs+0xac/0xbc) [<c048ddb4>] (device_create_groups_vargs) from [<c048dde4>] (device_create_vargs+0x20/0x28) [<c048dde4>] (device_create_vargs) from [<c02075c8>] (bdi_register_va+0x44/0xfc) [<c02075c8>] (bdi_register_va) from [<c023d378>] (super_setup_bdi_name+0x48/0xa4) [<c023d378>] (super_setup_bdi_name) from [<c0312ef4>] (nfs_fill_super+0x1a4/0x204) [<c0312ef4>] (nfs_fill_super) from [<c03133f0>] (nfs_fs_mount_common+0x140/0x1e8) [<c03133f0>] (nfs_fs_mount_common) from [<c03335cc>] (nfs4_remote_mount+0x50/0x58) [<c03335cc>] (nfs4_remote_mount) from [<c023ef98>] (mount_fs+0x14/0xa4) [<c023ef98>] (mount_fs) from [<c025cba0>] (vfs_kern_mount+0x54/0x128) [<c025cba0>] (vfs_kern_mount) from [<c033352c>] (nfs_do_root_mount+0x80/0xa0) [<c033352c>] (nfs_do_root_mount) from [<c0333818>] (nfs4_try_mount+0x28/0x3c) [<c0333818>] (nfs4_try_mount) from [<c0313874>] (nfs_fs_mount+0x2cc/0x8c4) [<c0313874>] (nfs_fs_mount) from [<c023ef98>] (mount_fs+0x14/0xa4) [<c023ef98>] (mount_fs) from [<c025cba0>] (vfs_kern_mount+0x54/0x128) [<c025cba0>] (vfs_kern_mount) from [<c02600f0>] (do_mount+0x158/0xc7c) [<c02600f0>] (do_mount) from [<c0260f98>] (SyS_mount+0x8c/0xb4) [<c0260f98>] (SyS_mount) from [<c0107840>] (ret_fast_syscall+0x0/0x3c) Fix the problem by always creating new bdi for a superblock as we used to do. Reported-and-tested-by: Corentin Labbe <clabbe.montjoie@gmail.com> Fixes: 0d3b12584972ce5781179ad3f15cca3cdb5cae05 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-04ceph: fix memory leak in __ceph_setxattr()Luis Henriques
The ceph_inode_xattr needs to be released when removing an xattr. Easily reproducible running the 'generic/020' test from xfstests or simply by doing: attr -s attr0 -V 0 /mnt/test && attr -r attr0 /mnt/test While there, also fix the error path. Here's the kmemleak splat: unreferenced object 0xffff88001f86fbc0 (size 64): comm "attr", pid 244, jiffies 4294904246 (age 98.464s) hex dump (first 32 bytes): 40 fa 86 1f 00 88 ff ff 80 32 38 1f 00 88 ff ff @........28..... 00 01 00 00 00 00 ad de 00 02 00 00 00 00 ad de ................ backtrace: [<ffffffff81560199>] kmemleak_alloc+0x49/0xa0 [<ffffffff810f3e5b>] kmem_cache_alloc+0x9b/0xf0 [<ffffffff812b157e>] __ceph_setxattr+0x17e/0x820 [<ffffffff812b1c57>] ceph_set_xattr_handler+0x37/0x40 [<ffffffff8111fb4b>] __vfs_removexattr+0x4b/0x60 [<ffffffff8111fd37>] vfs_removexattr+0x77/0xd0 [<ffffffff8111fdd1>] removexattr+0x41/0x60 [<ffffffff8111fe65>] path_removexattr+0x75/0xa0 [<ffffffff81120aeb>] SyS_lremovexattr+0xb/0x10 [<ffffffff81564b20>] entry_SYSCALL_64_fastpath+0x13/0x94 [<ffffffffffffffff>] 0xffffffffffffffff Cc: stable@vger.kernel.org Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04ceph: fix file open flags on ppc64Alexander Graf
The file open flags (O_foo) are platform specific and should never go out to an interface that is not local to the system. Unfortunately these flags have leaked out onto the wire in the cephfs implementation. That lead to bogus flags getting transmitted on ppc64. This patch converts the kernel view of flags to the ceph view of file open flags. Fixes: 124e68e74 ("ceph: file operations") Signed-off-by: Alexander Graf <agraf@suse.de> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04ceph: choose readdir frag based on previous readdir replyYan, Zheng
The dirfragtree is lazily updated, it's not always accurate. Infinite loops happens in following circumstance. - client send request to read frag A - frag A has been fragmented into frag B and C. So mds fills the reply with contents of frag B - client wants to read next frag C. ceph_choose_frag(frag value of C) return frag A. The fix is using previous readdir reply to calculate next readdir frag when possible. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04ceph: when seeing write errors on an inode, switch to sync writesJeff Layton
Currently, we don't have a real feedback mechanism in place for when we start seeing buffered writeback errors. If writeback is failing, there is nothing that prevents an application from continuing to dirty pages that aren't being cleaned. In the event that we're seeing write errors of any sort occur on an inode, have the callback set a flag to force further writes to be synchronous. When the next write succeeds, clear the flag to allow buffered writeback to continue. Since this is just a hint to the write submission mechanism, we only take the i_ceph_lock when a lockless check shows that the flag needs to be changed. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: "Yan, Zheng” <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04Revert "ceph: SetPageError() for writeback pages if writepages fails"Jeff Layton
This reverts commit b109eec6f4332bd517e2f41e207037c4b9065094. If I'm filling up a filesystem with this sort of command: $ dd if=/dev/urandom of=/mnt/cephfs/fillfile bs=2M oflag=sync ...then I'll eventually get back EIO on a write. Further calls will give us ENOSPC. I'm not sure what prompted this change, but I don't think it's what we want to do. If writepages failed, we will have already set the mapping error appropriately, and that's what gets reported by fsync() or close(). __filemap_fdatawait_range however, does this: wait_on_page_writeback(page); if (TestClearPageError(page)) ret = -EIO; ...and that -EIO ends up trumping the mapping's error if one exists. When writepages fails, we only want to set the error in the mapping, and not flag the individual pages. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: "Yan, Zheng” <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04ceph: handle epoch barriers in cap messagesJeff Layton
Have the client store and update the osdc epoch_barrier when a cap message comes in with one. When sending cap messages, send the epoch barrier as well. This allows clients to inform servers that their released caps may not be used until a particular OSD map epoch. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: "Yan, Zheng” <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04libceph: allow requests to return immediately on full conditions if caller ↵Jeff Layton
wishes Usually, when the osd map is flagged as full or the pool is at quota, write requests just hang. This is not what we want for cephfs, where it would be better to simply report -ENOSPC back to userland instead of stalling. If the caller knows that it will want an immediate error return instead of blocking on a full or at-quota error condition then allow it to set a flag to request that behavior. Set that flag in ceph_osdc_new_request (since ceph.ko is the only caller), and on any other write request from ceph.ko. A later patch will deal with requests that were submitted before the new map showing the full condition came in. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04ceph: make seeky readdir more efficientYan, Zheng
Current cephfs client uses string to indicate start position of readdir. The string is last entry of previous readdir reply. This approach does not work for seeky readdir because we can not easily convert the new postion to a string. For seeky readdir, mds needs to return dentries from the beginning. Client keeps retrying if the reply does not contain the dentry it wants. In current version of ceph, mds sorts CDentry in its cache in hash order. Client also uses dentry hash to compose dir postion. For seeky readdir, if client passes the hash part of dir postion to mds. mds can avoid replying useless dentries. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04ceph: close stopped mds' sessionYan, Zheng
If a mds has stopped, close its session and clean up its session requests/caps. The process is similar to handling SESSION_CLOSE initiated by mds. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04ceph: fix potential use-after-freeYan, Zheng
__unregister_session() free the session if it drops the last reference. We should grab an extra reference if we want to use session after __unregister_session(). Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04ceph: allow connecting to mds whose rank >= mdsmap::m_max_mdsYan, Zheng
mdsmap::m_max_mds is the expected count of active mds. It's not the max rank of active mds. User can decrease mdsmap::m_max_mds, but does not stop mds whose rank >= mdsmap::m_max_mds. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04ceph: fix wrong check in ceph_renew_caps()Yan, Zheng
Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04libceph: convert ceph_pagelist.refcnt from atomic_t to refcount_tElena Reshetova
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04ceph: convert ceph_cap_snap.nref from atomic_t to refcount_tElena Reshetova
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04ceph: convert ceph_mds_session.s_ref from atomic_t to refcount_tElena Reshetova
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04libceph, ceph: always advertise all supported featuresIlya Dryomov
No reason to hide CephFS-specific features in the rbd case. Recent feature bits mix RADOS and CephFS-specific stuff together anyway. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-03SMB3: Work around mount failure when using SMB3 dialect to MacsSteve French
Macs send the maximum buffer size in response on ioctl to validate negotiate security information, which causes us to fail the mount as the response buffer is larger than the expected response. Changed ioctl response processing to allow for padding of validate negotiate ioctl response and limit the maximum response size to maximum buffer size. Signed-off-by: Steve French <steve.french@primarydata.com> CC: Stable <stable@vger.kernel.org>
2017-05-03f2fs: fix a mount fail for wrong next_scan_nidYunlei He
-write_checkpoint -do_checkpoint -next_free_nid <--- something wrong with next free nid -f2fs_fill_super -build_node_manager -build_free_nids -get_current_nat_page -__get_meta_page <--- attempt to access beyond end of device Signed-off-by: Yunlei He <heyunlei@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-03Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc updates from Andrew Morton: - a few misc things - most of MM - KASAN updates * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (102 commits) kasan: separate report parts by empty lines kasan: improve double-free report format kasan: print page description after stacks kasan: improve slab object description kasan: change report header kasan: simplify address description logic kasan: change allocation and freeing stack traces headers kasan: unify report headers kasan: introduce helper functions for determining bug type mm: hwpoison: call shake_page() after try_to_unmap() for mlocked page mm: hwpoison: call shake_page() unconditionally mm/swapfile.c: fix swap space leak in error path of swap_free_entries() mm/gup.c: fix access_ok() argument type mm/truncate: avoid pointless cleancache_invalidate_inode() calls. mm/truncate: bail out early from invalidate_inode_pages2_range() if mapping is empty fs/block_dev: always invalidate cleancache in invalidate_bdev() fs: fix data invalidation in the cleancache during direct IO zram: reduce load operation in page_same_filled zram: use zram_free_page instead of open-coded zram: introduce zram data accessor ...
2017-05-03cifs: fix CIFS_IOC_GET_MNT_INFO oopsDavid Disseldorp
An open directory may have a NULL private_data pointer prior to readdir. Fixes: 0de1f4c6f6c0 ("Add way to query server fs info for smb3") Cc: stable@vger.kernel.org Signed-off-by: David Disseldorp <ddiss@suse.de> Signed-off-by: Steve French <smfrench@gmail.com>
2017-05-03CIFS: fix mapping of SFM_SPACE and SFM_PERIODBjörn Jacke
- trailing space maps to 0xF028 - trailing period maps to 0xF029 This fix corrects the mapping of file names which have a trailing character that would otherwise be illegal (period or space) but is allowed by POSIX. Signed-off-by: Bjoern Jacke <bjacke@samba.org> CC: Stable <stable@vger.kernel.org> Signed-off-by: Steve French <smfrench@gmail.com>
2017-05-03fs/block_dev: always invalidate cleancache in invalidate_bdev()Andrey Ryabinin
invalidate_bdev() calls cleancache_invalidate_inode() iff ->nrpages != 0 which doen't make any sense. Make sure that invalidate_bdev() always calls cleancache_invalidate_inode() regardless of mapping->nrpages value. Fixes: c515e1fd361c ("mm/fs: add hooks to support cleancache") Link: http://lkml.kernel.org/r/20170424164135.22350-3-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Alexey Kuznetsov <kuznet@virtuozzo.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Nikolay Borisov <n.borisov.lkml@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03fs: fix data invalidation in the cleancache during direct IOAndrey Ryabinin
Patch series "Properly invalidate data in the cleancache", v2. We've noticed that after direct IO write, buffered read sometimes gets stale data which is coming from the cleancache. The reason for this is that some direct write hooks call call invalidate_inode_pages2[_range]() conditionally iff mapping->nrpages is not zero, so we may not invalidate data in the cleancache. Another odd thing is that we check only for ->nrpages and don't check for ->nrexceptional, but invalidate_inode_pages2[_range] also invalidates exceptional entries as well. So we invalidate exceptional entries only if ->nrpages != 0? This doesn't feel right. - Patch 1 fixes direct IO writes by removing ->nrpages check. - Patch 2 fixes similar case in invalidate_bdev(). Note: I only fixed conditional cleancache_invalidate_inode() here. Do we also need to add ->nrexceptional check in into invalidate_bdev()? - Patches 3-4: some optimizations. This patch (of 4): Some direct IO write fs hooks call invalidate_inode_pages2[_range]() conditionally iff mapping->nrpages is not zero. This can't be right, because invalidate_inode_pages2[_range]() also invalidate data in the cleancache via cleancache_invalidate_inode() call. So if page cache is empty but there is some data in the cleancache, buffered read after direct IO write would get stale data from the cleancache. Also it doesn't feel right to check only for ->nrpages because invalidate_inode_pages2[_range] invalidates exceptional entries as well. Fix this by calling invalidate_inode_pages2[_range]() regardless of nrpages state. Note: nfs,cifs,9p doesn't need similar fix because the never call cleancache_get_page() (nor directly, nor via mpage_readpage[s]()), so they are not affected by this bug. Fixes: c515e1fd361c ("mm/fs: add hooks to support cleancache") Link: http://lkml.kernel.org/r/20170424164135.22350-2-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Alexey Kuznetsov <kuznet@virtuozzo.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Nikolay Borisov <n.borisov.lkml@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03jbd2: make the whole kjournald2 kthread NOFS safeMichal Hocko
kjournald2 is central to the transaction commit processing. As such any potential allocation from this kernel thread has to be GFP_NOFS. Make sure to mark the whole kernel thread GFP_NOFS by the memalloc_nofs_save. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20170306131408.9828-8-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Chris Mason <clm@fb.com> Cc: David Sterba <dsterba@suse.cz> Cc: Brian Foster <bfoster@redhat.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Nikolay Borisov <nborisov@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03jbd2: mark the transaction context with the scope GFP_NOFS contextMichal Hocko
now that we have memalloc_nofs_{save,restore} api we can mark the whole transaction context as implicitly GFP_NOFS. All allocations will automatically inherit GFP_NOFS this way. This means that we do not have to mark any of those requests with GFP_NOFS and moreover all the ext4_kv[mz]alloc(GFP_NOFS) are also safe now because even the hardcoded GFP_KERNEL allocations deep inside the vmalloc will be NOFS now. [akpm@linux-foundation.org: tweak comments] Link: http://lkml.kernel.org/r/20170306131408.9828-7-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Chris Mason <clm@fb.com> Cc: David Sterba <dsterba@suse.cz> Cc: Brian Foster <bfoster@redhat.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Nikolay Borisov <nborisov@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*Michal Hocko
kmem_zalloc_large and _xfs_buf_map_pages use memalloc_noio_{save,restore} API to prevent from reclaim recursion into the fs because vmalloc can invoke unconditional GFP_KERNEL allocations and these functions might be called from the NOFS contexts. The memalloc_noio_save will enforce GFP_NOIO context which is even weaker than GFP_NOFS and that seems to be unnecessary. Let's use memalloc_nofs_{save,restore} instead as it should provide exactly what we need here - implicit GFP_NOFS context. Link: http://lkml.kernel.org/r/20170306131408.9828-6-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Chris Mason <clm@fb.com> Cc: David Sterba <dsterba@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Nikolay Borisov <nborisov@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03mm: introduce memalloc_nofs_{save,restore} APIMichal Hocko
GFP_NOFS context is used for the following 5 reasons currently: - to prevent from deadlocks when the lock held by the allocation context would be needed during the memory reclaim - to prevent from stack overflows during the reclaim because the allocation is performed from a deep context already - to prevent lockups when the allocation context depends on other reclaimers to make a forward progress indirectly - just in case because this would be safe from the fs POV - silence lockdep false positives Unfortunately overuse of this allocation context brings some problems to the MM. Memory reclaim is much weaker (especially during heavy FS metadata workloads), OOM killer cannot be invoked because the MM layer doesn't have enough information about how much memory is freeable by the FS layer. In many cases it is far from clear why the weaker context is even used and so it might be used unnecessarily. We would like to get rid of those as much as possible. One way to do that is to use the flag in scopes rather than isolated cases. Such a scope is declared when really necessary, tracked per task and all the allocation requests from within the context will simply inherit the GFP_NOFS semantic. Not only this is easier to understand and maintain because there are much less problematic contexts than specific allocation requests, this also helps code paths where FS layer interacts with other layers (e.g. crypto, security modules, MM etc...) and there is no easy way to convey the allocation context between the layers. Introduce memalloc_nofs_{save,restore} API to control the scope of GFP_NOFS allocation context. This is basically copying memalloc_noio_{save,restore} API we have for other restricted allocation context GFP_NOIO. The PF_MEMALLOC_NOFS flag already exists and it is just an alias for PF_FSTRANS which has been xfs specific until recently. There are no more PF_FSTRANS users anymore so let's just drop it. PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO. memalloc_noio_flags is renamed to current_gfp_context because it now cares about both PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts. Xfs code paths preserve their semantic. kmem_flags_convert() doesn't need to evaluate the flag anymore. This patch shouldn't introduce any functional changes. Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS) usage as much as possible and only use a properly documented memalloc_nofs_{save,restore} checkpoints where they are appropriate. [akpm@linux-foundation.org: fix comment typo, reflow comment] Link: http://lkml.kernel.org/r/20170306131408.9828-5-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Chris Mason <clm@fb.com> Cc: David Sterba <dsterba@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Brian Foster <bfoster@redhat.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Nikolay Borisov <nborisov@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFSMichal Hocko
xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite some time ago. We would like to make this concept more generic and use it for other filesystems as well. Let's start by giving the flag a more generic name PF_MEMALLOC_NOFS which is in line with an exiting PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO contexts. Replace all PF_FSTRANS usage from the xfs code in the first step before we introduce a full API for it as xfs uses the flag directly anyway. This patch doesn't introduce any functional change. Link: http://lkml.kernel.org/r/20170306131408.9828-4-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Chris Mason <clm@fb.com> Cc: David Sterba <dsterba@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Nikolay Borisov <nborisov@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03proc: show MADV_FREE pages info in smapsShaohua Li
Show MADV_FREE pages info of each vma in smaps. The interface is for diganose or monitoring purpose, userspace could use it to understand what happens in the application. Since userspace could dirty MADV_FREE pages without notice from kernel, this interface is the only place we can get accurate accounting info about MADV_FREE pages. [mhocko@kernel.org: update Documentation/filesystems/proc.txt] Link: http://lkml.kernel.org/r/89efde633559de1ec07444f2ef0f4963a97a2ce8.1487965799.git.shli@fb.com Signed-off-by: Shaohua Li <shli@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>