summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2016-04-25ext4: handle unwritten or delalloc buffers before enabling data journalingDaeho Jeong
We already allocate delalloc blocks before changing the inode mode into "per-file data journal" mode to prevent delalloc blocks from remaining not allocated, but another issue concerned with "BH_Unwritten" status still exists. For example, by fallocate(), several buffers' status change into "BH_Unwritten", but these buffers cannot be processed by ext4_alloc_da_blocks(). So, they still remain in unwritten status after per-file data journaling is enabled and they cannot be changed into written status any more and, if they are journaled and eventually checkpointed, these unwritten buffer will cause a kernel panic by the below BUG_ON() function of submit_bh_wbc() when they are submitted during checkpointing. static int submit_bh_wbc(int rw, struct buffer_head *bh,... { ... BUG_ON(buffer_unwritten(bh)); Moreover, when "dioread_nolock" option is enabled, the status of a buffer is changed into "BH_Unwritten" after write_begin() completes and the "BH_Unwritten" status will be cleared after I/O is done. Therefore, if a buffer's status is changed into unwrutten but the buffer's I/O is not submitted and completed, it can cause the same problem after enabling per-file data journaling. You can easily generate this bug by executing the following command. ./kvm-xfstests -C 10000 -m nodelalloc,dioread_nolock generic/269 To resolve these problems and define a boundary between the previous mode and per-file data journaling mode, we need to flush and wait all the I/O of buffers of a file before enabling per-file data journaling of the file. Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2016-04-25ext4: fix jbd2 handle extension in ext4_ext_truncate_extend_restart()Theodore Ts'o
The function jbd2_journal_extend() takes as its argument the number of new credits to be added to the handle. We weren't taking into account the currently unused handle credits; worse, we would try to extend the handle by N credits when it had N credits available. In the case where jbd2_journal_extend() fails because the transaction is too large, when jbd2_journal_restart() gets called, the N credits owned by the handle gets returned to the transaction, and the transaction commit is asynchronously requested, and then start_this_handle() will be able to successfully attach the handle to the current transaction since the required credits are now available. This is mostly harmless, but since ext4_ext_truncate_extend_restart() returns EAGAIN, the truncate machinery will once again try to call ext4_ext_truncate_extend_restart(), which will do the above sequence over and over again until the transaction has committed. This was found while I was debugging a lockup in caused by running xfstests generic/074 in the data=journal case. I'm still not sure why we ended up looping forever, which suggests there may still be another bug hiding in the transaction accounting machinery, but this commit prevents us from looping in the first place. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-04-25libceph: make authorizer destruction independent of ceph_auth_clientIlya Dryomov
Starting the kernel client with cephx disabled and then enabling cephx and restarting userspace daemons can result in a crash: [262671.478162] BUG: unable to handle kernel paging request at ffffebe000000000 [262671.531460] IP: [<ffffffff811cd04a>] kfree+0x5a/0x130 [262671.584334] PGD 0 [262671.635847] Oops: 0000 [#1] SMP [262672.055841] CPU: 22 PID: 2961272 Comm: kworker/22:2 Not tainted 4.2.0-34-generic #39~14.04.1-Ubuntu [262672.162338] Hardware name: Dell Inc. PowerEdge R720/068CDY, BIOS 2.4.3 07/09/2014 [262672.268937] Workqueue: ceph-msgr con_work [libceph] [262672.322290] task: ffff88081c2d0dc0 ti: ffff880149ae8000 task.ti: ffff880149ae8000 [262672.428330] RIP: 0010:[<ffffffff811cd04a>] [<ffffffff811cd04a>] kfree+0x5a/0x130 [262672.535880] RSP: 0018:ffff880149aeba58 EFLAGS: 00010286 [262672.589486] RAX: 000001e000000000 RBX: 0000000000000012 RCX: ffff8807e7461018 [262672.695980] RDX: 000077ff80000000 RSI: ffff88081af2be04 RDI: 0000000000000012 [262672.803668] RBP: ffff880149aeba78 R08: 0000000000000000 R09: 0000000000000000 [262672.912299] R10: ffffebe000000000 R11: ffff880819a60e78 R12: ffff8800aec8df40 [262673.021769] R13: ffffffffc035f70f R14: ffff8807e5b138e0 R15: ffff880da9785840 [262673.131722] FS: 0000000000000000(0000) GS:ffff88081fac0000(0000) knlGS:0000000000000000 [262673.245377] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [262673.303281] CR2: ffffebe000000000 CR3: 0000000001c0d000 CR4: 00000000001406e0 [262673.417556] Stack: [262673.472943] ffff880149aeba88 ffff88081af2be04 ffff8800aec8df40 ffff88081af2be04 [262673.583767] ffff880149aeba98 ffffffffc035f70f ffff880149aebac8 ffff8800aec8df00 [262673.694546] ffff880149aebac8 ffffffffc035c89e ffff8807e5b138e0 ffff8805b047f800 [262673.805230] Call Trace: [262673.859116] [<ffffffffc035f70f>] ceph_x_destroy_authorizer+0x1f/0x50 [libceph] [262673.968705] [<ffffffffc035c89e>] ceph_auth_destroy_authorizer+0x3e/0x60 [libceph] [262674.078852] [<ffffffffc0352805>] put_osd+0x45/0x80 [libceph] [262674.134249] [<ffffffffc035290e>] remove_osd+0xae/0x140 [libceph] [262674.189124] [<ffffffffc0352aa3>] __reset_osd+0x103/0x150 [libceph] [262674.243749] [<ffffffffc0354703>] kick_requests+0x223/0x460 [libceph] [262674.297485] [<ffffffffc03559e2>] ceph_osdc_handle_map+0x282/0x5e0 [libceph] [262674.350813] [<ffffffffc035022e>] dispatch+0x4e/0x720 [libceph] [262674.403312] [<ffffffffc034bd91>] try_read+0x3d1/0x1090 [libceph] [262674.454712] [<ffffffff810ab7c2>] ? dequeue_entity+0x152/0x690 [262674.505096] [<ffffffffc034cb1b>] con_work+0xcb/0x1300 [libceph] [262674.555104] [<ffffffff8108fb3e>] process_one_work+0x14e/0x3d0 [262674.604072] [<ffffffff810901ea>] worker_thread+0x11a/0x470 [262674.652187] [<ffffffff810900d0>] ? rescuer_thread+0x310/0x310 [262674.699022] [<ffffffff810957a2>] kthread+0xd2/0xf0 [262674.744494] [<ffffffff810956d0>] ? kthread_create_on_node+0x1c0/0x1c0 [262674.789543] [<ffffffff817bd81f>] ret_from_fork+0x3f/0x70 [262674.834094] [<ffffffff810956d0>] ? kthread_create_on_node+0x1c0/0x1c0 What happens is the following: (1) new MON session is established (2) old "none" ac is destroyed (3) new "cephx" ac is constructed ... (4) old OSD session (w/ "none" authorizer) is put ceph_auth_destroy_authorizer(ac, osd->o_auth.authorizer) osd->o_auth.authorizer in the "none" case is just a bare pointer into ac, which contains a single static copy for all services. By the time we get to (4), "none" ac, freed in (2), is long gone. On top of that, a new vtable installed in (3) points us at ceph_x_destroy_authorizer(), so we end up trying to destroy a "none" authorizer with a "cephx" destructor operating on invalid memory! To fix this, decouple authorizer destruction from ac and do away with a single static "none" authorizer by making a copy for each OSD or MDS session. Authorizers themselves are independent of ac and so there is no reason for destroy_authorizer() to be an ac op. Make it an op on the authorizer itself by turning ceph_authorizer into a real struct. Fixes: http://tracker.ceph.com/issues/15447 Reported-by: Alan Zhang <alan.zhang@linux.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Sage Weil <sage@redhat.com>
2016-04-25udf: Fix conversion of 'dstring' fields to UTF8Andrew Gabbasov
Commit 9293fcfbc1812a22ad5ce1b542eb90c1bbe01be1 ("udf: Remove struct ustr as non-needed intermediate storage"), while getting rid of 'struct ustr', does not take any special care of 'dstring' fields and effectively use fixed field length instead of actual string length, encoded in the last byte of the field. Also, commit 484a10f49387e4386bf2708532e75bf78ffea2cb ("udf: Merge linux specific translation into CS0 conversion function") introduced checking of the length of the string being converted, requiring proper alignment to number of bytes constituing each character. The UDF volume identifier is represented as a 32-bytes 'dstring', and needs to be converted from CS0 to UTF8, while mounting UDF filesystem. The changes in mentioned commits can in some cases lead to incorrect handling of volume identifier: - if the actual string in 'dstring' is of maximal length and does not have zero bytes separating it from dstring encoded length in last byte, that last byte may be included in conversion, thus making incorrect resulting string; - if the identifier is encoded with 2-bytes characters (compression code is 16), the length of 31 bytes (32 bytes of field length minus 1 byte of compression code), taken as the string length, is reported as an incorrect (unaligned) length, and the conversion fails, which in its turn leads to volume mounting failure. This patch introduces handling of 'dstring' encoded length field in udf_CS0toUTF8 function, that is used in all and only cases when 'dstring' fields are converted. Currently these cases are processing of Volume Identifier and Volume Set Identifier fields. The function is also renamed to udf_dstrCS0toUTF8 to distinctly indicate that it handles 'dstring' input. Signed-off-by: Andrew Gabbasov <andrew_gabbasov@mentor.com> Signed-off-by: Jan Kara <jack@suse.cz>
2016-04-25fuse: Fix return value from fuse_get_user_pages()Ashish Samant
fuse_get_user_pages() should return error or 0. Otherwise fuse_direct_io read will not return 0 to indicate that read has completed. Fixes: 742f992708df ("fuse: return patrial success from fuse_direct_io()") Signed-off-by: Ashish Samant <ashish.samant@oracle.com> Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-04-24ext4: do not ask jbd2 to write data for delalloc buffersJan Kara
Currently we ask jbd2 to write all dirty allocated buffers before committing a transaction when doing writeback of delay allocated blocks. However this is unnecessary since we move all pages to writeback state before dropping a transaction handle and then submit all the necessary IO. We still need the transaction commit to wait for all the outstanding writeback before flushing disk caches during transaction commit to avoid data exposure issues though. Use the new jbd2 capability and ask it to only wait for outstanding writeback during transaction commit when writing back data in ext4_writepages(). Tested-by: "HUANG Weller (CM/ESW12-CN)" <Weller.Huang@cn.bosch.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-04-24jbd2: add support for avoiding data writes during transaction commitsJan Kara
Currently when filesystem needs to make sure data is on permanent storage before committing a transaction it adds inode to transaction's inode list. During transaction commit, jbd2 writes back all dirty buffers that have allocated underlying blocks and waits for the IO to finish. However when doing writeback for delayed allocated data, we allocate blocks and immediately submit the data. Thus asking jbd2 to write dirty pages just unnecessarily adds more work to jbd2 possibly writing back other redirtied blocks. Add support to jbd2 to allow filesystem to ask jbd2 to only wait for outstanding data writes before committing a transaction and thus avoid unnecessary writes. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-04-24ext4: remove EXT4_STATE_ORDERED_MODEJan Kara
This flag is just duplicating what ext4_should_order_data() tells you and is used in a single place. Furthermore it doesn't reflect changes to inode data journalling flag so it may be possibly misleading. Just remove it. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-04-24ext4: fix data exposure after a crashJan Kara
Huang has reported that in his powerfail testing he is seeing stale block contents in some of recently allocated blocks although he mounts ext4 in data=ordered mode. After some investigation I have found out that indeed when delayed allocation is used, we don't add inode to transaction's list of inodes needing flushing before commit. Originally we were doing that but commit f3b59291a69d removed the logic with a flawed argument that it is not needed. The problem is that although for delayed allocated blocks we write their contents immediately after allocating them, there is no guarantee that the IO scheduler or device doesn't reorder things and thus transaction allocating blocks and attaching them to inode can reach stable storage before actual block contents. Actually whenever we attach freshly allocated blocks to inode using a written extent, we should add inode to transaction's ordered inode list to make sure we properly wait for block contents to be written before committing the transaction. So that is what we do in this patch. This also handles other cases where stale data exposure was possible - like filling hole via mmap in data=ordered,nodelalloc mode. The only exception to the above rule are extending direct IO writes where blkdev_direct_IO() waits for IO to complete before increasing i_size and thus stale data exposure is not possible. For now we don't complicate the code with optimizing this special case since the overhead is pretty low. In case this is observed to be a performance problem we can always handle it using a special flag to ext4_map_blocks(). CC: stable@vger.kernel.org Fixes: f3b59291a69d0b734be1fc8be489fef2dd846d3d Reported-by: "HUANG Weller (CM/ESW12-CN)" <Weller.Huang@cn.bosch.com> Tested-by: "HUANG Weller (CM/ESW12-CN)" <Weller.Huang@cn.bosch.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-04-23ext4: allow readdir()'s of large empty directories to be interruptedTheodore Ts'o
If a directory has a large number of empty blocks, iterating over all of them can take a long time, leading to scheduler warnings and users getting irritated when they can't kill a process in the middle of one of these long-running readdir operations. Fix this by adding checks to ext4_readdir() and ext4_htree_fill_tree(). This was reverted earlier due to a typo in the original commit where I experimented with using signal_pending() instead of fatal_signal_pending(). The test was in the wrong place if we were going to return signal_pending() since we would end up returning duplicant entries. See 9f2394c9be47 for a more detailed explanation. Added fix as suggested by Linus to check for signal_pending() in in the filldir() functions. Reported-by: Benjamin LaHaise <bcrl@kvack.org> Google-Bug-Id: 27880676 Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-04-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts were two cases of simple overlapping changes, nothing serious. In the UDP case, we need to add a hlist_add_tail_rcu() to linux/rculist.h, because we've moved UDP socket handling away from using nulls lists. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-23ceph: kill __ceph_removexattr()Yan, Zheng
when removing a xattr, generic_removexattr() calls __ceph_setxattr() with NULL value and XATTR_REPLACE flag. __ceph_removexattr() is not used any more. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-04-23ceph: Switch to generic xattr handlersAndreas Gruenbacher
Add a catch-all xattr handler at the end of ceph_xattr_handlers. Check for valid attribute names there, and remove those checks from __ceph_{get,set,remove}xattr instead. No "system.*" xattrs need to be handled by the catch-all handler anymore. The set xattr handler is called with a NULL value to indicate that the attribute should be removed; __ceph_setxattr already handles that case correctly (ceph_set_acl could already calling __ceph_setxattr with a NULL value). Move the check for snapshots from ceph_{set,remove}xattr into __ceph_{set,remove}xattr. With that, ceph_{get,set,remove}xattr can be replaced with the generic iops. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-04-23ceph: Get rid of d_find_alias in ceph_set_aclAndreas Gruenbacher
Create a variant of ceph_setattr that takes an inode instead of a dentry. Change __ceph_setxattr (and also __ceph_removexattr) to take an inode instead of a dentry. Use those in ceph_set_acl so that we no longer need a dentry there. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-04-23cifs: Switch to generic xattr handlersAndreas Gruenbacher
Use xattr handlers for resolving attribute names. The amount of setup code required on cifs is nontrivial, so use the same get and set functions for all handlers, with switch statements for the different types of attributes in them. The set_EA handler can handle NULL values, so we don't need a separate removexattr function anymore. Remove the cifs_dbg statements related to xattr name resolution; they don't add much. Don't build xattr.o when CONFIG_CIFS_XATTR is not defined. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-04-23cifs: Fix removexattr for os2.* xattrsAndreas Gruenbacher
If cifs_removexattr finds a "user." or "os2." xattr name prefix, it skips 5 bytes, one byte too many for "os2.". Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-04-23cifs: Check for equality with ACL_TYPE_ACCESS and ACL_TYPE_DEFAULTAndreas Gruenbacher
The two values ACL_TYPE_ACCESS and ACL_TYPE_DEFAULT are meant to be enumerations, not bits in a bit mask. Use '==' instead of '&' to check for these values. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-04-23cifs: Fix xattr name checksAndreas Gruenbacher
Use strcmp(str, name) instead of strncmp(str, name, strlen(name)) for checking if str and name are the same (as opposed to name being a prefix of str) in the gexattr and setxattr inode operations. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-04-20eCryptfs: Do not allocate hash tfm in NORECLAIM contextHerbert Xu
You cannot allocate crypto tfm objects in NORECLAIM or NOFS contexts. The ecryptfs code currently does exactly that for the MD5 tfm. This patch fixes it by preallocating the MD5 tfm in a safe context. The MD5 tfm is also reentrant so this patch removes the superfluous cs_hash_tfm_mutex. Reported-by: Nicolas Boichat <drinkcat@chromium.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2016-04-19GFS2: Add calls to gfs2_holder_uninit in two error handlersDaniel DeFreez
This patch fixes two locations that do not call gfs2_holder_uninit if gfs2_glock_nq returns an error. Signed-off-by: Daniel DeFreez <dcdefreez@ucdavis.edu> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2016-04-19Merge branch 'ptmx-cleanup'Linus Torvalds
Merge the ptmx internal interface cleanup branch. This doesn't change semantics, but it should be a sane basis for eventually getting the multi-instance devpts code into some sane shape where we can get rid of the kernel config option. Which we can hopefully get done next merge window.. * ptmx-cleanup: devpts: clean up interface to pty drivers
2016-04-18devpts: clean up interface to pty driversLinus Torvalds
This gets rid of the horrible notion of having that struct inode *ptmx_inode be the linchpin of the interface between the pty code and devpts. By de-emphasizing the ptmx inode, a lot of things actually get cleaner, and we will have a much saner way forward. In particular, this will allow us to associate with any particular devpts instance at open-time, and not be artificially tied to one particular ptmx inode. The patch itself is actually fairly straightforward, and apart from some locking and return path cleanups it's pretty mechanical: - the interfaces that devpts exposes all take "struct pts_fs_info *" instead of "struct inode *ptmx_inode" now. NOTE! The "struct pts_fs_info" thing is a completely opaque structure as far as the pty driver is concerned: it's still declared entirely internally to devpts. So the pty code can't actually access it in any way, just pass it as a "cookie" to the devpts code. - the "look up the pts fs info" is now a single clear operation, that also does the reference count increment on the pts superblock. So "devpts_add/del_ref()" is gone, and replaced by a "lookup and get ref" operation (devpts_get_ref(inode)), along with a "put ref" op (devpts_put_ref()). - the pty master "tty->driver_data" field now contains the pts_fs_info, not the ptmx inode. - because we don't care about the ptmx inode any more as some kind of base index, the ref counting can now drop the inode games - it just gets the ref on the superblock. - the pts_fs_info now has a back-pointer to the super_block. That's so that we can easily look up the information we actually need. Although quite often, the pts fs info was actually all we wanted, and not having to look it up based on some magical inode makes things more straightforward. In particular, now that "devpts_get_ref(inode)" operation should really be the *only* place we need to look up what devpts instance we're associated with, and we do it exactly once, at ptmx_open() time. The other side of this is that one ptmx node could now be associated with multiple different devpts instances - you could have a single /dev/ptmx node, and then have multiple mount namespaces with their own instances of devpts mounted on /dev/pts/. And that's all perfectly sane in a model where we just look up the pts instance at open time. This will eventually allow us to get rid of our odd single-vs-multiple pts instance model, but this patch in itself changes no semantics, only an internal binding model. Cc: Eric Biederman <ebiederm@xmission.com> Cc: Peter Anvin <hpa@zytor.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: Serge Hallyn <serge.hallyn@ubuntu.com> Cc: Willy Tarreau <w@1wt.eu> Cc: Aurelien Jarno <aurelien@aurel32.net> Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk> Cc: Jann Horn <jann@thejh.net> Cc: Greg KH <greg@kroah.com> Cc: Jiri Slaby <jslaby@suse.com> Cc: Florian Weimer <fw@deneb.enyo.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-19Merge 4.6-rc4 into driver-core-nextGreg Kroah-Hartman
We want those fixes in here as well. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-18treewide: Fix typos in printkMasanari Iida
This patch fix spelling typos found in printk within various part of the kernel sources. Signed-off-by: Masanari Iida <standby24x7@gmail.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2016-04-18Merge branch 'master' into for-nextJiri Kosina
Sync with Linus' tree so that patches against newer codebase can be applied. Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2016-04-18Doc: treewide : Fix typos in DocBook/filesystem.xmlMasanari Iida
This patch fix spelling typos found in DocBook/filesystem.xml. It is because the file was generated from comments in code, I have to fix the comments in codes, instead of xml file. Signed-off-by: Masanari Iida <standby24x7@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2016-04-16Merge tag 'driver-core-4.6-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull misc fixes from Greg KH: "Here are three small fixes for 4.6-rc4. Two fix up some lz4 issues with big endian systems, and the remaining one resolves a minor debugfs issue that was reported. All have been in linux-next with no reported issues" * tag 'driver-core-4.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: lib: lz4: cleanup unaligned access efficiency detection lib: lz4: fixed zram with lz4 on big endian machines debugfs: Make automount point inodes permanently empty
2016-04-15f2fs: flush dirty pages before starting atomic writesJaegeuk Kim
If somebody wrote some data before atomic writes, we should flush them in order to handle atomic data in a right period. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-15f2fs: don't invalidate atomic page if successfulJaegeuk Kim
If we committed atomic write successfully, we don't need to invalidate pages. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-15f2fs: give -E2BIG for no space in xattrJaegeuk Kim
This patch returns -E2BIG if there is no space to add an xattr entry. This should fix generic/026 in xfstests as well. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-15f2fs: remove redundant condition checkJaegeuk Kim
This patch resolves the redundant condition check reported by David. Reported-by: David Binderman <dcb314@hotmail.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-15f2fs: unset atomic/volatile flag in f2fs_release_fileJaegeuk Kim
The atomic/volatile operation should be done in pair of start and commit ioctl. For example, if a killed process remains open-ended atomic operation, we should drop its flag as well as its atomic data. Otherwise, if sqlite initiates another operation which doesn't require atomic writes, it will lose every data, since f2fs still treats with them as atomic writes; nobody will trigger its commit. Reported-by: Miao Xie <miaoxie@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-15f2fs: fix dropping inmemory pages in a wrong timeJaegeuk Kim
When one reader closes its file while the other writer is doing atomic writes, f2fs_release_file drops atomic data resulting in an empty commit. This patch fixes this wrong commit problem by checking openess of the file. Process0 Process1 open file start atomic write write data read data close file f2fs_release_file() clear atomic data commit atomic write Reported-by: Miao Xie <miaoxie@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-15f2fs: add BUG_ON to avoid unnecessary flowJaegeuk Kim
This patch adds BUG_ON instead of retrying loop. In the case of node pages, we already got this inode page, but unlocked it. By the fact that we don't truncate any node pages in operations, the page's mapping should be unchangeable. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-15f2fs: use PGP_LOCK to check its truncationJaegeuk Kim
Previously, after trylock_page is succeeded, it doesn't check its mapping. In order to fix that, we can just give PGP_LOCK to pagecache_get_page. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-15f2fs: fix to convert inline directory correctlyChao Yu
With below serials, we will lose parts of dirents: 1) mount f2fs with inline_dentry option 2) echo 1 > /sys/fs/f2fs/sdX/dir_level 3) mkdir dir 4) touch 180 files named [1-180] in dir 5) touch 181 in dir 6) echo 3 > /proc/sys/vm/drop_caches 7) ll dir ls: cannot access 2: No such file or directory ls: cannot access 4: No such file or directory ls: cannot access 5: No such file or directory ls: cannot access 6: No such file or directory ls: cannot access 8: No such file or directory ls: cannot access 9: No such file or directory ... total 360 drwxr-xr-x 2 root root 4096 Feb 19 15:12 ./ drwxr-xr-x 3 root root 4096 Feb 19 15:11 ../ -rw-r--r-- 1 root root 0 Feb 19 15:12 1 -rw-r--r-- 1 root root 0 Feb 19 15:12 10 -rw-r--r-- 1 root root 0 Feb 19 15:12 100 -????????? ? ? ? ? ? 101 -????????? ? ? ? ? ? 102 -????????? ? ? ? ? ? 103 ... The reason is: when doing the inline dir conversion, we didn't consider that directory has hierarchical hash structure which can be configured through sysfs interface 'dir_level'. By default, dir_level of directory inode is 0, it means we have one bucket in hash table located in first level, all dirents will be hashed in this bucket, so it has no problem for us to do the duplication simply between inline dentry page and converted normal dentry page. However, if we configured dir_level with the value N (greater than 0), it will expand the bucket number of first level hash table by 2^N - 1, it hashs dirents into different buckets according their hash value, if we still move all dirents to first bucket, it makes incorrent locating for inline dirents, the result is, although we can iterate all dirents through ->readdir, we can't stat some of them in ->lookup which based on hash table searching. This patch fixes this issue by rehashing dirents into correct position when converting inline directory. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-15f2fs: show current mount statusJaegeuk Kim
This patch remains the current mount status to f2fs status info. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-15f2fs: treat as a normal umount when remounting roJaegeuk Kim
When user remounts f2fs as read-only, we can mark the checkpoint as umount. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-15f2fs: give -EINVAL for norecovery and rw mountJaegeuk Kim
Once detecting something to recover, f2fs should stop mounting, given norecovery and rw mount options. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-15f2fs: recover superblock at RW remountsJaegeuk Kim
This patch adds a sbi flag, SBI_NEED_SB_WRITE, which indicates it needs to recover superblock when (re)mounting as RW. This is set only when f2fs is mounted as RO. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-15f2fs: give RO message when recovering superblockJaegeuk Kim
When one of superblocks is missing, f2fs recovers it with the valid one. But, even if f2fs is mounted as RO, we'd better notify that too. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-14Merge tag 'for-linus-4.6-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs/fscrypto fixes from Jaegeuk Kim: "In addition to f2fs/fscrypto fixes, I've added one patch which prevents RCU mode lookup in d_revalidate, as Al mentioned. These patches fix f2fs and fscrypto based on -rc3 bug fixes in ext4 crypto, which have not yet been fully propagated as follows. - use of dget_parent and file_dentry to avoid crashes - disallow RCU-mode lookup in d_invalidate - disallow -ENOMEM in the core data encryption path" * tag 'for-linus-4.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: ext4/fscrypto: avoid RCU lookup in d_revalidate fscrypto: don't let data integrity writebacks fail with ENOMEM f2fs: use dget_parent and file_dentry in f2fs_file_open fscrypto: use dget_parent() in fscrypt_d_revalidate()
2016-04-14Make file credentials available to the seqfile interfacesLinus Torvalds
A lot of seqfile users seem to be using things like %pK that uses the credentials of the current process, but that is actually completely wrong for filesystem interfaces. The unix semantics for permission checking files is to check permissions at _open_ time, not at read or write time, and that is not just a small detail: passing off stdin/stdout/stderr to a suid application and making the actual IO happen in privileged context is a classic exploit technique. So if we want to be able to look at permissions at read time, we need to use the file open credentials, not the current ones. Normal file accesses can just use "f_cred" (or any of the helper functions that do that, like file_ns_capable()), but the seqfile interfaces do not have any such options. It turns out that seq_file _does_ save away the user_ns information of the file, though. Since user_ns is just part of the full credential information, replace that special case with saving off the cred pointer instead, and suddenly seq_file has all the permission information it needs. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-14GFS2: Don't dereference inode in gfs2_inode_lookup until it's validBob Peterson
Function gfs2_inode_lookup was dereferencing the inode, and after, it checks for the value being NULL. We need to check that first. Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2016-04-13sock: tigthen lockdep checks for sock_owned_by_userHannes Frederic Sowa
sock_owned_by_user should not be used without socket lock held. It seems to be a common practice to check .owned before lock reclassification, so provide a little help to abstract this check away. Cc: linux-cifs@vger.kernel.org Cc: linux-bluetooth@vger.kernel.org Cc: linux-nfs@vger.kernel.org Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-12ext4/fscrypto: avoid RCU lookup in d_revalidateJaegeuk Kim
As Al pointed, d_revalidate should return RCU lookup before using d_inode. This was originally introduced by: commit 34286d666230 ("fs: rcu-walk aware d_revalidate method"). Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Theodore Ts'o <tytso@mit.edu> Cc: stable <stable@vger.kernel.org>
2016-04-12debugfs: Make automount point inodes permanently emptySeth Forshee
Starting with 4.1 the tracing subsystem has its own filesystem which is automounted in the tracing subdirectory of debugfs. Prior to this debugfs could be bind mounted in a cloned mount namespace, but if tracefs has been mounted under debugfs this now fails because there is a locked child mount. This creates a regression for container software which bind mounts debugfs to satisfy the assumption of some userspace software. In other pseudo filesystems such as proc and sysfs we're already creating mountpoints like this in such a way that no dirents can be created in the directories, allowing them to be exceptions to some MNT_LOCKED tests. In fact we're already do this for the tracefs mountpoint in sysfs. Do the same in debugfs_create_automount(), since the intention here is clearly to create a mountpoint. This fixes the regression, as locked child mounts on permanently empty directories do not cause a bind mount to fail. Cc: stable@vger.kernel.org # v4.1+ Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12debugfs: unproxify files created through debugfs_create_u32_array()Nicolai Stange
The struct file_operations u32_array_fops associated with files created through debugfs_create_u32_array() has been lifetime aware already: everything needed for subsequent operation is copied to a ->f_private buffer at file opening time in u32_array_open(). Now, ->open() is always protected against file removal issues by the debugfs core. There is no need for the debugfs core to wrap the u32_array_fops with a file lifetime managing proxy. Make debugfs_create_u32_array() create its files in non-proxying operation mode by means of debugfs_create_file_unsafe(). Signed-off-by: Nicolai Stange <nicstange@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12debugfs: unproxify files created through debugfs_create_blob()Nicolai Stange
Currently, the struct file_operations fops_blob associated with files created through the debugfs_create_blob() helpers are not file lifetime aware. Thus, a lifetime managing proxy is created around fops_blob each time such a file is opened which is an unnecessary waste of resources. Implement file lifetime management for the fops_bool file_operations. Namely, make read_file_blob() safe gainst file removals by means of debugfs_use_file_start() and debugfs_use_file_finish(). Make debugfs_create_blob() create its files in non-proxying operation mode by means of debugfs_create_file_unsafe(). Signed-off-by: Nicolai Stange <nicstange@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12debugfs: unproxify files created through debugfs_create_bool()Nicolai Stange
Currently, the struct file_operations fops_bool associated with files created through the debugfs_create_bool() helpers are not file lifetime aware. Thus, a lifetime managing proxy is created around fops_bool each time such a file is opened which is an unnecessary waste of resources. Implement file lifetime management for the fops_bool file_operations. Namely, make debugfs_read_file_bool() and debugfs_write_file_bool() safe against file removals by means of debugfs_use_file_start() and debugfs_use_file_finish(). Make debugfs_create_bool() create its files in non-proxying operation mode through debugfs_create_mode_unsafe(). Finally, purge debugfs_create_mode() as debugfs_create_bool() had been its last user. Signed-off-by: Nicolai Stange <nicstange@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>