summaryrefslogtreecommitdiff
path: root/fs/gfs2/super.c
AgeCommit message (Collapse)Author
2019-05-07gfs2: read journal in large chunksAbhi Das
Use bios to read in the journal into the address space of the journal inode (jd_inode), sequentially and in large chunks. This is faster for locating the journal head that the previous binary search approach. When performing recovery, we keep the journal in the address space until recovery is done, which further speeds up things. Signed-off-by: Abhi Das <adas@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-05-07gfs2: fix race between gfs2_freeze_func and unmountAbhi Das
As part of the freeze operation, gfs2_freeze_func() is left blocking on a request to hold the sd_freeze_gl in SH. This glock is held in EX by the gfs2_freeze() code. A subsequent call to gfs2_unfreeze() releases the EXclusively held sd_freeze_gl, which allows gfs2_freeze_func() to acquire it in SH and resume its operation. gfs2_unfreeze(), however, doesn't wait for gfs2_freeze_func() to complete. If a umount is issued right after unfreeze, it could result in an inconsistent filesystem because some journal data (statfs update) isn't written out. Refer to commit 24972557b12c for a more detailed explanation of how freeze/unfreeze work. This patch causes gfs2_unfreeze() to wait for gfs2_freeze_func() to complete before returning to the user. Signed-off-by: Abhi Das <adas@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-05-07gfs2: Remove misleading comments in gfs2_evict_inodeAndreas Gruenbacher
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-05-07gfs2: Replace gl_revokes with a GLF flagBob Peterson
The gl_revokes value determines how many outstanding revokes a glock has on the superblock revokes list; this is used to avoid unnecessary log flushes. However, gl_revokes is only ever tested for being zero, and it's only decremented in revoke_lo_after_commit, which removes all revokes from the list, so we know that the gl_revoke values of all the glocks on the list will reach zero. Therefore, we can replace gl_revokes with a bit flag. This saves an atomic counter in struct gfs2_glock. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-05-01gfs2: switch to ->free_inode()Al Viro
... and use GFS2_I() to get the containing gfs2_inode by inode; yes, we can feed the address of the first member of structure to kmem_cache_free(), but let's do it in an obviously safe way. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-02-14Revert "gfs2: read journal in large chunks to locate the head"Bob Peterson
This reverts commit 2a5f14f279f59143139bcd1606903f2f80a34241. This patch causes xfstests generic/311 to fail. Reverting this for now until we have a proper fix. Signed-off-by: Abhi Das <adas@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-11gfs2: read journal in large chunks to locate the headAbhi Das
Use bio(s) to read in the journal sequentially in large chunks and locate the head of the journal. This version addresses the issues Christoph pointed out w.r.t error handling and using deprecated API. Signed-off-by: Abhi Das <adas@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com> Cc: Christoph Hellwig <hch@infradead.org>
2018-10-19gfs2: Fix minor typo: couln't versus couldn't.Bob Peterson
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-10-09GFS2: Flush the GFS2 delete workqueue before stopping the kernel threadsTim Smith
Flushing the workqueue can cause operations to happen which might call gfs2_log_reserve(), or get stuck waiting for locks taken by such operations. gfs2_log_reserve() can io_schedule(). If this happens, it will never wake because the only thing which can wake it is gfs2_logd() which was already stopped. This causes umount of a gfs2 filesystem to wedge permanently if, for example, the umount immediately follows a large delete operation. When this occured, the following stack trace was obtained from the umount command [<ffffffff81087968>] flush_workqueue+0x1c8/0x520 [<ffffffffa0666e29>] gfs2_make_fs_ro+0x69/0x160 [gfs2] [<ffffffffa0667279>] gfs2_put_super+0xa9/0x1c0 [gfs2] [<ffffffff811b7edf>] generic_shutdown_super+0x6f/0x100 [<ffffffff811b7ff7>] kill_block_super+0x27/0x70 [<ffffffffa0656a71>] gfs2_kill_sb+0x71/0x80 [gfs2] [<ffffffff811b792b>] deactivate_locked_super+0x3b/0x70 [<ffffffff811b79b9>] deactivate_super+0x59/0x60 [<ffffffff811d2998>] cleanup_mnt+0x58/0x80 [<ffffffff811d2a12>] __cleanup_mnt+0x12/0x20 [<ffffffff8108c87d>] task_work_run+0x7d/0xa0 [<ffffffff8106d7d9>] exit_to_usermode_loop+0x73/0x98 [<ffffffff81003961>] syscall_return_slowpath+0x41/0x50 [<ffffffff815a594c>] int_ret_from_sys_call+0x25/0x8f [<ffffffffffffffff>] 0xffffffffffffffff Signed-off-by: Tim Smith <tim.smith@citrix.com> Signed-off-by: Mark Syms <mark.syms@citrix.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-07-24Merge branch 'iomap-write' into linux-gfs2/for-nextAndreas Gruenbacher
Pull in the gfs2 iomap-write changes: Tweak the existing code to properly support iomap write and eliminate an unnecessary special case in gfs2_block_map. Implement iomap write support for buffered and direct I/O. Simplify some of the existing code and eliminate code that is no longer used: gfs2: Remove gfs2_write_{begin,end} gfs2: iomap direct I/O support gfs2: gfs2_extent_length cleanup gfs2: iomap buffered write support gfs2: Further iomap cleanups This is based on the following changes on the xfs 'iomap-4.19-merge' branch: iomap: add private pointer to struct iomap iomap: add a page_done callback iomap: generic inline data handling iomap: complete partial direct I/O writes synchronously iomap: mark newly allocated buffer heads as new fs: factor out a __generic_write_end helper Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2018-07-05gfs2: Eliminate redundant ip->i_rgdAndreas Gruenbacher
GFS2 remembers the last rgrp used for allocations in ip->i_rgd. However, block allocations are made by way of a reservations structure, ip->i_res, which keeps the last rgrp in ip->i_res.rs_rgd, and ip->i_res is kept in sync with ip->i_res.rs_rgd, so it's redundant. Get rid of ip->i_rgd and just use ip->i_res.rs_rgd in its place. Based on patches by Robert Peterson. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-06-12treewide: kmalloc() -> kmalloc_array()Kees Cook
The kmalloc() function has a 2-factor argument form, kmalloc_array(). This patch replaces cases of: kmalloc(a * b, gfp) with: kmalloc_array(a * b, gfp) as well as handling cases of: kmalloc(a * b * c, gfp) with: kmalloc(array3_size(a, b, c), gfp) as it's slightly less ugly than: kmalloc_array(array_size(a, b), c, gfp) This does, however, attempt to ignore constant size factors like: kmalloc(4 * 1024, gfp) though any constants defined via macros get caught up in the conversion. Any factors with a sizeof() of "unsigned char", "char", and "u8" were dropped, since they're redundant. The tools/ directory was manually excluded, since it has its own implementation of kmalloc(). The Coccinelle script used for this was: // Fix redundant parens around sizeof(). @@ type TYPE; expression THING, E; @@ ( kmalloc( - (sizeof(TYPE)) * E + sizeof(TYPE) * E , ...) | kmalloc( - (sizeof(THING)) * E + sizeof(THING) * E , ...) ) // Drop single-byte sizes and redundant parens. @@ expression COUNT; typedef u8; typedef __u8; @@ ( kmalloc( - sizeof(u8) * (COUNT) + COUNT , ...) | kmalloc( - sizeof(__u8) * (COUNT) + COUNT , ...) | kmalloc( - sizeof(char) * (COUNT) + COUNT , ...) | kmalloc( - sizeof(unsigned char) * (COUNT) + COUNT , ...) | kmalloc( - sizeof(u8) * COUNT + COUNT , ...) | kmalloc( - sizeof(__u8) * COUNT + COUNT , ...) | kmalloc( - sizeof(char) * COUNT + COUNT , ...) | kmalloc( - sizeof(unsigned char) * COUNT + COUNT , ...) ) // 2-factor product with sizeof(type/expression) and identifier or constant. @@ type TYPE; expression THING; identifier COUNT_ID; constant COUNT_CONST; @@ ( - kmalloc + kmalloc_array ( - sizeof(TYPE) * (COUNT_ID) + COUNT_ID, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * COUNT_ID + COUNT_ID, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * (COUNT_CONST) + COUNT_CONST, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * COUNT_CONST + COUNT_CONST, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * (COUNT_ID) + COUNT_ID, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * COUNT_ID + COUNT_ID, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * (COUNT_CONST) + COUNT_CONST, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * COUNT_CONST + COUNT_CONST, sizeof(THING) , ...) ) // 2-factor product, only identifiers. @@ identifier SIZE, COUNT; @@ - kmalloc + kmalloc_array ( - SIZE * COUNT + COUNT, SIZE , ...) // 3-factor product with 1 sizeof(type) or sizeof(expression), with // redundant parens removed. @@ expression THING; identifier STRIDE, COUNT; type TYPE; @@ ( kmalloc( - sizeof(TYPE) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kmalloc( - sizeof(TYPE) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kmalloc( - sizeof(TYPE) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kmalloc( - sizeof(TYPE) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kmalloc( - sizeof(THING) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kmalloc( - sizeof(THING) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kmalloc( - sizeof(THING) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kmalloc( - sizeof(THING) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) ) // 3-factor product with 2 sizeof(variable), with redundant parens removed. @@ expression THING1, THING2; identifier COUNT; type TYPE1, TYPE2; @@ ( kmalloc( - sizeof(TYPE1) * sizeof(TYPE2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kmalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kmalloc( - sizeof(THING1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kmalloc( - sizeof(THING1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kmalloc( - sizeof(TYPE1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) | kmalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) ) // 3-factor product, only identifiers, with redundant parens removed. @@ identifier STRIDE, SIZE, COUNT; @@ ( kmalloc( - (COUNT) * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - COUNT * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - COUNT * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - (COUNT) * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - COUNT * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - (COUNT) * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - (COUNT) * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - COUNT * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) ) // Any remaining multi-factor products, first at least 3-factor products, // when they're not all constants... @@ expression E1, E2, E3; constant C1, C2, C3; @@ ( kmalloc(C1 * C2 * C3, ...) | kmalloc( - (E1) * E2 * E3 + array3_size(E1, E2, E3) , ...) | kmalloc( - (E1) * (E2) * E3 + array3_size(E1, E2, E3) , ...) | kmalloc( - (E1) * (E2) * (E3) + array3_size(E1, E2, E3) , ...) | kmalloc( - E1 * E2 * E3 + array3_size(E1, E2, E3) , ...) ) // And then all remaining 2 factors products when they're not all constants, // keeping sizeof() as the second factor argument. @@ expression THING, E1, E2; type TYPE; constant C1, C2, C3; @@ ( kmalloc(sizeof(THING) * C2, ...) | kmalloc(sizeof(TYPE) * C2, ...) | kmalloc(C1 * C2 * C3, ...) | kmalloc(C1 * C2, ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * (E2) + E2, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * E2 + E2, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * (E2) + E2, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * E2 + E2, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - (E1) * E2 + E1, E2 , ...) | - kmalloc + kmalloc_array ( - (E1) * (E2) + E1, E2 , ...) | - kmalloc + kmalloc_array ( - E1 * E2 + E1, E2 , ...) ) Signed-off-by: Kees Cook <keescook@chromium.org>
2018-03-28fs: move I_DIRTY_INODE to fs.hChristoph Hellwig
And use it in a few more places rather than opencoding the values. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-01-30gfs2: Remove inode from ordered write list in gfs2_write_inode()Abhi Das
The vfs clears the I_DIRTY inode flag before calling gfs2_write_inode() having queued any data that needed to be written to disk. This is a good time to remove such inodes from our ordered write list so they don't hang around for long periods of time. Signed-off-by: Abhi Das <adas@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-01-23GFS2: Log the reason for log flushes in every log headerBob Peterson
This patch just adds the capability for GFS2 to track which function called gfs2_log_flush. This should make it easier to diagnose problems based on the sequence of events found in the journals. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
2018-01-23GFS2: Introduce new gfs2_log_header_v2Bob Peterson
This patch adds a new structure called gfs2_log_header_v2 which is used to store expanded fields into previously unused areas of the log headers (i.e., this change is backwards compatible). Some of these are used for debug purposes so we can backtrack when problems occur. Others are reserved for future expansion. This patch is based on a prototype from Steve Whitehouse. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2017-11-27Rename superblock flags (MS_xyz -> SB_xyz)Linus Torvalds
This is a pure automated search-and-replace of the internal kernel superblock flags. The s_flags are now called SB_*, with the names and the values for the moment mirroring the MS_* flags that they're equivalent to. Note how the MS_xyz flags are the ones passed to the mount system call, while the SB_xyz flags are what we then use in sb->s_flags. The script to do this was: # places to look in; re security/*: it generally should *not* be # touched (that stuff parses mount(2) arguments directly), but # there are two places where we really deal with superblock flags. FILES="drivers/mtd drivers/staging/lustre fs ipc mm \ include/linux/fs.h include/uapi/linux/bfs_fs.h \ security/apparmor/apparmorfs.c security/apparmor/include/lib.h" # the list of MS_... constants SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \ DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \ POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \ I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \ ACTIVE NOUSER" SED_PROG= for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done # we want files that contain at least one of MS_..., # with fs/namespace.c and fs/pnode.c excluded. L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c') for f in $L; do sed -i $f $SED_PROG; done Requested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-10-31GFS2: flush the log and all pages for jdata as we do for WB_SYNC_ALLBob Peterson
In function gfs2_write_inode, starting with patch a9185b41a4f84, we only flush the log and call filemap_fdatawait if we're passed in a wbc sync_mode of WB_SYNC_ALL. We also need to do these things if we're evicting a jdata inode, because we might have jdata pages still attached to bufdata descriptors that need to be revoked, but by the time it gets to evict() it's too late to start a new transaction. This patch changes it to treat jdata inodes as if WB_SYNC_ALL had been specified. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Acked-by: Abhijith Das <adas@redhat.com>
2017-09-14Merge branch 'work.mount' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull mount flag updates from Al Viro: "Another chunk of fmount preparations from dhowells; only trivial conflicts for that part. It separates MS_... bits (very grotty mount(2) ABI) from the struct super_block ->s_flags (kernel-internal, only a small subset of MS_... stuff). This does *not* convert the filesystems to new constants; only the infrastructure is done here. The next step in that series is where the conflicts would be; that's the conversion of filesystems. It's purely mechanical and it's better done after the merge, so if you could run something like list=$(for i in MS_RDONLY MS_NOSUID MS_NODEV MS_NOEXEC MS_SYNCHRONOUS MS_MANDLOCK MS_DIRSYNC MS_NOATIME MS_NODIRATIME MS_SILENT MS_POSIXACL MS_KERNMOUNT MS_I_VERSION MS_LAZYTIME; do git grep -l $i fs drivers/staging/lustre drivers/mtd ipc mm include/linux; done|sort|uniq|grep -v '^fs/namespace.c$') sed -i -e 's/\<MS_RDONLY\>/SB_RDONLY/g' \ -e 's/\<MS_NOSUID\>/SB_NOSUID/g' \ -e 's/\<MS_NODEV\>/SB_NODEV/g' \ -e 's/\<MS_NOEXEC\>/SB_NOEXEC/g' \ -e 's/\<MS_SYNCHRONOUS\>/SB_SYNCHRONOUS/g' \ -e 's/\<MS_MANDLOCK\>/SB_MANDLOCK/g' \ -e 's/\<MS_DIRSYNC\>/SB_DIRSYNC/g' \ -e 's/\<MS_NOATIME\>/SB_NOATIME/g' \ -e 's/\<MS_NODIRATIME\>/SB_NODIRATIME/g' \ -e 's/\<MS_SILENT\>/SB_SILENT/g' \ -e 's/\<MS_POSIXACL\>/SB_POSIXACL/g' \ -e 's/\<MS_KERNMOUNT\>/SB_KERNMOUNT/g' \ -e 's/\<MS_I_VERSION\>/SB_I_VERSION/g' \ -e 's/\<MS_LAZYTIME\>/SB_LAZYTIME/g' \ $list and commit it with something along the lines of 'convert filesystems away from use of MS_... constants' as commit message, it would save a quite a bit of headache next cycle" * 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: VFS: Differentiate mount flags (MS_*) from internal superblock flags VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb) vfs: Add sb_rdonly(sb) to query the MS_RDONLY flag on s_flags
2017-08-25GFS2: Withdraw for IO errors writing to the journal or statfsBob Peterson
Before this patch, if GFS2 encountered IO errors while writing to the journal, it would not report the problem, so they would go unnoticed, sometimes for many hours. Sometimes this would only be noticed later, when recovery tried to do journal replay and failed due to invalid metadata at the blocks that resulted in IO errors. This patch makes GFS2's log daemon check for IO errors. If it encounters one, it withdraws from the file system and reports why in dmesg. A similar action is taken when IO errors occur when writing to the system statfs file. These errors are also reported back to any callers of fsync, since that requires the journal to be flushed. Therefore, any IO errors that would previously go unnoticed are now noticed and the file system is withdrawn as early as possible, thus preventing further file system damage. Also note that this reintroduces superblock variable sd_log_error, which Christoph removed with commit f729b66fca. Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-08-10gfs2: Defer deleting inodes under memory pressureAndreas Gruenbacher
When under memory pressure and an inode's link count has dropped to zero, defer deleting the inode to the delete workqueue. This avoids calling into DLM under memory pressure, which can deadlock. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-08-10gfs2: gfs2_evict_inode: Put glocks asynchronouslyAndreas Gruenbacher
gfs2_evict_inode is called to free inodes under memory pressure. The function calls into DLM when an inode's last cluster-wide reference goes away (remote unlink) and to release the glock and associated DLM lock before finally destroying the inode. However, if DLM is blocked on memory to become available, calling into DLM again will deadlock. Avoid that by decoupling releasing glocks from destroying inodes in that case: with gfs2_glock_queue_put, glocks will be dequeued asynchronously in work queue context, when the associated inodes have likely already been destroyed. With this change, inodes can end up being unlinked, remote-unlink can be triggered, and then the inode can be reallocated before all remote-unlink callbacks are processed. To detect that, revalidate the link count in gfs2_evict_inode to make sure we're not deleting an allocated, referenced inode. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-08-09gfs2: Fix trivial typosAndreas Gruenbacher
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-08-09GFS2: Delete debugfs files only after we evict the glocksBob Peterson
This patch moves the call to gfs2_delete_debugfs_file so that it comes after the glock hash table has been cleared. This way we can query the debugfs files if umount hangs. Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-08-09GFS2: Clear gl_object when deleting an inode in gfs2_delete_inodeBob Peterson
This patch adds some calls to clear gl_object in function gfs2_delete_inode. Since we are deleting the inode, and the glock typically outlives the inode in core, we must clear gl_object so subsequent use of the glock (e.g. for a new inode in its place) will not have the old pointer sitting there. In error cases we need to tidy up after ourselves. In non-error cases, we need to clear gl_object before we set the block free in the bitmap so residules aren't left for potential inode creators. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
2017-07-21GFS2: Introduce helper for clearing gl_objectBob Peterson
This patch introduces a new helper function in glock.h that clears gl_object, with an added integrity check. An additional integrity check has been added to glock_set_object, plus comments. This is step 1 in a series to ensure gl_object integrity. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
2017-07-17VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb)David Howells
Firstly by applying the following with coccinelle's spatch: @@ expression SB; @@ -SB->s_flags & MS_RDONLY +sb_rdonly(SB) to effect the conversion to sb_rdonly(sb), then by applying: @@ expression A, SB; @@ ( -(!sb_rdonly(SB)) && A +!sb_rdonly(SB) && A | -A != (sb_rdonly(SB)) +A != sb_rdonly(SB) | -A == (sb_rdonly(SB)) +A == sb_rdonly(SB) | -!(sb_rdonly(SB)) +!sb_rdonly(SB) | -A && (sb_rdonly(SB)) +A && sb_rdonly(SB) | -A || (sb_rdonly(SB)) +A || sb_rdonly(SB) | -(sb_rdonly(SB)) != A +sb_rdonly(SB) != A | -(sb_rdonly(SB)) == A +sb_rdonly(SB) == A | -(sb_rdonly(SB)) && A +sb_rdonly(SB) && A | -(sb_rdonly(SB)) || A +sb_rdonly(SB) || A ) @@ expression A, B, SB; @@ ( -(sb_rdonly(SB)) ? 1 : 0 +sb_rdonly(SB) | -(sb_rdonly(SB)) ? A : B +sb_rdonly(SB) ? A : B ) to remove left over excess bracketage and finally by applying: @@ expression A, SB; @@ ( -(A & MS_RDONLY) != sb_rdonly(SB) +(bool)(A & MS_RDONLY) != sb_rdonly(SB) | -(A & MS_RDONLY) == sb_rdonly(SB) +(bool)(A & MS_RDONLY) == sb_rdonly(SB) ) to make comparisons against the result of sb_rdonly() (which is a bool) work correctly. Signed-off-by: David Howells <dhowells@redhat.com>
2017-07-05gfs2: gfs2_create_inode: Keep glock across iputAndreas Gruenbacher
On failure, keep the inode glock across the final iput of the new inode so that gfs2_evict_inode doesn't have to re-acquire the glock. That way, gfs2_evict_inode won't need to revalidate the block type. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-07-05gfs2: Protect gl->gl_object by spin lockAndreas Gruenbacher
Put all remaining accesses to gl->gl_object under the gl->gl_lockref.lock spinlock to prevent races. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-07-05gfs2: Get rid of flush_delayed_work in gfs2_evict_inodeAndreas Gruenbacher
So far, gfs2_evict_inode clears gl->gl_object and then flushes the glock work queue to make sure that inode glops which dereference gl->gl_object have finished running before the inode is destroyed. However, flushing the work queue may do more work than needed, and in particular, it may call into DLM, which we want to avoid here. Use a bit lock (GIF_GLOP_PENDING) to synchronize between the inode glops and gfs2_evict_inode instead to get rid of the flushing. In addition, flush the work queues of existing glocks before reusing them for new inodes to get those glocks into a known state: the glock state engine currently doesn't handle glock re-appropriation correctly. (We may be able to fix the glock state engine instead later.) Based on a patch by Steven Whitehouse <swhiteho@redhat.com>. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-04-03Revert "GFS2: Wait for iopen glock dequeues"Andreas Gruenbacher
Revert commit 86d067a797d4e8546a7c92b985f31e8cd3ec39ad: it turns out that waiting for iopen glock dequeues here isn't needed anymore because the bugs that commit was meant to fix have been fixed otherwise. In addition, we want to avoid waiting on glocks in gfs2_evict_inode in shrinker context because the shrinker may be invoked on behalf of DLM, in which case calling into DLM again would deadlock. This commit makes the described scenario less likely without completely avoiding it; it's still a step in the right direction, though. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-03-16GFS2: Prevent BUG from occurring when normal Withdraws occurBob Peterson
When the GFS2 file system withdraws due to metadata corruption, it often has outstanding transactions in the journal and delayed work queued for its glocks. This patch adds some new checks for a withdrawn file system before proceeding with operations that would obviously cause a BUG() to be triggered. That allows GFS2 to be safely unmounted rather than cause the system to go down. Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-03-02sched/headers: Prepare to move signal wakeup & sigpending methods from ↵Ingo Molnar
<linux/sched.h> into <linux/sched/signal.h> Fix up affected files that include this signal functionality via sched.h. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-02GFS2: use BIT() macroFabian Frederick
Replace 1 << value shift by more explicit BIT() macro Also fixes two bare unsigned definitions: WARNING: Prefer 'unsigned int' to bare use of 'unsigned' + unsigned hsize = BIT(ip->i_depth); Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2016-06-27gfs2: Lock holder cleanupAndreas Gruenbacher
Make the code more readable by cleaning up the different ways of initializing lock holders and checking for initialized lock holders: mark lock holders as uninitialized by setting the holder's glock to NULL (gfs2_holder_mark_uninitialized) instead of zeroing out the entire object or using a separate flag. Recognize initialized holders by their non-NULL glock (gfs2_holder_initialized). Don't zero out holder objects which are immeditiately initialized via gfs2_holder_init or gfs2_glock_nq_init. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2016-04-10don't bother with ->d_inode->i_sb - it's always equal to ->d_sbAl Viro
... and neither can ever be NULL Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-01-14GFS2: Check if iopen is held when deleting inodeBob Peterson
This patch fixes an error condition in which an inode is partially created in gfs2_create_inode() but then some error is discovered, which causes it to fail and call iput() before the iopen glock is created or held. In that case, gfs2_delete_inode would try to unlock an iopen glock that doesn't yet exist. Therefore, we test its holder (which must exist) for the HIF_HOLDER bit before trying to dq it. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Acked-by: Steven Whitehouse <swhiteho@redhat.com>
2015-12-18GFS2: Truncate address space mapping when deleting an inodeBob Peterson
In function gfs2_delete_inode() we write and flush the mapping for a glock, among other things. We truncate the mapping for the inode, but we never truncate the mapping for the glock. This patch makes it also truncate the metamapping. This avoid cases where the glock is reused by another process who is trying to recreate an inode in its place using the same block. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Acked-by: Steven Whitehouse <swhiteho@redhat.com>
2015-12-18GFS2: Wait for iopen glock dequeuesBob Peterson
This patch changes every glock_dq for iopen glocks into a dq_wait. This makes sure that iopen glocks do not outlive the inode itself. In turn, that ensures that anyone trying to unlink the glock will be able to find the inode when it receives a remote iopen callback. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Acked-by: Steven Whitehouse <swhiteho@redhat.com>
2015-12-14gfs2: clear journal live bit in gfs2_log_flushBenjamin Marzinski
When gfs2 was unmounting filesystems or changing them to read-only it was clearing the SDF_JOURNAL_LIVE bit before the final log flush. This caused a race. If an inode glock got demoted in the gap between clearing the bit and the shutdown flush, it would be unable to reserve log space to clear out the active items list in inode_go_sync, causing an error in inode_go_inval because the glock was still dirty. To solve this, the SDF_JOURNAL_LIVE bit is now cleared inside the shutdown log flush. This means that, because of the locking on the log blocks, either inode_go_sync will be able to reserve space to clean the glock before the shutdown flush, or the shutdown flush will clean the glock itself, before inode_go_sync fails to reserve the space. Either way, the glock will be clean before inode_go_inval. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2015-12-14gfs2: change gfs2 readdir cookieBenjamin Marzinski
gfs2 currently returns 31 bits of filename hash as a cookie that readdir uses for an offset into the directory. When there are a large number of directory entries, the likelihood of a collision goes up way too quickly. GFS2 will now return cookies that are guaranteed unique for a while, and then fail back to using 30 bits of filename hash. Specifically, the directory leaf blocks are divided up into chunks based on the minimum size of a gfs2 directory entry (48 bytes). Each entry's cookie is based off the chunk where it starts, in the linked list of leaf blocks that it hashes to (there are 131072 hash buckets). Directory entries will have unique names until they take reach chunk 8192. Assuming the largest filenames possible, and the least efficient spacing possible, this new method will still be able to return unique names when the previous method has statistically more than a 99% chance of a collision. The non-unique names it fails back to are guaranteed to not collide with the unique names. unique cookies will be in this format: - 1 bit "0" to make sure the the returned cookie is positive - 17 bits for the hash table index - 1 bit for the mode "0" - 13 bits for the offset non-unique cookies will be in this format: - 1 bit "0" to make sure the the returned cookie is positive - 17 bits for the hash table index - 1 bit for the mode "1" - 13 more bits of the name hash Another benefit of location based cookies, is that once a directory's exhash table is fully extended (so that multiple hash table indexs do not use the same leaf blocks), gfs2 can skip sorting the directory entries until it reaches the non-unique ones, and then it only needs to sort these. This provides a significant speed up for directory reads of very large directories. The only issue is that for these cookies to continue to point to the correct entry as files are added and removed from the directory, gfs2 must keep the entries at the same offset in the leaf block when they are split (see my previous patch). This means that until all the nodes in a cluster are running with code that will split the directory leaf blocks this way, none of the nodes can use the new cookie code. To deal with this, gfs2 now has the mount option loccookie, which, if set, will make it return these new location based cookies. This option must not be set until all nodes in the cluster are at least running this version of the kernel code, and you have guaranteed that there are no outstanding cookies required by other software, such as NFS. gfs2 uses some of the extra space at the end of the gfs2_dirent structure to store the calculated readdir cookies. This keeps us from needing to allocate a seperate array to hold these values. gfs2 recomputes the cookie stored in de_cookie for every readdir call. The time it takes to do so is small, and if gfs2 expected this value to be saved on disk, the new code wouldn't work correctly on filesystems created with an earlier version of gfs2. One issue with adding de_cookie to the union in the gfs2_dirent structure is that it caused the union to align itself to a 4 byte boundary, instead of its previous 2 byte boundary. This changed the offset of de_rahead. To solve that, I pulled de_rahead out of the union, since it does not need to be there. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2015-12-14GFS2: Update master statfs buffer with sd_statfs_spin lockedBob Peterson
Before this patch, function update_statfs called gfs2_statfs_change_out to update the master statfs buffer without the sd_statfs_spin held. In theory, another process could call gfs2_statfs_sync, which takes the sd_statfs_spin lock and re-reads m_sc from the buffer. So there's a theoretical timing window in which one process could write the master statfs buffer, then another comes along and re-reads it, wiping out the changes. Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2015-12-14GFS2: Make rgrp reservations part of the gfs2_inode structureBob Peterson
Before this patch, multi-block reservation structures were allocated from a special slab. This patch folds the structure into the gfs2_inode structure. The disadvantage is that the gfs2_inode needs more memory, even when a file is opened read-only. The advantages are: (a) we don't need the special slab and the extra time it takes to allocate and deallocate from it. (b) we no longer need to worry that the structure exists for things like quota management. (c) This also allows us to remove the calls to get_write_access and put_write_access since we know the structure will exist. Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2015-11-24GFS2: Extract quota data from reservations structure (revert 5407e24)Bob Peterson
This patch basically reverts the majority of patch 5407e24. That patch eliminated the gfs2_qadata structure in favor of just using the reservations structure. The problem with doing that is that it increases the size of the reservations structure. That is not an issue until it comes time to fold the reservations structure into the inode in memory so we know it's always there. By separating out the quota structure again, we aren't punishing the non-quota users by making all the inodes bigger, requiring more slab space. This patch creates a new slab area to allocate the quota stuff so it's managed a little more sanely. Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2015-11-16gfs2: Extended attribute readaheadAndreas Gruenbacher
When gfs2 allocates an inode and its extended attribute block next to each other at inode create time, the inode's directory entry indicates that in de_rahead. In that case, we can readahead the extended attribute block when we read in the inode. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2015-09-04fs: create and use seq_show_option for escapingKees Cook
Many file systems that implement the show_options hook fail to correctly escape their output which could lead to unescaped characters (e.g. new lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files. This could lead to confusion, spoofed entries (resulting in things like systemd issuing false d-bus "mount" notifications), and who knows what else. This looks like it would only be the root user stepping on themselves, but it's possible weird things could happen in containers or in other situations with delegated mount privileges. Here's an example using overlay with setuid fusermount trusting the contents of /proc/mounts (via the /etc/mtab symlink). Imagine the use of "sudo" is something more sneaky: $ BASE="ovl" $ MNT="$BASE/mnt" $ LOW="$BASE/lower" $ UP="$BASE/upper" $ WORK="$BASE/work/ 0 0 none /proc fuse.pwn user_id=1000" $ mkdir -p "$LOW" "$UP" "$WORK" $ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt $ cat /proc/mounts none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0 none /proc fuse.pwn user_id=1000 0 0 $ fusermount -u /proc $ cat /proc/mounts cat: /proc/mounts: No such file or directory This fixes the problem by adding new seq_show_option and seq_show_option_n helpers, and updating the vulnerable show_option handlers to use them as needed. Some, like SELinux, need to be open coded due to unusual existing escape mechanisms. [akpm@linux-foundation.org: add lost chunk, per Kees] [keescook@chromium.org: seq_show_option should be using const parameters] Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Acked-by: Jan Kara <jack@suse.com> Acked-by: Paul Moore <paul@paul-moore.com> Cc: J. R. Okajima <hooanon05g@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-02writeback: move bandwidth related fields from backing_dev_info into ↵Tejun Heo
bdi_writeback Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback) and the role of the separation is unclear. For cgroup support for writeback IOs, a bdi will be updated to host multiple wb's where each wb serves writeback IOs of a different cgroup on the bdi. To achieve that, a wb should carry all states necessary for servicing writeback IOs for a cgroup independently. This patch moves bandwidth related fields from backing_dev_info into bdi_writeback. * The moved fields are: bw_time_stamp, dirtied_stamp, written_stamp, write_bandwidth, avg_write_bandwidth, dirty_ratelimit, balanced_dirty_ratelimit, completions and dirty_exceeded. * writeback_chunk_size() and over_bground_thresh() now take @wb instead of @bdi. * bdi_writeout_fraction(bdi, ...) -> wb_writeout_fraction(wb, ...) bdi_dirty_limit(bdi, ...) -> wb_dirty_limit(wb, ...) bdi_position_ration(bdi, ...) -> wb_position_ratio(wb, ...) bdi_update_writebandwidth(bdi, ...) -> wb_update_write_bandwidth(wb, ...) [__]bdi_update_bandwidth(bdi, ...) -> [__]wb_update_bandwidth(wb, ...) bdi_{max|min}_pause(bdi, ...) -> wb_{max|min}_pause(wb, ...) bdi_dirty_limits(bdi, ...) -> wb_dirty_limits(wb, ...) * Init/exits of the relocated fields are moved to bdi_wb_init/exit() respectively. Note that explicit zeroing is dropped in the process as wb's are cleared in entirety anyway. * As there's still only one bdi_writeback per backing_dev_info, all uses of bdi->stat[] are mechanically replaced with bdi->wb.stat[] introducing no behavior changes. v2: Typo in description fixed as suggested by Jan. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-04-15VFS: normal filesystems (and lustre): d_inode() annotationsDavid Howells
that's the bulk of filesystem drivers dealing with inodes of their own Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-01-20fs: export inode_to_bdi and use it in favor of mapping->backing_dev_infoChristoph Hellwig
Now that we got rid of the bdi abuse on character devices we can always use sb->s_bdi to get at the backing_dev_info for a file, except for the block device special case. Export inode_to_bdi and replace uses of mapping->backing_dev_info with it to prepare for the removal of mapping->backing_dev_info. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-17GFS2: update freeze code to use freeze/thaw_super on all nodesBenjamin Marzinski
The current gfs2 freezing code is considerably more complicated than it should be because it doesn't use the vfs freezing code on any node except the one that begins the freeze. This is because it needs to acquire a cluster glock before calling the vfs code to prevent a deadlock, and without the new freeze_super and thaw_super hooks, that was impossible. To deal with the issue, gfs2 had to do some hacky locking tricks to make sure that a frozen node couldn't be holding on a lock it needed to do the unfreeze ioctl. This patch makes use of the new hooks to simply the gfs2 locking code. Now, all the nodes in the cluster freeze and thaw in exactly the same way. Every node in the cluster caches the freeze glock in the shared state. The new freeze_super hook allows the freezing node to grab this freeze glock in the exclusive state without first calling the vfs freeze_super function. All the nodes in the cluster see this lock change, and call the vfs freeze_super function. The vfs locking code guarantees that the nodes can't get stuck holding the glocks necessary to unfreeze the system. To unfreeze, the freezing node uses the new thaw_super hook to drop the freeze glock. Again, all the nodes notice this, reacquire the glock in shared mode and call the vfs thaw_super function. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>