summaryrefslogtreecommitdiff
path: root/fs/gfs2/glops.c
AgeCommit message (Collapse)Author
2019-05-07gfs2: read journal in large chunksAbhi Das
Use bios to read in the journal into the address space of the journal inode (jd_inode), sequentially and in large chunks. This is faster for locating the journal head that the previous binary search approach. When performing recovery, we keep the journal in the address space until recovery is done, which further speeds up things. Signed-off-by: Abhi Das <adas@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-02-14Revert "gfs2: read journal in large chunks to locate the head"Bob Peterson
This reverts commit 2a5f14f279f59143139bcd1606903f2f80a34241. This patch causes xfstests generic/311 to fail. Reverting this for now until we have a proper fix. Signed-off-by: Abhi Das <adas@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-12gfs2: Dump nrpages for inodes and their glocksBob Peterson
This patch is based on an idea from Steve Whitehouse. The idea is to dump the number of pages for inodes in the glock dumps. The additional locking required me to drop const from quite a few places. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2018-12-11gfs2: read journal in large chunks to locate the headAbhi Das
Use bio(s) to read in the journal sequentially in large chunks and locate the head of the journal. This version addresses the issues Christoph pointed out w.r.t error handling and using deprecated API. Signed-off-by: Abhi Das <adas@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com> Cc: Christoph Hellwig <hch@infradead.org>
2018-06-05vfs: change inode times to use struct timespec64Deepa Dinamani
struct timespec is not y2038 safe. Transition vfs to use y2038 safe struct timespec64 instead. The change was made with the help of the following cocinelle script. This catches about 80% of the changes. All the header file and logic changes are included in the first 5 rules. The rest are trivial substitutions. I avoid changing any of the function signatures or any other filesystem specific data structures to keep the patch simple for review. The script can be a little shorter by combining different cases. But, this version was sufficient for my usecase. virtual patch @ depends on patch @ identifier now; @@ - struct timespec + struct timespec64 current_time ( ... ) { - struct timespec now = current_kernel_time(); + struct timespec64 now = current_kernel_time64(); ... - return timespec_trunc( + return timespec64_trunc( ... ); } @ depends on patch @ identifier xtime; @@ struct \( iattr \| inode \| kstat \) { ... - struct timespec xtime; + struct timespec64 xtime; ... } @ depends on patch @ identifier t; @@ struct inode_operations { ... int (*update_time) (..., - struct timespec t, + struct timespec64 t, ...); ... } @ depends on patch @ identifier t; identifier fn_update_time =~ "update_time$"; @@ fn_update_time (..., - struct timespec *t, + struct timespec64 *t, ...) { ... } @ depends on patch @ identifier t; @@ lease_get_mtime( ... , - struct timespec *t + struct timespec64 *t ) { ... } @te depends on patch forall@ identifier ts; local idexpression struct inode *inode_node; identifier i_xtime =~ "^i_[acm]time$"; identifier ia_xtime =~ "^ia_[acm]time$"; identifier fn_update_time =~ "update_time$"; identifier fn; expression e, E3; local idexpression struct inode *node1; local idexpression struct inode *node2; local idexpression struct iattr *attr1; local idexpression struct iattr *attr2; local idexpression struct iattr attr; identifier i_xtime1 =~ "^i_[acm]time$"; identifier i_xtime2 =~ "^i_[acm]time$"; identifier ia_xtime1 =~ "^ia_[acm]time$"; identifier ia_xtime2 =~ "^ia_[acm]time$"; @@ ( ( - struct timespec ts; + struct timespec64 ts; | - struct timespec ts = current_time(inode_node); + struct timespec64 ts = current_time(inode_node); ) <+... when != ts ( - timespec_equal(&inode_node->i_xtime, &ts) + timespec64_equal(&inode_node->i_xtime, &ts) | - timespec_equal(&ts, &inode_node->i_xtime) + timespec64_equal(&ts, &inode_node->i_xtime) | - timespec_compare(&inode_node->i_xtime, &ts) + timespec64_compare(&inode_node->i_xtime, &ts) | - timespec_compare(&ts, &inode_node->i_xtime) + timespec64_compare(&ts, &inode_node->i_xtime) | ts = current_time(e) | fn_update_time(..., &ts,...) | inode_node->i_xtime = ts | node1->i_xtime = ts | ts = inode_node->i_xtime | <+... attr1->ia_xtime ...+> = ts | ts = attr1->ia_xtime | ts.tv_sec | ts.tv_nsec | btrfs_set_stack_timespec_sec(..., ts.tv_sec) | btrfs_set_stack_timespec_nsec(..., ts.tv_nsec) | - ts = timespec64_to_timespec( + ts = ... -) | - ts = ktime_to_timespec( + ts = ktime_to_timespec64( ...) | - ts = E3 + ts = timespec_to_timespec64(E3) | - ktime_get_real_ts(&ts) + ktime_get_real_ts64(&ts) | fn(..., - ts + timespec64_to_timespec(ts) ,...) ) ...+> ( <... when != ts - return ts; + return timespec64_to_timespec(ts); ...> ) | - timespec_equal(&node1->i_xtime1, &node2->i_xtime2) + timespec64_equal(&node1->i_xtime2, &node2->i_xtime2) | - timespec_equal(&node1->i_xtime1, &attr2->ia_xtime2) + timespec64_equal(&node1->i_xtime2, &attr2->ia_xtime2) | - timespec_compare(&node1->i_xtime1, &node2->i_xtime2) + timespec64_compare(&node1->i_xtime1, &node2->i_xtime2) | node1->i_xtime1 = - timespec_trunc(attr1->ia_xtime1, + timespec64_trunc(attr1->ia_xtime1, ...) | - attr1->ia_xtime1 = timespec_trunc(attr2->ia_xtime2, + attr1->ia_xtime1 = timespec64_trunc(attr2->ia_xtime2, ...) | - ktime_get_real_ts(&attr1->ia_xtime1) + ktime_get_real_ts64(&attr1->ia_xtime1) | - ktime_get_real_ts(&attr.ia_xtime1) + ktime_get_real_ts64(&attr.ia_xtime1) ) @ depends on patch @ struct inode *node; struct iattr *attr; identifier fn; identifier i_xtime =~ "^i_[acm]time$"; identifier ia_xtime =~ "^ia_[acm]time$"; expression e; @@ ( - fn(node->i_xtime); + fn(timespec64_to_timespec(node->i_xtime)); | fn(..., - node->i_xtime); + timespec64_to_timespec(node->i_xtime)); | - e = fn(attr->ia_xtime); + e = fn(timespec64_to_timespec(attr->ia_xtime)); ) @ depends on patch forall @ struct inode *node; struct iattr *attr; identifier i_xtime =~ "^i_[acm]time$"; identifier ia_xtime =~ "^ia_[acm]time$"; identifier fn; @@ { + struct timespec ts; <+... ( + ts = timespec64_to_timespec(node->i_xtime); fn (..., - &node->i_xtime, + &ts, ...); | + ts = timespec64_to_timespec(attr->ia_xtime); fn (..., - &attr->ia_xtime, + &ts, ...); ) ...+> } @ depends on patch forall @ struct inode *node; struct iattr *attr; struct kstat *stat; identifier ia_xtime =~ "^ia_[acm]time$"; identifier i_xtime =~ "^i_[acm]time$"; identifier xtime =~ "^[acm]time$"; identifier fn, ret; @@ { + struct timespec ts; <+... ( + ts = timespec64_to_timespec(node->i_xtime); ret = fn (..., - &node->i_xtime, + &ts, ...); | + ts = timespec64_to_timespec(node->i_xtime); ret = fn (..., - &node->i_xtime); + &ts); | + ts = timespec64_to_timespec(attr->ia_xtime); ret = fn (..., - &attr->ia_xtime, + &ts, ...); | + ts = timespec64_to_timespec(attr->ia_xtime); ret = fn (..., - &attr->ia_xtime); + &ts); | + ts = timespec64_to_timespec(stat->xtime); ret = fn (..., - &stat->xtime); + &ts); ) ...+> } @ depends on patch @ struct inode *node; struct inode *node2; identifier i_xtime1 =~ "^i_[acm]time$"; identifier i_xtime2 =~ "^i_[acm]time$"; identifier i_xtime3 =~ "^i_[acm]time$"; struct iattr *attrp; struct iattr *attrp2; struct iattr attr ; identifier ia_xtime1 =~ "^ia_[acm]time$"; identifier ia_xtime2 =~ "^ia_[acm]time$"; struct kstat *stat; struct kstat stat1; struct timespec64 ts; identifier xtime =~ "^[acmb]time$"; expression e; @@ ( ( node->i_xtime2 \| attrp->ia_xtime2 \| attr.ia_xtime2 \) = node->i_xtime1 ; | node->i_xtime2 = \( node2->i_xtime1 \| timespec64_trunc(...) \); | node->i_xtime2 = node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \); | node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \); | stat->xtime = node2->i_xtime1; | stat1.xtime = node2->i_xtime1; | ( node->i_xtime2 \| attrp->ia_xtime2 \) = attrp->ia_xtime1 ; | ( attrp->ia_xtime1 \| attr.ia_xtime1 \) = attrp2->ia_xtime2; | - e = node->i_xtime1; + e = timespec64_to_timespec( node->i_xtime1 ); | - e = attrp->ia_xtime1; + e = timespec64_to_timespec( attrp->ia_xtime1 ); | node->i_xtime1 = current_time(...); | node->i_xtime2 = node->i_xtime1 = node->i_xtime3 = - e; + timespec_to_timespec64(e); | node->i_xtime1 = node->i_xtime3 = - e; + timespec_to_timespec64(e); | - node->i_xtime1 = e; + node->i_xtime1 = timespec_to_timespec64(e); ) Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Cc: <anton@tuxera.com> Cc: <balbi@kernel.org> Cc: <bfields@fieldses.org> Cc: <darrick.wong@oracle.com> Cc: <dhowells@redhat.com> Cc: <dsterba@suse.com> Cc: <dwmw2@infradead.org> Cc: <hch@lst.de> Cc: <hirofumi@mail.parknet.co.jp> Cc: <hubcap@omnibond.com> Cc: <jack@suse.com> Cc: <jaegeuk@kernel.org> Cc: <jaharkes@cs.cmu.edu> Cc: <jslaby@suse.com> Cc: <keescook@chromium.org> Cc: <mark@fasheh.com> Cc: <miklos@szeredi.hu> Cc: <nico@linaro.org> Cc: <reiserfs-devel@vger.kernel.org> Cc: <richard@nod.at> Cc: <sage@redhat.com> Cc: <sfrench@samba.org> Cc: <swhiteho@redhat.com> Cc: <tj@kernel.org> Cc: <trond.myklebust@primarydata.com> Cc: <tytso@mit.edu> Cc: <viro@zeniv.linux.org.uk>
2018-01-23GFS2: Log the reason for log flushes in every log headerBob Peterson
This patch just adds the capability for GFS2 to track which function called gfs2_log_flush. This should make it easier to diagnose problems based on the sequence of events found in the journals. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
2018-01-23GFS2: Introduce new gfs2_log_header_v2Bob Peterson
This patch adds a new structure called gfs2_log_header_v2 which is used to store expanded fields into previously unused areas of the log headers (i.e., this change is backwards compatible). Some of these are used for debug purposes so we can backtrack when problems occur. Others are reserved for future expansion. This patch is based on a prototype from Steve Whitehouse. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2017-09-14Merge branch 'work.mount' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull mount flag updates from Al Viro: "Another chunk of fmount preparations from dhowells; only trivial conflicts for that part. It separates MS_... bits (very grotty mount(2) ABI) from the struct super_block ->s_flags (kernel-internal, only a small subset of MS_... stuff). This does *not* convert the filesystems to new constants; only the infrastructure is done here. The next step in that series is where the conflicts would be; that's the conversion of filesystems. It's purely mechanical and it's better done after the merge, so if you could run something like list=$(for i in MS_RDONLY MS_NOSUID MS_NODEV MS_NOEXEC MS_SYNCHRONOUS MS_MANDLOCK MS_DIRSYNC MS_NOATIME MS_NODIRATIME MS_SILENT MS_POSIXACL MS_KERNMOUNT MS_I_VERSION MS_LAZYTIME; do git grep -l $i fs drivers/staging/lustre drivers/mtd ipc mm include/linux; done|sort|uniq|grep -v '^fs/namespace.c$') sed -i -e 's/\<MS_RDONLY\>/SB_RDONLY/g' \ -e 's/\<MS_NOSUID\>/SB_NOSUID/g' \ -e 's/\<MS_NODEV\>/SB_NODEV/g' \ -e 's/\<MS_NOEXEC\>/SB_NOEXEC/g' \ -e 's/\<MS_SYNCHRONOUS\>/SB_SYNCHRONOUS/g' \ -e 's/\<MS_MANDLOCK\>/SB_MANDLOCK/g' \ -e 's/\<MS_DIRSYNC\>/SB_DIRSYNC/g' \ -e 's/\<MS_NOATIME\>/SB_NOATIME/g' \ -e 's/\<MS_NODIRATIME\>/SB_NODIRATIME/g' \ -e 's/\<MS_SILENT\>/SB_SILENT/g' \ -e 's/\<MS_POSIXACL\>/SB_POSIXACL/g' \ -e 's/\<MS_KERNMOUNT\>/SB_KERNMOUNT/g' \ -e 's/\<MS_I_VERSION\>/SB_I_VERSION/g' \ -e 's/\<MS_LAZYTIME\>/SB_LAZYTIME/g' \ $list and commit it with something along the lines of 'convert filesystems away from use of MS_... constants' as commit message, it would save a quite a bit of headache next cycle" * 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: VFS: Differentiate mount flags (MS_*) from internal superblock flags VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb) vfs: Add sb_rdonly(sb) to query the MS_RDONLY flag on s_flags
2017-08-10gfs2: Get rid of gfs2_set_nlinkAndreas Gruenbacher
Remove gfs2_set_nlink which prevents the link count of an inode from becoming non-zero once it has reached zero. The next commit reduces the amount of waiting on glocks when an inode is evicted from memory. With that, an inode can become reallocated before all the remote-unlink callbacks from a previous delete are processed, which causes the link count to change from zero to non-zero. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-07-21GFS2: fix code parameter error in inode_go_lockWang Xibo
In inode_go_lock() function, the parameter order of list_add() is error. According to the define of list_add(), the first parameter is new entry and the second is the list head, so ip->i_trunc_list should be the first parameter and the sdp->sd_trunc_list should be second. Signed-off-by: Wang Xibo<wang.xibo@zte.com.cn> Signed-off-by: Xiao Likun<xiao.likun@zte.com.cn> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-07-17VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb)David Howells
Firstly by applying the following with coccinelle's spatch: @@ expression SB; @@ -SB->s_flags & MS_RDONLY +sb_rdonly(SB) to effect the conversion to sb_rdonly(sb), then by applying: @@ expression A, SB; @@ ( -(!sb_rdonly(SB)) && A +!sb_rdonly(SB) && A | -A != (sb_rdonly(SB)) +A != sb_rdonly(SB) | -A == (sb_rdonly(SB)) +A == sb_rdonly(SB) | -!(sb_rdonly(SB)) +!sb_rdonly(SB) | -A && (sb_rdonly(SB)) +A && sb_rdonly(SB) | -A || (sb_rdonly(SB)) +A || sb_rdonly(SB) | -(sb_rdonly(SB)) != A +sb_rdonly(SB) != A | -(sb_rdonly(SB)) == A +sb_rdonly(SB) == A | -(sb_rdonly(SB)) && A +sb_rdonly(SB) && A | -(sb_rdonly(SB)) || A +sb_rdonly(SB) || A ) @@ expression A, B, SB; @@ ( -(sb_rdonly(SB)) ? 1 : 0 +sb_rdonly(SB) | -(sb_rdonly(SB)) ? A : B +sb_rdonly(SB) ? A : B ) to remove left over excess bracketage and finally by applying: @@ expression A, SB; @@ ( -(A & MS_RDONLY) != sb_rdonly(SB) +(bool)(A & MS_RDONLY) != sb_rdonly(SB) | -(A & MS_RDONLY) == sb_rdonly(SB) +(bool)(A & MS_RDONLY) == sb_rdonly(SB) ) to make comparisons against the result of sb_rdonly() (which is a bool) work correctly. Signed-off-by: David Howells <dhowells@redhat.com>
2017-07-05gfs2: Protect gl->gl_object by spin lockAndreas Gruenbacher
Put all remaining accesses to gl->gl_object under the gl->gl_lockref.lock spinlock to prevent races. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-07-05gfs2: Get rid of flush_delayed_work in gfs2_evict_inodeAndreas Gruenbacher
So far, gfs2_evict_inode clears gl->gl_object and then flushes the glock work queue to make sure that inode glops which dereference gl->gl_object have finished running before the inode is destroyed. However, flushing the work queue may do more work than needed, and in particular, it may call into DLM, which we want to avoid here. Use a bit lock (GIF_GLOP_PENDING) to synchronize between the inode glops and gfs2_evict_inode instead to get rid of the flushing. In addition, flush the work queues of existing glocks before reusing them for new inodes to get those glocks into a known state: the glock state engine currently doesn't handle glock re-appropriation correctly. (We may be able to fix the glock state engine instead later.) Based on a patch by Steven Whitehouse <swhiteho@redhat.com>. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2016-04-05GFS2: Get rid of dead code in inode_go_demote_okBob Peterson
Function inode_go_demote_ok had some code that was only executed if gl_holders was not empty. However, if gl_holders was not empty, the only caller, demote_ok(), returns before inode_go_demote_ok would ever be called. Therefore, it's dead code, so I removed it. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Acked-by: Steven Whitehouse <swhiteho@redhat.com>
2015-12-24gfs2: Invalid security labels of inodes when they go invalidAndreas Gruenbacher
When gfs2 releases the glock of an inode, it must invalidate all information cached for that inode, including the page cache and acls. Use the new security_inode_invalidate_secctx hook to also invalidate security labels in that case. These items will be reread from disk when needed after reacquiring the glock. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Acked-by: Bob Peterson <rpeterso@redhat.com> Acked-by: Steven Whitehouse <swhiteho@redhat.com> Cc: cluster-devel@redhat.com [PM: fixed spelling errors and description line lengths] Signed-off-by: Paul Moore <pmoore@redhat.com>
2015-10-29gfs2: Remove gl_spin defineAndreas Gruenbacher
Commit e66cf161 replaced the gl_spin spinlock in struct gfs2_glock with a gl_lockref lockref and defined gl_spin as gl_lockref.lock (the spinlock in gl_lockref). Remove that define to make the references to gl_lockref.lock more obvious. Signed-off-by: Andreas Gruenbacher <andreas.gruenbacher@gmail.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2015-09-03GFS2: Move glock superblock pointer to field gl_nameBob Peterson
What uniquely identifies a glock in the glock hash table is not gl_name, but gl_name and its superblock pointer. This patch makes the gl_name field correspond to a unique glock identifier. That will allow us to simplify hashing with a future patch, since the hash algorithm can then take the gl_name and hash its components in one operation. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Acked-by: Steven Whitehouse <swhiteho@redhat.com>
2015-06-19GFS2: Don't brelse rgrp buffer_heads every allocationBob Peterson
This patch allows the block allocation code to retain the buffers for the resource groups so they don't need to be re-read from buffer cache with every request. This is a performance improvement that's especially noticeable when resource groups are very large. For example, with 2GB resource groups and 4K blocks, there can be 33 blocks for every resource group. This patch allows those 33 buffers to be kept around and not read in and thrown away with every operation. The buffers are released when the resource group is either synced or invalidated. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Reviewed-by: Steven Whitehouse <swhiteho@redhat.com> Reviewed-by: Benjamin Marzinski <bmarzins@redhat.com>
2015-06-18GFS2: Don't add all glocks to the lruBob Peterson
The glocks used for resource groups often come and go hundreds of thousands of times per second. Adding them to the lru list just adds unnecessary contention for the lru_lock spin_lock, especially considering we're almost certainly going to re-use the glock and take it back off the lru microseconds later. We never want the glock shrinker to cull them anyway. This patch adds a new bit in the glops that determines which glock types get put onto the lru list and which ones don't. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Acked-by: Steven Whitehouse <swhiteho@redhat.com>
2014-11-17GFS2: update freeze code to use freeze/thaw_super on all nodesBenjamin Marzinski
The current gfs2 freezing code is considerably more complicated than it should be because it doesn't use the vfs freezing code on any node except the one that begins the freeze. This is because it needs to acquire a cluster glock before calling the vfs code to prevent a deadlock, and without the new freeze_super and thaw_super hooks, that was impossible. To deal with the issue, gfs2 had to do some hacky locking tricks to make sure that a frozen node couldn't be holding on a lock it needed to do the unfreeze ioctl. This patch makes use of the new hooks to simply the gfs2 locking code. Now, all the nodes in the cluster freeze and thaw in exactly the same way. Every node in the cluster caches the freeze glock in the shared state. The new freeze_super hook allows the freezing node to grab this freeze glock in the exclusive state without first calling the vfs freeze_super function. All the nodes in the cluster see this lock change, and call the vfs freeze_super function. The vfs locking code guarantees that the nodes can't get stuck holding the glocks necessary to unfreeze the system. To unfreeze, the freezing node uses the new thaw_super hook to drop the freeze glock. Again, all the nodes notice this, reacquire the glock in shared mode and call the vfs thaw_super function. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-10-08GFS2: use _RET_IP_ instead of (unsigned long)__builtin_return_address(0)Fabian Frederick
use macro definition Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-07-18GFS2: memcontrol: Spelling s/invlidate/invalidate/Geert Uytterhoeven
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: cluster-devel@redhat.com Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-06-04Merge tag 'gfs2-merge-window' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw into next Pull gfs2 updates from Steven Whitehouse: "This must be about the smallest merge window patch set ever for GFS2. It is probably also the first one without a single patch from me. That is down to a combination of factors, and I have some things in the works that are not quite ready yet, that I hope to put in next time around. Returning to what is here this time... we have 3 patches which fix various warnings. Two are bug fixes (for quotas and also a rare recovery race condition). The final patch, from Ben Marzinski, is an important change in the freeze code which has been in progress for some time. This removes the need to take and drop the transaction lock for every single transaction, when the only time it was used, was at file system freeze time. Ben's patch integrates the freeze operation into the journal flush code as an alternative with lower overheads and also lands up resolving some difficult to fix races at the same time" * tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw: GFS2: Prevent recovery before the local journal is set GFS2: fs/gfs2/file.c: kernel-doc warning fixes GFS2: fs/gfs2/bmap.c: kernel-doc warning fixes GFS2: remove transaction glock GFS2: lops.c: replace 0 by NULL for pointers GFS2: quotas not being refreshed in gfs2_adjust_quota
2014-05-14GFS2: remove transaction glockBenjamin Marzinski
GFS2 has a transaction glock, which must be grabbed for every transaction, whose purpose is to deal with freezing the filesystem. Aside from this involving a large amount of locking, it is very easy to make the current fsfreeze code hang on unfreezing. This patch rewrites how gfs2 handles freezing the filesystem. The transaction glock is removed. In it's place is a freeze glock, which is cached (but not held) in a shared state by every node in the cluster when the filesystem is mounted. This lock only needs to be grabbed on freezing, and actions which need to be safe from freezing, like recovery. When a node wants to freeze the filesystem, it grabs this glock exclusively. When the freeze glock state changes on the nodes (either from shared to unlocked, or shared to exclusive), the filesystem does a special log flush. gfs2_log_flush() does all the work for flushing out the and shutting down the incore log, and then it tries to grab the freeze glock in a shared state again. Since the filesystem is stuck in gfs2_log_flush, no new transaction can start, and nothing can be written to disk. Unfreezing the filesytem simply involes dropping the freeze glock, allowing gfs2_log_flush() to grab and then release the shared lock, so it is cached for next time. However, in order for the unfreezing ioctl to occur, gfs2 needs to get a shared lock on the filesystem root directory inode to check permissions. If that glock has already been grabbed exclusively, fsfreeze will be unable to get the shared lock and unfreeze the filesystem. In order to allow the unfreeze, this patch makes gfs2 grab a shared lock on the filesystem root directory during the freeze, and hold it until it unfreezes the filesystem. The functions which need to grab a shared lock in order to allow the unfreeze ioctl to be issued now use the lock grabbed by the freeze code instead. The freeze and unfreeze code take care to make sure that this shared lock will not be dropped while another process is using it. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-04-18arch: Mass conversion of smp_mb__*()Peter Zijlstra
Mostly scripted conversion of the smp_mb__* barriers. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-arch@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-24GFS2: Move log buffer lists into transactionSteven Whitehouse
Over time, we hope to be able to improve the concurrency available in the log code. This is one small step towards that, by moving the buffer lists from the super block, and into the transaction structure, so that each transaction builds its own buffer lists. At transaction commit time, the buffer lists are merged into the currently accumulating transaction. That transaction then is passed into the before and after commit functions at journal flush time. Thus there should be no change in overall behaviour yet. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-01-16GFS2: Don't use ENOBUFS when ENOMEM is the correct error codeSteven Whitehouse
Al Viro has tactfully pointed out that we are using the incorrect error code in some cases. This patch fixes that, and also removes the (unused) return value for glock dumping. > * gfs2_iget() - ENOBUFS instead of ENOMEM. ENOBUFS is > "No buffer space available (POSIX.1 (XSI STREAMS option))" and since > we don't support STREAMS it's probably fair game, but... what the hell? Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Cc: Al Viro <viro@ZenIV.linux.org.uk>
2014-01-03GFS2: Use only a single address space for rgrpsSteven Whitehouse
Prior to this patch, GFS2 had one address space for each rgrp, stored in the glock. This patch changes them to use a single address space in the super block. This therefore saves (sizeof(struct address_space) * nr_of_rgrps) bytes of memory and for large filesystems, that can be significant. It would be nice to be able to do something similar and merge the inode metadata address space into the same global address space. However, that is rather more complicated as the on-disk location doesn't have a 1:1 mapping with the inodes in general. So while it could be done, it will be a more complicated operation as it requires changing a lot more code paths. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-01-03GFS2: Use range based functions for rgrp sync/invalidationSteven Whitehouse
Each rgrp header is represented as a single extent on disk, so we can calculate the position within the address space, since we are using address spaces mapped 1:1 to the disk. This means that it is possible to use the range based versions of filemap_fdatawrite/wait and for invalidating the page cache. Our eventual intent is to then be able to merge the address spaces used for rgrps into a single address space, rather than to have one for each glock, saving memory and reducing complexity. Since during umount, the rgrp structures are disposed of before the glocks, we need to store the extent information in the glock so that is is available for a final invalidation. This patch uses a field which is otherwise unused in rgrp glocks to do that, so that we do not have to expand the size of a glock. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-12-20GFS2: Wait for async DIO in glock state changesSteven Whitehouse
We need to wait for any outstanding DIO to complete in a couple of situations. Firstly, in case we are changing out of deferred mode (in inode_go_sync) where GLF_DIRTY will not be set. That call could be prefixed with a test for gl_state == LM_ST_DEFERRED but it doesn't seem worth it bearing in mind that the test for outstanding DIO is very quick anyway, in the usual case that there is none. The second case is in inode_go_lock which will catch the cases where we have a cached EX lock, but where we grant deferred locks against it so that there is no glock state transistion. We only need to wait if the state is not deferred, since DIO is valid anyway in that state. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-10-15GFS2: Use lockref for glocksSteven Whitehouse
Currently glocks have an atomic reference count and also a spinlock which covers various internal fields, such as the state. This intent of this patch is to replace the spinlock and the atomic reference count with a lockref structure. This contains a spinlock which we can continue to use as before, and a reference counter which is used in conjuction with the spinlock to replace the previous atomic counter. As a result of this there are some new rules for reference counting on glocks. We need to distinguish between reference count changes under gl_spin (which are now just increment or decrement of the new counter, provided the count cannot hit zero) and those which are outside of gl_spin, but which now take gl_spin internally. The conversion is relatively straight forward. There is probably some further clean up which can be done, but the priority at this stage is to make the change in as simple a manner as possible. A consequence of this change is that the reference count is being decoupled from the lru list processing. This should allow future adoption of the lru_list code with glocks in due course. The reason for using the "dead" state and not just relying on 0 being the "invalid state" is so that in due course 0 ref counts can be allowable. The intent is to eventually be able to remove the ref count changes which are currently hidden away in state_change(). Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-08-19GFS2: don't overrun reserved revokesBenjamin Marzinski
When run during fsync, a gfs2_log_flush could happen between the time when gfs2_ail_flush checked the number of blocks to revoke, and when it actually started the transaction to do those revokes. This occassionally caused it to need more revokes than it reserved, causing gfs2 to crash. Instead of just reserving enough revokes to handle the blocks that currently need them, this patch makes gfs2_ail_flush reserve the maximum number of revokes it can, without increasing the total number of reserved log blocks. This patch also passes the number of reserved revokes to __gfs2_ail_flush() so that it doesn't go over its limit and cause a crash like we're seeing. Non-fsync calls to __gfs2_ail_flush will still cause a BUG() necessary revokes are skipped. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-06-19GFS2: aggressively issue revokes in gfs2_log_flushBenjamin Marzinski
This patch looks at all the outstanding blocks in all the transactions on the log, and moves the completed ones to the ail2 list. Then it issues revokes for these blocks. This will hopefully speed things up in situations where there is a lot of contention for glocks, especially if they are acquired serially. revoke_lo_before_commit will issue at most one log block's full of these preemptive revokes. The amount of reserved log space that gfs2_log_reserve() ignores has been incremented to allow for this extra block. This patch also consolidates the common revoke instructions into one function, gfs2_add_revoke(). Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-04-10GFS2: Add origin indicator to glock callbacksSteven Whitehouse
This patch adds a bool indicating whether the demote request was originated locally or remotely. This is then used by the iopen ->go_callback() to make 100% sure that it will only respond to remote callbacks. Since ->evict_inode() uses GL_NOCACHE when it attempts to get an exclusive lock on the iopen lock, this may result in extra scheduling of the workqueue in case that the exclusive promotion request failed. This patch prevents that from happening. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-02-13gfs2: Convert uids and gids between dinodes and vfs inodes.Eric W. Biederman
When reading dinodes from the disk convert uids and gids into kuids and kgids to store in vfs data structures. When writing to dinodes to the disk convert kuids and kgids in the in memory structures into plain uids and gids. For now all on disk data structures are assumed to be stored in the initial user namespace. Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2012-11-15GFS2: only use lvb on glocks that need itDavid Teigland
Save the effort of allocating, reading and writing the lvb for most glocks that do not use it. Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2012-11-07GFS2: Rename glops go_xmote_th to go_syncBob Peterson
[Editorial: This is a nit, but has been a minor irritation for a long time:] This patch renames glops structure item for go_xmote_th to go_sync. The functionality is unchanged; it's just for readability. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2012-11-07GFS2: Review bug traps in glops.cSteven Whitehouse
Two of the bug traps here could really be warnings. The others are converted from BUG() to GLOCK_BUG_ON() since we'll most likely need to know the glock state in order to debug any issues which arise. As a result of this, __dump_glock has to be renamed and is no longer static. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2012-09-24GFS2: fix s_writers.counter imbalance in gfs2_ail_empty_glEric Sandeen
gfs2_ail_empty_gl() contains an "inline version" of gfs2_trans_begin(), so it needs an explicit sb_start_intwrite() as well, to balance the sb_end_intwrite() which will be called by gfs2_trans_end(). With this, xfstest 068 passes on lock_nolock local gfs2. Without it, we reach a writer count of -1 and get stuck. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2012-05-08GFS2: Remove redundant metadata block type checkBob Peterson
This patch removes a redundant metadata block check. See description below. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2012-04-24GFS2: Remove bd_list_trSteven Whitehouse
This is another clean up in the logging code. This per-transaction list was largely unused. Its main function was to ensure that the number of buffers in a transaction was correct, however that counter was only used to check the number of buffers in the bd_list_tr, plus an assert at the end of each transaction. With the assert now changed to use the calculated buffer counts, we can remove both bd_list_tr and its associated counter. This should make the code easier to understand as well as shrinking a couple of structures. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2011-11-02filesystems: add set_nlink()Miklos Szeredi
Replace remaining direct i_nlink updates with a new set_nlink() updater function. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Tested-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2011-10-21GFS2: Fix AIL flush issue during fsyncSteven Whitehouse
Unfortunately, it is not enough to just ignore locked buffers during the AIL flush from fsync. We need to be able to ignore all buffers which are locked, dirty or pinned at this stage as they might have been added subsequent to the log flush earlier in the fsync function. In addition, this means that we no longer need to rely on i_mutex to keep out writes during fsync, so we can, as a side-effect, remove that protection too. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Tested-By: Abhijith Das <adas@redhat.com>
2011-10-21GFS2: Make resource groups "append only" during life of fsSteven Whitehouse
Since we have ruled out supporting online filesystem shrink, it is possible to make the resource group list append only during the life of a super block. This gives several benefits: Firstly, we only need to read new rindex elements as they are added rather than needing to reread the whole rindex file each time one element is added. Secondly, the rindex glock can be held for much shorter periods of time, and is completely removed from the fast path for allocations. The lock is taken in shared mode only when updating the resource groups when the first allocation occurs, and after a grow has taken place. Thirdly, this results in a reduction in code size, and everything gets a lot simpler to understand in this area. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2011-10-21GFS2: Use rbtree for resource groups and clean up bitmap buffer ref count schemeBob Peterson
Here is an update of Bob's original rbtree patch which, in addition, also resolves the rather strange ref counting that was being done relating to the bitmap blocks. Originally we had a dual system for journaling resource groups. The metadata blocks were journaled and also the rgrp itself was added to a list. The reason for adding the rgrp to the list in the journal was so that the "repolish clones" code could be run to update the free space, and potentially send any discard requests when the log was flushed. This was done by comparing the "cloned" bitmap with what had been written back on disk during the transaction commit. Due to this, there was a requirement to hang on to the rgrps' bitmap buffers until the journal had been flushed. For that reason, there was a rather complicated set up in the ->go_lock ->go_unlock functions for rgrps involving both a mutex and a spinlock (the ->sd_rindex_spin) to maintain a reference count on the buffers. However, the journal maintains a reference count on the buffers anyway, since they are being journaled as metadata buffers. So by moving the code which deals with the post-journal accounting for bitmap blocks to the metadata journaling code, we can entirely dispense with the rather strange buffer ref counting scheme and also the requirement to journal the rgrps. The net result of all this is that the ->sd_rindex_spin is left to do exactly one job, and that is to look after the rbtree or rgrps. This patch is designed to be a stepping stone towards using RCU for the rbtree of resource groups, however the reduction in the number of uses of the ->sd_rindex_spin is likely to have benefits for multi-threaded workloads, anyway. The patch retains ->go_lock and ->go_unlock for rgrps, however these maybe also be removed in future in favour of calling the functions directly where required in the code. That will allow locking of resource groups without needing to actually read them in - something that could be useful in speeding up statfs. In the mean time though it is valid to dereference ->bi_bh only when the rgrp is locked. This is basically the same rule as before, modulo the references not being valid until the following journal flush. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com> Cc: Benjamin Marzinski <bmarzins@redhat.com>
2011-10-21GFS2: Fix bug trap and journaled data fsyncSteven Whitehouse
Journaled data requires that a complete flush of all dirty data for the file is done, in order that the ail flush which comes after will succeed. Also the recently enhanced bug trap can trigger falsely in case an ail flush from fsync races with a page read. This updates the bug trap such that it will ignore buffers which are locked and only trigger on dirty and/or pinned buffers when the ail flush is run from fsync. The original bug trap is retained when ail flush is run from ->go_sync() Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2011-10-21GFS2: Fix bug-trap in ail flush codeSteven Whitehouse
The assert was being tested under the wrong lock, a legacy of the original code. Also, if it does trigger, the resulting information was not always a lot of help. This moves the patch under the correct lock and also prints out more useful information in tacking down the source of the problem. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2011-07-15GFS2: Add S_NOSEC supportSteven Whitehouse
This adds S_NOSEC support to GFS2. We set/reset the flag either when a user calls setattr or when we have just regained the glock from another node. The flag is only set if there are no xattrs on the inode and there is no suid bit set. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Cc: Al Viro <viro@ZenIV.linux.org.uk>
2011-07-15GFS2: Automatically adjust glock min hold timeBob Peterson
This patch is a performance improvement for GFS2 in a clustered environment. It makes the glock hold time self-adjusting. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2011-07-15GFS2: Cache dir hash table in a contiguous bufferSteven Whitehouse
This patch adds a cache for the hash table to the directory code in order to help simplify the way in which the hash table is accessed. This is intended to be a first step towards introducing some performance improvements in the directory code. There are two follow ups that I'm hoping to see fairly shortly. One is to simplify the hash table reading code now that we always read the complete hash table, whether we want one entry or all of them. The other is to introduce readahead on the heads of the hash chains which are referred to from the table. The hash table is a maximum of 128k in size, so it is not worth trying to read it in small chunks. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>