linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2018-01-29	tools/virtio: fix smp_mb on x86	Michael S. Tsirkin
	Offset 128 overlaps the last word of the redzone. Use 132 which is always beyond that. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-29	tools/virtio: copy READ/WRITE_ONCE	Michael S. Tsirkin
	This is to make ptr_ring test build again. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-29	tools/virtio: more stubs to fix tools build	Michael S. Tsirkin
	Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-29	tools/virtio: switch to __ptr_ring_empty	Michael S. Tsirkin
	We don't rely on lockless guarantees, but it seems cleaner than inverting __ptr_ring_peek. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-29	ptr_ring: prevent queue load/store tearing	Michael S. Tsirkin
	In theory compiler could tear queue loads or stores in two. It does not seem to be happening in practice but it seems easier to convert the cases where this would be a problem to READ/WRITE_ONCE than worry about it. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-29	skb_array: use __ptr_ring_empty	Michael S. Tsirkin
	__skb_array_empty should use __ptr_ring_empty since that's the only legal lockless function. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-29	Revert "net: ptr_ring: otherwise safe empty checks can overrun array bounds"	Michael S. Tsirkin
	This reverts commit bcecb4bbf88aa03171c30652bca761cf27755a6b. If we try to allocate an extra entry as the above commit did, and when the requested size is UINT_MAX, addition overflows causing zero size to be passed to kmalloc(). kmalloc then returns ZERO_SIZE_PTR with a subsequent crash. Reported-by: syzbot+87678bcf753b44c39b67@syzkaller.appspotmail.com Cc: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-29	ptr_ring: disallow lockless __ptr_ring_full	Michael S. Tsirkin
	Similar to bcecb4bbf88a ("net: ptr_ring: otherwise safe empty checks can overrun array bounds") a lockless use of __ptr_ring_full might cause an out of bounds access. We can fix this, but it's easier to just disallow lockless __ptr_ring_full for now. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-29	tap: fix use-after-free	Michael S. Tsirkin
	Lockless access to __ptr_ring_full is only legal if ring is never resized, otherwise it might cause use-after free errors. Simply drop the lockless test, we'll drop the packet a bit later when produce fails. Fixes: 362899b8 ("macvtap: switch to use skb array") Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-29	ptr_ring: READ/WRITE_ONCE for __ptr_ring_empty	Michael S. Tsirkin
	Lockless __ptr_ring_empty requires that consumer head is read and written at once, atomically. Annotate accordingly to make sure compiler does it correctly. Switch locked callers to __ptr_ring_peek which does not support the lockless operation. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-29	ptr_ring: clean up documentation	Michael S. Tsirkin
	The only function safe to call without locks is __ptr_ring_empty. Move documentation about lockless use there to make sure people do not try to use __ptr_ring_peek outside locks. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-29	ptr_ring: keep consumer_head valid at all times	Michael S. Tsirkin
	The comment near __ptr_ring_peek says: * If ring is never resized, and if the pointer is merely * tested, there's no need to take the lock - see e.g. __ptr_ring_empty. but this was in fact never possible since consumer_head would sometimes point outside the ring. Refactor the code so that it's always pointing within a ring. Fixes: c5ad119fb6c09 ("net: sched: pfifo_fast use skb_array") Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-29	GFS2: Don't try to end a non-existent transaction in unlink	Bob Peterson
	Before this patch, if function gfs2_unlink failed to get a valid transaction (for example, not enough journal blocks) it would go to label out_end_trans which did gfs2_trans_end. But if the trans_begin failed, there's no transaction to end, and trying to do so results in: kernel BUG at fs/gfs2/trans.c:117! This patch changes the goto so that it does not try to end a non-existent transaction. Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-01-29	ipv6: Fix SO_REUSEPORT UDP socket with implicit sk_ipv6only	Martin KaFai Lau
	If a sk_v6_rcv_saddr is !IPV6_ADDR_ANY and !IPV6_ADDR_MAPPED, it implicitly implies it is an ipv6only socket. However, in inet6_bind(), this addr_type checking and setting sk->sk_ipv6only to 1 are only done after sk->sk_prot->get_port(sk, snum) has been completed successfully. This inconsistency between sk_v6_rcv_saddr and sk_ipv6only confuses the 'get_port()'. In particular, when binding SO_REUSEPORT UDP sockets, udp_reuseport_add_sock(sk,...) is called. udp_reuseport_add_sock() checks "ipv6_only_sock(sk2) == ipv6_only_sock(sk)" before adding sk to sk2->sk_reuseport_cb. In this case, ipv6_only_sock(sk2) could be 1 while ipv6_only_sock(sk) is still 0 here. The end result is, reuseport_alloc(sk) is called instead of adding sk to the existing sk2->sk_reuseport_cb. It can be reproduced by binding two SO_REUSEPORT UDP sockets on an IPv6 address (!ANY and !MAPPED). Only one of the socket will receive packet. The fix is to set the implicit sk_ipv6only before calling get_port(). The original sk_ipv6only has to be saved such that it can be restored in case get_port() failed. The situation is similar to the inet_reset_saddr(sk) after get_port() has failed. Thanks to Calvin Owens <calvinowens@fb.com> who created an easy reproduction which leads to a fix. Fixes: e32ea7e74727 ("soreuseport: fast reuseport UDP socket selection") Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-29	Merge branch 'rtnetlink-enable-IFLA_IF_NETNSID-for-RTM_DELLINK-RTM_SETINK'	David S. Miller
	Christian Brauner says: ==================== rtnetlink: enable IFLA_IF_NETNSID for RTM_{DEL,SET}LINK Based on the previous discussion this enables passing a IFLA_IF_NETNSID property along with RTM_SETLINK and RTM_DELLINK requests. The patch for RTM_NEWLINK will be sent out in a separate patch since there are more corner-cases to think about. Changelog 2018-01-24: * Preserve old behavior and report -ENODEV when either ifindex or ifname is provided and IFLA_GROUP is set. Spotted by Wolfgang Bumiller. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-29	rtnetlink: enable IFLA_IF_NETNSID for RTM_DELLINK	Christian Brauner
	- Backwards Compatibility: If userspace wants to determine whether RTM_DELLINK supports the IFLA_IF_NETNSID property they should first send an RTM_GETLINK request with IFLA_IF_NETNSID on lo. If either EACCESS is returned or the reply does not include IFLA_IF_NETNSID userspace should assume that IFLA_IF_NETNSID is not supported on this kernel. If the reply does contain an IFLA_IF_NETNSID property userspace can send an RTM_DELLINK with a IFLA_IF_NETNSID property. If they receive EOPNOTSUPP then the kernel does not support the IFLA_IF_NETNSID property with RTM_DELLINK. Userpace should then fallback to other means. - Security: Callers must have CAP_NET_ADMIN in the owning user namespace of the target network namespace. Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-29	rtnetlink: enable IFLA_IF_NETNSID for RTM_SETLINK	Christian Brauner
	- Backwards Compatibility: If userspace wants to determine whether RTM_SETLINK supports the IFLA_IF_NETNSID property they should first send an RTM_GETLINK request with IFLA_IF_NETNSID on lo. If either EACCESS is returned or the reply does not include IFLA_IF_NETNSID userspace should assume that IFLA_IF_NETNSID is not supported on this kernel. If the reply does contain an IFLA_IF_NETNSID property userspace can send an RTM_SETLINK with a IFLA_IF_NETNSID property. If they receive EOPNOTSUPP then the kernel does not support the IFLA_IF_NETNSID property with RTM_SETLINK. Userpace should then fallback to other means. To retain backwards compatibility the kernel will first check whether a IFLA_NET_NS_PID or IFLA_NET_NS_FD property has been passed. If either one is found it will be used to identify the target network namespace. This implies that users who do not care whether their running kernel supports IFLA_IF_NETNSID with RTM_SETLINK can pass both IFLA_NET_NS_{FD,PID} and IFLA_IF_NETNSID referring to the same network namespace. - Security: Callers must have CAP_NET_ADMIN in the owning user namespace of the target network namespace. Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-29	rtnetlink: enable IFLA_IF_NETNSID in do_setlink()	Christian Brauner
	RTM_{NEW,SET}LINK already allow operations on other network namespaces by identifying the target network namespace through IFLA_NET_NS_{FD,PID} properties. This is done by looking for the corresponding properties in do_setlink(). Extend do_setlink() to also look for the IFLA_IF_NETNSID property. This introduces no functional changes since all callers of do_setlink() currently block IFLA_IF_NETNSID by reporting an error before they reach do_setlink(). This introduces the helpers: static struct net rtnl_link_get_net_by_nlattr(struct net src_net, struct nlattr tb[]) static struct net rtnl_link_get_net_capable(const struct sk_buff skb, struct net src_net, struct nlattr tb[], int cap) to simplify permission checks and target network namespace retrieval for RTM_ requests that already support IFLA_NET_NS_{FD,PID} but get extended to IFLA_IF_NETNSID. To perserve backwards compatibility the helpers look for IFLA_NET_NS_{FD,PID} properties first before checking for IFLA_IF_NETNSID. Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-29	xfs: remove experimental tag for reflinks	Christoph Hellwig
	But reject reflink + DAX file systems for now until the code to support reflinks on DAX is actually implemented. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> [darrick: port to 4.16] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-01-29	xfs: don't screw up direct writes when freesp is fragmented	Darrick J. Wong
	xfs_bmap_btalloc is given a range of file offset blocks that must be allocated to some data/attr/cow fork. If the fork has an extent size hint associated with it, the request will be enlarged on both ends to try to satisfy the alignment hint. If free space is fragmentated, sometimes we can allocate some blocks but not enough to fulfill any of the requested range. Since bmapi_allocate always trims the new extent mapping to match the originally requested range, this results in bmapi_write returning zero and no mapping. The consequences of this vary -- buffered writes will simply re-call bmapi_write until it can satisfy at least one block from the original request. Direct IO overwrites notice nmaps == 0 and return -ENOSPC through the dio mechanism out to userspace with the weird result that writes fail even when we have enough space because the ENOSPC return overrides any partial write status. For direct CoW writes the situation was disastrous because nobody notices us returning an invalid zero-length wrong-offset mapping to iomap and the write goes off into space. Therefore, if free space is so fragmented that we managed to allocate some space but not enough to map into even a single block of the original allocation request range, we should break the alignment hint in order to guarantee at least some forward progress for the direct write. If we return a short allocation to iomap_apply it'll call back about the remaining blocks. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-01-29	xfs: check reflink allocation mappings	Darrick J. Wong
	There's a really bad bug in xfs_reflink_allocate_cow -- if bmapi_write can return a zero error code but no mappings. This happens if there's an extent size hint (which causes allocation requests to be rounded to extsz granularity internally), but there wasn't a big enough chunk of free space to start filling at the extsz granularity and fill even one block of the range that we actually requested. In any case, if we got no mappings we can't possibly do anything useful with the contents of imap, so we must bail out with ENOSPC here. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-01-29	iomap: warn on zero-length mappings	Darrick J. Wong
	Don't let the iomap callback get away with feeding us a garbage zero length mapping -- there was a bug in xfs that resulted in those leaking out to hilarious effect. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-01-29	xfs: treat CoW fork operations as delalloc for quota accounting	Darrick J. Wong
	Since the CoW fork only exists in memory, it is incorrect to update the on-disk quota block counts when we modify the CoW fork. Unlike the data fork, even real extents in the CoW fork are only delalloc-style reservations (on-disk they're owned by the refcountbt) so they must not be tracked in the on disk quota info. Ensure the i_delayed_blks accounting reflects this too. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-01-29	xfs: only grab shared inode locks for source file during reflink	Darrick J. Wong
	Reflink and dedupe operations remap blocks from a source file into a destination file. The destination file needs exclusive locks on all levels because we're updating its block map, but the source file isn't undergoing any block map changes so we can use a shared lock. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-01-29	xfs: allow xfs_lock_two_inodes to take different EXCL/SHARED modes	Darrick J. Wong
	Refactor xfs_lock_two_inodes to take separate locking modes for each inode. Specifically, this enables us to take a SHARED lock on one inode and an EXCL lock on the other. The lock class (MMAPLOCK/ILOCK) must be the same for each inode. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-01-29	xfs: reflink should break pnfs leases before sharing blocks	Darrick J. Wong
	Before we share blocks between files, we need to break the pnfs leases on the layout before we start slicing and dicing the block map. The structure of this function sets us up for the lock contention reduction in the next patch. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-01-29	xfs: don't clobber inobt/finobt cursors when xref with rmap	Darrick J. Wong
	Even if we can't use the inobt/finobt cursors to count the number of inode btree blocks, we are never allowed to clobber the cursor of the btree being checked, so don't do this. Found by fuzzing level = ones in xfs/364. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-01-29	xfs: skip CoW writes past EOF when writeback races with truncate	Darrick J. Wong
	Every so often we blow the ASSERT(type != XFS_IO_COW) in xfs_map_blocks when running fsstress, as we do in generic/269. The cause of this is writeback racing with truncate -- writeback doesn't take the iolock, so truncate can sneak in to decrease i_size and truncate page cache while writeback is gathering buffer heads to schedule writeout. If we hit this race on a block that has a CoW mapping, we'll get a valid imap from the CoW fork but the reduced i_size trims the mapping to zero length (which makes it invalid), so we call xfs_map_blocks to try again. This doesn't do much anyway, since any mapping we get out of that will also be invalid, so we might as well skip the assert and just stop. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-01-29	xfs: preserve i_rdev when recycling a reclaimable inode	Amir Goldstein
	Commit 66f364649d870 ("xfs: remove if_rdev") moved storing of rdev value for special inodes to VFS inodes, but forgot to preserve the value of i_rdev when recycling a reclaimable xfs_inode. This was detected by xfstest overlay/017 with inodex=on mount option and xfs base fs. The test does a lookup of overlay chardev and blockdev right after drop caches. Overlayfs inodes hold a reference on underlying xfs inodes when mount option index=on is configured. If drop caches reclaim xfs inodes, before it relclaims overlayfs inodes, that can sometimes leave a reclaimable xfs inode and that test hits that case quite often. When that happens, the xfs inode cache remains broken (zere i_rdev) until the next cycle mount or drop caches. Fixes: 66f364649d870 ("xfs: remove if_rdev") Signed-off-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-01-29	xfs: refactor accounting updates out of xfs_bmap_btalloc	Darrick J. Wong
	Move all the inode and quota accounting updates out of xfs_bmap_btalloc in preparation for fixing some quota accounting problems with copy on write. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-01-29	xfs: refactor inode verifier corruption error printing	Darrick J. Wong
	Refactor inode verifier error reporting into a non-libxfs function so that we aren't encoding the message format in libxfs. This also changes the kernel dmesg output to resemble buffer verifier errors more closely. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-01-29	xfs: make tracepoint inode number format consistent	Darrick J. Wong
	Fix all the inode number formats to be consistently (0x%llx) in all trace point definitions. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-01-29	xfs: always zero di_flags2 when we free the inode	Darrick J. Wong
	Always zero the di_flags2 field when we free the inode so that we never end up with an on-disk record for an unallocated inode that also has the reflink iflag set. This is in keeping with the general principle that only files can have the reflink iflag set, even though we'll zero out di_flags2 if we ever reallocate the inode. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-01-29	xfs: call xfs_qm_dqattach before performing reflink operations	Darrick J. Wong
	Ensure that we've attached all the necessary dquots before performing reflink operations so that quota accounting is accurate. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-01-29	xfs: bmap code cleanup	Shan Hai
	Remove the extent size hint and realtime inode relevant code from the xfs_bmapi_reserve_delalloc since it is not called on the inode with extent size hint set or on a realtime inode. Signed-off-by: Shan Hai <shan.hai@oracle.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-01-29	Use list_head infra-structure for buffer's log items list	Carlos Maiolino
	Now that buffer's b_fspriv has been split, just replace the current singly linked list of xfs_log_items, by the list_head infrastructure. Also, remove the xfs_log_item argument from xfs_buf_resubmit_failed_buffers(), there is no need for this argument, once the log items can be walked through the list_head in the buffer. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Bill O'Donnell <billodo@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> [darrick: minor style cleanups] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-01-29	Split buffer's b_fspriv field	Carlos Maiolino
	By splitting the b_fspriv field into two different fields (b_log_item and b_li_list). It's possible to get rid of an old ABI workaround, by using the new b_log_item field to store xfs_buf_log_item separated from the log items attached to the buffer, which will be linked in the new b_li_list field. This way, there is no more need to reorder the log items list to place the buf_log_item at the beginning of the list, simplifying a bit the logic to handle buffer IO. This also opens the possibility to change buffer's log items list into a proper list_head. b_log_item field is still defined as a void *, because it is still used by the log buffers to store xlog_in_core structures, and there is no need to add an extra field on xfs_buf just for xlog_in_core. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Bill O'Donnell <billodo@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> [darrick: minor style changes] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-01-29	Get rid of xfs_buf_log_item_t typedef	Carlos Maiolino
	Take advantage of the rework on xfs_buf log items list, to get rid of ths typedef for xfs_buf_log_item. This patch also fix some indentation alignment issues found along the way. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Bill O'Donnell <billodo@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-01-29	irqchip/stm32: Fix copyright	Benjamin Gaignard
	Uniformize STMicroelectronics copyrights header and add SPDX identifier CC: Maxime Coquelin <mcoquelin.stm32@gmail.com> Signed-off-by: Benjamin Gaignard <benjamin.gaignard@st.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Alexandre TORGUE <alexandre.torgue@st.com> Acked-by: Maxime Coquelin <mcoquelin.stm32@gmail.com> Cc: jason@lakedaemon.net Cc: marc.zyngier@arm.com Cc: linux-arm-kernel@lists.infradead.org Link: https://lkml.kernel.org/r/20171130084500.23439-1-benjamin.gaignard@st.com
2018-01-29	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	David S. Miller
	Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-29	Coccinelle: memdup: drop spurious line	Julia Lawall
	The kmemdup line in the non-patch case was left over from the added kmemdup line in the patch case. Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2018-01-29	fs: handle inode->i_version more efficiently	Jeff Layton
	Since i_version is mostly treated as an opaque value, we can exploit that fact to avoid incrementing it when no one is watching. With that change, we can avoid incrementing the counter on writes, unless someone has queried for it since it was last incremented. If the a/c/mtime don't change, and the i_version hasn't changed, then there's no need to dirty the inode metadata on a write. Convert the i_version counter to an atomic64_t, and use the lowest order bit to hold a flag that will tell whether anyone has queried the value since it was last incremented. When we go to maybe increment it, we fetch the value and check the flag bit. If it's clear then we don't need to do anything if the update isn't being forced. If we do need to update, then we increment the counter by 2, and clear the flag bit, and then use a CAS op to swap it into place. If that works, we return true. If it doesn't then do it again with the value that we fetch from the CAS operation. On the query side, if the flag is already set, then we just shift the value down by 1 bit and return it. Otherwise, we set the flag in our on-stack value and again use cmpxchg to swap it into place if it hasn't changed. If it has, then we use the value from the cmpxchg as the new "old" value and try again. This method allows us to avoid incrementing the counter on writes (and dirtying the metadata) under typical workloads. We only need to increment if it has been queried since it was last changed. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Dave Chinner <dchinner@redhat.com> Tested-by: Krzysztof Kozlowski <krzk@kernel.org>
2018-01-29	btrfs: only dirty the inode in btrfs_update_time if something was changed	Jeff Layton
	At this point, we know that "now" and the file times may differ, and we suspect that the i_version has been flagged to be bumped. Attempt to bump the i_version, and only mark the inode dirty if that actually occurred or if one of the times was updated. Signed-off-by: Jeff Layton <jlayton@redhat.com> Acked-by: David Sterba <dsterba@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2018-01-29	xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need incrementing	Jeff Layton
	If XFS_ILOG_CORE is already set then go ahead and increment it. Signed-off-by: Jeff Layton <jlayton@redhat.com> Acked-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Dave Chinner <dchinner@redhat.com>
2018-01-29	fs: only set S_VERSION when updating times if necessary	Jeff Layton
	We only really need to update i_version if someone has queried for it since we last incremented it. By doing that, we can avoid having to update the inode if the times haven't changed. If the times have changed, then we go ahead and forcibly increment the counter, under the assumption that we'll be going to the storage anyway, and the increment itself is relatively cheap. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz>
2018-01-29	IMA: switch IMA over to new i_version API	Jeff Layton
	Signed-off-by: Jeff Layton <jlayton@redhat.com>
2018-01-29	xfs: convert to new i_version API	Jeff Layton
	Signed-off-by: Jeff Layton <jlayton@redhat.com> Acked-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Dave Chinner <dchinner@redhat.com>
2018-01-29	ufs: use new i_version API	Jeff Layton
	Signed-off-by: Jeff Layton <jlayton@redhat.com>
2018-01-29	ocfs2: convert to new i_version API	Jeff Layton
	Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz>
2018-01-29	nfsd: convert to new i_version API	Jeff Layton
	Mostly just making sure we use the "get" wrappers so we know when it is being fetched for later use. Signed-off-by: Jeff Layton <jlayton@redhat.com>