summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2017-08-30GFS2: Fix non-recursive truncate bugBob Peterson
Before this patch if you truncated a file to a smaller size it wasn't freeing all the blocks properly. There are two reasons. First, the metapath comparison was not comparing previous heights. I added a function, mp_eq_to_hgt, which checks the metapath at all heights prior to the target height. Second, in function find_nonnull_ptr, it needed to zero out all pointers for heights following the target height. Translated into decimal integer terms, this way a number like 299, when incremented, becomes 300, not 399. The 2 gets incremented to 3, and the following digits need to be reset. These two things allow the truncate state machine to properly find the blocks it needs to delete. Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-08-30fsnotify: make dnotify_fsnotify_ops constBhumika Goyal
Make this const as it is never modified. Signed-off-by: Bhumika Goyal <bhumirks@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-30gfs2: constify rhashtable_paramsArvind Yadav
rhashtable_params are not supposed to change at runtime. All Functions rhashtable_* working with const rhashtable_params provided by <linux/rhashtable.h>. So mark the non-const structs as const. Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-08-30GFS2: Fix gl_object warningsAndreas Gruenbacher
The following cleanup is needed to avoid spilling the syslog with false warnings. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-08-29f2fs: trigger fdatasync for non-atomic_write fileChao Yu
Sqlite only cares about synchronization of file data instead of other data unrelated attribute of inode, so in commit flow, call fdatasync is enough. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-29f2fs: fix to avoid race in between aio and gcChao Yu
We won't wait DIO synchronously when doing AIO, so there will be potential IO reorder in between AIO and GC, which will cause data corruption. This patch adds inode_dio_wait to serialize aio and data GC to avoid this issue. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-29f2fs: wake up discard_thread iff there is a candidateJaegeuk Kim
This patch fixes to avoid needless wake ups. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-29f2fs: return error when accessing insane flie offsetJaegeuk Kim
If file offset is insane, we have to return error instead of kernel panic. Reported-by: Eric Zhang <followme999@163.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-29f2fs: trigger normal fsync for non-atomic_write fileChao Yu
If file was not opened with atomic write mode, but user uses atomic write ioctl to fsync datas, in the flow, we should not fsync that file with atomic write mode. Fixes: 608514deba38 ("f2fs: set fsync mark only for the last dnode") Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-29f2fs: clear FI_HOT_DATA correctlyChao Yu
This patch fixes to clear FI_HOT_DATA correctly in below path: - error handling in f2fs_ioc_start_atomic_write - after commit atomic write in f2fs_ioc_commit_atomic_write - after drop atomic write in drop_inmem_pages Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-29f2fs: fix out-of-order execution in f2fs_issue_flushChao Yu
In f2fs_issue_flush, due to out-of-order execution of CPU, wake_up can be called before we insert issue_list, result in long latency of wait_for_completion. Fix this by adding smp_mb() to force the order of related codes. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-29f2fs: issue discard commands if gc_urgent is setJaegeuk Kim
It's time to issue all the discard commands, if user sets the idle time. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-29rxrpc: Add notification of end-of-Tx phaseDavid Howells
Add a callback to rxrpc_kernel_send_data() so that a kernel service can get a notification that the AF_RXRPC call has transitioned out the Tx phase and is now waiting for a reply or a final ACK. This is called from AF_RXRPC with the call state lock held so the notification is guaranteed to come before any reply is passed back. Further, modify the AFS filesystem to make use of this so that we don't have to change the afs_call state before sending the last bit of data. Signed-off-by: David Howells <dhowells@redhat.com>
2017-08-28fs/select: Fix memory corruption in compat_get_fd_set()Helge Deller
Commit 464d62421cb8 ("select: switch compat_{get,put}_fd_set() to compat_{get,put}_bitmap()") changed the calculation on how many bytes need to be zeroed when userspace handed over a NULL pointer for a fdset array in the select syscall. The calculation was changed in compat_get_fd_set() wrongly from memset(fdset, 0, ((nr + 1) & ~1)*sizeof(compat_ulong_t)); to memset(fdset, 0, ALIGN(nr, BITS_PER_LONG)); The ALIGN(nr, BITS_PER_LONG) calculates the number of _bits_ which need to be zeroed in the target fdset array (rounded up to the next full bits for an unsigned long). But the memset() call expects the number of _bytes_ to be zeroed. This leads to clearing more memory than wanted (on the stack area or even at kmalloc()ed memory areas) and to random kernel crashes as we have seen them on the parisc platform. The correct change should have been memset(fdset, 0, (ALIGN(nr, BITS_PER_LONG) / BITS_PER_LONG) * BYTES_PER_LONG); which is the same as can be archieved with a call to zero_fd_set(nr, fdset). Fixes: 464d62421cb8 ("select: switch compat_{get,put}_fd_set() to compat_{get,put}_bitmap()" Acked-by:: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Helge Deller <deller@gmx.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-28kernfs: Clarify lockdep name for kn->countWaiman Long
The reference count in kernfs_node structure is treated like a rwsem by using lockdep instrumentation code. The lockdep name, however, is still "s_active" which is carried over from the old sysfs code. As s_active is no longer the variable name, its use may confuse users on where the lock is when it is reported by lockdep. So it is changed to "kn->count" which is how this variable is normally referenced in kernfs code. Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-28Merge 4.13-rc7 into char-misc-nextGreg Kroah-Hartman
We want the binder fix in here as well for testing and merge issues. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-28fput: Don't reinvent the wheel but use existing llist APIByungchul Park
Although llist provides proper APIs, they are not used. Make them used. Signed-off-by: Byungchul Park <byungchul.park@lge.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-08-28namespace.c: Don't reinvent the wheel but use existing llist APIByungchul Park
Although llist provides proper APIs, they are not used. Make them used. Signed-off-by: Byungchul Park <byungchul.park@lge.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-08-25Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc fixes from Andrew Morton: "6 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm/memblock.c: reversed logic in memblock_discard() fork: fix incorrect fput of ->exe_file causing use-after-free mm/madvise.c: fix freeing of locked page with MADV_FREE dax: fix deadlock due to misaligned PMD faults mm, shmem: fix handling /sys/kernel/mm/transparent_hugepage/shmem_enabled PM/hibernate: touch NMI watchdog when creating snapshot
2017-08-25Merge tag 'nfsd-4.13-2' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
Pull nfsd fixes from Bruce Fields: "Two nfsd bugfixes, neither 4.13 regressions, but both potentially serious" * tag 'nfsd-4.13-2' of git://linux-nfs.org/~bfields/linux: net: sunrpc: svcsock: fix NULL-pointer exception nfsd: Limit end of page list when decoding NFSv4 WRITE
2017-08-25Merge tag 'cifs-fixes-for-4.13-rc6-and-stable' of ↵Linus Torvalds
git://git.samba.org/sfrench/cifs-2.6 Pull cifs fixes from Steve French: "Some bug fixes for stable for cifs" * tag 'cifs-fixes-for-4.13-rc6-and-stable' of git://git.samba.org/sfrench/cifs-2.6: cifs: return ENAMETOOLONG for overlong names in cifs_open()/cifs_lookup() cifs: Fix df output for users with quota limits
2017-08-25GFS2: Fix up some sparse warningsBob Peterson
This patch cleans up various pieces of GFS2 to avoid sparse errors. This doesn't fix them all, but it fixes several. The first error, in function glock_hash_walk was a genuine bug where the rhashtable could be started and not stopped. Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-08-25dax: fix deadlock due to misaligned PMD faultsRoss Zwisler
In DAX there are two separate places where the 2MiB range of a PMD is defined. The first is in the page tables, where a PMD mapping inserted for a given address spans from (vmf->address & PMD_MASK) to ((vmf->address & PMD_MASK) + PMD_SIZE - 1). That is, from the 2MiB boundary below the address to the 2MiB boundary above the address. So, for example, a fault at address 3MiB (0x30 0000) falls within the PMD that ranges from 2MiB (0x20 0000) to 4MiB (0x40 0000). The second PMD range is in the mapping->page_tree, where a given file offset is covered by a radix tree entry that spans from one 2MiB aligned file offset to another 2MiB aligned file offset. So, for example, the file offset for 3MiB (pgoff 768) falls within the PMD range for the order 9 radix tree entry that ranges from 2MiB (pgoff 512) to 4MiB (pgoff 1024). This system works so long as the addresses and file offsets for a given mapping both have the same offsets relative to the start of each PMD. Consider the case where the starting address for a given file isn't 2MiB aligned - say our faulting address is 3 MiB (0x30 0000), but that corresponds to the beginning of our file (pgoff 0). Now all the PMDs in the mapping are misaligned so that the 2MiB range defined in the page tables never matches up with the 2MiB range defined in the radix tree. The current code notices this case for DAX faults to storage with the following test in dax_pmd_insert_mapping(): if (pfn_t_to_pfn(pfn) & PG_PMD_COLOUR) goto unlock_fallback; This test makes sure that the pfn we get from the driver is 2MiB aligned, and relies on the assumption that the 2MiB alignment of the pfn we get back from the driver matches the 2MiB alignment of the faulting address. However, faults to holes were not checked and we could hit the problem described above. This was reported in response to the NVML nvml/src/test/pmempool_sync TEST5: $ cd nvml/src/test/pmempool_sync $ make TEST5 You can grab NVML here: https://github.com/pmem/nvml/ The dmesg warning you see when you hit this error is: WARNING: CPU: 13 PID: 2900 at fs/dax.c:641 dax_insert_mapping_entry+0x2df/0x310 Where we notice in dax_insert_mapping_entry() that the radix tree entry we are about to replace doesn't match the locked entry that we had previously inserted into the tree. This happens because the initial insertion was done in grab_mapping_entry() using a pgoff calculated from the faulting address (vmf->address), and the replacement in dax_pmd_load_hole() => dax_insert_mapping_entry() is done using vmf->pgoff. In our failure case those two page offsets (one calculated from vmf->address, one using vmf->pgoff) point to different order 9 radix tree entries. This failure case can result in a deadlock because the radix tree unlock also happens on the pgoff calculated from vmf->address. This means that the locked radix tree entry that we swapped in to the tree in dax_insert_mapping_entry() using vmf->pgoff is never unlocked, so all future faults to that 2MiB range will block forever. Fix this by validating that the faulting address's PMD offset matches the PMD offset from the start of the file. This check is done at the very beginning of the fault and covers faults that would have mapped to storage as well as faults to holes. I left the COLOUR check in dax_pmd_insert_mapping() in place in case we ever hit the insanity condition where the alignment of the pfn we get from the driver doesn't match the alignment of the userspace address. Link: http://lkml.kernel.org/r/20170822222436.18926-1-ross.zwisler@linux.intel.com Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reported-by: "Slusarz, Marcin" <marcin.slusarz@intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-25gfs2: Silence gcc format-truncation warningAndreas Gruenbacher
Enlarge sd_fsname to be big enough for the longest long lock table name and an arbitrary journal number. This silences two -Wformat-truncation warnings with gcc 7.1.1. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-08-25GFS2: Withdraw for IO errors writing to the journal or statfsBob Peterson
Before this patch, if GFS2 encountered IO errors while writing to the journal, it would not report the problem, so they would go unnoticed, sometimes for many hours. Sometimes this would only be noticed later, when recovery tried to do journal replay and failed due to invalid metadata at the blocks that resulted in IO errors. This patch makes GFS2's log daemon check for IO errors. If it encounters one, it withdraws from the file system and reports why in dmesg. A similar action is taken when IO errors occur when writing to the system statfs file. These errors are also reported back to any callers of fsync, since that requires the journal to be flushed. Therefore, any IO errors that would previously go unnoticed are now noticed and the file system is withdrawn as early as possible, thus preventing further file system damage. Also note that this reintroduces superblock variable sd_log_error, which Christoph removed with commit f729b66fca. Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-08-25Merge branch 'linus' into sched/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25Merge branch 'linus' into locking/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-24sunrpc: Const-ify struct sv_serv_opsChuck Lever
Close an attack vector by moving the arrays of per-server methods to read-only memory. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-08-24nfsd: Const-ify NFSv4 encoding and decoding ops arraysChuck Lever
Close an attack vector by moving the arrays of encoding and decoding methods to read-only memory. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-08-24nfsd4: individual encoders no longer see error casesJ. Bruce Fields
With a few exceptions, most individual encoders don't handle error cases. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-08-24nfsd4: skip encoder in trivial error casesJ. Bruce Fields
Most encoders do nothing in the error case. But they can still screw things up in that case: most errors happen very early in rpc processing, possibly before argument fields are filled in and bounds-tested, so encoders that do anything other than immediately bail on error can easily crash in odd error cases. So just handle errors centrally most of the time to remove the chance of error. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-08-24nfsd4: define ->op_release for compound opsJ. Bruce Fields
Run a separate ->op_release function if necessary instead of depending on the xdr encoder to do this. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-08-24nfsd4: opdesc will be useful outside nfs4proc.cJ. Bruce Fields
Trivial cleanup, no change in behavior. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-08-24nfsd: Limit end of page list when decoding NFSv4 WRITEChuck Lever
When processing an NFSv4 WRITE operation, argp->end should never point past the end of the data in the final page of the page list. Otherwise, nfsd4_decode_compound can walk into uninitialized memory. More critical, nfsd4_decode_write is failing to increment argp->pagelen when it increments argp->pagelist. This can cause later xdr decoders to assume more data is available than really is, which can cause server crashes on malformed requests. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-08-24Merge branch 'for-4.13-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "We have one more fixup that stems from the blk_status_t conversion that did not quite cover everything. The normal cases were not affected because the code is 0, but any error and retries could mix up new and old values" * 'for-4.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Btrfs: fix blk_status_t/errno confusion
2017-08-24pty: Repair TIOCGPTPEEREric W. Biederman
The implementation of TIOCGPTPEER has two issues. When /dev/ptmx (as opposed to /dev/pts/ptmx) is opened the wrong vfsmount is passed to dentry_open. Which results in the kernel displaying the wrong pathname for the peer. The second is simply by caching the vfsmount and dentry of the peer it leaves them open, in a way they were not previously Which because of the inreased reference counts can cause unnecessary behaviour differences resulting in regressions. To fix these move the ioctl into tty_io.c at a generic level allowing the ioctl to have access to the struct file on which the ioctl is being called. This allows the path of the slave to be derived when opening the slave through TIOCGPTPEER instead of requiring the path to the slave be cached. Thus removing the need for caching the path. A new function devpts_ptmx_path is factored out of devpts_acquire and used to implement a function devpts_mntget. The new function devpts_mntget takes a filp to perform the lookup on and fsi so that it can confirm that the superblock that is found by devpts_ptmx_path is the proper superblock. v2: Lots of fixes to make the code actually work v3: Suggestions by Linus - Removed the unnecessary initialization of filp in ptm_open_peer - Simplified devpts_ptmx_path as gotos are no longer required [ This is the fix for the issue that was reverted in commit 143c97cc6529, but this time without breaking 'pbuilder' due to increased reference counts - Linus ] Fixes: 54ebbfb16034 ("tty: add TIOCGPTPEER ioctl") Reported-by: Christian Brauner <christian.brauner@canonical.com> Reported-and-tested-by: Stefan Lippers-Hollmann <s.l-h@gmx.de> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-24ext4: fix fault handling when mounted with -o dax,roRandy Dodgen
If an ext4 filesystem is mounted with both the DAX and read-only options, executables on that filesystem will fail to start (claiming 'Segmentation fault') due to the fault handler returning VM_FAULT_SIGBUS. This is due to the DAX fault handler (see ext4_dax_huge_fault) attempting to write to the journal when FAULT_FLAG_WRITE is set. This is the wrong behavior for write faults which will lead to a COW page; in particular, this fails for readonly mounts. This change avoids journal writes for faults that are expected to COW. It might be the case that this could be better handled in ext4_iomap_begin / ext4_iomap_end (called via iomap_ops inside dax_iomap_fault). These is some overlap already (e.g. grabbing journal handles). Signed-off-by: Randy Dodgen <dodgen@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
2017-08-24ext4: fix quota inconsistency during orphan cleanup for read-only mountszhangyi (F)
Quota does not get enabled for read-only mounts if filesystem has quota feature, so that quotas cannot updated during orphan cleanup, which will lead to quota inconsistency. This patch turn on quotas during orphan cleanup for this case, make sure quotas can be updated correctly. Reported-by: Jan Kara <jack@suse.cz> Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Cc: stable@vger.kernel.org # 3.18+
2017-08-24ext4: fix incorrect quotaoff if the quota feature is enabledzhangyi (F)
Current ext4 quota should always "usage enabled" if the quota feautre is enabled. But in ext4_orphan_cleanup(), it turn quotas off directly (used for the older journaled quota), so we cannot turn it on again via "quotaon" unless umount and remount ext4. Simple reproduce: mkfs.ext4 -O project,quota /dev/vdb1 mount -o prjquota /dev/vdb1 /mnt chattr -p 123 /mnt chattr +P /mnt touch /mnt/aa /mnt/bb exec 100<>/mnt/aa rm -f /mnt/aa sync echo c > /proc/sysrq-trigger #reboot and mount mount -o prjquota /dev/vdb1 /mnt #query status quotaon -Ppv /dev/vdb1 #output quotaon: Cannot find mountpoint for device /dev/vdb1 quotaon: No correct mountpoint specified. This patch add check for journaled quotas to avoid incorrect quotaoff when ext4 has quota feautre. Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Cc: stable@vger.kernel.org # 3.18
2017-08-24ext4: remove useless test and assignment in strtohash functionsDamien Guibouret
On transformation of str to hash, computed value is initialised before first byte modulo 4. But it is already initialised before entering loop and after processing last byte modulo 4. So the corresponding test and initialisation could be removed. Signed-off-by: Damien Guibouret <damien.guibouret@partition-saving.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-08-24ext4: backward compatibility support for Lustre ea_inode implementationTahsin Erdogan
Original Lustre ea_inode feature did not have ref counts on xattr inodes because there was always one parent that referenced it. New implementation expects ref count to be initialized which is not true for Lustre case. Handle this by detecting Lustre created xattr inode and set its ref count to 1. The quota handling of xattr inodes have also changed with deduplication support. New implementation manually manages quotas to support sharing across multiple users. A consequence is that, a referencing inode incorporates the blocks of xattr inode into its own i_block field. We need to know how a xattr inode was created so that we can reverse the block charges during reference removal. This is handled by introducing a EXT4_STATE_LUSTRE_EA_INODE flag. The flag is set on a xattr inode if inode appears to have been created by Lustre. During xattr inode reference removal, the manual quota uncharge is skipped if the flag is set. Signed-off-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-08-24ext4: remove timebomb in ext4_decode_extra_time()Christoph Hellwig
Changing behavior based on the version code is a timebomb waiting to happen, and not easily bisectable. Drop it and leave any removal to explicit developer action. (And I don't think file system should _ever_ remove backwards compatibility that has no explicit flag, but I'll leave that to the ext4 folks). Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Eric Biggers <ebiggers@google.com>
2017-08-24ext4: use sizeof(*ptr)Markus Elfring
Replace the specification of data structures by pointer dereferences as the parameter for the operator "sizeof" to make the corresponding size determination a bit safer according to the Linux coding style convention. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2017-08-24ext4: in ext4_seek_{hole,data}, return -ENXIO for negative offsetsDarrick J. Wong
In the ext4 implementations of SEEK_HOLE and SEEK_DATA, make sure we return -ENXIO for negative offsets instead of banging around inside the extent code and returning -EFSCORRUPTED. Reported-by: Mateusz S <muttdini@gmail.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org # 4.6
2017-08-24ext4: reduce lock contention in __ext4_new_inodeWang Shilong
While running number of creating file threads concurrently, we found heavy lock contention on group spinlock: FUNC TOTAL_TIME(us) COUNT AVG(us) ext4_create 1707443399 1440000 1185.72 _raw_spin_lock 1317641501 180899929 7.28 jbd2__journal_start 287821030 1453950 197.96 jbd2_journal_get_write_access 33441470 73077185 0.46 ext4_add_nondir 29435963 1440000 20.44 ext4_add_entry 26015166 1440049 18.07 ext4_dx_add_entry 25729337 1432814 17.96 ext4_mark_inode_dirty 12302433 5774407 2.13 most of cpu time blames to _raw_spin_lock, here is some testing numbers with/without patch. Test environment: Server : SuperMicro Sever (2 x E5-2690 v3@2.60GHz, 128GB 2133MHz DDR4 Memory, 8GbFC) Storage : 2 x RAID1 (DDN SFA7700X, 4 x Toshiba PX02SMU020 200GB Read Intensive SSD) format command: mkfs.ext4 -J size=4096 test command: mpirun -np 48 mdtest -n 30000 -d /ext4/mdtest.out -F -C \ -r -i 1 -v -p 10 -u #first run to load inode mpirun -np 48 mdtest -n 30000 -d /ext4/mdtest.out -F -C \ -r -i 3 -v -p 10 -u Kernel version: 4.13.0-rc3 Test 1,440,000 files with 48 directories by 48 processes: Without patch: File Creation File removal 79,033 289,569 ops/per second 81,463 285,359 79,875 288,475 With patch: File Creation File removal 810669 301694 812805 302711 813965 297670 Creation performance is improved more than 10X with large journal size. The main problem here is we test bitmap and do some check and journal operations which could be slept, then we test and set with lock hold, this could be racy, and make 'inode' steal by other process. However, after first try, we could confirm handle has been started and inode bitmap journaled too, then we could find and set bit with lock hold directly, this will mostly gurateee success with second try. Tested-by: Shuichi Ihara <sihara@ddn.com> Signed-off-by: Wang Shilong <wshilong@ddn.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2017-08-24ext4: cleanup goto next groupWang Shilong
avoid duplicated codes, also we need goto next group in case we found reserved inode. Signed-off-by: Wang Shilong <wshilong@ddn.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2017-08-24ext4: do not unnecessarily allocate buffer in recently_deleted()Jan Kara
In recently_deleted() function we want to check whether inode is still cached in buffer cache. Use sb_find_get_block() for that instead of sb_getblk() to avoid unnecessary allocation of bdev page and buffer heads. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-08-24Btrfs: fix blk_status_t/errno confusionOmar Sandoval
This fixes several instances of blk_status_t and bare errno ints being mixed up, some of which are real bugs. In the normal case, 0 matches BLK_STS_OK, so we don't observe any effects of the missing conversion, but in case of errors or passes through the repair/retry paths, the errors get mixed up. The changes were identified using 'sparse', we don't have reports of the buggy behaviour. Fixes: 4e4cbee93d56 ("block: switch bios to blk_status_t") Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-23Revert "pty: fix the cached path of the pty slave file descriptor in the master"Linus Torvalds
This reverts commit c8c03f1858331e85d397bacccd34ef409aae993c. It turns out that while fixing the ptmx file descriptor to have the correct 'struct path' to the associated slave pty is a really good thing, it breaks some user space tools for a very annoying reason. The problem is that /dev/ptmx and its associated slave pty (/dev/pts/X) are on different mounts. That was what caused us to have the wrong path in the first place (we would mix up the vfsmount of the 'ptmx' node, with the dentry of the pty slave node), but it also means that now while we use the right vfsmount, having the pty master open also keeps the pts mount busy. And it turn sout that that makes 'pbuilder' very unhappy, as noted by Stefan Lippers-Hollmann: "This patch introduces a regression for me when using pbuilder 0.228.7[2] (a helper to build Debian packages in a chroot and to create and update its chroots) when trying to umount /dev/ptmx (inside the chroot) on Debian/ unstable (full log and pbuilder configuration file[3] attached). [...] Setting up build-essential (12.3) ... Processing triggers for libc-bin (2.24-15) ... I: unmounting dev/ptmx filesystem W: Could not unmount dev/ptmx: umount: /var/cache/pbuilder/build/1340/dev/ptmx: target is busy (In some cases useful info about processes that use the device is found by lsof(8) or fuser(1).)" apparently pbuilder tries to unmount the /dev/pts filesystem while still holding at least one master node open, which is arguably not very nice, but we don't break user space even when fixing other bugs. So this commit has to be reverted. I'll try to figure out a way to avoid caching the path to the slave pty in the master pty. The only thing that actually wants that slave pty path is the "TIOCGPTPEER" ioctl, and I think we could just recreate the path at that time. Reported-by: Stefan Lippers-Hollmann <s.l-h@gmx.de> Cc: Eric W Biederman <ebiederm@xmission.com> Cc: Christian Brauner <christian.brauner@canonical.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-23block: replace bi_bdev with a gendisk pointer and partitions indexChristoph Hellwig
This way we don't need a block_device structure to submit I/O. The block_device has different life time rules from the gendisk and request_queue and is usually only available when the block device node is open. Other callers need to explicitly create one (e.g. the lightnvm passthrough code, or the new nvme multipathing code). For the actual I/O path all that we need is the gendisk, which exists once per block device. But given that the block layer also does partition remapping we additionally need a partition index, which is used for said remapping in generic_make_request. Note that all the block drivers generally want request_queue or sometimes the gendisk, so this removes a layer of indirection all over the stack. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>