summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2017-07-07gfs2: Fix glock rhashtable rcu bugAndreas Gruenbacher
Before commit 88ffbf3e03 "GFS2: Use resizable hash table for glocks", glocks were freed via call_rcu to allow reading the glock hashtable locklessly using rcu. This was then changed to free glocks immediately, which made reading the glock hashtable unsafe. Bring back the original code for freeing glocks via call_rcu. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com> Cc: stable@vger.kernel.org # 4.3+
2017-07-07f2fs: avoid deadlock caused by lock order of page and lock_opJaegeuk Kim
- punch_hole - fill_zero - f2fs_lock_op - get_new_data_page - lock_page - f2fs_write_data_pages - lock_page - do_write_data_page - f2fs_lock_op Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-07f2fs: use spin_{,un}lock_irq{save,restore}Chao Yu
generic/361 reports below warning, this is because: once, there is someone entering into critical region of sbi.cp_lock, if write_end_io. f2fs_stop_checkpoint is invoked from an triggered IRQ, we will encounter deadlock. So this patch changes to use spin_{,un}lock_irq{save,restore} to create critical region without IRQ enabled to avoid potential deadlock. irq event stamp: 83391573 loop: Write error at byte offset 438729728, length 1024. hardirqs last enabled at (83391573): [<c1809752>] restore_all+0xf/0x65 hardirqs last disabled at (83391572): [<c1809eac>] reschedule_interrupt+0x30/0x3c loop: Write error at byte offset 438860288, length 1536. softirqs last enabled at (83389244): [<c180cc4e>] __do_softirq+0x1ae/0x476 softirqs last disabled at (83389237): [<c101ca7c>] do_softirq_own_stack+0x2c/0x40 loop: Write error at byte offset 438990848, length 2048. ================================ WARNING: inconsistent lock state 4.12.0-rc2+ #30 Tainted: G O -------------------------------- inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage. xfs_io/7959 [HC1[1]:SC0[0]:HE0:SE1] takes: (&(&sbi->cp_lock)->rlock){?.+...}, at: [<f96f96cc>] f2fs_stop_checkpoint+0x1c/0x50 [f2fs] {HARDIRQ-ON-W} state was registered at: __lock_acquire+0x527/0x7b0 lock_acquire+0xae/0x220 _raw_spin_lock+0x42/0x50 do_checkpoint+0x165/0x9e0 [f2fs] write_checkpoint+0x33f/0x740 [f2fs] __f2fs_sync_fs+0x92/0x1f0 [f2fs] f2fs_sync_fs+0x12/0x20 [f2fs] sync_filesystem+0x67/0x80 generic_shutdown_super+0x27/0x100 kill_block_super+0x22/0x50 kill_f2fs_super+0x3a/0x40 [f2fs] deactivate_locked_super+0x3d/0x70 deactivate_super+0x40/0x60 cleanup_mnt+0x39/0x70 __cleanup_mnt+0x10/0x20 task_work_run+0x69/0x80 exit_to_usermode_loop+0x57/0x85 do_fast_syscall_32+0x18c/0x1b0 entry_SYSENTER_32+0x4c/0x7b irq event stamp: 1957420 hardirqs last enabled at (1957419): [<c1808f37>] _raw_spin_unlock_irq+0x27/0x50 hardirqs last disabled at (1957420): [<c1809f9c>] call_function_single_interrupt+0x30/0x3c softirqs last enabled at (1953784): [<c180cc4e>] __do_softirq+0x1ae/0x476 softirqs last disabled at (1953773): [<c101ca7c>] do_softirq_own_stack+0x2c/0x40 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&(&sbi->cp_lock)->rlock); <Interrupt> lock(&(&sbi->cp_lock)->rlock); *** DEADLOCK *** 2 locks held by xfs_io/7959: #0: (sb_writers#13){.+.+.+}, at: [<c11fd7ca>] vfs_write+0x16a/0x190 #1: (&sb->s_type->i_mutex_key#16){+.+.+.}, at: [<f96e33f5>] f2fs_file_write_iter+0x25/0x140 [f2fs] stack backtrace: CPU: 2 PID: 7959 Comm: xfs_io Tainted: G O 4.12.0-rc2+ #30 Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 Call Trace: dump_stack+0x5f/0x92 print_usage_bug+0x1d3/0x1dd ? check_usage_backwards+0xe0/0xe0 mark_lock+0x23d/0x280 __lock_acquire+0x699/0x7b0 ? __this_cpu_preempt_check+0xf/0x20 ? trace_hardirqs_off_caller+0x91/0xe0 lock_acquire+0xae/0x220 ? f2fs_stop_checkpoint+0x1c/0x50 [f2fs] _raw_spin_lock+0x42/0x50 ? f2fs_stop_checkpoint+0x1c/0x50 [f2fs] f2fs_stop_checkpoint+0x1c/0x50 [f2fs] f2fs_write_end_io+0x147/0x150 [f2fs] bio_endio+0x7a/0x1e0 blk_update_request+0xad/0x410 blk_mq_end_request+0x16/0x60 lo_complete_rq+0x3c/0x70 __blk_mq_complete_request_remote+0x11/0x20 flush_smp_call_function_queue+0x6d/0x120 ? debug_smp_processor_id+0x12/0x20 generic_smp_call_function_single_interrupt+0x12/0x30 smp_call_function_single_interrupt+0x25/0x40 call_function_single_interrupt+0x37/0x3c EIP: _raw_spin_unlock_irq+0x2d/0x50 EFLAGS: 00000296 CPU: 2 EAX: 00000001 EBX: d2ccc51c ECX: 00000001 EDX: c1aacebd ESI: 00000000 EDI: 00000000 EBP: c96c9d1c ESP: c96c9d18 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 ? inherit_task_group.isra.98.part.99+0x6b/0xb0 __add_to_page_cache_locked+0x1d4/0x290 add_to_page_cache_lru+0x38/0xb0 pagecache_get_page+0x8e/0x200 f2fs_write_begin+0x96/0xf00 [f2fs] ? trace_hardirqs_on_caller+0xdd/0x1c0 ? current_time+0x17/0x50 ? trace_hardirqs_on+0xb/0x10 generic_perform_write+0xa9/0x170 __generic_file_write_iter+0x1a2/0x1f0 ? f2fs_preallocate_blocks+0x137/0x160 [f2fs] f2fs_file_write_iter+0x6e/0x140 [f2fs] ? __lock_acquire+0x429/0x7b0 __vfs_write+0xc1/0x140 vfs_write+0x9b/0x190 SyS_pwrite64+0x63/0xa0 do_fast_syscall_32+0xa1/0x1b0 entry_SYSENTER_32+0x4c/0x7b EIP: 0xb7786c61 EFLAGS: 00000293 CPU: 2 EAX: ffffffda EBX: 00000003 ECX: 08416000 EDX: 00001000 ESI: 18b24000 EDI: 00000000 EBP: 00000003 ESP: bf9b36b0 DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b Fixes: aaec2b1d1879 ("f2fs: introduce cp_lock to protect updating of ckpt_flags") Cc: stable@vger.kernel.org Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-07f2fs: relax migratepage for atomic written pageJaegeuk Kim
In order to avoid lock contention for atomic written pages, we'd better give EBUSY in f2fs_migrate_page when mode is asynchronous. We expect it will be released soon as transaction commits. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-07f2fs: don't count inode block in in-memory inode.i_blocksChao Yu
Previously, we count all inode consumed blocks including inode block, xattr block, index block, data block into i_blocks, for other generic filesystems, they won't count inode block into i_blocks, so for userspace applications or quota system, they may detect incorrect block count according to i_blocks value in inode. This patch changes to count all blocks into inode.i_blocks excluding inode block, for on-disk i_blocks, we keep counting inode block for backward compatibility. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-07Revert "f2fs: fix to clean previous mount option when remount_fs"Chao Yu
Don't clear old mount option before parse new option during ->remount_fs like other generic filesystems. This reverts commit 26666c8a4366debae30ae37d0688b2bec92d196a. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-07f2fs: do not set LOST_PINO for renamed dirSheng Yong
After renaming a directory, fsck could detect unmatched pino. The scenario can be reproduced as the following: $ mkdir /bar/subbar /foo $ rename /bar/subbar /foo Then fsck will report: [ASSERT] (__chk_dots_dentries:1182) --> Bad inode number[0x3] for '..', parent parent ino is [0x4] Rename sets LOST_PINO for old_inode. However, the flag cannot be cleared, since dir is written back with CP. So, let's get rid of LOST_PINO for a renamed dir and fix the pino directly at the end of rename. Signed-off-by: Sheng Yong <shengyong1@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-07f2fs: do not set LOST_PINO for newly created dirSheng Yong
Since directories will be written back with checkpoint and fsync a directory will always write CP, there is no need to set LOST_PINO after creating a directory. Signed-off-by: Sheng Yong <shengyong1@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-07f2fs: skip ->writepages for {mete,node}_inode during recoveryChao Yu
Skip ->writepages in prior to ->writepage for {meta,node}_inode during recovery, hence unneeded loop in ->writepages can be avoided. Moreover, check SBI_POR_DOING earlier while writebacking pages. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-07f2fs: introduce __check_sit_bitmapChao Yu
After we introduce discard thread, discard command can be issued concurrently with data allocating, this patch adds new function to heck sit bitmap to ensure that userdata was invalid in which on-going discard command covered. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-07f2fs: stop gc/discard thread in prior during umountChao Yu
This patch resolves kernel panic for xfstests/081, caused by recent f2fs_bug_on f2fs: add f2fs_bug_on in __remove_discard_cmd For fixing, we will stop gc/discard thread in prior in ->kill_sb in order to avoid referring and releasing race among them. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-07f2fs: introduce reserved_blocks in sysfsChao Yu
In this patch, we add a new sysfs interface, with it, we can control number of reserved blocks in system which could not be used by user, it enable f2fs to let user to configure for adjusting over-provision ratio dynamically instead of changing it by mkfs. So we can expect it will help to reserve more free space for relieving GC in both filesystem and flash device. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-07f2fs: avoid redundant f2fs_flush after remountYunlong Song
create_flush_cmd_control will create redundant issue_flush_thread after each remount with flush_merge option. Signed-off-by: Yunlong Song <yunlong.song@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-07f2fs: report # of free inodes more preciselyJaegeuk Kim
If the partition is small, we don't need to report total # of inodes including hidden free nodes. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-07Merge tag 'libnvdimm-for-4.13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm updates from Dan Williams: "libnvdimm updates for the latest ACPI and UEFI specifications. This pull request also includes new 'struct dax_operations' enabling to undo the abuse of copy_user_nocache() for copy operations to pmem. The dax work originally missed 4.12 to address concerns raised by Al. Summary: - Introduce the _flushcache() family of memory copy helpers and use them for persistent memory write operations on x86. The _flushcache() semantic indicates that the cache is either bypassed for the copy operation (movnt) or any lines dirtied by the copy operation are written back (clwb, clflushopt, or clflush). - Extend dax_operations with ->copy_from_iter() and ->flush() operations. These operations and other infrastructure updates allow all persistent memory specific dax functionality to be pushed into libnvdimm and the pmem driver directly. It also allows dax-specific sysfs attributes to be linked to a host device, for example: /sys/block/pmem0/dax/write_cache - Add support for the new NVDIMM platform/firmware mechanisms introduced in ACPI 6.2 and UEFI 2.7. This support includes the v1.2 namespace label format, extensions to the address-range-scrub command set, new error injection commands, and a new BTT (block-translation-table) layout. These updates support inter-OS and pre-OS compatibility. - Fix a longstanding memory corruption bug in nfit_test. - Make the pmem and nvdimm-region 'badblocks' sysfs files poll(2) capable. - Miscellaneous fixes and small updates across libnvdimm and the nfit driver. Acknowledgements that came after the branch was pushed: commit 6aa734a2f38e ("libnvdimm, region, pmem: fix 'badblocks' sysfs_get_dirent() reference lifetime") was reviewed by Toshi Kani <toshi.kani@hpe.com>" * tag 'libnvdimm-for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (42 commits) libnvdimm, namespace: record 'lbasize' for pmem namespaces acpi/nfit: Issue Start ARS to retrieve existing records libnvdimm: New ACPI 6.2 DSM functions acpi, nfit: Show bus_dsm_mask in sysfs libnvdimm, acpi, nfit: Add bus level dsm mask for pass thru. acpi, nfit: Enable DSM pass thru for root functions. libnvdimm: passthru functions clear to send libnvdimm, btt: convert some info messages to warn/err libnvdimm, region, pmem: fix 'badblocks' sysfs_get_dirent() reference lifetime libnvdimm: fix the clear-error check in nsio_rw_bytes libnvdimm, btt: fix btt_rw_page not returning errors acpi, nfit: quiet invalid block-aperture-region warnings libnvdimm, btt: BTT updates for UEFI 2.7 format acpi, nfit: constify *_attribute_group libnvdimm, pmem: disable dax flushing when pmem is fronting a volatile region libnvdimm, pmem, dax: export a cache control attribute dax: convert to bitmask for flags dax: remove default copy_from_iter fallback libnvdimm, nfit: enable support for volatile ranges libnvdimm, pmem: fix persistence warning ...
2017-07-07xfs: rename MAXPATHLEN to XFS_SYMLINK_MAXLENDarrick J. Wong
XFS has a maximum symlink target length of 1024 bytes; this is a holdover from the Irix days. Unfortunately, the constant establishing this is 'MAXPATHLEN' and is /not/ the same as the Linux MAXPATHLEN, which is 4096. The kernel enforces its 1024 byte MAXPATHLEN on symlink targets, but xfsprogs picks up the (Linux) system 4096 byte MAXPATHLEN, which means that xfs_repair doesn't complain about oversized symlinks. Since this is an on-disk format constraint, put the define in the XFS namespace and move everything over to use the new name. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-07-07ceph: update ceph_dentry_info::lease_session when necessaryYan, Zheng
Current code does not update ceph_dentry_info::lease_session once it is set. If auth mds of corresponding dentry changes, dentry lease keeps in an invalid state. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07ceph: new mount option that specifies fscache uniquifierYan, Zheng
Current ceph uses FSID as primary index key of fscache data. This allows ceph to retain cached data across remount. But this causes problem (kernel opps, fscache does not support sharing data) when a filesystem get mounted several times (with fscache enabled, with different mount options). The fix is adding a new mount option, which specifies uniquifier for fscache. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07ceph: avoid accessing freeing inode in ceph_check_delayed_caps()Yan, Zheng
Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07ceph: avoid invalid memory dereference in the middle of umountYan, Zheng
extra_mon_dispatch() and debugfs' foo_show functions dereference fsc->mdsc. we should clean up fsc->client->extra_mon_dispatch and debugfs before destroying fsc->mds. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07ceph: getattr before read on ceph.* xattrsYan, Zheng
Previously we were returning values for quota, layout xattrs without any kind of update -- the user just got whatever happened to be in our cache. Clearly this extra round trip has a cost, but reads of these xattrs are fairly rare, happening on admin intervention rather than in normal operation. Link: http://tracker.ceph.com/issues/17939 Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07ceph: don't re-send interrupted flock requestYan, Zheng
Don't re-send interrupted flock request in cases of mds failover and receiving request forward. Because corresponding 'lock intr' request may have been finished, it won't get re-sent. Link: http://tracker.ceph.com/issues/20170 Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07ceph: cleanup writepage_nounlock()Yan, Zheng
Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07ceph: redirty page when writepage_nounlock() skips unwritable pageYan, Zheng
Ceph needs to flush dirty page in the order in which in which snap context they belong to. Dirty pages belong to older snap context should be flushed earlier. if writepage_nounlock() can not flush a page, it should redirty the page. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07ceph: remove useless page->mapping check in writepage_nounlock()Yan, Zheng
Callers of writepage_nounlock() have already ensured non-null page->mapping. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07ceph: update the 'approaching max_size' codeYan, Zheng
The old 'approaching max_size' code expects MDS set max_size to '2 * reported_size'. This is no longer true. The new code reports file size when half of previous max_size increment has been used. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07ceph: re-request max size after importing capsYan, Zheng
The 'wanted max size' could be sent to inode's old auth mds, re-send it to inode's new auth mds if necessary. Otherwise write syscall may hang. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-06Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc updates from Andrew Morton: - a few hotfixes - various misc updates - ocfs2 updates - most of MM * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (108 commits) mm, memory_hotplug: move movable_node to the hotplug proper mm, memory_hotplug: drop CONFIG_MOVABLE_NODE mm, memory_hotplug: drop artificial restriction on online/offline mm: memcontrol: account slab stats per lruvec mm: memcontrol: per-lruvec stats infrastructure mm: memcontrol: use generic mod_memcg_page_state for kmem pages mm: memcontrol: use the node-native slab memory counters mm: vmstat: move slab statistics from zone to node counters mm/zswap.c: delete an error message for a failed memory allocation in zswap_dstmem_prepare() mm/zswap.c: improve a size determination in zswap_frontswap_init() mm/zswap.c: delete an error message for a failed memory allocation in zswap_pool_create() mm/swapfile.c: sort swap entries before free mm/oom_kill: count global and memory cgroup oom kills mm: per-cgroup memory reclaim stats mm: kmemleak: treat vm_struct as alternative reference to vmalloc'ed objects mm: kmemleak: factor object reference updating out of scan_block() mm: kmemleak: slightly reduce the size of some structures on 64-bit architectures mm, mempolicy: don't check cpuset seqlock where it doesn't matter mm, cpuset: always use seqlock when changing task's nodemask mm, mempolicy: simplify rebinding mempolicies when updating cpusets ...
2017-07-06Merge branch 'misc.compat' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc compat stuff updates from Al Viro: "This part is basically untangling various compat stuff. Compat syscalls moved to their native counterparts, getting rid of quite a bit of double-copying and/or set_fs() uses. A lot of field-by-field copyin/copyout killed off. - kernel/compat.c is much closer to containing just the copyin/copyout of compat structs. Not all compat syscalls are gone from it yet, but it's getting there. - ipc/compat_mq.c killed off completely. - block/compat_ioctl.c cleaned up; floppy compat ioctls moved to drivers/block/floppy.c where they belong. Yes, there are several drivers that implement some of the same ioctls. Some are m68k and one is 32bit-only pmac. drivers/block/floppy.c is the only one in that bunch that can be built on biarch" * 'misc.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: mqueue: move compat syscalls to native ones usbdevfs: get rid of field-by-field copyin compat_hdio_ioctl: get rid of set_fs() take floppy compat ioctls to sodding floppy.c ipmi: get rid of field-by-field __get_user() ipmi: get COMPAT_IPMICTL_RECEIVE_MSG in sync with the native one rt_sigtimedwait(): move compat to native select: switch compat_{get,put}_fd_set() to compat_{get,put}_bitmap() put_compat_rusage(): switch to copy_to_user() sigpending(): move compat to native getrlimit()/setrlimit(): move compat to native times(2): move compat to native compat_{get,put}_bitmap(): use unsafe_{get,put}_user() fb_get_fscreeninfo(): don't bother with do_fb_ioctl() do_sigaltstack(): lift copying to/from userland into callers take compat_sys_old_getrlimit() to native syscall trim __ARCH_WANT_SYS_OLD_GETRLIMIT
2017-07-06mm: per-cgroup memory reclaim statsRoman Gushchin
Track the following reclaim counters for every memory cgroup: PGREFILL, PGSCAN, PGSTEAL, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE and PGLAZYFREED. These values are exposed using the memory.stats interface of cgroup v2. The meaning of each value is the same as for global counters, available using /proc/vmstat. Also, for consistency, rename mem_cgroup_count_vm_event() to count_memcg_event_mm(). Link: http://lkml.kernel.org/r/1494530183-30808-1-git-send-email-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06mm/hugetlb: add size parameter to huge_pte_offset()Punit Agrawal
A poisoned or migrated hugepage is stored as a swap entry in the page tables. On architectures that support hugepages consisting of contiguous page table entries (such as on arm64) this leads to ambiguity in determining the page table entry to return in huge_pte_offset() when a poisoned entry is encountered. Let's remove the ambiguity by adding a size parameter to convey additional information about the requested address. Also fixup the definition/usage of huge_pte_offset() throughout the tree. Link: http://lkml.kernel.org/r/20170522133604.11392-4-punit.agrawal@arm.com Signed-off-by: Punit Agrawal <punit.agrawal@arm.com> Acked-by: Steve Capper <steve.capper@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: James Hogan <james.hogan@imgtec.com> (odd fixer:METAG ARCHITECTURE) Cc: Ralf Baechle <ralf@linux-mips.org> (supporter:MIPS) Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Helge Deller <deller@gmx.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Rich Felker <dalias@libc.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06mm: update callers to use HASH_ZERO flagPavel Tatashin
Update dcache, inode, pid, mountpoint, and mount hash tables to use HASH_ZERO, and remove initialization after allocations. In case of places where HASH_EARLY was used such as in __pv_init_lock_hash the zeroed hash table was already assumed, because memblock zeroes the memory. CPU: SPARC M6, Memory: 7T Before fix: Dentry cache hash table entries: 1073741824 Inode-cache hash table entries: 536870912 Mount-cache hash table entries: 16777216 Mountpoint-cache hash table entries: 16777216 ftrace: allocating 20414 entries in 40 pages Total time: 11.798s After fix: Dentry cache hash table entries: 1073741824 Inode-cache hash table entries: 536870912 Mount-cache hash table entries: 16777216 Mountpoint-cache hash table entries: 16777216 ftrace: allocating 20414 entries in 40 pages Total time: 3.198s CPU: Intel Xeon E5-2630, Memory: 2.2T: Before fix: Dentry cache hash table entries: 536870912 Inode-cache hash table entries: 268435456 Mount-cache hash table entries: 8388608 Mountpoint-cache hash table entries: 8388608 CPU: Physical Processor ID: 0 Total time: 3.245s After fix: Dentry cache hash table entries: 536870912 Inode-cache hash table entries: 268435456 Mount-cache hash table entries: 8388608 Mountpoint-cache hash table entries: 8388608 CPU: Physical Processor ID: 0 Total time: 3.244s Link: http://lkml.kernel.org/r/1488432825-92126-4-git-send-email-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Babu Moger <babu.moger@oracle.com> Cc: David Miller <davem@davemloft.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06fs/userfaultfd.c: drop dead codeMike Rapoport
Calculation of start end end in __wake_userfault function are not used and can be removed. Link: http://lkml.kernel.org/r/1494930917-3134-1-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06fs/file.c: replace alloc_fdmem() with kvmalloc() alternativeMichal Hocko
There is no real reason to duplicate kvmalloc* helpers so drop alloc_fdmem and replace it with the appropriate library function. Link: http://lkml.kernel.org/r/20170531155145.17111-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06ocfs2: constify attribute_group structuresArvind Yadav
attribute_groups are not supposed to change at runtime. All functions working with attribute_groups provided by <linux/sysfs.h> work with const attribute_group. So mark the non-const structs as const. File size before: text data bss dec hex filename 4402 1088 38 5528 1598 fs/ocfs2/stackglue.o File size After adding 'const': text data bss dec hex filename 4442 1024 38 5504 1580 fs/ocfs2/stackglue.o Link: http://lkml.kernel.org/r/cab4e59b4918db3ed2ec77073a4cb310c4429ef5.1498808026.git.arvind.yadav.cs@gmail.com Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06ocfs2: free 'dummy_sc' in sc_fop_release() to prevent memory leakpiaojun
'sd->dbg_sock' is malloced in sc_common_open(), but not freed at the end of sc_fop_release(). Link: http://lkml.kernel.org/r/594FB0A4.2050105@huawei.com Signed-off-by: Jun Piao <piaojun@huawei.com> Reviewed-by: Joseph Qi <jiangqi903@gmail.com> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06ocfs2: use magic.hFabian Frederick
Filesystems generally use SUPER_MAGIC values from magic.h instead of a local definition. Link: http://lkml.kernel.org/r/20170521154217.27917-1-fabf@skynet.be Signed-off-by: Fabian Frederick <fabf@skynet.be> Reviewed-by: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06ocfs2: fix a static checker warningGang He
Fix a static code checker warning: fs/ocfs2/inode.c:179 ocfs2_iget() warn: passing zero to 'ERR_PTR' Fixes: d56a8f32e4c6 ("ocfs2: check/fix inode block for online file check") Link: http://lkml.kernel.org/r/1495516634-1952-1-git-send-email-ghe@suse.com Signed-off-by: Gang He <ghe@suse.com> Reviewed-by: Joseph Qi <jiangqi903@gmail.com> Reviewed-by: Eric Ren <zren@suse.com> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06ext4: fix spelling mistake: "prellocated" -> "preallocated"Colin Ian King
Trivial fix to spelling mistake in mb_debug debug message Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-07-06Merge branch 'for-4.13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu Pull percpu updates from Tejun Heo: "These are the percpu changes for the v4.13-rc1 merge window. There are a couple visibility related changes - tracepoints and allocator stats through debugfs, along with __ro_after_init markings and a cosmetic rename in percpu_counter. Please note that the simple O(#elements_in_the_chunk) area allocator used by percpu allocator is again showing scalability issues, primarily with bpf allocating and freeing large number of counters. Dennis is working on the replacement allocator and the percpu allocator will be seeing increased churns in the coming cycles" * 'for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: percpu: fix static checker warnings in pcpu_destroy_chunk percpu: fix early calls for spinlock in pcpu_stats percpu: resolve err may not be initialized in pcpu_alloc percpu_counter: Rename __percpu_counter_add to percpu_counter_add_batch percpu: add tracepoint support for percpu memory percpu: expose statistics about percpu memory via debugfs percpu: migrate percpu data structures to internal header percpu: add missing lockdep_assert_held to func pcpu_free_area mark most percpu globals as __ro_after_init
2017-07-06move file_{start,end}_write() out of do_iter_write()Al Viro
... and do *not* grab it in vfs_write_iter(). Fixes: "fs: implement vfs_iter_read using do_iter_read" Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-06btrfs: minimal conversion to errseq_t writeback error reporting on fsyncJeff Layton
Just check and advance the errseq_t in the file before returning, and use an errseq_t based check for writeback errors. Other internal callers of filemap_* functions are left as-is. Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-07-06xfs: minimal conversion to errseq_t writeback error reportingJeff Layton
Just check and advance the data errseq_t in struct file before before returning from fsync on normal files. Internal filemap_* callers are left as-is. Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-07-06ext4: use errseq_t based error handling for reporting data writeback errorsJeff Layton
Add a call to filemap_report_wb_err at the end of ext4_sync_file. This will ensure that we check and advance the errseq_t in the file, which allows us to track and report errors on all open fds when they occur. Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-07-06fs: convert __generic_file_fsync to use errseq_t based reportingJeff Layton
Many simple, block-based filesystems use generic_file_fsync as their fsync operation. Some others (ext* and fat) also call this function to handle syncing out data. Switch this code over to use errseq_t based error reporting so that all of these filesystems get reliable error reporting via fsync, fdatasync and msync. Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-07-06block: convert to errseq_t based writeback error trackingJeff Layton
This is a very minimal conversion to errseq_t based error tracking for raw block device access. Just have it use the standard file_write_and_wait_range call. Note that there are internal callers that call sync_blockdev and the like that are not affected by this. They'll continue to use the AS_EIO/AS_ENOSPC flags for error reporting like they always have for now. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-07-06dax: set errors in mapping when writeback failsJeff Layton
Jan Kara's description for this patch is much better than mine, so I'm quoting it verbatim here: DAX currently doesn't set errors in the mapping when cache flushing fails in dax_writeback_mapping_range(). Since this function can get called only from fsync(2) or sync(2), this is actually as good as it can currently get since we correctly propagate the error up from dax_writeback_mapping_range() to filemap_fdatawrite() However, in the future better writeback error handling will enable us to properly report these errors on fsync(2) even if there are multiple file descriptors open against the file or if sync(2) gets called before fsync(2). So convert DAX to using standard error reporting through the mapping. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-and-tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
2017-07-06fs: new infrastructure for writeback error handling and reportingJeff Layton
Most filesystems currently use mapping_set_error and filemap_check_errors for setting and reporting/clearing writeback errors at the mapping level. filemap_check_errors is indirectly called from most of the filemap_fdatawait_* functions and from filemap_write_and_wait*. These functions are called from all sorts of contexts to wait on writeback to finish -- e.g. mostly in fsync, but also in truncate calls, getattr, etc. The non-fsync callers are problematic. We should be reporting writeback errors during fsync, but many places spread over the tree clear out errors before they can be properly reported, or report errors at nonsensical times. If I get -EIO on a stat() call, there is no reason for me to assume that it is because some previous writeback failed. The fact that it also clears out the error such that a subsequent fsync returns 0 is a bug, and a nasty one since that's potentially silent data corruption. This patch adds a small bit of new infrastructure for setting and reporting errors during address_space writeback. While the above was my original impetus for adding this, I think it's also the case that current fsync semantics are just problematic for userland. Most applications that call fsync do so to ensure that the data they wrote has hit the backing store. In the case where there are multiple writers to the file at the same time, this is really hard to determine. The first one to call fsync will see any stored error, and the rest get back 0. The processes with open fds may not be associated with one another in any way. They could even be in different containers, so ensuring coordination between all fsync callers is not really an option. One way to remedy this would be to track what file descriptor was used to dirty the file, but that's rather cumbersome and would likely be slow. However, there is a simpler way to improve the semantics here without incurring too much overhead. This set adds an errseq_t to struct address_space, and a corresponding one is added to struct file. Writeback errors are recorded in the mapping's errseq_t, and the one in struct file is used as the "since" value. This changes the semantics of the Linux fsync implementation such that applications can now use it to determine whether there were any writeback errors since fsync(fd) was last called (or since the file was opened in the case of fsync having never been called). Note that those writeback errors may have occurred when writing data that was dirtied via an entirely different fd, but that's the case now with the current mapping_set_error/filemap_check_error infrastructure. This will at least prevent you from getting a false report of success. The new behavior is still consistent with the POSIX spec, and is more reliable for application developers. This patch just adds some basic infrastructure for doing this, and ensures that the f_wb_err "cursor" is properly set when a file is opened. Later patches will change the existing code to use this new infrastructure for reporting errors at fsync time. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz>
2017-07-06jbd2: don't clear and reset errors after waiting on writebackJeff Layton
Resetting this flag is almost certainly racy, and will be problematic with some coming changes. Make filemap_fdatawait_keep_errors return int, but not clear the flag(s). Have jbd2 call it instead of filemap_fdatawait and don't attempt to re-set the error flag if it fails. Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-07-06buffer: set errors in mapping at the time that the error occursJeff Layton
I noticed on xfs that I could still sometimes get back an error on fsync on a fd that was opened after the error condition had been cleared. The problem is that the buffer code sets the write_io_error flag and then later checks that flag to set the error in the mapping. That flag perisists for quite a while however. If the file is later opened with O_TRUNC, the buffers will then be invalidated and the mapping's error set such that a subsequent fsync will return error. I think this is incorrect, as there was no writeback between the open and fsync. Add a new mark_buffer_write_io_error operation that sets the flag and the error in the mapping at the same time. Replace all calls to set_buffer_write_io_error with mark_buffer_write_io_error, and remove the places that check this flag in order to set the error in the mapping. This sets the error in the mapping earlier, at the time that it's first detected. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>