linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2016-05-19	fs: poll/select/recvmmsg: use timespec64 for timeout events	Deepa Dinamani
	struct timespec is not y2038 safe. Even though timespec might be sufficient to represent timeouts, use struct timespec64 here as the plan is to get rid of all timespec reference in the kernel. The patch transitions the common functions: poll_select_set_timeout() and select_estimate_accuracy() to use timespec64. And, all the syscalls that use these functions are transitioned in the same patch. The restart block parameters for poll uses monotonic time. Use timespec64 here as well to assign timeout value. This parameter in the restart block need not change because this only holds the monotonic timestamp at which timeout should occur. And, unsigned long data type should be big enough for this timestamp. The system call interfaces will be handled in a separate series. Compat interfaces need not change as timespec64 is an alias to struct timespec on a 64 bit system. Link: http://lkml.kernel.org/r/1461947989-21926-3-git-send-email-deepa.kernel@gmail.com Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Acked-by: John Stultz <john.stultz@linaro.org> Acked-by: David S. Miller <davem@davemloft.net> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-19	fsnotify: avoid spurious EMFILE errors from inotify_init()	Jan Kara
	Inotify instance is destroyed when all references to it are dropped. That not only means that the corresponding file descriptor needs to be closed but also that all corresponding instance marks are freed (as each mark holds a reference to the inotify instance). However marks are freed only after SRCU period ends which can take some time and thus if user rapidly creates and frees inotify instances, number of existing inotify instances can exceed max_user_instances limit although from user point of view there is always at most one existing instance. Thus inotify_init() returns EMFILE error which is hard to justify from user point of view. This problem is exposed by LTP inotify06 testcase on some machines. We fix the problem by making sure all group marks are properly freed while destroying inotify instance. We wait for SRCU period to end in that path anyway since we have to make sure there is no event being added to the instance while we are tearing down the instance. So it takes only some plumbing to allow for marks to be destroyed in that path as well and not from a dedicated work item. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Jan Kara <jack@suse.cz> Reported-by: Xiaoguang Wang <wangxg.fnst@cn.fujitsu.com> Tested-by: Xiaoguang Wang <wangxg.fnst@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20	Merge branch 'xfs-4.7-inode-reclaim' into for-next	Dave Chinner

2016-05-20	Merge branch 'xfs-4.7-error-cfg' into for-next	Dave Chinner

2016-05-20	Merge branch 'xfs-4.7-misc-fixes' into for-next	Dave Chinner

2016-05-20	Merge branch 'xfs-4.7-cleanup-attr-listent' into for-next	Dave Chinner

2016-05-20	Merge branch 'xfs-4.7-optimise-inline-symlinks' into for-next	Dave Chinner

2016-05-20	Merge branch 'xfs-4.7-trans-type-cleanup' into for-next	Dave Chinner

2016-05-20	Merge branch 'xfs-4.7-writeback-bio' into for-next	Dave Chinner

2016-05-20	xfs: fix warning in xfs_finish_page_writeback for non-debug builds	Christoph Hellwig
	blockmask is unused if ASSERTs are disabled. Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2016-05-19	dax: Remove i_mmap_lock protection	Jan Kara
	Currently faults are protected against truncate by filesystem specific i_mmap_sem and page lock in case of hole page. Cow faults are protected DAX radix tree entry locking. So there's no need for i_mmap_lock in DAX code. Remove it. Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
2016-05-19	dax: Use radix tree entry lock to protect cow faults	Jan Kara
	When doing cow faults, we cannot directly fill in PTE as we do for other faults as we rely on generic code to do proper accounting of the cowed page. We also have no page to lock to protect against races with truncate as other faults have and we need the protection to extend until the moment generic code inserts cowed page into PTE thus at that point we have no protection of fs-specific i_mmap_sem. So far we relied on using i_mmap_lock for the protection however that is completely special to cow faults. To make fault locking more uniform use DAX entry lock instead. Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
2016-05-19	dax: New fault locking	Jan Kara
	Currently DAX page fault locking is racy. CPU0 (write fault) CPU1 (read fault) __dax_fault() __dax_fault() get_block(inode, block, &bh, 0) -> not mapped get_block(inode, block, &bh, 0) -> not mapped if (!buffer_mapped(&bh)) if (vmf->flags & FAULT_FLAG_WRITE) get_block(inode, block, &bh, 1) -> allocates blocks if (page) -> no if (!buffer_mapped(&bh)) if (vmf->flags & FAULT_FLAG_WRITE) { } else { dax_load_hole(); } dax_insert_mapping() And we are in a situation where we fail in dax_radix_entry() with -EIO. Another problem with the current DAX page fault locking is that there is no race-free way to clear dirty tag in the radix tree. We can always end up with clean radix tree and dirty data in CPU cache. We fix the first problem by introducing locking of exceptional radix tree entries in DAX mappings acting very similarly to page lock and thus synchronizing properly faults against the same mapping index. The same lock can later be used to avoid races when clearing radix tree dirty tag. Reviewed-by: NeilBrown <neilb@suse.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
2016-05-19	dax: Define DAX lock bit for radix tree exceptional entry	Jan Kara
	We will use lowest available bit in the radix tree exceptional entry for locking of the entry. Define it. Also clean up definitions of DAX entry type bits in DAX exceptional entries to use defined constants instead of hardcoding numbers and cleanup checking of these bits to not rely on how other bits in the entry are set. Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
2016-05-19	dax: Make huge page handling depend of CONFIG_BROKEN	Jan Kara
	Currently the handling of huge pages for DAX is racy. For example the following can happen: CPU0 (THP write fault) CPU1 (normal read fault) __dax_pmd_fault() __dax_fault() get_block(inode, block, &bh, 0) -> not mapped get_block(inode, block, &bh, 0) -> not mapped if (!buffer_mapped(&bh) && write) get_block(inode, block, &bh, 1) -> allocates blocks truncate_pagecache_range(inode, lstart, lend); dax_load_hole(); This results in data corruption since process on CPU1 won't see changes into the file done by CPU0. The race can happen even if two normal faults race however with THP the situation is even worse because the two faults don't operate on the same entries in the radix tree and we want to use these entries for serialization. So make THP support in DAX code depend on CONFIG_BROKEN for now. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
2016-05-19	dax: Fix condition for filling of PMD holes	Jan Kara
	Currently dax_pmd_fault() decides to fill a PMD-sized hole only if returned buffer has BH_Uptodate set. However that doesn't get set for any mapping buffer so that branch is actually a dead code. The BH_Uptodate check doesn't make any sense so just remove it. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
2016-05-19	ocfs2/cluster: block BH in TCP callbacks	Eric Dumazet
	TCP stack can now run from process context. Use read_lock_bh() variant to restore previous assumption. Fixes: 5413d1babe8f ("net: do not block BH while processing socket backlog") Fixes: d41a69f1d390 ("tcp: make tcp_sendmsg() aware of socket backlog") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-19	Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus	Linus Torvalds
	Pull MIPS updates from Ralf Baechle: "This is the main pull request for MIPS for 4.7. Here's the summary of the changes: - ATH79: Support for DTB passuing using the UHI boot protocol - ATH79: Remove support for builtin DTB. - ATH79: Add zboot debug serial support. - ATH79: Add initial support for Dragino MS14 (Dragine 2), Onion Omega and DPT-Module. - ATH79: Update devicetree clock support for AR9132 and AR9331. - ATH79: Cleanup the DT code. - ATH79: Support newer SOCs in ath79_ddr_ctrl_init. - ATH79: Fix regression in PCI window initialization. - BCM47xx: Move SPROM driver to drivers/firmware/ - BCM63xx: Enable partition parser in defconfig. - BMIPS: BMIPS5000 has I cache filing from D cache - BMIPS: BMIPS: Add cpu-feature-overrides.h - BMIPS: Add Whirlwind support - BMIPS: Adjust mips-hpt-frequency for BCM7435 - BMIPS: Remove maxcpus from BCM97435SVMB DTS - BMIPS: Add missing 7038 L1 register cells to BCM7435 - BMIPS: Various tweaks to initialization code. - BMIPS: Enable partition parser in defconfig. - BMIPS: Cache tweaks. - BMIPS: Add UART, I2C and SATA devices to DT. - BMIPS: Add BCM6358 and BCM63268support - BMIPS: Add device tree example for BCM6358. - BMIPS: Improve Improve BCM6328 and BCM6368 device trees - Lantiq: Add support for device tree file from boot loader - Lantiq: Allow build with no built-in DT. - Loongson 3: Reserve 32MB for RS780E integrated GPU. - Loongson 3: Fix build error after ld-version.sh modification - Loongson 3: Move chipset ACPI code from drivers to arch. - Loongson 3: Speedup irq processing. - Loongson 3: Add basic Loongson 3A support. - Loongson 3: Set cache flush handlers to nop. - Loongson 3: Invalidate special TLBs when needed. - Loongson 3: Fast TLB refill handler. - MT7620: Fallback strategy for invalid syscfg0. - Netlogic: Fix CP0_EBASE redefinition warnings - Octeon: Initialization fixes - Octeon: Add DTS files for the D-Link DSR-1000N and EdgeRouter Lite - Octeon: Enable add Octeon-drivers in cavium_octeon_defconfig - Octeon: Correctly handle endian-swapped initramfs images. - Octeon: Support CN73xx, CN75xx and CN78xx. - Octeon: Remove dead code from cvmx-sysinfo. - Octeon: Extend number of supported CPUs past 32. - Octeon: Remove some code limiting NR_IRQS to 255. - Octeon: Simplify octeon_irq_ciu_gpio_set_type. - Octeon: Mark some functions __init in smp.c - Octeon: Octeon: Add Octeon III CN7xxx interface detection - PIC32: Add serial driver and bindings for it. - PIC32: Add PIC32 deadman timer driver and bindings. - PIC32: Add PIC32 clock timer driver and bindings. - Pistachio: Determine SoC revision during boot - Sibyte: Fix Kconfig dependencies of SIBYTE_BUS_WATCHER. - Sibyte: Strip redundant comments from bcm1480_regs.h. - Panic immediately if panic_on_oops is set. - module: fix incorrect IS_ERR_VALUE macro usage. - module: Make consistent use of pr_* - Remove no longer needed work_on_cpu() call. - Remove CONFIG_IPV6_PRIVACY from defconfigs. - Fix registers of non-crashing CPUs in dumps. - Handle MIPSisms in new vmcore_elf32_check_arch. - Select CONFIG_HANDLE_DOMAIN_IRQ and make it work. - Allow RIXI to be used on non-R2 or R6 cores. - Reserve nosave data for hibernation - Fix siginfo.h to use strict POSIX types. - Don't unwind user mode with EVA. - Fix watchpoint restoration - Ptrace watchpoints for R6. - Sync icache when it fills from dcache - I6400 I-cache fills from dcache. - Various MSA fixes. - Cleanup MIPS_CPU_* definitions. - Signal: Move generic copy_siginfo to signal.h - Signal: Fix uapi include in exported asm/siginfo.h - Timer fixes for sake of KVM. - XPA TLB refill fixes. - Treat perf counter feature - Update John Crispin's email address - Add PIC32 watchdog and bindings. - Handle R10000 LL/SC bug in set_pte() - cpufreq: Various fixes for Longson1. - R6: Fix R2 emulation. - mathemu: Cosmetic fix to ADDIUPC emulation, plenty of other small fixes - ELF: ABI and FP fixes. - Allow for relocatable kernel and use that to support KASLR. - Fix CPC_BASE_ADDR mask - Plenty fo smp-cps, CM, R6 and M6250 fixes. - Make reset_control_ops const. - Fix kernel command line handling of leading whitespace. - Cleanups to cache handling. - Add brcm, bcm6345-l1-intc device tree bindings. - Use generic clkdev.h header - Remove CLK_IS_ROOT usage. - Misc small cleanups. - CM: Fix compilation error when !MIPS_CM - oprofile: Fix a preemption issue - Detect DSP ASE v3 support:1" * 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus: (275 commits) MIPS: pic32mzda: fix getting timer clock rate. MIPS: ath79: fix regression in PCI window initialization MIPS: ath79: make ath79_ddr_ctrl_init() compatible for newer SoCs MIPS: Fix VZ probe gas errors with binutils <2.24 MIPS: perf: Fix I6400 event numbers MIPS: DEC: Export `ioasic_ssr_lock' to modules MIPS: MSA: Fix a link error on `_init_msa_upper' with older GCC MIPS: CM: Fix compilation error when !MIPS_CM MIPS: Fix genvdso error on rebuild USB: ohci-jz4740: Remove obsolete driver MIPS: JZ4740: Probe OHCI platform device via DT MIPS: JZ4740: Qi LB60: Remove support for AVT2 variant MIPS: pistachio: Determine SoC revision during boot MIPS: BMIPS: Adjust mips-hpt-frequency for BCM7435 mips: mt7620: fallback to SDRAM when syscfg0 does not have a valid value for the memory type MIPS: Prevent "restoration" of MSA context in non-MSA kernels MIPS: cevt-r4k: Dynamically calculate min_delta_ns MIPS: malta-time: Take seconds into account MIPS: malta-time: Start GIC count before syncing to RTC MIPS: Force CPUs to lose FP context during mode switches ...
2016-05-19	Merge branch 'next' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull security subsystem updates from James Morris: "Highlights: - A new LSM, "LoadPin", from Kees Cook is added, which allows forcing of modules and firmware to be loaded from a specific device (this is from ChromeOS, where the device as a whole is verified cryptographically via dm-verity). This is disabled by default but can be configured to be enabled by default (don't do this if you don't know what you're doing). - Keys: allow authentication data to be stored in an asymmetric key. Lots of general fixes and updates. - SELinux: add restrictions for loading of kernel modules via finit_module(). Distinguish non-init user namespace capability checks. Apply execstack check on thread stacks" * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (48 commits) LSM: LoadPin: provide enablement CONFIG Yama: use atomic allocations when reporting seccomp: Fix comment typo ima: add support for creating files using the mknodat syscall ima: fix ima_inode_post_setattr vfs: forbid write access when reading a file into memory fs: fix over-zealous use of "const" selinux: apply execstack check on thread stacks selinux: distinguish non-init user namespace capability checks LSM: LoadPin for kernel file loading restrictions fs: define a string representation of the kernel_read_file_id enumeration Yama: consolidate error reporting string_helpers: add kstrdup_quotable_file string_helpers: add kstrdup_quotable_cmdline string_helpers: add kstrdup_quotable selinux: check ss_initialized before revalidating an inode label selinux: delay inode label lookup as long as possible selinux: don't revalidate an inode's label when explicitly setting it selinux: Change bool variable name to index. KEYS: Add KEYCTL_DH_COMPUTE command ...
2016-05-19	udf: Use correct partition reference number for metadata	Alden Tondettar
	UDF/OSTA terminology is confusing. Partition Numbers (PNs) are arbitrary 16-bit values, one for each physical partition in the volume. Partition Reference Numbers (PRNs) are indices into the the Partition Map Table and do not necessarily equal the PN of the mapped partition. The current metadata code mistakenly uses the PN instead of the PRN when mapping metadata blocks to physical/sparable blocks. Windows-created UDF 2.5 discs for some reason use large, arbitrary PNs, resulting in mount failure and KASAN read warnings in udf_read_inode(). For example, a NetBSD UDF 2.5 partition might look like this: PRN PN Type --- -- ---- 0 0 Sparable 1 0 Metadata Since PRN == PN, we are fine. But Windows could gives us: PRN PN Type --- ---- ---- 0 8192 Sparable 1 8192 Metadata So udf_read_inode() will start out by checking the partition length in sbi->s_partmaps[8192], which is obviously out of bounds. Fix this by creating a new field (s_phys_partition_ref) in struct udf_meta_data, referencing whatever physical or sparable map has the same partition number as the metadata partition. [JK: Add comment about s_phys_partition_ref, change its name] Signed-off-by: Alden Tondettar <alden.tondettar@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2016-05-19	udf: Use IS_ERR when loading metadata mirror file entry	Alden Tondettar
	Currently when udf_get_pblock_meta25() fails to map a block using the primary metadata file, it will attempt to load the mirror file entry by calling udf_find_metadata_inode_efe(). That function will return a ERR_PTR if it fails, but the return value is only checked against NULL. Test the return value using IS_ERR() and change it to NULL if needed. Signed-off-by: Alden Tondettar <alden.tondettar@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2016-05-19	udf: Don't BUG on missing metadata partition descriptor	Alden Tondettar
	Currently, if a metadata partition map is missing its partition descriptor, then udf_get_pblock_meta25() will BUG() out the first time it is called. This is rather drastic for a corrupted filesystem, so just treat this case as an invalid mapping instead. Signed-off-by: Alden Tondettar <alden.tondettar@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2016-05-18	f2fs: make exit_f2fs_fs more clear	Tiezhu Yang
	init_f2fs_fs does: 1) f2fs_build_trace_ios 2) init_inodecache 3) create_node_manager_caches 4) create_segment_manager_caches 5) create_checkpoint_caches 6) create_extent_cache 7) kset_create_and_add 8) kobject_init_and_add 9) register_shrinker 10) register_filesystem 11) f2fs_create_root_stats 12) proc_mkdir exit_f2fs_fs should do cleanup in the reverse order to make the code more clear. Signed-off-by: Tiezhu Yang <kernelpatch@126.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-18	f2fs: use percpu_counter for total_valid_inode_count	Jaegeuk Kim
	This patch uses percpu_counter to avoid stat_lock. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-18	f2fs: use percpu_counter for alloc_valid_block_count	Jaegeuk Kim
	This patch uses percpu_count for sbi->alloc_valid_block_count. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-18	f2fs: use percpu_counter for # of dirty pages in inode	Jaegeuk Kim
	This patch adds percpu_counter for # of dirty pages in inode. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-18	f2fs: use percpu_counter for page counters	Jaegeuk Kim
	This patch substitutes percpu_counter for atomic_counter when counting various types of pages. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-18	f2fs: use bio count instead of F2FS_WRITEBACK page count	Jaegeuk Kim
	This can reduce page counting overhead. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-18	Merge branch 'work.misc' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc vfs cleanups from Al Viro: "Assorted cleanups and fixes all over the place" * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: coredump: only charge written data against RLIMIT_CORE coredump: get rid of coredump_params->written ecryptfs_lookup(): try either only encrypted or plaintext name ecryptfs: avoid multiple aliases for directories bpf: reject invalid names right in ->lookup() __d_alloc(): treat NULL name as QSTR("/", 1) mtd: switch ubi_open_volume_path() to vfs_stat() mtd: switch open_mtd_by_chdev() to use of vfs_stat()
2016-05-18	Merge branch 'work.iov_iter' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull iov_iter cleanups from Al Viro. * 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fold checks into iterate_and_advance() rw_verify_area(): saner calling conventions aio: remove a pointless assignment
2016-05-18	dax: fix a comment in dax_zero_page_range and dax_truncate_page	Vishal Verma
	The distinction between PAGE_SIZE and PAGE_CACHE_SIZE was removed in 09cbfea mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros The comments for the above functions described a distinction between those, that is now redundant, so remove those paragraphs Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
2016-05-18	dax: for truncate/hole-punch, do zeroing through the driver if possible	Vishal Verma
	In the truncate or hole-punch path in dax, we clear out sub-page ranges. If these sub-page ranges are sector aligned and sized, we can do the zeroing through the driver instead so that error-clearing is handled automatically. For sub-sector ranges, we still have to rely on clear_pmem and have the possibility of tripping over errors. Cc: Dan Williams <dan.j.williams@intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
2016-05-18	dax: export a low-level __dax_zero_page_range helper	Christoph Hellwig
	This allows XFS to perform zeroing using the iomap infrastructure and avoid buffer heads. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christoph Hellwig <hch@lst.de> [vishal: fix conflicts with dax-error-handling] Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
2016-05-18	dax: use sb_issue_zerout instead of calling dax_clear_sectors	Matthew Wilcox
	dax_clear_sectors() cannot handle poisoned blocks. These must be zeroed using the BIO interface instead. Convert ext2 and XFS to use only sb_issue_zerout(). Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> [vishal: Also remove the dax_clear_sectors function entirely] Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
2016-05-18	dax: enable dax in the presence of known media errors (badblocks)	Dan Williams
	1/ If a mapping overlaps a bad sector fail the request. 2/ Do not opportunistically report more dax-capable capacity than is requested when errors present. Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com> [vishal: fix a conflict with system RAM collision patches] [vishal: add a 'size' parameter to ->direct_access] [vishal: fix a conflict with DAX alignment check patches] Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
2016-05-18	Merge branch 'work.lookups' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull parallel lookup fixups from Al Viro: "Fix for xfs parallel readdir (turns out the cxfs exposure was not enough to catch all problems), and a reversion of btrfs back to ->iterate() until the fs/btrfs/delayed-inode.c gets fixed" * 'work.lookups' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: xfs: concurrent readdir hangs on data buffer locks Revert "btrfs: switch to ->iterate_shared()"
2016-05-18	xfs: concurrent readdir hangs on data buffer locks	Dave Chinner
	There's a three-process deadlock involving shared/exclusive barriers and inverted lock orders in the directory readdir implementation. It's a pre-existing problem with lock ordering, exposed by the VFS parallelisation code. process 1 process 2 process 3 --------- --------- --------- readdir iolock(shared) get_leaf_dents iterate entries ilock(shared) map, lock and read buffer iunlock(shared) process entries in buffer ..... readdir iolock(shared) get_leaf_dents iterate entries ilock(shared) map, lock buffer <blocks> finish ->iterate_shared file_accessed() ->update_time start transaction ilock(excl) <blocks> ..... finishes processing buffer get next buffer ilock(shared) <blocks> And that's the deadlock. Fix this by dropping the current buffer lock in process 1 before trying to map the next buffer. This means we keep the lock order of ilock -> buffer lock intact and hence will allow process 3 to make progress and drop it's ilock(shared) once it is done. Reported-by: Xiong Zhou <xzhou@redhat.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-18	Revert "btrfs: switch to ->iterate_shared()"	Al Viro
	This reverts commit 972b241f8441dc37a3f89dcd7e71d7f013873d13. Quoth Chris: didn't take the delayed inode stuff into account it got an rbtree of items and it pulls things out so in shared mode, its hugely racey sorry, lets revert and fix it for real inside of btrfs Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-18	Merge branch 'sendmsg.cifs' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull cifs iovec cleanups from Al Viro. * 'sendmsg.cifs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: cifs: don't bother with kmap on read_pages side cifs_readv_receive: use cifs_read_from_socket() cifs: no need to wank with copying and advancing iovec on recvmsg side either cifs: quit playing games with draining iovecs cifs: merge the hash calculation helpers
2016-05-18	Merge branch 'for-linus' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull remaining vfs xattr work from Al Viro: "The rest of work.xattr (non-cifs conversions)" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: btrfs: Switch to generic xattr handlers ubifs: Switch to generic xattr handlers jfs: Switch to generic xattr handlers jfs: Clean up xattr name mapping gfs2: Switch to generic xattr handlers ceph: kill __ceph_removexattr() ceph: Switch to generic xattr handlers ceph: Get rid of d_find_alias in ceph_set_acl
2016-05-18	Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6	Linus Torvalds
	Pull cifs updates from Steve French: "Various small CIFS and SMB3 fixes (including some for stable)" * 'for-next' of git://git.samba.org/sfrench/cifs-2.6: remove directory incorrectly tries to set delete on close on non-empty directories Update cifs.ko version to 2.09 fs/cifs: correctly to anonymous authentication for the NTLM(v2) authentication fs/cifs: correctly to anonymous authentication for the NTLM(v1) authentication fs/cifs: correctly to anonymous authentication for the LANMAN authentication fs/cifs: correctly to anonymous authentication via NTLMSSP cifs: remove any preceding delimiter from prefix_path cifs: Use file_dentry()
2016-05-18	xfs: move reclaim tagging functions	Dave Chinner
	Rearrange the inode tagging functions so that they are higher up in xfs_cache.c and so there is no need for forward prototypes to be defined. This is purely code movement, no other change. Signed-off-by: Dave Chinner <dchinner@redhat.com>
2016-05-18	xfs: simplify inode reclaim tagging interfaces	Dave Chinner
	Inode radix tree tagging for reclaim passes a lot of unnecessary variables around. Over time the xfs-perag has grown a xfs_mount backpointer, and an internal agno so we don't need to pass other variables into the tagging functions to supply this information. Rework the functions to pass the minimal variable set required and simplify the internal logic and flow. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18	xfs: rename variables in xfs_iflush_cluster for clarity	Dave Chinner
	The cluster inode variable uses unconventional naming - iq - which makes it hard to distinguish it between the inode passed into the function - ip - and that is a vector for mistakes to be made. Rename all the cluster inode variables to use a more conventional prefixes to reduce potential future confusion (cilist, cilist_size, cip). Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18	xfs: xfs_iflush_cluster has range issues	Dave Chinner
	xfs_iflush_cluster() does a gang lookup on the radix tree, meaning it can find inodes beyond the current cluster if there is sparse cache population. gang lookups return results in ascending index order, so stop trying to cluster inodes once the first inode outside the cluster mask is detected. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18	xfs: mark reclaimed inodes invalid earlier	Dave Chinner
	The last thing we do before using call_rcu() on an xfs_inode to be freed is mark it as invalid. This means there is a window between when we know for certain that the inode is going to be freed and when we do actually mark it as "freed". This is important in the context of RCU lookups - we can look up the inode, find that it is valid, and then use it as such not realising that it is in the final stages of being freed. As such, mark the inode as being invalid the moment we know it is going to be reclaimed. This can be done while we still hold the XFS_ILOCK_EXCL and the flush lock in xfs_inode_reclaim, meaning that it occurs well before we remove it from the radix tree, and that the i_flags_lock, the XFS_ILOCK and the inode flush lock all act as synchronisation points for detecting that an inode is about to go away. For defensive purposes, this allows us to add a further check to xfs_iflush_cluster to ensure we skip inodes that are being freed after we grab the XFS_ILOCK_SHARED and the flush lock - we know that if the inode number if valid while we have these locks held we know that it has not progressed through reclaim to the point where it is clean and is about to be freed. [bfoster: fixed __xfs_inode_clear_reclaim() using ip->i_ino after it had already been zeroed.] Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18	xfs: xfs_inode_free() isn't RCU safe	Dave Chinner
	The xfs_inode freed in xfs_inode_free() has multiple allocated structures attached to it. We free these in xfs_inode_free() before we mark the inode as invalid, and before we run call_rcu() to queue the structure for freeing. Unfortunately, this freeing can race with other accesses that are in the RCU current grace period that have found the inode in the radix tree with a valid state. This includes xfs_iflush_cluster(), which calls xfs_inode_clean(), and that accesses the inode log item on the xfs_inode. The log item structure is freed in xfs_inode_free(), so there is the possibility we can be accessing freed memory in xfs_iflush_cluster() after validating the xfs_inode structure as being valid for this RCU context. Hence we can get spuriously incorrect clean state returned from such checks. This can lead to use thinking the inode is dirty when it is, in fact, clean, and so incorrectly attaching it to the buffer for IO and completion processing. This then leads to use-after-free situations on the xfs_inode itself if the IO completes after the current RCU grace period expires. The buffer callbacks will access the xfs_inode and try to do all sorts of things it shouldn't with freed memory. IOWs, xfs_iflush_cluster() only works correctly when racing with inode reclaim if the inode log item is present and correctly stating the inode is clean. If the inode is being freed, then reclaim has already made sure the inode is clean, and hence xfs_iflush_cluster can skip it. However, we are accessing the inode inode under RCU read lock protection and so also must ensure that all dynamically allocated memory we reference in this context is not freed until the RCU grace period expires. To fix this, move all the potential memory freeing into xfs_inode_free_callback() so that we are guarantee RCU protected lookup code will always have the memory structures it needs available during the RCU grace period that lookup races can occur in. Discovered-by: Brain Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18	xfs: optimise xfs_iext_destroy	Alex Lyakas
	When unmounting XFS, we call: xfs_inode_free => xfs_idestroy_fork => xfs_iext_destroy This goes over the whole indirection array and calls xfs_iext_irec_remove for each one of the erps (from the last one to the first one). As a result, we keep shrinking (reallocating actually) the indirection array until we shrink out all of its elements. When we have files with huge numbers of extents, umount takes 30-80 sec, depending on the amount of files that XFS loaded and the amount of indirection entries of each file. The unmount stack looks like: [<ffffffffc0b6d200>] xfs_iext_realloc_indirect+0x40/0x60 [xfs] [<ffffffffc0b6cd8e>] xfs_iext_irec_remove+0xee/0xf0 [xfs] [<ffffffffc0b6cdcd>] xfs_iext_destroy+0x3d/0xb0 [xfs] [<ffffffffc0b6cef6>] xfs_idestroy_fork+0xb6/0xf0 [xfs] [<ffffffffc0b87002>] xfs_inode_free+0xb2/0xc0 [xfs] [<ffffffffc0b87260>] xfs_reclaim_inode+0x250/0x340 [xfs] [<ffffffffc0b87583>] xfs_reclaim_inodes_ag+0x233/0x370 [xfs] [<ffffffffc0b8823d>] xfs_reclaim_inodes+0x1d/0x20 [xfs] [<ffffffffc0b96feb>] xfs_unmountfs+0x7b/0x1a0 [xfs] [<ffffffffc0b98e4d>] xfs_fs_put_super+0x2d/0x70 [xfs] [<ffffffff811e9e36>] generic_shutdown_super+0x76/0x100 [<ffffffff811ea207>] kill_block_super+0x27/0x70 [<ffffffff811ea519>] deactivate_locked_super+0x49/0x60 [<ffffffff811eaaee>] deactivate_super+0x4e/0x70 [<ffffffff81207593>] cleanup_mnt+0x43/0x90 [<ffffffff81207632>] __cleanup_mnt+0x12/0x20 [<ffffffff8108f8e7>] task_work_run+0xa7/0xe0 [<ffffffff81014ff7>] do_notify_resume+0x97/0xb0 [<ffffffff81717c6f>] int_signal+0x12/0x17 Further, this reallocation prevents us from freeing the extent list from a RCU callback as allocation can block. Hence if the extent list is in indirect format, optimise the freeing of the extent list to only use kmem_free calls by freeing entire extent buffer pages at a time, rather than extent by extent. [dchinner: simplified freeing loop based on Christoph's suggestion] Signed-off-by: Alex Lyakas <alex@zadarastorage.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18	xfs: skip stale inodes in xfs_iflush_cluster	Dave Chinner
	We don't write back stale inodes so we should skip them in xfs_iflush_cluster, too. cc: <stable@vger.kernel.org> # 3.10.x- Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18	xfs: fix inode validity check in xfs_iflush_cluster	Dave Chinner
	Some careless idiot() wrote crap code in commit 1a3e8f3 ("xfs: convert inode cache lookups to use RCU locking") back in late 2010, and so xfs_iflush_cluster checks the wrong inode for whether it is still valid under RCU protection. Fix it to lock and check the correct inode. () Careless-idiot: Dave Chinner <dchinner@redhat.com> cc: <stable@vger.kernel.org> # 3.10.x- Discovered-by: Brain Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>