summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2016-03-25ocfs2: code clean up for direct ioRyan Ding
Clean up ocfs2_file_write_iter & ocfs2_prepare_inode_for_write: * remove append dio check: it will be checked in ocfs2_direct_IO() * remove file hole check: file hole is supported for now * remove inline data check: it will be checked in ocfs2_direct_IO() * remove the full_coherence check when append dio: we will get the inode_lock in ocfs2_dio_get_block, there is no need to fall back to buffer io to ensure the coherence semantics. Now the drop dio procedure is gone. :) [akpm@linux-foundation.org: remove unused label] Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2: fix sparse file & data ordering issue in direct ioRyan Ding
There are mainly three issues in the direct io code path after commit 24c40b329e03 ("ocfs2: implement ocfs2_direct_IO_write"): * Does not support sparse file. * Does not support data ordering. eg: when write to a file hole, it will alloc extent first. If system crashed before io finished, data will corrupt. * Potential risk when doing aio+dio. The -EIOCBQUEUED return value is likely to be ignored by ocfs2_direct_IO_write(). To resolve above problems, re-design direct io code with following ideas: * Use buffer io to fill in holes. And this will make better performance also. * Clear unwritten after direct write finished. So we can make sure meta data changes after data write to disk. (Unwritten extent is invisible to user, from user's view, meta data is not changed when allocate an unwritten extent.) * Clear ocfs2_direct_IO_write(). Do all ending work in end_io. This patch has passed fs,dio,ltp-aiodio.part1,ltp-aiodio.part2,ltp-aiodio.part4 test cases of ltp. For performance improvement, see following test result: ocfs2 cluster size 1MB, ocfs2 volume is mounted on /mnt/. The original way: + rm /mnt/test.img -f + dd if=/dev/zero of=/mnt/test.img bs=4K count=1048576 oflag=direct 1048576+0 records in 1048576+0 records out 4294967296 bytes (4.3 GB) copied, 1707.83 s, 2.5 MB/s + rm /mnt/test.img -f + dd if=/dev/zero of=/mnt/test.img bs=256K count=16384 oflag=direct 16384+0 records in 16384+0 records out 4294967296 bytes (4.3 GB) copied, 582.705 s, 7.4 MB/s After this patch: + rm /mnt/test.img -f + dd if=/dev/zero of=/mnt/test.img bs=4K count=1048576 oflag=direct 1048576+0 records in 1048576+0 records out 4294967296 bytes (4.3 GB) copied, 64.6412 s, 66.4 MB/s + rm /mnt/test.img -f + dd if=/dev/zero of=/mnt/test.img bs=256K count=16384 oflag=direct 16384+0 records in 16384+0 records out 4294967296 bytes (4.3 GB) copied, 34.7611 s, 124 MB/s Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2: record UNWRITTEN extents when populate write descRyan Ding
To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock. There is still one issue in the direct write procedure. phase 1: alloc extent with UNWRITTEN flag phase 2: submit direct data to disk, add zero page to page cache phase 3: clear UNWRITTEN flag when data has been written to disk When there are 2 direct write A(0~3KB),B(4~7KB) writing to the same cluster 0~7KB (cluster size 8KB). Write request A arrive phase 2 first, it will zero the region (4~7KB). Before request A enter to phase 3, request B arrive phase 2, it will zero region (0~3KB). This is just like request B steps request A. To resolve this issue, we should let request B knows this cluster is already under zero, to prevent it from steps the previous write request. This patch will add function ocfs2_unwritten_check() to do this job. It will record all clusters that are under direct write(it will be recorded in the 'ip_unwritten_list' member of inode info), and prevent the later direct write writing to the same cluster to do the zero work again. Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2: return the physical address in ocfs2_write_clusterRyan Ding
To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock. Direct io needs to get the physical address from write_begin, to map the user page. This patch is to change the arg 'phys' of ocfs2_write_cluster to a pointer, so it can be retrieved to write_begin. And we can retrieve it to the direct io procedure. Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2: do not change i_size in write_end for direct ioRyan Ding
To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock. Append direct io do not change i_size in get block phase. It only move to orphan when starting write. After data is written to disk, it will delete itself from orphan and update i_size. So skip i_size change section in write_begin for direct io. And when there is no extents alloc, no meta data changes needed for direct io (since write_begin start trans for 2 reason: alloc extents & change i_size. Now none of them needed). So we can skip start trans procedure. Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2: test target page before change itRyan Ding
To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock. Direct io data will not appear in buffer. The w_target_page member will not be filled by direct io. So avoid to use it when it's NULL. Unlinke buffer io and mmap, direct io will call write_begin with more than 1 page a time. So the target_index is not sufficient to describe the actual data. change it to a range start at target_index, end in end_index. Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2: use c_new to indicate newly allocated extentsRyan Ding
To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock. There is a problem in ocfs2's direct io implement: if system crashed after extents allocated, and before data return, we will get a extent with dirty data on disk. This problem violate the journal=order semantics, which means meta changes take effect after data written to disk. To resolve this issue, direct write can use the UNWRITTEN flag to describe a extent during direct data writeback. The direct write procedure should act in the following order: phase 1: alloc extent with UNWRITTEN flag phase 2: submit direct data to disk, add zero page to page cache phase 3: clear UNWRITTEN flag when data has been written to disk This patch is to change the 'c_unwritten' member of ocfs2_write_cluster_desc to 'c_clear_unwritten'. Means whether to clear the unwritten flag. It do not care if a extent is allocated or not. And use 'c_new' to specify a newly allocated extent. So the direct io procedure can use c_clear_unwritten to control the UNWRITTEN bit on extent. Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2: add ocfs2_write_type_t type to identify the caller of writeRyan Ding
Patchset: fix ocfs2 direct io code patch to support sparse file and data ordering semantics The idea is to use buffer io(more precisely use the interface ocfs2_write_begin_nolock & ocfs2_write_end_nolock) to do the zero work beyond block size. And clear UNWRITTEN flag until direct io data has been written to disk, which can prevent data corruption when system crashed during direct write. And we will also archive a better performance: eg. dd direct write new file with block size 4KB: before this patchset: 2.5 MB/s after this patchset: 66.4 MB/s This patch (of 8): To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock. Remove unused args filp & flags. Add new arg type. The type is one of buffer/direct/mmap. Indicate 3 way to perform write. buffer/mmap type has implemented. direct type will be implemented later. Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2: o2hb: fix double free bugJunxiao Bi
This is a regression issue and caused the following kernel panic when do ocfs2 multiple test. BUG: unable to handle kernel paging request at 00000002000800c0 IP: [<ffffffff81192978>] kmem_cache_alloc+0x78/0x160 PGD 7bbe5067 PUD 0 Oops: 0000 [#1] SMP Modules linked in: ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi xen_kbdfront xen_netfront xen_fbfront xen_blkfront CPU: 2 PID: 4044 Comm: mpirun Not tainted 4.5.0-rc5-next-20160225 #1 Hardware name: Xen HVM domU, BIOS 4.3.1OVM 05/14/2014 task: ffff88007a521a80 ti: ffff88007aed0000 task.ti: ffff88007aed0000 RIP: 0010:[<ffffffff81192978>] [<ffffffff81192978>] kmem_cache_alloc+0x78/0x160 RSP: 0018:ffff88007aed3a48 EFLAGS: 00010282 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000001991 RDX: 0000000000001990 RSI: 00000000024000c0 RDI: 000000000001b330 RBP: ffff88007aed3a98 R08: ffff88007d29b330 R09: 00000002000800c0 R10: 0000000c51376d87 R11: ffff8800792cac38 R12: ffff88007cc30f00 R13: 00000000024000c0 R14: ffffffff811b053f R15: ffff88007aed3ce7 FS: 0000000000000000(0000) GS:ffff88007d280000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000002000800c0 CR3: 000000007aeb2000 CR4: 00000000000406e0 Call Trace: __d_alloc+0x2f/0x1a0 d_alloc+0x17/0x80 lookup_dcache+0x8a/0xc0 path_openat+0x3c3/0x1210 do_filp_open+0x80/0xe0 do_sys_open+0x110/0x200 SyS_open+0x19/0x20 do_syscall_64+0x72/0x230 entry_SYSCALL64_slow_path+0x25/0x25 Code: 05 e6 77 e7 7e 4d 8b 08 49 8b 40 10 4d 85 c9 0f 84 dd 00 00 00 48 85 c0 0f 84 d4 00 00 00 49 63 44 24 20 49 8b 3c 24 48 8d 4a 01 <49> 8b 1c 01 4c 89 c8 65 48 0f c7 0f 0f 94 c0 3c 01 75 b6 49 63 RIP kmem_cache_alloc+0x78/0x160 CR2: 00000002000800c0 ---[ end trace 823969e602e4aaac ]--- Fixes: a4a1dfa4bb8b("ocfs2/cluster: fix memory leak in o2hb_region_release") Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ceph: use kmem_cache_zallocGeliang Tang
Use kmem_cache_zalloc() instead of kmem_cache_alloc() with flag GFP_ZERO. Signed-off-by: Geliang Tang <geliangtang@163.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25ceph: use lookup request to revalidate dentryYan, Zheng
If dentry has no lease, ceph_d_revalidate() previously return 0. This causes VFS to invalidate the dentry and create a new dentry for later lookup. Invalidating a dentry also detach any underneath mount points. So mount point inside cephfs can disapear mystically (even the mount point is not modified by other hosts). The fix is using lookup request to revalidate dentry without lease. This can partly solve the mount points disapear issue (as long as the mount point is not modified by other hosts) Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25ceph: kill ceph_get_dentry_parent_inode()Yan, Zheng
use vfs helper dget_parent() instead Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25ceph: fix security xattr deadlockYan, Zheng
When security is enabled, security module can call filesystem's getxattr/setxattr callbacks during d_instantiate(). For cephfs, d_instantiate() is usually called by MDS' dispatch thread, while handling MDS reply. If the MDS reply does not include xattrs and corresponding caps, getxattr/setxattr need to send a new request to MDS and waits for the reply. This makes MDS' dispatch sleep, nobody handles later MDS replies. The fix is make sure lookup/atomic_open reply include xattrs and corresponding caps. So getxattr can be handled by cached xattrs. This requires some modification to both MDS and request message. (Client tells MDS what caps it wants; MDS encodes proper caps in the reply) Smack security module may call setxattr during d_instantiate(). Unlike getxattr, we can't force MDS to issue CEPH_CAP_XATTR_EXCL to us. So just make setxattr return error when called by MDS' dispatch thread. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25ceph: don't request vxattrs from MDSYan, Zheng
It's uselese because MDS reply does not carry any vxattr. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25ceph: fix mounting same fs multiple timesYan, Zheng
Now __ceph_open_session() only accepts closed client. An opened client will tigger BUG_ON(). Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25ceph: remove unnecessary NULL checkYan, Zheng
If page->mapping is NULL, releasepage() callback does not get called. Remove the unnecessary NULL check to make static code analysis tool happy Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25ceph: avoid updating directory inode's i_size accidentallyYan, Zheng
Directory inode's i_size is used by readdir cache. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25ceph: fix race during filling readdir cacheYan, Zheng
Readdir cache uses page cache to save dentry pointers. When adding dentry pointers to middle of a page, we need to make sure the page already exists. Otherwise the beginning part of the page will be invalid pointers. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25ceph: kill ceph_empty_snapcIlya Dryomov
ceph_empty_snapc->num_snaps == 0 at all times. Passing such a snapc to ceph_osdc_alloc_request() (possibly through ceph_osdc_new_request()) is equivalent to passing NULL, as ceph_osdc_alloc_request() uses it only for sizing the request message. Further, in all four cases the subsequent ceph_osdc_build_request() is passed NULL for snapc, meaning that 0 is encoded for seq and num_snaps and making ceph_empty_snapc entirely useless. The two cases where it actually mattered were removed in commits 860560904962 ("ceph: avoid sending unnessesary FLUSHSNAP message") and 23078637e054 ("ceph: fix queuing inode to mdsdir's snaprealm"). Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Yan, Zheng <zyan@redhat.com>
2016-03-25ceph: fix a wrong comparisonAnton Protopopov
A negative value rc compared to the positive value ENOENT in the finish_read() function. Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25ceph: replace CURRENT_TIME by current_fs_time()Deepa Dinamani
CURRENT_TIME macro is not appropriate for filesystems as it doesn't use the right granularity for filesystem timestamps. Use current_fs_time() instead. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25ceph: scattered page writebackYan, Zheng
This patch makes ceph_writepages_start() try using single OSD request to write all dirty pages within a strip unit. When a nonconsecutive dirty page is found, ceph_writepages_start() tries starting a new write operation to existing OSD request. If it succeeds, it uses the new operation to writeback the dirty page. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25ceph: remove useless BUG_ONYan, Zheng
ceph_osdc_start_request() never return -EOLDSNAP Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25ceph: don't enable rbytes mount option by defaultYan, Zheng
When rbytes mount option is enabled, directory size is recursive size. Recursive size is not updated instantly. This can cause directory size to change between successive stat(1) Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25ceph: encode ctime in cap messageYan, Zheng
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25libceph: revamp subs code, switch to SUBSCRIBE2 protocolIlya Dryomov
It is currently hard-coded in the mon_client that mdsmap and monmap subs are continuous, while osdmap sub is always "onetime". To better handle full clusters/pools in the osd_client, we need to be able to issue continuous osdmap subs. Revamp subs code to allow us to specify for each sub whether it should be continuous or not. Although not strictly required for the above, switch to SUBSCRIBE2 protocol while at it, eliminating the ambiguity between a request for "every map since X" and a request for "just the latest" when we don't have a map yet (i.e. have epoch 0). SUBSCRIBE2 feature bit is now required - it's been supported since pre-argonaut (2010). Move "got mdsmap" call to the end of ceph_mdsc_handle_map() - calling in before we validate the epoch and successfully install the new map can mess up mon_client sub state. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-24Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: "Final round of fixes for this merge window - some of this has come up after the initial pull request, and some of it was put in a post-merge branch before the merge window. This contains: - Fix for a bad check for an error on dma mapping in the mtip32xx driver, from Alexey Khoroshilov. - A set of fixes for lightnvm, from Javier, Matias, and Wenwei. - An NVMe completion record corruption fix from Marta, ensuring that we read things in the right order. - Two writeback fixes from Tejun, marked for stable@ as well. - A blk-mq sw queue iterator fix from Thomas, fixing an oops for sparse CPU maps. They hit this in the hot plug/unplug rework" * 'for-linus' of git://git.kernel.dk/linux-block: nvme: avoid cqe corruption when update at the same time as read writeback, cgroup: fix use of the wrong bdi_writeback which mismatches the inode writeback, cgroup: fix premature wb_put() in locked_inode_to_wb_and_lock_list() blk-mq: Use proper cpumask iterator mtip32xx: fix checks for dma mapping errors lightnvm: do not load L2P table if not supported lightnvm: do not reserve lun on l2p loading nvme: lightnvm: return ppa completion status lightnvm: add a bitmap of luns lightnvm: specify target's logical address area null_blk: add lightnvm null_blk device to the nullb_list
2016-03-24Merge tag 'for-linus-20160324' of git://git.infradead.org/linux-mtdLinus Torvalds
Pull MTD updates from Brian Norris: "NAND: - Add sunxi_nand randomizer support - begin refactoring NAND ecclayout structs - fix pxa3xx_nand dmaengine usage - brcmnand: fix support for v7.1 controller - add Qualcomm NAND controller driver SPI NOR: - add new ls1021a, ls2080a support to Freescale QuadSPI - add new flash ID entries - support bottom-block protection for Winbond flash - support Status Register Write Protect - remove broken QPI support for Micron SPI flash JFFS2: - improve post-mount CRC scan efficiency General: - refactor bcm63xxpart parser, to later extend for NAND - add writebuf size parameter to mtdram Other minor code quality improvements" * tag 'for-linus-20160324' of git://git.infradead.org/linux-mtd: (72 commits) mtd: nand: remove kerneldoc for removed function parameter mtd: nand: Qualcomm NAND controller driver dt/bindings: qcom_nandc: Add DT bindings mtd: nand: don't select chip in nand_chip's block_bad op mtd: spi-nor: support lock/unlock for a few Winbond chips mtd: spi-nor: add TB (Top/Bottom) protect support mtd: spi-nor: add SPI_NOR_HAS_LOCK flag mtd: spi-nor: use BIT() for flash_info flags mtd: spi-nor: disallow further writes to SR if WP# is low mtd: spi-nor: make lock/unlock bounds checks more obvious and robust mtd: spi-nor: silently drop lock/unlock for already locked/unlocked region mtd: spi-nor: wait for SR_WIP to clear on initial unlock mtd: nand: simplify nand_bch_init() usage mtd: mtdswap: remove useless if (!mtd->ecclayout) test mtd: create an mtd_oobavail() helper and make use of it mtd: kill the ecclayout->oobavail field mtd: nand: check status before reporting timeout mtd: bcm63xxpart: give width specifier an 'int', not 'size_t' mtd: mtdram: Add parameter for setting writebuf size mtd: nand: pxa3xx_nand: kill unused field 'drcmr_cmd' ...
2016-03-24Merge tag 'upstream-4.6-rc1' of git://git.infradead.org/linux-ubifsLinus Torvalds
Pull UBI/UBIFS updates from Richard Weinberger: "This contains cleanups and a maintainer update for UBI and UBIFS" * tag 'upstream-4.6-rc1' of git://git.infradead.org/linux-ubifs: ubifs: Remove unused header MAINTAINERS: Update UBIFS entry mtd: ubi: Add logging functions ubi_msg, ubi_warn and ubi_err ubifs: Add logging functions for ubifs_msg, ubifs_err and ubifs_warn
2016-03-24Merge tag 'nfsd-4.6-1' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
Pull more nfsd updates from Bruce Fields: "Apologies for the previous request, which omitted the top 8 commits from my for-next branch (including the SCSI layout commits). Thanks to Trond for spotting my error!" This actually includes the new layout types, so here's that part of the pull message repeated: "Support for a new pnfs layout type from Christoph Hellwig. The new layout type is a variant of the block layout which uses SCSI features to offer improved fencing and device identification. Note this pull request also includes the client side of SCSI layout, with Trond's permission" * tag 'nfsd-4.6-1' of git://linux-nfs.org/~bfields/linux: nfsd: use short read as well as i_size to set eof nfsd: better layoutupdate bounds-checking nfsd: block and scsi layout drivers need to depend on CONFIG_BLOCK nfsd: add SCSI layout support nfsd: move some blocklayout code nfsd: add a new config option for the block layout driver nfs/blocklayout: add SCSI layout support nfs4.h: add SCSI layout definitions
2016-03-24Merge branch 'misc-4.6' of ↵Chris Mason
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.6
2016-03-24Merge tag 'nfsd-4.6' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
Pull nfsd updates from Bruce Fields: "Various bugfixes, a RDMA update from Chuck Lever, and support for a new pnfs layout type from Christoph Hellwig. The new layout type is a variant of the block layout which uses SCSI features to offer improved fencing and device identification. (Also: note this pull request also includes the client side of SCSI layout, with Trond's permission.)" * tag 'nfsd-4.6' of git://linux-nfs.org/~bfields/linux: sunrpc/cache: drop reference when sunrpc_cache_pipe_upcall() detects a race nfsd: recover: fix memory leak nfsd: fix deadlock secinfo+readdir compound nfsd4: resfh unused in nfsd4_secinfo svcrdma: Use new CQ API for RPC-over-RDMA server send CQs svcrdma: Use new CQ API for RPC-over-RDMA server receive CQs svcrdma: Remove close_out exit path svcrdma: Hook up the logic to return ERR_CHUNK svcrdma: Use correct XID in error replies svcrdma: Make RDMA_ERROR messages work rpcrdma: Add RPCRDMA_HDRLEN_ERR svcrdma: svc_rdma_post_recv() should close connection on error svcrdma: Close connection when a send error occurs nfsd: Lower NFSv4.1 callback message size limit svcrdma: Do not send Write chunk XDR pad with inline content svcrdma: Do not write xdr_buf::tail in a Write chunk svcrdma: Find client-provided write and reply chunks once per reply nfsd: Update NFS server comments related to RDMA support nfsd: Fix a memory leak when meeting unsupported state_protect_how4 nfsd4: fix bad bounds checking
2016-03-24GFS2: ignore unlock failures after withdrawBenjamin Marzinski
After gfs2 has withdrawn the filesystem, it may still have many locks not in the unlocked state. If it is using lock_dlm, it will failed trying the unlocks since it has already unmounted the lock manager. Instead, it should set the SDF_SKIP_DLM_UNLOCK flag on withdraw, to signal that it can skip the lock_manager on unlocks, and failback to lock_nolock style unlocking. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2016-03-23ornagefs: ensure that truncate has an up to date inode sizeMartin Brandenburg
Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-03-23orangefs: move code which sets i_link to orangefs_inode_getattrMartin Brandenburg
Everything else setting inode->i_ values is in there. Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-03-23orangefs: remove needless wrapper around GFP_KERNELMartin Brandenburg
Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-03-23orangefs: remove wrapper around mutex_lock(&inode->i_mutex)Martin Brandenburg
Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-03-23orangefs: refactor inode type or link_target change detectionMartin Brandenburg
Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-03-23orangefs: use new getattr for revalidate and remove old getattrMartin Brandenburg
Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-03-23orangefs: use new getattr in inode getattr and permissionMartin Brandenburg
Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-03-23orangefs: use new orangefs_inode_getattr to get size in write and llseekMartin Brandenburg
Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-03-23orangefs: use new orangefs_inode_getattr to create new inodesMartin Brandenburg
Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-03-23orangefs: rename orangefs_inode_getattr to orangefs_inode_old_getattrMartin Brandenburg
This is motivated by orangefs_inode_old_getattr's habit of writing over live inodes. Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-03-23orangefs: remove inode->i_lock wrapperMartin Brandenburg
Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-03-23nfsd: use short read as well as i_size to set eofBenjamin Coddington
Use the result of a local read to determine when to set the eof flag. This allows us to return the location of the end of the file atomically at the time of the read. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> [bfields: add some documentation] Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-03-23Merge branch 'linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto fixes from Herbert Xu: "This fixes the following issues: API: - Fix kzalloc error path crash in ecryptfs added by skcipher conversion. Note the subject of the commit is screwed up and the correct subject is actually in the body. Drivers: - A number of fixes to the marvell cesa hashing code. - Remove bogus nested irqsave that clobbers the saved flags in ccp" * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: crypto: marvell/cesa - forward devm_ioremap_resource() error code crypto: marvell/cesa - initialize hash states crypto: marvell/cesa - fix memory leak crypto: ccp - fix lock acquisition code eCryptfs: Use skcipher and shash
2016-03-22Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge third patch-bomb from Andrew Morton: - more ocfs2 changes - a few hotfixes - Andy's compat cleanups - misc fixes to fatfs, ptrace, coredump, cpumask, creds, eventfd, panic, ipmi, kgdb, profile, kfifo, ubsan, etc. - many rapidio updates: fixes, new drivers. - kcov: kernel code coverage feature. Like gcov, but not "prohibitively expensive". - extable code consolidation for various archs * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (81 commits) ia64/extable: use generic search and sort routines x86/extable: use generic search and sort routines s390/extable: use generic search and sort routines alpha/extable: use generic search and sort routines kernel/...: convert pr_warning to pr_warn drivers: dma-coherent: use memset_io for DMA_MEMORY_IO mappings drivers: dma-coherent: use MEMREMAP_WC for DMA_MEMORY_MAP memremap: add MEMREMAP_WC flag memremap: don't modify flags kernel/signal.c: add compile-time check for __ARCH_SI_PREAMBLE_SIZE mm/mprotect.c: don't imply PROT_EXEC on non-exec fs ipc/sem: make semctl setting sempid consistent ubsan: fix tree-wide -Wmaybe-uninitialized false positives kfifo: fix sparse complaints scripts/gdb: account for changes in module data structure scripts/gdb: add cmdline reader command scripts/gdb: add version command kernel: add kcov code coverage profile: hide unused functions when !CONFIG_PROC_FS hpwdt: use nmi_panic() when kernel panics in NMI handler ...
2016-03-22eventfd: document lockless access in eventfd_pollPaolo Bonzini
Since commit e22553e2a25e ("eventfd: don't take the spinlock in eventfd_poll", 2015-02-17), eventfd is reading ctx->count outside ctx->wqh.lock. However, things aren't as simple as the read barrier in eventfd_poll would suggest. In fact, the read barrier, besides lacking a comment, is not paired in any obvious manner with another read barrier, and it is pointless because it is sitting between a write (deep in poll_wait) and the read of ctx->count. The read barrier is acting just as a compiler barrier, for which we can use READ_ONCE instead. This is what the code change in this patch does. The documentation change is just as important, however. The question, posed by Andrea Arcangeli, is then why the thing is safe on architectures where spin_unlock does not imply a store-load memory barrier. The answer is that it's safe because writes of ctx->count use the same lock as poll_wait, and hence an acquire barrier implicit in poll_wait provides the necessary synchronization between eventfd_poll and callers of wake_up_locked_poll. This is sort of mentioned in the commit message with respect to eventfd_ctx_read ("eventfd_read is similar, it will do a single decrement with the lock held") but it applies to all other callers too. It's tricky enough that it should be documented in the code. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Chris Mason <clm@fb.com> Cc: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-22fs/coredump: prevent fsuid=0 dumps into user-controlled directoriesJann Horn
This commit fixes the following security hole affecting systems where all of the following conditions are fulfilled: - The fs.suid_dumpable sysctl is set to 2. - The kernel.core_pattern sysctl's value starts with "/". (Systems where kernel.core_pattern starts with "|/" are not affected.) - Unprivileged user namespace creation is permitted. (This is true on Linux >=3.8, but some distributions disallow it by default using a distro patch.) Under these conditions, if a program executes under secure exec rules, causing it to run with the SUID_DUMP_ROOT flag, then unshares its user namespace, changes its root directory and crashes, the coredump will be written using fsuid=0 and a path derived from kernel.core_pattern - but this path is interpreted relative to the root directory of the process, allowing the attacker to control where a coredump will be written with root privileges. To fix the security issue, always interpret core_pattern for dumps that are written under SUID_DUMP_ROOT relative to the root directory of init. Signed-off-by: Jann Horn <jann@thejh.net> Acked-by: Kees Cook <keescook@chromium.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-22fat: add config option to set UTF-8 mount option by defaultMaciej S. Szmigiero
FAT has long supported its own default file name encoding config setting, separate from CONFIG_NLS_DEFAULT. However, if UTF-8 encoded file names are desired FAT character set should not be set to utf8 since this would make file names case sensitive even if case insensitive matching is requested. Instead, "utf8" mount options should be provided to enable UTF-8 file names in FAT file system. Unfortunately, there was no possibility to set the default value of this option so on UTF-8 system "utf8" mount option had to be added manually to most FAT mounts. This patch adds config option to set such default value. Signed-off-by: Maciej S. Szmigiero <mail@maciej.szmigiero.name> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>