summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2016-09-13f2fs: fix to set PageUptodate in f2fs_write_end correctlyJaegeuk Kim
Previously, f2fs_write_begin sets PageUptodate all the time. But, when user tries to update the entire page (i.e., len == PAGE_SIZE), we need to consider that the page is able to be copied partially afterwards. In such the case, we will lose the remaing region in the page. This patch fixes this by setting PageUptodate in f2fs_write_end as given copied result. In the short copy case, it returns zero to let generic_perform_write retry copying user data again. As a result, f2fs_write_end() works: PageUptodate len copied return retry 1. no 4096 4096 4096 false -> return 4096 2. no 4096 1024 0 true -> goto #1 case 3. yes 2048 2048 2048 false -> return 2048 4. yes 2048 1024 1024 false -> return 1024 Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-13f2fs: fix parameters of __exchange_data_blockFan Li
__exchange_data_block should take block indexes as parameters instead of offsets in bytes. Signed-off-by: Fan li <fanofcode.li@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-13f2fs: avoid ENOMEM during roll-forward recoveryJaegeuk Kim
This patch gives another chances during roll-forward recovery regarding to -ENOMEM. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/mediatek/mtk_eth_soc.c drivers/net/ethernet/qlogic/qed/qed_dcbx.c drivers/net/phy/Kconfig All conflicts were cases of overlapping commits. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-12Merge tag 'nfs-for-4.8-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client bugfixes from Trond Myklebust: "Highlights include: Stable patches: - We must serialise LAYOUTGET and LAYOUTRETURN to ensure correct state accounting - Fix the CREATE_SESSION slot number Bugfixes: - sunrpc: fix a UDP memory accounting regression - NFS: Fix an error reporting regression in nfs_file_write() - pNFS: Fix further layout stateid issues - RPC/rdma: Revert 3d4cf35bd4fa ("xprtrdma: Reply buffer exhaustion...") - RPC/rdma: Fix receive buffer accounting" * tag 'nfs-for-4.8-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFSv4.1: Fix the CREATE_SESSION slot number accounting xprtrdma: Fix receive buffer accounting xprtrdma: Revert 3d4cf35bd4fa ("xprtrdma: Reply buffer exhaustion...") pNFS: Don't forget the layout stateid if there are outstanding LAYOUTGETs pNFS: Clear out all layout segments if the server unsets lrp->res.lrs_present pNFS: Fix pnfs_set_layout_stateid() to clear NFS_LAYOUT_INVALID_STID pNFS: Ensure LAYOUTGET and LAYOUTRETURN are properly serialised NFS: Fix error reporting in nfs_file_write() sunrpc: fix UDP memory accounting
2016-09-12f2fs: add common iget in add_fsync_inodeJaegeuk Kim
There is no functional change. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-12f2fs: check free_sections for defragmentationJaegeuk Kim
Fix wrong condition check for defragmentation of a file. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-12f2fs: forbid to do fstrim if fs has some errorYunlei He
This patch skip fstrim if sbi set SBI_NEED_FSCK flag Signed-off-by: Yunlei He <heyunlei@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-12f2fs: avoid page allocation for truncating partial inline_dataJaegeuk Kim
When truncating cached inline_data, we don't need to allocate a new page all the time. Instead, it must check its page cache only. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-11NFSv4.1: Fix the CREATE_SESSION slot number accountingTrond Myklebust
Ensure that we conform to the algorithm described in RFC5661, section 18.36.4 for when to bump the sequence id. In essence we do it for all cases except when the RPC call timed out, or in case of the server returning NFS4ERR_DELAY or NFS4ERR_STALE_CLIENTID. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Cc: stable@vger.kernel.org
2016-09-10Merge branch 'libnvdimm-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm fixes from Dan Williams: "nvdimm fixes for v4.8, two of them are tagged for -stable: - Fix devm_memremap_pages() to use track_pfn_insert(). Otherwise, DAX pmd mappings end up with an uncached pgprot, and unusable performance for the device-dax interface. The device-dax interface appeared in 4.7 so this is tagged for -stable. - Fix a couple VM_BUG_ON() checks in the show_smaps() path to understand DAX pmd entries. This fix is tagged for -stable. - Fix a mis-merge of the nfit machine-check handler to flip the polarity of an if() to match the final version of the patch that Vishal sent for 4.8-rc1. Without this the nfit machine check handler never detects / inserts new 'badblocks' entries which applications use to identify lost portions of files. - For test purposes, fix the nvdimm_clear_poison() path to operate on legacy / simulated nvdimm memory ranges. Without this fix a test can set badblocks, but never clear them on these ranges. - Fix the range checking done by dax_dev_pmd_fault(). This is not tagged for -stable since this problem is mitigated by specifying aligned resources at device-dax setup time. These patches have appeared in a next release over the past week. The recent rebase you can see in the timestamps was to drop an invalid fix as identified by the updated device-dax unit tests [1]. The -mm touches have an ack from Andrew" [1]: "[ndctl PATCH 0/3] device-dax test for recent kernel bugs" https://lists.01.org/pipermail/linux-nvdimm/2016-September/006855.html * 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: libnvdimm: allow legacy (e820) pmem region to clear bad blocks nfit, mce: Fix SPA matching logic in MCE handler mm: fix cache mode of dax pmd mappings mm: fix show_smap() for zone_device-pmd ranges dax: fix mapping size check
2016-09-10Merge tag 'for_linus_stable' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull fscrypto fixes fromTed Ts'o: "Fix some brown-paper-bag bugs for fscrypto, including one one which allows a malicious user to set an encryption policy on an empty directory which they do not own" * tag 'for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: fscrypto: require write access to mount to set encryption policy fscrypto: only allow setting encryption policy on directories fscrypto: add authorization check for setting encryption policy
2016-09-10fscrypto: require write access to mount to set encryption policyEric Biggers
Since setting an encryption policy requires writing metadata to the filesystem, it should be guarded by mnt_want_write/mnt_drop_write. Otherwise, a user could cause a write to a frozen or readonly filesystem. This was handled correctly by f2fs but not by ext4. Make fscrypt_process_policy() handle it rather than relying on the filesystem to get it right. Signed-off-by: Eric Biggers <ebiggers@google.com> Cc: stable@vger.kernel.org # 4.1+; check fs/{ext4,f2fs} Signed-off-by: Theodore Ts'o <tytso@mit.edu> Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-09Move check for prefix path to within cifs_get_root()Sachin Prabhu
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com> Tested-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <smfrench@gmail.com>
2016-09-09Compare prepaths when comparing superblocksSachin Prabhu
The patch fs/cifs: make share unaccessible at root level mountable makes use of prepaths when any component of the underlying path is inaccessible. When mounting 2 separate shares having different prepaths but are other wise similar in other respects, we end up sharing superblocks when we shouldn't be doing so. Signed-off-by: Sachin Prabhu <sprabhu@redhat.com> Tested-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <smfrench@gmail.com>
2016-09-09Fix memory leaks in cifs_do_mount()Sachin Prabhu
Fix memory leaks introduced by the patch fs/cifs: make share unaccessible at root level mountable Also move allocation of cifs_sb->prepath to cifs_setup_cifs_sb(). Signed-off-by: Sachin Prabhu <sprabhu@redhat.com> Tested-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <smfrench@gmail.com>
2016-09-09fscrypto: only allow setting encryption policy on directoriesEric Biggers
The FS_IOC_SET_ENCRYPTION_POLICY ioctl allowed setting an encryption policy on nondirectory files. This was unintentional, and in the case of nonempty regular files did not behave as expected because existing data was not actually encrypted by the ioctl. In the case of ext4, the user could also trigger filesystem errors in ->empty_dir(), e.g. due to mismatched "directory" checksums when the kernel incorrectly tried to interpret a regular file as a directory. This bug affected ext4 with kernels v4.8-rc1 or later and f2fs with kernels v4.6 and later. It appears that older kernels only permitted directories and that the check was accidentally lost during the refactoring to share the file encryption code between ext4 and f2fs. This patch restores the !S_ISDIR() check that was present in older kernels. Signed-off-by: Eric Biggers <ebiggers@google.com> Cc: stable@vger.kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-09fscrypto: add authorization check for setting encryption policyEric Biggers
On an ext4 or f2fs filesystem with file encryption supported, a user could set an encryption policy on any empty directory(*) to which they had readonly access. This is obviously problematic, since such a directory might be owned by another user and the new encryption policy would prevent that other user from creating files in their own directory (for example). Fix this by requiring inode_owner_or_capable() permission to set an encryption policy. This means that either the caller must own the file, or the caller must have the capability CAP_FOWNER. (*) Or also on any regular file, for f2fs v4.6 and later and ext4 v4.8-rc1 and later; a separate bug fix is coming for that. Signed-off-by: Eric Biggers <ebiggers@google.com> Cc: stable@vger.kernel.org # 4.1+; check fs/{ext4,f2fs} Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-09mm: fix show_smap() for zone_device-pmd rangesDan Williams
Attempting to dump /proc/<pid>/smaps for a process with pmd dax mappings currently results in the following VM_BUG_ONs: kernel BUG at mm/huge_memory.c:1105! task: ffff88045f16b140 task.stack: ffff88045be14000 RIP: 0010:[<ffffffff81268f9b>] [<ffffffff81268f9b>] follow_trans_huge_pmd+0x2cb/0x340 [..] Call Trace: [<ffffffff81306030>] smaps_pte_range+0xa0/0x4b0 [<ffffffff814c2755>] ? vsnprintf+0x255/0x4c0 [<ffffffff8123c46e>] __walk_page_range+0x1fe/0x4d0 [<ffffffff8123c8a2>] walk_page_vma+0x62/0x80 [<ffffffff81307656>] show_smap+0xa6/0x2b0 kernel BUG at fs/proc/task_mmu.c:585! RIP: 0010:[<ffffffff81306469>] [<ffffffff81306469>] smaps_pte_range+0x499/0x4b0 Call Trace: [<ffffffff814c2795>] ? vsnprintf+0x255/0x4c0 [<ffffffff8123c46e>] __walk_page_range+0x1fe/0x4d0 [<ffffffff8123c8a2>] walk_page_vma+0x62/0x80 [<ffffffff81307696>] show_smap+0xa6/0x2b0 These locations are sanity checking page flags that must be set for an anonymous transparent huge page, but are not set for the zone_device pages associated with dax mappings. Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-09-09Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse fix from Miklos Szeredi: "This fixes a deadlock when fuse, direct I/O and loop device are combined" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: direct-io: don't dirty ITER_BVEC pages
2016-09-09Merge branch 'overlayfs-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs fix from Miklos Szeredi: "This fixes a regression caused by the last pull request" * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: ovl: fix workdir creation
2016-09-09Merge branch 'for-linus-4.8' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "I'm not proud of how long it took me to track down that one liner in btrfs_sync_log(), but the good news is the patches I was trying to blame for these problems were actually fine (sorry Filipe)" * 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: btrfs: introduce tickets_id to determine whether asynchronous metadata reclaim work makes progress btrfs: remove root_log_ctx from ctx list before btrfs_sync_log returns btrfs: do not decrease bytes_may_use when replaying extents
2016-09-09fs/efivarfs: Fix double kfree() in error pathMatt Fleming
Julia reported that we may double free 'name' in efivarfs_callback(), and that this bug was introduced by commit 0d22f33bc37c ("efi: Don't use spinlocks for efi vars"). Move one of the kfree()s until after the point at which we know we are definitely on the success path. Reported-by: Julia Lawall <julia.lawall@lip6.fr> Acked-by: Julia Lawall <julia.lawall@lip6.fr> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Sylvain Chouleur <sylvain.chouleur@gmail.com> Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
2016-09-09efi: Don't use spinlocks for efi varsSylvain Chouleur
All efivars operations are protected by a spinlock which prevents interruptions and preemption. This is too restricted, we just need a lock preventing concurrency. The idea is to use a semaphore of count 1 and to have two ways of locking, depending on the context: - In interrupt context, we call down_trylock(), if it fails we return an error - In normal context, we call down_interruptible() We don't use a mutex here because the mutex_trylock() function must not be called from interrupt context, whereas the down_trylock() can. Signed-off-by: Sylvain Chouleur <sylvain.chouleur@intel.com> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Leif Lindholm <leif.lindholm@linaro.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Sylvain Chouleur <sylvain.chouleur@gmail.com> Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
2016-09-08ramoops: move spin_lock_init after kmalloc error checkingGeliang Tang
If cxt->pstore.buf allocated failed, no need to initialize cxt->pstore.buf_lock. So this patch moves spin_lock_init() after the error checking. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org>
2016-09-08pstore/ram: Use memcpy_fromio() to save old bufferAndrew Bresticker
The ramoops buffer may be mapped as either I/O memory or uncached memory. On ARM64, this results in a device-type (strongly-ordered) mapping. Since unnaligned accesses to device-type memory will generate an alignment fault (regardless of whether or not strict alignment checking is enabled), it is not safe to use memcpy(). memcpy_fromio() is guaranteed to only use aligned accesses, so use that instead. Signed-off-by: Andrew Bresticker <abrestic@chromium.org> Signed-off-by: Enric Balletbo Serra <enric.balletbo@collabora.com> Reviewed-by: Puneet Kumar <puneetster@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org> Cc: stable@vger.kernel.org
2016-09-08pstore/ram: Use memcpy_toio instead of memcpyFurquan Shaikh
persistent_ram_update uses vmap / iomap based on whether the buffer is in memory region or reserved region. However, both map it as non-cacheable memory. For armv8 specifically, non-cacheable mapping requests use a memory type that has to be accessed aligned to the request size. memcpy() doesn't guarantee that. Signed-off-by: Furquan Shaikh <furquan@google.com> Signed-off-by: Enric Balletbo Serra <enric.balletbo@collabora.com> Reviewed-by: Aaron Durbin <adurbin@chromium.org> Reviewed-by: Olof Johansson <olofj@chromium.org> Tested-by: Furquan Shaikh <furquan@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org> Cc: stable@vger.kernel.org
2016-09-08pstore/pmsg: drop bounce bufferMark Salyzyn
Removing a bounce buffer copy operation in the pmsg driver path is always better. We also gain in overall performance by not requesting a vmalloc on every write as this can cause precious RT tasks, such as user facing media operation, to stall while memory is being reclaimed. Added a write_buf_user to the pstore functions, a backup platform write_buf_user that uses the small buffer that is part of the instance, and implemented a ramoops write_buf_user that only supports PSTORE_TYPE_PMSG. Signed-off-by: Mark Salyzyn <salyzyn@android.com> Signed-off-by: Kees Cook <keescook@chromium.org>
2016-09-08pstore/ram: Set pstore flags dynamicallyNamhyung Kim
The ramoops can be configured to enable each pstore type by setting their size. In that case, it'd be better not to register disabled types in the first place. Cc: Anton Vorontsov <anton@enomsg.org> Cc: Colin Cross <ccross@android.com> Cc: Kees Cook <keescook@chromium.org> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org>
2016-09-08pstore: Split pstore fragile flagsNamhyung Kim
This patch adds new PSTORE_FLAGS for each pstore type so that they can be enabled separately. This is a preparation for ongoing virtio-pstore work to support those types flexibly. The PSTORE_FLAGS_FRAGILE is changed to PSTORE_FLAGS_DMESG to preserve the original behavior. Cc: Anton Vorontsov <anton@enomsg.org> Cc: Colin Cross <ccross@android.com> Cc: Kees Cook <keescook@chromium.org> Cc: Tony Luck <tony.luck@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Len Brown <lenb@kernel.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: linux-acpi@vger.kernel.org Cc: linux-efi@vger.kernel.org Signed-off-by: Namhyung Kim <namhyung@kernel.org> [kees: retained "FRAGILE" for now to make merges easier] Signed-off-by: Kees Cook <keescook@chromium.org>
2016-09-08pstore/core: drop cmpxchg based updatesSebastian Andrzej Siewior
I have here a FPGA behind PCIe which exports SRAM which I use for pstore. Now it seems that the FPGA no longer supports cmpxchg based updates and writes back 0xff…ff and returns the same. This leads to crash during crash rendering pstore useless. Since I doubt that there is much benefit from using cmpxchg() here, I am dropping this atomic access and use the spinlock based version. Cc: Anton Vorontsov <anton@enomsg.org> Cc: Colin Cross <ccross@android.com> Cc: Kees Cook <keescook@chromium.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Rabin Vincent <rabinv@axis.com> Tested-by: Rabin Vincent <rabinv@axis.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Guenter Roeck <linux@roeck-us.net> [kees: remove "_locked" suffix since it's the only option now] Signed-off-by: Kees Cook <keescook@chromium.org> Cc: stable@vger.kernel.org
2016-09-08pstore/ramoops: fixup driver removalSebastian Andrzej Siewior
A basic rmmod ramoops segfaults. Let's see why. Since commit 34f0ec82e0a9 ("pstore: Correct the max_dump_cnt clearing of ramoops") sets ->max_dump_cnt to zero before looping over ->przs but we didn't use it before that either. And since commit ee1d267423a1 ("pstore: add pstore unregister") we free that memory on rmmod. But even then, we looped until a NULL pointer or ERR. I don't see where it is ensured that the last member is NULL. Let's try this instead: simply error recovery and free. Clean up in error case where resources were allocated. And then, in the free path, rely on ->max_dump_cnt in the free path. Cc: Anton Vorontsov <anton@enomsg.org> Cc: Colin Cross <ccross@android.com> Cc: Kees Cook <keescook@chromium.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Acked-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Kees Cook <keescook@chromium.org> Cc: stable@vger.kernel.org # 4.4.x-
2016-09-08rxrpc: Rewrite the data and ack handling codeDavid Howells
Rewrite the data and ack handling code such that: (1) Parsing of received ACK and ABORT packets and the distribution and the filing of DATA packets happens entirely within the data_ready context called from the UDP socket. This allows us to process and discard ACK and ABORT packets much more quickly (they're no longer stashed on a queue for a background thread to process). (2) We avoid calling skb_clone(), pskb_pull() and pskb_trim(). We instead keep track of the offset and length of the content of each packet in the sk_buff metadata. This means we don't do any allocation in the receive path. (3) Jumbo DATA packet parsing is now done in data_ready context. Rather than cloning the packet once for each subpacket and pulling/trimming it, we file the packet multiple times with an annotation for each indicating which subpacket is there. From that we can directly calculate the offset and length. (4) A call's receive queue can be accessed without taking locks (memory barriers do have to be used, though). (5) Incoming calls are set up from preallocated resources and immediately made live. They can than have packets queued upon them and ACKs generated. If insufficient resources exist, DATA packet #1 is given a BUSY reply and other DATA packets are discarded). (6) sk_buffs no longer take a ref on their parent call. To make this work, the following changes are made: (1) Each call's receive buffer is now a circular buffer of sk_buff pointers (rxtx_buffer) rather than a number of sk_buff_heads spread between the call and the socket. This permits each sk_buff to be in the buffer multiple times. The receive buffer is reused for the transmit buffer. (2) A circular buffer of annotations (rxtx_annotations) is kept parallel to the data buffer. Transmission phase annotations indicate whether a buffered packet has been ACK'd or not and whether it needs retransmission. Receive phase annotations indicate whether a slot holds a whole packet or a jumbo subpacket and, if the latter, which subpacket. They also note whether the packet has been decrypted in place. (3) DATA packet window tracking is much simplified. Each phase has just two numbers representing the window (rx_hard_ack/rx_top and tx_hard_ack/tx_top). The hard_ack number is the sequence number before base of the window, representing the last packet the other side says it has consumed. hard_ack starts from 0 and the first packet is sequence number 1. The top number is the sequence number of the highest-numbered packet residing in the buffer. Packets between hard_ack+1 and top are soft-ACK'd to indicate they've been received, but not yet consumed. Four macros, before(), before_eq(), after() and after_eq() are added to compare sequence numbers within the window. This allows for the top of the window to wrap when the hard-ack sequence number gets close to the limit. Two flags, RXRPC_CALL_RX_LAST and RXRPC_CALL_TX_LAST, are added also to indicate when rx_top and tx_top point at the packets with the LAST_PACKET bit set, indicating the end of the phase. (4) Calls are queued on the socket 'receive queue' rather than packets. This means that we don't need have to invent dummy packets to queue to indicate abnormal/terminal states and we don't have to keep metadata packets (such as ABORTs) around (5) The offset and length of a (sub)packet's content are now passed to the verify_packet security op. This is currently expected to decrypt the packet in place and validate it. However, there's now nowhere to store the revised offset and length of the actual data within the decrypted blob (there may be a header and padding to skip) because an sk_buff may represent multiple packets, so a locate_data security op is added to retrieve these details from the sk_buff content when needed. (6) recvmsg() now has to handle jumbo subpackets, where each subpacket is individually secured and needs to be individually decrypted. The code to do this is broken out into rxrpc_recvmsg_data() and shared with the kernel API. It now iterates over the call's receive buffer rather than walking the socket receive queue. Additional changes: (1) The timers are condensed to a single timer that is set for the soonest of three timeouts (delayed ACK generation, DATA retransmission and call lifespan). (2) Transmission of ACK and ABORT packets is effected immediately from process-context socket ops/kernel API calls that cause them instead of them being punted off to a background work item. The data_ready handler still has to defer to the background, though. (3) A shutdown op is added to the AF_RXRPC socket so that the AFS filesystem can shut down the socket and flush its own work items before closing the socket to deal with any in-progress service calls. Future additional changes that will need to be considered: (1) Make sure that a call doesn't hog the front of the queue by receiving data from the network as fast as userspace is consuming it to the exclusion of other calls. (2) Transmit delayed ACKs from within recvmsg() when we've consumed sufficiently more packets to avoid the background work item needing to run. Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-08rxrpc: Preallocate peers, conns and calls for incoming service requestsDavid Howells
Make it possible for the data_ready handler called from the UDP transport socket to completely instantiate an rxrpc_call structure and make it immediately live by preallocating all the memory it might need. The idea is to cut out the background thread usage as much as possible. [Note that the preallocated structs are not actually used in this patch - that will be done in a future patch.] If insufficient resources are available in the preallocation buffers, it will be possible to discard the DATA packet in the data_ready handler or schedule a BUSY packet without the need to schedule an attempt at allocation in a background thread. To this end: (1) Preallocate rxrpc_peer, rxrpc_connection and rxrpc_call structs to a maximum number each of the listen backlog size. The backlog size is limited to a maxmimum of 32. Only this many of each can be in the preallocation buffer. (2) For userspace sockets, the preallocation is charged initially by listen() and will be recharged by accepting or rejecting pending new incoming calls. (3) For kernel services {,re,dis}charging of the preallocation buffers is handled manually. Two notifier callbacks have to be provided before kernel_listen() is invoked: (a) An indication that a new call has been instantiated. This can be used to trigger background recharging. (b) An indication that a call is being discarded. This is used when the socket is being released. A function, rxrpc_kernel_charge_accept() is called by the kernel service to preallocate a single call. It should be passed the user ID to be used for that call and a callback to associate the rxrpc call with the kernel service's side of the ID. (4) Discard the preallocation when the socket is closed. (5) Temporarily bump the refcount on the call allocated in rxrpc_incoming_call() so that rxrpc_release_call() can ditch the preallocation ref on service calls unconditionally. This will no longer be necessary once the preallocation is used. Note that this does not yet control the number of active service calls on a client - that will come in a later patch. A future development would be to provide a setsockopt() call that allows a userspace server to manually charge the preallocation buffer. This would allow user call IDs to be provided in advance and the awkward manual accept stage to be bypassed. Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-07f2fs: no need to make zeros beyond i_sizeJaegeuk Kim
We don't need to make zeros beyond i_size, since we already wrote that through NEW_ADDR case. Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-07f2fs: fix to detect temporary name of multimedia fileChao Yu
Some applications may create multimeida file with temporary name like '*.jpg.tmp' or '*.mp4.tmp', then rename to '*.jpg' or '*.mp4'. Now, f2fs can only detect multimedia filename with specified format: "filename + '.' + extension", so it will make f2fs missing to detect multimedia file with special temporary name, result in failing to set cold flag on file. This patch enhances detection flow for enabling lookup extension in the middle of temporary filename. Reported-by: Xue Liu <liuxueliu.liu@huawei.com> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-07f2fs: fix minor typoChao Yu
Correct typo from 'destory' to 'destroy'. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-07f2fs: set dentry bits on random location in memoryJaegeuk Kim
This fixes pointer panic when using inline_dentry, which was triggered when backporting to 3.10. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-07f2fs: fix to set superblock dirty correctlyChao Yu
tests/generic/251 of fstest suit complains us with below message: ------------[ cut here ]------------ invalid opcode: 0000 [#1] PREEMPT SMP CPU: 2 PID: 7698 Comm: fstrim Tainted: G O 4.7.0+ #21 task: e9f4e000 task.stack: e7262000 EIP: 0060:[<f89fcefe>] EFLAGS: 00010202 CPU: 2 EIP is at write_checkpoint+0xfde/0x1020 [f2fs] EAX: f33eb300 EBX: eecac310 ECX: 00000001 EDX: ffff0001 ESI: eecac000 EDI: eecac5f0 EBP: e7263dec ESP: e7263d18 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 CR0: 80050033 CR2: b76ab01c CR3: 2eb89de0 CR4: 000406f0 Stack: 00000001 a220fb7b e9f4e000 00000002 419ff2d3 b3a05151 00000002 e9f4e5d8 e9f4e000 419ff2d3 b3a05151 eecac310 c10b8154 b3a05151 419ff2d3 c10b78bd e9f4e000 e9f4e000 e9f4e5d8 00000001 e9f4e000 ec409000 eecac2cc eecac288 Call Trace: [<c10b8154>] ? __lock_acquire+0x3c4/0x760 [<c10b78bd>] ? mark_held_locks+0x5d/0x80 [<f8a10632>] f2fs_trim_fs+0x1c2/0x2e0 [f2fs] [<f89e9f56>] f2fs_ioctl+0x6b6/0x10b0 [f2fs] [<c13d51df>] ? __this_cpu_preempt_check+0xf/0x20 [<c10b4281>] ? trace_hardirqs_off_caller+0x91/0x120 [<f89e98a0>] ? __exchange_data_block+0xd30/0xd30 [f2fs] [<c120b2e1>] do_vfs_ioctl+0x81/0x7f0 [<c11d57c5>] ? kmem_cache_free+0x245/0x2e0 [<c1217840>] ? get_unused_fd_flags+0x40/0x40 [<c1206eec>] ? putname+0x4c/0x50 [<c11f631e>] ? do_sys_open+0x16e/0x1d0 [<c1001990>] ? do_fast_syscall_32+0x30/0x1c0 [<c13d51df>] ? __this_cpu_preempt_check+0xf/0x20 [<c120baa8>] SyS_ioctl+0x58/0x80 [<c1001a01>] do_fast_syscall_32+0xa1/0x1c0 [<c178cc54>] sysenter_past_esp+0x45/0x74 EIP: [<f89fcefe>] write_checkpoint+0xfde/0x1020 [f2fs] SS:ESP 0068:e7263d18 ---[ end trace 4de95d7e6b3aa7c6 ]--- The reason is: with below call stack, we will encounter BUG_ON during doing fstrim. Thread A Thread B - write_checkpoint - do_checkpoint - f2fs_write_inode - update_inode_page - update_inode - set_page_dirty - f2fs_set_node_page_dirty - inc_page_count - percpu_counter_inc - set_sbi_flag(SBI_IS_DIRTY) - clear_sbi_flag(SBI_IS_DIRTY) Thread C Thread D - f2fs_write_node_page - set_node_addr - __set_nat_cache_dirty - nm_i->dirty_nat_cnt++ - do_vfs_ioctl - f2fs_ioctl - f2fs_trim_fs - write_checkpoint - f2fs_bug_on(nm_i->dirty_nat_cnt) Fix it by setting superblock dirty correctly in do_checkpoint and f2fs_write_node_page. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-07f2fs: add roll-forward recovery process for encrypted dentryShuoran Liu
Add roll-forward recovery process for encrypted dentry, so the first fsync issued to an encrypted file does not need writing checkpoint. This improves the performance of the following test at thousands of small files: open -> write -> fsync -> close Signed-off-by: Shuoran Liu <liushuoran@huawei.com> Acked-by: Chao Yu <yuchao0@huawei.com> [Jaegeuk Kim: modify kernel message to show encrypted names] Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-07f2fs: fix lost xattrs of directoriesJaegeuk Kim
This patch enhances the xattr consistency of dirs from suddern power-cuts. Possible scenario would be: 1. dir->setxattr used by per-file encryption 2. file->setxattr goes into inline_xattr 3. file->fsync In that case, we should do checkpoint for #1. Otherwise we'd lose dir's key information for the file given #2. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-07f2fs: support async discardChao Yu
Like most filesystems, f2fs will issue discard command synchronously, so when user trigger fstrim through ioctl, multiple discard commands will be issued serially with sync mode, which makes poor performance. In this patch we try to support async discard, so that all discard commands can be issued and be waited for endio in batch to improve performance. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-07f2fs: set encryption name flag in add inline entry pathShuoran Liu
This patch sets encryption name flag in the add inline entry path if filename is encrypted. Signed-off-by: Shuoran Liu <liushuoran@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-07f2fs crypto: avoid unneeded memory allocation in ->readdirChao Yu
When decrypting dirents in ->readdir, fscrypt_fname_disk_to_usr won't change content of original encrypted dirent, we don't need to allocate additional buffer for storing mirror of it, so get rid of it. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-07f2fs: fix to do security initialization of encrypted inode with original ↵Chao Yu
filename When creating new inode, security_inode_init_security will be called for initializing security info related to the inode, and filename is passed to security module, it helps security module such as SElinux to know which rule or label could be applied for the inode with specified name. Previously, if new inode is created as an encrypted one, f2fs will transfer encrypted filename to security module which may fail the check of security policy belong to the inode. So in order to this issue, alter to transfer original unencrypted filename instead. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-07f2fs: do in batch synchronously readahead during GCChao Yu
In order to enhance performance, we try to readahead node page during GC, but before loading node page we should get block address of node page which is stored in NAT table, so synchronously read of single NAT page block our readahead flow. f2fs_submit_page_bio: dev = (251,0), ino = 2, page_index = 0xa1e, oldaddr = 0xa1e, newaddr = 0xa1e, rw = READ_SYNC(MP), type = META f2fs_submit_page_bio: dev = (251,0), ino = 1, page_index = 0x35e9, oldaddr = 0x72d7a, newaddr = 0x72d7a, rw = READAHEAD ^H, type = NODE f2fs_submit_page_bio: dev = (251,0), ino = 2, page_index = 0xc1f, oldaddr = 0xc1f, newaddr = 0xc1f, rw = READ_SYNC(MP), type = META f2fs_submit_page_bio: dev = (251,0), ino = 1, page_index = 0x389d, oldaddr = 0x72d7d, newaddr = 0x72d7d, rw = READAHEAD ^H, type = NODE f2fs_submit_page_bio: dev = (251,0), ino = 1, page_index = 0x3a82, oldaddr = 0x72d7f, newaddr = 0x72d7f, rw = READAHEAD ^H, type = NODE f2fs_submit_page_bio: dev = (251,0), ino = 1, page_index = 0x3bfa, oldaddr = 0x72d86, newaddr = 0x72d86, rw = READAHEAD ^H, type = NODE This patch adds one phase that do readahead NAT pages in batch before readahead node page for more effeciently. f2fs_submit_page_bio: dev = (251,0), ino = 2, page_index = 0x1952, oldaddr = 0x1952, newaddr = 0x1952, rw = READ_SYNC(MP), type = META f2fs_submit_page_mbio: dev = (251,0), ino = 2, page_index = 0xc34, oldaddr = 0xc34, newaddr = 0xc34, rw = READ_SYNC(MP), type = META f2fs_submit_page_mbio: dev = (251,0), ino = 2, page_index = 0xa33, oldaddr = 0xa33, newaddr = 0xa33, rw = READ_SYNC(MP), type = META f2fs_submit_page_mbio: dev = (251,0), ino = 2, page_index = 0xc30, oldaddr = 0xc30, newaddr = 0xc30, rw = READ_SYNC(MP), type = META f2fs_submit_page_mbio: dev = (251,0), ino = 2, page_index = 0xc32, oldaddr = 0xc32, newaddr = 0xc32, rw = READ_SYNC(MP), type = META f2fs_submit_page_mbio: dev = (251,0), ino = 2, page_index = 0xc26, oldaddr = 0xc26, newaddr = 0xc26, rw = READ_SYNC(MP), type = META f2fs_submit_page_mbio: dev = (251,0), ino = 2, page_index = 0xa2b, oldaddr = 0xa2b, newaddr = 0xa2b, rw = READ_SYNC(MP), type = META f2fs_submit_page_mbio: dev = (251,0), ino = 2, page_index = 0xc23, oldaddr = 0xc23, newaddr = 0xc23, rw = READ_SYNC(MP), type = META f2fs_submit_page_mbio: dev = (251,0), ino = 2, page_index = 0xc24, oldaddr = 0xc24, newaddr = 0xc24, rw = READ_SYNC(MP), type = META f2fs_submit_page_mbio: dev = (251,0), ino = 2, page_index = 0xa10, oldaddr = 0xa10, newaddr = 0xa10, rw = READ_SYNC(MP), type = META f2fs_submit_page_mbio: dev = (251,0), ino = 2, page_index = 0xc2c, oldaddr = 0xc2c, newaddr = 0xc2c, rw = READ_SYNC(MP), type = META f2fs_submit_page_bio: dev = (251,0), ino = 1, page_index = 0x5db7, oldaddr = 0x6be00, newaddr = 0x6be00, rw = READAHEAD ^H, type = NODE f2fs_submit_page_bio: dev = (251,0), ino = 1, page_index = 0x5db9, oldaddr = 0x6be17, newaddr = 0x6be17, rw = READAHEAD ^H, type = NODE f2fs_submit_page_bio: dev = (251,0), ino = 1, page_index = 0x5dbc, oldaddr = 0x6be1a, newaddr = 0x6be1a, rw = READAHEAD ^H, type = NODE f2fs_submit_page_bio: dev = (251,0), ino = 1, page_index = 0x5dc3, oldaddr = 0x6be20, newaddr = 0x6be20, rw = READAHEAD ^H, type = NODE f2fs_submit_page_bio: dev = (251,0), ino = 1, page_index = 0x5dc7, oldaddr = 0x6be24, newaddr = 0x6be24, rw = READAHEAD ^H, type = NODE f2fs_submit_page_bio: dev = (251,0), ino = 1, page_index = 0x5dc9, oldaddr = 0x6be25, newaddr = 0x6be25, rw = READAHEAD ^H, type = NODE Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-07f2fs: schedule in between two continous batch discardsChao Yu
In batch discard approach of fstrim will grab/release gc_mutex lock repeatly, it makes contention of the lock becoming more intensive. So after one batch discards were issued in checkpoint and the lock was released, it's better to do schedule() to increase opportunity of grabbing gc_mutex lock for other competitors. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-07Merge branch 'for-chris' of ↵Chris Mason
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.8
2016-09-07rxrpc: Add tracepoint for working out where aborts happenDavid Howells
Add a tracepoint for working out where local aborts happen. Each tracepoint call is labelled with a 3-letter code so that they can be distinguished - and the DATA sequence number is added too where available. rxrpc_kernel_abort_call() also takes a 3-letter code so that AFS can indicate the circumstances when it aborts a call. Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-06jfs: Simplify codeChristophe JAILLET
Calling 'list_splice' followed by 'INIT_LIST_HEAD' is equivalent to 'list_splice_init'. This has been spotted with the following coccinelle script: ///// @@ expression y,z; @@ - list_splice(y,z); - INIT_LIST_HEAD(y); + list_splice_init(y,z); Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>