summaryrefslogtreecommitdiff
path: root/Documentation/filesystems
AgeCommit message (Collapse)Author
2018-02-01Merge branch 'for-linus' of ↡Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk Pull printk updates from Petr Mladek: - Add a console_msg_format command line option: The value "default" keeps the old "[time stamp] text\n" format. The value "syslog" allows to see the syslog-like "<log level>[timestamp] text" format. This feature was requested by people doing regression tests, for example, 0day robot. They want to have both filtered and full logs at hands. - Reduce the risk of softlockup: Pass the console owner in a busy loop. This is a new approach to the old problem. It was first proposed by Steven Rostedt on Kernel Summit 2017. It marks a context in which the console_lock owner calls console drivers and could not sleep. On the other side, printk() callers could detect this state and use a busy wait instead of a simple console_trylock(). Finally, the console_lock owner checks if there is a busy waiter at the end of the special context and eventually passes the console_lock to the waiter. The hand-off works surprisingly well and helps in many situations. Well, there is still a possibility of the softlockup, for example, when the flood of messages stops and the last owner still has too much to flush. There is increasing number of people having problems with printk-related softlockups. We might eventually need to get better solution. Anyway, this looks like a good start and promising direction. - Do not allow to schedule in console_unlock() called from printk(): This reverts an older controversial commit. The reschedule helped to avoid softlockups. But it also slowed down the console output. This patch is obsoleted by the new console waiter logic described above. In fact, the reschedule made the hand-off less effective. - Deprecate "%pf" and "%pF" format specifier: It was needed on ia64, ppc64 and parisc64 to dereference function descriptors and show the real function address. It is done transparently by "%ps" and "pS" format specifier now. Sergey Senozhatsky found that all the function descriptors were in a special elf section and could be easily detected. - Remove printk_symbol() API: It has been obsoleted by "%pS" format specifier, and this change helped to remove few continuous lines and a less intuitive old API. - Remove redundant memsets: Sergey removed unnecessary memset when processing printk.devkmsg command line option. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk: (27 commits) printk: drop redundant devkmsg_log_str memsets printk: Never set console_may_schedule in console_trylock() printk: Hide console waiter logic into helpers printk: Add console owner and waiter logic to load balance console writes kallsyms: remove print_symbol() function checkpatch: add pF/pf deprecation warning symbol lookup: introduce dereference_symbol_descriptor() parisc64: Add .opd based function descriptor dereference powerpc64: Add .opd based function descriptor dereference ia64: Add .opd based function descriptor dereference sections: split dereference_function_descriptor() openrisc: Fix conflicting types for _exext and _stext lib: do not use print_symbol() irq debug: do not use print_symbol() sysfs: do not use print_symbol() drivers: do not use print_symbol() x86: do not use print_symbol() unicore32: do not use print_symbol() sh: do not use print_symbol() mn10300: do not use print_symbol() ...
2018-01-31Merge tag 'docs-4.16' of git://git.lwn.net/linuxLinus Torvalds
Pull documentation updates from Jonathan Corbet: "Documentation updates for 4.16. New stuff includes refcount_t documentation, errseq documentation, kernel-doc support for nested structure definitions, the removal of lots of crufty kernel-doc support for unused formats, SPDX tag documentation, the beginnings of a manual for subsystem maintainers, and lots of fixes and updates. As usual, some of the changesets reach outside of Documentation/ to effect kerneldoc comment fixes. It also adds the new LICENSES directory, of which Thomas promises I do not need to be the maintainer" * tag 'docs-4.16' of git://git.lwn.net/linux: (65 commits) linux-next: docs-rst: Fix typos in kfigure.py linux-next: DOC: HWPOISON: Fix path to debugfs in hwpoison.txt Documentation: Fix misconversion of #if docs: add index entry for networking/msg_zerocopy Documentation: security/credentials.rst: explain need to sort group_list LICENSES: Add MPL-1.1 license LICENSES: Add the GPL 1.0 license LICENSES: Add Linux syscall note exception LICENSES: Add the MIT license LICENSES: Add the BSD-3-clause "Clear" license LICENSES: Add the BSD 3-clause "New" or "Revised" License LICENSES: Add the BSD 2-clause "Simplified" license LICENSES: Add the LGPL-2.1 license LICENSES: Add the LGPL 2.0 license LICENSES: Add the GPL 2.0 license Documentation: Add license-rules.rst to describe how to properly identify file licenses scripts: kernel_doc: better handle show warnings logic fs/*/Kconfig: drop links to 404-compliant http://acl.bestbits.at doc: md: Fix a file name to md-fault.c in fault-injection.txt errseq: Add to documentation tree ...
2018-01-31Merge branch 'work.misc' of ↡Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc vfs updates from Al Viro: "All kinds of misc stuff, without any unifying topic, from various people. Neil's d_anon patch, several bugfixes, introduction of kvmalloc analogue of kmemdup_user(), extending bitfield.h to deal with fixed-endians, assorted cleanups all over the place..." * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (28 commits) alpha: osf_sys.c: use timespec64 where appropriate alpha: osf_sys.c: fix put_tv32 regression jffs2: Fix use-after-free bug in jffs2_iget()'s error handling path dcache: delete unused d_hash_mask dcache: subtract d_hash_shift from 32 in advance fs/buffer.c: fold init_buffer() into init_page_buffers() fs: fold __inode_permission() into inode_permission() fs: add RWF_APPEND sctp: use vmemdup_user() rather than badly open-coding memdup_user() snd_ctl_elem_init_enum_names(): switch to vmemdup_user() replace_user_tlv(): switch to vmemdup_user() new primitive: vmemdup_user() memdup_user(): switch to GFP_USER eventfd: fold eventfd_ctx_get() into eventfd_ctx_fileget() eventfd: fold eventfd_ctx_read() into eventfd_read() eventfd: convert to use anon_inode_getfd() nfs4file: get rid of pointless include of btrfs.h uvc_v4l2: clean copyin/copyout up vme_user: don't use __copy_..._user() usx2y: don't bother with memdup_user() for 16-byte structure ...
2018-01-16kallsyms: remove print_symbol() functionSergey Senozhatsky
No more print_symbol()/__print_symbol() users left, remove these symbols. It was a very old API that encouraged people use continuous lines. It had been obsoleted by %pS format specifier in a normal printk() call. Link: http://lkml.kernel.org/r/20180105102538.GC471@jagdpanzerIV Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Mark Salter <msalter@redhat.com> Cc: Tony Luck <tony.luck@intel.com> Cc: David Howells <dhowells@redhat.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Cc: Borislav Petkov <bp@alien8.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: LKML <linux-kernel@vger.kernel.org> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-c6x-dev@linux-c6x.org Cc: linux-ia64@vger.kernel.org Cc: linux-am33-list@redhat.com Cc: linux-sh@vger.kernel.org Cc: linux-edac@vger.kernel.org Cc: x86@kernel.org Cc: linux-snps-arc@lists.infradead.org Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Suggested-by: Joe Perches <joe@perches.com> [pmladek@suse.com: updated commit message] Signed-off-by: Petr Mladek <pmladek@suse.com>
2018-01-13MAINTAINERS, nilfs2: change project home URLsRyusuke Konishi
The domain of NILFS project home was changed to "nilfs.sourceforge.io" to enable https access (the previous domain "nilfs.sourceforge.net" is redirected to the new one). Modify URLs of the project home to reflect this change and to replace their protocol from http to https. Link: http://lkml.kernel.org/r/1515416141-5614-1-git-send-email-konishi.ryusuke@lab.ntt.co.jp Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-01fs/*/Kconfig: drop links to 404-compliant http://acl.bestbits.atAdam Borowski
This link is replicated in most filesystems' config stanzas. Referring to an archived version of that site is pointless as it mostly deals with patches; user documentation is available elsewhere. Signed-off-by: Adam Borowski <kilobyte@angband.pl> CC: Alexander Viro <viro@zeniv.linux.org.uk> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Jan Kara <jack@suse.cz> Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com> Acked-by: David Sterba <dsterba@suse.com> Acked-by: "Yan, Zheng" <zyan@redhat.com> Acked-by: Chao Yu <yuchao0@huawei.com> Acked-by: Jaegeuk Kim <jaegeuk@kernel.org> Acked-by: Steve French <smfrench@gmail.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2017-12-25VFS: don't keep disconnected dentries on d_anonNeilBrown
The original purpose of the per-superblock d_anon list was to keep disconnected dentries in the cache between consecutive requests to the NFS server. Dentries can be disconnected if a client holds a file open and repeatedly performs IO on it, and if the server drops the dentry, whether due to memory pressure, server restart, or "echo 3 > /proc/sys/vm/drop_caches". This purpose was thwarted by commit 75a6f82a0d10 ("freeing unlinked file indefinitely delayed") which caused disconnected dentries to be freed as soon as their refcount reached zero. This means that, when a dentry being used by nfsd gets disconnected, a new one needs to be allocated for every request (unless requests overlap). As the dentry has no name, no parent, and no children, there is little of value to cache. As small memory allocations are typically fast (from per-cpu free lists) this likely has little cost. This means that the original purpose of s_anon is no longer relevant: there is no longer any need to keep disconnected dentries on a list so they appear to be hashed. However, s_anon now has a new use. When you mount an NFS filesystem, the dentry stored in s_root is just a placebo. The "real" root dentry is allocated using d_obtain_root() and so it kept on the s_anon list. I don't know the reason for this, but suspect it related to NFSv4 where a mount of "server:/some/path" require NFS to look up the root filehandle on the server, then walk down "/some" and "/path" to get the filehandle to mount. Whatever the reason, NFS depends on the s_anon list and on shrink_dcache_for_umount() pruning all dentries on this list. So we cannot simply remove s_anon. We could just leave the code unchanged, but apart from that being potentially confusing, the (unfair) bit-spin-lock which protects s_anon can become a bottle neck when lots of disconnected dentries are being created. So this patch renames s_anon to s_roots, and stops storing disconnected dentries on the list. Only dentries obtained with d_obtain_root() are now stored on this list. There are many fewer of these (only NFS and NILFS2 use the call, and only during filesystem mount) so contention on the bit-lock will not be a problem. Possibly an alternate solution should be found for NFS and NILFS2, but that would require understanding their needs first. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-12-21Documentation/filesystems/vfat.txt: fix a remark that implies UCS2Adam Borowski
All non-historic operating systems support the full range of Unicode here, thus you can make filenames for example in Gothic (πŒΌπŒ΄π‰π…), the other Gothic (π“‚β„―β„΄π“Œ) or the third Gothic (π—†π–Ύπ—ˆπ—), or declare something as πŸ’©. Characters above U+FFFF are encoded on four bytes. Signed-off-by: Adam Borowski <kilobyte@angband.pl> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2017-12-11ovl: don't follow redirects if redirect_dir=offMiklos Szeredi
Overlayfs is following redirects even when redirects are disabled. If this is unintentional (probably the majority of cases) then this can be a problem. E.g. upper layer comes from untrusted USB drive, and attacker crafts a redirect to enable read access to otherwise unreadable directories. If "redirect_dir=off", then turn off following as well as creation of redirects. If "redirect_dir=follow", then turn on following, but turn off creation of redirects (which is what "redirect_dir=off" does now). This is a backward incompatible change, so make it dependent on a config option. Reported-by: David Howells <dhowells@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-11-17Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge more updates from Andrew Morton: - a bit more MM - procfs updates - dynamic-debug fixes - lib/ updates - checkpatch - epoll - nilfs2 - signals - rapidio - PID management cleanup and optimization - kcov updates - sysvipc updates - quite a few misc things all over the place * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (94 commits) EXPERT Kconfig menu: fix broken EXPERT menu include/asm-generic/topology.h: remove unused parent_node() macro arch/tile/include/asm/topology.h: remove unused parent_node() macro arch/sparc/include/asm/topology_64.h: remove unused parent_node() macro arch/sh/include/asm/topology.h: remove unused parent_node() macro arch/ia64/include/asm/topology.h: remove unused parent_node() macro drivers/pcmcia/sa1111_badge4.c: avoid unused function warning mm: add infrastructure for get_user_pages_fast() benchmarking sysvipc: make get_maxid O(1) again sysvipc: properly name ipc_addid() limit parameter sysvipc: duplicate lock comments wrt ipc_addid() sysvipc: unteach ids->next_id for !CHECKPOINT_RESTORE initramfs: use time64_t timestamps drivers/watchdog: make use of devm_register_reboot_notifier() kernel/reboot.c: add devm_register_reboot_notifier() kcov: update documentation Makefile: support flag -fsanitizer-coverage=trace-cmp kcov: support comparison operands collection kcov: remove pointless current != NULL check kernel/panic.c: add TAINT_AUX ...
2017-11-17proc, coredump: add CoreDumping flag to /proc/pid/statusRoman Gushchin
Right now there is no convenient way to check if a process is being coredumped at the moment. It might be necessary to recognize such state to prevent killing the process and getting a broken coredump. Writing a large core might take significant time, and the process is unresponsive during it, so it might be killed by timeout, if another process is monitoring and killing/restarting hanging tasks. We're getting a significant number of corrupted coredump files on machines in our fleet, just because processes are being killed by timeout in the middle of the core writing process. We do have a process health check, and some agent is responsible for restarting processes which are not responding for health check requests. Writing a large coredump to the disk can easily exceed the reasonable timeout (especially on an overloaded machine). This flag will allow the agent to distinguish processes which are being coredumped, extend the timeout for them, and let them produce a full coredump file. To provide an ability to detect if a process is in the state of being coredumped, we can expose a boolean CoreDumping flag in /proc/pid/status. Example: $ cat core.sh #!/bin/sh echo "|/usr/bin/sleep 10" > /proc/sys/kernel/core_pattern sleep 1000 & PID=$! cat /proc/$PID/status | grep CoreDumping kill -ABRT $PID sleep 1 cat /proc/$PID/status | grep CoreDumping $ ./core.sh CoreDumping: 0 CoreDumping: 1 [guro@fb.com: document CoreDumping flag in /proc/<pid>/status] Link: http://lkml.kernel.org/r/20170928135357.GA8470@castle.DHCP.thefacebook.com Link: http://lkml.kernel.org/r/20170920230634.31572-1-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Ingo Molnar <mingo@kernel.org> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-17Merge branch 'work.cramfs' of ↡Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull cramfs updates from Al Viro: "Nicolas Pitre's cramfs work" * 'work.cramfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: cramfs: rehabilitate it cramfs: add mmap support cramfs: implement uncompressed and arbitrary data block positioning cramfs: direct memory access support
2017-11-17Merge branch 'work.misc' of ↡Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc vfs updates from Al Viro: "Assorted stuff, really no common topic here" * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: vfs: grab the lock instead of blocking in __fd_install during resizing vfs: stop clearing close on exec when closing a fd include/linux/fs.h: fix comment about struct address_space fs: make fiemap work from compat_ioctl coda: fix 'kernel memory exposure attempt' in fsync pstore: remove unneeded unlikely() vfs: remove unneeded unlikely() stubs for mount_bdev() and kill_block_super() in !CONFIG_BLOCK case make vfs_ustat() static do_handle_open() should be static elf_fdpic: fix unused variable warning fold destroy_super() into __put_super() new helper: destroy_unused_super() fix address space warnings in ipc/ acct.h: get rid of detritus
2017-11-16Merge tag 'afs-next-20171113' of ↡Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs Pull AFS updates from David Howells: "kAFS filesystem driver overhaul. The major points of the overhaul are: (1) Preliminary groundwork is laid for supporting network-namespacing of kAFS. The remainder of the namespacing work requires some way to pass namespace information to submounts triggered by an automount. This requires something like the mount overhaul that's in progress. (2) sockaddr_rxrpc is used in preference to in_addr for holding addresses internally and add support for talking to the YFS VL server. With this, kAFS can do everything over IPv6 as well as IPv4 if it's talking to servers that support it. (3) Callback handling is overhauled to be generally passive rather than active. 'Callbacks' are promises by the server to tell us about data and metadata changes. Callbacks are now checked when we next touch an inode rather than actively going and looking for it where possible. (4) File access permit caching is overhauled to store the caching information per-inode rather than per-directory, shared over subordinate files. Whilst older AFS servers only allow ACLs on directories (shared to the files in that directory), newer AFS servers break that restriction. To improve memory usage and to make it easier to do mass-key removal, permit combinations are cached and shared. (5) Cell database management is overhauled to allow lighter locks to be used and to make cell records autonomous state machines that look after getting their own DNS records and cleaning themselves up, in particular preventing races in acquiring and relinquishing the fscache token for the cell. (6) Volume caching is overhauled. The afs_vlocation record is got rid of to simplify things and the superblock is now keyed on the cell and the numeric volume ID only. The volume record is tied to a superblock and normal superblock management is used to mediate the lifetime of the volume fscache token. (7) File server record caching is overhauled to make server records independent of cells and volumes. A server can be in multiple cells (in such a case, the administrator must make sure that the VL services for all cells correctly reflect the volumes shared between those cells). Server records are now indexed using the UUID of the server rather than the address since a server can have multiple addresses. (8) File server rotation is overhauled to handle VMOVED, VBUSY (and similar), VOFFLINE and VNOVOL indications and to handle rotation both of servers and addresses of those servers. The rotation will also wait and retry if the server says it is busy. (9) Data writeback is overhauled. Each inode no longer stores a list of modified sections tagged with the key that authorised it in favour of noting the modified region of a page in page->private and storing a list of keys that made modifications in the inode. This simplifies things and allows other keys to be used to actually write to the server if a key that made a modification becomes useless. (10) Writable mmap() is implemented. This allows a kernel to be build entirely on AFS. Note that Pre AFS-3.4 servers are no longer supported, though this can be added back if necessary (AFS-3.4 was released in 1998)" * tag 'afs-next-20171113' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: (35 commits) afs: Protect call->state changes against signals afs: Trace page dirty/clean afs: Implement shared-writeable mmap afs: Get rid of the afs_writeback record afs: Introduce a file-private data record afs: Use a dynamic port if 7001 is in use afs: Fix directory read/modify race afs: Trace the sending of pages afs: Trace the initiation and completion of client calls afs: Fix documentation on # vs % prefix in mount source specification afs: Fix total-length calculation for multiple-page send afs: Only progress call state at end of Tx phase from rxrpc callback afs: Make use of the YFS service upgrade to fully support IPv6 afs: Overhaul volume and server record caching and fileserver rotation afs: Move server rotation code into its own file afs: Add an address list concept afs: Overhaul cell database management afs: Overhaul permit caching afs: Overhaul the callback handling afs: Rename struct afs_call server member to cm_server ...
2017-11-15mm: consolidate page table accountingKirill A. Shutemov
Currently, we account page tables separately for each page table level, but that's redundant -- we only make use of total memory allocated to page tables for oom_badness calculation. We also provide the information to userspace, but it has dubious value there too. This patch switches page table accounting to single counter. mm->pgtables_bytes is now used to account all page table levels. We use bytes, because page table size for different levels of page table tree may be different. The change has user-visible effect: we don't have VmPMD and VmPUD reported in /proc/[pid]/status. Not sure if anybody uses them. (As alternative, we can always report 0 kB for them.) OOM-killer report is also slightly changed: we now report pgtables_bytes instead of nr_ptes, nr_pmd, nr_puds. Apart from reducing number of counters per-mm, the benefit is that we now calculate oom_badness() more correctly for machines which have different size of page tables depending on level or where page tables are less than a page in size. The only downside can be debuggability because we do not know which page table level could leak. But I do not remember many bugs that would be caught by separate counters so I wouldn't lose sleep over this. [akpm@linux-foundation.org: fix mm/huge_memory.c] Link: http://lkml.kernel.org/r/20171006100651.44742-2-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> [kirill.shutemov@linux.intel.com: fix build] Link: http://lkml.kernel.org/r/20171016150113.ikfxy3e7zzfvsr4w@black.fi.intel.com Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-14Merge branch 'for_linus' of ↡Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull quota, ext2, isofs and udf fixes from Jan Kara: - two small quota error handling fixes - two isofs fixes for architectures with signed char - several udf block number overflow and signedness fixes - ext2 rework of mount option handling to avoid GFP_KERNEL allocation with spinlock held - ... it also contains a patch to implement auditing of responses to fanotify permission events. That should have been in the fanotify pull request but I mistakenly merged that patch into a wrong branch and noticed only now at which point I don't think it's worth rebasing and redoing. * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: quota: be aware of error from dquot_initialize quota: fix potential infinite loop isofs: use unsigned char types consistently isofs: fix timestamps beyond 2027 udf: Fix some sign-conversion warnings udf: Fix signed/unsigned format specifiers udf: Fix 64-bit sign extension issues affecting blocks > 0x7FFFFFFF udf: Remove some outdate references from documentation udf: Avoid overflow when session starts at large offset ext2: Fix possible sleep in atomic during mount option parsing ext2: Parse mount options into a dedicated structure audit: Record fanotify access control decisions
2017-11-14Merge tag 'fscrypt-for-linus' of ↡Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt Pull fscrypt updates from Ted Ts'o: "Lots of cleanups, mostly courtesy by Eric Biggers" * tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt: fscrypt: lock mutex before checking for bounce page pool fscrypt: add a documentation file for filesystem-level encryption ext4: switch to fscrypt_prepare_setattr() ext4: switch to fscrypt_prepare_lookup() ext4: switch to fscrypt_prepare_rename() ext4: switch to fscrypt_prepare_link() ext4: switch to fscrypt_file_open() fscrypt: new helper function - fscrypt_prepare_setattr() fscrypt: new helper function - fscrypt_prepare_lookup() fscrypt: new helper function - fscrypt_prepare_rename() fscrypt: new helper function - fscrypt_prepare_link() fscrypt: new helper function - fscrypt_file_open() fscrypt: new helper function - fscrypt_require_key() fscrypt: remove unneeded empty fscrypt_operations structs fscrypt: remove ->is_encrypted() fscrypt: switch from ->is_encrypted() to IS_ENCRYPTED() fs, fscrypt: add an S_ENCRYPTED inode flag fscrypt: clean up include file mess
2017-11-13Merge branch 'locking-core-for-linus' of ↡Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core locking updates from Ingo Molnar: "The main changes in this cycle are: - Another attempt at enabling cross-release lockdep dependency tracking (automatically part of CONFIG_PROVE_LOCKING=y), this time with better performance and fewer false positives. (Byungchul Park) - Introduce lockdep_assert_irqs_enabled()/disabled() and convert open-coded equivalents to lockdep variants. (Frederic Weisbecker) - Add down_read_killable() and use it in the VFS's iterate_dir() method. (Kirill Tkhai) - Convert remaining uses of ACCESS_ONCE() to READ_ONCE()/WRITE_ONCE(). Most of the conversion was Coccinelle driven. (Mark Rutland, Paul E. McKenney) - Get rid of lockless_dereference(), by strengthening Alpha atomics, strengthening READ_ONCE() with smp_read_barrier_depends() and thus being able to convert users of lockless_dereference() to READ_ONCE(). (Will Deacon) - Various micro-optimizations: - better PV qspinlocks (Waiman Long), - better x86 barriers (Michael S. Tsirkin) - better x86 refcounts (Kees Cook) - ... plus other fixes and enhancements. (Borislav Petkov, Juergen Gross, Miguel Bernal Marin)" * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (70 commits) locking/x86: Use LOCK ADD for smp_mb() instead of MFENCE rcu: Use lockdep to assert IRQs are disabled/enabled netpoll: Use lockdep to assert IRQs are disabled/enabled timers/posix-cpu-timers: Use lockdep to assert IRQs are disabled/enabled sched/clock, sched/cputime: Use lockdep to assert IRQs are disabled/enabled irq_work: Use lockdep to assert IRQs are disabled/enabled irq/timings: Use lockdep to assert IRQs are disabled/enabled perf/core: Use lockdep to assert IRQs are disabled/enabled x86: Use lockdep to assert IRQs are disabled/enabled smp/core: Use lockdep to assert IRQs are disabled/enabled timers/hrtimer: Use lockdep to assert IRQs are disabled/enabled timers/nohz: Use lockdep to assert IRQs are disabled/enabled workqueue: Use lockdep to assert IRQs are disabled/enabled irq/softirqs: Use lockdep to assert IRQs are disabled/enabled locking/lockdep: Add IRQs disabled/enabled assertion APIs: lockdep_assert_irqs_enabled()/disabled() locking/pvqspinlock: Implement hybrid PV queued/unfair locks locking/rwlocks: Fix comments x86/paravirt: Set up the virt_spin_lock_key after static keys get initialized block, locking/lockdep: Assign a lock_class per gendisk used for wait_for_completion() workqueue: Remove now redundant lock acquisitions wrt. workqueue flushes ...
2017-11-13Merge tag 'docs-4.15' of git://git.lwn.net/linuxLinus Torvalds
Pull documentation updates from Jonathan Corbet: "A relatively calm cycle for the docs tree again. - The old driver statement has been added to the kernel docs. - We have a couple of new helper scripts. find-unused-docs.sh from Sayli Karnic will point out kerneldoc comments that are not actually used in the documentation. Jani Nikula's documentation-file-ref-check finds references to non-existing files. - A new ftrace document from Steve Rostedt. - Vinod Koul converted the dmaengine docs to RST Beyond that, it's mostly simple fixes. This set reaches outside of Documentation/ a bit more than most. In all cases, the changes are to comment docs, mostly from Randy, in places where there didn't seem to be anybody better to take them" * tag 'docs-4.15' of git://git.lwn.net/linux: (52 commits) documentation: fb: update list of available compiled-in fonts MAINTAINERS: update DMAengine documentation location dmaengine: doc: ReSTize pxa_dma doc dmaengine: doc: ReSTize dmatest doc dmaengine: doc: ReSTize client API doc dmaengine: doc: ReSTize provider doc dmaengine: doc: Add ReST style dmaengine document ftrace/docs: Add documentation on how to use ftrace from within the kernel bug-hunting.rst: Fix an example and a typo in a Sphinx tag scripts: Add a script to find unused documentation samples: Convert timers to use timer_setup() documentation: kernel-api: add more info on bitmap functions Documentation: fix selftests related file refs Documentation: fix ref to power basic-pm-debugging Documentation: fix ref to trace stm content Documentation: fix ref to coccinelle content Documentation: fix ref to workqueue content Documentation: fix ref to sphinx/kerneldoc.py Documentation: fix locking rt-mutex doc refs docs: dev-tools: correct Coccinelle version number ...
2017-11-13afs: Fix documentation on # vs % prefix in mount source specificationDavid Howells
The documentation that describes the #-prefix and the %-prefix used when specifying the source to mount is has the descriptions the wrong way round. Switch them over. Reported-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-05vfs: grab the lock instead of blocking in __fd_install during resizingMateusz Guzik
Explicit locking in the fallback case provides a safe state of the table. Getting rid of blocking semantics makes __fd_install usable again in non-sleepable contexts, which easies backporting efforts. There is a side effect of slightly nicer assembly for the common case as might_sleep can now be removed. Signed-off-by: Mateusz Guzik <mguzik@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-10-31fscrypt: add a documentation file for filesystem-level encryptionEric Biggers
Perhaps long overdue, add a documentation file for filesystem-level encryption, a.k.a. fscrypt or fs/crypto/, to the Documentation directory. The new file is based loosely on the latest version of the "EXT4 Encryption Design Document (public version)" Google Doc, but with many improvements made, including: - Reflect the reality that it is not specific to ext4 anymore. - More thoroughly document the design and user-visible API/behavior. - Replace outdated information, such as the outdated explanation of how encrypted filenames are hashed for indexed directories and how encrypted filenames are presented to userspace without the key. (This was changed just before release.) For now the focus is on the design and user-visible API/behavior, not on how to add encryption support to a filesystem --- since the internal API is still pretty messy and any standalone documentation for it would become outdated as things get refactored over time. Reviewed-by: Michael Halcrow <mhalcrow@google.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-10-25locking/atomics, doc/filesystems: Convert ACCESS_ONCE() referencesPaul E. McKenney
For several reasons, it is desirable to use {READ,WRITE}_ONCE() in preference to ACCESS_ONCE(), and new code is expected to use one of the former. So far, there's been no reason to change most existing uses of ACCESS_ONCE(), as these aren't currently harmful. However, for some features it is necessary to instrument reads and writes separately, which is not possible with ACCESS_ONCE(). This distinction is critical to correct operation. It's possible to transform the bulk of kernel code using the Coccinelle script below. However, this doesn't handle documentation, leaving references to ACCESS_ONCE() instances which have been removed. As a preparatory step, this patch converts the filesystems documentation to use {READ,WRITE}_ONCE() consistently. ---- virtual patch @ depends on patch @ expression E1, E2; @@ - ACCESS_ONCE(E1) = E2 + WRITE_ONCE(E1, E2) @ depends on patch @ expression E; @@ - ACCESS_ONCE(E) + READ_ONCE(E) ---- Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Will Deacon <will.deacon@arm.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: davem@davemloft.net Cc: linux-arch@vger.kernel.org Cc: mpe@ellerman.id.au Cc: shuah@kernel.org Cc: snitzer@redhat.com Cc: thor.thayer@linux.intel.com Cc: tj@kernel.org Cc: viro@zeniv.linux.org.uk Link: http://lkml.kernel.org/r/1508792849-3115-14-git-send-email-paulmck@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-19Documentation: fix selftests related file refsTom Saeger
Make refs to selftests files valid including: - watchdog-test.c - dnotify_test.c Signed-off-by: Tom Saeger <tom.saeger@oracle.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2017-10-16udf: Remove some outdate references from documentationJan Kara
Remove outdated references to maintainer and mailing list from the documentation - reference MAINTAINERS instead. Also update reference to current repository of udf-tools. Signed-off-by: Jan Kara <jack@suse.cz>
2017-10-15cramfs: rehabilitate itNicolas Pitre
Update documentation, pointer to latest tools, appoint myself as maintainer. Given it's been unloved for so long, I don't expect anyone will protest. Signed-off-by: Nicolas Pitre <nico@linaro.org> Tested-by: Chris Brandt <chris.brandt@renesas.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-10-06Merge branch 'overlayfs-linus' of ↡Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs fixes from Miklos Szeredi: "Fix a regression in 4.14 and one in 4.13. The latter is a case when Docker is doing something it really shouldn't and gets away with it. We now print a warning instead of erroring out. There are also fixes to several error paths" * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: ovl: fix regression caused by exclusive upper/work dir protection ovl: fix missing unlock_rename() in ovl_do_copy_up() ovl: fix dentry leak in ovl_indexdir_cleanup() ovl: fix dput() of ERR_PTR in ovl_cleanup_index() ovl: fix error value printed in ovl_lookup_index() ovl: fix may_write_real() for overlayfs directories
2017-10-05ovl: fix regression caused by exclusive upper/work dir protectionAmir Goldstein
Enforcing exclusive ownership on upper/work dirs caused a docker regression: https://github.com/moby/moby/issues/34672. Euan spotted the regression and pointed to the offending commit. Vivek has brought the regression to my attention and provided this reproducer: Terminal 1: mount -t overlay -o workdir=work,lowerdir=lower,upperdir=upper none merged/ Terminal 2: unshare -m Terminal 1: umount merged mount -t overlay -o workdir=work,lowerdir=lower,upperdir=upper none merged/ mount: /root/overlay-testing/merged: none already mounted or mount point busy To fix the regression, I replaced the error with an alarming warning. With index feature enabled, mount does fail, but logs a suggestion to override exclusive dir protection by disabling index. Note that index=off mount does take the inuse locks, so a concurrent index=off will issue the warning and a concurrent index=on mount will fail. Documentation was updated to reflect this change. Fixes: 2cac0c00a6cd ("ovl: get exclusive ownership on upper/work dirs") Cc: <stable@vger.kernel.org> # v4.13 Reported-by: Euan Kemp <euank@euank.com> Reported-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-10-03Merge tag 'driver-core-4.14-rc4' of ↡Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core fixes from Greg KH: "Here are a few small fixes for 4.14-rc4. The removal of DRIVER_ATTR() was almost completed by 4.14-rc1, but one straggler made it in through some other tree (odds are, one of mine...) So there's a simple removal of the last user, and then finally the macro is removed from the tree. There's a fix for old crazy udev instances that insist on reloading a module when it is removed from the kernel due to the new uevents for bind/unbind. This fixes the reported regression, hopefully some year in the future we can drop the workaround, once users update to the latest version, but I'm not holding my breath. And then there's a build fix for a linker warning, and a buffer overflow fix to match the PCI fixes you took through the PCI tree in the same area. All of these have been in linux-next for a few weeks while I've been traveling, sorry for the delay" * tag 'driver-core-4.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: driver core: remove DRIVER_ATTR fpga: altera-cvp: remove DRIVER_ATTR() usage driver core: platform: Don't read past the end of "driver_override" buffer base: arch_topology: fix section mismatch build warnings driver core: suppress sending MODALIAS in UNBIND uevents
2017-09-26Documentation: fix little inconsistenciesPavel Machek
Fix little inconsistencies in Documentation: make case and spacing match surrounding text. Signed-off-by: Pavel Machek <pavel@ucw.cz> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2017-09-19Merge tag '4.14-smb3-multidialect-support-and-fixes-for-stable' of ↡Linus Torvalds
git://git.samba.org/sfrench/cifs-2.6 Pull cifs fixes from Steve French: "Convert default dialect to smb2.1 or later to allow connecting to Windows 7 for example, also includes some fixes for stable" * tag '4.14-smb3-multidialect-support-and-fixes-for-stable' of git://git.samba.org/sfrench/cifs-2.6: Update version of cifs module cifs: hide unused functions SMB3: Add support for multidialect negotiate (SMB2.1 and later) CIFS/SMB3: Update documentation to reflect SMB3 and various changes cifs: check rsp for NULL before dereferencing in SMB2_open
2017-09-19driver core: remove DRIVER_ATTRGreg Kroah-Hartman
DRIVER_ATTR is no longer in use, and driver authors should be using DRIVER_ATTR_RW() or DRIVER_ATTR_RO() or DRIVER_ATTR_WO() instead in order to always get the permissions correct. So remove it so that no one can use it anymore. Acked-by: Alan Tull <atull@kernel.org> Reviewed-by: Moritz Fischer <mdf@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-17CIFS/SMB3: Update documentation to reflect SMB3 and various changesSteve French
Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2017-09-15Merge tag 'for-linus-4.14-ofs2' of ↡Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux Pull orangefs updates from Mike Marshall: "Some cleanups and a big bug fix for ACLs. When I was reviewing Jan Kara's ACL patch, I realized that Orangefs ACL code was busted, not just in the kernel module, but in the server as well. I've been working on the code in the server mostly, but here's one kernel patch, there will be more" * tag 'for-linus-4.14-ofs2' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux: orangefs: Adjust three checks for null pointers orangefs: Use kcalloc() in orangefs_prepare_cdm_array() orangefs: Delete error messages for a failed memory allocation in five functions orangefs: constify xattr_handler structure orangefs: don't call filemap_write_and_wait from fsync orangefs: off by ones in xattr size checks orangefs: documentation clean up orangefs: react properly to posix_acl_update_mode's aftermath. orangefs: Don't clear SGID when inheriting ACLs
2017-09-14Merge branch 'work.mount' of ↡Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull mount flag updates from Al Viro: "Another chunk of fmount preparations from dhowells; only trivial conflicts for that part. It separates MS_... bits (very grotty mount(2) ABI) from the struct super_block ->s_flags (kernel-internal, only a small subset of MS_... stuff). This does *not* convert the filesystems to new constants; only the infrastructure is done here. The next step in that series is where the conflicts would be; that's the conversion of filesystems. It's purely mechanical and it's better done after the merge, so if you could run something like list=$(for i in MS_RDONLY MS_NOSUID MS_NODEV MS_NOEXEC MS_SYNCHRONOUS MS_MANDLOCK MS_DIRSYNC MS_NOATIME MS_NODIRATIME MS_SILENT MS_POSIXACL MS_KERNMOUNT MS_I_VERSION MS_LAZYTIME; do git grep -l $i fs drivers/staging/lustre drivers/mtd ipc mm include/linux; done|sort|uniq|grep -v '^fs/namespace.c$') sed -i -e 's/\<MS_RDONLY\>/SB_RDONLY/g' \ -e 's/\<MS_NOSUID\>/SB_NOSUID/g' \ -e 's/\<MS_NODEV\>/SB_NODEV/g' \ -e 's/\<MS_NOEXEC\>/SB_NOEXEC/g' \ -e 's/\<MS_SYNCHRONOUS\>/SB_SYNCHRONOUS/g' \ -e 's/\<MS_MANDLOCK\>/SB_MANDLOCK/g' \ -e 's/\<MS_DIRSYNC\>/SB_DIRSYNC/g' \ -e 's/\<MS_NOATIME\>/SB_NOATIME/g' \ -e 's/\<MS_NODIRATIME\>/SB_NODIRATIME/g' \ -e 's/\<MS_SILENT\>/SB_SILENT/g' \ -e 's/\<MS_POSIXACL\>/SB_POSIXACL/g' \ -e 's/\<MS_KERNMOUNT\>/SB_KERNMOUNT/g' \ -e 's/\<MS_I_VERSION\>/SB_I_VERSION/g' \ -e 's/\<MS_LAZYTIME\>/SB_LAZYTIME/g' \ $list and commit it with something along the lines of 'convert filesystems away from use of MS_... constants' as commit message, it would save a quite a bit of headache next cycle" * 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: VFS: Differentiate mount flags (MS_*) from internal superblock flags VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb) vfs: Add sb_rdonly(sb) to query the MS_RDONLY flag on s_flags
2017-09-14orangefs: documentation clean upMike Marshall
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-09-13Merge branch 'overlayfs-linus' of ↡Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs updates from Miklos Szeredi: "This fixes d_ino correctness in readdir, which brings overlayfs on par with normal filesystems regarding inode number semantics, as long as all layers are on the same filesystem. There are also some bug fixes, one in particular (random ioctl's shouldn't be able to modify lower layers) that touches some vfs code, but of course no-op for non-overlay fs" * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: ovl: fix false positive ESTALE on lookup ovl: don't allow writing ioctl on lower layer ovl: fix relatime for directories vfs: add flags to d_real() ovl: cleanup d_real for negative ovl: constant d_ino for non-merge dirs ovl: constant d_ino across copy up ovl: fix readdir error value ovl: check snprintf return
2017-09-12Merge tag 'f2fs-for-4.14' of ↡Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs updates from Jaegeuk Kim: "In this round, we've mostly tuned f2fs to provide better user experience for Android. Especially, we've worked on atomic write feature again with SQLite community in order to support it officially. And we added or modified several facilities to analyze and enhance IO behaviors. Major changes include: - add app/fs io stat - add inode checksum feature - support project/journalled quota - enhance atomic write with new ioctl() which exposes feature set - enhance background gc/discard/fstrim flows with new gc_urgent mode - add F2FS_IOC_FS{GET,SET}XATTR - fix some quota flows" * tag 'f2fs-for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (63 commits) f2fs: hurry up to issue discard after io interruption f2fs: fix to show correct discard_granularity in sysfs f2fs: detect dirty inode in evict_inode f2fs: clear radix tree dirty tag of pages whose dirty flag is cleared f2fs: speed up gc_urgent mode with SSR f2fs: better to wait for fstrim completion f2fs: avoid race in between read xattr & write xattr f2fs: make get_lock_data_page to handle encrypted inode f2fs: use generic terms used for encrypted block management f2fs: introduce f2fs_encrypted_file for clean-up Revert "f2fs: add a new function get_ssr_cost" f2fs: constify super_operations f2fs: fix to wake up all sleeping flusher f2fs: avoid race in between atomic_read & atomic_inc f2fs: remove unneeded parameter of change_curseg f2fs: update i_flags correctly f2fs: don't check inode's checksum if it was dirtied or writebacked f2fs: don't need to update inode checksum for recovery f2fs: trigger fdatasync for non-atomic_write file f2fs: fix to avoid race in between aio and gc ...
2017-09-06fscache: remove unused ->now_uncached callbackJan Kara
Patch series "Ranged pagevec lookup", v2. In this series I make pagevec_lookup() update the index (to be consistent with pagevec_lookup_tag() and also as a preparation for ranged lookups), provide ranged variant of pagevec_lookup() and use it in places where it makes sense. This not only removes some common code but is also a measurable performance win for some use cases (see patch 4/10) where radix tree is sparse and searching & grabing of a page after the end of the range has measurable overhead. This patch (of 10): The callback doesn't ever get called. Remove it. Link: http://lkml.kernel.org/r/20170726114704.7626-2-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06dax: use common 4k zero page for dax mmap readsRoss Zwisler
When servicing mmap() reads from file holes the current DAX code allocates a page cache page of all zeroes and places the struct page pointer in the mapping->page_tree radix tree. This has three major drawbacks: 1) It consumes memory unnecessarily. For every 4k page that is read via a DAX mmap() over a hole, we allocate a new page cache page. This means that if you read 1GiB worth of pages, you end up using 1GiB of zeroed memory. This is easily visible by looking at the overall memory consumption of the system or by looking at /proc/[pid]/smaps: 7f62e72b3000-7f63272b3000 rw-s 00000000 103:00 12 /root/dax/data Size: 1048576 kB Rss: 1048576 kB Pss: 1048576 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 1048576 kB Private_Dirty: 0 kB Referenced: 1048576 kB Anonymous: 0 kB LazyFree: 0 kB AnonHugePages: 0 kB ShmemPmdMapped: 0 kB Shared_Hugetlb: 0 kB Private_Hugetlb: 0 kB Swap: 0 kB SwapPss: 0 kB KernelPageSize: 4 kB MMUPageSize: 4 kB Locked: 0 kB 2) It is slower than using a common zero page because each page fault has more work to do. Instead of just inserting a common zero page we have to allocate a page cache page, zero it, and then insert it. Here are the average latencies of dax_load_hole() as measured by ftrace on a random test box: Old method, using zeroed page cache pages: 3.4 us New method, using the common 4k zero page: 0.8 us This was the average latency over 1 GiB of sequential reads done by this simple fio script: [global] size=1G filename=/root/dax/data fallocate=none [io] rw=read ioengine=mmap 3) The fact that we had to check for both DAX exceptional entries and for page cache pages in the radix tree made the DAX code more complex. Solve these issues by following the lead of the DAX PMD code and using a common 4k zero page instead. As with the PMD code we will now insert a DAX exceptional entry into the radix tree instead of a struct page pointer which allows us to remove all the special casing in the DAX code. Note that we do still pretty aggressively check for regular pages in the DAX radix tree, especially where we take action based on the bits set in the page. If we ever find a regular page in our radix tree now that most likely means that someone besides DAX is inserting pages (which has happened lots of times in the past), and we want to find that out early and fail loudly. This solution also removes the extra memory consumption. Here is that same /proc/[pid]/smaps after 1GiB of reading from a hole with the new code: 7f2054a74000-7f2094a74000 rw-s 00000000 103:00 12 /root/dax/data Size: 1048576 kB Rss: 0 kB Pss: 0 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 0 kB Referenced: 0 kB Anonymous: 0 kB LazyFree: 0 kB AnonHugePages: 0 kB ShmemPmdMapped: 0 kB Shared_Hugetlb: 0 kB Private_Hugetlb: 0 kB Swap: 0 kB SwapPss: 0 kB KernelPageSize: 4 kB MMUPageSize: 4 kB Locked: 0 kB Overall system memory consumption is similarly improved. Another major change is that we remove dax_pfn_mkwrite() from our fault flow, and instead rely on the page fault itself to make the PTE dirty and writeable. The following description from the patch adding the vm_insert_mixed_mkwrite() call explains this a little more: "To be able to use the common 4k zero page in DAX we need to have our PTE fault path look more like our PMD fault path where a PTE entry can be marked as dirty and writeable as it is first inserted rather than waiting for a follow-up dax_pfn_mkwrite() => finish_mkwrite_fault() call. Right now we can rely on having a dax_pfn_mkwrite() call because we can distinguish between these two cases in do_wp_page(): case 1: 4k zero page => writable DAX storage case 2: read-only DAX storage => writeable DAX storage This distinction is made by via vm_normal_page(). vm_normal_page() returns false for the common 4k zero page, though, just as it does for DAX ptes. Instead of special casing the DAX + 4k zero page case we will simplify our DAX PTE page fault sequence so that it matches our DAX PMD sequence, and get rid of the dax_pfn_mkwrite() helper. We will instead use dax_iomap_fault() to handle write-protection faults. This means that insert_pfn() needs to follow the lead of insert_pfn_pmd() and allow us to pass in a 'mkwrite' flag. If 'mkwrite' is set insert_pfn() will do the work that was previously done by wp_page_reuse() as part of the dax_pfn_mkwrite() call path" Link: http://lkml.kernel.org/r/20170724170616.25810-4-ross.zwisler@linux.intel.com Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Christoph Hellwig <hch@lst.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-04vfs: add flags to d_real()Miklos Szeredi
Add a separate flags argument (in addition to the open flags) to control the behavior of d_real(). Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-08-26swap: Remove obsolete sentenceNikolay Borisov
Currently there are no ->swap_{in,out} method in address_space_operations sructure definition, so the statement that anything is going to be proxied through them is wrong. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2017-08-21f2fs: support journalled quotaChao Yu
This patch supports to enable f2fs to accept quota information through mount option: - {usr,grp,prj}jquota=<quota file path> - jqfmt=<quota type> Then, in ->mount flow, we can recover quota file during log replaying, by this, journelled quota can be supported. Signed-off-by: Chao Yu <yuchao0@huawei.com> [Jaegeuk Kim: Fix wrong return values.] Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-15f2fs: introduce gc_urgent mode for background GCJaegeuk Kim
This patch adds a sysfs entry to control urgent mode for background GC. If this is set, background GC thread conducts GC with gc_urgent_sleep_time all the time. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-31f2fs: support project quotaChao Yu
This patch adds to support plain project quota. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-17VFS: Differentiate mount flags (MS_*) from internal superblock flagsDavid Howells
Differentiate the MS_* flags passed to mount(2) from the internal flags set in the super_block's s_flags. s_flags are now called SB_*, with the names and the values for the moment mirroring the MS_* flags that they're equivalent to. In this patch, just the headers are altered and some kernel code where blind automated conversion isn't necessarily correct. Note that this shows up some interesting issues: (1) Some MS_* flags get translated to MNT_* flags (such as MS_NODEV -> MNT_NODEV) without passing this on to the filesystem, but some filesystems set such flags anyway. (2) The ->remount_fs() methods of some filesystems adjust the *flags argument by setting MS_* flags in it, such as MS_NOATIME - but these flags are then scrubbed by do_remount_sb() (only the occupants of MS_RMT_MASK are permitted: MS_RDONLY, MS_SYNCHRONOUS, MS_MANDLOCK, MS_I_VERSION and MS_LAZYTIME) I'm not sure what's the best way to solve all these cases. Suggested-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: David Howells <dhowells@redhat.com>
2017-07-15Merge branch 'work.mount' of ↡Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull ->s_options removal from Al Viro: "Preparations for fsmount/fsopen stuff (coming next cycle). Everything gets moved to explicit ->show_options(), killing ->s_options off + some cosmetic bits around fs/namespace.c and friends. Basically, the stuff needed to work with fsmount series with minimum of conflicts with other work. It's not strictly required for this merge window, but it would reduce the PITA during the coming cycle, so it would be nice to have those bits and pieces out of the way" * 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: isofs: Fix isofs_show_options() VFS: Kill off s_options and helpers orangefs: Implement show_options 9p: Implement show_options isofs: Implement show_options afs: Implement show_options affs: Implement show_options befs: Implement show_options spufs: Implement show_options bpf: Implement show_options ramfs: Implement show_options pstore: Implement show_options omfs: Implement show_options hugetlbfs: Implement show_options VFS: Don't use save/replace_mount_options if not using generic_show_options VFS: Provide empty name qstr VFS: Make get_filesystem() return the affected filesystem VFS: Clean up whitespace in fs/namespace.c and fs/super.c Provide a function to create a NUL-terminated string from unterminated data
2017-07-12procfs: fdinfo: extend information about epoll target filesCyrill Gorcunov
Since it is possbile to have same number in tfd field (say file added, closed, then nother file dup'ed to same number and added back) it is imposible to distinguish such target files solely by their numbers. Strictly speaking regular applications don't need to recognize these targets at all but for checkpoint/restore sake we need to collect targets to be able to push them back on restore stage in a proper order. Thus lets add file position, inode and device number where this target lays. This three fields can be used as a primary key for sorting, and together with kcmp help CRIU can find out an exact file target (from the whole set of processes being checkpointed). Link: http://lkml.kernel.org/r/20170424154423.436491881@gmail.com Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Jason Baron <jbaron@akamai.com> Cc: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12Merge branch 'overlayfs-linus' of ↡Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs updates from Miklos Szeredi: "This work from Amir introduces the inodes index feature, which provides: - hardlinks are not broken on copy up - infrastructure for overlayfs NFS export This also fixes constant st_ino for samefs case for lower hardlinks" * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (33 commits) ovl: mark parent impure and restore timestamp on ovl_link_up() ovl: document copying layers restrictions with inodes index ovl: cleanup orphan index entries ovl: persistent overlay inode nlink for indexed inodes ovl: implement index dir copy up ovl: move copy up lock out ovl: rearrange copy up ovl: add flag for upper in ovl_entry ovl: use struct copy_up_ctx as function argument ovl: base tmpfile in workdir too ovl: factor out ovl_copy_up_inode() helper ovl: extract helper to get temp file in copy up ovl: defer upper dir lock to tempfile link ovl: hash overlay non-dir inodes by copy up origin ovl: cleanup bad and stale index entries on mount ovl: lookup index entry for copy up origin ovl: verify index dir matches upper dir ovl: verify upper root dir matches lower root dir ovl: introduce the inodes index dir feature ovl: generalize ovl_create_workdir() ...
2017-07-11VFS: Kill off s_options and helpersDavid Howells
Kill off s_options, save/replace_mount_options() and generic_show_options() as all filesystems now implement ->show_options() for themselves. This should make it easier to implement a context-based mount where the mount options can be passed individually over a file descriptor. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>