summaryrefslogtreecommitdiff
path: root/include/uapi/linux/fs.h
AgeCommit message (Collapse)Author
2024-07-21Merge tag 'mm-stable-2024-07-21-14-50' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - In the series "mm: Avoid possible overflows in dirty throttling" Jan Kara addresses a couple of issues in the writeback throttling code. These fixes are also targetted at -stable kernels. - Ryusuke Konishi's series "nilfs2: fix potential issues related to reserved inodes" does that. This should actually be in the mm-nonmm-stable tree, along with the many other nilfs2 patches. My bad. - More folio conversions from Kefeng Wang in the series "mm: convert to folio_alloc_mpol()" - Kemeng Shi has sent some cleanups to the writeback code in the series "Add helper functions to remove repeated code and improve readability of cgroup writeback" - Kairui Song has made the swap code a little smaller and a little faster in the series "mm/swap: clean up and optimize swap cache index". - In the series "mm/memory: cleanly support zeropage in vm_insert_page*(), vm_map_pages*() and vmf_insert_mixed()" David Hildenbrand has reworked the rather sketchy handling of the use of the zeropage in MAP_SHARED mappings. I don't see any runtime effects here - more a cleanup/understandability/maintainablity thing. - Dev Jain has improved selftests/mm/va_high_addr_switch.c's handling of higher addresses, for aarch64. The (poorly named) series is "Restructure va_high_addr_switch". - The core TLB handling code gets some cleanups and possible slight optimizations in Bang Li's series "Add update_mmu_tlb_range() to simplify code". - Jane Chu has improved the handling of our fake-an-unrecoverable-memory-error testing feature MADV_HWPOISON in the series "Enhance soft hwpoison handling and injection". - Jeff Johnson has sent a billion patches everywhere to add MODULE_DESCRIPTION() to everything. Some landed in this pull. - In the series "mm: cleanup MIGRATE_SYNC_NO_COPY mode", Kefeng Wang has simplified migration's use of hardware-offload memory copying. - Yosry Ahmed performs more folio API conversions in his series "mm: zswap: trivial folio conversions". - In the series "large folios swap-in: handle refault cases first", Chuanhua Han inches us forward in the handling of large pages in the swap code. This is a cleanup and optimization, working toward the end objective of full support of large folio swapin/out. - In the series "mm,swap: cleanup VMA based swap readahead window calculation", Huang Ying has contributed some cleanups and a possible fixlet to his VMA based swap readahead code. - In the series "add mTHP support for anonymous shmem" Baolin Wang has taught anonymous shmem mappings to use multisize THP. By default this is a no-op - users must opt in vis sysfs controls. Dramatic improvements in pagefault latency are realized. - David Hildenbrand has some cleanups to our remaining use of page_mapcount() in the series "fs/proc: move page_mapcount() to fs/proc/internal.h". - David also has some highmem accounting cleanups in the series "mm/highmem: don't track highmem pages manually". - Build-time fixes and cleanups from John Hubbard in the series "cleanups, fixes, and progress towards avoiding "make headers"". - Cleanups and consolidation of the core pagemap handling from Barry Song in the series "mm: introduce pmd|pte_needs_soft_dirty_wp helpers and utilize them". - Lance Yang's series "Reclaim lazyfree THP without splitting" has reduced the latency of the reclaim of pmd-mapped THPs under fairly common circumstances. A 10x speedup is seen in a microbenchmark. It does this by punting to aother CPU but I guess that's a win unless all CPUs are pegged. - hugetlb_cgroup cleanups from Xiu Jianfeng in the series "mm/hugetlb_cgroup: rework on cftypes". - Miaohe Lin's series "Some cleanups for memory-failure" does just that thing. - Someone other than SeongJae has developed a DAMON feature in Honggyu Kim's series "DAMON based tiered memory management for CXL memory". This adds DAMON features which may be used to help determine the efficiency of our placement of CXL/PCIe attached DRAM. - DAMON user API centralization and simplificatio work in SeongJae Park's series "mm/damon: introduce DAMON parameters online commit function". - In the series "mm: page_type, zsmalloc and page_mapcount_reset()" David Hildenbrand does some maintenance work on zsmalloc - partially modernizing its use of pageframe fields. - Kefeng Wang provides more folio conversions in the series "mm: remove page_maybe_dma_pinned() and page_mkclean()". - More cleanup from David Hildenbrand, this time in the series "mm/memory_hotplug: use PageOffline() instead of PageReserved() for !ZONE_DEVICE". It "enlightens memory hotplug more about PageOffline() pages" and permits the removal of some virtio-mem hacks. - Barry Song's series "mm: clarify folio_add_new_anon_rmap() and __folio_add_anon_rmap()" is a cleanup to the anon folio handling in preparation for mTHP (multisize THP) swapin. - Kefeng Wang's series "mm: improve clear and copy user folio" implements more folio conversions, this time in the area of large folio userspace copying. - The series "Docs/mm/damon/maintaier-profile: document a mailing tool and community meetup series" tells people how to get better involved with other DAMON developers. From SeongJae Park. - A large series ("kmsan: Enable on s390") from Ilya Leoshkevich does that. - David Hildenbrand sends along more cleanups, this time against the migration code. The series is "mm/migrate: move NUMA hinting fault folio isolation + checks under PTL". - Jan Kara has found quite a lot of strangenesses and minor errors in the readahead code. He addresses this in the series "mm: Fix various readahead quirks". - SeongJae Park's series "selftests/damon: test DAMOS tried regions and {min,max}_nr_regions" adds features and addresses errors in DAMON's self testing code. - Gavin Shan has found a userspace-triggerable WARN in the pagecache code. The series "mm/filemap: Limit page cache size to that supported by xarray" addresses this. The series is marked cc:stable. - Chengming Zhou's series "mm/ksm: cmp_and_merge_page() optimizations and cleanup" cleans up and slightly optimizes KSM. - Roman Gushchin has separated the memcg-v1 and memcg-v2 code - lots of code motion. The series (which also makes the memcg-v1 code Kconfigurable) are "mm: memcg: separate legacy cgroup v1 code and put under config option" and "mm: memcg: put cgroup v1-specific memcg data under CONFIG_MEMCG_V1" - Dan Schatzberg's series "Add swappiness argument to memory.reclaim" adds an additional feature to this cgroup-v2 control file. - The series "Userspace controls soft-offline pages" from Jiaqi Yan permits userspace to stop the kernel's automatic treatment of excessive correctable memory errors. In order to permit userspace to monitor and handle this situation. - Kefeng Wang's series "mm: migrate: support poison recover from migrate folio" teaches the kernel to appropriately handle migration from poisoned source folios rather than simply panicing. - SeongJae Park's series "Docs/damon: minor fixups and improvements" does those things. - In the series "mm/zsmalloc: change back to per-size_class lock" Chengming Zhou improves zsmalloc's scalability and memory utilization. - Vivek Kasireddy's series "mm/gup: Introduce memfd_pin_folios() for pinning memfd folios" makes the GUP code use FOLL_PIN rather than bare refcount increments. So these paes can first be moved aside if they reside in the movable zone or a CMA block. - Andrii Nakryiko has added a binary ioctl()-based API to /proc/pid/maps for much faster reading of vma information. The series is "query VMAs from /proc/<pid>/maps". - In the series "mm: introduce per-order mTHP split counters" Lance Yang improves the kernel's presentation of developer information related to multisize THP splitting. - Michael Ellerman has developed the series "Reimplement huge pages without hugepd on powerpc (8xx, e500, book3s/64)". This permits userspace to use all available huge page sizes. - In the series "revert unconditional slab and page allocator fault injection calls" Vlastimil Babka removes a performance-affecting and not very useful feature from slab fault injection. * tag 'mm-stable-2024-07-21-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (411 commits) mm/mglru: fix ineffective protection calculation mm/zswap: fix a white space issue mm/hugetlb: fix kernel NULL pointer dereference when migrating hugetlb folio mm/hugetlb: fix possible recursive locking detected warning mm/gup: clear the LRU flag of a page before adding to LRU batch mm/numa_balancing: teach mpol_to_str about the balancing mode mm: memcg1: convert charge move flags to unsigned long long alloc_tag: fix page_ext_get/page_ext_put sequence during page splitting lib: reuse page_ext_data() to obtain codetag_ref lib: add missing newline character in the warning message mm/mglru: fix overshooting shrinker memory mm/mglru: fix div-by-zero in vmpressure_calc_level() mm/kmemleak: replace strncpy() with strscpy() mm, page_alloc: put should_fail_alloc_page() back behing CONFIG_FAIL_PAGE_ALLOC mm, slab: put should_failslab() back behind CONFIG_SHOULD_FAILSLAB mm: ignore data-race in __swap_writepage hugetlbfs: ensure generic_hugetlb_get_unmapped_area() returns higher address than mmap_min_addr mm: shmem: rename mTHP shmem counters mm: swap_state: use folio_alloc_mpol() in __read_swap_cache_async() mm/migrate: putback split folios when numa hint migration fails ...
2024-07-12fs/procfs: add build ID fetching to PROCMAP_QUERY APIAndrii Nakryiko
The need to get ELF build ID reliably is an important aspect when dealing with profiling and stack trace symbolization, and /proc/<pid>/maps textual representation doesn't help with this. To get backing file's ELF build ID, application has to first resolve VMA, then use it's start/end address range to follow a special /proc/<pid>/map_files/<start>-<end> symlink to open the ELF file (this is necessary because backing file might have been removed from the disk or was already replaced with another binary in the same file path. Such approach, beyond just adding complexity of having to do a bunch of extra work, has extra security implications. Because application opens underlying ELF file and needs read access to its entire contents (as far as kernel is concerned), kernel puts additional capable() checks on following /proc/<pid>/map_files/<start>-<end> symlink. And that makes sense in general. But in the case of build ID, profiler/symbolizer doesn't need the contents of ELF file, per se. It's only build ID that is of interest, and ELF build ID itself doesn't provide any sensitive information. So this patch adds a way to request backing file's ELF build ID along the rest of VMA information in the same API. User has control over whether this piece of information is requested or not by either setting build_id_size field to zero or non-zero maximum buffer size they provided through build_id_addr field (which encodes user pointer as __u64 field). This is a completely optional piece of information, and so has no performance implications for user cases that don't care about build ID, while improving performance and simplifying the setup for those application that do need it. Kernel already implements build ID fetching, which is used from BPF subsystem. We are reusing this code here, but plan a follow up changes to make it work better under more relaxed assumption (compared to what existing code assumes) of being called from user process context, in which page faults are allowed. BPF-specific implementation currently bails out if necessary part of ELF file is not paged in, all due to extra BPF-specific restrictions (like the need to fetch build ID in restrictive contexts such as NMI handler). [andrii@kernel.org: fix integer to pointer cast warning in do_procmap_query()] Link: https://lkml.kernel.org/r/20240701174805.1897344-1-andrii@kernel.org Link: https://lkml.kernel.org/r/20240627170900.1672542-4-andrii@kernel.org Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-12fs/procfs: implement efficient VMA querying API for /proc/<pid>/mapsAndrii Nakryiko
/proc/<pid>/maps file is extremely useful in practice for various tasks involving figuring out process memory layout, what files are backing any given memory range, etc. One important class of applications that absolutely rely on this are profilers/stack symbolizers (perf tool being one of them). Patterns of use differ, but they generally would fall into two categories. In on-demand pattern, a profiler/symbolizer would normally capture stack trace containing absolute memory addresses of some functions, and would then use /proc/<pid>/maps file to find corresponding backing ELF files (normally, only executable VMAs are of interest), file offsets within them, and then continue from there to get yet more information (ELF symbols, DWARF information) to get human-readable symbolic information. This pattern is used by Meta's fleet-wide profiler, as one example. In preprocessing pattern, application doesn't know the set of addresses of interest, so it has to fetch all relevant VMAs (again, probably only executable ones), store or cache them, then proceed with profiling and stack trace capture. Once done, it would do symbolization based on stored VMA information. This can happen at much later point in time. This patterns is used by perf tool, as an example. In either case, there are both performance and correctness requirement involved. This address to VMA information translation has to be done as efficiently as possible, but also not miss any VMA (especially in the case of loading/unloading shared libraries). In practice, correctness can't be guaranteed (due to process dying before VMA data can be captured, or shared library being unloaded, etc), but any effort to maximize the chance of finding the VMA is appreciated. Unfortunately, for all the /proc/<pid>/maps file universality and usefulness, it doesn't fit the above use cases 100%. First, it's main purpose is to emit all VMAs sequentially, but in practice captured addresses would fall only into a smaller subset of all process' VMAs, mainly containing executable text. Yet, library would need to parse most or all of the contents to find needed VMAs, as there is no way to skip VMAs that are of no use. Efficient library can do the linear pass and it is still relatively efficient, but it's definitely an overhead that can be avoided, if there was a way to do more targeted querying of the relevant VMA information. Second, it's a text based interface, which makes its programmatic use from applications and libraries more cumbersome and inefficient due to the need to handle text parsing to get necessary pieces of information. The overhead is actually payed both by kernel, formatting originally binary VMA data into text, and then by user space application, parsing it back into binary data for further use. For the on-demand pattern of usage, described above, another problem when writing generic stack trace symbolization library is an unfortunate performance-vs-correctness tradeoff that needs to be made. Library has to make a decision to either cache parsed contents of /proc/<pid>/maps (after initial processing) to service future requests (if application requests to symbolize another set of addresses (for the same process), captured at some later time, which is typical for periodic/continuous profiling cases) to avoid higher costs of re-parsing this file. Or it has to choose to cache the contents in memory to speed up future requests. In the former case, more memory is used for the cache and there is a risk of getting stale data if application loads or unloads shared libraries, or otherwise changed its set of VMAs somehow, e.g., through additional mmap() calls. In the latter case, it's the performance hit that comes from re-opening the file and re-parsing its contents all over again. This patch aims to solve this problem by providing a new API built on top of /proc/<pid>/maps. It's meant to address both non-selectiveness and text nature of /proc/<pid>/maps, by giving user more control of what sort of VMA(s) needs to be queried, and being binary-based interface eliminates the overhead of text formatting (on kernel side) and parsing (on user space side). It's also designed to be extensible and forward/backward compatible by including required struct size field, which user has to provide. We use established copy_struct_from_user() approach to handle extensibility. User has a choice to pick either getting VMA that covers provided address or -ENOENT if none is found (exact, least surprising, case). Or, with an extra query flag (PROCMAP_QUERY_COVERING_OR_NEXT_VMA), they can get either VMA that covers the address (if there is one), or the closest next VMA (i.e., VMA with the smallest vm_start > addr). The latter allows more efficient use, but, given it could be a surprising behavior, requires an explicit opt-in. There is another query flag that is useful for some use cases. PROCMAP_QUERY_FILE_BACKED_VMA instructs this API to only return file-backed VMAs. Combining this with PROCMAP_QUERY_COVERING_OR_NEXT_VMA makes it possible to efficiently iterate only file-backed VMAs of the process, which is what profilers/symbolizers are normally interested in. All the above querying flags can be combined with (also optional) set of desired VMA permissions flags. This allows to, for example, iterate only an executable subset of VMAs, which is what preprocessing pattern, used by perf tool, would benefit from, as the assumption is that captured stack traces would have addresses of executable code. This saves time by skipping non-executable VMAs altogether efficienty. All these querying flags (modifiers) are orthogonal and can be combined in a semantically meaningful and natural way. Basing this ioctl()-based API on top of /proc/<pid>/maps's FD makes sense given it's querying the same set of VMA data. It's also benefitial because permission checks for /proc/<pid>/maps is performed at open time once, and the actual data read of text contents of /proc/<pid>/maps is done without further permission checks. We piggyback on this pattern with ioctl()-based API as well, as that's a desired property. Both for performance reasons, but also for security and flexibility reasons. Allowing application to open an FD for /proc/self/maps without any extra capabilities, and then passing it to some sort of profiling agent through Unix-domain socket, would allow such profiling agent to not require some of the capabilities that are otherwise expected when opening /proc/<pid>/maps file for *another* process. This is a desirable property for some more restricted setups. This new ioctl-based implementation doesn't interfere with seq_file-based implementation of /proc/<pid>/maps textual interface, and so could be used together or independently without paying any price for that. Note also, that fetching VMA name (e.g., backing file path, or special hard-coded or user-provided names) is optional just like build ID. If user sets vma_name_size to zero, kernel code won't attempt to retrieve it, saving resources. Earlier versions of this patch set were adding per-VMA locking, which is why we have a code structure that is ready for abstracting mmap_lock vs vm_lock differences (query_vma_setup(), query_vma_teardown(), and query_vma_find_by_addr()), but given anon_vma_name() is not yet compatible with per-VMA locking, initial implementation sticks to using only mmap_lock for now. It will be easy to add back per-VMA locking once all the pieces are ready later on. Which is why we keep existing code structure with setup/teardown/query helper functions. [andrii@kernel.org: improve PROCMAP_QUERY's compat mode handling] Link: https://lkml.kernel.org/r/20240701174805.1897344-2-andrii@kernel.org Link: https://lkml.kernel.org/r/20240627170900.1672542-3-andrii@kernel.org Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-06-20fs: Initial atomic write supportPrasad Singamsetty
An atomic write is a write issued with torn-write protection, meaning that for a power failure or any other hardware failure, all or none of the data from the write will be stored, but never a mix of old and new data. Userspace may add flag RWF_ATOMIC to pwritev2() to indicate that the write is to be issued with torn-write prevention, according to special alignment and length rules. For any syscall interface utilizing struct iocb, add IOCB_ATOMIC for iocb->ki_flags field to indicate the same. A call to statx will give the relevant atomic write info for a file: - atomic_write_unit_min - atomic_write_unit_max - atomic_write_segments_max Both min and max values must be a power-of-2. Applications can avail of atomic write feature by ensuring that the total length of a write is a power-of-2 in size and also sized between atomic_write_unit_min and atomic_write_unit_max, inclusive. Applications must ensure that the write is at a naturally-aligned offset in the file wrt the total write length. The value in atomic_write_segments_max indicates the upper limit for IOV_ITER iovcnt. Add file mode flag FMODE_CAN_ATOMIC_WRITE, so files which do not have the flag set will have RWF_ATOMIC rejected and not just ignored. Add a type argument to kiocb_set_rw_flags() to allows reads which have RWF_ATOMIC set to be rejected. Helper function generic_atomic_write_valid() can be used by FSes to verify compliant writes. There we check for iov_iter type is for ubuf, which implies iovcnt==1 for pwritev2(), which is an initial restriction for atomic_write_segments_max. Initially the only user will be bdev file operations write handler. We will rely on the block BIO submission path to ensure write sizes are compliant for the bdev, so we don't need to check atomic writes sizes yet. Signed-off-by: Prasad Singamsetty <prasad.singamsetty@oracle.com> jpg: merge into single patch and much rewrite Acked-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20240620125359.2684798-4-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-11Merge tag 'vfs-6.9.uuid' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs uuid updates from Christian Brauner: "This adds two new ioctl()s for getting the filesystem uuid and retrieving the sysfs path based on the path of a mounted filesystem. Getting the filesystem uuid has been implemented in filesystem specific code for a while it's now lifted as a generic ioctl" * tag 'vfs-6.9.uuid' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: xfs: add support for FS_IOC_GETFSSYSFSPATH fs: add FS_IOC_GETFSSYSFSPATH fat: Hook up sb->s_uuid fs: FS_IOC_GETUUID ovl: convert to super_set_uuid() fs: super_set_uuid()
2024-02-12fs: add FS_IOC_GETFSSYSFSPATHKent Overstreet
Add a new ioctl for getting the sysfs name of a filesystem - the path under /sys/fs. This is going to let us standardize exporting data from sysfs across filesystems, e.g. time stats. The returned path will always be of the form "$FSTYP/$SYSFS_IDENTIFIER", where the sysfs identifier may be a UUID (for bcachefs) or a device name (xfs). Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Dave Chinner <dchinner@redhat.com> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Link: https://lore.kernel.org/r/20240207025624.1019754-6-kent.overstreet@linux.dev Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-08fs: FS_IOC_GETUUIDKent Overstreet
Add a new generic ioctls for querying the filesystem UUID. These are lifted versions of the ext4 ioctls, with one change: we're not using a flexible array member, because UUIDs will never be more than 16 bytes. This patch adds a generic implementation of FS_IOC_GETFSUUID, which reads from super_block->s_uuid. We're not lifting SETFSUUID from ext4 - that can be done on offline filesystems by the people who need it, trying to do it online is just asking for too much trouble. Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Dave Chinner <dchinner@redhat.com> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Theodore Ts'o <tytso@mit.edu> Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Link: https://lore.kernel.org/r/20240207025624.1019754-4-kent.overstreet@linux.dev Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-01-22vfs: add RWF_NOAPPEND flag for pwritev2Rich Felker
The pwrite function, originally defined by POSIX (thus the "p"), is defined to ignore O_APPEND and write at the offset passed as its argument. However, historically Linux honored O_APPEND if set and ignored the offset. This cannot be changed due to stability policy, but is documented in the man page as a bug. Now that there's a pwritev2 syscall providing a superset of the pwrite functionality that has a flags argument, the conforming behavior can be offered to userspace via a new flag. Since pwritev2 checks flag validity (in kiocb_set_rw_flags) and reports unknown ones with EOPNOTSUPP, callers will not get wrong behavior on old kernels that don't support the new flag; the error is reported and the caller can decide how to handle it. Signed-off-by: Rich Felker <dalias@libc.org> Link: https://lore.kernel.org/r/20200831153207.GO3265@brightrain.aerifal.cx Reviewed-by: Jann Horn <jannh@google.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-12-10fs/proc/task_mmu: report SOFT_DIRTY bits through the PAGEMAP_SCAN ioctlAndrei Vagin
The PAGEMAP_SCAN ioctl returns information regarding page table entries. It is more efficient compared to reading pagemap files. CRIU can start to utilize this ioctl, but it needs info about soft-dirty bits to track memory changes. We are aware of a new method for tracking memory changes implemented in the PAGEMAP_SCAN ioctl. For CRIU, the primary advantage of this method is its usability by unprivileged users. However, it is not feasible to transparently replace the soft-dirty tracker with the new one. The main problem here is userfault descriptors that have to be preserved between pre-dump iterations. It means criu continues supporting the soft-dirty method to avoid breakage for current users. The new method will be implemented as a separate feature. [avagin@google.com: update tools/include/uapi/linux/fs.h] Link: https://lkml.kernel.org/r/20231107164139.576046-1-avagin@google.com Link: https://lkml.kernel.org/r/20231106220959.296568-1-avagin@google.com Signed-off-by: Andrei Vagin <avagin@google.com> Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-18fs/proc/task_mmu: implement IOCTL to get and optionally clear info about PTEsMuhammad Usama Anjum
The PAGEMAP_SCAN IOCTL on the pagemap file can be used to get or optionally clear the info about page table entries. The following operations are supported in this IOCTL: - Scan the address range and get the memory ranges matching the provided criteria. This is performed when the output buffer is specified. - Write-protect the pages. The PM_SCAN_WP_MATCHING is used to write-protect the pages of interest. The PM_SCAN_CHECK_WPASYNC aborts the operation if non-Async Write Protected pages are found. The ``PM_SCAN_WP_MATCHING`` can be used with or without PM_SCAN_CHECK_WPASYNC. - Both of those operations can be combined into one atomic operation where we can get and write protect the pages as well. Following flags about pages are currently supported: - PAGE_IS_WPALLOWED - Page has async-write-protection enabled - PAGE_IS_WRITTEN - Page has been written to from the time it was write protected - PAGE_IS_FILE - Page is file backed - PAGE_IS_PRESENT - Page is present in the memory - PAGE_IS_SWAPPED - Page is in swapped - PAGE_IS_PFNZERO - Page has zero PFN - PAGE_IS_HUGE - Page is THP or Hugetlb backed This IOCTL can be extended to get information about more PTE bits. The entire address range passed by user [start, end) is scanned until either the user provided buffer is full or max_pages have been found. [akpm@linux-foundation.org: update it for "mm: hugetlb: add huge page size param to set_huge_pte_at()"] [akpm@linux-foundation.org: fix CONFIG_HUGETLB_PAGE=n warning] [arnd@arndb.de: hide unused pagemap_scan_backout_range() function] Link: https://lkml.kernel.org/r/20230927060257.2975412-1-arnd@kernel.org [sfr@canb.auug.org.au: fix "fs/proc/task_mmu: hide unused pagemap_scan_backout_range() function"] Link: https://lkml.kernel.org/r/20230928092223.0625c6bf@canb.auug.org.au Link: https://lkml.kernel.org/r/20230821141518.870589-3-usama.anjum@collabora.com Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Reviewed-by: Andrei Vagin <avagin@gmail.com> Reviewed-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Cc: Alex Sierra <alex.sierra@amd.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Gustavo A. R. Silva <gustavoars@kernel.org> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Miroslaw <emmir@google.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Nadav Amit <namit@vmware.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Paul Gofman <pgofman@codeweavers.com> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Cc: Yun Zhou <yun.zhou@windriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-28treewide: uapi: Replace zero-length arrays with flexible-array membersGustavo A. R. Silva
There is a regular need in the kernel to provide a way to declare having a dynamically sized set of trailing elements in a structure. Kernel code should always use “flexible array members”[1] for these cases. The older style of one-element or zero-length arrays should no longer be used[2]. This code was transformed with the help of Coccinelle: (linux-5.19-rc2$ spatch --jobs $(getconf _NPROCESSORS_ONLN) --sp-file script.cocci --include-headers --dir . > output.patch) @@ identifier S, member, array; type T1, T2; @@ struct S { ... T1 member; T2 array[ - 0 ]; }; -fstrict-flex-arrays=3 is coming and we need to land these changes to prevent issues like these in the short future: ../fs/minix/dir.c:337:3: warning: 'strcpy' will always overflow; destination buffer has size 0, but the source string has length 2 (including NUL byte) [-Wfortify-source] strcpy(de3->name, "."); ^ Since these are all [0] to [] changes, the risk to UAPI is nearly zero. If this breaks anything, we can use a union with a new member name. [1] https://en.wikipedia.org/wiki/Flexible_array_member [2] https://www.kernel.org/doc/html/v5.16/process/deprecated.html#zero-length-and-one-element-arrays Link: https://github.com/KSPP/linux/issues/78 Build-tested-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/lkml/62b675ec.wKX6AOZ6cbE71vtF%25lkp@intel.com/ Acked-by: Dan Williams <dan.j.williams@intel.com> # For ndctl.h Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2021-08-02block: add ioctl to read the disk sequence numberMatteo Croce
Add a new BLKGETDISKSEQ ioctl which retrieves the disk sequence number from the genhd structure. # ./getdiskseq /dev/loop* /dev/loop0: 13 /dev/loop0p1: 13 /dev/loop0p2: 13 /dev/loop0p3: 13 /dev/loop1: 14 /dev/loop1p1: 14 /dev/loop1p2: 14 /dev/loop2: 5 /dev/loop3: 6 Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Matteo Croce <mcroce@microsoft.com> Tested-by: Luca Boccassi <bluca@debian.org> Link: https://lore.kernel.org/r/20210712230530.29323-4-mcroce@linux.microsoft.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-10block: uapi: fix comment about block device ioctlDamien Le Moal
Fix the comment mentioning ioctl command range used for zoned block devices to reflect the range of commands actually implemented. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Link: https://lore.kernel.org/r/20210509234806.3000-1-damien.lemoal@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-28fs/ext4: Introduce DAX inode flagIra Weiny
Add a flag ([EXT4|FS]_DAX_FL) to preserve FS_XFLAG_DAX in the ext4 inode. Set the flag to be user visible and changeable. Set the flag to be inherited. Allow applications to change the flag at any time except if it conflicts with the set of mutually exclusive flags (Currently VERITY, ENCRYPT, JOURNAL_DATA). Furthermore, restrict setting any of the exclusive flags if DAX is set. While conceptually possible, we do not allow setting EXT4_DAX_FL while at the same time clearing exclusion flags (or vice versa) for 2 reasons: 1) The DAX flag does not take effect immediately which introduces quite a bit of complexity 2) There is no clear use case for being this flexible Finally, on regular files, flag the inode to not be cached to facilitate changing S_DAX on the next creation of the inode. Signed-off-by: Ira Weiny <ira.weiny@intel.com> Link: https://lore.kernel.org/r/20200528150003.828793-9-ira.weiny@intel.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-09-21Merge tag 'f2fs-for-5.4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs updates from Jaegeuk Kim: "In this round, we introduced casefolding support in f2fs, and fixed various bugs in individual features such as IO alignment, checkpoint=disable, quota, and swapfile. Enhancement: - support casefolding w/ enhancement in ext4 - support fiemap for directory - support FS_IO_GET|SET_FSLABEL Bug fix: - fix IO stuck during checkpoint=disable - avoid infinite GC loop - fix panic/overflow related to IO alignment feature - fix livelock in swap file - fix discard command leak - disallow dio for atomic_write" * tag 'f2fs-for-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (51 commits) f2fs: add a condition to detect overflow in f2fs_ioc_gc_range() f2fs: fix to add missing F2FS_IO_ALIGNED() condition f2fs: fix to fallback to buffered IO in IO aligned mode f2fs: fix to handle error path correctly in f2fs_map_blocks f2fs: fix extent corrupotion during directIO in LFS mode f2fs: check all the data segments against all node ones f2fs: Add a small clarification to CONFIG_FS_F2FS_FS_SECURITY f2fs: fix inode rwsem regression f2fs: fix to avoid accessing uninitialized field of inode page in is_alive() f2fs: avoid infinite GC loop due to stale atomic files f2fs: Fix indefinite loop in f2fs_gc() f2fs: convert inline_data in prior to i_size_write f2fs: fix error path of f2fs_convert_inline_page() f2fs: add missing documents of reserve_root/resuid/resgid f2fs: fix flushing node pages when checkpoint is disabled f2fs: enhance f2fs_is_checkpoint_ready()'s readability f2fs: clean up __bio_alloc()'s parameter f2fs: fix wrong error injection path in inc_valid_block_count() f2fs: fix to writeout dirty inode during node flush f2fs: optimize case-insensitive lookups ...
2019-09-18Merge tag 'fsverity-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt Pull fs-verity support from Eric Biggers: "fs-verity is a filesystem feature that provides Merkle tree based hashing (similar to dm-verity) for individual readonly files, mainly for the purpose of efficient authenticity verification. This pull request includes: (a) The fs/verity/ support layer and documentation. (b) fs-verity support for ext4 and f2fs. Compared to the original fs-verity patchset from last year, the UAPI to enable fs-verity on a file has been greatly simplified. Lots of other things were cleaned up too. fs-verity is planned to be used by two different projects on Android; most of the userspace code is in place already. Another userspace tool ("fsverity-utils"), and xfstests, are also available. e2fsprogs and f2fs-tools already have fs-verity support. Other people have shown interest in using fs-verity too. I've tested this on ext4 and f2fs with xfstests, both the existing tests and the new fs-verity tests. This has also been in linux-next since July 30 with no reported issues except a couple minor ones I found myself and folded in fixes for. Ted and I will be co-maintaining fs-verity" * tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt: f2fs: add fs-verity support ext4: update on-disk format documentation for fs-verity ext4: add fs-verity read support ext4: add basic fs-verity support fs-verity: support builtin file signatures fs-verity: add SHA-512 support fs-verity: implement FS_IOC_MEASURE_VERITY ioctl fs-verity: implement FS_IOC_ENABLE_VERITY ioctl fs-verity: add data verification hooks for ->readpages() fs-verity: add the hook for file ->setattr() fs-verity: add the hook for file ->open() fs-verity: add inode and superblock fields fs-verity: add Kconfig and the helper functions for hashing fs: uapi: define verity bit for FS_IOC_GETFLAGS fs-verity: add UAPI header fs-verity: add MAINTAINERS file entry fs-verity: add a documentation file
2019-08-23fs: Reserve flag for casefoldingDaniel Rosenberg
In preparation for including the casefold feature within f2fs, elevate the EXT4_CASEFOLD_FL flag to FS_CASEFOLD_FL. Signed-off-by: Daniel Rosenberg <drosen@google.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-08-12fs, fscrypt: move uapi definitions to new header <linux/fscrypt.h>Eric Biggers
More fscrypt definitions are being added, and we shouldn't use a disproportionate amount of space in <linux/fs.h> for fscrypt stuff. So move the fscrypt definitions to a new header <linux/fscrypt.h>. For source compatibility with existing userspace programs, <linux/fs.h> still includes the new header. Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-07-28fs: uapi: define verity bit for FS_IOC_GETFLAGSEric Biggers
Add FS_VERITY_FL to the flags for FS_IOC_GETFLAGS, so that applications can easily determine whether a file is a verity file at the same time as they're checking other file flags. This flag will be gettable only; FS_IOC_SETFLAGS won't allow setting it, since an ioctl must be used instead to provide more parameters. This flag matches the on-disk bit that was already allocated for ext4. Reviewed-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-05-14fs/sync.c: sync_file_range(2) may use WB_SYNC_ALL writebackAmir Goldstein
23d0127096cb ("fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE writeback") claims that sync_file_range(2) syscall was "created for userspace to be able to issue background writeout and so waiting for in-flight IO is undesirable there" and changes the writeback (back) to WB_SYNC_NONE. This claim is only partially true. It is true for users that use the flag SYNC_FILE_RANGE_WRITE by itself, as does PostgreSQL, the user that was the reason for changing to WB_SYNC_NONE writeback. However, that claim is not true for users that use that flag combination SYNC_FILE_RANGE_{WAIT_BEFORE|WRITE|_WAIT_AFTER}. Those users explicitly requested to wait for in-flight IO as well as to writeback of dirty pages. Re-brand that flag combination as SYNC_FILE_RANGE_WRITE_AND_WAIT and use WB_SYNC_ALL writeback to perform the full range sync request. Link: http://lkml.kernel.org/r/20190409114922.30095-1-amir73il@gmail.com Link: http://lkml.kernel.org/r/20190419072938.31320-1-amir73il@gmail.com Fixes: 23d0127096cb ("fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE") Signed-off-by: Amir Goldstein <amir73il@gmail.com> Acked-by: Jan Kara <jack@suse.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-06Merge tag 'fscrypt_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt Pull fscrypt updates from Ted Ts'o: "Add Adiantum support for fscrypt" * tag 'fscrypt_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt: fscrypt: add Adiantum support
2019-01-06fscrypt: add Adiantum supportEric Biggers
Add support for the Adiantum encryption mode to fscrypt. Adiantum is a tweakable, length-preserving encryption mode with security provably reducible to that of XChaCha12 and AES-256, subject to a security bound. It's also a true wide-block mode, unlike XTS. See the paper "Adiantum: length-preserving encryption for entry-level processors" (https://eprint.iacr.org/2018/720.pdf) for more details. Also see commit 059c2a4d8e16 ("crypto: adiantum - add Adiantum support"). On sufficiently long messages, Adiantum's bottlenecks are XChaCha12 and the NH hash function. These algorithms are fast even on processors without dedicated crypto instructions. Adiantum makes it feasible to enable storage encryption on low-end mobile devices that lack AES instructions; currently such devices are unencrypted. On ARM Cortex-A7, on 4096-byte messages Adiantum encryption is about 4 times faster than AES-256-XTS encryption; decryption is about 5 times faster. In fscrypt, Adiantum is suitable for encrypting both file contents and names. With filenames, it fixes a known weakness: when two filenames in a directory share a common prefix of >= 16 bytes, with CTS-CBC their encrypted filenames share a common prefix too, leaking information. Adiantum does not have this problem. Since Adiantum also accepts long tweaks (IVs), it's also safe to use the master key directly for Adiantum encryption rather than deriving per-file keys, provided that the per-file nonce is included in the IVs and the master key isn't used for any other encryption mode. This configuration saves memory and improves performance. A new fscrypt policy flag is added to allow users to opt-in to this configuration. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-12-20vfs: Suppress MS_* flag defs within the kernel unless explicitly enabledDavid Howells
Only the mount namespace code that implements mount(2) should be using the MS_* flags. Suppress them inside the kernel unless uapi/linux/mount.h is included. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Reviewed-by: David Howells <dhowells@redhat.com>
2018-09-04crypto: speck - remove SpeckJason A. Donenfeld
These are unused, undesired, and have never actually been used by anybody. The original authors of this code have changed their mind about its inclusion. While originally proposed for disk encryption on low-end devices, the idea was discarded [1] in favor of something else before that could really get going. Therefore, this patch removes Speck. [1] https://marc.info/?l=linux-crypto-vger&m=153359499015659 Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Acked-by: Eric Biggers <ebiggers@google.com> Cc: stable@vger.kernel.org Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-06-05Merge tag 'fscrypt_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt Pull fscrypt updates from Ted Ts'o: "Add bunch of cleanups, and add support for the Speck128/256 algorithms. Yes, Speck is contrversial, but the intention is to use them only for the lowest end Android devices, where the alternative *really* is no encryption at all for data stored at rest" * tag 'fscrypt_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt: fscrypt: log the crypto algorithm implementations fscrypt: add Speck128/256 support fscrypt: only derive the needed portion of the key fscrypt: separate key lookup from key derivation fscrypt: use a common logging function fscrypt: remove internal key size constants fscrypt: remove unnecessary check for non-logon key type fscrypt: make fscrypt_operations.max_namelen an integer fscrypt: drop empty name check from fname_decrypt() fscrypt: drop max_namelen check from fname_decrypt() fscrypt: don't special-case EOPNOTSUPP from fscrypt_get_encryption_info() fscrypt: don't clear flags on crypto transform fscrypt: remove stale comment from fscrypt_d_revalidate() fscrypt: remove error messages for skcipher_request_alloc() failure fscrypt: remove unnecessary NULL check when allocating skcipher fscrypt: clean up after fscrypt_prepare_lookup() conversions fs, fscrypt: only define ->s_cop when FS_ENCRYPTION is enabled fscrypt: use unbound workqueue for decryption
2018-05-20fscrypt: add Speck128/256 supportEric Biggers
fscrypt currently only supports AES encryption. However, many low-end mobile devices have older CPUs that don't have AES instructions, e.g. the ARMv8 Cryptography Extensions. Currently, user data on such devices is not encrypted at rest because AES is too slow, even when the NEON bit-sliced implementation of AES is used. Unfortunately, it is infeasible to encrypt these devices at all when AES is the only option. Therefore, this patch updates fscrypt to support the Speck block cipher, which was recently added to the crypto API. The C implementation of Speck is not especially fast, but Speck can be implemented very efficiently with general-purpose vector instructions, e.g. ARM NEON. For example, on an ARMv7 processor, we measured the NEON-accelerated Speck128/256-XTS at 69 MB/s for both encryption and decryption, while AES-256-XTS with the NEON bit-sliced implementation was only 22 MB/s encryption and 19 MB/s decryption. There are multiple variants of Speck. This patch only adds support for Speck128/256, which is the variant with a 128-bit block size and 256-bit key size -- the same as AES-256. This is believed to be the most secure variant of Speck, and it's only about 6% slower than Speck128/128. Speck64/128 would be at least 20% faster because it has 20% rounds, and it can be even faster on CPUs that can't efficiently do the 64-bit operations needed for Speck128. However, Speck64's 64-bit block size is not preferred security-wise. ARM NEON also supports the needed 64-bit operations even on 32-bit CPUs, resulting in Speck128 being fast enough for our targeted use cases so far. The chosen modes of operation are XTS for contents and CTS-CBC for filenames. These are the same modes of operation that fscrypt defaults to for AES. Note that as with the other fscrypt modes, Speck will not be used unless userspace chooses to use it. Nor are any of the existing modes (which are all AES-based) being removed, of course. We intentionally don't make CONFIG_FS_ENCRYPTION select CONFIG_CRYPTO_SPECK, so people will have to enable Speck support themselves if they need it. This is because we shouldn't bloat the FS_ENCRYPTION dependencies with every new cipher, especially ones that aren't recommended for most users. Moreover, CRYPTO_SPECK is just the generic implementation, which won't be fast enough for many users; in practice, they'll need to enable CRYPTO_SPECK_NEON to get acceptable performance. More details about our choice of Speck can be found in our patches that added Speck to the crypto API, and the follow-on discussion threads. We're planning a publication that explains the choice in more detail. But briefly, we can't use ChaCha20 as we previously proposed, since it would be insecure to use a stream cipher in this context, with potential IV reuse during writes on f2fs and/or on wear-leveling flash storage. We also evaluated many other lightweight and/or ARX-based block ciphers such as Chaskey-LTS, RC5, LEA, CHAM, Threefish, RC6, NOEKEON, SPARX, and XTEA. However, all had disadvantages vs. Speck, such as insufficient performance with NEON, much less published cryptanalysis, or an insufficient security level. Various design choices in Speck make it perform better with NEON than competing ciphers while still having a security margin similar to AES, and in the case of Speck128 also the same available security levels. Unfortunately, Speck does have some political baggage attached -- it's an NSA designed cipher, and was rejected from an ISO standard (though for context, as far as I know none of the above-mentioned alternatives are ISO standards either). Nevertheless, we believe it is a good solution to the problem from a technical perspective. Certain algorithms constructed from ChaCha or the ChaCha permutation, such as MEM (Masked Even-Mansour) or HPolyC, may also meet our performance requirements. However, these are new constructions that need more time to receive the cryptographic review and acceptance needed to be confident in their security. HPolyC hasn't been published yet, and we are concerned that MEM makes stronger assumptions about the underlying permutation than the ChaCha stream cipher does. In contrast, the XTS mode of operation is relatively well accepted, and Speck has over 70 cryptanalysis papers. Of course, these ChaCha-based algorithms can still be added later if they become ready. The best known attack on Speck128/256 is a differential cryptanalysis attack on 25 of 34 rounds with 2^253 time complexity and 2^125 chosen plaintexts, i.e. only marginally faster than brute force. There is no known attack on the full 34 rounds. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-05-16fs: copy BTRFS_IOC_[SG]ET_FSLABEL to vfsEric Sandeen
This retains 256 chars as the maximum size through the interface, which is the btrfs limit and AFAIK exceeds any other filesystem's maximum label size. This just copies the ioctl for now and leaves it in place for btrfs for the time being. A later patch will allow btrfs to use the new common ioctl definition, but it may be sent after this is merged. (Note, Reviewed-by's were originally given for the combined vfs+btrfs patch, some license taken here.) Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Reviewed-by: David Sterba <dsterba@suse.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-01-25fs: add RWF_APPENDJürg Billeter
This is the per-I/O equivalent of O_APPEND to support atomic append operations on any open file. If a file is opened with O_APPEND, pwrite() ignores the offset and always appends data to the end of the file. RWF_APPEND enables atomic append and pwrite() with offset on a single file descriptor. Signed-off-by: Jürg Billeter <j@bitron.ch> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-11-02License cleanup: add SPDX license identifier to uapi header files with no ↵Greg Kroah-Hartman
license Many user space API headers are missing licensing information, which makes it hard for compliance tools to determine the correct license. By default are files without license information under the default license of the kernel, which is GPLV2. Marking them GPLV2 would exclude them from being included in non GPLV2 code, which is obviously not intended. The user space API headers fall under the syscall exception which is in the kernels COPYING file: NOTE! This copyright does *not* cover user programs that use kernel services by normal system calls - this is merely considered normal use of the kernel, and does *not* fall under the heading of "derived work". otherwise syscall usage would not be possible. Update the files which contain no license information with an SPDX license identifier. The chosen identifier is 'GPL-2.0 WITH Linux-syscall-note' which is the officially assigned identifier for the Linux syscall exception. SPDX license identifiers are a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. See the previous patch in this series for the methodology of how this patch was researched. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-31annotate RWF_... flagsChristoph Hellwig
[AV: added missing annotations in syscalls.h/compat.h] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-09Merge tag 'fscrypt_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt Pull fscrypt updates from Ted Ts'o: "Add support for 128-bit AES and some cleanups to fscrypt" * tag 'fscrypt_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt: fscrypt: make ->dummy_context() return bool fscrypt: add support for AES-128-CBC fscrypt: inline fscrypt_free_filename()
2017-06-23fscrypt: add support for AES-128-CBCDaniel Walter
fscrypt provides facilities to use different encryption algorithms which are selectable by userspace when setting the encryption policy. Currently, only AES-256-XTS for file contents and AES-256-CBC-CTS for file names are implemented. This is a clear case of kernel offers the mechanism and userspace selects a policy. Similar to what dm-crypt and ecryptfs have. This patch adds support for using AES-128-CBC for file contents and AES-128-CBC-CTS for file name encryption. To mitigate watermarking attacks, IVs are generated using the ESSIV algorithm. While AES-CBC is actually slightly less secure than AES-XTS from a security point of view, there is more widespread hardware support. Using AES-CBC gives us the acceptable performance while still providing a moderate level of security for persistent storage. Especially low-powered embedded devices with crypto accelerators such as CAAM or CESA often only support AES-CBC. Since using AES-CBC over AES-XTS is basically thought of a last resort, we use AES-128-CBC over AES-256-CBC since it has less encryption rounds and yields noticeable better performance starting from a file size of just a few kB. Signed-off-by: Daniel Walter <dwalter@sigma-star.at> [david@sigma-star.at: addressed review comments] Signed-off-by: David Gstir <david@sigma-star.at> Reviewed-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-06-20fs: Introduce RWF_NOWAIT and FMODE_AIO_NOWAITGoldwyn Rodrigues
RWF_NOWAIT informs kernel to bail out if an AIO request will block for reasons such as file allocations, or a writeback triggered, or would block while allocating requests while performing direct I/O. RWF_NOWAIT is translated to IOCB_NOWAIT for iocb->ki_flags. FMODE_AIO_NOWAIT is a flag which identifies the file opened is capable of returning -EAGAIN if the AIO call will block. This must be set by supporting filesystems in the ->open() call. Filesystems xfs, btrfs and ext4 would be supported in the following patches. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-20fs: Separate out kiocb flags setup based on RWF_* flagsGoldwyn Rodrigues
Also added RWF_SUPPORTED to encompass all flags. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-04-30fscrypt: Remove __packed from fscrypt_policyJoe Richey
This commit removes __packed from fscrypt_policy as it does not contain any implicit padding and does not refer to an on-disk structure. Even though this is a change to a UAPI file, no users will be broken as the structure doesn't change. Signed-off-by: Joe Richey <joerichey@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-30fscrypt: Move key structure and constants to uapiJoe Richey
This commit exposes the necessary constants and structures for a userspace program to pass filesystem encryption keys into the keyring. The fscrypt_key structure was already part of the kernel ABI, this change just makes it so programs no longer have to redeclare these structures (like e4crypt in e2fsprogs currently does). Note that we do not expose the other FS_*_KEY_SIZE constants as they are not necessary. Only XTS is supported for contents_encryption_mode, so currently FS_MAX_KEY_SIZE bytes of key material must always be passed to the kernel. This commit also removes __packed from fscrypt_key as it does not contain any implicit padding and does not refer to an on-disk structure. Signed-off-by: Joe Richey <joerichey@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-02-02fs: Better permission checking for submountsEric W. Biederman
To support unprivileged users mounting filesystems two permission checks have to be performed: a test to see if the user allowed to create a mount in the mount namespace, and a test to see if the user is allowed to access the specified filesystem. The automount case is special in that mounting the original filesystem grants permission to mount the sub-filesystems, to any user who happens to stumble across the their mountpoint and satisfies the ordinary filesystem permission checks. Attempting to handle the automount case by using override_creds almost works. It preserves the idea that permission to mount the original filesystem is permission to mount the sub-filesystem. Unfortunately using override_creds messes up the filesystems ordinary permission checks. Solve this by being explicit that a mount is a submount by introducing vfs_submount, and using it where appropriate. vfs_submount uses a new mount internal mount flags MS_SUBMOUNT, to let sget and friends know that a mount is a submount so they can take appropriate action. sget and sget_userns are modified to not perform any permission checks on submounts. follow_automount is modified to stop using override_creds as that has proven problemantic. do_mount is modified to always remove the new MS_SUBMOUNT flag so that we know userspace will never by able to specify it. autofs4 is modified to stop using current_real_cred that was put in there to handle the previous version of submount permission checking. cifs is modified to pass the mountpoint all of the way down to vfs_submount. debugfs is modified to pass the mountpoint all of the way down to trace_automount by adding a new parameter. To make this change easier a new typedef debugfs_automount_t is introduced to capture the type of the debugfs automount function. Cc: stable@vger.kernel.org Fixes: 069d5ac9ae0d ("autofs: Fix automounts by using current_real_cred()->uid") Fixes: aeaa4a79ff6a ("fs: Call d_automount with the filesystems creds") Reviewed-by: Trond Myklebust <trond.myklebust@primarydata.com> Reviewed-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-12-14Merge tag 'ext4_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "This merge request includes the dax-4.0-iomap-pmd branch which is needed for both ext4 and xfs dax changes to use iomap for DAX. It also includes the fscrypt branch which is needed for ubifs encryption work as well as ext4 encryption and fscrypt cleanups. Lots of cleanups and bug fixes, especially making sure ext4 is robust against maliciously corrupted file systems --- especially maliciously corrupted xattr blocks and a maliciously corrupted superblock. Also fix ext4 support for 64k block sizes so it works well on ppcle. Fixed mbcache so we don't miss some common xattr blocks that can be merged" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (86 commits) dax: Fix sleep in atomic contex in grab_mapping_entry() fscrypt: Rename FS_WRITE_PATH_FL to FS_CTX_HAS_BOUNCE_BUFFER_FL fscrypt: Delay bounce page pool allocation until needed fscrypt: Cleanup page locking requirements for fscrypt_{decrypt,encrypt}_page() fscrypt: Cleanup fscrypt_{decrypt,encrypt}_page() fscrypt: Never allocate fscrypt_ctx on in-place encryption fscrypt: Use correct index in decrypt path. fscrypt: move the policy flags and encryption mode definitions to uapi header fscrypt: move non-public structures and constants to fscrypt_private.h fscrypt: unexport fscrypt_initialize() fscrypt: rename get_crypt_info() to fscrypt_get_crypt_info() fscrypto: move ioctl processing more fully into common code fscrypto: remove unneeded Kconfig dependencies MAINTAINERS: fscrypto: recommend linux-fsdevel for fscrypto patches ext4: do not perform data journaling when data is encrypted ext4: return -ENOMEM instead of success ext4: reject inodes with negative size ext4: remove another test in ext4_alloc_file_blocks() Documentation: fix description of ext4's block_validity mount option ext4: fix checks for data=ordered and journal_async_commit options ...
2016-12-11fscrypt: move the policy flags and encryption mode definitions to uapi headerTheodore Ts'o
These constants are part of the UAPI, so they belong in include/uapi/linux/fs.h instead of include/linux/fscrypto.h Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Eric Biggers <ebiggers@google.com>
2016-10-18blk-zoned: implement ioctlsShaun Tancheff
Adds the new BLKREPORTZONE and BLKRESETZONE ioctls for respectively obtaining the zone configuration of a zoned block device and resetting the write pointer of sequential zones of a zoned block device. The BLKREPORTZONE ioctl maps directly to a single call of the function blkdev_report_zones. The zone information result is passed as an array of struct blk_zone identical to the structure used internally for processing the REQ_OP_ZONE_REPORT operation. The BLKRESETZONE ioctl maps to a single call of the blkdev_reset_zones function. Signed-off-by: Shaun Tancheff <shaun.tancheff@seagate.com> Signed-off-by: Damien Le Moal <damien.lemoal@hgst.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-13Merge tag 'xfs-reflink-for-linus-4.9-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs < XFS has gained super CoW powers! > ---------------------------------- \ ^__^ \ (oo)\_______ (__)\ )\/\ ||----w | || || Pull XFS support for shared data extents from Dave Chinner: "This is the second part of the XFS updates for this merge cycle. This pullreq contains the new shared data extents feature for XFS. Given the complexity and size of this change I am expecting - like the addition of reverse mapping last cycle - that there will be some follow-up bug fixes and cleanups around the -rc3 stage for issues that I'm sure will show up once the code hits a wider userbase. What it is: At the most basic level we are simply adding shared data extents to XFS - i.e. a single extent on disk can now have multiple owners. To do this we have to add new on-disk features to both track the shared extents and the number of times they've been shared. This is done by the new "refcount" btree that sits in every allocation group. When we share or unshare an extent, this tree gets updated. Along with this new tree, the reverse mapping tree needs to be updated to track each owner or a shared extent. This also needs to be updated ever share/unshare operation. These interactions at extent allocation and freeing time have complex ordering and recovery constraints, so there's a significant amount of new intent-based transaction code to ensure that operations are performed atomically from both the runtime and integrity/crash recovery perspectives. We also need to break sharing when writes hit a shared extent - this is where the new copy-on-write implementation comes in. We allocate new storage and copy the original data along with the overwrite data into the new location. We only do this for data as we don't share metadata at all - each inode has it's own metadata that tracks the shared data extents, the extents undergoing CoW and it's own private extents. Of course, being XFS, nothing is simple - we use delayed allocation for CoW similar to how we use it for normal writes. ENOSPC is a significant issue here - we build on the reservation code added in 4.8-rc1 with the reverse mapping feature to ensure we don't get spurious ENOSPC issues part way through a CoW operation. These mechanisms also help minimise fragmentation due to repeated CoW operations. To further reduce fragmentation overhead, we've also introduced a CoW extent size hint, which indicates how large a region we should allocate when we execute a CoW operation. With all this functionality in place, we can hook up .copy_file_range, .clone_file_range and .dedupe_file_range and we gain all the capabilities of reflink and other vfs provided functionality that enable manipulation to shared extents. We also added a fallocate mode that explicitly unshares a range of a file, which we implemented as an explicit CoW of all the shared extents in a file. As such, it's a huge chunk of new functionality with new on-disk format features and internal infrastructure. It warns at mount time as an experimental feature and that it may eat data (as we do with all new on-disk features until they stabilise). We have not released userspace suport for it yet - userspace support currently requires download from Darrick's xfsprogs repo and build from source, so the access to this feature is really developer/tester only at this point. Initial userspace support will be released at the same time the kernel with this code in it is released. The new code causes 5-6 new failures with xfstests - these aren't serious functional failures but things the output of tests changing slightly due to perturbations in layouts, space usage, etc. OTOH, we've added 150+ new tests to xfstests that specifically exercise this new functionality so it's got far better test coverage than any functionality we've previously added to XFS. Darrick has done a pretty amazing job getting us to this stage, and special mention also needs to go to Christoph (review, testing, improvements and bug fixes) and Brian (caught several intricate bugs during review) for the effort they've also put in. Summary: - unshare range (FALLOC_FL_UNSHARE) support for fallocate - copy-on-write extent size hints (FS_XFLAG_COWEXTSIZE) for fsxattr interface - shared extent support for XFS - copy-on-write support for shared extents - copy_file_range support - clone_file_range support (implements reflink) - dedupe_file_range support - defrag support for reverse mapping enabled filesystems" * tag 'xfs-reflink-for-linus-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (71 commits) xfs: convert COW blocks to real blocks before unwritten extent conversion xfs: rework refcount cow recovery error handling xfs: clear reflink flag if setting realtime flag xfs: fix error initialization xfs: fix label inaccuracies xfs: remove isize check from unshare operation xfs: reduce stack usage of _reflink_clear_inode_flag xfs: check inode reflink flag before calling reflink functions xfs: implement swapext for rmap filesystems xfs: refactor swapext code xfs: various swapext cleanups xfs: recognize the reflink feature bit xfs: simulate per-AG reservations being critically low xfs: don't mix reflink and DAX mode for now xfs: check for invalid inode reflink flags xfs: set a default CoW extent size of 32 blocks xfs: convert unwritten status of reverse mappings for shared files xfs: use interval query for rmap alloc operations on shared files xfs: add shared rmap map/unmap/convert log item types xfs: increase log reservations for reflink ...
2016-10-03vfs: support FS_XFLAG_COWEXTSIZE and get/set of CoW extent size hintDarrick J. Wong
Introduce XFLAGs for the new XFS CoW extent size hint, and actually plumb the CoW extent size hint into the fsxattr structure. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-09-16locks: fix file locking on overlayfsMiklos Szeredi
This patch allows flock, posix locks, ofd locks and leases to work correctly on overlayfs. Instead of using the underlying inode for storing lock context use the overlay inode. This allows locks to be persistent across copy-up. This is done by introducing locks_inode() helper and using it instead of file_inode() to get the inode in locking code. For non-overlayfs the two are equivalent, except for an extra pointer dereference in locks_inode(). Since lock operations are in "struct file_operations" we must also make sure not to call underlying filesystem's lock operations. Introcude a super block flag MS_NOREMOTELOCK to this effect. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Acked-by: Jeff Layton <jlayton@poochiereds.net> Cc: "J. Bruce Fields" <bfields@fieldses.org>
2016-05-23Merge tag 'libnvdimm-for-4.7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm updates from Dan Williams: "The bulk of this update was stabilized before the merge window and appeared in -next. The "device dax" implementation was revised this week in response to review feedback, and to address failures detected by the recently expanded ndctl unit test suite. Not included in this pull request are two dax topic branches (dax error handling, and dax radix-tree locking). These topics were deferred to get a few more days of -next integration testing, and to coordinate a branch baseline with Ted and the ext4 tree. Vishal and Ross will send the error handling and locking topics respectively in the next few days. This branch has received a positive build result from the kbuild robot across 226 configs. Summary: - Device DAX for persistent memory: Device DAX is the device-centric analogue of Filesystem DAX (CONFIG_FS_DAX). It allows memory ranges to be allocated and mapped without need of an intervening file system. Device DAX is strict, precise and predictable. Specifically this interface: a) Guarantees fault granularity with respect to a given page size (pte, pmd, or pud) set at configuration time. b) Enforces deterministic behavior by being strict about what fault scenarios are supported. Persistent memory is the first target, but the mechanism is also targeted for exclusive allocations of performance/feature differentiated memory ranges. - Support for the HPE DSM (device specific method) command formats. This enables management of these first generation devices until a unified DSM specification materializes. - Further ACPI 6.1 compliance with support for the common dimm identifier format. - Various fixes and cleanups across the subsystem" * tag 'libnvdimm-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (40 commits) libnvdimm, dax: fix deletion libnvdimm, dax: fix alignment validation libnvdimm, dax: autodetect support libnvdimm: release ida resources Revert "block: enable dax for raw block devices" /dev/dax, core: file operations and dax-mmap /dev/dax, pmem: direct access to persistent memory libnvdimm: stop requiring a driver ->remove() method libnvdimm, dax: record the specified alignment of a dax-device instance libnvdimm, dax: reserve space to store labels for device-dax libnvdimm, dax: introduce device-dax infrastructure nfit: add sysfs dimm 'family' and 'dsm_mask' attributes tools/testing/nvdimm: ND_CMD_CALL support nfit: disable vendor specific commands nfit: export subsystem ids as attributes nfit: fix format interface code byte order per ACPI6.1 nfit, libnvdimm: limited/whitelisted dimm command marshaling mechanism nfit, libnvdimm: clarify "commands" vs "_DSMs" libnvdimm: increase max envelope size for ioctl acpi/nfit: Add sysfs "id" for NVDIMM ID ...
2016-05-20Revert "block: enable dax for raw block devices"Dan Williams
This reverts commit 5a023cdba50c5f5f2bc351783b3131699deb3937. The functionality is superseded by the new "Device DAX" facility. Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Jan Kara <jack@suse.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-05-01fs: add RWF_DSYNC aand RWF_SYNCChristoph Hellwig
This is the per-I/O equivalent of O_DSYNC and O_SYNC, and very useful for all kinds of file servers and storage targets. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-03-21Merge tag 'for-f2fs-4.6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs updates from Jaegeuk Kim: "New Features: - uplift filesystem encryption into fs/crypto/ - give sysfs entries to control memroy consumption Enhancements: - aio performance by preallocating blocks in ->write_iter - use writepages lock for only WB_SYNC_ALL - avoid redundant inline_data conversion - enhance forground GC - use wait_for_stable_page as possible - speed up SEEK_DATA and fiiemap Bug Fixes: - corner case in terms of -ENOSPC for inline_data - hung task caused by long latency in shrinker - corruption between atomic write and f2fs_trace_pid - avoid garbage lengths in dentries - revoke atomicly written pages if an error occurs In addition, there are various minor bug fixes and clean-ups" * tag 'for-f2fs-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (81 commits) f2fs: submit node page write bios when really required f2fs: add missing argument to f2fs_setxattr stub f2fs: fix to avoid unneeded unlock_new_inode f2fs: clean up opened code with f2fs_update_dentry f2fs: declare static functions f2fs: use cryptoapi crc32 functions f2fs: modify the readahead method in ra_node_page() f2fs crypto: sync ext4_lookup and ext4_file_open fs crypto: move per-file encryption from f2fs tree to fs/crypto f2fs: mutex can't be used by down_write_nest_lock() f2fs: recovery missing dot dentries in root directory f2fs: fix to avoid deadlock when merging inline data f2fs: introduce f2fs_flush_merged_bios for cleanup f2fs: introduce f2fs_update_data_blkaddr for cleanup f2fs crypto: fix incorrect positioning for GCing encrypted data page f2fs: fix incorrect upper bound when iterating inode mapping tree f2fs: avoid hungtask problem caused by losing wake_up f2fs: trace old block address for CoWed page f2fs: try to flush inode after merging inline data f2fs: show more info about superblock recovery ...
2016-03-18Merge branches 'work.lookups', 'work.misc' and 'work.preadv2' into for-nextAl Viro
2016-03-17fs crypto: move per-file encryption from f2fs tree to fs/cryptoJaegeuk Kim
This patch adds the renamed functions moved from the f2fs crypto files. 1. definitions for per-file encryption used by ext4 and f2fs. 2. crypto.c for encrypt/decrypt functions a. IO preparation: - fscrypt_get_ctx / fscrypt_release_ctx b. before IOs: - fscrypt_encrypt_page - fscrypt_decrypt_page - fscrypt_zeroout_range c. after IOs: - fscrypt_decrypt_bio_pages - fscrypt_pullback_bio_page - fscrypt_restore_control_page 3. policy.c supporting context management. a. For ioctls: - fscrypt_process_policy - fscrypt_get_policy b. For context permission - fscrypt_has_permitted_context - fscrypt_inherit_context 4. keyinfo.c to handle permissions - fscrypt_get_encryption_info - fscrypt_free_encryption_info 5. fname.c to support filename encryption a. general wrapper functions - fscrypt_fname_disk_to_usr - fscrypt_fname_usr_to_disk - fscrypt_setup_filename - fscrypt_free_filename b. specific filename handling functions - fscrypt_fname_alloc_buffer - fscrypt_fname_free_buffer 6. Makefile and Kconfig Cc: Al Viro <viro@ftp.linux.org.uk> Signed-off-by: Michael Halcrow <mhalcrow@google.com> Signed-off-by: Ildar Muslukhov <ildarm@google.com> Signed-off-by: Uday Savagaonkar <savagaon@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-03-04vfs: add the RWF_HIPRI flag for preadv2/pwritev2Christoph Hellwig
This adds a flag that tells the file system that this is a high priority request for which it's worth to poll the hardware. The flag is purely advisory and can be ignored if not supported. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Stephen Bates <stephen.bates@pmcs.com> Tested-by: Stephen Bates <stephen.bates@pmcs.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>