Age | Commit message (Collapse) | Author |
|
Patch series "More swap folio conversions".
These all seem like fairly straightforward conversions to me. A lot of
compound_head() calls get removed. And page_swap_info(), which is nice.
This patch (of 13):
Move the folio->page conversion into the callers that actually want that.
Most of the callers are happier with the folio anyway. If the
page_allocated boolean is set, the folio allocated is of order-0, so it is
safe to pass the page directly to swap_readpage().
Link: https://lkml.kernel.org/r/20231213215842.671461-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20231213215842.671461-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
First of all, we need to rename acomp_ctx->dstmem field to buffer, since
we are now using for purposes other than compression.
Then we change per-cpu mutex and buffer to per-acomp_ctx, since them
belong to the acomp_ctx and are necessary parts when used in the
compress/decompress contexts.
So we can remove the old per-cpu mutex and dstmem.
Link: https://lkml.kernel.org/r/20231213-zswap-dstmem-v5-5-9382162bbf05@bytedance.com
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Chris Li <chrisl@kernel.org> (Google)
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Cc: Barry Song <21cnbao@gmail.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
All callers have now been converted to folio_add_new_anon_rmap() and
folio_add_lru_vma() so we can remove the wrapper.
Link: https://lkml.kernel.org/r/20231211162214.2146080-10-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "Finish two folio conversions".
Most callers of page_add_new_anon_rmap() and
lru_cache_add_inactive_or_unevictable() have been converted to their folio
equivalents, but there are still a few stragglers. There's a bit of
preparatory work in ksm and unuse_pte(), but after that it's pretty
mechanical.
This patch (of 9):
Accept a folio as an argument and return a folio result. Removes a call
to compound_head() in do_swap_page(), and prevents folio & page from
getting out of sync in unuse_pte().
Reviewed-by: David Hildenbrand <david@redhat.com>
[willy@infradead.org: fix smatch warning]
Link: https://lkml.kernel.org/r/ZXnPtblC6A1IkyAB@casper.infradead.org
[david@redhat.com: only adjust the page if the folio changed]
Link: https://lkml.kernel.org/r/6a8f2110-fa91-4c10-9eae-88315309a6e3@redhat.com
Link: https://lkml.kernel.org/r/20231211162214.2146080-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20231211162214.2146080-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Implement the uABI of UFFDIO_MOVE ioctl.
UFFDIO_COPY performs ~20% better than UFFDIO_MOVE when the application
needs pages to be allocated [1]. However, with UFFDIO_MOVE, if pages are
available (in userspace) for recycling, as is usually the case in heap
compaction algorithms, then we can avoid the page allocation and memcpy
(done by UFFDIO_COPY). Also, since the pages are recycled in the
userspace, we avoid the need to release (via madvise) the pages back to
the kernel [2].
We see over 40% reduction (on a Google pixel 6 device) in the compacting
thread's completion time by using UFFDIO_MOVE vs. UFFDIO_COPY. This was
measured using a benchmark that emulates a heap compaction implementation
using userfaultfd (to allow concurrent accesses by application threads).
More details of the usecase are explained in [2]. Furthermore,
UFFDIO_MOVE enables moving swapped-out pages without touching them within
the same vma. Today, it can only be done by mremap, however it forces
splitting the vma.
[1] https://lore.kernel.org/all/1425575884-2574-1-git-send-email-aarcange@redhat.com/
[2] https://lore.kernel.org/linux-mm/CA+EESO4uO84SSnBhArH4HvLNhaUQ5nZKNKXqxRCyjniNVjp0Aw@mail.gmail.com/
Update for the ioctl_userfaultfd(2) manpage:
UFFDIO_MOVE
(Since Linux xxx) Move a continuous memory chunk into the
userfault registered range and optionally wake up the blocked
thread. The source and destination addresses and the number of
bytes to move are specified by the src, dst, and len fields of
the uffdio_move structure pointed to by argp:
struct uffdio_move {
__u64 dst; /* Destination of move */
__u64 src; /* Source of move */
__u64 len; /* Number of bytes to move */
__u64 mode; /* Flags controlling behavior of move */
__s64 move; /* Number of bytes moved, or negated error */
};
The following value may be bitwise ORed in mode to change the
behavior of the UFFDIO_MOVE operation:
UFFDIO_MOVE_MODE_DONTWAKE
Do not wake up the thread that waits for page-fault
resolution
UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES
Allow holes in the source virtual range that is being moved.
When not specified, the holes will result in ENOENT error.
When specified, the holes will be accounted as successfully
moved memory. This is mostly useful to move hugepage aligned
virtual regions without knowing if there are transparent
hugepages in the regions or not, but preventing the risk of
having to split the hugepage during the operation.
The move field is used by the kernel to return the number of
bytes that was actually moved, or an error (a negated errno-
style value). If the value returned in move doesn't match the
value that was specified in len, the operation fails with the
error EAGAIN. The move field is output-only; it is not read by
the UFFDIO_MOVE operation.
The operation may fail for various reasons. Usually, remapping of
pages that are not exclusive to the given process fail; once KSM
might deduplicate pages or fork() COW-shares pages during fork()
with child processes, they are no longer exclusive. Further, the
kernel might only perform lightweight checks for detecting whether
the pages are exclusive, and return -EBUSY in case that check fails.
To make the operation more likely to succeed, KSM should be
disabled, fork() should be avoided or MADV_DONTFORK should be
configured for the source VMA before fork().
This ioctl(2) operation returns 0 on success. In this case, the
entire area was moved. On error, -1 is returned and errno is
set to indicate the error. Possible errors include:
EAGAIN The number of bytes moved (i.e., the value returned in
the move field) does not equal the value that was
specified in the len field.
EINVAL Either dst or len was not a multiple of the system page
size, or the range specified by src and len or dst and len
was invalid.
EINVAL An invalid bit was specified in the mode field.
ENOENT
The source virtual memory range has unmapped holes and
UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES is not set.
EEXIST
The destination virtual memory range is fully or partially
mapped.
EBUSY
The pages in the source virtual memory range are either
pinned or not exclusive to the process. The kernel might
only perform lightweight checks for detecting whether the
pages are exclusive. To make the operation more likely to
succeed, KSM should be disabled, fork() should be avoided
or MADV_DONTFORK should be configured for the source virtual
memory area before fork().
ENOMEM Allocating memory needed for the operation failed.
ESRCH
The target process has exited at the time of a UFFDIO_MOVE
operation.
Link: https://lkml.kernel.org/r/20231206103702.3873743-3-surenb@google.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Nicolas Geoffray <ngeoffray@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: ZhangPeng <zhangpeng362@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Pull block fixes from Jens Axboe:
"Fix for a badly numbered flag, and a regression fix for the badblocks
updates from this merge window"
* tag 'block-6.7-2023-12-29' of git://git.kernel.dk/linux:
block: renumber QUEUE_FLAG_HW_WC
badblocks: avoid checking invalid range in badblocks_check()
|
|
Support governors update when the thermal instance's weight has changed.
This allows to adjust internal state for the governor.
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
[ rjw: Add two empty code lines aroung the locking ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Add a new callback to the struct thermal_governor. It can be used for
updating governors when there is a change in the thermal zone internals,
e.g. thermal cooling device is bind to the thermal zone.
That makes possible to move some heavy operations like memory allocations
related to the number of cooling instances out of the throttle() callback.
Both callback code paths (throttle() and update_tz()) are protected with
the same thermal zone lock, which guaranties the consistency.
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
The kerneldoc comment for struct ethtool_fec_stats attempts to describe the
"total" and "lanes" fields of the ethtool_fec_stat substructure in a way
leading to these warnings:
./include/linux/ethtool.h:424: warning: Excess struct member 'lane' description in 'ethtool_fec_stats'
./include/linux/ethtool.h:424: warning: Excess struct member 'total' description in 'ethtool_fec_stats'
Reformat the comment to retain the information while eliminating the
warnings.
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fixes from Masahiro Yamada:
- Revive proper alignment for the ksymtab and kcrctab sections
- Fix gen_compile_commands.py tool to resolve symbolic links
- Fix symbolic links to installed debug VDSO files
- Update MAINTAINERS
* tag 'kbuild-fixes-v6.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
linux/export: Ensure natural alignment of kcrctab array
kbuild: fix build ID symlinks to installed debug VDSO files
gen_compile_commands.py: fix path resolve with symlinks in it
MAINTAINERS: Add scripts/clang-tools to Kbuild section
linux/export: Fix alignment for 64-bit ksymtab entries
|
|
The ___kcrctab section holds an array of 32-bit CRC values.
Add a .balign 4 to tell the linker the correct memory alignment.
Fixes: f3304ecd7f06 ("linux/export: use inline assembler to populate symbol CRCs")
Signed-off-by: Helge Deller <deller@gmx.de>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
|
|
There are 3 synchronization issues with thermal zone suspend-resume
during system-wide transitions:
1. The resume code runs in a PM notifier which is invoked after user
space has been thawed, so it can run concurrently with user space
which can trigger a thermal zone device removal. If that happens,
the thermal zone resume code may use a stale pointer to the next
list element and crash, because it does not hold thermal_list_lock
while walking thermal_tz_list.
2. The thermal zone resume code calls thermal_zone_device_init()
outside the zone lock, so user space or an update triggered by
the platform firmware may see an inconsistent state of a
thermal zone leading to unexpected behavior.
3. Clearing the in_suspend global variable in thermal_pm_notify()
allows __thermal_zone_device_update() to continue for all thermal
zones and it may as well run before the thermal_tz_list walk (or
at any point during the list walk for that matter) and attempt to
operate on a thermal zone that has not been resumed yet. It may
also race destructively with thermal_zone_device_init().
To address these issues, add thermal_list_lock locking to
thermal_pm_notify(), especially arount the thermal_tz_list,
make it call thermal_zone_device_init() back-to-back with
__thermal_zone_device_update() under the zone lock and replace
in_suspend with per-zone bool "suspend" indicators set and unset
under the given zone's lock.
Link: https://lore.kernel.org/linux-pm/20231218162348.69101-1-bo.ye@mediatek.com/
Reported-by: Bo Ye <bo.ye@mediatek.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
All usages of this struct have been removed from the kernel tree.
The struct is still referenced by scripts/check-sysctl-docs but that
script is broken anyways as it only supports the register_sysctl_paths()
API and not the currently used register_sysctl() one.
Fixes: 0199849acd07 ("sysctl: remove register_sysctl_paths()")
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
|
|
It seems it was never used.
Fixes: 2f2665c13af4 ("sysctl: replace child with an enumeration")
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
|
|
When running 'make htmldocs', I see the following warning:
Documentation/filesystems/api-summary:14: ./include/linux/fs.h:1659: WARNING: Definition list ends without a blank line; unexpected unindent.
The official guidance [1] seems to be to use lists, which will prevent
both the "unexpected unindent" warning as well as ensure that each line
is formatted on a separate line in the HTML output instead of being
all considered a single paragraph.
[1]: https://docs.kernel.org/doc-guide/kernel-doc.html#return-values
Fixes: 8802e580ee64 ("fs: create __sb_write_started() helper")
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Link: https://lore.kernel.org/r/20231228100608.3123987-1-vegard.nossum@oracle.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Track the file position above which the server is not expected to have any
data (the "zero point") and preemptively assume that we can satisfy
requests by filling them with zeroes locally rather than attempting to
download them if they're over that line - even if we've written data back
to the server. Assume that any data that was written back above that
position is held in the local cache. Note that we have to split requests
that straddle the line.
Make use of this to optimise away some reads from the server. We need to
set the zero point in the following circumstances:
(1) When we see an extant remote inode and have no cache for it, we set
the zero_point to i_size.
(2) On local inode creation, we set zero_point to 0.
(3) On local truncation down, we reduce zero_point to the new i_size if
the new i_size is lower.
(4) On local truncation up, we don't change zero_point.
(5) On local modification, we don't change zero_point.
(6) On remote invalidation, we set zero_point to the new i_size.
(7) If stored data is discarded from the pagecache or culled from fscache,
we must set zero_point above that if the data also got written to the
server.
(8) If dirty data is written back to the server, but not fscache, we must
set zero_point above that.
(9) If a direct I/O write is made, set zero_point above that.
Assuming the above, any read from the server at or above the zero_point
position will return all zeroes.
The zero_point value can be stored in the cache, provided the above rules
are applied to it by any code that culls part of the local cache.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Provide a flag whereby a filesystem may request that cifs_perform_write()
perform write-through caching. This involves putting pages directly into
writeback rather than dirty and attaching them to a write operation as we
go.
Further, the writes being made are limited to the byte range being written
rather than whole folios being written. This can be used by cifs, for
example, to deal with strict byte-range locking.
This can't be used with content encryption as that may require expansion of
the write RPC beyond the write being made.
This doesn't affect writes via mmap - those are written back in the normal
way; similarly failed writethrough writes are marked dirty and left to
writeback to retry. Another option would be to simply invalidate them, but
the contents can be simultaneously accessed by read() and through mmap.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Provide a launder_folio implementation for netfslib.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Provide an implementation of writepages for network filesystems to delegate
to.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Make netfslib pass the maximum length to the ->prepare_write() op to tell
the cache how much it can expand the length of a write to. This allows a
write to the server at the end of a file to be limited to a few bytes
whilst writing an entire block to the cache (something required by direct
I/O).
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Provide a top-level-ish function that can be pointed to directly by
->read_iter file op.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Provide an entry point to delegate a filesystem's ->page_mkwrite() to.
This checks for conflicting writes, then attached any netfs-specific group
marking (e.g. ceph snap) to the page to be considered dirty.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Institute a netfs write helper, netfs_file_write_iter(), to be pointed at
by the network filesystem ->write_iter() call. Make it handled buffered
writes by calling the previously defined netfs_perform_write() to copy the
source data into the pagecache.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Implement support for unbuffered writes and direct I/O writes. If the
write is misaligned with respect to the fscrypt block size, then RMW cycles
are performed if necessary. DIO writes are a special case of unbuffered
writes with extra restriction imposed, such as block size alignment
requirements.
Also provide a field that can tell the code to add some extra space onto
the bounce buffer for use by the filesystem in the case of a
content-encrypted file.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Implement support for unbuffered and DIO reads in the netfs library,
utilising the existing read helper code to do block splitting and
individual queuing. The code also handles extraction of the destination
buffer from the supplied iterator, allowing async unbuffered reads to take
place.
The read will be split up according to the rsize setting and, if supplied,
the ->clamp_length() method. Note that the next subrequest will be issued
as soon as issue_op returns, without waiting for previous ones to finish.
The network filesystem needs to pause or handle queuing them if it doesn't
want to fire them all at the server simultaneously.
Once all the subrequests have finished, the state will be assessed and the
amount of data to be indicated as having being obtained will be
determined. As the subrequests may finish in any order, if an intermediate
subrequest is short, any further subrequests may be copied into the buffer
and then abandoned.
In the future, this will also take care of doing an unbuffered read from
encrypted content, with the decryption being done by the library.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Provide a netfs write helper, netfs_perform_write() to buffer data to be
written in the pagecache and mark the modified folios dirty.
It will perform "streaming writes" for folios that aren't currently
resident, if possible, storing data in partially modified folios that are
marked dirty, but not uptodate. It will also tag pages as belonging to
fs-specific write groups if so directed by the filesystem.
This is derived from generic_perform_write(), but doesn't use
->write_begin() and ->write_end(), having that logic rolled in instead.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Dispatch one or more write reqeusts to process a writeback slice, where a
slice is tailored more to logical block divisions within the file (such as
crypto blocks, an object layout or cache granules) than the protocol RPC
maximum capacity.
The dispatch doesn't happen until throttling allows, at which point the
entire writeback slice is processed and queued. A slice may be written to
multiple destinations (one or more servers and the local cache) and the
writes to each destination might be split up along different lines.
The writeback slice holds the required folios pinned. An iov_iter is
provided in netfs_write_request that describes the buffer to be used. This
may be part of the pagecache, may have auxiliary padding pages attached or
may be a bounce buffer resulting from crypto or compression. Consequently,
the filesystem must not twiddle the folio markings directly.
The following API is available to the filesystem:
(1) The ->create_write_requests() method is called to ask the filesystem
to create the requests it needs. This is passed the writeback slice
to be processed.
(2) The filesystem should then call netfs_create_write_request() to create
the requests it needs.
(3) Once a request is initialised, netfs_queue_write_request() can be
called to dispatch it asynchronously, if not completed immediately.
(4) netfs_write_request_completed() should be called to note the
completion of a request.
(5) netfs_get_write_request() and netfs_put_write_request() are provided
to refcount a request. These take constants from the netfs_wreq_trace
enum for logging into ftrace.
(6) The ->free_write_request is method is called to ask the filesystem to
clean up a request.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Prepare to use folio->private to hold information write grouping and
streaming write. These are implemented in the same commit as they both
make use of folio->private and will be both checked at the same time in
several places.
"Write grouping" involves ordering the writeback of groups of writes, such
as is needed for ceph snaps. A group is represented by a
filesystem-supplied object which must contain a netfs_group struct. This
contains just a refcount and a pointer to a destructor.
"Streaming write" is the storage of data in folios that are marked dirty,
but not uptodate, to avoid unnecessary reads of data. This is represented
by a netfs_folio struct. This contains the offset and length of the
modified region plus the otherwise displaced write grouping pointer.
The way folio->private is multiplexed is:
(1) If private is NULL then neither is in operation on a dirty folio.
(2) If private is set, with bit 0 clear, then this points to a group.
(3) If private is set, with bit 0 set, then this points to a netfs_folio
struct (with bit 0 AND'ed out).
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Add a hook for netfslib's write helpers to call to tell the network
filesystem that it should update its i_size.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Modify the netfs_io_request struct to act as a point around which writes
can be coordinated. It represents and pins a range of pages that need
writing and a list of regions of dirty data in that range of pages.
If RMW is required, the original data can be downloaded into the bounce
buffer, decrypted if necessary, the modifications made, then the modified
data can be reencrypted/recompressed and sent back to the server.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Limit a subrequest to a maximum size and/or a maximum number of contiguous
physical regions. This permits, for instance, an subreq's iterator to be
limited to the number of DMA'able segments that a large RDMA request can
handle.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Add a function to work out how much of an ITER_BVEC or ITER_XARRAY iterator
we can use in a pagecount-limited and size-limited span. This will be
used, for example, to limit the number of segments in a subrequest to the
maximum number of elements that an RDMA transfer can handle.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Provide tools to create a buffer in an xarray, with a function to add new
folios with a mark. This will be used to create bounce buffer and can be
used more easily to create a list of folios the span of which would require
more than a page's worth of bio_vec structs.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Add a bvec array pointer and an iterator to netfs_io_request for either
holding a copy of a DIO iterator or a list of all the bits of buffer
pointed to by a DIO iterator.
There are two problems: Firstly, if an iovec-class iov_iter is passed to
->read_iter() or ->write_iter(), this cannot be passed directly to
kernel_sendmsg() or kernel_recvmsg() as that may cause locking recursion if
a fault is generated, so we need to keep track of the pages involved
separately.
Secondly, if the I/O is asynchronous, we must copy the iov_iter describing
the buffer before returning to the caller as it may be immediately
deallocated.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Give BLK_DEF_MAX_SECTORS a _CAP postfix and document what it is used for.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231227092305.279567-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
by moving cond_resched_rcu() to rcupdate_wait.h, we can kill another big
sched.h dependency.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We really only need types.h, list.h is big.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We're trying to get sched.h down to more or less just types only, not
code - rseq can live in its own header.
This helps us kill the dependency on preempt.h in sched.h.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Linux 6.7-rc7
|
|
Rename dsa_realloc_skb to skb_ensure_writable_head_tail and move it to
skbuff.c to use it as helper.
Signed-off-by: Radu Pirea (NXP OSS) <radu-nicolae.pirea@oss.nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
By mistake, dev_pm_opp_find_level_floor() used the level parameter as
unsigned long instead of unsigned int. Fix it.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
|
|
Remove the unused 'node' member. It got replaced by device_driver chaining
more than 20 years ago in commit 4b4a837f2b57 ("PCI: start to use common
fields of struct device_driver more...") of the history.git tree.
Link: https://lore.kernel.org/r/20231220133505.8798-1-minipli@grsecurity.net
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Kalle Valo <kvalo@kernel.org>
|
|
The System EID (SEID) is an internal EID that is used by the SMCv2
software stack that has a predefined and constant value representing
the s390 physical machine that the OS is executing on. So it should
be managed by SMC stack instead of ISM driver and be consistent for
all ISMv2 device (including virtual ISM devices) on s390 architecture.
Suggested-by: Alexandra Winter <wintera@linux.ibm.com>
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-and-tested-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-by: Alexandra Winter <wintera@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
submit_bio_noacct allows completely invalid operations, or operations
that are not supported in the bio path. Extent the existing switch
statement to rejcect all invalid types.
Move the code point for REQ_OP_ZONE_APPEND so that it's not right in the
middle of the zone management operations and the switch statement can
follow the numerical order of the operations.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231221070538.1112446-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
For the QUEUE_FLAG_HW_WC to actually work, it needs to have a separate
number from QUEUE_FLAG_FUA, doh.
Fixes: 43c9835b144c ("block: don't allow enabling a cache on devices that don't support it")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231226081524.180289-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Remove the @of_xlate: lines to prevent the kernel-doc warning:
include/linux/iio/iio.h:534: warning: Excess struct member 'of_xlate' description in 'iio_info'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jonathan Cameron <jic23@kernel.org>
Cc: Lars-Peter Clausen <lars@metafoo.de>
Cc: linux-iio@vger.kernel.org
Link: https://lore.kernel.org/r/20231223050556.13948-1-rdunlap@infradead.org
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
|
|
Some ioctl commands do not require ioctl permission, but are routed to
other permissions such as FILE_GETATTR or FILE_SETATTR. This routing is
done by comparing the ioctl cmd to a set of 64-bit flags (FS_IOC_*).
However, if a 32-bit process is running on a 64-bit kernel, it emits
32-bit flags (FS_IOC32_*) for certain ioctl operations. These flags are
being checked erroneously, which leads to these ioctl operations being
routed to the ioctl permission, rather than the correct file
permissions.
This was also noted in a RED-PEN finding from a while back -
"/* RED-PEN how should LSM module know it's handling 32bit? */".
This patch introduces a new hook, security_file_ioctl_compat(), that is
called from the compat ioctl syscall. All current LSMs have been changed
to support this hook.
Reviewing the three places where we are currently using
security_file_ioctl(), it appears that only SELinux needs a dedicated
compat change; TOMOYO and SMACK appear to be functional without any
change.
Cc: stable@vger.kernel.org
Fixes: 0b24dcb7f2f7 ("Revert "selinux: simplify ioctl checking"")
Signed-off-by: Alfred Piccioni <alpic@google.com>
Reviewed-by: Stephen Smalley <stephen.smalley.work@gmail.com>
[PM: subject tweak, line length fixes, and alignment corrections]
Signed-off-by: Paul Moore <paul@paul-moore.com>
|
|
Add three iov_iter structs:
(1) Add an iov_iter (->iter) to the I/O request to describe the
unencrypted-side buffer.
(2) Add an iov_iter (->io_iter) to the I/O request to describe the
encrypted-side I/O buffer. This may be a different size to the buffer
in (1).
(3) Add an iov_iter (->io_iter) to the I/O subrequest to describe the part
of the I/O buffer for that subrequest.
This will allow future patches to point to a bounce buffer instead for
purposes of handling oversize writes, decryption (where we want to save the
encrypted data to the cache) and decompression.
These iov_iters persist for the lifetime of the (sub)request, and so can be
accessed multiple times without worrying about them being deallocated upon
return to the caller.
The network filesystem must appropriately advance the iterator before
terminating the request.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Borrow NFS's direct-vs-buffered I/O locking into netfslib. Similar code is
also used in ceph.
Modify it to have the correct checker annotations for i_rwsem lock
acquisition/release and to return -ERESTARTSYS if waits are interrupted.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Provide default invalidate_folio and release_folio calls. These will need
to interact with invalidation correctly at some point. They will be needed
if netfslib is to make use of folio->private for its own purposes.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|