Documentation/filesystems/netfs_library.rst


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056

.. SPDX-License-Identifier: GPL-2.0

===================================
Network Filesystem Services Library
===================================

.. Contents:

 - Overview.
   - Requests and streams.
   - Subrequests.
   - Result collection and retry.
   - Local caching.
   - Content encryption (fscrypt).
 - Per-inode context.
   - Inode context helper functions.
   - Inode locking.
   - Inode writeback.
 - High-level VFS API.
   - Unlocked read/write iter.
   - Pre-locked read/write iter.
   - Monolithic files API.
   - Memory-mapped I/O API.
 - High-level VM API.
   - Deprecated PG_private2 API.
 - I/O request API.
   - Request structure.
   - Stream structure.
   - Subrequest structure.
   - Filesystem methods.
   - Terminating a subrequest.
   - Local cache API.
 - API function reference.


Overview
========

The network filesystem services library, netfslib, is a set of functions
designed to aid a network filesystem in implementing VM/VFS API operations.  It
takes over the normal buffered read, readahead, write and writeback and also
handles unbuffered and direct I/O.

The library provides support for (re-)negotiation of I/O sizes and retrying
failed I/O as well as local caching and will, in the future, provide content
encryption.

It insulates the filesystem from VM interface changes as much as possible and
handles VM features such as large multipage folios.  The filesystem basically
just has to provide a way to perform read and write RPC calls.

The way I/O is organised inside netfslib consists of a number of objects:

 * A *request*.  A request is used to track the progress of the I/O overall and
   to hold on to resources.  The collection of results is done at the request
   level.  The I/O within a request is divided into a number of parallel
   streams of subrequests.

 * A *stream*.  A non-overlapping series of subrequests.  The subrequests
   within a stream do not have to be contiguous.

 * A *subrequest*.  This is the basic unit of I/O.  It represents a single RPC
   call or a single cache I/O operation.  The library passes these to the
   filesystem and the cache to perform.

Requests and Streams
--------------------

When actually performing I/O (as opposed to just copying into the pagecache),
netfslib will create one or more requests to track the progress of the I/O and
to hold resources.

A read operation will have a single stream and the subrequests within that
stream may be of mixed origins, for instance mixing RPC subrequests and cache
subrequests.

On the other hand, a write operation may have multiple streams, where each
stream targets a different destination.  For instance, there may be one stream
writing to the local cache and one to the server.  Currently, only two streams
are allowed, but this could be increased if parallel writes to multiple servers
is desired.

The subrequests within a write stream do not need to match alignment or size
with the subrequests in another write stream and netfslib performs the tiling
of subrequests in each stream over the source buffer independently.  Further,
each stream may contain holes that don't correspond to holes in the other
stream.

In addition, the subrequests do not need to correspond to the boundaries of the
folios or vectors in the source/destination buffer.  The library handles the
collection of results and the wrangling of folio flags and references.

Subrequests
-----------

Subrequests are at the heart of the interaction between netfslib and the
filesystem using it.  Each subrequest is expected to correspond to a single
read or write RPC or cache operation.  The library will stitch together the
results from a set of subrequests to provide a higher level operation.

Netfslib has two interactions with the filesystem or the cache when setting up
a subrequest.  First, there's an optional preparatory step that allows the
filesystem to negotiate the limits on the subrequest, both in terms of maximum
number of bytes and maximum number of vectors (e.g. for RDMA).  This may
involve negotiating with the server (e.g. cifs needing to acquire credits).

And, secondly, there's the issuing step in which the subrequest is handed off
to the filesystem to perform.

Note that these two steps are done slightly differently between read and write:

 * For reads, the VM/VFS tells us how much is being requested up front, so the
   library can preset maximum values that the cache and then the filesystem can
   then reduce.  The cache also gets consulted first on whether it wants to do
   a read before the filesystem is consulted.

 * For writeback, it is unknown how much there will be to write until the
   pagecache is walked, so no limit is set by the library.

Once a subrequest is completed, the filesystem or cache informs the library of
the completion and then collection is invoked.  Depending on whether the
request is synchronous or asynchronous, the collection of results will be done
in either the application thread or in a work queue.

Result Collection and Retry
---------------------------

As subrequests complete, the results are collected and collated by the library
and folio unlocking is performed progressively (if appropriate).  Once the
request is complete, async completion will be invoked (again, if appropriate).
It is possible for the filesystem to provide interim progress reports to the
library to cause folio unlocking to happen earlier if possible.

If any subrequests fail, netfslib can retry them.  It will wait until all
subrequests are completed, offer the filesystem the opportunity to fiddle with
the resources/state held by the request and poke at the subrequests before
re-preparing and re-issuing the subrequests.

This allows the tiling of contiguous sets of failed subrequest within a stream
to be changed, adding more subrequests or ditching excess as necessary (for
instance, if the network sizes change or the server decides it wants smaller
chunks).

Further, if one or more contiguous cache-read subrequests fail, the library
will pass them to the filesystem to perform instead, renegotiating and retiling
them as necessary to fit with the filesystem's parameters rather than those of
the cache.

Local Caching
-------------

One of the services netfslib provides, via ``fscache``, is the option to cache
on local disk a copy of the data obtained from/written to a network filesystem.
The library will manage the storing, retrieval and some invalidation of data
automatically on behalf of the filesystem if a cookie is attached to the
``netfs_inode``.

Note that local caching used to use the PG_private_2 (aliased as PG_fscache) to
keep track of a page that was being written to the cache, but this is now
deprecated as PG_private_2 will be removed.

Instead, folios that are read from the server for which there was no data in
the cache will be marked as dirty and will have ``folio->private`` set to a
special value (``NETFS_FOLIO_COPY_TO_CACHE``) and left to writeback to write.
If the folio is modified before that happened, the special value will be
cleared and the write will become normally dirty.

When writeback occurs, folios that are so marked will only be written to the
cache and not to the server.  Writeback handles mixed cache-only writes and
server-and-cache writes by using two streams, sending one to the cache and one
to the server.  The server stream will have gaps in it corresponding to those
folios.

Content Encryption (fscrypt)
----------------------------

Though it does not do so yet, at some point netfslib will acquire the ability
to do client-side content encryption on behalf of the network filesystem (Ceph,
for example).  fscrypt can be used for this if appropriate (it may not be -
cifs, for example).

The data will be stored encrypted in the local cache using the same manner of
encryption as the data written to the server and the library will impose bounce
buffering and RMW cycles as necessary.


Per-Inode Context
=================

The network filesystem helper library needs a place to store a bit of state for
its use on each netfs inode it is helping to manage.  To this end, a context
structure is defined::

	struct netfs_inode {
		struct inode inode;
		const struct netfs_request_ops *ops;
		struct fscache_cookie * cache;
		loff_t remote_i_size;
		unsigned long flags;
		...
	};

A network filesystem that wants to use netfslib must place one of these in its
inode wrapper struct instead of the VFS ``struct inode``.  This can be done in
a way similar to the following::

	struct my_inode {
		struct netfs_inode netfs; /* Netfslib context and vfs inode */
		...
	};

This allows netfslib to find its state by using ``container_of()`` from the
inode pointer, thereby allowing the netfslib helper functions to be pointed to
directly by the VFS/VM operation tables.

The structure contains the following fields that are of interest to the
filesystem:

 * ``inode``

   The VFS inode structure.

 * ``ops``

   The set of operations provided by the network filesystem to netfslib.

 * ``cache``

   Local caching cookie, or NULL if no caching is enabled.  This field does not
   exist if fscache is disabled.

 * ``remote_i_size``

   The size of the file on the server.  This differs from inode->i_size if
   local modifications have been made but not yet written back.

 * ``flags``

   A set of flags, some of which the filesystem might be interested in:

   * ``NETFS_ICTX_MODIFIED_ATTR``

     Set if netfslib modifies mtime/ctime.  The filesystem is free to ignore
     this or clear it.

   * ``NETFS_ICTX_UNBUFFERED``

     Do unbuffered I/O upon the file.  Like direct I/O but without the
     alignment limitations.  RMW will be performed if necessary.  The pagecache
     will not be used unless mmap() is also used.

   * ``NETFS_ICTX_WRITETHROUGH``

     Do writethrough caching upon the file.  I/O will be set up and dispatched
     as buffered writes are made to the page cache.  mmap() does the normal
     writeback thing.

   * ``NETFS_ICTX_SINGLE_NO_UPLOAD``

     Set if the file has a monolithic content that must be read entirely in a
     single go and must not be written back to the server, though it can be
     cached (e.g. AFS directories).

Inode Context Helper Functions
------------------------------

To help deal with the per-inode context, a number helper functions are
provided.  Firstly, a function to perform basic initialisation on a context and
set the operations table pointer::

	void netfs_inode_init(struct netfs_inode *ctx,
			      const struct netfs_request_ops *ops);

then a function to cast from the VFS inode structure to the netfs context::

	struct netfs_inode *netfs_inode(struct inode *inode);

and finally, a function to get the cache cookie pointer from the context
attached to an inode (or NULL if fscache is disabled)::

	struct fscache_cookie *netfs_i_cookie(struct netfs_inode *ctx);

Inode Locking
-------------

A number of functions are provided to manage the locking of i_rwsem for I/O and
to effectively extend it to provide more separate classes of exclusion::

	int netfs_start_io_read(struct inode *inode);
	void netfs_end_io_read(struct inode *inode);
	int netfs_start_io_write(struct inode *inode);
	void netfs_end_io_write(struct inode *inode);
	int netfs_start_io_direct(struct inode *inode);
	void netfs_end_io_direct(struct inode *inode);

The exclusion breaks down into four separate classes:

 1) Buffered reads and writes.

    Buffered reads can run concurrently each other and with buffered writes,
    but buffered writes cannot run concurrently with each other.

 2) Direct reads and writes.

    Direct (and unbuffered) reads and writes can run concurrently since they do
    not share local buffering (i.e. the pagecache) and, in a network
    filesystem, are expected to have exclusion managed on the server (though
    this may not be the case for, say, Ceph).

 3) Other major inode modifying operations (e.g. truncate, fallocate).

    These should just access i_rwsem directly.

 4) mmap().

    mmap'd accesses might operate concurrently with any of the other classes.
    They might form the buffer for an intra-file loopback DIO read/write.  They
    might be permitted on unbuffered files.

Inode Writeback
---------------

Netfslib will pin resources on an inode for future writeback (such as pinning
use of an fscache cookie) when an inode is dirtied.  However, this pinning
needs careful management.  To manage the pinning, the following sequence
occurs:

 1) An inode state flag ``I_PINNING_NETFS_WB`` is set by netfslib when the
    pinning begins (when a folio is dirtied, for example) if the cache is
    active to stop the cache structures from being discarded and the cache
    space from being culled.  This also prevents re-getting of cache resources
    if the flag is already set.

 2) This flag then cleared inside the inode lock during inode writeback in the
    VM - and the fact that it was set is transferred to ``->unpinned_netfs_wb``
    in ``struct writeback_control``.

 3) If ``->unpinned_netfs_wb`` is now set, the write_inode procedure is forced.

 4) The filesystem's ``->write_inode()`` function is invoked to do the cleanup.

 5) The filesystem invokes netfs to do its cleanup.

To do the cleanup, netfslib provides a function to do the resource unpinning::

	int netfs_unpin_writeback(struct inode *inode, struct writeback_control *wbc);

If the filesystem doesn't need to do anything else, this may be set as a its
``.write_inode`` method.

Further, if an inode is deleted, the filesystem's write_inode method may not
get called, so::

	void netfs_clear_inode_writeback(struct inode *inode, const void *aux);

must be called from ``->evict_inode()`` *before* ``clear_inode()`` is called.


High-Level VFS API
==================

Netfslib provides a number of sets of API calls for the filesystem to delegate
VFS operations to.  Netfslib, in turn, will call out to the filesystem and the
cache to negotiate I/O sizes, issue RPCs and provide places for it to intervene
at various times.

Unlocked Read/Write Iter
------------------------

The first API set is for the delegation of operations to netfslib when the
filesystem is called through the standard VFS read/write_iter methods::

	ssize_t netfs_file_read_iter(struct kiocb *iocb, struct iov_iter *iter);
	ssize_t netfs_file_write_iter(struct kiocb *iocb, struct iov_iter *from);
	ssize_t netfs_buffered_read_iter(struct kiocb *iocb, struct iov_iter *iter);
	ssize_t netfs_unbuffered_read_iter(struct kiocb *iocb, struct iov_iter *iter);
	ssize_t netfs_unbuffered_write_iter(struct kiocb *iocb, struct iov_iter *from);

They can be assigned directly to ``.read_iter`` and ``.write_iter``.  They
perform the inode locking themselves and the first two will switch between
buffered I/O and DIO as appropriate.

Pre-Locked Read/Write Iter
--------------------------

The second API set is for the delegation of operations to netfslib when the
filesystem is called through the standard VFS methods, but needs to do some
other stuff before or after calling netfslib whilst still inside locked section
(e.g. Ceph negotiating caps).  The unbuffered read function is::

	ssize_t netfs_unbuffered_read_iter_locked(struct kiocb *iocb, struct iov_iter *iter);

This must not be assigned directly to ``.read_iter`` and the filesystem is
responsible for performing the inode locking before calling it.  In the case of
buffered read, the filesystem should use ``filemap_read()``.

There are three functions for writes::

	ssize_t netfs_buffered_write_iter_locked(struct kiocb *iocb, struct iov_iter *from,
						 struct netfs_group *netfs_group);
	ssize_t netfs_perform_write(struct kiocb *iocb, struct iov_iter *iter,
				    struct netfs_group *netfs_group);
	ssize_t netfs_unbuffered_write_iter_locked(struct kiocb *iocb, struct iov_iter *iter,
						   struct netfs_group *netfs_group);

These must not be assigned directly to ``.write_iter`` and the filesystem is
responsible for performing the inode locking before calling them.

The first two functions are for buffered writes; the first just adds some
standard write checks and jumps to the second, but if the filesystem wants to
do the checks itself, it can use the second directly.  The third function is
for unbuffered or DIO writes.

On all three write functions, there is a writeback group pointer (which should
be NULL if the filesystem doesn't use this).  Writeback groups are set on
folios when they're modified.  If a folio to-be-modified is already marked with
a different group, it is flushed first.  The writeback API allows writing back
of a specific group.

Memory-Mapped I/O API
---------------------

An API for support of mmap()'d I/O is provided::

	vm_fault_t netfs_page_mkwrite(struct vm_fault *vmf, struct netfs_group *netfs_group);

This allows the filesystem to delegate ``.page_mkwrite`` to netfslib.  The
filesystem should not take the inode lock before calling it, but, as with the
locked write functions above, this does take a writeback group pointer.  If the
page to be made writable is in a different group, it will be flushed first.

Monolithic Files API
--------------------

There is also a special API set for files for which the content must be read in
a single RPC (and not written back) and is maintained as a monolithic blob
(e.g. an AFS directory), though it can be stored and updated in the local cache::

	ssize_t netfs_read_single(struct inode *inode, struct file *file, struct iov_iter *iter);
	void netfs_single_mark_inode_dirty(struct inode *inode);
	int netfs_writeback_single(struct address_space *mapping,
				   struct writeback_control *wbc,
				   struct iov_iter *iter);

The first function reads from a file into the given buffer, reading from the
cache in preference if the data is cached there; the second function allows the
inode to be marked dirty, causing a later writeback; and the third function can
be called from the writeback code to write the data to the cache, if there is
one.

The inode should be marked ``NETFS_ICTX_SINGLE_NO_UPLOAD`` if this API is to be
used.  The writeback function requires the buffer to be of ITER_FOLIOQ type.

High-Level VM API
==================

Netfslib also provides a number of sets of API calls for the filesystem to
delegate VM operations to.  Again, netfslib, in turn, will call out to the
filesystem and the cache to negotiate I/O sizes, issue RPCs and provide places
for it to intervene at various times::

	void netfs_readahead(struct readahead_control *);
	int netfs_read_folio(struct file *, struct folio *);
	int netfs_writepages(struct address_space *mapping,
			     struct writeback_control *wbc);
	bool netfs_dirty_folio(struct address_space *mapping, struct folio *folio);
	void netfs_invalidate_folio(struct folio *folio, size_t offset, size_t length);
	bool netfs_release_folio(struct folio *folio, gfp_t gfp);

These are ``address_space_operations`` methods and can be set directly in the
operations table.

Deprecated PG_private_2 API
---------------------------

There is also a deprecated function for filesystems that still use the
``->write_begin`` method::

	int netfs_write_begin(struct netfs_inode *inode, struct file *file,
			      struct address_space *mapping, loff_t pos, unsigned int len,
			      struct folio **_folio, void **_fsdata);

It uses the deprecated PG_private_2 flag and so should not be used.


I/O Request API
===============

The I/O request API comprises a number of structures and a number of functions
that the filesystem may need to use.

Request Structure
-----------------

The request structure manages the request as a whole, holding some resources
and state on behalf of the filesystem and tracking the collection of results::

	struct netfs_io_request {
		enum netfs_io_origin	origin;
		struct inode		*inode;
		struct address_space	*mapping;
		struct netfs_group	*group;
		struct netfs_io_stream	io_streams[];
		void			*netfs_priv;
		void			*netfs_priv2;
		unsigned long long	start;
		unsigned long long	len;
		unsigned long long	i_size;
		unsigned int		debug_id;
		unsigned long		flags;
		...
	};

Many of the fields are for internal use, but the fields shown here are of
interest to the filesystem:

 * ``origin``

   The origin of the request (readahead, read_folio, DIO read, writeback, ...).

 * ``inode``
 * ``mapping``

   The inode and the address space of the file being read from.  The mapping
   may or may not point to inode->i_data.

 * ``group``

   The writeback group this request is dealing with or NULL.  This holds a ref
   on the group.

 * ``io_streams``

   The parallel streams of subrequests available to the request.  Currently two
   are available, but this may be made extensible in future.  ``NR_IO_STREAMS``
   indicates the size of the array.

 * ``netfs_priv``
 * ``netfs_priv2``

   The network filesystem's private data.  The value for this can be passed in
   to the helper functions or set during the request.

 * ``start``
 * ``len``

   The file position of the start of the read request and the length.  These
   may be altered by the ->expand_readahead() op.

 * ``i_size``

   The size of the file at the start of the request.

 * ``debug_id``

   A number allocated to this operation that can be displayed in trace lines
   for reference.

 * ``flags``

   Flags for managing and controlling the operation of the request.  Some of
   these may be of interest to the filesystem:

   * ``NETFS_RREQ_RETRYING``

     Netfslib sets this when generating retries.

   * ``NETFS_RREQ_PAUSE``

     The filesystem can set this to request to pause the library's subrequest
     issuing loop - but care needs to be taken as netfslib may also set it.

   * ``NETFS_RREQ_NONBLOCK``
   * ``NETFS_RREQ_BLOCKED``

     Netfslib sets the first to indicate that non-blocking mode was set by the
     caller and the filesystem can set the second to indicate that it would
     have had to block.

   * ``NETFS_RREQ_USE_PGPRIV2``

     The filesystem can set this if it wants to use PG_private_2 to track
     whether a folio is being written to the cache.  This is deprecated as
     PG_private_2 is going to go away.

If the filesystem wants more private data than is afforded by this structure,
then it should wrap it and provide its own allocator.

Stream Structure
----------------

A request is comprised of one or more parallel streams and each stream may be
aimed at a different target.

For read requests, only stream 0 is used.  This can contain a mixture of
subrequests aimed at different sources.  For write requests, stream 0 is used
for the server and stream 1 is used for the cache.  For buffered writeback,
stream 0 is not enabled unless a normal dirty folio is encountered, at which
point ->begin_writeback() will be invoked and the filesystem can mark the
stream available.

The stream struct looks like::

	struct netfs_io_stream {
		unsigned char		stream_nr;
		bool			avail;
		size_t			sreq_max_len;
		unsigned int		sreq_max_segs;
		unsigned int		submit_extendable_to;
		...
	};

A number of members are available for access/use by the filesystem:

 * ``stream_nr``

   The number of the stream within the request.

 * ``avail``

   True if the stream is available for use.  The filesystem should set this on
   stream zero if in ->begin_writeback().

 * ``sreq_max_len``
 * ``sreq_max_segs``

   These are set by the filesystem or the cache in ->prepare_read() or
   ->prepare_write() for each subrequest to indicate the maximum number of
   bytes and, optionally, the maximum number of segments (if not 0) that that
   subrequest can support.

 * ``submit_extendable_to``

   The size that a subrequest can be rounded up to beyond the EOF, given the
   available buffer.  This allows the cache to work out if it can do a DIO read
   or write that straddles the EOF marker.

Subrequest Structure
--------------------

Individual units of I/O are managed by the subrequest structure.  These
represent slices of the overall request and run independently::

	struct netfs_io_subrequest {
		struct netfs_io_request *rreq;
		struct iov_iter		io_iter;
		unsigned long long	start;
		size_t			len;
		size_t			transferred;
		unsigned long		flags;
		short			error;
		unsigned short		debug_index;
		unsigned char		stream_nr;
		...
	};

Each subrequest is expected to access a single source, though the library will
handle falling back from one source type to another.  The members are:

 * ``rreq``

   A pointer to the read request.

 * ``io_iter``

   An I/O iterator representing a slice of the buffer to be read into or
   written from.

 * ``start``
 * ``len``

   The file position of the start of this slice of the read request and the
   length.

 * ``transferred``

   The amount of data transferred so far for this subrequest.  This should be
   added to with the length of the transfer made by this issuance of the
   subrequest.  If this is less than ``len`` then the subrequest may be
   reissued to continue.

 * ``flags``

   Flags for managing the subrequest.  There are a number of interest to the
   filesystem or cache:

   * ``NETFS_SREQ_MADE_PROGRESS``

     Set by the filesystem to indicates that at least one byte of data was read
     or written.

   * ``NETFS_SREQ_HIT_EOF``

     The filesystem should set this if a read hit the EOF on the file (in which
     case ``transferred`` should stop at the EOF).  Netfslib may expand the
     subrequest out to the size of the folio containing the EOF on the off
     chance that a third party change happened or a DIO read may have asked for
     more than is available.  The library will clear any excess pagecache.

   * ``NETFS_SREQ_CLEAR_TAIL``

     The filesystem can set this to indicate that the remainder of the slice,
     from transferred to len, should be cleared.  Do not set if HIT_EOF is set.

   * ``NETFS_SREQ_NEED_RETRY``

     The filesystem can set this to tell netfslib to retry the subrequest.

   * ``NETFS_SREQ_BOUNDARY``

     This can be set by the filesystem on a subrequest to indicate that it ends
     at a boundary with the filesystem structure (e.g. at the end of a Ceph
     object).  It tells netfslib not to retile subrequests across it.

   * ``NETFS_SREQ_SEEK_DATA_READ``

     This is a hint from netfslib to the cache that it might want to try
     skipping ahead to the next data (ie. using SEEK_DATA).

 * ``error``

   This is for the filesystem to store result of the subrequest.  It should be
   set to 0 if successful and a negative error code otherwise.

 * ``debug_index``
 * ``stream_nr``

   A number allocated to this slice that can be displayed in trace lines for
   reference and the number of the request stream that it belongs to.

If necessary, the filesystem can get and put extra refs on the subrequest it is
given::

	void netfs_get_subrequest(struct netfs_io_subrequest *subreq,
				  enum netfs_sreq_ref_trace what);
	void netfs_put_subrequest(struct netfs_io_subrequest *subreq,
				  enum netfs_sreq_ref_trace what);

using netfs trace codes to indicate the reason.  Care must be taken, however,
as once control of the subrequest is returned to netfslib, the same subrequest
can be reissued/retried.

Filesystem Methods
------------------

The filesystem sets a table of operations in ``netfs_inode`` for netfslib to
use::

	struct netfs_request_ops {
		mempool_t *request_pool;
		mempool_t *subrequest_pool;
		int (*init_request)(struct netfs_io_request *rreq, struct file *file);
		void (*free_request)(struct netfs_io_request *rreq);
		void (*free_subrequest)(struct netfs_io_subrequest *rreq);
		void (*expand_readahead)(struct netfs_io_request *rreq);
		int (*prepare_read)(struct netfs_io_subrequest *subreq);
		void (*issue_read)(struct netfs_io_subrequest *subreq);
		void (*done)(struct netfs_io_request *rreq);
		void (*update_i_size)(struct inode *inode, loff_t i_size);
		void (*post_modify)(struct inode *inode);
		void (*begin_writeback)(struct netfs_io_request *wreq);
		void (*prepare_write)(struct netfs_io_subrequest *subreq);
		void (*issue_write)(struct netfs_io_subrequest *subreq);
		void (*retry_request)(struct netfs_io_request *wreq,
				      struct netfs_io_stream *stream);
		void (*invalidate_cache)(struct netfs_io_request *wreq);
	};

The table starts with a pair of optional pointers to memory pools from which
requests and subrequests can be allocated.  If these are not given, netfslib
has default pools that it will use instead.  If the filesystem wraps the netfs
structs in its own larger structs, then it will need to use its own pools.
Netfslib will allocate directly from the pools.

The methods defined in the table are:

 * ``init_request()``
 * ``free_request()``
 * ``free_subrequest()``

   [Optional] A filesystem may implement these to initialise or clean up any
   resources that it attaches to the request or subrequest.

 * ``expand_readahead()``

   [Optional] This is called to allow the filesystem to expand the size of a
   readahead request.  The filesystem gets to expand the request in both
   directions, though it must retain the initial region as that may represent
   an allocation already made.  If local caching is enabled, it gets to expand
   the request first.

   Expansion is communicated by changing ->start and ->len in the request
   structure.  Note that if any change is made, ->len must be increased by at
   least as much as ->start is reduced.

 * ``prepare_read()``

   [Optional] This is called to allow the filesystem to limit the size of a
   subrequest.  It may also limit the number of individual regions in iterator,
   such as required by RDMA.  This information should be set on stream zero in::

	rreq->io_streams[0].sreq_max_len
	rreq->io_streams[0].sreq_max_segs

   The filesystem can use this, for example, to chop up a request that has to
   be split across multiple servers or to put multiple reads in flight.

   Zero should be returned on success and an error code otherwise.

 * ``issue_read()``

   [Required] Netfslib calls this to dispatch a subrequest to the server for
   reading.  In the subrequest, ->start, ->len and ->transferred indicate what
   data should be read from the server and ->io_iter indicates the buffer to be
   used.

   There is no return value; the ``netfs_read_subreq_terminated()`` function
   should be called to indicate that the subrequest completed either way.
   ->error, ->transferred and ->flags should be updated before completing.  The
   termination can be done asynchronously.

   Note: the filesystem must not deal with setting folios uptodate, unlocking
   them or dropping their refs - the library deals with this as it may have to
   stitch together the results of multiple subrequests that variously overlap
   the set of folios.

 * ``done()``

   [Optional] This is called after the folios in a read request have all been
   unlocked (and marked uptodate if applicable).

 * ``update_i_size()``

   [Optional] This is invoked by netfslib at various points during the write
   paths to ask the filesystem to update its idea of the file size.  If not
   given, netfslib will set i_size and i_blocks and update the local cache
   cookie.
   
 * ``post_modify()``

   [Optional] This is called after netfslib writes to the pagecache or when it
   allows an mmap'd page to be marked as writable.
   
 * ``begin_writeback()``

   [Optional] Netfslib calls this when processing a writeback request if it
   finds a dirty page that isn't simply marked NETFS_FOLIO_COPY_TO_CACHE,
   indicating it must be written to the server.  This allows the filesystem to
   only set up writeback resources when it knows it's going to have to perform
   a write.
   
 * ``prepare_write()``

   [Optional] This is called to allow the filesystem to limit the size of a
   subrequest.  It may also limit the number of individual regions in iterator,
   such as required by RDMA.  This information should be set on stream to which
   the subrequest belongs::

	rreq->io_streams[subreq->stream_nr].sreq_max_len
	rreq->io_streams[subreq->stream_nr].sreq_max_segs

   The filesystem can use this, for example, to chop up a request that has to
   be split across multiple servers or to put multiple writes in flight.

   This is not permitted to return an error.  Instead, in the event of failure,
   ``netfs_prepare_write_failed()`` must be called.

 * ``issue_write()``

   [Required] This is used to dispatch a subrequest to the server for writing.
   In the subrequest, ->start, ->len and ->transferred indicate what data
   should be written to the server and ->io_iter indicates the buffer to be
   used.

   There is no return value; the ``netfs_write_subreq_terminated()`` function
   should be called to indicate that the subrequest completed either way.
   ->error, ->transferred and ->flags should be updated before completing.  The
   termination can be done asynchronously.

   Note: the filesystem must not deal with removing the dirty or writeback
   marks on folios involved in the operation and should not take refs or pins
   on them, but should leave retention to netfslib.

 * ``retry_request()``

   [Optional] Netfslib calls this at the beginning of a retry cycle.  This
   allows the filesystem to examine the state of the request, the subrequests
   in the indicated stream and of its own data and make adjustments or
   renegotiate resources.
   
 * ``invalidate_cache()``

   [Optional] This is called by netfslib to invalidate data stored in the local
   cache in the event that writing to the local cache fails, providing updated
   coherency data that netfs can't provide.

Terminating a subrequest
------------------------

When a subrequest completes, there are a number of functions that the cache or
subrequest can call to inform netfslib of the status change.  One function is
provided to terminate a write subrequest at the preparation stage and acts
synchronously:

 * ``void netfs_prepare_write_failed(struct netfs_io_subrequest *subreq);``

   Indicate that the ->prepare_write() call failed.  The ``error`` field should
   have been updated.

Note that ->prepare_read() can return an error as a read can simply be aborted.
Dealing with writeback failure is trickier.

The other functions are used for subrequests that got as far as being issued:

 * ``void netfs_read_subreq_terminated(struct netfs_io_subrequest *subreq);``

   Tell netfslib that a read subrequest has terminated.  The ``error``,
   ``flags`` and ``transferred`` fields should have been updated.

 * ``void netfs_write_subrequest_terminated(void *_op, ssize_t transferred_or_error);``

   Tell netfslib that a write subrequest has terminated.  Either the amount of
   data processed or the negative error code can be passed in.  This is
   can be used as a kiocb completion function.

 * ``void netfs_read_subreq_progress(struct netfs_io_subrequest *subreq);``

   This is provided to optionally update netfslib on the incremental progress
   of a read, allowing some folios to be unlocked early and does not actually
   terminate the subrequest.  The ``transferred`` field should have been
   updated.

Local Cache API
---------------

Netfslib provides a separate API for a local cache to implement, though it
provides some somewhat similar routines to the filesystem request API.

Firstly, the netfs_io_request object contains a place for the cache to hang its
state::

	struct netfs_cache_resources {
		const struct netfs_cache_ops	*ops;
		void				*cache_priv;
		void				*cache_priv2;
		unsigned int			debug_id;
		unsigned int			inval_counter;
	};

This contains an operations table pointer and two private pointers plus the
debug ID of the fscache cookie for tracing purposes and an invalidation counter
that is cranked by calls to ``fscache_invalidate()`` allowing cache subrequests
to be invalidated after completion.

The cache operation table looks like the following::

	struct netfs_cache_ops {
		void (*end_operation)(struct netfs_cache_resources *cres);
		void (*expand_readahead)(struct netfs_cache_resources *cres,
					 loff_t *_start, size_t *_len, loff_t i_size);
		enum netfs_io_source (*prepare_read)(struct netfs_io_subrequest *subreq,
						     loff_t i_size);
		int (*read)(struct netfs_cache_resources *cres,
			    loff_t start_pos,
			    struct iov_iter *iter,
			    bool seek_data,
			    netfs_io_terminated_t term_func,
			    void *term_func_priv);
		void (*prepare_write_subreq)(struct netfs_io_subrequest *subreq);
		void (*issue_write)(struct netfs_io_subrequest *subreq);
	};

With a termination handler function pointer::

	typedef void (*netfs_io_terminated_t)(void *priv,
					      ssize_t transferred_or_error,
					      bool was_async);

The methods defined in the table are:

 * ``end_operation()``

   [Required] Called to clean up the resources at the end of the read request.

 * ``expand_readahead()``

   [Optional] Called at the beginning of a readahead operation to allow the
   cache to expand a request in either direction.  This allows the cache to
   size the request appropriately for the cache granularity.

 * ``prepare_read()``

   [Required] Called to configure the next slice of a request.  ->start and
   ->len in the subrequest indicate where and how big the next slice can be;
   the cache gets to reduce the length to match its granularity requirements.

   The function is passed pointers to the start and length in its parameters,
   plus the size of the file for reference, and adjusts the start and length
   appropriately.  It should return one of:

   * ``NETFS_FILL_WITH_ZEROES``
   * ``NETFS_DOWNLOAD_FROM_SERVER``
   * ``NETFS_READ_FROM_CACHE``
   * ``NETFS_INVALID_READ``

   to indicate whether the slice should just be cleared or whether it should be
   downloaded from the server or read from the cache - or whether slicing
   should be given up at the current point.

 * ``read()``

   [Required] Called to read from the cache.  The start file offset is given
   along with an iterator to read to, which gives the length also.  It can be
   given a hint requesting that it seek forward from that start position for
   data.

   Also provided is a pointer to a termination handler function and private
   data to pass to that function.  The termination function should be called
   with the number of bytes transferred or an error code, plus a flag
   indicating whether the termination is definitely happening in the caller's
   context.

 * ``prepare_write_subreq()``

   [Required] This is called to allow the cache to limit the size of a
   subrequest.  It may also limit the number of individual regions in iterator,
   such as required by DIO/DMA.  This information should be set on stream to
   which the subrequest belongs::

	rreq->io_streams[subreq->stream_nr].sreq_max_len
	rreq->io_streams[subreq->stream_nr].sreq_max_segs

   The filesystem can use this, for example, to chop up a request that has to
   be split across multiple servers or to put multiple writes in flight.

   This is not permitted to return an error.  In the event of failure,
   ``netfs_prepare_write_failed()`` must be called.

 * ``issue_write()``

   [Required] This is used to dispatch a subrequest to the cache for writing.
   In the subrequest, ->start, ->len and ->transferred indicate what data
   should be written to the cache and ->io_iter indicates the buffer to be
   used.

   There is no return value; the ``netfs_write_subreq_terminated()`` function
   should be called to indicate that the subrequest completed either way.
   ->error, ->transferred and ->flags should be updated before completing.  The
   termination can be done asynchronously.


API Function Reference
======================

.. kernel-doc:: include/linux/netfs.h
.. kernel-doc:: fs/netfs/buffered_read.c