From 18ccb2233fc5f7c27b5be17f5b6585c2fa62d919 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:17 +0100 Subject: docs: filesystems: convert orangefs.txt to ReST - Add a SPDX header; - Adjust document and section titles; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/6f438eeff5b029d229197a602bd9b74004fe9b63.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/index.rst | 1 + Documentation/filesystems/orangefs.rst | 554 +++++++++++++++++++++++++++++++++ Documentation/filesystems/orangefs.txt | 529 ------------------------------- 3 files changed, 555 insertions(+), 529 deletions(-) create mode 100644 Documentation/filesystems/orangefs.rst delete mode 100644 Documentation/filesystems/orangefs.txt diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index fbee77175840..fed53f831192 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -79,6 +79,7 @@ Documentation for filesystem implementations. ocfs2 ocfs2-online-filecheck omfs + orangefs overlayfs virtiofs vfat diff --git a/Documentation/filesystems/orangefs.rst b/Documentation/filesystems/orangefs.rst new file mode 100644 index 000000000000..7d6d4cad73c4 --- /dev/null +++ b/Documentation/filesystems/orangefs.rst @@ -0,0 +1,554 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======== +ORANGEFS +======== + +OrangeFS is an LGPL userspace scale-out parallel storage system. It is ideal +for large storage problems faced by HPC, BigData, Streaming Video, +Genomics, Bioinformatics. + +Orangefs, originally called PVFS, was first developed in 1993 by +Walt Ligon and Eric Blumer as a parallel file system for Parallel +Virtual Machine (PVM) as part of a NASA grant to study the I/O patterns +of parallel programs. + +Orangefs features include: + + * Distributes file data among multiple file servers + * Supports simultaneous access by multiple clients + * Stores file data and metadata on servers using local file system + and access methods + * Userspace implementation is easy to install and maintain + * Direct MPI support + * Stateless + + +Mailing List Archives +===================== + +http://lists.orangefs.org/pipermail/devel_lists.orangefs.org/ + + +Mailing List Submissions +======================== + +devel@lists.orangefs.org + + +Documentation +============= + +http://www.orangefs.org/documentation/ + + +Userspace Filesystem Source +=========================== + +http://www.orangefs.org/download + +Orangefs versions prior to 2.9.3 would not be compatible with the +upstream version of the kernel client. + + +Running ORANGEFS On a Single Server +=================================== + +OrangeFS is usually run in large installations with multiple servers and +clients, but a complete filesystem can be run on a single machine for +development and testing. + +On Fedora, install orangefs and orangefs-server:: + + dnf -y install orangefs orangefs-server + +There is an example server configuration file in +/etc/orangefs/orangefs.conf. Change localhost to your hostname if +necessary. + +To generate a filesystem to run xfstests against, see below. + +There is an example client configuration file in /etc/pvfs2tab. It is a +single line. Uncomment it and change the hostname if necessary. This +controls clients which use libpvfs2. This does not control the +pvfs2-client-core. + +Create the filesystem:: + + pvfs2-server -f /etc/orangefs/orangefs.conf + +Start the server:: + + systemctl start orangefs-server + +Test the server:: + + pvfs2-ping -m /pvfsmnt + +Start the client. The module must be compiled in or loaded before this +point:: + + systemctl start orangefs-client + +Mount the filesystem:: + + mount -t pvfs2 tcp://localhost:3334/orangefs /pvfsmnt + + +Building ORANGEFS on a Single Server +==================================== + +Where OrangeFS cannot be installed from distribution packages, it may be +built from source. + +You can omit --prefix if you don't care that things are sprinkled around +in /usr/local. As of version 2.9.6, OrangeFS uses Berkeley DB by +default, we will probably be changing the default to LMDB soon. + +:: + + ./configure --prefix=/opt/ofs --with-db-backend=lmdb + + make + + make install + +Create an orangefs config file:: + + /opt/ofs/bin/pvfs2-genconfig /etc/pvfs2.conf + +Create an /etc/pvfs2tab file:: + + echo tcp://localhost:3334/orangefs /pvfsmnt pvfs2 defaults,noauto 0 0 > \ + /etc/pvfs2tab + +Create the mount point you specified in the tab file if needed:: + + mkdir /pvfsmnt + +Bootstrap the server:: + + /opt/ofs/sbin/pvfs2-server -f /etc/pvfs2.conf + +Start the server:: + + /opt/osf/sbin/pvfs2-server /etc/pvfs2.conf + +Now the server should be running. Pvfs2-ls is a simple +test to verify that the server is running:: + + /opt/ofs/bin/pvfs2-ls /pvfsmnt + +If stuff seems to be working, load the kernel module and +turn on the client core:: + + /opt/ofs/sbin/pvfs2-client -p /opt/osf/sbin/pvfs2-client-core + +Mount your filesystem:: + + mount -t pvfs2 tcp://localhost:3334/orangefs /pvfsmnt + + +Running xfstests +================ + +It is useful to use a scratch filesystem with xfstests. This can be +done with only one server. + +Make a second copy of the FileSystem section in the server configuration +file, which is /etc/orangefs/orangefs.conf. Change the Name to scratch. +Change the ID to something other than the ID of the first FileSystem +section (2 is usually a good choice). + +Then there are two FileSystem sections: orangefs and scratch. + +This change should be made before creating the filesystem. + +:: + + pvfs2-server -f /etc/orangefs/orangefs.conf + +To run xfstests, create /etc/xfsqa.config:: + + TEST_DIR=/orangefs + TEST_DEV=tcp://localhost:3334/orangefs + SCRATCH_MNT=/scratch + SCRATCH_DEV=tcp://localhost:3334/scratch + +Then xfstests can be run:: + + ./check -pvfs2 + + +Options +======= + +The following mount options are accepted: + + acl + Allow the use of Access Control Lists on files and directories. + + intr + Some operations between the kernel client and the user space + filesystem can be interruptible, such as changes in debug levels + and the setting of tunable parameters. + + local_lock + Enable posix locking from the perspective of "this" kernel. The + default file_operations lock action is to return ENOSYS. Posix + locking kicks in if the filesystem is mounted with -o local_lock. + Distributed locking is being worked on for the future. + + +Debugging +========= + +If you want the debug (GOSSIP) statements in a particular +source file (inode.c for example) go to syslog:: + + echo inode > /sys/kernel/debug/orangefs/kernel-debug + +No debugging (the default):: + + echo none > /sys/kernel/debug/orangefs/kernel-debug + +Debugging from several source files:: + + echo inode,dir > /sys/kernel/debug/orangefs/kernel-debug + +All debugging:: + + echo all > /sys/kernel/debug/orangefs/kernel-debug + +Get a list of all debugging keywords:: + + cat /sys/kernel/debug/orangefs/debug-help + + +Protocol between Kernel Module and Userspace +============================================ + +Orangefs is a user space filesystem and an associated kernel module. +We'll just refer to the user space part of Orangefs as "userspace" +from here on out. Orangefs descends from PVFS, and userspace code +still uses PVFS for function and variable names. Userspace typedefs +many of the important structures. Function and variable names in +the kernel module have been transitioned to "orangefs", and The Linux +Coding Style avoids typedefs, so kernel module structures that +correspond to userspace structures are not typedefed. + +The kernel module implements a pseudo device that userspace +can read from and write to. Userspace can also manipulate the +kernel module through the pseudo device with ioctl. + +The Bufmap +---------- + +At startup userspace allocates two page-size-aligned (posix_memalign) +mlocked memory buffers, one is used for IO and one is used for readdir +operations. The IO buffer is 41943040 bytes and the readdir buffer is +4194304 bytes. Each buffer contains logical chunks, or partitions, and +a pointer to each buffer is added to its own PVFS_dev_map_desc structure +which also describes its total size, as well as the size and number of +the partitions. + +A pointer to the IO buffer's PVFS_dev_map_desc structure is sent to a +mapping routine in the kernel module with an ioctl. The structure is +copied from user space to kernel space with copy_from_user and is used +to initialize the kernel module's "bufmap" (struct orangefs_bufmap), which +then contains: + + * refcnt + - a reference counter + * desc_size - PVFS2_BUFMAP_DEFAULT_DESC_SIZE (4194304) - the IO buffer's + partition size, which represents the filesystem's block size and + is used for s_blocksize in super blocks. + * desc_count - PVFS2_BUFMAP_DEFAULT_DESC_COUNT (10) - the number of + partitions in the IO buffer. + * desc_shift - log2(desc_size), used for s_blocksize_bits in super blocks. + * total_size - the total size of the IO buffer. + * page_count - the number of 4096 byte pages in the IO buffer. + * page_array - a pointer to ``page_count * (sizeof(struct page*))`` bytes + of kcalloced memory. This memory is used as an array of pointers + to each of the pages in the IO buffer through a call to get_user_pages. + * desc_array - a pointer to ``desc_count * (sizeof(struct orangefs_bufmap_desc))`` + bytes of kcalloced memory. This memory is further intialized: + + user_desc is the kernel's copy of the IO buffer's ORANGEFS_dev_map_desc + structure. user_desc->ptr points to the IO buffer. + + :: + + pages_per_desc = bufmap->desc_size / PAGE_SIZE + offset = 0 + + bufmap->desc_array[0].page_array = &bufmap->page_array[offset] + bufmap->desc_array[0].array_count = pages_per_desc = 1024 + bufmap->desc_array[0].uaddr = (user_desc->ptr) + (0 * 1024 * 4096) + offset += 1024 + . + . + . + bufmap->desc_array[9].page_array = &bufmap->page_array[offset] + bufmap->desc_array[9].array_count = pages_per_desc = 1024 + bufmap->desc_array[9].uaddr = (user_desc->ptr) + + (9 * 1024 * 4096) + offset += 1024 + + * buffer_index_array - a desc_count sized array of ints, used to + indicate which of the IO buffer's partitions are available to use. + * buffer_index_lock - a spinlock to protect buffer_index_array during update. + * readdir_index_array - a five (ORANGEFS_READDIR_DEFAULT_DESC_COUNT) element + int array used to indicate which of the readdir buffer's partitions are + available to use. + * readdir_index_lock - a spinlock to protect readdir_index_array during + update. + +Operations +---------- + +The kernel module builds an "op" (struct orangefs_kernel_op_s) when it +needs to communicate with userspace. Part of the op contains the "upcall" +which expresses the request to userspace. Part of the op eventually +contains the "downcall" which expresses the results of the request. + +The slab allocator is used to keep a cache of op structures handy. + +At init time the kernel module defines and initializes a request list +and an in_progress hash table to keep track of all the ops that are +in flight at any given time. + +Ops are stateful: + + * unknown + - op was just initialized + * waiting + - op is on request_list (upward bound) + * inprogr + - op is in progress (waiting for downcall) + * serviced + - op has matching downcall; ok + * purged + - op has to start a timer since client-core + exited uncleanly before servicing op + * given up + - submitter has given up waiting for it + +When some arbitrary userspace program needs to perform a +filesystem operation on Orangefs (readdir, I/O, create, whatever) +an op structure is initialized and tagged with a distinguishing ID +number. The upcall part of the op is filled out, and the op is +passed to the "service_operation" function. + +Service_operation changes the op's state to "waiting", puts +it on the request list, and signals the Orangefs file_operations.poll +function through a wait queue. Userspace is polling the pseudo-device +and thus becomes aware of the upcall request that needs to be read. + +When the Orangefs file_operations.read function is triggered, the +request list is searched for an op that seems ready-to-process. +The op is removed from the request list. The tag from the op and +the filled-out upcall struct are copy_to_user'ed back to userspace. + +If any of these (and some additional protocol) copy_to_users fail, +the op's state is set to "waiting" and the op is added back to +the request list. Otherwise, the op's state is changed to "in progress", +and the op is hashed on its tag and put onto the end of a list in the +in_progress hash table at the index the tag hashed to. + +When userspace has assembled the response to the upcall, it +writes the response, which includes the distinguishing tag, back to +the pseudo device in a series of io_vecs. This triggers the Orangefs +file_operations.write_iter function to find the op with the associated +tag and remove it from the in_progress hash table. As long as the op's +state is not "canceled" or "given up", its state is set to "serviced". +The file_operations.write_iter function returns to the waiting vfs, +and back to service_operation through wait_for_matching_downcall. + +Service operation returns to its caller with the op's downcall +part (the response to the upcall) filled out. + +The "client-core" is the bridge between the kernel module and +userspace. The client-core is a daemon. The client-core has an +associated watchdog daemon. If the client-core is ever signaled +to die, the watchdog daemon restarts the client-core. Even though +the client-core is restarted "right away", there is a period of +time during such an event that the client-core is dead. A dead client-core +can't be triggered by the Orangefs file_operations.poll function. +Ops that pass through service_operation during a "dead spell" can timeout +on the wait queue and one attempt is made to recycle them. Obviously, +if the client-core stays dead too long, the arbitrary userspace processes +trying to use Orangefs will be negatively affected. Waiting ops +that can't be serviced will be removed from the request list and +have their states set to "given up". In-progress ops that can't +be serviced will be removed from the in_progress hash table and +have their states set to "given up". + +Readdir and I/O ops are atypical with respect to their payloads. + + - readdir ops use the smaller of the two pre-allocated pre-partitioned + memory buffers. The readdir buffer is only available to userspace. + The kernel module obtains an index to a free partition before launching + a readdir op. Userspace deposits the results into the indexed partition + and then writes them to back to the pvfs device. + + - io (read and write) ops use the larger of the two pre-allocated + pre-partitioned memory buffers. The IO buffer is accessible from + both userspace and the kernel module. The kernel module obtains an + index to a free partition before launching an io op. The kernel module + deposits write data into the indexed partition, to be consumed + directly by userspace. Userspace deposits the results of read + requests into the indexed partition, to be consumed directly + by the kernel module. + +Responses to kernel requests are all packaged in pvfs2_downcall_t +structs. Besides a few other members, pvfs2_downcall_t contains a +union of structs, each of which is associated with a particular +response type. + +The several members outside of the union are: + + ``int32_t type`` + - type of operation. + ``int32_t status`` + - return code for the operation. + ``int64_t trailer_size`` + - 0 unless readdir operation. + ``char *trailer_buf`` + - initialized to NULL, used during readdir operations. + +The appropriate member inside the union is filled out for any +particular response. + + PVFS2_VFS_OP_FILE_IO + fill a pvfs2_io_response_t + + PVFS2_VFS_OP_LOOKUP + fill a PVFS_object_kref + + PVFS2_VFS_OP_CREATE + fill a PVFS_object_kref + + PVFS2_VFS_OP_SYMLINK + fill a PVFS_object_kref + + PVFS2_VFS_OP_GETATTR + fill in a PVFS_sys_attr_s (tons of stuff the kernel doesn't need) + fill in a string with the link target when the object is a symlink. + + PVFS2_VFS_OP_MKDIR + fill a PVFS_object_kref + + PVFS2_VFS_OP_STATFS + fill a pvfs2_statfs_response_t with useless info . It is hard for + us to know, in a timely fashion, these statistics about our + distributed network filesystem. + + PVFS2_VFS_OP_FS_MOUNT + fill a pvfs2_fs_mount_response_t which is just like a PVFS_object_kref + except its members are in a different order and "__pad1" is replaced + with "id". + + PVFS2_VFS_OP_GETXATTR + fill a pvfs2_getxattr_response_t + + PVFS2_VFS_OP_LISTXATTR + fill a pvfs2_listxattr_response_t + + PVFS2_VFS_OP_PARAM + fill a pvfs2_param_response_t + + PVFS2_VFS_OP_PERF_COUNT + fill a pvfs2_perf_count_response_t + + PVFS2_VFS_OP_FSKEY + file a pvfs2_fs_key_response_t + + PVFS2_VFS_OP_READDIR + jamb everything needed to represent a pvfs2_readdir_response_t into + the readdir buffer descriptor specified in the upcall. + +Userspace uses writev() on /dev/pvfs2-req to pass responses to the requests +made by the kernel side. + +A buffer_list containing: + + - a pointer to the prepared response to the request from the + kernel (struct pvfs2_downcall_t). + - and also, in the case of a readdir request, a pointer to a + buffer containing descriptors for the objects in the target + directory. + +... is sent to the function (PINT_dev_write_list) which performs +the writev. + +PINT_dev_write_list has a local iovec array: struct iovec io_array[10]; + +The first four elements of io_array are initialized like this for all +responses:: + + io_array[0].iov_base = address of local variable "proto_ver" (int32_t) + io_array[0].iov_len = sizeof(int32_t) + + io_array[1].iov_base = address of global variable "pdev_magic" (int32_t) + io_array[1].iov_len = sizeof(int32_t) + + io_array[2].iov_base = address of parameter "tag" (PVFS_id_gen_t) + io_array[2].iov_len = sizeof(int64_t) + + io_array[3].iov_base = address of out_downcall member (pvfs2_downcall_t) + of global variable vfs_request (vfs_request_t) + io_array[3].iov_len = sizeof(pvfs2_downcall_t) + +Readdir responses initialize the fifth element io_array like this:: + + io_array[4].iov_base = contents of member trailer_buf (char *) + from out_downcall member of global variable + vfs_request + io_array[4].iov_len = contents of member trailer_size (PVFS_size) + from out_downcall member of global variable + vfs_request + +Orangefs exploits the dcache in order to avoid sending redundant +requests to userspace. We keep object inode attributes up-to-date with +orangefs_inode_getattr. Orangefs_inode_getattr uses two arguments to +help it decide whether or not to update an inode: "new" and "bypass". +Orangefs keeps private data in an object's inode that includes a short +timeout value, getattr_time, which allows any iteration of +orangefs_inode_getattr to know how long it has been since the inode was +updated. When the object is not new (new == 0) and the bypass flag is not +set (bypass == 0) orangefs_inode_getattr returns without updating the inode +if getattr_time has not timed out. Getattr_time is updated each time the +inode is updated. + +Creation of a new object (file, dir, sym-link) includes the evaluation of +its pathname, resulting in a negative directory entry for the object. +A new inode is allocated and associated with the dentry, turning it from +a negative dentry into a "productive full member of society". Orangefs +obtains the new inode from Linux with new_inode() and associates +the inode with the dentry by sending the pair back to Linux with +d_instantiate(). + +The evaluation of a pathname for an object resolves to its corresponding +dentry. If there is no corresponding dentry, one is created for it in +the dcache. Whenever a dentry is modified or verified Orangefs stores a +short timeout value in the dentry's d_time, and the dentry will be trusted +for that amount of time. Orangefs is a network filesystem, and objects +can potentially change out-of-band with any particular Orangefs kernel module +instance, so trusting a dentry is risky. The alternative to trusting +dentries is to always obtain the needed information from userspace - at +least a trip to the client-core, maybe to the servers. Obtaining information +from a dentry is cheap, obtaining it from userspace is relatively expensive, +hence the motivation to use the dentry when possible. + +The timeout values d_time and getattr_time are jiffy based, and the +code is designed to avoid the jiffy-wrap problem:: + + "In general, if the clock may have wrapped around more than once, there + is no way to tell how much time has elapsed. However, if the times t1 + and t2 are known to be fairly close, we can reliably compute the + difference in a way that takes into account the possibility that the + clock may have wrapped between times." + +from course notes by instructor Andy Wang + diff --git a/Documentation/filesystems/orangefs.txt b/Documentation/filesystems/orangefs.txt deleted file mode 100644 index f4ba94950e3f..000000000000 --- a/Documentation/filesystems/orangefs.txt +++ /dev/null @@ -1,529 +0,0 @@ -ORANGEFS -======== - -OrangeFS is an LGPL userspace scale-out parallel storage system. It is ideal -for large storage problems faced by HPC, BigData, Streaming Video, -Genomics, Bioinformatics. - -Orangefs, originally called PVFS, was first developed in 1993 by -Walt Ligon and Eric Blumer as a parallel file system for Parallel -Virtual Machine (PVM) as part of a NASA grant to study the I/O patterns -of parallel programs. - -Orangefs features include: - - * Distributes file data among multiple file servers - * Supports simultaneous access by multiple clients - * Stores file data and metadata on servers using local file system - and access methods - * Userspace implementation is easy to install and maintain - * Direct MPI support - * Stateless - - -MAILING LIST ARCHIVES -===================== - -http://lists.orangefs.org/pipermail/devel_lists.orangefs.org/ - - -MAILING LIST SUBMISSIONS -======================== - -devel@lists.orangefs.org - - -DOCUMENTATION -============= - -http://www.orangefs.org/documentation/ - - -USERSPACE FILESYSTEM SOURCE -=========================== - -http://www.orangefs.org/download - -Orangefs versions prior to 2.9.3 would not be compatible with the -upstream version of the kernel client. - - -RUNNING ORANGEFS ON A SINGLE SERVER -=================================== - -OrangeFS is usually run in large installations with multiple servers and -clients, but a complete filesystem can be run on a single machine for -development and testing. - -On Fedora, install orangefs and orangefs-server. - -dnf -y install orangefs orangefs-server - -There is an example server configuration file in -/etc/orangefs/orangefs.conf. Change localhost to your hostname if -necessary. - -To generate a filesystem to run xfstests against, see below. - -There is an example client configuration file in /etc/pvfs2tab. It is a -single line. Uncomment it and change the hostname if necessary. This -controls clients which use libpvfs2. This does not control the -pvfs2-client-core. - -Create the filesystem. - -pvfs2-server -f /etc/orangefs/orangefs.conf - -Start the server. - -systemctl start orangefs-server - -Test the server. - -pvfs2-ping -m /pvfsmnt - -Start the client. The module must be compiled in or loaded before this -point. - -systemctl start orangefs-client - -Mount the filesystem. - -mount -t pvfs2 tcp://localhost:3334/orangefs /pvfsmnt - - -BUILDING ORANGEFS ON A SINGLE SERVER -==================================== - -Where OrangeFS cannot be installed from distribution packages, it may be -built from source. - -You can omit --prefix if you don't care that things are sprinkled around -in /usr/local. As of version 2.9.6, OrangeFS uses Berkeley DB by -default, we will probably be changing the default to LMDB soon. - -./configure --prefix=/opt/ofs --with-db-backend=lmdb - -make - -make install - -Create an orangefs config file. - -/opt/ofs/bin/pvfs2-genconfig /etc/pvfs2.conf - -Create an /etc/pvfs2tab file. - -echo tcp://localhost:3334/orangefs /pvfsmnt pvfs2 defaults,noauto 0 0 > \ - /etc/pvfs2tab - -Create the mount point you specified in the tab file if needed. - -mkdir /pvfsmnt - -Bootstrap the server. - -/opt/ofs/sbin/pvfs2-server -f /etc/pvfs2.conf - -Start the server. - -/opt/osf/sbin/pvfs2-server /etc/pvfs2.conf - -Now the server should be running. Pvfs2-ls is a simple -test to verify that the server is running. - -/opt/ofs/bin/pvfs2-ls /pvfsmnt - -If stuff seems to be working, load the kernel module and -turn on the client core. - -/opt/ofs/sbin/pvfs2-client -p /opt/osf/sbin/pvfs2-client-core - -Mount your filesystem. - -mount -t pvfs2 tcp://localhost:3334/orangefs /pvfsmnt - - -RUNNING XFSTESTS -================ - -It is useful to use a scratch filesystem with xfstests. This can be -done with only one server. - -Make a second copy of the FileSystem section in the server configuration -file, which is /etc/orangefs/orangefs.conf. Change the Name to scratch. -Change the ID to something other than the ID of the first FileSystem -section (2 is usually a good choice). - -Then there are two FileSystem sections: orangefs and scratch. - -This change should be made before creating the filesystem. - -pvfs2-server -f /etc/orangefs/orangefs.conf - -To run xfstests, create /etc/xfsqa.config. - -TEST_DIR=/orangefs -TEST_DEV=tcp://localhost:3334/orangefs -SCRATCH_MNT=/scratch -SCRATCH_DEV=tcp://localhost:3334/scratch - -Then xfstests can be run - -./check -pvfs2 - - -OPTIONS -======= - -The following mount options are accepted: - - acl - Allow the use of Access Control Lists on files and directories. - - intr - Some operations between the kernel client and the user space - filesystem can be interruptible, such as changes in debug levels - and the setting of tunable parameters. - - local_lock - Enable posix locking from the perspective of "this" kernel. The - default file_operations lock action is to return ENOSYS. Posix - locking kicks in if the filesystem is mounted with -o local_lock. - Distributed locking is being worked on for the future. - - -DEBUGGING -========= - -If you want the debug (GOSSIP) statements in a particular -source file (inode.c for example) go to syslog: - - echo inode > /sys/kernel/debug/orangefs/kernel-debug - -No debugging (the default): - - echo none > /sys/kernel/debug/orangefs/kernel-debug - -Debugging from several source files: - - echo inode,dir > /sys/kernel/debug/orangefs/kernel-debug - -All debugging: - - echo all > /sys/kernel/debug/orangefs/kernel-debug - -Get a list of all debugging keywords: - - cat /sys/kernel/debug/orangefs/debug-help - - -PROTOCOL BETWEEN KERNEL MODULE AND USERSPACE -============================================ - -Orangefs is a user space filesystem and an associated kernel module. -We'll just refer to the user space part of Orangefs as "userspace" -from here on out. Orangefs descends from PVFS, and userspace code -still uses PVFS for function and variable names. Userspace typedefs -many of the important structures. Function and variable names in -the kernel module have been transitioned to "orangefs", and The Linux -Coding Style avoids typedefs, so kernel module structures that -correspond to userspace structures are not typedefed. - -The kernel module implements a pseudo device that userspace -can read from and write to. Userspace can also manipulate the -kernel module through the pseudo device with ioctl. - -THE BUFMAP: - -At startup userspace allocates two page-size-aligned (posix_memalign) -mlocked memory buffers, one is used for IO and one is used for readdir -operations. The IO buffer is 41943040 bytes and the readdir buffer is -4194304 bytes. Each buffer contains logical chunks, or partitions, and -a pointer to each buffer is added to its own PVFS_dev_map_desc structure -which also describes its total size, as well as the size and number of -the partitions. - -A pointer to the IO buffer's PVFS_dev_map_desc structure is sent to a -mapping routine in the kernel module with an ioctl. The structure is -copied from user space to kernel space with copy_from_user and is used -to initialize the kernel module's "bufmap" (struct orangefs_bufmap), which -then contains: - - * refcnt - a reference counter - * desc_size - PVFS2_BUFMAP_DEFAULT_DESC_SIZE (4194304) - the IO buffer's - partition size, which represents the filesystem's block size and - is used for s_blocksize in super blocks. - * desc_count - PVFS2_BUFMAP_DEFAULT_DESC_COUNT (10) - the number of - partitions in the IO buffer. - * desc_shift - log2(desc_size), used for s_blocksize_bits in super blocks. - * total_size - the total size of the IO buffer. - * page_count - the number of 4096 byte pages in the IO buffer. - * page_array - a pointer to page_count * (sizeof(struct page*)) bytes - of kcalloced memory. This memory is used as an array of pointers - to each of the pages in the IO buffer through a call to get_user_pages. - * desc_array - a pointer to desc_count * (sizeof(struct orangefs_bufmap_desc)) - bytes of kcalloced memory. This memory is further intialized: - - user_desc is the kernel's copy of the IO buffer's ORANGEFS_dev_map_desc - structure. user_desc->ptr points to the IO buffer. - - pages_per_desc = bufmap->desc_size / PAGE_SIZE - offset = 0 - - bufmap->desc_array[0].page_array = &bufmap->page_array[offset] - bufmap->desc_array[0].array_count = pages_per_desc = 1024 - bufmap->desc_array[0].uaddr = (user_desc->ptr) + (0 * 1024 * 4096) - offset += 1024 - . - . - . - bufmap->desc_array[9].page_array = &bufmap->page_array[offset] - bufmap->desc_array[9].array_count = pages_per_desc = 1024 - bufmap->desc_array[9].uaddr = (user_desc->ptr) + - (9 * 1024 * 4096) - offset += 1024 - - * buffer_index_array - a desc_count sized array of ints, used to - indicate which of the IO buffer's partitions are available to use. - * buffer_index_lock - a spinlock to protect buffer_index_array during update. - * readdir_index_array - a five (ORANGEFS_READDIR_DEFAULT_DESC_COUNT) element - int array used to indicate which of the readdir buffer's partitions are - available to use. - * readdir_index_lock - a spinlock to protect readdir_index_array during - update. - -OPERATIONS: - -The kernel module builds an "op" (struct orangefs_kernel_op_s) when it -needs to communicate with userspace. Part of the op contains the "upcall" -which expresses the request to userspace. Part of the op eventually -contains the "downcall" which expresses the results of the request. - -The slab allocator is used to keep a cache of op structures handy. - -At init time the kernel module defines and initializes a request list -and an in_progress hash table to keep track of all the ops that are -in flight at any given time. - -Ops are stateful: - - * unknown - op was just initialized - * waiting - op is on request_list (upward bound) - * inprogr - op is in progress (waiting for downcall) - * serviced - op has matching downcall; ok - * purged - op has to start a timer since client-core - exited uncleanly before servicing op - * given up - submitter has given up waiting for it - -When some arbitrary userspace program needs to perform a -filesystem operation on Orangefs (readdir, I/O, create, whatever) -an op structure is initialized and tagged with a distinguishing ID -number. The upcall part of the op is filled out, and the op is -passed to the "service_operation" function. - -Service_operation changes the op's state to "waiting", puts -it on the request list, and signals the Orangefs file_operations.poll -function through a wait queue. Userspace is polling the pseudo-device -and thus becomes aware of the upcall request that needs to be read. - -When the Orangefs file_operations.read function is triggered, the -request list is searched for an op that seems ready-to-process. -The op is removed from the request list. The tag from the op and -the filled-out upcall struct are copy_to_user'ed back to userspace. - -If any of these (and some additional protocol) copy_to_users fail, -the op's state is set to "waiting" and the op is added back to -the request list. Otherwise, the op's state is changed to "in progress", -and the op is hashed on its tag and put onto the end of a list in the -in_progress hash table at the index the tag hashed to. - -When userspace has assembled the response to the upcall, it -writes the response, which includes the distinguishing tag, back to -the pseudo device in a series of io_vecs. This triggers the Orangefs -file_operations.write_iter function to find the op with the associated -tag and remove it from the in_progress hash table. As long as the op's -state is not "canceled" or "given up", its state is set to "serviced". -The file_operations.write_iter function returns to the waiting vfs, -and back to service_operation through wait_for_matching_downcall. - -Service operation returns to its caller with the op's downcall -part (the response to the upcall) filled out. - -The "client-core" is the bridge between the kernel module and -userspace. The client-core is a daemon. The client-core has an -associated watchdog daemon. If the client-core is ever signaled -to die, the watchdog daemon restarts the client-core. Even though -the client-core is restarted "right away", there is a period of -time during such an event that the client-core is dead. A dead client-core -can't be triggered by the Orangefs file_operations.poll function. -Ops that pass through service_operation during a "dead spell" can timeout -on the wait queue and one attempt is made to recycle them. Obviously, -if the client-core stays dead too long, the arbitrary userspace processes -trying to use Orangefs will be negatively affected. Waiting ops -that can't be serviced will be removed from the request list and -have their states set to "given up". In-progress ops that can't -be serviced will be removed from the in_progress hash table and -have their states set to "given up". - -Readdir and I/O ops are atypical with respect to their payloads. - - - readdir ops use the smaller of the two pre-allocated pre-partitioned - memory buffers. The readdir buffer is only available to userspace. - The kernel module obtains an index to a free partition before launching - a readdir op. Userspace deposits the results into the indexed partition - and then writes them to back to the pvfs device. - - - io (read and write) ops use the larger of the two pre-allocated - pre-partitioned memory buffers. The IO buffer is accessible from - both userspace and the kernel module. The kernel module obtains an - index to a free partition before launching an io op. The kernel module - deposits write data into the indexed partition, to be consumed - directly by userspace. Userspace deposits the results of read - requests into the indexed partition, to be consumed directly - by the kernel module. - -Responses to kernel requests are all packaged in pvfs2_downcall_t -structs. Besides a few other members, pvfs2_downcall_t contains a -union of structs, each of which is associated with a particular -response type. - -The several members outside of the union are: - - int32_t type - type of operation. - - int32_t status - return code for the operation. - - int64_t trailer_size - 0 unless readdir operation. - - char *trailer_buf - initialized to NULL, used during readdir operations. - -The appropriate member inside the union is filled out for any -particular response. - - PVFS2_VFS_OP_FILE_IO - fill a pvfs2_io_response_t - - PVFS2_VFS_OP_LOOKUP - fill a PVFS_object_kref - - PVFS2_VFS_OP_CREATE - fill a PVFS_object_kref - - PVFS2_VFS_OP_SYMLINK - fill a PVFS_object_kref - - PVFS2_VFS_OP_GETATTR - fill in a PVFS_sys_attr_s (tons of stuff the kernel doesn't need) - fill in a string with the link target when the object is a symlink. - - PVFS2_VFS_OP_MKDIR - fill a PVFS_object_kref - - PVFS2_VFS_OP_STATFS - fill a pvfs2_statfs_response_t with useless info . It is hard for - us to know, in a timely fashion, these statistics about our - distributed network filesystem. - - PVFS2_VFS_OP_FS_MOUNT - fill a pvfs2_fs_mount_response_t which is just like a PVFS_object_kref - except its members are in a different order and "__pad1" is replaced - with "id". - - PVFS2_VFS_OP_GETXATTR - fill a pvfs2_getxattr_response_t - - PVFS2_VFS_OP_LISTXATTR - fill a pvfs2_listxattr_response_t - - PVFS2_VFS_OP_PARAM - fill a pvfs2_param_response_t - - PVFS2_VFS_OP_PERF_COUNT - fill a pvfs2_perf_count_response_t - - PVFS2_VFS_OP_FSKEY - file a pvfs2_fs_key_response_t - - PVFS2_VFS_OP_READDIR - jamb everything needed to represent a pvfs2_readdir_response_t into - the readdir buffer descriptor specified in the upcall. - -Userspace uses writev() on /dev/pvfs2-req to pass responses to the requests -made by the kernel side. - -A buffer_list containing: - - a pointer to the prepared response to the request from the - kernel (struct pvfs2_downcall_t). - - and also, in the case of a readdir request, a pointer to a - buffer containing descriptors for the objects in the target - directory. -... is sent to the function (PINT_dev_write_list) which performs -the writev. - -PINT_dev_write_list has a local iovec array: struct iovec io_array[10]; - -The first four elements of io_array are initialized like this for all -responses: - - io_array[0].iov_base = address of local variable "proto_ver" (int32_t) - io_array[0].iov_len = sizeof(int32_t) - - io_array[1].iov_base = address of global variable "pdev_magic" (int32_t) - io_array[1].iov_len = sizeof(int32_t) - - io_array[2].iov_base = address of parameter "tag" (PVFS_id_gen_t) - io_array[2].iov_len = sizeof(int64_t) - - io_array[3].iov_base = address of out_downcall member (pvfs2_downcall_t) - of global variable vfs_request (vfs_request_t) - io_array[3].iov_len = sizeof(pvfs2_downcall_t) - -Readdir responses initialize the fifth element io_array like this: - - io_array[4].iov_base = contents of member trailer_buf (char *) - from out_downcall member of global variable - vfs_request - io_array[4].iov_len = contents of member trailer_size (PVFS_size) - from out_downcall member of global variable - vfs_request - -Orangefs exploits the dcache in order to avoid sending redundant -requests to userspace. We keep object inode attributes up-to-date with -orangefs_inode_getattr. Orangefs_inode_getattr uses two arguments to -help it decide whether or not to update an inode: "new" and "bypass". -Orangefs keeps private data in an object's inode that includes a short -timeout value, getattr_time, which allows any iteration of -orangefs_inode_getattr to know how long it has been since the inode was -updated. When the object is not new (new == 0) and the bypass flag is not -set (bypass == 0) orangefs_inode_getattr returns without updating the inode -if getattr_time has not timed out. Getattr_time is updated each time the -inode is updated. - -Creation of a new object (file, dir, sym-link) includes the evaluation of -its pathname, resulting in a negative directory entry for the object. -A new inode is allocated and associated with the dentry, turning it from -a negative dentry into a "productive full member of society". Orangefs -obtains the new inode from Linux with new_inode() and associates -the inode with the dentry by sending the pair back to Linux with -d_instantiate(). - -The evaluation of a pathname for an object resolves to its corresponding -dentry. If there is no corresponding dentry, one is created for it in -the dcache. Whenever a dentry is modified or verified Orangefs stores a -short timeout value in the dentry's d_time, and the dentry will be trusted -for that amount of time. Orangefs is a network filesystem, and objects -can potentially change out-of-band with any particular Orangefs kernel module -instance, so trusting a dentry is risky. The alternative to trusting -dentries is to always obtain the needed information from userspace - at -least a trip to the client-core, maybe to the servers. Obtaining information -from a dentry is cheap, obtaining it from userspace is relatively expensive, -hence the motivation to use the dentry when possible. - -The timeout values d_time and getattr_time are jiffy based, and the -code is designed to avoid the jiffy-wrap problem: - -"In general, if the clock may have wrapped around more than once, there -is no way to tell how much time has elapsed. However, if the times t1 -and t2 are known to be fairly close, we can reliably compute the -difference in a way that takes into account the possibility that the -clock may have wrapped between times." - - from course notes by instructor Andy Wang - -- cgit