summaryrefslogtreecommitdiff
path: root/net/ceph/messenger.c
AgeCommit message (Collapse)Author
2012-06-06libceph: have messages point to their connectionAlex Elder
When a ceph message is queued for sending it is placed on a list of pending messages (ceph_connection->out_queue). When they are actually sent over the wire, they are moved from that list to another (ceph_connection->out_sent). When acknowledgement for the message is received, it is removed from the sent messages list. During that entire time the message is "in the possession" of a single ceph connection. Keep track of that connection in the message. This will be used in the next patch (and is a helpful bit of information for debugging anyway). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
2012-06-06libceph: tweak ceph_alloc_msg()Alex Elder
The function ceph_alloc_msg() is only used to allocate a message that will be assigned to a connection's in_msg pointer. Rename the function so this implied usage is more clear. In addition, make that assignment inside the function (again, since that's precisely what it's intended to be used for). This allows us to return what is now provided via the passed-in address of a "skip" variable. The return type is now Boolean to be explicit that there are only two possible outcomes. Make sure the result of an ->alloc_msg method call always sets the value of *skip properly. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
2012-06-06libceph: fully initialize connection in con_init()Alex Elder
Move the initialization of a ceph connection's private pointer, operations vector pointer, and peer name information into ceph_con_init(). Rearrange the arguments so the connection pointer is first. Hide the byte-swapping of the peer entity number inside ceph_con_init() Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
2012-06-01libceph: set CLOSED state bit in con_initAlex Elder
Once a connection is fully initialized, it is really in a CLOSED state, so make that explicit by setting the bit in its state field. It is possible for a connection in NEGOTIATING state to get a failure, leading to ceph_fault() and ultimately ceph_con_close(). Clear that bits if it is set in that case, to reflect that the connection truly is closed and is no longer participating in a connect sequence. Issue a warning if ceph_con_open() is called on a connection that is not in CLOSED state. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
2012-06-01libceph: start tracking connection socket stateAlex Elder
Start explicitly keeping track of the state of a ceph connection's socket, separate from the state of the connection itself. Create placeholder functions to encapsulate the state transitions. -------- | NEW* | transient initial state -------- | con_sock_state_init() v ---------- | CLOSED | initialized, but no socket (and no ---------- TCP connection) ^ \ | \ con_sock_state_connecting() | ---------------------- | \ + con_sock_state_closed() \ |\ \ | \ \ | ----------- \ | | CLOSING | socket event; \ | ----------- await close \ | ^ | | | | | + con_sock_state_closing() | | / \ | | / --------------- | | / \ v | / -------------- | / -----------------| CONNECTING | socket created, TCP | | / -------------- connect initiated | | | con_sock_state_connected() | | v ------------- | CONNECTED | TCP connection established ------------- Make the socket state an atomic variable, reinforcing that it's a distinct transtion with no possible "intermediate/both" states. This is almost certainly overkill at this point, though the transitions into CONNECTED and CLOSING state do get called via socket callback (the rest of the transitions occur with the connection mutex held). We can back out the atomicity later. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil<sage@inktank.com>
2012-06-01libceph: start separating connection flags from stateAlex Elder
A ceph_connection holds a mixture of connection state (as in "state machine" state) and connection flags in a single "state" field. To make the distinction more clear, define a new "flags" field and use it rather than the "state" field to hold Boolean flag values. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil<sage@inktank.com>
2012-06-01libceph: embed ceph messenger structure in ceph_clientAlex Elder
A ceph client has a pointer to a ceph messenger structure in it. There is always exactly one ceph messenger for a ceph client, so there is no need to allocate it separate from the ceph client structure. Switch the ceph_client structure to embed its ceph_messenger structure. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
2012-06-01libceph: rename kvec_reset and kvec_add functionsAlex Elder
The functions ceph_con_out_kvec_reset() and ceph_con_out_kvec_add() are entirely private functions, so drop the "ceph_" prefix in their name to make them slightly more wieldy. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
2012-06-01libceph: rename socket callbacksAlex Elder
Change the names of the three socket callback functions to make it more obvious they're specifically associated with a connection's socket (not the ceph connection that uses it). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
2012-06-01libceph: kill bad_proto ceph connection opAlex Elder
No code sets a bad_proto method in its ceph connection operations vector, so just get rid of it. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-06-01libceph: eliminate connection state "DEAD"Alex Elder
The ceph connection state "DEAD" is never set and is therefore not needed. Eliminate it. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-05-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-clientLinus Torvalds
Pull ceph updates from Sage Weil: "There are some updates and cleanups to the CRUSH placement code, a bug fix with incremental maps, several cleanups and fixes from Josh Durgin in the RBD block device code, a series of cleanups and bug fixes from Alex Elder in the messenger code, and some miscellaneous bounds checking and gfp cleanups/fixes." Fix up trivial conflicts in net/ceph/{messenger.c,osdmap.c} due to the networking people preferring "unsigned int" over just "unsigned". * git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (45 commits) libceph: fix pg_temp updates libceph: avoid unregistering osd request when not registered ceph: add auth buf in prepare_write_connect() ceph: rename prepare_connect_authorizer() ceph: return pointer from prepare_connect_authorizer() ceph: use info returned by get_authorizer ceph: have get_authorizer methods return pointers ceph: ensure auth ops are defined before use ceph: messenger: reduce args to create_authorizer ceph: define ceph_auth_handshake type ceph: messenger: check return from get_authorizer ceph: messenger: rework prepare_connect_authorizer() ceph: messenger: check prepare_write_connect() result ceph: don't set WRITE_PENDING too early ceph: drop msgr argument from prepare_write_connect() ceph: messenger: send banner in process_connect() ceph: messenger: reset connection kvec caller libceph: don't reset kvec in prepare_write_banner() ceph: ignore preferred_osd field ceph: fully initialize new layout ...
2012-05-18ceph: add auth buf in prepare_write_connect()Alex Elder
Move the addition of the authorizer buffer to a connection's out_kvec out of get_connect_authorizer() and into its caller. This way, the caller--prepare_write_connect()--can avoid adding the connect header to out_kvec before it has been fully initialized. Prior to this patch, it was possible for a connect header to be sent over the wire before the authorizer protocol or buffer length fields were initialized. An authorizer buffer associated with that header could also be queued to send only after the connection header that describes it was on the wire. Fixes http://tracker.newdream.net/issues/2424 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
2012-05-17ceph: rename prepare_connect_authorizer()Alex Elder
Change the name of prepare_connect_authorizer(). The next patch is going to make this function no longer add anything to the connection's out_kvec, so it will no longer fit the pattern of the rest of the prepare_connect_*() functions. In addition, pass the address of a variable that will hold the authorization protocol to use. Move the assignment of that to the connection's out_connect structure into prepare_write_connect(). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
2012-05-17ceph: return pointer from prepare_connect_authorizer()Alex Elder
Change prepare_connect_authorizer() so it returns a pointer (or pointer-coded error). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
2012-05-17ceph: use info returned by get_authorizerAlex Elder
Rather than passing a bunch of arguments to be filled in with the content of the ceph_auth_handshake buffer now returned by the get_authorizer method, just use the returned information in the caller, and drop the unnecessary arguments. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
2012-05-17ceph: have get_authorizer methods return pointersAlex Elder
Have the get_authorizer auth_client method return a ceph_auth pointer rather than an integer, pointer-encoding any returned error value. This is to pave the way for making use of the returned value in an upcoming patch. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
2012-05-17ceph: messenger: check return from get_authorizerAlex Elder
In prepare_connect_authorizer(), a connection's get_authorizer method is called but ignores its return value. This function can return an error, so check for it and return it if that ever occurs. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
2012-05-17ceph: messenger: rework prepare_connect_authorizer()Alex Elder
Change prepare_connect_authorizer() so it returns without dropping the connection mutex if the connection has no get_authorizer method. Use the symbolic CEPH_AUTH_UNKNOWN instead of 0 when assigning authorization protocols. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
2012-05-17ceph: messenger: check prepare_write_connect() resultAlex Elder
prepare_write_connect() can return an error, but only one of its callers checks for it. All the rest are in functions that already return errors, so it should be fine to return the error if one gets returned. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
2012-05-17ceph: don't set WRITE_PENDING too earlyAlex Elder
prepare_write_connect() prepares a connect message, then sets WRITE_PENDING on the connection. Then *after* this, it calls prepare_connect_authorizer(), which updates the content of the connection buffer already queued for sending. It's also possible it will result in prepare_write_connect() returning -EAGAIN despite the WRITE_PENDING big getting set. Fix this by preparing the connect authorizer first, setting the WRITE_PENDING bit only after that is done. Partially addresses http://tracker.newdream.net/issues/2424 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
2012-05-17ceph: drop msgr argument from prepare_write_connect()Alex Elder
In all cases, the value passed as the msgr argument to prepare_write_connect() is just con->msgr. Just get the msgr value from the ceph connection and drop the unneeded argument. The only msgr passed to prepare_write_banner() is also therefore just the one from con->msgr, so change that function to drop the msgr argument as well. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
2012-05-17ceph: messenger: send banner in process_connect()Alex Elder
prepare_write_connect() has an argument indicating whether a banner should be sent out before sending out a connection message. It's only ever set in one of its callers, so move the code that arranges to send the banner into that caller and drop the "include_banner" argument from prepare_write_connect(). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
2012-05-17ceph: messenger: reset connection kvec callerAlex Elder
Reset a connection's kvec fields in the caller rather than in prepare_write_connect(). This ends up repeating a few lines of code but it's improving the separation between distinct operations on the connection, which we can take advantage of later. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
2012-05-17libceph: don't reset kvec in prepare_write_banner()Alex Elder
Move the kvec reset for a connection out of prepare_write_banner and into its only caller. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
2012-05-14ceph: messenger: change read_partial() to take "end" argAlex Elder
Make the second argument to read_partial() be the ending input byte position rather than the beginning offset it now represents. This amounts to moving the addition "to + size" into the caller. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
2012-05-14ceph: messenger: update "to" in read_partial() callerAlex Elder
read_partial() always increases whatever "to" value is supplied by adding the requested size to it, and that's the only thing it does with that pointed-to value. Do that pointer advance in the caller (and then only when the updated value will be subsequently used), and change the "to" parameter to be an in-only and non-pointer value. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
2012-05-14ceph: messenger: use read_partial() in read_partial_message()Alex Elder
There are two blocks of code in read_partial_message()--those that read the header and footer of the message--that can be replaced by a call to read_partial(). Do that. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
2012-04-15net: cleanup unsigned to unsigned intEric Dumazet
Use of "unsigned int" is preferred to bare "unsigned" in net tree. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-03-22libceph: isolate kmap() call in write_partial_msg_pages()Alex Elder
In write_partial_msg_pages(), every case now does an identical call to kmap(page). Instead, just call it once inside the CRC-computing block where it's needed. Move the definition of kaddr inside that block, and make it a (char *) to ensure portable pointer arithmetic. We still don't kunmap() it until after the sendpage() call, in case that also ends up needing to use the mapping. Signed-off-by: Alex Elder <elder@dreamhost.com> Reviewed-by: Sage Weil <sage@newdream.net>
2012-03-22libceph: rename "page_shift" variable to something sensibleAlex Elder
In write_partial_msg_pages() there is a local variable used to track the starting offset within a bio segment to use. Its name, "page_shift" defies the Linux convention of using that name for log-base-2(page size). Since it's only used in the bio case rename it "bio_offset". Use it along with the page_pos field to compute the memory offset when computing CRC's in that function. This makes the bio case match the others more closely. Signed-off-by: Alex Elder <elder@dreamhost.com> Reviewed-by: Sage Weil <sage@newdream.net>
2012-03-22libceph: get rid of zero_page_addressAlex Elder
There's not a lot of benefit to zero_page_address, which basically holds a mapping of the zero page through the life of the messenger module. Even with our own mapping, the sendpage interface where it's used may need to kmap() it again. It's almost certain to be in low memory anyway. So stop treating the zero page specially in write_partial_msg_pages() and just get rid of zero_page_address entirely. Signed-off-by: Alex Elder <elder@dreamhost.com> Reviewed-by: Sage Weil <sage@newdream.net>
2012-03-22libceph: only call kernel_sendpage() via helperAlex Elder
Make ceph_tcp_sendpage() be the only place kernel_sendpage() is used, by using this helper in write_partial_msg_pages(). Signed-off-by: Alex Elder <elder@dreamhost.com> Reviewed-by: Sage Weil <sage@newdream.net>
2012-03-22libceph: use kernel_sendpage() for sending zeroesAlex Elder
If a message queued for send gets revoked, zeroes are sent over the wire instead of any unsent data. This is done by constructing a message and passing it to kernel_sendmsg() via ceph_tcp_sendmsg(). Since we are already working with a page in this case we can use the sendpage interface instead. Create a new ceph_tcp_sendpage() helper that sets up flags to match the way ceph_tcp_sendmsg() does now. Signed-off-by: Alex Elder <elder@dreamhost.com> Reviewed-by: Sage Weil <sage@newdream.net>
2012-03-22libceph: fix inverted crc option logicAlex Elder
CRC's are computed for all messages between ceph entities. The CRC computation for the data portion of message can optionally be disabled using the "nocrc" (common) ceph option. The default is for CRC computation for the data portion to be enabled. Unfortunately, the code that implements this feature interprets the feature flag wrong, meaning that by default the CRC's have *not* been computed (or checked) for the data portion of messages unless the "nocrc" option was supplied. Fix this, in write_partial_msg_pages() and read_partial_message(). Also change the flag variable in write_partial_msg_pages() to be "no_datacrc" to match the usage elsewhere in the file. This fixes http://tracker.newdream.net/issues/2064 Signed-off-by: Alex Elder <elder@dreamhost.com> Reviewed-by: Sage Weil <sage@newdream.net>
2012-03-22libceph: some simple changesAlex Elder
Nothing too big here. - define the size of the buffer used for consuming ignored incoming data using a symbolic constant - simplify the condition determining whether to unmap the page in write_partial_msg_pages(): do it for crc but not if the page is the zero page Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22libceph: small refactor in write_partial_kvec()Alex Elder
Make a small change in the code that counts down kvecs consumed by a ceph_tcp_sendmsg() call. Same functionality, just blocked out a little differently. Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22libceph: do crc calculations outside loopAlex Elder
Move blocks of code out of loops in read_partial_message_section() and read_partial_message(). They were only was getting called at the end of the last iteration of the loop anyway. Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22libceph: separate CRC calculation from byte swappingAlex Elder
Calculate CRC in a separate step from rearranging the byte order of the result, to improve clarity and readability. Use offsetof() to determine the number of bytes to include in the CRC calculation. In read_partial_message(), switch which value gets byte-swapped, since the just-computed CRC is already likely to be in a register. Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22libceph: use "do" in CRC-related Boolean variablesAlex Elder
Change the name (and type) of a few CRC-related Boolean local variables so they contain the word "do", to distingish their purpose from variables used for holding an actual CRC value. Note that in the process of doing this I identified a fairly serious logic error in write_partial_msg_pages(): the value of "do_crc" assigned appears to be the opposite of what it should be. No attempt to fix this is made here; this change preserves the erroneous behavior. The problem I found is documented here: http://tracker.newdream.net/issues/2064 Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22libceph: a few small changesAlex Elder
This gathers a number of very minor changes: - use %hu when formatting the a socket address's address family - null out the ceph_msgr_wq pointer after the queue has been destroyed - drop a needless cast in ceph_write_space() - add a WARN() call in ceph_state_change() in the event an unrecognized socket state is encountered - rearrange the logic in ceph_con_get() and ceph_con_put() so that: - the reference counts are only atomically read once - the values displayed via dout() calls are known to be meaningful at the time they are formatted Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22libceph: make ceph_tcp_connect() return intAlex Elder
There is no real need for ceph_tcp_connect() to return the socket pointer it creates, since it already assigns it to con->sock, which is visible to the caller. Instead, have it return an error code, which tidies things up a bit. Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22libceph: encapsulate some messenger cleanup codeAlex Elder
Define a helper function to perform various cleanup operations. Use it both in the exit routine and in the init routine in the event of an error. Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22libceph: make ceph_msgr_wq privateAlex Elder
The messenger workqueue has no need to be public. So give it static scope. Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22libceph: encapsulate connection kvec operationsAlex Elder
Encapsulate the operation of adding a new chunk of data to the next open slot in a ceph_connection's out_kvec array. Also add a "reset" operation to make subsequent add operations start at the beginning of the array again. Use these routines throughout, avoiding duplicate code and ensuring all calls are handled consistently. Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22libceph: move prepare_write_banner()Alex Elder
One of the arguments to prepare_write_connect() indicates whether it is being called immediately after a call to prepare_write_banner(). Move the prepare_write_banner() call inside prepare_write_connect(), and reinterpret (and rename) the "after_banner" argument so it indicates that prepare_write_connect() should *make* the call rather than should know it has already been made. This was split out from the next patch to highlight this change in logic. Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22ceph: eliminate some abusive castsAlex Elder
This fixes some spots where a type cast to (void *) was used as as a universal type hiding mechanism. Instead, properly cast the type to the intended target type. Signed-off-by: Alex Elder <elder@newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22ceph: eliminate some needless castsAlex Elder
This eliminates type casts in some places where they are not required. Signed-off-by: Alex Elder <elder@newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22ceph: kill addr_str_lock spinlock; use atomic insteadAlex Elder
A spinlock is used to protect a value used for selecting an array index for a string used for formatting a socket address for human consumption. The index is reset to 0 if it ever reaches the maximum index value. Instead, use an ever-increasing atomic variable as a sequence number, and compute the array index by masking off all but the sequence number's lowest bits. Make the number of entries in the array a power of two to allow the use of such a mask (to avoid jumps in the index value when the sequence number wraps). The length of these strings is somewhat arbitrarily set at 60 bytes. The worst-case length of a string produced is 54 bytes, for an IPv6 address that can't be shortened, e.g.: [1234:5678:9abc:def0:1111:2222:123.234.210.100]:32767 Change it so we arbitrarily use 64 bytes instead; if nothing else it will make the array of these line up better in hex dumps. Rename a few things to reinforce the distinction between the number of strings in the array and the length of individual strings. Signed-off-by: Alex Elder <elder@newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22ceph: make use of "else" where appropriateAlex Elder
Rearrange ceph_tcp_connect() a bit, making use of "else" rather than re-testing a value with consecutive "if" statements. Don't record a connection's socket pointer unless the connect operation is successful. Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>