summaryrefslogtreecommitdiff
path: root/fs/ceph/osdmap.c
AgeCommit message (Collapse)Author
2010-10-20ceph: factor out libceph from Ceph file systemYehuda Sadeh
This factors out protocol and low-level storage parts of ceph into a separate libceph module living in net/ceph and include/linux/ceph. This is mostly a matter of moving files around. However, a few key pieces of the interface change as well: - ceph_client becomes ceph_fs_client and ceph_client, where the latter captures the mon and osd clients, and the fs_client gets the mds client and file system specific pieces. - Mount option parsing and debugfs setup is correspondingly broken into two pieces. - The mon client gets a generic handler callback for otherwise unknown messages (mds map, in this case). - The basic supported/required feature bits can be expanded (and are by ceph_fs_client). No functional change, aside from some subtle error handling cases that got cleaned up in the refactoring process. Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20ceph: lookup pool in osdmap by nameYehuda Sadeh
Implement a pool lookup by name. This will be used by rbd. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-03ceph: whitespace cleanupSage Weil
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-02ceph: fix decoding of pool snap infoSage Weil
The pool info contains a vector for snap_info_t, not snap ids. This fixes the broken decoding, which would declare teh update corrupt when a pool snapshot was created. Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01ceph: print useful error message when crush rule not foundSage Weil
Include the crush_ruleset in the error message. Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01ceph: code cleanupYehuda Sadeh
Mainly fixing minor issues reported by sparse. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-23ceph: fix pg_mapping leak on pg_temp updatesSage Weil
Free the ceph_pg_mapping structs when they are removed from the pg_temp rbtree. Also fix a leak in the __insert_pg_mapping() error path. Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-08ceph: add kfree() to error pathDan Carpenter
We leak a "pi" on this error path. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>
2010-06-17ceph: fix crush map update decodingSage Weil
If the incremental osdmap has a new crush map, advance the position after decoding so that we can parse the rest of the osdmap properly. Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-29fs/ceph: Use ERR_CASTJulia Lawall
Use ERR_CAST(x) rather than ERR_PTR(PTR_ERR(x)). The former makes more clear what is the purpose of the operation, which otherwise looks like a no-op. In the case of fs/ceph/inode.c, ERR_CAST is not needed, because the type of the returned value is the same as the type of the enclosing function. The semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ type T; T x; identifier f; @@ T f (...) { <+... - ERR_PTR(PTR_ERR(x)) + x ...+> } @@ expression x; @@ - ERR_PTR(PTR_ERR(x)) + ERR_CAST(x) // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-11ceph: resubmit requests on pg mapping change (not just primary change)Sage Weil
OSD requests need to be resubmitted on any pg mapping change, not just when the pg primary changes. Resending only when the primary changes results in occasional 'hung' requests during osd cluster recovery or rebalancing. Signed-off-by: Sage Weil <sage@newdream.net>
2010-04-14Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: ceph: use separate class for ceph sockets' sk_lock ceph: reserve one more caps space when doing readdir ceph: queue_cap_snap should always queue dirty context ceph: fix dentry reference leak in dcache readdir ceph: decode v5 of osdmap (pool names) [protocol change] ceph: fix ack counter reset on connection reset ceph: fix leaked inode ref due to snap metadata writeback race ceph: fix snap context reference leaks ceph: allow writeback of snapped pages older than 'oldest' snapc ceph: fix dentry rehashing on virtual .snap dir
2010-04-09ceph: decode v5 of osdmap (pool names) [protocol change]Sage Weil
Teach the client to decode an updated format for the osdmap. The new format includes pool names, which will be useful shortly. Get this change in earlier rather than later. Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-30include cleanup: Update gfp.h and slab.h includes to prepare for breaking ↵Tejun Heo
implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-23ceph: fix pg pool decoding from incremental osdmap updateSage Weil
The incremental map decoding of pg pool updates wasn't skipping the snaps and removed_snaps vectors. This caused osd requests to stall when pool snapshots were created or fs snapshots were deleted. Use a common helper for full and incremental map decoders that decodes pools properly. Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-01ceph: fix osdmap decoding when pools include (removed) snapsSage Weil
Add missing pointer dereference (p is a void **). Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-17ceph: use rbtree for pg pools; decode new osdmap formatSage Weil
Since we can now create and destroy pg pools, the pool ids will be sparse, and an array no longer makes sense for looking up by pool id. Use an rbtree instead. The OSDMap encoding also no longer has a max pool count (previously used to allocate the array). There is a new pool_max, that is the largest pool id we've ever used, although we don't actually need it in the client. Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-17ceph: fix memory leak when destroying osdmap with pg_temp mappingsSage Weil
Also move _lookup_pg_mapping into a helper. Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-11ceph: add uid field to ceph_pg_poolSage Weil
Also verify encoding version as we go. Signed-off-by: Sage Weil <sage@newdream.net>
2010-01-25ceph: precede encoded ceph_pg_pool struct with versionSage Weil
Signed-off-by: Sage Weil <sage@newdream.net>
2009-12-21ceph: fix incremental osdmap pg_temp decoding bugSage Weil
An incremental pg_temp wasn't being decoded properly (wrong bound on for loop). Also remove unused local variable, while we're at it. Signed-off-by: Sage Weil <sage@newdream.net>
2009-12-21ceph: fix error paths for corrupt osdmap messagesSage Weil
Both osdmap_decode() and osdmap_apply_incremental() should never return NULL. Signed-off-by: Sage Weil <sage@newdream.net>
2009-12-21ceph: hex dump corrupt server data to KERN_DEBUGSage Weil
Also, print fsid using standard format, NOT hex dump. Signed-off-by: Sage Weil <sage@newdream.net>
2009-12-09ceph: do not feed bad device ids to crushSage Weil
Do not feed bad (large) device ids to CRUSH. Signed-off-by: Sage Weil <sage@newdream.net>
2009-11-07ceph: make CRUSH hash function a bucket propertySage Weil
Make the integer hash function a property of the bucket it is used on. This allows us to gracefully add support for new hash functions without starting from scatch. Signed-off-by: Sage Weil <sage@newdream.net>
2009-11-06ceph: make object hash a pg_pool propertySage Weil
The object will be hashed to a placement seed (ps) based on the pg_pool's hash function. This allows new hashes to be introduced into an existing object store, or selection of a hash appropriate to the objects that will be stored in a particular pool. Signed-off-by: Sage Weil <sage@newdream.net>
2009-11-06ceph: clean up 'osd%d down' console msgSage Weil
No ceph prefix. Signed-off-by: Sage Weil <sage@newdream.net>
2009-11-04ceph: fix endian conversions for ceph_pgSage Weil
The endian conversions don't quite work with the old union ceph_pg. Just make it a regular struct, and make each field __le. This is simpler and it has the added bonus of actually working. Signed-off-by: Sage Weil <sage@newdream.net>
2009-11-03ceph: use fixed endian encoding for ceph_entity_addrSage Weil
We exchange struct ceph_entity_addr over the wire and store it on disk. The sockaddr_storage.ss_family field, however, is host endianness. So, fix ss_family endianness to big endian when sending/receiving over the wire. Signed-off-by: Sage Weil <sage@newdream.net>
2009-10-30ceph: fix intra strip unit length calculationNoah Watkins
Commit 645a102581b3639836b17d147c35d574fd6e8267 fixes calculation of object offset for layouts with multiple stripes per object. This updates the calculation of the length written to take into account multiple stripes per object. Signed-off-by: Noah Watkins <noah@noahdesu.com> Signed-off-by: Sage Weil <sage@newdream.net>
2009-10-28ceph: fix object striping calculation for non-default striping schemesSage Weil
We were incorrectly calculationing of object offset. If we have multiple stripe units per object, we need to shift to the start of the current su in addition to the offset within the su. Also rename bno to ono (object number) to avoid some variable naming confusion. Signed-off-by: Sage Weil <sage@newdream.net>
2009-10-28ceph: correct comment to match striping calculationSage Weil
The object extent offset is the file offset _modulo_ the stripe unit. The code was correct, the comment was wrong. Reported-by: Noah Watkins <jayhawk@soe.ucsc.edu> Signed-off-by: Sage Weil <sage@newdream.net>
2009-10-28ceph: remove redundant use of le32_to_cpuNoah Watkins
Using stripe unit size calculated and saved on the stack to avoid a redundant call to le32_to_cpu. Signed-off-by: Noah Watkins <noah@noahdesu.com> Signed-off-by: Sage Weil <sage@newdream.net>
2009-10-19ceph: include preferred osd in placement seedSage Weil
Mix the preferred osd (if any) into the placement seed that is fed into the CRUSH object placement calculation. This prevents all the placement pgs from peering with the same osds. Rev the osd client protocol with this change. Signed-off-by: Sage Weil <sage@newdream.net>
2009-10-14ceph: convert encode/decode macros to inlinesSage Weil
This avoids the fugly pass by reference and makes the code a bit easier to read. Signed-off-by: Sage Weil <sage@newdream.net>
2009-10-09ceph: fail gracefully on corrupt osdmap (bad pg_temp mapping)Sage Weil
Return an error and report a corrupt map instead of crying BUG(). Signed-off-by: Sage Weil <sage@newdream.net>
2009-10-06ceph: OSD clientSage Weil
The OSD client is responsible for reading and writing data from/to the object storage pool. This includes determining where objects are stored in the cluster, and ensuring that requests are retried or redirected in the event of a node failure or data migration. If an OSD does not respond before a timeout expires, keepalive messages are sent across the lossless, ordered communications channel to ensure that any break in the TCP is discovered. If the session does reset, a reconnection is attempted and affected requests are resent (by the message transport layer). Signed-off-by: Sage Weil <sage@newdream.net>