X-Git-Url: https://git.ozlabs.org/?p=ccan;a=blobdiff_plain;f=ccan%2Ftdb2%2Fdoc%2Fdesign.txt;h=c2994a4ce08785f0e2c106dfdf3929ea1fea51e7;hp=2f0d22cd9e921d849e8a71326f7088cc56a4ab1a;hb=a42bba8ec446284256a7c9146ba3525404de474c;hpb=39f01834db9b6a21d076e67d1e3143ab99aaf43e diff --git a/ccan/tdb2/doc/design.txt b/ccan/tdb2/doc/design.txt index 2f0d22cd..c2994a4c 100644 --- a/ccan/tdb2/doc/design.txt +++ b/ccan/tdb2/doc/design.txt @@ -2,7 +2,7 @@ TDB2: A Redesigning The Trivial DataBase Rusty Russell, IBM Corporation -26-July-2010 +1-December-2010 Abstract @@ -74,7 +74,7 @@ optional hashing function and an optional logging function argument. Additional arguments to open would require the introduction of a tdb_open_ex2 call etc. -2.1.1 Proposed Solution +2.1.1 Proposed Solution tdb_open() will take a linked-list of attributes: @@ -129,6 +129,10 @@ union tdb_attribute { This allows future attributes to be added, even if this expands the size of the union. +2.1.2 Status + +Complete. + 2.2 tdb_traverse Makes Impossible Guarantees tdb_traverse (and tdb_firstkey/tdb_nextkey) predate transactions, @@ -148,6 +152,11 @@ occur during your traversal, otherwise you will see some subset. You can prevent changes by using a transaction or the locking API. +2.2.2 Status + +Complete. Delete-during-traverse will still delete every record, +too (assuming no other changes). + 2.3 Nesting of Transactions Is Fraught TDB has alternated between allowing nested transactions and not @@ -182,6 +191,10 @@ However, this behavior can be simulated with a wrapper which uses tdb_add_flags() and tdb_remove_flags(), so the API should not be expanded for this relatively-obscure case. +2.3.2 Status + +Incomplete; nesting flag is still defined as per tdb1. + 2.4 Incorrect Hash Function is Not Detected tdb_open_ex() allows the calling code to specify a different hash @@ -195,6 +208,10 @@ The header should contain an example hash result (eg. the hash of 0xdeadbeef), and tdb_open_ex() should check that the given hash function produces the same answer, or fail the tdb_open call. +2.4.2 Status + +Complete. + 2.5 tdb_set_max_dead/TDB_VOLATILE Expose Implementation In response to scalability issues with the free list ([TDB-Freelist-Is] @@ -216,6 +233,11 @@ hint that store and delete of records will be at least as common as fetch in order to allow some internal tuning, but initially will become a no-op. +2.5.2 Status + +Incomplete. TDB_VOLATILE still defined, but implementation should +fail on unknown flags to be future-proof. + 2.6 TDB Files Cannot Be Opened Multiple Times In The Same Process @@ -251,6 +273,10 @@ whether re-opening is allowed, as though there may be some benefit to adding a call to detect when a tdb_context is shared, to allow other to create such an API. +2.6.2 Status + +Incomplete. + 2.7 TDB API Is Not POSIX Thread-safe The TDB API uses an error code which can be queried after an @@ -277,7 +303,13 @@ maintained. The aim is that building tdb with -DTDB_PTHREAD will result in a pthread-safe version of the library, and otherwise no overhead -will exist. +will exist. Alternatively, a hooking mechanism similar to that +proposed for [Proposed-Solution-locking-hook] could be used to +enable pthread locking at runtime. + +2.7.2 Status + +Incomplete. 2.8 *_nonblock Functions And *_mark Functions Expose Implementation @@ -341,6 +373,10 @@ locks it doesn't need to obtain. It also keeps the complexity out of the API, and in ctdbd where it is needed. +2.8.2 Status + +Incomplete. + 2.9 tdb_chainlock Functions Expose Implementation tdb_chainlock locks some number of records, including the record @@ -389,6 +425,10 @@ EINVAL if the signal occurs before the kernel is entered, otherwise EAGAIN. ] +2.10.2 Status + +Incomplete. + 2.11 The API Uses Gratuitous Typedefs, Capitals typedefs are useful for providing source compatibility when types @@ -431,6 +471,10 @@ the tdb_open_ex for logging. It should simply take an extra argument, since we are prepared to break the API/ABI. +2.12.2 Status + +Complete. + 2.13 Various Callback Functions Are Not Typesafe The callback functions in tdb_set_logging_function (after [tdb_log_func-Doesnt-Take] @@ -453,6 +497,10 @@ their parameter. See CCAN's typesafe_cb module at http://ccan.ozlabs.org/info/typesafe_cb.html +2.13.2 Status + +Incomplete. + 2.14 TDB_CLEAR_IF_FIRST Must Be Specified On All Opens, tdb_reopen_all Problematic @@ -473,6 +521,89 @@ it alone has opened the TDB and will erase it. Remove TDB_CLEAR_IF_FIRST. Other workarounds are possible, but see [TDB_CLEAR_IF_FIRST-Imposes-Performance]. +2.14.2 Status + +Incomplete, TDB_CLEAR_IF_FIRST still defined, but not +implemented. + +2.15 Extending The Header Is Difficult + +We have reserved (zeroed) words in the TDB header, which can be +used for future features. If the future features are compulsory, +the version number must be updated to prevent old code from +accessing the database. But if the future feature is optional, we +have no way of telling if older code is accessing the database or +not. + +2.15.1 Proposed Solution + +The header should contain a “format variant” value (64-bit). This +is divided into two 32-bit parts: + +1. The lower part reflects the format variant understood by code + accessing the database. + +2. The upper part reflects the format variant you must understand + to write to the database (otherwise you can only open for + reading). + +The latter field can only be written at creation time, the former +should be written under the OPEN_LOCK when opening the database +for writing, if the variant of the code is lower than the current +lowest variant. + +This should allow backwards-compatible features to be added, and +detection if older code (which doesn't understand the feature) +writes to the database. + +2.15.2 Status + +Incomplete. + +2.16 Record Headers Are Not Expandible + +If we later want to add (say) checksums on keys and data, it +would require another format change, which we'd like to avoid. + +2.16.1 Proposed Solution + +We often have extra padding at the tail of a record. If we ensure +that the first byte (if any) of this padding is zero, we will +have a way for future changes to detect code which doesn't +understand a new format: the new code would write (say) a 1 at +the tail, and thus if there is no tail or the first byte is 0, we +would know the extension is not present on that record. + +2.16.2 Status + +Incomplete. + +2.17 TDB Does Not Use Talloc + +Many users of TDB (particularly Samba) use the talloc allocator, +and thus have to wrap TDB in a talloc context to use it +conveniently. + +2.17.1 Proposed Solution + +The allocation within TDB is not complicated enough to justify +the use of talloc, and I am reluctant to force another +(excellent) library on TDB users. Nonetheless a compromise is +possible. An attribute (see [attributes]) can be added later to +tdb_open() to provide an alternate allocation mechanism, +specifically for talloc but usable by any other allocator (which +would ignore the “context” argument). + +This would form a talloc heirarchy as expected, but the caller +would still have to attach a destructor to the tdb context +returned from tdb_open to close it. All TDB_DATA fields would be +children of the tdb_context, and the caller would still have to +manage them (using talloc_free() or talloc_steal()). + +2.17.2 Status + +Deferred. + 3 Performance And Scalability Issues 3.1 TDB_CLEAR_IF_FIRST @@ -502,6 +633,10 @@ Remove the flag. It was a neat idea, but even trivial servers tend to know when they are initializing for the first time and can simply unlink the old tdb at that point. +3.1.2 Status + +Incomplete; TDB_CLEAR_IF_FIRST still defined, but does nothing. + 3.2 TDB Files Have a 4G Limit This seems to be becoming an issue (so much for “trivial”!), @@ -528,6 +663,10 @@ Old versions of tdb will fail to open the new TDB files (since 28 August 2009, commit 398d0c29290: prior to that any unrecognized file format would be erased and initialized as a fresh tdb!) +3.2.2 Status + +Complete. + 3.3 TDB Records Have a 4G Limit This has not been a reported problem, and the API uses size_t @@ -542,6 +681,10 @@ implementation would return TDB_ERR_OOM in a similar case). It seems unlikely that 32 bit keys will be a limitation, so the implementation may not support this (see [sub:Records-Incur-A]). +3.3.2 Status + +Complete. + 3.4 Hash Size Is Determined At TDB Creation Time TDB contains a number of hash chains in the header; the number is @@ -551,7 +694,7 @@ long), that LDB uses 10,000 for this hash. In general it is impossible to know what the 'right' answer is at database creation time. -3.4.1 Proposed Solution +3.4.1 Proposed Solution After comprehensive performance testing on various scalable hash variants[footnote: @@ -559,31 +702,33 @@ http://rusty.ozlabs.org/?p=89 and http://rusty.ozlabs.org/?p=94 This was annoying because I was previously convinced that an expanding tree of hashes would be very close to optimal. ], it became clear that it is hard to beat a straight linear hash -table which doubles in size when it reaches saturation. There are -three details which become important: - -1. On encountering a full bucket, we use the next bucket. - -2. Extra hash bits are stored with the offset, to reduce - comparisons. - -3. A marker entry is used on deleting an entry. - -The doubling of the table must be done under a transaction; we -will not reduce it on deletion, so it will be an unusual case. It -will either be placed at the head (other entries will be moved -out the way so we can expand). We could have a pointer in the -header to the current hashtable location, but that pointer would -have to be read frequently to check for hashtable moves. - -The locking for this is slightly more complex than the chained -case; we currently have one lock per bucket, and that means we -would need to expand the lock if we overflow to the next bucket. -The frequency of such collisions will effect our locking -heuristics: we can always lock more buckets than we need. - -One possible optimization is to only re-check the hash size on an -insert or a lookup miss. +table which doubles in size when it reaches saturation. +Unfortunately, altering the hash table introduces serious locking +complications: the entire hash table needs to be locked to +enlarge the hash table, and others might be holding locks. +Particularly insidious are insertions done under tdb_chainlock. + +Thus an expanding layered hash will be used: an array of hash +groups, with each hash group exploding into pointers to lower +hash groups once it fills, turning into a hash tree. This has +implications for locking: we must lock the entire group in case +we need to expand it, yet we don't know how deep the tree is at +that point. + +Note that bits from the hash table entries should be stolen to +hold more hash bits to reduce the penalty of collisions. We can +use the otherwise-unused lower 3 bits. If we limit the size of +the database to 64 exabytes, we can use the top 8 bits of the +hash entry as well. These 11 bits would reduce false positives +down to 1 in 2000 which is more than we need: we can use one of +the bits to indicate that the extra hash bits are valid. This +means we can choose not to re-hash all entries when we expand a +hash group; simply use the next bits we need and mark them +invalid. + +3.4.2 Status + +Complete. 3.5 TDB Freelist Is Highly Contended @@ -660,50 +805,57 @@ to the size of the hash table, but as it is rare to walk a large number of free list entries we can use far fewer, say 1/32 of the number of hash buckets. +It seems tempting to try to reuse the hash implementation which +we use for records here, but we have two ways of searching for +free entries: for allocation we search by size (and possibly +zone) which produces too many clashes for our hash table to +handle well, and for coalescing we search by address. Thus an +array of doubly-linked free lists seems preferable. + There are various benefits in using per-size free lists (see [sub:TDB-Becomes-Fragmented] ) but it's not clear this would reduce contention in the common case where all processes are allocating/freeing the same size. Thus we almost certainly need to divide in other ways: the most obvious is to divide the file into zones, and using a free list -(or set of free lists) for each. This approximates address +(or table of free lists) for each. This approximates address ordering. -Note that this means we need to split the free lists when we -expand the file; this is probably acceptable when we double the -hash table size, since that is such an expensive operation -already. In the case of increasing the file size, there is an -optimization we can use: if we use M in the formula above as the -file size rounded up to the next power of 2, we only need -reshuffle free lists when the file size crosses a power of 2 -boundary, and reshuffling the free lists is trivial: we simply -merge every consecutive pair of free lists. +Unfortunately it is difficult to know what heuristics should be +used to determine zone sizes, and our transaction code relies on +being able to create a “recovery area” by simply appending to the +file (difficult if it would need to create a new zone header). +Thus we use a linked-list of free tables; currently we only ever +create one, but if there is more than one we choose one at random +to use. In future we may use heuristics to add new free tables on +contention. We only expand the file when all free tables are +exhausted. The basic algorithm is as follows. Freeing is simple: -1. Identify the correct zone. +1. Identify the correct free list. 2. Lock the corresponding list. -3. Re-check the zone (we didn't have a lock, sizes could have +3. Re-check the list (we didn't have a lock, sizes could have changed): relock if necessary. -4. Place the freed entry in the list for that zone. +4. Place the freed entry in the list. Allocation is a little more complicated, as we perform delayed coalescing at this point: -1. Pick a zone either the zone we last freed into, or based on a “ - random” number. +1. Pick a free table; usually the previous one. 2. Lock the corresponding list. -3. Re-check the zone: relock if necessary. - -4. If the top entry is -large enough, remove it from the list and +3. If the top entry is -large enough, remove it from the list and return it. -5. Otherwise, coalesce entries in the list.If there was no entry - large enough, unlock the list and try the next zone. +4. Otherwise, coalesce entries in the list.If there was no entry + large enough, unlock the list and try the next largest list + +5. If no list has an entry which meets our needs, try the next + free table. 6. If no zone satisfies, expand the file. @@ -714,9 +866,9 @@ ordering seems to be fairly good for keeping fragmentation low does not need a tailer to coalesce, though if we needed one we could have one cheaply: see [sub:Records-Incur-A]. -I anticipate that the number of entries in each free zone would -be small, but it might be worth using one free entry to hold -pointers to the others for cache efficiency. +Each free entry has the free table number in the header: less +than 255. It also contains a doubly-linked list for easy +deletion. 3.6 TDB Becomes Fragmented @@ -822,7 +974,9 @@ block: this is diminishing returns after a handful of bits (at 10 bits, it reduces 99.9% of false memcmp). As an aside, as the lower bits are already incorporated in the hash table - resolution, the upper bits should be used here. + resolution, the upper bits should be used here. Note that it's + not clear that these bits will be a win, given the extra bits + in the hash table itself (see [sub:Hash-Size-Solution]). 5. 'magic' does not need to be enlarged: it currently reflects one of 5 values (used, free, dead, recovery, and @@ -843,13 +997,13 @@ This produces a 16 byte used header like this: struct tdb_used_record { - uint32_t magic : 16, + uint32_t used_magic : 16, + - prev_is_free: 1, key_data_divide: 5, - top_hash: 10; + top_hash: 11; uint32_t extra_octets; @@ -861,17 +1015,27 @@ And a free record like this: struct tdb_free_record { - uint32_t free_magic; + uint64_t free_magic: 8, - uint64_t total_length; + prev : 56; - ... - uint64_t tailer; + + uint64_t free_table: 8, + + total_length : 56 + + uint64_t next;; }; +Note that by limiting valid offsets to 56 bits, we can pack +everything we need into 3 64-byte words, meaning our minimum +record size is 8 bytes. +3.7.2 Status + +Complete. 3.8 Transaction Commit Requires 4 fdatasync @@ -924,11 +1088,14 @@ but need only be done at open. For running databases, a separate header field can be used to indicate a transaction in progress; we need only check for recovery if this is set. -3.9 TDB Does Not Have Snapshot Support +3.8.2 Status -3.9.1 Proposed Solution +Deferred. -None. At some point you say “use a real database”. +3.9 TDB Does Not Have Snapshot Support + +3.9.1 Proposed SolutionNone. At some point you say “use a real + database” (but see [replay-attribute]). But as a thought experiment, if we implemented transactions to only overwrite free entries (this is tricky: there must not be a @@ -947,6 +1114,10 @@ rewrite some sections of the hash, too. We could then implement snapshots using a similar method, using multiple different hash tables/free tables. +3.9.2 Status + +Deferred. + 3.10 Transactions Cannot Operate in Parallel This would be useless for ldb, as it hits the index records with @@ -957,11 +1128,15 @@ failed. 3.10.1 Proposed Solution -We could solve a small part of the problem by providing read-only -transactions. These would allow one write transaction to begin, -but it could not commit until all r/o transactions are done. This -would require a new RO_TRANSACTION_LOCK, which would be upgraded -on commit. +None (but see [replay-attribute]). We could solve a small part of +the problem by providing read-only transactions. These would +allow one write transaction to begin, but it could not commit +until all r/o transactions are done. This would require a new +RO_TRANSACTION_LOCK, which would be upgraded on commit. + +3.10.2 Status + +Deferred. 3.11 Default Hash Function Is Suboptimal @@ -984,6 +1159,10 @@ The seed should be created at tdb-creation time from some random source, and placed in the header. This is far from foolproof, but adds a little bit of protection against hash bombing. +3.11.2 Status + +Complete. + 3.12 Reliable Traversal Adds Complexity We lock a record during traversal iteration, and try to grab that @@ -998,6 +1177,10 @@ indefinitely. Remove reliability guarantees; see [traverse-Proposed-Solution]. +3.12.2 Status + +Complete. + 3.13 Fcntl Locking Adds Overhead Placing a fcntl lock means a system call, as does removing one. @@ -1056,3 +1239,21 @@ filled). On crash, tdb_open() would examine the array of top levels, and apply the transactions until it encountered an invalid checksum. +3.15 Tracing Is Fragile, Replay Is External + +The current TDB has compile-time-enabled tracing code, but it +often breaks as it is not enabled by default. In a similar way, +the ctdb code has an external wrapper which does replay tracing +so it can coordinate cluster-wide transactions. + +3.15.1 Proposed Solution + +Tridge points out that an attribute can be later added to +tdb_open (see [attributes]) to provide replay/trace hooks, which +could become the basis for this and future parallel transactions +and snapshot support. + +3.15.2 Status + +Deferred. +