tdb2: update documentation

[ccan] / ccan / tdb2 / doc / design.txt
diff --git a/ccan/tdb2/doc/design.txt b/ccan/tdb2/doc/design.txt

index 2f0d22cd9e921d849e8a71326f7088cc56a4ab1a..c2994a4ce08785f0e2c106dfdf3929ea1fea51e7 100644 (file)
--- a/ccan/tdb2/doc/design.txt
+++ b/ccan/tdb2/doc/design.txt
@@ -2,7 +2,7 @@ TDB2: A Redesigning The Trivial DataBase
  
  Rusty Russell, IBM Corporation
  
  
  Rusty Russell, IBM Corporation
  
-26-July-2010
+1-December-2010
  
  Abstract
  
  
  Abstract
  
@@ -74,7 +74,7 @@ optional hashing function and an optional logging function
  argument. Additional arguments to open would require the 
  introduction of a tdb_open_ex2 call etc.
  
  argument. Additional arguments to open would require the 
  introduction of a tdb_open_ex2 call etc.
  
-2.1.1 Proposed Solution
+2.1.1 Proposed Solution<attributes>
  
  tdb_open() will take a linked-list of attributes:
  
  
  tdb_open() will take a linked-list of attributes:
  
@@ -129,6 +129,10 @@ union tdb_attribute {
  This allows future attributes to be added, even if this expands 
  the size of the union.
  
  This allows future attributes to be added, even if this expands 
  the size of the union.
  
+2.1.2 Status
+
+Complete.
+
  2.2 tdb_traverse Makes Impossible Guarantees
  
  tdb_traverse (and tdb_firstkey/tdb_nextkey) predate transactions, 
  2.2 tdb_traverse Makes Impossible Guarantees
  
  tdb_traverse (and tdb_firstkey/tdb_nextkey) predate transactions, 
@@ -148,6 +152,11 @@ occur during your traversal, otherwise you will see some subset.
  You can prevent changes by using a transaction or the locking 
  API.
  
  You can prevent changes by using a transaction or the locking 
  API.
  
+2.2.2 Status
+
+Complete. Delete-during-traverse will still delete every record, 
+too (assuming no other changes).
+
  2.3 Nesting of Transactions Is Fraught
  
  TDB has alternated between allowing nested transactions and not 
  2.3 Nesting of Transactions Is Fraught
  
  TDB has alternated between allowing nested transactions and not 
@@ -182,6 +191,10 @@ However, this behavior can be simulated with a wrapper which uses
  tdb_add_flags() and tdb_remove_flags(), so the API should not be 
  expanded for this relatively-obscure case.
  
  tdb_add_flags() and tdb_remove_flags(), so the API should not be 
  expanded for this relatively-obscure case.
  
+2.3.2 Status
+
+Incomplete; nesting flag is still defined as per tdb1.
+
  2.4 Incorrect Hash Function is Not Detected
  
  tdb_open_ex() allows the calling code to specify a different hash 
  2.4 Incorrect Hash Function is Not Detected
  
  tdb_open_ex() allows the calling code to specify a different hash 
@@ -195,6 +208,10 @@ The header should contain an example hash result (eg. the hash of
  0xdeadbeef), and tdb_open_ex() should check that the given hash 
  function produces the same answer, or fail the tdb_open call.
  
  0xdeadbeef), and tdb_open_ex() should check that the given hash 
  function produces the same answer, or fail the tdb_open call.
  
+2.4.2 Status
+
+Complete.
+
  2.5 tdb_set_max_dead/TDB_VOLATILE Expose Implementation
  
  In response to scalability issues with the free list ([TDB-Freelist-Is]
  2.5 tdb_set_max_dead/TDB_VOLATILE Expose Implementation
  
  In response to scalability issues with the free list ([TDB-Freelist-Is]
@@ -216,6 +233,11 @@ hint that store and delete of records will be at least as common
  as fetch in order to allow some internal tuning, but initially 
  will become a no-op.
  
  as fetch in order to allow some internal tuning, but initially 
  will become a no-op.
  
+2.5.2 Status
+
+Incomplete. TDB_VOLATILE still defined, but implementation should 
+fail on unknown flags to be future-proof.
+
  2.6 <TDB-Files-Cannot>TDB Files Cannot Be Opened Multiple Times 
    In The Same Process
  
  2.6 <TDB-Files-Cannot>TDB Files Cannot Be Opened Multiple Times 
    In The Same Process
  
@@ -251,6 +273,10 @@ whether re-opening is allowed, as though there may be some
  benefit to adding a call to detect when a tdb_context is shared, 
  to allow other to create such an API.
  
  benefit to adding a call to detect when a tdb_context is shared, 
  to allow other to create such an API.
  
+2.6.2 Status
+
+Incomplete.
+
  2.7 TDB API Is Not POSIX Thread-safe
  
  The TDB API uses an error code which can be queried after an 
  2.7 TDB API Is Not POSIX Thread-safe
  
  The TDB API uses an error code which can be queried after an 
@@ -277,7 +303,13 @@ maintained.
  
  The aim is that building tdb with -DTDB_PTHREAD will result in a 
  pthread-safe version of the library, and otherwise no overhead 
  
  The aim is that building tdb with -DTDB_PTHREAD will result in a 
  pthread-safe version of the library, and otherwise no overhead 
-will exist.
+will exist. Alternatively, a hooking mechanism similar to that 
+proposed for [Proposed-Solution-locking-hook] could be used to 
+enable pthread locking at runtime.
+
+2.7.2 Status
+
+Incomplete.
  
  2.8 *_nonblock Functions And *_mark Functions Expose 
    Implementation
  
  2.8 *_nonblock Functions And *_mark Functions Expose 
    Implementation
@@ -341,6 +373,10 @@ locks it doesn't need to obtain.
  It also keeps the complexity out of the API, and in ctdbd where 
  it is needed.
  
  It also keeps the complexity out of the API, and in ctdbd where 
  it is needed.
  
+2.8.2 Status
+
+Incomplete.
+
  2.9 tdb_chainlock Functions Expose Implementation
  
  tdb_chainlock locks some number of records, including the record 
  2.9 tdb_chainlock Functions Expose Implementation
  
  tdb_chainlock locks some number of records, including the record 
@@ -389,6 +425,10 @@ EINVAL if the signal occurs before the kernel is entered,
  otherwise EAGAIN.
  ]
  
  otherwise EAGAIN.
  ]
  
+2.10.2 Status
+
+Incomplete.
+
  2.11 The API Uses Gratuitous Typedefs, Capitals
  
  typedefs are useful for providing source compatibility when types 
  2.11 The API Uses Gratuitous Typedefs, Capitals
  
  typedefs are useful for providing source compatibility when types 
@@ -431,6 +471,10 @@ the tdb_open_ex for logging.
  It should simply take an extra argument, since we are prepared to 
  break the API/ABI.
  
  It should simply take an extra argument, since we are prepared to 
  break the API/ABI.
  
+2.12.2 Status
+
+Complete.
+
  2.13 Various Callback Functions Are Not Typesafe
  
  The callback functions in tdb_set_logging_function (after [tdb_log_func-Doesnt-Take]
  2.13 Various Callback Functions Are Not Typesafe
  
  The callback functions in tdb_set_logging_function (after [tdb_log_func-Doesnt-Take]
@@ -453,6 +497,10 @@ their parameter.
  See CCAN's typesafe_cb module at 
  http://ccan.ozlabs.org/info/typesafe_cb.html
  
  See CCAN's typesafe_cb module at 
  http://ccan.ozlabs.org/info/typesafe_cb.html
  
+2.13.2 Status
+
+Incomplete.
+
  2.14 TDB_CLEAR_IF_FIRST Must Be Specified On All Opens, 
    tdb_reopen_all Problematic
  
  2.14 TDB_CLEAR_IF_FIRST Must Be Specified On All Opens, 
    tdb_reopen_all Problematic
  
@@ -473,6 +521,89 @@ it alone has opened the TDB and will erase it.
  Remove TDB_CLEAR_IF_FIRST. Other workarounds are possible, but 
  see [TDB_CLEAR_IF_FIRST-Imposes-Performance].
  
  Remove TDB_CLEAR_IF_FIRST. Other workarounds are possible, but 
  see [TDB_CLEAR_IF_FIRST-Imposes-Performance].
  
+2.14.2 Status
+
+Incomplete, TDB_CLEAR_IF_FIRST still defined, but not 
+implemented.
+
+2.15 Extending The Header Is Difficult
+
+We have reserved (zeroed) words in the TDB header, which can be 
+used for future features. If the future features are compulsory, 
+the version number must be updated to prevent old code from 
+accessing the database. But if the future feature is optional, we 
+have no way of telling if older code is accessing the database or 
+not.
+
+2.15.1 Proposed Solution
+
+The header should contain a “format variant” value (64-bit). This 
+is divided into two 32-bit parts:
+
+1. The lower part reflects the format variant understood by code 
+  accessing the database.
+
+2. The upper part reflects the format variant you must understand 
+  to write to the database (otherwise you can only open for 
+  reading).
+
+The latter field can only be written at creation time, the former 
+should be written under the OPEN_LOCK when opening the database 
+for writing, if the variant of the code is lower than the current 
+lowest variant.
+
+This should allow backwards-compatible features to be added, and 
+detection if older code (which doesn't understand the feature) 
+writes to the database.
+
+2.15.2 Status
+
+Incomplete.
+
+2.16 Record Headers Are Not Expandible
+
+If we later want to add (say) checksums on keys and data, it 
+would require another format change, which we'd like to avoid.
+
+2.16.1 Proposed Solution
+
+We often have extra padding at the tail of a record. If we ensure 
+that the first byte (if any) of this padding is zero, we will 
+have a way for future changes to detect code which doesn't 
+understand a new format: the new code would write (say) a 1 at 
+the tail, and thus if there is no tail or the first byte is 0, we 
+would know the extension is not present on that record.
+
+2.16.2 Status
+
+Incomplete.
+
+2.17 TDB Does Not Use Talloc
+
+Many users of TDB (particularly Samba) use the talloc allocator, 
+and thus have to wrap TDB in a talloc context to use it 
+conveniently.
+
+2.17.1 Proposed Solution
+
+The allocation within TDB is not complicated enough to justify 
+the use of talloc, and I am reluctant to force another 
+(excellent) library on TDB users. Nonetheless a compromise is 
+possible. An attribute (see [attributes]) can be added later to 
+tdb_open() to provide an alternate allocation mechanism, 
+specifically for talloc but usable by any other allocator (which 
+would ignore the “context” argument).
+
+This would form a talloc heirarchy as expected, but the caller 
+would still have to attach a destructor to the tdb context 
+returned from tdb_open to close it. All TDB_DATA fields would be 
+children of the tdb_context, and the caller would still have to 
+manage them (using talloc_free() or talloc_steal()).
+
+2.17.2 Status
+
+Deferred.
+
  3 Performance And Scalability Issues
  
  3.1 <TDB_CLEAR_IF_FIRST-Imposes-Performance>TDB_CLEAR_IF_FIRST 
  3 Performance And Scalability Issues
  
  3.1 <TDB_CLEAR_IF_FIRST-Imposes-Performance>TDB_CLEAR_IF_FIRST 
@@ -502,6 +633,10 @@ Remove the flag. It was a neat idea, but even trivial servers
  tend to know when they are initializing for the first time and 
  can simply unlink the old tdb at that point.
  
  tend to know when they are initializing for the first time and 
  can simply unlink the old tdb at that point.
  
+3.1.2 Status
+
+Incomplete; TDB_CLEAR_IF_FIRST still defined, but does nothing.
+
  3.2 TDB Files Have a 4G Limit
  
  This seems to be becoming an issue (so much for “trivial”!), 
  3.2 TDB Files Have a 4G Limit
  
  This seems to be becoming an issue (so much for “trivial”!), 
@@ -528,6 +663,10 @@ Old versions of tdb will fail to open the new TDB files (since 28
  August 2009, commit 398d0c29290: prior to that any unrecognized 
  file format would be erased and initialized as a fresh tdb!)
  
  August 2009, commit 398d0c29290: prior to that any unrecognized 
  file format would be erased and initialized as a fresh tdb!)
  
+3.2.2 Status
+
+Complete.
+
  3.3 TDB Records Have a 4G Limit
  
  This has not been a reported problem, and the API uses size_t 
  3.3 TDB Records Have a 4G Limit
  
  This has not been a reported problem, and the API uses size_t 
@@ -542,6 +681,10 @@ implementation would return TDB_ERR_OOM in a similar case). It
  seems unlikely that 32 bit keys will be a limitation, so the 
  implementation may not support this (see [sub:Records-Incur-A]).
  
  seems unlikely that 32 bit keys will be a limitation, so the 
  implementation may not support this (see [sub:Records-Incur-A]).
  
+3.3.2 Status
+
+Complete.
+
  3.4 Hash Size Is Determined At TDB Creation Time
  
  TDB contains a number of hash chains in the header; the number is 
  3.4 Hash Size Is Determined At TDB Creation Time
  
  TDB contains a number of hash chains in the header; the number is 
@@ -551,7 +694,7 @@ long), that LDB uses 10,000 for this hash. In general it is
  impossible to know what the 'right' answer is at database 
  creation time.
  
  impossible to know what the 'right' answer is at database 
  creation time.
  
-3.4.1 Proposed Solution
+3.4.1 <sub:Hash-Size-Solution>Proposed Solution
  
  After comprehensive performance testing on various scalable hash 
  variants[footnote:
  
  After comprehensive performance testing on various scalable hash 
  variants[footnote:
@@ -559,31 +702,33 @@ http://rusty.ozlabs.org/?p=89 and http://rusty.ozlabs.org/?p=94
  This was annoying because I was previously convinced that an 
  expanding tree of hashes would be very close to optimal.
  ], it became clear that it is hard to beat a straight linear hash 
  This was annoying because I was previously convinced that an 
  expanding tree of hashes would be very close to optimal.
  ], it became clear that it is hard to beat a straight linear hash 
-table which doubles in size when it reaches saturation. There are 
-three details which become important:
-
-1. On encountering a full bucket, we use the next bucket.
-
-2. Extra hash bits are stored with the offset, to reduce 
-  comparisons.
-
-3. A marker entry is used on deleting an entry.
-
-The doubling of the table must be done under a transaction; we 
-will not reduce it on deletion, so it will be an unusual case. It 
-will either be placed at the head (other entries will be moved 
-out the way so we can expand). We could have a pointer in the 
-header to the current hashtable location, but that pointer would 
-have to be read frequently to check for hashtable moves.
-
-The locking for this is slightly more complex than the chained 
-case; we currently have one lock per bucket, and that means we 
-would need to expand the lock if we overflow to the next bucket. 
-The frequency of such collisions will effect our locking 
-heuristics: we can always lock more buckets than we need.
-
-One possible optimization is to only re-check the hash size on an 
-insert or a lookup miss.
+table which doubles in size when it reaches saturation. 
+Unfortunately, altering the hash table introduces serious locking 
+complications: the entire hash table needs to be locked to 
+enlarge the hash table, and others might be holding locks. 
+Particularly insidious are insertions done under tdb_chainlock.
+
+Thus an expanding layered hash will be used: an array of hash 
+groups, with each hash group exploding into pointers to lower 
+hash groups once it fills, turning into a hash tree. This has 
+implications for locking: we must lock the entire group in case 
+we need to expand it, yet we don't know how deep the tree is at 
+that point.
+
+Note that bits from the hash table entries should be stolen to 
+hold more hash bits to reduce the penalty of collisions. We can 
+use the otherwise-unused lower 3 bits. If we limit the size of 
+the database to 64 exabytes, we can use the top 8 bits of the 
+hash entry as well. These 11 bits would reduce false positives 
+down to 1 in 2000 which is more than we need: we can use one of 
+the bits to indicate that the extra hash bits are valid. This 
+means we can choose not to re-hash all entries when we expand a 
+hash group; simply use the next bits we need and mark them 
+invalid.
+
+3.4.2 Status
+
+Complete.
  
  3.5 <TDB-Freelist-Is>TDB Freelist Is Highly Contended
  
  
  3.5 <TDB-Freelist-Is>TDB Freelist Is Highly Contended
  
@@ -660,50 +805,57 @@ to the size of the hash table, but as it is rare to walk a large
  number of free list entries we can use far fewer, say 1/32 of the 
  number of hash buckets.
  
  number of free list entries we can use far fewer, say 1/32 of the 
  number of hash buckets.
  
+It seems tempting to try to reuse the hash implementation which 
+we use for records here, but we have two ways of searching for 
+free entries: for allocation we search by size (and possibly 
+zone) which produces too many clashes for our hash table to 
+handle well, and for coalescing we search by address. Thus an 
+array of doubly-linked free lists seems preferable.
+
  There are various benefits in using per-size free lists (see [sub:TDB-Becomes-Fragmented]
  ) but it's not clear this would reduce contention in the common 
  case where all processes are allocating/freeing the same size. 
  Thus we almost certainly need to divide in other ways: the most 
  obvious is to divide the file into zones, and using a free list 
  There are various benefits in using per-size free lists (see [sub:TDB-Becomes-Fragmented]
  ) but it's not clear this would reduce contention in the common 
  case where all processes are allocating/freeing the same size. 
  Thus we almost certainly need to divide in other ways: the most 
  obvious is to divide the file into zones, and using a free list 
-(or set of free lists) for each. This approximates address 
+(or table of free lists) for each. This approximates address 
  ordering.
  
  ordering.
  
-Note that this means we need to split the free lists when we 
-expand the file; this is probably acceptable when we double the 
-hash table size, since that is such an expensive operation 
-already. In the case of increasing the file size, there is an 
-optimization we can use: if we use M in the formula above as the 
-file size rounded up to the next power of 2, we only need 
-reshuffle free lists when the file size crosses a power of 2 
-boundary, and reshuffling the free lists is trivial: we simply 
-merge every consecutive pair of free lists.
+Unfortunately it is difficult to know what heuristics should be 
+used to determine zone sizes, and our transaction code relies on 
+being able to create a “recovery area” by simply appending to the 
+file (difficult if it would need to create a new zone header). 
+Thus we use a linked-list of free tables; currently we only ever 
+create one, but if there is more than one we choose one at random 
+to use. In future we may use heuristics to add new free tables on 
+contention. We only expand the file when all free tables are 
+exhausted.
  
  The basic algorithm is as follows. Freeing is simple:
  
  
  The basic algorithm is as follows. Freeing is simple:
  
-1. Identify the correct zone.
+1. Identify the correct free list.
  
  2. Lock the corresponding list.
  
  
  2. Lock the corresponding list.
  
-3. Re-check the zone (we didn't have a lock, sizes could have 
+3. Re-check the list (we didn't have a lock, sizes could have 
    changed): relock if necessary.
  
    changed): relock if necessary.
  
-4. Place the freed entry in the list for that zone.
+4. Place the freed entry in the list.
  
  Allocation is a little more complicated, as we perform delayed 
  coalescing at this point:
  
  
  Allocation is a little more complicated, as we perform delayed 
  coalescing at this point:
  
-1. Pick a zone either the zone we last freed into, or based on a “
-  random” number.
+1. Pick a free table; usually the previous one.
  
  2. Lock the corresponding list.
  
  
  2. Lock the corresponding list.
  
-3. Re-check the zone: relock if necessary.
-
-4. If the top entry is -large enough, remove it from the list and 
+3. If the top entry is -large enough, remove it from the list and 
    return it.
  
    return it.
  
-5. Otherwise, coalesce entries in the list.If there was no entry 
-  large enough, unlock the list and try the next zone.
+4. Otherwise, coalesce entries in the list.If there was no entry 
+  large enough, unlock the list and try the next largest list
+
+5. If no list has an entry which meets our needs, try the next 
+  free table.
  
  6. If no zone satisfies, expand the file.
  
  
  6. If no zone satisfies, expand the file.
  
@@ -714,9 +866,9 @@ ordering seems to be fairly good for keeping fragmentation low
  does not need a tailer to coalesce, though if we needed one we 
  could have one cheaply: see [sub:Records-Incur-A]. 
  
  does not need a tailer to coalesce, though if we needed one we 
  could have one cheaply: see [sub:Records-Incur-A]. 
  
-I anticipate that the number of entries in each free zone would 
-be small, but it might be worth using one free entry to hold 
-pointers to the others for cache efficiency.
+Each free entry has the free table number in the header: less 
+than 255. It also contains a doubly-linked list for easy 
+deletion.
  
  3.6 <sub:TDB-Becomes-Fragmented>TDB Becomes Fragmented
  
  
  3.6 <sub:TDB-Becomes-Fragmented>TDB Becomes Fragmented
  
@@ -822,7 +974,9 @@ block:
    this is diminishing returns after a handful of bits (at 10 
    bits, it reduces 99.9% of false memcmp). As an aside, as the 
    lower bits are already incorporated in the hash table 
    this is diminishing returns after a handful of bits (at 10 
    bits, it reduces 99.9% of false memcmp). As an aside, as the 
    lower bits are already incorporated in the hash table 
-  resolution, the upper bits should be used here.
+  resolution, the upper bits should be used here. Note that it's 
+  not clear that these bits will be a win, given the extra bits 
+  in the hash table itself (see [sub:Hash-Size-Solution]).
  
  5. 'magic' does not need to be enlarged: it currently reflects 
    one of 5 values (used, free, dead, recovery, and 
  
  5. 'magic' does not need to be enlarged: it currently reflects 
    one of 5 values (used, free, dead, recovery, and 
@@ -843,13 +997,13 @@ This produces a 16 byte used header like this:
  
  struct tdb_used_record {
  
  
  struct tdb_used_record {
  
-        uint32_t magic : 16,
+        uint32_t used_magic : 16,
+
  
  
-                 prev_is_free: 1,
  
                   key_data_divide: 5,
  
  
                   key_data_divide: 5,
  
-                 top_hash: 10;
+                 top_hash: 11;
  
          uint32_t extra_octets;
  
  
          uint32_t extra_octets;
  
@@ -861,17 +1015,27 @@ And a free record like this:
  
  struct tdb_free_record {
  
  
  struct tdb_free_record {
  
-        uint32_t free_magic;
+        uint64_t free_magic: 8,
  
  
-        uint64_t total_length;
+                   prev : 56;
  
  
-        ...
  
  
-        uint64_t tailer;
+
+        uint64_t free_table: 8,
+
+                 total_length : 56
+
+        uint64_t next;;
  
  };
  
  
  };
  
+Note that by limiting valid offsets to 56 bits, we can pack 
+everything we need into 3 64-byte words, meaning our minimum 
+record size is 8 bytes.
  
  
+3.7.2 Status
+
+Complete.
  
  3.8 Transaction Commit Requires 4 fdatasync
  
  
  3.8 Transaction Commit Requires 4 fdatasync
  
@@ -924,11 +1088,14 @@ but need only be done at open. For running databases, a separate
  header field can be used to indicate a transaction in progress; 
  we need only check for recovery if this is set.
  
  header field can be used to indicate a transaction in progress; 
  we need only check for recovery if this is set.
  
-3.9 <sub:TDB-Does-Not>TDB Does Not Have Snapshot Support
+3.8.2 Status
  
  
-3.9.1 Proposed Solution
+Deferred.
  
  
-None. At some point you say “use a real database”.
+3.9 <sub:TDB-Does-Not>TDB Does Not Have Snapshot Support
+
+3.9.1 Proposed SolutionNone. At some point you say “use a real 
+  database” (but see [replay-attribute]).
  
  But as a thought experiment, if we implemented transactions to 
  only overwrite free entries (this is tricky: there must not be a 
  
  But as a thought experiment, if we implemented transactions to 
  only overwrite free entries (this is tricky: there must not be a 
@@ -947,6 +1114,10 @@ rewrite some sections of the hash, too.
  We could then implement snapshots using a similar method, using 
  multiple different hash tables/free tables.
  
  We could then implement snapshots using a similar method, using 
  multiple different hash tables/free tables.
  
+3.9.2 Status
+
+Deferred.
+
  3.10 Transactions Cannot Operate in Parallel
  
  This would be useless for ldb, as it hits the index records with 
  3.10 Transactions Cannot Operate in Parallel
  
  This would be useless for ldb, as it hits the index records with 
@@ -957,11 +1128,15 @@ failed.
  
  3.10.1 Proposed Solution
  
  
  3.10.1 Proposed Solution
  
-We could solve a small part of the problem by providing read-only 
-transactions. These would allow one write transaction to begin, 
-but it could not commit until all r/o transactions are done. This 
-would require a new RO_TRANSACTION_LOCK, which would be upgraded 
-on commit.
+None (but see [replay-attribute]). We could solve a small part of 
+the problem by providing read-only transactions. These would 
+allow one write transaction to begin, but it could not commit 
+until all r/o transactions are done. This would require a new 
+RO_TRANSACTION_LOCK, which would be upgraded on commit.
+
+3.10.2 Status
+
+Deferred.
  
  3.11 Default Hash Function Is Suboptimal
  
  
  3.11 Default Hash Function Is Suboptimal
  
@@ -984,6 +1159,10 @@ The seed should be created at tdb-creation time from some random
  source, and placed in the header. This is far from foolproof, but 
  adds a little bit of protection against hash bombing.
  
  source, and placed in the header. This is far from foolproof, but 
  adds a little bit of protection against hash bombing.
  
+3.11.2 Status
+
+Complete.
+
  3.12 <Reliable-Traversal-Adds>Reliable Traversal Adds Complexity
  
  We lock a record during traversal iteration, and try to grab that 
  3.12 <Reliable-Traversal-Adds>Reliable Traversal Adds Complexity
  
  We lock a record during traversal iteration, and try to grab that 
@@ -998,6 +1177,10 @@ indefinitely.
  
  Remove reliability guarantees; see [traverse-Proposed-Solution].
  
  
  Remove reliability guarantees; see [traverse-Proposed-Solution].
  
+3.12.2 Status
+
+Complete.
+
  3.13 Fcntl Locking Adds Overhead
  
  Placing a fcntl lock means a system call, as does removing one. 
  3.13 Fcntl Locking Adds Overhead
  
  Placing a fcntl lock means a system call, as does removing one. 
@@ -1056,3 +1239,21 @@ filled). On crash, tdb_open() would examine the array of top
  levels, and apply the transactions until it encountered an 
  invalid checksum.
  
  levels, and apply the transactions until it encountered an 
  invalid checksum.
  
+3.15 Tracing Is Fragile, Replay Is External
+
+The current TDB has compile-time-enabled tracing code, but it 
+often breaks as it is not enabled by default. In a similar way, 
+the ctdb code has an external wrapper which does replay tracing 
+so it can coordinate cluster-wide transactions.
+
+3.15.1 Proposed Solution<replay-attribute>
+
+Tridge points out that an attribute can be later added to 
+tdb_open (see [attributes]) to provide replay/trace hooks, which 
+could become the basis for this and future parallel transactions 
+and snapshot support.
+
+3.15.2 Status
+
+Deferred.
+