tdb2: unify tdb1_firstkey/tdb1_nextkey into tdb_firstkey/tdb_nextkey.

[ccan] / ccan / tdb2 / doc / design.txt
diff --git a/ccan/tdb2/doc/design.txt b/ccan/tdb2/doc/design.txt

index 2f0d22cd9e921d849e8a71326f7088cc56a4ab1a..c2994a4ce08785f0e2c106dfdf3929ea1fea51e7 100644 (file)
--- a/ccan/tdb2/doc/design.txt
+++ b/ccan/tdb2/doc/design.txt
@@ -2,7 +2,7 @@ TDB2: A Redesigning The Trivial DataBase
  
  Rusty Russell, IBM Corporation
  
-26-July-2010
+1-December-2010
  
  Abstract
  
@@ -74,7 +74,7 @@ optional hashing function and an optional logging function
  argument. Additional arguments to open would require the 
  introduction of a tdb_open_ex2 call etc.
  
-2.1.1 Proposed Solution
+2.1.1 Proposed Solution<attributes>
  
  tdb_open() will take a linked-list of attributes:
  
@@ -129,6 +129,10 @@ union tdb_attribute {
  This allows future attributes to be added, even if this expands 
  the size of the union.
  
+2.1.2 Status
+
+Complete.
+
  2.2 tdb_traverse Makes Impossible Guarantees
  
  tdb_traverse (and tdb_firstkey/tdb_nextkey) predate transactions, 
@@ -148,6 +152,11 @@ occur during your traversal, otherwise you will see some subset.
  You can prevent changes by using a transaction or the locking 
  API.
  
+2.2.2 Status
+
+Complete. Delete-during-traverse will still delete every record, 
+too (assuming no other changes).
+
  2.3 Nesting of Transactions Is Fraught
  
  TDB has alternated between allowing nested transactions and not 
@@ -182,6 +191,10 @@ However, this behavior can be simulated with a wrapper which uses
  tdb_add_flags() and tdb_remove_flags(), so the API should not be 
  expanded for this relatively-obscure case.
  
+2.3.2 Status
+
+Incomplete; nesting flag is still defined as per tdb1.
+
  2.4 Incorrect Hash Function is Not Detected
  
  tdb_open_ex() allows the calling code to specify a different hash 
@@ -195,6 +208,10 @@ The header should contain an example hash result (eg. the hash of
  0xdeadbeef), and tdb_open_ex() should check that the given hash 
  function produces the same answer, or fail the tdb_open call.
  
+2.4.2 Status
+
+Complete.
+
  2.5 tdb_set_max_dead/TDB_VOLATILE Expose Implementation
  
  In response to scalability issues with the free list ([TDB-Freelist-Is]
@@ -216,6 +233,11 @@ hint that store and delete of records will be at least as common
  as fetch in order to allow some internal tuning, but initially 
  will become a no-op.
  
+2.5.2 Status
+
+Incomplete. TDB_VOLATILE still defined, but implementation should 
+fail on unknown flags to be future-proof.
+
  2.6 <TDB-Files-Cannot>TDB Files Cannot Be Opened Multiple Times 
    In The Same Process
  
@@ -251,6 +273,10 @@ whether re-opening is allowed, as though there may be some
  benefit to adding a call to detect when a tdb_context is shared, 
  to allow other to create such an API.
  
+2.6.2 Status
+
+Incomplete.
+
  2.7 TDB API Is Not POSIX Thread-safe
  
  The TDB API uses an error code which can be queried after an 
@@ -277,7 +303,13 @@ maintained.
  
  The aim is that building tdb with -DTDB_PTHREAD will result in a 
  pthread-safe version of the library, and otherwise no overhead 
-will exist.
+will exist. Alternatively, a hooking mechanism similar to that 
+proposed for [Proposed-Solution-locking-hook] could be used to 
+enable pthread locking at runtime.
+
+2.7.2 Status
+
+Incomplete.
  
  2.8 *_nonblock Functions And *_mark Functions Expose 
    Implementation
@@ -341,6 +373,10 @@ locks it doesn't need to obtain.
  It also keeps the complexity out of the API, and in ctdbd where 
  it is needed.
  
+2.8.2 Status
+
+Incomplete.
+
  2.9 tdb_chainlock Functions Expose Implementation
  
  tdb_chainlock locks some number of records, including the record 
@@ -389,6 +425,10 @@ EINVAL if the signal occurs before the kernel is entered,
  otherwise EAGAIN.
  ]
  
+2.10.2 Status
+
+Incomplete.
+
  2.11 The API Uses Gratuitous Typedefs, Capitals
  
  typedefs are useful for providing source compatibility when types 
@@ -431,6 +471,10 @@ the tdb_open_ex for logging.
  It should simply take an extra argument, since we are prepared to 
  break the API/ABI.
  
+2.12.2 Status
+
+Complete.
+
  2.13 Various Callback Functions Are Not Typesafe
  
  The callback functions in tdb_set_logging_function (after [tdb_log_func-Doesnt-Take]
@@ -453,6 +497,10 @@ their parameter.
  See CCAN's typesafe_cb module at 
  http://ccan.ozlabs.org/info/typesafe_cb.html
  
+2.13.2 Status
+
+Incomplete.
+
  2.14 TDB_CLEAR_IF_FIRST Must Be Specified On All Opens, 
    tdb_reopen_all Problematic
  
@@ -473,6 +521,89 @@ it alone has opened the TDB and will erase it.
  Remove TDB_CLEAR_IF_FIRST. Other workarounds are possible, but 
  see [TDB_CLEAR_IF_FIRST-Imposes-Performance].
  
+2.14.2 Status
+
+Incomplete, TDB_CLEAR_IF_FIRST still defined, but not 
+implemented.
+
+2.15 Extending The Header Is Difficult
+
+We have reserved (zeroed) words in the TDB header, which can be 
+used for future features. If the future features are compulsory, 
+the version number must be updated to prevent old code from 
+accessing the database. But if the future feature is optional, we 
+have no way of telling if older code is accessing the database or 
+not.
+
+2.15.1 Proposed Solution
+
+The header should contain a “format variant” value (64-bit). This 
+is divided into two 32-bit parts:
+
+1. The lower part reflects the format variant understood by code 
+  accessing the database.
+
+2. The upper part reflects the format variant you must understand 
+  to write to the database (otherwise you can only open for 
+  reading).
+
+The latter field can only be written at creation time, the former 
+should be written under the OPEN_LOCK when opening the database 
+for writing, if the variant of the code is lower than the current 
+lowest variant.
+
+This should allow backwards-compatible features to be added, and 
+detection if older code (which doesn't understand the feature) 
+writes to the database.
+
+2.15.2 Status
+
+Incomplete.
+
+2.16 Record Headers Are Not Expandible
+
+If we later want to add (say) checksums on keys and data, it 
+would require another format change, which we'd like to avoid.
+
+2.16.1 Proposed Solution
+
+We often have extra padding at the tail of a record. If we ensure 
+that the first byte (if any) of this padding is zero, we will 
+have a way for future changes to detect code which doesn't 
+understand a new format: the new code would write (say) a 1 at 
+the tail, and thus if there is no tail or the first byte is 0, we 
+would know the extension is not present on that record.
+
+2.16.2 Status
+
+Incomplete.
+
+2.17 TDB Does Not Use Talloc
+
+Many users of TDB (particularly Samba) use the talloc allocator, 
+and thus have to wrap TDB in a talloc context to use it 
+conveniently.
+
+2.17.1 Proposed Solution
+
+The allocation within TDB is not complicated enough to justify 
+the use of talloc, and I am reluctant to force another 
+(excellent) library on TDB users. Nonetheless a compromise is 
+possible. An attribute (see [attributes]) can be added later to 
+tdb_open() to provide an alternate allocation mechanism, 
+specifically for talloc but usable by any other allocator (which 
+would ignore the “context” argument).
+
+This would form a talloc heirarchy as expected, but the caller 
+would still have to attach a destructor to the tdb context 
+returned from tdb_open to close it. All TDB_DATA fields would be 
+children of the tdb_context, and the caller would still have to 
+manage them (using talloc_free() or talloc_steal()).
+
+2.17.2 Status
+
+Deferred.
+
  3 Performance And Scalability Issues
  
  3.1 <TDB_CLEAR_IF_FIRST-Imposes-Performance>TDB_CLEAR_IF_FIRST 
@@ -502,6 +633,10 @@ Remove the flag. It was a neat idea, but even trivial servers
  tend to know when they are initializing for the first time and 
  can simply unlink the old tdb at that point.
  
+3.1.2 Status
+
+Incomplete; TDB_CLEAR_IF_FIRST still defined, but does nothing.
+
  3.2 TDB Files Have a 4G Limit
  
  This seems to be becoming an issue (so much for “trivial”!), 
@@ -528,6 +663,10 @@ Old versions of tdb will fail to open the new TDB files (since 28
  August 2009, commit 398d0c29290: prior to that any unrecognized 
  file format would be erased and initialized as a fresh tdb!)
  
+3.2.2 Status
+
+Complete.
+
  3.3 TDB Records Have a 4G Limit
  
  This has not been a reported problem, and the API uses size_t 
@@ -542,6 +681,10 @@ implementation would return TDB_ERR_OOM in a similar case). It
  seems unlikely that 32 bit keys will be a limitation, so the 
  implementation may not support this (see [sub:Records-Incur-A]).
  
+3.3.2 Status
+
+Complete.
+
  3.4 Hash Size Is Determined At TDB Creation Time
  
  TDB contains a number of hash chains in the header; the number is 
@@ -551,7 +694,7 @@ long), that LDB uses 10,000 for this hash. In general it is
  impossible to know what the 'right' answer is at database 
  creation time.
  
-3.4.1 Proposed Solution
+3.4.1 <sub:Hash-Size-Solution>Proposed Solution
  
  After comprehensive performance testing on various scalable hash 
  variants[footnote:
@@ -559,31 +702,33 @@ http://rusty.ozlabs.org/?p=89 and http://rusty.ozlabs.org/?p=94
  This was annoying because I was previously convinced that an 
  expanding tree of hashes would be very close to optimal.
  ], it became clear that it is hard to beat a straight linear hash 
-table which doubles in size when it reaches saturation. There are 
-three details which become important:
-
-1. On encountering a full bucket, we use the next bucket.
-
-2. Extra hash bits are stored with the offset, to reduce 
-  comparisons.
-
-3. A marker entry is used on deleting an entry.
-
-The doubling of the table must be done under a transaction; we 
-will not reduce it on deletion, so it will be an unusual case. It 
-will either be placed at the head (other entries will be moved 
-out the way so we can expand). We could have a pointer in the 
-header to the current hashtable location, but that pointer would 
-have to be read frequently to check for hashtable moves.
-
-The locking for this is slightly more complex than the chained 
-case; we currently have one lock per bucket, and that means we 
-would need to expand the lock if we overflow to the next bucket. 
-The frequency of such collisions will effect our locking 
-heuristics: we can always lock more buckets than we need.
-
-One possible optimization is to only re-check the hash size on an 
-insert or a lookup miss.
+table which doubles in size when it reaches saturation. 
+Unfortunately, altering the hash table introduces serious locking 
+complications: the entire hash table needs to be locked to 
+enlarge the hash table, and others might be holding locks. 
+Particularly insidious are insertions done under tdb_chainlock.
+
+Thus an expanding layered hash will be used: an array of hash 
+groups, with each hash group exploding into pointers to lower 
+hash groups once it fills, turning into a hash tree. This has 
+implications for locking: we must lock the entire group in case 
+we need to expand it, yet we don't know how deep the tree is at 
+that point.
+
+Note that bits from the hash table entries should be stolen to 
+hold more hash bits to reduce the penalty of collisions. We can 
+use the otherwise-unused lower 3 bits. If we limit the size of 
+the database to 64 exabytes, we can use the top 8 bits of the 
+hash entry as well. These 11 bits would reduce false positives 
+down to 1 in 2000 which is more than we need: we can use one of 
+the bits to indicate that the extra hash bits are valid. This 
+means we can choose not to re-hash all entries when we expand a 
+hash group; simply use the next bits we need and mark them 
+invalid.
+
+3.4.2 Status
+
+Complete.
  
  3.5 <TDB-Freelist-Is>TDB Freelist Is Highly Contended
  
@@ -660,50 +805,57 @@ to the size of the hash table, but as it is rare to walk a large
  number of free list entries we can use far fewer, say 1/32 of the 
  number of hash buckets.
  
+It seems tempting to try to reuse the hash implementation which 
+we use for records here, but we have two ways of searching for 
+free entries: for allocation we search by size (and possibly 
+zone) which produces too many clashes for our hash table to 
+handle well, and for coalescing we search by address. Thus an 
+array of doubly-linked free lists seems preferable.
+
  There are various benefits in using per-size free lists (see [sub:TDB-Becomes-Fragmented]
  ) but it's not clear this would reduce contention in the common 
  case where all processes are allocating/freeing the same size. 
  Thus we almost certainly need to divide in other ways: the most 
  obvious is to divide the file into zones, and using a free list 
-(or set of free lists) for each. This approximates address 
+(or table of free lists) for each. This approximates address 
  ordering.
  
-Note that this means we need to split the free lists when we 
-expand the file; this is probably acceptable when we double the 
-hash table size, since that is such an expensive operation 
-already. In the case of increasing the file size, there is an 
-optimization we can use: if we use M in the formula above as the 
-file size rounded up to the next power of 2, we only need 
-reshuffle free lists when the file size crosses a power of 2 
-boundary, and reshuffling the free lists is trivial: we simply 
-merge every consecutive pair of free lists.
+Unfortunately it is difficult to know what heuristics should be 
+used to determine zone sizes, and our transaction code relies on 
+being able to create a “recovery area” by simply appending to the 
+file (difficult if it would need to create a new zone header). 
+Thus we use a linked-list of free tables; currently we only ever 
+create one, but if there is more than one we choose one at random 
+to use. In future we may use heuristics to add new free tables on 
+contention. We only expand the file when all free tables are 
+exhausted.
  
  The basic algorithm is as follows. Freeing is simple:
  
-1. Identify the correct zone.
+1. Identify the correct free list.
  
  2. Lock the corresponding list.
  
-3. Re-check the zone (we didn't have a lock, sizes could have 
+3. Re-check the list (we didn't have a lock, sizes could have 
    changed): relock if necessary.
  
-4. Place the freed entry in the list for that zone.
+4. Place the freed entry in the list.
  
  Allocation is a little more complicated, as we perform delayed 
  coalescing at this point:
  
-1. Pick a zone either the zone we last freed into, or based on a “
-  random” number.
+1. Pick a free table; usually the previous one.
  
  2. Lock the corresponding list.
  
-3. Re-check the zone: relock if necessary.
-
-4. If the top entry is -large enough, remove it from the list and 
+3. If the top entry is -large enough, remove it from the list and 
    return it.
  
-5. Otherwise, coalesce entries in the list.If there was no entry 
-  large enough, unlock the list and try the next zone.
+4. Otherwise, coalesce entries in the list.If there was no entry 
+  large enough, unlock the list and try the next largest list
+
+5. If no list has an entry which meets our needs, try the next 
+  free table.
  
  6. If no zone satisfies, expand the file.
  
@@ -714,9 +866,9 @@ ordering seems to be fairly good for keeping fragmentation low
  does not need a tailer to coalesce, though if we needed one we 
  could have one cheaply: see [sub:Records-Incur-A]. 
  
-I anticipate that the number of entries in each free zone would 
-be small, but it might be worth using one free entry to hold 
-pointers to the others for cache efficiency.
+Each free entry has the free table number in the header: less 
+than 255. It also contains a doubly-linked list for easy 
+deletion.
  
  3.6 <sub:TDB-Becomes-Fragmented>TDB Becomes Fragmented
  
@@ -822,7 +974,9 @@ block:
    this is diminishing returns after a handful of bits (at 10 
    bits, it reduces 99.9% of false memcmp). As an aside, as the 
    lower bits are already incorporated in the hash table 
-  resolution, the upper bits should be used here.
+  resolution, the upper bits should be used here. Note that it's 
+  not clear that these bits will be a win, given the extra bits 
+  in the hash table itself (see [sub:Hash-Size-Solution]).
  
  5. 'magic' does not need to be enlarged: it currently reflects 
    one of 5 values (used, free, dead, recovery, and 
@@ -843,13 +997,13 @@ This produces a 16 byte used header like this:
  
  struct tdb_used_record {
  
-        uint32_t magic : 16,
+        uint32_t used_magic : 16,
+
  
-                 prev_is_free: 1,
  
                   key_data_divide: 5,
  
-                 top_hash: 10;
+                 top_hash: 11;
  
          uint32_t extra_octets;
  
@@ -861,17 +1015,27 @@ And a free record like this:
  
  struct tdb_free_record {
  
-        uint32_t free_magic;
+        uint64_t free_magic: 8,
  
-        uint64_t total_length;
+                   prev : 56;
  
-        ...
  
-        uint64_t tailer;
+
+        uint64_t free_table: 8,
+
+                 total_length : 56
+
+        uint64_t next;;
  
  };
  
+Note that by limiting valid offsets to 56 bits, we can pack 
+everything we need into 3 64-byte words, meaning our minimum 
+record size is 8 bytes.
  
+3.7.2 Status
+
+Complete.
  
  3.8 Transaction Commit Requires 4 fdatasync
  
@@ -924,11 +1088,14 @@ but need only be done at open. For running databases, a separate
  header field can be used to indicate a transaction in progress; 
  we need only check for recovery if this is set.
  
-3.9 <sub:TDB-Does-Not>TDB Does Not Have Snapshot Support
+3.8.2 Status
  
-3.9.1 Proposed Solution
+Deferred.
  
-None. At some point you say “use a real database”.
+3.9 <sub:TDB-Does-Not>TDB Does Not Have Snapshot Support
+
+3.9.1 Proposed SolutionNone. At some point you say “use a real 
+  database” (but see [replay-attribute]).
  
  But as a thought experiment, if we implemented transactions to 
  only overwrite free entries (this is tricky: there must not be a 
@@ -947,6 +1114,10 @@ rewrite some sections of the hash, too.
  We could then implement snapshots using a similar method, using 
  multiple different hash tables/free tables.
  
+3.9.2 Status
+
+Deferred.
+
  3.10 Transactions Cannot Operate in Parallel
  
  This would be useless for ldb, as it hits the index records with 
@@ -957,11 +1128,15 @@ failed.
  
  3.10.1 Proposed Solution
  
-We could solve a small part of the problem by providing read-only 
-transactions. These would allow one write transaction to begin, 
-but it could not commit until all r/o transactions are done. This 
-would require a new RO_TRANSACTION_LOCK, which would be upgraded 
-on commit.
+None (but see [replay-attribute]). We could solve a small part of 
+the problem by providing read-only transactions. These would 
+allow one write transaction to begin, but it could not commit 
+until all r/o transactions are done. This would require a new 
+RO_TRANSACTION_LOCK, which would be upgraded on commit.
+
+3.10.2 Status
+
+Deferred.
  
  3.11 Default Hash Function Is Suboptimal
  
@@ -984,6 +1159,10 @@ The seed should be created at tdb-creation time from some random
  source, and placed in the header. This is far from foolproof, but 
  adds a little bit of protection against hash bombing.
  
+3.11.2 Status
+
+Complete.
+
  3.12 <Reliable-Traversal-Adds>Reliable Traversal Adds Complexity
  
  We lock a record during traversal iteration, and try to grab that 
@@ -998,6 +1177,10 @@ indefinitely.
  
  Remove reliability guarantees; see [traverse-Proposed-Solution].
  
+3.12.2 Status
+
+Complete.
+
  3.13 Fcntl Locking Adds Overhead
  
  Placing a fcntl lock means a system call, as does removing one. 
@@ -1056,3 +1239,21 @@ filled). On crash, tdb_open() would examine the array of top
  levels, and apply the transactions until it encountered an 
  invalid checksum.
  
+3.15 Tracing Is Fragile, Replay Is External
+
+The current TDB has compile-time-enabled tracing code, but it 
+often breaks as it is not enabled by default. In a similar way, 
+the ctdb code has an external wrapper which does replay tracing 
+so it can coordinate cluster-wide transactions.
+
+3.15.1 Proposed Solution<replay-attribute>
+
+Tridge points out that an attribute can be later added to 
+tdb_open (see [attributes]) to provide replay/trace hooks, which 
+could become the basis for this and future parallel transactions 
+and snapshot support.
+
+3.15.2 Status
+
+Deferred.
+