Rusty Russell, IBM Corporation
-1-September-2010
+1-December-2010
Abstract
argument. Additional arguments to open would require the
introduction of a tdb_open_ex2 call etc.
-2.1.1 Proposed Solution
+2.1.1 Proposed Solution<attributes>
tdb_open() will take a linked-list of attributes:
This allows future attributes to be added, even if this expands
the size of the union.
+2.1.2 Status
+
+Complete.
+
2.2 tdb_traverse Makes Impossible Guarantees
tdb_traverse (and tdb_firstkey/tdb_nextkey) predate transactions,
You can prevent changes by using a transaction or the locking
API.
+2.2.2 Status
+
+Complete. Delete-during-traverse will still delete every record,
+too (assuming no other changes).
+
2.3 Nesting of Transactions Is Fraught
TDB has alternated between allowing nested transactions and not
tdb_add_flags() and tdb_remove_flags(), so the API should not be
expanded for this relatively-obscure case.
+2.3.2 Status
+
+Incomplete; nesting flag is still defined as per tdb1.
+
2.4 Incorrect Hash Function is Not Detected
tdb_open_ex() allows the calling code to specify a different hash
0xdeadbeef), and tdb_open_ex() should check that the given hash
function produces the same answer, or fail the tdb_open call.
+2.4.2 Status
+
+Complete.
+
2.5 tdb_set_max_dead/TDB_VOLATILE Expose Implementation
In response to scalability issues with the free list ([TDB-Freelist-Is]
as fetch in order to allow some internal tuning, but initially
will become a no-op.
+2.5.2 Status
+
+Incomplete. TDB_VOLATILE still defined, but implementation should
+fail on unknown flags to be future-proof.
+
2.6 <TDB-Files-Cannot>TDB Files Cannot Be Opened Multiple Times
In The Same Process
benefit to adding a call to detect when a tdb_context is shared,
to allow other to create such an API.
+2.6.2 Status
+
+Incomplete.
+
2.7 TDB API Is Not POSIX Thread-safe
The TDB API uses an error code which can be queried after an
The aim is that building tdb with -DTDB_PTHREAD will result in a
pthread-safe version of the library, and otherwise no overhead
-will exist.
+will exist. Alternatively, a hooking mechanism similar to that
+proposed for [Proposed-Solution-locking-hook] could be used to
+enable pthread locking at runtime.
+
+2.7.2 Status
+
+Incomplete.
2.8 *_nonblock Functions And *_mark Functions Expose
Implementation
It also keeps the complexity out of the API, and in ctdbd where
it is needed.
+2.8.2 Status
+
+Incomplete.
+
2.9 tdb_chainlock Functions Expose Implementation
tdb_chainlock locks some number of records, including the record
otherwise EAGAIN.
]
+2.10.2 Status
+
+Incomplete.
+
2.11 The API Uses Gratuitous Typedefs, Capitals
typedefs are useful for providing source compatibility when types
It should simply take an extra argument, since we are prepared to
break the API/ABI.
+2.12.2 Status
+
+Complete.
+
2.13 Various Callback Functions Are Not Typesafe
The callback functions in tdb_set_logging_function (after [tdb_log_func-Doesnt-Take]
See CCAN's typesafe_cb module at
http://ccan.ozlabs.org/info/typesafe_cb.html
+2.13.2 Status
+
+Incomplete.
+
2.14 TDB_CLEAR_IF_FIRST Must Be Specified On All Opens,
tdb_reopen_all Problematic
Remove TDB_CLEAR_IF_FIRST. Other workarounds are possible, but
see [TDB_CLEAR_IF_FIRST-Imposes-Performance].
+2.14.2 Status
+
+Incomplete, TDB_CLEAR_IF_FIRST still defined, but not
+implemented.
+
+2.15 Extending The Header Is Difficult
+
+We have reserved (zeroed) words in the TDB header, which can be
+used for future features. If the future features are compulsory,
+the version number must be updated to prevent old code from
+accessing the database. But if the future feature is optional, we
+have no way of telling if older code is accessing the database or
+not.
+
+2.15.1 Proposed Solution
+
+The header should contain a “format variant” value (64-bit). This
+is divided into two 32-bit parts:
+
+1. The lower part reflects the format variant understood by code
+ accessing the database.
+
+2. The upper part reflects the format variant you must understand
+ to write to the database (otherwise you can only open for
+ reading).
+
+The latter field can only be written at creation time, the former
+should be written under the OPEN_LOCK when opening the database
+for writing, if the variant of the code is lower than the current
+lowest variant.
+
+This should allow backwards-compatible features to be added, and
+detection if older code (which doesn't understand the feature)
+writes to the database.
+
+2.15.2 Status
+
+Incomplete.
+
+2.16 Record Headers Are Not Expandible
+
+If we later want to add (say) checksums on keys and data, it
+would require another format change, which we'd like to avoid.
+
+2.16.1 Proposed Solution
+
+We often have extra padding at the tail of a record. If we ensure
+that the first byte (if any) of this padding is zero, we will
+have a way for future changes to detect code which doesn't
+understand a new format: the new code would write (say) a 1 at
+the tail, and thus if there is no tail or the first byte is 0, we
+would know the extension is not present on that record.
+
+2.16.2 Status
+
+Incomplete.
+
+2.17 TDB Does Not Use Talloc
+
+Many users of TDB (particularly Samba) use the talloc allocator,
+and thus have to wrap TDB in a talloc context to use it
+conveniently.
+
+2.17.1 Proposed Solution
+
+The allocation within TDB is not complicated enough to justify
+the use of talloc, and I am reluctant to force another
+(excellent) library on TDB users. Nonetheless a compromise is
+possible. An attribute (see [attributes]) can be added later to
+tdb_open() to provide an alternate allocation mechanism,
+specifically for talloc but usable by any other allocator (which
+would ignore the “context” argument).
+
+This would form a talloc heirarchy as expected, but the caller
+would still have to attach a destructor to the tdb context
+returned from tdb_open to close it. All TDB_DATA fields would be
+children of the tdb_context, and the caller would still have to
+manage them (using talloc_free() or talloc_steal()).
+
+2.17.2 Status
+
+Deferred.
+
3 Performance And Scalability Issues
3.1 <TDB_CLEAR_IF_FIRST-Imposes-Performance>TDB_CLEAR_IF_FIRST
tend to know when they are initializing for the first time and
can simply unlink the old tdb at that point.
+3.1.2 Status
+
+Incomplete; TDB_CLEAR_IF_FIRST still defined, but does nothing.
+
3.2 TDB Files Have a 4G Limit
This seems to be becoming an issue (so much for “trivial”!),
August 2009, commit 398d0c29290: prior to that any unrecognized
file format would be erased and initialized as a fresh tdb!)
+3.2.2 Status
+
+Complete.
+
3.3 TDB Records Have a 4G Limit
This has not been a reported problem, and the API uses size_t
seems unlikely that 32 bit keys will be a limitation, so the
implementation may not support this (see [sub:Records-Incur-A]).
+3.3.2 Status
+
+Complete.
+
3.4 Hash Size Is Determined At TDB Creation Time
TDB contains a number of hash chains in the header; the number is
expanding tree of hashes would be very close to optimal.
], it became clear that it is hard to beat a straight linear hash
table which doubles in size when it reaches saturation.
-
-1.
-
-2.
-
-3.
-
-
-
-
-
- Unfortunately, altering the hash table introduces serious
-locking complications: the entire hash table needs to be locked
-to enlarge the hash table, and others might be holding locks.
+Unfortunately, altering the hash table introduces serious locking
+complications: the entire hash table needs to be locked to
+enlarge the hash table, and others might be holding locks.
Particularly insidious are insertions done under tdb_chainlock.
Thus an expanding layered hash will be used: an array of hash
hash group; simply use the next bits we need and mark them
invalid.
+3.4.2 Status
+
+Complete.
+
3.5 <TDB-Freelist-Is>TDB Freelist Is Highly Contended
TDB uses a single linked list for the free list. Allocation
case where all processes are allocating/freeing the same size.
Thus we almost certainly need to divide in other ways: the most
obvious is to divide the file into zones, and using a free list
-(or set of free lists) for each. This approximates address
+(or table of free lists) for each. This approximates address
ordering.
-Note that this means we need to split the free lists when we
-expand the file; this is probably acceptable when we double the
-hash table size, since that is such an expensive operation
-already. In the case of increasing the file size, there is an
-optimization we can use: if we use M in the formula above as the
-file size rounded up to the next power of 2, we only need
-reshuffle free lists when the file size crosses a power of 2
-boundary, and reshuffling the free lists is trivial: we simply
-merge every consecutive pair of free lists.
+Unfortunately it is difficult to know what heuristics should be
+used to determine zone sizes, and our transaction code relies on
+being able to create a “recovery area” by simply appending to the
+file (difficult if it would need to create a new zone header).
+Thus we use a linked-list of free tables; currently we only ever
+create one, but if there is more than one we choose one at random
+to use. In future we may use heuristics to add new free tables on
+contention. We only expand the file when all free tables are
+exhausted.
The basic algorithm is as follows. Freeing is simple:
-1. Identify the correct zone.
+1. Identify the correct free list.
2. Lock the corresponding list.
-3. Re-check the zone (we didn't have a lock, sizes could have
+3. Re-check the list (we didn't have a lock, sizes could have
changed): relock if necessary.
-4. Place the freed entry in the list for that zone.
+4. Place the freed entry in the list.
Allocation is a little more complicated, as we perform delayed
coalescing at this point:
-1. Pick a zone either the zone we last freed into, or based on a “
- random” number.
+1. Pick a free table; usually the previous one.
2. Lock the corresponding list.
-3. Re-check the zone: relock if necessary.
-
-4. If the top entry is -large enough, remove it from the list and
+3. If the top entry is -large enough, remove it from the list and
return it.
-5. Otherwise, coalesce entries in the list.If there was no entry
- large enough, unlock the list and try the next zone.
+4. Otherwise, coalesce entries in the list.If there was no entry
+ large enough, unlock the list and try the next largest list
+
+5. If no list has an entry which meets our needs, try the next
+ free table.
6. If no zone satisfies, expand the file.
does not need a tailer to coalesce, though if we needed one we
could have one cheaply: see [sub:Records-Incur-A].
-I anticipate that the number of entries in each free zone would
-be small, but it might be worth using one free entry to hold
-pointers to the others for cache efficiency.
-
-<freelist-in-zone>If we want to avoid locking complexity
-(enlarging the free lists when we enlarge the file) we could
-place the array of free lists at the beginning of each zone. This
-means existing array lists never move, but means that a record
-cannot be larger than a zone. That in turn implies that zones
-should be variable sized (say, power of 2), which makes the
-question “what zone is this record in?” much harder (and “pick a
-random zone”, but that's less common). It could be done with as
-few as 4 bits from the record header.[footnote:
-Using 2^{16+N*3}means 0 gives a minimal 65536-byte zone, 15 gives
-the maximal 2^{61} byte zone. Zones range in factor of 8 steps.
-]
+Each free entry has the free table number in the header: less
+than 255. It also contains a doubly-linked list for easy
+deletion.
3.6 <sub:TDB-Becomes-Fragmented>TDB Becomes Fragmented
struct tdb_used_record {
- uint32_t magic : 16,
+ uint32_t used_magic : 16,
+
- prev_is_free: 1,
key_data_divide: 5,
- top_hash: 10;
+ top_hash: 11;
uint32_t extra_octets;
struct tdb_free_record {
- uint32_t free_magic;
+ uint64_t free_magic: 8,
+
+ prev : 56;
+
- uint64_t total_length;
- uint64_t prev, next;
+ uint64_t free_table: 8,
- ...
+ total_length : 56
- uint64_t tailer;
+ uint64_t next;;
};
-We might want to take some bits from the used record's top_hash
-(and the free record which has 32 bits of padding to spare
-anyway) if we use variable sized zones. See [freelist-in-zone].
+Note that by limiting valid offsets to 56 bits, we can pack
+everything we need into 3 64-byte words, meaning our minimum
+record size is 8 bytes.
+
+3.7.2 Status
+
+Complete.
3.8 Transaction Commit Requires 4 fdatasync
header field can be used to indicate a transaction in progress;
we need only check for recovery if this is set.
-3.9 <sub:TDB-Does-Not>TDB Does Not Have Snapshot Support
+3.8.2 Status
-3.9.1 Proposed Solution
+Deferred.
-None. At some point you say “use a real database”.
+3.9 <sub:TDB-Does-Not>TDB Does Not Have Snapshot Support
+
+3.9.1 Proposed SolutionNone. At some point you say “use a real
+ database” (but see [replay-attribute]).
But as a thought experiment, if we implemented transactions to
only overwrite free entries (this is tricky: there must not be a
We could then implement snapshots using a similar method, using
multiple different hash tables/free tables.
+3.9.2 Status
+
+Deferred.
+
3.10 Transactions Cannot Operate in Parallel
This would be useless for ldb, as it hits the index records with
3.10.1 Proposed Solution
-We could solve a small part of the problem by providing read-only
-transactions. These would allow one write transaction to begin,
-but it could not commit until all r/o transactions are done. This
-would require a new RO_TRANSACTION_LOCK, which would be upgraded
-on commit.
+None (but see [replay-attribute]). We could solve a small part of
+the problem by providing read-only transactions. These would
+allow one write transaction to begin, but it could not commit
+until all r/o transactions are done. This would require a new
+RO_TRANSACTION_LOCK, which would be upgraded on commit.
+
+3.10.2 Status
+
+Deferred.
3.11 Default Hash Function Is Suboptimal
source, and placed in the header. This is far from foolproof, but
adds a little bit of protection against hash bombing.
+3.11.2 Status
+
+Complete.
+
3.12 <Reliable-Traversal-Adds>Reliable Traversal Adds Complexity
We lock a record during traversal iteration, and try to grab that
Remove reliability guarantees; see [traverse-Proposed-Solution].
+3.12.2 Status
+
+Complete.
+
3.13 Fcntl Locking Adds Overhead
Placing a fcntl lock means a system call, as does removing one.
levels, and apply the transactions until it encountered an
invalid checksum.
+3.15 Tracing Is Fragile, Replay Is External
+
+The current TDB has compile-time-enabled tracing code, but it
+often breaks as it is not enabled by default. In a similar way,
+the ctdb code has an external wrapper which does replay tracing
+so it can coordinate cluster-wide transactions.
+
+3.15.1 Proposed Solution<replay-attribute>
+
+Tridge points out that an attribute can be later added to
+tdb_open (see [attributes]) to provide replay/trace hooks, which
+could become the basis for this and future parallel transactions
+and snapshot support.
+
+3.15.2 Status
+
+Deferred.
+